RankNet and LambdaRank

The ranking problem is about ordering a collection of documents according to their relevance to the given query.

Their are multiple approaches to the problem, but in pairwise approach, we simply care about predicting order of document pairs for the query. Given 2 documents di and dj the true relative ordering is specified as hij={1di>dj0di=dj1di<dj

In terms of modelling, we assume there is a base model takes in features xi corresponds to document di and predict a score depicting relevance to the given query. si=f(xi)

A comparator model is feed these scores which predicts P(di>dj).

The comparator model can be a binary classifier by setting yij1+hij2 as target variable. The a comparator is trained to predict 1 for verifying the order and 0 for negating it.

RankNet

RankNet uses a logistic regression as comparator which is feed the difference of scores. y^ij=P(di>dj)=11+eα(sisj) α is a parameter which controls the slope of sigmoid function.

We can define binary cross entropy loss on this model as

Cij=(yijlogy^ij+(1yij)log(1y^ij))

Now consider a mini-batch of document d1,,dn corresponding to a particular query. The documents have some perfect ordering to answer the query which is specified by values of hij.

For gradient update, we are interested in computing the gradients generated by each documents. Cw=1ni=1nCiwwwηCw

Now, if we assume the documents follow the order db>di>da, then the loss incurred by the di can be written as

Ci=a:di>dayialogy^iab:db>di(1yib)log(1y^ib)

Note that the ground truth labels yia=1 and yib=0. The gradient from di can also be simplified as

Ciw=aCisasaw+bCisbsbw

The s/w part of the gradient only depends on the score prediction network. For computing the gradient of the comparator, we have Cisa=logy^iasa=α1+eα(sisa)=α(1y^ia)Cisb=log(1y^ib)sb=α1+eα(sisb)=αy^ib

If we randomly select a document pair (di,dj), the gradients would be

$$\begin{aligned} \frac{\partial C_i}{\partial s_j} &= \begin{cases} \alpha(\hat{y}_{ij}-1) &d_i>d_jorh_{ij}=1 \ -\alpha\hat{y}_{ij}&d_j>d_iorh_{ij}=-1\end{cases} \end{aligned}$$

Now if we define the quantities λijα[(1hij)2(1y^ij)] we can write the individual gradient as Ciw=aλiasawbλibsbwCw=1ni=1nλisiwλi=di>djλijdi<djλij

So for each document in the batch, we can simply accumulate λ and then apply it to the gradient thus not requiring n2 gradient computation.

Each λi can also be thought of as the strength of gradient, getting larger for every inversions in the the ordering and getting smaller for correct orderings.

LambdaRank

At this point explaining lambda rank is very simple. Its exactly same as RankNet, but we modify computation of λ’s as follows. λijα(1y^ij)|ΔNDCG|

ΔNDCG is the change in NDCG measure if we swap di and dj in the ordering. This results in gradient updates optimising for NDCG measure. Since in terms of NDCG, higher is better, we have to do gradient ascent instead of gradient descent ww+η(1ni=1nλisiw)




tags: #ranking #search