Evaluation¶

PyTorchLTR provides several built-in evaluation metrics including ARP [JSS17] and DCG [JK02]. Furthermore, the library has support for creating pytrec_eval [VR18] compatible output.

Example¶

>>> import torch
>>> from pytorchltr.evaluation import ndcg
>>> scores = torch.tensor([[1.0, 0.0, 1.5], [1.5, 0.2, 0.5]])
>>> relevance = torch.tensor([[0, 1, 0], [0, 1, 1]])
>>> n = torch.tensor([3, 3])
>>> ndcg(scores, relevance, n, k=10)
tensor([0.5000, 0.6934])

Built-in metrics¶

pytorchltr.evaluation.arp(scores, relevance, n)¶

Average Relevant Position (ARP)

\[\text{arp}(\mathbf{s}, \mathbf{y}) = \frac{1}{\sum_{i=1}^n y_i} \sum_{i=1}^n y_{\pi_i} \cdot i\]

where \(\pi_i\) is the index of the item at rank \(i\) after sorting the scores.

Parameters

scores (FloatTensor) – A tensor of size (batch_size, list_size, 1) or (batch_size, list_size), indicating the scores per document per query.
relevance (LongTensor) – A tensor of size (batch_size, list_size), indicating the relevance judgements per document per query.
n (LongTensor) – A tensor of size (batch_size) indicating the number of docs per query.

Return type

FloatTensor

Returns

A tensor of size (batch_size) indicating the ARP of each query.

pytorchltr.evaluation.dcg(scores, relevance, n, k=None, exp=True)¶

Discounted Cumulative Gain (DCG)

\[\text{dcg}(\mathbf{s}, \mathbf{y}) = \sum_{i=1}^n \frac{\text{gain}(y_{\pi_i})}{\log_2(1 + i)}\]

where \(\pi_i\) is the index of the item at rank \(i\) after sorting the scores, and:

\[ \text{gain}(y_i) = \left\{ \begin{array}{ll} 2^{y_i} - 1 & \text{if } \texttt{exp=True} \\ y_i & \text{otherwise} \end{array} \right. \]

Parameters

scores (FloatTensor) – A tensor of size (batch_size, list_size, 1) or (batch_size, list_size), indicating the scores per document per query.
relevance (LongTensor) – A tensor of size (batch_size, list_size), indicating the relevance judgements per document per query.
n (LongTensor) – A tensor of size (batch_size) indicating the number of docs per query.
k (Optional[int]) – An integer indicating the cutoff for ndcg.
exp (Optional[bool]) – A boolean indicating whether to use the exponential notation of DCG.

Return type

FloatTensor

Returns

A tensor of size (batch_size, list_size) indicating the DCG of each query at every rank. If k is not None, then this returns a tensor of size (batch_size), indicating the DCG@k of each query.

pytorchltr.evaluation.ndcg(scores, relevance, n, k=None, exp=True)¶

Normalized Discounted Cumulative Gain (NDCG)

\[\text{ndcg}(\mathbf{s}, \mathbf{y}) = \frac{\text{dcg}(\mathbf{s}, \mathbf{y})} {\text{dcg}(\mathbf{y}, \mathbf{y})}\]

Parameters

scores (FloatTensor) – A tensor of size (batch_size, list_size, 1) or (batch_size, list_size), indicating the scores per document per query.
relevance (LongTensor) – A tensor of size (batch_size, list_size), indicating the relevance judgements per document per query.
n (LongTensor) – A tensor of size (batch_size) indicating the number of docs per query.
k (Optional[int]) – An integer indicating the cutoff for ndcg.
exp (Optional[bool]) – A boolean indicating whether to use the exponential notation of DCG.

Return type

FloatTensor

Returns

A tensor of size (batch_size, list_size) indicating the NDCG of each query at every rank. If k is not None, then this returns a tensor of size (batch_size), indicating the NDCG@k of each query.

Integration with pytrec_eval¶

pytorchltr.evaluation.generate_pytrec_eval(scores, relevance, n, qids=None, qid_offset=0, q_prefix='q', d_prefix='d')¶

Generates pytrec_eval qrels and runs from given batch.

Example usage:

>>> import json
>>> import torch
>>> import pytrec_eval
>>> from pytorchltr.evaluation.trec import generate_pytrec_eval
>>> scores = torch.tensor([[1.0, 0.0, 1.5], [1.5, 0.2, 0.5]])
>>> relevance = torch.tensor([[0, 1, 0], [0, 1, 1]])
>>> n = torch.tensor([3, 3])
>>> qrel, run = generate_pytrec_eval(scores, relevance, n)
>>> evaluator = pytrec_eval.RelevanceEvaluator(qrel, {'map', 'ndcg'})
>>> print(json.dumps(evaluator.evaluate(run), indent=1))
{
 "q0": {
  "map": 0.3333333333333333,
  "ndcg": 0.5
 },
 "q1": {
  "map": 0.5833333333333333,
  "ndcg": 0.6934264036172708
 }
}

Parameters

scores (FloatTensor) – A FloatTensor of size (batch_size, list_size) indicating the scores of each document.
relevance (LongTensor) – A LongTensor of size (batch_size, list_size) indicating the relevance of each document.
n (LongTensor) – A LongTensor of size (batch_size) indicating the number of docs per query.
qids (Optional[LongTensor]) – A LongTensor of size (batch_size) indicating the qid of each query.
qid_offset (int) – An offset to increment all qids in this batch with. Only used if qids is None.
q_prefix (str) – A string prefix to add for query identifiers.
d_prefix (str) – A string prefix to add for doc identifiers.

Return type

Tuple[Dict[str, Dict[str, int]], Dict[str, Dict[str, float]]]

Returns

A tuple containing a qrel dict and a run dict.

References

JSS17: Thorsten Joachims, Adith Swaminathan, and Tobias Schnabel. Unbiased learning-to-rank with biased feedback. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, WSDM ’17, 781–789. New York, NY, USA, 2017. Association for Computing Machinery. doi:10.1145/3018661.3018699.
JK02: Kalervo Järvelin and Jaana Kekäläinen. Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf. Syst., 20(4):422–446, October 2002. doi:10.1145/582415.582418.
VR18: Christophe Van Gysel and Maarten de Rijke. Pytrec_eval: an extremely fast python interface to trec_eval. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR ’18, 873–876. New York, NY, USA, 2018. Association for Computing Machinery. doi:10.1145/3209978.3210065.