Evaluation

PyTorchLTR provides several built-in evaluation metrics including ARP [JSS17] and DCG [JK02]. Furthermore, the library has support for creating pytrec_eval [VR18] compatible output.

Example

>>> import torch
>>> from pytorchltr.evaluation import ndcg
>>> scores = torch.tensor([[1.0, 0.0, 1.5], [1.5, 0.2, 0.5]])
>>> relevance = torch.tensor([[0, 1, 0], [0, 1, 1]])
>>> n = torch.tensor([3, 3])
>>> ndcg(scores, relevance, n, k=10)
tensor([0.5000, 0.6934])

Built-in metrics

pytorchltr.evaluation.arp(scores, relevance, n)

Average Relevant Position (ARP)

\[\text{arp}(\mathbf{s}, \mathbf{y}) = \frac{1}{\sum_{i=1}^n y_i} \sum_{i=1}^n y_{\pi_i} \cdot i\]

where \(\pi_i\) is the index of the item at rank \(i\) after sorting the scores.

Parameters
  • scores (FloatTensor) – A tensor of size (batch_size, list_size, 1) or (batch_size, list_size), indicating the scores per document per query.

  • relevance (LongTensor) – A tensor of size (batch_size, list_size), indicating the relevance judgements per document per query.

  • n (LongTensor) – A tensor of size (batch_size) indicating the number of docs per query.

Return type

FloatTensor

Returns

A tensor of size (batch_size) indicating the ARP of each query.

pytorchltr.evaluation.dcg(scores, relevance, n, k=None, exp=True)

Discounted Cumulative Gain (DCG)

\[\text{dcg}(\mathbf{s}, \mathbf{y}) = \sum_{i=1}^n \frac{\text{gain}(y_{\pi_i})}{\log_2(1 + i)}\]

where \(\pi_i\) is the index of the item at rank \(i\) after sorting the scores, and:

\[ \text{gain}(y_i) = \left\{ \begin{array}{ll} 2^{y_i} - 1 & \text{if } \texttt{exp=True} \\ y_i & \text{otherwise} \end{array} \right. \]
Parameters
  • scores (FloatTensor) – A tensor of size (batch_size, list_size, 1) or (batch_size, list_size), indicating the scores per document per query.

  • relevance (LongTensor) – A tensor of size (batch_size, list_size), indicating the relevance judgements per document per query.

  • n (LongTensor) – A tensor of size (batch_size) indicating the number of docs per query.

  • k (Optional[int]) – An integer indicating the cutoff for ndcg.

  • exp (Optional[bool]) – A boolean indicating whether to use the exponential notation of DCG.

Return type

FloatTensor

Returns

A tensor of size (batch_size, list_size) indicating the DCG of each query at every rank. If k is not None, then this returns a tensor of size (batch_size), indicating the DCG@k of each query.

pytorchltr.evaluation.ndcg(scores, relevance, n, k=None, exp=True)

Normalized Discounted Cumulative Gain (NDCG)

\[\text{ndcg}(\mathbf{s}, \mathbf{y}) = \frac{\text{dcg}(\mathbf{s}, \mathbf{y})} {\text{dcg}(\mathbf{y}, \mathbf{y})}\]
Parameters
  • scores (FloatTensor) – A tensor of size (batch_size, list_size, 1) or (batch_size, list_size), indicating the scores per document per query.

  • relevance (LongTensor) – A tensor of size (batch_size, list_size), indicating the relevance judgements per document per query.

  • n (LongTensor) – A tensor of size (batch_size) indicating the number of docs per query.

  • k (Optional[int]) – An integer indicating the cutoff for ndcg.

  • exp (Optional[bool]) – A boolean indicating whether to use the exponential notation of DCG.

Return type

FloatTensor

Returns

A tensor of size (batch_size, list_size) indicating the NDCG of each query at every rank. If k is not None, then this returns a tensor of size (batch_size), indicating the NDCG@k of each query.

Integration with pytrec_eval

pytorchltr.evaluation.generate_pytrec_eval(scores, relevance, n, qids=None, qid_offset=0, q_prefix='q', d_prefix='d')

Generates pytrec_eval qrels and runs from given batch.

Example usage:

>>> import json
>>> import torch
>>> import pytrec_eval
>>> from pytorchltr.evaluation.trec import generate_pytrec_eval
>>> scores = torch.tensor([[1.0, 0.0, 1.5], [1.5, 0.2, 0.5]])
>>> relevance = torch.tensor([[0, 1, 0], [0, 1, 1]])
>>> n = torch.tensor([3, 3])
>>> qrel, run = generate_pytrec_eval(scores, relevance, n)
>>> evaluator = pytrec_eval.RelevanceEvaluator(qrel, {'map', 'ndcg'})
>>> print(json.dumps(evaluator.evaluate(run), indent=1))
{
 "q0": {
  "map": 0.3333333333333333,
  "ndcg": 0.5
 },
 "q1": {
  "map": 0.5833333333333333,
  "ndcg": 0.6934264036172708
 }
}
Parameters
  • scores (FloatTensor) – A FloatTensor of size (batch_size, list_size) indicating the scores of each document.

  • relevance (LongTensor) – A LongTensor of size (batch_size, list_size) indicating the relevance of each document.

  • n (LongTensor) – A LongTensor of size (batch_size) indicating the number of docs per query.

  • qids (Optional[LongTensor]) – A LongTensor of size (batch_size) indicating the qid of each query.

  • qid_offset (int) – An offset to increment all qids in this batch with. Only used if qids is None.

  • q_prefix (str) – A string prefix to add for query identifiers.

  • d_prefix (str) – A string prefix to add for doc identifiers.

Return type

Tuple[Dict[str, Dict[str, int]], Dict[str, Dict[str, float]]]

Returns

A tuple containing a qrel dict and a run dict.

References

JSS17

Thorsten Joachims, Adith Swaminathan, and Tobias Schnabel. Unbiased learning-to-rank with biased feedback. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, WSDM ’17, 781–789. New York, NY, USA, 2017. Association for Computing Machinery. doi:10.1145/3018661.3018699.

JK02

Kalervo Järvelin and Jaana Kekäläinen. Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf. Syst., 20(4):422–446, October 2002. doi:10.1145/582415.582418.

VR18

Christophe Van Gysel and Maarten de Rijke. Pytrec_eval: an extremely fast python interface to trec_eval. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR ’18, 873–876. New York, NY, USA, 2018. Association for Computing Machinery. doi:10.1145/3209978.3210065.