Datasets

PyTorchLTR provides several LTR datasets utility classes that can be used to automatically process and/or download the dataset files.

Warning

PyTorchLTR provides utilities to automatically download and prepare several public LTR datasets. We cannot vouch for the quality, correctness or usefulness of these datasets. We do not host or distribute any datasets and it is ultimately your responsibility to determine whether you have permission to use each dataset under its respective license.

Example

The following is a usage example for the small Example3 dataset.

>>> from pytorchltr.datasets import Example3
>>> train = Example3(split="train")
>>> test = Example3(split="test")
>>> print(len(train))
3
>>> print(len(test))
1
>>> sample = train[0]
>>> print(sample["features"])
tensor([[1.0000, 1.0000, 0.0000, 0.3333, 0.0000],
        [0.0000, 0.0000, 1.0000, 0.0000, 1.0000],
        [0.0000, 1.0000, 0.0000, 1.0000, 0.0000],
        [0.0000, 0.0000, 1.0000, 0.6667, 0.0000]])
>>> print(sample["relevance"])
tensor([3, 2, 1, 1])
>>> print(sample["n"])
4

Note

PyTorchLTR looks for dataset files in (and downloads them to) the following locations:

  • The location arg if it is specified in the constructor of each respective Dataset class.

  • $PYTORCHLTR_DATASET_PATH/{dataset_name} if $PYTORCHLTR_DATASET_PATH is a defined environment variable.

  • $DATASET_PATH/{dataset_name} if $DATASET_PATH is a defined environment variable.

  • $HOME/.pytorchltr_datasets/{dataset_name} if all the above fail.

SVMRank datasets

Example3

class pytorchltr.datasets.Example3(location="'$PYTORCHLTR_DATASET_PATH'/example3", split='train', normalize=True, filter_queries=None, download=True, validate_checksums=True)

Utility class for loading and using the Example3 dataset: http://www.cs.cornell.edu/people/tj/svm_light/svm_rank.html

This dataset is a very small toy sample which is useful as a sanity check for testing your code.

__init__(location="'$PYTORCHLTR_DATASET_PATH'/example3", split='train', normalize=True, filter_queries=None, download=True, validate_checksums=True)
Parameters
  • location (str) – Directory where the dataset is located.

  • split (str) – The data split to load (“train” or “test”)

  • normalize (bool) – Whether to perform query-level feature normalization.

  • filter_queries (Optional[bool]) – Whether to filter out queries that have no relevant items. If not given this will filter queries for the test set but not the train set.

  • download (bool) – Whether to download the dataset if it does not exist.

  • validate_checksums (bool) – Whether to validate the dataset files via sha256.

static collate_fn(list_sampler=None)

Returns a collate_fn that can be used to collate batches. :type list_sampler: Optional[ListSampler] :param list_sampler: Sampler to use for sampling lists of documents.

Return type

Callable[[List[SVMRankItem]], SVMRankBatch]

__getitem__(index)

Returns the item at given index.

Parameters

index (int) – The index.

Return type

SVMRankItem

Returns

A pytorchltr.datasets.svmrank.SVMRankItem that contains features, relevance, qid, n and sparse fields.

__len__()
Returns

The length of the dataset.

Return type

int

Istella

class pytorchltr.datasets.Istella(location="'$PYTORCHLTR_DATASET_PATH'/istella", split='train', normalize=True, filter_queries=None, download=True, validate_checksums=True)

Utility class for downloading and using the istella dataset: http://quickrank.isti.cnr.it/istella-dataset/.

__init__(location="'$PYTORCHLTR_DATASET_PATH'/istella", split='train', normalize=True, filter_queries=None, download=True, validate_checksums=True)
Parameters
  • location (str) – Directory where the dataset is located.

  • split (str) – The data split to load (“train” or “test”)

  • normalize (bool) – Whether to perform query-level feature normalization.

  • filter_queries (Optional[bool]) – Whether to filter out queries that have no relevant items. If not given this will filter queries for the test set but not the train set.

  • download (bool) – Whether to download the dataset if it does not exist.

  • validate_checksums (bool) – Whether to validate the dataset files via sha256.

static collate_fn(list_sampler=None)

Returns a collate_fn that can be used to collate batches. :type list_sampler: Optional[ListSampler] :param list_sampler: Sampler to use for sampling lists of documents.

Return type

Callable[[List[SVMRankItem]], SVMRankBatch]

__getitem__(index)

Returns the item at given index.

Parameters

index (int) – The index.

Return type

SVMRankItem

Returns

A pytorchltr.datasets.svmrank.SVMRankItem that contains features, relevance, qid, n and sparse fields.

__len__()
Returns

The length of the dataset.

Return type

int

Istella-S

class pytorchltr.datasets.IstellaS(location="'$PYTORCHLTR_DATASET_PATH'/istella_s", split='train', normalize=True, filter_queries=None, download=True, validate_checksums=True)

Utility class for downloading and using the istella-s dataset: http://quickrank.isti.cnr.it/istella-dataset/.

This dataset is a smaller sampled version of the Istella dataset.

__init__(location="'$PYTORCHLTR_DATASET_PATH'/istella_s", split='train', normalize=True, filter_queries=None, download=True, validate_checksums=True)
Parameters
  • location (str) – Directory where the dataset is located.

  • split (str) – The data split to load (“train”, “test” or “vali”)

  • normalize (bool) – Whether to perform query-level feature normalization.

  • filter_queries (Optional[bool]) – Whether to filter out queries that have no relevant items. If not given this will filter queries for the test set but not the train set.

  • download (bool) – Whether to download the dataset if it does not exist.

  • validate_checksums (bool) – Whether to validate the dataset files via sha256.

static collate_fn(list_sampler=None)

Returns a collate_fn that can be used to collate batches. :type list_sampler: Optional[ListSampler] :param list_sampler: Sampler to use for sampling lists of documents.

Return type

Callable[[List[SVMRankItem]], SVMRankBatch]

__getitem__(index)

Returns the item at given index.

Parameters

index (int) – The index.

Return type

SVMRankItem

Returns

A pytorchltr.datasets.svmrank.SVMRankItem that contains features, relevance, qid, n and sparse fields.

__len__()
Returns

The length of the dataset.

Return type

int

Istella-X

class pytorchltr.datasets.IstellaX(location="'$PYTORCHLTR_DATASET_PATH'/istella_x", split='train', normalize=True, filter_queries=None, download=True, validate_checksums=True)

Utility class for downloading and using the istella-X dataset: http://quickrank.isti.cnr.it/istella-dataset/.

__init__(location="'$PYTORCHLTR_DATASET_PATH'/istella_x", split='train', normalize=True, filter_queries=None, download=True, validate_checksums=True)
Parameters
  • location (str) – Directory where the dataset is located.

  • split (str) – The data split to load (“train”, “test” or “vali”)

  • normalize (bool) – Whether to perform query-level feature normalization.

  • filter_queries (Optional[bool]) – Whether to filter out queries that have no relevant items. If not given this will filter queries for the test set but not the train set.

  • download (bool) – Whether to download the dataset if it does not exist.

  • validate_checksums (bool) – Whether to validate the dataset files via sha256.

static collate_fn(list_sampler=None)

Returns a collate_fn that can be used to collate batches. :type list_sampler: Optional[ListSampler] :param list_sampler: Sampler to use for sampling lists of documents.

Return type

Callable[[List[SVMRankItem]], SVMRankBatch]

__getitem__(index)

Returns the item at given index.

Parameters

index (int) – The index.

Return type

SVMRankItem

Returns

A pytorchltr.datasets.svmrank.SVMRankItem that contains features, relevance, qid, n and sparse fields.

__len__()
Returns

The length of the dataset.

Return type

int

MSLR-WEB10K

class pytorchltr.datasets.MSLR10K(location="'$PYTORCHLTR_DATASET_PATH'/MSLR10K", split='train', fold=1, normalize=True, filter_queries=None, download=True, validate_checksums=True)

Utility class for downloading and using the MSLR-WEB10K dataset: https://www.microsoft.com/en-us/research/project/mslr/.

This dataset is a smaller sampled version of the MSLR-WEB30K dataset.

__init__(location="'$PYTORCHLTR_DATASET_PATH'/MSLR10K", split='train', fold=1, normalize=True, filter_queries=None, download=True, validate_checksums=True)
Parameters
  • location (str) – Directory where the dataset is located.

  • split (str) – The data split to load (“train”, “test” or “vali”)

  • fold (int) – Which data fold to load (1…5)

  • normalize (bool) – Whether to perform query-level feature normalization.

  • filter_queries (Optional[bool]) – Whether to filter out queries that have no relevant items. If not given this will filter queries for the test set but not the train set.

  • download (bool) – Whether to download the dataset if it does not exist.

  • validate_checksums (bool) – Whether to validate the dataset files via sha256.

static collate_fn(list_sampler=None)

Returns a collate_fn that can be used to collate batches. :type list_sampler: Optional[ListSampler] :param list_sampler: Sampler to use for sampling lists of documents.

Return type

Callable[[List[SVMRankItem]], SVMRankBatch]

__getitem__(index)

Returns the item at given index.

Parameters

index (int) – The index.

Return type

SVMRankItem

Returns

A pytorchltr.datasets.svmrank.SVMRankItem that contains features, relevance, qid, n and sparse fields.

__len__()
Returns

The length of the dataset.

Return type

int

MSLR-WEB30K

class pytorchltr.datasets.MSLR30K(location="'$PYTORCHLTR_DATASET_PATH'/MSLR30K", split='train', fold=1, normalize=True, filter_queries=None, download=True, validate_checksums=True)

Utility class for downloading and using the MSLR-WEB30K dataset: https://www.microsoft.com/en-us/research/project/mslr/.

__init__(location="'$PYTORCHLTR_DATASET_PATH'/MSLR30K", split='train', fold=1, normalize=True, filter_queries=None, download=True, validate_checksums=True)
Parameters
  • location (str) – Directory where the dataset is located.

  • split (str) – The data split to load (“train”, “test” or “vali”)

  • fold (int) – Which data fold to load (1…5)

  • normalize (bool) – Whether to perform query-level feature normalization.

  • filter_queries (Optional[bool]) – Whether to filter out queries that have no relevant items. If not given this will filter queries for the test set but not the train set.

  • download (bool) – Whether to download the dataset if it does not exist.

  • validate_checksums (bool) – Whether to validate the dataset files via sha256.

static collate_fn(list_sampler=None)

Returns a collate_fn that can be used to collate batches. :type list_sampler: Optional[ListSampler] :param list_sampler: Sampler to use for sampling lists of documents.

Return type

Callable[[List[SVMRankItem]], SVMRankBatch]

__getitem__(index)

Returns the item at given index.

Parameters

index (int) – The index.

Return type

SVMRankItem

Returns

A pytorchltr.datasets.svmrank.SVMRankItem that contains features, relevance, qid, n and sparse fields.

__len__()
Returns

The length of the dataset.

Return type

int