Datasets¶

PyTorchLTR provides several LTR datasets utility classes that can be used to automatically process and/or download the dataset files.

Warning

PyTorchLTR provides utilities to automatically download and prepare several public LTR datasets. We cannot vouch for the quality, correctness or usefulness of these datasets. We do not host or distribute any datasets and it is ultimately your responsibility to determine whether you have permission to use each dataset under its respective license.

Example¶

The following is a usage example for the small Example3 dataset.

>>> from pytorchltr.datasets import Example3
>>> train = Example3(split="train")
>>> test = Example3(split="test")
>>> print(len(train))
3
>>> print(len(test))
1
>>> sample = train[0]
>>> print(sample["features"])
tensor([[1.0000, 1.0000, 0.0000, 0.3333, 0.0000],
        [0.0000, 0.0000, 1.0000, 0.0000, 1.0000],
        [0.0000, 1.0000, 0.0000, 1.0000, 0.0000],
        [0.0000, 0.0000, 1.0000, 0.6667, 0.0000]])
>>> print(sample["relevance"])
tensor([3, 2, 1, 1])
>>> print(sample["n"])
4

Note

PyTorchLTR looks for dataset files in (and downloads them to) the following locations:

The location arg if it is specified in the constructor of each respective Dataset class.
$PYTORCHLTR_DATASET_PATH/{dataset_name} if $PYTORCHLTR_DATASET_PATH is a defined environment variable.
$DATASET_PATH/{dataset_name} if $DATASET_PATH is a defined environment variable.
$HOME/.pytorchltr_datasets/{dataset_name} if all the above fail.

SVMRank datasets¶

Example3¶

class pytorchltr.datasets.Example3(location="'$PYTORCHLTR_DATASET_PATH'/example3", split='train', normalize=True, filter_queries=None, download=True, validate_checksums=True)¶

Utility class for loading and using the Example3 dataset: http://www.cs.cornell.edu/people/tj/svm_light/svm_rank.html

This dataset is a very small toy sample which is useful as a sanity check for testing your code.

__init__(location="'$PYTORCHLTR_DATASET_PATH'/example3", split='train', normalize=True, filter_queries=None, download=True, validate_checksums=True)¶

Parameters

location (str) – Directory where the dataset is located.
split (str) – The data split to load (“train” or “test”)
normalize (bool) – Whether to perform query-level feature normalization.
filter_queries (Optional[bool]) – Whether to filter out queries that have no relevant items. If not given this will filter queries for the test set but not the train set.
download (bool) – Whether to download the dataset if it does not exist.
validate_checksums (bool) – Whether to validate the dataset files via sha256.

static collate_fn(list_sampler=None)¶

Returns a collate_fn that can be used to collate batches. :type list_sampler: Optional[ListSampler] :param list_sampler: Sampler to use for sampling lists of documents.

Return type: Callable[[List[SVMRankItem]], SVMRankBatch]

__getitem__(index)¶

Returns the item at given index.

Parameters: index (int) – The index.
Return type: SVMRankItem
Returns: A pytorchltr.datasets.svmrank.SVMRankItem that contains features, relevance, qid, n and sparse fields.

__len__()¶

Returns: The length of the dataset.
Return type: int

Istella¶

class pytorchltr.datasets.Istella(location="'$PYTORCHLTR_DATASET_PATH'/istella", split='train', normalize=True, filter_queries=None, download=True, validate_checksums=True)¶

Utility class for downloading and using the istella dataset: http://quickrank.isti.cnr.it/istella-dataset/.

__init__(location="'$PYTORCHLTR_DATASET_PATH'/istella", split='train', normalize=True, filter_queries=None, download=True, validate_checksums=True)¶

Parameters

location (str) – Directory where the dataset is located.
split (str) – The data split to load (“train” or “test”)
normalize (bool) – Whether to perform query-level feature normalization.
filter_queries (Optional[bool]) – Whether to filter out queries that have no relevant items. If not given this will filter queries for the test set but not the train set.
download (bool) – Whether to download the dataset if it does not exist.
validate_checksums (bool) – Whether to validate the dataset files via sha256.

static collate_fn(list_sampler=None)¶

Returns a collate_fn that can be used to collate batches. :type list_sampler: Optional[ListSampler] :param list_sampler: Sampler to use for sampling lists of documents.

Return type: Callable[[List[SVMRankItem]], SVMRankBatch]

__getitem__(index)¶

Returns the item at given index.

Parameters: index (int) – The index.
Return type: SVMRankItem
Returns: A pytorchltr.datasets.svmrank.SVMRankItem that contains features, relevance, qid, n and sparse fields.

__len__()¶

Returns: The length of the dataset.
Return type: int

Istella-S¶

class pytorchltr.datasets.IstellaS(location="'$PYTORCHLTR_DATASET_PATH'/istella_s", split='train', normalize=True, filter_queries=None, download=True, validate_checksums=True)¶

Utility class for downloading and using the istella-s dataset: http://quickrank.isti.cnr.it/istella-dataset/.

This dataset is a smaller sampled version of the Istella dataset.

__init__(location="'$PYTORCHLTR_DATASET_PATH'/istella_s", split='train', normalize=True, filter_queries=None, download=True, validate_checksums=True)¶

Parameters

location (str) – Directory where the dataset is located.
split (str) – The data split to load (“train”, “test” or “vali”)
normalize (bool) – Whether to perform query-level feature normalization.
filter_queries (Optional[bool]) – Whether to filter out queries that have no relevant items. If not given this will filter queries for the test set but not the train set.
download (bool) – Whether to download the dataset if it does not exist.
validate_checksums (bool) – Whether to validate the dataset files via sha256.

static collate_fn(list_sampler=None)¶

Returns a collate_fn that can be used to collate batches. :type list_sampler: Optional[ListSampler] :param list_sampler: Sampler to use for sampling lists of documents.

Return type: Callable[[List[SVMRankItem]], SVMRankBatch]

__getitem__(index)¶

Returns the item at given index.

Parameters: index (int) – The index.
Return type: SVMRankItem
Returns: A pytorchltr.datasets.svmrank.SVMRankItem that contains features, relevance, qid, n and sparse fields.

__len__()¶

Returns: The length of the dataset.
Return type: int

Istella-X¶

class pytorchltr.datasets.IstellaX(location="'$PYTORCHLTR_DATASET_PATH'/istella_x", split='train', normalize=True, filter_queries=None, download=True, validate_checksums=True)¶

Utility class for downloading and using the istella-X dataset: http://quickrank.isti.cnr.it/istella-dataset/.

__init__(location="'$PYTORCHLTR_DATASET_PATH'/istella_x", split='train', normalize=True, filter_queries=None, download=True, validate_checksums=True)¶

Parameters

location (str) – Directory where the dataset is located.
split (str) – The data split to load (“train”, “test” or “vali”)
normalize (bool) – Whether to perform query-level feature normalization.
filter_queries (Optional[bool]) – Whether to filter out queries that have no relevant items. If not given this will filter queries for the test set but not the train set.
download (bool) – Whether to download the dataset if it does not exist.
validate_checksums (bool) – Whether to validate the dataset files via sha256.

static collate_fn(list_sampler=None)¶

Returns a collate_fn that can be used to collate batches. :type list_sampler: Optional[ListSampler] :param list_sampler: Sampler to use for sampling lists of documents.

Return type: Callable[[List[SVMRankItem]], SVMRankBatch]

__getitem__(index)¶

Returns the item at given index.

Parameters: index (int) – The index.
Return type: SVMRankItem
Returns: A pytorchltr.datasets.svmrank.SVMRankItem that contains features, relevance, qid, n and sparse fields.

__len__()¶

Returns: The length of the dataset.
Return type: int

MSLR-WEB10K¶

class pytorchltr.datasets.MSLR10K(location="'$PYTORCHLTR_DATASET_PATH'/MSLR10K", split='train', fold=1, normalize=True, filter_queries=None, download=True, validate_checksums=True)¶

Utility class for downloading and using the MSLR-WEB10K dataset: https://www.microsoft.com/en-us/research/project/mslr/.

This dataset is a smaller sampled version of the MSLR-WEB30K dataset.

__init__(location="'$PYTORCHLTR_DATASET_PATH'/MSLR10K", split='train', fold=1, normalize=True, filter_queries=None, download=True, validate_checksums=True)¶

Parameters

location (str) – Directory where the dataset is located.
split (str) – The data split to load (“train”, “test” or “vali”)
fold (int) – Which data fold to load (1…5)
normalize (bool) – Whether to perform query-level feature normalization.
filter_queries (Optional[bool]) – Whether to filter out queries that have no relevant items. If not given this will filter queries for the test set but not the train set.
download (bool) – Whether to download the dataset if it does not exist.
validate_checksums (bool) – Whether to validate the dataset files via sha256.

static collate_fn(list_sampler=None)¶

Returns a collate_fn that can be used to collate batches. :type list_sampler: Optional[ListSampler] :param list_sampler: Sampler to use for sampling lists of documents.

Return type: Callable[[List[SVMRankItem]], SVMRankBatch]

__getitem__(index)¶

Returns the item at given index.

Parameters: index (int) – The index.
Return type: SVMRankItem
Returns: A pytorchltr.datasets.svmrank.SVMRankItem that contains features, relevance, qid, n and sparse fields.

__len__()¶

Returns: The length of the dataset.
Return type: int

MSLR-WEB30K¶

class pytorchltr.datasets.MSLR30K(location="'$PYTORCHLTR_DATASET_PATH'/MSLR30K", split='train', fold=1, normalize=True, filter_queries=None, download=True, validate_checksums=True)¶

Utility class for downloading and using the MSLR-WEB30K dataset: https://www.microsoft.com/en-us/research/project/mslr/.

__init__(location="'$PYTORCHLTR_DATASET_PATH'/MSLR30K", split='train', fold=1, normalize=True, filter_queries=None, download=True, validate_checksums=True)¶

Parameters

location (str) – Directory where the dataset is located.
split (str) – The data split to load (“train”, “test” or “vali”)
fold (int) – Which data fold to load (1…5)
normalize (bool) – Whether to perform query-level feature normalization.
filter_queries (Optional[bool]) – Whether to filter out queries that have no relevant items. If not given this will filter queries for the test set but not the train set.
download (bool) – Whether to download the dataset if it does not exist.
validate_checksums (bool) – Whether to validate the dataset files via sha256.

static collate_fn(list_sampler=None)¶

Returns a collate_fn that can be used to collate batches. :type list_sampler: Optional[ListSampler] :param list_sampler: Sampler to use for sampling lists of documents.

Return type: Callable[[List[SVMRankItem]], SVMRankBatch]

__getitem__(index)¶

Returns the item at given index.

Parameters: index (int) – The index.
Return type: SVMRankItem
Returns: A pytorchltr.datasets.svmrank.SVMRankItem that contains features, relevance, qid, n and sparse fields.

__len__()¶

Returns: The length of the dataset.
Return type: int