Datasets¶
PyTorchLTR provides several LTR datasets utility classes that can be used to automatically process and/or download the dataset files.
Warning
PyTorchLTR provides utilities to automatically download and prepare several public LTR datasets. We cannot vouch for the quality, correctness or usefulness of these datasets. We do not host or distribute any datasets and it is ultimately your responsibility to determine whether you have permission to use each dataset under its respective license.
Example¶
The following is a usage example for the small Example3 dataset.
>>> from pytorchltr.datasets import Example3
>>> train = Example3(split="train")
>>> test = Example3(split="test")
>>> print(len(train))
3
>>> print(len(test))
1
>>> sample = train[0]
>>> print(sample["features"])
tensor([[1.0000, 1.0000, 0.0000, 0.3333, 0.0000],
[0.0000, 0.0000, 1.0000, 0.0000, 1.0000],
[0.0000, 1.0000, 0.0000, 1.0000, 0.0000],
[0.0000, 0.0000, 1.0000, 0.6667, 0.0000]])
>>> print(sample["relevance"])
tensor([3, 2, 1, 1])
>>> print(sample["n"])
4
Note
PyTorchLTR looks for dataset files in (and downloads them to) the following locations:
The
location
arg if it is specified in the constructor of each respective Dataset class.$PYTORCHLTR_DATASET_PATH/{dataset_name}
if$PYTORCHLTR_DATASET_PATH
is a defined environment variable.$DATASET_PATH/{dataset_name}
if$DATASET_PATH
is a defined environment variable.$HOME/.pytorchltr_datasets/{dataset_name}
if all the above fail.
SVMRank datasets¶
Example3¶
-
class
pytorchltr.datasets.
Example3
(location="'$PYTORCHLTR_DATASET_PATH'/example3", split='train', normalize=True, filter_queries=None, download=True, validate_checksums=True)¶ Utility class for loading and using the Example3 dataset: http://www.cs.cornell.edu/people/tj/svm_light/svm_rank.html
This dataset is a very small toy sample which is useful as a sanity check for testing your code.
-
__init__
(location="'$PYTORCHLTR_DATASET_PATH'/example3", split='train', normalize=True, filter_queries=None, download=True, validate_checksums=True)¶ - Parameters
location (
str
) – Directory where the dataset is located.split (
str
) – The data split to load (“train” or “test”)normalize (
bool
) – Whether to perform query-level feature normalization.filter_queries (
Optional
[bool
]) – Whether to filter out queries that have no relevant items. If not given this will filter queries for the test set but not the train set.download (
bool
) – Whether to download the dataset if it does not exist.validate_checksums (
bool
) – Whether to validate the dataset files via sha256.
-
static
collate_fn
(list_sampler=None)¶ Returns a collate_fn that can be used to collate batches. :type list_sampler:
Optional
[ListSampler
] :param list_sampler: Sampler to use for sampling lists of documents.- Return type
Callable
[[List
[SVMRankItem
]],SVMRankBatch
]
-
__getitem__
(index)¶ Returns the item at given index.
- Parameters
index (int) – The index.
- Return type
SVMRankItem
- Returns
A
pytorchltr.datasets.svmrank.SVMRankItem
that contains features, relevance, qid, n and sparse fields.
-
__len__
()¶ - Returns
The length of the dataset.
- Return type
int
-
Istella¶
-
class
pytorchltr.datasets.
Istella
(location="'$PYTORCHLTR_DATASET_PATH'/istella", split='train', normalize=True, filter_queries=None, download=True, validate_checksums=True)¶ Utility class for downloading and using the istella dataset: http://quickrank.isti.cnr.it/istella-dataset/.
-
__init__
(location="'$PYTORCHLTR_DATASET_PATH'/istella", split='train', normalize=True, filter_queries=None, download=True, validate_checksums=True)¶ - Parameters
location (
str
) – Directory where the dataset is located.split (
str
) – The data split to load (“train” or “test”)normalize (
bool
) – Whether to perform query-level feature normalization.filter_queries (
Optional
[bool
]) – Whether to filter out queries that have no relevant items. If not given this will filter queries for the test set but not the train set.download (
bool
) – Whether to download the dataset if it does not exist.validate_checksums (
bool
) – Whether to validate the dataset files via sha256.
-
static
collate_fn
(list_sampler=None)¶ Returns a collate_fn that can be used to collate batches. :type list_sampler:
Optional
[ListSampler
] :param list_sampler: Sampler to use for sampling lists of documents.- Return type
Callable
[[List
[SVMRankItem
]],SVMRankBatch
]
-
__getitem__
(index)¶ Returns the item at given index.
- Parameters
index (int) – The index.
- Return type
SVMRankItem
- Returns
A
pytorchltr.datasets.svmrank.SVMRankItem
that contains features, relevance, qid, n and sparse fields.
-
__len__
()¶ - Returns
The length of the dataset.
- Return type
int
-
Istella-S¶
-
class
pytorchltr.datasets.
IstellaS
(location="'$PYTORCHLTR_DATASET_PATH'/istella_s", split='train', normalize=True, filter_queries=None, download=True, validate_checksums=True)¶ Utility class for downloading and using the istella-s dataset: http://quickrank.isti.cnr.it/istella-dataset/.
This dataset is a smaller sampled version of the Istella dataset.
-
__init__
(location="'$PYTORCHLTR_DATASET_PATH'/istella_s", split='train', normalize=True, filter_queries=None, download=True, validate_checksums=True)¶ - Parameters
location (
str
) – Directory where the dataset is located.split (
str
) – The data split to load (“train”, “test” or “vali”)normalize (
bool
) – Whether to perform query-level feature normalization.filter_queries (
Optional
[bool
]) – Whether to filter out queries that have no relevant items. If not given this will filter queries for the test set but not the train set.download (
bool
) – Whether to download the dataset if it does not exist.validate_checksums (
bool
) – Whether to validate the dataset files via sha256.
-
static
collate_fn
(list_sampler=None)¶ Returns a collate_fn that can be used to collate batches. :type list_sampler:
Optional
[ListSampler
] :param list_sampler: Sampler to use for sampling lists of documents.- Return type
Callable
[[List
[SVMRankItem
]],SVMRankBatch
]
-
__getitem__
(index)¶ Returns the item at given index.
- Parameters
index (int) – The index.
- Return type
SVMRankItem
- Returns
A
pytorchltr.datasets.svmrank.SVMRankItem
that contains features, relevance, qid, n and sparse fields.
-
__len__
()¶ - Returns
The length of the dataset.
- Return type
int
-
Istella-X¶
-
class
pytorchltr.datasets.
IstellaX
(location="'$PYTORCHLTR_DATASET_PATH'/istella_x", split='train', normalize=True, filter_queries=None, download=True, validate_checksums=True)¶ Utility class for downloading and using the istella-X dataset: http://quickrank.isti.cnr.it/istella-dataset/.
-
__init__
(location="'$PYTORCHLTR_DATASET_PATH'/istella_x", split='train', normalize=True, filter_queries=None, download=True, validate_checksums=True)¶ - Parameters
location (
str
) – Directory where the dataset is located.split (
str
) – The data split to load (“train”, “test” or “vali”)normalize (
bool
) – Whether to perform query-level feature normalization.filter_queries (
Optional
[bool
]) – Whether to filter out queries that have no relevant items. If not given this will filter queries for the test set but not the train set.download (
bool
) – Whether to download the dataset if it does not exist.validate_checksums (
bool
) – Whether to validate the dataset files via sha256.
-
static
collate_fn
(list_sampler=None)¶ Returns a collate_fn that can be used to collate batches. :type list_sampler:
Optional
[ListSampler
] :param list_sampler: Sampler to use for sampling lists of documents.- Return type
Callable
[[List
[SVMRankItem
]],SVMRankBatch
]
-
__getitem__
(index)¶ Returns the item at given index.
- Parameters
index (int) – The index.
- Return type
SVMRankItem
- Returns
A
pytorchltr.datasets.svmrank.SVMRankItem
that contains features, relevance, qid, n and sparse fields.
-
__len__
()¶ - Returns
The length of the dataset.
- Return type
int
-
MSLR-WEB10K¶
-
class
pytorchltr.datasets.
MSLR10K
(location="'$PYTORCHLTR_DATASET_PATH'/MSLR10K", split='train', fold=1, normalize=True, filter_queries=None, download=True, validate_checksums=True)¶ Utility class for downloading and using the MSLR-WEB10K dataset: https://www.microsoft.com/en-us/research/project/mslr/.
This dataset is a smaller sampled version of the MSLR-WEB30K dataset.
-
__init__
(location="'$PYTORCHLTR_DATASET_PATH'/MSLR10K", split='train', fold=1, normalize=True, filter_queries=None, download=True, validate_checksums=True)¶ - Parameters
location (
str
) – Directory where the dataset is located.split (
str
) – The data split to load (“train”, “test” or “vali”)fold (
int
) – Which data fold to load (1…5)normalize (
bool
) – Whether to perform query-level feature normalization.filter_queries (
Optional
[bool
]) – Whether to filter out queries that have no relevant items. If not given this will filter queries for the test set but not the train set.download (
bool
) – Whether to download the dataset if it does not exist.validate_checksums (
bool
) – Whether to validate the dataset files via sha256.
-
static
collate_fn
(list_sampler=None)¶ Returns a collate_fn that can be used to collate batches. :type list_sampler:
Optional
[ListSampler
] :param list_sampler: Sampler to use for sampling lists of documents.- Return type
Callable
[[List
[SVMRankItem
]],SVMRankBatch
]
-
__getitem__
(index)¶ Returns the item at given index.
- Parameters
index (int) – The index.
- Return type
SVMRankItem
- Returns
A
pytorchltr.datasets.svmrank.SVMRankItem
that contains features, relevance, qid, n and sparse fields.
-
__len__
()¶ - Returns
The length of the dataset.
- Return type
int
-
MSLR-WEB30K¶
-
class
pytorchltr.datasets.
MSLR30K
(location="'$PYTORCHLTR_DATASET_PATH'/MSLR30K", split='train', fold=1, normalize=True, filter_queries=None, download=True, validate_checksums=True)¶ Utility class for downloading and using the MSLR-WEB30K dataset: https://www.microsoft.com/en-us/research/project/mslr/.
-
__init__
(location="'$PYTORCHLTR_DATASET_PATH'/MSLR30K", split='train', fold=1, normalize=True, filter_queries=None, download=True, validate_checksums=True)¶ - Parameters
location (
str
) – Directory where the dataset is located.split (
str
) – The data split to load (“train”, “test” or “vali”)fold (
int
) – Which data fold to load (1…5)normalize (
bool
) – Whether to perform query-level feature normalization.filter_queries (
Optional
[bool
]) – Whether to filter out queries that have no relevant items. If not given this will filter queries for the test set but not the train set.download (
bool
) – Whether to download the dataset if it does not exist.validate_checksums (
bool
) – Whether to validate the dataset files via sha256.
-
static
collate_fn
(list_sampler=None)¶ Returns a collate_fn that can be used to collate batches. :type list_sampler:
Optional
[ListSampler
] :param list_sampler: Sampler to use for sampling lists of documents.- Return type
Callable
[[List
[SVMRankItem
]],SVMRankBatch
]
-
__getitem__
(index)¶ Returns the item at given index.
- Parameters
index (int) – The index.
- Return type
SVMRankItem
- Returns
A
pytorchltr.datasets.svmrank.SVMRankItem
that contains features, relevance, qid, n and sparse fields.
-
__len__
()¶ - Returns
The length of the dataset.
- Return type
int
-