recbole.data.dataset¶

class recbole.data.dataset.dataset.Dataset(config, saved_dataset=None)[source]¶

Bases: object

Dataset stores the original dataset in memory. It provides many useful functions for data preprocessing, such as k-core data filtering and missing value imputation. Features are stored as pandas.DataFrame inside Dataset. General and Context-aware Models can use this class.

By calling method build(), it will processing dataset into DataLoaders, according to EvalSetting.

Parameters

config (Config) – Global configuration object.
saved_dataset (str, optional) – Restore Dataset object from saved_dataset. Defaults to None.

dataset_name¶

Name of this dataset.

Type: str

dataset_path¶

Local file path of this dataset.

Type: str

field2type¶

Dict mapping feature name (str) to its type (FeatureType).

Type: dict

field2source¶

Dict mapping feature name (str) to its source (FeatureSource). Specially, if feature is loaded from Arg additional_feat_suffix, its source has type str, which is the suffix of its local file (also the suffix written in Arg additional_feat_suffix).

Type: dict

field2id_token¶

Dict mapping feature name (str) to a np.ndarray, which stores the original token of this feature. For example, if test is token-like feature, token_a is remapped to 1, token_b is remapped to 2. Then field2id_token['test'] = ['[PAD]', 'token_a', 'token_b']. (Note that 0 is always PADDING for token-like features.)

Type: dict

field2token_id¶

Dict mapping feature name (str) to a dict, which stores the token remap table of this feature. For example, if test is token-like feature, token_a is remapped to 1, token_b is remapped to 2. Then field2token_id['test'] = {'[PAD]': 0, 'token_a': 1, 'token_b': 2}. (Note that 0 is always PADDING for token-like features.)

Type: dict

field2seqlen¶

Dict mapping feature name (str) to its sequence length (int). For sequence features, their length can be either set in config, or set to the max sequence length of this feature. For token and float features, their length is 1.

Type: dict

uid_field¶

The same as config['USER_ID_FIELD'].

Type: str or None

iid_field¶

The same as config['ITEM_ID_FIELD'].

Type: str or None

label_field¶

The same as config['LABEL_FIELD'].

Type: str or None

time_field¶

The same as config['TIME_FIELD'].

Type: str or None

inter_feat¶

Internal data structure stores the interaction features. It’s loaded from file .inter.

Type: pandas.DataFrame

user_feat¶

Internal data structure stores the user features. It’s loaded from file .user if existed.

Type: pandas.DataFrame or None

item_feat¶

Internal data structure stores the item features. It’s loaded from file .item if existed.

Type: pandas.DataFrame or None

feat_list¶

A list contains all the features (pandas.DataFrame), including additional features.

Type: list

property avg_actions_of_items¶

Get the average number of items’ interaction records.

Returns: Average number of items’ interaction records.
Return type: numpy.float64

property avg_actions_of_users¶

Get the average number of users’ interaction records.

Returns: Average number of users’ interaction records.
Return type: numpy.float64

build(eval_setting)[source]¶

Processing dataset according to evaluation setting, including Group, Order and Split. See EvalSetting for details.

Parameters: eval_setting (EvalSetting) – Object contains evaluation settings, which guide the data processing procedure.
Returns: List of builded Dataset.
Return type: list

copy(new_inter_feat)[source]¶

Given a new interaction feature, return a new Dataset object, whose interaction feature is updated with new_inter_feat, and all the other attributes the same.

Parameters: new_inter_feat (pandas.DataFrame) – The new interaction feature need to be updated.
Returns: the new Dataset object, whose interaction feature has been updated.
Return type: Dataset

copy_field_property(dest_field, source_field)[source]¶

Copy properties from dest_field towards source_field.

Parameters

dest_field (str) – Destination field.
source_field (str) – Source field.

fields(ftype=None)[source]¶

Given type of features, return all the field name of this type. if ftype = None, return all the fields.

Parameters: ftype (FeatureType, optional) – Type of features.
Returns: List of field names.
Return type: list

property float_like_fields¶

Get fields of type FLOAT and FLOAT_SEQ.

Returns: List of field names.
Return type: list

get_item_feature()[source]¶

Returns: item features
Return type: pandas.DataFrame

get_preload_weight(field)[source]¶

Get preloaded weight matrix, whose rows are sorted by token ids.

0 is used as padding.

Parameters: field (str) – preloaded feature field name.
Returns: preloaded weight matrix. See Args for Data for details.
Return type: numpy.ndarray

get_user_feature()[source]¶

Returns: user features
Return type: pandas.DataFrame

history_item_matrix(value_field=None)[source]¶

Get dense matrix describe user’s history interaction records.

history_matrix[i] represents user i’s history interacted item_id.

history_value[i] represents user i’s history interaction records’ values, 0 if value_field = None.

history_len[i] represents number of user i’s history interaction records.

0 is used as padding.

Parameters

value_field (str, optional) – Data of matrix, which should exist in self.inter_feat. Defaults to None.

Returns

History matrix (torch.Tensor): history_matrix described above.
History values matrix (torch.Tensor): history_value described above.
History length matrix (torch.Tensor): history_len described above.

Return type

tuple

history_user_matrix(value_field=None)[source]¶

Get dense matrix describe item’s history interaction records.

history_matrix[i] represents item i’s history interacted item_id.

history_value[i] represents item i’s history interaction records’ values, 0 if value_field = None.

history_len[i] represents number of item i’s history interaction records.

0 is used as padding.

Parameters

value_field (str, optional) – Data of matrix, which should exist in self.inter_feat. Defaults to None.

Returns

History matrix (torch.Tensor): history_matrix described above.
History values matrix (torch.Tensor): history_value described above.
History length matrix (torch.Tensor): history_len described above.

Return type

tuple

id2token(field, ids)[source]¶

Map internal ids to external tokens.

Parameters

field (str) – Field of internal ids.
ids (int, list, np.ndarray or torch.Tensor) – Internal ids.

Returns

The external tokens of internal ids.

Return type

str or np.ndarray

inter_matrix(form='coo', value_field=None)[source]¶

Get sparse matrix that describe interactions between user_id and item_id.

Sparse matrix has shape (user_num, item_num).

For a row of <src, tgt>, matrix[src, tgt] = 1 if value_field is None, else matrix[src, tgt] = self.inter_feat[src, tgt].

Parameters

form (str, optional) – Sparse matrix format. Defaults to coo.
value_field (str, optional) – Data of sparse matrix, which should exist in df_feat. Defaults to None.

Returns

Sparse matrix in form coo or csr.

Return type

scipy.sparse

property inter_num¶

Get the number of interaction records.

Returns: Number of interaction records.
Return type: int

property item_num¶

Get the number of different tokens of self.iid_field.

Returns: Number of different tokens of self.iid_field.
Return type: int

join(df)[source]¶

Given interaction feature, join user/item feature into it.

Parameters: df (pandas.DataFrame) – Interaction feature to be joint.
Returns: Interaction feature after joining operation.
Return type: pandas.DataFrame

leave_one_out(group_by, leave_one_num=1)[source]¶

Split interaction records by leave one out strategy.

Parameters

group_by (str) – Field name that interaction records should grouped by before splitting.
leave_one_num (int, optional) – Number of parts whose length is expected to be 1. Defaults to 1.

Returns

List of Dataset, whose interaction features has been splitted.

Return type

list

property non_seq_fields¶

Get fields of type TOKEN and FLOAT.

Returns: List of field names.
Return type: list

num(field)[source]¶

Given field, for token-like fields, return the number of different tokens after remapping, for float-like fields, return 1.

Parameters: field (str) – field name to get token number.
Returns: The number of different tokens (1 if field is a float-like field).
Return type: int

save(filepath)[source]¶

Saving this Dataset object to local path.

Parameters: filepath (str) – path of saved dir.

property seq_fields¶

Get fields of type TOKEN_SEQ and FLOAT_SEQ.

Returns: List of field names.
Return type: list

set_field_property(field, field_type, field_source, field_seqlen)[source]¶

Set a new field’s properties.

Parameters

field (str) – Name of the new field.
field_type (FeatureType) – Type of the new field.
field_source (FeatureSource) – Source of the new field.
field_seqlen (int) – max length of the sequence in field. 1 if field’s type is not sequence-like.

shuffle()[source]¶: Shuffle the interaction records inplace.

sort(by, ascending=True)[source]¶

Sort the interaction records inplace.

Parameters

by (str) – Field that as the key in the sorting process.
ascending (bool, optional) – Results are ascending if True, otherwise descending. Defaults to True

property sparsity¶

Get the sparsity of this dataset.

Returns: Sparsity of this dataset.
Return type: float

split_by_ratio(ratios, group_by=None)[source]¶

Split interaction records by ratios.

Parameters

ratios (list) – List of split ratios. No need to be normalized.
group_by (str, optional) – Field name that interaction records should grouped by before splitting. Defaults to None

Returns

List of Dataset, whose interaction features has been splitted.

Return type

list

Note

Other than the first one, each part is rounded down.

token2id(field, tokens)[source]¶

Map external tokens to internal ids.

Parameters

field (str) – Field of external tokens.
tokens (str, list or np.ndarray) – External tokens.

Returns

The internal ids of external tokens.

Return type

int or np.ndarray

property token_like_fields¶

Get fields of type TOKEN and TOKEN_SEQ.

Returns: List of field names.
Return type: list

property uid2index¶

Sort self.inter_feat, and get the mapping of user_id and index of its interaction records.

Returns

numpy.ndarray of tuple (uid, slice), interaction records between slice are all belong to the same uid.
numpy.ndarray of int, representing number of interaction records of each user.

Return type

tuple

property user_num¶

Get the number of different tokens of self.uid_field.

Returns: Number of different tokens of self.uid_field.
Return type: int