recbole.data.dataset¶
-
class
recbole.data.dataset.dataset.
Dataset
(config, saved_dataset=None)[source]¶ Bases:
object
Dataset
stores the original dataset in memory. It provides many useful functions for data preprocessing, such as k-core data filtering and missing value imputation. Features are stored aspandas.DataFrame
insideDataset
. General and Context-aware Models can use this class.By calling method
build()
, it will processing dataset into DataLoaders, according toEvalSetting
.- Parameters
config (Config) – Global configuration object.
saved_dataset (str, optional) – Restore Dataset object from
saved_dataset
. Defaults toNone
.
-
dataset_name
¶ Name of this dataset.
- Type
str
-
dataset_path
¶ Local file path of this dataset.
- Type
str
-
field2type
¶ Dict mapping feature name (str) to its type (
FeatureType
).- Type
dict
-
field2source
¶ Dict mapping feature name (str) to its source (
FeatureSource
). Specially, if feature is loaded from Argadditional_feat_suffix
, its source has type str, which is the suffix of its local file (also the suffix written in Argadditional_feat_suffix
).- Type
dict
-
field2id_token
¶ Dict mapping feature name (str) to a
np.ndarray
, which stores the original token of this feature. For example, iftest
is token-like feature,token_a
is remapped to 1,token_b
is remapped to 2. Thenfield2id_token['test'] = ['[PAD]', 'token_a', 'token_b']
. (Note that 0 is always PADDING for token-like features.)- Type
dict
-
field2token_id
¶ Dict mapping feature name (str) to a dict, which stores the token remap table of this feature. For example, if
test
is token-like feature,token_a
is remapped to 1,token_b
is remapped to 2. Thenfield2token_id['test'] = {'[PAD]': 0, 'token_a': 1, 'token_b': 2}
. (Note that 0 is always PADDING for token-like features.)- Type
dict
-
field2seqlen
¶ Dict mapping feature name (str) to its sequence length (int). For sequence features, their length can be either set in config, or set to the max sequence length of this feature. For token and float features, their length is 1.
- Type
dict
-
uid_field
¶ The same as
config['USER_ID_FIELD']
.- Type
str or None
-
iid_field
¶ The same as
config['ITEM_ID_FIELD']
.- Type
str or None
-
label_field
¶ The same as
config['LABEL_FIELD']
.- Type
str or None
-
time_field
¶ The same as
config['TIME_FIELD']
.- Type
str or None
-
inter_feat
¶ Internal data structure stores the interaction features. It’s loaded from file
.inter
.- Type
pandas.DataFrame
-
user_feat
¶ Internal data structure stores the user features. It’s loaded from file
.user
if existed.- Type
pandas.DataFrame
or None
-
item_feat
¶ Internal data structure stores the item features. It’s loaded from file
.item
if existed.- Type
pandas.DataFrame
or None
-
feat_list
¶ A list contains all the features (
pandas.DataFrame
), including additional features.- Type
list
-
property
avg_actions_of_items
¶ Get the average number of items’ interaction records.
- Returns
Average number of items’ interaction records.
- Return type
numpy.float64
-
property
avg_actions_of_users
¶ Get the average number of users’ interaction records.
- Returns
Average number of users’ interaction records.
- Return type
numpy.float64
-
build
(eval_setting)[source]¶ Processing dataset according to evaluation setting, including Group, Order and Split. See
EvalSetting
for details.- Parameters
eval_setting (
EvalSetting
) – Object contains evaluation settings, which guide the data processing procedure.- Returns
List of builded
Dataset
.- Return type
list
-
copy
(new_inter_feat)[source]¶ Given a new interaction feature, return a new
Dataset
object, whose interaction feature is updated withnew_inter_feat
, and all the other attributes the same.
-
copy_field_property
(dest_field, source_field)[source]¶ Copy properties from
dest_field
towardssource_field
.- Parameters
dest_field (str) – Destination field.
source_field (str) – Source field.
-
fields
(ftype=None)[source]¶ Given type of features, return all the field name of this type. if
ftype = None
, return all the fields.- Parameters
ftype (FeatureType, optional) – Type of features.
- Returns
List of field names.
- Return type
list
-
property
float_like_fields
¶ Get fields of type
FLOAT
andFLOAT_SEQ
.- Returns
List of field names.
- Return type
list
-
get_preload_weight
(field)[source]¶ Get preloaded weight matrix, whose rows are sorted by token ids.
0
is used as padding.- Parameters
field (str) – preloaded feature field name.
- Returns
preloaded weight matrix. See Args for Data for details.
- Return type
numpy.ndarray
-
history_item_matrix
(value_field=None)[source]¶ Get dense matrix describe user’s history interaction records.
history_matrix[i]
represents useri
’s history interacted item_id.history_value[i]
represents useri
’s history interaction records’ values,0
ifvalue_field = None
.history_len[i]
represents number of useri
’s history interaction records.0
is used as padding.- Parameters
value_field (str, optional) – Data of matrix, which should exist in
self.inter_feat
. Defaults toNone
.- Returns
History matrix (torch.Tensor):
history_matrix
described above.History values matrix (torch.Tensor):
history_value
described above.History length matrix (torch.Tensor):
history_len
described above.
- Return type
tuple
-
history_user_matrix
(value_field=None)[source]¶ Get dense matrix describe item’s history interaction records.
history_matrix[i]
represents itemi
’s history interacted item_id.history_value[i]
represents itemi
’s history interaction records’ values,0
ifvalue_field = None
.history_len[i]
represents number of itemi
’s history interaction records.0
is used as padding.- Parameters
value_field (str, optional) – Data of matrix, which should exist in
self.inter_feat
. Defaults toNone
.- Returns
History matrix (torch.Tensor):
history_matrix
described above.History values matrix (torch.Tensor):
history_value
described above.History length matrix (torch.Tensor):
history_len
described above.
- Return type
tuple
-
id2token
(field, ids)[source]¶ Map internal ids to external tokens.
- Parameters
field (str) – Field of internal ids.
ids (int, list, np.ndarray or torch.Tensor) – Internal ids.
- Returns
The external tokens of internal ids.
- Return type
str or np.ndarray
-
inter_matrix
(form='coo', value_field=None)[source]¶ Get sparse matrix that describe interactions between user_id and item_id.
Sparse matrix has shape (user_num, item_num).
For a row of <src, tgt>,
matrix[src, tgt] = 1
ifvalue_field
isNone
, elsematrix[src, tgt] = self.inter_feat[src, tgt]
.- Parameters
form (str, optional) – Sparse matrix format. Defaults to
coo
.value_field (str, optional) – Data of sparse matrix, which should exist in
df_feat
. Defaults toNone
.
- Returns
Sparse matrix in form
coo
orcsr
.- Return type
scipy.sparse
-
property
inter_num
¶ Get the number of interaction records.
- Returns
Number of interaction records.
- Return type
int
-
property
item_num
¶ Get the number of different tokens of
self.iid_field
.- Returns
Number of different tokens of
self.iid_field
.- Return type
int
-
join
(df)[source]¶ Given interaction feature, join user/item feature into it.
- Parameters
df (pandas.DataFrame) – Interaction feature to be joint.
- Returns
Interaction feature after joining operation.
- Return type
pandas.DataFrame
-
leave_one_out
(group_by, leave_one_num=1)[source]¶ Split interaction records by leave one out strategy.
- Parameters
group_by (str) – Field name that interaction records should grouped by before splitting.
leave_one_num (int, optional) – Number of parts whose length is expected to be
1
. Defaults to1
.
- Returns
List of
Dataset
, whose interaction features has been splitted.- Return type
list
-
property
non_seq_fields
¶ Get fields of type
TOKEN
andFLOAT
.- Returns
List of field names.
- Return type
list
-
num
(field)[source]¶ Given
field
, for token-like fields, return the number of different tokens after remapping, for float-like fields, return1
.- Parameters
field (str) – field name to get token number.
- Returns
The number of different tokens (
1
iffield
is a float-like field).- Return type
int
-
save
(filepath)[source]¶ Saving this
Dataset
object to local path.- Parameters
filepath (str) – path of saved dir.
-
property
seq_fields
¶ Get fields of type
TOKEN_SEQ
andFLOAT_SEQ
.- Returns
List of field names.
- Return type
list
-
set_field_property
(field, field_type, field_source, field_seqlen)[source]¶ Set a new field’s properties.
- Parameters
field (str) – Name of the new field.
field_type (FeatureType) – Type of the new field.
field_source (FeatureSource) – Source of the new field.
field_seqlen (int) – max length of the sequence in
field
.1
iffield
’s type is not sequence-like.
-
sort
(by, ascending=True)[source]¶ Sort the interaction records inplace.
- Parameters
by (str) – Field that as the key in the sorting process.
ascending (bool, optional) – Results are ascending if
True
, otherwise descending. Defaults toTrue
-
property
sparsity
¶ Get the sparsity of this dataset.
- Returns
Sparsity of this dataset.
- Return type
float
-
split_by_ratio
(ratios, group_by=None)[source]¶ Split interaction records by ratios.
- Parameters
ratios (list) – List of split ratios. No need to be normalized.
group_by (str, optional) – Field name that interaction records should grouped by before splitting. Defaults to
None
- Returns
List of
Dataset
, whose interaction features has been splitted.- Return type
list
Note
Other than the first one, each part is rounded down.
-
token2id
(field, tokens)[source]¶ Map external tokens to internal ids.
- Parameters
field (str) – Field of external tokens.
tokens (str, list or np.ndarray) – External tokens.
- Returns
The internal ids of external tokens.
- Return type
int or np.ndarray
-
property
token_like_fields
¶ Get fields of type
TOKEN
andTOKEN_SEQ
.- Returns
List of field names.
- Return type
list
-
property
uid2index
¶ Sort
self.inter_feat
, and get the mapping of user_id and index of its interaction records.- Returns
numpy.ndarray
of tuple(uid, slice)
, interaction records between slice are all belong to the same uid.numpy.ndarray
of int, representing number of interaction records of each user.
- Return type
tuple
-
property
user_num
¶ Get the number of different tokens of
self.uid_field
.- Returns
Number of different tokens of
self.uid_field
.- Return type
int