Data settings¶
RecBole provides several arguments for describing:
Basic information of the dataset
Operations of dataset preprocessing
See below for the details:
Atomic File Format¶
field_separator (str)
: Separator of different columns in atomic files. Defaults to"\t"
.seq_separator (str)
: Separator inside the sequence features. Defaults to" "
.
Basic Information¶
Common Features¶
USER_ID_FIELD (str)
: Field name of user ID feature. Defaults touser_id
.ITEM_ID_FIELD (str)
: Field name of item ID feature. Defaults toitem_id
.RATING_FIELD (str)
: Field name of rating feature. Defaults torating
.TIME_FIELD (str)
: Field name of timestamp feature. Defaults totimestamp
.seq_len (dict)
: Keys are field names of sequence features, values are maximum length of each sequence (which means too long sequences will be cut off). If not set, the sequences will not be cut off. Defaults toNone
.
Label for Point-wise DataLoader¶
LABEL_FIELD (str)
: Expected field name of the generated labels. Defaults tolabel
.threshold (dict)
: The format is{k (str): v (float)}
. 0/1 labels will be generated according to the value ofinter_feat[k]
andv
. The rows withinter_feat[k] >= v
will be labeled as positive, otherwise the label is negative. Note that at most one pair ofk
andv
can exist inthreshold
. Defaults toNone
.
Negative Sampling Prefix for Pair-wise DataLoader¶
NEG_PREFIX (str)
: Prefix of field names which are generated as negative cases. E.g. if we have positive item ID nameditem_id
, then those item ID in negative samples will be calledNEG_PREFIX + item_id
. Defaults toneg_
.
Sequential Model Needed¶
ITEM_LIST_LENGTH_FIELD (str)
: Field name of the feature representing item sequences’ length. Defaults toitem_length
.LIST_SUFFIX (str)
: Suffix of field names which are generated as sequences. E.g. if we have item ID nameditem_id
, then those item ID sequences will be calleditem_id + LIST_SUFFIX
. Defaults to_list
.MAX_ITEM_LIST_LENGTH (int)
: Maximum length of each generated sequence. Defaults to50
.POSITION_FIELD (str)
: Field name of the generated position sequence. For sequence of lengthk
, its position sequence isrange(k)
. Note that this field will only be generated if this arg is notNone
. Defaults toposition_id
.
Knowledge-based Model Needed¶
HEAD_ENTITY_ID_FIELD (str)
: Field name of the head entity ID feature. Defaults tohead_id
.TAIL_ENTITY_ID_FIELD (str)
: Field name of the tail entity ID feature. Defaults totail_id
.RELATION_ID_FIELD (str)
: Field name of the relation ID feature. Defaults torelation_id
.ENTITY_ID_FIELD (str)
: Field name of the entity ID. Note that it’s only a symbol of entities, not real feature of one of thexxx_feat
. Defaults toentity_id
.kg_reverse_r (bool)
: Whether or not to reverse relations of triples for bidirectional edges. Defaults toFalse
.entity_kg_num_interval (str)
: Has the interval format, such as[A,B]
/[A,B)
/(A,B)
/(A,B]
, whereA
andB
are the endpoints of the interval andA <= B
. Entities (including head entities and tail entities) whose number of triples is in the interval will be retained. Defaults to[0,inf)
.relation_kg_num_interval (str)
: Has the interval format, such as[A,B]
/[A,B)
/(A,B)
/(A,B]
, whereA
andB
are the endpoints of the interval andA <= B
. Relations whose number of triples is in the interval will be retained. Defaults to[0,inf)
.
Selectively Loading¶
load_col (dict)
: Keys are the suffix of loaded atomic files, values are the list of field names to be loaded. If a suffix doesn’t exist inload_col
, the corresponding atomic file will not be loaded. Note that ifload_col
isNone
, then all the existed atomic files will be loaded. Defaults to{inter: [user_id, item_id]}
.unload_col (dict)
: Keys are suffix of loaded atomic files, values are list of field names NOT to be loaded. Note thatload_col
andunload_col
can not be set at the same time. Defaults toNone
.unused_col (dict)
: Keys are suffix of loaded atomic files, values are list of field names which are loaded for data processing but will not be used in model. E.g. thetime_field
may be used for time ordering but model does not use this field. Defaults toNone
.additional_feat_suffix (list)
: Control loading additional atomic files. E.g. if you want to load features fromml-100k.hello
, just set this arg asadditional_feat_suffix: [hello]
. Features of additional features will be stored inDataset.feat_list
. Defaults toNone
.numerical_features (list)
: The numerical features to be embed for context-aware methods. Defaults toNone
.
Filtering¶
Remove duplicated user-item interactions¶
rm_dup_inter (str)
: Whether to remove duplicated user-item interactions. Iftime_field
exists,inter_feat
will be sorted bytime_field
in ascending order. Otherwise it will remain unchanged. After that, ifrm_dup_inter=first
, we will keep the first user-item interaction in duplicates; ifrm_dup_inter=last
, we will keep the last user-item interaction in duplicates. Defaults toNone
.
Filter by value¶
val_interval (dict)
: Has the format{k (str): interval (str), ...}
, whereinterval
can be set as[A,B]
/[A,B)
/(A,B)
/(A,B]
. The rows whosefeat[k]
is in the intervalinterval
will be retained. If you want to specify more than one interval, separate them with semicolon(s). For instance,{k: "[A,B);(C,D]"}
can be adopted and rows whosefeat[k]
is in any specified interval will be retained. Defaults toNone
, which means all rows will be retained.
Remove interation by user or item¶
filter_inter_by_user_or_item (bool)
: IfTrue
, we will remove the interaction ininter_feat
which user or item is not inuser_feat
oritem_feat
. Defaults toTrue
.
Filter by number of interactions¶
user_inter_num_interval (str)
: Has the interval format, such as[A,B]
/[A,B)
/(A,B)
/(A,B]
, whereA
andB
are the endpoints of the interval andA <= B
. Users whose number of interactions is in the interval will be retained. Defaults to[0,inf)
.item_inter_num_interval (str)
: Has the interval format, such as[A,B]
/[A,B)
/(A,B)
/(A,B]
, whereA
andB
are the endpoints of the interval andA <= B
. Items whose number of interactions is in the interval will be retained. Defaults to[0,inf)
.
Preprocessing¶
alias_of_user_id (list)
: List of fields’ names, which will be remapped into the same index system withUSER_ID_FIELD
. Defaults toNone
.alias_of_item_id (list)
: List of fields’ names, which will be remapped into the same index system withITEM_ID_FIELD
. Defaults toNone
.alias_of_entity_id (list)
: List of fields’ names, which will be remapped into the same index system withENTITY_ID_FIELD
,HEAD_ENTITY_ID_FIELD
andTAIL_ENTITY_ID_FIELD
. Defaults toNone
.alias_of_relation_id (list)
: List of fields’ names, which will be remapped into the same index system withRELATION_ID_FIELD
. Defaults toNone
.preload_weight (dict)
: Has the format{k (str): v (float)}, ...
.k
is a token field, representing the IDs of each row of preloaded weight matrix.v
is a float-like field. Each pair ofk
andv
should be from the same atomic file. This arg can be used to load pretrained vectors. Defaults toNone
.normalize_field (list)
: List of filed names to be normalized. Note that only float-like fields can be normalized. Defaults toNone
.normalize_all (bool)
: Normalize all the float like fields ifTrue
. Defaults toNone
.discretization (dict)
: Has the format{k (str): v (dict)}, ...
.k
is a float field, representing the fields to be discretized.v
is a config dict which have 2 keys:method
andbucket
, which respectively control the discretization strategy.method (str)
: decides how we discretize the float data. Now we support two kinds of discretization strategies:['ED', 'LD']
. If the value ofmethod
isED
, the data will be discretized in equal distance. If the value isLD
, the data will be discretized in logarithm. The default value isED
.bucket (int)
: the number of buckets that contains equal number of features when the discretization method is ‘ED’.
Benchmark file¶
benchmark_filename (list)
: List of pre-split user-item interaction suffix. We will only apply normalize, remap-id, which will not delete the interaction in inter_feat. And then split the inter_feat bybenchmark_filename
. E.g. Let’s assume that the dataset is calledclick
, andbenchmark_filename
equals to['part1', 'part2', 'part3']
. That we will loadclick.part1.inter
,click.part2.inter
,click.part3.inter
, and treat them as train, valid, test dataset. Defaults toNone
.