Args for Data¶
RecBole provides several arguments for describing:
Basic information of the dataset
Operations of dataset preprocessing
See below for the details:
Atomic File Format¶
field_separator (str)
: Separator of different columns in atomic files. Defaults to"\t"
.seq_separator (str)
: Separator inside the sequence features. Defaults to" "
.
Basic Information¶
Common Features¶
USER_ID_FIELD (str)
: Field name of user ID feature. Defaults touser_id
.ITEM_ID_FIELD (str)
: Field name of item ID feature. Defaults toitem_id
.RATING_FIELD (str)
: Field name of rating feature. Defaults torating
.TIME_FIELD (str)
: Field name of timestamp feature. Defaults totimestamp
.seq_len (dict)
: Keys are field names of sequence features, values are maximum length of each sequence (which means sequences too long will be cut off). If not set, the sequences will not be cut off. Defaults toNone
.
Label for Point-wise DataLoader¶
LABEL_FIELD (str)
: Expected field name of the generated labels. Defaults tolabel
.threshold (dict)
: The format is{k (str): v (float)}
. 0/1 labels will be generated according to the value ofinter_feat[k]
andv
. The rows withinter_feat[k] >= v
will be labeled as positive, otherwise the label is negative. Note that at most one pair ofk
andv
can exist inthreshold
. Defaults toNone
.
NegSample Prefix for Pair-wise DataLoader¶
NEG_PREFIX (str)
: Prefix of field names which are generated as negative cases. E.g. if we have positive item ID nameditem_id
, then those item ID in negative samples will be calledNEG_PREFIX + item_id
. Defaults toneg_
.
Sequential Model Needed¶
ITEM_LIST_LENGTH_FIELD (str)
: Field name of the feature representing item sequences’ length. Defaults toitem_length
.LIST_SUFFIX (str)
: Suffix of field names which are generated as sequences. E.g. if we have item ID nameditem_id
, then those item ID sequences will be calleditem_id + LIST_SUFFIX
. Defaults to_list
.MAX_ITEM_LIST_LENGTH (int)
: Maximum length of each generated sequence. Defaults to50
.POSITION_FIELD (str)
: Field name of the generated position sequence. For sequence of lengthk
, its position sequence isrange(k)
. Note that this field will only be generated if this arg is notNone
. Defaults toposition_id
.
Knowledge-based Model Needed¶
HEAD_ENTITY_ID_FIELD (str)
: Field name of the head entity ID feature. Defaults tohead_id
.TAIL_ENTITY_ID_FIELD (str)
: Field name of the tail entity ID feature. Defaults totail_id
.RELATION_ID_FIELD (str)
: Field name of the relation ID feature. Defaults torelation_id
.ENTITY_ID_FIELD (str)
: Field name of the entity ID. Note that it’s only a symbol of entities, not real feature of one of thexxx_feat
. Defaults toentity_id
.
Selectively Loading¶
load_col (dict)
: Keys are the suffix of loaded atomic files, values are the list of field names to be loaded. If a suffix doesn’t exist inload_col
, the corresponding atomic file will not be loaded. Note that ifload_col
isNone
, then all the existed atomic files will be loaded. Defaults to{inter: [user_id, item_id]}
.unload_col (dict)
: Keys are suffix of loaded atomic files, values are list of field names NOT to be loaded. Note thatload_col
andunload_col
can not be set at the same time. Defaults toNone
.additional_feat_suffix (list)
: Control loading additional atomic files. E.g. if you want to load features fromml-100k.hello
, just set this arg asadditional_feat_suffix: [hello]
. Features of additional features will be stored inDataset.feat_list
. Defaults toNone
.
Filtering¶
Remove duplicated user-item interactions¶
rm_dup_inter (str)
: Whether to remove duplicated user-item interactions. Iftime_field
exists,inter_feat
will be sorted bytime_field
in ascending order. Otherwise it will remain unchanged. After that, ifrm_dup_inter == first
, we will keep the first user-item interaction in duplicates; ifrm_dup_inter == last
, we will keep the last user-item interaction in duplicates. Defaults toNone
.
Filter by number of interactions¶
max_user_inter_num (int)
: Users whose number of interactions is more thanmax_user_inter_num
will be filtered. Defaults toNone
.min_user_inter_num (int)
: Users whose number of interactions is less thanmin_user_inter_num
will be filtered. Defaults to0
.max_item_inter_num (int)
: Items whose number of interactions is more thanmax_item_inter_num
will be filtered. Defaults toNone
.min_item_inter_num (int)
: Items whose number of interactions is less thanmin_item_inter_num
will be filtered. Defaults to0
.
Filter by value¶
lowest_val (dict)
: Has the format{k (str): v (float)}, ...
. The rows whosefeat[k] < v
will be filtered. Defaults toNone
.highest_val (dict)
: Has the format{k (str): v (float)}, ...
. The rows whosefeat[k] > v
will be filtered. Defaults toNone
.equal_val (dict)
: Has the format{k (str): v (float)}, ...
. The rows whosefeat[k] != v
will be filtered. Defaults toNone
.not_equal_val (dict)
: Has the format{k (str): v (float)}, ...
. The rows whosefeat[k] == v
will be filtered. Defaults toNone
.drop_filter_field (bool)
: Fields that occured inlowest_val
,highest_val
,equal_val
andnot_equal_val
will be dropped ifdrop_filter_field == True
. Defaults toTrue
.
Preprocessing¶
fields_in_same_space (list)
: List of spaces. Space is a list of string similar to the fields’ names. The fields in the same space will be remapped into the same index system. Note that if you want to make some fields remapped in the same space with entities, then just setfields_in_same_space = [entity_id, xxx, ...]
. (ifENTITY_ID_FIELD != 'entity_id'
, then change the'entity_id'
in the above example.) Defaults toNone
.fill_nan (bool)
: Fill theNaN
in the features ifTrue
. Token fileds will be filled as[PAD]
. Float fields will be filled as the average value. Sequence fields will be filled as[0]
. Defaults toTrue
.preload_weight (dict)
: Has the format{k (str): v (float)}, ...
.k
if a token field, representing the IDs of each row of preloaded weight matrix.v
is a float like fields. Each pair ofu
andv
should be from the same atomic file. This arg can be used to load pretrained vectors. Defaults toNone
.drop_preload_weight (bool)
: Drop the fields whose names are keys ofpreload_weight
, ifdrop_preload_weight == True
. Defaults toTrue
.normalize_field (list)
: List of filed names to be normalized. Note that only float like fields can be normalized. Defaults toNone
.normalize_all (bool)
: Normalize all the float like fields ifTrue
. Defaults toTrue
.