Data settings

RecBole provides several arguments for describing:

  • Basic information of the dataset

  • Operations of dataset preprocessing

See below for the details:

Atomic File Format

  • field_separator (str) : Separator of different columns in atomic files. Defaults to "\t".

  • seq_separator (str) : Separator inside the sequence features. Defaults to " ".

Basic Information

Common Features

  • USER_ID_FIELD (str) : Field name of user ID feature. Defaults to user_id.

  • ITEM_ID_FIELD (str) : Field name of item ID feature. Defaults to item_id.

  • RATING_FIELD (str) : Field name of rating feature. Defaults to rating.

  • TIME_FIELD (str) : Field name of timestamp feature. Defaults to timestamp.

  • seq_len (dict) : Keys are field names of sequence features, values are maximum length of each sequence (which means too long sequences will be cut off). If not set, the sequences will not be cut off. Defaults to None.

Label for Point-wise DataLoader

  • LABEL_FIELD (str) : Expected field name of the generated labels. Defaults to label.

  • threshold (dict) : The format is {k (str): v (float)}. 0/1 labels will be generated according to the value of inter_feat[k] and v. The rows with inter_feat[k] >= v will be labeled as positive, otherwise the label is negative. Note that at most one pair of k and v can exist in threshold. Defaults to None.

Negative Sampling Prefix for Pair-wise DataLoader

  • NEG_PREFIX (str) : Prefix of field names which are generated as negative cases. E.g. if we have positive item ID named item_id, then those item ID in negative samples will be called NEG_PREFIX + item_id. Defaults to neg_.

Sequential Model Needed

  • ITEM_LIST_LENGTH_FIELD (str) : Field name of the feature representing item sequences’ length. Defaults to item_length.

  • LIST_SUFFIX (str) : Suffix of field names which are generated as sequences. E.g. if we have item ID named item_id, then those item ID sequences will be called item_id + LIST_SUFFIX. Defaults to _list.

  • MAX_ITEM_LIST_LENGTH (int): Maximum length of each generated sequence. Defaults to 50.

  • POSITION_FIELD (str) : Field name of the generated position sequence. For sequence of length k, its position sequence is range(k). Note that this field will only be generated if this arg is not None. Defaults to position_id.

Knowledge-based Model Needed

  • HEAD_ENTITY_ID_FIELD (str) : Field name of the head entity ID feature. Defaults to head_id.

  • TAIL_ENTITY_ID_FIELD (str) : Field name of the tail entity ID feature. Defaults to tail_id.

  • RELATION_ID_FIELD (str) : Field name of the relation ID feature. Defaults to relation_id.

  • ENTITY_ID_FIELD (str) : Field name of the entity ID. Note that it’s only a symbol of entities, not real feature of one of the xxx_feat. Defaults to entity_id.

  • kg_reverse_r (bool) : Whether or not to reverse relations of triples for bidirectional edges. Defaults to False.

  • entity_kg_num_interval (str) : Has the interval format, such as [A,B] / [A,B) / (A,B) / (A,B], where A and B are the endpoints of the interval and A <= B. Entities (including head entities and tail entities) whose number of triples is in the interval will be retained. Defaults to [0,inf).

  • relation_kg_num_interval (str) : Has the interval format, such as [A,B] / [A,B) / (A,B) / (A,B], where A and B are the endpoints of the interval and A <= B. Relations whose number of triples is in the interval will be retained. Defaults to [0,inf).

Selectively Loading

  • load_col (dict) : Keys are the suffix of loaded atomic files, values are the list of field names to be loaded. If a suffix doesn’t exist in load_col, the corresponding atomic file will not be loaded. Note that if load_col is None, then all the existed atomic files will be loaded. Defaults to {inter: [user_id, item_id]}.

  • unload_col (dict) : Keys are suffix of loaded atomic files, values are list of field names NOT to be loaded. Note that load_col and unload_col can not be set at the same time. Defaults to None.

  • unused_col (dict) : Keys are suffix of loaded atomic files, values are list of field names which are loaded for data processing but will not be used in model. E.g. the time_field may be used for time ordering but model does not use this field. Defaults to None.

  • additional_feat_suffix (list): Control loading additional atomic files. E.g. if you want to load features from ml-100k.hello, just set this arg as additional_feat_suffix: [hello]. Features of additional features will be stored in Dataset.feat_list. Defaults to None.

  • numerical_features (list): The numerical features to be embed for context-aware methods. Defaults to None.

Filtering

Remove duplicated user-item interactions

  • rm_dup_inter (str) : Whether to remove duplicated user-item interactions. If time_field exists, inter_feat will be sorted by time_field in ascending order. Otherwise it will remain unchanged. After that, if rm_dup_inter=first, we will keep the first user-item interaction in duplicates; if rm_dup_inter=last, we will keep the last user-item interaction in duplicates. Defaults to None.

Filter by value

  • val_interval (dict): Has the format {k (str): interval (str), ...}, where interval can be set as [A,B] / [A,B) / (A,B) / (A,B]. The rows whose feat[k] is in the interval interval will be retained. If you want to specify more than one interval, separate them with semicolon(s). For instance, {k: "[A,B);(C,D]"} can be adopted and rows whose feat[k] is in any specified interval will be retained. Defaults to None, which means all rows will be retained.

Remove interation by user or item

  • filter_inter_by_user_or_item (bool) : If True, we will remove the interaction in inter_feat which user or item is not in user_feat or item_feat. Defaults to True.

Filter by number of interactions

  • user_inter_num_interval (str) : Has the interval format, such as [A,B] / [A,B) / (A,B) / (A,B], where A and B are the endpoints of the interval and A <= B. Users whose number of interactions is in the interval will be retained. Defaults to [0,inf).

  • item_inter_num_interval (str) : Has the interval format, such as [A,B] / [A,B) / (A,B) / (A,B], where A and B are the endpoints of the interval and A <= B. Items whose number of interactions is in the interval will be retained. Defaults to [0,inf).

Preprocessing

  • alias_of_user_id (list): List of fields’ names, which will be remapped into the same index system with USER_ID_FIELD. Defaults to None.

  • alias_of_item_id (list): List of fields’ names, which will be remapped into the same index system with ITEM_ID_FIELD. Defaults to None.

  • alias_of_entity_id (list): List of fields’ names, which will be remapped into the same index system with ENTITY_ID_FIELD, HEAD_ENTITY_ID_FIELD and TAIL_ENTITY_ID_FIELD. Defaults to None.

  • alias_of_relation_id (list): List of fields’ names, which will be remapped into the same index system with RELATION_ID_FIELD. Defaults to None.

  • preload_weight (dict) : Has the format {k (str): v (float)}, .... k is a token field, representing the IDs of each row of preloaded weight matrix. v is a float-like field. Each pair of k and v should be from the same atomic file. This arg can be used to load pretrained vectors. Defaults to None.

  • normalize_field (list) : List of filed names to be normalized. Note that only float-like fields can be normalized. Defaults to None.

  • normalize_all (bool) : Normalize all the float like fields if True. Defaults to None.

  • discretization (dict) : Has the format {k (str): v (dict)}, .... k is a float field, representing the fields to be discretized. v is a config dict which have 2 keys: method and bucket, which respectively control the discretization strategy.

    • method (str): decides how we discretize the float data. Now we support two kinds of discretization strategies: ['ED', 'LD']. If the value of method is ED, the data will be discretized in equal distance. If the value is LD, the data will be discretized in logarithm. The default value is ED.

    • bucket (int): the number of buckets that contains equal number of features when the discretization method is ‘ED’.

Benchmark file

  • benchmark_filename (list) : List of pre-split user-item interaction suffix. We will only apply normalize, remap-id, which will not delete the interaction in inter_feat. And then split the inter_feat by benchmark_filename. E.g. Let’s assume that the dataset is called click, and benchmark_filename equals to ['part1', 'part2', 'part3']. That we will load click.part1.inter, click.part2.inter, click.part3.inter, and treat them as train, valid, test dataset. Defaults to None.