Evaluation Settings¶

Evaluation settings are designed to set parameters about model evaluation.

eval_args (dict): This parameter have 4 keys: group_by, order, split, and mode, which respectively control the data grouping strategy, data ordering strategy, data splitting strategy and evaluation mode for model evaluation.
- group_by (str): decides how we group the data in .inter. Now we support two kinds of grouping strategies: ['user', 'none']. If the value of group_by is user, the data will be grouped by the column of USER_ID_FIELD and split in user dimension. If the value is none, the data won’t be grouped. The default value is user.
- order (str): decides how we sort the data in .inter. Now we support two kinds of ordering strategies: ['RO', 'TO'], which denotes the random ordering and temporal ordering. For RO, we will shuffle the data and then split them in this order. For TO, we will sort the data by the column of TIME_FIELD in ascending order and the split them in this order. The default value is RO.
- split (dict): decides how we split the data in .inter. Now we support two kinds of splitting strategies: ['RS','LS'], which denotes the ratio-based data splitting and leave-one-out data splitting. If the key of split is RS, you need to set the splitting ratio like [0.8,0.1,0.1], [7,2,1] or [8,0,2], which denotes the ratio of training set, validation set and testing set respectively. If the key of split is LS, now we support three kinds of LS mode: ['valid_and_test', 'valid_only', 'test_only'] and you should choose one mode as the value of LS. The default value of split is {'RS': [0.8,0.1,0.1]}.
- mode (str|dict): decides the data range when we evaluate the model during valid and test phase. Now we support four kinds of evaluation mode: ['full','unixxx','popxxx','labeled']. full , unixxx and popxxx are designed for the evaluation on implicit feedback (data without label). For implicit feedback, we regard the items with observed interactions as positive items and those without observed interactions as negative items. full means evaluating the model on the set of all items. unixxx, for example uni100, means uniformly sample 100 negative items for each positive item in testing set, and evaluate the model on these positive items with their sampled negative items. popxxx, for example pop100, means sample 100 negative items for each positive item in testing set based on item popularity (Counter(item) in .inter file), and evaluate the model on these positive items with their sampled negative items. Here the xxx must be an integer. For explicit feedback (data with label), you should set the mode as labeled and we will evaluate the model based on your label. You can use valid and test as the dict key to set specific mode in different phases. The default value is full, which is equivalent to {'valid': 'full', 'test': 'full'}.
repeatable (bool): Whether to evaluate the result with a repeatable recommendation scene. Note that it is disabled for sequential models as the recommendation is already repeatable. For other models, defaults to False.
metrics (list or str): Evaluation metrics. Defaults to ['Recall', 'MRR', 'NDCG', 'Hit', 'Precision']. Range in the following table:

Type

Metrics

Ranking-based

Recall, MRR, NDCG, Hit, MAP, Precision, GAUC, ItemCoverage, AveragePopularity, GiniIndex, ShannonEntropy, TailPercentage

value-based

AUC, MAE, RMSE, LogLoss

Note that value-based metrics and ranking-based metrics can not be used together.
topk (list or int or None): The value of k for topk evaluation metrics. Defaults to 10.
valid_metric (str): The evaluation metric for early stopping. It must be one of used metrics. Defaults to 'MRR@10'.
eval_batch_size (int): The evaluation batch size. Defaults to 4096.
metric_decimal_place(int): The decimal place of metric scores. Defaults to 4.

Type	Metrics
Ranking-based	Recall, MRR, NDCG, Hit, MAP, Precision, GAUC, ItemCoverage, AveragePopularity, GiniIndex, ShannonEntropy, TailPercentage
value-based	AUC, MAE, RMSE, LogLoss