RecBole contains an evaluation module to evaluate the performance of recommendation algorithms.
The function of evaluation module is to implement commonly used evaluation protocols for recommender systems. Since different models can be compared under the same evaluation module, RecBole is useful to standardize the evaluation of recommender systems.
In the following table, we summarize the evaluation settings that RecBole supports, where the first four rows correspond to the dataset splitting methods and the last two rows correspond to the ranking mechanisms.
Notation | Explanation |
---|---|
RO_RS | Random Ordering + Ratio-based Splitting |
TO_LS | Temporal Ordering + Leave-one-out Splitting |
RO_LS | Random Ordering + Leave-one-out Splitting |
TO_RS | Temporal Ordering + Ratio-based Splitting |
full | full ranking with all item candidates |
uniN | sample-based ranking: each positive item is paired with N sampled negative items |
In RecBole, the evaluation module consists of two major classes of Ranking-based Evaluator and Value-based Evaluator. Both classes can work with the above evaluation settings. For `t\op`-`k` item recommendation, the implemented evaluation settings cover the various settings of our earlier work in (Zhao et al., 2020), where we have studied the influence of different evaluation protocols on the performance comparison of models.
Ranking-based Evaluator is mainly used for the evaluation of ranking tasks. At present, RecBole supports
the
evaluation
metrics such as HR (hit ratio), NDCG, MRR, recall, MAP and precision.
These metrics are often computed in a truncated version with a length of `k`, such as
Recall@5
, NDCG@10
For more information about the `t\op`-`k` metrics, see also: metrics
Value-based Evaluator is mainly used for rating prediction and click-through rate prediction. Currently, it supports the evaluation metrics such as AUC, logloss, MAE and RMSE.
For more information about the Loss metrics, see also: metrics
Computing `t\op`-`k` evaluation metrics is usually time consuming. The basic reason lies in that one need to exhaustively estimate the score for each user-item pair. Since the methods of score estimation varies across different models, it is not easy to optimize the entire evaluation procedure in a general way. Therefore, we mainly focus on the step of selecting and generating `t\op`-`k` items given the ranking scores.
The following illustration is the proposed acceleration strategy for `t\op`-`k` item evaluation. Here, `u0, · · · , u5` indicate six users; black, blue and grey boxes denote training items, test items and other candidate items, respectively. The entries in `A` correspond to the indices of the identified `k` items by a recommendation algorithm. The entries in `B` correspond to indicate the existence of an item in the test set (blue box in the Figure). The entries in `C` correspond to indicate whether the corresponding position of `A` matrix is predicted correctly.
We utilize the GPU-version topk() function provided by PyTorch to find the `t\op`-`k` items with the highest scores for users. The GPU-version topk() function has been specially optimized based on CUDA, which is very efficient in our case. With the topk() function, we can obtain a result matrix `A` of size `n\times k`, which records the original index of the selected `t\op`-`k` items.
Considering that users may want to calculate the results of one metric under different positions (i.e., `k` ) at the same time, we implement the parallel computation of evaluation metrics. The major procedure is given as follows:
np.cumsum()
instead
of np.sum()
to calculate the metrics. Different from the traditional calculation
method, our metrics function will returns a list instead of a value, which records the
`t\op`-`k` result from `1` to `\max(k)`. for
statements. Because we store the
`t\op`-`k` items as matrix, we can make full use of the rich matrix functions provided by
NumPy to speed-up our calculations.