Evaluation

RecBole contains an evaluation module to evaluate the performance of recommendation algorithms.


Overview of evaluation module

The function of evaluation module is to implement commonly used evaluation protocols for recommender systems. Since different models can be compared under the same evaluation module, RecBole is useful to standardize the evaluation of recommender systems.


Evaluation protocols and settings

In the following table, we summarize the evaluation settings that RecBole supports, where the first four rows correspond to the dataset splitting methods and the last two rows correspond to the ranking mechanisms.

Notation Explanation
RO_RS Random Ordering + Ratio-based Splitting
TO_LS Temporal Ordering + Leave-one-out Splitting
RO_LS Random Ordering + Leave-one-out Splitting
TO_RS Temporal Ordering + Ratio-based Splitting
full full ranking with all item candidates
uniN sample-based ranking: each positive item is paired with N sampled negative items

In RecBole, the evaluation module consists of two major classes of Ranking-based Evaluator and Value-based Evaluator. Both classes can work with the above evaluation settings. For `t\op`-`k` item recommendation, the implemented evaluation settings cover the various settings of our earlier work in (Zhao et al., 2020), where we have studied the influence of different evaluation protocols on the performance comparison of models.



Ranking-based Evaluator

Ranking-based Evaluator is mainly used for the evaluation of ranking tasks. At present, RecBole supports the evaluation metrics such as HR (hit ratio), NDCG, MRR, recall, MAP and precision. These metrics are often computed in a truncated version with a length of `k`, such as Recall@5 , NDCG@10

For more information about the `t\op`-`k` metrics, see also: metrics


Value-based Evaluator

Value-based Evaluator is mainly used for rating prediction and click-through rate prediction. Currently, it supports the evaluation metrics such as AUC, logloss, MAE and RMSE.

For more information about the Loss metrics, see also: metrics



Acceleration Strategy for `t\op`-`k` Evaluation

Computing `t\op`-`k` evaluation metrics is usually time consuming. The basic reason lies in that one need to exhaustively estimate the score for each user-item pair. Since the methods of score estimation varies across different models, it is not easy to optimize the entire evaluation procedure in a general way. Therefore, we mainly focus on the step of selecting and generating `t\op`-`k` items given the ranking scores.

The following illustration is the proposed acceleration strategy for `t\op`-`k` item evaluation. Here, `u0, · · · , u5` indicate six users; black, blue and grey boxes denote training items, test items and other candidate items, respectively. The entries in `A` correspond to the indices of the identified `k` items by a recommendation algorithm. The entries in `B` correspond to indicate the existence of an item in the test set (blue box in the Figure). The entries in `C` correspond to indicate whether the corresponding position of `A` matrix is predicted correctly.




Reassigning the score matrix

  • reshaping: Given `n` users and `m` items for consideration, when performing full ranking, we can obtain a `n \times m` matrix `D` consisting of the confidence scores from a model over the entire item set. When performing sample-based ranking, we create an `n \times m` matrix `D`, initializing all elements to negative infinity. Then, we fill matrix `D` with the confidence scores over sampled items.
  • filling: When performing full ranking with all item candidates, we provide an option to mask the score of training items. If the user choose to mask, the matrix obtained in the above step cannot be directly used for `t\op`-`k` prediction. Our solution is to set the scores of training items to negative infinity, and perform the full ranking over the entire item set without removing training items.

Get the `t\op`-`k` items

We utilize the GPU-version topk() function provided by PyTorch to find the `t\op`-`k` items with the highest scores for users. The GPU-version topk() function has been specially optimized based on CUDA, which is very efficient in our case. With the topk() function, we can obtain a result matrix `A` of size `n\times k`, which records the original index of the selected `t\op`-`k` items.


Calculate the intersection of `t\op`-`k` items and ground truth items

  • indexing: We further generate a binary matrix `B` of size `n \times m` to indicate the existence of an item in the test set. Next, we use each row of matrix `A` to index the same row in matrix `B` and obtain a binary matrix `C` of size `n \times K` indicating whether the corresponding entry is a right recommendation or not, which can be implemented efficiently through gather() function provided by PyTorch.

Calculate `t\op`-`k` metrics

Considering that users may want to calculate the results of one metric under different positions (i.e., `k` ) at the same time, we implement the parallel computation of evaluation metrics. The major procedure is given as follows:

  • If a user requires to compute the evaluation metrics at several positions, we will maintain a `k` list, and get the `k` items matrix with size `n \times \max (k)`
  • In calculation, we use the function provided by NumPy such as np.cumsum() instead of np.sum() to calculate the metrics. Different from the traditional calculation method, our metrics function will returns a list instead of a value, which records the `t\op`-`k` result from `1` to `\max(k)`.
  • We replace matrix operation with for statements. Because we store the `t\op`-`k` items as matrix, we can make full use of the rich matrix functions provided by NumPy to speed-up our calculations.