RecBole contains an evaluation module to evaluate the performance of recommendation algorithms.

The function of evaluation module is to implement commonly used evaluation protocols for recommender systems. Since different models can be compared under the same evaluation module, RecBole is useful to standardize the evaluation of recommender systems.

In the following table, we summarize the evaluation settings that RecBole supports, where the first four rows correspond to the dataset splitting methods and the last two rows correspond to the ranking mechanisms.

Notation | Explanation |
---|---|

RO_RS | Random Ordering + Ratio-based Splitting |

TO_LS | Temporal Ordering + Leave-one-out Splitting |

RO_LS | Random Ordering + Leave-one-out Splitting |

TO_RS | Temporal Ordering + Ratio-based Splitting |

full | full ranking with all item candidates |

uniN | sample-based ranking: each positive item is paired with N sampled negative items |

In RecBole, the evaluation module consists of two major classes of Ranking-based Evaluator and Value-based Evaluator. Both classes can work with the above evaluation settings. For `t\op`-`k` item recommendation, the implemented evaluation settings cover the various settings of our earlier work in (Zhao et al., 2020), where we have studied the influence of different evaluation protocols on the performance comparison of models.

Ranking-based Evaluator is mainly used for the evaluation of ranking tasks. At present, RecBole supports
the
evaluation
metrics such as HR (hit ratio), NDCG, MRR, recall, MAP and precision.
These metrics are often computed in a truncated version with a length of `k`, such as
` Recall@5 `

, `NDCG@10`

For more information about the `t\op`-`k` metrics, see also: metrics

Value-based Evaluator is mainly used for rating prediction and click-through rate prediction. Currently, it supports the evaluation metrics such as AUC, logloss, MAE and RMSE.

For more information about the Loss metrics, see also: metrics

Computing `t\op`-`k` evaluation metrics is usually time consuming. The basic reason lies in that one need to exhaustively estimate the score for each user-item pair. Since the methods of score estimation varies across different models, it is not easy to optimize the entire evaluation procedure in a general way. Therefore, we mainly focus on the step of selecting and generating `t\op`-`k` items given the ranking scores.

The following illustration is the proposed acceleration strategy for `t\op`-`k` item evaluation. Here, `u0, · · · , u5` indicate six users; black, blue and grey boxes denote training items, test items and other candidate items, respectively. The entries in `A` correspond to the indices of the identified `k` items by a recommendation algorithm. The entries in `B` correspond to indicate the existence of an item in the test set (blue box in the Figure). The entries in `C` correspond to indicate whether the corresponding position of `A` matrix is predicted correctly.

**reshaping**: Given `n` users and `m` items for consideration, when performing full ranking, we can obtain a `n \times m` matrix `D` consisting of the confidence scores from a model over the entire item set. When performing sample-based ranking, we create an `n \times m` matrix `D`, initializing all elements to negative infinity. Then, we fill matrix `D` with the confidence scores over sampled items.**filling**: When performing full ranking with all item candidates, we provide an option to mask the score of training items. If the user choose to mask, the matrix obtained in the above step cannot be directly used for `t\op`-`k` prediction. Our solution is to set the scores of training items to negative infinity, and perform the full ranking over the entire item set without removing training items.

We utilize the GPU-version topk() function provided by PyTorch to find the `t\op`-`k` items with the highest scores for users. The GPU-version topk() function has been specially optimized based on CUDA, which is very efficient in our case. With the topk() function, we can obtain a result matrix `A` of size `n\times k`, which records the original index of the selected `t\op`-`k` items.

**indexing**: We further generate a binary matrix `B` of size `n \times m` to indicate the existence of an item in the test set. Next, we use each row of matrix `A` to index the same row in matrix `B` and obtain a binary matrix `C` of size `n \times K` indicating whether the corresponding entry is a right recommendation or not, which can be implemented efficiently through gather() function provided by PyTorch.

Considering that users may want to calculate the results of one metric under different positions (i.e., `k` ) at the same time, we implement the parallel computation of evaluation metrics. The major procedure is given as follows:

- If a user requires to compute the evaluation metrics at several positions, we will maintain a `k` list, and get the `k` items matrix with size `n \times \max (k)`
- In calculation, we use the function provided by NumPy such as
`np.cumsum()`

instead of`np.sum()`

to calculate the metrics. Different from the traditional calculation method, our metrics function will returns a list instead of a value, which records the `t\op`-`k` result from `1` to `\max(k)`. - We replace matrix operation with
`for`

statements. Because we store the `t\op`-`k` items as matrix, we can make full use of the rich matrix functions provided by NumPy to speed-up our calculations.