Data FlowΒΆ
For extensibility and reusability, our data module designs an elegant data flow that transforms raw data into the model input.
The overall data flow can be described as follows:
The details are as follows:
- Raw Input
Unprocessed raw input dataset. Detailed as Dataset List.
- Atomic Files
Basic components for characterizing the input of various recommendation tasks, proposed by RecBole. Detailed as Atomic Files.
- Dataset:
Mainly based on the primary data structure of
pandas.DataFrame
in the library of pandas. During the transformation step from atomic files to classDataset
, we provide many useful functions that support a series of preprocessing functions in recommender systems, such as k-core data filtering and missing value imputation.
- DataLoader:
Mainly based on a general internal data structure implemented by our library, called
Interaction
.Interaction
is the internal data structural that is fed into the recommendation algorithms. It is implemented as a new abstract data type based onpython.Dict
, which is a key-value indexed data structure. The keys correspond to features from input, which can be conveniently referenced with feature names when writing the recommendation algorithms; and the values correspond to tensors (implemented bytorch.Tensor
), which will be used for the update and computation in learning algorithms. Specially, the value entry for a specific key stores all the corresponding tensor data in a batch or mini-batch.