Atomic Files¶
Atomic files are introduced to format the input of mainstream recommendation tasks in a flexible way.
So far, our library introduces six atomic file types, and we identify different files by their suffixes.
Suffix |
Content |
Example Format |
---|---|---|
.inter |
User-item interaction |
user_id, item_id, rating, timestamp, review |
.user |
User feature |
user_id, age, gender |
.item |
Item feature |
item_id, category |
.kg |
Triplets in a knowledge graph |
head_entity, tail_entity, relation |
.link |
Item-entity linkage data |
entity, item_id |
.net |
Social graph data |
source, target |
Atomic files are combined to support the input of different recommendation tasks.
One can write the suffixes into the config arg load_col
to load the corresponding atomic files.
For each recommendation task, we have to provide several mandatory files:
Tasks |
Mandatory atomic files |
---|---|
General |
.inter |
Context-aware |
.inter, .user, .item |
Knowledge-aware |
.inter, .kg, .link |
Sequential |
.inter |
Social |
.inter, .net |
Format¶
Each atomic file can be viewed as a m x n table, where n is the number of features and m-1 is the number of data records(one line for header).
The first row corresponds to feature names, in which each entry has the form of feat_name:feat_type
,indicating the feature name and feature type.
We support four feature types, which can be processed by tensors in batch.
feat_type |
Explanations |
Examples |
---|---|---|
token |
single discrete feature |
user_id, age |
token_seq |
discrete features sequence |
review |
float |
single continuous feature |
rating, timestamp |
float_seq |
continuous feature sequence |
vector |
Examples¶
We present three example data rows in the formatted ML-1M dataset.
ml-1m.inter
user_id:token |
item_id:token |
rating:float |
timestamp:float |
---|---|---|---|
1 |
1193 |
5 |
978300760 |
1 |
661 |
3 |
978302109 |
ml-1m.user
user_id:token |
age:token |
gender:token |
occupation:token |
zip_code:token |
---|---|---|---|---|
1 |
1 |
F |
10 |
48067 |
2 |
56 |
M |
16 |
70072 |
ml-1m.item
item_id:token |
movie_title:token_seq |
release_year:token |
genre:token_seq |
---|---|---|---|
1 |
Toy Story |
1995 |
Animation Children’s Comedy |
2 |
Jumanji |
1995 |
Adventure Children’s Fantasy |
ml-1m.kg
head_id:token |
relation_id:token |
tail_id:token |
---|---|---|
m.0gs6m |
film.film_genre.films_in_this_genre |
m.01b195 |
m.052_dz |
film.film.actor |
m.02nrdp |
ml-1m.link
item_id:token |
entity_id:token |
---|---|
2694 |
m.02hxhz |
2079 |
m.0kvcr9 |
Additional Atomic Files¶
For users who want to load features from additional atomic files (e.g. pretrained entity embeddings), we provide a simple way as following.
Firstly, prepare your additional atomic file (e.g. ml-1m.ent
).
ent_id:token |
ent_emb:float_seq |
---|---|
m.0gs6m |
-115.08 13.60 113.69 |
m.01b195 |
-130.97 263.05 -129.88 |
Secondly, update the args as:
Then, this additional atomic file will be loaded into the Dataset
object. These new features can be used as following.
Note that these features can be preprocessed by the same way as the other features.
For example, if you want to map the tokens of ent_id
into the same space of entity_id
, then update the args as: