Atomic Files¶
Atomic files are introduced to format the input of mainstream recommendation tasks in a flexible way.
So far, our library introduces six atomic file types, and we identify different files by their suffixes.
Suffix |
Content |
Example Format |
---|---|---|
.inter |
User-item interaction |
user_id, item_id, rating, timestamp, review |
.user |
User feature |
user_id, age, gender |
.item |
Item feature |
item_id, category |
.kg |
Triplets in a knowledge graph |
head_entity, tail_entity, relation |
.link |
Item-entity linkage data |
entity, item_id |
.net |
Social graph data |
source, target |
Atomic files are combined to support the input of different recommendation tasks.
One can write the suffixes into the config arg load_col
to load the corresponding atomic files.
For each recommendation task, we have to provide several mandatory files:
Tasks |
Mandatory atomic files |
---|---|
General |
.inter |
Context-aware |
.inter, .user, .item |
Knowledge-aware |
.inter, .kg, .link |
Sequential |
.inter |
Social |
.inter, .net |
Format¶
Each atomic file can be viewed as a m x n table, where n is the number of features and m-1 is the number of data records(one line for header).
The first row corresponds to feature names, in which each entry has the form of feat_name:feat_type
,indicating the feature name and feature type.
We support four feature types, which can be processed by tensors in batch.
feat_type |
Explanations |
Examples |
---|---|---|
token |
single discrete feature |
user_id, age |
token_seq |
discrete features sequence |
review |
float |
single continuous feature |
rating, timestamp |
float_seq |
continuous feature sequence |
vector |
Examples¶
We present three example data rows in the formatted ML-1M dataset.
ml-1m.inter
user_id:token |
item_id:token |
rating:float |
timestamp:float |
---|---|---|---|
1 |
1193 |
5 |
978300760 |
1 |
661 |
3 |
978302109 |
ml-1m.user
user_id:token |
age:token |
gender:token |
occupation:token |
zip_code:token |
---|---|---|---|---|
1 |
1 |
F |
10 |
48067 |
2 |
56 |
M |
16 |
70072 |
ml-1m.item
item_id:token |
movie_title:token_seq |
release_year:token |
genre:token_seq |
---|---|---|---|
1 |
Toy Story |
1995 |
Animation Children’s Comedy |
2 |
Jumanji |
1995 |
Adventure Children’s Fantasy |
ml-1m.kg
head_id:token |
relation_id:token |
tail_id:token |
---|---|---|
m.0gs6m |
film.film_genre.films_in_this_genre |
m.01b195 |
m.052_dz |
film.film.actor |
m.02nrdp |
ml-1m.link
item_id:token |
entity_id:token |
---|---|
2694 |
m.02hxhz |
2079 |
m.0kvcr9 |
Additional Atomic Files¶
For users who want to load features from additional atomic files (e.g. pretrained entity embeddings), we provide a simple way as following.
Firstly, prepare your additional atomic file (e.g. ml-1m.ent
).
ent_id:token |
ent_emb:float_seq |
---|---|
m.0gs6m |
-115.08 13.60 113.69 |
m.01b195 |
-130.97 263.05 -129.88 |
Secondly, update the args as:
additional_feat_suffix: [ent]
load_col:
# inter/user/item/...: As usual
ent: [ent_id, ent_emb]
Then, this additional atomic file will be loaded into the Dataset
object. These new features can be used as following.
dataset = create_dataset(config)
print(dataset.ent_feat)
Note that these features can be preprocessed by the same way as the other features.
For example, if you want to map the tokens of ent_id
into the same space of entity_id
, then update the args as:
additional_feat_suffix: [ent]
load_col:
# inter/user/item/...: As usual
ent: [ent_id, ent_emb]
alias_of_entity_id: [ent_id]