Atomic Files¶

Atomic files are introduced to format the input of mainstream recommendation tasks in a flexible way.

So far, our library introduces six atomic file types, and we identify different files by their suffixes.

Suffix	Content	Example Format
.inter	User-item interaction	user_id, item_id, rating, timestamp, review
.user	User feature	user_id, age, gender
.item	Item feature	item_id, category
.kg	Triplets in a knowledge graph	head_entity, tail_entity, relation
.link	Item-entity linkage data	entity, item_id
.net	Social graph data	source, target

Atomic files are combined to support the input of different recommendation tasks.

One can write the suffixes into the config arg load_col to load the corresponding atomic files.

For each recommendation task, we have to provide several mandatory files:

Format¶

Each atomic file can be viewed as a m x n table, where n is the number of features and m-1 is the number of data records(one line for header).

The first row corresponds to feature names, in which each entry has the form of feat_name:feat_type，indicating the feature name and feature type.

We support four feature types, which can be processed by tensors in batch.

feat_type	Explanations	Examples
token	single discrete feature	user_id, age
token_seq	discrete features sequence	review
float	single continuous feature	rating, timestamp
float_seq	continuous feature sequence	vector

We present three example data rows in the formatted ML-1M dataset.

ml-1m.inter

user_id:token	item_id:token	rating:float	timestamp:float
1	1193	5	978300760
1	661	3	978302109

ml-1m.user

user_id:token	age:token	gender:token	occupation:token	zip_code:token
1	1	F	10	48067
2	56	M	16	70072

ml-1m.item

item_id:token	movie_title:token_seq	release_year:token	genre:token_seq
1	Toy Story	1995	Animation Children’s Comedy
2	Jumanji	1995	Adventure Children’s Fantasy

ml-1m.kg

head_id:token	relation_id:token	tail_id:token
m.0gs6m	film.film_genre.films_in_this_genre	m.01b195
m.052_dz	film.film.actor	m.02nrdp

ml-1m.link

item_id:token	entity_id:token
2694	m.02hxhz
2079	m.0kvcr9

For users who want to load features from additional atomic files (e.g. pretrained entity embeddings), we provide a simple way as following.

Firstly, prepare your additional atomic file (e.g. ml-1m.ent).

ent_id:token	ent_emb:float_seq
m.0gs6m	-115.08 13.60 113.69
m.01b195	-130.97 263.05 -129.88

Secondly, update the args as:

additional_feat_suffix: [ent]
load_col:
    # inter/user/item/...: As usual
    ent: [ent_id, ent_emb]

Then, this additional atomic file will be loaded into the Dataset object. These new features can be used as following.

dataset = create_dataset(config)
print(dataset.ent_feat)

Note that these features can be preprocessed by the same way as the other features.

For example, if you want to map the tokens of ent_id into the same space of entity_id, then update the args as:

additional_feat_suffix: [ent]
load_col:
    # inter/user/item/...: As usual
    ent: [ent_id, ent_emb]

fields_in_same_space: [[ent_id, entity_id]]