Dataset List

In order to transform the raw data file into our atomic files, we have collected 43 datasets, and released the scripts for formatting these datasets into atomic files, detailed in RecSysDatasets. Meanwhile, we have uploaded the processed atomic files in network disks with the links Google Drive and Baidu Wangpan (Password: e272).

A brief introduction of these datasets is as follows:


  • Amazon: Amazon Review Data includes reviews (ratings, text, helpfulness votes) and product metadata (descriptions, category information, price, brand, and image features), which includes a previous version in 2014 and an updated version in 2018. Our processed datasets are detailed here.
    • Amazon 2014: This dataset contains product reviews and metadata from Amazon, including 24 categories and 142.8 million reviews spanning May 1996 - July 2014.
    • Amazon 2018: This Dataset is an updated version of the Amazon review dataset released in 2014. The total number of reviews is 233.1 million and the number of categories is 29 (142.8 million and 24 in 2014) and current data includes reviews in the range May 1996 - Oct 2018.
  • Amazon_M2: This dataset is a collection of anonymized customer sessions containing products from six different locales: English, German, Japanese, French, Italian, and Spanish.
  • Alibaba-iFashion: This dataset is a fashion outfit dataset collected from Alibaba online shopping systems in the paper POG. The items from each outfit are viewed as the items being recommended to users, where each item consists of attributes such as category and title.
  • Epinions: This dataset was collected from, a popular online consumer review website.
  • Yelp: This dataset was collected from Yelp. The Yelp dataset is a subset of our businesses, reviews, and user data for use in personal, educational, and academic purposes. Starting from Yelp Challenge 2018 (the original link to this competition is not found and there will not be another round of Yelp Dataset Challenge), there are four versions of Yelp datasets in total and Yelp has also posted the dataset on Kaggle, where you can also download a few earlier versions. Our processed 5 datasets are detailed here.
    • Yelp 2018: It is the first version of Yelp dataset released in Yelp Challenge 2018 including 5,261,669 reviews.
    • Yelp 2020: It is the second version of Yelp dataset released in 2020, including 8,021,122 reviews.
    • Yelp 2021: It is the first version of Yelp dataset released in 2021, including 8,635,403 reviews.
    • Yelp 2022: It is the latest version of Yelp dataset, which contains 908,915 tips by 1,987,897 users over 1.2 million business attributes like hours, parking, availability, and ambience aggregated check-ins over time for each of the 131,930 businesses.
    • Yelp-full: This is a combination dataset including four versions of yelp datasets mentioned above, where the duplicates are dropped and the number of total reviews is 28,908,240.
  • Tmall : This dataset is provided by Ant Financial Services, used in the IJCAI16 contest.
  • DIGINETICA : The dataset includes user sessions extracted from an e-commerce search engine logs, with anonymized user IDs, hashed queries, hashed query terms, hashed product descriptions and meta-data, log-scaled prices, clicks, and purchases.
  • YOOCHOOSE : This dataset has been constructed by YOOCHOOSE GmbH to support participants in the RecSys Challenge 2015.
  • Retailrocket: The data has been collected from a real-world ecommerce website. It is raw data, i.e. without any content preprocessing, however, all values are hashed due to confidential issues.
  • Ta Feng: The dataset contains a Chinese grocery store transaction data from November 2000 to February 2001.


  • Criteo: This dataset was collected from Criteo, which consists of a portion of Criteo's traffic over a period of several days.
  • Avazu: This dataset is used in Avazu CTR prediction contest.
  • iPinYou: This dataset was provided by iPinYou, which contains all training datasets and leaderboard testing datasets of the three seasons iPinYou Global RTB (Real-Time Bidding) Bidding Algorithm Competition.
  • AliEC: Ali_Display_Ad_Click is a dataset of click rate prediction about display Ad, which is displayed on the website of Taobao. The dataset is offered by the company of Alibaba.


  • Foursquare: This dataset contains check-ins in NYC and Tokyo collected for about 10 months. Each check-in is associated with its time stamp, its GPS coordinates and its semantic meaning.
  • Gowalla: This dataset is from a location-based social networking website where users share their locations by checking-in, and contains a total of 6,442,890 check-ins of these users over the period of Feb. 2009 - Oct. 2010.


  • MovieLens: GroupLens Research has collected and made available rating datasets from their movie web site.
  • Netflix: This is the official data set used in the Netflix Prize competition.
  • Douban: Douban Movie is a Chinese website that allows Internet users to share their comments and viewpoints about movies. This dataset contains more than 2 million short comments of 28 movies in Douban Movie website.
  • Twitch: This is a dataset of users consuming streaming content on Twitch. We retrieved all streamers, and all users connected in their respective chats, every 10 minutes during 43 days.
    • Twitch-100k: Twitch-100k is a subset of 100k users for benchmark purposes. The code is available in this Github repository.
    • Twitch-full: See the Google Drive folder containing all Twitch files. Twitch-full contains the full dataset while Twitch-100k is a subset.


  • Last.FM: This dataset contains social networking, tagging, and music artist listening information from a set of 2K users from online music system.
  • LFM-1b: This dataset contains more than one billion music listening events created by more than 120,000 users of Each listening event is characterized by artist, album, and track name, and includes a timestamp.
  • Yahoo Music: This dataset represents a snapshot of the Yahoo! Music community's preferences for various musical artists.
  • KGRec: Music and Sound Recommendation with Knowledge Graphs are two different datasets with users, items, implicit feedback interactions between users and items, item tags, and item text descriptions are provided, one for Music Recommendation (KGRec-music), and other for Sound Recommendation (KGRec-sound).
    • KGRec-music: All the data comes from and websites. Items are songs, which are described in terms of textual description extracted from, and tags from
    • KGRec-sound: All the data comes from Items are sounds, which are described in terms of textual description and tags created by the sound creator at uploading time.
  • Music4All-Onion : The dataset expands the Music4All dataset by including 26 additional audio, video, and metadata characteristics for 109,269 music pieces.


  • Book-Crossing: This dataset was collected by Cai-Nicolas Ziegler in a 4-week crawl (August / September 2004) from the Book-Crossing community with kind permission from Ron Hornbaker, CTO of Humankind Systems. Contains 278,858 users (anonymized but with demographic information) providing 1,149,780 ratings (explicit / implicit) about 271,379 books.
  • GoodReads: This dataset contain reviews from the Goodreads book review website, and a variety of attributes describing the items. Critically, datasets have multiple levels of user interaction, raging from adding to a shelf, rating, and reading.


  • Steam: This dataset is reviews and game information from Steam, which contains 7,793,069 reviews, 2,567,538 users, and 32,135 games. In addition to the review text, the data also includes the users' play hours in each review.


  • Anime: This dataset contains information on user preference data from Each user is able to add anime to their completed list and give it a rating and this data set is a compilation of those ratings.


  • Pinterest: This dataset is originally constructed by paper Learning image and user features for recommendations in social networks for evaluating content-based image recommendation, and processed by paper Neural Collaborative Filtering.


  • Jester: This dataset contains anonymous ratings of jokes by users of the Jester Joke Recommender System.


  • KDD2010: This dataset was released in KDD Cup 2010 Educational Data Mining Challenge, which contains the situations of students submitting exercises on the systems.
  • EndoMondo: This is a collection of workout logs from users of EndoMondo. Data includes multiple sources of sequential sensor data such as heart rate logs, speed, GPS, as well as sport type, gender and weather conditions.


  • Phishing Websites: This dataset contains 30 features of 11,055 websites and labels of whether they are phishing websites or not. The websites' features include 12 address-bar based features, 6 abnormal based features, 5 HTML-and-JavaScript based features and 7 domain based features.
  • Behance: This is a small, anonymized, version of a larger proprietary dataset about likes and image data from the community art website Behance.


  • Adult: This dataset is extracted by Barry Becker from the 1994 Census database, which consists of a list of people's attributes and whether they make over 50k a year.


  • MIND: This dataset is a large-scale dataset for news recommendation research. It was collected from anonymized behavior logs of Microsoft News website. MIND contains about 160k English news articles and more than 15 million impression logs generated by 1 million users.


  • BeerAdvocate: This dataset includes beer reviews with multiple rated dimensions, covering sensory aspects such as taste, look, feel, and smell.
  • RateBeer: This dataset contains beer reviews with multiple rated dimensions, including item attributes with sensory aspects such as taste, look, feel, and smell.