Muokkaa

Jaa


Microsoft News Recommendation

Microsoft News Dataset (MIND) is a large-scale dataset for news recommendation research. It was collected from anonymized behavior logs of Microsoft News website. The mission of MIND is to serve as a benchmark dataset for news recommendation and facilitate the research in news recommendation and recommender systems area.

MIND contains about 160k English news articles and more than 15 million impression logs generated by 1 million users. Every news article contains rich textual content including title, abstract, body, category, and entities. Each impression log contains the click events, non-clicked events, and historical news click behaviors of this user before this impression. To protect user privacy, each user was de-linked from the production system when securely hashed into an anonymized ID. For more detailed information about the MIND dataset, you can refer to the paper MIND: A Large-scale Dataset for News Recommendation.

Volume

Both the training and validation data are a zip-compressed folder, which contains four different files:

FILE NAME DESCRIPTION
behaviors.tsv The click histories and impression logs of users
news.tsv The information of news articles
entity_embedding.vec The embeddings of entities in news extracted from knowledge graph
relation_embedding.vec The embeddings of relations between entities extracted from knowledge graph

behaviors.tsv

The behaviors.tsv file contains the impression logs and users’ news click histories. It has five columns divided by the tab symbol:

  • Impression ID. The ID of an impression.
  • User ID. The anonymous ID of a user.
  • Time. The impression time with format “MM/DD/YYYY HH:MM:SS AM/PM”.
  • History. The news click history (ID list of clicked news) of this user before this impression.
  • Impressions. List of news displayed in this impression and user’s click behaviors on them (1 for click and 0 for non-click).

An example is shown in the table below:

COLUMN CONTENT
Impression ID 123
User ID U131
Time 11/13/2019 8:36:57 AM
History N11 N21 N103
Impressions N4-1 N34-1 N156-0 N207-0 N198-0

news.tsv

The news.tsv file contains the detailed information of news articles involved in the behaviors.tsv file. It has seven columns, which are divided by the tab symbol:

  • News ID
  • Category
  • Subcategory
  • Title
  • Abstract
  • URL
  • Title Entities (entities contained in the title of this news)
  • Abstract Entities (entities contained in the abstract of this news)

The full content bodies of MSN news articles are not made available for download, due to licensing structure. However, for your convenience, we have provided a utility script to help parse news webpage from the MSN URLs in the dataset. Due to time limitation, some URLs are expired and cannot be accessed successfully. Currently, we are trying our best to solve this problem.

An example is shown in the following table:

COLUMN CONTENT
News ID N37378
Category sports
SubCategory golf
Title PGA Tour winners
Abstract A gallery of recent winners on the PGA Tour.
URL https://www.msn.com/en-us/sports/golf/pga-tour-winners/ss-AAjnQjj?ocid=chopendata
Title Entities [{“Label”: “PGA Tour”, “Type”: “O”, “WikidataId”: “Q910409”, “Confidence”: 1.0, “OccurrenceOffsets”: [0], “SurfaceForms”: [“PGA Tour”]}]
Abstract Entites [{“Label”: “PGA Tour”, “Type”: “O”, “WikidataId”: “Q910409”, “Confidence”: 1.0, “OccurrenceOffsets”: [35], “SurfaceForms”: [“PGA Tour”]}]

The descriptions of the dictionary keys in the “Entities” column are listed as follows:

KEYS DESCRIPTION
Label The entity name in the Wikidata knowledge graph
Type The type of this entity in Wikidata
WikidataId The entity ID in Wikidata
Confidence The confidence of entity linking
OccurrenceOffsets The character-level entity offset in the text of title or abstract
SurfaceForms The raw entity names in the original text

entity_embedding.vec & relation_embedding.vec

The entity_embedding.vec and relation_embedding.vec files contain the 100-dimensional embeddings of the entities and relations learned from the subgraph (from WikiData knowledge graph) by TransE method. In both files, the first column is the ID of entity/relation, and the other columns are the embedding vector values. We hope this data can facilitate the research of knowledge-aware news recommendation. An example is shown as follows:

ID EMBEDDING VALUES
Q42306013 0.014516 -0.106958 0.024590 … -0.080382

Due to some reasons in learning embedding from the subgraph, a few entities may not have embeddings in the entity_embedding.vec file.

Storage location

The data are stored in blobs in the West/East US data center, in the following blob container: 'https://mind201910small.blob.core.windows.net/release/'.

Within the container, the training and validation set are compressed into MINDlarge_train.zip and MINDlarge_dev.zip respectively.

Additional information

The MIND dataset is free to download for research purposes under Microsoft Research License Terms. Contact mind@microsoft.com if you have any questions about the dataset.

Data access

Azure Notebooks

Demo notebook for accessing MIND data on Azure

This notebook provides an example of accessing MIND data from blob storage on Azure.

MIND data are stored in the West/East US data center, so this notebook will run more efficiently on the Azure compute located in West/East US.

Imports and environment

import os
import tempfile
import shutil
import urllib
import zipfile
import pandas as pd

# Temporary folder for data we need during execution of this notebook (we'll clean up
# at the end, we promise)
temp_dir = os.path.join(tempfile.gettempdir(), 'mind')
os.makedirs(temp_dir, exist_ok=True)

# The dataset is split into training and validation set, each with a large and small version.
# The format of the four files are the same.
# For demonstration purpose, we will use small version validation set only.
base_url = 'https://mind201910small.blob.core.windows.net/release'
training_small_url = f'{base_url}/MINDsmall_train.zip'
validation_small_url = f'{base_url}/MINDsmall_dev.zip'
training_large_url = f'{base_url}/MINDlarge_train.zip'
validation_large_url = f'{base_url}/MINDlarge_dev.zip'

Functions

def download_url(url,
                 destination_filename=None,
                 progress_updater=None,
                 force_download=False,
                 verbose=True):
    """
    Download a URL to a temporary file
    """
    if not verbose:
        progress_updater = None
    # This is not intended to guarantee uniqueness, we just know it happens to guarantee
    # uniqueness for this application.
    if destination_filename is None:
        url_as_filename = url.replace('://', '_').replace('/', '_')
        destination_filename = \
            os.path.join(temp_dir,url_as_filename)
    if (not force_download) and (os.path.isfile(destination_filename)):
        if verbose:
            print('Bypassing download of already-downloaded file {}'.format(
                os.path.basename(url)))
        return destination_filename
    if verbose:
        print('Downloading file {} to {}'.format(os.path.basename(url),
                                                 destination_filename),
              end='')
    urllib.request.urlretrieve(url, destination_filename, progress_updater)
    assert (os.path.isfile(destination_filename))
    nBytes = os.path.getsize(destination_filename)
    if verbose:
        print('...done, {} bytes.'.format(nBytes))
    return destination_filename

Download and extract the files

# For demonstration purpose, we will use small version validation set only.
# This file is about 30MB.
zip_path = download_url(validation_small_url, verbose=True)
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(temp_dir)

os.listdir(temp_dir)

Read the files with pandas

# The behaviors.tsv file contains the impression logs and users' news click histories. 
# It has 5 columns divided by the tab symbol:
# - Impression ID. The ID of an impression.
# - User ID. The anonymous ID of a user.
# - Time. The impression time with format "MM/DD/YYYY HH:MM:SS AM/PM".
# - History. The news click history (ID list of clicked news) of this user before this impression.
# - Impressions. List of news displayed in this impression and user's click behaviors on them (1 for click and 0 for non-click).
behaviors_path = os.path.join(temp_dir, 'behaviors.tsv')
pd.read_table(
    behaviors_path,
    header=None,
    names=['impression_id', 'user_id', 'time', 'history', 'impressions'])
# The news.tsv file contains the detailed information of news articles involved in the behaviors.tsv file.
# It has 7 columns, which are divided by the tab symbol:
# - News ID
# - Category
# - Subcategory
# - Title
# - Abstract
# - URL
# - Title Entities (entities contained in the title of this news)
# - Abstract Entities (entities contained in the abstract of this news)
news_path = os.path.join(temp_dir, 'news.tsv')
pd.read_table(news_path,
              header=None,
              names=[
                  'id', 'category', 'subcategory', 'title', 'abstract', 'url',
                  'title_entities', 'abstract_entities'
              ])
# The entity_embedding.vec file contains the 100-dimensional embeddings
# of the entities learned from the subgraph by TransE method.
# The first column is the ID of entity, and the other columns are the embedding vector values.
entity_embedding_path = os.path.join(temp_dir, 'entity_embedding.vec')
entity_embedding = pd.read_table(entity_embedding_path, header=None)
entity_embedding['vector'] = entity_embedding.iloc[:, 1:101].values.tolist()
entity_embedding = entity_embedding[[0,
                                     'vector']].rename(columns={0: "entity"})
entity_embedding
# The relation_embedding.vec file contains the 100-dimensional embeddings
# of the relations learned from the subgraph by TransE method.
# The first column is the ID of relation, and the other columns are the embedding vector values.
relation_embedding_path = os.path.join(temp_dir, 'relation_embedding.vec')
relation_embedding = pd.read_table(relation_embedding_path, header=None)
relation_embedding['vector'] = relation_embedding.iloc[:,
                                                       1:101].values.tolist()
relation_embedding = relation_embedding[[0, 'vector'
                                         ]].rename(columns={0: "relation"})
relation_embedding

Clean up temporary files

shutil.rmtree(temp_dir)

Examples

See the following examples of how to use the Microsoft News Recommender dataset:

Next steps

Check out several baseline news recommendation models developed on MIND from Microsoft Recommenders Repository

View the rest of the datasets in the Open Datasets catalog.