新型冠狀病毒 (COVID-19) 開放研究資料集

發行項
10/16/2024

COVID-19 及冠狀病毒相關的學術性文章全文與中繼資料資料集，已經過優化而使機器能夠讀取，並開放給全球研究社群使用。

為了應對新型冠狀病毒 (COVID-19) 疫情，Allen Institute for AI 已與領先研究集團合作準備及散發新型冠狀病毒 (COVID-19) 開放研究資料集 (CORD-19)。此資料集是免費資源，內含超過 47,000 篇有關新型冠狀病毒 (COVID-19) 和新冠肺炎系列病毒的學術文章 (包括逾 36,000 篇全文)，可供全球研究社群使用。

此資料集動員研究人員，運用最新的自然語言處理技術產生新見解，協助對抗這種傳染病。

當新研究收錄於經過同儕審查的出版刊物，以及像是 bioRxiv、medRxiv 等典藏服務時，語料庫可能會隨之更新。

注意

Microsoft 依「現況」提供 Azure 開放資料集。針對　貴用戶對資料集的使用方式，Microsoft 不提供任何明示或默示的擔保、保證或條件。在　貴用戶當地法律允許的範圍內，針對因使用資料集而導致的任何直接性、衍生性、特殊性、間接性、附隨性或懲罰性損害或損失，Microsoft 概不承擔任何責任。

此資料集是根據 Microsoft 接收來源資料的原始條款所提供。資料集可能包含源自 Microsoft 的資料。

授權條款

此資料集是由 Allen Institute of AI 與 Semantic Scholar 提供。存取、下載或使用 CORD-19 資料集提供的任何內容，即代表您同意與本資料集使用上相關的資料集授權。中繼資料檔案有提供資料集中個別文章的專屬授權資訊。更多授權資訊位於 PMC 網站、medRxiv 網站及 bioRxiv 網站。

資料量與保留期

此資料集以 JSON 格式儲存，而且最新版本包含逾 36,000 篇全文文章。每篇論文各代表一個 JSON 物件。檢視結構描述.

儲存位置

此資料集儲存於美國東部 Azure 區域。建議將計算資源置於美國東部，以確保親和性。

引文

在發行或轉發內容中加入 CORD-19 資料時，請依下列格式引用資料集：

在參考書目中：

COVID-19 開放研究資料集 (COVID-19 Open Research Dataset，CORD-19)。 2020 年。 YYYY-MM-DD 版。擷取自 COVID-19 開放研究資料集 (COVID-19 Open Research Dataset，CORD-19)。存取日期 YYYY-MM-DD。 doi:10.5281/zenodo.3715505

文字：(CORD-19，2020)

連絡人

對於此資料集如有任何問題，請連絡 partnerships@allenai.org。

資料存取

CORD-19 資料集

CORD-19 收集了超過 50,000 篇有關新型冠狀病毒 (COVID-19)、SARS-CoV-2 及其他新冠肺炎病毒的文章 (包括逾 40,000 篇全文)。此資料集已免費提供，目標是協助研究社群對抗新型冠狀病毒 (COVID-19) 疫情。

此筆記本有雙重目標：

示範如何在 Azure 上存取 CORD-19 資料集：我們會連線到存放 CORD-19 資料集的 Azure Blob 儲存體帳戶。
逐步解說資料集的結構：資料集中的文章會儲存為 JSON 檔案。我們提供的範例將示範：

如何尋找文章 (瀏覽容器)
如何讀取文章 (瀏覽 JSON 結構描述)

相依性：此筆記本需要下列程式庫：

Azure 儲存體 (例如 pip install azure-storage-blob)
NLTK (文件)
Pandas (例如 pip install pandas)

從 Azure 取得 CORD-19 資料

CORD-19 資料已上傳為這裡的 Azure 開放資料集。我們會建立連結至此 CORD-19 開放資料集的 Blob 服務。

from azure.storage.blob import BlockBlobService

# storage account details
azure_storage_account_name = "azureopendatastorage"
azure_storage_sas_token = "sv=2019-02-02&ss=bfqt&srt=sco&sp=rlcup&se=2025-04-14T00:21:16Z&st=2020-04-13T16:21:16Z&spr=https&sig=JgwLYbdGruHxRYTpr5dxfJqobKbhGap8WUtKFadcivQ%3D"

# create a blob service
blob_service = BlockBlobService(
    account_name=azure_storage_account_name,
    sas_token=azure_storage_sas_token,
)

我們可以使用此 Blob 服務作為資料的控制代碼。我們可以瀏覽使用 BlockBlobService API 的資料集。如需詳細資料，請參閱這裡：

Blob 服務概念 \(英文\)
容器上的作業

CORD-19 資料會儲存在 covid19temp 容器中。這是容器內的檔案結構，並有一個範例檔案。

metadata.csv
custom_license/
    pdf_json/
        0001418189999fea7f7cbe3e82703d71c85a6fe5.json        # filename is sha-hash
        ...
    pmc_json/
        PMC1065028.xml.json                                  # filename is the PMC ID
        ...
noncomm_use_subset/
    pdf_json/
        0036b28fddf7e93da0970303672934ea2f9944e7.json
        ...
    pmc_json/
        PMC1616946.xml.json
        ...
comm_use_subset/
    pdf_json/
        000b7d1517ceebb34e1e3e817695b6de03e2fa78.json
        ...
    pmc_json/
        PMC1054884.xml.json
        ...
biorxiv_medrxiv/                                             # note: there is no pmc_json subdir
    pdf_json/
        0015023cc06b5362d332b3baf348d11567ca2fbb.json
        ...

每個 .json 檔案都會對應至資料集中的個別文章。這是標題、作者、摘要和 (若可用) 全文資料的儲存位置。

使用 metadata.csv

CORD-19 資料集隨附 metadata.csv - 單一檔案，可記錄 CORD-19 資料集中所有可用文件的基本資訊。這是開始探索的好位置！

# container housing CORD-19 data
container_name = "covid19temp"

# download metadata.csv
metadata_filename = 'metadata.csv'
blob_service.get_blob_to_path(
    container_name=container_name,
    blob_name=metadata_filename,
    file_path=metadata_filename
)

import pandas as pd

# read metadata.csv into a dataframe
metadata_filename = 'metadata.csv'
metadata = pd.read_csv(metadata_filename)

metadata.head(3)

乍看之下內容很多，因此讓我們稍微改善一下。

simple_schema = ['cord_uid', 'source_x', 'title', 'abstract', 'authors', 'full_text_file', 'url']

def make_clickable(address):
    '''Make the url clickable'''
    return '<a href="{0}">{0}</a>'.format(address)

def preview(text):
    '''Show only a preview of the text data.'''
    return text[:30] + '...'

format_ = {'title': preview, 'abstract': preview, 'authors': preview, 'url': make_clickable}

metadata[simple_schema].head().style.format(format_)

# let's take a quick look around
num_entries = len(metadata)
print("There are {} many entries in this dataset:".format(num_entries))

metadata_with_text = metadata[metadata['full_text_file'].isna() == False]
with_full_text = len(metadata_with_text)
print("-- {} have full text entries".format(with_full_text))

with_doi = metadata['doi'].count()
print("-- {} have DOIs".format(with_doi))

with_pmcid = metadata['pmcid'].count()
print("-- {} have PubMed Central (PMC) ids".format(with_pmcid))

with_microsoft_id = metadata['Microsoft Academic Paper ID'].count()
print("-- {} have Microsoft Academic paper ids".format(with_microsoft_id))

There are 51078 many entries in this dataset:
-- 42511 have full text entries
-- 47741 have DOIs
-- 41082 have PubMed Central (PMC) ids
-- 964 have Microsoft Academic paper ids

範例：讀取全文

metadata.csv 本身不包含全文。讓我們來看一個如何進行讀取的範例。找出並解壓縮全文 JSON，並將其轉換成句子清單。

# choose a random example with pdf parse available
metadata_with_pdf_parse = metadata[metadata['has_pdf_parse']]
example_entry = metadata_with_pdf_parse.iloc[42]

# construct path to blob containing full text
blob_name = '{0}/pdf_json/{1}.json'.format(example_entry['full_text_file'], example_entry['sha'])  # note the repetition in the path
print("Full text blob for this entry:")
print(blob_name)

我們現在可以讀取與此 Blob 相關聯的 JSON 內容，如下所示。

import json
blob_as_json_string = blob_service.get_blob_to_text(container_name=container_name, blob_name=blob_name)
data = json.loads(blob_as_json_string.content)

# in addition to the body text, the metadata is also stored within the individual json files
print("Keys within data:", ', '.join(data.keys()))

基於此範例的目的，我們有興趣的是 body_text，這會儲存文字資料，如下所示：

"body_text": [                      # list of paragraphs in full body
    {
        "text": <str>,
        "cite_spans": [             # list of character indices of inline citations
                                    # e.g. citation "[7]" occurs at positions 151-154 in "text"
                                    #      linked to bibliography entry BIBREF3
            {
                "start": 151,
                "end": 154,
                "text": "[7]",
                "ref_id": "BIBREF3"
            },
            ...
        ],
        "ref_spans": <list of dicts similar to cite_spans>,     # e.g. inline reference to "Table 1"
        "section": "Abstract"
    },
    ...
]

這裡提供完整的 JSON 結構描述。

from nltk.tokenize import sent_tokenize

# the text itself lives under 'body_text'
text = data['body_text']

# many NLP tasks play nicely with a list of sentences
sentences = []
for paragraph in text:
    sentences.extend(sent_tokenize(paragraph['text']))

print("An example sentence:", sentences[0])

PDF 與 PMC XML 剖析

在上述範例中，我們探討了 has_pdf_parse == True 的案例。在該案例中，Blob 檔案路徑的格式如下：

'<full_text_file>/pdf_json/<sha>.json'

或者，針對 has_pmc_xml_parse == True 的案例，請使用下列格式：

'<full_text_file>/pmc_json/<pmcid>.xml.json'

例如：

# choose a random example with pmc parse available
metadata_with_pmc_parse = metadata[metadata['has_pmc_xml_parse']]
example_entry = metadata_with_pmc_parse.iloc[42]

# construct path to blob containing full text
blob_name = '{0}/pmc_json/{1}.xml.json'.format(example_entry['full_text_file'], example_entry['pmcid'])  # note the repetition in the path
print("Full text blob for this entry:")
print(blob_name)

blob_as_json_string = blob_service.get_blob_to_text(container_name=container_name, blob_name=blob_name)
data = json.loads(blob_as_json_string.content)

# the text itself lives under 'body_text'
text = data['body_text']

# many NLP tasks play nicely with a list of sentences
sentences = []
for paragraph in text:
    sentences.extend(sent_tokenize(paragraph['text']))

print("An example sentence:", sentences[0])

Full text blob for this entry:
custom_license/pmc_json/PMC546170.xml.json
An example sentence: Double-stranded small interfering RNA (siRNA) molecules have drawn much attention since it was unambiguously shown that they mediate potent gene knock-down in a variety of mammalian cells (1).

直接逐一查看 Blob

在上述範例中，我們使用 metadata.csv 檔案來瀏覽資料、建構 Blob 檔案路徑，以及從 Blob 讀取資料。一個替代方案是逐一查看 Blob 本身。

# get and sort list of available blobs
blobs = blob_service.list_blobs(container_name)
sorted_blobs = sorted(list(blobs), key=lambda e: e.name, reverse=True)

現在我們可以直接逐一查看 Blob。例如，讓我們計算可用的 JSON 檔案數目。

# we can now iterate directly though the blobs
count = 0
for blob in sorted_blobs:
    if blob.name[-5:] == ".json":
        count += 1
print("There are {} many json files".format(count))

There are 59784 many json files

附錄

資料品質問題

這是一個大型資料集，基於明顯的原因，已草草放在一起！以下是我們觀察到的一些資料品質問題。

多個 SHA

我們觀察到在某些情況下，指定項目有多個 SHA。

metadata_multiple_shas = metadata[metadata['sha'].str.len() > 40]

print("There are {} many entries with multiple shas".format(len(metadata_multiple_shas)))

metadata_multiple_shas.head(3)

There are 1999 many entries with multiple shas

容器的配置

在此，我們會使用簡單的 RegEx 來探索容器的檔案結構，以防未來更新。

container_name = "covid19temp"
blobs = blob_service.list_blobs(container_name)
sorted_blobs = sorted(list(blobs), key=lambda e: e.name, reverse=True)

import re
dirs = {}

pattern = '([\w]+)\/([\w]+)\/([\w.]+).json'
for blob in sorted_blobs:
    
    m = re.match(pattern, blob.name)
    
    if m:
        dir_ = m[1] + '/' + m[2]
        
        if dir_ in dirs:
            dirs[dir_] += 1
        else:
            dirs[dir_] = 1
        
dirs

CORD-19 資料集

此筆記本有雙重目標：

示範如何在 Azure 上存取 CORD-19 資料集：我們會使用 AzureML 資料集來提供 CORD-19 資料的內容。
逐步解說資料集的結構：資料集中的文章會儲存為 JSON 檔案。我們提供的範例將示範：

如何尋找文章 (瀏覽目錄結構)
如何讀取文章 (瀏覽 JSON 結構描述)

相依性：此筆記本需要下列程式庫：

AzureML Python SDK (例如 pip install --upgrade azureml-sdk)
Pandas (例如 pip install pandas)
NLTK (文件) (例如 pip install nltk)

如果您的 NLTK 沒有 punkt 套件，您必須執行：

import nltk
nltk.download('punkt')

從 Azure 取得 CORD-19 資料

CORD-19 資料已上傳為這裡的 Azure 開放資料集。在此筆記本中，我們會使用 AzureML 資料集來參考 CORD-19 開放資料集。

import azureml.core
print("Azure ML SDK Version: ", azureml.core.VERSION)

from azureml.core import  Dataset
cord19_dataset = Dataset.File.from_files('https://azureopendatastorage.blob.core.windows.net/covid19temp')
mount = cord19_dataset.mount()

mount() 方法會建立內容管理員，以便將資料集定義的檔案系統資料流當作本機檔案來裝載。

使用 mount.start() 和 mount.stop()，或者使用 with mount(): 來管理內容。

只有在 Unix 或類似 Unix 的作業系統上才支援裝載，而且必須有 libfuse。如果您在 Docker 容器內執行，則必須使用 --privileged 旗標啟動 Docker 容器，或以 --cap-add SYS_ADMIN --device /dev/fuse 啟動。如需詳細資訊，請參閱 Docs

import os

COVID_DIR = '/covid19temp'
path = mount.mount_point + COVID_DIR

with mount:
    print(os.listdir(path))

['antiviral_with_properties_compressed.sdf', 'biorxiv_medrxiv', 'biorxiv_medrxiv_compressed.tar.gz', 'comm_use_subset', 'comm_use_subset_compressed.tar.gz', 'custom_license', 'custom_license_compressed.tar.gz', 'metadata.csv', 'noncomm_use_subset', 'noncomm_use_subset_compressed.tar.gz']

這是 CORD-19資料集內的檔案結構，並有一個範例檔案。

metadata.csv
custom_license/
    pdf_json/
        0001418189999fea7f7cbe3e82703d71c85a6fe5.json        # filename is sha-hash
        ...
    pmc_json/
        PMC1065028.xml.json                                  # filename is the PMC ID
        ...
noncomm_use_subset/
    pdf_json/
        0036b28fddf7e93da0970303672934ea2f9944e7.json
        ...
    pmc_json/
        PMC1616946.xml.json
        ...
comm_use_subset/
    pdf_json/
        000b7d1517ceebb34e1e3e817695b6de03e2fa78.json
        ...
    pmc_json/
        PMC1054884.xml.json
        ...
biorxiv_medrxiv/                                             # note: there is no pmc_json subdir
    pdf_json/
        0015023cc06b5362d332b3baf348d11567ca2fbb.json
        ...

每個 .json 檔案都會對應至資料集中的個別文章。這是標題、作者、摘要和 (若可用) 全文資料的儲存位置。

使用 metadata.csv

CORD-19 資料集隨附 metadata.csv - 單一檔案，可記錄 CORD-19 資料集中所有可用文件的基本資訊。這是開始探索的好位置！

import pandas as pd

# create mount context
mount.start()

# specify path to metadata.csv
COVID_DIR = 'covid19temp'
metadata_filename = '{}/{}/{}'.format(mount.mount_point, COVID_DIR, 'metadata.csv')

# read metadata
metadata = pd.read_csv(metadata_filename)
metadata.head(3)

simple_schema = ['cord_uid', 'source_x', 'title', 'abstract', 'authors', 'full_text_file', 'url']

def make_clickable(address):
    '''Make the url clickable'''
    return '<a href="{0}">{0}</a>'.format(address)

def preview(text):
    '''Show only a preview of the text data.'''
    return text[:30] + '...'

format_ = {'title': preview, 'abstract': preview, 'authors': preview, 'url': make_clickable}

metadata[simple_schema].head().style.format(format_)

# let's take a quick look around
num_entries = len(metadata)
print("There are {} many entries in this dataset:".format(num_entries))

metadata_with_text = metadata[metadata['full_text_file'].isna() == False]
with_full_text = len(metadata_with_text)
print("-- {} have full text entries".format(with_full_text))

with_doi = metadata['doi'].count()
print("-- {} have DOIs".format(with_doi))

with_pmcid = metadata['pmcid'].count()
print("-- {} have PubMed Central (PMC) ids".format(with_pmcid))

with_microsoft_id = metadata['Microsoft Academic Paper ID'].count()
print("-- {} have Microsoft Academic paper ids".format(with_microsoft_id))

範例：讀取全文

metadata.csv 本身不包含全文。讓我們來看一個如何進行讀取的範例。找出並解壓縮全文 JSON，並將其轉換成句子清單。

# choose a random example with pdf parse available
metadata_with_pdf_parse = metadata[metadata['has_pdf_parse']]
example_entry = metadata_with_pdf_parse.iloc[42]

# construct path to blob containing full text
filepath = '{0}/{1}/pdf_json/{2}.json'.format(path, example_entry['full_text_file'], example_entry['sha'])
print("Full text filepath:")
print(filepath)

我們現在可以讀取與此檔案相關聯的 JSON 內容，如下所示。

import json

try:
    with open(filepath, 'r') as f:
        data = json.load(f)
except FileNotFoundError as e:
    # in case the mount context has been closed
    mount.start()
    with open(filepath, 'r') as f:
        data = json.load(f)
        
# in addition to the body text, the metadata is also stored within the individual json files
print("Keys within data:", ', '.join(data.keys()))

Keys within data: paper_id, metadata, abstract, body_text, bib_entries, ref_entries, back_matter

基於此範例的目的，我們有興趣的是 body_text，這會儲存文字資料，如下所示：

"body_text": [                      # list of paragraphs in full body
    {
        "text": <str>,
        "cite_spans": [             # list of character indices of inline citations
                                    # e.g. citation "[7]" occurs at positions 151-154 in "text"
                                    #      linked to bibliography entry BIBREF3
            {
                "start": 151,
                "end": 154,
                "text": "[7]",
                "ref_id": "BIBREF3"
            },
            ...
        ],
        "ref_spans": <list of dicts similar to cite_spans>,     # e.g. inline reference to "Table 1"
        "section": "Abstract"
    },
    ...
]

檢視完整的 JSON 結構描述。

from nltk.tokenize import sent_tokenize
# the text itself lives under 'body_text'
text = data['body_text']

# many NLP tasks play nicely with a list of sentences
sentences = []
for paragraph in text:
    sentences.extend(sent_tokenize(paragraph['text']))

print("An example sentence:", sentences[0])

PDF 與 PMC XML 剖析

在上述範例中，我們探討了 has_pdf_parse == True 的案例。在該案例中，檔案路徑的格式如下：

'<full_text_file>/pdf_json/<sha>.json'

或者，針對 has_pmc_xml_parse == True 的案例，請使用下列格式：

'<full_text_file>/pmc_json/<pmcid>.xml.json'

例如：

# choose a random example with pmc parse available
metadata_with_pmc_parse = metadata[metadata['has_pmc_xml_parse']]
example_entry = metadata_with_pmc_parse.iloc[42]

# construct path to blob containing full text
filename = '{0}/pmc_json/{1}.xml.json'.format(example_entry['full_text_file'], example_entry['pmcid'])  # note the repetition in the path
print("Path to file: {}\n".format(filename))

with open(mount.mount_point + '/' + COVID_DIR + '/' + filename, 'r') as f:
    data = json.load(f)

# the text itself lives under 'body_text'
text = data['body_text']

# many NLP tasks play nicely with a list of sentences
sentences = []
for paragraph in text:
    sentences.extend(sent_tokenize(paragraph['text']))

print("An example sentence:", sentences[0])

附錄

資料品質問題

這是一個大型資料集，基於明顯的原因，已草草放在一起！以下是我們觀察到的一些資料品質問題。

metadata_multiple_shas = metadata[metadata['sha'].str.len() > 40]

print("There are {} many entries with multiple shas".format(len(metadata_multiple_shas)))

metadata_multiple_shas.head(3)

下一步

檢視開放資料集目錄中的其餘資料集。

共用方式為

新型冠狀病毒 (COVID-19) 開放研究資料集

授權條款

資料量與保留期

儲存位置

引文

連絡人

資料存取

Azure Notebooks

CORD-19 資料集

從 Azure 取得 CORD-19 資料

使用 metadata.csv

範例：讀取全文

PDF 與 PMC XML 剖析

直接逐一查看 Blob

附錄

資料品質問題

多個 SHA

容器的配置

CORD-19 資料集

從 Azure 取得 CORD-19 資料

使用 metadata.csv

範例：讀取全文

PDF 與 PMC XML 剖析

附錄

資料品質問題

下一步

意見反應

其他資源