COVID-19 Open Research Dataset (CORD-19)

A Free, Open Resource for the Global AI Community

The White House is tapping the expertise of researchers from Georgetown’s Center for Security and Emerging Technology to determine how data and open research can be used to address the COVID-19 pandemic. CSET has partnered with leading research groups to prepare and distribute the COVID-19 Open Research Dataset (CORD-19), a resource of more than 44,000 articles in JSON format about COVID-19 and the coronavirus family of viruses for use by the global machine learning community. The dataset represents the most extensive machine-readable coronavirus literature collection available for data and text mining to date. 

CORD-19 contains 29,000 full-text articles with a wealth of information about the novel coronavirus (SARS-CoV-2), the associated illness COVID-19, and related viruses. The collection will be updated as new research is published in peer-reviewed publications and archival services like bioRxivmedRxiv, and others.

At the request of the White House Office of Science and Technology Policy, CSET leads this effort in partnership with the Allen Institute for AI, Chan Zuckerberg Initiative, Microsoft Research and the National Library of Medicine of the National Institutes of Health. Read the press release here

"With this step, we've made available full-text, machine-readable resources to help speed response to this global crisis. The worldwide machine learning community now has the opportunity to apply recent advances in natural language processing to find answers to important questions about this infectious disease."
Dewey Murdick, CSET Director of Data Science

Now, preliminary answers to questions about COVID-19 — including estimations of reproduction rate, incubation period, and key risk factors — are emerging. Kaggle has summarized early findings extracted from the CORD-19 papers by machine learning algorithms. Learn more here.