The White House Office of Science and Technology Policy released the most extensive collection of machine-readable coronavirus literature Monday for data and text mining to answer scientists’ most pressing questions.
About 29,000 articles — 13,000 of which are full text — now exist within the updatable COVID-19 Open Research Dataset (CORD-19), which grows by more than 100 documents a week.
The World Health Organization identified key questions about COVID-19 that the global research community is trying to answer. And the National Academies of Sciences, Engineering and Medicine narrowed those queries down to the ones data scientists can answer using natural language processing on the dataset.
“For us, we’re posing important questions,” said Michael Kratsios, U.S. chief technology officer, during a press briefing Monday. “And we’re able to extract answers from this large database.”
Last Wednesday, the White House held a phone call with the tech industry on forming a loose partnership seeking artificial intelligence breakthroughs in coronavirus response, where the database was first demonstrated to companies.
Once the data mining is fine-tuned, answers to questions about COVID-19 incubation, transmission, therapeutics and vaccines can be relayed to drugmakers, policy experts and government officials, Kratsios said.
The National Library of Medicine at the National Institutes of Health made more than 10,000 articles related to the coronavirus family of syndromes — which includes COVID-19, SARS and MERS — available from its PubMed Central digital archive of biomedical journals.
Those articles were incorporated into a larger database with preprint content — literature that hasn’t yet been peer-reviewed and published — from Cold Spring Harbor Laboratory’s BioRxiv server funded by the Chan Zuckerberg Initiative. Preprint content reaches the global research community about 100 days faster than peer-reviewed articles, which is important when faced with “tight” COVID-19 response timelines, said Alex Wade, technical program manager for Meta at CZI.
“What we learn from this current situation should influence how the scientific community shares research information going forward,” Wade said.
Microsoft lended its experience indexing and mapping global scientific literature, with Chief Scientific Officer Eric Horvitz noting Monday how hard it typically is for data scientists to gain machine-readable rights to such articles. And Georgetown University’s Center for Security and Emerging Technology coordinated the entire collection.
The Allen Institute for AI transformed the article text into a machine-readable JSON format for data analysis, and its free Semantic Scholar academic discovery engine pulls research relevant to individual scientists.
All of this took place starting March 13, said Lynne Parker, U.S. deputy CTO.
Now it’s up to scientists and data scientists to share any AI tools they develop and answers to WHO’s questions they find with the CORD-19 Kaggle community. Owned by Google Cloud, the Kaggle platform allows for the global sharing of machine learning tools and insights among researchers.
OSTP created 10 broad tasks for researchers on Kaggle, based on WHO’s questions, in order to maximize the insights provided from AI analyzing the database:
- What do we know about virus genetics, origin, and evolution?
- What is known about transmission, incubation, and environmental stability?
- What has been published about medical care?
- What do we know about COVID-19 risk factors?
- What do we know about vaccines and therapeutics?
- What has been published about ethical and social science considerations?
- What do we know about diagnostics and surveillance?
- What has been published about information sharing and inter-sectoral collaboration?
- What do we know about non-pharmaceutical interventions?
- How does geography affect virality?
Within each task is a subset of questions more directly related to WHO’s queries. Kaggle is offering a $1,000-per-task award to the submissions that best meet evaluation criteria, which can be received as a charitable donation to COVID-19 relief and research efforts or a monetary payment.
OSTP and its partners stressed that the dataset does not contain any personally identifiable information.
“We’re not making medical records and so forth available from this initiative,” said Jerry Sheehan, deputy director of NLM.