Researchers across the globe are working to develop vaccines and cures for COVID-19, a respiratory disease caused by a new strain of the coronavirus. To assist in this effort, the White House and industry and academic partners announced Monday a new data repository offering open access to the nation’s best, cutting-edge research on the coronavirus family and a prize challenge to develop artificial intelligence tools to help.
The COVID-19 Open Research Dataset, or CORD-19, was created through a collaboration between government, industry and academia, including the Allen Institute for AI, the Chan Zuckerberg Initiative, Georgetown University’s Center for Security and Emerging Technology, Microsoft, Google Cloud’s Kaggle platform and the National Institutes of Health’s National Library of Medicine.
With the wealth of information available—some trustworthy, some not—the group moved to ensure the dataset contained “the most extensive machine-readable coronavirus literature collection available,” enabling researchers to cull through more than 29,000 articles using keyword searches.
“Microsoft’s web-scale literature curation tools were used to identify and bring together worldwide scientific efforts and results, CZI provided access to pre-publication content, NLM provided access to literature content, and the Allen AI team transformed the content into machine-readable form, making the corpus ready for analysis and study,” the White House Office of Science and Technology Policy said in a statement Monday.
“Decisive action from America’s science and technology enterprise is critical to prevent, detect, treat, and develop solutions to COVID-19,” U.S. Chief Technology Officer Michael Kratsios said. “We thank each institution for voluntarily lending its expertise and innovation to this collaborative effort, and call on the United States research community to put artificial intelligence technologies to work in answering key scientific questions about the novel coronavirus.”
With the information now available—and being continuously updated—the next step will be for artificial intelligence developers to work on tools to mine key data from the archive and target machine learning algorithms toward answering “high-priority scientific questions related to COVID-19,” according to OSTP.
As those tools are developed, OSTP encouraged AI and machine learning researchers to publish their work through Google Cloud’s Kaggle platform to disseminate the tools and research to other data science teams working throughout the world.
“It’s difficult for people to manually go through more than 20,000 articles and synthesize their findings,” said Anthony Goldbloom, co-founder and CEO at Kaggle. “Recent advances in technology can be helpful here. We’re putting machine-readable versions of these articles in front of our community of more than 4 million data scientists. Our hope is that AI can be used to help find answers to a key set of questions about COVID-19.”
Kaggle is offering up to $1,000 on each of nine tasks addressing key questions in the response to the outbreak:
- What do we know about virus genetics, origin and evolution?
- What is known about transmission, incubation and environmental stability?
- What has been published about medical care?
- What has been published about information sharing and inter-sectoral collaboration?
- What do we know about COVID-19 risk factors?
- What do we know about vaccines and therapeutics?
- What has been published about ethical and social science considerations?
- What do we know about diagnostics and surveillance?
- What do we know about non-pharmaceutical interventions?
“This valuable new resource is the fruit of unselfish collaboration and now offers the opportunity to find answers to important questions about COVID-19,” said Dr. Dewey Murdick, director of data science at Georgetown University’s Center for Security and Emerging Technology, who coordinated the cross-team effort. “Once the crisis has passed, we hope this project will inspire new ways to use machine learning to advance scientific research.”