Earlier this week, five organisations released an open dataset – CORD-19 – containing nearly 30 000 scientific articles with the hopes that artificial intelligence will be able to use the data and combat the spread of COVID-19 infections.
These articles have previously been published in journals, or were saved on pre-print servers. CORD-19 is short for COVID-19 Open Research Data set.
What is CORD-19 and how could it fight COVID-19?
The CORD-19 dataset was released after the Trump administration issued a “call to action” for the tech community to develop AI (artificial intelligence) techniques to curb the spread of COVID-19 infections.
In addition, Michael Kratsios, US Chief Technology Officer at The White House, explained that “decisive action from America’s science and technology enterprise” was needed to prevent, detect, treat and develop a cure for COVID-19.
The dataset was released in the hopes of spurring America’s artificial intelligence experts to “develop new techniques for mining data and text that could help answer some of the most pressing questions about the novel coronavirus and the disease it causes”.
Rebecca Robbins from Stat News reports that the dataset is said to be “most extensive collection of its kind”, and will be made machine-readable in order for AI specialists to work with it.
Despite the impressive size of the collection, researches point out that only 13 000 articles contain usable, machine-readable data. Unfortunately, approximately 16 000 articles contain only metadata such as the authors’ names and the papers’ abstract.
Companies who partnered to create CORD-19
Microsoft contributed its literature curation tools while the Allen Institute for AI – one of the research institutes founded by the late Microsoft co-founder Paul Allen – transformed the content into a form that would be machine-readable.
Dr Eric Horvitz, Chief Scientific Officer at Microsoft, said that companies, governments and scientists must come together “and work to bring our best technologies to bear across biomedicine, epidemiology, AI, and other sciences”. He added:
“The COVID-19 literature resource and challenge will stimulate efforts that can accelerate the path to solutions on COVID-19”.
In addition, the National Institutes of Health’s National Library of Medicine provided access to its literature content while Georgetown University’s Center for Security and Emerging Technology coordinated the initiative.
The Chan Zuckerberg Initiative – the philanthropic vehicle launched by Facebook founder Mark Zuckerberg and his wife, the paediatrician Priscilla Chan – provided access to articles that have been posted on pre-print servers but not yet peer-reviewed.
Finding the ‘right information in a sea of scientific papers’
Dewey Murdick, CSET Director of Data Science, explains that “the worldwide machine learning community now has the opportunity to apply recent advances in natural language processing to find answers to important questions about” COVID-19.
“One of the most immediate and impactful applications of AI is in the ability to help scientists and technologists find the right information in a sea of scientific papers to move research faster”. Dr Oren Etzioni, Chief Executive Officer of the Allen Institute for AI
The CORD-19 data can be downloaded from the Georgetown University’s Center for Security and Emerging Technology website.
At the time of publishing, South Africa had 240 confirmed cases. Updates can be found here.