On March 16, 2020, the White House Office of Science and Technology Policy (OSTP) announced the availability of an open research dataset on COVID-19, as well as stated a call to action for the nation’s artificial intelligence (AI) researchers to help scientists fight the disease. Established in 1976, the OSTP provides scientific and technological advice to the President and the Executive Office, among other duties.
The CORD-19 (COVID-19 Open Research Dataset) is the result of a collaboration between the Allen Institute for Artificial Intelligence (AI2), Microsoft, the National Library of Medicine at the National Institutes of Health (NIH), Georgetown University’s Center for Security and Emerging Technology (CSET), and the Chan Zuckerberg Initiative in response to a request by the White House Office of Science and Technology Policy. Together these institutions issued a joint call to action to the world’s AI researchers to create text and data mining tools to help accelerate COVID-19 research.
The Allen Institute’s Semantic Scholar has an adaptive feed on COVID-19 research. The late Paul Allen founded the Allen Institute for Artificial Intelligence in 2013, and two years later AI2 launched Semantic Scholar, an AI-powered search engine for papers published in scientific journals that goes beyond just search and retrieve. Semantic Scholar uses artificial intelligence to extract meaning from the data, provide insights into the paper’s influence by highlighting the volume of citations, provide quick scans and automatic extraction of tables, figures, abstracts and citations, as well as enable the viewing of supplemental content to provide context. The plan is to update CORD-19 periodically with data not only from preprint archives such as bioRxiv and medRxiv, but also articles published in peer-reviewed scientific and medical journals.
“The scientific literature on coronavirus is growing exponentially,” said Oren Etzioni, CEO of the Allen Institute for Artificial Intelligence, and professor of computer science at the University of Washington, on a conference call on Monday, as reported by GeekWire’s Health Tech Podcast. “And scientists need AI capabilities to do their research on COVID-19 quickly and effectively, with the goal of bolstering prevention, detection, treatment, and vaccination. The work on the Free Semantic Scholar academic discovery engine, which began five years ago, has prepared us for this moment, where humanity needs scientists to succeed, and to succeed quickly.”
CORD-19 is available on Kaggle, a platform for hosting AI machine learning and data science competitions. Kaggle was founded in 2010 by Anthony Goldbloom and Ben Hamner, and had investors that included Naval Ravikant, Koshla Ventures, Index Ventures, Hal Varian, Yuri Milner, Max Levchin, and SV Angel. In 2017, Google acquired Kaggle.
AI researchers are encouraged to participate in the COVID-19 Open Research Dataset Challenge. Kaggle is awarding $1,000 per task to winners that best meet the evaluation criteria, with the option to donate the prize towards COVID-19 relief efforts. Current task challenges include finding COVID-19 insights on topics such as transmission, incubation and environmental stability, risk factors, virus genetics, origin, evolution, geographical impact on virality, non-pharmaceutical interventions, potential vaccines, therapeutics, diagnostics, disease surveillance, medical care, information sharing, inter-sectoral collaboration, as well as social science and ethical considerations.
The world is in a public health crisis due to the COVID-19 disease caused by the SARS-CoV-2 coronavirus that was first detected in the city of Wuhan, the capital of Hubei Province with a population of 11 million in the People’s Republic of China during the winter of 2019. According to today’s Reuters news report, in Italy 3,405 have died from COVID-19, which surpasses the number of reported deaths in China.
Globally there are 207,855 confirmed cases, and 8,648 reported deaths according to today’s figures from The World Health Organization (WHO). In countries with limited testing, the reported numbers may not necessarily reflect the number of people who actually have the disease. The number of actual cases and deaths due to COVID-19 may be higher, as these figures rely on testing, as well as transparent and timely reporting of the data from the impacted countries.
Scientists are urgently researching ways to prevent, treat, and cure COVID-19. Having a way to identify patterns in big data is a helpful tool to assist researchers. CORD-19 contains access to more than 29,000 scholarly articles, including 13,000 articles in full text, on coronaviruses, SARS-CoV-2, and COVID-19.
It would take many years and a large number of people to manually read each of these academic papers manually, extract the relevant information, and apply then it to scientific research—a costly and impractical approach as new papers are published constantly. Artificial intelligence natural language processing (NLP) is often used for computer speech recognition, AI chatbots, translation, natural language generation, natural language understanding, text-to-speech, and similar purposes.
This is an opportunity for NLP developers and AI experts to create machine learning algorithms to identify meaningful patterns in the CORD-19 dataset that will help researchers rapidly find solutions with the potential to stop the pandemic and save lives around the world.