One month after the debut of the COVID-19 Open Research Dataset, or CORD-19, the database of coronavirus-related research papers has doubled in size – and has given rise to more than a dozen software tools to channel the hundreds of studies that are being published every day about the pandemic.
In a roundup published on the ArXiv preprint server this week, researchers from Seattle’s Allen Institute for Artificial Intelligence, Microsoft Research and other partners in the project say CORD-19’s collection has risen from about 28,000 papers to more than 52,000. Every day, several hundred more papers are being published, in peer-reviewed journals and on preprint servers such as BioRxiv and MedRxiv.
CORD-19 aims to make sense of them all, using the Semantic Scholar academic search engine developed by the Allen Institute for AI, also known as AI2.
“We commit to providing regular updates to the dataset until an end to the crisis is foreseeable,” the project’s organizers say.
Since mid-March, the dataset has been viewed more than 1.5 million times and downloaded more than 75,000 times.
But it’s not just a question of quantity: CORD-19 has sparked the development of spin-off projects aimed at visualizing and organizing COVID-19 research to answer key questions about the pandemic and how to stop it.
One of the highest-profiles is the Text Retrieval Conference-COVID, or TREC-COVID, launched last week by the Commerce Department’s National Institute of Standards and Technology and the White House Office of Science and Technology Policy.
Among other organizers of TREC-COVID are AI2, the National Library of Medicine, Oregon Health and Science University and the University of Texas Health Science Center at Houston. The goal of the project is to assess systems on their ability to rank COVID-19 research papers based on their relevance to topical queries – for example, “How does the coronavirus respond to changes in the weather?”
“AI experts worldwide are responding to the White House’s call to action, developing approaches that help scientists gain insights from thousands of articles of COVID-19 scholarly literature,” Michael Kratsios, U.S. chief technology officer, said in a news release. “The TREC-COVID program expands upon these efforts by creating powerful and accurate search engines that extract knowledge from this literature, tailored to the needs of the health-care and medical research communities.”
Another partner in CORD-19 is the Kaggle online data science community, which is conducting a text-mining competition to extract answers to key research questions surrounding the pandemic. More than 550 teams are participating in the competition, and they’re already finding new ways to blend machine-based analysis with human-based curation.
“A few Kagglers are collaborating with a group of medical students to create a semi-automated living literature review page,” said AI2’s Lucy Lu Wang, a member of the CORD-19 team. “The machine-learning experts are creating systems to extract answers out of the CORD-19 dataset, and the medical students are helping to evaluate those results and present them in a form that’s suited for public consumption.”
Wang and the other team members say they’ve faced a few obstacles in their efforts to build the database. One has to do with access to research. “Though many publishers have generously made COVID-19 papers available during this time, there are still bottlenecks to information access,” the organizers say in their report.
Securing release rights to papers that haven’t yet been available for CORD-19 is one of the top items on the organizers’ to-do list, with the National Institutes of Health’s PubMed Central COVID-19 Initiative taking a leading role.
Another obstacle has to do with the PDF document format, which is the primary distribution format for scientific papers. PDF is optimized to render papers faithfully for reading and printing, not for automated document analysis. For that reason, studies published as PDF files have to go through significant cleanup in order for AI to do its work. What’s more, there’s no standard format for representing the metadata that accompany research papers.
“We encourage the community to come together and propose solutions to these challenges,” CORD-19’s organizers say.
The good news is that a new crop of data search and visualization tools have flowered in the fertile field of CORD-19 meta-analysis. Here’s a sampling:
- Neural Covidex: AI-based ranking of biomedical studies.
- CovidScholar: Adaptation of MatScholar optimized for COVID-19.
- COVID-19 Explorer: Search filter developed by Indian researchers.
- COVID-19 Search: Powered by Microsoft’s Azure Cognitive Search.
- Covid Graph: Knowledge graph created by a German-based team.
- CoViz: Visualization tool developed at AI2.
- CovidAsk: Question-answering tool created in South Korea.
- Vespa: Verizon Media’s search engine, optimized for CORD-19.
- ASReview: CORD-19 plugin for ASReview software.
- CORD-19 Demos and Resources: Semantic Scholar’s list of tools.
CORD-19 team member Kyle Lo, an applied research scientist at AI2, said the use of natural language processing and text mining to address biomedical research questions isn’t exactly a new idea. “What’s new about this is the pace at which we need the answers and findings extracted from these articles,” he said.
Wang said the information infrastructure and tools created for CORD-19 should yield dividends long after the current pandemic has passed. “We hope that they can help with whatever crises come up in the future,” she said.
This report has been updated with comments from Wang and Lo. Other authors of the newly issued preprint paper on CORD-19, titled “CORD-19: The COVID-19 Open Research Dataset,” include Yoganand Chandrasekhar, Russell Reas, Jiangjiang Yang, Rodney Kinney, William Merrill, Brandon Stilson, Chris Wilhelm, Douglas Raymond, Daniel Weld, Oren Etzioni and Sebastian Kohlmeier of AI2; Darrin Eide, Zhihong Shen, Kuansan Wang and Boya Xie of Microsoft Research; Kathryn Funk and Jerry Sheehan of the National Library of Medicine; Paul Mooney and Devvret Rishi of Kaggle; Ziyang Liu and Alex Wade of the Chan Zuckerberg Initiative; and Dewey Murdick of Georgetown University.