U.S. health and technology specialists on Monday said they had launched a new collaborative venture to assemble a dataset of tens of thousands of scientific papers and literature on the coronavirus, which would then be analyzed by artificial intelligence programs to find patterns and answer questions raised by the World Health Organization about the pandemic.
The dataset includes 29,000 articles, including 13,000 full-text pieces of medical literature, which will be made available on a special website allowing data scientists and artificial intelligence programmers to propose tools and software code that can unearth insights from the articles, White House officials and experts told reporters in a conference call.
The venture came together after the White House Office of Science and Technology Policy issued a call to tech companies and research groups to figure out how artificial intelligence tools could be used to sift through thousands of research articles being published worldwide on the pandemic, said Lynn Parker, deputy chief technology officer at the White House office.
With data scientists and machine language experts mining the literature compilation known as COVID-19 Open Research Dataset, experts and White House officials expect to get help developing vaccines, forming new guidelines on how long social distancing should be maintained and other insights, Michael Kratsios, the U.S. chief technology officer said.
The venture includes the National Library of Medicine, which is part of the National Institutes of Health, Microsoft, Allen Institute of AI, Georgetown University’s Center for Security and Emerging Technology, the Chan Zuckerberg Initiative (named for Mark Zuckerberg, Facebook’s founder, and his wife Priscilla Chan), and Kaggle, which is a unit of Google.
The Allen Institute’s Semantics Scholar website will host the database of scientific articles and add to the collection over time, while Kaggle’s platform, which provides access to about 4 million artificial intelligence researchers, will receive suggestions from the experts on tools and codes to use to mine the database, experts from both organizations said.
Scientists have been working and publishing their findings on various strains of coronavirus over the years, including other variants such as SARS, MERS, and the latest, COVID-19. The application of artificial intelligence tools to look for commonalities and differences among the thousands of such published articles will help the scientists spot things they may have missed, Eric Horvitz, Microsoft’s chief scientific officer said.
“It’s difficult for people to manually go through more than 20,000 articles and synthesize their findings,” Anthony Goldbloom, co-founder and CEO of Kaggle said. “Recent advances in technology can be helpful here. We’re putting machine readable versions of these articles in front of our community of more than 4 million data scientists. Our hope is that AI can be used to help find answers to a key set of questions about COVID-19.”
Sharing vital information across scientific and medical communities is key to accelerating our ability to respond to the coronavirus pandemic,” said Cori Bargmann, head of science at the Chan Zuckerberg Initiative. “The new COVID-19 Open Research Dataset will help researchers worldwide to access important information faster.”
Publishers of scientific journals and literature have agreed to make their full articles available to researchers so that machine learning algorithms can look for key insights from them, the experts said. As scientists around the world continue to publish new research, journal publishers have agreed to provide those articles in electronic form ahead of their printed versions, they said.