Policymakers and researchers alike have a vested interest in understanding the emerging technology pipeline. CSET alone has a number of publications that aim to track the prevalence of research on key technologies across the world, from tracking AI-related research and examining China’s Advanced AI Research to measuring AI development and its transformation into patented technology. GitHub metadata contributes to this area of research by providing deeper insight into the utilization of research presented in publications. While not all emerging technology fields are heavily represented on GitHub, subjects including artificial intelligence, computer engineering, and bioinformatics have a large presence within the landscape of open source software.
As covered in earlier snapshots, the advantages of open source software extend beyond idea sharing amongst researchers and into more general experimentation and implementation by any user. GitHub users include students, researchers, computer scientists, data analysts, and many others with an interest in software engineering. CSET’s GitHub repository dataset includes repositories cited by publications in our merged corpus of scholarly literature.1 By tracking changes in repository metadata before and after publication, we are able to explore the proliferation of the research across GitHub.
Each piece of information about a repository offers insight into a different aspect of adoption. For example, tracking “stars” across time for a given repository informs researchers on a loose measure of popularity. Additionally, tracking “commits” illuminates the update activity of a given repository, with the commit contributors being independent members of the GitHub community who are directly providing software updates. Depending on the research goal, analysis can focus on one or more of these given metrics to draw conclusions.
Figure 1. Average Increase in Collaboration Following Publication, by Field
Figure 1 provides an example of the advantages of using GitHub metadata to research the emerging technology pipeline. For all repositories in our dataset, we first link to any associated publications in our merged corpus. Keeping only the linked repositories, we then extract the publication year, enabling the pre- versus post-publication analysis. Next, we find the top field of study for each publication. We found that most of the linked publications are data science, but most of the repositories in our dataset are linked to machine learning publications. In other words, a large number of data science publications reference a smaller number of repositories in our dataset, whereas a smaller number of machine learning publications reference a larger number of repositories. Tracking the number of unique contributors to each repository before and after publication helps us understand changes in the network of interested parties.
This analysis shows that the number of repositories for a given field of study does not necessarily correspond to the scale of the field’s network development, as measured in new contributors after publication. For example, machine learning research has the most repositories linked to publications. However, other fields have more new contributors, on average, after publication (with the exception of bioinformatics). Similarly, this analysis shows that the scale of research for a field of study does not necessarily correspond to network development. The computer security field has nearly four times as many contributors post-publication as the machine learning field, despite machine learning having twice as many linked publications. Data science has relatively few repositories on GitHub with relatively few new contributors, and yet has nearly 70,000,000 publications within the merged corpus. GitHub metadata analysis offers insight into the network of development, implementation, and collaboration of key advanced technologies beyond the realm of academic publication.
- CSET merged corpus of scholarly literature including Digital Science Dimensions (an inter-linked research information system provided by Digital Science (http://www.dimensions.ai), Clarivate’s Web of Science, Microsoft Academic Graph, China National Knowledge Infrastructure, arXiv, and Papers With Code. All China National Knowledge Infrastructure content is furnished for use in the United States by East View Information Services, Minneapolis, MN, USA.