GitHub’s status as the world’s primary software-collaboration and file-sharing site attracts users of all kinds. Any one account holder can access the existing public packages for implementation—known as repositories—into their own projects, whether it be software engineering, web development, or data science. Individual users tend to create these repositories, relying upon community support for maintenance and upkeep. In recent years, organizations have increasingly waded into the public-facing side of GitHub to share their own packages and software tools. In our dataset, roughly 13 percent of repository owners are organizations (10 percent verified organizations). This practice raises a key question: why do private organizations offer software packages for free?
The rapid expansion and adoption of open source software, primarily through GitHub, aspires to provide high value in the software-development landscape. Individual users can rely upon a large pool of programmers to maintain code packages rather than handle development independently. The network of programming support can extend the shelf life of software by allowing timely updates and fixing key deficiencies. Furthermore, the ability to access a network of expertise not only extends the longevity of software but also provides a mechanism for improvement. Reducing computation time, implementing state-of-the-art methods, and detailing implementation methods are all ways in which open source software development can enhance the state of emerging software.
Figure 1. Screenshot of SynapseML Repository in GitHub
Organizations can tap into all of these benefits when they list their packages as public on GitHub. Certain software implementations serve as entry points for more sophisticated packages. For example, the publicly available SynapseML repository from Microsoft (seen in Figure 1) allows users to create machine learning (ML) models for large datasets. Microsoft—which acquired GitHub in 2018—benefits from the open source offering in two ways: First, users who implement and test the software can report bugs back to Microsoft’s internal software team in an expedited and efficient manner (currently standing at 227 open issues as of writing). Second, users who wish to implement the software at scale are more likely to use Azure, Microsoft’s paid cloud computing platform, because of built-in compatibility. In short, GitHub offers the combination of essentially free bug testing for software and advertising for premium products directly towards a well-versed consumer base for an organization like Microsoft.
Figure 2. AI/ML Developers Know Microsoft but Rely on Explosion
Figure 2 shows the top organizations by “stars,” “commits,” and “dependencies” for AI/ML-related repositories in CSET’s dataset of GitHub repositories. Microsoft stands as the clear leader in terms of GitHub popularity: “stars” and “commits.” Users can star any public repository, which essentially functions as a bookmark. Commits occur when a repository manager approves a user-provided modification to the code. Microsoft leads the other two front-runners in both of these metrics among AI/ML repositories. This means that users on GitHub are more likely to visit and contribute to Microsoft’s set of AI/ML repositories than any other organization’s, and that the company benefits the most from open source availability in terms of total number of contributions and total number of interested users for its products. However, in terms of direct implementation of these repositories, Explosion—a software company based out of Germany—outpaces Microsoft by a factor of 10. Explosion is the developer of the incredibly popular and “industrial-strength” NLP package spaCy. While Microsoft may be more likely to be implemented in private projects, and thus not captured within our dataset, Explosion ranks as the leader in public implementations of AI/ML-related software. Explosion benefits from the extremely active issues board on each of its repositories, providing insight to improvements that may otherwise have not been detected.
GitHub metadata provides a lens through which we can investigate implementation and explore development stages for key technologies and methods. Tracking popularity and implementation metrics across topics, organizations, programming language, or any other captured field offers the opportunity for real-time analysis of the state of software distribution. Whereas academic publication occurrences of key technological topics capture broader use cases, GitHub metadata estimates the dispersion of advanced technologies.