Programming Language Distribution
Question: What are the different programming languages present in an open source project(s), and what is the percentage of each language?
The number of programming languages and the percentage of each language in a project provides some understanding of the skills required from code contributors, as well as the nature of the project itself.
This metric will aid newcomers to a particular open source project, as well as provide open source program managers with a perspective on the project’s profile, in the context of their own experience and organization. In Value: This metric may be used by developers for identifying projects that rely heavily on languages they use, as part of a job search.
In Risk: This metric can be combined with dependency metrics to determine if there is a prominent language used in a project, but for which a dependency scanner is not yet identified.
In DEI: When inclusive, diverse, and equitable communities are identified, they will have some degree of language distribution.
As an individual looks for new projects to work on, knowing which projects rely on languages they already know, or want to learn, can be one of a number of “personal filters”.
In general, this metric is useful for OSPOs, and community managers aiming to understand which languages are most prominent, and perhaps which languages are little used, but critical.
Language distribution takes into account different properties of each file in a repository with an a priori identifiable computer programming language. As new languages emerge, there may be initial periods where counting tools do not recognize their extensions, in which case they may be counted as “other”. Such periods are typically brief.
Number of Files - The number of files of each language.
Lines of Code - The percentage of lines of code for each language.
Either lines of code, or files, could be presented as absolute numbers, or percentages, depending on the application of the metric. In many cases, a simple count of files is useful, while the absolute number of lines of code can be difficult to differentiate because the numbers are much larger.
Tools Providing the Metric
The Augur-Community-Reports repository provides this metric currently
GrimoireLab provides this information through the proxy of file extensions
Augur provides this information in its frontend, as well as through an API endpoint.
Data Collection Strategies
The contents of a repository can be counted by iterating through each file, though several libraries exist, including the one used by Augur: https://github.com/boyter/scc
File extensions for some languages, like Jupyter Notebooks, might be excluded because they obscure the actual language used.
- Dawn Foster
- Beth Hancock
- Matt Germonprez
- Elizabeth Barron
- Daniel Izquierdo
- Kevin Lumbard
- Sean Goggins