Blog Post

What we Learned from the CHAOSS Data Science Hackathon

By July 17, 2025No Comments
A bunch of televisions screens hanging from the ceiling and displaying data

This blog post was co-authored by Chan Voong, Sal Kimmich, Nandana Krishnan, and Ishan Juneja

We held our first ever CHAOSS Data Science Hackathon in Denver co-located with the Linux Foundation’s Open Source Summit and CHAOSScon NA. Some first time hackathon participants had a really good time! We felt like it was successful overall, and we hope some of the participants will stick around and continue to participate in the CHAOSS Data Science Working Group.

During the hackathon, we focused on 3 ongoing CHAOSS Data Science Working Group projects: Relicensing and Forks, Archival of Open Source Projects, and Projects Moving to Foundations, and here’s what we accomplished for each of these projects.

Relicensing and Forks

Analyzing PR data before and after a relicense/fork event helps us understand open source project and community behavior. For the hackathon, we took inspiration from Dawn’s code, which looked at commits by people and organizations. To build on the case study, we focused on pull requests and visualized how the number of PRs changed before and after a relicense or fork.

This group was led by Chan Voong and Cali Dolfi and consisted of CU Denver’s  Department of Mathematical and Statistical Sciences students and faculty. The group went in a slightly different direction to explore projects that they were interested in, instead of the projects we expected them to look at, but that’s great! Some projects important to them included Bayesian-Inference, Stan, and TensorFlow, and the team’s primary project that they maintain is <T>LAPACK.

After visualizing those project PRs in a Jupyter Notebook, we checked out 8knot, which led to interest in submitting these projects to be added to 8knot for visualizations.

There was great discussion around other metrics such as lottery factor and the possible repercussions if the projects they cared about were to be relicensed to a non-OSI  license. Learn more about the Relicensing and Forks project and how to get involved.

Archival of Open Source Projects

The archive project was focused on data cleaning and classification, since data scientists spend a lot of time doing these kinds of manual tasks that are necessary for ensuring data accuracy. The advantage was that anyone could help out with this project regardless of their skillsets. For this project, we categorized popular open source projects that had been archived on GitHub with a primary reason that the project had been archived.

We categorized 45 out of 729 projects as part of the hackathon, and here are the categories for those 45 projects.

  • Corporate 3
  • Inactive 13
  • Moved 7
  • Personal 4
  • Technology 18
  • Unevaluated 684
  • Grand Total 729

While we classified projects into a single, primary category, we also had some interesting discussions about how and why projects are archived and how the reasoning for archival can be complex. For each project that we classified, we added notes about how we came to our decision and any other categories that might have influenced the decision.

Learn more about the Archival project and how to get involved.

Projects Moving to Foundations

This group explored the question, “How can we better understand the health of open source projects across different ecosystems?” Using public datasets from Apache, CNCF, and Eclipse, several developers: Sal Kimmich, Ejiro Oghenekome, Nandana Krishnan, and Ishan Juneja – transformed raw project metadata into visual insights. Their work revealed how lifecycle patterns, language choices, and infrastructure gaps impact the sustainability of open source.

Which Projects Graduate—and Which Get Stuck? One of the central findings was the creation of a unified status timeline (Incubated → Graduated → Retired) to track project lifecycles across Apache and CNCF. By merging public metadata from foundation websites and CSV datasets, it became possible to identify projects that had stalled or succeeded in reaching maturity. We had a few key insights:

  • Projects written in Go and Python showed significantly higher graduation rates.
  • Projects built with Java were more likely to be retired or linger in long-term incubation.
  • Some Apache projects had been in incubation for over a decade without graduating.

What Can Metadata Tell Us About a Project’s Health? Another analysis looked at the quality and structure of available metadata across foundation datasets. Specifically, the distribution of path types and status fields from podling histories and CNCF websites was explored to reveal inconsistencies. Findings included:

  • Overrepresentation of certain paths like sandbox, with minimal detail on contributor activity or oversight.
  • Incomplete lifecycle data in projects still marked as “current” or “incubating.”
  • Descriptive gaps that could obscure the true state of a project’s community or release status.

Are Contributors Visible—and Is Anyone Listening? Contributor and communication data were analyzed for Eclipse and CNCF projects. This revealed a startling insight: many active projects are flying blind when it comes to external visibility. The team found that:

  • A significant percentage of projects lacked a public-facing blog or news source.
  • Many projects did not offer a Slack, Gitter, or any other form of community channel.
  • Contributor and committer data was sparse, inconsistent, or missing altogether.

There is still much work left to do, but you can read more about this group’s results from the hackathon and how to get involved.

Learn more

All of these are ongoing projects that people can get involved with by joining the CHAOSS Data Science WG! You can learn more about what we’ve been up to by reading our latest update from June on the CHAOSS blog.

Photo by Leif Christoph Gottwald on Unsplash

Author