VOOZH about

URL: https://thenewstack.io/navigating-the-messy-world-of-open-source-contributor-data/

⇱ Navigating the Messy World of Open Source Contributor Data - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2021-10-19 10:47:11
Navigating the Messy World of Open Source Contributor Data
profile,sponsor-kasten,sponsored,sponsored-event-coverage,
Open Source / Tech Culture

Navigating the Messy World of Open Source Contributor Data

Open source communities have varying data sources. Are you counting the right contributions while still protecting privacy?
Oct 19th, 2021 10:47am by Jennifer Riggins
👁 Featued image for: Navigating the Messy World of Open Source Contributor Data
Image par MasterTux de Pixabay 
Kasten sponsored this post. Insight Partners is an investor in Kasten and TNS.

Open source communities are inherently open and distributed, which makes open source software data inevitably messy. And when someone’s privacy — especially someone who has been contributing to your project for free — is at risk, messy is not OK. At this year’s KubeCon+CloudNativeCon North America, Sophia Vargas, research analyst and open source program manager at Google’s open source office, talked about how best to navigate the regulations and ethics of data management in an open source community.

Or, as she called it, “Things I wish I knew before I started this job.”

Define Your Project Contributions to Define Your Data

Part of the pain is that there is no single definition of an open source project. Yes, an open source project is defined by what open source license it holds, but from there, it could be open source code, datasets, a framework, metrics or a common language, or design. A license can apply to all different sorts of assets, each with its own assortment of data.

A lot of the data collected within an open source community is typically centered on contributions, in order to know who is doing what, who to acknowledge, who do you need to encourage more, how to attract more users, how to embrace diversity, equity and inclusion, and other goals that maybe traditional business models may not prioritize.

Since an open source model isn’t necessarily a business model, it demands whole new metrics that are community-based.

Vargas argues that a community contribution focus should determine your data sources. However, according to research by the University of Vermont from earlier this year, the “idiosyncratic methods” that most open source projects are measured by are notoriously hard to mine, leaving, for example, GitHub’s top contributors to be measured solely by actions taken on the platform, like commits, pushes and patches. But there are many other contributions available. The University of Vermont All Contributors Model acknowledges more diverse contributions, including outreach, finance, infrastructure, and community management. These groupings include:

  • Data Source #1: Git, issues — the most often measured contributions like writing and reviewing code, bug triaging, quality assurance and testing, and any security-related activities.
  • Data Source #2: Surveys — this is useful to measure community engagement, event organization, demographics, teaching and tutorial development.
  • Data Source #3: Forums — documentation writing and editing, localization and translation, and troubleshooting and support.
  • Data Source #4: Site Analytics — marketing, public relations, social media management and content creation.
  • Data Source #5: Legal and finance — as projects scale, there are also significant contributions from legal counsel and financial management that are often measured in more traditional business ways.

This research found that more community-generated systems of contribution acknowledgment are necessary, not only to cultivate a culture of collaboration but to increase bug visibility and overall innovation.

Now, What Are You Doing with All That PII?

To start, you have to acknowledge you’re only looking at the subset of the information. And most of that is just in your repository. Vargas advocates for investigating alternative trajectories, including acknowledging that public data — sometimes — deserves classification. It can certainly be valuable to garner information ranging from demographics in an effort to improve diversity to knowing who is actively promoting your open source project on various socials.

This public data most often comes from social media, but could come from anywhere, including what you might say speaking at an event or meetup. It could also be anecdotal data that a contributor shares with a maintainer in a chat. Vargas said that the Github Push event payload has an author name in order to verify the person who made the original commit.

Kasten by Veeam® is the Kubernetes backup leader. Its Kasten K10, cloud native data management platform, provides DevOps teams with Kubernetes backup/restore, DR and application mobility. It has deep integrations with relational and NoSQL databases, Kubernetes distributions and clouds providers.  Insight Partners is an investor in Kasten and TNS.
Learn More
The latest from Kasten

But, if it’s public data, what’s the problem? When you are combining datasets, you are risking personally identifiable information or PII. Risk classifications are dependent on the parties involved, where regulations, organizations and projects intersect.

“We encourage a right to anonymity, but, at the same time, the more you learn about someone on GitHub or on other channels, the more you can infer about them,” Vargas said.

👁 Screenshot is of Vargas' talk, where she emphasized the 'read the fine print' message, unlike when she had passively accepted T&Cs to attend KubeCon Barcelona 2019 (the last in-person event.) Her face has graced the KubeCon registration page ever since.

The screenshot is of Vargas’ talk, where she emphasized the ‘read the fine print’ message, unlike when she had passively accepted T&Cs to attend KubeCon Barcelona 2019. Her face has graced the KubeCon registration page ever since.

Using herself as an example, she had introduced herself at the start of the talk that she works at Google and shared her Twitter, which says she lives in New York. Then on LinkedIn, she says when she went to college and where. If you cross that with any places she published or commits she’s made on one source projects, it can aggregate to include a lot of information, more than would typically be necessary for an open source maintainer to have. Still, if an open source project is looking to increase diversity, equity, inclusion and representation — and it should be — it needs much of those demographics. PII can also help you decide who to promote or give awards to within your community.

There’s also an argument for organizations that are contributing to open source to understand how their employees are interacting with it. Are you providing the right tooling to support them? Is the documentation sufficient? Of course, this could backfire if an organization decides to promote or compensate based on total number of commits, only again acknowledging the most technical contributions.

And we already know there are many reasons to want to separate your identity from your contributions, ranging from safety to a higher chance of success. It’s widely cited that, when women hide their gender on GitHub, they are more likely to have pull requests accepted than men counterparts — meaning that women are statistically better coders, but are more likely for their open source contributions to be ignored.

There’s no clear answer with any of this, but when your contributors’ privacy is at stake, maintainers should constantly consider how much data they need, how they are getting it, and how they are sharing it. And especially those who could be harmed by this data retention and sharing.

Different Ways of Managing and Measuring Open Source Data

Another challenge is that these open source datasets are innate as distributed as the systems that generate them. For Vargas at Google, she knows there are about 300 repositories just within Kubernetes alone, although maybe they aren’t all active.

It’s also hard with all that interconnectivity to understand where one project stops and another begins, with one perhaps forked into another or a new one being added. Git activity logs do not stay consistent over time. Historical information can change. Somebody could’ve made a mistake or forgot to count something.

“In the context of git, if the project has a lot of forms, that form gets merged back in, that fork becomes part of the history of the project, [and then] your historical and current state don’t necessarily match.” Vargas continued to warn that GitHub APIs have rate limits, which means you often end up in lost or missing data. There also could be cleaning and correction over time which will also alter data.

“There’s a little bit of a tribal knowledge problem if you don’t realize the perimeters of the project,” she said.

With so much data, it can be attractive to just lump it all together to see what comes up.

“Huge tables with everything in the same place is much easier to work with, but when we are talking about this information, it’s often better to give back to the people it’s about. But whenyou share back with your community, it increases the risk levels,” Vargas said.

For many of these situations, is the actual data even necessary? Could you leverage anonymized or synthetic data instead?

Whenever possible, Vargas recommends to tokenize and compartmentalize data. And only allow individual assessment that can be provoked. These big tables shared around will always be higher risk.

Also, be cognizant that the method in which you collect the data can actually impact that data. For example, if you are directly contacting someone, like through a survey, they may not be as direct. Demographics, opinions, feedback, satisfaction feelings and morale are best done anonymously. And of course provide opt-outs, so people do not have to answer every question if they do not want to.

Vargas recommends always being transparent about why you’re collecting data, like you want to encourage more participation or representation.

“If your community is very tiny, you may be able to figure out who they are. Decide if things are appropriate to share back,” Vargas warned. Even anonymously, it could be clear who is the source. We’ve already written about instances where DEI survey respondents have been concerned as being the only woman in her country working on a project.

You can pull from indirect data conclusions, often from aggregated or inferred data. Just be sure to exclude the 120 million bots Vargas estimates are in GitHub alone.

When in doubt, consult with your legal and compliance colleagues. And always have your privacy policy publicly documented.

Be Careful What You Share, Too

Remember, when you measure, it influences the measurement itself.

Open source leaderboards are a clear motivator. But be aware of what puts people on top. After all, what you measure will undoubtedly influence the measurement itself. Often GitHub leaderboards just tout the number of commits, which could encourage a mass of smaller, less significant contributions, and only boast technical contributions. Be sure to acknowledge more qualitative contributions as well.

Focus on what behaviors and contributions you want to encourage and see more of in your community. Vargas offered some logical things to measure, depending on what outcome or outcomes you hope to achieve:

  • Project growth — activity and contributions per repo per time
  • Contributor growth — new versus retained contributors over time; activity per contributor over time
  • Contributor diversity — the percentage of contribution from outside of your organization
  • Contribution experience — time to respond or close pull requests
  • Contribution quality — the percentage of pull requests closed and merged

Vargas recommends keeping contributions relevant, at least within the last year, and sharing that time period along with the results. And note nothing that externally would affect results, like the challenges of contributing during a pandemic.

Transparency matters. Always state what you are counting. Define what you are counting as a “contribution” or “engagement.”

And remember a good data scientist states her sources. What methods, assumptions, biases and boundaries influenced your data collection and processing? What types of contributors does this represent? Vargas shared an example of how she handles an “About this Data” section of data sharing.

Own Your Accountability

In the end, Vargas says you just have to assume you are accountable. She recommends that you:

  • Know your licenses and regulations — and of those that you’re integrating with and tools you’re using
  • Openly communicate your purpose and intent
  • Design policies that foster trust, transparency and safety
  • Provide an option to opt in and out
  • Ask if you really even need to use personally identifiable information
  • Always read the fine print to understand your own rights as a contributor

Also while this article focuses a lot on the world’s biggest open source repository GitHub, don’t think that’s your one and only source of data. Go where your community is.

Just remember, “Every time you add more PII, it adds more risk and potential vulnerability to your dataset.”

Kasten by Veeam® is the Kubernetes backup leader. Its Kasten K10, cloud native data management platform, provides DevOps teams with Kubernetes backup/restore, DR and application mobility. It has deep integrations with relational and NoSQL databases, Kubernetes distributions and clouds providers.  Insight Partners is an investor in Kasten and TNS.
Learn More
The latest from Kasten
TRENDING STORIES
Jennifer Riggins is a tech storyteller and journalist, event and panel host. She bridges the gap between business, culture and technology, with her work grounded in the developer experience. She has been a working writer since 2003, and is based...
Read more from Jennifer Riggins
Kasten sponsored this post. Insight Partners is an investor in Kasten and TNS.
SHARE THIS STORY
TRENDING STORIES
Kubecon+CloudNativeCon is a sponsor of The New Stack.
TNS owner Insight Partners is an investor in: Pragma, Kasten, Veeam.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.