VOOZH about

URL: https://thenewstack.io/open-source-ai-what-about-data-transparency/

⇱ Open Source AI: What About Data Transparency? - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2024-07-10 09:09:29
Open Source AI: What About Data Transparency?
AI / Large Language Models / Open Source

Open Source AI: What About Data Transparency?

AI uses both code and data, and this combination continues to be a challenge for open source, said experts at the United Nations OSPOs for Good Conference.
Jul 10th, 2024 9:09am by Steven J. Vaughan-Nichols
👁 Featued image for: Open Source AI: What About Data Transparency?
Photo of Ambassador Philip Thigo by Steven J. Vaughan-Nichols.

NEW YORK — At the United Nations OSPOs for Good Conference, we were once more reminded of the curious situation of AI and open source programs: While the foundations of AI are built on open source tools and libraries, almost no major AI program is truly open source. OpenAI’s ChatGPT, Google’s PaLM (and its successor, the multimodel Gemini), and Meta’s Llama-3 are often touted as open, but they’re not. They come with significant restrictions that don’t meet the definition of open source software.

Enter the Open Source Initiative (OSI), the stewards of the Open Source Definition. Recognizing the growing importance of AI and the need for clarity in this space, the OSI has embarked on an ambitious project to define what “open source AI” should mean. This effort brings together 70 experts, including researchers, lawyers, policymakers and representatives from tech giants like Amazon, Google and Meta.

That’s easier said than done. As Stefano Maffulli, OSI’s executive director, noted in a panel on open source and AI, “While there’s broad agreement on the overarching principles, it’s becoming obvious that the devil is in the details.”

The open source community is a big tent, encompassing everyone from basement hackers to grassroots activists to Fortune 500 companies, each with their own priorities and concerns.

In short, “we need to have new guardrails and new guidelines when it comes to what open source AI actually means,” said Ashley Kramer, GitLab’s chief marketing and strategy officer, during the panel discussion.

LLM Data Transparency: a Thorny Issue

It became clear from the panel’s discussion that the biggest challenge in defining open source AI lies in addressing the role of training data. Large language models (LLMs) rely on vast data sets, often scraped from the internet without explicit permission. This messy data raises thorny questions about privacy, copyright and ethics.

Indeed, we know some of this data is flatly illegal. “One of the largest data sets of images [LAION-5B] that is being used for training a lot of the image generation AI tools recently has contained child sexual abuse images,” Maffulli said. “We need data set maintainers to notice and remove those things.”

The OSI’s draft definition attempts to sidestep the data issues by focusing on the “four freedoms” traditionally associated with open source software: The freedom to use, study, modify and distribute the AI system. It focuses on the code and not the data.

Should an open source AI model be required to disclose its training data? If so, how can this be reconciled with privacy concerns and the practical challenges of sharing petabytes of information? The answer is not just yes but hell yes, to many critics of the OSI AI definition draft.

As Tom Callaway, principal open source technical strategist at Amazon Web Services, wrote before the conference on LinkedIn, “You cannot build an LLM without data. Without the data, the LLM doesn’t just lack any purpose; it doesn’t exist. That makes the data a functional and required source component of an LLM.”

He and others argue that any definition of open source AI will be incomplete without addressing the data issue.

Maffulli acknowledged that this is a real concern: “This needs to be debated and finalized.” But, he added,  “pushing for radical openness for data has drawbacks and brings issues. So it’s going to be a balance of intentions and what’s going to be the best outcome for the general public.”

However, another panelist, Sasha Luccioni, AI and climate lead at Hugging Face, sees it another way. Luccioni believes being an open source purist is a mistake.

“You can’t really expect all companies to be 100% open source as it’s defined by the open source license,” she said during the panel. “That’s why there is a multitude of licenses. Saying that this is not true, open source can antagonize companies. You can’t expect companies to just give up everything that they’re making money off of and  do so in a way that they’re comfortable with.”

She believes that “there’s a responsible AI license that can exist” — one that is open source friendly — “where you can kind of define your terms of open source. By tweaking the language a little bit, you can build forward in a way that companies, governments and academia are all comfortable with instead of saying this project or license is not open source.

‘We Have To Do It Together’

None of the open source advocates at the conference that The New Stack spoke with was pleased with this take. How ever the OSI AI definition works out,  the issue of what is — and isn’t — open source AI remains critical to the open source community.

It’s also important outside the open source community. As Ambassador Philip Thigo, special envoy on technology for Kenya, observed in a keynote address at the conference devoted to open source and AI, “Open source AI ensures that many Global South communities can build their own AI programs and LLMs.”

These countries can’t afford to pay an OpenAI for their AI needs. They need open source, global standards and interoperability to build AI systems to address their health, climate and education needs.

Looking ahead, “we have to do it together,” Kramer said on the conference panel, indicating that open source is the way to do it.

“We must understand the data that was foundational for the model,” Kramer said. “While I love the hype around AI and I love the direction it’s going, we saw very similar patterns with the Internet and the rise of cloud technology. The faster we move, the more things we miss. So it takes a group and it takes an open source AI guardrail model to really figure out how to get there fast with privacy, trust and security at top of mind.”

Stay tuned. We’re still writing the story of open source AI. As the OSI and others grapple with these complex issues, the outcome will have profound implications for the future of AI development, innovation and governance. The challenge lies in finding a definition that preserves the spirit of openness while addressing the unique challenges posed by data. This task may require rethinking some long-held assumptions about what it means to be “open source” in the age of AI.

TRENDING STORIES
Steven J. Vaughan-Nichols, aka sjvn, has been writing about technology and the business of technology since CP/M-80 was the cutting-edge PC operating system, 300bps was a fast internet connection, WordStar was the state-of-the-art word processor, and we liked it.
Read more from Steven J. Vaughan-Nichols
SHARE THIS STORY
TRENDING STORIES
Amazon Web Services, GitLab and Google Cloud are sponsors of The New Stack.
TNS owner Insight Partners is an investor in: OpenAI, Ambassador.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.