VOOZH about

URL: https://thenewstack.io/building-starcoder-an-open-source-llm-alternative/

⇱ Building StarCoder, an Open Source LLM Alternative - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2023-06-07 08:19:35
Building StarCoder, an Open Source LLM Alternative
AI / Large Language Models / Software Development

Building StarCoder, an Open Source LLM Alternative

Find out how Big Code created an alternative open source large language model that can be used to create AI coding assistants and chats.
Jun 7th, 2023 8:19am by Loraine Lawson
👁 Featued image for: Building StarCoder, an Open Source LLM Alternative

A challenge with proprietary large language models, particularly for regulated industries, is that they lack transparency in how they are developed.

This is not an insignificant issue. For instance, in all the hullabaloo around AI assistants, it’s easy to forget that OpenAI, Microsoft and GitHub still face a lawsuit over the coding assistant, Copilot. Indeed, last month, a judge agreed to allow the lawsuit to move forward, despite an attempt to have it dismissed, which, to be fair, is a standard move in lawsuits. It’s also worth noting that concerns about personal information use also led Italy to temporarily ban ChatGPT and then launch on ongoing investigation into OpenAI’s compliance with the European Union’s General Data Protection Regulation (GDPR).

Big Code is attempting to avoid that problem by open sourcing its large language models to be more transparent, plus taking steps to ensure it is “ethically sourced,” so to speak.

Why Create an Open Source Model

StarCoder: May the Source Be With You, a Cornell-published paper about the project, explained why creating the open source model was necessary. It noted that while OpenAI and other AI startups have made their LLMs available for use to the general public through a paid API, they have not shared all the details regarding the development process.

“While API access allows researchers to experiment with these models, it limits their ability to research LLM safety and alignment and inspect the models’ inner workings,” the paper noted. “Additionally, the high development costs make it nearly impossible for academic institutions to develop these models from scratch, which has created anxiety among academic researchers about whether they can meaningfully contribute to new AI breakthroughs.”

Other drawbacks with proprietary systems is the inability to adapt them to your own domain or codebase, the StarCoder team noted in a recent blog post about how developers can create their own coding assistant with the LLM.

The model isn’t just for code completion, either, said Leandro von Werra, a machine learning engineer at Hugging Face and co-lead on the project. The model isn’t just trained on raw code but also on GitHub commits and issues, which taught it a lot about chat.

“The model can also respond, for example, to GitHub issues,” he said. “One thing that was quite interesting that we found is if we just showed the model a lot of examples of conversations about coding problems, like a conversation between a human and a hypothetical assistant, the mobile would also be able to answer questions. So we were able to use it as a tech assistant, where you can say, ‘I have this error in Python. What should I do?’ It would try to help you, which was a little bit surprising because it was primarily trained on code, not to chat.”

Training it a bit more explicitly yields better results, he said, adding that the Big Code team have created an alpha version of a chat, called StarChat.

The Challenge in Creating Open Source LLMs

Big Code recently released its LLM, StarCoderBase, which was trained on 1 trillion tokens (“words”) in 80 languages from the dataset The Stack, a collection of source code in over 300 languages. The team then further trained StarCoderBase for 34 billion tokens on the Python subset of the dataset to create a second LLM called StarCoder.

Big Code is not the only open source LLM available, but it is the most recent and most performant one, von Werra claimed. There’s also SalesForce’s CodeGen Mono 16B for Python and Replit’s 3B parameter model trained on 20 programming languages.

One of the barriers to creating open source LLMs is that training on the data sets requires a lot of compute power. That’s not something most open source projects can afford. In September 2022, Hugging Face and ServiceNow Research launched Big Code, an open science collaboration. Hugging Face is a large open source community that builds tools for machine learning models based on open source code and technologies. ServiceNow Research is an enterprise AI company. Both companies made their compute cluster available for the large-scale training for Big Code’s StarCoder and StarCoderBase. Since its launch, 600 more members from academic institutes and industry labs have joined the Big Code effort.

StarCoder is trained using only “permissively licensed code on GitHub,” explained von Werra. The 15.5B parameter model is trained on one trillion tokens sourced from 80+ programming languages, GitHub issues, Git commits, and Jupyter notebooks.

The models can copy verbatim from the pertaining data and even if it’s permissive data, it will still require attribution, von Werra added. In the VSCode extension, there is a quick test to see if the code generated by the model was in the retraining data and a full-text search to find where exactly the code came from and how it is licensed, he explained.

“If you have a 15 billion parameter model, you have 15 billion things that you can adjust and optimize during training,” von Werra said. “You need a lot of GPUs and a lot of data. That’s the main thing. Training StarCoder required roughly 500 GPUs for almost a month, 24 days of training. That’s quite expensive.”

By comparison, GPT is rumored to have a trillion parameters — but size is not always a sign of better, Ori Goshen, AI21 Labs co-founder and co-CEO, told The New Stack’s Senior Editor Richard MacManus in March.

LLM size “plays a factor, but it’s not the only factor,” said Goshen. “So we’ve stopped referring to the size because it can be misleading about the actual performance of the model.”

Ethically Sourced Training Data

Beyond using only GitHub material that was permissively licensed, Big Code took other steps to ensure it’s “ethically sourced.” First, it stripped out personal identifiable information (PII), such as names, email, addresses, and passwords that might be in the code.

“One thing that you can quite easily do with these language models is you can prompt them to generate PII if it was trained on such information,” von Werra said. “You could, for example, input to the model ‘password equals’ and then the model would generate a password that it has seen during pre-training. We created a dataset, an annotated data set where we know if there was PII and we trained a model to detect and then we applied that to the whole data set to remove this information such that you can’t easily abuse the model to create a big data set of personal information.”

Second, Big Code added an opt-out process. Developers can look up whether their code was used to train the model and then, by completing a form, opt out of being used for future model training.

StarCoder Compared to Copilot

How does it compare to a Copilot? One of the first Open AI models presumed to power Copilot was called Cushman, von Werra said. StarCoder either performed on par or outperformed Cushman on the HumanEval benchmark for performance, he said.

“We found that on this HumanEval benchmark, they’re either the same performance or better depending on the language — we train on many languages and we evaluate many languages —but on general, we match the performance of the first iteration of Copilot,” von Werra said. “We also outperform it on some other benchmarks that are more related on data science coding tasks; there’s a DS 1000 benchmark are we pretty good at with StarCoder.”

TRENDING STORIES
Loraine Lawson is a veteran technology reporter who has covered technology issues from data integration to security for 25 years. Before joining The New Stack, she served as the editor of the banking technology site Bank Automation News. She has...
Read more from Loraine Lawson
SHARE THIS STORY
TRENDING STORIES
TNS owner Insight Partners is an investor in: OpenAI.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.