VOOZH about

URL: https://thenewstack.io/developers-can-now-access-the-worlds-fastest-ai-chip/

⇱ Developers Can Now Access the World's Fastest AI Chip - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2024-09-03 07:37:01
Developers Can Now Access the World's Fastest AI Chip
AI / Software Development

Developers Can Now Access the World’s Fastest AI Chip

Cerebras — a rival chip maker to Nvidia — has launched an AI cloud service it claims is 10 to 20 times faster than regular cloud providers.
Sep 3rd, 2024 7:37am by Agam Shah
👁 Featued image for: Developers Can Now Access the World’s Fastest AI Chip
Image via Unsplash+. 

AI computing is still at the dial-up level. Getting an answer from an LLM can be slow. But now Cerebras — a chip maker — has launched an AI cloud service that it claims is 10 to 20 times faster than regular cloud providers.

The service, called Cerebras Inference, provides access to the world’s largest and fastest AI chip, which handily outperforms a single or group of Nvidia GPUs cobbled together.

Cerebras’ WSE-3 AI chip size is about 46,225mm², which is 56 times larger than Nvidia’s H100 GPU. Cerebras has put together these mega-AI chips in its data centers.

The company is also welcoming developers to build AI applications on it with free API keys, though there’s very limited customization available. The available models on the service include Llama 3.1 and its 8 billion and 70 billion parameter variants.

Upcoming models include Llama 3.1 with 405 billion parameters, Mistral’s Large 2 with 123 billion parameters, which was announced a month ago, and Cohere’s Command R.

Behind the Numbers

The response time on Cerebras chips for Llama 3.1 with 8 billion parameters is 1,842 tokens per second, according to independent benchmarks on Artificial Analysis.

On the same LLM, the output speed of Microsoft Azure is 51.5 tokens per second, and Amazon is 92.2. That makes Cerebras 20 times faster than the major cloud providers.

“Most of our users are inference application developers who just want to build on top of the stack of an open source model.”
– Andy Hock, Cerebras Systems

The numbers vary with larger context sizes and LLM size, which have demanding memory and data requirements.

Developers have control over the frontend of the development, but can’t do much with the backend in customizing models or controlling hardware.

“Most of our users are […] inference application developers who just want to build on top of the stack [of an] open source model,” said Andy Hock, vice president of product management at Cerebras Systems, in an interview with The New Stack.

What Devs Need to Know

The Cerebras-compatible models themselves are written in standard Pytorch.

Developers will get a free API key, and can easily move chatbot or other AI applications to Cerebras’ inferencing cloud service.

For example, developers can change a line code by swapping out the API key and pointing from OpenAI to Cerebras’ cloud service. That’s by changing just a line or two of code.

Cerebras users can build RAG or customized models with personalized data.

“They’re therefore just waiting and ready for your inbound API call to throw data at those weights on the wafer and generate the answer back,” Hock said.

“If they want to run something that’s not Llama 8B, not Llama 70B… they can work with us to build it and deploy it,” Hock said.

Customers can build RAG or customized models with personalized data. That will require the installation of Ollama and creation of a vector database locally. An example using Pinecone and Docker is outlined here, and Weaviate and Hugging Face are outlined here. You can view other examples and, also, the Cerebras Inference platform is compatible with OpenAI client libraries.

The company is discussing “easy buttons” and customization capabilities, executives said. But the first thing was to get the service up and running so that developers get access to the chips, Hock said.

“We’re going to learn a lot about the market for this new breed of performance over the next 60 days. We have some really exciting partnerships and application projects coming,” he added.

Pricing

Cerebras’s inference isn’t cheap when compared to cloud providers. But as with CPUs and GPUs, you have to pay for performance.

The good news: the API is free.

A free tier of Llama 3.1-8b — with the 1,800 tokens per second speed — has a daily token limit of 1 million, and 30 requests per minute. The paid tier has 10 cents per 1 million tokens in input or output, with unlimited requests.

The free tier of Llama 3.1-70B — with 450 tokens per second speed — has a daily token limit of 1 million, and 30 requests. The paid tier has 60 cents per minute per 1 million tokens in input or output.

Cerebras also has an enterprise model for those who want to run customized models.

Google and OpenAI have recently been raising prices for customers using APIs to access LLMs — that upset customers who were using Google AI tools to build applications. Similarly, the price of Cerebras inferencing may go up as it comes out of the experimental phase. Cerebras’ chips are expensive to manufacture, and it’ll need to recoup money to build up its cloud infrastructure.

Scope

Cerebras’ AI speed opens the door to agents (or agentic) modeling, in which a single prompt is spread out and sent out to many different models. Those models review it, analyze it, and produce results, which are spread into other models, and then aggregated back.

“We have partners building applications that chain our LLM with multiple other models.”
– Hock

“We have partners building applications that chain our LLM with multiple other models. For example, doing speech-to-text conversion before sending text to our LLM inference, then outputting to a text-to-speech model,” Hock said.

Developers can do that on Cerebras cloud once many more models become available.

“You have full flexibility as a developer to effectively string our LLM inference into a multimodal workflow,” Hock said.

There’s no easy-button capability yet to do that. Developers will have to customize scripts to different models for such workflows.

Under the Hood

There’s a reason Cerebras is able to pull off such significant speed upgrades.

The chip is 57 times larger than a single Nvidia GPU. In production, individual Nvidia’s H100 GPUs are cut off a big wafer. Instead, Cerebras has put its entire chip on a wafer.

“What we are putting together is impossible to achieve on GPUs.”
– Andrew Feldman, Cerebras CEO

The sheer size and wafer engineering of the 4-trillion-transistor chip give it the performance benefits. Cerebras claims it is 10x faster than the H100 chips in Nvidia’s DGX servers.

“What we are putting together is impossible to achieve on GPUs,” Cerebras CEO Andrew Feldman said during a press conference.

Existing AI installations involve a complex network of interconnected GPUs with independent memory units working in tandem. The distance between processors and integrated memory creates a bottleneck, which leads to slowdowns.

Cerebras has instead put its SRAM memory inside the mega chip, which solves the bandwidth problem.

“Speed converts to quality, more powerful and more relevant answers,” Feldman said.

The results are from a 16-bit data type, which requires speed but generates more accurate answers. Many cloud providers are quantizing down 8-bit or 4-bit data types, which sacrifices quality for speed.

Cerebras vs. Nvidia

Cerebras may have better hardware, but it isn’t Nvidia, which has its GPUs in some of the most powerful generative AI systems.

OpenAI and Microsoft have built their own hardware, and Google’s AI system relies on TPUs. Nvidia will ship its next generation of Blackwell at the end of this year. Most proprietary and open source large language models are already tuned to run on the GPUs.

Cerebras doesn’t have a software ecosystem, which is built around open source AI models such as Llama and Mistral. Developers will be key in helping the company’s inferencing service succeed, and it has Discord and Slack channels for developers.

That said, Cerebras’ chip is doing well in the high-performance computing and AI training space. It is also working with G42, which is the largest Middle Eastern data center provider, to establish AI data centers in the US.

TRENDING STORIES
Agam Shah has covered enterprise IT for more than a decade. Outside of machine learning, hardware and chips, he's also interested in martial arts and Russia.
Read more from Agam Shah
SHARE THIS STORY
TRENDING STORIES
TNS owner Insight Partners is an investor in: Docker, OpenAI.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.