Bit by Bit is a weekly column focusing on technical advances each and every week across multiple spaces. My name is Adam Conway, and I've been covering tech and following the cutting-edge for a decade. If there's something you're interested in and would like to see covered, you can reach out to me at adam@xda-developers.com.
Just this week, DeepSeek R1 plunged the U.S. stock market into chaos and upstaged OpenAI at their own game. Its release wiped $1 trillion in valuations across the stock market, with $600 billion of that being Nvidia's own. Some have bounced back, and others are recovering, but it's clear that DeepSeek had a pretty big impact on the top computing and AI companies.
With almost baffling claims regarding the cost of training the model being a mere fraction of OpenAI's costs while also selling access to the API that significantly undercut OpenAI as well, how did they do it? What happened? There's a lot to break down here, particularly around DeepSeek's claims, retaliation, and how claims that R1 is "open source" aren't telling the full picture.
DeepSeek R1 isn't the same as DeepSeek V3
Though they're very similar
First and foremost, DeepSeek released two models: V3 and R1. Both of them are pretty important to the story, but all of the talk has been around R1. DeepSeek R1 is the company's reasoning model, which can ask itself questions and talk to itself before answering a prompt, just like OpenAI's o1 model.
DeepSeek V3 is a general-purpose Mixture of Experts (MoE) LLM with 671B parameters. DeepSeek R1 is based on DeepSeek-V3-Base, and is available for download in 1.5B, 7B, 8B, 14B, 32B, and 70B parameter models that are distilled from DeepSeek R1, based on Qwen and Llama. There is also a full-fledged DeepSeek R1 671B model available for download. Both R1 and V3 are similar models, but R1's reasoning capabilities are what makes it particularly impressive.
The best way to use DeepSeek's R1 and V3 671B models are to navigate over to DeepSeek's site, where you can create an account and use it like you would ChatGPT. The company's servers are in China, and some prompts result in a censored answer. DeepSeek's R1 671B model can be run locally, but it requires at least 800 GB of HBM memory in FP8 format to run, according to AWS. This is where the open weight nature of the model comes in too, as you can tweak the parameters to remove this censorship, with there already being a number of uncensored models available to download made with a process known as "abliteration".
The process of "distillation" mentioned when it comes to those smaller parameter models is one that you mightn't necessarily be familiar with. Distillation refers to using a larger model to train a smaller model, where the larger model is the parent and the smaller model is the child. The child model asks the parent model a litany of questions, labeling the answers and learning from its responses. In other words, the DeepSeek R1 models that you can run locally are based on Qwen and Llama, where those two models learned from the larger DeepSeek R1.
Did DeepSeek R1 steal from OpenAI?
Even if they did, it's hypocritical of OpenAI to complain
OpenAI is currently facing a number of lawsuits relating to the collection of the data that it has used to train its models. The Times sued OpenAI, as did Canadian News Outlets, Intercept Media, and ANI in India. There are countless more lawsuits out there too, and all of them allege more or less the same thing: OpenAI used their data without permission to train its GPT models.
Right now, nobody from OpenAI has officially come out and made the claim that DeepSeek stole from it on an official channel, but both Bloomberg and Financial Times have reported that OpenAI and Microsoft are currently investigating the possibility. First and foremost: this is a laughing matter. Even if DeepSeek did "steal" from OpenAI, it's hard to have sympathy for the company that feels its data was taken in an "unauthorized" way when significant portions of its own data were collected in the exact same way.
In fact, OpenAI has argued more or less in favor of what it claims DeepSeek is said to have done. “Training AI models using publicly available internet materials is fair use, as supported by long-standing and widely accepted precedents. We view this principle as fair to creators, necessary for innovators, and critical for US competitiveness,” OpenAI once said in a blog post.
However, it's not clear what exactly DeepSeek could have trained on when it comes to OpenAI. o1's reasoning model is obfuscated; when you ask o1 a question, it doesn't give you the full chain-of-thought that R1 does. It's a summary, and OpenAI deliberately hides the actual inner-workings, going so far as to make it very clear that any attempts to siphon this information will result in your account being banned.
It doesn't stop there, though, as David Sacks, a venture capitalist and "AI and crypto czar" of The White House, claimed that there was "substantial" evidence of distillation in R1 from OpenAI.
“There’s a technique in AI called distillation, which you’re going to hear a lot about, and it’s when one model learns from another model, effectively what happens is that the student model asks the parent model a lot of questions, just like a human would learn, but AIs can do this asking millions of questions, and they can essentially mimic the reasoning process that they learn from the parent model and they can kind of suck the knowledge of the parent model,” Sacks told Fox News. “There’s substantial evidence that what DeepSeek did here is they distilled the knowledge out of OpenAI’s models and I don’t think OpenAI is very happy about this.”
As we've already mentioned, this reasoning process cannot be distilled. The obfuscated chain-of-thought that the o1 model shows users does not contain a full chain-of-thought, and instead summarizes what it's "thinking". This isn't enough information to train DeepSeek R1, especially not when R1 actually matches (and even outperforms at times) the alleged source of its reasoning process in multiple benchmarks.
With that said, we don't know where the initial training data came from, but that's not really what the allegations of stolen data relate to. DeepSeek has actually been very open about how R1's reasoning capabilities came about, and in the whitepaper released by the team of researchers, they say that the capabilities emerged through reinforcement learning when building R1-Zero. This focuses on "self-evolution," a technique where the model itself "learns" to achieve a goal in the most efficient way.
A particularly intriguing phenomenon observed during the training of DeepSeek-
R1-Zero is the occurrence of an “aha moment”. This moment, as illustrated in Table 3, occurs in an intermediate version of the model. During this phase, DeepSeek-R1-Zero learns to allocate more thinking time to a problem by reevaluating its initial approach. This behavior is not only a testament to the model’s growing reasoning abilities but also a captivating example of how reinforcement learning can lead to unexpected and sophisticated outcomes.
Reinforcement learning is a very common machine learning technique, and neuroevolution, a part of the reinforcement learning paradigm, has even been used to teach models how to play games like Super Mario, in the form of MarI/O by SethBling. This isn't a new concept, but one that has been somewhat overlooked when it comes to LLMs. Plenty of LLMs use RLHF which is Reinforcement Learning by Human Feedback, but pure RL does not require any supervision or feedback provided by a human.
Did it really cost $5.576M to train DeepSeek R1? And why did the stock market panic?
Yes and no, but mostly no
This claim has originated from the DeepSeek V3 whitepaper, which says that this model cost $5.576M to train, racking up 2788K Nvidia H800 GPU hours estimated at $2 per hour. This is one model, not all of the other test runs, not all of the other times they built the model and then had to build it again. This is the final output cost to build the model, nothing more, and there has definitely been significantly more investment in this project than that.
This oversight has led to allegations that DeepSeek lied about its costs, despite the fact that the whitepaper makes it very clear that the training cost was for just the model, without any other overheads included such as research and development, models trained in the process of building up V3, and other adjacent costs. This also is not the cost of R1, and is the cost of building V3. Eryck Banatt has an excellent breakdown of this cost, which asserts that DeepSeek's numbers are plausible and many aspects of their claims are verifiable on the outset.
However, these fundamental misunderstandings (coupled with the actual efficiency of DeepSeek's newest models) and training on older GPUs caused market chaos. Nvidia's H100 GPUs, bought in the hundreds of thousands by big players in the AI space such as Google, Meta, and OpenAI, are the most powerful GPUs out there and were previously seen to be necessary in the development of cutting edge technology.
With that said, DeepSeek achieved all of this on a series of H800 GPUs, which reduce the chip-to-chip transfer rate by about half and complied with export regulations for a short time before a loophole that Nvidia was said to have exploited was closed. This calls into question just how important Nvidia's latest technology actually is when it comes to AI, if slower GPUs can still compete with the results of using the best.
And that's another thing too; allegations surfaced that DeepSeek had skirted export controls and acquired H100 GPUs. Scale AI CEO Alexandr Wang made the claim that DeepSeek had about 50,000 of them and had avoided talking about them as it would prove it had violated those export controls. It's likely that Wang misunderstood a tweet from Dylan Patel, which said that DeepSeek had more than 50,000 Hopper GPUs. H800 GPUs are still Hopper GPUs, as they are modified versions of the H100 that were made to comply with those U.S. export controls.
All of this prompted Nvidia to release a statement, saying that it expects all partners to comply with regulations and it will act accordingly if it discovers they haven't. Nvidia has also “stated that there is no reason to believe that DeepSeek obtained any export-controlled products from Singapore." according to the Ministry of Trade and and Industry in Singapore.
Even still, this cost is remarkably low. Aran Komatsuzaki, an AI researcher, estimates that the cost of the training GPT-4o and GPT-o1 is about $15 million each, three times the cost of DeepSeek's V3 model. This is partially enabled by optimization, as DeepSeek has made a number of advancements in this area. That includes using PTX, a low-level language for Nvidia GPUs that enables the researchers to do things like using some of the H800 GPUs to manage cross-chip communications.
DeepSeek represents several major advancements in AI, and we'll all reap the benefits
Even if it's panicking competitors
Despite suggestions that Meta has set up "war rooms" and OpenAI potentially looking to take action against DeepSeek, this is a major win for the AI community. Advancement helps everyone, and the open-nature of DeepSeek's research will allow competitors to use some of those techniques in improving their own models, too. Back to when I mentioned that DeepSeek is "open weights", the reason it's "open weights" and not "open source" is that open source would also require the original data that it was trained on.
In contrast, open weights means that we have the parameters and the numerical values that define how the model runs. That, alongside the research papers, is more than enough to go off of when trying to build a model that replicates R1. In fact, someone is already working on building their own version of R1 in a project called "Open R1", which uses all of the information released by DeepSeek to implement it. It's not completed, but there's a very clear path and outline to follow if you want to do it yourself.
If a regular person like you or I can read the paper and understand the basics of what's going on, then you know that researchers at companies like Google, Meta, and OpenAI definitely can. This will improve models across the board, reducing power consumption, costs, and further democratize AI. OpenAI CEO Sam Altman has already said that OpenAI's reasoning models will now share more of their chain of thought, thanking R1 in his response.
You can run a distilled version of DeepSeek R1 in LM Studio at the moment, and I've been running the 32B Qwen model distilled from DeepSeek R1 on my MacBook Pro with an M4 Pro SoC using LM Studio.
