![]() |
VOOZH | about |
DeepSeek-R1 represents a significant advancement in large language models (LLMs) by focusing on reasoning capabilities through innovative training methodologies and architectural improvements. The model tackles key challenges and introduces new methods to improve its reasoning abilities
Unlike traditional models, DeepSeek-R1-Zero doesn't go through the usual supervised fine-tuning step. Most large language models (LLMs) are trained through pre-training, supervised fine-tuning, and then reinforcement learning. However, DeepSeek-R1-Zero skips the fine-tuning and instead uses rule-based reinforcement learning directly on the pre-trained DeepSeek-V3-Base model, which has 671 billion parameters.
DeepSeek-R1 employs Group Relative Policy Optimization (GRPO), a rule-based reinforcement learning method, for training. This method samples multiple outputs for a given input and uses predefined rules to determine the reward for each output. These rules include accuracy and format.
For example , consider a code generation task where the model is asked to implement a Python function that calculates the factorial of a number. The predefined rules for reward assignment could include:
For instance, if the model generates the following output:
The reward would be assigned based on:
A better output might include memoization:
This rule-based approach simplifies training and makes it more cost-effective by eliminating the need for neural reward models, which are susceptible to reward hacking—where the model finds loopholes in learned reward functions rather than truly optimizing for quality. By using explicit human-defined rules, GRPO ensures that the model's outputs remain reliable, interpretable, and aligned with intended objectives.
Through reinforcement learning, DeepSeek-R1-Zero exhibits a "self-evolution" process where it learns to allocate more thinking time for complex reasoning tasks. Additionally, it demonstrates an "Aha moment" phenomenon, where the model learns to reevaluate its initial approach and correct itself if needed. These capabilities emerge naturally during training without external adjustments.
Foe example : Solving a Logical Puzzle
A farmer needs to cross a river with a wolf, a goat, and a cabbage. He has a boat that can only carry himself and one other item at a time. If left alone together:
- The wolf will eat the goat.
- The goat will eat the cabbage.
How can the farmer safely transport all three across the river?
Incorrect! The wolf will eat the goat when left alone in Step 3.
Realizing the mistake, the model re-evaluates its strategy and adjusts:
Correct Solution! The model self-corrected after noticing the initial error.
Hence, model first attempts a simple approach but then allocates more thinking time when it detects an issue. Instead of repeating the mistake, it reassesses its logic and finds a better strategy.
To address readability and language consistency issues seen in DeepSeek-R1-Zero, DeepSeek-R1 uses a four-phase training pipeline.
Smaller open-source models are created using the data from phase 3 of the DeepSeek-R1 training. These smaller models still have strong reasoning skills and can be great alternatives. Even a 32 billion parameter model performs really well.
DeepSeek-R1 represents a significant leap forward in the field of Large Language Models. By eliminating supervised fine-tuning, leveraging rule-based reinforcement learning, and implementing a multi-stage training pipeline, it achieves unparalleled reasoning capabilities. Its ability to self-evolve, correct mistakes, and adapt to complex tasks sets a new benchmark for LLMs. As the AI landscape continues to evolve, DeepSeek-R1 stands as a testament to the power of innovation and precision in model training.