VOOZH about

URL: https://www.geeksforgeeks.org/artificial-intelligence/how-does-deepseek-r1-improve-upon-existing-llms-reasoning/

⇱ How does DeepSeek-R1 improve upon existing LLMs reasoning? - GeeksforGeeks


  • Courses
  • Tutorials
  • Interview Prep

How does DeepSeek-R1 improve upon existing LLMs reasoning?

Last Updated : 3 Feb, 2025

DeepSeek-R1 represents a significant advancement in large language models (LLMs) by focusing on reasoning capabilities through innovative training methodologies and architectural improvements. The model tackles key challenges and introduces new methods to improve its reasoning abilities

Unlike traditional models, DeepSeek-R1-Zero doesn't go through the usual supervised fine-tuning step. Most large language models (LLMs) are trained through pre-training, supervised fine-tuning, and then reinforcement learning. However, DeepSeek-R1-Zero skips the fine-tuning and instead uses rule-based reinforcement learning directly on the pre-trained DeepSeek-V3-Base model, which has 671 billion parameters.

👁 Deepseek-r1
DeepSeek-R1

1. Rule-Based Reinforcement Learning

DeepSeek-R1 employs Group Relative Policy Optimization (GRPO), a rule-based reinforcement learning method, for training. This method samples multiple outputs for a given input and uses predefined rules to determine the reward for each output. These rules include accuracy and format.

For example , consider a code generation task where the model is asked to implement a Python function that calculates the factorial of a number. The predefined rules for reward assignment could include:

  1. Accuracy: The function should return the correct factorial value for any valid input.
  2. Format Adherence: The implementation must include a docstring describing the function’s purpose, and all variables should follow standard naming conventions.
  3. Efficiency: Recursive implementations should include memoization to optimize performance.

For instance, if the model generates the following output:

The reward would be assigned based on:

  • Correctness (✔️) – The function computes the correct factorial values.
  • Format (✔️) – It includes a docstring.
  • Efficiency (❌) – It lacks memoization, which could lead to redundant calculations.

A better output might include memoization:

This rule-based approach simplifies training and makes it more cost-effective by eliminating the need for neural reward models, which are susceptible to reward hacking—where the model finds loopholes in learned reward functions rather than truly optimizing for quality. By using explicit human-defined rules, GRPO ensures that the model's outputs remain reliable, interpretable, and aligned with intended objectives.

2. Self-Evolution and 'Aha Moment'

Through reinforcement learning, DeepSeek-R1-Zero exhibits a "self-evolution" process where it learns to allocate more thinking time for complex reasoning tasks. Additionally, it demonstrates an "Aha moment" phenomenon, where the model learns to reevaluate its initial approach and correct itself if needed. These capabilities emerge naturally during training without external adjustments.

Foe example : Solving a Logical Puzzle

A farmer needs to cross a river with a wolf, a goat, and a cabbage. He has a boat that can only carry himself and one other item at a time. If left alone together:

  • The wolf will eat the goat.
  • The goat will eat the cabbage.

How can the farmer safely transport all three across the river?

First Attempt (Initial Thinking) – Rushing to a Conclusion

  1. Take the goat across first.
  2. Go back alone.
  3. Take the wolf across and leave it.
  4. Return alone.
  5. Take the cabbage across.

Incorrect! The wolf will eat the goat when left alone in Step 3.

"Aha Moment" – Self-Correction and Deeper Reasoning

Realizing the mistake, the model re-evaluates its strategy and adjusts:

  1. Take the goat across first.
  2. Go back alone.
  3. Take the wolf across, but bring the goat back.
  4. Leave the goat and take the cabbage across.
  5. Return alone and bring the goat across.

Correct Solution! The model self-corrected after noticing the initial error.

Hence, model first attempts a simple approach but then allocates more thinking time when it detects an issue. Instead of repeating the mistake, it reassesses its logic and finds a better strategy.

Multi-stage Training Pipeline for DeepSeek-R1

To address readability and language consistency issues seen in DeepSeek-R1-Zero, DeepSeek-R1 uses a four-phase training pipeline.

  1. Cold Start: The model starts by being trained on a small, carefully chosen set of examples from DeepSeek-R1-Zero. This step helps improve how easily the model's answers can be understood and ensures it gives clearer, more readable responses.
  2. Reasoning Reinforcement Learning: After the initial training, the model goes through another training phase using GRPO to improve its ability to reason and make better decisions. The focus here is to keep improving accuracy and making sure the answers are properly formatted, especially for tougher tasks.
  3. Rejection Sampling and Supervised Fine-Tuning: In this phase, the model uses a technique called rejection sampling, where it picks and refines the best answers. It also uses a new reward model (DeepSeek-V3) to help guide the training. These improved answers are then used to teach the model more general knowledge, making it smarter and able to handle a wider variety of tasks.
  4. Diverse Reinforcement Learning: Finally, the model combines two types of feedback. For tasks like math, it uses rules-based rewards, while for other tasks, it takes feedback from a Large Language Model (LLM). This mix of methods helps the model do well in both logical reasoning and creative problem-solving.

Smaller open-source models are created using the data from phase 3 of the DeepSeek-R1 training. These smaller models still have strong reasoning skills and can be great alternatives. Even a 32 billion parameter model performs really well.

Conclusion: A New Benchmark in LLM Reasoning

DeepSeek-R1 represents a significant leap forward in the field of Large Language Models. By eliminating supervised fine-tuning, leveraging rule-based reinforcement learning, and implementing a multi-stage training pipeline, it achieves unparalleled reasoning capabilities. Its ability to self-evolve, correct mistakes, and adapt to complex tasks sets a new benchmark for LLMs. As the AI landscape continues to evolve, DeepSeek-R1 stands as a testament to the power of innovation and precision in model training.

Comment

Explore