[BARO-Tech] Why Did DeepSeek-R1 Choose Reinforcement Learning?

2025. 03. 19


Before reading this article, we recommend checking out the previous two parts. ๐Ÿ˜Š 

1. Open-source ‘DeepSeek’: Making GPT-level AI More Accessible for Startups?

With the release of DeepSeek-R1 as open source in China, startups and independent developers now have greater access to the AI market.

2. DeepSeek-R1’s Strategy to Catch Up with ChatGPT

The DeepSeek research team aimed to replicate GPT-4 level performance at just 5% of the cost, using performance optimization techniques like reinforcement learning and distillation.



The following content is excerpted from DeepSeek-R1’s training session.


๐Ÿ”Ž Why Did DeepSeek Choose Reinforcement Learning?


The research team wanted DeepSeek to demonstrate
more logical thinking by implementing CoT reasoning.

To understand why Reinforcement Learning (RL) was chosen, it’s essential to first look at CoT(Chain-of-Thought) —a technique that guides AI models to solve problems step-by-step, rather than jumping to conclusions.


This explains the underlying principles of the AI model in which Chain-of-Thought (CoT)โ€”a core technique used alongside reinforcement learningโ€”was applied to DeepSeek in order to implement Test-Time Scaling.


Unlike typical AI models that immediately generate outputs, CoT emphasizes intermediate reasoning steps. This improves logical thinking capabilities and leads to better performance in tasks like math, coding, scientific reasoning, and complex writing.

However, even with improved reasoning, another critical challenge remained unsolved: Test-Time Scaling. The model struggled to dynamically adjust its computational effort based on input complexity, leading to unstable results.

To solve this, the team explored several approaches.



After Trial and Error: Finding the Optimal Training Strategy



First Attempt:
Process-Based Reward Models – โŒ

These models evaluate the appropriateness of each reasoning step.
While promising in theory, they weren’t effective in practice and were dismissed.


Second Attempt:
Search Algorithms – โœ”๏ธ?

Search algorithms explore various reasoning paths to find optimal answers.
For example, Google’s AlphaGo used Monte Carlo Tree Search in games like Go and chess.

However, since LLMs like DeepSeek must handle subjective responses
(unlike games with clear win/lose outcomes), this approach wasn’t a good fit either. โŒ


Third Attempt:
Reinforcement Learning (RL)

In reinforcement learning, the model receives rewards based on the effectiveness of its actions.
It learns the optimal policy by maximizing these rewards.

๐Ÿ’ก How RL Rewards Work

In supervised learning, the model is explicitly told, “This is how you get a high score”—it learns from clear, labeled answers. But in reinforcement learning, no such explicit guidance is provided. Instead, the model learns through rewards that indicate how good or bad a particular action is in achieving a goal. As a result, it must learn on its own through trial and error.

For example, imagine learning to ride a bicycle. At first, you might try pedaling with different intensities or turning the handlebars in various ways. Without any prior knowledge, you’re likely to fall many times. But through repeated attempts, you may eventually discover a way to go farther than before. On the other hand, you might also fall immediately. In this process, you start identifying which actions were effective, reinforcing successful ones and discarding ineffective ones. This is the core principle of reinforcement learning—gradually finding the optimal strategy through experience.

Traditional reinforcement learning involves balancing Exploration—trying new actions to discover better rewards—and Exploitation—repeating actions that have yielded the highest rewards so far. This balance is essential for finding the optimal strategy. However, DeepSeek takes a different approach: it minimizes Exploration and prioritizes Exploitation to accelerate learning and performance. 

 

It explains the principle behind the AI model where Chain-of-Thought (CoT), a key technique used alongside reinforcement learning, is applied to implement Test-Time Scaling in DeepSeek.


Simply put, while traditional reinforcement learning explores all possible actions experimentally, DeepSeek focuses on leveraging already validated, high-performing behaviors. By reducing unnecessary exploration and selectively learning from high-quality data samples, it improves sample efficiency.

Through this optimized reinforcement learning algorithm, DeepSeek achieved high performance with significantly less data, shorter training time, and lower cost compared to conventional RL methods. This is why the DeepSeek research team chose reinforcement learning as the core strategy for developing the new DeepSeek model.



๐Ÿ“ˆ Performance Boost from RL-Centered LLM Design

A performance comparison chart between DeepSeek, which improved results solely through reinforcement learning (RL), and OpenAIโ€™s ChatGPT o1 model.


The DeepSeek-R1-Zero model initially achieved a pass@1 score of 15.6% on the AIME 2024 benchmark. However, after reinforcement learning (RL) training, its performance significantly improved to 71.0%. These results clearly demonstrate that reinforcement learning can effectively enhance model performance. 

With Majority Voting added, performance increased to 86.7%, matching the level of OpenAI’s o1-0912 model.

Majority Voting works like this: If a model is asked “What’s 2 + 2?” and replies with 4, 4, 5, 8, and 3, the most common answer (4) is selected as the final response. This technique further boosts answer accuracy.



๐Ÿง  Multi-Stage Training for General-Purpose, Multilingual AI 

Traditional reinforcement learning seeks the optimal strategy by balancing exploration and exploitation. However, since DeepSeek aimed to become a widely accessible AI model with robust performance in multilingual environments, it needed to address key challenges such as readability issues and language mixing across languages. To tackle this, the team implemented a Multi-Stage Training approach.

Here’s an overview of the multi-stage training process for DeepSeek-R1:

  1. Initial Supervised Fine-Tuning, SFT
    • At the early stage, due to limited training data, the model undergoes supervised fine-tuning using a restricted dataset.
  2. Reasoning-Oriented RL
    • To enhance reasoning capabilities, DeepSeek-R1 adopts the same reasoning-focused RL strategy used in the DeepSeek-R1-Zero model.
  3. Rejection Sampling & Additional Supervised Fine-Tuning
    • After reinforcement learning, only high-quality outputs that meet specific criteria are selected through rejection sampling.
    • These refined samples are then combined with DeepSeek’s existing supervised training data for additional fine-tuning.
  4. Final Tuning
    • Using the enriched dataset from the previous stages, a final round of reinforcement or supervised learning is performed to optimize the model—bringing it up to a performance level comparable to the latest OpenAI models.

Through this multi-stage process, DeepSeek-R1 has evolved into a publicly accessible AI model with strong multilingual reasoning capabilities and linguistic consistency across languages.



๐Ÿ‹๏ธ‍โ™€๏ธ What’s Next for DeepSeek-R1?

There are four key areas where DeepSeek needs improvement in order to become a more advanced AI system. 

  • General Capability Enhancement
    DeepSeek-R1 falls short compared to DeepSeek-V3 in several key aspects, including function calling—which allows the model to execute specific commands—multi-turn interactions that enable richer, ongoing conversations between the user and AI, and JSON output, a lightweight data format commonly used for storing and transmitting structured information. To address these limitations, the model needs to evolve toward enhanced task handling capabilities, particularly through continued research into long-term utilization of Chain-of-Thought (CoT) reasoning. 
  • Language Mixing Issues
    DeepSeek-R1 is optimized for Chinese and English, and it tends to reason and respond in English when presented with queries in other languages. To resolve this, the model needs to be improved to generate more natural responses across a wider range of languages in multilingual environments. 
  • Prompt Engineering Optimization 
    DeepSeek-R1 is highly sensitive to prompts, and its performance tends to decline when using few-shot prompting—a method that guides the model to produce accurate and structured outputs through limited examples. To mitigate this sensitivity and achieve more stable performance, the model should be optimized to rely on zero-shot prompting instead. This approach allows the model to perform new tasks by leveraging general knowledge it has already learned, without needing specific examples in the prompt. 
  • Software Engineering Task Efficiency
    Evaluation times for software engineering tasks are long, leading to low reinforcement learning (RL) efficiency. In addition, large-scale RL has not been sufficiently applied, which is why DeepSeek-R1 has not shown significant performance gains compared to DeepSeek-V3. To address this, further research is needed to improve the efficiency of the RL process by incorporating rejection sampling, a method for generating sequences from specific probability distributions, and asynchronous evaluation, which allows the next task to begin without waiting for the current one to finish. 



๐Ÿคž Questions for Developers and Founders

As DeepSeek continues to make waves with its groundbreaking performance, the open-sourced release of the model presents significant opportunities for developers and startups alike. While the barriers to entry remain relatively high, the arrival of DeepSeek is expected to gradually lower these obstacles. In particular, the reduced GPU requirements create an environment where AI applications can be developed with far less financial burden, offering strong advantages in terms of resource optimization.

One of China's key strengths lies in its ability to rapidly implement a copy & paste strategy. While many view this approach negatively and argue that mere imitation is not an effective long-term strategy, the success of DeepSeek demonstrates that, when executed strategically, it can lead to powerful results.

Against this backdrop, developers and entrepreneurs in Korea should take a moment to reflect on several important questions: What lessons can Korea learn from this? In what areas can we improve? What kind of mindset should entrepreneurs adopt? How should Korea’s startup ecosystem evolve? Are there strategies worth adapting from Silicon Valley or China's startup environments?

Korean AI companies must now think critically about the path forward. It will become increasingly important to explore strategies and differentiated approaches that can strengthen competitiveness in the global AI landscape.




Yong Hyerim, CEO of 10X AI Club, studied Computer Science at NYU and previously developed the AI chatbot Eddie.

Website | YouTube | Disquiet

Editor : Joen