DeepSeek-R1-Zero and DeepSeek-R1: Reinforcement Learning & Fine-Tuning Analysis

Introduction

This post follows the research detailed in the paper DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning by DeepSeek AI. The paper explores how reinforcement learning can enhance reasoning abilities in large language models (LLMs).

DeepSeek-R1-Zero and DeepSeek-R1 are two cutting-edge models built upon DeepSeek-V3-Base, leveraging Reinforcement Learning techniques to enhance reasoning capabilities. This post provides a precise examination of their architectural innovations, training strategies, and performance enhancements.

DeepSeek-V3-Base: The Foundation

Both DeepSeek-R1-Zero and DeepSeek-R1 originate from DeepSeek-V3-Base, a Mixture-of-Experts (MoE) LLM with:

671 billion total parameters (37 billion active per token during inference)
128K token context window for handling long-context reasoning
Multi-Head Latent Attention (MLA) and DeepSeek-MoE architecture
Pre-trained on 14.8 trillion tokens

These innovations allow efficient long-context handling and reasoning performance while maintaining training feasibility.

For a detailed exploration of DeepSeek-V3, check out this post on my blog, where I provide an analysis of its architecture.

DeepSeek-R1-Zero: Pure Reinforcement Learning Model

DeepSeek-R1-Zero was trained entirely via Reinforcement Learning (RL) using Group Relative Policy Optimization (GRPO), without any supervised fine-tuning (SFT). Key aspects:

📌 No human-annotated data was used in training; the model learned reasoning abilities purely through RL.
📌 Task-based rewards were used, focusing on accuracy-based and format-based incentives.
📌 Challenges emerged, such as verbosity, repetition, and formatting inconsistencies, as RL alone didn't optimize for readability.

Despite these challenges, R1-Zero achieved remarkable performance, nearly matching top-tier closed models in mathematical and logical reasoning tasks.

DeepSeek-R1: Enhanced Reasoning and Readability

To address R1-Zero’s shortcomings, DeepSeek-R1 incorporated a hybrid training approach:

🏁 Cold-start SFT: A small set of high-quality reasoning demonstrations helped establish clear formatting and structured reasoning.
🏋️ Reasoning-Focused RL: Large-scale reinforcement learning further improved its problem-solving ability.
🔄 Data Augmentation & Additional SFT: The best reasoning samples from RL were used to fine-tune the model again.
🎯 Final RLHF & Alignment: A last RL phase ensured helpfulness, harmlessness, and user alignment.

🔥 Key Improvements in DeepSeek-R1

📝 Concise, well-structured responses
🏆 Higher accuracy in reasoning tasks
🌍 Language consistency maintained
🛡 Better alignment for real-world applications

Reinforcement Learning

Reinforcement Learning (RL) is a machine learning paradigm where an agent learns to make decisions by interacting with an environment. The core idea is to optimize actions based on rewards, gradually improving performance over time.

Key Components of RL:

Agent – The model or algorithm making decisions.
Environment – The system the agent interacts with.
Actions (A) – The possible choices the agent can make.
State (S) – The current situation the agent observes.
Reward (R) – A signal indicating the quality of an action taken.
Policy (π) – A strategy mapping states to actions.

The learning process follows a loop:

The agent observes the environment's state (S).
It selects an action (A) based on its current policy.
The environment responds with a reward (R) and a new state (S').
The agent updates its policy to maximize future rewards.

Supervised Fine-Tuning

Supervised Fine-Tuning (SFT) is a training technique where a pre-trained model is refined using high-quality labeled data. This method ensures the model learns structured responses, clear formatting, and task-specific knowledge.

Key Aspects of SFT:

🏗 Uses labeled datasets where inputs are paired with ideal outputs.
🏆 Improves response quality, ensuring the model follows proper reasoning steps.
🔧 Reduces hallucinations by grounding responses in curated knowledge.
📝 Helps with formatting, making model outputs more readable and structured.

What is Cold-start SFT?

Cold-start SFT is an approach used to jump-start learning in an RL-trained model by first providing it with a small but high-quality supervised dataset before reinforcement learning begins. This helps establish:

📏 Consistent formatting rules for responses.
🧠 Baseline reasoning skills before RL fine-tuning.
🚀 Faster convergence by giving the model a structured foundation.

In DeepSeek-R1, Cold-start SFT was used to:

Train the model with a small set of expert-crafted reasoning examples.
Establish clear formatting guidelines to prevent verbosity and repetition in later RL stages.
Serve as a stepping stone before large-scale reinforcement learning.

Group Relative Policy Optimization

Group Relative Policy Optimization (GRPO) is a novel reinforcement learning technique that reduces training costs by estimating the baseline from group scores rather than using a critic model. This benefit allows for greater efficiency in training since it eliminates the need for a separate critic network, which can be resource-intensive and complex. By using group scores to establish baselines, GRPO enhances the stability and robustness of policy updates, resulting in more reliable learning outcomes.

The following sections will provide a detailed breakdown of GRPO’s mathematical formulation, highlighting its optimization objective, KL divergence penalty, and advantage estimation.

Key Ideas of GRPO

1. Relative Reward Estimation

GRPO avoids the need for a critic model by assigning rewards based on relative comparisons within a group of outputs. Instead of estimating absolute value functions, it computes:

A_i = \frac{r_i - \text{mean}(r_1, r_2, \dots, r_G)}{\text{std}(r_1, r_2, \dots, r_G)}

where $A_i$ is the advantage of an action $o_i$ , measured relative to other sampled actions in the same group. Here, $r_1, r_2, \dots, r_G$ represent the rewards assigned to various actions taken by the model in the same context, which helps in comparing their effectiveness and determining which actions perform better relative to each other.

2. Policy Ratio Clipping for Stability

To prevent unstable updates, GRPO adopts a clipped importance ratio, inspired by Proximal Policy Optimization (PPO):

L(o_i, q, \theta) = \min \left( r_t(\theta) A_i, \text{clip} \left( r_t(\theta), 1 - \epsilon, 1 + \epsilon \right) A_i \right)

where:

$r_t(\theta) = \frac{\pi_{\theta}(o_i | q)}{\pi_{\theta_{old}}(o_i | q)}$ is the policy ratio. The policy ratio measures the likelihood of taking action $o_i$ under the current policy $\pi_{\theta}$ compared to the previous policy $\pi_{\theta_{old}}$ . It is crucial for understanding how much the policy has changed and for ensuring stable updates.
$\text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon)$ restricts the policy ratio to a safe range.
$A_i$ is the advantage function, determining if the action $o_i$ is better or worse than the group mean.
$\theta$ denotes the parameters of the current policy, and $q$ represents the context or state information relevant to the action taken, ensuring that the policy is adapted appropriately based on the current situation.

This ensures that policy updates remain within a controlled range, preventing overly large updates that can destabilize training. If $r_t(\theta)$ exceeds the clipping threshold, the clipped version is used to avoid excessive policy shifts.

3. KL Regularization for Controlled Updates

GRPO applies KL divergence regularization to prevent the policy from diverging too far from a reference policy:

D_{KL} (\pi_{\theta} || \pi_{\text{ref}}) = \sum_{o} \pi_{\text{ref}}(o | q) \log \frac{\pi_{\text{ref}}(o | q)}{\pi_{\theta}(o | q)} - 1.

KL divergence is a measure of how one probability distribution diverges from a second, expected probability distribution. A weighting parameter $\beta$ controls how much the policy is constrained.

4. Efficient Reward Computation in Practical Applications

In implementations like DeepSeek-R1, rewards are determined by:

Accuracy-based rewards, where responses are evaluated against ground-truth answers.
Format-based rewards, ensuring structured outputs (e.g., enforcing reasoning steps within <think> tags).

This structured reward system allows GRPO to guide models towards producing both accurate and well-formatted responses without requiring an explicit value function.

Why GRPO Works

🚀 No need for a critic model, reducing computational cost.
🔄 Relative scoring ensures stable training, avoiding noisy reward signals.
🎯 Clipped updates and KL constraints prevent drastic policy shifts.
✅ Task-specific reward shaping makes it flexible across different applications.

GRPO is an efficient reinforcement learning approach that balances scalability, stability, and performance, making it ideal for modern AI applications.

Conclusion

Reinforcement Learning Enhancements: The use of reinforcement learning, particularly GRPO, significantly boosts reasoning capabilities without relying on human-annotated data, offering a scalable solution for training large models.
Hybrid Training Approach: The combination of reinforcement learning with supervised fine-tuning in DeepSeek-R1 addresses the limitations of RL alone, improving readability and alignment with human-like reasoning.
Architectural Innovations: DeepSeek models leverage Mixture-of-Experts architecture to maintain efficiency while handling long-context reasoning tasks.
Policy Optimization Techniques: GRPO provides a cost-effective, stable method for policy optimization, avoiding the need for a critic model and reducing computational overhead.
Real-World Alignment: The structured reward system ensures that models not only perform well in reasoning tasks but also align with real-world application requirements, enhancing usability and effectiveness.

Source(s)

Enjoyed this post? Found it insightful? Feel free to leave a comment below to share your thoughts or ask questions. A GitHub account is required to join the discussion.