DeepSeek_R1
DeepSeek_R1
Reinforcement Learning
DeepSeek-AI
Abstract
80 79.8 79.2
75.7
72.6 71.5
Accuracy / Percentile (%)
63.6 62.1
60 58.7 60.0 59.1
49.2 48.9
41.6 42.0
40 39.2
36.8
20
0
AIME 2024 Codeforces GPQA Diamond MATH-500 MMLU SWE-bench Verified
(Pass@1) (Percentile) (Pass@1) (Pass@1) (Pass@1) (Resolved)
1 Introduction 3
1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Summary of Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Approach 5
2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 DeepSeek-R1-Zero: Reinforcement Learning on the Base Model . . . . . . . . . . 5
2.2.1 Reinforcement Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . 5
2.2.2 Reward Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.3 Training Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.4 Performance, Self-evolution Process and Aha Moment of DeepSeek-R1-Zero 6
2.3 DeepSeek-R1: Reinforcement Learning with Cold Start . . . . . . . . . . . . . . . 9
2.3.1 Cold Start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.2 Reasoning-oriented Reinforcement Learning . . . . . . . . . . . . . . . . . 10
2.3.3 Rejection Sampling and Supervised Fine-Tuning . . . . . . . . . . . . . . . 10
2.3.4 Reinforcement Learning for all Scenarios . . . . . . . . . . . . . . . . . . . 11
2.4 Distillation: Empower Small Models with Reasoning Capability . . . . . . . . . . 11
3 Experiment 11
3.1 DeepSeek-R1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Distilled Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4 Discussion 14
4.1 Distillation v.s. Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 Unsuccessful Attempts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2
1. Introduction
In recent years, Large Language Models (LLMs) have been undergoing rapid iteration and
evolution (Anthropic, 2024; Google, 2024; OpenAI, 2024a), progressively diminishing the gap
towards Artificial General Intelligence (AGI).
Recently, post-training has emerged as an important component of the full training pipeline.
It has been shown to enhance accuracy on reasoning tasks, align with social values, and adapt
to user preferences, all while requiring relatively minimal computational resources against
pre-training. In the context of reasoning capabilities, OpenAI’s o1 (OpenAI, 2024b) series models
were the first to introduce inference-time scaling by increasing the length of the Chain-of-
Thought reasoning process. This approach has achieved significant improvements in various
reasoning tasks, such as mathematics, coding, and scientific reasoning. However, the challenge
of effective test-time scaling remains an open question for the research community. Several prior
works have explored various approaches, including process-based reward models (Lightman
et al., 2023; Uesato et al., 2022; Wang et al., 2023), reinforcement learning (Kumar et al., 2024),
and search algorithms such as Monte Carlo Tree Search and Beam Search (Feng et al., 2024; Trinh
et al., 2024; Xin et al., 2024). However, none of these methods has achieved general reasoning
performance comparable to OpenAI’s o1 series models.
In this paper, we take the first step toward improving language model reasoning capabilities
using pure reinforcement learning (RL). Our goal is to explore the potential of LLMs to develop
reasoning capabilities without any supervised data, focusing on their self-evolution through
a pure RL process. Specifically, we use DeepSeek-V3-Base as the base model and employ
GRPO (Shao et al., 2024) as the RL framework to improve model performance in reasoning.
During training, DeepSeek-R1-Zero naturally emerged with numerous powerful and interesting
reasoning behaviors. After thousands of RL steps, DeepSeek-R1-Zero exhibits super performance
on reasoning benchmarks. For instance, the pass@1 score on AIME 2024 increases from 15.6% to
71.0%, and with majority voting, the score further improves to 86.7%, matching the performance
of OpenAI-o1-0912.
However, DeepSeek-R1-Zero encounters challenges such as poor readability, and language
mixing. To address these issues and further enhance reasoning performance, we introduce
DeepSeek-R1, which incorporates a small amount of cold-start data and a multi-stage training
pipeline. Specifically, we begin by collecting thousands of cold-start data to fine-tune the
DeepSeek-V3-Base model. Following this, we perform reasoning-oriented RL like DeepSeek-R1-
Zero. Upon nearing convergence in the RL process, we create new SFT data through rejection
sampling on the RL checkpoint, combined with supervised data from DeepSeek-V3 in domains
such as writing, factual QA, and self-cognition, and then retrain the DeepSeek-V3-Base model.
After fine-tuning with the new data, the checkpoint undergoes an additional RL process, taking
into account prompts from all scenarios. After these steps, we obtained a checkpoint referred to
as DeepSeek-R1, which achieves performance on par with OpenAI-o1-1217.
We further explore distillation from DeepSeek-R1 to smaller dense models. Using Qwen2.5-
32B (Qwen, 2024b) as the base model, direct distillation from DeepSeek-R1 outperforms applying
RL on it. This demonstrates that the reasoning patterns discovered by larger base models are cru-
cial for improving reasoning capabilities. We open-source the distilled Qwen and Llama (Dubey
et al., 2024) series. Notably, our distilled 14B model outperforms state-of-the-art open-source
QwQ-32B-Preview (Qwen, 2024a) by a large margin, and the distilled 32B and 70B models set a
new record on the reasoning benchmarks among dense models.
3
1.1. Contributions
• Reasoning tasks: (1) DeepSeek-R1 achieves a score of 79.8% Pass@1 on AIME 2024, slightly
surpassing OpenAI-o1-1217. On MATH-500, it attains an impressive score of 97.3%,
performing on par with OpenAI-o1-1217 and significantly outperforming other models. (2)
On coding-related tasks, DeepSeek-R1 demonstrates expert level in code competition tasks,
as it achieves 2,029 Elo rating on Codeforces outperforming 96.3% human participants in
the competition. For engineering-related tasks, DeepSeek-R1 performs slightly better than
DeepSeek-V3, which could help developers in real world tasks.
• Knowledge: On benchmarks such as MMLU, MMLU-Pro, and GPQA Diamond, DeepSeek-
R1 achieves outstanding results, significantly outperforming DeepSeek-V3 with scores
of 90.8% on MMLU, 84.0% on MMLU-Pro, and 71.5% on GPQA Diamond. While its
performance is slightly below that of OpenAI-o1-1217 on these benchmarks, DeepSeek-R1
surpasses other closed-source models, demonstrating its competitive edge in educational
tasks. On the factual benchmark SimpleQA, DeepSeek-R1 outperforms DeepSeek-V3,
demonstrating its capability in handling fact-based queries. A similar trend is observed
where OpenAI-o1 surpasses 4o on this benchmark.
4
• Others: DeepSeek-R1 also excels in a wide range of tasks, including creative writing,
general question answering, editing, summarization, and more. It achieves an impressive
length-controlled win-rate of 87.6% on AlpacaEval 2.0 and a win-rate of 92.3% on Are-
naHard, showcasing its strong ability to intelligently handle non-exam-oriented queries.
Additionally, DeepSeek-R1 demonstrates outstanding performance on tasks requiring
long-context understanding, substantially outperforming DeepSeek-V3 on long-context
benchmarks.
2. Approach
2.1. Overview
Previous work has heavily relied on large amounts of supervised data to enhance model
performance. In this study, we demonstrate that reasoning capabilities can be significantly
improved through large-scale reinforcement learning (RL), even without using supervised
fine-tuning (SFT) as a cold start. Furthermore, performance can be further enhanced with
the inclusion of a small amount of cold-start data. In the following sections, we present: (1)
DeepSeek-R1-Zero, which applies RL directly to the base model without any SFT data, and
(2) DeepSeek-R1, which applies RL starting from a checkpoint fine-tuned with thousands of
long Chain-of-Thought (CoT) examples. 3) Distill the reasoning capability from DeepSeek-R1 to
small dense models.
Group Relative Policy Optimization In order to save the training costs of RL, we adopt Group
Relative Policy Optimization (GRPO) (Shao et al., 2024), which foregoes the critic model that is
typically the same size as the policy model, and estimates the baseline from group scores instead.
Specifically, for each question 𝑞, GRPO samples a group of outputs { 𝑜1 , 𝑜2 , · · · , 𝑜𝐺 } from the old
policy 𝜋𝜃𝑜𝑙𝑑 and then optimizes the policy model 𝜋𝜃 by maximizing the following objective:
J𝐺𝑅𝑃𝑂 ( 𝜃) = E[𝑞 ∼ 𝑃 ( 𝑄 ), { 𝑜𝑖 }𝐺𝑖=1 ∼ 𝜋𝜃𝑜𝑙𝑑 (𝑂 | 𝑞)]
𝐺
(1)
1 ∑︁ 𝜋𝜃 ( 𝑜𝑖 | 𝑞) 𝜋𝜃 ( 𝑜𝑖 | 𝑞)
, 1 − 𝜀, 1 + 𝜀 𝐴𝑖 − 𝛽 D 𝐾𝐿 𝜋𝜃 || 𝜋𝑟𝑒 𝑓 ,
min 𝐴𝑖 , clip
𝐺 𝜋𝜃𝑜𝑙𝑑 ( 𝑜𝑖 | 𝑞) 𝜋𝜃𝑜𝑙𝑑 ( 𝑜𝑖 | 𝑞)
𝑖=1
𝜋𝑟𝑒 𝑓 ( 𝑜𝑖 | 𝑞) 𝜋𝑟𝑒 𝑓 ( 𝑜𝑖 | 𝑞)
D 𝐾𝐿 𝜋𝜃 || 𝜋𝑟𝑒 𝑓 = − log − 1, (2)
𝜋𝜃 ( 𝑜𝑖 | 𝑞) 𝜋𝜃 ( 𝑜𝑖 | 𝑞)
where 𝜀 and 𝛽 are hyper-parameters, and 𝐴𝑖 is the advantage, computed using a group of
rewards {𝑟1 , 𝑟2 , . . . , 𝑟𝐺 } corresponding to the outputs within each group:
𝑟𝑖 − m𝑒𝑎𝑛 ({𝑟1 , 𝑟2 , · · · , 𝑟𝐺 })
𝐴𝑖 = . (3)
s𝑡𝑑 ({𝑟1 , 𝑟2 , · · · , 𝑟𝐺 })
5
A conversation between User and Assistant. The user asks a question, and the Assistant solves it.
The assistant first thinks about the reasoning process in the mind and then provides the user
with the answer. The reasoning process and answer are enclosed within <think> </think> and
<answer> </answer> tags, respectively, i.e., <think> reasoning process here </think>
<answer> answer here </answer>. User: prompt. Assistant:
Table 1 | Template for DeepSeek-R1-Zero. prompt will be replaced with the specific reasoning
question during training.
The reward is the source of the training signal, which decides the optimization direction of RL.
To train DeepSeek-R1-Zero, we adopt a rule-based reward system that mainly consists of two
types of rewards:
• Accuracy rewards: The accuracy reward model evaluates whether the response is correct.
For example, in the case of math problems with deterministic results, the model is required
to provide the final answer in a specified format (e.g., within a box), enabling reliable
rule-based verification of correctness. Similarly, for LeetCode problems, a compiler can be
used to generate feedback based on predefined test cases.
• Format rewards: In addition to the accuracy reward model, we employ a format reward
model that enforces the model to put its thinking process between ‘<think>’ and ‘</think>’
tags.
We do not apply the outcome or process neural reward model in developing DeepSeek-R1-Zero,
because we find that the neural reward model may suffer from reward hacking in the large-scale
reinforcement learning process, and retraining the reward model needs additional training
resources and it complicates the whole training pipeline.
6
GPQA LiveCode
AIME 2024 MATH-500 CodeForces
Model Diamond Bench
pass@1 cons@64 pass@1 pass@1 pass@1 rating
OpenAI-o1-mini 63.6 80.0 90.0 60.0 53.8 1820
OpenAI-o1-0912 74.4 83.3 94.8 77.3 63.4 1843
DeepSeek-R1-Zero 71.0 86.7 95.9 73.3 50.0 1444
Figure 2 | AIME accuracy of DeepSeek-R1-Zero during training. For each question, we sample
16 responses and calculate the overall average accuracy to ensure a stable evaluation.
DeepSeek-R1-Zero to attain robust reasoning capabilities without the need for any supervised
fine-tuning data. This is a noteworthy achievement, as it underscores the model’s ability to
learn and generalize effectively through RL alone. Additionally, the performance of DeepSeek-
R1-Zero can be further augmented through the application of majority voting. For example,
when majority voting is employed on the AIME benchmark, DeepSeek-R1-Zero’s performance
escalates from 71.0% to 86.7%, thereby exceeding the performance of OpenAI-o1-0912. The
ability of DeepSeek-R1-Zero to achieve such competitive performance, both with and without
majority voting, highlights its strong foundational capabilities and its potential for further
advancements in reasoning tasks.
7
Figure 3 | The average response length of DeepSeek-R1-Zero on the training set during the RL
process. DeepSeek-R1-Zero naturally learns to solve reasoning tasks with more thinking time.
ment throughout the training process. This improvement is not the result of external adjustments
but rather an intrinsic development within the model. DeepSeek-R1-Zero naturally acquires the
ability to solve increasingly complex reasoning tasks by leveraging extended test-time compu-
tation. This computation ranges from generating hundreds to thousands of reasoning tokens,
allowing the model to explore and refine its thought processes in greater depth.
One of the most remarkable aspects of this self-evolution is the emergence of sophisticated
behaviors as the test-time computation increases. Behaviors such as reflection—where the model
revisits and reevaluates its previous steps—and the exploration of alternative approaches to
problem-solving arise spontaneously. These behaviors are not explicitly programmed but instead
emerge as a result of the model’s interaction with the reinforcement learning environment. This
spontaneous development significantly enhances DeepSeek-R1-Zero’s reasoning capabilities,
enabling it to tackle more challenging tasks with greater efficiency and accuracy.
8
√︁ √
Question: If 𝑎 > 1, then the sum of the real solutions of 𝑎 − 𝑎 + 𝑥 = 𝑥 is equal to
Response: <think> √︁
√
To solve the equation 𝑎 − 𝑎 + 𝑥 = 𝑥 , let’s start by squaring both · · ·
√︁ √ 2 √
𝑎 − 𝑎 + 𝑥 = 𝑥 2 =⇒ 𝑎 − 𝑎 + 𝑥 = 𝑥 2 .
Rearrange to isolate the inner square root term:
( 𝑎 − 𝑥 2 ) 2 = 𝑎 + 𝑥 =⇒ 𝑎2 − 2𝑎𝑥 2 + ( 𝑥 2 ) 2 = 𝑎 + 𝑥 =⇒ 𝑥 4 − 2𝑎𝑥 2 − 𝑥 + ( 𝑎2 − 𝑎) = 0
...
Wait, wait. Wait. That’s an aha moment I can flag here.
Let’s reevaluate this step-by-step to identify if the correct sum can be · · ·
We
√︁ started with the equation:
√
𝑎− 𝑎+𝑥 = 𝑥
First,
√ let’s square both √ sides:
𝑎 − 𝑎 + 𝑥 = 𝑥 2 =⇒ 𝑎 + 𝑥 = 𝑎 − 𝑥 2
Next, I could square both sides again, treating the equation: · · ·
...
Inspired by the promising results of DeepSeek-R1-Zero, two natural questions arise: 1) Can
reasoning performance be further improved or convergence accelerated by incorporating a small
amount of high-quality data as a cold start? 2) How can we train a user-friendly model that
not only produces clear and coherent Chains of Thought (CoT) but also demonstrates strong
general capabilities? To address these questions, we design a pipeline to train DeepSeek-R1. The
pipeline consists of four stages, outlined as follows.
Unlike DeepSeek-R1-Zero, to prevent the early unstable cold start phase of RL training from
the base model, for DeepSeek-R1 we construct and collect a small amount of long CoT data
to fine-tune the model as the initial RL actor. To collect such data, we have explored several
approaches: using few-shot prompting with a long CoT as an example, directly prompting
models to generate detailed answers with reflection and verification, gathering DeepSeek-R1-
Zero outputs in a readable format, and refining the results through post-processing by human
annotators.
In this work, we collect thousands of cold-start data to fine-tune the DeepSeek-V3-Base as
the starting point for RL. Compared to DeepSeek-R1-Zero, the advantages of cold start data
9
include:
• Readability: A key limitation of DeepSeek-R1-Zero is that its content is often not suitable
for reading. Responses may mix multiple languages or lack markdown formatting to
highlight answers for users. In contrast, when creating cold-start data for DeepSeek-R1,
we design a readable pattern that includes a summary at the end of each response and
filters out responses that are not reader-friendly. Here, we define the output format as
|special_token|<reasoning_process>|special_token|<summary>, where the reasoning
process is the CoT for the query, and the summary is used to summarize the reasoning
results.
• Potential: By carefully designing the pattern for cold-start data with human priors, we
observe better performance against DeepSeek-R1-Zero. We believe the iterative training is
a better way for reasoning models.
After fine-tuning DeepSeek-V3-Base on the cold start data, we apply the same large-scale
reinforcement learning training process as employed in DeepSeek-R1-Zero. This phase focuses
on enhancing the model’s reasoning capabilities, particularly in reasoning-intensive tasks such
as coding, mathematics, science, and logic reasoning, which involve well-defined problems with
clear solutions. During the training process, we observe that CoT often exhibits language mixing,
particularly when RL prompts involve multiple languages. To mitigate the issue of language
mixing, we introduce a language consistency reward during RL training, which is calculated
as the proportion of target language words in the CoT. Although ablation experiments show
that such alignment results in a slight degradation in the model’s performance, this reward
aligns with human preferences, making it more readable. Finally, we combine the accuracy of
reasoning tasks and the reward for language consistency by directly summing them to form the
final reward. We then apply reinforcement learning (RL) training on the fine-tuned model until
it achieves convergence on reasoning tasks.
Reasoning data We curate reasoning prompts and generate reasoning trajectories by perform-
ing rejection sampling from the checkpoint from the above RL training. In the previous stage,
we only included data that could be evaluated using rule-based rewards. However, in this stage,
we expand the dataset by incorporating additional data, some of which use a generative reward
model by feeding the ground-truth and model predictions into DeepSeek-V3 for judgment.
Additionally, because the model output is sometimes chaotic and difficult to read, we have
filtered out chain-of-thought with mixed languages, long parapraphs, and code blocks. For
each prompt, we sample multiple responses and retain only the correct ones. In total, we collect
about 600k reasoning related training samples.
10
Non-Reasoning data For non-reasoning data, such as writing, factual QA, self-cognition,
and translation, we adopt the DeepSeek-V3 pipeline and reuse portions of the SFT dataset of
DeepSeek-V3. For certain non-reasoning tasks, we call DeepSeek-V3 to generate a potential
chain-of-thought before answering the question by prompting. However, for simpler queries,
such as “hello” we do not provide a CoT in response. In the end, we collected a total of
approximately 200k training samples that are unrelated to reasoning.
We fine-tune DeepSeek-V3-Base for two epochs using the above curated dataset of about
800k samples.
To further align the model with human preferences, we implement a secondary reinforcement
learning stage aimed at improving the model’s helpfulness and harmlessness while simultane-
ously refining its reasoning capabilities. Specifically, we train the model using a combination
of reward signals and diverse prompt distributions. For reasoning data, we adhere to the
methodology outlined in DeepSeek-R1-Zero, which utilizes rule-based rewards to guide the
learning process in math, code, and logical reasoning domains. For general data, we resort to
reward models to capture human preferences in complex and nuanced scenarios. We build
upon the DeepSeek-V3 pipeline and adopt a similar distribution of preference pairs and train-
ing prompts. For helpfulness, we focus exclusively on the final summary, ensuring that the
assessment emphasizes the utility and relevance of the response to the user while minimizing
interference with the underlying reasoning process. For harmlessness, we evaluate the entire
response of the model, including both the reasoning process and the summary, to identify and
mitigate any potential risks, biases, or harmful content that may arise during the generation
process. Ultimately, the integration of reward signals and diverse data distributions enables us
to train a model that excels in reasoning while prioritizing helpfulness and harmlessness.
To equip more efficient smaller models with reasoning capabilities like DeekSeek-R1, we directly
fine-tuned open-source models like Qwen (Qwen, 2024b) and Llama (AI@Meta, 2024) using
the 800k samples curated with DeepSeek-R1, as detailed in §2.3.3. Our findings indicate that
this straightforward distillation method significantly enhances the reasoning abilities of smaller
models. The base models we use here are Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, Qwen2.5-
14B, Qwen2.5-32B, Llama-3.1-8B, and Llama-3.3-70B-Instruct. We select Llama-3.3 because its
reasoning capability is slightly better than that of Llama-3.1.
For distilled models, we apply only SFT and do not include an RL stage, even though
incorporating RL could substantially boost model performance. Our primary goal here is to
demonstrate the effectiveness of the distillation technique, leaving the exploration of the RL
stage to the broader research community.
3. Experiment
Benchmarks We evaluate models on MMLU (Hendrycks et al., 2020), MMLU-Redux (Gema
et al., 2024), MMLU-Pro (Wang et al., 2024), C-Eval (Huang et al., 2023), and CMMLU (Li et al.,
2023), IFEval (Zhou et al., 2023), FRAMES (Krishna et al., 2024), GPQA Diamond (Rein et al.,
2023), SimpleQA (OpenAI, 2024c), C-SimpleQA (He et al., 2024), SWE-Bench Verified (OpenAI,
11
2024d), Aider 1 , LiveCodeBench (Jain et al., 2024) (2024-08 – 2025-01), Codeforces 2 , Chinese
National High School Mathematics Olympiad (CNMO 2024)3 , and American Invitational Math-
ematics Examination 2024 (AIME 2024) (MAA, 2024). In addition to standard benchmarks, we
also evaluate our models on open-ended generation tasks using LLMs as judges. Specifically, we
adhere to the original configurations of AlpacaEval 2.0 (Dubois et al., 2024) and Arena-Hard (Li
et al., 2024), which leverage GPT-4-Turbo-1106 as judges for pairwise comparisons. Here, we
only feed the final summary to evaluation to avoid the length bias. For distilled models, we
report representative results on AIME 2024, MATH-500, GPQA Diamond, Codeforces, and
LiveCodeBench.
Generation Setup For all our models, the maximum generation length is set to 32,768 tokens.
For benchmarks requiring sampling, we use a temperature of 0.6, a top-p value of 0.95, and
generate 64 responses per query to estimate pass@1.
For education-oriented knowledge benchmarks such as MMLU, MMLU-Pro, and GPQA Di-
amond, DeepSeek-R1 demonstrates superior performance compared to DeepSeek-V3. This
improvement is primarily attributed to enhanced accuracy in STEM-related questions, where
significant gains are achieved through large-scale reinforcement learning (RL). Additionally,
DeepSeek-R1 excels on FRAMES, a long-context-dependent QA task, showcasing its strong
document analysis capabilities. This highlights the potential of reasoning models in AI-driven
1 https://aider.chat
2 https://codeforces.com
3 https://www.cms.org.cn/Home/comp/comp/cid/12.html
12
Claude-3.5- GPT-4o DeepSeek OpenAI OpenAI DeepSeek
Benchmark (Metric)
Sonnet-1022 0513 V3 o1-mini o1-1217 R1
Architecture - - MoE - - MoE
# Activated Params - - 37B - - 37B
# Total Params - - 671B - - 671B
MMLU (Pass@1) 88.3 87.2 88.5 85.2 91.8 90.8
MMLU-Redux (EM) 88.9 88.0 89.1 86.7 - 92.9
MMLU-Pro (EM) 78.0 72.6 75.9 80.3 - 84.0
DROP (3-shot F1) 88.3 83.7 91.6 83.9 90.2 92.2
IF-Eval (Prompt Strict) 86.5 84.3 86.1 84.8 - 83.3
English
GPQA Diamond (Pass@1) 65.0 49.9 59.1 60.0 75.7 71.5
SimpleQA (Correct) 28.4 38.2 24.9 7.0 47.0 30.1
FRAMES (Acc.) 72.5 80.5 73.3 76.9 - 82.5
AlpacaEval2.0 (LC-winrate) 52.0 51.1 70.0 57.8 - 87.6
ArenaHard (GPT-4-1106) 85.2 80.4 85.5 92.0 - 92.3
LiveCodeBench (Pass@1-COT) 38.9 32.9 36.2 53.8 63.4 65.9
Codeforces (Percentile) 20.3 23.6 58.7 93.4 96.6 96.3
Code
Codeforces (Rating) 717 759 1134 1820 2061 2029
SWE Verified (Resolved) 50.8 38.8 42.0 41.6 48.9 49.2
Aider-Polyglot (Acc.) 45.3 16.0 49.6 32.9 61.7 53.3
AIME 2024 (Pass@1) 16.0 9.3 39.2 63.6 79.2 79.8
Math MATH-500 (Pass@1) 78.3 74.6 90.2 90.0 96.4 97.3
CNMO 2024 (Pass@1) 13.1 10.8 43.2 67.6 - 78.8
CLUEWSC (EM) 85.4 87.9 90.9 89.9 - 92.8
Chinese C-Eval (EM) 76.7 76.0 86.5 68.9 - 91.8
C-SimpleQA (Correct) 55.4 58.7 68.0 40.3 - 63.7
search and data analysis tasks. On the factual benchmark SimpleQA, DeepSeek-R1 outperforms
DeepSeek-V3, demonstrating its capability in handling fact-based queries. A similar trend
is observed where OpenAI-o1 surpasses GPT-4o on this benchmark. However, DeepSeek-R1
performs worse than DeepSeek-V3 on the Chinese SimpleQA benchmark, primarily due to its
tendency to refuse answering certain queries after safety RL. Without safety RL, DeepSeek-R1
could achieve an accuracy of over 70%.
DeepSeek-R1 also delivers impressive results on IF-Eval, a benchmark designed to assess a
model’s ability to follow format instructions. These improvements can be linked to the inclusion
of instruction-following data during the final stages of supervised fine-tuning (SFT) and RL
training. Furthermore, remarkable performance is observed on AlpacaEval2.0 and ArenaHard,
indicating DeepSeek-R1’s strengths in writing tasks and open-domain question answering. Its
significant outperformance of DeepSeek-V3 underscores the generalization benefits of large-scale
RL, which not only boosts reasoning capabilities but also improves performance across diverse
domains. Moreover, the summary lengths generated by DeepSeek-R1 are concise, with an
average of 689 tokens on ArenaHard and 2,218 characters on AlpacaEval 2.0. This indicates that
DeepSeek-R1 avoids introducing length bias during GPT-based evaluations, further solidifying
its robustness across multiple tasks.
On math tasks, DeepSeek-R1 demonstrates performance on par with OpenAI-o1-1217,
surpassing other models by a large margin. A similar trend is observed on coding algorithm
tasks, such as LiveCodeBench and Codeforces, where reasoning-focused models dominate these
benchmarks. On engineering-oriented coding tasks, OpenAI-o1-1217 outperforms DeepSeek-R1
on Aider but achieves comparable performance on SWE Verified. We believe the engineering
13
performance of DeepSeek-R1 will improve in the next version, as the amount of related RL
training data currently remains very limited.
GPQA LiveCode
AIME 2024 MATH-500 CodeForces
Model Diamond Bench
pass@1 cons@64 pass@1 pass@1 pass@1 rating
GPT-4o-0513 9.3 13.4 74.6 49.9 32.9 759
Claude-3.5-Sonnet-1022 16.0 26.7 78.3 65.0 38.9 717
OpenAI-o1-mini 63.6 80.0 90.0 60.0 53.8 1820
QwQ-32B-Preview 50.0 60.0 90.6 54.5 41.9 1316
DeepSeek-R1-Distill-Qwen-1.5B 28.9 52.7 83.9 33.8 16.9 954
DeepSeek-R1-Distill-Qwen-7B 55.5 83.3 92.8 49.1 37.6 1189
DeepSeek-R1-Distill-Qwen-14B 69.7 80.0 93.9 59.1 53.1 1481
DeepSeek-R1-Distill-Qwen-32B 72.6 83.3 94.3 62.1 57.2 1691
DeepSeek-R1-Distill-Llama-8B 50.4 80.0 89.1 49.0 39.6 1205
DeepSeek-R1-Distill-Llama-70B 70.0 86.7 94.5 65.2 57.5 1633
As shown in Table 5, simply distilling DeepSeek-R1’s outputs enables the efficient DeepSeek-
R1-7B (i.e., DeepSeek-R1-Distill-Qwen-7B, abbreviated similarly below) to outperform non-
reasoning models like GPT-4o-0513 across the board. DeepSeek-R1-14B surpasses QwQ-32B-
Preview on all evaluation metrics, while DeepSeek-R1-32B and DeepSeek-R1-70B significantly
exceed o1-mini on most benchmarks. These results demonstrate the strong potential of distilla-
tion. Additionally, we found that applying RL to these distilled models yields significant further
gains. We believe this warrants further exploration and therefore present only the results of the
simple SFT-distilled models here.
4. Discussion
In Section 3.2, we can see that by distilling DeepSeek-R1, the small model can achieve
impressive results. However, there is still one question left: can the model achieve comparable
performance through the large-scale RL training discussed in the paper without distillation?
To answer this question, we conduct large-scale RL training on Qwen-32B-Base using math,
code, and STEM data, training for over 10K steps, resulting in DeepSeek-R1-Zero-Qwen-32B. The
experimental results, shown in Figure 6, demonstrate that the 32B base model, after large-scale
14
RL training, achieves performance on par with QwQ-32B-Preview. However, DeepSeek-R1-
Distill-Qwen-32B, which is distilled from DeepSeek-R1, performs significantly better than
DeepSeek-R1-Zero-Qwen-32B across all benchmarks. Therefore, we can draw two conclusions:
First, distilling more powerful models into smaller ones yields excellent results, whereas smaller
models relying on the large-scale RL mentioned in this paper require enormous computational
power and may not even achieve the performance of distillation. Second, while distillation
strategies are both economical and effective, advancing beyond the boundaries of intelligence
may still require more powerful base models and larger-scale reinforcement learning.
In the early stages of developing DeepSeek-R1, we also encountered failures and setbacks along
the way. We share our failure experiences here to provide insights, but this does not imply that
these approaches are incapable of developing effective reasoning models.
Process Reward Model (PRM) PRM is a reasonable method to guide the model toward better
approaches for solving reasoning tasks (Lightman et al., 2023; Uesato et al., 2022; Wang et al.,
2023). However, in practice, PRM has three main limitations that may hinder its ultimate suc-
cess. First, it is challenging to explicitly define a fine-grain step in general reasoning. Second,
determining whether the current intermediate step is correct is a challenging task. Automated
annotation using models may not yield satisfactory results, while manual annotation is not con-
ducive to scaling up. Third, once a model-based PRM is introduced, it inevitably leads to reward
hacking (Gao et al., 2022), and retraining the reward model needs additional training resources
and it complicates the whole training pipeline. In conclusion, while PRM demonstrates a good
ability to rerank the top-N responses generated by the model or assist in guided search (Snell
et al., 2024), its advantages are limited compared to the additional computational overhead it
introduces during large-scale reinforcement learning process in our experiments.
Monte Carlo Tree Search (MCTS) Inspired by AlphaGo (Silver et al., 2017b) and AlphaZero (Sil-
ver et al., 2017a), we explored using Monte Carlo Tree Search (MCTS) to enhance test-time
compute scalability. This approach involves breaking answers into smaller parts to allow the
model to explore the solution space systematically. To facilitate this, we prompt the model to
generate multiple tags that correspond to specific reasoning steps necessary for the search. For
training, we first use collected prompts to find answers via MCTS guided by a pre-trained value
model. Subsequently, we use the resulting question-answer pairs to train both the actor model
and the value model, iteratively refining the process.
However, this approach encounters several challenges when scaling up the training. First,
unlike chess, where the search space is relatively well-defined, token generation presents an
exponentially larger search space. To address this, we set a maximum extension limit for each
node, but this can lead to the model getting stuck in local optima. Second, the value model
directly influences the quality of generation since it guides each step of the search process.
Training a fine-grained value model is inherently difficult, which makes it challenging for the
model to iteratively improve. While AlphaGo’s core success relied on training a value model to
progressively enhance its performance, this principle proves difficult to replicate in our setup
due to the complexities of token generation.
In conclusion, while MCTS can improve performance during inference when paired with a
pre-trained value model, iteratively boosting model performance through self-search remains a
15
significant challenge.
References
AI@Meta. Llama 3.1 model card, 2024. URL https://github.com/meta-llama/llama-m
odels/blob/main/models/llama3_1/MODEL_CARD.md.
Anthropic. Claude 3.5 sonnet, 2024. URL https://www.anthropic.com/news/claude-3
-5-sonnet.
A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten,
A. Yang, A. Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
Y. Dubois, B. Galambosi, P. Liang, and T. B. Hashimoto. Length-controlled alpacaeval: A simple
way to debias automatic evaluators. arXiv preprint arXiv:2404.04475, 2024.
16
X. Feng, Z. Wan, M. Wen, S. M. McAleer, Y. Wen, W. Zhang, and J. Wang. Alphazero-like
tree-search can guide large language model decoding and training, 2024. URL https:
//arxiv.org/abs/2309.17179.
L. Gao, J. Schulman, and J. Hilton. Scaling laws for reward model overoptimization, 2022. URL
https://arxiv.org/abs/2210.10760.
A. P. Gema, J. O. J. Leang, G. Hong, A. Devoto, A. C. M. Mancino, R. Saxena, X. He, Y. Zhao,
X. Du, M. R. G. Madani, C. Barale, R. McHardy, J. Harris, J. Kaddour, E. van Krieken, and
P. Minervini. Are we done with mmlu? CoRR, abs/2406.04127, 2024. URL https://doi.or
g/10.48550/arXiv.2406.04127.
Google. Our next-generation model: Gemini 1.5, 2024. URL https://blog.google/techno
logy/ai/google-gemini-next-generation-model-february-2024.
Y. He, S. Li, J. Liu, Y. Tan, W. Wang, H. Huang, X. Bu, H. Guo, C. Hu, B. Zheng, et al. Chi-
nese simpleqa: A chinese factuality evaluation for large language models. arXiv preprint
arXiv:2411.07140, 2024.
Y. Huang, Y. Bai, Z. Zhu, J. Zhang, J. Zhang, T. Su, J. Liu, C. Lv, Y. Zhang, J. Lei, et al. C-Eval: A
multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint
arXiv:2305.08322, 2023.
N. Jain, K. Han, A. Gu, W. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica.
Livecodebench: Holistic and contamination free evaluation of large language models for code.
CoRR, abs/2403.07974, 2024. URL https://doi.org/10.48550/arXiv.2403.07974.
H. Li, Y. Zhang, F. Koto, Y. Yang, H. Zhao, Y. Gong, N. Duan, and T. Baldwin. CMMLU: Measur-
ing massive multitask language understanding in Chinese. arXiv preprint arXiv:2306.09212,
2023.
T. Li, W.-L. Chiang, E. Frick, L. Dunlap, T. Wu, B. Zhu, J. E. Gonzalez, and I. Stoica. From
crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline. arXiv
preprint arXiv:2406.11939, 2024.
B. Y. Lin. ZeroEval: A Unified Framework for Evaluating Language Models, July 2024. URL
https://github.com/WildEval/ZeroEval.
17
MAA. American invitational mathematics examination - aime. In American Invitational
Mathematics Examination - AIME 2024, February 2024. URL https://maa.org/math
-competitions/american-invitational-mathematics-examination-aime.
OpenAI. Hello GPT-4o, 2024a. URL https://openai.com/index/hello-gpt-4o/.
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, M. Zhang, Y. Li, Y. Wu, and D. Guo. Deepseekmath:
Pushing the limits of mathematical reasoning in open language models. arXiv preprint
arXiv:2402.03300, 2024.
C. Snell, J. Lee, K. Xu, and A. Kumar. Scaling llm test-time compute optimally can be more
effective than scaling model parameters, 2024. URL https://arxiv.org/abs/2408.033
14.
T. Trinh, Y. Wu, Q. Le, H. He, and T. Luong. Solving olympiad geometry without human
demonstrations. Nature, 2024. doi: 10.1038/s41586-023-06747-5.
P. Wang, L. Li, Z. Shao, R. Xu, D. Dai, Y. Li, D. Chen, Y. Wu, and Z. Sui. Math-shepherd: A label-
free step-by-step verifier for llms in mathematical reasoning. arXiv preprint arXiv:2312.08935,
2023.
18
Y. Wang, X. Ma, G. Zhang, Y. Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jiang, T. Li,
M. Ku, K. Wang, A. Zhuang, R. Fan, X. Yue, and W. Chen. Mmlu-pro: A more robust and
challenging multi-task language understanding benchmark. CoRR, abs/2406.01574, 2024.
URL https://doi.org/10.48550/arXiv.2406.01574.
H. Xin, Z. Z. Ren, J. Song, Z. Shao, W. Zhao, H. Wang, B. Liu, L. Zhang, X. Lu, Q. Du, W. Gao,
Q. Zhu, D. Yang, Z. Gou, Z. F. Wu, F. Luo, and C. Ruan. Deepseek-prover-v1.5: Harnessing
proof assistant feedback for reinforcement learning and monte-carlo tree search, 2024. URL
https://arxiv.org/abs/2408.08152.
J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y. Luan, D. Zhou, and L. Hou. Instruction-following
evaluation for large language models. arXiv preprint arXiv:2311.07911, 2023.
19
Appendix
A. Contributions and Acknowledgments
20
Ruyi Chen Y.X. Wei
Shanghao Lu Yang Zhang
Shangyan Zhou Yanhong Xu
Shanhuang Chen Yao Li
Shengfeng Ye Yao Zhao
Shiyu Wang Yaofeng Sun
Shuiping Yu Yaohui Wang
Shunfeng Zhou Yi Yu
Shuting Pan Yichao Zhang
S.S. Li Yifan Shi
Shuang Zhou Yiliang Xiong
Shaoqing Wu Ying He
Shengfeng Ye Yishi Piao
Tao Yun Yisong Wang
Tian Pei Yixuan Tan
Tianyu Sun Yiyang Ma*
T. Wang Yiyuan Liu
Wangding Zeng Yongqiang Guo
Wen Liu Yuan Ou
Wenfeng Liang Yuduan Wang
Wenjun Gao Yue Gong
Wenqin Yu* Yuheng Zou
Wentao Zhang Yujia He
W.L. Xiao Yunfan Xiong
Wei An Yuxiang Luo
Xiaodong Liu Yuxiang You
Xiaohan Wang Yuxuan Liu
Xiaokang Chen Yuyang Zhou
Xiaotao Nie Y.X. Zhu
Xin Cheng Yanping Huang
Xin Liu Yaohui Li
Xin Xie Yi Zheng
Xingchao Liu Yuchen Zhu
Xinyu Yang Yunxian Ma
Xinyuan Li Ying Tang
Xuecheng Su Yukun Zha
Xuheng Lin Yuting Yan
X.Q. Li Z.Z. Ren
Xiangyue Jin Zehui Ren
Xiaojin Shen Zhangli Sha
Xiaosha Chen Zhe Fu
Xiaowen Sun Zhean Xu
Xiaoxiang Wang Zhenda Xie
Xinnan Song Zhengyan Zhang
Xinyi Zhou Zhewen Hao
Xianzu Wang Zhicheng Ma
Xinxia Shan Zhigang Yan
Y.K. Li Zhiyu Wu
Y.Q. Wang Zihui Gu
21
Zijia Zhu Zhen Huang
Zijun Liu* Zhipeng Xu
Zilin Li Zhongyu Zhang
Ziwei Xie Zhen Zhang
Ziyang Song
Zizheng Pan
Within each role, authors are listed alphabetically by the first name. Names marked with *
denote individuals who have departed from our team.
22