DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-AI
Abstract
arXiv:2405.04434v5 [cs.CL] 19 Jun 2024
training costs
DeepSeek 67B 0 50 100 150 200 250 300
Qwen1.5 32B Grok-1
70 KV Cache for Generation (KB/Token)
Mixtral 8x7B
LLaMA 2 70B DeepSeek 67B
Command R
65 LLaMA 3 8B DeepSeek-V2 reducing KV cache by 93.3%
LLaMA 1 65B
LLaMA 2 34B LLaMA 1 Family 0 100 200 300 400
Mistral 7B
60 LLaMA 2 Family
LLaMA 3 Family Maximum Generation Throughput (Tokens/Sec)
Mixtral Family
LLaMA 1 33B Command R Family DeepSeek 67B
55 LLaMA 2 13B Qwen1.5 Family
DeepSeek-V2 576% of maximum throughput
0 20 40 60 80 100
Activated Parameters (Billions) 0 10000 20000 30000 40000 50000
(a) (b)
Figure 1 | (a) MMLU accuracy vs. activated parameters, among different open-source models.
(b) Training costs and inference efficiency of DeepSeek 67B (Dense) and DeepSeek-V2.
Contents
1 Introduction 4
2 Architecture 6
2.1 Multi-Head Latent Attention: Boosting Inference Efficiency . . . . . . . . . . . . . 6
2.1.1 Preliminaries: Standard Multi-Head Attention . . . . . . . . . . . . . . . . 6
2.1.2 Low-Rank Key-Value Joint Compression . . . . . . . . . . . . . . . . . . . 7
2.1.3 Decoupled Rotary Position Embedding . . . . . . . . . . . . . . . . . . . . 8
2.1.4 Comparison of Key-Value Cache . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 DeepSeekMoE: Training Strong Models at Economical Costs . . . . . . . . . . . . 9
2.2.1 Basic Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Device-Limited Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.3 Auxiliary Loss for Load Balance . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.4 Token-Dropping Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Pre-Training 11
3.1 Experimental Setups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.1 Data Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.2 Hyper-Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.3 Infrastructures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.4 Long Context Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.1 Evaluation Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.2 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.3 Training and Inference Efficiency . . . . . . . . . . . . . . . . . . . . . . . . 16
4 Alignment 16
4.1 Supervised Fine-Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2
B.1 Model Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
B.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
G Evaluation Formats 33
3
1. Introduction
In the past few years, Large Language Models (LLMs) (Anthropic, 2023; Google, 2023; OpenAI,
2022, 2023) have undergone rapid development, offering a glimpse into the dawn of Artificial
General Intelligence (AGI). In general, the intelligence of an LLM tends to improve as the number
of parameters increases, allowing it to exhibit emergent capabilities across various tasks (Wei
et al., 2022). However, the improvement comes at the cost of larger computing resources for
training and a potential decrease in inference throughput. These constraints present significant
challenges that impede the widespread adoption and utilization of LLMs. In order to tackle this
problem, we introduce DeepSeek-V2, a strong open-source Mixture-of-Experts (MoE) language
model, characterized by economical training and efficient inference through an innovative
Transformer architecture. It is equipped with a total of 236B parameters, of which 21B are
activated for each token, and supports a context length of 128K tokens.
We optimize the attention modules and Feed-Forward Networks (FFNs) within the Trans-
former framework (Vaswani et al., 2017) with our proposed Multi-head Latent Attention (MLA)
and DeepSeekMoE. (1) In the context of attention mechanisms, the Key-Value (KV) cache
of the Multi-Head Attention (MHA) (Vaswani et al., 2017) poses a significant obstacle to the
inference efficiency of LLMs. Various approaches have been explored to address this issue,
including Grouped-Query Attention (GQA) (Ainslie et al., 2023) and Multi-Query Attention
(MQA) (Shazeer, 2019). However, these methods often compromise performance in their attempt
to reduce the KV cache. In order to achieve the best of both worlds, we introduce MLA, an
attention mechanism equipped with low-rank key-value joint compression. Empirically, MLA
achieves superior performance compared with MHA, and meanwhile significantly reduces
the KV cache during inference, thus boosting the inference efficiency. (2) For Feed-Forward
Networks (FFNs), we follow the DeepSeekMoE architecture (Dai et al., 2024), which adopts
fine-grained expert segmentation and shared expert isolation for higher potential in expert
specialization. The DeepSeekMoE architecture demonstrates great advantages compared with
conventional MoE architectures like GShard (Lepikhin et al., 2021), enabling us to train strong
models at an economical cost. As we employ expert parallelism during training, we also devise
supplementary mechanisms to control communication overheads and ensure load balance.
By combining these two techniques, DeepSeek-V2 features strong performance (Figure 1(a)),
economical training costs, and efficient inference throughput (Figure 1(b)), simultaneously.
We construct a high-quality and multi-source pre-training corpus consisting of 8.1T tokens.
Compared with the corpus used in DeepSeek 67B (our previous release) (DeepSeek-AI, 2024), this
corpus features an extended amount of data, especially Chinese data, and higher data quality. We
first pretrain DeepSeek-V2 on the full pre-training corpus. Then, we collect 1.5M conversational
sessions, which encompass various domains such as math, code, writing, reasoning, safety, and
more, to perform Supervised Fine-Tuning (SFT) for DeepSeek-V2 Chat (SFT). Finally, we follow
DeepSeekMath (Shao et al., 2024) to employ Group Relative Policy Optimization (GRPO) to
further align the model with human preference and produce DeepSeek-V2 Chat (RL).
We evaluate DeepSeek-V2 on a wide range of benchmarks in English and Chinese, and
compare it with representative open-source models. Evaluation results show that even with only
21B activated parameters, DeepSeek-V2 still achieves top-tier performance among open-source
models and becomes the strongest open-source MoE language model. Figure 1(a) highlights
that, on MMLU, DeepSeek-V2 achieves top-ranking performance with only a small number
of activated parameters. In addition, as shown in Figure 1(b), compared with DeepSeek 67B,
DeepSeek-V2 saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the
maximum generation throughput to 5.76 times. We also evaluate DeepSeek-V2 Chat (SFT) and
4
DeepSeekMoE
… … Routed Expert
Output Hidden 𝐡𝐡′𝑡𝑡 Shared Expert
Transformer Block ×𝐿𝐿
Multi-Head Attention
RMS Norm
{[𝐪𝐪𝐶𝐶𝑡𝑡,𝑖𝑖 ; 𝐪𝐪𝑅𝑅𝑡𝑡,𝑖𝑖 ]} {[𝐤𝐤 𝐶𝐶𝑡𝑡,𝑖𝑖 ; 𝐤𝐤 𝑅𝑅𝑡𝑡 ]}
concatenate concatenate
… 𝑄𝑄 …
Latent 𝐜𝐜𝑡𝑡 Latent 𝐜𝐜𝑡𝑡𝐾𝐾𝐾𝐾
DeepSeek-V2 Chat (RL) on open-ended benchmarks. Notably, DeepSeek-V2 Chat (RL) achieves
38.9 length-controlled win rate on AlpacaEval 2.0 (Dubois et al., 2024), 8.97 overall score on
MT-Bench (Zheng et al., 2023), and 7.91 overall score on AlignBench (Liu et al., 2023). The
English open-ended conversation evaluations demonstrate that DeepSeek-V2 Chat (RL) has top-
tier performance among open-source chat models. In addition, the evaluation on AlignBench
indicates that in Chinese, DeepSeek-V2 Chat (RL) outperforms all of open-source models, and
even beats most of closed-source models.
In order to facilitate further research and development on MLA and DeepSeekMoE, we also
release DeepSeek-V2-Lite, a smaller model equipped with MLA and DeepSeekMoE, for the
open-source community. It has a total of 15.7B parameters, where 2.4B are activated for each
token. Detailed descriptions about DeepSeek-V2-Lite can be found in Appendix B.
In the rest of this paper, we first provide a detailed description of the model architecture of
DeepSeek-V2 (Section 2). Subsequently, we introduce our pre-training endeavors, including the
training data construction, hyper-parameter settings, infrastructures, long context extension,
and the evaluation of model performance and efficiency (Section 3). Following this, we demon-
strate our efforts in alignment, encompassing Supervised Fine-Tuning (SFT), Reinforcement
5
Learning (RL), the evaluation results, and other discussion (Section 4). Finally, we summarize
the conclusion, deliberate on the current limitations of DeepSeek-V2, and outline our future
work (Section 5).
2. Architecture
By and large, DeepSeek-V2 is still in the Transformer architecture (Vaswani et al., 2017), where
each Transformer block consists of an attention module and a Feed-Forward Network (FFN).
However, for both the attention module and the FFN, we design and employ innovative archi-
tectures. For attention, we design MLA, which utilizes low-rank key-value joint compression to
eliminate the bottleneck of inference-time key-value cache, thus supporting efficient inference.
For FFNs, we adopt the DeepSeekMoE architecture (Dai et al., 2024), a high-performance MoE
architecture that enables training strong models at an economical cost. An illustration of the
architecture of DeepSeek-V2 is presented in Figure 2, and we will introduce the details of MLA
and DeepSeekMoE in this section. For other tiny details (e.g., layer normalization and the
activation function in FFNs), unless specifically stated, DeepSeek-V2 follows the settings of
DeepSeek 67B (DeepSeek-AI, 2024).
We first introduce the standard MHA mechanism as background. Let 𝑑 be the embedding
dimension, 𝑛ℎ be the number of attention heads, 𝑑ℎ be the dimension per head, and h𝑡 ∈ R𝑑
be the attention input of the 𝑡 -th token at an attention layer. Standard MHA first produces
q𝑡 , k𝑡 , v𝑡 ∈ R𝑑ℎ 𝑛ℎ through three matrices 𝑊 𝑄 , 𝑊 𝐾 , 𝑊 𝑉 ∈ R𝑑ℎ 𝑛ℎ × 𝑑 , respectively:
q𝑡 = 𝑊 𝑄 h𝑡 , (1)
k𝑡 = 𝑊 𝐾 h𝑡 , (2)
𝑉
v𝑡 = 𝑊 h𝑡 , (3)
6
Cached During Inference
Multi-Head Attention (MHA) Grouped-Query Attention (GQA) Multi-Query Attention (MQA) Multi-Head Latent Attention (MLA)
Values
projection
Keys
Compressed
Latent KV
Queries
Then, q𝑡 , k𝑡 , v𝑡 will be sliced into 𝑛ℎ heads for the multi-head attention computation:
where q𝑡,𝑖 , k𝑡,𝑖 , v𝑡,𝑖 ∈ R𝑑ℎ denote the query, key, and value of the 𝑖-th attention head, respectively;
𝑊 𝑂 ∈ R𝑑 × 𝑑ℎ 𝑛ℎ denotes the output projection matrix. During inference, all keys and values need
to be cached to accelerate inference, so MHA needs to cache 2𝑛ℎ 𝑑ℎ 𝑙 elements for each token. In
model deployment, this heavy KV cache is a large bottleneck that limits the maximum batch
size and sequence length.
The core of MLA is the low-rank joint compression for keys and values to reduce KV cache:
where c𝑡𝐾𝑉 ∈ R𝑑𝑐 is the compressed latent vector for keys and values; 𝑑 𝑐 (≪ 𝑑ℎ 𝑛ℎ ) denotes the KV
compression dimension; 𝑊 𝐷𝐾𝑉 ∈ R𝑑𝑐 × 𝑑 is the down-projection matrix; and 𝑊 𝑈 𝐾 , 𝑊 𝑈𝑉 ∈ R𝑑ℎ 𝑛ℎ × 𝑑𝑐
are the up-projection matrices for keys and values, respectively. During inference, MLA only
needs to cache c𝑡𝐾𝑉 , so its KV cache has only 𝑑 𝑐 𝑙 elements, where 𝑙 denotes the number of layers.
In addition, during inference, since 𝑊 𝑈 𝐾 can be absorbed into 𝑊 𝑄 , and 𝑊 𝑈𝑉 can be absorbed
into 𝑊 𝑂 , we even do not need to compute keys and values out for attention. Figure 3 intuitively
illustrates how the KV joint compression in MLA reduces the KV cache.
Moreover, in order to reduce the activation memory during training, we also perform
7
low-rank compression for the queries, even if it cannot reduce the KV cache:
c𝑡𝑄 = 𝑊 𝐷𝑄 h𝑡 , (12)
𝑄
q𝐶𝑡 = 𝑊 𝑈𝑄 c𝑡 , (13)
′
where c𝑡𝑄 ∈ R𝑑𝑐 is the compressed latent vector for queries; 𝑑 𝑐′ (≪ 𝑑ℎ 𝑛ℎ ) denotes the query
′ ′
compression dimension; and 𝑊 𝐷𝑄 ∈ R𝑑𝑐 × 𝑑 , 𝑊 𝑈𝑄 ∈ R𝑑ℎ 𝑛ℎ × 𝑑𝑐 are the down-projection and up-
projection matrices for queries, respectively.
Following DeepSeek 67B (DeepSeek-AI, 2024), we intend to use the Rotary Position Embed-
ding (RoPE) (Su et al., 2024) for DeepSeek-V2. However, RoPE is incompatible with low-rank
KV compression. To be specific, RoPE is position-sensitive for both keys and queries. If we apply
RoPE for the keys k𝐶𝑡 , 𝑊 𝑈 𝐾 in Equation 10 will be coupled with a position-sensitive RoPE matrix.
In this way, 𝑊 𝑈 𝐾 cannot be absorbed into 𝑊 𝑄 any more during inference, since a RoPE matrix
related to the currently generating token will lie between 𝑊 𝑄 and 𝑊 𝑈 𝐾 and matrix multiplication
does not obey a commutative law. As a result, we must recompute the keys for all the prefix
tokens during inference, which will significantly hinder the inference efficiency.
As a solution, we propose the decoupled RoPE strategy that uses additional multi-head
queries q𝑡𝑅,𝑖 ∈ R𝑑ℎ and a shared key k𝑡𝑅 ∈ R𝑑ℎ to carry RoPE, where 𝑑ℎ𝑅 denotes the per-head
𝑅 𝑅
dimension of the decoupled queries and key. Equipped with the decoupled RoPE strategy, MLA
performs the following computation:
respectively; RoPE(·) denotes the operation that applies RoPE matrices; and [·; ·] denotes the
concatenation operation. During inference, the decoupled key should also be cached. Therefore,
DeepSeek-V2 requires a total KV cache containing ( 𝑑 𝑐 + 𝑑ℎ𝑅 ) 𝑙 elements.
In order to demonstrate the complete computation process of MLA, we also organize and
provide its full formulas in Appendix C.
We demonstrate a comparison of the KV cache per token among different attention mechanisms
in Table 1. MLA requires only a small amount of KV cache, equal to GQA with only 2.25 groups,
but can achieve stronger performance than MHA.
8
Attention Mechanism KV Cache per Token (# Element) Capability
Multi-Head Attention (MHA) 2 𝑛ℎ 𝑑 ℎ 𝑙 Strong
Grouped-Query Attention (GQA) 2𝑛 𝑔 𝑑 ℎ 𝑙 Moderate
Multi-Query Attention (MQA) 2𝑑 ℎ 𝑙 Weak
MLA (Ours) ( 𝑑 𝑐 + 𝑑ℎ𝑅 ) 𝑙 ≈ 92 𝑑ℎ 𝑙 Stronger
Table 1 | Comparison of the KV cache per token among different attention mechanisms. 𝑛ℎ
denotes the number of attention heads, 𝑑ℎ denotes the dimension per attention head, 𝑙 denotes
the number of layers, 𝑛𝑔 denotes the number of groups in GQA, and 𝑑 𝑐 and 𝑑ℎ𝑅 denote the KV
compression dimension and the per-head dimension of the decoupled queries and key in MLA,
respectively. The amount of KV cache is measured by the number of elements, regardless of the
storage precision. For DeepSeek-V2, 𝑑 𝑐 is set to 4𝑑ℎ and 𝑑ℎ𝑅 is set to 𝑑2ℎ . So, its KV cache is equal
to GQA with only 2.25 groups, but its performance is stronger than MHA.
For FFNs, we employ the DeepSeekMoE architecture (Dai et al., 2024). DeepSeekMoE has two
key ideas: segmenting experts into finer granularity for higher expert specialization and more
accurate knowledge acquisition, and isolating some shared experts for mitigating knowledge
redundancy among routed experts. With the same number of activated and total expert parame-
ters, DeepSeekMoE can outperform conventional MoE architectures like GShard (Lepikhin et al.,
2021) by a large margin.
Let u𝑡 be the FFN input of the 𝑡 -th token, we compute the FFN output h𝑡′ as follows:
𝑁𝑠
∑︁ 𝑁𝑟
∑︁
h𝑡′ = u𝑡 + FFN𝑖( 𝑠 ) (u𝑡 ) + 𝑔𝑖,𝑡 FFN𝑖
(𝑟 )
(u𝑡 ), (20)
𝑖=1 𝑖=1
(
𝑠 𝑖 ,𝑡 , 𝑠𝑖,𝑡 ∈ Topk({ 𝑠 𝑗,𝑡 |1 ⩽ 𝑗 ⩽ 𝑁𝑟 }, 𝐾𝑟 ),
𝑔 𝑖 ,𝑡 = (21)
0,
otherwise,
𝑠𝑖,𝑡 = Softmax𝑖 u𝑡 𝑇 e𝑖 , (22)
where 𝑁𝑠 and 𝑁𝑟 denote the numbers of shared experts and routed experts, respectively; FFN𝑖( 𝑠 ) (·)
and FFN𝑖( 𝑟 ) (·) denote the 𝑖-th shared expert and the 𝑖-th routed expert, respectively; 𝐾𝑟 denotes
the number of activated routed experts; 𝑔𝑖,𝑡 is the gate value for the 𝑖-th expert; 𝑠𝑖,𝑡 is the token-
to-expert affinity; e𝑖 is the centroid of the 𝑖-th routed expert in this layer; and Topk(·, 𝐾 ) denotes
the set comprising 𝐾 highest scores among the affinity scores calculated for the 𝑡 -th token and
all routed experts.
9
For DeepSeek-V2, beyond the naive top-K selection of routed experts, we additionally ensure
that the target experts of each token will be distributed on at most 𝑀 devices. To be specific, for
each token, we first select 𝑀 devices that have experts with the highest affinity scores in them.
Then, we perform top-K selection among experts on these 𝑀 devices. In practice, we find that
when 𝑀 ⩾ 3, the device-limited routing can achieve a good performance roughly aligned with
the unrestricted top-K routing.
We take the load balance into consideration for automatically learned routing strategies. Firstly,
unbalanced load will raise the risk of routing collapse (Shazeer et al., 2017), preventing some
experts being fully trained and utilized. Secondly, when expert parallelism is employed, unbal-
anced load will diminish computation efficiency. During the training of DeepSeek-V2, we design
three kinds of auxiliary losses, for controlling expert-level load balance (LExpBal ), device-level
load balance (LDevBal ), and communication balance (LCommBal ), respectively.
Expert-Level Balance Loss. We use an expert-level balance loss (Fedus et al., 2021; Lepikhin
et al., 2021) to mitigate the risk of routing collapse:
𝑁𝑟
∑︁
LExpBal = 𝛼1 𝑓𝑖 𝑃 𝑖 , (23)
𝑖=1
𝑇
𝑁𝑟 ∑︁
𝑓𝑖 = 1(Token 𝑡 selects Expert 𝑖), (24)
𝐾𝑟 𝑇
𝑡 =1
𝑇
1 ∑︁
𝑃𝑖 = 𝑠𝑖,𝑡 , (25)
𝑇
𝑡 =1
where 𝛼1 is a hyper-parameter called expert-level balance factor; 1(·) denotes the indicator
function; and 𝑇 denotes the number of tokens in a sequence.
10
receives more tokens than other devices, the practical communication efficiency will also be
affected. In order to mitigate this issue, we design a communication balance loss as follows:
𝐷
∑︁
LCommBal = 𝛼3 𝑓𝑖′′ 𝑃 𝑖′′ , (29)
𝑖=1
𝑇
𝐷 ∑︁
𝑓𝑖′′ = 1(Token 𝑡 is sent to Device 𝑖), (30)
𝑀𝑇
𝑡 =1
∑︁
𝑃 𝑖′′ = 𝑃 𝑗, (31)
𝑗 ∈ E𝑖
While balance losses aim to encourage a balanced load, it is important to acknowledge that
they cannot guarantee a strict load balance. In order to further mitigate the computation
wastage caused by unbalanced load, we introduce a device-level token-dropping strategy during
training. This approach first computes the average computational budget for each device, which
means that the capacity factor for each device is equivalent to 1.0. Then, inspired by Riquelme
et al. (2021), we drop tokens with the lowest affinity scores on each device until reaching the
computational budget. In addition, we ensure that the tokens belonging to approximately 10%
of the training sequences will never be dropped. In this way, we can flexibly decide whether
to drop tokens during inference according to the efficiency requirements, and always ensure
consistency between training and inference.
3. Pre-Training
While maintaining the same data processing stages as for DeepSeek 67B (DeepSeek-AI, 2024),
we extend the amount of data and elevate the data quality. In order to enlarge our pre-training
corpus, we explore the potential of the internet data and optimize our cleaning processes, thus
recovering a large amount of mistakenly deleted data. Moreover, we incorporate more Chinese
data, aiming to better leverage the corpus available on the Chinese internet. In addition to
the amount of data, we also focus on the data quality. We enrich our pre-training corpus with
high-quality data from various sources, and meanwhile improve the quality-based filtering
algorithm. The improved algorithm ensures that a large amount of non-beneficial data will
be removed, while the valuable data will be mostly retained. In addition, we filter out the
contentious content from our pre-training corpus to mitigate the data bias introduced from
specific regional cultures. A detailed discussion about the influence of this filtering strategy is
presented in Appendix E.
11
We adopt the same tokenizer as used in DeepSeek 67B, which is built based on the Byte-level
Byte-Pair Encoding (BBPE) algorithm and has a vocabulary size of 100K. Our tokenized pre-
training corpus contains 8.1T tokens, where Chinese tokens are approximately 12% more than
English ones.
3.1.2. Hyper-Parameters
Model Hyper-Parameters. We set the number of Transformer layers to 60 and the hidden
dimension to 5120. All learnable parameters are randomly initialized with a standard deviation
of 0.006. In MLA, we set the number of attention heads 𝑛ℎ to 128 and the per-head dimension 𝑑ℎ
to 128. The KV compression dimension 𝑑 𝑐 is set to 512, and the query compression dimension
𝑑 𝑐′ is set to 1536. For the decoupled queries and key, we set the per-head dimension 𝑑ℎ𝑅 to 64.
Following Dai et al. (2024), we substitute all FFNs except for the first layer with MoE layers.
Each MoE layer consists of 2 shared experts and 160 routed experts, where the intermediate
hidden dimension of each expert is 1536. Among the routed experts, 6 experts will be activated
for each token. In addition, the low-rank compression and fine-grained expert segmentation
will impact the output scale of a layer. Therefore, in practice, we employ additional RMS Norm
layers after the compressed latent vectors, and multiply additional scaling factors at the width
bottlenecks (i.e., the compressed latent vectors and the intermediate hidden states of routed
experts) to ensure stable training. Under this configuration, DeepSeek-V2 comprises 236B total
parameters, of which 21B are activated for each token.
Training Hyper-Parameters. We employ the AdamW optimizer (Loshchilov and Hutter, 2017)
with hyper-parameters set to 𝛽1 = 0.9, 𝛽2 = 0.95, and weight_decay = 0.1. The learning rate is
scheduled using a warmup-and-step-decay strategy (DeepSeek-AI, 2024). Initially, the learning
rate linearly increases from 0 to the maximum value during the first 2K steps. Subsequently,
the learning rate is multiplied by 0.316 after training about 60% of tokens, and again by 0.316
after training about 90% of tokens. The maximum learning rate is set to 2.4 × 10−4 , and the
gradient clipping norm is set to 1.0. We also use a batch size scheduling strategy, where the
batch size is gradually increased from 2304 to 9216 in the training of the first 225B tokens, and
then keeps 9216 in the remaining training. We set the maximum sequence length to 4K, and
train DeepSeek-V2 on 8.1T tokens. We leverage pipeline parallelism to deploy different layers of
a model on different devices, and for each layer, the routed experts will be uniformly deployed
on 8 devices ( 𝐷 = 8). As for the device-limited routing, each token will be sent to at most 3
devices ( 𝑀 = 3). As for balance losses, we set 𝛼1 to 0.003, 𝛼2 to 0.05, and 𝛼3 to 0.02. We employ
the token-dropping strategy during training for acceleration, but do not drop any tokens for
evaluation.
3.1.3. Infrastructures
DeepSeek-V2 is trained based on the HAI-LLM framework (High-flyer, 2023), an efficient and
light-weight training framework developed internally by our engineers. It employs a 16-way
zero-bubble pipeline parallelism (Qi et al., 2023), an 8-way expert parallelism (Lepikhin et al.,
2021), and ZeRO-1 data parallelism (Rajbhandari et al., 2020). Given that DeepSeek-V2 has
relatively few activated parameters, and a portion of the operators are recomputed to save acti-
vation memory, it can be trained without the necessity of tensor parallelism, thereby decreasing
the communication overhead. Moreover, in order to further improve the training efficiency, we
overlap the computation of shared experts with the expert parallel all-to-all communication.
We also customize faster CUDA kernels for communications, routing algorithms, and fused
12
Pressure Testing DeepSeek-V2 Base 128K Context via "Needle In A HayStack"
10
0
9 9
18
Document Depth Percent (%)
8
27
7
36
45 6
Score
55 5
64
4
73
82 3
91 2
100
1K 12K 24K 35K 47K 58K 70K 81K 93K 104K 116K 128K 1
Context Length (#Tokens)
Figure 4 | Evaluation results on the “Needle In A Haystack” (NIAH) tests. DeepSeek-V2
performs well across all context window lengths up to 128K.
linear computations across different experts. In addition, MLA is also optimized based on an
improved version of FlashAttention-2 (Dao, 2023).
We conduct all experiments on a cluster equipped with NVIDIA H800 GPUs. Each node in
the H800 cluster contains 8 GPUs connected using NVLink and NVSwitch within nodes. Across
nodes, InfiniBand interconnects are utilized to facilitate communications.
After the initial pre-training of DeepSeek-V2, we employ YaRN (Peng et al., 2023) to extend the
default context window length from 4K to 128K. YaRN was specifically applied to the decoupled
shared key k𝑡𝑅 as it is responsible for carrying RoPE (Su et al., 2024). For YaRN, we set the scale
𝑠 to 40, 𝛼 to 1, 𝛽 to 32, and the target maximum context length to 160K. Under these settings,
we can expect the model to respond well for a context length of 128K. Slightly diverging from
original YaRN, due to our distinct attention mechanism,
√ we adjust√ the length scaling factor to
modulate the attention entropy. The factor 𝑡 is computed as 𝑡 = 0.0707 ln 𝑠 + 1, aiming at
minimizing the perplexity.
We additionally train the model for 1000 steps, with a sequence length of 32K and a batch
size of 576 sequences. Although the training is conducted solely at the sequence length of 32K,
the model still demonstrates robust performance when being evaluated at a context length of
128K. As shown in Figure 4, the results on the “Needle In A Haystack” (NIAH) tests indicate
that DeepSeek-V2 performs well across all context window lengths up to 128K.
3.2. Evaluations
13
in our HAI-LLM framework. Included benchmarks are categorized and listed as follows, where
underlined benchmarks are in Chinese:
Multi-subject multiple-choice datasets include MMLU (Hendrycks et al., 2020), C-Eval
(Huang et al., 2023), and CMMLU (Li et al., 2023).
Language understanding and reasoning datasets include HellaSwag (Zellers et al., 2019),
PIQA (Bisk et al., 2020), ARC (Clark et al., 2018), and BigBench Hard (BBH) (Suzgun et al., 2022).
Closed-book question answering datasets include TriviaQA (Joshi et al., 2017) and Natu-
ralQuestions (Kwiatkowski et al., 2019).
Reading comprehension datasets include RACE Lai et al. (2017), DROP (Dua et al., 2019),
C3 (Sun et al., 2019), and CMRC (Cui et al., 2019).
Reference disambiguation datasets include WinoGrande Sakaguchi et al. (2019) and CLUEWSC
(Xu et al., 2020).
Language modeling datasets include Pile (Gao et al., 2020).
Chinese understanding and culture datasets include CHID (Zheng et al., 2019) and CCPM
(Li et al., 2021).
Math datasets include GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021), and
CMath (Wei et al., 2023).
Code datasets include HumanEval (Chen et al., 2021), MBPP (Austin et al., 2021), and
CRUXEval (Gu et al., 2024).
Standardized exams include AGIEval (Zhong et al., 2023). Note that AGIEval includes both
English and Chinese subsets.
Following our previous work (DeepSeek-AI, 2024), we adopt perplexity-based evaluation
for datasets including HellaSwag, PIQA, WinoGrande, RACE-Middle, RACE-High, MMLU,
ARC-Easy, ARC-Challenge, CHID, C-Eval, CMMLU, C3, and CCPM, and adopt generation-
based evaluation for TriviaQA, NaturalQuestions, DROP, MATH, GSM8K, HumanEval, MBPP,
CRUXEval, BBH, AGIEval, CLUEWSC, CMRC, and CMath. In addition, we perform language-
modeling-based evaluation for Pile-test and use Bits-Per-Byte (BPB) as the metric to guarantee
fair comparison among models with different tokenizers.
For an intuitive overview of these benchmarks, we additionally provide our evaluation
formats for each benchmark in Appendix G.
14
DeepSeek Qwen1.5 Mixtral LLaMA 3
Benchmark (Metric) # Shots DeepSeek-V2
67B 72B 8x22B 70B
Architecture - Dense Dense MoE Dense MoE
# Activated Params - 67B 72B 39B 70B 21B
# Total Params - 67B 72B 141B 70B 236B
Pile-test (BPB) - 0.642 0.637 0.623 0.602 0.606
BBH (EM) 3-shot 68.7 59.9 78.9 81.0 78.9
MMLU (Acc.) 5-shot 71.3 77.2 77.6 78.9 78.5
DROP (F1) 3-shot 69.7 71.5 80.4 82.5 80.1
ARC-Easy (Acc.) 25-shot 95.3 97.1 97.3 97.9 97.6
ARC-Challenge (Acc.) 25-shot 86.4 92.8 91.2 93.3 92.4
HellaSwag (Acc.) 10-shot 86.3 85.8 86.6 87.9 84.2
PIQA (Acc.) 0-shot 83.6 83.3 83.6 85.0 83.7
English
WinoGrande (Acc.) 5-shot 84.9 82.4 83.7 85.7 84.9
RACE-Middle (Acc.) 5-shot 69.9 63.4 73.3 73.3 73.1
RACE-High (Acc.) 5-shot 50.7 47.0 56.7 57.9 52.7
TriviaQA (EM) 5-shot 78.9 73.1 82.1 81.6 79.9
NaturalQuestions (EM) 5-shot 36.6 35.6 39.6 40.2 38.7
AGIEval (Acc.) 0-shot 41.3 64.4 43.4 49.8 51.2
HumanEval (Pass@1) 0-shot 45.1 43.9 53.1 48.2 48.8
MBPP (Pass@1) 3-shot 57.4 53.6 64.2 68.6 66.6
Code
CRUXEval-I (Acc.) 2-shot 42.5 44.3 52.4 49.4 52.8
CRUXEval-O (Acc.) 2-shot 41.0 42.3 52.8 54.3 49.8
GSM8K (EM) 8-shot 63.4 77.9 80.3 83.0 79.2
Math MATH (EM) 4-shot 18.7 41.4 42.5 42.2 43.6
CMath (EM) 3-shot 63.0 77.8 72.3 73.9 78.7
CLUEWSC (EM) 5-shot 81.0 80.5 77.5 78.3 82.2
C-Eval (Acc.) 5-shot 66.1 83.7 59.6 67.5 81.7
CMMLU (Acc.) 5-shot 70.8 84.3 60.0 69.3 84.0
Chinese CMRC (EM) 1-shot 73.4 66.6 73.1 73.3 77.5
C3 (Acc.) 0-shot 75.3 78.2 71.4 74.0 77.4
CHID (Acc.) 0-shot 92.1 - 57.0 83.2 92.7
CCPM (Acc.) 0-shot 88.5 88.1 61.0 68.1 93.1
Table 2 | Comparison among DeepSeek-V2 and other representative open-source models. All
models are evaluated in our internal framework and share the same evaluation setting. Bold
denotes the best and underline denotes the second-best. Scores with a gap smaller than 0.3
are regarded as at the same level. With only 21B activated parameters, DeepSeek-V2 achieves
top-tier performance among open-source models.
15
70B overwhelmingly on Chinese benchmarks.
Finally, it is worth mentioning that certain prior studies (Hu et al., 2024) incorporate SFT
data during the pre-training stage, whereas DeepSeek-V2 has never been exposed to SFT data
during pre-training.
Training Costs. Since DeepSeek-V2 activates fewer parameters for each token and requires
fewer FLOPs than DeepSeek 67B, training DeepSeek-V2 will be more economical than training
DeepSeek 67B theoretically. Although training an MoE model will introduce additional commu-
nication overheads, through our operator and communication optimizations, the training for
DeepSeek-V2 can attain a relatively high Model FLOPs Utilization (MFU). During our practical
training on the H800 cluster, for training on each trillion tokens, DeepSeek 67B requires 300.6K
GPU hours, while DeepSeek-V2 needs only 172.8K GPU hours, i.e., sparse DeepSeek-V2 can
save 42.5% training costs compared with dense DeepSeek 67B.
Inference Efficiency. In order to efficiently deploy DeepSeek-V2 for service, we first convert
its parameters into the precision of FP8. In addition, we also perform KV cache quantiza-
tion (Hooper et al., 2024; Zhao et al., 2023) for DeepSeek-V2 to further compress each element
in its KV cache into 6 bits on average. Benefiting from MLA and these optimizations, actually
deployed DeepSeek-V2 requires significantly less KV cache than DeepSeek 67B, and thus can
serve a much larger batch size. We evaluate the generation throughput of DeepSeek-V2 based
on the prompt and generation length distribution from the actually deployed DeepSeek 67B
service. On a single node with 8 H800 GPUs, DeepSeek-V2 achieves a generation throughput
exceeding 50K tokens per second, which is 5.76 times the maximum generation throughput of
DeepSeek 67B. In addition, the prompt input throughput of DeepSeek-V2 exceeds 100K tokens
per second.
4. Alignment
Building upon our prior research (DeepSeek-AI, 2024), we curate our instruction tuning datasets
to include 1.5M instances, comprising 1.2M instances for helpfulness and 0.3M instances for
safety. In comparison to the initial version, we improve the data quality to mitigate hallucinatory
responses and enhance writing proficiency. We fine-tune DeepSeek-V2 with 2 epochs, and
the learning rate is set to 5 × 10−6 . For the evaluation of DeepSeek-V2 Chat (SFT), we mainly
include generation-based benchmarks, except for several representative multiple-choice tasks
(MMLU and ARC). We also conduct an instruction-following evaluation (IFEval) (Zhou et al.,
2023) for DeepSeek-V2 Chat (SFT), using prompt-level loose accuracy as the metric. Moreover,
we employ LiveCodeBench (Jain et al., 2024) questions from September 1st, 2023 to April 1st,
2024 to evaluate chat models. In addition to the standard benchmarks, we further evaluate
our model on open-ended conversation benchmarks including MT-Bench (Zheng et al., 2023),
AlpacaEval 2.0 (Dubois et al., 2024), and AlignBench (Liu et al., 2023). For comparison, we also
evaluate Qwen1.5 72B Chat, LLaMA-3-70B Instruct, and Mistral-8x22B Instruct in our evaluation
framework and settings. As for DeepSeek 67B Chat, we directly refer to the evaluation results
reported in our previous release.
16
4.2. Reinforcement Learning
In order to further unlock the potential of DeepSeek-V2 and align it with human preference, we
conduct Reinforcement Learning (RL) to adjust its preference.
Reinforcement Learning Algorithm. In order to save the training costs of RL, we adopt Group
Relative Policy Optimization (GRPO) (Shao et al., 2024), which foregoes the critic model that is
typically with the same size as the policy model, and estimates the baseline from group scores
instead. Specifically, for each question 𝑞, GRPO samples a group of outputs { 𝑜1 , 𝑜2 , · · · , 𝑜𝐺 }
from the old policy 𝜋𝜃𝑜𝑙𝑑 and then optimizes the policy model 𝜋𝜃 by maximizing the following
objective:
𝜋𝑟𝑒 𝑓 ( 𝑜𝑖 | 𝑞) 𝜋𝑟𝑒 𝑓 ( 𝑜𝑖 | 𝑞)
D 𝐾𝐿 𝜋𝜃 || 𝜋𝑟𝑒 𝑓 = − log − 1, (33)
𝜋𝜃 ( 𝑜𝑖 | 𝑞) 𝜋𝜃 ( 𝑜𝑖 | 𝑞)
where 𝜀 and 𝛽 are hyper-parameters; and 𝐴𝑖 is the advantage, computed using a group of
rewards {𝑟1 , 𝑟2 , . . . , 𝑟𝐺 } corresponding to the outputs within each group:
𝑟𝑖 − m𝑒𝑎𝑛 ({𝑟1 , 𝑟2 , · · · , 𝑟𝐺 })
𝐴𝑖 = . (34)
s𝑡𝑑 ({𝑟1 , 𝑟2 , · · · , 𝑟𝐺 })
Training Strategy. In our preliminary experiments, we find that the RL training on reasoning
data, such as code and math prompts, exhibits unique characteristics that are distinct from the
training on general data. For example, the mathematical and coding abilities of our model can
keep improving over a longer period of training steps. Therefore, we employ a two-stage RL
training strategy, which first performs reasoning alignment, and then performs human prefer-
ence alignment. In the first reasoning alignment stage, we train a reward model 𝑅𝑀𝑟𝑒𝑎𝑠𝑜𝑛𝑖𝑛𝑔 for
code and math reasoning tasks, and optimize the policy model with the feedback of 𝑅𝑀𝑟𝑒𝑎𝑠𝑜𝑛𝑖𝑛𝑔 :
𝑟𝑖 = 𝑅𝑀𝑟𝑒𝑎𝑠𝑜𝑛𝑖𝑛𝑔 ( 𝑜𝑖 ). (35)
In the second human preference alignment stage, we adopt a multi-reward framework, which
acquires rewards from a helpful reward model 𝑅𝑀ℎ𝑒𝑙 𝑝 𝑓 𝑢𝑙 , a safety reward model 𝑅𝑀𝑠𝑎 𝑓 𝑒𝑡 𝑦 , and a
rule-based reward model 𝑅𝑀𝑟𝑢𝑙𝑒 . The final reward of a response 𝑜𝑖 is
17
Optimizations for Training Efficiency. Conducting RL training on extremely large models
places high demands on the training framework. It requires careful engineering optimization to
manage the GPU memory and RAM pressure, and meanwhile maintain a fast training speed.
For this goal, we implement the following engineering optimizations. (1) Firstly, we propose a
hybrid engine that adopts different parallel strategies for training and inference respectively to
achieve higher GPU utilization. (2) Secondly, we leverage vLLM (Kwon et al., 2023) with large
batch sizes as our inference backend to accelerate the inference speed. (3) Thirdly, we carefully
design a scheduling strategy for offloading models to CPUs and loading models back to GPUs,
which achieves a near-optimal balance between the training speed and memory consumption.
18
DeepSeek Qwen 1.5 LLaMA3 Mixtral DeepSeek-V2 DeepSeek-V2
Benchmark # Shots
67B Chat 72B Chat 70B Inst. 8x22B Inst. Chat (SFT) Chat (RL)
Context Length - 4K 32K 8K 64K 128K 128K
Architecture - Dense Dense Dense MoE MoE MoE
# Activated Params - 67B 72B 70B 39B 21B 21B
# Total Params - 67B 72B 70B 141B 236B 236B
TriviaQA 5-shot 81.5 79.6 69.1 80.0 85.4 86.7
NaturalQuestions 5-shot 47.0 46.9 44.6 54.9 51.9 53.4
MMLU 5-shot 71.1 76.2 80.3 77.8 78.4 77.8
ARC-Easy 25-shot 96.6 96.8 96.9 97.1 97.6 98.1
English
ARC-Challenge 25-shot 88.9 91.7 92.6 90.0 92.5 92.3
BBH 3-shot 71.7 65.9 80.1 78.4 81.3 79.7
AGIEval 0-shot 46.4 62.8 56.6 41.4 63.2 61.4
IFEval 0-shot 55.5 57.3 79.7 72.1 64.1 63.8
HumanEval 0-shot 73.8 68.9 76.2 75.0 76.8 81.1
MBPP 3-shot 61.4 52.2 69.8 64.4 70.4 72.0
Code
CRUXEval-I-COT 2-shot 49.1 51.4 61.1 59.4 59.5 61.5
CRUXEval-O-COT 2-shot 50.9 56.5 63.6 63.6 60.7 63.0
LiveCodeBench 0-shot 18.3 18.8 30.5 25.0 28.7 32.5
GSM8K 8-shot 84.1 81.9 93.2 87.9 90.8 92.2
MATH 4-shot 32.6 40.6 48.5 49.8 52.7 53.9
Math
CMath 0-shot 80.3 82.8 79.2 75.1 82.0 81.9
CLUEWSC 5-shot 78.5 90.1 85.4 75.8 88.6 89.9
Chinese C-Eval 5-shot 65.2 82.2 67.9 60.0 80.9 78.0
CMMLU 5-shot 67.8 82.9 70.7 61.0 82.4 81.6
Table 3 | Comparison among DeepSeek-V2 Chat (SFT), DeepSeek-V2 Chat (RL), and other
representative open-source chat models. Regarding TriviaQA and NaturalQuestions, it is
worth noting that chat models, such as LLaMA3 70B Instruct, might not strictly adhere to the
format constraints typically specified in the few-shot setting. Consequently, this can lead to
underestimation of certain models in our evaluation framework.
Table 4 | English open-ended conversation evaluations. For AlpacaEval 2.0, we use the length-
controlled win rate as the metric.
72B Chat on both Chinese reasoning and language. Moreover, both DeepSeek-V2 Chat (SFT)
and DeepSeek-V2 Chat (RL) outperform GPT-4-0613 and ERNIEBot 4.0, solidifying the position
of our models in the top-tier LLMs that support Chinese. Specifically, DeepSeek-V2 Chat (RL)
shows remarkable performance in Chinese language understanding, which outperforms all
models including GPT-4-Turbo-1106-Preview. On the other hand, the reasoning capability of
DeepSeek-V2 Chat (RL) still lags behind giant models, such as Erniebot-4.0 and GPT-4s.
19
Reasoning 中文推理 Language 中文语言
Model Overall
Avg. Math. Logi. Avg. Fund. Chi. Open. Writ. Role. Pro.
推理 数学 逻辑 语言 基本 中文 综合 文本 角色 专业
模型 总分
总分 计算 推理 总分 任务 理解 问答 写作 扮演 能力
GPT-4-1106-Preview 8.01 7.73 7.80 7.66 8.29 7.99 7.33 8.61 8.67 8.47 8.65
DeepSeek-V2 Chat (RL) 7.91 7.45 7.77 7.14 8.36 8.10 8.28 8.37 8.53 8.33 8.53
ERNIEBot-4.0-202404*(文心一言) 7.89 7.61 7.81 7.41 8.17 7.56 8.53 8.13 8.45 8.24 8.09
DeepSeek-V2 Chat (SFT) 7.74 7.30 7.34 7.26 8.17 8.04 8.26 8.13 8.00 8.10 8.49
GPT-4-0613 7.53 7.47 7.56 7.37 7.59 7.81 6.93 7.42 7.93 7.51 7.94
ERNIEBot-4.0-202312*(文心一言) 7.36 6.84 7.00 6.67 7.88 7.47 7.88 8.05 8.19 7.84 7.85
Moonshot-v1-32k-202404*(月之暗面) 7.22 6.42 6.41 6.43 8.02 7.82 7.58 8.00 8.22 8.19 8.29
Qwen1.5-72B-Chat* 7.19 6.45 6.58 6.31 7.93 7.38 7.77 8.15 8.02 8.05 8.24
DeepSeek-67B-Chat 6.43 5.75 5.71 5.79 7.11 7.12 6.52 7.58 7.20 6.91 7.37
ChatGLM-Turbo(智谱清言) 6.24 5.00 4.74 5.26 7.49 6.82 7.17 8.16 7.77 7.76 7.24
ERNIEBot-3.5(文心一言) 6.14 5.15 5.03 5.27 7.13 6.62 7.60 7.26 7.56 6.83 6.90
Yi-34B-Chat* 6.12 4.86 4.97 4.74 7.38 6.72 7.28 7.76 7.44 7.58 7.53
GPT-3.5-Turbo-0613 6.08 5.35 5.68 5.02 6.82 6.71 5.81 7.29 7.03 7.28 6.77
ChatGLM-Pro(智谱清言) 5.83 4.65 4.54 4.75 7.01 6.51 6.76 7.47 7.07 7.34 6.89
SparkDesk-V2(讯飞星火) 5.74 4.73 4.71 4.74 6.76 5.84 6.97 7.29 7.18 6.92 6.34
Qwen-14B-Chat 5.72 4.81 4.91 4.71 6.63 6.90 6.36 6.74 6.64 6.59 6.56
Baichuan2-13B-Chat 5.25 3.92 3.76 4.07 6.59 6.22 6.05 7.11 6.97 6.75 6.43
ChatGLM3-6B 4.97 3.85 3.55 4.14 6.10 5.75 5.29 6.71 6.83 6.28 5.73
Baichuan2-7B-Chat 4.97 3.66 3.56 3.75 6.28 5.81 5.50 7.13 6.84 6.53 5.84
InternLM-20B 4.96 3.66 3.39 3.92 6.26 5.96 5.50 7.18 6.19 6.49 6.22
Qwen-7B-Chat 4.91 3.73 3.62 3.83 6.09 6.40 5.74 6.26 6.31 6.19 5.66
ChatGLM2-6B 4.48 3.39 3.16 3.61 5.58 4.91 4.52 6.66 6.25 6.08 5.08
InternLM-Chat-7B 3.65 2.56 2.45 2.66 4.75 4.34 4.09 5.82 4.89 5.32 4.06
Chinese-LLaMA-2-7B-Chat 3.57 2.68 2.29 3.07 4.46 4.31 4.26 4.50 4.63 4.91 4.13
LLaMA-2-13B-Chinese-Chat 3.35 2.47 2.21 2.73 4.23 4.13 3.31 4.79 3.93 4.53 4.71
Table 5 | AlignBench leaderboard rated by GPT-4-0613. Models are ranked in descending order
based on the overall score. Models marked with * represent that we evaluate them through their
API service or open-weighted model, instead of referring to the results reported in their original
papers. Suffixes of Erniebot-4.0 and Moonshot denote the timestamps when we called their API.
4.4. Discussion
Amount of SFT Data. The discussion surrounding the necessity of a large SFT corpus has been
a topic of intense debate. Previous works (Young et al., 2024; Zhou et al., 2024) argue that fewer
than 10K instances of SFT data are enough to produce satisfactory results. However, in our
experiments, we observe a significant performance decline on the IFEval benchmark if we use
fewer than 10K instances. A possible explanation is that, a language model necessitates a certain
amount of data to develop specific skills. Although the requisite data amount may diminish
with the model size increasing, it cannot be entirely eliminated. Our observation underscores the
critical need for sufficient data to equip an LLM with desired capabilities. Moreover, the quality
of SFT data is also crucial, especially for tasks involving writing or open-ended questions.
20
compromising its general performance presents a valuable direction for future research.
Online Reinforcement Learning. In our preference alignment experiments, we find that the
online approach significantly outperforms the offline approach. Therefore, we invest tremendous
efforts in implementing an online RL framework for aligning DeepSeek-V2. The conclusion
about online or offline preference alignment can vary in different contexts, and we reserve a
more thorough comparison and analysis between them for future work.
• In our ongoing exploration, we are dedicated to devising methods that enable further
scaling up MoE models while maintaining economical training and inference costs. The
goal of our next step is to achieve performance on par with GPT-4 in our upcoming release.
• Our alignment team continuously strives to enhance our models, aiming to develop
a model that is not only helpful but also honest and safe for worldwide users. Our
ultimate objective is to align the values of our model with human values, while minimizing
the need for human supervision. By prioritizing ethical considerations and responsible
development, we are dedicated to creating a positive and beneficial impact on society.
• Currently, DeepSeek-V2 is designed to support the text modality exclusively. In our
forward-looking agenda, we intend to enable our model to support multiple modalities,
enhancing its versatility and utility in a wider range of scenarios.
References
AI@Meta. Llama 3 model card, 2024. URL https://github.com/meta-llama/llama3/bl
ob/main/MODEL_CARD.md.
J. Ainslie, J. Lee-Thorp, M. de Jong, Y. Zemlyanskiy, F. Lebrón, and S. Sanghai. Gqa: Training
generalized multi-query transformer models from multi-head checkpoints. arXiv preprint
arXiv:2305.13245, 2023.
21
Anthropic. Introducing Claude, 2023. URL https://www.anthropic.com/index/introd
ucing-claude.
J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry,
Q. Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732,
2021.
J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huang, B. Hui, L. Ji,
M. Li, J. Lin, R. Lin, D. Liu, G. Liu, C. Lu, K. Lu, J. Ma, R. Men, X. Ren, X. Ren, C. Tan, S. Tan,
J. Tu, P. Wang, S. Wang, W. Wang, S. Wu, B. Xu, J. Xu, A. Yang, H. Yang, J. Yang, S. Yang, Y. Yao,
B. Yu, H. Yuan, Z. Yuan, J. Zhang, X. Zhang, Y. Zhang, Z. Zhang, C. Zhou, J. Zhou, X. Zhou,
and T. Zhu. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
Y. Bisk, R. Zellers, R. L. Bras, J. Gao, and Y. Choi. PIQA: reasoning about physical commonsense
in natural language. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI
2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI
2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI
2020, New York, NY, USA, February 7-12, 2020, pages 7432–7439. AAAI Press, 2020. doi:
10.1609/aaai.v34i05.6239. URL https://doi.org/10.1609/aaai.v34i05.6239.
P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you
have solved question answering? try arc, the AI2 reasoning challenge. CoRR, abs/1803.05457,
2018. URL http://arxiv.org/abs/1803.05457.
Y. Cui, T. Liu, W. Che, L. Xiao, Z. Chen, W. Ma, S. Wang, and G. Hu. A span-extraction
dataset for Chinese machine reading comprehension. In K. Inui, J. Jiang, V. Ng, and X. Wan,
editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language
Processing and the 9th International Joint Conference on Natural Language Processing
(EMNLP-IJCNLP), pages 5883–5889, Hong Kong, China, Nov. 2019. Association for Computa-
tional Linguistics. doi: 10.18653/v1/D19-1600. URL https://aclanthology.org/D19-1
600.
D. Dai, C. Deng, C. Zhao, R. X. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y. Wu, Z. Xie, Y. K.
Li, P. Huang, F. Luo, C. Ruan, Z. Sui, and W. Liang. Deepseekmoe: Towards ultimate expert
specialization in mixture-of-experts language models. CoRR, abs/2401.06066, 2024. URL
https://doi.org/10.48550/arXiv.2401.06066.
T. Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning, 2023.
22
DeepSeek-AI. Deepseek LLM: scaling open-source language models with longtermism. CoRR,
abs/2401.02954, 2024. URL https://doi.org/10.48550/arXiv.2401.02954.
D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gardner. DROP: A reading compre-
hension benchmark requiring discrete reasoning over paragraphs. In J. Burstein, C. Doran, and
T. Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies, NAACL-HLT
2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 2368–
2378. Association for Computational Linguistics, 2019. doi: 10.18653/V1/N19-1246. URL
https://doi.org/10.18653/v1/n19-1246.
Y. Dubois, B. Galambosi, P. Liang, and T. B. Hashimoto. Length-controlled alpacaeval: A simple
way to debias automatic evaluators. arXiv preprint arXiv:2404.04475, 2024.
W. Fedus, B. Zoph, and N. Shazeer. Switch transformers: Scaling to trillion parameter models
with simple and efficient sparsity. CoRR, abs/2101.03961, 2021. URL https://arxiv.org/
abs/2101.03961.
L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite,
N. Nabeshima, et al. The Pile: An 800GB dataset of diverse text for language modeling. arXiv
preprint arXiv:2101.00027, 2020.
Google. Introducing gemini: our largest and most capable ai model, 2023. URL https:
//blog.google/technology/ai/google-gemini-ai/.
A. Gu, B. Rozière, H. Leather, A. Solar-Lezama, G. Synnaeve, and S. I. Wang. Cruxeval: A
benchmark for code reasoning, understanding and execution, 2024.
D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Mea-
suring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874,
2021.
S. Hu, Y. Tu, X. Han, C. He, G. Cui, X. Long, Z. Zheng, Y. Fang, Y. Huang, W. Zhao, et al.
Minicpm: Unveiling the potential of small language models with scalable training strategies.
arXiv preprint arXiv:2404.06395, 2024.
Y. Huang, Y. Bai, Z. Zhu, J. Zhang, J. Zhang, T. Su, J. Liu, C. Lv, Y. Zhang, J. Lei, et al. C-Eval: A
multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint
arXiv:2305.08322, 2023.
N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica.
Livecodebench: Holistic and contamination free evaluation of large language models for code.
arXiv preprint arXiv:2403.07974, 2024.
23
M. Joshi, E. Choi, D. Weld, and L. Zettlemoyer. TriviaQA: A large scale distantly supervised chal-
lenge dataset for reading comprehension. In R. Barzilay and M.-Y. Kan, editors, Proceedings of
the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers), pages 1601–1611, Vancouver, Canada, July 2017. Association for Computational
Linguistics. doi: 10.18653/v1/P17-1147. URL https://aclanthology.org/P17-1147.
W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica.
Efficient memory management for large language model serving with pagedattention. In
Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
G. Lai, Q. Xie, H. Liu, Y. Yang, and E. H. Hovy. RACE: large-scale reading comprehension
dataset from examinations. In M. Palmer, R. Hwa, and S. Riedel, editors, Proceedings of
the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017,
Copenhagen, Denmark, September 9-11, 2017, pages 785–794. Association for Computational
Linguistics, 2017. doi: 10.18653/V1/D17-1082. URL https://doi.org/10.18653/v1/d1
7-1082.
D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer, and Z. Chen.
Gshard: Scaling giant models with conditional computation and automatic sharding. In 9th
International Conference on Learning Representations, ICLR 2021. OpenReview.net, 2021.
URL https://openreview.net/forum?id=qrwe7XHTmYb.
H. Li, Y. Zhang, F. Koto, Y. Yang, H. Zhao, Y. Gong, N. Duan, and T. Baldwin. CMMLU: Measur-
ing massive multitask language understanding in Chinese. arXiv preprint arXiv:2306.09212,
2023.
W. Li, F. Qi, M. Sun, X. Yi, and J. Zhang. Ccpm: A chinese classical poetry matching dataset,
2021.
X. Liu, X. Lei, S. Wang, Y. Huang, Z. Feng, B. Wen, J. Cheng, P. Ke, Y. Xu, W. L. Tam, X. Zhang,
L. Sun, H. Wang, J. Zhang, M. Huang, Y. Dong, and J. Tang. Alignbench: Benchmarking
chinese alignment of large language models. CoRR, abs/2311.18743, 2023. doi: 10.48550/A
RXIV.2311.18743. URL https://doi.org/10.48550/arXiv.2311.18743.
Mistral. Cheaper, better, faster, stronger: Continuing to push the frontier of ai and making it
accessible to all, 2024. URL https://mistral.ai/news/mixtral-8x22b.
24
B. Peng, J. Quesnelle, H. Fan, and E. Shippole. Yarn: Efficient context window extension of large
language models. arXiv preprint arXiv:2309.00071, 2023.
P. Qi, X. Wan, G. Huang, and M. Lin. Zero bubble pipeline parallelism. arXiv preprint
arXiv:2401.10241, 2023.
S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He. Zero: Memory optimizations toward training tril-
lion parameter models. In SC20: International Conference for High Performance Computing,
Networking, Storage and Analysis, pages 1–16. IEEE, 2020.
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, M. Zhang, Y. Li, Y. Wu, and D. Guo. Deepseekmath:
Pushing the limits of mathematical reasoning in open language models. arXiv preprint
arXiv:2402.03300, 2024.
N. Shazeer. Fast transformer decoding: One write-head is all you need. CoRR, abs/1911.02150,
2019. URL http://arxiv.org/abs/1911.02150.
K. Sun, D. Yu, D. Yu, and C. Cardie. Investigating prior knowledge for challenging chinese
machine reading comprehension, 2019.
T. Wei, J. Luan, W. Liu, S. Dong, and B. Wang. Cmath: Can your language model pass chinese
elementary school math test?, 2023.
L. Xu, H. Hu, X. Zhang, L. Li, C. Cao, Y. Li, Y. Xu, K. Sun, D. Yu, C. Yu, Y. Tian, Q. Dong, W. Liu,
B. Shi, Y. Cui, J. Li, J. Zeng, R. Wang, W. Xie, Y. Li, Y. Patterson, Z. Tian, Y. Zhang, H. Zhou,
25
S. Liu, Z. Zhao, Q. Zhao, C. Yue, X. Zhang, Z. Yang, K. Richardson, and Z. Lan. CLUE: A chi-
nese language understanding evaluation benchmark. In D. Scott, N. Bel, and C. Zong, editors,
Proceedings of the 28th International Conference on Computational Linguistics, COLING
2020, Barcelona, Spain (Online), December 8-13, 2020, pages 4762–4772. International Com-
mittee on Computational Linguistics, 2020. doi: 10.18653/V1/2020.COLING-MAIN.419. URL
https://doi.org/10.18653/v1/2020.coling-main.419.
A. Young, B. Chen, C. Li, C. Huang, G. Zhang, G. Zhang, H. Li, J. Zhu, J. Chen, J. Chang, et al.
Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652, 2024.
R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi. HellaSwag: Can a machine really finish
your sentence? In A. Korhonen, D. R. Traum, and L. Màrquez, editors, Proceedings of the 57th
Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July
28- August 2, 2019, Volume 1: Long Papers, pages 4791–4800. Association for Computational
Linguistics, 2019. doi: 10.18653/v1/p19-1472. URL https://doi.org/10.18653/v1/p1
9-1472.
Y. Zhao, C. Lin, K. Zhu, Z. Ye, L. Chen, S. Zheng, L. Ceze, A. Krishnamurthy, T. Chen, and
B. Kasikci. Atom: Low-bit quantization for efficient and accurate LLM serving. CoRR,
abs/2310.19102, 2023. URL https://doi.org/10.48550/arXiv.2310.19102.
C. Zheng, M. Huang, and A. Sun. Chid: A large-scale chinese idiom dataset for cloze test. In
A. Korhonen, D. R. Traum, and L. Màrquez, editors, Proceedings of the 57th Conference of
the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2,
2019, Volume 1: Long Papers, pages 778–787. Association for Computational Linguistics, 2019.
doi: 10.18653/V1/P19-1075. URL https://doi.org/10.18653/v1/p19-1075.
L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing,
H. Zhang, J. E. Gonzalez, and I. Stoica. Judging llm-as-a-judge with mt-bench and chatbot
arena, 2023.
W. Zhong, R. Cui, Y. Guo, Y. Liang, S. Lu, Y. Wang, A. Saied, W. Chen, and N. Duan. AGIEval: A
human-centric benchmark for evaluating foundation models. CoRR, abs/2304.06364, 2023.
doi: 10.48550/arXiv.2304.06364. URL https://doi.org/10.48550/arXiv.2304.06364.
C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y. Mao, X. Ma, A. Efrat, P. Yu, L. Yu, et al. Lima: Less is
more for alignment. Advances in Neural Information Processing Systems, 36, 2024.
J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y. Luan, D. Zhou, and L. Hou. Instruction-following
evaluation for large language models. arXiv preprint arXiv:2311.07911, 2023.
26
Appendix
A. Contributions and Acknowledgments
27
Yongji Wang Xiaosha Chen
Yongqiang Guo Xiaowen Sun
Yuduan Wang Xiaoxiang Wang
Yuheng Zou Xinnan Song
Yuxiang You Xinyi Zhou
Yuxuan Liu Y.X. Zhu
Z.Z. Ren Yanhong Xu
Zehui Ren Yanping Huang
Zhangli Sha Yaohui Li
Zhe Fu Yi Zheng
Zhenda Xie Yuchen Zhu
Zhewen Hao Yunxian Ma
Zhihong Shao Zhen Huang
Zhuoshu Li Zhipeng Xu
Zihan Wang Zhongyu Zhang
Zihui Gu
Zilin Li
Business & Compliance
Ziwei Xie
Bin Wang
Dongjie Ji
Data Annotation Jian Liang
Bei Feng Jin Chen
Hui Li Leyi Xia
J.L. Cai Miaojun Wang
Jiaqi Ni Mingming Li
Lei Xu Peng Zhang
Meng Li Shaoqing Wu
Ning Tian Shengfeng Ye
R.J. Chen T. Wang
R.L. Jin W.L. Xiao
Ruyi Chen Wei An
S.S. Li Xianzu Wang
Shuang Zhou Ying Tang
Tian Yuan Yukun Zha
Tianyu Sun Yuting Yan
X.Q. Li Zhen Zhang
Xiangyue Jin Zhiniu Wen
Xiaojin Shen
Within each role, authors are listed alphabetically by first name. Especially, Huazuo Gao
and Wangding Zeng have made key innovations in the research of the MLA architecture.
Furthermore, we’d like to thank Jianlin Su for his helpful discussion on position embedding.
We thank all those who have contributed to DeepSeek-V2 but are not mentioned in the paper.
DeepSeek believes that innovation, novelty, and curiosity are essential in the path to AGI.
28
B. DeepSeek-V2-Lite: A 16B Model Equipped with MLA and DeepSeekMoE
Architectures. DeepSeek-V2-Lite has 27 layers and a hidden dimension of 2048. It also employs
MLA and has 16 attention heads, where each head has a dimension of 128. Its KV compression
dimension is 512, but slightly different from DeepSeek-V2, it does not compress the queries.
For the decoupled queries and key, it has a per-head dimension of 64. DeepSeek-V2-Lite also
employs DeepSeekMoE, and all FFNs except for the first layer are replaced with MoE layers.
Each MoE layer consists of 2 shared experts and 64 routed experts, where the intermediate
hidden dimension of each expert is 1408. Among the routed experts, 6 experts will be activated
for each token. Under this configuration, DeepSeek-V2-Lite comprises 15.7B total parameters, of
which 2.4B are activated for each token.
Training Details. DeepSeek-V2-Lite is also trained from scratch on the same pre-training
corpus of DeepSeek-V2, which is not polluted by any SFT data. It uses the AdamW optimizer
with hyper-parameters set to 𝛽1 = 0.9, 𝛽2 = 0.95, and weight_decay = 0.1. The learning rate is
scheduled using a warmup-and-step-decay strategy. Initially, the learning rate linearly increases
from 0 to the maximum value during the first 2K steps. Subsequently, the learning rate is
multiplied by 0.316 after training about 80% of tokens, and again by 0.316 after training about
90% of tokens. The maximum learning rate is set to 4.2 × 10−4 , and the gradient clipping norm
is set to 1.0. We do not employ the batch size scheduling strategy for it, and it is trained with
a constant batch size of 4608 sequences. During pre-training, we set the maximum sequence
29
length to 4K, and train DeepSeek-V2-Lite on 5.7T tokens. We leverage pipeline parallelism to
deploy different layers of it on different devices, but for each layer, all experts will be deployed
on the same device. Therefore, we only employ a small expert-level balance loss with 𝛼1 = 0.001,
and do not employ device-level balance loss and communication balance loss for it. After
pre-training, we also perform long context extension and SFT for DeepSeek-V2-Lite and get a
chat model called DeepSeek-V2-Lite Chat.
Base Model. We evaluate the performance of DeepSeek-V2-Lite and compare it with our pre-
vious small-size base models in Table 6. DeepSeek-V2-Lite exhibits overwhelming performance
advantages, especially in reasoning, coding, and math.
Chat Model. We evaluate the performance of DeepSeek-V2-Lite Chat and compare it with our
previous small-size chat models in Table 7. DeepSeek-V2-Lite also outperforms our previous
small-size chat models by a large margin.
30
C. Full Formulas of MLA
In order to demonstrate the complete computation process of MLA, we provide its full formulas
in the following:
c𝑡𝑄 = 𝑊 𝐷𝑄 h𝑡 , (37)
𝑄
[q𝐶𝑡,1 ; q𝐶𝑡,2 ; ...; q𝐶𝑡,𝑛ℎ ] = q𝐶𝑡 = 𝑊 𝑈𝑄 c𝑡 , (38)
[q𝑡𝑅,1 ; q𝑡𝑅,2 ; ...; q𝑡𝑅,𝑛ℎ ] = q𝑡𝑅 = RoPE(𝑊 𝑄𝑅 c𝑡𝑄 ), (39)
q𝑡,𝑖 = [q𝐶𝑡,𝑖 ; q𝑡𝑅,𝑖 ], (40)
c𝑡𝐾𝑉 = 𝑊 𝐷𝐾𝑉 h𝑡 , (41)
[k𝐶𝑡,1 ; k𝐶𝑡,2 ; ...; k𝐶𝑡,𝑛ℎ ] = k𝐶𝑡 = 𝑊 𝑈 𝐾 c𝑡𝐾𝑉 , (42)
k𝑡𝑅 = RoPE(𝑊 𝐾𝑅 h𝑡 ), (43)
k𝑡,𝑖 = [k𝐶𝑡,𝑖 ; k𝑡𝑅 ], (44)
[v𝐶𝑡,1 ; v𝐶𝑡,2 ; ...; v𝐶𝑡,𝑛ℎ ] = v𝐶𝑡 = 𝑊 𝑈𝑉 c𝑡𝐾𝑉 , (45)
𝑡
∑︁ q𝑇𝑡,𝑖 k 𝑗,𝑖
o𝑡,𝑖 = Softmax 𝑗 ( √︃ )v𝐶𝑗,𝑖 , (46)
𝑗=1 𝑑ℎ + 𝑑ℎ𝑅
We show the evaluation results for 7B dense models with MHA, GQA, and MQA on four hard
benchmarks in Table 8. All of these three models are trained on 1.33T tokens, and share the same
architecture except for the attention mechanisms. In addition, for a fair comparison, we align
the number of parameters of them to around 7B by adjusting the number of layers. From the
table, we can find that MHA demonstrates significant advantages over GQA and MQA on these
benchmarks.
In Table 9, we show the evaluation results for MoE models equipped with MLA and MHA,
respectively, on four hard benchmarks. For a solid conclusion, we train and evaluate models
across two scales. Two small MoE models comprise about 16B total parameters, and we train
them on 1.33T tokens. Two large MoE models comprise about 250B total parameters, and we
train them on 420B tokens. Also, two small MoE models and two large MoE models respectively
share the same architecture except for the attention mechanisms. From the table, we can observe
that MLA shows better performance than MHA. More importantly, MLA requires a significantly
smaller amount of KV cache (14% for small MoE models and 4% for large MoE models) than
MHA.
31
Dense 7B Dense 7B Dense 7B
Benchmark (Metric) # Shots
w/ MQA w/ GQA (8 Groups) w/ MHA
# Params - 7.1B 6.9B 6.9B
BBH (EM) 3-shot 33.2 35.6 37.0
MMLU (Acc.) 5-shot 37.9 41.2 45.2
C-Eval (Acc.) 5-shot 30.0 37.7 42.9
CMMLU (Acc.) 5-shot 34.6 38.4 43.5
Table 8 | Comparison among 7B dense models with MHA, GQA, and MQA, respectively. MHA
demonstrates significant advantages over GQA and MQA on hard benchmarks.
Table 9 | Comparison between MLA and MHA on hard benchmarks. DeepSeek-V2 shows better
performance than MHA, but requires a significantly smaller amount of KV cache.
32
Agreement Ground-Truth Label Annotator 1 Annotator 2 Annotator 3
Ground-Truth Label 100.0% 66.7% 59.8% 42.1%
Annotator 1 66.7% 100.0% 57.9% 69.0%
Annotator 2 59.8% 57.9% 100.0% 65.5%
Annotator 3 42.1% 69.0% 65.5% 100.0%
Table 10 | Three well-educated human annotators conduct independent annotations on 420 moral
scenarios from the MMLU Humanity-Moral subset, on which DeepSeek-V2 and its competitive
models demonstrate performance inconsistency. Three annotators and the ground-truth label
exhibit a low agreement with each other. This indicates that the answers to the Humanity-Moral
subset can be contentious according to specific regional cultures.
Model Name R Level Comp. Score Reas. Steps Score OvrAcc Score
GPT-4-1106-Preview 5 90.71 91.65 89.77
GPT-4 5 88.40 89.10 87.71
DeepSeek-V2 Chat (RL) 5 83.35 85.73 84.54
Ernie-bot 4.0 5 85.60 86.82 84.38
Qwen-110B-Chat 5 83.25 84.93 84.09
GLM-4 5 84.24 85.72 82.77
Xinghuo 3.5 5 83.73 85.37 82.09
Qwen-72B-Chat 4 78.42 80.07 79.25
ChatGLM-Turbo 4 57.70 60.32 55.09
GPT-3.5-Turbo 4 57.05 59.61 54.50
Qwen-14B-Chat 4 53.12 55.99 50.26
ChatGLM3-6B 3 40.90 44.20 37.60
Xinghuo 3.0 3 40.08 45.27 34.89
Baichuan2-13B-Chat 3 39.40 42.63 36.18
Ernie-3.5-turbo 2 25.19 27.70 22.67
Chinese-Alpaca2-13B 2 20.55 22.52 18.58
Table 11 | SC-Math6 Model Reasoning Level. “R Level” stands for Reasoning Level, “Comp.
Score” stands for Comprehensive Score, “Reas. Steps Score” stands for Reasoning Steps Score,
and “OvrAcc Score” stands for Overall Accuracy Score.
questions of LiveCodeBench are selected from the period between September 1st, 2023, and
April 1st, 2024. As shown in the figure, DeepSeek-V2 Chat (RL) demonstrates considerable
proficiency in LiveCodeBench, achieving a Pass@1 score that even surpasses some giant models.
This performance highlights the strong capability of DeepSeek-V2 Chat (RL) in tackling live
coding tasks.
G. Evaluation Formats
We present our evaluation formats for each benchmark in Table 12-37, respectively.
33
45
GPT-4-Turbo-1106
LiveCodeBench (Pass@1)
40
GPT-4-0613
35 DeepSeek-V2-Chat-RL Claude Opus
LLaMA3-70B-Chat
30
Claude Sonnet
Claude Haiku
25 Mistral Large Mixtral 8x22B Qwen Max
20 Qwen1.5 72B
DeepSeek 67B
1565 70 75 80 85 90
HumanEval (Pass@1)
Figure 5 | Evaluation results on HumanEval and LiveCodeBench. The questions of Live-
CodeBench are selected from the period between September 1st, 2023 and April 1st, 2024.
PROMPT
以下是一道中国高考生物选择题,请选择正确的答案。
问题:下列有关高尔基体、线粒体和叶绿体的叙述, 正确的是选项:(A)三者都
存在于蓝藻中(B)三者都含有DNA (C)三者都是ATP 合成的场所(D)三者的膜结
构中都含有蛋白质
答案:从A到D, 我们应选择
34
PROMPT
Question: A sample in a cylindrical container has a cylindrical shape and a
fixed volume. The state of matter of the sample _
A. must be solid
B. could be either solid or liquid
C. must be liquid
D. could be either liquid or gas
Answer: B
Question: When oil and water are mixed together, they form a _
A. gas
B. solid
C. compound
D. suspension
Answer: D
Question: A container of liquid water was placed outside during the day when
the temperature was 3°C. At night the outside temperature dropped to -2°C.
This temperature change most likely caused the water to _
A. condense
B. evaporate
C. remain a liquid
D. become a solid
Answer:
35
PROMPT
Evaluate the result of a random Boolean expression.
36
PROMPT
以下是中国关于教育学考试的单项选择题,请选出其中的正确答案。
根据我国心理学家冯忠良教授的学习分类,培养学生品德要通过____。
A. 知识的学习
B. 技能的学习
C. 行为规范的学习
D. 态度的学习
答案:C
开设跨学科课程或建立跨学科专业体现了高等教育课程发展的____。
A. 综合化趋势
B. 多样化趋势
C. 人文化趋势
D. 科学化趋势
答案:A
心智技能的特点有____。
A. 物质性、外显性、简缩性
B. 观念性、内潜性、简缩性
C. 物质性、外显性、展开性
D. 观念性、内潜性、展开性
答案:B
下列关于大学生的情绪与理智关系的说法中正确的是____。
A. 能冷静控制自己情绪
B. 感情用事,难以用理智控制情绪
C. 遇事能坚持自己正确认识
D. 已发展到不为小事而发怒和怄气
答案:B
在学完一篇逻辑结构严密的课文以后,勾画出课文的论点论据的逻辑关系图以
帮助理解和记忆。这种学习方法属于____。
A. 精细加工策略
B. 组织策略
C. 复述策略
D. 做笔记策略
答案:B
有学者强调,教育要根据一个民族固有的特征来定,这种观点体现了____
A. 生产力对教育的影响和制约
B. 政治制度对教育的影响和制约
C. 文化对教育的影响和制约
D. 经济制度对教育的影响和制约
答案:
OPTIONS
-A
-B
-C
-D
37
PROMPT
女:这些药怎么吃?
男:一天三次,一次两片。
请根据上文回答问题:
他们在哪儿?
答案:
OPTIONS
- 商店
- 饭店
- 医院
- 教室
PROMPT
以下是将某句古诗文翻译而成的现代表述:春天已至,万物复苏,春风如一位
美丽而又心灵手巧的姑娘,迈着纤纤细步款款而来,她挥舞剪刀,尽情地展示
那高超的女工技巧,她先裁出了柳叶,随着柳条袅袅依依地舞蹈,又裁出杏
叶,桃叶。
该翻译所对应的古诗文是:
OPTIONS
- 春风骋巧如翦刀
- 剪裁无巧似春风
- 风吹怨恨快如刀
- 春风欲擅秋风巧
38
PROMPT
Q: 某 小 学 在“献 爱 心–为 汶 川 地 震 区 捐 款”活 动 中 , 六 年 级 五 个 班 共
捐 款8000元 , 其 中 一 班 捐 款1500元 , 二 班 比 一 班 多 捐 款200元 , 三 班 捐
款1600元,四班与五班捐款数之比是3:5.四班捐款多少元?
A: 一 班 捐 款1500元 , 而 二 班 比 一 班 多 捐200元 , 所 以 二 班 捐
款1500+200=1700元 , 又 知 道 六 年 级 五 个 班 一 共 捐 款8000元 , 所 以 四 班
和 五 班 捐 款 之 和= 一 共 捐 款- 一 班 和 二 班 和 三 班 捐 款 之 和 , 即8000-1500-
1700-1600=3200元,而题目说四班与 五班捐款数之比是3:5,则四班捐款
了3200/(3+5)*3=1200元。所以答案是:1200。
Q: 小俊在东西大道上跑步,若规定向东为正。他先向东跑了800米,然后又跑
了一段之后,他位于出发点西边100米处,小俊第二段跑了多少米?
A: 小俊第二段跑完后位于出发点西边,所以第二段应该是向西跑,第二
段跑的长度-第一段跑的长度=100,第二段跑了100+800=900米。所以答案
是:900。
Q: A车和B车同时从甲、乙两地相向开出,经过5小时相遇.然后,它们又各
自按原速原方向继续行驶3小时,这时A车离乙地还有135千米,B车离甲地还
有165千米.甲、乙两地相距多少千米?
A: 假设A车的速度为x千米每小时,B车的速度为y千米每小时,根据而A、B相
遇时A车行驶了5小时,A车行驶3小时后离乙地还有135千米,B车行驶3小时
后距离甲地还有165千米,可以得到甲乙两地相距=5x+5y=135+8x=165+8y,
变换得到:10(x+y)=300+8(x+y),于是x+y=150,甲乙两地相距5(x+y)=750千
米。所以答案是:750。
Q: 在一个底面半径为10厘米的圆柱形容器内,倒入10厘米深的水,然后将一
个底面直径4厘米,高6厘米的圆锥形铅锤放入水中,容器中水面上升多少厘
米?
A:
39
PROMPT
以下是关于解剖学的单项选择题,请直接给出正确答案的选项。
题目:壁胸膜的分部不包括
A. 肋胸膜
B. 肺胸膜
C. 膈胸膜
D. 胸膜顶
答案是:B
题目:属于蝶骨上的结构为
A. 垂体窝
B. 棘孔
C. 破裂孔
D. 视神经管
答案是:B
题目:属于右心房的结构是
A. 肉柱
B. 室上嵴
C. 乳头肌
D. 梳状肌
答案是:D
题目:咽的分部
A. 咽隐窝
B. 口咽部
C. 鼻咽部
D. 喉咽部
答案是:C
题目:舌下神经核位于
A. 间脑
B. 延髓
C. 中脑
D. 脑挢
答案是:B
题目:从脑干背侧出脑的脑神经是
A. 副神经
B. 三叉神经
C. 舌下神经
D. 滑车神经
答案是:
OPTIONS
-A
-B
-C
-D
40
PROMPT
文章:英雄广场(Heldenplatz)是奥地利首都维也纳的一个广场。在此曾发
生许多重要事件— 最著名的是1938年希特勒在此宣告德奥合并。英雄广场是
霍夫堡皇宫的外部广场,兴建于皇帝弗朗茨·约瑟夫一世统治时期,是没有完
全建成的所谓“帝国广场”(Kaiserforum)的一部分。其东北部是霍夫堡皇宫
的Leopoldinian Tract,东南方是新霍夫堡,西南方的内环路,将其与“城门
外”(Äußeres Burgtor)隔开。西北部没有任何建筑物,可以很好地眺望内环
路、国会大厦、市政厅,以及城堡剧院。广场上有2尊军事领袖的骑马像:欧
根亲王和卡尔大公。
根据上文回答下面的问题。
问题:英雄广场是哪个皇宫的外部广场?
答案:霍夫堡皇宫
问题:广场上有哪两位军事领袖的骑马像?
答案:
PROMPT
Passage: The median age in the city was 22.1 years. 10.1% of residents were
under the age of 18; 56.2% were between the ages of 18 and 24; 16.1% were
from 25 to 44; 10.5% were from 45 to 64; and 7% were 65 years of age or older.
The gender makeup of the city was 64.3% male and 35.7% female.
Answer the following questions based on the above passage, please calculate
carefully if calculation is necessary.
Q: How many percent were not from 25 to 44?
A: The answer type is number. So according to above Passage, the answer is
83.9.
PROMPT
中新网12月7日电综合外媒6日报道,在美国得克萨斯州,负责治疗新冠肺炎患者
的医生约瑟夫·瓦隆(Joseph Varon)已连续上班超260天,每天只睡不超过2小时。
瓦隆日前接受采访时呼吁,美国民众应遵从防疫规定,一线的医护人员“已
OPTIONS
- 神清气爽”。
- 诡计多端”。
- 精疲力竭”。
- 分工合作”。
- 寅吃卯粮”。
- 土豪劣绅”。
- 芸芸众生”。
41
PROMPT
胡雪岩离船登岸,坐轿进城,等王有龄到家,他接着也到了他那里,脸上是掩
抑不住的笑容,王有龄夫妇都觉得奇怪,问他什么事这么高兴。
上面的句子中的"他"指的是
胡雪岩
渐渐地,汤中凝结出一团团块状物,将它们捞起放进盆里冷却,肥皂便出现在
世上了。
上面的句子中的"它们"指的是
块状物
“她序上明明引着JulesTellier的比喻,说有个生脱发病的人去理发,那剃头的
对他说不用剪发,等不了几天,头毛压儿全掉光了;大部分现代文学也同样的
不值批评。这比喻还算俏皮。”
上面的句子中的"他"指的是
生脱发病的人
在洛伦佐大街的尽头处,矗立着著名的圣三一大教堂。它有着巨大的穹顶,还
有明亮的彩色玻璃窗,上面描绘着《旧约》和《新约》的场景。
上面的句子中的"它"指的是
圣三一大教堂
他伯父还有许多女弟子,大半是富商财主的外室;这些财翁白天忙着赚钱,怕
小公馆里的情妇长日无聊,要不安分,常常叫她们学点玩艺儿消遣。
上面的句子中的"她们"指的是
情妇
赵雨又拿出了一个杯子,我们热情地请老王入座,我边给他倒酒边问:1962年
的哪次记得吗?“
上面的句子中的"他"指的是
42
PROMPT
Q: Max can mow the lawn in 40 minutes. If it takes him twice that long to
fertilize the lawn, how long will it take him to both mow and fertilize the lawn?
A: Let’s think step by step. It takes Max 2 * 40 minutes = 80 minutes to fertilize
the lawn. In total, Max takes 80 minutes + 40 minutes = 120 minutes to both
mow and fertilize the lawn. The answer is 120.
Q: The bagels cost $2.25 each, or a dozen for $24. How much is saved, per
bagel, in cents, by buying a dozen at a time?
A: Let’s think step by step. They cost 2.25*100=225 cents each. At the bulk rate,
they are 24/12=2 dollar each. They cost 2*100=200 cents each. 225-200=25 cents
are saved per bagel. The answer is 25.
Q: Tim is 5 years old. His cousin, Rommel, is thrice as old as he is. His other
cousin, Jenny, is 2 years older than Rommel. How many years younger is Tim
than Jenny?
A: Let’s think step by step. Rommel is 5 x 3 = 15 years old. Jenny is 15 + 2 = 17
years old. So, Tim is 17 - 5 = 12 years younger than Jenny. The answer is 12.
Q: The school has 14 boys and 10 girls. If 4 boys and 3 girls drop out, how
many boys and girls are left?
A: Let’s think step by step. There are 14 boys - 4 boys = 10 boys left. There are
10 girls - 3 girls = 7 girls left. In total there are 10 boys + 7 girls = 17 boys and
girls left. The answer is 17.
Q: Building one birdhouse requires 7 planks and 20 nails. If 1 nail costs 0.05,
and one plank costs 3, what is the cost, in dollars, to build 4 birdhouses?
A: Let’s think step by step. The cost of the planks for one birdhouse is 7 * 3 =
21. And the nails are a cost of 20 * 0.05 = 1 for each birdhouse. So to build one
birdhouse one will need 21 + 1 = 22. So the cost of building 4 birdhouses is at 4
* 22 = 88. The answer is 88.
Q: Cori is 3 years old today. In 5 years, she will be one-third the age of her aunt.
How old is her aunt today?
A: Let’s think step by step. In 5 years, Cori will be 3 + 5 = 8 years old. In 5
years, Cori’s aunt will be 8 x 3 = 24 years old. Today, her aunt is 24 - 5 = 19
years old. The answer is 19.
Q: Indras has 6 letters in her name. Her sister’s name has 4 more letters than
half of the letters in Indras’ name. How many letters are in Indras and her
sister’s names?
A: Let’s think step by step. 43
PROMPT
def starts_one_ends(n):
"""
Given a positive integer n, return the count of the numbers of n-digit
positive integers that start or end with 1.
"""
44
PROMPT
Problem:
Find the domain of the expression $\frac{\sqrt{x-2}}{\sqrt{5-x}}$.}
Solution:
The expressions inside each square root must be non-negative.
Therefore, $x-2 \ge 0$, so $x\ge2$, and $5 - x \ge 0$, so $x \le 5$.
Also, the denominator cannot be equal to zero, so $5-x>0$, which gives $x<5$.
Therefore, the domain of the expression is $\boxed{[2,5)}$.
Final Answer: The final answer is $[2,5)$. I hope it is correct.
Problem:
If $\det \mathbf{A} = 2$ and $\det \mathbf{B} = 12,$ then find $\det
(\mathbf{A} \mathbf{B}).$
Solution:
We have that $\det (\mathbf{A} \mathbf{B}) = (\det \mathbf{A})(\det
\mathbf{B}) = (2)(12) = \boxed{24}.$
Final Answer: The final answer is $24$. I hope it is correct.
Problem:
Terrell usually lifts two 20-pound weights 12 times. If he uses two 15-pound
weights instead, how many times must Terrell lift them in order to lift the same
total weight?
Solution:
If Terrell lifts two 20-pound weights 12 times, he lifts a total of $2\cdot
12\cdot20=480$ pounds of weight. If he lifts two 15-pound weights instead
for $n$ times, he will lift a total of $2\cdot15\cdot n=30n$ pounds of weight.
Equating this to 480 pounds, we can solve for $n$: \begin{align*}
30n&=480\\
\Rightarrow\qquad n&=480/30=\boxed{16}
\end{align*}
Final Answer: The final answer is $16$. I hope it is correct.
Problem:
If the system of equations
\begin{align*}
6x-4y&=a,\\
6y-9x &=b.
\end{align*}has a solution $(x, y)$ where $x$ and $y$ are both nonzero, find
$\frac{a}{b},$ assuming $b$ is nonzero.
Solution:
If we multiply the first equation by $-\frac{3}{2}$, we obtain
$$-\frac{3}{2}a=b\Rightarrow\frac{a}{b}=\boxed{-\frac{2}{3}}.$$
Final Answer: The final answer is $-\frac{2}{3}$. I hope it is correct.
Solution:
assert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],3)==[85, 75, 65]
assert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],2)==[85, 75]
assert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],5)==[85, 75, 65, 58,
35]
[BEGIN]
import heapq as hq
def heap_queue_largest(nums,n):
largest_nums = hq.nlargest(n, nums)
return largest_nums
[DONE]
You are an expert Python programmer, and here is your task: Write a function
to return the sum of all divisors of a number. Your code should pass these tests:
assert sum_div(8)==7
assert sum_div(12)==16
assert sum_div(7)==1
[BEGIN]
46
PROMPT
The following are multiple choice questions (with answers) about miscella-
neous.
What place is named in the title of the 1979 live album by rock legends Cheap
Trick?
A. Budapest
B. Budokan
C. Bhutan
D. Britain
Answer: B
Who is the shortest man to ever win an NBA slam dunk competition?
A. Anthony ’Spud’ Webb
B. Michael ’Air’ Jordan
C. Tyrone ’Muggsy’ Bogues
D. Julius ’Dr J’ Erving
Answer: A
Which of these songs was a Top 10 hit for the rock band The Police?
A. ’Radio Ga-Ga’
B. ’Ob-la-di Ob-la-da’
C. ’De Do Do Do De Da Da Da’
D. ’In-a-Gadda-Da-Vida’
Answer: C
47
PROMPT
Answer these questions:
Q: Who is hosting the fifa world cup in 2022?
A: Qatar
Q: Who won the first women ’s fifa world cup?
A: United States
Q: When did miami vice go off the air?
A: 1989
Q: Who wrote the song shout to the lord?
A: Darlene Zschech
Q: Who was thrown in the lion ’s den?
A: Daniel
Q: What is the meaning of the name habib?
A:
PROMPT
A woman notices that she is depressed every autumn, and wonders why. A
friend suggests to her that perhaps certain changes that take place as seasons
move from warm to cold may be having an effect on her. When pressed for an
example of these changes, the friend cites
OPTIONS
- flowers blooming
- grass turning brown
- trees growing
- blossoms blooming
PROMPT
To make it easier to push the reset button of the garbage disposable machine
which is located underneath the machine,
OPTIONS
- place a wall mirror on the floor of the cabinet
- hold a hand mirror under the garbage disposable machine
48
PROMPT
Article:
When you read an article you will understand and remember it better if you
can work out how the writer has put the ideas together. Sometimes a writer
puts ideas together by asking questions and then answering them.For example,
if the article is about groundhogs, the set of questions in the writer’s head
might be:
What does a groundhog look like?
Where do groundhogs live?
What do they eat?...
In the article,the author might answer those questions.
Sometimes an author writes out her questions in the article.These questions
give you signals.They tell you what the author is going to write next.Often
an author has a question in her head but she doesn’t write it out for you.You
have to work out her question for yourself.Here’s a sample reading for you to
practice this method.
Earthworms
Do you know how many kinds of earthworms there are?There are about 1800
kinds in the world! They can be brown,purple,green.They can be as small as 3
cm long and as large as 3 m long.
The best time to see earthworms is at night,especially a cool,damp night.That’s
when they come up from their burrows to hunt for food.Earthworms don’t
like to be in the sun.That’s because they breathe through their skin,and they
can’t breathe if their skin gets too dry.Earthworms must come out of the earth
if it rains a lot,because they can’t breathe in their flooded burrows.What a
dangerous life!
Earthworms don’t have eyes,so how can they tell when it’s dark? They have
special places on their skin that are sensitive to light.These spots tell whether
it’s light or dark.If you shine a flashlight on an earthworm at night,it will
quickly disappear into the ground.
Earthworms don’t have ears either,but they can hear by feeling movements in
the earth.If you want to hear like an earthworm,lie on the ground with your
fingers in your ears.Then have a friend stamp his or her feet near you.This is
how earthworms feel birds and people walking,and moles digging,near them.
Earthworms are useful.Farmers and gardeners like having lots of earthworms
in their land because the worms help to make better soil when they dig.That
digging keeps the soil loose and airy .In one year earthworms can pile up as
much as 23,000 kg of castings in an area about the size of a football field.
A: Read to work out all the questions in the writer’s head while reading.
A:
OPTIONS
- One way to help with understanding
- One way to practice with a new idea
- One way to learn to be a wise writer49
- One way to be clearer about worms
PREFIXES
- So Monica
- So Jessica
COMPLETION
avoids eating carrots for their eye health because Emily needs good eyesight
while Monica doesn’t.
Table 35 | An example of WinoGrande. Note that there are multiple prefixes and only one
completion for WinoGrande, and we choose the predicted prefix with the lowest perplexity of
the completion.
50
Prompt
You will be given a function f and an output in the form f(??) == output. Find
any input such that executing f on the input leads to the given output. There
may be multiple answers, but you should only output one. In [ANSWER] and
[/ANSWER] tags, complete the assertion with one such input that will produce
the output when executing the function.
[PYTHON]
def f(my_list):
count = 0
for i in my_list:
if len(i) % 2 == 0:
count += 1
return count
assert f(??) == 3
[/PYTHON]
[ANSWER]
assert f( ["mq", "px", "zy"]) == 3
[/ANSWER]
[PYTHON]
def f(s1, s2):
return s1 + s2
assert f(??) == "banana"
[/PYTHON]
[ANSWER]
assert f("ba", "nana") == "banana"
[/ANSWER]
[PYTHON]
def f(a, b, c):
result = {}
for d in a, b, c:
result.update(dict.fromkeys(d))
return result
assert f(??) == {1: None, 2: None}
[/PYTHON]
[ANSWER]
51
Prompt
You are given a Python function and an assertion containing an input to the
function. Complete the assertion with a literal (no unsimplified expressions,
no function calls) containing the output when executing the provided code on
the given input, even if the function is incorrect or incomplete. Do NOT output
any extra information. Provide the full assertion with the correct output in
[ANSWER] and [/ANSWER] tags, following the examples.
[PYTHON]
def f(n):
return n
assert f(17) == ??
[/PYTHON]
[ANSWER]
assert f(17) == 17
[/ANSWER]
[PYTHON]
def f(s):
return s + "a"
assert f("x9j") == ??
[/PYTHON]
[ANSWER]
assert f("x9j") == "x9ja"
[/ANSWER]
[PYTHON]
def f(nums):
output = []
for n in nums:
output.append((nums.count(n), n))
output.sort(reverse=True)
return output
assert f( [1, 1, 3, 1, 3, 1]) == ??
[/PYTHON]
[ANSWER]
52