Qwen2.5-Coder Technical Report
Qwen2.5-Coder Technical Report
https://github.com/QwenLM/Qwen2.5-Coder
Abstract
Qwen2.5-Coder-32B
92.7
69.1
68.9
88.4
57.4
92
.1
.4
90
84
.2
21.7
16.8
79.3 .1
89
.2 .8
78
h. 83 .3
86
.9
.5
74
73
81 val.. CodeAren
86.8 .2 anE a..
um
.H Ev
...
73.3
al
.
PP
Pl
MB
us
54.2
...
.
45.6
.. B I R D - S Q L
31.4
LiveCodeBen.c.
46.2
27.9 51.9
22.6 58.4
-O .
21.3
val
....
34.6 50.6
XE
M
cE
al
R
v
C
Bi 63
gC
odeB der .5 89.2
enc.h.....Ai
3
.5
. 75
5 .9 54
50
6 .9 .1
62
51
83
29.4
29.8
.4
.1
59.4
.8
71
65
.4
36.3
72.9
38.3
73.7
37.6
1
Technical Report
Contents
1 Introduction 3
2 Model Architecture 3
3 Pre-training 4
3.1 Pretraining Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.1.1 Data Composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.1.2 Data Mixture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2 Training Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2.1 File-Level Pretraining . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2.2 Repo-Level Pretraining . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4 Post-training 7
4.1 A Recipe for Instruction Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.2 Training Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5 Decontamination 9
9 Conclusion 28
2
Technical Report
1 Introduction
With the rapid development of large language models (LLMs) (Brown, 2020; Achiam et al.,
2023; Touvron et al., 2023; Dubey et al., 2024; Jiang et al., 2023; Bai et al., 2023; Yang et al., 2024;
Anthropic, 2024; OpenAI, 2024), code-specific language models have garnered significant
attention in the community. Built upon pre-trained LLMs, code LLMs such as the StarCoder
series (Li et al., 2023; Lozhkov et al., 2024), CodeLlama series (Roziere et al., 2023), DeepSeek-
Coder series (Guo et al., 2024a), CodeQwen1.5 (Qwen, 2024), and CodeStral (MistralAI,
2024), have demonstrated superior performance in coding evaluations (Chen et al., 2021;
Austin et al., 2021; Cassano et al., 2022; Jain et al., 2024; Liu et al., 2024a; Li et al., 2024b; Guo
et al., 2024b; Wu et al., 2024b). However, in comparison with the recently state-of-the-art
proprietary LLMs, Claude-3.5-Sonnet (Anthropic, 2024) and GPT-4o (OpenAI, 2024), the
code LLMs are still falling behind, either open-source or proprietary models.
Building upon our previous work, CodeQwen1.5, we are excited to introduce Qwen2.5-
Coder, a new series of language models designed to achieve top-tier performance in coding
tasks at various model sizes. Qwen2.5-Coder models are derived from the Qwen2.5 LLMs,
inheriting their advanced architecture and tokenizer. These models are trained on exten-
sive datasets and further fine-tuned on carefully curated instruction datasets specifically
designed for coding tasks. We are committed to fostering research and innovation in the
field of code LLMs, coding agents, and coding assistant applications. Therefore, we release
the Powerful, Diverse, and Practical Qwen2.5-Coder series, dedicated to continuously pro-
moting the development of Open CodeLLMs. (1) Powerful: Qwen2.5-Coder-32B-Instruct
has become the current SOTA open-source code model, matching the coding capabilities of
GPT-4o. While demonstrating strong and comprehensive coding abilities, it also possesses
good general and mathematical skills. (2) Diverse: Qwen2.5-Coder series brings six model
sizes, including 0.5B/1.5B/3B/7B/14B/32B. Qwen2.5-Coder has covered six mainstream
model sizes to meet the needs of different developers. (3) Practical: We explore the practical-
ity of Qwen2.5-Coder in two scenarios, including code assistants and Artifacts, with some
examples showcasing the potential applications of Qwen2.5-Coder in real-world scenarios
Significant efforts have been dedicated to constructing a large-scale, coding-specific pretrain-
ing dataset comprising over 5.5 trillion tokens. This dataset is sourced from a broad range of
public code repositories, such as those on GitHub, as well as large-scale web-crawled data
containing code-related texts. We have implemented sophisticated procedures to recall and
clean potential code data and filter out low-quality content using weak model based classi-
fiers and scorers. Our approach encompasses both file-level and repository-level pretraining
to ensure comprehensive coverage. To optimize performance and balance coding expertise
with general language understanding, we have carefully curated a data mixture that in-
cludes code, mathematics, and general texts. To transform models into coding assistants for
downstream applications, we have developed a well-designed instruction-tuning dataset.
This dataset includes a wide range of coding-related problems and solutions, sourced from
real-world applications and synthetic data generated by code-focused LLMs, covering a
broad spectrum of coding tasks.
To evaluate the effectiveness of Qwen2.5-Coder, we conducted an extensive evaluation
on a suite of popular benchmarks. The results highlight Qwen2.5-Coder’s superior code
generation capabilities, achieving state-of-the-art performance across more than ten code-
focused benchmarks while maintaining robust general and mathematical reasoning abilities.
This model outperforms larger code models on a variety of tasks. The release of these
models aims to advance code intelligence research and promote widespread adoption in
real-world applications, facilitated by permissive licensing.
2 Model Architecture
3
Technical Report
increase as the model size scales up. Comparing the 7B and 32B models for instance: the 7B
model features a hidden size of 3,584, whereas the 32B model has a hidden size of 5,120. The
7B model uses 28 query heads and 4 key-value heads, while the 32B model uses 40 query
heads and 8 key-value heads, reflecting its enhanced capacity. Similarly, the intermediate
size scales with model size, being 18,944 for the 7B model and 27,648 for the 32B model.
Additionally, smaller models use embedding tying, while larger models do not. Both models
have a vocabulary size of 151,646 tokens and are trained on 5.5 trillion tokens.
3 Pre-training
Large-scale, high-quality, and diverse data forms the foundation of pre-trained models. To
this end, we constructed a dataset named Qwen2.5-Coder-Data. This dataset comprises
five key data types: Source Code Data, Text-Code Grounding Data, Synthetic Data, Math
Data and Text Data. In this section, we provide a brief overview of the sources and cleaning
methods applied to these datasets.
4
Technical Report
400
Performance
45
Tokens (B)
370
300 42
200 40
147
100 118 38
0 35
Stage 1 Stage 2 Stage 3 Stage 4
We designed a cleaning pipeline for the Text-Code Grounding Data, where each filter level is
built using smaller models, such as fastText. Although we experimented with larger models,
they did not yield significant benefits. A likely explanation is that smaller models focus
more on surface-level features, avoiding unnecessary semantic complexity.
In Qwen2.5-Coder, we applied this process iteratively. As shown in Figure 1, each iteration
resulted in improvement for Qwen2.5-Coder-1.5B. Through 4-stage filtering, the average
scores on HumanEval and MBPP increased from 41.6% to 46.8% compared to the baseline,
demonstrating the value of high-quality Text-Code Grounding Data for code generation.
Synthetic Data Synthetic data offers a promising way to address the anticipated scarcity
of training data. We used CodeQwen1.5, the predecessor of Qwen2.5-Coder, to generate
large-scale synthetic datasets. To mitigate the risk of hallucinations during this process, we
introduced an executor for validation, ensuring that only executable code was retained.
5
Technical Report
Text Data Similar to the Math Data, we included high-quality general natural language
data from the pre-training corpus of the Qwen2.5 model to preserve Qwen2.5-Coder’s
general capabilities. This data had already passed stringent quality checks during the
cleaning phase of Qwen2.5’s dataset, so no further processing was applied. However, all
code segments were removed from the general Text data to avoid overlap with our code
data, ensuring the independence of different data sources.
Repo-Level Alignment
Qwen2.5 File-Level Pretrain Qwen2.5-Coder Qwen2.5-Code-Instruct
Pretrain SFT & DPO
5.2T Tokens 300B Tokens
① ② ③
<|fim_prefix|>{code_pre}<|fim_suffix|>{code_suf}<|fim_middle|>{code_mid}<|endoftext|>
6
Technical Report
<|repo_name|>{repo_name}
<|file_sep|>{file_path1}
{file_content1}
<|file_sep|>{file_path2}
{file_content2}
<|file_sep|>{file_path3}
<|fim_prefix|>{code_pre}<|fim_suffix|>{code_suf}<|fim_middle|>{code_fim}<|endoftext|>
4 Post-training
Instruction Synthesis from GitHub For the unsupervised data (code snippets) massively
existing in many websites (e.g. GitHub), we try to construct the supervised instruction
dataset using LLM. Specifically, we use the LLM to generate the instruction from the code
snippets within 1024 tokens and then we use the code LLM to generate the response (Wei
et al., 2024; Sun et al., 2024; Yu et al., 2024). Finally, we use the LLM scorer to filter the
low-quality ones to obtain the final pair. Given the code snippets of different programming
languages, we construct an instruction dataset from the code snippets. To fully unleash the
potential of our proposed method, we also include the open-source instruction dataset (e.g.
McEval-Instruct for massively multilingual code generation and debugging1 ) in the seed
instruction dataset. Finally, we combine the instruction data from the GitHub code snippet
and open-source instructions for supervised fine-tuning.
Multilingual Code Instruction Data To bridge the gap among different programming
languages, we propose a multilingual multi-agent collaborative framework to synthesize
the multilingual instruction corpora. We introduce language-specific agents, where a set of
1 https://huggingface.co/datasets/Multilingual-Multimodal-NLP/McEval-Instruct
7
Technical Report
specialized agents are created and each dedicated to a particular programming language.
These agents are initialized with language-specific instruction data derived from the limited
existing multilingual instruction corpora. The multilingual data generation process can be
split into: (1) Language-Specific Intelligent Agents: We create a set of specialized agents,
each dedicated to a particular programming language. These agents are initialized with
language-specific instruction data derived from curated code snippets. (2) Collaborative
Discussion Protocol: Multiple language-specific agents engage in a structured dialogue
to formulate new instructions and solutions. This process can result in either enhancing
existing language capabilities or generating instructions for a novel programming language.
(3) Adaptive Memory System: Each agent maintains a dynamic memory bank that stores its
generation history to avoid generating the similar samples. (4) Cross-Lingual Discussion:
We implement a novel knowledge distillation technique that allows agents to share insights
and patterns across language boundaries, fostering a more comprehensive understanding
of programming concepts. (5) Synergy Evaluation Metric: We develop a new metric to
quantify the degree of knowledge sharing and synergy between different programming
languages within the model. (6) Adaptive Instruction Generation: The framework includes
a mechanism to dynamically generate new instructions based on identified knowledge gaps
across languages.
Checklist-based Scoring for Instruction Data To completely evaluate the quality of the
created instruction pair, we introduce several scoring points for each sample: (1) Ques-
tion&Answer Consistency: Whether Q&A are consistent and correct for fine-tuning. (2)
Question&Answer Relevance: Whether Q&A are related to the computer field. (3) Ques-
tion&Answer Difficulty: Whether Q&A are sufficiently challenging. (4) Code Exist: Whether
the code is provided in question or answer. (5) Code Correctness: Evaluate whether the
provided code is free from syntax errors and logical flaws. (6) Consider factors like proper
variable naming, code indentation, and adherence to best practices. (7) Code Clarity: Assess
how clear and understandable the code is. Evaluate if it uses meaningful variable names,
proper comments, and follows a consistent coding style. (8) Code Comments: Evaluate
the presence of comments and their usefulness in explaining the code’s functionality. (9)
Easy to Learn: determine its educational value for a student whose goal is to learn ba-
sic coding concepts. After gaining all scores (s1 , . . . , sn ), we can get the final score with
s = w1 s1 + · · · + wn sn , where (w1 , . . . , wn ) are a series of pre-defined weights.
A multilingual sandbox for code verification To further verify the correctness of the code
syntax, we use the code static checking for all extracted code snippets of programming
languages (e.g. Python, Java, and C++). We parse the code snippet into the abstract syntax
tree and filter out the code snippet, where the parsed nodes in code snippet have parsing
errors. We create a multilingual sandbox to support the code static checking for the main
programming language. Further, the multilingual sandbox is a comprehensive platform
designed to validate code snippets across multiple programming languages. It automates the
process of generating relevant unit tests based on language-specific samples and evaluates
whether the provided code snippets can successfully pass these tests. Especially, only the
self-contained (e.g. algorithm problems) code snippet will be fed into the multilingual
sandbox. The multilingual verification sandbox is mainly comprised of five parts:
8
Technical Report
Mixed Tuning Since most instruction data have a short length, we construct the instruction
pair with the FIM format to keep the long context capability of the base model. Inspired by
programming language syntax rules and user habits in practical scenarios, we leverage the
tree-sitter-languages2 to parse the code snippets and extract the basic logic blocks as the
middle code to infill. For example, the abstract syntax tree (AST) represents the structure of
Python code in a tree format, where each node in the tree represents a construct occurring
in the source code. The tree’s hierarchical nature reflects the syntactic nesting of constructs
in the code and includes various elements such as expressions, statements, and functions.
By traversing and manipulating the AST, we can randomly extract the nodes of multiple
levels and use the code context of the same file to uncover the masked node. Finally, we
optimize the instruction model with a majority of standard SFT data and a small part of
FIM instruction samples.
Direct Preference Optimization for Code After obtaining the SFT model, we further align
the Qwen2.5-Coder with the help of offline direct preference optimization (DPO) (Rafailov
et al., 2023). Given that human feedback is highly labor-intensive, we use a multilingual
code sandbox to provide code execution feedback, while an LLM is utilized for human
judgment feedback. For the algorithm-like and self-contained code snippets, we generate
the test cases to check the correctness of the code as the code execution feedback, including
Python, Java, and other languages. For other complex code snippets, we use LLM-as-a-
judge (Zheng et al., 2023) to decide which code snippet is better. Further, we combine the
code DPO data and common data for offline DPO training.
5 Decontamination
To ensure that Qwen2.5-Coder does not produce inflated results due to test set leakage,
we performed decontamination on all data, including both pre-training and post-training
datasets. We removed key datasets such as HumanEval, MBPP, GSM8K, and MATH. The
filtering was done using a 10-gram overlap method, where any training data with a 10-gram
word-level overlap with the test data was removed.
2 https://pypi.org/project/tree-sitter-languages/
9
Technical Report
For the base model, we conducted a comprehensive and fair evaluation in six key aspects, in-
cluding code generation, code completion, code reasoning, mathematical reasoning, general
natural language understanding and long-context modeling. To ensure the reproducibility
of all results, we made all evaluation codes publicly available3 . For comparing models,
we chose the most popular and powerful open source language models, including the
StarCoder2 and DeepSeek-Coder series. Below is the list of artifacts used in the evaluation
for this section.
HumanEval and MBPP Code generation serves as a fundamental capability for code
models to handle more complex tasks. We selected two popular code generation benchmarks
to evaluate Qwen2.5-Coder, namely HumanEval (Chen et al., 2021) and MBPP (Austin et al.,
2021). HumanEval consists of 164 manually written programming tasks, each providing a
Python function signature and a docstring as input to the model. MBPP, on the other hand,
comprises 974 programming problems created by crowdsource contributors. Each problem
includes a problem statement (i.e., a docstring), a function signature, and three test cases.
To further ensure accurate evaluation, EvalPlus (Liu et al., 2023) extends HumanEval into
HumanEval+ by adding 80 times more unique test cases and correcting inaccurate ground-
truth solutions in HumanEval. Similarly, MBPP+ offers 35 times more test cases than the
original MBPP.
Additionally, we should notice that MBPP 3-shot is particularly suitable for monitoring
model convergence during training. Early in the convergence process, the model tends to be
unstable, causing significant fluctuation in metrics, and simple 3-shot examples effectively
mitigate it. Therefore, we also report the results of MBPP 3-shot performance.
As shown in Table 5, Qwen2.5-Coder have shown impressive performance in basic code
generation, achieving state-of-the-art results among open-source models of the same size
and surpassing even larger models. In particular, Qwen2.5-Coder-7B outperforms the
previous best dense model, DS-Coder-33B, across all five metrics.
10
Technical Report
Table 5: Performance of various models on HumanEval, MBPP and the “complete” task of
BigCodeBench.
Many developer aid tools rely on the capability to autocomplete code based on preced-
ing and succeeding code snippets. Qwen2.5-Coder utilizes the Fill-In-the-Middle (FIM)
training strategy, as introduced in Bavarian et al. (2022), enabling the model to generate
code that is contextually coherent. To assess its code completion proficiency, we utilize the
HumanEval-FIM benchmark (Allal et al., 2023), CrossCodeEval (Ding et al., 2024), Cross-
CodeLongEval (Wu et al., 2024a), RepoEval (Zhang et al., 2023) and SAFIM (Gong et al.,
11
Technical Report
2024). Figure 5 shows the overall evaluation results of Qwen2.5-Coder-32B on different code
completion benchmarks.
88.3 86.1
85.0
71.2
67.7 67.2
57.1
51.6
48.8 47.8
43.7 43.4
36.9
31.9 30.4
12
Technical Report
Humaneval-FIM
Model Size
Python Java JavaScript Average∗
0.5B+ Models
Qwen2.5-Coder-0.5B 0.5B 70.3 78.1 81.2 77.7
1B+ Models
DS-Coder-1.3B-Base 1.3B 72.8 84.3 81.7 80.7
Qwen2.5-Coder-1.5B 1.5B 77.0 85.6 85.0 83.5
3B+ Models
StarCoder2-3B 3B 70.9 84.4 81.8 80.4
Qwen2.5-Coder-3B 3B 78.7 88.0 87.4 85.7
6B+ Models
StarCoder2-7B 7B 70.8 86.0 84.4 82.0
DS-Coder-6.7B-Base 6.7B 78.1 87.4 84.1 84.0
DS-Coder-V2-Lite-Base 2.4/16B 78.7 87.8 85.9 85.0
CodeQwen1.5-7B 7B 75.8 85.7 85.0 83.3
Qwen2.5-Coder-7B 7B 79.7 88.5 87.6 86.2
14B+ Models
StarCoder2-15B 15B 74.2 85.2 84.6 82.6
Qwen2.5-Coder-14B 14B 80.5 91.0 88.5 87.7
20B+ Models
CodeStral-22B 22B 76.7 82.5 86.0 82.7
DS-Coder-33B-Base 33B 80.1 89.0 86.8 86.2
Qwen2.5-Coder-32B 32B 81.5 91.0 89.4 88.3
In real-world scenarios, code completion often depends on accessing cross-file context and
dependencies. CrossCodeEval is a benchmark that requires a deep understanding of this
cross-file context to accurately complete the code. In our evaluation, we set a maximum
sequence length of 8192 tokens, designate a maximum output length of 50 tokens, and
impose a limit of 2048 tokens for the cross-file context. For the cross-file context, we use
the official BM25 search results provided by Ding et al. (2024). We evaluate performance
using Exact Match (EM) and Edit Similarity (ES) metrics. Table 8 shows that the Qwen2.5-
Coder-32B achieves state-of-the-art performance with a 3.7% improvement. Qwen2.5-Coder
outperforms all the models with a comparable model size. Meanwhile, Qwen2.5-Coder-7B
has a comparable performance with other models exceeding 20 billion parameters.
CrossCodeLongEval is a long context benchmark on cross file code completion tasks. In
our evaluation, we set a maximum sequence length of 8192 tokens and set the maximum
output as 256 tokens for function completion and 50 tokens for other tasks. The cross-file
context is truncated to 2048 tokens. For the cross-file context, we use the official BM25 search
results provided by Wu et al. (2024a). We evaluate performance using Exact Match (EM)
and Edit Similarity (ES) metrics. Qwen2.5-Coder-32B achieves state-of-the-art performance,
as detailed in Table 9. The Qwen2.5-Coder series surpasses all other models of a similar size.
All models demonstrate low Exact Match (EM) results on function completion tasks, likely
due to the complexity of generating multi-line code snippets that are challenging to match
precisely.
RepoEval is a benchmark designed to evaluate repository-level code completion capabilities
across three granularities: line, API invocation, and function body completion. In our
evaluation, we set a maximum sequence length of 8192 tokens, set the maximum output as
256 tokens for function completion and 50 tokens for other tasks, and impose a limit of 2048
13
Technical Report
tokens for the cross-file context. Besides, we utilize the official sparse retriever (Lu et al.,
2022) to extract the cross-file context. We evaluate performance using Exact Match (EM)
and Edit Similarity (ES) metrics. As shown in Table 10, Qwen2.5-Coder-32B achieves state-
of-the-art performance with an average improvement of 7.9% EM and 4.2% ES compared
to DS-Coder-33B-Base. Furthermore, Qwen2.5-Coder-14B and Qwen2.5-Coder-7B achieve
comparable performance to models with more than 20B parameters, while maintaining
state-of-the-art results among models of similar size.
SAFIM is a syntax-aware fill-in-the-middle benchmark that emphasizes AST-based code
completion, specifically targeting algorithmic blocks, control-flow expressions, and API
function calls. The benchmark consists of 17,720 examples from 8,590 code files created
after April 2022, deliberately avoiding overlap with mainstream pretraining corpora. For
evaluation, we use pass@1 rate as the metric for algorithmic and control-flow tasks, and
Exact Match (EM) for API completion tasks.
Code is a highly abstract form of logical language, and reasoning based on code helps
us determine whether a model truly understands the reasoning flow behind the code.
We selected CRUXEval (Gu et al., 2024) as the benchmark, which includes 800 Python
functions along with corresponding input-output examples. It consists of two distinct
tasks: CRUXEval-I, where the large language model (LLM) must predict the output based
on a given input; and CRUXEval-O, where the model must predict the input based on a
known output. For both CRUXEval-I and CRUXEval-O, we used a chain-of-thought (CoT)
approach, requiring the LLM to output steps sequentially during simulated execution.
14
Technical Report
As shown in Table 11, Qwen2.5-Coder delivered highly promising results, achieving a score
of 56.5 on CRUXEval-I and 56.0 on CRUXEval-O, thanks to our focus on executable quality
during the code cleaning process.
Mathematics and coding have always been closely intertwined. Mathematics forms the
foundational discipline for coding, while coding serves as a vital tool in mathematical
fields. As such, we expect an open and powerful code model to exhibit strong mathematical
capabilities as well. To assess Qwen2.5-Coder’s mathematical performance, we selected
five popular benchmarks, including MATH (Hendrycks et al., 2021), GSM8K (Cobbe et al.,
2021), MMLU-STEM (Hendrycks et al., 2020) and TheoremQA (Chen et al., 2023). Table 12
highlights Qwen2.5-Coder’s strengths in mathematics, which likely stem from two key
factors: first, the model’s strong foundation built on Qwen2.5, and second, the careful
mixing of code and mathematical data during training, which has ensured a well-balanced
performance across these domains.
In addition to mathematical ability, we aim to retain as much of the base model’s general-
purpose capabilities as possible, such as general knowledge. To evaluate general natural
language understanding, we selected MMLU (Hendrycks et al., 2021) and its variant MMLU-
Redux (Gema et al., 2024), along with four other benchmarks: ARC-Challenge (Clark et al.,
2018), TruthfulQA (Lin et al., 2021), WinoGrande (Sakaguchi et al., 2019), and HellaSwag
(Zellers et al., 2019). Similar to the results in mathematics, Table 14 highlights Qwen2.5-
15
Technical Report
Coder’s advantage in general natural language capabilities compared to other coders, further
validating the effectiveness of Qwen2.5-Coder data mixing strategy.
Long context capability is crucial for code LLMs, serving as the core skill for understanding
repository-level code and becoming a code agent. However, most of the current code
models still have very limited support for length, which hinders their potential for practical
application. Qwen2.5-Coder aims to further advance the progress of open-source code
models in long context modeling. To achieve this, we have collected and constructed
long sequence code data at the repository level for pre-training. Through careful data
proportioning and organization, we have enabled it to support input lengths of up to 128K
tokens.
Needle in the Code We created a simple but basic synthetic task called Needle in the Code,
inspired by popular long-context evaluations in the text domain. In this task, we inserted a
very simple custom function at various positions within a code repo (we chose Megatron 4 to
honor its contributions to open-source LLMs!) and tested whether the model could replicate
this function at the end of the codebase. The figure below shows that Qwen2.5-Coder is
capable of successfully completing this task within a 128k length range.
16
Technical Report
CRUXEval
Model Size
Input-CoT Output-CoT
0.5B+ Models
Qwen2.5-Coder-0.5B 0.5B 35.2 23.0
1B+ Models
DS-Coder-1.3B-Base 1.3B 32.1 28.2
Qwen2.5-Coder-1.5B 1.5B 43.8 34.6
3B+ Models
StarCoder2-3B 3B 42.1 29.2
Qwen2.5-Coder-3B 3B 46.5 43.8
6B+ Models
StarCoder2-7B 7B 39.5 35.1
DS-Coder-6.7B-Base 6.7B 39.0 41.0
DS-Coder-V2-Lite-Base 2.4/16B 53.4 46.1
CodeQwen1.5-7B 7B 44.8 40.1
Qwen2.5-Coder-7B 7B 56.5 56.0
14B+ Models
StarCoder2-15B 15B 46.1 47.6
Qwen2.5-Coder-14B 14B 60.6 66.4
20B+ Models
DS-Coder-33B-Base 33B 50.6 48.8
DS-Coder-V2-Base 21/236B 62.7 67.4
Qwen2.5-Coder-32B 32B 62.5 69.4
Table 11: Performance of different models on CRUXEval with Input-CoT and Output-CoT
settings.
Figure 6: The long context ability of Qwen2.5-Coder, evaluated by Needle in the Code.
understanding. The evaluation was structured to ensure a fair and thorough comparison
across models. All evaluation code is publicly accessible for reproducibility5 . To ensure a
broad comparison, we included some of the most popular and widely-used open-source
instruction-tuned models, notably versions from the DeepSeek-Coder series and Codestral
models. Below is a list of all artifacts referenced in this section.
5 https://github.com/QwenLM/Qwen2.5-Coder
17
Technical Report
Table 12: Performance of various models on four math benchmarks, named MATH, GSM8K,
MMLU STEM and TheoremQA respectively.
Building on the performance improvements of the Qwen2.5-Coder series base models, our
Qwen2.5-Coder series instruct models similarly demonstrated outstanding performance in
code generation tasks.
HumanEval and MBPP We also assessed the code generation capabilities of the Qwen2.5-
Coder series instruction models using the EvalPlus (Liu et al., 2023) dataset. As shown by
the results in Table 16, our Qwen2.5-Coder-7B-Instruct model demonstrated exceptional
accuracy, significantly outperforming other models with a comparable parameter count.
Remarkably, it even surpassed larger models with over 20 billion parameters, such as
CodeStral-22B and DS-Coder-33B-Instruct. Furthermore, our Qwen2.5-Coder-32B-Instruct
model achieved the highest performance on EvalPlus, even outperforming DS-Coder-V2-
Instruct, making it the most powerful open-source code model to date.
18
Technical Report
MMLU
Model Size
Base Pro Redux
0.5B+ Models
Qwen2.5-Coder-0.5B 0.5B 42.0 13.3 40.6
1B+ Models
DS-Coder-1.3B-Base 1.3B 25.8 11.4 24.5
Qwen2.5-Coder-1.5B 1.5B 53.6 23.1 50.9
3B+ Models
StarCoder2-3B 3B 36.6 15.5 37.0
Qwen2.5-Coder-3B 3B 61.2 32.0 59.5
6B+ Models
StarCoder2-7B 7B 38.8 17.2 38.6
DS-Coder-6.7B-Base 6.7B 36.4 16.7 36.5
DS-Coder-V2-Lite-Base 2.4/16B 60.5 33.4 58.3
CodeQwen1.5-7B 7B 40.5 17.2 41.2
Qwen2.5-Coder-7B 7B 68.0 40.1 66.6
14B+ Models
StarCoder2-15B 15B 64.1 24.3 48.8
Qwen2.5-Coder-14B 14B 75.2 49.3 72.4
20B+ Models
DS-Coder-33B-Base 33B 39.4 18.4 38.7
Qwen2.5-Coder-32B 32B 79.1 50.4 77.5
Table 13: MMLU results of different models, a general benchmark for common knowledge.
19
Technical Report
Table 14: General performance of different models on four popular general benchmarks,
ARC-Challenge, TruthfulQA, WinoGrande and HellaSwag.
20
Technical Report
Table 16: The performance of different instruct models on code generation by HumanEval,
MBPP, bigcodebench and livecodebench. For bigcodebench here, we report “instruct” tasks
score.
Instruct model with only 32 billion parameters, bringing it very close to the performance of
several closed-source APIs.
21
Technical Report
Chatbot Arena (Chiang et al., 2024), we use CodeArena to emulate user code-related prompts
in realistic environments. We use GPT-4o as the evaluation model for preference alignment,
employing an “A vs. B win” evaluation method, which measures the percentage of instances
in the test set where the score of A exceeds the score of B. The results in Figure 9 demonstrate
the advantage of Qwen2.5-Coder-32B-Instruct in preference alignment.
To evaluate the code reasoning capabilities of the Qwen2.5-Coder series instruct mod-
els, we conducted an assessment on the CRUXEval (Gu et al., 2024) dataset. As shown
in Table 18, the Qwen2.5-Coder-7B-Instruct model achieved Input-CoT and Output-CoT
accuracies of 65.8% and 65.9%, respectively—demonstrating a substantial improvement
over the DS-Coder-V2-Lite-Instruct model, with gains of 12.8% in Input-CoT accuracy
and 13.0% in Output-CoT accuracy. Additionally, the Qwen2.5-Coder-7B-Instruct model
outperformed larger models, including CodeStral-22B and DS-Coder-33B-Instruct, high-
lighting its advanced code reasoning capabilities despite its smaller size. Notably, our
Qwen2.5-Coder-32B-Instruct model achieved accuracies of 75.2% and 83.4% on Input-CoT
and Output-CoT, respectively, significantly outperforming other open-source code mod-
els (including DS-Coder-V2-Instruct) and underscoring its robust performance in code
reasoning.
Figure 10 illustrates the relationship between model sizes and code reasoning capabilities.
The Qwen2.5-Coder instruct models stand out for delivering superior code reasoning
22
Technical Report
McEval Performance
50
0
Average Coffee Groovy Swift Json C# Kotlin Power
50
0
Rust Java VB Haskell R Shell Julia PHP
50
25
0
Scheme F# Python JS Ruby Scala AWK C
50
25
0
C++ Go Lua Racket TS Clisp Elixir Fortran
50
25
0
Perl Elisp Pascal Tcl VimL Erlang Dart Html
Qwen2.5-Coder-7B-Chat CodeStral-20B DS-Coder-V1-6.7B-Instruct
DS-Coder-V2-Lite-Instruct DS-Coder-V1-33B-Instruct CodeQwen1.5-7B-Chat
performance with the fewest parameters, surpassing the results of other open-source large
language models by a significant margin.
Aider Aider9 has created a code editing benchmark designed to quantitatively measure
its collaboration with large language models (LLMs). Drawing from a set of 133 Python
exercises sourced from Exercism10 , the benchmark tests the ability of Aider and LLMs to
interpret natural language programming requests and translate them into executable code
that successfully passes unit tests. This assessment goes beyond evaluating raw coding
proficiency; it also examines how effectively LLMs can edit existing code and format those
modifications for seamless integration with Aider’s system, ensuring that local source
files can be updated without issues. The comprehensive nature of this benchmark reflects
both the technical aptitude of the LLMs and their consistency in task completion. Table 19
highlights the performance of several language models in the Code Editing task. Among
these models, Qwen2.5-Coder-7B-Instruct exhibits exceptional code repair capabilities.
Despite its relatively modest scale of 7 billion parameters, it achieves an impressive PASS@1
accuracy of 51.9%, significantly outperforming comparable models. Remarkably, it also
surpasses larger models such as CodeStral-22B and DS-Coder-33B-Instruct , highlighting
9 https://github.com/paul-gauthier/aider
10 https://github.com/exercism/python
23
Technical Report
MdEval Performance
100
50
0
Average C Clisp C++ Go Java JavaScript Julia Pascal
100
50
0
PHP Python R Ruby Rust Scala Swift C# F#
1
Figure 8: The MdEval Performance of Qwen2.5-Coder-32B-Instruct compared with popular
open-source large code models with similar size.
Percentage
its remarkable efficiency and effectiveness in code editing tasks. Our Qwen2.5-Coder-32B-
Instruct model achieves even higher accuracy, with Pass@1 and Pass@2 rates reaching 60.9%
and 73.7%, respectively.
24
Technical Report
CRUXEval
Model Size
Input-CoT Output-CoT
0.5B+ Models
Qwen2.5-Coder-0.5B-Instruct 0.5B 33.9 27.8
1B+ Models
DS-Coder-1.3B-Instruct 1.3B 12.9 28.1
Yi-Coder-1.5B-Chat 1.5B 19.9 24.9
Qwen2.5-Coder-1.5B-Instruct 1.5B 45.4 37.5
3B+ Models
Qwen2.5-Coder-3B-Instruct 3B 53.2 56.0
6B+ Models
CodeLlama-7B-Instruct 7B 36.1 36.2
DS-Coder-6.7B-Instruct 6.7B 42.6 45.1
CodeQwen1.5-7B-Chat 7B 44.0 38.8
Yi-Coder-9B-Chat 9B 47.5 55.6
DS-Coder-V2-Lite-Instruct 2.4/16B 53.0 52.9
Qwen2.5-Coder-7B-Instruct 7B 65.8 65.9
13B+ Models
CodeLlama-13B-Instruct 13B 47.5 41.1
Starcoder2-15B-Instruct-v0.1 15B 45.5 50.9
Qwen2.5-Coder-14B-Instruct 14B 69.5 79.5
20B+ Models
CodeLlama-34B-Instruct 34B 48.5 47.1
CodeStral-22B-v0.1 22B 61.3 63.5
DS-Coder-33B-Instruct 33B 47.3 50.6
CodeLlama-70B-Instruct 70B 56.5 57.8
DS-Coder-V2-Instruct 21/236B 70.0 75.1
Qwen2.5-Coder-32B-Instruct 32B 75.2 83.4
Closed-APIs
Claude-3.5-Sonnet-20240620 - 75.5 81.8
Claude-3.5-Sonnet-20241022 - 84.4 87.2
GPT-4o-mini-2024-07-18 - 67.5 78.4
GPT-4o-2024-08-06 - 78.6 89.2
o1-mini - 91.6 96.2
o1-preview - 86.5 81.4
Table 18: The CRUXEval performance of different instruct models, with Input-CoT and
Output-CoT settings.
7.4 Text-to-SQL
SQL is one of the essential tools in daily software development and production, but its
steep learning curve often hinders free interaction between non-programming experts and
databases. To address this issue, the Text-to-SQL task was introduced, aiming for models
to automatically map natural language questions to structured SQL queries. Previous
improvements in Text-to-SQL focused primarily on structure-aware learning, domain-
specific pre-training, and sophisticated prompt designs.
Thanks to the use of finely crafted synthetic data during both pre-training and fine-tuning,
we significantly enhanced Qwen2.5-Coder’s capability in Text-to-SQL tasks. We selected
two well-known benchmarks, Spider (Yu et al., 2018) and BIRD (Li et al., 2024a), for com-
prehensive evaluation. To ensure a fair comparison between Qwen2.5-Coder and other
open-source language models on this task, we used a unified prompt template as input,
following the work of Chang & Fosler-Lussier (2023). The evaluation prompt consists of
table representations aligned with database instructions, examples of table content, op-
tional additional knowledge, and natural language questions. This standardized prompt
template minimizes biases that may arise from prompt variations. As shown in Figure 12,
Qwen2.5-Coder outperforms other code models of the same size on the Text-to-SQL task.
25
Technical Report
CRUXEval-O (CoT)
90% Best performance/size ratio Qwen2.5-Coder-32B-Instruct GPT-4o-0513
Claude-3-Opus
DS-Coder-V2-Instruct
Qwen2.5-Coder-7B-Instruct
Llama3-Instruct
Codestral
DS-Coder-V2-Lite-Instruct
DS-Coder-33B-Instruct
DS-Coder-6.7B-Instruct
45%
Qwen2.5-Coder-1.5B-Instruct
CodeQwen1.5-7B-Chat
DS-Coder-1.3B-Instruct
# of Parameters (B)
0%
0 3B 7B 30B ???
Figure 10: The relationship between model sizes and code reasoning capabilities. The x-axis
represents the parameter sizes of different models, and the y-axis indicates the CRUXEval-O
(CoT) scores respectively.
80
60
40
20
0
Overall Win Rate Code Debug Code Translation Code Requirement Switch Code Polishment Present
Qwen2.5-Coder-32B-Instruct DS-Coder-V2-Instruct Codestral-22B-v0.1 DS-Coder-V1-33B-Instruct
1
Figure 11: The evaluation results on CodeEditBench.
Bird Spider
Qwen2.5-Coder-32B-Instruct 58.4 85.1
Qwen2.5-Coder-14B-Instruct 56.9 84.8
Qwen2.5-Coder-7B-Instruct 51.1 82.0
CodeStral-22B 46.2 76.6
DS-Coder-33B-Instruct 45.6 73.8
DS-Coder-V2-Lite-Instruct 41.6 74.6
DS-Coder-6.7B-Instruct 39.8 70.0
70 60 50 40 30 20 40 50 60 70 80 90 100
70
60
50
40
30
20
10
0
Overall Fact Checking Num-Reasoning Data Analysis Visualization
1
Figure 13: The table understanding evaluation on TableBench.
26
Technical Report
Aider
Model Size
Pass@1 Pass@2
0.5B+ Models
Qwen2.5-Coder-0.5B-Instruct 0.5B 14.3 14.3
1B+ Models
DS-Coder-1.3B-Instruct 1.3B 18.0 18.8
Yi-Coder-1.5B-Chat 1.5B 17.3 17.3
Qwen2.5-Coder-1.5B-Instruct 1.5B 28.6 31.6
3B+ Models
Qwen2.5-Coder-3B-Instruct 3B 33.8 39.1
6B+ Models
CodeLlama-7B-Instruct 7B 1.5 1.5
DS-Coder-6.7B-Instruct 6.7B 37.6 44.4
CodeQwen1.5-7B-Chat 7B 24.8 38.3
Yi-Coder-9B-Chat 9B 45.9 54.1
DS-Coder-V2-Lite-Instruct 2.4/16B 44.4 52.6
Qwen2.5-Coder-7B-Instruct 7B 55.6 68.4
13B+ Models
CodeLlama-13B-Instruct 13B 1.5 1.5
Qwen2.5-Coder-14B-Instruct 14B 58.6 69.2
20B+ Models
CodeLlama-34B-Instruct 34B 1.5 1.5
CodeStral-22B-v0.1 22B 36.8 51.1
DS-Coder-33B-Instruct 33B 50.4 54.5
CodeLlama-70B-Instruct 70B 12.8 15.0
DS-Coder-V2-Instruct 21/236B 51.9 73.7
Qwen2.5-Coder-32B-Instruct 32B 60.9 73.7
Closed-APIs
Claude-3.5-Sonnet-20240620 - 59.4 66.2
Claude-3.5-Sonnet-20241022 - 71.4 86.5
GPT-4o-mini-2024-07-18 - 43.6 55.6
GPT-4o-2024-08-06 - 56.8 74.4
o1-mini - 49.6 70.7
o1-preview - 69.9 88.0
Table 19: The code editing ability of different instruct models evaluated by Aider benchmark.
The whole edit-format was consistently applied across all our experiments.
In this section, we provide a comparative analysis of the performance between our Qwen2.5-
Coder series models and the DS-Coder-V2 series models, with a focus on both mathematical
computation and general natural language processing tasks. The results in Table 20 highlight
the versatility of the Qwen2.5-Coder series, which excels not only in complex coding tasks
but also in advanced general-purpose tasks, setting it apart from its competitors.
27
Technical Report
1
Figure 14: The evaluation results of Qwen2.5-Coder models with different sizes on MBPP-
3shot and LiveCodeBench.
9 Conclusion
This work introduces Qwen2.5-Coder, the latest addition to the Qwen series. Built upon
Qwen2.5, a top-tier open-source LLM, Qwen2.5-Coder has been developed through exten-
sive pre-training and post-training of Qwen2.5-0.5B/1.5B/3B/7B/14B/32B on large-scale
datasets. To ensure the quality of the pre-training data, we have curated a dataset by collect-
ing public code data and extracting high-quality code-related content from web texts, while
filtering out low-quality data using advanced classifiers. Additionally, we have constructed
a meticulously designed instruction-tuning dataset to transform the base code LLM into a
strong coding assistant.
Looking ahead, our research will focus on exploring the impact of scaling up code LLMs
in terms of both data size and model size. We will also continue to enhance the reasoning
capabilities of these models, aiming to push the boundaries of what code LLMs can achieve.
28
Technical Report
References
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni
Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al.
Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Car-
los Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, et al.
Santacoder: don’t reach for the stars! arXiv preprint arXiv:2301.03988, 2023.
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David
Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with
large language models. arXiv preprint arXiv:2108.07732, 2021.
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin
Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609,
2023.
Mohammad Bavarian, Heewoo Jun, Nikolas Tezak, John Schulman, Christine McLeavey,
Jerry Tworek, and Mark Chen. Efficient training of language models to fill in the middle.
arXiv preprint arXiv:2207.14255, 2022.
Tom B Brown. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin,
Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feld-
man, et al. Multipl-e: A scalable and extensible approach to benchmarking neural code
generation. arXiv preprint arXiv:2208.08227, 2022.
Linzheng Chai, Shukai Liu, Jian Yang, Yuwei Yin, Ke Jin, Jiaheng Liu, Tao Sun, Ge Zhang,
Changyu Ren, Hongcheng Guo, et al. Mceval: Massively multilingual code evaluation.
arXiv preprint arXiv:2406.07436, 2024.
Shuaichen Chang and Eric Fosler-Lussier. How to prompt llms for text-to-sql: A study
in zero-shot, single-domain, and cross-domain settings. arXiv preprint arXiv:2305.11853,
2023.
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto,
Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evalu-
ating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang,
and Tony Xia. Theoremqa: A theorem-driven question answering dataset. In Proceedings
of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 7889–7901,
2023.
Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle
Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E Gonzalez, et al.
Chatbot arena: An open platform for evaluating llms by human preference. arXiv preprint
arXiv:2403.04132, 2024.
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick,
and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning
challenge. arXiv preprint arXiv:1803.05457, 2018.
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser,
Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers
to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
29
Technical Report
Yangruibo Ding, Zijian Wang, Wasi Ahmad, Hantian Ding, Ming Tan, Nihal Jain, Murali Kr-
ishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, et al. Crosscodeeval:
A diverse and multilingual benchmark for cross-file code completion. Advances in Neural
Information Processing Systems, 36, 2024.
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle,
Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3
herd of models. arXiv preprint arXiv:2407.21783, 2024.
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun
Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. Codebert: A pre-trained model
for programming and natural languages. In Trevor Cohn, Yulan He, and Yang Liu (eds.),
Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20
November 2020, volume EMNLP 2020 of Findings of ACL, pp. 1536–1547. Association for
Computational Linguistics, 2020. doi: 10.18653/V1/2020.FINDINGS-EMNLP.139. URL
https://doi.org/10.18653/v1/2020.findings-emnlp.139.
Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto
Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad
Reza Ghasemi Madani, et al. Are we done with mmlu? arXiv preprint arXiv:2406.04127,
2024.
Linyuan Gong, Sida Wang, Mostafa Elhoushi, and Alvin Cheung. Evaluation of llms on
syntax-aware code fill-in-the-middle tasks. arXiv preprint arXiv:2403.04814, 2024.
Alex Gu, Baptiste Rozière, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, and
Sida I Wang. Cruxeval: A benchmark for code reasoning, understanding and execution.
arXiv preprint arXiv:2401.03065, 2024.
Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen,
Xiao Bi, Yu Wu, YK Li, et al. Deepseek-coder: When the large language model meets
programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196, 2024a.
Jiawei Guo, Ziming Li, Xueling Liu, Kaijing Ma, Tianyu Zheng, Zhouliang Yu, Ding Pan,
Yizhi Li, Ruibo Liu, Yue Wang, et al. Codeeditorbench: Evaluating code editing capability
of large language models. arXiv preprint arXiv:2404.03543, 2024b.
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and
Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint
arXiv:2009.03300, 2020.
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang,
Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the
math dataset. arXiv preprint arXiv:2103.03874, 2021.
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar-
mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contami-
nation free evaluation of large language models for code. arXiv preprint arXiv:2403.07974,
2024.
AQ Jiang, A Sablayrolles, A Mensch, C Bamford, DS Chaplot, D de las Casas, F Bressand,
G Lengyel, G Lample, L Saulnier, et al. Mistral 7b (2023). arXiv preprint arXiv:2310.06825,
2023.
Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin,
Ruiying Geng, Nan Huo, et al. Can llm already serve as a database interface? a big bench
for large-scale database grounded text-to-sqls. Advances in Neural Information Processing
Systems, 36, 2024a.
Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Cheng-
hao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. Starcoder: may the
source be with you! arXiv preprint arXiv:2305.06161, 2023.
30
Technical Report
Ziming Li, Qianbo Zang, David Ma, Jiawei Guo, Tianyu Zheng, Xinyao Niu, Xiang Yue,
Yue Wang, Jian Yang, Jiaheng Liu, et al. Autokaggle: A multi-agent framework for
autonomous data science competitions. arXiv preprint arXiv:2410.20424, 2024b.
Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic
human falsehoods. arXiv preprint arXiv:2109.07958, 2021.
J Liu, CS Xia, Y Wang, and L Zhang. Is your code generated by chatgpt really correct?
rigorous evaluation of large language models for code generation. arxiv preprint arxiv:
230501210. 2023, 2023.
Jiaheng Liu, Ken Deng, Congnan Liu, Jian Yang, Shukai Liu, He Zhu, Peng Zhao, Linzheng
Chai, Yanan Wu, Ke Jin, et al. M2rc-eval: Massively multilingual repository-level code
completion evaluation. arXiv preprint arXiv:2410.21157, 2024a.
Shukai Liu, Linzheng Chai, Jian Yang, Jiajun Shi, He Zhu, Liran Wang, Ke Jin, Wei Zhang,
Hualei Zhu, Shuyue Guo, et al. Mdeval: Massively multilingual code debugging. arXiv
preprint arXiv:2411.02310, 2024b.
Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier,
Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, et al. Starcoder 2
and the stack v2: The next generation. arXiv preprint arXiv:2402.19173, 2024.
Shuai Lu, Nan Duan, Hojae Han, Daya Guo, Seung-won Hwang, and Alexey Svy-
atkovskiy. Reacc: A retrieval-augmented code completion framework. arXiv preprint
arXiv:2203.07722, 2022.
MistralAI. Codestral. https://mistral.ai/news/codestral, 2024. 2024.05.29.
OpenAI. Gpt-4o. https://openai.com/index/hello-gpt-4o, 2024. 2024.05.13.
Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context
window extension of large language models. arXiv preprint arXiv:2309.00071, 2023.
Qwen. Code with codeqwen1.5, April 2024. URL https://qwenlm.github.io/blog/
codeqwen1.5/.
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and
Chelsea Finn. Direct preference optimization: Your language model is secretly a reward
model. arXiv preprint arXiv:2305.18290, 2023.
Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan,
Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code llama: Open foundation
models for code. arXiv preprint arXiv:2308.12950, 2023.
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. An adversarial
winograd schema challenge at scale. arXiv preprint arXiv:1907.10641, 2019.
Tao Sun, Linzheng Chai, Jian Yang, Yuwei Yin, Hongcheng Guo, Jiaheng Liu, Bing Wang,
Liqun Yang, and Zhoujun Li. Unicoder: Scaling code large language model via universal
code. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd
Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL
2024, Bangkok, Thailand, August 11-16, 2024, pp. 1812–1824. Association for Computational
Linguistics, 2024. URL https://aclanthology.org/2024.acl-long.100.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei,
Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2:
Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. Magicoder: Em-
powering code generation with oss-instruct. In Forty-first International Conference on
Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024.
URL https://openreview.net/forum?id=XUeoOBid3x.
31
Technical Report
Di Wu, Wasi Uddin Ahmad, Dejiao Zhang, Murali Krishna Ramanathan, and Xiaofei Ma.
Repoformer: Selective retrieval for repository-level code completion. arXiv preprint
arXiv:2403.10059, 2024a.
Xianjie Wu, Jian Yang, Linzheng Chai, Ge Zhang, Jiaheng Liu, Xinrun Du, Di Liang, Daixin
Shu, Xianfu Cheng, Tianzhen Sun, et al. Tablebench: A comprehensive and complex
benchmark for table question answering. arXiv preprint arXiv:2408.09174, 2024b.
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li,
Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report. arXiv preprint
arXiv:2407.10671, 2024.
Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma,
Irene Li, Qingning Yao, Shanelle Roman, et al. Spider: A large-scale human-labeled
dataset for complex and cross-domain semantic parsing and text-to-sql task. arXiv preprint
arXiv:1809.08887, 2018.
Zhaojian Yu, Xin Zhang, Ning Shang, Yangyu Huang, Can Xu, Yishujie Zhao, Wenxiang
Hu, and Qiufeng Yin. Wavecoder: Widespread and versatile enhancement for code
large language models by instruction tuning. In Lun-Wei Ku, Andre Martins, and Vivek
Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, pp.
5140–5153. Association for Computational Linguistics, 2024. doi: 10.18653/V1/2024.
ACL-LONG.280. URL https://doi.org/10.18653/v1/2024.acl-long.280.
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a
machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-
Guang Lou, and Weizhu Chen. Repocoder: Repository-level code completion through
iterative retrieval and generation. arXiv preprint arXiv:2303.12570, 2023.
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao
Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with
mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36:46595–
46623, 2023.
Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari,
Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al. Bigcodebench:
Benchmarking code generation with diverse function calls and complex instructions.
arXiv preprint arXiv:2406.15877, 2024.
32