2024FuallStackBench Seed
2024FuallStackBench Seed
Siyao Liu∗1 , He Zhu∗2 , Jerry Liu∗2 , Shulin Xin∗1 , Aoyan Li∗1 , Rui Long1 , Li Chen1 ,
Jack Yang2 , Jinxiang Xia2 , Z.Y. Peng2 , Shukai Liu2 , Zhaoxiang Zhang2 , Jing Mai1 ,
Ge Zhang1,2 , Wenhao Huang1 , Kai Shen†,1 , Liang Xiang†,1 ,
Bytedance Seed1
M-A-P2
Abstract
arXiv:2412.00535v3 [cs.AI] 9 Dec 2024
As the capabilities of code large language models (LLMs) continue to expand, their applica-
tions across diverse code intelligence domains are rapidly increasing. However, most existing
datasets only evaluate limited application domains. To address this gap, we have developed
a comprehensive code evaluation dataset FullStack Bench1,2 focusing on full-stack program-
ming, which encompasses a wide range of application domains (e.g., basic programming, data
analysis, software engineering, mathematics, and machine learning). Besides, to assess multi-
lingual programming capabilities, in FullStack Bench, we design real-world instructions and
corresponding unit test cases from 16 widely-used programming languages to reflect real-world
usage scenarios rather than simple translations. Moreover, we also release an effective code
sandbox execution tool (i.e., SandboxFusion3 ) supporting various programming languages
and packages to evaluate the performance of our FullStack Bench efficiently. Comprehensive
experimental results on our FullStack Bench demonstrate the necessity and effectiveness of our
FullStack Bench and SandboxFusion.
1 Introduction 3
2 FullStack Bench 4
2.1 Data Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Data Construction and Quality Control . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Bilingual Benchmark Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 SandboxFusion 7
4 Experiments 8
4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.2 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.3 Analysis on the performance of different programming languages . . . . . . . . . 10
4.4 Scaling Laws on FullStack Bench . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.5 Analysis on the performance of different difficulties . . . . . . . . . . . . . . . . . 11
4.6 Analysis on the effect of feedback from SandboxFusion . . . . . . . . . . . . . . . 12
5 Related Works 12
6 Conclusion 14
7 Acknowledgements 14
A Appendix 22
A.1 Visualization on the cases of FullStack Bench . . . . . . . . . . . . . . . . . . . . . 22
A.2 Details of SandboxFusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
A.2.1 Dataset Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
A.2.2 Sandbox Execution Module . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
A.3 Comparison with Other Sandboxes . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2
1. Introduction
The code large language models (LLMs) have achieved significant improvements in code
intelligence [Roziere et al., 2023, Zheng et al., 2023, Guo et al., 2024a, Hui et al., 2024, Huang et al.,
2024b], which are pre-trained on extensive datasets comprising billions of code-related tokens.
Recently, to discover the limitations of existing code LLMs and facilitate further development
of code intelligence, many code evaluation benchmark datasets (e.g., HumanEval [Chen et al.,
2021a], MBPP [Austin et al., 2021b], DS-1000 [Lai et al., 2022], xCodeEval [Khan et al., 2023])
have been proposed as shown in Figure 1.
However, as shown in Figure 1, we observe that the existing benchmarks cover limited
application domain types, which cannot access the code-related abilities of the real-world
code development scenarios. Specifically, in Figure 1, we sample 500k questions from the widely-
used software development community (i.e., “StackOverflow”) and tag the application domain
label for these questions based on LLMs4 . Then, based on the labels on “StackOverflow”, we
summarize 11 main-stream application domains (e.g., Basic Programming, Software Engineering,
Data Analysis), which cover about 88.1% problems in “StackOverflow”. Meanwhile, using
these domain labels, we also tag four popular code evaluation datasets (i.e., HumanEval, MBPP,
DS-1000, xCodeEval), and observe that these benchmarks usually focus on very limited domains.
For example, a large portion of DS-1000 (>95%) is related to data analysis and machine learning
tasks, and even the so-called multi-task benchmark xCodeEval (with code understanding,
generation, translation and retrieval tasks) mainly focuses on advanced programming and
mathematics domains.
To address the abovementioned limitation, we propose the FullStack Bench, an evaluation
set spanning multiple computer science domains and programming languages, which aims
to assess large models’ capabilities across various real-world code development scenarios. As
shown in Figure 1, when compared to existing benchmarks, our FullStack Bench covers more
application domains, which demonstrates the diversity and necessity of our FullStack Bench.
Besides, based on the analysis of StackOverflow, we observe that our FullStack Bench can
simulate StackOverflow well for real-world programming scenes, where ratios of the selected 11
application domains (excluding “Others”) for our FullStack Bench and StackOverflow are 94.3%
and 88.1%, respectively.
Moreover, automating the evaluation on FullStack Bench is challenging due to the various
data formats and dependencies for different application domains and programming languages.
Recently, some sandbox execution environments (i.e., DifySandbox [LangGenius, 2024], MultiPL-
E [Cassano et al., 2023], MPLSandbox [Dou et al., 2024]) have been proposed. However, there
are significant limitations (e.g., supporting limited packages and programming languages) in
these sandboxes, which cannot evaluate our FullStack Bench well. For example, the front-end
browsers and deep-learning packages (e.g., PyTorch [Paszke et al., 2019], Tensorflow [Abadi et al.,
2015]) are not supported in these sandboxes. Besides, our FullStack Bench has 16 programming
languages (i.e., Bash, C++, C#, D, Go, HTML, Java, Javascript, PHP, Python, R, Ruby, Rust, Scala,
SQL, Typescript), and many sandboxes do not fully support these languages. Therefore, we also
introduce a new execution environment (i.e., SandboxFusion) to support the evaluation on our
FullStack Bench, and the main features of SandboxFusion are as follows: (1) Supporting various
languages: our SandboxFusion supports 23 commonly-used programming languages, which
satisfies different real-world usage scenes (e.g., front-end development, backend development,
4 Prompt: You are an expert in the field of computer programming, proficient in various programming knowledge. Below, I
will provide a set of user questions and AI assistant answers. You need to output application domain tags for this Q&A pair.
Here are some tags for reference: Mathematics, Data Analysis, Database, Desktop and Web Development.
3
Figure 2. Performance plot of tested LLMs on HumanEval and FullStack Bench.
ML training) (2) Easy-to-deploy: we only need a single server to deploy our SandboxFusion
with high throughput for large model evaluation scenarios. (3) Unified multi-dataset execution
environment: Apart from our FullStack Bench, we additionally support 10+ widely-used code
evaluation benchmarks.
Overall, the contributions are summarized as follows:
2. FullStack Bench
As illustrated in Table 1, the FullStack Bench consists of 3374 problems, where each problem in
FullStack Bench includes question, unit test cases, reference solution, and labels. Besides, we also
calculate the token lengths of the question and correct code using the LLaMA3 tokenizer [Team,
2024], where the average question length is 210.2 tokens. To ensure judgment accuracy, the
overall number of unit tests for the dataset is 15168, where the average number of unit tests is
4
Figure 3. Overview of data collection process of FullStack Bench.
4.5. We strive to cover all error types in each language. Due to the inherent differences among
languages, we ensure a balanced distribution of difficulty levels, leading to variations in the
distribution of error types across languages.
To curate the multilingual full stack code evaluation benchmark FullStack Bench, we employ
a comprehensive and systematic human annotation process for producing code samples of
different application domains, where meticulously pre-defined guidelines are provided to
guarantee accuracy and consistency.
Specifically, as shown in Figure 3, we il-
lustrate the overall dataset construction pro- Statistics Number
cess. Specifically, we first collect code snip-
pets from Github, code-related documents #Problems 3374
(e.g., blog and Book) and XLCoST [Zhu et al., Difficulty Level
2022] . Then, we use LLM and human ver- - Easy/Medium/Hard 1, 466/1, 184/724
ification to generate the instruction, unit
cases and corresponding reference solution. Length
Besides, we also employ programming ex- Question
perts actively in each field to create domain- - maximum length 1931 tokens
specific questions for LLMs. These questions - minimum length 35 tokens
do not involve proprietary information, but - avg length 210.2 tokens
are designed to assess essential skills in the Reference Solution
respective application domains, similar to - maximum length 2720 tokens
interview questions. For example, we en- - minimum length 4 tokens
gaged our internal data engineering team to - avg length 153.0 tokens
develop a series of data analysis questions,
including data filtering, data mining, and Table 1. Dataset statistics of FullStack Bench.
data visualization. After obtaining the initial
dataset, to improve the annotation quality, the annotators evaluate the annotated code based on
three criteria: problem difficulty, ambiguity and solveability. Furthermore, after completing their
annotations, each annotator exchanges data with another annotator for cross-refining, aiming
to minimize subjective bias and errors. Any discrepancies between annotators are resolved
5
through consensus or with input from senior annotators.
Additionally, to improve the difficulty of our FullStack Bench, we follow the LIME [Zhu
et al., 2024a] and implement a voting method using six selected models (i.e., DeepSeek-Coder-
6.7B [Guo et al., 2024a], DeepSeek-Coder-33B [Guo et al., 2024a], Qwen2.5-Coder-7B [Hui et al.,
2024], LLaMA3.1-70B [Team, 2024], Claude-3.5-Sonnet 5 , GPT-4o [Achiam et al., 2023]) to filter
out samples that can be correctly answered by all these LLMs. Specifically, for each question, if
only one model obtains the correct answer, this question is classified as a hard sample, and if
five or six models obtain the correct answer, this question is classified as an easy sample. Apart
from the easy and hard samples, the difficulty of the remained samples is medium.
Moreover, to simulate real-world usage of full-stack developments, we summarize the
common application domains by analyzing the distributions of “Stackoverflow.com”. As shown
in Figure 4, we sample 500k questions from “Stackoverflow.com” and then prompt the LLMs
to label the application domain type for each question. After that, we preserve the top 11
application domains, which dominate 88.1% of the questions. Meanwhile, we name other
application domain types as “Others”. In this way, we also prompt GPT to label the domain
types of our annotated questions and generate our final FullStack Bench, where the domain
types are as follows:
• Basic Programming (BP): Basic programming involves fundamental concepts and skills
to write simple computer programs. This typically includes understanding data types,
variables, control structures, functions, and basic input/output operations.
• Advanced Programming (AP): Advanced programming involves developing complex
software solutions and focuses on creating efficient, scalable, and robust applications while
implementing sophisticated algorithms, data structures, and design patterns.
• Software Engineering (SE): Software engineering covers the design, development, test-
ing, and maintenance of software systems, and includes tasks of requirements analysis,
software architecture design, coding, quality assurance, and project management.
• Data Analysis (DP): Data analysis is the cleaning, processing, and analysis of collected
data to discover meaningful patterns and relationships to make data-driven decisions.
• Mathematics (MA): Mathematical problems involve solving various problems through
mathematical methods and theories, covering multiple fields such as algebra, geometry,
calculus, number theory, probability, and statistics
• Desktop and Web Development (DW): Desktop development encompasses a wide range
of programming languages, frameworks, and tools to design, build, and maintain user-
friendly interfaces and robust backend systems.
• Machine Learning (ML): Machine learning algorithms are developed to learn from data
for tasks such as classification, prediction, and pattern recognition.
• Scientific Computing (SC): Scientific computing solves complex scientific and engineering
problems, which encompasses the tasks of numerical analysis, and high-performance com-
puting to simulate, model, and analyze phenomena across various scientific disciplines.
• DataBase (DB): Database includes tasks such as insertion, querying, updating, and dele-
tion, and these tasks are typically performed using query languages such as SQL to ensure
efficient storage and retrieval of data.
• Multimedia (MM): Multimedia involves processing and manipulating various forms of
content, including text, images, audio, and video.
• Operating System (OS): Operating system includes tasks such as memory management,
process scheduling, file system management, and device control, which aims to manage
5 https://www.anthropic.com/news/claude-3-5-sonnet
6
① Prompt Generation ② Model Completion ⑦ Metric Calculation
Here's a JavaScript function to {
Write a JavaScript reverse a string: "accepted": true,
function that "extracted_code": "...",
takes a string and ```javascript "tests": [
returns the function revStr (str) { "passed": true,
return "exec_info: {...}
reverse of that str.split('').reverse().join(''); ]
Dataset Samples string. } }
```
SandboxFusion Service
Dataset Module
③ Code Extraction
function revStr(str) {
FullStack Bench CRUXEval miniF2F return ...
} return code == 0 ?
anti-cheat flags ?
HumanEval MBPP verilog-eval ④ Test Code Synthesis
stdout as expected ?
function revStr(str) { ... }
...
MultiPL-E Code Contests … assert.equal(revStr("he"), "eh");
...
⑥ Judgement
⑤ Code Execution
Sandbox Execution Module
The collected questions are in Chinese or English. For Chinese or English problems, we translate
these problems into English or Chinese, which results in both Chinese and English versions.
Finally, in FullStack Bench, the numbers of Chinese and English problems are both 3374/2 = 1687.
Following HumanEval and MBPP, we directly use the Pass@1 as the default evaluation metric
for our proposed FullStack Bench.
3. SandboxFusion
Execution-based datasets are crucial for discriminating code generation tasks [Hendrycks et al.,
2021]. Automating the evaluation of these datasets requires extracting complete code from
the model’s responses, and executing it in a compatible environment. This is a complex task
due to the varying data formats and dependencies. To facilitate the evaluation of FullStack
Bench, we also propose the SandboxFusion execution environment. SandboxFusion is a unified
architecture that is compatible with many datasets as well as Fullstack Bench. This makes the
sandbox widely applicable for data processing, model evaluation, reinforcement learning, etc.
As shown in Figure 4, the overall evaluation process of SandboxFusion usually involves the
following steps:
7
• Prompt Generation: The system generates diverse prompts based on the original problem
specifications and evaluation paradigms (e.g., few-shot, zero-shot), enabling systematic
assessment of model capabilities.
• Model Completion: Users need to perform model completion using the generated
prompts independently, as our framework does not provide built-in inference capabilities.
While many efficient inference engines exist (e.g., vLLM, text-generation-inference), we
focus on prompt generation and evaluation.
• Code Extraction: The system extracts executable code segments from model outputs,
primarily focusing on code contained within markdown blocks.
• Test Code Synthesis: The framework combines the extracted code with predefined test
cases to create executable test programs. This process handles various language-specific
requirements, such as distributing classes across files in Java or adapting main functions
for unit testing.
• Code Execution: The system executes the synthesized code with all dependent files and
captures program output.
• Judgement: The framework assesses solution correctness based on execution results,
typically through standard unit testing frameworks where zero return values indicate
successful execution.
• Metric Calculation: The evaluation primarily focuses on pass rates across different prob-
lem instances.
SandboxFusion mainly contains two modules: the Dataset Module and the Sandbox Ex-
ecution Module. The dataset module is responsible for implementing various datasets and
abstracting out common components for reuse. The sandbox execution module focuses on
executing code in different languages, controlling resource usage, and ensuring execution safety.
Please See Appendix A.2 for more details of SandboxFusion and more comparisons with other
sandboxes in Appendix A.3.
4. Experiments
FullStack AI Coders. We select 27 popular (code) language models as full-stack AI coders and
test them with FullStack Bench. For open-sourced models, we select AI coders from well-known
and uprising code LLM series, including CodeQwen1.5 [Bai et al., 2023], Qwen2.5-Coder [Hui
et al., 2024], DeepSeek-Coder [Guo et al., 2024b], Deep-Seek-Coder-v2 [Zhu et al., 2024b],
CodeLlama [Roziere et al., 2023], Yi-Coder [Young et al., 2024], StarCoder2 [Lozhkov et al., 2024],
and OpenCoder [Huang et al., 2024a]. Further, we involve two open-sourced general LLMs,
Qwen2.5 6 and Llama3.1 [Team, 2024], into the comparison. As the majority of problems in
FullStack Bench are complex natural language instructions, we adopt the instruction-tuned
version of those AI coders rather than their base models. According to the model size, we
categorize the AI coders into five groups: 1B+, 6B+, 13B, 20B+, and 70B+.
On the other hand, we also evaluate some prominent close-sourced LLMs including GPT-4o,
OpenAI-o1, Claude, GLM4, DeepSeek-v2.5, Qwen-Max, and the upcoming Doubao-Coder-
Preview. The access links of the open-sourced and close-sourced models are listed in Table 5
and Table 6, respectively.
6 https://qwenlm.github.io/blog/qwen2.5/
8
Model BP AP SE DP MA DW ML SC DB MM OS Others Overall
1B+ Instruction Tuned Coder
OpenCoder-1.5B-Instruct 26.05 40.03 31.50 42.64 25.17 39.12 23.75 13.97 30.16 26.67 44.12 38.30 33.52
Qwen2.5-Coder-1.5B-Instruct 18.37 34.75 29.00 33.50 28.32 41.33 17.50 15.81 40.48 23.33 47.06 28.19 30.74
DeepSeek-Coder-1.3B-Instruct 16.74 29.91 32.50 37.06 22.73 35.54 18.75 9.19 27.78 25.00 36.76 30.32 27.65
6B+ Instruction Tuned Coder
Qwen2.5-Coder-7B-Instruct 36.51 52.06 46.00 59.39 48.95 50.00 37.50 30.51 53.17 50.00 63.24 53.19 48.16
Yi-Coder-9B-Chat 39.07 46.04 39.50 64.97 46.50 49.66 42.50 34.93 48.41 41.67 58.82 49.47 47.13
OpenCoder-8B-Instruct 39.53 49.12 38.00 55.58 36.01 45.92 27.50 26.47 47.62 46.67 45.59 45.74 43.63
DeepSeek-Coder-7B-Instruct-v1.5 38.37 45.16 36.00 57.36 35.66 47.96 30.00 30.88 46.03 53.33 45.59 44.15 43.48
DeepSeek-Coder-6.7B-Instruct 34.19 43.40 38.50 58.12 38.11 43.88 33.75 23.90 46.03 38.33 60.29 44.15 41.88
CodeQwen1.5-7B-Chat 36.74 44.87 46.00 51.78 29.72 40.82 26.25 24.26 42.06 41.67 48.53 44.68 40.52
CodeLlama-7B-Instruct 21.40 21.70 30.50 34.26 20.28 40.48 8.75 11.76 34.92 15.00 50.00 29.26 27.06
13B+ Instruction Tuned Coder
Qwen2.5-Coder-14B-Instruct 53.26 58.50 41.00 69.54 69.23 46.26 51.25 43.01 49.21 60.00 69.12 57.45 55.28
DeepSeekCoder-v2-Lite-Instruct 45.81 57.18 38.50 56.85 52.80 44.56 42.50 33.82 52.38 33.33 50.00 51.60 48.73
StarCoder2-15B-Instruct-v0.1 38.37 42.23 29.00 59.90 37.06 40.99 42.50 28.68 54.76 33.33 42.65 45.74 41.79
CodeLlama-13B-Instruct 24.88 21.41 31.00 31.47 18.18 41.67 16.25 13.24 35.71 15.00 45.59 32.45 27.59
20B+ Instruction Tuned Coder
DeepSeekCoder-v2-Instruct 52.79 63.64 43.00 71.57 75.87 47.45 46.25 52.94 53.97 51.67 63.24 59.57 58.09
Qwen2.5-Coder-32B-Instruct 51.86 60.85 43.00 73.10 69.93 47.11 55.00 44.85 56.35 61.67 61.76 60.64 56.88
DeepSeekCoder-33B-Instruct 38.37 50.59 35.50 65.99 50.00 49.49 43.75 39.71 49.21 53.33 54.41 48.40 48.61
CodeLlama-34B-Instruct 23.72 22.73 26.50 37.56 18.18 43.71 17.50 17.65 38.10 26.67 51.47 30.85 29.22
70B+ Instruction Tuned General Language Model
Qwen2.5-72B-Instruct 52.56 61.44 43.00 66.50 76.57 48.47 55.00 51.10 52.38 51.67 55.88 55.32 56.88
Llama3.1-70B-Instruct 46.51 54.69 34.50 65.48 64.69 45.24 51.25 38.60 56.35 46.67 57.35 53.72 51.45
Close-Sourced API Model
OpenAI o1-preview 71.63 71.99 49.50 72.59 80.77 51.53 50.00 63.97 57.14 60.00 67.65 68.09 65.62
OpenAI o1-mini 70.23 75.66 41.50 71.07 81.47 48.47 56.25 59.19 54.76 60.00 69.12 67.55 64.73
Claude-35-Sonnet 61.63 65.40 53.00 71.83 77.27 48.81 53.75 63.24 58.73 68.33 64.71 68.62 62.57
GPT 4o-0806 57.21 67.60 46.00 74.37 76.92 48.47 63.75 55.88 60.32 63.33 70.59 64.89 61.77
Doubao-Coder-Preview 56.98 64.66 43.00 71.07 74.48 49.15 45.00 59.19 50.00 48.33 60.29 55.32 58.92
DeepSeek-v2.5 51.86 63.78 43.00 69.54 75.17 49.66 47.50 57.35 53.17 60.00 60.29 61.70 58.65
GLM-4-Plus 49.77 59.97 43.00 71.32 72.73 46.43 56.25 50.00 57.14 61.67 63.24 52.66 56.40
Qwen-Max 47.21 61.14 42.00 63.20 72.73 47.11 53.75 55.15 57.94 41.67 54.41 50.53 55.16
Implementation Details. For open-sourced coders, we pull model checkpoints from Hugging
Face7 and load the model with AutoTokenizer and AutoModelForCausalLM in the transformers
package 8 . Further, we apply vLLM [Kwon et al., 2023] to accelerate the evaluation. We set top_k
to 1, max_tokens to 2048, and keep all other settings as default. For those model versions not
supported by vLLM, we call the default generate() method provided by the model and set top_k
to 1, max_new_tokens to 1024. For the input prompt, we set the “You are a helpful AI assistant”
as the system prompt and set each problem text in FullStack Bench as the user prompt. Then we
apply the default chat template within the tokenizer and feed the tokens into the model.
We chat with the close-sourced models via API calls similarly, where top_k is set to 1 and
max_new_tokens is set to 2048. The prompt template is kept the same. After model inference,
the first code block with the corresponding programming language formatted in Markdown
is extracted from the generated output. If no Markdown code block is detected, a heuristic
approach is employed to identify and extract incomplete code snippets. The extracted code,
combined with predefined test cases, is then used to synthesize the complete code, which is
evaluated for correctness.
7 https://huggingface.co/
8 https://pypi.org/project/transformers/
9
2 S H Q &