CIBench Evaluating Your LLMs With a Code Interpret
CIBench Evaluating Your LLMs With a Code Interpret
Interpreter Plugin
Songyang Zhang1,∗ , Chuyu Zhang1,2,∗, Yingfan Hu∗ , Haowen Shen, Kuikun Liu1 ,
Zerun Ma1 , Fengzhe Zhou1 , Wenwei Zhang1 , Xuming He2 , Dahua Lin1 , Kai Chen1,†
1
Shanghai Artificial Intelligence Laboratory 2 ShanghaiTech University
{zhangsongyang}@pjlab.org.cn
arXiv:2407.10499v1 [cs.CL] 15 Jul 2024
Abstract
While LLM-Based agents, which use external tools to solve complex problems,
have made significant progress, benchmarking their ability is challenging, thereby
hindering a clear understanding of their limitations. In this paper, we propose
an interactive evaluation framework, named CIBench, to comprehensively assess
LLMs’ ability to utilize code interpreters for data science tasks. Our evaluation
framework includes an evaluation dataset and two evaluation modes. The evaluation
dataset is constructed using an LLM-human cooperative approach and simulates an
authentic workflow by leveraging consecutive and interactive IPython sessions. The
two evaluation modes assess LLMs’ ability with and without human assistance. We
conduct extensive experiments to analyze the ability of 24 LLMs on CIBench and
provide valuable insights for future LLMs in code interpreter utilization.
1 Introduction
Empowered by the emerging abilities of Large Language Models (LLMs), numerous agents have
surfaced to harness these generalist models for utilizing external tools, such as web browsing,
document retrieval, and code interpretation, in tackling complex real-world problems. Notable
examples include ToolLLM [23], LangChain [5], AutoGPT [25], and QwenAgent [2]. Specifically,
agents equipped with a code interpreter leverage the advanced programming skills of LLMs through
a natural language interface, thereby facilitating the creation of workflows that are both effortless and
efficient. However, assessing the agents’ capability to generate executable code and effectively use
code interpreters remains a challenge, hindering a clear understanding of their limitations.
In this study, we focus on assessing the proficiency of LLMs in leveraging code interpreters to address
data science problems across several distinct domains, like data analysis, visualization, and machine
learning. These tasks necessitate that LLMs exhibit advanced capabilities in instruction following,
reasoning, and programming. Existing benchmarks, including GSM8K [8], MathBench [20] and
HumanEval [6], primarily assess the models’ abilities to solve mathematical or coding problems solely.
These benchmarks, while valuable for measuring specific competencies, fail to fully represent the
LLMs’ aptitude for building complex workflows involving code interpreters in practical applications.
More recent efforts, such as MINT [27], QwenAgent [2], CodeGen [21], and DS-1000 [16], aim
to bridge the existing gap by constructing novel benchmarks, particularly tailored to mathematical
problems and data analysis tasks. Despite considerable advancements, they either focus on single-turn
question assessments or have a limited scope in data science. The substantial insights they provided
inadequately reflect the LLM’s ability to invoke a code interpreter to solve real-world data science
problems, which typically require consecutive sessions.
To address these shortcomings, as depicted in Fig.1, we introduce a novel evaluation framework that
encompasses a benchmark with consecutive and diverse tasks, along with comprehensive assessment
protocols. This framework aims to provide a thorough evaluation of LLMs’ ability to use code
∗
Equal contribution. Code is available at https://github.com/open-compass/CIBench
Table 1: Comparison of different datasets. "Multi-turn" denotes that LLMs can attempt to solve tasks
through multiple trials. "Code interpreter" means whether LLM needs to call the code interpreter to
solve the problem. "Consecutive" indicates that each task consists of a series of sequential questions
rather than a single question.
Dataset Topic Size Multi-turn Code Interpreter Consecutive
DS-1000 Data Science 1000 questions No No No
MINT Math, Reasoning 586 questions Yes Yes No
CodeGen Math, Data Science, etc 115 tasks No No Yes
QwenAgent Math, Data visualization 295 questions No Yes No
CIBench Data Science 234 tasks, 1900+ questions Yes Yes Yes
2
• We build a new benchmark for agents with code interpreters using an LLM-human cooperative
method. It consists of interactive IPython sessions with interconnected questions on key data science
libraries, simulating interactive problem-solving scenarios in practical workflows.
• We devise unique assessment strategies involving both end-to-end and oracle modes. We also
introduce several evaluation metrics to assess various outputs, offering a comprehensive gauge of
LLMs’ coding prowess within the benchmark.
• We conduct thorough experiments with 24 LLMs to analyze their performance on our benchmark.
The results indicate that open-sourced LLMs are inferior to GPT-4 by a large margin and perform
poorly in the modeling category modules.
2 Related Works
CIBench is an evaluation framework that assesses LLMs’ [26, 2, 10, 7, 4] ability to utilize external
code interpreters for solving data science tasks. Therefore, we focus on presenting work related to
invoking code interpreters and benchmarks related to data science.
3 CIBench
To benchmark LLM’s ability to leverage code interpreters for addressing data science problems,
we propose a novel evaluation framework (Fig.2), which comprises a diverse evaluation dataset
and two newly devised evaluation modes. The evaluation dataset is generated through an LLM-
human cooperative approach and simulates authentic workflow scenarios for solving sequential and
interconnected tasks. Given the evaluation dataset, we adhere to the ReAct protocol [31] to generate
3
Tasks Refinement
Online resource
Matplotlib:
• Show Image
• Linear plot • Load data
• … • Statistics
… • Show the image
Pytorch
Topic Generation Tasks Generation
• Convolution • ……
Fig. 2. Overview of CIBench. CIBench first selects Python modules to generate candidate topics
and then generates tasks based on these modules and the selected topic. Additionally, humans are
engaged to generate new tasks to ensure diversity and filter out incorrect questions to enhance quality.
Task Generation
Prompt:
Please create jupyter notebook experiment based on Python module {}. Please follow these
rules:
1. The experiment should be conducted in a jupyter notebook manner, but use the markdown
format.
2. The experiment should only use Python code.
3. The experiment has around 10-15 continuous steps, from the easiest to the hardest.
4. The step description should be concise.
5. The step description should be precise and contain exact parameter names and values to
instruct.
6. Each step requires Python code to solve and the executed result should be the numeric
answer, structured output, or visualized result.
7. Please use ‘matplotlib’ to visualize if necessary.
8. DO NOT have any steps to save or write any output files.
9. Please provide an input data file with an external link.
The experiment topic is {}. You should generate the experiment file without any other
statements.
reasoning traces and invoke code interpreters alternately. And, we allow LLMs to attempt to solve
tasks multiple times, enabling exploration of their self-debugging capabilities based on feedback
from the code interpreter. Finally, we propose two evaluation modes: the end-to-end mode and the
oracle mode, to comprehensively measure LLM’s ability with and without human interaction.
In the following sections, we will detail the construction of the dataset in Sec.3.1 and the evaluation
modes and metrics in Sec.3.2.
4
End-to-End Mode Oracle Mode
User: User:
Load the dataset from path xxx.csv, Load the dataset from path xxx.csv,
display the column names and ... display the column names and ...
```python ```python
import pandas as pd import pandas as pd
data = pd.read_csv(path) path = xxx
``` data = pd.read_csv(path)
data.columns
```
Interperter: Interperter:
NameError: name ‘path’ is not defined. (Output of ground truth code)
User: User:
Create a scatterplot with a regression Create a scatterplot with a regression
line to visualize the relationship between …… line to visualize the relationship between ……
Fig. 4. Evaluation modes: In end-to-end mode, the LLM addresses the user’s question (bottom)
within the context of its response, while in oracle mode, it answers the user’s question (bottom) within
the context of ground truth.
Topic Candidates Generation After selecting modules, we prompt (Appendix B) GPT-4 to
summarize 50 topics for each module, to encapsulate the vast majority of the module’s functionalities,
thereby offering precise guidance for the subsequent generation of more targeted questions.
Tasks Generation and Refinement We sample a module and topic, then prompt GPT-4 to generate
questions and code based on the prompt in Fig.3. The prompt is designed to enable GPT-4 to
generate a Jupyter notebook with sequential steps and various outputs, including numeric answers,
structured output, and visualizations, mirroring real-world scenarios. Despite our request for concise
descriptions, the generated content may lack conciseness and continuity. To address this, we undertake
iterative refinement of these tasks. This process involves presenting both good and bad cases, along
with additional modifications, to enhance the quality of questions and reduce ambiguity. Details of
prompts used for refinement are in Appendix B.
Human Refinement Despite the excellent capabilities of LLMs, we found that the generated tasks
often focus on typical datasets, such as Titanic and Iris, which limits task diversity. Additionally,
ensuring perfect accuracy in question descriptions and result correctness is challenging.
To enhance diversity, we summarize template tasks based on the generated tasks and existing high-
quality tutorials available in various Python libraries (Fig.11). With minor modifications, these
template tasks can be adapted to a wide range of datasets. To further diversify the benchmark, we
collect new datasets through two approaches: 1) We prompt GPT-4 to generate datasets tailored to
different templates, leveraging GPT-4’s ability to flexibly specify characteristics and data attributes.
2) We incorporate the latest datasets from the past year, ensuring authenticity and diversity, and
significantly reducing the likelihood that the model has previously encountered this data.
To improve quality, we employ several experts who use a code interpreter for manual double-checking.
This approach ensures that questions are written by real users and that various factors, such as runtime,
are carefully controlled. For a comprehensive overview of the rules governing quality control, please
refer to the Appendix B. The statistics of the dataset are also provided in the Appendix A.2.
5
Table 3: Main results of CIBench. Tool, Exe, Num, Text, and Vis denote the tool call rate, executable
rate, numeric accuracy, text score, and visualization score respectively. bold denotes the best score
among the same model scale. Average is the mean of Num, Text, and Vis in two modes..
Process-oriented metrics focus on the correct invocation of tools and the successful compilation
and execution of code. These metrics include the Tool Call Rate, which measures the proportion of
instances where the model correctly follows the instructions to invoke a code interpreter, and the
Executable Rate, which indicates the percentage of code that is executed without any errors.
On the other hand, output-oriented metrics focus on the outcomes of the model. These metrics
include Numeric Accuracy, which assesses the accuracy of the numerical results; Text Score, which
measures the quality of the structural text output using the Rouge metric [19]; and Visualization
Score, which evaluates the quality of visual output. Instead of using GPT-4V like Qwen-Agent, which
is expensive and ineffective, we propose using structural similarities [28] between predictions and
ground truth images as the visualization score. These metrics provide a holistic evaluation of the
LLM’s capabilities.
4 Experiments
4.1 Experiments Setup
To provide a thorough analysis, we evaluate 19 chat models, including popular open-sourced LLMs
and the private GPT-4, using the CIBench benchmark. During inference, we allow LLMs to attempt
up to 3 times. The specific versions of Python modules utilized in the code interpreter are provided in
Appendix A.1. All experiments are conducted within the OpenCompass [9] evaluation platform.
6
Pearson: 0.73 Pearson: 0.85 Pearson: 0.86
80 P=8.3e-04 P=1.3e-05 P=1.0e-05
80
80
60 60
GSM8K
IFEval
BBH
60
40 40
40
20 20
20 40 60 20 40 60 20 40 60
CIBench CIBench CIBench
Pearson: 0.87 Pearson: 0.88 Pearson: 0.76
75 P=6.0e-06 P=3.0e-06 P=3.8e-04
75 80
50
HumanEval
50 60
MATH
MBPP
25 40
25
0 20
0
20 40 60 20 40 60 20 40 60
CIBench CIBench CIBench
Fig. 5. Correlation of CIBench with other benchmarks. The small p-value (top-left) and high Pearson
correlation coefficients (title) indicate a strong correlation between CIBench and IFEval, BBH,
GSM8K, MATH, HumanEval, and MBPP. These benchmarks evaluate the instruction-following,
reasoning, and coding abilities of LLMs, respectively.
other models, especially in end-to-end mode, highlighting the significant potential for improvement in
current open-source models. What’s more, larger models tend to exhibit superior performance across
various metrics, in line with established trends [3, 15, 29]. Moreover, models within the same series
(such as Mistral, InternLM, Qwen, Llama3, etc.) consistently maintain relatively stable rankings
within their respective parameter groups, underscoring the stability and efficacy of our approach.
When comparing the end-to-end mode and oracle mode, it becomes evident that the oracle mode
surpasses the end-to-end mode across all metrics for most models. This observation suggests that
LLMs can achieve better results with human interaction, hinting at a promising avenue for integrating
LLMs to assist humans in data science. The experiment demos are shown in Appendix C.
In the evaluation of CIBench, we identify four prevalent types of errors in the code generated
by the model. These errors are categorized as follows: 1) Instruction Following Errors: These
encompass instances where the model deviates from or disregards provided instructions, reflecting
a lack of adherence to specified guidelines; 2) Hallucination Errors: This category pertains to the
phenomenon of the model generating code that contains hallucinated elements, such as utilizing
undefined parameters or referencing irrelevant variables; 3) Reasoning Errors: These errors occur
when the model encounters complex problems, often resulting in logical errors in the generated
code. Such errors offer valuable insights into the model’s ability to handle intricate tasks in code
generation; 4) Code Errors: Basic errors in code generation fall under this category. While these
errors may sometimes appear trivial, they signify potential deficiencies in the model’s code-generation
process. These identified errors effectively underscore the current limitations of LLMs in terms of
their coding capabilities, providing valuable insights for the ongoing development of CIBench. Since
automatically classifying errors is challenging, we manually categorized the four types of errors in
GPT-4-1106-preview. The proportions of these error types are 31.9%, 4.3%, 40.4%, and 23.4%,
respectively. Detailed examples of these errors are presented in Appendix E.
Furthermore, to provide insight into enhancing performance on CIBench, we analyze its correlation
with existing benchmarks such as IFEval, BBH, GSM8K, MATH, HumanEval, and MBPP. These
benchmarks evaluate the instruction-following, reasoning, and coding abilities of LLMs. As shown
in Fig.5, CIBench performance exhibits a strong correlation with these benchmarks, with Pearson
correlation coefficients exceeding 0.7, particularly for HumanEval and GSM8K. This result indicates
that improvements in reasoning and coding can boost LLM performance on CIBench.
7
100
Tool Call Rate 95
Executable Rate 70
Numeric Accuracy Text Score Visualization Score
90 80 65
99
60
Rate 98
85 65
70
80 55
97 50
75 60 60
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
Max Trial Max Trial Max Trial Max Trial Max Trial
DeepSeek-Coder-6.7B Qwen1.5-14B-Chat Mixtral-8x7B-Instruct InternLM2-20B-Chat
Fig. 6. Debug ability analysis. Max trial denotes the trial times in ReAct protocol.
80 Chinese
65 English
Score
60 Fitted Line 70
55
Structural Similarity
60
50
45 50
40 40
35
30
30
view
ruct
t
hat
t
t
Cha
truc
Cha
truc
Cha
C
25
Inst
-pre
4B-
7B-
2B-
0B-
-Ins
-Ins
8B-
Yi-3
ek-6
n-7
2-2
B
70B
106
20 25 30 35 40 45 50 55 60
8x7
a-3-
GPT-4V Scores
Qwe
rnLM
a-3-
-4-1
pSe
-
tral
Llam
Dee
Inte
Llam
gpt
Mix
Fig. 7. Structural Similarities v.s. GPT-4V.
Fig. 8. Chinese CIBench v.s. English CIBench .
generated code. To assess the LLMs’ ability to autonomously correct bugs, we vary the number
of trials. As shown in Fig.6, increasing the number of trials correlates with improvements across
all metrics. Significantly, for most LLMs, there is a notable enhancement when the number of
trials reaches two, particularly evident in metrics such as executable rate, numeric accuracy, and
visualization score. This suggests that the LLM can autonomously rectify bugs to a certain extent. In
our experiments, to balance evaluation time and performance, we set the number of trials to three.
Visualization Metric Analysis To validate the effectiveness of our proposed visualization metric,
we follow QwenAgent [2] and use GPT-4V to assess visualization scores on a subset of CIBench tasks.
The prompt is provided in Appendix D. As shown in Fig.7, despite structural similarities being derived
from low-level features, there is a strong correlation between them and GPT-4V scores, demonstrating
remarkable consistency between the two metrics. Therefore, we can utilize structural similarities as a
simplified visualization metric to subject GPT-4V for effective analysis.
Cross Language Analysis To benchmark the LLMs’ ability in Chinese, we created a Chinese
version of CIBench by translating the human-created tasks into Chinese. This allows us to evaluate
the Code Interpreter performance in Chinese. As shown in Fig.8, we observe that: 1) most models
exhibit a slight decrease in Chinese CIBench compared to their English counterparts.; 2) the strong
DeepSeek-67B-Chat and Qwen-72B-Chat drop a lot on Chinese CIBench, compared to the English
version. Further research and development efforts are necessary to address these discrepancies and
improve the performance of LLMs in multilingual scenarios.
Difficulty Analysis CIbench comprises interactive tasks, and we assume that varying interactive
steps denote distinct levels of difficulty. To demonstrate model performance across these difficulty
levels, we decompose each task into three categories: easy (up to 2 steps), medium (2-4 steps), and
difficult (more than 4 steps), and then evaluate performance accordingly. As Tab.9 shows, for most
models, the performance of most models decreases as the number of steps increases.
Different Category Modules Analysis We assess the capabilities of different LLMs with various
category modules (refer to Tab. 2). As shown in Fig. 10, LLMs demonstrate proficiency in tackling
mathematical and statistical tasks utilizing SciPy modules. However, open-sourced LLMs encounter
challenges when handling modeling tasks that demand advanced coding and reasoning skills. We
hope that future open-sourced LLMs will excel in modeling tasks.
8
Model Easy Medium Hard
Mathematics and Statistics
Yi-6B-chat 37.5 33.1 23.1
Nat
Llama-2-7B-chat 25.7 19.2 15.7 80
ura
lL
Qwen-7B-chat 51.0 45.5 36.5 60
ang
g
elin
Vicuna-7B-v1.5-16k 39.0 35.9 23.8
uag
Mod
DeepSeek-7B-chat 38.8 41.8 22.8 40
e Pr
Mistral-7B-instruct-v0.2 52.3 52.4 33.0
oce
20
ssin
Chatglm3-6B-32k 43.2 33.3 19.0
g
Mistral-8x7B-instruct-v0.1 62.1 61.2 47.6
Internlm2-chat-7B 55.5 52.0 33.9
Qwen-14B-chat 59.6 58.8 42.0
Dat
g
Vicuna-13B-v1.5-16k 46.1 41.6 27.7
ssin
aM
roce
Llama-2-13B-chat 34.8 23.5 18.7
anip
ge P
Internlm2-chat-20B 51.5 59.2 45.1
ulat
Ima
Yi-34b-chat 55.2 58.1 38.0
ion
Llama-2-70B-chat 38.1 33.9 18.6
DeepSeek-67B-chat 60.4 63.9 43.7 Data Visualization
Qwen-72B-chat 60.7 66.2 50.3
Llama-3-70B-chat 62.3 67.1 42.0 Yi-34B-Chat Llama-3-70B-Instruct
InternLM2-20B-Chat Qwen-72B-Chat
DeepSeek-67B-Chat GPT-4-1106-preview
Fig. 9. Performance on different difficulty levels.
Easy means <2 steps, Medium means 2-4 steps, and Fig. 10. Performance on different categories.
Hard means >4 steps.
Limitation Our work has two main limitations: 1) CIBench is currently limited to Python, despite
it could be extended to include other programming languages using a similar methodology; 2) the
evaluation metric of CIBench has limitations in measuring certain data science tasks, such as "training
a model with PyTorch" and tasks involving randomness.
5 Conclusion
We propose a novel benchmark, named CIBench , to comprehensively assess LLMs’ ability to
leverage code interpreters for complex data science tasks. It includes an evaluation dataset covering
widely used Python modules in data science and two evaluation modes measuring LLMs’ ability
with and without human assistance. The evaluation dataset is constructed using an LLM-human
cooperative approach, leveraging interactive IPython sessions to simulate realistic scenarios in data
science. Thorough experimental analysis with 24 LLMs on CIBench indicates that LLMs perform
poorly in modeling category modules. Based on our experimental analysis, we offer several insights
for the future development of LLMs: enhancing the LLM’s ability to correct errors based on feedback,
improving its understanding of user intentions across multiple rounds of interactions, and, most
importantly, strengthening its reasoning capabilities.
References
[1] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David
Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis
with large language models, 2021.
[2] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin
Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu,
Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren,
Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu,
Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu,
Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang,
Chang Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang Zhu. Qwen technical report, 2023.
[3] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel
Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M.
Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz
Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec
Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020.
9
[4] Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui
Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report. arXiv preprint arXiv:2403.17297,
2024.
[5] Harrison Chase. LangChain, October 2022.
[6] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto,
Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul
Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke
Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad
Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias
Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex
Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain,
William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra,
Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer,
Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech
Zaremba. Evaluating large language models trained on code, 2021.
[7] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng,
Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot
impressing gpt-4 with 90%* chatgpt quality. https://vicuna.lmsys.org, 2023.
[8] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser,
Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John
Schulman. Training verifiers to solve math word problems, 2021.
[9] OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation
models. https://github.com/open-compass/opencompass, 2023.
[10] DeepSeek-AI. Deepseek llm: Scaling open-source language models with longtermism, 2024.
[11] Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan,
and Graham Neubig. Pal: Program-aided language models. In International Conference on
Machine Learning, pages 10764–10799. PMLR, 2023.
[12] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn
Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset,
2021.
[13] Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili
Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, et al. Metagpt: Meta programming for
multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 2023.
[14] Xueyu Hu, Ziyu Zhao, Shuang Wei, Ziwei Chai, Guoyin Wang, Xuwu Wang, Jing Su, Jingjing
Xu, Ming Zhu, Yao Cheng, et al. Infiagent-dabench: Evaluating agents on data analysis tasks.
arXiv preprint arXiv:2401.05507, 2024.
[15] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child,
Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language
models, 2020.
[16] Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen-
tau Yih, Daniel Fried, Sida Wang, and Tao Yu. Ds-1000: A natural and reliable benchmark
for data science code generation. In International Conference on Machine Learning, pages
18319–18345. PMLR, 2023.
[17] Chenliang Li, Hehong Chen, Ming Yan, Weizhou Shen, Haiyang Xu, Zhikai Wu, Zhicheng
Zhang, Wenmeng Zhou, Yingda Chen, Chen Cheng, Hongzhu Shi, Ji Zhang, Fei Huang, and
Jingren Zhou. Modelscope-agent: Building your customizable agent system with open-source
large language models, 2023.
[18] Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard
Ghanem. Camel: Communicative agents for" mind" exploration of large scale language model
society. arXiv preprint arXiv:2303.17760, 2023.
10
[19] Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization
branches out, pages 74–81, 2004.
[20] Hongwei Liu, Zilong Zheng, Yuxuan Qiao, Haodong Duan, Zhiwei Fei, Fengzhe Zhou, Wenwei
Zhang, Songyang Zhang, Dahua Lin, and Kai Chen. Mathbench: Evaluating the theory and
application proficiency of llms with a hierarchical mathematics benchmark. arXiv preprint
arXiv:2405.12209, 2024.
[21] Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese,
and Caiming Xiong. Codegen: An open large language model for code with multi-turn program
synthesis. arXiv preprint arXiv:2203.13474, 2022.
[22] Chen Qian, Xin Cong, Cheng Yang, Weize Chen, Yusheng Su, Juyuan Xu, Zhiyuan Liu,
and Maosong Sun. Communicative agents for software development. arXiv preprint
arXiv:2307.07924, 2023.
[23] Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong,
Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou,
Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. Toolllm: Facilitating large language
models to master 16000+ real-world apis, 2023.
[24] Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettle-
moyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach
themselves to use tools. arXiv preprint arXiv:2302.04761, 2023.
[26] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei,
Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open
foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
[27] Xingyao Wang, Zihan Wang, Jiateng Liu, Yangyi Chen, Lifan Yuan, Hao Peng, and Heng Ji.
Mint: Evaluating llms in multi-turn interaction with tools and language feedback, 2023.
[28] Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. Image quality assessment: from
error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612,
2004.
[29] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani
Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto,
Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language
models, 2022.
[30] Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li,
Li Jiang, Xiaoyun Zhang, and Chi Wang. Autogen: Enabling next-gen llm applications via
multi-agent conversation framework. arXiv preprint arXiv:2308.08155, 2023.
[31] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan
Cao. React: Synergizing reasoning and acting in language models, 2023.
[32] Huaiyuan Ying, Shuo Zhang, Linyang Li, Zhejian Zhou, Yunfan Shao, Zhaoye Fei, Yichuan
Ma, Jiawei Hong, Kuikun Liu, Ziyi Wang, et al. Internlm-math: Open math large language
models toward verifiable reasoning. arXiv preprint arXiv:2402.06332, 2024.
[33] Aojun Zhou, Ke Wang, Zimu Lu, Weikang Shi, Sichun Luo, Zipeng Qin, Shaoqing Lu, Anya
Jia, Linqi Song, Mingjie Zhan, et al. Solving challenging math word problems using gpt-4 code
interpreter with code-based self-verification. arXiv preprint arXiv:2308.07921, 2023.
[34] Yuchen Zhuang, Yue Yu, Kuan Wang, Haotian Sun, and Chao Zhang. Toolqa: A dataset for llm
question answering with external tools. arXiv preprint arXiv:2306.13304, 2023.
11
Checklist
The checklist follows the references. Please read the checklist guidelines carefully for information on
how to answer these questions. For each question, change the default [TODO] to [Yes] , [No] , or
[NA] . You are strongly encouraged to include a justification to your answer, either by referencing
the appropriate section of your paper or providing a brief inline description. For example:
• Did you include the license to the code and datasets? [Yes] See
• Did you include the license to the code and datasets? [No] The code and the data are
proprietary.
• Did you include the license to the code and datasets? [NA]
Please do not modify the questions and only use the provided macros for your answers. Note that the
Checklist section does not count towards the page limit. In your paper, please delete this instructions
block and only keep the Checklist section heading above along with the questions/answers below.
12
A Dataset Details
A.1 Module Version Settings
The CIBench comprises generation tasks, template tasks, and Chinese template tasks, which produce
three types of output: numerical, text, and visualization. The statistics of CIBench are shown in
Tab.5.
Question Refinement The prompts used for question refinement are shown in Fig. 14, 15.
Quality Control Rules We include manual quality checking in the final steps due to limitations
in the LLMs’ ability to control runtime or file size. The rules used for quality control are shown in
Fig.16.
13
E Dataset Error Analysis
In the evaluation of CIBench, we identify four prevalent types of errors in the code generated by the
model. These errors are categorized as follows: 1) Instruction Following Errors (Fig.18): These
encompass instances where the model deviates from or disregards provided instructions, reflecting a
lack of adherence to specified guidelines; 2) Hallucination Errors (Fig.19): This category pertains to
the phenomenon of the model generating code that contains hallucinated elements, such as utilizing
undefined parameters or referencing irrelevant variables; 3) Reasoning Errors (Fig.20): These
errors occur when the model encounters complex problems, often resulting in logical errors in the
generated code. Such errors offer valuable insights into the model’s ability to handle intricate tasks in
code generation; 4) Code Errors (Fig.21): Basic errors in code generation fall under this category.
While these errors may sometimes appear trivial, they signify potential deficiencies in the model’s
code-generation process. These identified errors effectively underscore the current limitations of
LLMs in terms of their Code Interpreter capabilities, providing valuable insights for the ongoing
development of CIBench.
F Human Annotator
In our paper, the authors also serve as annotators for CIBench. These annotators are tasked with
creating template tasks for various data science challenges, ensuring coverage of a wide range of
concepts in the field. Importantly, our data collection process for these tasks is unbiased and does not
involve any private information.
G Ethical Consideration
We use GPT-4 and online resources to construct our benchmark. The benchmarks are carefully
processed by experts to exclude any private information. Additionally, we utilize ChatGPT for text
refinement and to correct any typographical errors during the writing process.
14
Fig. 11. Example of successful template task. The penguins dataset and the specific attribute
"flipper_length_mm" can be substituted with alternative datasets and columns. Such a template task
enriches the diversity of CIBench.
15
Fig. 12. Example of successful template task. Continue of Fig. 11.
16
Topic Generation
Prompt:
Please create 50 different topics base on Python module {}. These topics can cover the
majority of functionality of this module.
Question Refinement
Prompt:
Based on the experiment document and each step, please refine these steps into specific
instructions for code writing to solve a real world question.
Provide parameter settings if needed, ensuring that the execution results are identical for any
reasonable person attempting to write the code.
Please refrain from directly mentioning the name of the API, just target on purpose and give
parameter settings.
Please also summarize the required external dataset if applicable.
1. "Remove the ’Cabin’ column from the dataframe due to its numerous missing values.
Make sure to specify that you are removing a column and that the change should be
permanent."
Too wordy, "Remove the ’Cabin’ column and make the change permanent." should be fine.
2. "Identify and count how many values are missing in each column of the dataframe. Use a
chain of two functions: the first checks if each value is null, and the second adds up these
null instances for each column."
Too specific, "Count how many values are missing in each column of the dataframe. Print the
count." should be fine.
1. "Convert the ’Sex’ column to numerical values. Map ’male’ to 0 and ’female’ to 1."
Concise, clear instruction.
Your output format should be as follows starting with import necessary libraries:
[Optional] Dataset Link:
[Optional] Dataset Description:
Step 1. xx
# code blcok
Step 2. xx
# code blcok
...
17
Another Question Refinement
Prompt:
Given the above experiment document. Do the following modification:
1. Remove all the installation steps.
2. Remove all the file saving steps.
3. Elaborate the steps to be more specific with number and inputs that the execution results
are identical for any reasonable person attempting to solve this step.
4. Reorder the steps.
A score of 1 to 2 is given when the image cannot reflect the given instruction content at all.
A score of 3 to 4 is given when the image can partly reflect the given instruction content, but
there is a significant lack of specific detail expression.
If the image basically meets the given instructions, but the aesthetic quality of the image is
average without any outstanding features, this kind of rating can get a score of 5 to 6.
When the image matches the given instructions well, and shows some aesthetic appeal, such
as some originality in color matching and shape design, this kind of rating can get a score of
7 to 8.
Only when the image completely matches the given instructions, covers all the details in the
instructions, and performs excellently in terms of aesthetics, can this kind of rating get a
score of 9 to 10.
[Visualization Problem]:{question}
18
Following error
User:
Detect corners using the Shi-Tomas corner detector with max-corners=50, min-distance=0.5
and blocksize=10, mark the corners with circles on the image. The radius and thickness of
the circle are 5 and 1. Show the marked image.Please use opencv and numpy modules.
Assistant:
# Detect corners using Shi - Tomasi corner detector
corners = cv2 . goodFeaturesToTrack ( equalized_image ,
maxCorners = max_corners ,
qualityLevel = quality_level ,
minDistance = min_distance ,
blockSize = block_size )
# Mark the corners with circles on the image
marked_image = equalized_image . copy ()
for corner in corners :
x , y = corner . ravel ()
cv2 . circle ( marked_image , ( int ( x ) , int ( y )) , 5 , (255 , 0 , 0) , -1)
# Draw a blue filled circle at each corner
System:
NameError Traceback ( most recent call last )
Cell In [6] , line 2
1 # Detect corners using Shi - Tomasi corner detector
----> 2 corners = cv2 . goodFeaturesToTrack ( equalized_image , maxCorners = max_corners ,
qualityLevel = quality_level , minDistance = min_distance , blockSize = block_size )
4 # Mark the corners with circles on the image
5 marked_image = equalized_image . copy ()
User: Define the CrossEntropyLoss function and the Adam optimizer. Train the model for 3
epochs. Display the loss for the last epoch(Keep to two decimal places).
Assistant:
import torch . optim as optim
criterion = nn . CrossEntropyLoss ()
optimizer = optim . Adam ( model . parameters () , lr =0.001)
epochs = 3
for epoch in range ( epochs ):
model . train ()
running_loss = 0.0
for inputs , labels in train_loader :
inputs , labels = inputs . to ( device ) , labels . to ( device )
optimizer . zero_grad ()
System:
933.55
933.45
933.09
19
Hallucination error
User:
Build a vocabulary of all the tokens and assign an index to the vocabulary. Display the index
of the word "happy". Please use nltk modules.
Assistant:
# Check the vocabulary
vocab
# Display the index of the word " happy "
word_to_index . get ( ’ happy ’ , ’ Word not found in vocabulary ’)
System:
NameError Traceback ( most recent call last )
Cell In [2] , line 2
1 # Check the vocabulary
----> 2 vocab
3 # Display the index of the word " happy "
4 word_to_index . get ( ’ happy ’ , ’ Word not found in vocabulary ’)
User:
Define a LightGBM model with max_depth=4, n_estimators=120, learning_rate=0.01 and
num_leaves=31. Train the model with Evaluation Metric=’logloss’. Please use LightGBM
modules.
Assistant:
import lightgbm as lgb
System:
TypeError Traceback ( most recent call last )
Cell In [3] , line 7
4 model = lgb . LGBMClassifier ( max_depth =4 , n_estimators =120 ,
learning_rate =0.01 , num_leaves =31)
6 # Train the model with evaluation metric as ’ logloss ’
----> 7 model . fit ( X_train , y_train , eval_set =( X_test , y_test ) ,
eval_metric = ’ logloss ’ , early_stopping_rounds =20)
20
Reasoning error
User:
Split the dataset into training and testing sets using 0.2 as the test size, then define the
train_loader and test_loader. Set batch size as 64. Please use pytorch and sklearn modules.
Assistant:
from sklearn . model_selection import train_test_split
import torch
from torch . utils . data import DataLoader , TensorDataset
batch_size = 64
train_loader = create_data_loader ( X_train , y_train , batch_size )
test_loader = create_data_loader ( X_test , y_test , batch_size )
System:
TypeError Traceback ( most recent call last )
Cell In [4] , line 19
16 return data_loader
18 batch_size = 64
---> 19 train_loader = create_data_loader ( X_train , y_train , batch_size )
20 test_loader = create_data_loader ( X_test , y_test , batch_size )
TypeError : can ’ t convert np . ndarray of type numpy . object_ . The only supported types
are : float64 , float32 , float16 , complex64 , complex128 , int64 , int32 , int16 , int8 ,
uint8 , and bool .
User:
Group and aggregate data by weight and calculate the average of each numerical group. Find
out how many attributes have a negative correlation with weight.Please use pandas modules.
Assistant:
# Group and aggregate data by weight and calculate the average of each numerical group
grouped_data = df . groupby ( ’ Weight ’ ). mean ()
System:
KeyError Traceback ( most recent call last )
File
anaconda3 / envs / prod - torch1 .13/ lib / python3 .10/ site - packages / pandas / core / indexes / base . py :3802 ,
in Index . get_loc ( self , key , method , tolerance )
3801 try :
-> 3802 return self . _engine . get_loc ( casted_key )
3803 except KeyError as err :
File
anaconda3 / envs / prod - torch1 .13/ lib / python3 .10/ site - packages / pandas / _libs / index . pyx :138 ,
in pandas . _libs . index . IndexEngine . get_loc ()
File
anaconda3 / envs / prod - torch1 .13/ lib / python3 .10/ site - packages / pandas / _libs / index . pyx :165 ,
in pandas . _libs . index . IndexEngine . get_loc ()
KeyError : ’ Weight ’
21
System:
ValueError Traceback ( most recent call last )
File
anaconda3 / envs / prod - torch1 .13/ lib / python3 .10/ site - packages / matplotlib / axes / _axes . py :4372 ,
in Axes . _parse_scatter_color_args (c , edgecolors , kwargs , xsize , get_next_color_func )
4371 try :
-> 4372 mcolors . to_rgba_array ( kwcolor )
4373 except ValueError as err :
File
anaconda3 / envs / prod - torch1 .13/ lib / python3 .10/ site - packages / matplotlib / colors . py :471 ,
in to_rgba_array (c , alpha )
470 if isinstance (c , str ) :
--> 471 raise ValueError ( f "{ c ! r } is not a valid color value .")
473 if len ( c ) == 0:
User:
Blur the initial image by applying a Gaussian filter with a sigma value of 3. Sigma is the
standard deviation for the Gaussian kernel, which controls the amount of blur. Then display
the smoothed image.
Assistant:
smooth = ndimage . gaussian_filter ( image , sigma =3)
plt . imshow ( smooth )
plt . show ()
System:
NameError Traceback ( most recent call last )
Cell In [48] , line 2
1 smooth = ndimage . gaussian_filter ( image , sigma =3)
----> 2 plt . imshow ( smooth )
3 plt . show ()
22