0% found this document useful (0 votes)

37 views22 pages

CIBench Evaluating Your LLMs With a Code Interpret

Uploaded by

Shivam Chauhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views22 pages

CIBench Evaluating Your LLMs With a Code Interpret

Uploaded by

Shivam Chauhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

CIBench: Evaluating Your LLMs with a Code

Interpreter Plugin

Songyang Zhang1,∗ , Chuyu Zhang1,2,∗, Yingfan Hu∗ , Haowen Shen, Kuikun Liu1 ,
Zerun Ma1 , Fengzhe Zhou1 , Wenwei Zhang1 , Xuming He2 , Dahua Lin1 , Kai Chen1,†
1
Shanghai Artificial Intelligence Laboratory 2 ShanghaiTech University
{zhangsongyang}@pjlab.org.cn
arXiv:2407.10499v1 [cs.CL] 15 Jul 2024

Abstract
While LLM-Based agents, which use external tools to solve complex problems,
have made significant progress, benchmarking their ability is challenging, thereby
hindering a clear understanding of their limitations. In this paper, we propose
an interactive evaluation framework, named CIBench, to comprehensively assess
LLMs’ ability to utilize code interpreters for data science tasks. Our evaluation
framework includes an evaluation dataset and two evaluation modes. The evaluation
dataset is constructed using an LLM-human cooperative approach and simulates an
authentic workflow by leveraging consecutive and interactive IPython sessions. The
two evaluation modes assess LLMs’ ability with and without human assistance. We
conduct extensive experiments to analyze the ability of 24 LLMs on CIBench and
provide valuable insights for future LLMs in code interpreter utilization.

1 Introduction
Empowered by the emerging abilities of Large Language Models (LLMs), numerous agents have
surfaced to harness these generalist models for utilizing external tools, such as web browsing,
document retrieval, and code interpretation, in tackling complex real-world problems. Notable
examples include ToolLLM [23], LangChain [5], AutoGPT [25], and QwenAgent [2]. Specifically,
agents equipped with a code interpreter leverage the advanced programming skills of LLMs through
a natural language interface, thereby facilitating the creation of workflows that are both effortless and
efficient. However, assessing the agents’ capability to generate executable code and effectively use
code interpreters remains a challenge, hindering a clear understanding of their limitations.
In this study, we focus on assessing the proficiency of LLMs in leveraging code interpreters to address
data science problems across several distinct domains, like data analysis, visualization, and machine
learning. These tasks necessitate that LLMs exhibit advanced capabilities in instruction following,
reasoning, and programming. Existing benchmarks, including GSM8K [8], MathBench [20] and
HumanEval [6], primarily assess the models’ abilities to solve mathematical or coding problems solely.
These benchmarks, while valuable for measuring specific competencies, fail to fully represent the
LLMs’ aptitude for building complex workflows involving code interpreters in practical applications.
More recent efforts, such as MINT [27], QwenAgent [2], CodeGen [21], and DS-1000 [16], aim
to bridge the existing gap by constructing novel benchmarks, particularly tailored to mathematical
problems and data analysis tasks. Despite considerable advancements, they either focus on single-turn
question assessments or have a limited scope in data science. The substantial insights they provided
inadequately reflect the LLM’s ability to invoke a code interpreter to solve real-world data science
problems, which typically require consecutive sessions.
To address these shortcomings, as depicted in Fig.1, we introduce a novel evaluation framework that
encompasses a benchmark with consecutive and diverse tasks, along with comprehensive assessment
protocols. This framework aims to provide a thorough evaluation of LLMs’ ability to use code
∗
Equal contribution. Code is available at https://github.com/open-compass/CIBench

Preprint. Under review.

Fig. 1. Features of our benchmark. Our benchmark consists of interactive sessions, diverse tasks
covering various Python modules, and comprehensive evaluations (The tool-call rate is not displayed).

Table 1: Comparison of different datasets. "Multi-turn" denotes that LLMs can attempt to solve tasks
through multiple trials. "Code interpreter" means whether LLM needs to call the code interpreter to
solve the problem. "Consecutive" indicates that each task consists of a series of sequential questions
rather than a single question.
Dataset Topic Size Multi-turn Code Interpreter Consecutive
DS-1000 Data Science 1000 questions No No No
MINT Math, Reasoning 586 questions Yes Yes No
CodeGen Math, Data Science, etc 115 tasks No No Yes
QwenAgent Math, Data visualization 295 questions No Yes No
CIBench Data Science 234 tasks, 1900+ questions Yes Yes Yes

interpreters. Notably, the benchmark employs a distinctive LLM-human cooperative approach, as

depicted in Fig. 2. It simulates authentic workflow scenarios using interactive IPython sessions with
sequential, interconnected questions focused on popular Python modules such as Matplotlib, Pandas,
and PyTorch. Additionally, we utilize two evaluation modes and a series of fine-grained metrics for
various outputs. The statistics of our benchmark are shown in Tab.1.
Specifically, we build the evaluation dataset by initially identifying ten highly prevalent Python
libraries within the domain of data science. Then, we prompt advanced LLM, such as GPT-4, to
generate instructions and code snippets within Jupyter Notebook format. Each notebook is structured
to contain a sequence of 10 to 15 progressive steps, with increasing levels of complexity. To mitigate
any inherent biases or limitations in the LLM-generated content, we employ human experts to
generate template tasks based on the common patterns observed in the LLM-generated tasks and
online resources. Those tasks can incorporate multiple interchangeable datasets for evaluation.
Those designs ensure that the benchmark encapsulates both diversity and quality, thereby offering a
comprehensive and balanced assessment of code interpreter capabilities.
To thoroughly assess the LLMs’ performance on our benchmark, we have instituted two distinct
evaluation modes: the end-to-end mode and the oracle mode. In the end-to-end mode, LLMs are
tasked with a holistic problem-solving process where they must reason through given instructions and
generate corresponding code. This requires them to iteratively refine their output based on feedback
from the code interpreter, as they attempt to solve multiple consecutive questions that build upon one
another. Additionally, the oracle mode simulates guided learning by providing the LLM with the
correct code snippet when it fails. This immediate feedback mimics human guidance and equips the
model to use this accurate example for tackling subsequent tasks in the same context. Furthermore,
we introduce two types of metrics: the process-oriented (i.e., tool-call rate, executable rate) and
output-oriented (i.e., numeric accuracy, text score, visualization score), to provide a comprehensive
analysis of the model’s performance.
Based on our evaluation framework, we conduct extensive experiments and analysis using 19 LLMs.
The results indicate that open-sourced LLMs struggle to utilize PyTorch- and TensorFlow-like
modules (Fig.10), and the best-open-sourced LLMs lag behind GPT-4 by 10.0% (Tab.3). In summary,
our contributions are three-fold:

2
• We build a new benchmark for agents with code interpreters using an LLM-human cooperative
method. It consists of interactive IPython sessions with interconnected questions on key data science
libraries, simulating interactive problem-solving scenarios in practical workflows.
• We devise unique assessment strategies involving both end-to-end and oracle modes. We also
introduce several evaluation metrics to assess various outputs, offering a comprehensive gauge of
LLMs’ coding prowess within the benchmark.
• We conduct thorough experiments with 24 LLMs to analyze their performance on our benchmark.
The results indicate that open-sourced LLMs are inferior to GPT-4 by a large margin and perform
poorly in the modeling category modules.

2 Related Works
CIBench is an evaluation framework that assesses LLMs’ [26, 2, 10, 7, 4] ability to utilize external
code interpreters for solving data science tasks. Therefore, we focus on presenting work related to
invoking code interpreters and benchmarks related to data science.

2.1 Model with Plugins

LLM-based agents use external tools via APIs to solve complex tasks and have been regarded as a
promising direction [5, 23, 25, 24, 13, 30]. Specifically, [18, 13, 22] develop efficient workflows to
coordinate multi-agent systems for automatic programming. [24, 5, 23, 11] equip LLMs with external
tools, such as search engines, calculators, and code interpreters, to augment LLMs’ problem-solving
ability. Among these tools, the code interpreter can promote LLMs’ reasoning and coding ability and
has gradually gained attention in works like [32–34].
In contrast, we aim to develop an evaluation framework to benchmark LLMs’ ability with code
interpreters for solving data science tasks, providing insights for future work to promote LLMs for
better realistic utilization.

2.2 Related Benchmarks

Various benchmarks [8, 6, 23, 2] have been proposed to measure LLMs’ reasoning, coding, and
tool utilization ability. Classic benchmarks, such as GSM8K [8], MATH [12], HumanEval [6],
and MBPP [1], focus on evaluating the mathematical reasoning or code capability of LLMs. Tool-
Bench [23] and MS-Agent Bench [17] aim to evaluate LLMs’ capability in effectively utilizing
various tools and generating accurate and contextually appropriate responses. [27] introduce MINT,
a benchmark that evaluates LLMs’ ability to solve challenging tasks with multi-turn interactions
by using tools and leveraging natural language feedback. However, the above benchmarks cannot
measure LLMs’ ability in data science tasks, which require instruction following, coding, and tool
utilization abilities. To address this gap, [21] introduces a multi-turn programming benchmark
consisting of 115 diverse problems, covering topics such as mathematics, algorithms, and data
science. QwenAgent [2] introduces a benchmark for data science, focusing mainly on mathematical
problems and data visualization. [16] introduce DS-1000, a code generation benchmark with a
thousand data science problems. QwenAgent [2] introduces a benchmark for data science, focusing
mainly on mathematical problems and data visualization. Meanwhile, [14] introduces DABench,
which evaluates various concepts with individual questions assigned to each dataset. However, this
approach does not effectively mimic practical scenarios where interconnected questions are raised.
In contrast to the above works, CIBench simulates a real-world data science workflow by leveraging
sequential interactive IPython sessions and covers most concepts in data science by including
commonly used Python modules. Furthermore, we devise two evaluation modes and five metrics to
holistically evaluate LLMs’ abilities.

3 CIBench
To benchmark LLM’s ability to leverage code interpreters for addressing data science problems,
we propose a novel evaluation framework (Fig.2), which comprises a diverse evaluation dataset
and two newly devised evaluation modes. The evaluation dataset is generated through an LLM-
human cooperative approach and simulates authentic workflow scenarios for solving sequential and
interconnected tasks. Given the evaluation dataset, we adhere to the ReAct protocol [31] to generate

3
Tasks Refinement

Online resource
Matplotlib:
• Show Image
• Linear plot • Load data

• … • Statistics
… • Show the image
Pytorch
Topic Generation Tasks Generation
• Convolution • ……

• Transformer Human Refinement

… • …

Python Modules Topic Candidates Generated Tasks Dataset

Fig. 2. Overview of CIBench. CIBench first selects Python modules to generate candidate topics
and then generates tasks based on these modules and the selected topic. Additionally, humans are
engaged to generate new tasks to ensure diversity and filter out incorrect questions to enhance quality.

Table 2: Selected Python modules and their categories.

Category Python Modules

Data Cleaning and Manipulation Pandas
Data Visualization Matplotlib, Seaborn
Modeling Scikit-learn, PyTorch, TensorFlow, LightGBM
Natural Language Processing NLTK
Image Processing OpenCV-Python
Mathematics and Statistics SciPy

Task Generation
Prompt:
Please create jupyter notebook experiment based on Python module {}. Please follow these
rules:
1. The experiment should be conducted in a jupyter notebook manner, but use the markdown
format.
2. The experiment should only use Python code.
3. The experiment has around 10-15 continuous steps, from the easiest to the hardest.
4. The step description should be concise.
5. The step description should be precise and contain exact parameter names and values to
instruct.
6. Each step requires Python code to solve and the executed result should be the numeric
answer, structured output, or visualized result.
7. Please use ‘matplotlib’ to visualize if necessary.
8. DO NOT have any steps to save or write any output files.
9. Please provide an input data file with an external link.
The experiment topic is {}. You should generate the experiment file without any other
statements.

Fig. 3. An example prompt of task generation.

reasoning traces and invoke code interpreters alternately. And, we allow LLMs to attempt to solve
tasks multiple times, enabling exploration of their self-debugging capabilities based on feedback
from the code interpreter. Finally, we propose two evaluation modes: the end-to-end mode and the
oracle mode, to comprehensively measure LLM’s ability with and without human interaction.
In the following sections, we will detail the construction of the dataset in Sec.3.1 and the evaluation
modes and metrics in Sec.3.2.

3.1 Evaluation Dataset

Python Modules Selection We carefully choose modules that pertain to the fields of data science. It
encompasses a wide array of topics such as data cleansing, visualization, image analysis, mathematical
computations, and statistical methods. The modules we have selected are detailed in Tab.2.

4
End-to-End Mode Oracle Mode
User: User:
Load the dataset from path xxx.csv, Load the dataset from path xxx.csv,
display the column names and ... display the column names and ...

LLM: Human Response (Ground Truth):

Thought: We should use ‘Pandas’ to solve this Thought: 1. import pandas. 2. load the data
step. from path.

```python ```python
import pandas as pd import pandas as pd
data = pd.read_csv(path) path = xxx
``` data = pd.read_csv(path)
data.columns
```

Interperter: Interperter:
NameError: name ‘path’ is not defined. (Output of ground truth code)

User: User:
Create a scatterplot with a regression Create a scatterplot with a regression
line to visualize the relationship between …… line to visualize the relationship between ……

Fig. 4. Evaluation modes: In end-to-end mode, the LLM addresses the user’s question (bottom)
within the context of its response, while in oracle mode, it answers the user’s question (bottom) within
the context of ground truth.
Topic Candidates Generation After selecting modules, we prompt (Appendix B) GPT-4 to
summarize 50 topics for each module, to encapsulate the vast majority of the module’s functionalities,
thereby offering precise guidance for the subsequent generation of more targeted questions.
Tasks Generation and Refinement We sample a module and topic, then prompt GPT-4 to generate
questions and code based on the prompt in Fig.3. The prompt is designed to enable GPT-4 to
generate a Jupyter notebook with sequential steps and various outputs, including numeric answers,
structured output, and visualizations, mirroring real-world scenarios. Despite our request for concise
descriptions, the generated content may lack conciseness and continuity. To address this, we undertake
iterative refinement of these tasks. This process involves presenting both good and bad cases, along
with additional modifications, to enhance the quality of questions and reduce ambiguity. Details of
prompts used for refinement are in Appendix B.
Human Refinement Despite the excellent capabilities of LLMs, we found that the generated tasks
often focus on typical datasets, such as Titanic and Iris, which limits task diversity. Additionally,
ensuring perfect accuracy in question descriptions and result correctness is challenging.
To enhance diversity, we summarize template tasks based on the generated tasks and existing high-
quality tutorials available in various Python libraries (Fig.11). With minor modifications, these
template tasks can be adapted to a wide range of datasets. To further diversify the benchmark, we
collect new datasets through two approaches: 1) We prompt GPT-4 to generate datasets tailored to
different templates, leveraging GPT-4’s ability to flexibly specify characteristics and data attributes.
2) We incorporate the latest datasets from the past year, ensuring authenticity and diversity, and
significantly reducing the likelihood that the model has previously encountered this data.
To improve quality, we employ several experts who use a code interpreter for manual double-checking.
This approach ensures that questions are written by real users and that various factors, such as runtime,
are carefully controlled. For a comprehensive overview of the rules governing quality control, please
refer to the Appendix B. The statistics of the dataset are also provided in the Appendix A.2.

3.2 Evaluation Modes and Metrics

Evaluation Modes As shown in Fig.4, CIBench includes the end-to-end and oracle mode. It
not only assesses the model’s proficiency in autonomously resolving continuous problems but also
assesses its capacity in conjunction with human interaction. In end-to-end mode, the model must solve
the problem by itself. Each subsequent response is contingent upon the model’s previous outcomes,
necessitating self-correction based on code interpreter feedback. What’s more, to reflect the real-
world scenarios where human assistance is available, we introduce oracle mode to incorporate ground
truth code and thought as context, helping the model to address subsequent tasks. Conceptually,
oracle mode emulates a few-shot testing or in-context learning scenario, providing the model with
comprehensive context to facilitate a more precise solution.
Evaluation Metrics In two evaluation modes, we introduce two types of evaluation metrics:
process-oriented and output-oriented, to provide a comprehensive analysis of the model’s performance.

5
Table 3: Main results of CIBench. Tool, Exe, Num, Text, and Vis denote the tool call rate, executable
rate, numeric accuracy, text score, and visualization score respectively. bold denotes the best score
among the same model scale. Average is the mean of Num, Text, and Vis in two modes..

E ND - TO - END M ODE O RACLE M ODE

Model Average
Tool Exe Num Text Vis Tool Exe Num Text Vis
Llama-2-7B-Chat 80.6 28.8 10.1 10.4 17.4 92.7 41.5 22.4 31.2 24.8 19.4
Yi-6B-Chat 90.6 50.2 22.5 21.7 31.1 99.7 61.5 29.5 40.8 40.2 31.0
ChatGLM3-6B-32K 77.9 48.8 18.9 33.0 29.3 98.6 55.7 28.2 52.0 34.2 32.6
Vicuna-7B-v1.5-16k 89.4 54.5 19.9 27.1 32.7 99.9 62.9 29.3 43.8 43.1 32.7
DeepSeek-7B-Chat 62.9 45.3 16.0 17.4 30.8 93.8 67.1 33.0 56.7 48.5 33.7
Qwen-7B-Chat 99.8 72.2 42.3 55.7 43.9 97.9 71.9 36.6 58.1 52.1 48.1
Mistral-7b-Instruct 97.5 67.2 41.9 53.6 41.7 99.2 75.7 48.0 55.4 49.1 48.3
Qwen1.5-7B-Chat 99.1 69.9 35.4 59.1 39.1 99.8 75.9 48.6 61.7 52.1 49.3
InternLM2-7B-Chat 99.7 77.7 52.6 64.6 51.8 99.3 70.4 48.1 44.3 42.9 50.7
DeepSeek-Coder-6.7B 75.9 66.0 40.7 48.5 47.7 96.8 83.0 59.2 77.2 62.5 56.0
Llama-3-8B-Instruct 98.5 78.6 55.6 70.8 54.6 95.5 84.5 60.4 83.7 61.4 64.4
Llama-2-13B-Chat 90.7 42.7 12.1 14.8 24.0 99.0 52.4 28.9 31.0 29.3 23.4
Vicuna-13B-v1.5-16k 91.0 62.2 25.9 45.4 35.4 99.9 72.1 39.7 45.2 45.9 39.6
Qwen-14B-Chat 93.1 70.9 48.2 60.7 48.7 99.2 82.4 55.5 74.6 57.3 57.5
Qwen1.5-14B-Chat 99.7 84.8 56.9 63.3 51.4 99.9 92.7 61.7 62.6 63.1 59.8
Mixtral-8x7B-Instruct 99.1 83.0 45.4 55.9 53.0 99.9 88.6 67.5 73.5 67.0 60.4
InternLM2-20B-Chat 96.6 82.2 47.5 67.2 47.1 96.7 81.0 60.4 85.0 56.7 60.6
Yi-34B-Chat 92.9 66.2 42.2 40.3 42.2 99.8 77.2 68.2 72.2 52.7 53.0
Llama-2-70B-Chat 92.4 53.8 23.7 22.7 24.9 94.2 61.6 36.7 36.0 35.6 29.9
DeepSeek-67B-Chat 76.9 70.7 47.5 63.3 54.8 91.1 82.8 60.6 74.7 63.8 60.8
Qwen-72B-Chat 98.3 87.2 60.5 73.2 55.9 99.7 90.1 62.8 73.6 64.9 65.1
Llama-3-70B-Instruct 98.0 91.1 64.9 55.7 63.6 96.6 82.8 72.9 74.2 62.0 65.5
gpt-4o 100.0 97.2 74.1 79.0 61.1 99.2 94.8 81.7 79.7 71.6 74.5
GPT-4-1106-preview 99.8 98.6 77.8 78.9 64.0 99.6 97.7 77.5 81.6 73.1 75.5

Process-oriented metrics focus on the correct invocation of tools and the successful compilation
and execution of code. These metrics include the Tool Call Rate, which measures the proportion of
instances where the model correctly follows the instructions to invoke a code interpreter, and the
Executable Rate, which indicates the percentage of code that is executed without any errors.
On the other hand, output-oriented metrics focus on the outcomes of the model. These metrics
include Numeric Accuracy, which assesses the accuracy of the numerical results; Text Score, which
measures the quality of the structural text output using the Rouge metric [19]; and Visualization
Score, which evaluates the quality of visual output. Instead of using GPT-4V like Qwen-Agent, which
is expensive and ineffective, we propose using structural similarities [28] between predictions and
ground truth images as the visualization score. These metrics provide a holistic evaluation of the
LLM’s capabilities.

4 Experiments
4.1 Experiments Setup
To provide a thorough analysis, we evaluate 19 chat models, including popular open-sourced LLMs
and the private GPT-4, using the CIBench benchmark. During inference, we allow LLMs to attempt
up to 3 times. The specific versions of Python modules utilized in the code interpreter are provided in
Appendix A.1. All experiments are conducted within the OpenCompass [9] evaluation platform.

4.2 Main Results

We categorize the models into distinct groups based on their scales to enable more effective compar-
isons. As illustrated in Table 3, LLama-3-8B-Instruct achieves the highest performance in the 7B
group, while InternLM2-20B-Chat demonstrates the top performance in the 13B-20B group. Within
the 70B group, LLama-3-70B-Instruct secures the leading position with an overall score that is 10.0%
lower than GPT-4-1106-preview. The API model GPT-4-1106-preview and gpt-4o outperform all

6
Pearson: 0.73 Pearson: 0.85 Pearson: 0.86
80 P=8.3e-04 P=1.3e-05 P=1.0e-05
80
80
60 60

GSM8K
IFEval

BBH
60
40 40
40
20 20
20 40 60 20 40 60 20 40 60
CIBench CIBench CIBench
Pearson: 0.87 Pearson: 0.88 Pearson: 0.76
75 P=6.0e-06 P=3.0e-06 P=3.8e-04
75 80
50

HumanEval
50 60
MATH

MBPP
25 40
25
0 20
0
20 40 60 20 40 60 20 40 60
CIBench CIBench CIBench
Fig. 5. Correlation of CIBench with other benchmarks. The small p-value (top-left) and high Pearson
correlation coefficients (title) indicate a strong correlation between CIBench and IFEval, BBH,
GSM8K, MATH, HumanEval, and MBPP. These benchmarks evaluate the instruction-following,
reasoning, and coding abilities of LLMs, respectively.
other models, especially in end-to-end mode, highlighting the significant potential for improvement in
current open-source models. What’s more, larger models tend to exhibit superior performance across
various metrics, in line with established trends [3, 15, 29]. Moreover, models within the same series
(such as Mistral, InternLM, Qwen, Llama3, etc.) consistently maintain relatively stable rankings
within their respective parameter groups, underscoring the stability and efficacy of our approach.
When comparing the end-to-end mode and oracle mode, it becomes evident that the oracle mode
surpasses the end-to-end mode across all metrics for most models. This observation suggests that
LLMs can achieve better results with human interaction, hinting at a promising avenue for integrating
LLMs to assist humans in data science. The experiment demos are shown in Appendix C.

4.3 Error Mode Analysis

In the evaluation of CIBench, we identify four prevalent types of errors in the code generated
by the model. These errors are categorized as follows: 1) Instruction Following Errors: These
encompass instances where the model deviates from or disregards provided instructions, reflecting
a lack of adherence to specified guidelines; 2) Hallucination Errors: This category pertains to the
phenomenon of the model generating code that contains hallucinated elements, such as utilizing
undefined parameters or referencing irrelevant variables; 3) Reasoning Errors: These errors occur
when the model encounters complex problems, often resulting in logical errors in the generated
code. Such errors offer valuable insights into the model’s ability to handle intricate tasks in code
generation; 4) Code Errors: Basic errors in code generation fall under this category. While these
errors may sometimes appear trivial, they signify potential deficiencies in the model’s code-generation
process. These identified errors effectively underscore the current limitations of LLMs in terms of
their coding capabilities, providing valuable insights for the ongoing development of CIBench. Since
automatically classifying errors is challenging, we manually categorized the four types of errors in
GPT-4-1106-preview. The proportions of these error types are 31.9%, 4.3%, 40.4%, and 23.4%,
respectively. Detailed examples of these errors are presented in Appendix E.
Furthermore, to provide insight into enhancing performance on CIBench, we analyze its correlation
with existing benchmarks such as IFEval, BBH, GSM8K, MATH, HumanEval, and MBPP. These
benchmarks evaluate the instruction-following, reasoning, and coding abilities of LLMs. As shown
in Fig.5, CIBench performance exhibits a strong correlation with these benchmarks, with Pearson
correlation coefficients exceeding 0.7, particularly for HumanEval and GSM8K. This result indicates
that improvements in reasoning and coding can boost LLM performance on CIBench.

4.4 More Analysis

Debug Ability Analysis In the ReAct protocol, we allow LLMs to try to solve tasks multiple times.
During each trial, the model can use feedback from the code interpreter to rectify any bugs in the

7
100
Tool Call Rate 95
Executable Rate 70
Numeric Accuracy Text Score Visualization Score
90 80 65
99
60
Rate 98
85 65
70
80 55
97 50
75 60 60
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
Max Trial Max Trial Max Trial Max Trial Max Trial
DeepSeek-Coder-6.7B Qwen1.5-14B-Chat Mixtral-8x7B-Instruct InternLM2-20B-Chat

Fig. 6. Debug ability analysis. Max trial denotes the trial times in ReAct protocol.
80 Chinese
65 English
Score
60 Fitted Line 70

55
Structural Similarity

60
50
45 50

40 40
35
30
30

view
ruct

t
hat

t
t
Cha
truc

Cha

truc
Cha

C
25

Inst

-pre
4B-

7B-

2B-
0B-
-Ins

-Ins
8B-

Yi-3

ek-6

n-7
2-2
B

70B

106
20 25 30 35 40 45 50 55 60

8x7
a-3-
GPT-4V Scores

Qwe
rnLM

a-3-

-4-1
pSe
-
tral
Llam

Dee
Inte

Llam

gpt
Mix
Fig. 7. Structural Similarities v.s. GPT-4V.
Fig. 8. Chinese CIBench v.s. English CIBench .

generated code. To assess the LLMs’ ability to autonomously correct bugs, we vary the number
of trials. As shown in Fig.6, increasing the number of trials correlates with improvements across
all metrics. Significantly, for most LLMs, there is a notable enhancement when the number of
trials reaches two, particularly evident in metrics such as executable rate, numeric accuracy, and
visualization score. This suggests that the LLM can autonomously rectify bugs to a certain extent. In
our experiments, to balance evaluation time and performance, we set the number of trials to three.

Visualization Metric Analysis To validate the effectiveness of our proposed visualization metric,
we follow QwenAgent [2] and use GPT-4V to assess visualization scores on a subset of CIBench tasks.
The prompt is provided in Appendix D. As shown in Fig.7, despite structural similarities being derived
from low-level features, there is a strong correlation between them and GPT-4V scores, demonstrating
remarkable consistency between the two metrics. Therefore, we can utilize structural similarities as a
simplified visualization metric to subject GPT-4V for effective analysis.

Cross Language Analysis To benchmark the LLMs’ ability in Chinese, we created a Chinese
version of CIBench by translating the human-created tasks into Chinese. This allows us to evaluate
the Code Interpreter performance in Chinese. As shown in Fig.8, we observe that: 1) most models
exhibit a slight decrease in Chinese CIBench compared to their English counterparts.; 2) the strong
DeepSeek-67B-Chat and Qwen-72B-Chat drop a lot on Chinese CIBench, compared to the English
version. Further research and development efforts are necessary to address these discrepancies and
improve the performance of LLMs in multilingual scenarios.

Difficulty Analysis CIbench comprises interactive tasks, and we assume that varying interactive
steps denote distinct levels of difficulty. To demonstrate model performance across these difficulty
levels, we decompose each task into three categories: easy (up to 2 steps), medium (2-4 steps), and
difficult (more than 4 steps), and then evaluate performance accordingly. As Tab.9 shows, for most
models, the performance of most models decreases as the number of steps increases.

Different Category Modules Analysis We assess the capabilities of different LLMs with various
category modules (refer to Tab. 2). As shown in Fig. 10, LLMs demonstrate proficiency in tackling
mathematical and statistical tasks utilizing SciPy modules. However, open-sourced LLMs encounter
challenges when handling modeling tasks that demand advanced coding and reasoning skills. We
hope that future open-sourced LLMs will excel in modeling tasks.

8
Model Easy Medium Hard
Mathematics and Statistics
Yi-6B-chat 37.5 33.1 23.1

Nat
Llama-2-7B-chat 25.7 19.2 15.7 80

ura
lL
Qwen-7B-chat 51.0 45.5 36.5 60

ang
g
elin
Vicuna-7B-v1.5-16k 39.0 35.9 23.8

uag
Mod
DeepSeek-7B-chat 38.8 41.8 22.8 40

e Pr
Mistral-7B-instruct-v0.2 52.3 52.4 33.0

oce
20

ssin
Chatglm3-6B-32k 43.2 33.3 19.0

g
Mistral-8x7B-instruct-v0.1 62.1 61.2 47.6
Internlm2-chat-7B 55.5 52.0 33.9
Qwen-14B-chat 59.6 58.8 42.0

Dat

g
Vicuna-13B-v1.5-16k 46.1 41.6 27.7

ssin
aM

roce
Llama-2-13B-chat 34.8 23.5 18.7

anip

ge P
Internlm2-chat-20B 51.5 59.2 45.1

ulat

Ima
Yi-34b-chat 55.2 58.1 38.0

ion
Llama-2-70B-chat 38.1 33.9 18.6
DeepSeek-67B-chat 60.4 63.9 43.7 Data Visualization
Qwen-72B-chat 60.7 66.2 50.3
Llama-3-70B-chat 62.3 67.1 42.0 Yi-34B-Chat Llama-3-70B-Instruct
InternLM2-20B-Chat Qwen-72B-Chat
DeepSeek-67B-Chat GPT-4-1106-preview
Fig. 9. Performance on different difficulty levels.
Easy means <2 steps, Medium means 2-4 steps, and Fig. 10. Performance on different categories.
Hard means >4 steps.
Limitation Our work has two main limitations: 1) CIBench is currently limited to Python, despite
it could be extended to include other programming languages using a similar methodology; 2) the
evaluation metric of CIBench has limitations in measuring certain data science tasks, such as "training
a model with PyTorch" and tasks involving randomness.

5 Conclusion
We propose a novel benchmark, named CIBench , to comprehensively assess LLMs’ ability to
leverage code interpreters for complex data science tasks. It includes an evaluation dataset covering
widely used Python modules in data science and two evaluation modes measuring LLMs’ ability
with and without human assistance. The evaluation dataset is constructed using an LLM-human
cooperative approach, leveraging interactive IPython sessions to simulate realistic scenarios in data
science. Thorough experimental analysis with 24 LLMs on CIBench indicates that LLMs perform
poorly in modeling category modules. Based on our experimental analysis, we offer several insights
for the future development of LLMs: enhancing the LLM’s ability to correct errors based on feedback,
improving its understanding of user intentions across multiple rounds of interactions, and, most
importantly, strengthening its reasoning capabilities.

References
[1] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David
Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis
with large language models, 2021.
[2] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin
Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu,
Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren,
Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu,
Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu,
Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang,
Chang Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang Zhu. Qwen technical report, 2023.
[3] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel
Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M.
Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz
Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec
Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020.

9
[4] Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui
Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report. arXiv preprint arXiv:2403.17297,
2024.
[5] Harrison Chase. LangChain, October 2022.
[6] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto,
Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul
Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke
Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad
Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias
Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex
Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain,
William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra,
Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer,
Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech
Zaremba. Evaluating large language models trained on code, 2021.
[7] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng,
Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot
impressing gpt-4 with 90%* chatgpt quality. https://vicuna.lmsys.org, 2023.
[8] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser,
Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John
Schulman. Training verifiers to solve math word problems, 2021.
[9] OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation
models. https://github.com/open-compass/opencompass, 2023.
[10] DeepSeek-AI. Deepseek llm: Scaling open-source language models with longtermism, 2024.
[11] Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan,
and Graham Neubig. Pal: Program-aided language models. In International Conference on
Machine Learning, pages 10764–10799. PMLR, 2023.
[12] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn
Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset,
2021.
[13] Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili
Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, et al. Metagpt: Meta programming for
multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 2023.
[14] Xueyu Hu, Ziyu Zhao, Shuang Wei, Ziwei Chai, Guoyin Wang, Xuwu Wang, Jing Su, Jingjing
Xu, Ming Zhu, Yao Cheng, et al. Infiagent-dabench: Evaluating agents on data analysis tasks.
arXiv preprint arXiv:2401.05507, 2024.
[15] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child,
Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language
models, 2020.
[16] Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen-
tau Yih, Daniel Fried, Sida Wang, and Tao Yu. Ds-1000: A natural and reliable benchmark
for data science code generation. In International Conference on Machine Learning, pages
18319–18345. PMLR, 2023.
[17] Chenliang Li, Hehong Chen, Ming Yan, Weizhou Shen, Haiyang Xu, Zhikai Wu, Zhicheng
Zhang, Wenmeng Zhou, Yingda Chen, Chen Cheng, Hongzhu Shi, Ji Zhang, Fei Huang, and
Jingren Zhou. Modelscope-agent: Building your customizable agent system with open-source
large language models, 2023.
[18] Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard
Ghanem. Camel: Communicative agents for" mind" exploration of large scale language model
society. arXiv preprint arXiv:2303.17760, 2023.

10
[19] Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization
branches out, pages 74–81, 2004.

[20] Hongwei Liu, Zilong Zheng, Yuxuan Qiao, Haodong Duan, Zhiwei Fei, Fengzhe Zhou, Wenwei
Zhang, Songyang Zhang, Dahua Lin, and Kai Chen. Mathbench: Evaluating the theory and
application proficiency of llms with a hierarchical mathematics benchmark. arXiv preprint
arXiv:2405.12209, 2024.

[21] Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese,
and Caiming Xiong. Codegen: An open large language model for code with multi-turn program
synthesis. arXiv preprint arXiv:2203.13474, 2022.

[22] Chen Qian, Xin Cong, Cheng Yang, Weize Chen, Yusheng Su, Juyuan Xu, Zhiyuan Liu,
and Maosong Sun. Communicative agents for software development. arXiv preprint
arXiv:2307.07924, 2023.

[23] Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong,
Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou,
Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. Toolllm: Facilitating large language
models to master 16000+ real-world apis, 2023.

[24] Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettle-
moyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach
themselves to use tools. arXiv preprint arXiv:2302.04761, 2023.

[25] Significant Gravitas. AutoGPT.

[26] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei,
Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open
foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.

[27] Xingyao Wang, Zihan Wang, Jiateng Liu, Yangyi Chen, Lifan Yuan, Hao Peng, and Heng Ji.
Mint: Evaluating llms in multi-turn interaction with tools and language feedback, 2023.

[28] Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. Image quality assessment: from
error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612,
2004.

[29] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani
Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto,
Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language
models, 2022.

[30] Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li,
Li Jiang, Xiaoyun Zhang, and Chi Wang. Autogen: Enabling next-gen llm applications via
multi-agent conversation framework. arXiv preprint arXiv:2308.08155, 2023.

[31] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan
Cao. React: Synergizing reasoning and acting in language models, 2023.

[32] Huaiyuan Ying, Shuo Zhang, Linyang Li, Zhejian Zhou, Yunfan Shao, Zhaoye Fei, Yichuan
Ma, Jiawei Hong, Kuikun Liu, Ziyi Wang, et al. Internlm-math: Open math large language
models toward verifiable reasoning. arXiv preprint arXiv:2402.06332, 2024.

[33] Aojun Zhou, Ke Wang, Zimu Lu, Weikang Shi, Sichun Luo, Zipeng Qin, Shaoqing Lu, Anya
Jia, Linqi Song, Mingjie Zhan, et al. Solving challenging math word problems using gpt-4 code
interpreter with code-based self-verification. arXiv preprint arXiv:2308.07921, 2023.

[34] Yuchen Zhuang, Yue Yu, Kuan Wang, Haotian Sun, and Chao Zhang. Toolqa: A dataset for llm
question answering with external tools. arXiv preprint arXiv:2306.13304, 2023.

11
Checklist
The checklist follows the references. Please read the checklist guidelines carefully for information on
how to answer these questions. For each question, change the default [TODO] to [Yes] , [No] , or
[NA] . You are strongly encouraged to include a justification to your answer, either by referencing
the appropriate section of your paper or providing a brief inline description. For example:
• Did you include the license to the code and datasets? [Yes] See
• Did you include the license to the code and datasets? [No] The code and the data are
proprietary.
• Did you include the license to the code and datasets? [NA]
Please do not modify the questions and only use the provided macros for your answers. Note that the
Checklist section does not count towards the page limit. In your paper, please delete this instructions
block and only keep the Checklist section heading above along with the questions/answers below.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reflect the paper’s
contributions and scope? [Yes] See Introduction
(b) Did you describe the limitations of your work? [Yes] See Sec.4.4
(c) Did you discuss any potential negative societal impacts of your work? [NA] Our paper
proposes a dataset to measure the programming ability of large language models. There
is currently no negative social impact.
(d) Have you read the ethics review guidelines and ensured that your paper conforms to
them? [Yes] See Appendix
2. If you are including theoretical results...
(a) Did you state the full set of assumptions of all theoretical results? [NA]
(b) Did you include complete proofs of all theoretical results? [NA]
3. If you ran experiments (e.g. for benchmarks)...
(a) Did you include the code, data, and instructions needed to reproduce the main
experimental results (either in the supplemental material or as a URL)? [Yes]
https://github.com/open-compass/CIBench
(b) Did you specify all the training details (e.g., data splits, hyperparameters, how they
were chosen)? [Yes] https://github.com/open-compass/CIBench
(c) Did you report error bars (e.g., with respect to the random seed after running exper-
iments multiple times)? [Yes] The experiments are conducted at a temperature of
0.
(d) Did you include the total amount of compute and the type of resources used (e.g., type
of GPUs, internal cluster, or cloud provider)? [Yes] See 4.1
4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...
(a) If your work uses existing assets, did you cite the creators? [NA]
(b) Did you mention the license of the assets? [NA]
(c) Did you include any new assets either in the supplemental material or as a URL? [NA]
(d) Did you discuss whether and how consent was obtained from people whose data you’re
using/curating? [NA]
(e) Did you discuss whether the data you are using/curating contains personally identifiable
information or offensive content? [NA]
5. If you used crowdsourcing or conducted research with human subjects...
(a) Did you include the full text of instructions given to participants and screenshots, if
applicable? [Yes] See Appendix
(b) Did you describe any potential participant risks, with links to Institutional Review
Board (IRB) approvals, if applicable? [NA]
(c) Did you include the estimated hourly wage paid to participants and the total amount
spent on participant compensation? [NA]

12
A Dataset Details
A.1 Module Version Settings

The version of Python modules used in code interpreters is listed in Tab.4.

Table 4: The module version settings in CIBench.

Module Version
Pandas 1.5.3
Matplotlib 3.7.2
Seaborn 0.13.0
Scikit-learn 1.2.1
PyTorch 1.13.1
TensorFlow 2.14.0
LightGBM 4.1.0
NLTK 3.8
PyTorch 1.131
OpenCV-Python 4.8.1.78
SciPy 1.11.2

A.2 Dataset Statistics

The CIBench comprises generation tasks, template tasks, and Chinese template tasks, which produce
three types of output: numerical, text, and visualization. The statistics of CIBench are shown in
Tab.5.

Table 5: Dataset statistics of CIBench. "generation", "template", and "template_cn" represent

generation tasks, template tasks, and Chinese template tasks, respectively. Other refers to the
questions that only require successful execution without any output comparison.
Subset Num Text Vis Other Total
generation 210 76 466 208 960
template 147 20 161 142 470
template_cn 147 20 161 142 470
total 504 116 788 492 1900

B Construction Prompts and Rules

Topic Generation The prompt used for topic generation is shown in Fig. 13.

Question Refinement The prompts used for question refinement are shown in Fig. 14, 15.

Quality Control Rules We include manual quality checking in the final steps due to limitations
in the LLMs’ ability to control runtime or file size. The rules used for quality control are shown in
Fig.16.

C Experiment Example Demo

An example of the experiment is shown in Fig. 11 and Fig. 12.

D Subjective Visualization Evaluation

The prompt used for subjective visualization scoring is shown in Fig.17.

13
E Dataset Error Analysis
In the evaluation of CIBench, we identify four prevalent types of errors in the code generated by the
model. These errors are categorized as follows: 1) Instruction Following Errors (Fig.18): These
encompass instances where the model deviates from or disregards provided instructions, reflecting a
lack of adherence to specified guidelines; 2) Hallucination Errors (Fig.19): This category pertains to
the phenomenon of the model generating code that contains hallucinated elements, such as utilizing
undefined parameters or referencing irrelevant variables; 3) Reasoning Errors (Fig.20): These
errors occur when the model encounters complex problems, often resulting in logical errors in the
generated code. Such errors offer valuable insights into the model’s ability to handle intricate tasks in
code generation; 4) Code Errors (Fig.21): Basic errors in code generation fall under this category.
While these errors may sometimes appear trivial, they signify potential deficiencies in the model’s
code-generation process. These identified errors effectively underscore the current limitations of
LLMs in terms of their Code Interpreter capabilities, providing valuable insights for the ongoing
development of CIBench.

F Human Annotator
In our paper, the authors also serve as annotators for CIBench. These annotators are tasked with
creating template tasks for various data science challenges, ensuring coverage of a wide range of
concepts in the field. Importantly, our data collection process for these tasks is unbiased and does not
involve any private information.

G Ethical Consideration
We use GPT-4 and online resources to construct our benchmark. The benchmarks are carefully
processed by experts to exclude any private information. Additionally, we utilize ChatGPT for text
refinement and to correct any typographical errors during the writing process.

14
Fig. 11. Example of successful template task. The penguins dataset and the specific attribute
"flipper_length_mm" can be substituted with alternative datasets and columns. Such a template task
enriches the diversity of CIBench.

15
Fig. 12. Example of successful template task. Continue of Fig. 11.

16
Topic Generation
Prompt:
Please create 50 different topics base on Python module {}. These topics can cover the
majority of functionality of this module.

Fig. 13. An example prompt of topic generation.

Question Refinement
Prompt:
Based on the experiment document and each step, please refine these steps into specific
instructions for code writing to solve a real world question.
Provide parameter settings if needed, ensuring that the execution results are identical for any
reasonable person attempting to write the code.
Please refrain from directly mentioning the name of the API, just target on purpose and give
parameter settings.
Please also summarize the required external dataset if applicable.

Bad case step:

1. "Remove the ’Cabin’ column from the dataframe due to its numerous missing values.
Make sure to specify that you are removing a column and that the change should be
permanent."
Too wordy, "Remove the ’Cabin’ column and make the change permanent." should be fine.

2. "Identify and count how many values are missing in each column of the dataframe. Use a
chain of two functions: the first checks if each value is null, and the second adds up these
null instances for each column."
Too specific, "Count how many values are missing in each column of the dataframe. Print the
count." should be fine.

Good case step:

1. "Convert the ’Sex’ column to numerical values. Map ’male’ to 0 and ’female’ to 1."
Concise, clear instruction.

Your output format should be as follows starting with import necessary libraries:
[Optional] Dataset Link:
[Optional] Dataset Description:

Step 1. xx
# code blcok
Step 2. xx
# code blcok
...

Fig. 14. An example prompt of question refinement.

17
Another Question Refinement
Prompt:
Given the above experiment document. Do the following modification:
1. Remove all the installation steps.
2. Remove all the file saving steps.
3. Elaborate the steps to be more specific with number and inputs that the execution results
are identical for any reasonable person attempting to solve this step.
4. Reorder the steps.

Response with the new experiment document.

Fig. 15. Another example prompt of question refinement.

Quality Control Rules

Check Rules:
• Assure the questions is written from the perspective of a real user.
• Assure file path informed in the head for all the experiment required external files.
• Control the runtime, and each step should ideally produce results within 1 minute.
• Control the file size, the file used for single experiment should ideally not exceed
50M.
• Assure the output is valid and unambiguous as ground truth.

Fig. 16. Rules used for manual dataset quality control.

Subjective Visualization Scoring Prompt

Prompt:
You are an assistant skilled in assessing visualization capabilities.
In the capacity of a fair judge, you will evaluate the quality of images drawn by an AI
model generating code for visualization-related problems. We will provide you with a code
visualization problem and an image drawn by the code created by the AI model you need to
assess. When you start your assessment, you must adhere to the following process:
1. Rate the visualization capability with a score between 1 and 10 for the image, returning
only the number without any additional descriptions.
2. Your scoring needs to be as rigorous as possible, and it should follow the scoring rules
below: Overall, the higher the quality of the model’s response, the higher the score.

A score of 1 to 2 is given when the image cannot reflect the given instruction content at all.
A score of 3 to 4 is given when the image can partly reflect the given instruction content, but
there is a significant lack of specific detail expression.
If the image basically meets the given instructions, but the aesthetic quality of the image is
average without any outstanding features, this kind of rating can get a score of 5 to 6.
When the image matches the given instructions well, and shows some aesthetic appeal, such
as some originality in color matching and shape design, this kind of rating can get a score of
7 to 8.
Only when the image completely matches the given instructions, covers all the details in the
instructions, and performs excellently in terms of aesthetics, can this kind of rating get a
score of 9 to 10.

[Visualization Problem]:{question}

Fig. 17. An example prompt of subjective visualization scoring prompt.

18
Following error
User:
Detect corners using the Shi-Tomas corner detector with max-corners=50, min-distance=0.5
and blocksize=10, mark the corners with circles on the image. The radius and thickness of
the circle are 5 and 1. Show the marked image.Please use opencv and numpy modules.
Assistant:
# Detect corners using Shi - Tomasi corner detector
corners = cv2 . goodFeaturesToTrack ( equalized_image ,
maxCorners = max_corners ,
qualityLevel = quality_level ,
minDistance = min_distance ,
blockSize = block_size )
# Mark the corners with circles on the image
marked_image = equalized_image . copy ()
for corner in corners :
x , y = corner . ravel ()
cv2 . circle ( marked_image , ( int ( x ) , int ( y )) , 5 , (255 , 0 , 0) , -1)
# Draw a blue filled circle at each corner

# Show the marked image

plt . imshow ( marked_image , cmap = ’ gray ’)
plt . axis ( ’ off ’) # Turn off axis numbers and ticks
plt . show ()

System:
NameError Traceback ( most recent call last )
Cell In [6] , line 2
1 # Detect corners using Shi - Tomasi corner detector
----> 2 corners = cv2 . goodFeaturesToTrack ( equalized_image , maxCorners = max_corners ,
qualityLevel = quality_level , minDistance = min_distance , blockSize = block_size )
4 # Mark the corners with circles on the image
5 marked_image = equalized_image . copy ()

NameError : name ’ max_corners ’ is not defined

User: Define the CrossEntropyLoss function and the Adam optimizer. Train the model for 3
epochs. Display the loss for the last epoch(Keep to two decimal places).
Assistant:
import torch . optim as optim

criterion = nn . CrossEntropyLoss ()
optimizer = optim . Adam ( model . parameters () , lr =0.001)

epochs = 3
for epoch in range ( epochs ):
model . train ()
running_loss = 0.0
for inputs , labels in train_loader :
inputs , labels = inputs . to ( device ) , labels . to ( device )

optimizer . zero_grad ()

outputs = model ( inputs )

loss = criterion ( outputs . squeeze () , labels . float ())
loss . backward ()
optimizer . step ()

running_loss += loss . item ()

final_loss = round ( running_loss / len ( train_loader ) , 2)

print ( final_loss )

System:
933.55
933.45
933.09

Fig. 18. Examples of Following error

19
Hallucination error
User:
Build a vocabulary of all the tokens and assign an index to the vocabulary. Display the index
of the word "happy". Please use nltk modules.
Assistant:
# Check the vocabulary
vocab
# Display the index of the word " happy "
word_to_index . get ( ’ happy ’ , ’ Word not found in vocabulary ’)

System:
NameError Traceback ( most recent call last )
Cell In [2] , line 2
1 # Check the vocabulary
----> 2 vocab
3 # Display the index of the word " happy "
4 word_to_index . get ( ’ happy ’ , ’ Word not found in vocabulary ’)

NameError : name ’ vocab ’ is not defined

User:
Define a LightGBM model with max_depth=4, n_estimators=120, learning_rate=0.01 and
num_leaves=31. Train the model with Evaluation Metric=’logloss’. Please use LightGBM
modules.
Assistant:
import lightgbm as lgb

# Define the LightGBM model

model = lgb . LGBMClassifier ( max_depth =4 , n_estimators =120 , learning_rate =0.01 , num_leaves =31)

# Train the model with evaluation metric as ’ logloss ’

model . fit ( X_train , y_train , eval_set =( X_test , y_test ) , eval_metric = ’ logloss ’ , early_stopping_rounds =20)

System:
TypeError Traceback ( most recent call last )
Cell In [3] , line 7
4 model = lgb . LGBMClassifier ( max_depth =4 , n_estimators =120 ,
learning_rate =0.01 , num_leaves =31)
6 # Train the model with evaluation metric as ’ logloss ’
----> 7 model . fit ( X_train , y_train , eval_set =( X_test , y_test ) ,
eval_metric = ’ logloss ’ , early_stopping_rounds =20)

TypeError : LGBMClassifier . fit () got an unexpected keyword argument

’ early_stopping_rounds ’

Fig. 19. Examples of Hallucination error

20
Reasoning error
User:
Split the dataset into training and testing sets using 0.2 as the test size, then define the
train_loader and test_loader. Set batch size as 64. Please use pytorch and sklearn modules.
Assistant:
from sklearn . model_selection import train_test_split
import torch
from torch . utils . data import DataLoader , TensorDataset

# Split the dataset into training and testing sets

X = df [ ’ text ’ ]. values
y = df [ ’ sentiment ’ ]. values
X_train , X_test , y_train , y_test = train_test_split (X , y , test_size =0.2 , random_state =42)

# Define the train_loader and test_loader

def create_data_loader (X , y , batch_size ):
X_tensor = torch . tensor ( X )
y_tensor = torch . tensor (y , dtype = torch . long )
dataset = TensorDataset ( X_tensor , y_tensor )
data_loader = DataLoader ( dataset , batch_size = batch_size )
return data_loader

batch_size = 64
train_loader = create_data_loader ( X_train , y_train , batch_size )
test_loader = create_data_loader ( X_test , y_test , batch_size )

System:
TypeError Traceback ( most recent call last )
Cell In [4] , line 19
16 return data_loader
18 batch_size = 64
---> 19 train_loader = create_data_loader ( X_train , y_train , batch_size )
20 test_loader = create_data_loader ( X_test , y_test , batch_size )

Cell In [4] , line 12

11 def create_data_loader (X , y , batch_size ) :
---> 12 X_tensor = torch . tensor ( X )
13 y_tensor = torch . tensor (y , dtype = torch . long )
14 dataset = TensorDataset ( X_tensor , y_tensor )

TypeError : can ’ t convert np . ndarray of type numpy . object_ . The only supported types
are : float64 , float32 , float16 , complex64 , complex128 , int64 , int32 , int16 , int8 ,
uint8 , and bool .

User:
Group and aggregate data by weight and calculate the average of each numerical group. Find
out how many attributes have a negative correlation with weight.Please use pandas modules.
Assistant:
# Group and aggregate data by weight and calculate the average of each numerical group
grouped_data = df . groupby ( ’ Weight ’ ). mean ()

# Find attributes with negative correlation with weight

negative_correlation_count = ( grouped_data . corr ()[ ’ Weight ’] < 0). sum ()
negative_correlation_count

System:
KeyError Traceback ( most recent call last )
File
anaconda3 / envs / prod - torch1 .13/ lib / python3 .10/ site - packages / pandas / core / indexes / base . py :3802 ,
in Index . get_loc ( self , key , method , tolerance )
3801 try :
-> 3802 return self . _engine . get_loc ( casted_key )
3803 except KeyError as err :

File
anaconda3 / envs / prod - torch1 .13/ lib / python3 .10/ site - packages / pandas / _libs / index . pyx :138 ,
in pandas . _libs . index . IndexEngine . get_loc ()

File
anaconda3 / envs / prod - torch1 .13/ lib / python3 .10/ site - packages / pandas / _libs / index . pyx :165 ,
in pandas . _libs . index . IndexEngine . get_loc ()

File pandas / _libs / hashtable_class_helper . pxi :5745 , in

pandas . _libs . hashtable . PyObjectHashTable . get_item ()

File pandas / _libs / hashtable_class_helper . pxi :5753 , in

pandas . _libs . hashtable . PyObjectHashTable . get_item ()

KeyError : ’ Weight ’
21

Fig. 20. Examples of Reasoning error

Code error
User:
Generate a pair plot for all the numerical columns, using color = "pastel".Please use seaborn
modules.
Assistant:
sns . pairplot ( df , kind = ’ scatter ’ , plot_kws ={ ’ color ’: ’ pastel ’ })
plt . show ()

System:
ValueError Traceback ( most recent call last )
File
anaconda3 / envs / prod - torch1 .13/ lib / python3 .10/ site - packages / matplotlib / axes / _axes . py :4372 ,
in Axes . _parse_scatter_color_args (c , edgecolors , kwargs , xsize , get_next_color_func )
4371 try :
-> 4372 mcolors . to_rgba_array ( kwcolor )
4373 except ValueError as err :

File
anaconda3 / envs / prod - torch1 .13/ lib / python3 .10/ site - packages / matplotlib / colors . py :471 ,
in to_rgba_array (c , alpha )
470 if isinstance (c , str ) :
--> 471 raise ValueError ( f "{ c ! r } is not a valid color value .")
473 if len ( c ) == 0:

ValueError : ’ pastel ’ is not a valid color value .

User:
Blur the initial image by applying a Gaussian filter with a sigma value of 3. Sigma is the
standard deviation for the Gaussian kernel, which controls the amount of blur. Then display
the smoothed image.
Assistant:
smooth = ndimage . gaussian_filter ( image , sigma =3)
plt . imshow ( smooth )
plt . show ()

System:
NameError Traceback ( most recent call last )
Cell In [48] , line 2
1 smooth = ndimage . gaussian_filter ( image , sigma =3)
----> 2 plt . imshow ( smooth )
3 plt . show ()

NameError : name ’plt ’ is not defined

Fig. 21. Examples of Code error

Liu et al. - 2023 - Invited Paper VerilogEval Evaluating Large Language Models for Verilog Code Generation
No ratings yet
Liu et al. - 2023 - Invited Paper VerilogEval Evaluating Large Language Models for Verilog Code Generation
8 pages
Deepseek LLM
No ratings yet
Deepseek LLM
48 pages
Comprehensive Benchmark Suite for Evaluating Gemma Models
No ratings yet
Comprehensive Benchmark Suite for Evaluating Gemma Models
15 pages
Big Code Bench
No ratings yet
Big Code Bench
62 pages
Data Analytics with MS Excel Lab Manual Full 2024-25
No ratings yet
Data Analytics with MS Excel Lab Manual Full 2024-25
30 pages
Code Generation With LLMs
No ratings yet
Code Generation With LLMs
59 pages
ML-Bench
No ratings yet
ML-Bench
46 pages
2504.09714v2
No ratings yet
2504.09714v2
17 pages
Taskweaver: A Code-First Agent Framework: Equal Contribution
No ratings yet
Taskweaver: A Code-First Agent Framework: Equal Contribution
23 pages
mmeval-survey
No ratings yet
mmeval-survey
31 pages
MLAgentBench Evaluating Language Agents on Machine Learning Experimentation
No ratings yet
MLAgentBench Evaluating Language Agents on Machine Learning Experimentation
39 pages
2402.18679v4
No ratings yet
2402.18679v4
29 pages
2404.08008
No ratings yet
2404.08008
32 pages
2311.18760v4
No ratings yet
2311.18760v4
29 pages
2. A_Survey_on_Evaluating_Large_Language_Models_in_Co
No ratings yet
2. A_Survey_on_Evaluating_Large_Language_Models_in_Co
26 pages
CODEJUDGE : Evaluating Code Generation with Large Language Models
No ratings yet
CODEJUDGE : Evaluating Code Generation with Large Language Models
20 pages
A Survey On Evaluation of Large Language Models
No ratings yet
A Survey On Evaluation of Large Language Models
42 pages
HumanEval Pro and MBPPPro Evaluating Large Language Models
No ratings yet
HumanEval Pro and MBPPPro Evaluating Large Language Models
27 pages
Qwen2 Technical Report
No ratings yet
Qwen2 Technical Report
25 pages
Executable Code Actions Elicit Better LLM Agents
No ratings yet
Executable Code Actions Elicit Better LLM Agents
25 pages
MMLU-Pro
No ratings yet
MMLU-Pro
24 pages
SciReplicate-Bench_Benchmarking_LLMs_in_Agent-driv
No ratings yet
SciReplicate-Bench_Benchmarking_LLMs_in_Agent-driv
23 pages
2024FuallStackBench Seed
No ratings yet
2024FuallStackBench Seed
26 pages
100+ LLM benchmarks and evaluation datasets
No ratings yet
100+ LLM benchmarks and evaluation datasets
21 pages
6. Large Language Models for Code Analysis - Do LLMs Really Do Their Job
No ratings yet
6. Large Language Models for Code Analysis - Do LLMs Really Do Their Job
18 pages
4. Studying the Quality of Source Code Generated by Different AI Generative Engines An Empirical Evaluation
No ratings yet
4. Studying the Quality of Source Code Generated by Different AI Generative Engines An Empirical Evaluation
19 pages
LLM Code TAs
No ratings yet
LLM Code TAs
20 pages
CMMLU - Measuring Massive Multitask Language Understanding in Chinese
No ratings yet
CMMLU - Measuring Massive Multitask Language Understanding in Chinese
17 pages
2412.12544v1
No ratings yet
2412.12544v1
19 pages
2024.acl-long.737
No ratings yet
2024.acl-long.737
16 pages
A Systematic Survey and Critical Review on Evaluating Large Language Models- Challenges, Limitations, and Recommendations
No ratings yet
A Systematic Survey and Critical Review on Evaluating Large Language Models- Challenges, Limitations, and Recommendations
32 pages
代码大模型
No ratings yet
代码大模型
18 pages
Paper 5
No ratings yet
Paper 5
23 pages
2305.01210v3
No ratings yet
2305.01210v3
15 pages
OpenAI Codex Arxiv
No ratings yet
OpenAI Codex Arxiv
35 pages
10 Important LLM Benchmarks That You Should Know-1
No ratings yet
10 Important LLM Benchmarks That You Should Know-1
13 pages
2406.11589v4
No ratings yet
2406.11589v4
15 pages
2407.10457v1
No ratings yet
2407.10457v1
10 pages
CodeTree
No ratings yet
CodeTree
16 pages
2407.21072v1
No ratings yet
2407.21072v1
15 pages
Evo Code Bench
No ratings yet
Evo Code Bench
15 pages
Benchmark LLM
No ratings yet
Benchmark LLM
28 pages
S 001: N Q A C LLM E: Afurai EW Ualitative Pproach For ODE Valuation
No ratings yet
S 001: N Q A C LLM E: Afurai EW Ualitative Pproach For ODE Valuation
22 pages
3625549.3658689
No ratings yet
3625549.3658689
14 pages
2408.09701v1
No ratings yet
2408.09701v1
11 pages
3650212.3680308
No ratings yet
3650212.3680308
13 pages
133_large_language_model_evaluatio
No ratings yet
133_large_language_model_evaluatio
12 pages
2401.04531v3
No ratings yet
2401.04531v3
29 pages
1 s2.0 S0167739X24002449 Main
No ratings yet
1 s2.0 S0167739X24002449 Main
13 pages
LPML - Llm-Prompting Markup Language For Mathematical Reasoning
No ratings yet
LPML - Llm-Prompting Markup Language For Mathematical Reasoning
10 pages
Science Agent Bench
No ratings yet
Science Agent Bench
60 pages
Assessing Large Language Models for Code Generation: A Comprehensive Framework
No ratings yet
Assessing Large Language Models for Code Generation: A Comprehensive Framework
6 pages
Invitedpaper Aspdac 24
No ratings yet
Invitedpaper Aspdac 24
7 pages
LLM Benchmarks
No ratings yet
LLM Benchmarks
5 pages
llama2_page8
No ratings yet
llama2_page8
1 page
LLM’s for Code Generation
No ratings yet
LLM’s for Code Generation
31 pages
Evaluating Large Language Models Trained On Code
No ratings yet
Evaluating Large Language Models Trained On Code
35 pages
Benchmarking Large Language Models with a Unified Performance Ranking Metric
No ratings yet
Benchmarking Large Language Models with a Unified Performance Ranking Metric
13 pages
Comparing LLMs Using A Unified Performance Ranking System
No ratings yet
Comparing LLMs Using A Unified Performance Ranking System
13 pages
PDF Formal Modeling and Verification of Cyber Physical Systems 1st International Summer School on Methods and Tools for the Design of Digital Systems Bremen Germany September 2015 1st Edition Rolf Drechsler download
100% (4)
PDF Formal Modeling and Verification of Cyber Physical Systems 1st International Summer School on Methods and Tools for the Design of Digital Systems Bremen Germany September 2015 1st Edition Rolf Drechsler download
55 pages
Web-I-Lecture-4
No ratings yet
Web-I-Lecture-4
22 pages
SOFTWARE (UTILITY SOFTWARE)
No ratings yet
SOFTWARE (UTILITY SOFTWARE)
11 pages
RS485-to-eth-b-user-manual-EN-v1.33
No ratings yet
RS485-to-eth-b-user-manual-EN-v1.33
52 pages
React+d3.js - Build Data Visualizations With React and d3.js
100% (2)
React+d3.js - Build Data Visualizations With React and d3.js
80 pages
Bba Unit 3 Full Notes
No ratings yet
Bba Unit 3 Full Notes
65 pages
CIRCULAR 05 REGARDING SCHOOL HOLIDAY, RE-ADMISSION AND INSTALLATION & UPDATION OF JDS SAVARNA APP
No ratings yet
CIRCULAR 05 REGARDING SCHOOL HOLIDAY, RE-ADMISSION AND INSTALLATION & UPDATION OF JDS SAVARNA APP
2 pages
new all hosts
No ratings yet
new all hosts
31 pages
(How To) File Migration
No ratings yet
(How To) File Migration
4 pages
opps paper
No ratings yet
opps paper
4 pages
Building Transformer Models With Attention Crash Course Build A Neural Machine Translator in 12 Days
No ratings yet
Building Transformer Models With Attention Crash Course Build A Neural Machine Translator in 12 Days
33 pages
SQL Server Always On Setup
No ratings yet
SQL Server Always On Setup
15 pages
Mass Event Handling Functional Description
67% (3)
Mass Event Handling Functional Description
2 pages
TPA_1
No ratings yet
TPA_1
5 pages
MachhindraJagadale_UI_Developer
No ratings yet
MachhindraJagadale_UI_Developer
4 pages
ECE 512L Data Communication: Engr. Adriano Mercedes H. Cano JR
No ratings yet
ECE 512L Data Communication: Engr. Adriano Mercedes H. Cano JR
92 pages
Master Cheat Sheet
No ratings yet
Master Cheat Sheet
61 pages
Modern Colorful Company Profile Presentation
No ratings yet
Modern Colorful Company Profile Presentation
10 pages
Data Storage - Worksheet (GR 10)
No ratings yet
Data Storage - Worksheet (GR 10)
4 pages
Ms MP Manual 2012
No ratings yet
Ms MP Manual 2012
8 pages
Prism Scada
No ratings yet
Prism Scada
4 pages
IBIS Guardian TrueVector ESP - Web - Version
No ratings yet
IBIS Guardian TrueVector ESP - Web - Version
4 pages
Venkatesh Resume
No ratings yet
Venkatesh Resume
2 pages
Como Instalar Webspace
No ratings yet
Como Instalar Webspace
9 pages
Software For Evaluation of Uncertainty in Measurements
No ratings yet
Software For Evaluation of Uncertainty in Measurements
8 pages
Best Practices For Siebel Scipting
No ratings yet
Best Practices For Siebel Scipting
6 pages
BASIC Programming/Beginning Basic/Print, CLS, and End
No ratings yet
BASIC Programming/Beginning Basic/Print, CLS, and End
2 pages
Configuring SDR33 To Leica TPS
No ratings yet
Configuring SDR33 To Leica TPS
3 pages
7 Things You Should Know About Personal Learning Envionments
No ratings yet
7 Things You Should Know About Personal Learning Envionments
2 pages
Mastering the Craft of C++ Programming: Unraveling the Secrets of Expert-Level Programming
From Everand
Mastering the Craft of C++ Programming: Unraveling the Secrets of Expert-Level Programming
Steve Jones
No ratings yet
C++ Automation Basics: A Practical Guide with Examples
From Everand
C++ Automation Basics: A Practical Guide with Examples
William E. Clark
No ratings yet

CIBench Evaluating Your LLMs With a Code Interpret

Uploaded by

CIBench Evaluating Your LLMs With a Code Interpret

Uploaded by

CIBench: Evaluating Your LLMs with a Code

Preprint. Under review.

interpreters. Notably, the benchmark employs a distinctive LLM-human cooperative approach, as

2.1 Model with Plugins

2.2 Related Benchmarks

• Transformer Human Refinement

Python Modules Topic Candidates Generated Tasks Dataset

Table 2: Selected Python modules and their categories.

Category Python Modules

Fig. 3. An example prompt of task generation.

3.1 Evaluation Dataset

LLM: Human Response (Ground Truth):

3.2 Evaluation Modes and Metrics

E ND - TO - END M ODE O RACLE M ODE

4.2 Main Results

4.3 Error Mode Analysis

4.4 More Analysis

[25] Significant Gravitas. AutoGPT.

1. For all authors...

The version of Python modules used in code interpreters is listed in Tab.4.

Table 4: The module version settings in CIBench.

A.2 Dataset Statistics

Table 5: Dataset statistics of CIBench. "generation", "template", and "template_cn" represent

B Construction Prompts and Rules

C Experiment Example Demo

D Subjective Visualization Evaluation

Fig. 13. An example prompt of topic generation.

Bad case step:

Good case step:

Fig. 14. An example prompt of question refinement.

Response with the new experiment document.

Fig. 15. Another example prompt of question refinement.

Quality Control Rules

Fig. 16. Rules used for manual dataset quality control.

Subjective Visualization Scoring Prompt

Fig. 17. An example prompt of subjective visualization scoring prompt.

# Show the marked image

NameError : name ’ max_corners ’ is not defined

outputs = model ( inputs )

running_loss += loss . item ()

final_loss = round ( running_loss / len ( train_loader ) , 2)

Fig. 18. Examples of Following error

NameError : name ’ vocab ’ is not defined

# Define the LightGBM model

# Train the model with evaluation metric as ’ logloss ’

TypeError : LGBMClassifier . fit () got an unexpected keyword argument

Fig. 19. Examples of Hallucination error

# Split the dataset into training and testing sets

# Define the train_loader and test_loader

Cell In [4] , line 12

# Find attributes with negative correlation with weight

File pandas / _libs / hashtable_class_helper . pxi :5745 , in

File pandas / _libs / hashtable_class_helper . pxi :5753 , in

Fig. 20. Examples of Reasoning error

ValueError : ’ pastel ’ is not a valid color value .

NameError : name ’plt ’ is not defined

Fig. 21. Examples of Code error

You might also like