CODEJUDGE : Evaluating Code Generation with Large Language Models
CODEJUDGE : Evaluating Code Generation with Large Language Models
Models
ated by LLMs remains an unresolved problem. having syntactic variations, e.g., using a while loop
This paper presents C ODE J UDGE , a code instead of a for loop, following different naming
evaluation framework that leverages LLMs to conventions, etc. In particular, Evtikhiev et al.
evaluate the semantic correctness of generated (2023) shows a statistically significant disagree-
code without the need for test cases. We inves-
ment on code assessment between human judges
tigate different ways to guide the LLM in per-
forming “slow thinking” to arrive at an in-depth and these token-based metrics.
and reliable evaluation. We experimented with Recent studies show that LLMs are promis-
four LLMs as evaluators on four code genera- ing alternatives to human evaluators in different
tion datasets and five programming languages. tasks (Liu et al., 2023b; Zheng et al., 2023a; Chan
The results show that C ODE J UDGE signifi- et al., 2024). Inspired by these findings, we pro-
cantly outperformed existing methods in most pose an LLM-based code evaluation framework
settings. Furthermore, compared with a SOTA
called C ODE J UDGE . C ODE J UDGE supports two
GPT-3.5-based code evaluation method, C ODE -
J UDGE achieved better results even when using kinds of assessment: (1) determines whether the
a much smaller model, Llama-3-8B-Instruct. model-generated code is correct or not, and (2) esti-
Our code and datasets are available on GitHub mates to what extent the generated code is aligned
https://github.com/VichyTong/CodeJudge. with user-intended code. While the former is the
typical way of evaluating LLMs in code generation,
1 Introduction
we argue that the latter provides a more informa-
There is an increasing interest in leveraging Large tive evaluation, since LLMs often generate partially
Language Models (LLMs) to generate code (Roz- correct code, which provides a good starting point
ière et al., 2023; Shen et al., 2023). However, re- or hints to developers (Vaithilingam et al., 2022;
liably evaluating LLMs on code generation tasks Barke et al., 2023). Thus, it is useful to account for
remains a challenging problem (Evtikhiev et al., partial correctness and the severity of code errors
2023). Test-based methods, such as measuring when evaluating LLMs for code generation.
pass@k (Kulal et al., 2019; Chen et al., 2021), rely We design two methods to guide the LLM to
on manually written test cases to evaluate code perform “slow thinking” (Kahneman, 2011) for re-
quality. This reliance presents a significant limita- liable code evaluation. For the first assessment,
tion, since many tasks do not come with test cases C ODE J UDGE instructs the LLM to perform a step-
or only have insufficient test cases that miss corner by-step analysis of the code functionality and then
cases (Liu et al., 2023a). Moreover, it is challeng- asks it to summarize the analysis results into a
ing to write test cases for many coding tasks, such binary decision. For the second assessment, C ODE -
as object serialization and web scraping, since they J UDGE provides the LLM with a taxonomy of com-
require extensive effort to construct and configure mon coding errors and instructs the LLM to identify
test stubs and mock objects. what types of errors the generated code contains.
When there are no test cases, existing work often Then, it computes a code correctness score based
relies on token-based metrics, such as BLEU (Pa- on the severity of identified errors. Notably, our
pineni et al., 2002), ROUGE-L (Lin, 2004) and framework does not require any test cases or any
fine-tuning of backbone models in code evaluation. classification task (Chen et al., 2021). The evalua-
We evaluate C ODE J UDGE on five programming tion method f simply determines whether the gen-
languages—Java, C++, Python, JavaScript, Go— erated code is correct or not (i.e., f (c, t) ∈ {0, 1}).
and four datasets—HumanEval-X (Zheng et al., For instance, test-based evaluation treats the gen-
2023b), CoNaLa (Yin et al., 2018; Evtikhiev et al., erated code as correct if it passes all test cases.
2023), APPS (Hendrycks et al., 2021), and Big- Otherwise, the code is considered wrong.
CodeBench (Zhuo et al., 2024). Following prior Recent studies indicate that code evaluation
work on text generation evaluation (Zhang et al., should not be simply treated as a yes-or-no ques-
2020; Yuan et al., 2021) and code generation eval- tion (Vaithilingam et al., 2022; Barke et al., 2023).
uation (Zhou et al., 2023; Yuan et al., 2021), we In practice, LLMs often generate code that is not
adopt Kendall’s τ coefficient and Spearman’s ρ to fully correct, e.g., not handling corner cases, miss-
measure the statistical correlation between C ODE - ing one computation step, etc. Despite these errors,
J UDGE’s assessment and the ground truth, which many developers find the generated code a good
provides a robust measurement for C ODE J UDGE’s starting point compared with writing code from
performance. For the first assessment, we also mea- scratch, since they can fix the errors by changing
sure the accuracy of the binary decision made by a few lines of code or at least get some inspiration.
C ODE J UDGE as a more intuitive metric for C ODE - Thus, it would be helpful if the evaluation method
J UDGE’s performance. f could measure to what extent the generated code
We experiment with four LLMs as code evalu- deviates from the user-intended code based on the
ators and compare C ODE J UDGE with nine exist- task description (i.e., f (c, t) ∈ R).
ing methods, including ICE-Score (Zhuo, 2024), a
Evaluation without reference code. Many eval-
state-of-the-art code evaluation method based on
uation methods assume that the correct code is
GPT-3.5-Turbo. For all four LLMs, we observe
available as the ground truth so that they can
that C ODE J UDGE achieves significantly higher cor-
directly compare the generated code with the
relations (12.1%-41.8%) than existing methods
ground-truth code. All token-based methods,
in most settings. Even when using a relatively
such as CodeBLEU (Ren et al., 2020) and Code-
small model (Llama-3-8B-Instruct), C ODE J UDGE
BERTScore (Zhou et al., 2023), fall into this cat-
still outperforms ICE-Score, which uses GPT-3.5-
egory. However, this assumption does not always
Turbo. C ODE J UDGE also achieves a high accu-
hold in practice, especially in online settings. In
racy (80.56% on average) when directly predicting
many cases, human programmers can make a good
whether a generated code is correct or not. Notably,
assessment only through code inspection and rea-
when the ground-truth code is not available for
soning based on their programming knowledge,
comparison, C ODE J UDGE still achieves reasonable
without the need for the ground truth. Since LLMs
performance (e.g., 0.502 Kendall’s τ coefficient1
have been demonstrated as promising alternatives
and 73.13% accuracy) and outperforms all existing
to human judges (Liu et al., 2023b; Zheng et al.,
methods that rely on references. This demonstrates
2023a; Chan et al., 2024), it is appealing to in-
that C ODE J UDGE can effectively guide LLMs to
vestigate whether LLMs can make accurate code
exert their reasoning capabilities to examine the
assessments without reference code. In this work,
correctness of code.
we consider code evaluation without references as
2 Background and Related Work a special and more challenging evaluation task.
Challenges. The challenge of code evaluation is
2.1 Problem Formulation
two-fold. First, the generated code may have many
The objective of this work is to evaluate the seman- syntactic variations compared with the correct code
tic correctness of machine-generated code. The while being semantically equivalent, e.g., using dif-
task of code generation is formulated as generating ferent variable names, using a different ordering
a code snippet c based on a given task description of independent program statements, using a for
t. We define an evaluation method as a function loop instead of a while loop, etc. Second, there
f (c, t). can be multiple alternative solutions for a code gen-
Code evaluation is typically treated as a binary eration task. For instance, for the task of sorting in-
1
A correlation coefficient above 0.5 is often interpreted as tegers, there are many different sorting algorithms
strong correlation (Cohen, 1988). that have drastically different implementations. In
Task Description: Alphabetize letters in each word of a
sentence, keeping the words and spaces in the same order.
incorrect code snippets incidentally passing given
tests. To address this issue, EvalPlus (Liu et al.,
def anti_shuffle(s):
return ' '.join([ 2023a) augments the test cases of a given task us-
''.join(sorted(list(i))) ing LLMs and mutation-based strategies. However,
for i in s.split(' ')
])
this method still relies on hand-written test cases
(a) Reference code (e.g., ground truth)
as the initial seeds.
def anti_shuffle(s): Token-based Methods. Conventional methods for
return ' '.join([ evaluating machine translation or text generation
sorted(list(i))
for i in s.split(' ') have been adopted for code evaluation. Basically,
]) these methods compute the token-level similarity
(b) Partially correct code between the generated text and the ground-truth
def anti_shuffle(s): text to measure the generation quality. For in-
pass
stance, BLEU (Papineni et al., 2002) calculates
(c) Completely useless code
modified n-gram precision and includes a brevity
def anti_shuffle(s):
return ' '.join([ penalty. ROUGE-L (Lin, 2004) computes se-
''.join(sorted(list(word))) quence n-grams based on the longest common sub-
for word in s.split(' ') sequence. METEOR (Denkowski and Lavie, 2014)
])
(d) Correct code with syntactic variations
relies on the recall and precision of unigrams, while
def anti_shuffle(s): also considering the order of the matched words.
def sort(word): ChrF (Popović, 2015) calculates character-level n-
return ''.join(sorted(list(word))) gram precision and recall.
word_list = []
current_word = "" CodeBLEU (Ren et al., 2020) and RUBY (Tran
for i in range(len(s)): et al., 2019) extend traditional token-based methods
if s[i] != " ":
current_word += s[i] for code evaluation. CodeBLEU incorporates the
else: similarity of data-flow graphs and abstract syntax
word_list.append(sort(current_word)) trees into the calculation. RUBY calculates simi-
current_word = ""
word_list.append(sort(current_word)) larity based on three representation levels of code:
return ' '.join(word_list) text, AST, and the program dependence graph.
(e) Correct code with a different implementation Embedding-based Method. Zhou et al.
Figure 1: An intuitive example of different types of code (2023) proposed CodeBERTScore based on a
solving a sentence sorting problem. machine translation evaluation method called
BERTScore (Zhang et al., 2020). CodeBERTScore
first encodes the generated code and reference code
the following section, we will illustrate these chal- using a fine-tuned CodeBERT (Feng et al., 2020)
lenges using existing code evaluation methods on model. Then, it computes a cosine similarity matrix
a running example in Figure 1. between the embeddings, based on which Code-
BERTScore calculates precision and recall by tak-
2.2 Existing Evaluation Methods ing the maximum across rows and columns and
Existing evaluation methods for code generation averaging the results. CodeBERTScore employs F1
can be categorized into four categories: test-based, and F3 scores to represent the alignment between
token-based, embedding-based, and more recently, the generated code and reference code.
LLM-based methods. LLM-based Method. To the best of our knowl-
Test-based Methods. Pass@k (Kulal et al., 2019) edge, ICE-Score (Zhuo, 2024) is the only work that
is defined as the percentage of code generation also adopts LLMs for code evaluation. ICE-Score
tasks where at least one of the top k code samples performs multi-dimensional evaluation (Zhong
generated for a task passes the unit tests of the et al., 2022; Liu et al., 2023b; Fu et al., 2023) and
task. Chen et al. (2021) then introduces an unbi- instructs the LLM to predict an evaluation score
ased version of pass@k to reduce variances, which from 0 to 4 based on the definition of an evalua-
is widely used to evaluate code generation mod- tion criterion. Unlike token-based and embedding-
els these days. However, since many tasks lack a based methods, ICE-Score does not require the
comprehensive set of test cases, this often leads to availability of the reference code. Furthermore,
Fig. 1(b) Fig. 1(c) Fig. 1(d) Fig. 1(e) ply asks the LLM to predict a score based on the
Test-based Methods
pass@1 0 0 1 1
definition of a criterion. In this work, we inves-
Token-based Methods tigate better ways to exert the inherent reasoning
BLEU 0.779 0.010 0.858 0.231 capabilities of LLMs for reliable code evaluation.
ROUGE-L 0.914 0.267 0.947 0.431
chrF 0.852 0.266 0.891 0.466 3 C ODE J UDGE
CodeBLEU 0.852 0.052 0.983 0.851
RUBY 0.811 0.364 0.990 0.533 Our key insight is to guide LLMs to perform “slow
METEOR 0.846 0.164 0.947 0.705
Embedding-based Methods thinking” (Kahneman, 2011) in code evaluation,
CodeBERTScoreF1 0.990 0.796 0.976 0.800 instead of predicting an evaluation score in one
CodeBERTScoreF3 0.988 0.746 0.976 0.841 step. We design two methods for the two kinds of
LLM-based Methods
ICE-Score 3.0 0 4.0 3.0 code evaluation assessment defined in Section 2.1.
w/o REF 2.0 2.0 3.5 3.0
C ODE J UDGE A.S. 0 0 1 1
3.1 Analyze then Summarize
w/o REF 0 0 1 1 For the binary evaluation task, we decompose the
C ODE J UDGE F.L. 0.50 0 1.00 1.00
w/o REF 0.50 0 1.00 1.00
evaluation task into two subtasks: analysis and sum-
marization, as illustrated in Figure 2. Specifically,
Table 1: Scores assigned by various code evaluation the analysis task provides a step-by-step evalua-
methods to the code snippets shown in Figure 1, where tion guideline and asks the LLM to identify the
Fig. 1(b) is partially correct, Fig. 1(c) is completely required functionalities from the task description,
useless, Fig. 1(d) is correct but with syntactic variations,
examine the logic of the generated code, and report
and Fig. 1(e) is correct but implemented differently.
C ODE J UDGE A.S. and C ODE J UDGE F.L. correspond to the any requirement that is not fulfilled. Optionally, a
analyze then summarize method and taxonomy-guided reference solution can be added to the prompt to
fault localization method described in Section 3.1 and aid the analysis. Subsequently, the summarization
Section 3.2, respectively. task asks the LLM to check the analysis report and
decide whether the code is correct or not.
This design is inspired by how developers per-
their evaluation shows that including the reference form code review in practice. Instead of directly
code in the prompt does not significantly improve arriving at a decision, developers typically do a
ICE-Score’s performance. round of careful analysis of task requirements and
Drawbacks of Existing Methods. Table 1 shows code functionality and then decide whether there is
the evaluation scores computed by different meth- any inconsistency. By explicitly asking the LLM
ods for the four types of code solutions in Figure 1. to generate a detailed analysis report and double-
We made several interesting observations about the check it, C ODE J UDGE forces the LLM to exert its
alignment between evaluation scores and the actual reasoning capabilities and perform a more careful
correctness of the generated code. analysis, instead of making a quick decision.
First, token-based and embedding-based meth-
ods assign higher scores to partially correct code 3.2 Taxonomy-Guided Fault Localization
(Figure 1(b)) compared to correct code with a dif- To decide to what extent a generated code deviates
ferent implementation (Figure 1(e)). This indicates from the user-intended code, we augment the anal-
that these methods face challenges in appropriately ysis step in the previous section by supplementing
scoring code that is correct but syntactically very the LLM with a taxonomy of common inconsisten-
different. cies in LLM-generated code and instructing it to
Second, without the reference code, ICE-Score identify any potential inconsistencies. As discussed
cannot differentiate between partially correct code in Section 2.1, different kinds of inconsistencies
(Figure 1(b)) and completely useless code (Figure have different kinds of consequences. Some errors
1(c)), as it assigns both a score of 2.0. Adding the are simple and easy to fix, while others are more
reference code to ICE-Score addresses this prob- severe. Therefore, we incorporate the severity of
lem but still cannot distinguish between partially each identified inconsistency into the summariza-
correct code (Figure 1(b)) and correct code with a tion step to calculate the correctness score. We
different implementation (Figure 1(e)), assigning explain the details below.
both a score of 3.0. These drawbacks may stem A Taxonomy of Common Inconsistencies. To
from the prompt design of ICE-Score, which sim- design the taxonomy, we manually inspected code
Input Example
Task Description: Workflow
Alphabetize letters in each word of a sentence, keeping the words and spaces in the same order.
Reference Code: Model Input
def anti_shuffle(s):
return ' '.join([''.join(sorted(list(i))) for i in s.split(' ')]) Model Output
Code Snippet:
def anti_shuffle(s):
return ' '.join([sorted(list(i)) for i in s.split(' ')])
a1
Analyze then Summarize Taxonomy-Guided Fault Localization
Analyze the semantic
a1 ... b1 correctness of the
... code snippet and
Output your answer in a JSON format list.
Your task is to check if the code snippet provide reasons.
covers the required functionalities. a) If the code snippet is correct, output:
Evaluation Steps: [{"inconsistency": "None", "severity": a2
... "Negligible"}].
Summarize the output
b) If the code snippet is incorrect, output
of a1.
{Input} the identified inconsistencies and their
severity according to the catalog of code b1
inconsistencies.
The code snippet is wrong because it does Identify inconsisten-
not join the sorted letters back into words. cies in the code
It returns a list of sorted characters for {Input}
snippet and classify
each word instead of a string. their severities.
Taxonomy of Common Inconsistencies:
You will be provided with an analysis a2 1. Missing dependency declarations:
result of a code snippet. Negligible
If the analysis believes that the code ...
snippet is correct, output: "Yes".
Otherwise, output: "No". [{
"The code snippet is wrong because it "inconsistency": "Logic error because
does not join the sorted letters back into the sorted function
words. It returns a list of sorted characters returns a list, not
for each word instead of a string." a string",
"severity": "major"
}]
No
Final Decision: 0 (incorrect code) Final Decision: 0.5 / 1.0 (1 Major inconsistency identified)
snippets generated by different LLMs in different mantic correctness of the code, leading to in-
programming languages and also referred to the lit- correct outputs. These errors are considered to
erature on code error analysis (Hristova et al., 2003; have a major impact on semantic correctness.
Weimer and Necula, 2004; Chabbi and Mellor-
Crummey, 2012; Chen et al., 2018). We summa-
rized eight distinct types of frequent code incon- • Fatal. Code generation models sometimes
sistencies and categorized them into four severity hallucinate function calls or variable names
levels based on their impact on the semantic cor- that are never defined in the code. Further-
rectness of the generated code, as shown in Table 2. more, in many cases, they generate code with
• Negligible. Code inconsistencies in this cat- incomplete expressions and statements. These
egory have little impact on semantic correct- issues often lead to runtime exceptions or com-
ness. Specifically, we consider missing import pilation errors that crash the program execu-
statements or exception handling not semanti- tion. Thus, we considered them as fatal errors.
cally wrong, since the code generated in such
cases indeed achieves the functionality in the
task description while not being perfect.
Given the potential inconsistencies identified by
• Small. We classify input handling as small the LLM, we aggregate them via a weighted sum
due to their limited impact on the core func- based on their severity levels to compute the final
tionality of the code snippet and the straight- score. To better compare with other methods, we
forward nature of their correction. normalize the score to the range of [0, 1]. More
• Major. Logical errors directly affect the se- details can be found in Appendix B.
Type Description Fault Localization method for the code deviation
Negligible assessment task dataset (CoNaLa). We briefly de-
Using different methods or algo- scribe each dataset below.
Alternative
rithms to solve the problem.
Dependency Missing import statements. HumanEval-X (Zheng et al., 2023b) is a multi-
No exception handling for unex- language version of HumanEval, a popular code
Error Handling
pected events, e.g., invalid inputs. generation benchmark originally from the Codex
Including inefficient or unneces- paper (Chen et al., 2021). It contains 164 introduc-
Efficiency
sary statements. tory coding tasks, each of which includes a natural
Small
language task description, some test cases, and
Input Handling Failing to handle edge cases.
Major
a human-created reference. We evaluate C ODE -
J UDGE on five programming languages in the ini-
Logic Error Containing logical errors.
Fatal tial release of HumanEval-X, including Python,
Using undefined functions or vari- C++, Java, JavaScript, and Go.2
Declaration
ables. CoNaLa (Yin et al., 2018) is a Python code gen-
Incompletion Incomplete code. eration benchmark with 472 tasks collected from
Table 2: The catalog of code inconsistencies. StackOverflow. We use the human annotation col-
lected by Evtikhiev et al. (2023) as ground truth
for the code deviation assessment. For each task,
4 Experiments Evtikhiev et al. (2023) asked experienced software
developers to grade a score of usefulness between
4.1 Datasets 0 and 4 for the generated code snippets from five
As described in Section 2.1, C ODE J UDGE makes different models.
two kinds of code assessment. Following Zhuo APPS (Hendrycks et al., 2021) is a Python code
(2024), we use HumanEval-X (Zheng et al., 2023b) generation benchmark. It includes introductory-
for the binary assessment task and CoNaLa (Yin level problems, interview-level, and competition-
et al., 2018) for the code deviation assessment task. level coding tasks collected from code competition
The rationale is that HumanEval-X includes test websites. We randomly sampled 100 competition-
cases for each task so we can easily obtain binary level tasks to form a challenging dataset.
correctness labels based on test results. By con- BigCodeBench (Zhuo et al., 2024) is a recently
trast, CoNaLa (Yin et al., 2018) does not have test released code generation dataset in Python with
cases. Instead, it provides human-annotated code 1,140 practical and challenging programming tasks.
usefulness scores in the range of 0 to 4, which were This dataset challenges the ability of LLMs to in-
obtained via crowdsourcing. voke multiple function calls from various libraries.
Since HumanEval-X only includes introductory
coding tasks, we also include two more challeng-
4.2 Evaluation Metrics
ing datasets, APPS (Hendrycks et al., 2021) and
BigCodeBench (Zhuo et al., 2024). Compared with Statistical Correlations. Recent studies have used
HumanEval-X, APPS includes competition-level statistical correlation metrics, such as Kendall’s
coding problems and BigCodeBench includes more τ coefficient (τ ) and Spearman’s rank correlation
complex instructions and more API calls. For in- coefficient (rs ), as a robust way to measure the
stance, Codex achieves a pass@1 rate of 28.81% correlation between code evaluation results and
on HumanEval, but only 0.92% on APPS (Le the ground truth (Zhou et al., 2023; Zhuo, 2024).
et al., 2022; Chen et al., 2021). Similarly, GPT-4o Thus, we adopt these correlation metrics to evaluate
achieves a pass@1 rate of 90.2% on HumanEval C ODE J UDGE on both kinds of assessment tasks.
but only 56.1% on the BigCodeBench (Anthropic,
2024; Zhuo et al., 2024). Since both APPS and Big- Accuracy. For the binary classification task, we
CodeBench provide only test cases, we use them also measure the correctness prediction accuracy
for the binary assessment task. of C ODE J UDGE as a more intuitive metric.
We apply our analyze then summarize method 2
We tried other languages such as Rust in the latest version
for binary assessment task datasets (HumanEval-X, of HumanEval-X but encountered issues when running their
APPS, and BigCodeBench) and Taxonomy-Guided test cases. Thus, we chose not to evaluate those languages.
HumanEval-X CoNaLa APPS BigCodeBench
Method τ rs τ rs τ rs τ rs
E XISTING M ETHODS
BLEU 0.306 0.373 0.437 0.485 0.035 0.042 0.072 0.089
ROUGE-L 0.318 0.388 0.450 0.501 0.035 0.043 0.117 0.143
METEOR 0.357 0.436 0.412 0.463 0.085 0.104 0.247 0.302
chrF 0.328 0.400 0.457 0.514 0.036 0.044 0.167 0.205
CodeBLEU 0.362 0.442 0.292 0.332 0.135 0.164 0.173 0.212
RUBY 0.309 0.376 0.332 0.373 0.092 0.113 0.119 0.146
CodeBERTScoreF1 0.339 0.414 0.499 0.558 -0.003 -0.003 0.048 0.059
CodeBERTScoreF3 0.372 0.454 0.485 0.542 0.008 0.010 0.133 0.163
VANILLA 0.570 0.570 0.357 0.386 0.103 0.103 0.251 0.251
VANILLA w/o REF 0.390 0.390 0.465 0.499 -0.058 -0.058 0.131 0.131
ICE-Score 0.475 0.492 0.253 0.271 0.224 0.224 0.321 0.330
ICE-Score w/o REF 0.349 0.363 0.462 0.491 0.140 0.143 0.117 0.118
C ODE J UDGE 0.612 0.612 0.457 0.478 0.354 0.354 0.392 0.392
C ODE J UDGE w/o REF 0.502 0.502 0.538 0.562 0.153 0.153 0.317 0.317
Table 3: The results on four datasets when using GPT-3.5-Turbo-1106 as the evaluator. The best results are in bold.
Due to space limitations, tables with standard deviation and results of each language are shown in Appendix E.
Table 5: The Kendall-Tau (τ ) and Spearman (rs ) correlations between C ODE J UDGE using GPT-3.5-Turbo and
semantic correctness in HumanEval-X.
CoNaLa HE-X APPS BCB the accuracy of different methods in the binary as-
Method τ rs τ = rs τ = rs τ = rs
CodeLlama-Instruct-34B
sessment task. Since ICE-Score produces a rating
C ODE J UDGE 0.559 0.581 0.492 0.210 0.334 in the range of 0 to 4, we treat the rating of 4 as
w/o REF 0.582 0.607 0.412 0.062 0.097 fully correct, while the other ratings as not correct
Llama-3-8B-Instruct in the binary assessment task. C ODE J UDGE outper-
C ODE J UDGE 0.523 0.547 0.480 0.161 0.383 forms both ICE-Score and VANILLA regardless of
w/o REF 0.576 0.602 0.388 0.072 0.258
Llama-3-70B-Instruct
whether the reference code is provided or not.
C ODE J UDGE 0.572 0.598 0.681 0.391 0.440 Evaluating without References. We want to high-
w/o REF 0.628 0.654 0.619 0.153 0.298 light that even when reference code is not provided
to C ODE J UDGE but is provided to all other meth-
Table 6: The results of C ODE J UDGE using three open-
source models (more results in Appendix E). ods, C ODE J UDGE still outperforms all existing
methods in most settings. This implies the power
of performing “slow thinking” in code evaluation.
ods in the CoNaLa dataset. One plausible expla- Impact of Programming Languages. Table 5
nation is that for the code deviation task, the LLM shows the statistical correlation results of C ODE -
evaluator focuses too much on the differences be- J UDGE on different programming languages. When
tween the generated code and reference code rather reference code is provided, C ODE J UDGE consis-
than high-level semantic similarities. This implies tently achieves a coefficient above 0.5, which in-
future opportunities to calibrate LLMs for code dicates a strong correlation with the ground truth.
assessment. C ODE J UDGE performs much better on Python and
Results on More Challenging Benchmarks. Ta- Java compared with the other three languages.
ble 3 shows the correlations on APPS and Big- Generalizaiblity to Open-Source LLMs. Ta-
CodeBench. While C ODE J UDGE still achieves the ble 6 shows the correlation results of C ODE J UDGE
best performance, we observe that all evaluation when substituting GPT-3.5 with three open-source
methods suffer from a significant drop in perfor- models. Compared with GPT-3.5, C ODE J UDGE
mance on APPS and BigCodeBench. The vanilla achieves better correlations when using Llama-3-
LLM-based method, which performs comparably 70B. Besides, even when using a relatively small
to ICE-SCore on the other benchmarks, experi- model (Llama-3-8B-Instruct), C ODE J UDGE still
enced the biggest degradation. Such a performance achieves better or comparable performance to all
drop is not surprising, since these competition-level existing methods, including ICE-Score, which uses
tasks are challenging to human developers, not GPT-3.5 as the evaluator. This demonstrates that
even to mention LLMs. Without running and de- C ODE J UDGE can be easily applied to other LLMs
bugging the code, many developers may also strug- and obtain evaluations with a reasonable correla-
gle with assessing the code. Table 3 shows that tion to semantic correctness.
LLM-based methods consistently perform better
when reference code is provided to aid code eval- Prompt Design. We further test C ODE J UDGE with
uation. We also observe that for BigCodeBench, few-shot learning, Chain-of-Thought (CoT), and
LLM-based methods with reference show a sig- the combination of them. However, C ODE J UDGE
nificantly smaller performance degradation com- with these prompting methods do not outperform
pared to methods without reference. This implies the original one. Our analysis of the drawbacks
that providing reference code is more helpful for of employing CoT and few-shot learning can be
challenging tasks compared with relatively simple found in Appendix A.
tasks. Failure Case Analysis. To understand the lim-
Accuracy of Binary Evaluation. Table 4 shows itations of C ODE J UDGE, we manually inspected
600 failure cases, especially those from APPS. We Shraddha Barke, Michael B James, and Nadia Po-
identified three failure patterns: likarpova. 2023. Grounded copilot: How program-
mers interact with code-generating models. Pro-
• Wrong Analysis of Code Logic (52.83%). ceedings of the ACM on Programming Languages,
The most common pattern is that the LLM 7(OOPSLA1):85–111.
evaluator fails to infer the code logic correctly. Milind Chabbi and John Mellor-Crummey. 2012. Dead-
For example, the LLM may mention that the spy: a tool to pinpoint program inefficiencies. In
code implements a logic while it does not. Proceedings of the Tenth International Symposium
on Code Generation and Optimization, CGO ’12,
• Wrong Identification of Task Requirements page 124–134, New York, NY, USA. Association for
Computing Machinery.
(26.42%). For some complex tasks, the LLM
evaluator struggles to identify all requirements Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu,
from the task description correctly. Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu.
2024. Chateval: Towards better LLM-based eval-
• Requirements of Error Handling (20.75%). uators through multi-agent debate. In The Twelfth
International Conference on Learning Representa-
We find that the LLM evaluator tends to report
tions.
many error-handling errors (e.g., not handling
invalid inputs) in generated code, even though Mark Chen, Jerry Tworek, Heewoo Jun, Qiming
it is not necessary in many cases. This makes Yuan, Henrique Ponde de Oliveira Pinto, Jared Ka-
plan, Harri Edwards, Yuri Burda, Nicholas Joseph,
C ODE J UDGE over-conservative when evaluat- Greg Brockman, Alex Ray, Raul Puri, Gretchen
ing some partially correct code. Krueger, Michael Petrov, Heidy Khlaaf, Girish Sas-
try, Pamela Mishkin, Brooke Chan, Scott Gray,
5 Conclusion Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz
Kaiser, Mohammad Bavarian, Clemens Winter,
We propose C ODE J UDGE , a framework that Philippe Tillet, Felipe Petroski Such, Dave Cum-
leverages LLMs to evaluate code generation with- mings, Matthias Plappert, Fotios Chantzis, Eliza-
out the need for test cases. We demonstrate that beth Barnes, Ariel Herbert-Voss, William Hebgen
Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie
by guiding LLMs to perform slow thinking, C ODE - Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain,
J UDGE outperforms all existing code evaluation William Saunders, Christopher Hesse, Andrew N.
methods. This demonstrates a promising future Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan
direction to replace human evaluators with LLM Morikawa, Alec Radford, Matthew Knight, Miles
Brundage, Mira Murati, Katie Mayer, Peter Welinder,
evaluators. This is also beneficial for alignment Bob McGrew, Dario Amodei, Sam McCandlish, Ilya
methods that rely on human evaluation as feed- Sutskever, and Wojciech Zaremba. 2021. Evaluat-
back. Finally, we release our code and dataset at ing large language models trained on code. arXiv
https://github.com/VichyTong/CodeJudge. preprint arXiv:2107.03374.
AI Anthropic. 2024. Claude 3.5 sonnet model card Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xi-
addendum. Claude-3.5 Model Card. aocheng Feng, Ming Gong, Linjun Shou, Bing Qin,
Ting Liu, Daxin Jiang, and Ming Zhou. 2020. Code- NLG evaluation using gpt-4 with better human align-
BERT: A pre-trained model for programming and ment. In Proceedings of the 2023 Conference on
natural languages. In Findings of the Association Empirical Methods in Natural Language Processing,
for Computational Linguistics: EMNLP 2020, pages pages 2511–2522, Singapore. Association for Com-
1536–1547, Online. Association for Computational putational Linguistics.
Linguistics.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Jing Zhu. 2002. Bleu: a method for automatic evalu-
Liu. 2023. Gptscore: Evaluate as you desire. arXiv ation of machine translation. In Proceedings of the
preprint arXiv:2302.04166. 40th Annual Meeting of the Association for Compu-
tational Linguistics, pages 311–318, Philadelphia,
Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Pennsylvania, USA. Association for Computational
Liu. 2024. GPTScore: Evaluate as you desire. In Linguistics.
Proceedings of the 2024 Conference of the North
American Chapter of the Association for Computa- Maja Popović. 2015. chrF: character n-gram F-score
tional Linguistics: Human Language Technologies for automatic MT evaluation. In Proceedings of the
(Volume 1: Long Papers), pages 6556–6576, Mexico Tenth Workshop on Statistical Machine Translation,
City, Mexico. Association for Computational Lin- pages 392–395, Lisbon, Portugal. Association for
guistics. Computational Linguistics.
Dan Hendrycks, Steven Basart, Saurav Kadavath, Man- Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu,
tas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio
Samir Puranik, Horace He, Dawn Song, and Jacob Blanco, and Shuai Ma. 2020. Codebleu: a method
Steinhardt. 2021. Measuring coding challenge com- for automatic evaluation of code synthesis. arXiv
petence with APPS. In Thirty-fifth Conference on preprint arXiv:2009.10297.
Neural Information Processing Systems Datasets and Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten
Benchmarks Track (Round 2). Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi,
Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom
Maria Hristova, Ananya Misra, Megan Rutter, and Re-
Kozhevnikov, Ivan Evtimov, Joanna Bitton, Man-
becca Mercuri. 2003. Identifying and correcting java
ish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori,
programming errors for introductory computer sci-
Wenhan Xiong, Alexandre Défossez, Jade Copet,
ence students. In Proceedings of the 34th SIGCSE
Faisal Azhar, Hugo Touvron, Louis Martin, Nico-
Technical Symposium on Computer Science Educa-
las Usunier, Thomas Scialom, and Gabriel Synnaeve.
tion, SIGCSE ’03, page 153–156, New York, NY,
2023. Code llama: Open foundation models for code.
USA. Association for Computing Machinery.
arXiv preprint arXiv:2308.12950.
Daniel Kahneman. 2011. Thinking, fast and slow. Bo Shen, Jiaxin Zhang, Taihong Chen, Daoguang Zan,
macmillan. Bing Geng, An Fu, Muhan Zeng, Ailun Yu, Jichuan
Ji, Jingyang Zhao, Yuenan Guo, and Qianxiang Wang.
Sumith Kulal, Panupong Pasupat, Kartik Chandra, Mina
2023. Pangu-coder2: Boosting large language mod-
Lee, Oded Padon, Alex Aiken, and Percy S Liang.
els for code with ranking feedback. arXiv preprint
2019. Spoc: Search-based pseudocode to code. Ad-
arXiv:2307.14936.
vances in Neural Information Processing Systems,
32. N. Tran, H. Tran, S. Nguyen, H. Nguyen, and T. Nguyen.
2019. Does bleu score work for code migration? In
Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio 2019 IEEE/ACM 27th International Conference on
Savarese, and Steven Chu Hong Hoi. 2022. Coderl: Program Comprehension (ICPC), pages 165–176,
Mastering code generation through pretrained models Los Alamitos, CA, USA. IEEE Computer Society.
and deep reinforcement learning. Advances in Neural
Information Processing Systems, 35:21314–21328. Priyan Vaithilingam, Tianyi Zhang, and Elena L Glass-
man. 2022. Expectation vs. experience: Evaluating
Chin-Yew Lin. 2004. ROUGE: A package for auto- the usability of code generation tools powered by
matic evaluation of summaries. In Text Summariza- large language models. In Chi conference on hu-
tion Branches Out, pages 74–81, Barcelona, Spain. man factors in computing systems extended abstracts,
Association for Computational Linguistics. pages 1–7.
Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Ling- Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
ming Zhang. 2023a. Is your code generated by chat- Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou,
GPT really correct? rigorous evaluation of large lan- et al. 2022. Chain-of-thought prompting elicits rea-
guage models for code generation. In Thirty-seventh soning in large language models. Advances in neural
Conference on Neural Information Processing Sys- information processing systems, 35:24824–24837.
tems.
Westley Weimer and George C. Necula. 2004. Find-
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, ing and preventing run-time error handling mistakes.
Ruochen Xu, and Chenguang Zhu. 2023b. G-eval: In Proceedings of the 19th Annual ACM SIGPLAN
Conference on Object-Oriented Programming, Sys- Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al.
tems, Languages, and Applications, OOPSLA ’04, 2024. Bigcodebench: Benchmarking code genera-
page 419–431, New York, NY, USA. Association for tion with diverse function calls and complex instruc-
Computing Machinery. tions. arXiv preprint arXiv:2406.15877.
Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan A Prompt Design
Vasilescu, and Graham Neubig. 2018. Learning to
mine aligned code and natural language pairs from
stack overflow. In Proceedings of the 15th Interna- Method Acc.
tional Conference on Mining Software Repositories, C ODE J UDGE 81.63
MSR ’18, page 476–486, New York, NY, USA. As- C ODE J UDGE w/o REF 74.43
sociation for Computing Machinery.
CoT 77.65
Weizhe Yuan, Graham Neubig, and Pengfei Liu. 2021. CoT w/o REF 68.56
Bartscore: Evaluating generated text as text genera- CoT + Few-shot 78.22
tion. In Advances in Neural Information Processing CoT + Few-shot w/o REF 67.61
Systems, volume 34, pages 27263–27277. Curran As-
sociates, Inc. C ODE J UDGE + CoT 78.60
C ODE J UDGE + CoT w/o REF 72.16
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q.
C ODE J UDGE + Few-shot 77.84
Weinberger, and Yoav Artzi. 2020. Bertscore: Eval-
uating text generation with bert. In International C ODE J UDGE + Few-shot w/o REF 69.89
Conference on Learning Representations. C ODE J UDGE + CoT + Few-shot 77.27
C ODE J UDGE + CoT + Few-shot w/o REF 69.51
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan
Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Table 7: Average accuracy (%) across five programming
Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, languages in HumanEval-X using different prompts.
Joseph E. Gonzalez, and Ion Stoica. 2023a. Judging
LLM-as-a-judge with MT-bench and chatbot arena.
In Thirty-seventh Conference on Neural Information We use Chain-of-Thought (CoT) (Wei et al.,
Processing Systems Datasets and Benchmarks Track. 2022) and few-shot learning methods (three exam-
Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan ples) to construct different prompts and test them
Wang, Yufei Xue, Lei Shen, Zihan Wang, Andi Wang, using GPT-3.5-Turbo in HumanEval-X. Table 7
Yang Li, Teng Su, Zhilin Yang, and Jie Tang. 2023b. shows the results, helping us understand the effects
Codegeex: A pre-trained model for code generation of CoT and few-shot learning. These two methods
with multilingual benchmarking on humaneval-x. In
Proceedings of the 29th ACM SIGKDD Conference
slightly reduce the correlation of the results. We ob-
on Knowledge Discovery and Data Mining, KDD ’23, serve the following drawbacks of the CoT method
page 5673–5684, New York, NY, USA. Association and few-shot prompting:
for Computing Machinery.
• Incorrect judgments: The CoT method can
Ming Zhong, Yang Liu, Da Yin, Yuning Mao, Yizhu
Jiao, Pengfei Liu, Chenguang Zhu, Heng Ji, and
cause the model to make incorrect logical
Jiawei Han. 2022. Towards a unified multi- judgments when thinking step by step. If the
dimensional evaluator for text generation. In Pro- initial determination is incorrect, the model
ceedings of the 2022 Conference on Empirical Meth- will generate incorrect evidence to support
ods in Natural Language Processing, pages 2023– that determination. For instance, Table 8
2038.
shows that when the LLM determines that
Shuyan Zhou, Uri Alon, Sumit Agarwal, and Graham the code snippet does not cover all the re-
Neubig. 2023. CodeBERTScore: Evaluating code quired functionalities in the first sentence, it
generation with pretrained models of code. In Pro-
ceedings of the 2023 Conference on Empirical Meth- then makes the incorrect judgment that the
ods in Natural Language Processing, pages 13921– model does not account for frequency.
13937, Singapore. Association for Computational
Linguistics. • Misleading fixed code: When using the CoT
method, the model may ignore the system
Terry Yue Zhuo. 2024. ICE-score: Instructing large
language models to evaluate code. In Findings of the prompt and create its own process: first, find
Association for Computational Linguistics: EACL the errors, then fix the code. This can be
2024, pages 2232–2242, St. Julian’s, Malta. Associa- misleading during summarization because the
tion for Computational Linguistics. model might evaluate the correctness of the
Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, fixed code instead of the original, incorrect
Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani version. For example, Table 9 shows that
Task Description
You are given a non-empty list of positive integers. Return the greatest integer that is greater
than zero, and has a frequency greater than or equal to the value of the integer itself. The
frequency of an integer is the number of times it appears in the list. If no such a value exist,
return -1.
Code Snippet
def search(lst):
# your code here
d = {}
for i in lst:
if i in d:
d[i] += 1
else:
d[i] = 1
for i in sorted(d, reverse=True):
if d[i] >= i:
return i
return -1
Evaluation
The code snippet provided does not cover all the required functionalities. It does not correctly
implement the condition for finding the greatest integer that is greater than zero and has a
frequency greater than or equal to the value of the integer itself.
The code snippet iterates through the dictionary keys in reverse order and returns the first key
that satisfies the condition, but it does not consider the frequency of the integer.
The reference solution correctly calculates the frequency of each integer and then iterates through
the frequencies to find the greatest integer that satisfies the condition.
Table 8: As the initial determination is that the code “does not cover all the required functionalities”, the model
provides incorrect evidence—the model does not consider the frequency—in support of the determination.
the LLM generates a conclusion based on the and deduct a penalty score for each inconsistency
fixed code, and this conclusion further mis- identified by C ODE J UDGE. We experimented with
leads the summarization step. different penalty score settings on a small valida-
tion set, which includes 32 tasks from HumanEval
• Less example few-shot limitation: Similar (20%). We found that setting the penalty score of
to GPTScore (Fu et al., 2024), we observe Small, Major, and Fatal inconsistencies to 5, 50,
a similar performance degradation when the and 100 points achieve the best correlation. We cal-
number of few-shot examples is less than four. culate the final score with the following equations.
One possible reason is that LLMs become
S = NumSmall × 5
constrained by the given examples, leading
to a preference for particular correctness and M = NumMajor × 50
reduced generalization ability. F = NumFatal × 100 (1)
P enalty = max(−100, −(S + M + F ))
B Postprocessing Steps P enalty
Score = 1 −
For the binary evaluation task, since the LLM gen- 100
erates a free-form response to the summarization
C Experiment Details
task. We use a regex parser that assigns a score of 1
to answers that mention “Yes” and 0 to answers that We use the official version of the HumanEval-X,
mention “No”. While this postprocessing method CoNaLa, APPS, and the BigCodeBench datasets.
may sound simple, it turns out to work very well. To generate code snippets for HumanEval-X and
In our experiments, we did not observe any cases APPS, we adopt the code from MultiPL-E.3 For
where the LLMs generated ambiguous answers that BigCodeBench, we use the official pre-generated
cannot be handled by this method. code samples from LLMs.4 For token-based meth-
For the more complex code deviation estimation 3
https://github.com/nuprl/MultiPL-E
4
task, we set the initial correctness score to 100 https://github.com/bigcode-project/bigcodebench
Task Description
You are given a word. Your task is to find the closest vowel that stands between two consonants from
the right side of the word (case sensitive). Vowels in the beginning and ending doesn’t count.
Return empty string if you didn’t find any vowel met the above condition. You may assume that the
given string contains English letter only.
Code Snippet
def get_closest_vowel(word):
pass
Evaluation
The given problem requires finding the closest vowel that stands between two consonants from the
right side of the word. The reference solution iterates through the word from the right side and
checks if the current character is a vowel and the characters on its right and left are consonants.
If this condition is met, the function returns the current vowel. If no such vowel is found, it
returns an empty string.
The provided code snippet is incomplete and does not contain any implementation. To address the
problem statement, the code should iterate through the word from the right side and implement the
logic to find the closest vowel between two consonants.
def get_closest_vowel(word):
if len(word) < 3:
return ""
vowels = {"a", "e", "i", "o", "u", "A", "E", 'O', 'U', 'I'}
for i in range(len(word)-2, 0, -1):
if word[i] in vowels:
if (word[i+1] not in vowels) and (word[i-1] not in vowels):
return word[i]
return ""
This code snippet covers the required functionalities as per the problem statement and is
consistent with the reference solution.
Table 9: When using the CoT method, the model may design an evaluation plan by itself and generate the fixed
code. This fixed code may mislead the summarization step.
Table 10: Average single execution times (in seconds) over 100 runs.
ods, we adopt implementations from Jetbrains.5 GPU was utilized. The execution times of Llama-3-
For CodeBERTScore and ICE-Score, we use their Instruct-70B were measured using two A100-80GB
implementations available on GitHub.6,7 . To evalu- GPUs to load the model. The generating time of
ate C ODE J UDGE, we use the implementations of C ODE J UDGE is less than 20 seconds, which is rea-
correlation metrics from https://scipy.org/. sonable for code evaluation compared to manual
human annotation.
D Latency Discussion
Table 10 shows the average execution times of E Full Results
C ODE J UDGE using four different models over 100
runs. The results for GPT-3.5-Turbo-1106 were ob- We report the numbers with standard deviations
tained via the official API. For CodeLlama-Instruct- in the HumanEval-X dataset in Table 11. We also
34B and Llama-3-Instruct-8B, a single A100-80GB report the accuracy of the binary classification task
5 of the HumanEval-X dataset in Table 12. The full
https://github.com/JetBrains-Research/codegen-metrics
6
https://github.com/neulab/code-bert-score results of the CoNaLa, APPS, and BigCodeBench
7
https://github.com/terryyz/ice-score are in Table 13, Table 14, and Table 15, respectively.
F Prompts
We present the full prompts of VANILLA in Ta-
bles 16 and 17. Full prompts of C ODE J UDGE are
shown in Table 18 and 19.
Java C++ Python JavaScript Go
Metric τ rs τ rs τ rs τ rs τ rs
E XISTING M ETHODS
BLEU 0.230±0.00 0.280±0.00 0.306±0.00 0.373±0.00 0.446±0.00 0.541±0.00 0.288±0.00 0.352±0.00 0.261±0.00 0.318±0.00
ROUGE-L 0.249±0.00 0.304±0.00 0.305±0.00 0.372±0.00 0.450±0.00 0.546±0.00 0.329±0.00 0.401±0.00 0.260±0.00 0.317±0.00
METEOR 0.299±0.00 0.365±0.00 0.338±0.00 0.412±0.00 0.487±0.00 0.594±0.00 0.379±0.00 0.462±0.00 0.284±0.00 0.346±0.00
chrF 0.267±0.00 0.326±0.00 0.314±0.00 0.383±0.00 0.448±0.00 0.545±0.00 0.368±0.00 0.449±0.00 0.242±0.00 0.295±0.00
CodeBLEU 0.318±0.00 0.388±0.00 0.341±0.00 0.417±0.00 0.501±0.00 0.611±0.00 0.384±0.00 0.468±0.00 0.268±0.00 0.326±0.00
RUBY 0.260±0.00 0.318±0.00 0.284±0.00 0.346±0.00 0.425±0.00 0.516±0.00 0.329±0.00 0.401±0.00 0.245±0.00 0.299±0.00
CodeBERTScoreF1 0.282±0.00 0.344±0.00 0.334±0.00 0.408±0.00 0.453±0.00 0.553±0.00 0.318±0.00 0.388±0.00 0.308±0.00 0.376±0.00
CodeBERTScoreF3 0.303±0.00 0.370±0.00 0.375±0.00 0.458±0.00 0.495±0.00 0.604±0.00 0.363±0.00 0.443±0.00 0.324±0.00 0.396±0.00
CodeLlama-Instruct-34B
VANILLA 0.300±0.01 0.300±0.01 0.345±0.01 0.345±0.01 0.489±0.03 0.489±0.03 0.316±0.03 0.316±0.03 0.314±0.01 0.314±0.01
VANILLA w/o REF 0.297±0.01 0.297±0.01 0.373±0.02 0.373±0.02 0.541±0.03 0.541±0.03 0.277±0.03 0.277±0.03 0.348±0.05 0.348±0.05
ICE-Score 0.418±0.06 0.449±0.06 0.309±0.04 0.331±0.05 0.440±0.04 0.477±0.04 0.308±0.06 0.332±0.07 0.297±0.06 0.320±0.07
ICE-Score w/o REF 0.263±0.04 0.279±0.04 0.282±0.04 0.303±0.04 0.471±0.05 0.503±0.05 0.382±0.04 0.404±0.04 0.338±0.05 0.362±0.05
C ODE J UDGE A.S. 0.515±0.04 0.515±0.04 0.464±0.03 0.464±0.03 0.625±0.00 0.625±0.00 0.503±0.03 0.503±0.03 0.354±0.02 0.354±0.02
C ODE J UDGE A.S. w/o REF 0.355±0.06 0.355±0.06 0.408±0.02 0.408±0.02 0.561±0.02 0.561±0.02 0.338±0.04 0.338±0.04 0.396±0.02 0.396±0.02
Meta-Llama-3-8B-Instruct
VANILLA 0.342±0.01 0.342±0.01 0.216±0.01 0.216±0.01 0.409±0.02 0.409±0.02 0.265±0.03 0.265±0.03 0.192±0.01 0.192±0.01
VANILLA w/o REF 0.282±0.01 0.282±0.01 0.159±0.04 0.159±0.04 0.446±0.02 0.446±0.02 0.356±0.01 0.356±0.01 0.331±0.01 0.331±0.01
ICE-Score 0.389±0.01 0.400±0.01 0.242±0.01 0.248±0.01 0.440±0.00 0.455±0.00 0.296±0.01 0.303±0.01 0.269±0.00 0.281±0.00
ICE-Score w/o REF 0.290±0.02 0.296±0.02 0.306±0.04 0.316±0.04 0.481±0.03 0.499±0.03 0.275±0.00 0.283±0.00 0.287±0.02 0.299±0.02
C ODE J UDGE 0.523±0.01 0.523±0.01 0.387±0.02 0.387±0.02 0.637±0.04 0.637±0.04 0.446±0.03 0.446±0.03 0.407±0.03 0.407±0.03
C ODE J UDGE w/o REF 0.411±0.06 0.411±0.06 0.309±0.04 0.309±0.04 0.586±0.03 0.586±0.03 0.339±0.06 0.339±0.06 0.295±0.01 0.295±0.01
Meta-Llama-3-70B-Instruct
VANILLA 0.607±0.01 0.607±0.01 0.624±0.01 0.624±0.01 0.685±0.00 0.685±0.00 0.554±0.00 0.554±0.00 0.529±0.00 0.529±0.00
VANILLA w/o REF 0.554±0.01 0.554±0.01 0.541±0.01 0.541±0.01 0.651±0.01 0.651±0.01 0.553±0.01 0.553±0.01 0.571±0.01 0.571±0.01
ICE-Score 0.552±0.00 0.576±0.00 0.516±0.01 0.543±0.01 0.626±0.01 0.654±0.01 0.471±0.00 0.490±0.00 0.389±0.01 0.411±0.01
ICE-Score w/o REF 0.509±0.01 0.531±0.00 0.507±0.00 0.533±0.00 0.591±0.00 0.620±0.00 0.425±0.00 0.444±0.00 0.478±0.00 0.508±0.00
C ODE J UDGE 0.640±0.02 0.640±0.02 0.700±0.03 0.700±0.03 0.803±0.02 0.803±0.02 0.675±0.01 0.675±0.01 0.589±0.02 0.589±0.02
C ODE J UDGE w/o REF 0.583±0.02 0.583±0.02 0.611±0.01 0.611±0.01 0.698±0.02 0.698±0.02 0.617±0.04 0.617±0.04 0.587±0.05 0.587±0.05
GPT-3.5-Turbo-1106
VANILLA 0.615 0.615 0.482 0.482 0.675 0.675 0.550 0.550 0.528 0.528
VANILLA w/o REF 0.343 0.343 0.328 0.328 0.537 0.537 0.345 0.345 0.398 0.398
ICE-Score 0.499 0.510 0.436 0.455 0.514 0.537 0.524 0.542 0.402 0.415
ICE-Score w/o REF 0.275 0.278 0.410 0.429 0.485 0.513 0.253 0.258 0.324 0.337
C ODE J UDGE 0.638 0.638 0.580 0.580 0.707 0.707 0.591 0.591 0.543 0.543
C ODE J UDGE w/o REF 0.508 0.508 0.474 0.474 0.629 0.629 0.453 0.453 0.446 0.446
Table 11: The Kendall-Tau (τ ) and Spearman (rs ) correlations of each method with semantic correctness on
HumanEval-X in multiple languages. “w/ REF” indicates that this method contains the reference code in the prompt.
The correlation coefficients are reported across three runs using open-source models, along with the standard
deviation.
Method Java C++ Python JavaScript Go
CodeLlama-Instruct-34B
VANILLA 57.07±0.01 61.11±0.01 72.22±0.01 58.33±0.01 62.37±0.00
VANILLA w/o REF 59.09±0.00 65.91±0.01 73.48±0.02 58.84±0.02 57.32±0.02
C ODE J UDGE 75.00±0.02 75.25±0.01 80.56±0.00 73.74±0.01 75.51±0.01
C ODE J UDGE w/o REF 67.93±0.03 73.48±0.01 78.03±0.01 66.16±0.02 71.97±0.01
Meta-Llama-3-8B-Instruct
VANILLA 57.83±0.00 47.47±0.01 67.42±0.01 55.05±0.01 47.73±0.01
VANILLA w/o REF 58.84±0.01 47.47±0.02 70.20±0.01 62.12±0.01 60.10±0.00
C ODE J UDGE 74.49±0.01 65.91±0.01 81.57±0.02 69.44±0.02 69.70±0.02
C ODE J UDGE w/o REF 70.20±0.03 66.16±0.02 78.79±0.01 65.15±0.02 66.16±0.01
Meta-Llama-3-70B-Instruct
VANILLA 78.28±0.00 79.29±0.00 83.33±0.00 74.24±0.00 73.48±0.00
VANILLA w/o REF 75.51±0.00 75.51±0.00 82.07±0.01 75.76±0.01 78.03±0.01
C ODE J UDGE 81.31±0.01 84.60±0.02 90.15±0.01 81.82±0.01 80.30±0.01
C ODE J UDGE w/o REF 79.55±0.01 81.82±0.01 84.60±0.01 80.56±0.02 81.82±0.02
GPT-3.5-Turbo-1106
VANILLA 77.27 71.21 82.07 72.98 76.26
VANILLA w/o REF 60.86 67.17 74.24 61.36 62.12
C ODE J UDGE 81.57 78.28 85.35 78.28 79.29
C ODE J UDGE w/o REF 73.48 72.22 80.81 68.43 70.71
Table 12: Accuracies (%) across five programming languages in the binary classification task of HumanEval-X
dataset. The accuracies are reported across three runs using open-source models, along with the standard deviation.
Method τ rs
Method τ rs E XISTING M ETHODS
BLEU 0.437 0.485 BLEU 0.035±0.00 0.042±0.00
ROUGE-L 0.450 0.501 ROUGE-L 0.035±0.00 0.043±0.00
METEOR 0.412 0.463 METEOR 0.085±0.00 0.104±0.00
chrF 0.457 0.514 chrF 0.036±0.00 0.044±0.00
CodeBLEU 0.292 0.332 CodeBLEU 0.135±0.00 0.164±0.00
RUBY 0.332 0.373 RUBY 0.092±0.00 0.113±0.00
CodeBERTScoref 1 0.499 0.558 CodeBERTScoreF1 -0.003±0.00 -0.003±0.00
CodeBERTScoref 3 0.485 0.542 CodeBERTScoreF3 0.008±0.00 0.010±0.00
Code Llama - Instruct 34B CodeLlama-Instruct-34B
VANILLA 0.317 0.344 VANILLA 0.005±0.05 0.005±0.05
VANILLA w/o REF 0.448 0.486 VANILLA w/o REF 0.080±0.00 0.080±0.00
ICE-Score 0.397 0.425 ICE-Score 0.174±0.06 0.185±0.06
ICE-Score w/o REF 0.534 0.572 ICE-Score w/o REF -0.032±0.02 -0.034±0.02
Table 16: Full prompt of VANILLA baseline for binary assessment task. Blue text is an example of model output.
Brown text is the problem, reference, and code we provide to LLMs.
Determine the helpfulness of the code snippet. Output a score from 0 to 4 where 0 means the code
snippet is not helpful at all and 4 means the code snippet is very helpful.
Helpfulness (0-4): 4
Table 17: Full prompt of VANILLA baseline for code deviation assessment task. Blue text is an example of model
output. Brown text is the problem, reference, and code we provide to LLMs.
Analysis Subtask
You will be provided with a problem statement and a code snippet that supposedly addresses the
problem in {LANGUAGE}.
Your task is to check if the code snippet covers the required functionalities. Do not provide a
corrected version.
Evaluation Steps:
1. Read the problem statement carefully and identify the required functionalities of the
implementation. You can refer to the example to understand the problem better.
2. Read the code snippet and analyze its logic. Check if the code snippet covers all the required
functionalities of the problem.
3. Finally, conclude your evaluation.
Table 18: Full prompt of A NALYZE THEN S UMMARIZE method. Blue text is an example of model output. Brown
text is the problem and code we provide to LLMs.
You will be provided with a problem statement, a code snippet that supposedly addresses the problem,
and a catalog of code inconsistencies.
Evaluation Steps:
1. Read the problem statement carefully to identify the functionalities required for the
implementation.
2. Read the code snippet and compare it to the problem statement. Check if the code snippet covers
the required functionalities.
3. Output your answer in a JSON format list.
a) If the code snippet is correct, output: ["inconsistency": "None", "severity": "Negligible"].
b) If the code snippet is incorrect, output the identified inconsistencies and their severity
according to the catalog of code inconsistencies. For example: ["inconsistency": "<inconsistency1>",
"severity": "<severity1>", "inconsistency": "<inconsistency2>", "severity": "<severity2>", ...]
Problem: {PROBLEM}
Table 19: Full prompt of FAULT L OCALIZATION method. Blue text is an example of model output. Brown text is
the problem and code we provide to LLMs.