0% found this document useful (0 votes)
12 views9 pages

2408.11729v2

The paper explores the use of Large Language Models (LLMs) as judges (LLMJ) to evaluate tests for verifying and validating compiler implementations, particularly for directive-based parallel programming models like OpenACC and OpenMP. It introduces methods such as negative probing and an agent-based approach to enhance the evaluation process, aiming to automate and improve the accuracy of LLM-generated compiler tests. The findings suggest that employing an LLMJ can significantly streamline the validation workflow and increase the quality of test evaluations.

Uploaded by

TECHNICAL MASTER
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views9 pages

2408.11729v2

The paper explores the use of Large Language Models (LLMs) as judges (LLMJ) to evaluate tests for verifying and validating compiler implementations, particularly for directive-based parallel programming models like OpenACC and OpenMP. It introduces methods such as negative probing and an agent-based approach to enhance the evaluation process, aiming to automate and improve the accuracy of LLM-generated compiler tests. The findings suggest that employing an LLMJ can significantly streamline the validation workflow and increase the quality of test evaluations.

Uploaded by

TECHNICAL MASTER
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

LLM4VV: Exploring LLM-as-a-Judge for

Validation and Verification Testsuites


Zachariah Sollenberger* Jay Patel* Christian Munley Aaron Jarmusch Sunita Chandrasekaran
University of Delaware University of Delaware University of Delaware University of Delaware University of Delaware
Newark, DE Newark, DE Newark, DE Newark, DE Newark, DE
[email protected]

Abstract—Large Language Models (LLM) continue to improve using self-supervised learning and learn to model language
arXiv:2408.11729v2 [cs.SE] 22 Aug 2024

and are revolutionizing the landscape of software development. effectively.


These large models have demonstrated potential to generate, With the wide-ranging capabilities provided by LLMs, this
debug, test, analyze, document, and even translate code. Thus
they are a valuable tool in the software development cycle. If paper explores the idea of using an LLM-as-a-judge (LLMJ)
used correctly, such tools can often accelerate the development to evaluate tests written to verify and validate compiler
cycle. Though the tools are powerful and new, the community implementations. We chose DeepSeek’s deepseek-coder-33B-
is cautious of training using biased or sensitive data, which can instruct model [5] for this purpose because in a recently
lead to biased, dangerous, or incorrect outputs along with the published work of ours [6], we found that the deepseek-coder-
inadvertent release of confidential information. Additionally, the
carbon footprints and the un-explainability of these “black box” 33B-instruct model demonstrated the best capability to gener-
models continue to raise questions about the reliability of LLMs. ate directive-based parallel programming model codes among
With these opportunities and these challenges ahead, this the several LLMs we tested for that purpose (the directive-
paper explores the idea of “judging” LLM-generated code to based parallel programming models being OpenACC [7] and
better understand and “open up” the un-explainable “black box” OpenMP [8]). This LLM generated codes with a high compi-
models used by LLMs. We probe into the black box of one
such LLM that has generated the best compiler tests for the
lation and pass rate compared to other popular LLMs, such as
directive-based programming models OpenMP and OpenACC in GPT-4 turbo [9], Codellama-34b-Instruct [10], and GPT-3, as
our earlier research. We challenge DeepSeek’s deepseek-coder- narrated in the published paper.
33B-instruct model with intentionally-erroneous code, and we LLMs are being widely considered for tasks such as code
also define relevant metrics and adopt an agent-based approach generation, summarization, and refactoring [11], [12], [13],
to evaluate the LLM and assess its capabilities as an LLM-as-a-
judge. We also develop a pipeline-based approach to streamline
[14]. However, the application of LLMJs specifically in eval-
the entire workflow. Finally, we make use of all of these strategies uating tests used for verifying and validating compiler imple-
together to develop a more reliable method for automatically mentations of directive-based parallel programming models is
validating LLM-generated compiler tests. Based on our results, a new topic. This paper investigates this topic and explores
utilizing an agent-based prompting approach and setting up a the application of the LLMJ technique.
validation pipeline structure drastically increased the quality of
deepseek-coder-33B-instruct evaluation of tests which are used
The potential of LLMJ is enabled by the training of the
to validate compiler implementations of directive-based parallel model on a large number of codes to allow it to compre-
programming models. hend code and assess a given piece of code based on user-
specified metrics. The LLM processes the input data in its
I. I NTRODUCTION large network of parameters, and at a level of abstraction it
us using some pattern recognition and learned knowledge to
Large Language Models (LLMs) have recently revolu- generate text. The LLMJ produces an output that reflects its
tionized the field of computer science. Popular models like judgment against the defined criteria. This process can take
BERT [1], GPT-4 [2], Gemini [3], and more are trained on an various forms, such as analyzing code for errors, determining
objective such as predicting the next words or tokens in a text, syntactical correctness, and predicting the accuracy of code
and demonstrate capabilities to process, recognize, and under- implementation, among others.
stand human languages at impressive levels. LLMs can achieve Our reason for exploring this usage of an LLMJ is to help
this feat with the help of a subsection of machine learning automate the creation of functional validation and verification
known as deep learning. LLMs use a type of deep-learning test suites for directive-based parallel programming models.
architecture called transformers. With the combination of self- The challenge we are currently facing in this process is
attention, positional encoding, feed forward networks, multi- finding a method to accurately evaluate the correctness of
head attention [4], and other key components, the transformer tests generated by an LLM. The objective of our research in
architecture can be trained on internet-scale text datasets this paper is to minimize or potentially remove the need for
human intervention or involvement in this process by utilizing
* Authors Zachariah and Jay contributed equally to this manuscript an LLMJ.
The approach in this paper could be beneficial to developers to include various errors, and the other containing code that
beyond directive-based programming models. Any developer remained unchanged.
would need to verify and validate their software, and an LLMJ The idea behind is to intentionally create invalid variations
could serve that purpose, as it takes significantly less labor and of otherwise valid code in order to determine and understand
time compared to a human evaluating the code. how an LLM as a “black box” assesses code. We term this
The paper makes the following contributions: process as negative probing. Modifications applied to Group
• Creating and defining metrics to evaluate LLM-generated 1 include:
code Group 1: Variations of negative probing
• Developing negative-probing methods and a benchmark • 0. Removed memory allocation / replaced directives with
to evaluate a given LLM’s performance as a judge a different syntactically incorrect directive
• Evaluating the capability of deepseek-coder-33B as a • 1. Removed an opening bracket
judge by using an agent-based approach • 2. Added use of undeclared variable
• 3. Replaced file with randomly generated non-OpenACC
II. R ELATED W ORK & OpenMP code
Several recent studies have demonstrated the capabili- • 4. Removed last bracketed section of code
ties of LLMs in generating parallel programs. For instance, Group 2: Unchanged manually written codes:
Nichols [15] proposed a reinforcement learning method to • 5. No changes to code
improve the speedup of generated codes, while LM4HPC [16]
First we split the manually-written test files in half randomly
presented various datasets and a tokenizer for HPC-related
and create a modified, invalid suite and an unchanged, valid
code generation. Oren et al. [17] explored AST representation
suite. We prompt the deepseek-coder-33B-instruct model [5]
of code, Godoy et al. [18] evaluated OpenAI Codex for HPC
one test at a time and instruct the model to judge the
kernels generation and Valero-Lara et al. [19] explored Llama-
two different groups with predefined criteria, and record the
2 and GPT-3 for kernel generation.
evaluations for each file.
Another line of work involves using LLMs as judges to Listing 1 shows the criteria we use in prompting to review
evaluate other models on open-ended questions. Zheng et and evaluate an OpenACC code:
al. [20] presented the concept of using strong LLMs as judges
to identify biases in other models, achieving an 80% agreement Listing 1: Criteria for Evaluation - an Example Prompt
with human preferences. This study demonstrates the potential 1 Syntax: Ensure all OpenACC directives
of LLMJs in the HPC realm. and pragmas are syntactically
Other studies have explored LLMs for developing test cases correct.
2 Directive Appropriateness: Check if the
for applications beyond compiler V&V. For instance, Shhafer right directives are used for the
et al. [21] evaluated LLMs for automated JavaScript unit test intended parallel computations.
generation, while other works have investigated LLM-based 3 Clause Correctness: Verify that all
clauses within the directives are
test case generation for various programming languages and correctly used according to OpenACC
software systems [22], [23]. specifications.
Finally, there are several copilot models being implemented 4 Memory Management: Assess the accuracy
of data movement between CPU and
into IDEs, such as GitHub Copilot [24] and Cursor [25], which GPU.
leverage LLMs to assist developers in writing code. These 5 Compliance: Ensure the code adheres to
models have shown promising results in improving coding the latest OpenACC specifications
and best practices.
productivity and reducing errors. 6 Logic: Verify that the logic of the
Overall, these studies demonstrate the growing interest in test (e.g. performing the same
exploring the potential of LLMs for software development and computation in serial and parallel
and comparing) is correct.
automation. Our work builds upon this trend by investigating
the use of LLMs for compiler V&V, with a focus on improving By observing how the LLM judged both groups of files,
the accuracy and efficiency of the verification process. and by recording the specific modifications made to each file,
we were able to identify different areas where the LLMJ did
III. M ETHODOLOGY
well, and where it encountered challenges. We were also able
To determine how deepseek-coder-33b-instruct performs as to measure and judge the overall accuracy of the LLM. This
an LLMJ, we first outline in this section strategies including type of analysis allows for insights into the strengths and
negative probing, an agent-based approach, and a validation weaknesses of an LLM’s assessment capabilities.
pipeline to streamline the process.
B. Agent-based Approach for LLM-as-a-Judge (LLMJ)
A. Negative Probing An agent-based approach involves treating the LLM as
Manually written compiler tests from the OpenACC an autonomous agent that interacts with its environment and
V&V [26] and OpenMP V&V [27] repositories were split utilizes various tools to improve quality of its outputs. In
into two groups: one containing code that had been modified the context of using an LLMJ, the agent-based approach
13 Compiler STDERR: {Compiler’s STDERR}
14 Compiler STDOUT: {Compiler’s STDOUT}
15 When the compiled code is run, it gives
the following results:
16 Return code: {Program’s return code}
17 STDERR: {Program’s STDERR}
18 STDOUT: {Program’s STDOUT}

Through this method, the LLM is able to obtain more


information about the file to aid in its evaluation.

C. Validation Pipeline utilizing LLMJ

Fig. 1: An Overview of Agent-based Approach for LLMJ In order to efficiently evaluate the validity of compiler tests,
it may not always be feasible to compile, run, and have an
LLM evaluate each and every test that requires verification.
entails collecting and making use of external information about Performing all three processes on every single file being
each file, such as compilation and execution error messages, verified can quickly become a time-consuming and costly
outputs, and return codes, and providing this information to task, especially if verifying LLM-generated codes with a high
the LLMJ within the prompt. The LLMJ then utilizes the occurrence of invalidity or a large volume of candidate tests.
provided information to evaluate the file, deeming it either To streamline this task, we re-organized the three processes
valid or invalid. Figure 1 demonstrates how the agent-based into a pipeline infrastructure as shown in Figure 2, to both
LLMJ works. optimize the overarching task by reducing the number of
Listing 2 shows how the tool use is incorporated in the unnecessary steps and by increasing the throughput of files
prompting to provide additional information to the LLM to for verification via pipeline stages and parallel processing.
help it review an OpenACC code and evaluate it based on
user-specified criteria:

Listing 2: Agent-based LLMJ - an Example Prompt


1 Syntax: Ensure all OpenACC directives
and pragmas are syntactically
correct.
2 Directive Appropriateness: Check if the Fig. 2: An Overview of the Validation Pipeline
right directives are used for the
intended parallel computations.
3 Clause Correctness: Verify that all The driving concept behind the pipeline is that a file that
clauses within the directives are fails an earlier stage of the pipeline does not need to be
correctly used according to OpenACC
specifications. passed to the next stage, as it has already demonstrated its
4 Memory Management: Assess the accuracy invalidity. Within this validation pipeline infrastructure, files
of data movement between CPU and are first compiled, then executed, an finally judged by an
GPU.
5 Compliance: Ensure the code adheres to agent-based LLMJ. Each file being processed is first queued
the latest OpenACC specifications for compilation, which can be done either by a single thread
and best practices. or by a pool of threads in parallel. Files that successfully
6 Logic: Verify that the logic of the
test (e.g. performing the same compile are then queued for execution, which can again be
computation in serial and parallel done synchronously in a single thread or asynchronously by
and comparing) is correct. a second thread pool.
7 Based on these criteria, evaluate the
code and determine if it is a valid Finally, files that exit with return code 0 are queued for
or invalid test. Think step by evaluation by an agent-based LLMJ. This stage can also be
step. parallelized if there are enough available GPU resources, but
8 You MUST include the exact phrase, "
FINAL JUDGEMENT: valid" in your it can also be done by a single thread running synchronously
response if you deem the test to be or asynchronously. In this manner, unnecessary operations are
valid. reduced by preventing invalid files from continuing through
9 If you deem the test to be invalid,
include the exact phrase "FINAL the pipeline, throughput is increased by the staged architecture,
JUDGEMENT: invalid" in your and the overarching task of verification can utilize all available
response instead. resources via parallel and/or asynchronous computing.
10 Here is some information about the code
to help you. To determine the accuracy of this method for compiler test
11 When compiled with a compliant OpenACC validation, we performed the negative probing technique again,
compiler, the below code causes the but instead of only recording the LLMJ’s evaluations, we
following outputs:
12 Compiler return code: {Compiler’s recorded each file’s compilation data, execution data, and its
return code} evaluation from an agent-based LLMJ. This not only allowed
us to determine which files would have passed through the • Count: This is the number of files that correspond to
pipeline architecture and thus determine the accuracy of the each issue ID.
validation pipeline, but also allowed us to determine the • Correct/Incorrect Judgments: This is the number of
performance and accuracy of an agent-based LLMJ on its own. correct and mistaken evaluations made by the LLMJ
on files corresponding to each issue ID, determined by
D. Experimental Setup comparing the LLMJ’s evaluations against each file’s
For this paper, we have used the high-performance comput- validity according to the above verification system.
ing cluster Perlmutter located at Lawrence Berkeley National • Accuracy: Calculated by first determining the number of
Laboratory [28]. Each node of Perlmutter is equipped with correct evaluations made by the LLMJ (equal to the count
four NVIDIA A100 GPUs and one AMD EPYC 7763 CPU. value minus the mistakes value for each issue ID), and
Manually written test suites from the OpenACC and OpenMP dividing that number by the number of files with the same
Validation and Verification test suites were used for negative issue ID. The resulting value represents the percentage of
probing. The experiments were conducted on C, C++, and a correct evaluations made by the LLMJ for each issue ID.
small set of Fortran files. Additionally, we conducted a numerical evaluation to assess
the overall accuracy and bias.
IV. D EFINING M ETRICS
• Overall evaluation accuracy: Calculated by determining
To determine the effectiveness of LLMJ, we utilized three the total number of correct evaluations, and dividing it by
metrics: the total number of files, regardless of each file’s issue
• Per-issue evaluation accuracy: Where the issue is the ID.
intentional error introduced into each file during negative • Bias: Calculated by first determining a total bias value.
probing 1 is added to the total for each mistaken evaluation of an
• Overall evaluation accuracy: This does not take into invalid file, and 1 is subtracted from the total for each
account the issue in each file mistaken evaluation of a valid file. The resulting total is
• Bias: A numerical measurement of the LLMJ’s tendency then divided by the total number of mistaken evaluations,
to fail valid files or pass invalid files when an incorrect giving a value in the range [-1, 1].
evaluation is made These metrics allowed us to create profiles of multiple different
All metrics were calculated with the results from negative approaches and setups for verifying compiler tests when each
probing. In order to determine whether the LLMJ’s evaluations approach was subjected to negative probing.
were accurate or not, the following system-of-verification was
utilized to determine the validity of each file: V. A NALYSIS OF DEEPSEEK - CODER -33B- INSTRUCT AS AN
LLMJ
• Files with issue IDs ranging from 0-4 are considered
invalid as they have been altered to include errors. This section discusses results from analyzing deepseek-
• Files with issue ID 5 are considered valid as they remain coder-33B-instruct as an LLMJ. We do so in two parts.
unchanged. • Part One: We discuss results derived from using the
The first metric, i.e. per-issue evaluation accuracy was LLMJ by itself through negative probing
determined by categorizing the LLMJ’s evaluations according • Part Two: We discuss results from two different prompt-
to each file’s issue ID, and then observing the percentage ing styles for an agent-based LLMJ and a validation
of correct LLMJ evaluations in each category. The second pipeline that utilizes an agent-based LLMJ.
metric, i.e. the overall accuracy was determined by observing
the percentage of correct LLMJ evaluations regardless of the A. Results for Part One
issues injected into each file. Finally, the third metric, i.e. Initial experimentation began with an analysis of the LLMJ
bias was determined by numerically measuring the LLMJ’s technique itself. Two test suites were put together with nega-
tendency to fail a valid file or to pass an invalid one when it tive probing to test the LLM against: one suite for OpenMP
made a mistake. A positive bias means that when the LLMJ (containing only C files, due to time constraints), and one suite
makes a mistake, it is more likely to be one of permissiveness for OpenACC (containing C, C++, and Fortran files).
(passing an invalid file), whereas a negative bias means that After assembling the testsuites, we loaded the deepseek-
a mistake is more likely to be one of restrictiveness (failing a coder-33b-instruct model onto one node on Perlmutter, and
valid file). use the following prompt for each file as shown in Listing 3.
For the purposes of numerical analysis, ”Correct”, ”Pass- Because the prompt asks the LLM to directly evaluate the code
ing”, and ”Valid” were mapped to 0, and ”Incorrect”, ”Fail- provided, we called this prompt a direct analysis prompt.
ing”, and ”Invalid” were mapped to 1. Based on this definition,
we can numerically evaluate the performance of LLMJ for Listing 3: Direct Analysis - an Example Prompt
each issue type. 1 Review the following OpenACC/OpenMP
code and evaluate it based on the
Following are the data points recorded or calculated on a following criteria:
per-issue basis: 2
3 Syntax: Ensure all OpenACC/OpenMP both). Many OpenMP offloading compilers do not support all
directives and pragmas are OpenMP features introduced after version 4.5. To reduce the
syntactically correct.
4 Directive Appropriateness: Check if the likelihood of this inconsistent feature support affecting our
right directives are used for the results, we only included files that used OpenMP 4.5 or lower
intended parallel computations. to ensure that the LLVM OpenMP offloading compiler we
5 Clause Correctness: Verify that all
clauses within the directives are used would be fully-compliant for all features present. For
correctly used according to OpenACC OpenACC, we used NVIDIA’s HPC SDK nvc compiler. For
/OpenMP specifications. now, we have experimented mostly with C/C++ files, with an
6 Memory Management: Assess the accuracy
of data movement between CPU and aim to include Fortran files in the near future.
GPU. We theorize that the wording of our direct analysis prompt
7 Compliance: Ensure the code adheres to in Listing 4 was causing the LLM to provide results based
the latest OpenACC specifications
and best practices. on examples of code reviews online instead of its knowledge
8 Logic: Verify that the logic of the of OpenMP and OpenACC. To remedy this, we re-wrote
test (e.g. performing the same the prompt and instructed the LLM to generate a detailed
computation in serial and parallel
and comparing) is correct. description of the code provided, and then determine if that
9 Based on these criteria, evaluate the description fit the profile of a valid compiler test. In this
code in a brief summary, then way, the LLM would be indirectly evaluating the code, so
respond with precisely "FINAL
JUDGEMENT: correct" (or incorrect). we referred to it as an indirect analysis prompt. With this
10 You MUST include the exact phrase " approach, the LLM would hopefully base its response on its
FINAL JUDGEMENT: correct" in your knowledge of OpenMP and OpenACC (when generating a
evaluation if you believe the code
is correct. Otherwise, you must description of the code), and its knowledge of compiler tests
include the phrase "FINAL JUDGEMENT (when analyzing the description). The following shows the
: incorrect" in your evalutation. indirect anlaysis prompt that we created:
11 Here is the code:
12 {C/C++/Fortran file content}
Listing 4: Indirect Analysis - an Example Prompt
The LLM’s response and evaluation were then recorded for 1 Describe what the below OpenACC/OpenMP
program will do when run. Think
each file, and we performed an analysis of the data. Table I step by step.
and Table II show the per-issue accuracy of deepseek-coder- 2 Here is some information about the code
33B-instruct’s evaluations for OpenACC and OpenMP files, to help you; you do not have to
compile or run the code yourself.
respectively. 3 When the below code is compiled with a
As Table I demonstrates, deepseek-coder-33B-instruct strug- OpenACC/OpenMP-compliant compiler,
gled to recognize basic syntax errors and test logic errors in the compiler gives the following
outputs:
OpenACC files, and was only able to accurately determine 4 Compiler return code: {return code}
whether the test contained any OpenACC directives or rou- 5 Compiler STDERR: {STDOUT}
tines at all. Meanwhile, Table II shows that the LLMJ was 6 Compiler STDOUT: {STDERR}
7 When the compiled code is run, it gives
significantly better at recognizing syntax errors in OpenMP the following results:
files, while struggling a bit more to recognize OpenMP errors 8 Return code: {return code}
and test logic errors. Notably, the LLMJ was almost entirely 9 STDOUT: {STDOUT}
10 STDERR: {STDERR}
incapable of recognizing when a file did not contain any 11 Using this information, describe in
OpenMP at all. full detail how the below code
Table III shows the overall performance of deepseek-coder- works, what the below code will do
when run, and suggest why the below
33B-instruct as a judge for OpenACC as well as OpenMP. Sur- code might have been written this
prisingly, despite OpenMP having existed for a longer period way.
of time, deepseek-coder-33B-instruct demonstrated a higher 12 Then, based on that description,
determine whether the described
overall accuracy when evaluating OpenACC files. However, it program would be a valid or invalid
also exhibited a much higher positive bias for OpenACC than compiler test for {flavor}
for OpenMP, demonstrating a strong tendency for its mistakes compilers.
13 You MUST include the exact phrase "
to involve it passing an invalid file. FINAL JUDGEMENT: valid" in your
final response if you beleive that
B. Results for Part Two your description of the below
OpenACC/OpenMP code describes a
Based on these results, we concluded that it would be valid compiler test; otherwise,
necessary to equip the LLMJ with more tools in order to your final response MUST include
improve its accuracy. We designed the validation pipeline the exact phrase "FINAL JUDGEMENT:
invalid".
and implemented an agent-based approach for the LLMJ, and 14 Here is the code for you to analyze: {
created larger testsuites for OpenMP and OpenACC (using C/C++/Fortran file}
C and C++ files from the manually-written testsuites for
TABLE I: LLMJ Negative Probing Results for OpenACC
Total Correct Incorrect
OpenACC Issue Type Accuracy
Count Judgments Judgments
Removed ACC memory allocation / swapped ACC directive 203 31 172 15%
Removed an opening bracket 125 15 110 12%
Added use of undeclared variable 108 16 92 15%
Replaced file with randomly-generated non-OpenACC code 117 94 23 80%
Removed last bracketed section of code 114 14 100 12%
No issue 668 586 82 88%

TABLE II: LLMJ Negative Probing Results for OpenMP


Total Correct Incorrect
OpenMP Issue Type Accuracy
Count Judgments Judgments
Removed OMP memory allocation / swapped OMP directive 59 28 31 47%
Removed an opening bracket 39 29 10 74%
Added use of undeclared variable 33 21 12 64%
Replaced file with randomly-generated non-OpenMP code 51 2 49 4%
Removed last bracketed section of code 33 11 22 33%
No issue 216 84 132 39%

TABLE III: LLMJ Overall Negative Probing Results Pipeline 2 was slightly worse at recognizing OpenMP errors
Datapoint OpenACC OpenMP and significantly better at recognizing a lack of OpenMP code.
Total Count 1335 431 Table VI shows the overall performance of both pipelines
Total Mistakes 579 256
Overall Accuracy 56.63% 40.60%
across both OpenACC and OpenMP. Both pipelines were
Bias 0.717 -0.031 significantly more accurate for OpenMP than for OpenACC,
though Pipeline 1 was slightly more accurate than Pipeline
2 for OpenACC and slightly less accurate than Pipeline 2
For each testsuite, we then compiled, executed, and used for OpenMP. For both programming models, both pipelines
both LLMJ prompts to evaluate each file while recording demonstrated a bias towards restrictiveness, though Pipeline
the compilation data, execution data, and evaluations. We 2 consistently had a stronger bias than Pipeline 1. This
ran each file through every stage of the validation pipeline; demonstrates that for both pipelines, when a mistake does
however, for this experiment, we did not prevent invalid occur, it is more likely to be one of misjudging a valid file
files from continuing through the pipeline. This way, we rather than one of misjudging an invalid file.
could gather information about both agent-based LLMJs, and Figures 3 and 4 present the accuracy of both pipelines
retroactively verify how the entire validation pipeline would on the four categories of errors introduced into each file,
have performed on the data by checking the compilation, for OpenACC and OpenMP respectively. As Figure 4 clearly
execution, and evaluation status of each file. shows, the performance of both pipelines on OpenMP was
nearly identical across all four types of issues, while Pipeline
To simplify the data analysis: 1 and Pipeline 2 had only slight differences in performance for
OpenACC. The radar plots also show the large difference in
• LLMJ 1: The agent-based LLMJ that used the direct the pipelines’ ability to detect erroneous test logic in OpenMP
analysis prompt files versus OpenACC files; however, both pipelines also
• LLMJ 2: The agent-based LLMJ that used the indirect demonstrated an almost identical ability to detect improper
analysis prompt directive use and improper syntax across both OpenACC and
• Pipeline 1: Validation pipeline outputs computed with OpenMP.
LLMJ 1’s evaluation Table VII shows the results of the two agent-based LLMJs
• Pipeline 2: Validation pipeline outputs computed with on the OpenACC testsuite. In this case, the two LLMJs’
LLMJ 2’s evaluation performances varied much more, with LLMJ 1 demonstrating
We then compared the performances of the two pipelines a superior ability to identify missing syntax errors and to
against each other, and compared the two agent-based LLMJs recognize valid code, and LLMJ 2 demonstrating a supe-
against each other and against the non-agent-based LLMJ. rior ability to detect OpenACC errors, a lack of OpenACC
Table IV shows the results of the two pipelines on the code, and errors in test logic. Table VIII, which shows the
OpenACC testsuite. As can be seen, the two pipelines per- performance of both LLMJs on OpenMP, also demonstrates
formed almost identically, though Pipeline 2 demonstrated more variance between the two LLMJs. LLMJ 1 exhibited a
a higher ability recognize errors in the test’s logic, and higher accuracy for recognizing OpenMP errors, syntax errors,
Pipeline 1 demonstrated a higher ability to recognize when and test logic error, while LLMJ 2 was better-equipped for
a file contained no errors. Table V also shows a similarity recognizing a lack of OpenMP code and valid codes.
between the two pipelines’ performances, though in this case, Table IX shows the overall performance of both LLMJs.
TABLE IV: Validation Pipeline Results for OpenACC
Pipeline 1 Pipeline 2
Total Pipeline 1 Pipeline 2
OpenACC Issue Type Correct Correct
Count Accuracy Accuracy
Evaluations Evaluations
Removed ACC memory allocation / swapped ACC directive 272 250 251 92% 92%
Removed an opening bracket 146 146 146 100% 100%
Added use of undeclared variable 151 151 151 100% 100%
Replaced file with randomly-generated non-OpenACC code 146 146 146 100% 100%
Removed last bracketed section of code 176 38 53 22% 30%
No issue 891 704 627 79% 70%

TABLE V: Validation Pipeline Results for OpenMP


Pipeline 1 Pipeline 2
Total Pipeline 1 Pipeline 2
OpenMP Issue Type Correct Correct
Count Accuracy Accuracy
Evaluations Evaluations
Removed OMP memory allocation / swapped OMP directive 49 47 46 96% 94%
Removed an opening bracket 28 28 28 100% 100%
Added use of undeclared variable 26 26 26 100% 100%
Replaced file with randomly-generated non-OpenMP code 20 14 17 70% 85%
Removed last bracketed section of code 25 23 23 92% 92%
No issue 148 136 138 92% 93%

TABLE VI: Overall Validation Pipeline Results


Datapoint OpenACC OpenMP
Total Count 1782 296
Total Pipeline 1 Mistakes 347 22
Total Pipeline 2 Mistakes 408 18
Overall Pipeline 1 Accuracy 80.53% 92.57%
Overall Pipeline 2 Accuracy 77.10% 93.92%
Pipeline 1 Bias -0.078 -0.091
Pipeline 2 Bias -0.294 -0.111

Fig. 4: A Radar Plot for Validation Pipeline Results for


OpenMP

demonstrated a roughly equal level of accuracy between

OpenACC and OpenMP. LLMJ 1 also exhibited a consistently


strong positive bias, while LLMJ 2 had a much smaller
positive bias for OpenACC and a much larger positive bias
Fig. 3: A Radar Plot for Validation Pipeline Results for for OpenMP. In all cases, the agent-based LLMs exhibited
OpenACC a tendency towards passing invalid files as opposed to
failing valid files. Compared to the non-agent-based LLMJ,
both agent-based LLMJs exhibited drastically higher overall
For both OpenACC and OpenMP, LLMJ 1 demonstrated accuracy, and both exhibited a smaller positive bias for
a higher overall accuracy than LLMJ 2 and performed OpenACC.
slightly better on OpenACC than OpenMP, while LLMJ 2 Figures 5 and 6 present the accuracy of all three LLMJs for
TABLE VII: Agent-Based LLMJ Results for OpenACC
LLMJ 1 LLMJ 2
Total LLMJ 1 LLMJ 2
OpenACC Issue Type Correct Correct
Count Accuracy Accuracy
Evaluations Evaluations
Removed ACC memory allocation / swapped ACC directive 272 182 224 67% 82%
Removed an opening bracket 146 111 81 76% 55%
Added use of undeclared variable 151 128 126 85% 83%
Replaced file with randomly-generated non-OpenACC code 146 142 146 97% 100%
Removed last bracketed section of code 176 26 47 15% 27%
No issue 891 819 701 92% 79%

TABLE VIII: Agent-Based LLMJ Results for OpenMP


LLMJ 1 LLMJ 2
Total LLMJ 1 LLMJ 2
OpenMP Issue Type Correct Correct
Count Accuracy Accuracy
Evaluations Evaluations
Removed OMP memory allocation / swapped OMP directive 49 23 22 47% 45%
Removed an opening bracket 28 16 13 57% 46%
Added use of undeclared variable 26 18 15 69% 58%
Replaced file with randomly-generated non-OpenMP code 20 13 17 65% 85%
Removed last bracketed section of code 25 18 12 72% 48%
No issue 148 137 142 93% 96%

TABLE IX: Overall Agent-Based LLMJ Results


Datapoint OpenACC OpenMP
Total Count 1782 296
Total LLMJ 1 Mistakes 374 71
Total LLMJ 2 Mistakes 457 75
Overall LLMJ 1 Accuracy 79.01% 76.01%
Overall LLMJ 2 Accuracy 74.35% 74.66%
LLMJ 1 Bias 0.615 0.690
LLMJ 2 Bias 0.168 0.840

Fig. 6: A Radar Plot for LLMJ Results for OpenMP

2), and improper syntax recognition for OpenMP (where


the non-agent-based LLMJ outperformed both agent-based
LLMJs). LLMJ 2 consistently demonstrated a higher accuracy
in recognizing improper directive usage than LLMJ 1, while
LLMJ 1 exhibited a better recognition of improper syntax than
LLMJ 2. Both agent-based LLMJs were also consistently able
Fig. 5: A Radar Plot for LLMJ Results for OpenACC to recognize valid tests with a high degree of accuracy, with
LLMJ 1 slightly outperforming LLMJ 2 for OpenACC.

VI. C ONCLUSION
OpenACC and OpenMP, respectively. In almost all categories,
the agent-based LLMJs outperformed the non-agent-based In this paper, we explore ways to assess the capability
LLMJ, with the exception of valid test recognition for Ope- of LLM-as-a-Judge. We employ different techniques such
nACC (where the non-agent-based LLMJ outperformed LLMJ as negative probing and agent-based approach along with
prompts to understand how the LLM evaluates the codes. [16] L. Chen, P.-H. Lin, T. Vanderbruggen, C. Liao, M. Emani,
Our results indicate that utilizing an agent-based prompting and B. de Supinski, “Lm4hpc: Towards effective language model
application in high-performance computing,” 2023. [Online]. Available:
approach and setting up a validation pipeline structure signif- https://arxiv.org/abs/2306.14979
icantly increased the quality of DeepSeek Coder’s evaluations [17] T. Kadosh, N. Hasabnis, V. A. Vo, N. Schneider, N. Krien, A. Wasay,
of tests used to validate compiler implementations of directive- N. Ahmed, T. Willke, G. Tamir, Y. Pinter et al., “Scope is all you need:
Transforming llms for hpc code,” arXiv preprint arXiv:2308.09440,
based programming models. As part of our future work, we 2023.
will incorporate fortran code into our testing to ensure more [18] W. Godoy, P. Valero-Lara, K. Teranishi, P. Balaprakash, and J. Vetter,
comprehensive data collection and probing. We will also be “Evaluation of openai codex for hpc parallel programming models kernel
generation,” in Proceedings of the 52nd International Conference on
exploring the automation of compiler test generation based on Parallel Processing Workshops, 2023, pp. 136–144.
lessons learnt from this work. [19] P. Valero-Lara, A. Huante, M. A. Lail, W. F. Godoy, K. Teranishi,
P. Balaprakash, and J. S. Vetter, “Comparing llama-2 and gpt-3 llms
ACKNOWLEDGMENT for hpc kernels generation,” arXiv preprint arXiv:2309.07103, 2023.
[20] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin,
The authors are very grateful to OpenACC for supporting Z. Li, D. Li, E. Xing et al., “Judging llm-as-a-judge with mt-bench
this work. This research used resources NERSC, a U.S. DOE and chatbot arena,” Advances in Neural Information Processing Systems,
vol. 36, 2024.
Office of Science User Facility located at LBNL, operated [21] M. Schäfer, S. Nadi, A. Eghbali, and F. Tip, “An empirical evaluation of
under Contract No. DE-AC02-05CH11231 using NERSC ER- using large language models for automated unit test generation,” IEEE
CAP0029463. This material is also based upon work supported Transactions on Software Engineering, 2023.
[22] G. Ryan, S. Jain, M. Shang, S. Wang, X. Ma, M. K. Ramanathan,
by the U.S. DOE under Contract DE-FOA-0003177, S4PST: and B. Ray, “Code-aware prompting: A study of coverage-guided test
Next Generation Science Software Technologies Project. generation in regression setting using llm,” Proceedings of the ACM on
Software Engineering, vol. 1, no. FSE, pp. 951–971, 2024.
R EFERENCES [23] K. Liu, Y. Liu, Z. Chen, J. M. Zhang, Y. Han, Y. Ma, G. Li, and
G. Huang, “Llm-powered test case generation for detecting tricky bugs,”
[1] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training arXiv preprint arXiv:2404.10304, 2024.
of deep bidirectional transformers for language understanding,” 2019. [24] GitHub, “Github copilot,” 2024, accessed: 2024-08-16. [Online].
[Online]. Available: https://arxiv.org/abs/1810.04805 Available: https://github.com/features/copilot
[2] OpenAI and A. et al., “Gpt-4 technical report,” 2024. [Online]. [25] Cursor, 2024, accessed: 2024-08-16. [Online]. Available: https:
Available: https://arxiv.org/abs/2303.08774 //www.cursor.com/
[3] G. Team and A. et al., “Gemini: A family of highly capable multimodal [26] A. Jarmusch, A. Liu, C. Munley, D. Horta, V. Ravichandran, J. Denny,
models,” 2024. [Online]. Available: https://arxiv.org/abs/2312.11805 K. Friedline, and S. Chandrasekaran, “Analysis of validating and verify-
[4] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. ing openacc compilers 3.0 and above,” in 2022 Workshop on Accelerator
Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2017. Programming Using Directives (WACCPD). IEEE, 2022, pp. 1–10.
[Online]. Available: https://arxiv.org/abs/1706.03762 [27] T. Huber, S. Pophale, N. Baker, M. Carr, N. Rao, J. Reap, K. Holsapple,
[5] D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, J. H. Davis, T. Burnus, S. Lee, D. E. Bernholdt, and S. Chandrasekaran,
X. Bi, Y. Wu, Y. Li, F. Luo, Y. Xiong, and W. Liang, “Deepseek-coder: “Ecp sollve: Validation and verification testsuite status update and com-
When the large language model meets programming – the rise of code piler insight for openmp,” in 2022 IEEE/ACM International Workshop
intelligence,” 2024. [Online]. Available: https://arxiv.org/abs/2401.14196 on Performance, Portability and Productivity in HPC (P3HPC), 2022,
[6] C. Munley, A. Jarmusch, and S. Chandrasekaran, “Llm4vv: Developing pp. 123–135.
llm-driven testsuite for compiler validation,” 2024. [Online]. Available: [28] L. B. N. Lab, 2024, accessed: 2024-08-16. [Online]. Available:
https://arxiv.org/abs/2310.04963 https://docs.nersc.gov/systems/perlmutter/architecture/
[7] OpenACC Organization, “OpenACC.” [Online]. Available: https:
//www.openacc.org/
[8] OpenMP Architecture Review Board, “OpenMP application program
interface version 5.2,” 2021. [Online]. Available: https://www.openmp.
org/wp-content/uploads/OpenMP-API-Specification-5-2.pdf
[9] OpenAI, “New models and developer products
announced at devday,” https://openai.com/index/
new-models-and-developer-products-announced-at-devday/, 2024,
accessed: 2024-08-16.
[10] R. et al., “Code llama: Open foundation models for code,” 2024.
[Online]. Available: https://arxiv.org/abs/2308.12950
[11] J. Jiang, F. Wang, J. Shen, S. Kim, and S. Kim, “A survey on large
language models for code generation,” arXiv preprint arXiv:2406.00515,
2024.
[12] X. Hou, Y. Zhao, Y. Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo,
J. Grundy, and H. Wang, “Large language models for software engineer-
ing: A systematic literature review,” arXiv preprint arXiv:2308.10620,
2023.
[13] N. Baumgartner, P. Iyenghar, T. Schoemaker, and E. Pulvermüller, “Ai-
driven refactoring: A pipeline for identifying and correcting data clumps
in git repositories,” Electronics, vol. 13, no. 9, p. 1644, 2024.
[14] A. T. McCabe, M. Björkman, J. Engström, P. Kuang, E. Söderberg,
and L. Church, “Ironies of programming automation: Exploring the
experience of code synthesis via large language models,” in Companion
Proceedings of the 8th International Conference on the Art, Science,
and Engineering of Programming, 2024, pp. 12–21.
[15] D. Nichols, P. Polasam, H. Menon, A. Marathe, T. Gamblin, and
A. Bhatele, “Performance-aligned llms for generating fast code,” 2024.
[Online]. Available: https://arxiv.org/abs/2404.18864

You might also like