2408.11729v2
2408.11729v2
Abstract—Large Language Models (LLM) continue to improve using self-supervised learning and learn to model language
arXiv:2408.11729v2 [cs.SE] 22 Aug 2024
Fig. 1: An Overview of Agent-based Approach for LLMJ In order to efficiently evaluate the validity of compiler tests,
it may not always be feasible to compile, run, and have an
LLM evaluate each and every test that requires verification.
entails collecting and making use of external information about Performing all three processes on every single file being
each file, such as compilation and execution error messages, verified can quickly become a time-consuming and costly
outputs, and return codes, and providing this information to task, especially if verifying LLM-generated codes with a high
the LLMJ within the prompt. The LLMJ then utilizes the occurrence of invalidity or a large volume of candidate tests.
provided information to evaluate the file, deeming it either To streamline this task, we re-organized the three processes
valid or invalid. Figure 1 demonstrates how the agent-based into a pipeline infrastructure as shown in Figure 2, to both
LLMJ works. optimize the overarching task by reducing the number of
Listing 2 shows how the tool use is incorporated in the unnecessary steps and by increasing the throughput of files
prompting to provide additional information to the LLM to for verification via pipeline stages and parallel processing.
help it review an OpenACC code and evaluate it based on
user-specified criteria:
TABLE III: LLMJ Overall Negative Probing Results Pipeline 2 was slightly worse at recognizing OpenMP errors
Datapoint OpenACC OpenMP and significantly better at recognizing a lack of OpenMP code.
Total Count 1335 431 Table VI shows the overall performance of both pipelines
Total Mistakes 579 256
Overall Accuracy 56.63% 40.60%
across both OpenACC and OpenMP. Both pipelines were
Bias 0.717 -0.031 significantly more accurate for OpenMP than for OpenACC,
though Pipeline 1 was slightly more accurate than Pipeline
2 for OpenACC and slightly less accurate than Pipeline 2
For each testsuite, we then compiled, executed, and used for OpenMP. For both programming models, both pipelines
both LLMJ prompts to evaluate each file while recording demonstrated a bias towards restrictiveness, though Pipeline
the compilation data, execution data, and evaluations. We 2 consistently had a stronger bias than Pipeline 1. This
ran each file through every stage of the validation pipeline; demonstrates that for both pipelines, when a mistake does
however, for this experiment, we did not prevent invalid occur, it is more likely to be one of misjudging a valid file
files from continuing through the pipeline. This way, we rather than one of misjudging an invalid file.
could gather information about both agent-based LLMJs, and Figures 3 and 4 present the accuracy of both pipelines
retroactively verify how the entire validation pipeline would on the four categories of errors introduced into each file,
have performed on the data by checking the compilation, for OpenACC and OpenMP respectively. As Figure 4 clearly
execution, and evaluation status of each file. shows, the performance of both pipelines on OpenMP was
nearly identical across all four types of issues, while Pipeline
To simplify the data analysis: 1 and Pipeline 2 had only slight differences in performance for
OpenACC. The radar plots also show the large difference in
• LLMJ 1: The agent-based LLMJ that used the direct the pipelines’ ability to detect erroneous test logic in OpenMP
analysis prompt files versus OpenACC files; however, both pipelines also
• LLMJ 2: The agent-based LLMJ that used the indirect demonstrated an almost identical ability to detect improper
analysis prompt directive use and improper syntax across both OpenACC and
• Pipeline 1: Validation pipeline outputs computed with OpenMP.
LLMJ 1’s evaluation Table VII shows the results of the two agent-based LLMJs
• Pipeline 2: Validation pipeline outputs computed with on the OpenACC testsuite. In this case, the two LLMJs’
LLMJ 2’s evaluation performances varied much more, with LLMJ 1 demonstrating
We then compared the performances of the two pipelines a superior ability to identify missing syntax errors and to
against each other, and compared the two agent-based LLMJs recognize valid code, and LLMJ 2 demonstrating a supe-
against each other and against the non-agent-based LLMJ. rior ability to detect OpenACC errors, a lack of OpenACC
Table IV shows the results of the two pipelines on the code, and errors in test logic. Table VIII, which shows the
OpenACC testsuite. As can be seen, the two pipelines per- performance of both LLMJs on OpenMP, also demonstrates
formed almost identically, though Pipeline 2 demonstrated more variance between the two LLMJs. LLMJ 1 exhibited a
a higher ability recognize errors in the test’s logic, and higher accuracy for recognizing OpenMP errors, syntax errors,
Pipeline 1 demonstrated a higher ability to recognize when and test logic error, while LLMJ 2 was better-equipped for
a file contained no errors. Table V also shows a similarity recognizing a lack of OpenMP code and valid codes.
between the two pipelines’ performances, though in this case, Table IX shows the overall performance of both LLMJs.
TABLE IV: Validation Pipeline Results for OpenACC
Pipeline 1 Pipeline 2
Total Pipeline 1 Pipeline 2
OpenACC Issue Type Correct Correct
Count Accuracy Accuracy
Evaluations Evaluations
Removed ACC memory allocation / swapped ACC directive 272 250 251 92% 92%
Removed an opening bracket 146 146 146 100% 100%
Added use of undeclared variable 151 151 151 100% 100%
Replaced file with randomly-generated non-OpenACC code 146 146 146 100% 100%
Removed last bracketed section of code 176 38 53 22% 30%
No issue 891 704 627 79% 70%
VI. C ONCLUSION
OpenACC and OpenMP, respectively. In almost all categories,
the agent-based LLMJs outperformed the non-agent-based In this paper, we explore ways to assess the capability
LLMJ, with the exception of valid test recognition for Ope- of LLM-as-a-Judge. We employ different techniques such
nACC (where the non-agent-based LLMJ outperformed LLMJ as negative probing and agent-based approach along with
prompts to understand how the LLM evaluates the codes. [16] L. Chen, P.-H. Lin, T. Vanderbruggen, C. Liao, M. Emani,
Our results indicate that utilizing an agent-based prompting and B. de Supinski, “Lm4hpc: Towards effective language model
application in high-performance computing,” 2023. [Online]. Available:
approach and setting up a validation pipeline structure signif- https://arxiv.org/abs/2306.14979
icantly increased the quality of DeepSeek Coder’s evaluations [17] T. Kadosh, N. Hasabnis, V. A. Vo, N. Schneider, N. Krien, A. Wasay,
of tests used to validate compiler implementations of directive- N. Ahmed, T. Willke, G. Tamir, Y. Pinter et al., “Scope is all you need:
Transforming llms for hpc code,” arXiv preprint arXiv:2308.09440,
based programming models. As part of our future work, we 2023.
will incorporate fortran code into our testing to ensure more [18] W. Godoy, P. Valero-Lara, K. Teranishi, P. Balaprakash, and J. Vetter,
comprehensive data collection and probing. We will also be “Evaluation of openai codex for hpc parallel programming models kernel
generation,” in Proceedings of the 52nd International Conference on
exploring the automation of compiler test generation based on Parallel Processing Workshops, 2023, pp. 136–144.
lessons learnt from this work. [19] P. Valero-Lara, A. Huante, M. A. Lail, W. F. Godoy, K. Teranishi,
P. Balaprakash, and J. S. Vetter, “Comparing llama-2 and gpt-3 llms
ACKNOWLEDGMENT for hpc kernels generation,” arXiv preprint arXiv:2309.07103, 2023.
[20] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin,
The authors are very grateful to OpenACC for supporting Z. Li, D. Li, E. Xing et al., “Judging llm-as-a-judge with mt-bench
this work. This research used resources NERSC, a U.S. DOE and chatbot arena,” Advances in Neural Information Processing Systems,
vol. 36, 2024.
Office of Science User Facility located at LBNL, operated [21] M. Schäfer, S. Nadi, A. Eghbali, and F. Tip, “An empirical evaluation of
under Contract No. DE-AC02-05CH11231 using NERSC ER- using large language models for automated unit test generation,” IEEE
CAP0029463. This material is also based upon work supported Transactions on Software Engineering, 2023.
[22] G. Ryan, S. Jain, M. Shang, S. Wang, X. Ma, M. K. Ramanathan,
by the U.S. DOE under Contract DE-FOA-0003177, S4PST: and B. Ray, “Code-aware prompting: A study of coverage-guided test
Next Generation Science Software Technologies Project. generation in regression setting using llm,” Proceedings of the ACM on
Software Engineering, vol. 1, no. FSE, pp. 951–971, 2024.
R EFERENCES [23] K. Liu, Y. Liu, Z. Chen, J. M. Zhang, Y. Han, Y. Ma, G. Li, and
G. Huang, “Llm-powered test case generation for detecting tricky bugs,”
[1] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training arXiv preprint arXiv:2404.10304, 2024.
of deep bidirectional transformers for language understanding,” 2019. [24] GitHub, “Github copilot,” 2024, accessed: 2024-08-16. [Online].
[Online]. Available: https://arxiv.org/abs/1810.04805 Available: https://github.com/features/copilot
[2] OpenAI and A. et al., “Gpt-4 technical report,” 2024. [Online]. [25] Cursor, 2024, accessed: 2024-08-16. [Online]. Available: https:
Available: https://arxiv.org/abs/2303.08774 //www.cursor.com/
[3] G. Team and A. et al., “Gemini: A family of highly capable multimodal [26] A. Jarmusch, A. Liu, C. Munley, D. Horta, V. Ravichandran, J. Denny,
models,” 2024. [Online]. Available: https://arxiv.org/abs/2312.11805 K. Friedline, and S. Chandrasekaran, “Analysis of validating and verify-
[4] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. ing openacc compilers 3.0 and above,” in 2022 Workshop on Accelerator
Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2017. Programming Using Directives (WACCPD). IEEE, 2022, pp. 1–10.
[Online]. Available: https://arxiv.org/abs/1706.03762 [27] T. Huber, S. Pophale, N. Baker, M. Carr, N. Rao, J. Reap, K. Holsapple,
[5] D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, J. H. Davis, T. Burnus, S. Lee, D. E. Bernholdt, and S. Chandrasekaran,
X. Bi, Y. Wu, Y. Li, F. Luo, Y. Xiong, and W. Liang, “Deepseek-coder: “Ecp sollve: Validation and verification testsuite status update and com-
When the large language model meets programming – the rise of code piler insight for openmp,” in 2022 IEEE/ACM International Workshop
intelligence,” 2024. [Online]. Available: https://arxiv.org/abs/2401.14196 on Performance, Portability and Productivity in HPC (P3HPC), 2022,
[6] C. Munley, A. Jarmusch, and S. Chandrasekaran, “Llm4vv: Developing pp. 123–135.
llm-driven testsuite for compiler validation,” 2024. [Online]. Available: [28] L. B. N. Lab, 2024, accessed: 2024-08-16. [Online]. Available:
https://arxiv.org/abs/2310.04963 https://docs.nersc.gov/systems/perlmutter/architecture/
[7] OpenACC Organization, “OpenACC.” [Online]. Available: https:
//www.openacc.org/
[8] OpenMP Architecture Review Board, “OpenMP application program
interface version 5.2,” 2021. [Online]. Available: https://www.openmp.
org/wp-content/uploads/OpenMP-API-Specification-5-2.pdf
[9] OpenAI, “New models and developer products
announced at devday,” https://openai.com/index/
new-models-and-developer-products-announced-at-devday/, 2024,
accessed: 2024-08-16.
[10] R. et al., “Code llama: Open foundation models for code,” 2024.
[Online]. Available: https://arxiv.org/abs/2308.12950
[11] J. Jiang, F. Wang, J. Shen, S. Kim, and S. Kim, “A survey on large
language models for code generation,” arXiv preprint arXiv:2406.00515,
2024.
[12] X. Hou, Y. Zhao, Y. Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo,
J. Grundy, and H. Wang, “Large language models for software engineer-
ing: A systematic literature review,” arXiv preprint arXiv:2308.10620,
2023.
[13] N. Baumgartner, P. Iyenghar, T. Schoemaker, and E. Pulvermüller, “Ai-
driven refactoring: A pipeline for identifying and correcting data clumps
in git repositories,” Electronics, vol. 13, no. 9, p. 1644, 2024.
[14] A. T. McCabe, M. Björkman, J. Engström, P. Kuang, E. Söderberg,
and L. Church, “Ironies of programming automation: Exploring the
experience of code synthesis via large language models,” in Companion
Proceedings of the 8th International Conference on the Art, Science,
and Engineering of Programming, 2024, pp. 12–21.
[15] D. Nichols, P. Polasam, H. Menon, A. Marathe, T. Gamblin, and
A. Bhatele, “Performance-aligned llms for generating fast code,” 2024.
[Online]. Available: https://arxiv.org/abs/2404.18864