0% found this document useful (0 votes)

22 views

1. Can We Trust Large Language Models Generated Code A

Uploaded by

smeetnilvarna

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views

1. Can We Trust Large Language Models Generated Code A

Uploaded by

smeetnilvarna

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Can We Trust Large Language Models Generated Code?

A
Framework for In-Context Learning, Security Patterns, and Code
Evaluations Across Diverse LLMs
Ahmad Mohsina , Helge Janickea , Adrian Wooda , Iqbal H. Sarkera , Leandros Maglarasb and
Naeem Janjuab
a Centre for Securing Digital Futures, School of Science, Edith Cowan University, WA-6027, Australia
b School of Computing, Edinburgh Napier University, United Kingdom
c College of Science and Engineering, Flinders University, Adelaide, Australia

ARTICLE INFO ABSTRACT

Keywords: Large Language Models (LLMs) such as ChatGPT and GitHub Copilot have revolutionized
Large Language Models, Automated automated code generation in software engineering. However, as these models are increas-
arXiv:2406.12513v1 [cs.CR] 18 Jun 2024

Code Generation, AI and Code Se- ingly utilized for software development, concerns have arisen regarding the security and
curity, Security Vulnerabilities, Hid- quality of the generated code. These concerns stem from LLMs being primarily trained
den Code Smells, In-Context Learn- on publicly available code repositories and internet-based textual data, which may contain
ing, Supply Chain Vulnerabilities insecure code. This presents a significant risk of perpetuating vulnerabilities in the generated
code, creating potential attack vectors for exploitation by malicious actors. Our research aims
to tackle these issues by introducing a framework for secure behavioral learning of LLMs
through In-Content Learning (ICL) patterns during the code generation process, followed
by rigorous security evaluations. To achieve this, we have selected four diverse LLMs for
experimentation. We have evaluated these coding LLMs across three programming languages
and identified security vulnerabilities and code smells. The code is generated through ICL
with curated problem sets and undergoes rigorous security testing to evaluate the overall
quality and trustworthiness of the generated code. Our research indicates that ICL-driven
one-shot and few-shot learning patterns can enhance code security, reducing vulnerabilities
in various programming scenarios. Developers and researchers should know that LLMs have
a limited understanding of security principles. This may lead to security breaches when
the generated code is deployed in production systems. Our research highlights LLMs are a
potential source of new vulnerabilities to the software supply chain. It is important to consider
this when using LLMs for code generation. This research article offers insights into improving
LLM security and encourages proactive use of LLMs for code generation to ensure software
system safety.

1. Introduction these LLMs often produce code with security vulnera-

bilities. This results in potential security threats due to
The transformation of LLMs has revolutionized
vulnerabilities [4] present in the produced code.
software development from agile DevOps to cloud These models lack an understanding of security
infrastructure automation. However, the quality of gen- principles; their primary goal is to produce operational
erated code is a critical issue, with code security being code without considering security flaws during code
the primary concern. Tools like GitHub Copilot and
generation [5]. Studies have indicated that different
ChatGPT have improved developers productivity by
versions of ChatGPT and GitHub Copilot generated
automating various tasks [1, 2, 3] for code generation.
insecure code in approximately 40% of cases reported,
The, software-intensive systems employing generated highlighting significant security risks [4, 6, 7]. Re-
code through LLMs are posed with significant security searchers and industry practitioners have expressed
risks, primarily due to the generated source code con- concerns about the reliability of source code generated
taining various security weaknesses and hidden bugs.
by LLMs for software security [6, 8]. These models
As they are trained on extensive data from human
generate probabilistic code using sequential learning
developers extracted from open source repositories, by transforming natural language inputs into token
⋆
This research work is supported by the School of Science sequences and embeddings [8, 9, 10, 11]. The LLM
Center for Securing Digital Futures (CSDF) and Cyber Security Co- model is trained mainly on internet-based datasets,
operative Research Centre (CSCRC) Australia, under the Research which makes the output code susceptible to various
Theme for AI and Cybersecurity.
∗ Corresponding author (First author) at School of Science, Com- security issues. To understand this process better, let us
puting and Security Discipline Joondalup, Australia consider the LLM code generation process illustrated
[email protected] (A. Mohsin); [email protected] in Figure 1. For instance, if we have input text such
(H. Janicke); [email protected] (A. Wood); as "Write a Python function to calculate factorial",
[email protected] (I.H. Sarker); [email protected] (L.
Maglaras); [email protected] (N. Janjua)
the LLM model tokenizes the input, applies learned
ORCID (s): embeddings using training datasets, and its transformer

Ahmad Mohsin et al.: Preprint submitted to Elsevier Page 1 of 27

Can We Trust Large Language Models Generated Code?

Figure 1: The LLM code generation process: simplified Transformer architecture for code generation with potential
security risks

architecture to output a sequence of tokens that forms We have developed an approach named "Can We
the appropriate Python code. However, the overall se- Trust Large Language Models Generated Code? A
curity of the generated source code is a significant Framework for In-Context Learning, Security Patterns,
question mark and challenge for software developers and Code Evaluations Across Diverse LLMs" to ad-
and LLM users. When deployed in production sys- dress these challenges. We argue LLMs are effective
tems, this poses a significant security risk, potentially few-shot learners that use a gradual in-context learning
compromising trustworthiness and resilience [12]. It approach, aided by natural language inputs [25, 17,
is crucial to thoroughly test the generated code before 26]. This framework enables various LLMs to acquire
deployment, especially since developers may not fully security knowledge and understand security by im-
understand how it is generated [6, 13, 4] and may lack plementing ICL security patterns, followed by exten-
security knowledge. sive security testing to assess their ability to produce
When interacting with LLMs, developers with safe and secure code. Four different LLM platforms
varying coding expertise can inadvertently introduce are used with three programming languages: C++,
security vulnerabilities, leading to the unintentional C#, and Python. The code experimentation employing
generation of less secure or malicious code. Unfortu- ICL patterns against LLMs involves a diverse set of
nately, most developers are unaware of the security problem sets, ranging from classic data structures and
implications of their inputs to LLMs. The resulting algorithms to modern web and API development. We
vulnerabilities can be exploited through various attack conduct experiments for code generation using prompt-
vectors, compromising the integrity of software sys- driven code generators (PDCGs) such as ChatGPT
tems [14, 15]. Therefore, it is essential to note that and Google Bard, along with Coding CoPilots (CCPs)
developers using LLMs may have limited knowledge like GitHub Copilot and Amazon Code Whisperer.
about security[16, 4]. Rigorous testing and analysis of the generated code
Code generating LLMs exhibit different behaviors is carried out across various programming scenarios
depending on the language model type and underly- to evaluate security risks. ICL security patterns are
ing architecture they are trained [17]. The behavior utilized to generate code for instructing and fine-tuning
of prompt-driven code generators using Foundation or coding LLMs, leading to the creation of two distinct
Baseline LLMs such as GPTs [18, 10] is different from datasets for future LLM security research.
that of fine-tuned LLMs [19, 20, 21] used in coding We apply each ICL security pattern specifically
copilots. Existing research efforts [22, 23] do not ad- within programming language problem-solving sce-
dress these aspects when evaluating AI platforms for narios to ensure they are repeatable and applicable
coding. The current research is currently focused solely across the four LLMs. This approach aims to improve
on standalone LLMs functional code evaluations [8], LLMs learning behaviors through tailored ICL patterns
with little emphasis on code generation using diverse for both pre-trained (Foundation or Base models) and
LLM types, especially with regard to security. Existing fine-tuned models. We rigorously test the security of
research only concentrates on the security of generated each generated program instance using Static Applica-
code and identifying specific vulnerabilities [22, 24], tion Security Testing (SAST), complemented by man-
while some through instruction fine-tuning, which is ual security reviews of selected programs to identify
expensive and unreliable [1, 5]. However, there is a and analyze security vulnerabilities and hidden code
lack of experimental research that trains LLMs to ac- smells, respectively. Focusing on risk management, we
quire security knowledge during code generation. This developed security risk metrics to quantify security
is crucial for providing a comprehensive and empiri- impacts at the source code level.
cal analysis of code security and the associated risks. Our main contributions in this research work are as
These research gaps could lead to severe cybersecurity follows:
risks if not appropriately addressed.

Ahmad Mohsin et al.: Preprint submitted to Elsevier Page 2 of 27

Can We Trust Large Language Models Generated Code?

• Our novel approach offers LLMs an opportunity AmazonCode Whisperer [20], Hugging Face Start
to learn about security knowledge using ICL Coder [19], and Facebook Code Llama [27]. These
security patterns which are designed to enhance platforms use Transformer Architecture [28] for code
their ability to produce secure code. generation, functioning as conversational code genera-
tors allowing developers to input prompts and receive
• We curate diverse programming problem sets in code outputs. They also integrate plugins within IDEs,
widely used programming languages in C++, acting as programming assistants or pair program-
C#, and Python to generate extensive code bases mers. This advancement in LLM technology lays the
using ICL security patterns. foundation for understanding the unique functionalities
• We use four diverse LLMs to improve their se- and applications of Prompt Driven Code Generators
curity learning behavior using ICL and analyze (PDCGs) and Coding CoPilots (CCPs), tailored to
their code secure generation capabilities. Each different aspects of coding assistance and developer
LLM is fed with tailored instruction sets con- interaction.[29, 30, 21, 31, 13, 32, 33]. We describe
sidering its contextual and dynamic interactions these as follows:
during the code generation process. (i) Prompt Driven Code Generators (PDCGs). The
PDGCs are platforms such as OpenAI ChatGPT that
• We perform security assessments on the gener- utilize foundation models, or Baseline LLMs (BLLMs1
ated LLM code using industry-standard Static ), trained on various datasets. These models generate
SAST and code reviews to evaluate potential code in response to specific prompts and inputs/queries
vulnerabilities and hidden code issues. Our de- that developers provide. For example, a developer
veloped security risk assessment metrics help might input a query "computing the prime values of the
to identify threats to the overall quality of the first 999 numbers in an array," the LLM interprets this
generated LLM code. input using NLP to generate the corresponding code.
These interactions heavily rely on the capabilities of
• We aim to release a security instructions dataset the BLLMs [29, 30, 21, 31]. Please refer to Figure 2
used to design ICL patterns. This dataset can to illustrate PDCG usage. PDCGs that use foundation
later be utilized to train and fine-tune LLMs for models have certain limitations and constraints, espe-
secure code generation. The generated code is a cially when dealing with complex problems specific
curated dataset intended for future LLM security to programming languages that require task-specific or
training for experimental purposes. domain-specific features from pre-trained LLMs.
Concepts and terms used in this study, along with (ii) Coding CoPilots (CCPs). The CCPs, such as
research questions, are described in Section 2. We delve Microsoft GitHub Copilot, are based on FLLMs2 These
into related work for AI-based code generation and finely-tuned language models are trained with coding-
research on evaluating LLMs for code generation in specific datasets to provide context-driven automated
Section 3. Section 4 presents the proposed research code-generation capabilities for developers. These mod-
approach for this study. Experimentation and results els seamlessly integrate into Integrated Development
analysis are detailed in Sections 5 and 6, respectively. Environments (IDEs) as plugins or APIs, providing
Discussion on the results is provided in Section 7, fol- context-sensitive and sophisticated coding assistance.
lowed by Conclusions and Future Research in Section They enhance code completion capabilities by inter-
8. preting the context of the project and the developers
intent from specific comments or code snippets. This
refined tuning allows CCPs to offer highly pertinent
2. Background & Research Problematic code suggestions and enhancements, effectively adapt-
This section covers key concepts in this article: ing to the unique requirements of the development
AI for code generation, code-generating LLM types, environment and project specifics [13, 32, 33]. Figure
learning behaviors, and source code security vulnera- 3 illustrates this type of interactive code generation
bilities. It defines essential concepts and terms for a where developers collaborate closely with the LLM.
better understanding of the terminology used in the Both PDCGs and CCPs, as LLMs, are prone to
following discussion. It covers the evolution of LLM producing insecure code. Being probabilistic code gen-
code generation, its two main types, related software erators, they often introduce security risks due to their
security terms, methods for training LLMs to learn tendency vulnerabilities. Below, we describe the vul-
code behavior, and ways to ensure code quality. nerabilities in the generated source code and the related
security attacks.
2.1. Code Generation Evolution with LLMs 1 Baseline Large Language Models (BLLMs) are foundational
With LLM advancements, several AI-based code models. In this research article, they are referred to as Prompt Driven
generators have emerged to help developers improve Code Generators (PDCGs)
2 In this paper, CCPs represent fine-tuned, and we shall use CCP
their productivity across various platforms. Some no-
table examples of these platforms are Google Bard, for their representation in this article.

Ahmad Mohsin et al.: Preprint submitted to Elsevier Page 3 of 27

Can We Trust Large Language Models Generated Code?

Figure 2: Prompt Driven Code Generators with BLLMs

Figure 3: Coding CoPilots LLMs for Code Generation

2.2. Source Code Level Security Defects by language models. We use CWE and code smells to
Here, we define the nature of software vulnera- analyze the security of code generated by LLMs. This
bilities that may arise from these coding LLMs. This highlights the importance of addressing these vulnera-
research article will address security-related flaws in bilities due to LLMs tendency to produce potentially
generated LLM code as follows: insecure code. Identifying code smells also helps us
Source Code Vulnerabilities. A software vulnera- evaluate the associated security risks and overall code
bility is a source code or system configuration flaw quality.
that malicious actors can exploit to compromise data Software Supplychain Vulnerabilities. The majority
integrity, availability, and confidentiality in computer of these vulnerabilities often stem from developers’
systems and networks [14]. Software vulnerabilities code or third-party sources, such as open-source li-
often result from programming errors, language syn- braries and commercial off-the-shelf (COTS) products
tax, or architectural design weaknesses at the source used during software design [15, 38]. Software supply
code level. The types of coding weaknesses include chain attacks exploit these vulnerabilities, targeting
input validation errors, application injections, cross- developers, their infrastructure, and third-party code
site scripting, information leakage, and buffer overflow, suppliers to introduce malicious code [39]. Key vectors
which are categorized under the Common Weakness for these attacks include code repositories, software
Enumeration (CWE) [34] vulnerabilities. The CWEs is build pipelines, and distribution channels. A prime
a publicly developed database that categorizes software example is the 2020 SolarWinds cyber incident, which
security weaknesses stemming from poor design and compromised the Orion software and impacted thou-
coding practices. Vulnerabilities identified in the CWE sands of organizations, reflecting a significant rise in
are recorded in the Common Vulnerabilities and Expo- software supply chain attacks, with a 742% increase
sures (CVE) system [35]. This system is a repository of from 2019 to 2022 [40]. Additionally, there is a grow-
publicly known information about vulnerabilities. Each ing concern that generative AI, particularly LLMs,
CVE may correspond to one or more CWEs. might create new attack vectors by introducing security
Software Coding Smells. Code smells are subtle, of- vulnerabilities in the code they generate.
ten hidden issues in the software that indicate po-
tential design flaws, potentially complicating future 2.3. LLM Code Generation with Behavioral
maintenance, bug fixes, or enhancements [36, 37]. Dur- Learning
ing code generation using LLMs, these issues may Generative AI, specifically LLMs, can be opti-
be concealed and more challenging to identify than mized to mitigate LLM-related risks [17]. Fine-tuning
explicit vulnerabilities such as CWEs or CVEs, pre- techniques are used to improve their learning behaviors
senting covert risks in the code automatically generated in specific aspects of coding, such as functional code
generation, program synthesis, vulnerability analysis,

Ahmad Mohsin et al.: Preprint submitted to Elsevier Page 4 of 27

Can We Trust Large Language Models Generated Code?

Research Questions

• RQ1: How well can diverse LLMs generate secure code across various programming challenges in zero-
shot scenarios?
• RQ2: To what extent do LLMs understand and apply best practices and address vulnerabilities after
using ICL security patterns in one-shot and few-shot learning?
• RQ3: How do PDCG LLMs like ChatGPT-4 and Google Bard compare to CCP LLMs like GitHub Copilot
and Code Whisperer in generating secure code and adapting to ICL security contexts?
• RQ4: What security code smells persist after employing ICL security patterns in one-shot and few-shot
scenarios, and what are the potential security risks?

Figure 4: Research Questions

and code repair [24]. Updating LLM weights and pa- lead to vulnerabilities, making software systems sus-
rameters or modifying training architectures can im- ceptible to cyber attacks. With the assistance of auto-
prove performance on downstream tasks like coding. mated code completion and other tools, human coders
ICL improves the learning of LLMs by using exam- are now able to increase productivity. In this regard,
ples to enhance code accuracy and security without generative AI-driven language models i.e. (LLMs) aim
requiring retraining or adjusting parameters, unlike to automate coding tasks and enhance productivity;
heavy-weight fine-tuning employing forward compu- however, studies have shown that code generated by
tation [41]. Through ICL LLMs are able to behave as LLMs may contain vulnerabilities that attackers can
meta-gradients3 via attention mechanism. exploit, leading to sophisticated cyber attacks [6, 24].
The learning of LLMs can be improved through It has become evident that developers are often un-
Chain of Thought (COT) reasoning porcess. Chain of aware of LLM-driven security weaknesses due to the
Thought reasoning is a powerful strategy for enhancing lack of proper testing and code reviews [13, 44]. The
the LLMs’ performance on complex tasks [42, 43]. integration of LLMs with their various styles of code
It involves a series of intermediate reasoning steps, generation, such as prompt-driven natural language
breaking down a problem into smaller, more manage- instructions and coding copilots working alongside hu-
able components, and applying sequential processing. man developers, adds another layer of complexity to
LLMs arrive at the final prediction for tasks such modern software development echo systems. This po-
as code generation. By using CoT reasoning, LLMs tentially adds new attack vectors to future software
can effectively learn to solve problems step-by-step, supply chains for enterprises and critical infrastructures
even with zero to few-shot examples. This approach [38, 39].
helps LLMs develop a deeper understanding of the task In the introduction Section (1), we identified gaps
at hand, making them more self-aware and capable in our understanding of security risks associated with
of producing more accurate and contextually relevant LLM code generation. This research aims to evaluate
outputs, especially in code generation tasks. It allows how different machine learning models embed and ap-
developers and users to improve learning behaviors, ply security knowledge to produce secure code. It will
particularly in the context of security [25, 17]. This re- also explore the potential for LLM-generated code to
search paper utilizes the ICL method employing COT- introduce new vulnerabilities or unintentionally repli-
based reasoning to improve LLM contextual learning cate existing ones during the code generation process.
during code generation, aiming to reduce potential Additionally, it will examine the learning behaviors
security weaknesses in software development. [26]. of LLMs with ICL patterns using chain of thought
reasoning methods to aid in the secure and safe us-
2.4. Research Problem and Motivation age of these code generating LLMs. By identifying
As the reliance on software in critical systems and addressing these security challenges early in the
continues to grow, it becomes crucial to ensure that software development stages and embedding security
the code powering these systems is secure and reliable. knowledge during the LLMs for code generation, we
Traditionally, software development has been driven by can ensure that LLM-generated code is a valuable tool
human coders, but human-driven software design can for software development rather than a new vector for
3 In LLMs, using ICL as meta-gradients allows models to adapt software supply chain vulnerabilities.
their learning directly from examples during operation. They adjust This research aims to enhance the safety and secu-
parameters based on specific input using the attention mechanism to rity of software systems by examining the use of AI-
focus and learn from the most relevant parts of the input, essentially based code automation tools by searching for answers
fine-tuning their responses in real-time.
to various research questions. The main objective is to

Ahmad Mohsin et al.: Preprint submitted to Elsevier Page 5 of 27

Can We Trust Large Language Models Generated Code?

decrease the likelihood of cyber attacks on critical sys- around various CWEs to experiment with secure and
tems. This study will provide insight into and solutions insecure hardware code generation using ChatGPT.
for the security risks linked to LLM code generation.
Before implementing LLM-based code generation, it 3.2. Coding Copilots and Vulnerabilities
is essential to investigate how different LLMs react With the introduction of Codex and various ver-
to user inputs in various scenarios when solving pro- sions of GPTs, Coding Copilots (CCPs), which are fine-
gramming problems. By understanding these aspects, tuned LLMs, have been integrated into programming
vulnerabilities in LLM-generated code can be identi- environments and evaluated for their effectiveness in
fied and mitigated before being integrated into software real-world software development scenarios. Research
development pipelines. This experimental research will by Sarsa et al. [44] explored Codex use in educational
address the research questions (RQ) defined in Figure settings, particularly its real-time code auto-completion
4. and the generation of programming exercises, high-
lighting its potential to enhance learning experiences.
3. Related Work Jacob et al. [47] examined the capabilities of LLMs
in program synthesis across general-purpose program-
The early research focused on fundamental code ming languages, noting the variable effectiveness of
generation tasks in natural language processing (NLP). these models in producing syntactically and function-
Over time, there has been a shift from using recur- ally correct code.
rent neural networks (RNNs) to advanced Transformer Further, a comprehensive evaluation of a GPT lan-
models in deep learning for code generation. Initially, guage model, fine-tuned on publicly available code
studies concentrated on language models such as BERT from GitHub, was conducted by authors in [13]. Their
and RoBERTa. A survey by researchers in [1] provided findings emphasize the model’s tendency to suggest
an in-depth analysis of BERT language model code insecure code, underscoring the importance of rigorous
generation and understanding performance. Further- security assessments.
more, GPT-Neo and similar models were evaluated for Regarding secure coding practices, few studies
their capability to convert human developer inputs into have assessed the ability of fine-tuned LLMs to pro-
executable code by the authors in [2, 3]. They also duce secure code. Pearce et al. [6] first tested Codex,
delved into CodeGPT application to complex coding employed within GitHub Copilot, focusing on security
problems and compared its effectiveness against other through static code analysis to identify CWEs in gen-
contemporary models using standardized datasets. erated code. Gustavo et al. [7] conducted a user study
to assess the security implications of GitHub Copilot
3.1. Baseline LLMs Software Vulnerabilities code completions by comparing it with code produced
Baseline or foundation LLMs in the form of prompt- by human developers in the C programming language.
driven code generators have been extensively tested They found that while CCPs can sometimes reduce
across various programming languages, including Python, bug rates depending on the problem complexity, they
Java, and JavaScript. These models aim to solve mul- often also introduce vulnerabilities. In a related study,
tiple programming challenges, ranging from simple Asare et al. [4] evaluated GitHub Copilot performance
function implementations to complex system-level pro- against human coders in generating code for selected
gramming [22]. A survey by Zheng et al. [1] discusses C and C++ problems. Their tests showed that Copi-
the effectiveness of models like CodeGPT and PALM lot occasionally introduced new vulnerabilities during
in generating code from complex queries. The survey code generation and was influential in suggesting more
highlights their adaptability in zero-shot, one-shot, secure code compared to human developers in some
and few-shot learning environments [23]. Despite their cases.
proficiency in producing syntactically correct code,
these models often fall short of security standards. 3.3. Significance of our Approach to Related
They frequently exhibit code security smells and vul- Work
nerabilities [8]. Our research focuses on using custom LLM ap-
Hussien et al. [22] conducted a study to develop plications to enhance security in code generation. We
benchmarks for evaluating security vulnerabilities in differentiate between PDCGs and CCPs, which have
code generated by black-box language models. They interconnected learning behaviors and secure code gen-
used a proximity inversion technique to train LLMs for eration capabilities. Existing studies often overlook this
code generation and explored how few-shot prompts distinction. Additionally, research evaluations typically
can lead to different vulnerabilities. Khoury et al. use a single problem set, neglecting a broader as-
[45] assessed ChatGPT ability to generate secure code sessment of LLMs across diverse software application
across five different programming languages. They scenarios.
later evaluated its potential to mitigate the identified Our study compares two types of LLMs for code
vulnerabilities. Similarly, another study [46] examined generation, considering developers’ personas. We pro-
ChatGPT capability to generate secure hardware-level vide opportunities to learn about security behaviors
code. The study utilized ten different prompts designed

Ahmad Mohsin et al.: Preprint submitted to Elsevier Page 6 of 27

Can We Trust Large Language Models Generated Code?

Figure 5: Large Language Models Code Evaluations Framework using ICL Security Patterns

through ICL security patterns involving natural lan- RESTful APIs and web applications using MVC design
guage inputs with incremental security knowledge. patterns [54].
This differs from other research in that code generated
by LLMs with opportunistic learning behaviors is used 4.2. LLMs Tuning-In-Context Learning
to acquire security knowledge dynamically. Our secu- In the second stage of our approach, we develop
rity evaluation is thorough, going beyond Static Ap- in-context learning patterns to improve LLMs secure
plication Security Testing (SAST) to examine security coding behaviors. The in-context learning method is a
code smells after learning. This approach differs from lightweight approach that enhances the learning con-
previous studies that only focus on SAST evaluation. text of LLMs during code generation through implicit
We have addressed a broader range of programming fine-tuning. This approach educates LLMs on secure
issues, and our results emphasize the potential and coding practices by using security examples and op-
limitations of using LLMs for critical code security and erational context to enhance their security knowledge
functionality assessments. and application abilities. The ICL patterns utilize chain
of thought reasoning to provide security examples to
LLMs [42, 43]. Security-aware ICL is especially help-
4. Proposed Approach ful for improving the contextual behavioral learning of
Our method for evaluating and enhancing the secu- coding LLMs. Applying ICL TO programming task en-
rity of LLMs generated code comprises five stages, as ables the LLM to enhance its understanding of security
shown in Figure 5. These stages include problem for- by analyzing the context and examples provided. In this
mulation, development of ICL security patterns, gen- regard Chain of thought reasoning further augments
eration of LLMs code using ICL patterns, and compre- this process by guiding the LLM through a logical
hensive security evaluations and analysis. We describe sequence of steps, helping it secure code generation.
each stage in the subsections below. This method enables LLMs to better understand and
replicate desired behaviors by providing specific exam-
4.1. Programming Problems Formulation ples within the input context, potentially leading to the
As the first step of our proposed approach, we generation of safe and secure code [17, 26].
curate a range of diverse problem sets [48, 49, 50, The ICL security pattern comprises three learning
51]. In order to enhance evaluations for secure code scenarios: Zero-Shot, One-Shot, and Few-Shot. These
generation, we include three different sets of problems scenarios are examples of embedding security knowl-
and programming languages: Data Structures and Al- edge from well-known standards such as OWASP
gorithms (DS & Algos) for fundamental programming ASVS and NIST CSDF-driven security principles [55,
problems, API development, and MVC design patterns 56]. The ICL examples progressively improve LLMs’
for medium to advanced application development. We secure coding abilities and adapt to various applica-
focus on C++, Python, and C# due to their widespread tions’ security requirements, ensuring adequate train-
use in application development and relevance to the ing across Prompt-Driven and Coding CoPilot LLMs
TOP 25 CWEs and OWASP Top 10 security risks [57]. It is particularly interesting to investigate if in-
[52, 53, 49]. Our programming problem dataset draws corporating security-focused inputs through ICL can
from various open-source competitions, such as Leet- reduce, maintain, or potentially introduce additional
Code, and includes API and design pattern problems security vulnerabilities in the generated source code.
that reflect current security trends [49, 50, 51]. The An example of the ICL security pattern, which guides
importance of assessing LLMs in these areas is high- securing passwords using various encryption mecha-
lighted by the increasing security risks associated with nisms, is presented in Figure 6. This example aims

Ahmad Mohsin et al.: Preprint submitted to Elsevier Page 7 of 27

Can We Trust Large Language Models Generated Code?

In-Context Learning (ICL) Security Pattern for LLMs (PDCGs and CCPs)

ICL Security Pattern Example

Zero-Shot Learning

• Description: LLM Generates code based on pre-existing knowledge without specific examples
on security.
• LLM Input: Generate a hash function to securely convert plain text into a fixed-size hash value,
ensuring data integrity and security.
function hashFunction (password) {
return hash(password);}

↓ Contextual Reasoning - LLM

One-Shot Learning

• Description: LLM is provided with a single example to demonstrate secure coding practices.
• LLM CoT-Reasoning: The provided hash function is vulnerable because it doesn’t include a
salt, which makes it susceptible to rainbow table attacks.
• LLM Input- One Security Example: Develop a secure hash function using salt.
function secureHashFunction(password, salt) { // use salt
return hash(password + salt);}

↓ Contextual Reasoning - LLM

Few-Shot Learning

• Description: LLM receives multiple secure coding examples illustrating a range of security
contexts.
• LLM CoT-Reasoning: Using only salt is not sufficient for ensuring security. We need to use
unique salts and other security measures like encryption and input validation.
• LLM Inputs- Multiple Security Examples:
// Use uniqueSalt with password
function secureHashFunction(password, uniqueSalt) {
return hash(password + uniqueSalt);}
function encryptData(data, key) {
return encrypt(data, key); } // Use AES with key expansion
function validateUserInput(input) {
return sanitizeInput(input); } // Prevent SQL injections

Figure 6: Illustration of In-Context Learning (ICL) Security Pattern for LLMs utilizing different learning strategies with
contextual reasoning.

to educate LLM about password security by gradually 4.3. LLMS Code Generators Selection
applying cryptographic principles from the ground up. The third step involves generating ICL-based codes
ICL patterns can be applied to programmers of varying using various LLMs. We have selected a range of popu-
skill levels, from novices to experts. While this article lar LLMs based on the Transformer architecture, which
doesn’t use these personas during code generation, it developers, researchers, and students widely use. These
suggests that such roles should be considered in ICL LLMs support code generation through NLP prompts
security patterns. and coding copilots that offer code completions and
recommendations. Our selection includes ChatGPT44
4 At times, We shall be using the term ChatGPT, which repre-

sents ChatGPT4 in this article.

Ahmad Mohsin et al.: Preprint submitted to Elsevier Page 8 of 27

Can We Trust Large Language Models Generated Code?

Table 1
Selected LLM Code Generators: Models, Code Generation, and Integration Support
LLM Code Generator LLM Model Details Key Features Integration & Prog. Lang. Support
OpenAI ChatGPT4 GPT-4.0 multi-model Code generation with API Integration, multiple languages
(PDCG) (1.76T Parameters), code input instructions. supported (C++, Python, C#,
context window 32768 Php, Java)
tokens
Google Bard (PDCG) PaLM 540B parameters, -Do- -Do-
Context Window 1000+
tokens
Amazon Code Whis- Model name not dis- Code Generation, Supports cloud infrastructures,
perer (CCP) closed, Billions of Param- Completion, Design IDEs via APIs integration of
eters Suggestions, Dynamic multiple languages and frameworks
Contextual Learning (C++, Python, C#)
GitHub Copilot (CCP) Codex 12B Parameters, -Do- -Do-
Context Window 4096 To-
kens

[58], which has recently been extended from GPT-3.5, vulnerabilities in LLM-generated code through ICL
and Google Bard, which is now known as Google Gem- patterns (Zero, one, and few shots) using tools that
ini [10, 59]. These are prompt-driven code generators. reference the MITRE CWE Database [66] and the
We have also chosen GitHub Copilot [13] and Amazon National Vulnerability Database for a comprehensive
Code whisperer [20, 60] as coding copilots. ChatGPT4 risk assessment. During our experiments, we utilized
and Google Bard utilize pre-trained BLLMs, while the the Snyk Security Code Analyzer [67] in combination
copilots use FLLMs. Details on these generators, their with the Amazon AWS security scanning API [68]
LLMs, training datasets, and supported languages are to evaluate the security of our code. To ensure that
listed in Table 1. It is worth noting that ChatGPT4 we cover all of our programming languages during
is trained on vast datasets, including programming these evaluations, we integrated an AI SAST analyzer
languages, while GitHub Copilot and Amazon Code (Sixth SAST) [69], which is based on the GPT language
Whisperer are fine-tuned on programming examples. model, as a plugin through its API in our VS code IDE.
The study aims to examine the quality of these models’ Hidden Security Code Smells. The SAST-based anal-
security output and how it varies. ysis is a valuable technique for detecting code security
issues but has limitations. To complement SAST, se-
4.4. LLMs Generated Code Security curity code smell analysis is used to identify LLMs’
Evaluations poor coding practices that could lead to vulnerabilities
In the final phase of our proposed approach, we in the long run, [70]. Code smell analysis helps identify
conduct security testing and quality evaluations of subtler yet significant potential issues that could com-
LLM-generated code. Various tools and methods are promise security in the long run. These issues may in-
used for source code security testing and analysis, clude structural problems not immediately detected by
ranging from static and dynamic code analysis to SAST-driven tests. In our evaluation, we focused on the
compositional analysis, appropriate for distinct stages ICL-post code generated, where we selected a few shots
of the software development lifecycle [61, 36]. Static generated from the ICL codebase for security analysis.
code analysis is preferred for early detection of secu- This approach provides a more accurate evaluation of
rity vulnerabilities and examines code structure and code quality by addressing immediate vulnerabilities
syntax during the development phase [62, 63, 56]. and underlying issues [36, 37].
In contrast, preferred for early detection of security LLM Code Security Risk Assessment. We have
vulnerabilities, it examines code structure and syntax developed a metric named Code Security Risk Measure
during the development phase. The choice of security (CSRM) to evaluate the code quality generated by
code testing depends on the development phase, the LLMs and ensure that it adheres to secure-by-design
nature of the application, and resource availability. We principles. The CSRM quantitatively measures the
have selected static testing along with code reviews security risks associated with LLM-generated code by
for LLMs-generated code evaluations to explore source using source code from few-shot learning to compare
code weaknesses and hidden smells, respectively. and evaluate code from Post-ICL against problem sets
Security Code Static Analysis. To ensure the se- of LLM-generated code. The metric considers the
curity of our code, we have implemented the Static Lines of Code, CWEs, and code smells to provide a
Application Security Testing (SAST) approach [36, standardized approach to assess code quality and eval-
64, 65]. This method meets our specific requirements uate the security risks associated with LLM-generated
and adheres to industry standards. It evaluates security

Ahmad Mohsin et al.: Preprint submitted to Elsevier Page 9 of 27

Can We Trust Large Language Models Generated Code?

code. Here is the formal definition of CSRM: Security Analyzer and Snyk Security API for SAST,
(∑ ) Code Whisperer built-in security scan, and an AI plu-
(𝐶𝑊 𝐸𝑠 + 𝐶𝑆𝑚𝑒𝑙𝑙)
𝐶𝑆𝑅𝑀 = ∑ × 100 (1) gin for static code analysis. Around 80% vulnerabilities
𝐿𝑂𝐶 were identified using the Snyk Analyzer and AWS code
Where: scanner integrated into Visual Studio Code IDE. An
AI-based code assistant was also utilized for additional
• LOCs denote the total lines of code generated by vulnerability detection. These tools effectively identi-
the LLM. fied critical vulnerabilities, including those listed in the
MITRE Top 25 Most Dangerous Software Weaknesses
• CWEs denote the total count of weaknesses iden- and the OWASP Top Ten, such as SQL injection and
tified in the generated code that match entries in broken access control, addressing multiple CWEs [52,
the CWE. 53].
• CSmells denote the total number of identified
code smells.
5.2. Programming Datasets
Our programming problem datasets are notably
diverse, ranging from Data Structures and Algorithms
( ∑𝑛 ) to Design Patterns and RESTful API development.
𝑖=1 (CWE𝑖 + CSmell𝑖 ) Across these programming problems, we conducted
CSRM = ∑𝑚 × 100 (2)
𝑗=1 LOC𝑗
60 experiments for each problem set using LLM code
generators applying ICL security patterns with three
This generalized description of CSRM reflects the sum instances containing zero, one, and a few shots-based
of multiple modules indexed by 𝑖 and 𝑗 for CWEs, inputs to LLMs. A summary of the programming
code smells, and lines of code applicable to diverse datasets generated through our experiments is provided
programming problems. in Table 2.
The CSRM metric gauges the level of risk asso- Data Structures and Algorithms. To study Data
ciated with software vulnerabilities. This metric con- Structures & Algorithms (DS & Algos.), we used
siders both the severity and frequency of occurrence programming problems from LeetCode and Geeks-
of these indicators. A higher CSRM score indicates forGeeks. We chose to implement these problems
a greater risk of exploitation by attackers. This score using C++ because it is a widely used programming
assists developers and organizations in prioritizing re- language popular among both beginners and experi-
mediation efforts and comparing LLM-generated code enced developers. To test the effectiveness of LLMs
across diverse problem sets to identify the most press- in generating safe and efficient code for data structures
ing issues. The CSRM is a comprehensive tool that en- and algorithms, we selected three distinct problem sets.
hances software security by guiding targeted improve- This problem set is designed to test a broad range of
ments in code quality and vulnerability management. skills and include different programming challenges.

• Wildcard pattern searching,

5. Experimentation
• Management of sorted linked lists,
This section will provide an overview of the exper-
imental setup used for code generation. This includes • SQL De-duplication (SQLDD) programming
LLMs utilized, the programming datasets chosen, the problems.
integration with source code IDEs, and the approach to
security testing. Furthermore, we describe how the ICL MVC Design Pattern. We assess the security of MVC
security patterns were implemented in the program- (Model-View-Controller) design pattern implementa-
ming problems during the code generation process tions in C# applications, particularly those generated
across four different LLMs. by LLMs. Leveraging problem sets from C# Corner,
we developed an MVC-based C# application, utilizing
5.1. LLM Platforms and Security Tooling LLMs for code generation. Our analysis rigorously
Code Generation Tools. Our study utilized four examines the generated model, view, and controller
different LLLM code generators, categorized as PD- classes. This approach allows us to delve into the se-
CGs and CCPs in Table 1. The code produced by curity aspects of LLM-generated code within practical
PDCG LLMs, including ChatGPT and Google Bard, application development, specifically focusing on man-
was exported to Visual Studio Code for analysis. On aging user contact information.
the other hand, all code generated by the copilots, RESTful API Design. Designing a secure RESTful
namely GitHub Copilot and Code Whisperer, was cre- API can be challenging due to widespread use and
ated within the Visual Studio Code IDE and saved in potential vulnerabilities in development. In our project,
local repositories for additional examination. we used Python with Flask to create a News API
Security Tools. We used multiple security tools to that utilizes LLMs. This API provides endpoints for
assess our codebase for vulnerabilities, including Snyk integrating, modifying, and managing news content

Ahmad Mohsin et al.: Preprint submitted to Elsevier Page 10 of 27

Can We Trust Large Language Models Generated Code?

Table 2
Programming Task Descriptions and Details
Problem Set Task Descriptions Difficulty Level Programming Lan- # Experiments
guage
DS & Algos Wildcard pattern Medium C++ 12
matching uses * to ICL-based code gen-
match any sequence eration via prompt-
of characters, driven and coding
including empty copilot LLMs
sequences.
DS & Algos Merge two sorted Medium to High C++ -Do-
linked lists from in-
put files into one.
DS & Algos SQL query to High C++ -Do-
remove duplicate
emails and keep
the ones with the
smallest ID in the
table ’Emails.’
Design Patterns A C# app that uses Medium C# -Do-
the MVC architec-
ture to manage con-
tact info via model,
view, and controller
classes.
API Design A RESTful News High Python -Do-
API with multiple
endpoints for CRUD
operations.

on platforms while addressing common vulnerabilities shows further refinement in the security approach, such
with essential security features. We prioritize critical as adding input validation for the ’pattern’ variable to
security aspects and refer to industry standards like the ensure it only contains allowed characters and using
OWASP API Security Top 10 when used with ICL a configuration file instead of hardcoded values for
across LLMs. sensitive data.
The learning process of ChatGPT LLM involves
5.3. ICL Security Pattern-Based LLM Code the addition of secure code related to input validation
Generation and sensitization to mitigate SQL injection threats. This
We used In-Context Learning (ICL) security pat- is illustrated on the left side of Figure 7.
terns to guide the code generation of ChatGPT and Google Bard Code Generation. Similarly, the perfor-
Google Bard for all three problem sets mentioned mance of the Google Bard model improves with ICL
earlier. We also demonstrated the code generation of security patterns, as illustrated in Figure 7, providing
LLMs for programming problems such as wildcard step-by-step reasoning as chain of thoughts. The se-
pattern matching and sorted linked lists. To achieve curity principles are applied in handling memory for
this, we provided detailed instruction-based inputs to sorted linked lists. During code generation, with one-
the models. shot, LLM initially focuses on memory deallocation
to prevent memory leaks, an essential security and
5.3.1. Prompt Driven Code Generators stability concern in C++ programs. As the learning
LLMs receive natural language and code-based progresses to Few Shots, Google Bard integrates addi-
input queries inspired by NIST Secure Software Devel- tional security checks, such as file and directory valida-
opment Framework (SSDF) and OWASP ASVS. The tion routines. These checks safeguard against directory
model is trained using gradual inputs of one and few- traversal and ensure data integrity when reading from
shot examples. files.
ChatGPT4 Code Generation. ChatGPT4 engagement
with the wildcard pattern matching algorithm illus- 5.3.2. Coding CoPilots Code Generation
trates a progression from a fundamental approach in Integrating ICL Security Patterns into coding copi-
zero-shot as inputs, referring to 7. It receives multiple lots significantly improves the dynamic security con-
examples with a chain of thought-based instructions text within the IDE. This allows models (GitHub Copi-
emphasizing security in a few-shot scenario. The figure lot and Code Whisperer) to seamlessly incorporate

Ahmad Mohsin et al.: Preprint submitted to Elsevier Page 11 of 27

Can We Trust Large Language Models Generated Code?

Figure 7: ICL Code Generation with ChatGPT and Google Bard LLMs: LLM inputs are provided in natural language
text with ICL coding examples and contextual reasoning steps for security. Code is generated afterward

Figure 8: Coding Copilots dynamic contextual learning with ICL: Code Generation of GitHub Copilot and Code
Whisperer

these patterns into their code generation processes and Code Whisperer - MVC Design Pattern
gradual learning behaviors. This approach ensures that Code Whisperer application in MVC design pattern de-
the generated code adheres to high-security standards. velopment begins with setting the initial coding context
To illustrate, we have included a sample of source code with a Contact class in the C# program. Code whis-
generated through coding copilots below. perer LLM learning starts with a zero-shot context; it
GitHub Copilot - RESTful API Development generates MVC controller classes based on comment
Using GitHub Copilot, we enhanced the security of inputs and inline code, as shown on the right side of
a news API by validating API keys during endpoint Figure 8. Interactive and context-aware inputs enhance
calls to prevent misuse and reduce the risk of MiTM security, including enforcing HTTPS and integrating
attacks. This sophisticated code generation process ex- security headers into the codebase, demonstrating the
tends beyond conventional single-prompt instructions dynamic interactions between developer inputs and AI
and is depicted on the left side of Figure 8. capabilities.

Ahmad Mohsin et al.: Preprint submitted to Elsevier Page 12 of 27

Can We Trust Large Language Models Generated Code?

Table 3
Zero Shot: Coding Vulnerabilities Discovered in Programming Problems

DS & Algos (C++) MVC Pattern RESTful API

Coding LLMs
(C#) (Python)
WildCard Sorted Linked SQL DD
List
ChatGPT 5 4 5 5 7
Google Bard 4 4 4 5 6
GitHub Copilot 8 4 4 6 7
Code Whisperer 4 7 5 7 6

6. Results and Analysis by LLMs in the first iteration with zero shots. We
have selected frequently occurring or prominent CWEs
This section examines the code generated in re-
found in each programming dataset. Please see Table
sponse to the research questions posed in subsection
4 for more information. In ChatGPT, CWE-20 (Im-
2.4. Our goal is to evaluate the reliability and secu-
proper Input Validation) was observed in its gener-
rity of the code produced by LLMs, focusing on its ated code for both DS & Algos and the MVC Pat-
trustworthiness through various ICL patterns used in tern. This vulnerability could lead to SQL injection,
programming scenarios. Our results are derived from XSS, and command injection attacks. Google Bard
research questions and obtained through a two-stage
has shown vulnerabilities such as CWE-89 (SQL In-
security analysis. We employed both SAST and code
jection), which appeared twice in the generated code
security examinations to identify security issues in the
algorithms. Attackers could exploit this vulnerabil-
code. We used the MITRE CWE database to analyze ity to manipulate database queries and potentially ac-
the generated code and detect known vulnerabilities. cess or corrupt data. GitHub Copilot generated code
We evaluated CWEs based on their associated severity
included CWE-22 (Weak Path Traversal Validation),
levels to pinpoint potential security risks in the gener-
which could allow attackers to access unauthorized
ated code that could be exploited in cyber attacks.
files. Code whisperer is prone to using risky functions
like CWE-676, which can lead to buffer overflows and
6.1. LLMs Code Generation with Zero Shot
other critical errors.
The first research question (RQ1) seeks to answer
whether LLM code generators can produce functional 6.2. RQ2: Secure Code Generation - One and
code securely. To find the answer to RQ1, we further
Few Shot Learning
divide this question into two sub-parts:
With our second research question, RQ2, we seek to
• Zero-shot code generation with prompt-driven determine the impact of ICL security pattern applica-
code generators. bility on generated code security. We aim to discover if
ICL one-shot and few-shot security-aware prompts and
• Zero-shot code generation with Copilots. contexts can train LLMs during the code generation
process to produce secure code. For this purpose, we
The code generated by the LLMs in Table 3 is not
created 20 unique ICL one-shot patterns for each LLM
secure, as numerous vulnerabilities are found in each
to address problem sets and generate separate pro-
LLM with zero shots. In total, 111 CWEs were discov-
grams. In our few-shot ICL experiments, we included
ered, with an average of around six vulnerabilities ob-
multiple ICL security examples in each pattern, with at
served in each set of generated programming code. For
least two examples based on secure coding principles
instance, Code Whisperer and Github Copilot each pro-
in a few shots.
duced 29 vulnerabilities, with the code related to MVC
It is essential to understand that prompt-driven lan-
pattern and RESTful API having a higher number of
guage models like ChatGPT and Google Bard learn dif-
CWEs. ChatGPT-4 and GitHub Copilot also produced
ferently from coding copilot language models. Prompt-
more CWEs (5 and 7, 6 and 7, respectively) for MVC
driven models learn continuously within dynamic IDEs
Pattern and RESTful API code. The code generation
by analyzing code contexts, comments, and selections,
did not consider security context, only simplified input
especially when applying ICL security patterns to pro-
prompts for each problem set. This indicates that when
gramming problems. After implementing ICL secu-
these LLMs generate code based on functional require-
rity measures in the coding of LLMs, the generated
ments, they tend to produce insecure code, potentially
code indicates that the LLM is now equipped with an
leading to threats and allowing adversaries to exploit
understanding of security principles and attempts to
vulnerabilities and launch sophisticated attacks.
interpret NLP instructions for enforcing security during
Vulnerabilities Distribution. We have identified 111
code generation. However, the learning behavior of
coding vulnerabilities in the source code generated

Ahmad Mohsin et al.: Preprint submitted to Elsevier Page 13 of 27

Can We Trust Large Language Models Generated Code?

Table 4
LLMs Generated Code Frequent Vulnerabilities Distribution Across Problem Sets with DS & Algo has Three Instances
of Programming Problems

ChatGPT Zero Shot Frequent Vulnerabilities

Programming Problems
CWEs Descriptions Count
DS & Algos MVC Pattern RESTful API
CWE-20 Improper Input Validation 4 3 1 0
CWE-22 Insecure Path Traversal 2 2 0 0
CWE-676 Use of Dangerous Functions 2 2 0 0
CWE-200 Information Exposure 3 1 1 1
Google Bard Zero Shot Frequent Vulnerabilities
Programming Problems
CWEs Descriptions Count
DS & Algos MVC Pattern RESTful API
CWE-89 SQL Injection 2 1 1 0
CWE-200 Exposure of Sensitive Information 2 0 1 1
CWE-20 Improper Input Validation 2 2 0 0
GitHub Copilot Zero Shot Frequent Vulnerabilities
Programming Problems
CWEs Descriptions Count
DS & Algos MVC Pattern RESTful API
CWE-22 Weak Path Traversal Validation 2 2 0 0
CWE-89 SQL Injection 2 0 1 1
CodeWhisperer Zero Shot Frequent Vulnerabilities
Programming Problems
CWEs Descriptions Count
DS & Algos MVC Pattern RESTful API
CWE-759 Weak Encryption (One Way Hash without Salt) 2 0 1 1
CWE-352 Information Exposure 2 0 1 1
CWE-676 Use of Dangerous Functions 3 3 0 1
CWE-89 SQL Injection 2 0 1 1

an LLM depends on its underlying Foundation Model Listing 1: One-shot code enhancement with ICL
architecture, Training Dataset, and other vital features; 1 / ∗ E n s u r e f i l e names a r e s a n i t i z e d :
therefore, after a few shots, each LLM exhibits different 2 i s V a l i d F i l e N a m e ( S t i r n g FileName ) ∗ /
security-aware behavior with the ability to generate 3 i f ( ! isValidFileName ( inputFileName ) | | !
isValidFileName ( outputFileName ) ) {
secure code. Another aspect is the coverage provided 4 s t d : : c e r r << " I n v a l i d f i l e name ! " <<
by few shots of ICL security contexts. std : : endl ;
One Shot ICL Security Analysis. Instructions for 5 return 1;
one-shot security contexts, accompanied by at least 6 }
one example that adheres to security best practices Here, the risk of injection attacks targeting the
and principles (as referenced in the code generation application is minimized. The secure code generated by
Figures 7, 8) are provided to code generators. Integrat- the ChatGPT-4 model reflects improvement after being
ing one-shot ICL security patterns into programming trained in security contexts. Sometimes, even after
problems reduces vulnerabilities within code generated applying a one-shot security pattern, the vulnerabilities
by language models. For instance, applying a one- remain unchanged or lead to another vulnerability. For
shot pattern to a wildcard problem through ChatGPT-4 instance, Google Bard zero-shot generated code for
and conducting security tests can decrease the number a sorted linked list was vulnerable due to improper
of vulnerabilities from five (observed after a zero- memory release. This security weakness mitigated by
shot attempt) to three (after one-shot implementation). using the memory cleanup function through an ICL
This improvement is demonstrated in the context of one-shot context in the code:
CWE-20, which concerns improper input validation.
By adding a one-shot security context that includes file Listing 2: LLM: Memory Management
name sanity checks and input format verification, as 1 / / c l e a n memory f o r Node∗ m e r g e L i s t s ( Node∗
illustrated by the code snippet for wildcard pattern in l i s t 1 , Node∗ l i s t 2 ) .
2 i n t main ( ) { . . .
C++: 3 cleanupMemory ( L i s t 1 ) ;

Ahmad Mohsin et al.: Preprint submitted to Elsevier Page 14 of 27

Can We Trust Large Language Models Generated Code?

(a) ChatGPT (b) Google Bard

(c) GitHub Copilot (d) Code whisperer

Figure 9: Comparison of Vulnerabilities Generated by ChatGPT, Google Bard, GitHub Copilot, and Code Whisperer

4 cleanupMemory ( L i s t 2 ) ; Listing 3: API Security- Few shots ICL

5 return 0; }
1 # P r o v i d i n g API k e y s h a r d −c o d e d i s n o t s a f e
However, the above mitigation did not address the 2 # Load API k e y s from a f i l e f o r s e c u r i t y
3 API_KEYS = {}
related CWE-400 vulnerability involving uncontrolled 4 # Open ( API k e y s f i l e ) a s s t o r a g e v a r i a b l e
resource consumption of related variables in memory. 5 w i t h open ( ' a p i _ k e y s . t x t ' ) a s f :
Few Shot ICL Security Analysis. ICL security pat- 6 for line in f :
key , v a l u e = l i n e . s t r i p ( ) . s p l i t ( ' : ' )
terns, when provided through multiple examples, en- 7
8 API_KEYS [ key ] = v a l u e
able coding LLMs to understand security principles
better and generate more secure code. These LLMs ef- This leads to CWE-732 and the lack of encryption
fectiveness can vary, with some producing more secure of sensitive data during data transmission at runtime.
code (as shown in the vulnerability analysis graphs in Moreover, saving keys in plain storage is risky, where
Figure 9). Offering LLMs a more comprehensive array an attacker potentially can gain unauthorized access
of security examples than the one-shot ICL pattern due to the lack of key validation.
provides them with a broader security context. For LLMs using a few-shot ICL security pattern often
example, when using GitHub CoPilot to create code learn a broader context of security when generating
for the MVC pattern in C#, multiple security contexts code compared to one-shot attempts and strive to ad-
learned from the initial one-shot ICL pattern were pro- dress security weaknesses more effectively. The cod-
vided. Instructions to generate code that verifies API ing copilot GitHub learns incrementally about security
keys when API endpoints are called were included, thus contexts with few-shot experiences and, over time, sug-
mitigating CWE-307 related to weak API key valida- gests more secure lines of code. Meanwhile, prompt-
tion. Even though we have provided extensive security driven code generators like ChatGPT and Google Bard
guidance for the LLMs, vulnerabilities continue to exist learn about security context specific to the program-
in certain parts of the code. For instance, when utilizing ming problem area within the source code without thor-
GitHub Copilot for a RESTful API, there were still oughly covering security aspects, leading to security
API design issues due to incorrect resource permission vulnerabilities.
assignments, such as managing API keys. Please refer
to the provided context and generated code below.

Ahmad Mohsin et al.: Preprint submitted to Elsevier Page 15 of 27

Can We Trust Large Language Models Generated Code?

6.3. RQ3: Comparative Assessment of LLMs Shot, One Shot, and Few Shots. Initially, both models
for Vulnerabilities exhibit minimal input validation. ChatGPT relies on
This section seeks answers to RQ3: How do prompt- simple model state validations within an MVC design
driven LLMs compare to coding Copilot in generating pattern, and Google Bard lacks integer validations in a
secure code and adapting to ICL security contexts? sorted linked list program. This early approach demon-
To answer this, we conduct a vulnerability analysis strates a fundamental gap in proactively addressing
among these two classes of LLM platforms and assess security vulnerabilities without specific guidance.
their code safety and security performance during code Addressing CWE-20 vulnerabilities through ICL,
generation. both ChatGPT and Google Bard have shown signifi-
ChatGPT and Google Bard We start with PDCGs cant progress from basic to more sophisticated input
LLMs and refer to the graphs in Figures 9a and 9b. validation techniques across different learning scenar-
In a comparative zero-shot scenario, Google Bard ios. Initially, ChatGPT4 approach within an MVC de-
generated slightly fewer CWEs than ChatGPT (4 and 5, sign pattern in C# primarily relied on simple model
respectively), with a notable exception in the RESTful state validations. Through Few Shots learning, it im-
API problem, where Bard performed better. Bard also plements more comprehensive security measures by
demonstrated a better initial understanding of security adding validation attributes to model properties, signi-
in the context of MVC patterns, suggesting a firmer fying a deeper understanding of secure coding princi-
foundational grasp of security concerns. On the other ples.
hand, ChatGPT outperformed in the Wildcard, sorted
linked list, and SQLDD programming scenarios with a Listing 4: ChatGTP4 Vulreabilties Management
lower number of vulnerabilities. 1 [ Required ]
[ StringLength (50) ]
In the shift to one-shot learning, we observed more 2
3 p u b l i c s t r i n g Name { g e t ; s e t ; }
significant variations in performance. The Google Bard
vulnerability count increased for the Wildcard and Google Bard, initially lacking in input validation
RESTful API from 5 to 6, while ChatGPT performance for a sorted linked list, evolves to include a specific
remained generally stable or improved. This suggests function ensuring integer validation, illustrating its
that single examples may be used more effectively to adaptive learning capability:
enhance code security. During the few-shot learning
phase, both models showed significant improvements. Listing 5: Google Bard Vulreabilties Management
However, Bard exhibited a notable increase in vulner- 1 bool i n t e g e r ( s t r i n g s t r ) {
stringstream ss ( s t r ) ;
abilities within the Wildcard problem set, indicating 2
3 i n t num ;
potential differences in multiple examples’ integration 4 i f ( s s >> num ) { r e t u r n s s . e o f ( ) ;
and application methods to enhance security. 5 } else {
GitHub Copilot and Code Whisperer. The vulner- 6 r e t u r n f a l s e ; }}
abilities demonstrated by GitHub Copilot and Code These examples demonstrate both models’ ability
Whisperer, as visualized in Figures 9c and 9d across to enhance code security by incorporating detailed
problem sets, showcase their efficacy in learning to validation checks, reflecting an improved capacity to
handle security issues. Initially, Copilot showed weak- generate secure code in response to explicit security
ness in the WildCard domain, with 8 and 6 CWEs instructions.
from zero to one shot, while Code Whisperer struggled 2. Coding Copilots. Generated code by GitHub Copi-
with SQL Duplicate vulnerabilities, staying at 5. The lot initially included hardcoded API keys, which led to
GitHub model improved with one-shot to few-shot CWE-798. The introduction of a one-shot learning sce-
learning, reducing CWEs from 5 to 4 and 3 to 2, nario directed Copilot to externalize API key storage,
especially in the MVC pattern and RESTful API do- as reflected in the following code snippet:
mains. As the learning advances to few-shot scenarios,
both language models exhibit reduced vulnerabilities. Listing 6: One-shot learning for secure API key loading
Copilot has significantly strived in RESTful API and in GitHub Copilot.
MVC problems, whereas Code Whisperer has shown 1 API_KEYS = {}
marked improvements in SQL Duplicate and MVC 2 w i t h open ( ' a p i _ k e y s . t x t ' ) a s f :
scenarios. Despite increased context learning from zero 3 for line in f :
4 key , v a l u e = l i n e . s t r i p ( ) . s p l i t ( ' : ' )
to a few shots, vulnerabilities persist, highlighting the 5 API_KEYS [ key ] = v a l u e
ongoing challenge of fully securing automated code
generation. Despite advancements with few-shot learning, GitHub
Copilot inadvertently introduced a lower severity CWE-
ICL Patterns Addressing Coding Weaknesses. 259 related to hard-coded credentials. Conversely,
1. Prompt Driven Code Generators. The evolution of Code whisperer progression in few-shot learning en-
ChatGPT and Google Bard in addressing CWE-20 vul- abled the loading of secret keys from the environment,
nerabilities through various learning scenarios: Zero as seen in the snippet below:

Ahmad Mohsin et al.: Preprint submitted to Elsevier Page 16 of 27

Can We Trust Large Language Models Generated Code?

Listing 7: Few-shot learning for environment-based pose a risk if not adequately managed, making Copilot
secret key loading in Codewhispererer. generally safer for specific coding tasks but not without
1 app . c o n f i g [ ' SECRET_KEY ' ] = o s . g e t e n v ( ' its risks. On the other hand, Code Whisperer displays a
SECRET_KEY ' ) broader range of severity, with 33 high-severity vulner-
Furthermore, Code Whisperer secured API key in- abilities. This indicates a greater likelihood of generat-
vocation by implementing a decorator function, en- ing problematic code, particularly in Data Structures &
hancing its security measures against CWE-798. Algorithms and RESTful API code. The vulnerabilities
Vulnerabilities Severity Levels include Path Traversal CWE-22 and Out of Bounds
This section describes the severity levels of vulnera- Write CWE-787. This suggests that the code generated
bilities associated with LLM-generated source code. by Code Whisperer may pose higher security risks,
We will conduct a comparative analysis between our highlighting the need for thorough security measures.
coding LLMs. The classification of CWECWE into LLMs Vulnerabilities Reduction Capability
high, medium, and low severity levels is fundamental Here, we present an empirical analysis detailing the
for evaluating and managing software vulnerabilities reduced vulnerabilities achieved through implementing
CWE severity levels [71]. High-severity vulnerabilities ICL security patterns in LLM-generated code.
require immediate intervention to prevent significant 1. Vulnerabilities Reduction - CCPs. The graph in
breaches. Below, we will analyze vulnerabilities in two Figure (11a) illustrates the percentage reduction of
categories of LLMs: vulnerabilities from Zero Shot to Few Shot for both
1. PDCGs Severity Levels. We have conducted Github Copilot and Code Whisperer across different
a comparative analysis of vulnerability severities be- software concepts. Github Copilot shows significant
tween PDCGs (ChatGPT and Google Bard). In the vulnerability reductions, particularly in the Wildcard
case of ChatGPT, the majority of vulnerabilities are (75%) and MVC Pattern (33%) categories, suggesting
within the Low to Medium severity spectrum, num- potential for further improvement with more data or
bering 26 and 21, respectively (refer to Figure (10a)). training. However, smaller reductions in the Sorted
These vulnerabilities mainly stem from programming Linked List and SQL Duplicate categories (25% and
datasets for Data Structures & Algorithms. They are 0%, respectively) indicate limited initial vulnerabilities
characterized by issues such as path traversal, the use or that additional examples did not improve learning.
of potentially unsafe functions, and lapses in input An increase in vulnerabilities in the RESTful API cat-
validation. Additionally, RESTful API components are egory (-17%) raises concerns about potential issues in
associated with higher severity levels, including CWE- model learning or complexity-induced vulnerabilities.
311 (Missing Encryption of Sensitive Data), CWE-319 Code Whisperer achieves significant vulnerability re-
(Cleartext Transmission of Information), and CWE- ductions in the Sorted Linked List (29%) and RESTful
811 (SQL Injection), which underscore vulnerabilities API (57%) categories, outperforming Github Copilot
that could facilitate injection attacks. in API-related vulnerabilities, possibly due to better
The Google Bard platform displays various vulner- learning from more examples. Yet, it experiences a
ability severities, including 10 High, 21 Low, and 39 notable vulnerability increase in the Wildcard category
Medium severities, as shown in Graph (10a). Among (-57%) and no change in the SQL Duplicate cate-
these, vulnerabilities such as CWE-327, which ad- gory (0%), indicating challenges in complex problem-
dresses the use of insecure cryptographic algorithms, solving and consistent performance across problem
and CWE-522, denoting the unauthorized exposure types.
of sensitive information, are particularly concerning. 2. Vulnerabilities Reduction - PDCGs. Examining
These High-severity vulnerabilities pose significant vulnerability reductions in Graph (11b) from zero to
risks to system security and data protection, emphasiz- few-shot learning for ChatGPT and Google Bard CWEs
ing the need for robust security measures, especially in across programming problems provides valuable in-
production environments where such code is deployed. sights. ChatGPT shows a broad ability to decrease vul-
2. Copilots Severity Levels. The security impact nerabilities, with the most notable reduction in REST-
assessment comparing the vulnerability severity levels ful API vulnerabilities at 42.86 %, suggesting a solid
of CCPs (GitHub Copilot and Code Whisperer) re- understanding of API-related security issues. Other
veals significant differences. The analysis indicates that areas like Wildcard, SQL Duplicate, and Sorted Linked
GitHub Copilot has fewer high-severity vulnerabilities, List also see reductions, indicating effective learning
which are categorized into high (7), medium (26), and from additional examples.
low (30) levels. Please refer to the Graph in Figure 10b Conversely, Google Bard displays an increase in
to visually represent these findings. vulnerabilities for the Wildcard category by 75 %, im-
In copilots, most of the high-severity issues are plying challenges in mitigating or possibly exacerbat-
found in Data Structures & Algorithms and RESTful ing vulnerabilities with more input. However, it excels
APIs, including notable vulnerabilities such as SQL in the MVC Pattern category with a 57.14 % reduction,
Injection CWE-89 and Missing Release of Resources. highlighting its strength in addressing design pattern
Although these vulnerabilities are less severe, they still vulnerabilities. Other categories show mixed results,

Ahmad Mohsin et al.: Preprint submitted to Elsevier Page 17 of 27

Can We Trust Large Language Models Generated Code?

(a) (b)

Figure 10: Security impact analysis from vulnerabilities

(a) (b)
Figure 11: LLMs Code Vulnerability Reduction Analysis for LLMs- ICL security patterns

with moderate reductions in vulnerabilities, illustrating LLM Generated Code Reliability. We analyzed
varied effectiveness across different programming con- the presence of code smells in three problem sets using
texts. four code generating LLMs, as illustrated in Figure
(12a). Our findings indicate that Google Bard and Chat-
6.4. RQ4: Assessing the Presence of Code GPT produced more code smells than GitHub Copilot
Smells in Source Code and Code Whisperer. This suggests potential gaps in
The fourth research question investigates source understanding or implementing secure coding practices
code smells related to security in LLM outputs post- using ICL in specific contexts with each LLM. Among
ICL-based security learning. The goal is to determine all the programming problems, Google Bard consis-
the extent of safe and secure code production. We tently had the highest number of code smells, with
will examine hidden code smells or bugs in few-shot the maximum number of code smells (9) found in the
generated code that complies with ICL security pat- linked list program. This may indicate that this model
terns across various LLM code generators. We aim to training data or algorithms are not optimally tuned
evaluate how LLMs manage security in their outputs. for security, even though it applies ICL security pat-
After a thorough analysis, these code smells will be terns. On the other hand, Code Whisperer showed the
classified by severity (High, Medium, Low) to provide lowest number of code smells in more straightforward
insights for enhancing security testing and exposing tasks such as Wildcard, but this number increased in
vulnerabilities that may not be apparent through tradi- complexity-related tasks like Restful API. This pattern
tional static code analysis. might suggest strengths in essential code generation but
challenges in more complex scenarios.

Ahmad Mohsin et al.: Preprint submitted to Elsevier Page 18 of 27

Can We Trust Large Language Models Generated Code?

(a) Code Smells (b) Severity Levels

Figure 12: LLMs generated Code Smells and Risks

GitHub Copilot performed moderately well, show- 4 return 1;

ing fewer code smells. This suggests a balanced han- 5 }
dling of coding tasks in terms of security. Both GitHub Furthermore, Code whisperer generated code smell
Copilot and Code Whisperer had lower code smell depicted here:
occurrences, particularly in critical areas like SQL
Database Design and RESTful API. On average, REST- Listing 10: API code smell in Python
ful API and SQL Database Design had the highest 1 @limiter . l i m i t ( "5 per minute " )
number of code smells among the four LLMs, which
can vary across problems and affect reliability. We have The code snippet above only applies the rate limit to the
gathered a random selection of code smells generated login route, which presents a medium-severity security
by LLMs from PDCGs and CCPs. These code smells risk within RESTful API practices. This limited appli-
and their associated security impacts for each problem cation indicates a potential for bypassing rate limits,
set are detailed in Table 5. The severity levels of these raising concerns about the risk of Denial of Service
code smells significantly influence the overall quality (DoS) or brute force attacks. It emphasizes the need for
and security of the code. For example, as shown in consistently applying rate limiting across all sensitive
Graph Figure 12, GitHub Copilot introduces medium- routes to ensure robust security.
severity security risks with hard-coded database paths. Code Smells Severity Levels. ChatGPT has the
The code listing below exhibits this vulnerability sever- highest medium-severity coding smells (21), while
ity taken from SQLDD program generated with GitHub Google Bard has the highest high-severity coding
Copilot: smells (10). GitHub Copilot and Code Whisperer have
fewer high-severity smells (7 and 5, respectively) and
Listing 8: Code Smell Severity in C++ more medium and low-severity smells (16 and 9 for
1 s q l i t e 3 ∗ db ; GitHub Copilot and 17 and 8 for Code Whisperer).
2 / / S Q L i t e d a t a b a s e f i l e name The analysis indicates that ChatGPT code may require
3 s q l i t e 3 _ o p e n ( " E m a i l s . db " , &db ) ; a substantial review to identify potentially disruptive
4 s q l i t e 3 _ s t m t ∗stmt ;
issues. High severity Google Bard coding issues raise
This vulnerability exposes the system to unauthorized concerns about the critical nature of the problems it
data access, highlighting the importance of dynamic may introduce, necessitating rigorous code review pro-
database path configurations to enhance security, espe- cesses. Meanwhile, GitHub Copilot and Code Whis-
cially in the context of DS & Algos. perer code smells tend to be less critical, reducing
Similarly, code related to ChatGPT exhibits a direct the urgency and intensity of the review processes
use of user-controlled input, leading to high-severity required. The blue line in Figure 12(b) represents the
vulnerabilities related to arbitrary file access. This crit- average code smell severity across all models. This line
ical issue emphasizes the urgent need for stringent trends downward from ChatGPT to Code Whisperer,
input validation to protect system integrity and prevent indicating an overall improvement in the ability to
unauthorized file exposure, particularly relevant in DS generate cleaner code from one LLM to another.
& Algo scenarios. Code Smells Security Impacts. Although we used
ICL security patterns for secure code generation, the re-
Listing 9: Code Smell lacking integrity sults of the generated code by LLMs show that security
1 std : : i f s t r e a m i n p u t F i l e ( inputFileName ) ; code smells continue to emerge. This indicates areas
2 if (! inputFile ) { where LLMs still fall short. For example, to mitigate
3 s t d : : c e r r << " E r r o r o p e n i n g i n p u t
f i l e ! " << s t d : : e n d l ;

Ahmad Mohsin et al.: Preprint submitted to Elsevier Page 19 of 27

Can We Trust Large Language Models Generated Code?

Table 5
Selected LLMs generated Source Code Smells, related LOCs, and Security Impacts Summary
LLMs Code Smell Related LOC Security Impact Problem
GitHub Hardcoded Database sqlite3_open("Emails.db", Inflexible and insecure DS & Algo
Copilot File Path &db); Database file access.
Insufficient Error Han- if (db == nullptr) ... and Unstable application state MVC Pattern
dling other similar checks or sensitive information ex-
posure.
Improper Use of POST and PUT methods Unauthorized RESTful API
HTTP Methods in multiple endpoints modifications lacking
validation.
Code SQL Injection Vulner- Direct binding user input May allow SQL injection DS & Algo
Whisperer ability to SQL statements if inputs are not properly
sanitized.
Weak Authentication RequireHttpsAttribute Improper implementation MVC Pattern
used without checking leading to MITM attacks.
Potential Rate Limit @limiter.limit("5 per Lacking consistent rate RESTful API
Bypass minute") Only applied to limits can lead to DoS or
the login route. brute force attacks.
ChatGPT Potential Arbitrary std::ifstream in(filename); Risk of accessing or expos- DS & Algo
File Access Using user-controlled in- ing unauthorized files.
put directly
Missing Data Valida- if (ModelState.IsValid) Vulnerabilities leading to MVC Pattern
tion injection attacks/data cor-
ruption.
Lack of Input Valida- @app.route functions Increased risk of malicious RESTful API
tion input.
Google Lack of Input Saniti- email = sani- Insufficient sanitization DS & Algo
Bard zation tizeEmail(email); may leave room for
The sanitization is injection attacks.
rudimentary.
Cross-Site Scripting WebUtility.HtmlEncode(...): Insufficient sanitization of MVC Pattern
(XSS) Manual input encoding is user input.
prone to errors.
Insufficient Input Vali- The routes do not explic- Vulnerable to various in- RESTful API
dation itly validate input data be- jection attacks or unin-
fore processing. tended behavior.

SQL Injection vulnerabilities (CWE-89), Code whis- 7. Discussions

perer LLM tried to incorporate parameterized queries
in its code generation process. However, direct binding This section presents an analytical discussion of
of user input to SQL statements was observed, signify- the results produced and evaluates their findings con-
cerning the research questions we posed in this study.
ing a high-severity SQL Injection Vulnerability (refer
We also delve into discussions to assess the security of
to 5 code smell). This suggests that although there
was an attempt to improve, essential security practices source code generated by LLMs using our developed
were not fully implemented or were incorrectly applied, metrics and examine the overall impact on code quality.
leaving critical vulnerabilities. Lastly, we critically and scientifically analyze and dis-
Similarly, in addressing Cross-Site Scripting (XSS) cuss various factors impacting code quality for LLMs-
based code generation.
risks (CWE-79), Google Bard aimed to sanitize user in-
puts through mechanisms like email = sanitizeEmail(email);.
7.1. Results Synthesis with Research
However, this approach resulted in a medium-severity
code smell due to a lack of input sanitization, revealing Questions
that the sanitization efforts were rudimentary and insuf- RQ1: Source Code Security without Security
ficient to mitigate the risk of XSS attacks effectively. Knowledge Section 6.1 presented results for code gen-
Overall, the analysis provides insights into the qual- erated through zero-shot inputs across LLM platforms.
ity and reliability of code generated by these models, These results corroborate research on the probability
and the visualizations help understand LLM-generated of LLMs generating insecure code [6, 8]. All 4 LLMs
code smells with each LLM about programming prob- used in this study contributed to vulnerabilities when
lems and severity levels. given tasks for code generation. These experiments

Ahmad Mohsin et al.: Preprint submitted to Elsevier Page 20 of 27

Can We Trust Large Language Models Generated Code?

reveal a substantial gap in the baseline security knowl- RQ3: Are certain LLMs better at generating se-
edge of LLMs when generating code without a spe- cure code? In our analysis in subsection 6.3, we pre-
cific security context (zero-shot). All tested models sented how different LLMs perform when generating
produced code with significant vulnerabilities, totaling code. Comparing pre-trained models like ChatGPT
111 CWEs across different programming contexts. For and Google Bard with coding copilots like GitHub
example, GitHub Copilot and Code Whisperer each Copilot and Code Whisperer reveals differing strengths
produced 29 vulnerabilities, like MVC patterns and and weaknesses. Coding copilots with downstream
RESTful APIs, where most of the CWEs dangerously code generation composed of fine-tuned LLMs showed
indicated severe threats and associated attack risks. better dynamic adaptation to security contexts, likely
This quantitative data underlines the inherent risks due to their interaction with ongoing coding activi-
of deploying such models without additional security ties and immediate feedback loops. In contrast, pre-
enhancements. trained models struggled more with generalization and
We conclude that LLMs’ default operation mode required more direct and explicit ICL security patterns
prioritizes functional output over secure output, ne- to produce secure code. Our findings suggest that few-
cessitating explicit security training or guidelines for shot learning is more effective than zero and one-
secure coding practices. shot ICL patterns in reducing vulnerabilities in LLMs.
RQ2: Adaptation to ICL patterns (One-Shot and The degree of improvement varies depending on the
Few-Shot Learning) Our research on code generation model and context. Still, in general, few-shot learning
using LLMs found that adapting to security require- produces better results due to the richer context it
ments through methods like one-shot and few-shot provides for learning and applying security measures.
learning has shown promising but varied results. One- ChatGPT and Copilot demonstrate improvements with
shot learning, where a single security example is pro- few-shot learning, while Google Bard’s performance
vided, often leads to a significant but partial reduction varies across different programming scenarios. The
in vulnerabilities. For example, when implementing language model exhibits a unique trend where few-
security measures in ChatGPT, focusing on input vali- shot learning significantly reduces vulnerabilities, es-
dation reduced the number of vulnerabilities from five pecially in SQL injection and MVC pattern contexts,
to three. However, this approach does not uniformly but vulnerabilities increase in the WildCard context.
address all potential security flaws. Interestingly, vulnerabilities increase from one-shot to
When transitioning to few-shot learning, language few-shot in the WildCard context, indicating a model-
models are provided with multiple examples of security specific learning anomaly.
contexts to improve their depth and breadth of under- We calculate the average reduction of vulnerabili-
standing. For instance, when GitHub CoPilot was used ties across problem sets for each LLM and then mea-
to generate code for the MVC pattern in C#, instruc- sure the reduction ratio for each. It is important to
tions derived from multiple one-shot experiences were note that the complexity of individual programming
applied to ensure that API keys were verified when problems might have affected the learning ability of
API endpoints were called, addressing vulnerabilities LLM through ICL. GitHub Copilot had the most sig-
such as weak API key validation ( CWE-307). Despite nificant reduction at 38%, followed by ChatGPT at
this enhanced security approach, some vulnerabilities 34%, Google Bard at 23%, and Code Whisperer at 6%.
may persist, as evidenced by persisting vulnerabilities Inconsistencies across models show varying effective-
(CWE-732). This vulnerability persisted after a few ness in learning security principles, especially in more
shots of training the LLM. complex scenarios like MVC patterns and RESTful
The examples mentioned highlight an important APIs. The analysis suggests that prompt-driven lan-
point. While one-shot learning can lay the groundwork guage models (PDCGs) and coding copilots (CCPs)
for security considerations in LLMs, it may not have have reduced vulnerabilities through few-shot learning.
the depth to address all security aspects fully. On Coding copilots, such as GitHub Copilot and Code
the other hand, few-shot learning, which provides Whisperer, may be more effective in integrating and
multiple examples, allows for a broader and more applying security measures. The learning behavior of
detailed understanding, potentially leading to more prompt-driven language models and coding copilots
secure outputs. However, neither method ensures the under ICL conditions showed different adaptations to
complete elimination of vulnerabilities, emphasizing security contexts. ICL effectively enhances the security
the complex challenge of incorporating comprehensive of language model-generated code, but not all security
security requirements into LLMs. The varying effec- principles are equally learned or applied. Future work
tiveness of instructional contexts suggests the need for should focus on optimizing ICL strategies for broader
continuous refinement of training approaches to better and more complex security scenarios.
adapt to the evolving landscape of security needs in RQ4: Security Smells and Code Reliability We
LLM code generation. detailed code smells for each programming scenario in
subsection 6.4 post ICL for each few-shot learned gen-
erated program instance. Code smells have uncovered a

Ahmad Mohsin et al.: Preprint submitted to Elsevier Page 21 of 27

Can We Trust Large Language Models Generated Code?

significant number of medium and high severity issues API implementations, indicating different strengths in
across different LLMs, quantitatively indicating vari- secure coding practices across problem sets. GitHub
ous associated risks. For example, GitHub Copilot and Copilot has relatively low security risks (8%) for
Code Whisperer showed a high frequency of medium- SQLDD and MVC patterns, whereas Code Whis-
severity vulnerabilities, requiring stringent security au- perer has higher security risks both for SQLDD and
dits and highlighting the potential risk of deploying MVC patterns (13%, 10%), indicating security risks
LLM-generated code without thorough validation. associated with each LLM have different thresholds
The research findings demonstrate the average oc- from problem to problem. Therefore, while numeri-
currence of code smells in various coding LLMs. This cal metrics like CSRM are indicative, they must be
reveals distinct patterns and provides insights into how considered alongside the nuanced improvements that
well each model has integrated security knowledge to ICL provides. Optimizing ICL for each problem set is
produce secure code. ChatGPT consistently performs crucial as it allows for a more directed and effective
with a slight increase in code smells, suggesting a learning path for each LLM.
moderate absorption of security knowledge, stable but The CSRM assessments assist developers in priori-
imperfect outputs, and potential for improvement to tizing their code security features, particularly in areas
minimize security flaws. Google Bard displays vari- with higher risks. By identifying specific vulnerabil-
ability, with code smells increasing and then decreas- ities, the CSRM metric helps direct remedial actions
ing, indicating a learning curve in understanding se- aimed at strengthening the secure-by-design principles
curity concepts and inconsistent integration of secu- essential for maintaining robustness against potential
rity knowledge across coding challenges. Performance exploitations in LLM-generated software.
for GitHub Copilot varies significantly by problem 2. Generated Code Quality by LLMs. We utilized
type, with noticeable improvement in certain areas but linear regression on CSRM data to examine security
struggles in others. It shows effective internalization of risk trends for two types of LLMs: PDCGs and CCPs.
security practices for specific problem sets and areas To simplify this analysis, we converted categorical data
where security knowledge application is lacking. Code into a format appropriate for mathematical modeling.
Whisperer produces the most complex outputs and the We developed a linear model to identify patterns in
highest and most variable code smell counts, suggest- the data, concentrating on the overall code quality. The
ing significant struggles with consistently integrating analysis results are displayed in a chart titled LLMs
security knowledge and indicating a high-risk, high- Generated Code Security Risks and can be found in
reward scenario in its coding solutions. Overall, the Figure 13b. This chart thoroughly examines how PD-
effectiveness of each LLM in generating secure code CGs and CCPs handle security risks across different
appears correlated with their ability to apply learned coding problem sets. Such an analysis is crucial for sys-
security knowledge consistently across different pro- tem development as it highlights these various LLMs’
gramming problems. More consistency and fewer code security awareness and adaptive learning behaviors.
smell peaks indicate more effective security integra- Additionally, it helps to ensure security when utilizing
tion. principles. these LLMs.
The study highlights the prevalence of medium During the analysis, it was observed that moving
to high-severity coding smells in the LLMs output, from DS & Algorithms to RESTful APIs resulted in
emphasizing the need for improved security practices a significant drop in the impact on code quality for
and tools to address these subtler security risks. PDCGs. The effect on code quality decreased from 15%
to 11%, suggesting PDCGs may be better suited for
7.2. Generated Code Security Implications handling RESTful APIs, as they are optimized for web-
In this section, we measure the security impli- based environments where endpoint interface security
cations of the code generated by LLMs using our is crucial. However, when CCP LLM platforms handle
CSRM metric, as defined in Section 4.4. The CSRM, more complex or interactive tasks, such as RESTful
defined in equation 2, is used to evaluate the post- APIs, they exhibit a slight upward trend in security
ICL security risks for both prompt-driven and coding risks, impacting code quality by over 10%.
copilots, and the results are visualized in Figure 13a. These findings highlight the challenges of main-
After conducting security testing, we calculate CSRM taining secure coding practices in feature-rich commer-
values using the CWE and code smells. The security cial software. The analysis of LLMs shows that founda-
implications of our findings are discussed in detail tion models like PDCGs exhibit different behaviors and
below. may be prone to security vulnerabilities. Fine-tuned
1. Generated Code Security Risks. The given graph models for CCPs show increasing security risk ratios
depicts the CSRM values of four LLMs against various in complex environments, indicating the challenges of
coding domains, highlighting different risk profiles. It maintaining secure coding practices. Tailored security
shows that ChatGPT has higher security risks (20%) strategies based on LLM categories and specific cod-
in data structures and algorithms. At the same time, ing tasks are necessary to mitigate risks. Researching
Google Bard has increased risks (14%) in RESTful

Ahmad Mohsin et al.: Preprint submitted to Elsevier Page 22 of 27

Can We Trust Large Language Models Generated Code?

(a) (b)

Figure 13: LLMs impact on security risks and code quality.

factors for better LLM security in particular contexts is security measures integrated into ICL, there are chal-
crucial for refining and ensuring safer tool applications. lenges in converting learning patterns to low-risk code
outputs from code-generating LLMs. These difficul-
7.3. LLMs, ICL, and Security Risks ties are heightened in complex settings or problematic
In this subsection, we will discuss the correlation areas, such as RESTful APIs for CCPs and high-risk
between the ICL security patterns used to embed se- initial settings in DS & Algos for PDCGs. To minimize
curity knowledge in LLMS and the associated security the chances of generating code with severe vulnerabil-
risks of the generated code. ities, it is essential to customize the outputs of LLMs
Implications of ICL Security Patterns We dis- to promote secure code patterns and discourage known
cuss the impact of ICL security patterns on the ability vulnerabilities.
of LLMs (Language Model Models) to learn about
security. Understanding how these approaches affect 7.4. LLMs Training Architectures and Source
code security across various problem sets is essential Code Security.
for generating code through ICL using security learn- It is crucial to analytically describe the factors that
ing patterns such as one-shot, zero-shot, and few-shot could lead to security vulnerabilities in code generated
learning. We aim to examine the effectiveness of these by LMs. We will discuss these factors below.
ICL patterns in addressing security risks associated Training Data Quality The presence of outdated or
with the code generation process and to what extent vulnerable code, deprecated libraries, and poor security
these patterns can teach LLMs about security princi- practices in the training data can lead to less secure or
ples. directly vulnerable code. In our study, despite using a
Although LLMs have equal opportunities to learn few-shot learning approach employing ICL with mul-
about security through one, zero, and few shots learn- tiple security examples, LLMs still exhibited security
ing patterns, we have observed varying security risks weaknesses in our experiments. Similarly, malicious
between PDCGs and CCPs in our study. This difference code patterns in the training source code can seep
may indicate variations in how efficiently each LLM into LLM outputs as code smells. The coding copilots
internalizes and applies security knowledge based on (GitHub Copilot and Code Whisperer) LLMs used in
the type and level of examples provided during the our experiments are fine-tuned on code containing hid-
learning phase. den security patterns with bad data, such as undeclared
Learning Pattern Limitations. It has been ob- variables and the absence of exception handling within
served that when using LLMs to generate code, in particular programming language datasets. Our study
some cases, one-shot and few-shot learning based on also highlights this problem, as code smells were found
ICL may not always provide complete exposure to a in almost all the generated code with severe security
broad spectrum of security issues. This requires ICL bugs. Despite aiding developers in generating more
security patterns as domain-specific, fully instructed functional lines of code, automated code generation
with targeted programming language security princi- through prompt-driven and coding copilots comes with
ples knowledge. As described earlier, incomplete learn- diverse security problems in the form of CWEs and
ing outcomes can result in security risks in the code coding smells, adding to the overall attack surface of
generated by LLMs. Similarly, zero-shot learning heav- the systems deploying this code in production lines. It
ily relies on the pre-training dataset and architecture, is important to sanitize bad source code quality training
which may not have covered specific security practices datasets for LLMs, where underlying vulnerabilities
relevant to the examined problem sets. Despite the and code smells must be repaired and fixed before

Ahmad Mohsin et al.: Preprint submitted to Elsevier Page 23 of 27

Can We Trust Large Language Models Generated Code?

this data is used for model learning. Using this code security weaknesses, leading to vulnerabilities in the
for deployment without integrating systematic security generated code.
testing and reviews is dangerous for system security Security Challenges and Developers Experience
and trustworthiness. Levels
. Developers’ experience levels may influence the secu-
Inherent Model Biases. Pre-training biases and limi- rity of LLM-generated code and their understanding
tations in model training data can lead to security over- of security practices [74, 5]. For instance, inexpe-
sights [17, 8], especially when the data lacks diverse rienced programmers using LLMs to generate C++
examples of secure coding practices across various code for tasks like Data Structures and Algorithms
contexts. Specifically, models like Google Bard and may overlook essential security practices, leading to
Code Whisperer have shown heightened vulnerabilities vulnerabilities such as CWE-20 (Improper Input Val-
in certain problem sets, suggesting a learning bias or idation) and Buffer Overflow. As these programmers
a deficiency in recognizing specific security threats. gain experience and transition to languages like C# and
This issue is particularly concerning in environments Python, their understanding of security also improves.
with limited secure coding examples. We observed Developers using LLMs for C++ code generation
that vulnerabilities persisted in various programming may face heightened security risks due to the lan-
problem scenarios despite providing LLMs with ex- guage’s complexity and lack of experience in essential
plicit examples of security issues. For example, when security practices. On the other hand, developers work-
instructed an LLM to enhance the security of an MVC ing with languages like C# in structured environments
pattern by transforming a vulnerable SQL query into a may encounter different security challenges, such as in-
parameterized query. The LLM successfully modified formation leakage or improper error handling [75]. For
the SQL query but failed to address security issues in example, they encounter misconfigurations in multiple
the surrounding code. This suggests a learning bias in interacting components, leading to information leakage
these models, where a lack of comprehensive security (CWE-200) or improper error handling (CWE-705).
knowledge leads to persistent security weaknesses. The More experienced programmers dealing with complex
observed increase in vulnerabilities for LLMs in cer- API development in Python must address advanced se-
tain scenarios underscores a learning bias or a gap in curity issues, such as weak authentication mechanisms
understanding specific vulnerabilities, highlighting the (CWE-307) and insufficient input validation (CWE-
importance of rigorous model training. 20). Developers must be educated in secure software
A mix of fine-tuning methods can be used to tackle development practices, including language-specific cy-
these issues, including instruction tuning and security bersecurity principles following the OWASP Top 10.
guidance based on the RAG [26], explicit instruction Additionally, developers must understand how LLMs
fine-tuning [72, 12, 73] approach customized for the learn in different contexts in dynamic programming
programming language. Various models may have environments to ensure secure code generation.
specific strengths, highlighting the necessity for diverse
strategies in AI-powered security for code generation. 7.5. LLMs New Source of Software Supply
Including a human feedback loop involving developers Chain Vulnerabilities
with security expertise in model training is vital for safe Our experiments show that code generated by
and secure code generation. LLMs is potentially a news source of software supply
Dependency and Third-party Libraries. LLMs trained chain vulnerabilities, which is concerning and diffi-
on millions of open-source repositories often use third- cult to handle. The increasing use of LLMs for code
party libraries with outdated code. The code generated generation represents a significant change in modern
by LLMs depends heavily on these third-party libraries software development. It allows for quicker production
used during training, which may contain vulnerabilities and supports agile and DevOps pipelines through
[5]. As a result, the generated code inherits third-party advanced automation and API support. However, AI
security risks. This problem becomes more challenging driven automated code generation also brings substan-
in static environments and cannot automatically detect tial cybersecurity risks. Our findings suggest that the
issues with library versioning or security updates, LLMs generated code may introduce more complex
particularly with code generation methods such as the attack vectors for systems that use it. As more AI
prompt-driven style used in ChatGPT. However, this assisted programming tools are integrated into devel-
issue persists with both types of coding LLMs, includ- opment ecosystems, LLMs could become new sources
ing coding copilots. Our study found that LLMs may of vulnerabilities in software supply chains. Unlike
include calls to third-party libraries with vulnerabilities the traditional risks associated with third-party com-
during code generation. For example, in RESTful APIs, ponents and libraries, the vulnerabilities from LLMs
the security of libraries managing HTTP requests, data come directly from the generated code itself.
parsing, or service interactions is crucial. Similarly, in Our research emphasizes the complex nature of vul-
the MVC pattern, code generation often relies on these nerabilities and hidden code smells, influenced by the
third-party libraries and API calls, which can harbor learning patterns and training data of LLMs underlying

Ahmad Mohsin et al.: Preprint submitted to Elsevier Page 24 of 27

Can We Trust Large Language Models Generated Code?

architectures. Lack of security practices and testing Declaration of competing interest

among software development teams also play a role
Authors hereby declare that there is no conflict of
in producing insecure code. While coding LLMs excel
interest regarding the publication of this article.
in learning diverse NLP tasks like program synthesis
and code repair, they generally lack specialized security
knowledge about emerging vulnerabilities and hidden References
code problems. This complexity emphasizes the need [1] Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, H. Chen,
for strong security protocols and continuous assess- X. Yi, C. Wang, Y. Wang, W. Ye, Y. Zhang, Y. Chang, P. S.
ment of LLMs generated code to effectively reduce Yu, Q. Yang, X. Xie, A survey on evaluation of large language
potential security risks. It is important to ensure that models, ACM Trans. Intell. Syst. Technol. 15 (3) (mar 2024).
doi:10.1145/3641289.
integrating LLMs into software development enhances
[2] F. Christopoulou, G. Lampouras, M. Gritta, Pangu-coder: Pro-
security rather than compromising it. gram synthesis with function-level language modeling, arXiv
preprint arXiv:2207.11280 (2022).
URL https://arxiv.org/abs/2207.11280
8. Conclusions and Future Research [3] Y. Wang, H. Le, A. Gotmare, J. Li, S. Hoi, Codet5mix: A
Work pretrained mixture of encoder-decoder transformers for code
understanding and generation (2022).
Our study rigorously evaluated various coding URL https://openreview.net/forum?id=VPCi3STZcaO
LLMs to determine their capability to produce secure [4] O. Asare, M. Nagappan, N. Asokan, Is github’s copilot
code. We added security knowledge to LLMs through as bad as humans at introducing vulnerabilities in code?,
in-context learning patterns. Our thorough analysis Empirical Softw. Engg. 28 (6) (sep 2023). doi:10.1007/
s10664-023-10380-1.
revealed that LLMs exhibited significant vulnerabil- [5] Y. Yao, J. Duan, K. Xu, Y. Cai, Z. Sun, Y. Zhang, A survey on
ities when generating code without any prior secu- large language model (llm) security and privacy: The good, the
rity knowledge (zero-shot). However, with in-context bad, and the ugly, High-Confidence Computing (2024).
learning, they occasionally demonstrated the ability to [6] H. Pearce, B. Ahmad, B. Tan, B. Dolan-Gavitt, R. Karri, Asleep
produce more secure code in specific programming at the keyboard? assessing the security of github copilot’s
code contributions, in: 2022 IEEE Symposium on Security and
scenarios. Nevertheless, we observed persistent se- Privacy (SP), IEEE, 2022, pp. 754–768.
curity bugs in different programming instances, even [7] G. Sandoval, H. Pearce, T. Nys, R. Karri, S. Garg, B. Dolan-
after providing security knowledge. Taking a broader Gavitt, Lost at c: A user study on the security implications of
perspective, our findings clearly indicate that coding large language model code assistants, in: 32nd USENIX Secu-
copilots showed improved performance in specific rity Symposium (USENIX Security 23), USENIX Association,
Anaheim, CA, 2023, pp. 2205–2222.
programming scenarios by learning through dynamic [8] S. Yeo, Y.-S. Ma, S. C. Kim, H. Jun, T. Kim, Framework for
interactions and contextual awareness. When alerted evaluating code generation ability of large language models,
about security issues, LLMs often acknowledged the ETRI Journal 46 (1) (2024) 106–117. doi:https://doi.org/10.
problems and made attempts to address vulnerabilities. 4218/etrij.2023-0357.

In conclusion, forcefully integrating LLMs into soft- [9] Google AI, PALM 2 Technical Report , Tech. rep., Google AI
Research (2023).
ware development signifies a substantial shift, offering URL https://ai.google/static/documents/palm2techreport.
rapid development. Yet, this pioneering innovation also pdf
presents new challenges, particularly regarding soft- [10] Google, An overview of bard: an early experiment with gener-
ware security. LLM generated code is potentially a new ative ai, accessed: 2023-11-30 (2023).
source of source code level vulnerabilities introduced URL https://ai.google/static/documents/google-about-bard.
pdf
into software supply chains. Addressing these new [11] F. F. Xu, U. Alon, G. Neubig, V. J. Hellendoorn, A systematic
vulnerabilities demands a comprehensive review of evaluation of large language models of code, in: Proceedings
security measures in LLM training and deployment to of the 6th ACM SIGPLAN International Symposium on Ma-
effectively mitigate these emerging risks. chine Programming, MAPS 2022, Association for Computing
Our future work aims to improve the security of Machinery, New York, NY, USA, 2022, p. 1–10. doi:10.1145/
3520312.3534862.
code generated by LLM by developing techniques to [12] I. H. Sarker, Llm potentiality and awareness: a position paper
embed secure coding patterns and enhance their ability from the perspective of trustworthy and responsible ai model-
to recognize security issues in context. We will use ing, Discover Artificial Intelligence 4 (1) (2024) 40.
our curated datasets to explore methods for fixing [13] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto,
vulnerabilities using many-shot learning and retrieval- J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman,
et al., Evaluating large language models trained on code, arXiv
augmented generation. Furthermore, we will refine preprint arXiv:2107.03374 (2021).
LLMs to learn secure coding practices in order to [14] M. Verdi, A. Sami, J. Akhondali, F. Khomh, G. Uddin, A. K.
reduce source code bugs. To ensure the trustworthi- Motlagh, An empirical study of c++ vulnerabilities in crowd-
ness and reliability of LLM, we will create language- sourced code examples, IEEE Transactions on Software En-
specific problem sets for comprehensive testing. By gineering 48 (5) (2022) 1497–1514. doi:10.1109/TSE.2020.
3023664.
taking proactive measures to address security risks, we [15] G. A. A. Prana, A. Sharma, L. K. Shar, D. Foo, A. E. Santosa,
aim to unleash the full potential of LLMs in software A. Sharma, D. Lo, Out of sight, out of mind? how vulnerable
development.

Ahmad Mohsin et al.: Preprint submitted to Elsevier Page 25 of 27

Can We Trust Large Language Models Generated Code?

dependencies affect open-source projects, Empirical Softw. [33] D. Guo, S. Ren, S. Lu, Z. Feng, D. Tang, S. Liu, L. Zhou,
Engg. 26 (4) (jul 2021). doi:10.1007/s10664-021-09959-3. N. Duan, A. Svyatkovskiy, S. Fu, M. Tufano, S. K. Deng,
[16] D. Votipka, K. R. Fulton, J. Parker, M. Hou, M. L. Mazurek, C. Clement, D. Drain, N. Sundaresan, J. Yin, D. Jiang, M. Zhou,
M. Hicks, Understanding security mistakes developers make: Graphcodebert: Pre-training code representations with data
qualitative analysis from build it, break it, fix it, in: Proceedings flow (2021). arXiv:2009.08366.
of the 29th USENIX Conference on Security Symposium, [34] MITRE, Cwe view: Software development, https://cwe.mitre.
SEC’20, USENIX Association, USA, 2020. org/data/definitions/699.html, accessed: 2024-04-23.
[17] Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, H. Chen, [35] Overview of CVE, https://www.cve.org/About/Overview, ac-
X. Yi, C. Wang, Y. Wang, et al., A survey on evaluation of large cessed: YYYY-MM-DD.
language models, ACM Transactions on Intelligent Systems [36] M. Felderer, M. Büchler, M. Johns, A. D. Brucker, R. Breu,
and Technology (2023). A. Pretschner, Security testing: A survey, in: Advances in
[18] OpenAI, GPT-4 Technical Report , Tech. rep., OpenAI (2023). Computers, Vol. 101, Elsevier, 2016, pp. 1–51.
URL https://cdn.openai.com/papers/gpt-4.pdf [37] J. Li, Vulnerabilities mapping based on owasp-sans: a survey
[19] R. L. et. al, Starcoder: may the source be with you! (2023). for static application security testing (sast), arXiv preprint
arXiv:2305.06161. arXiv:2004.03216 (2020).
[20] Amazon Web Services, Amazon codewhisperer, https://aws. [38] A. Coufalíková, I. Klaban, T. Šlajs, Complex strategy against
amazon.com/codewhisperer/, accessed: 2023-09-30 (2023). supply chain attacks, in: 2021 International Conference on
[21] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, Military Technologies (ICMT), 2021, pp. 1–5. doi:10.1109/
T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al., ICMT52455.2021.9502768.
Llama: Open and efficient foundation language models, arXiv [39] S. Cordey, Software supply chain attacks: An illustrated typo-
preprint arXiv:2302.13971 (2023). logical review, CSS Cyberdefense Reports (2022).
[22] H. Hajipour, K. Hassler, T. Holz, L. Schönherr, M. Fritz, [40] V. Ghariwala, Protecting against software supply chain attacks,
Codelmsec benchmark: Systematically evaluating and finding InfoWorldAccessed: 2024-02-23 (2024).
security vulnerabilities in black-box code language models URL https://www.infoworld.com/article/3712543/
(2023). arXiv:2302.04012. protecting-against-software-supply-chain-attacks.html
[23] Y. Li, S. Qi, C. Gao, Y. Peng, D. Lo, Z. Xu, M. R. Lyu, A closer [41] D. Dai, Y. Sun, L. Dong, Y. Hao, S. Ma, Z. Sui, F. Wei, Why
look into transformer-based code intelligence through code can gpt learn in-context? language models implicitly perform
transformation: Challenges and opportunities (2022). arXiv: gradient descent as meta-optimizers (2023). arXiv:2212.10559.
2207.04285. [42] Z. Bi, N. Zhang, Y. Jiang, S. Deng, G. Zheng, H. Chen, When
[24] H. Pearce, B. Tan, B. Ahmad, R. Karri, B. Dolan-Gavitt, do program-of-thought works for reasoning?, in: Proceedings
Examining zero-shot vulnerability repair with large language of the AAAI Conference on Artificial Intelligence, Vol. 38,
models, in: 2023 IEEE Symposium on Security and Privacy 2024, pp. 17691–17699.
(SP), IEEE, 2023, pp. 2339–2356. [43] W. Chen, X. Ma, X. Wang, W. W. Cohen, Program of thoughts
[25] S. Zhang, Z. Chen, Y. Shen, M. Ding, J. B. Tenenbaum, C. Gan, prompting: Disentangling computation from reasoning for nu-
Planning with large language models for code generation, arXiv merical reasoning tasks, Transactions on Machine Learning
preprint arXiv:2303.05510 (2023). Research (2023).
[26] M. R. Parvez, W. Ahmad, S. Chakraborty, B. Ray, K.-W. URL https://openreview.net/forum?id=YfZ4ZPt8zd
Chang, Retrieval augmented code generation and summariza- [44] J. Finnie-Ansley, P. Denny, B. A. Becker, A. Luxton-Reilly,
tion, in: M.-F. Moens, X. Huang, L. Specia, S. W.-t. Yih J. Prather, The robots are coming: Exploring the implications of
(Eds.), Findings of the Association for Computational Linguis- openai codex on introductory programming, in: Proceedings of
tics: EMNLP 2021, Association for Computational Linguistics, the 24th Australasian Computing Education Conference, ACE
Punta Cana, Dominican Republic, 2021, pp. 2719–2734. doi: ’22, Association for Computing Machinery, New York, NY,
10.18653/v1/2021.findings-emnlp.232. USA, 2022, p. 10–19. doi:10.1145/3511861.3511863.
[27] B. Rozière, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. [45] R. Khoury, A. R. Avila, J. Brunelle, B. M. Camara, How secure
Tan, Y. Adi, J. Liu, R. Sauvestre, T. Remez, J. Rapin, is code generated by chatgpt?, in: 2023 IEEE International
A. Kozhevnikov, I. Evtimov, J. Bitton, M. Bhatt, C. C. Ferrer, Conference on Systems, Man, and Cybernetics (SMC), IEEE,
A. Grattafiori, W. Xiong, A. Défossez, J. Copet, F. Azhar, 2023, pp. 2445–2451.
H. Touvron, L. Martin, N. Usunier, T. Scialom, G. Synnaeve, [46] M. Nair, R. Sadhukhan, D. Mukhopadhyay, How hardened is
Code llama: Open foundation models for code (2024). arXiv: your hardware? guiding chatgpt to generate secure hardware
2308.12950. resistant to cwes, in: S. Dolev, E. Gudes, P. Paillier (Eds.),
[28] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Cyber Security, Cryptology, and Machine Learning, Springer
Gomez, L. u. Kaiser, I. Polosukhin, Attention is all you need, in: Nature Switzerland, Cham, 2023, pp. 320–336.
Advances in Neural Information Processing Systems, Vol. 30, [47] V. Liventsev, A. Grishina, A. Härmä, L. Moonen, Fully au-
Curran Associates, Inc., 2017. tonomous programming with large language models, in: Pro-
[29] S. Kotsiantis, V. Verykios, M. Tzagarakis, Ai-assisted proceedings of the Genetic and Evolutionary Computation Con-
gramming tasks using code embeddings and transformers, ference, GECCO ’23, Association for Computing Machinery,
Electronics 13 (4) (2024) 767. New York, NY, USA, 2023, p. 1146–1155. doi:10.1145/
[30] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. 3583131.3590481.
Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, [48] LeetCode Problem Set, https://leetcode.com/problemset/, ac-
et al., Gpt-4 technical report, arXiv preprint arXiv:2303.08774 cessed: [2024-03-04].
(2023). [49] R. Sun, Q. Wang, L. Guo, Research towards key issues of api
[31] Y. Wang, W. Wang, S. Joty, S. C. Hoi, Codet5: Identifier- security, in: W. Lu, Y. Zhang, W. Wen, H. Yan, C. Li (Eds.),
aware unified pre-trained encoder-decoder models for code un- Cyber Security, Springer Nature Singapore, Singapore, 2022,
derstanding and generation, arXiv preprint arXiv:2109.00859 pp. 179–192.
(2021). [50] Akamai Technologies, Sans survey on api security, Tech. rep.,
[32] J. Finnie-Ansley, P. Denny, B. A. Becker, A. Luxton-Reilly, Akamai Technologies (2023).
J. Prather, The robots are coming: Exploring the implications of URL https://www.akamai.com/site/en/documents/
openai codex on introductory programming, in: Proceedings of research-paper/2023/sans-survey-api-security.pdf
the 24th Australasian Computing Education Conference, 2022,
pp. 10–19.

Ahmad Mohsin et al.: Preprint submitted to Elsevier Page 26 of 27

Can We Trust Large Language Models Generated Code?

[51] E. B. Fernandez, N. Yoshioka, H. Washizaki, J. Yoder, Abstract 2023 IEEE 23rd International Working Conference on Source
security patterns and the design of secure systems, Cybersecu- Code Analysis and Manipulation (SCAM), 2023, pp. 265–270.
rity 5 (1) (2022) 7. doi:10.1109/SCAM59687.2023.00037.
[52] MITRE Corporation, CWE - Top 25 Most Dangerous Software [72] A. Mohsin, H. Janicke, S. Nepal, D. Holmes, Digital twins
Weaknesses, https://cwe.mitre.org/top25/, accessed: 2024- and the future of their use enabling shift left and shift right
03-02 (2023). cybersecurity operations, in: 2023 5th IEEE International Con-
[53] Open Web Application Security Project (OWASP), OWASP ference on Trust, Privacy and Security in Intelligent Systems
Top Ten Web Application Security Risks, https://owasp.org/ and Applications (TPS-ISA), 2023, pp. 277–286. doi:10.1109/
www-project-top-ten/, accessed: 2024-03-02 (2023). TPS-ISA58951.2023.00042.
[54] K. Greenberg, Akamai survey: Api-specific controls are [73] I. H. Sarker, H. Janicke, A. Mohsin, A. Gill, L. Maglaras,
lacking, Tech. rep., Akamai Technologies (Jul 2023). Explainable ai for cybersecurity automation, intelligence and
URL https://www.techrepublic.com/article/ trustworthiness in digital twin: Methods, taxonomy, challenges
akamai-survey-api-security/ and prospects, ICT Express (2024). doi:https://doi.org/10.
[55] OWASP Foundation, Owasp application security verification 1016/j.icte.2024.05.007.
standard 4.0, accessed: 2023-09-24 (2019). [74] R. Croft, Y. Xie, M. Zahedi, M. A. Babar, An empirical
URL https://owasp.org/www-pdf-archive/OWASP_Application_ study of developers’ discussions about security challenges
Security_Verification_Standard_4.0-en.pdf of different programming languages, Empirical Software
[56] National Institute of Standards and Technology (NIST), Secure Engineering (2022).
software development framework (ssdf), https://csrc.nist. URL https://link.springer.com/article/10.1007/
gov/projects/ssdf, accessed: 2024-02-24 (2021). s10664-021-10054-w
[57] T. Rangnau, R. v. Buijtenen, F. Fransen, F. Turkmen, Con- [75] X. Xia, Z. Wan, P. S. Kochhar, D. Lo, How practitioners
tinuous security testing: A case study on integrating dynamic perceive coding proficiency, IEEE/ACM 41st International
security testing tools in ci/cd pipelines, in: 2020 IEEE 24th Conference on Software Engineering (2019).
International Enterprise Distributed Object Computing Confer-
ence (EDOC), IEEE, 2020, pp. 145–154.
[58] OpenAI, J. A. et al., Gpt-4 technical report (2024). arXiv:
2303.08774.
[59] G. Team, Gemini: A family of highly capable multimodal
models (2023). arXiv:2312.11805.
[60] Infosys, Amazone codewhisperer: Early adoption
of emerging technologies, https://www.infosys.com/
services/incubating-emerging-technologies/documents/
early-adoption-infosys.pdf, accessed: 2023-09-30 (2023).
[61] O. B. Tauqeer, S. Jan, A. O. Khadidos, A. O. Khadidos,
F. Q. Khan, S. Khattak, Analysis of security testing techniques,
Intelligent Automation & Soft Computing 29 (1) (2021) 291–
306.
[62] H. H. AlBreiki, Q. H. Mahmoud, Evaluation of static analysis
tools for software security, in: 2014 10th International Confer-
ence on Innovations in Information Technology (IIT), IEEE,
2014, pp. 93–98.
[63] Microsoft, Security development lifecycle, accessed: 2023-09-
24 (2021).
URL https://www.microsoft.com/en-us/securityengineering/
sdl/
[64] A. Nguyen-Duc, M. V. Do, Q. L. Hong, K. N. Khac, A. N.
Quang, On the adoption of static analysis for software secu-
rity assessment–a case study of an open-source e-government
project, computers & security 111 (2021) 102470.
[65] G. Sandoval, H. Pearce, T. Nys, R. Karri, S. Garg, B. Dolan-
Gavitt, Lost at c: A user study on the security implications of
large language model code assistants (2023). arXiv:2208.09727.
[66] MITRE Corporation, CWE-699: Software Development Weak-
nesses, https://cwe.mitre.org/data/definitions/699.html, ac-
cessed: 2024-03-02 (2023).
[67] Snyk, Integrate with Snyk: IDE Tools, https://docs.snyk.io/
integrate-with-snyk/ide-tools, accessed: 2024-03-12 (2024).
[68] Amazon Web Services, AWS CodeWhisperer User Guide
for Security Scanning, Amazon Web Services, accessed:
2024-03-29 (2023).
URL https://docs.aws.amazon.com/pdfs/codewhisperer/
latest/userguide/user-guide.pdf#security-scans
[69] SixHq, Ai realtime code scanner sixth sast, https://github.
com/SixHq/, accessed: 2024-05-04 (2023).
[70] E. V. d. P. Sobrinho, A. De Lucia, M. d. A. Maia, A systematic
literature review on bad smells–5 w’s: Which, when, what,
who, where, IEEE Trans. Softw. Eng. 47 (1) (2021) 17–66.
doi:10.1109/TSE.2018.2880977.
[71] M. Esposito, S. Moreschini, V. Lenarduzzi, D. Hästbacka,
D. Falessi, Can we trust the default vulnerabilities severity?, in:

Ahmad Mohsin et al.: Preprint submitted to Elsevier Page 27 of 27

Lesson 7 Social Ethical and Legal Responsibilities in The Use of Technology Tools and Resources
100% (14)
Lesson 7 Social Ethical and Legal Responsibilities in The Use of Technology Tools and Resources
14 pages
DisasterRecovery Syllabus Ec Council
100% (1)
DisasterRecovery Syllabus Ec Council
19 pages
Bugs in LLms Genereated Code
No ratings yet
Bugs in LLms Genereated Code
47 pages
8. How Well Do Large Language Models Serve as End-to-End Secure
No ratings yet
8. How Well Do Large Language Models Serve as End-to-End Secure
13 pages
Assessing Cybersecurity Vulnerabilities in Code Large Language Models_2404.18567v1
No ratings yet
Assessing Cybersecurity Vulnerabilities in Code Large Language Models_2404.18567v1
12 pages
Examining Zero-Shot Vulnerability Repair With LLM
No ratings yet
Examining Zero-Shot Vulnerability Repair With LLM
18 pages
A_Review_on_Code_Generation_with_LLMs_Application_and_Evaluation 2 (1)
No ratings yet
A_Review_on_Code_Generation_with_LLMs_Application_and_Evaluation 2 (1)
6 pages
Securing and Scrutinizing Large Language Models
No ratings yet
Securing and Scrutinizing Large Language Models
2 pages
6. Large Language Models for Code Analysis - Do LLMs Really Do Their Job
No ratings yet
6. Large Language Models for Code Analysis - Do LLMs Really Do Their Job
18 pages
Assessing Large Language Models for Code Generation: A Comprehensive Framework
No ratings yet
Assessing Large Language Models for Code Generation: A Comprehensive Framework
6 pages
1 s2.0 S266729522400014X Main
No ratings yet
1 s2.0 S266729522400014X Main
21 pages
Code Attack
No ratings yet
Code Attack
16 pages
4. Studying the Quality of Source Code Generated by Different AI Generative Engines An Empirical Evaluation
No ratings yet
4. Studying the Quality of Source Code Generated by Different AI Generative Engines An Empirical Evaluation
19 pages
2501.09431v1 (2)
No ratings yet
2501.09431v1 (2)
35 pages
LLM and Security
No ratings yet
LLM and Security
4 pages
AI Security
No ratings yet
AI Security
18 pages
2408.11006v4
No ratings yet
2408.11006v4
18 pages
Exploring The Security Risks of Using Large Language Models
100% (1)
Exploring The Security Risks of Using Large Language Models
15 pages
2308.10345v1
No ratings yet
2308.10345v1
18 pages
ASE2024_CodeGenSurvey-7
No ratings yet
ASE2024_CodeGenSurvey-7
17 pages
2024 NTU - Resaro - LLM - Security - Paper
No ratings yet
2024 NTU - Resaro - LLM - Security - Paper
19 pages
NAI_ RESEARCH PAPER
No ratings yet
NAI_ RESEARCH PAPER
14 pages
Mastering Secure Coding: Writing Software That Stands Up to Attacks
From Everand
Mastering Secure Coding: Writing Software That Stands Up to Attacks
Larry Jones
No ratings yet
Automatic Programming: Large Language Models and Beyond
No ratings yet
Automatic Programming: Large Language Models and Beyond
33 pages
Large Language Models for Software Engineering_a systematic literature review
No ratings yet
Large Language Models for Software Engineering_a systematic literature review
79 pages
Building Secure Applications with C++: Best Practices for the Enterprise
From Everand
Building Secure Applications with C++: Best Practices for the Enterprise
Robert Johnson
No ratings yet
Kantek DP
No ratings yet
Kantek DP
100 pages
Llm Oakland2024
No ratings yet
Llm Oakland2024
19 pages
14. Large Language Models for Code
No ratings yet
14. Large Language Models for Code
20 pages
15. Purple Llama CYBERSECEVAL
No ratings yet
15. Purple Llama CYBERSECEVAL
13 pages
Prompt Injection Attacks in Defended Systems
No ratings yet
Prompt Injection Attacks in Defended Systems
10 pages
A Preliminary Study On Using Large Language Models in Software Pentesting
No ratings yet
A Preliminary Study On Using Large Language Models in Software Pentesting
7 pages
thinking-about-security-ai-systems
No ratings yet
thinking-about-security-ai-systems
5 pages
Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities
No ratings yet
Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities
18 pages
Code Generation 2308.10335v5
No ratings yet
Code Generation 2308.10335v5
9 pages
Cybersecurity Risks of AI-Generated Code
No ratings yet
Cybersecurity Risks of AI-Generated Code
41 pages
2406.00515v1
No ratings yet
2406.00515v1
49 pages
Trapping LLM "Hallucinations" Using Tagged Context Prompts: Philip Feldman, James R. Foulds, and Shimei Pan
No ratings yet
Trapping LLM "Hallucinations" Using Tagged Context Prompts: Philip Feldman, James R. Foulds, and Shimei Pan
14 pages
2. A_Survey_on_Evaluating_Large_Language_Models_in_Co
No ratings yet
2. A_Survey_on_Evaluating_Large_Language_Models_in_Co
26 pages
Case Study For Procurement
No ratings yet
Case Study For Procurement
62 pages
2401.12273v2
No ratings yet
2401.12273v2
9 pages
Chatbots in A Botnet World by Forrest McKee and David Noever
No ratings yet
Chatbots in A Botnet World by Forrest McKee and David Noever
47 pages
NLP ML Ai
No ratings yet
NLP ML Ai
21 pages
AI-Policy LLM
No ratings yet
AI-Policy LLM
9 pages
Granite Code Models: A Family of Open Foundation Models For Code Intelligence
No ratings yet
Granite Code Models: A Family of Open Foundation Models For Code Intelligence
28 pages
Protecting Language Learning Models From Cyber Threats
No ratings yet
Protecting Language Learning Models From Cyber Threats
3 pages
3691620.3695527
No ratings yet
3691620.3695527
12 pages
2024 A Survey on LLM-based Code Generation for Low-Resource
No ratings yet
2024 A Survey on LLM-based Code Generation for Low-Resource
35 pages
2411 03350v1
No ratings yet
2411 03350v1
76 pages
From LLMs To LLM Based Agents For Software Engineering 1723301316
100% (1)
From LLMs To LLM Based Agents For Software Engineering 1723301316
42 pages
Large Language Models For Software Engineering
No ratings yet
Large Language Models For Software Engineering
79 pages
Watermarking Large Language Models and the Generat
No ratings yet
Watermarking Large Language Models and the Generat
8 pages
From_LLMs to_LLM_based_Agents
No ratings yet
From_LLMs to_LLM_based_Agents
42 pages
tosem2hshzh024_5.pdf
No ratings yet
tosem2hshzh024_5.pdf
79 pages
2403.00046v2
No ratings yet
2403.00046v2
13 pages
2305.06972v3
No ratings yet
2305.06972v3
16 pages
Pathway To Secure and Trustworthy 6G For LLMS: Attacks, Defense, and Opportunities
No ratings yet
Pathway To Secure and Trustworthy 6G For LLMS: Attacks, Defense, and Opportunities
7 pages
ChatGPT Coding CompSac 23
No ratings yet
ChatGPT Coding CompSac 23
9 pages
03-Towards An Understanding of Large Language
No ratings yet
03-Towards An Understanding of Large Language
41 pages
Hullmi: Human vs. LLM Identification With Explainability
No ratings yet
Hullmi: Human vs. LLM Identification With Explainability
17 pages
Exploiting Programmatic Behavior of LLMS: Dual-Use Through Standard Security Attacks
No ratings yet
Exploiting Programmatic Behavior of LLMS: Dual-Use Through Standard Security Attacks
14 pages
Summary of Vulnerabilities in Large Language Models
No ratings yet
Summary of Vulnerabilities in Large Language Models
1 page
Deepa Kundur - Cyber-Physical Security of The Smart Grid PDF
No ratings yet
Deepa Kundur - Cyber-Physical Security of The Smart Grid PDF
4 pages
Nokia OpenStack Solution
No ratings yet
Nokia OpenStack Solution
1 page
Vigenere Cipher
No ratings yet
Vigenere Cipher
3 pages
Palo Alto Hands On Workshop
No ratings yet
Palo Alto Hands On Workshop
108 pages
New Distributor Retration
No ratings yet
New Distributor Retration
2 pages
Pag-IBIG Privacy Statement v7.0 080318
No ratings yet
Pag-IBIG Privacy Statement v7.0 080318
3 pages
Blockchain Technology For Secure Supply Chain Management A Comprehensive Review
No ratings yet
Blockchain Technology For Secure Supply Chain Management A Comprehensive Review
27 pages
The Role of Technology in Education
No ratings yet
The Role of Technology in Education
3 pages
7) TYBBA (CA) - Syllabus of Sem V AND VI - 08.07.2021
No ratings yet
7) TYBBA (CA) - Syllabus of Sem V AND VI - 08.07.2021
34 pages
ĐỀ THAM KHẢO 10
No ratings yet
ĐỀ THAM KHẢO 10
4 pages
Checkpoint (CCSA-NGX) Course Details
No ratings yet
Checkpoint (CCSA-NGX) Course Details
16 pages
Mobile Computing - Reporting
No ratings yet
Mobile Computing - Reporting
16 pages
Tresoritwhitepaper PDF
No ratings yet
Tresoritwhitepaper PDF
11 pages
Firewall Policy Template
No ratings yet
Firewall Policy Template
8 pages
27 B SC Information Tech Syllabus (2017-18) (1) Modify
No ratings yet
27 B SC Information Tech Syllabus (2017-18) (1) Modify
27 pages
Lecture 1 CSCU (Autosaved)
No ratings yet
Lecture 1 CSCU (Autosaved)
25 pages
Peermountain Whitepaper
100% (1)
Peermountain Whitepaper
34 pages
Artificial Intelligence Notes Detailed
No ratings yet
Artificial Intelligence Notes Detailed
119 pages
SC200 Lab 1 - With Answers
No ratings yet
SC200 Lab 1 - With Answers
4 pages
16900321157_OE-EC 803A
No ratings yet
16900321157_OE-EC 803A
6 pages
Sumit - Updated - Kumar
No ratings yet
Sumit - Updated - Kumar
2 pages
RD3912A10
0% (1)
RD3912A10
13 pages
4 Kabupaten Lamongan - 240516 - 075111
No ratings yet
4 Kabupaten Lamongan - 240516 - 075111
422 pages
Cyber Security For Cyber Physical Systems 1st Edition Saqib Ali 2024 Scribd Download
100% (3)
Cyber Security For Cyber Physical Systems 1st Edition Saqib Ali 2024 Scribd Download
62 pages
End Customer Investment Funds-BPC
No ratings yet
End Customer Investment Funds-BPC
5 pages
Mongo DB Replication POC PDF
No ratings yet
Mongo DB Replication POC PDF
13 pages
Download ebooks file Network Security Technologies 2nd Edition Kwok T. Fung all chapters
No ratings yet
Download ebooks file Network Security Technologies 2nd Edition Kwok T. Fung all chapters
67 pages
Personal Data Breach Report ICO
No ratings yet
Personal Data Breach Report ICO
7 pages