0% found this document useful (0 votes)
23 views

NLP ML Ai

Large AI language models can process natural language and have many potential applications, but also pose new security risks. The document outlines opportunities and risks of large language models and provides recommendations for mitigating risks such as misuse and attacks through careful analysis and appropriate countermeasures.

Uploaded by

Shriekanth Iyer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

NLP ML Ai

Large AI language models can process natural language and have many potential applications, but also pose new security risks. The document outlines opportunities and risks of large language models and provides recommendations for mitigating risks such as misuse and attacks through careful analysis and appropriate countermeasures.

Uploaded by

Shriekanth Iyer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Machine Translated by Google

Large AI language models


Opportunities and risks for industry and authorities
Machine Translated by Google

change history
version Date Surname Description
1.0 05/03/2023 TK24 Initial release

Federal Office for Security in Information Technology


P.O. Box 20 03 63
53133 Bonn
Email: [email protected]
Internet: https://www.bsi.bund.de ©
Federal Office for Information Security 2021
Machine Translated by Google

Executive summary

Executive summary
Large AI language models are computer programs capable of automatically processing
natural language in written form. Such models can potentially be used in a variety of use cases in
which text is to be processed and thus represent an opportunity for digitization. On the other hand, the
use of large AI language models harbors new IT security risks and increases the threat potential of some
known IT security threats .
This includes, in particular, the potential for abuse that such models generate by generating spam/
phishing emails or malware.
In response to these potential threats, companies or authorities should carry out a risk analysis for use
in their specific use case before integrating large AI language models into their workflows. In
addition, they should evaluate abuse scenarios to determine whether they pose a threat to their
work processes. Based on this, existing security measures can be adapted and, if necessary, new
measures can be taken and users can be informed about the potential dangers.

Federal Office for Security in Information Technology 3


Machine Translated by Google

Contents

Contents
1 Introduction................................................. .................................................. .................................................. ....................................... 5

2 Background and classification of LLMs........................................ .................................................. ................................ 7

2.1 Capabilities ................................................. .................................................. .................................................. .......................... 7

2.2 Application areas................................................. .................................................. .................................................. ........ 7

2.3 Explainability................................................ .................................................. .................................................. ....................... 8th

3 Opportunities and risks of LLMs........................................ .................................................. .................................................. 9

3.1 Opportunities for IT security ................................................ .................................................. ............................................9

3.2 Risks of using LLMs and countermeasures ........................................................ ................................10

3.2.1 Risks................................................ .................................................. .................................................. ................................10

3.2.2 Countermeasures................................................ .................................................. .................................................. ......11

3.3 Misuse scenarios and countermeasures................................................. .................................................. ....12

3.3.1 Misuse scenarios ........................................ .................................................. .................................................. 12

3.3.2 Countermeasures................................................ .................................................. .................................................. ......13

3.4 Risks and challenges in developing secure LLMs...................................................... ................15

3.4.1 Data quality when selecting training data........................................................ ................................................15

3.4.2 Attacks on LLMs and specific countermeasures........................................................ ........................................16

4 Summary ..................................................... .................................................. .................................................. ...................19

Bibliography................................................. .................................................. .................................................. ..........................20

Federal Office for Security in Information Technology 4


Machine Translated by Google

1 Introduction

1 Introduction
Since December 2022, large AI language models have been omnipresent in newspapers, social
media and other information sources. In particular, the announcement and release of models, some of which
are freely available, have led to a rapid increase in the popularity and use of large AI language models. The high
quality of the texts generated by an AI impresses even experts. At the same time, intensive discussions are
being held about areas of application for the new technology and the resulting dangers. In this document,
the BSI shows the current risks and threats of large AI language models for IT security in order to create
awareness of these aspects among authorities and companies who are considering using these models
in their work processes. Developers of large AI language models will also find clues on the topics mentioned.
In addition, options are presented as to how these threats can be countered.

Definition of large AI language models

In the context of this document, the term large AI language models (LLMs) is understood to mean software
that processes natural language in written form on the basis of machine learning and also presents outputs
as text. However, acoustic or image inputs are also conceivable, since in many cases these can now be
converted into text almost flawlessly. In the future, even acoustic speech output could hardly be
distinguished from human voices. Some LLMs are already being expanded into so-called multi-modal
models that can process and/or produce images as well as text, for example. These models are not explicitly
considered in this document. The most modern LLMs are trained on large amounts of data and can
produce text that is often indistinguishable from human-written text. Scenarios in which LLMs can be used
are for example chatbots, question-answer systems or automatic translations (2.2).

Purpose and target groups of the document

This information is aimed at companies and authorities as well as developers who would like to find out more
about the opportunities and risks involved in the development, deployment and/or use of LLMs. A brief
summary of the document, which is primarily aimed at consumers, will also be published
alongside this document.

The aim of this document is to present the most important current threats in relation to LLMs and to show the
associated risks for the aforementioned target groups. The focus here is primarily on the area of IT security,
which can be impaired by the use of LLMs.
This is intended to create and strengthen awareness of possible risks when using or developing LLMs.

Structure of the document

In Chapter 2, the general capabilities and areas of application of LLMs are first described and a brief
digression on the explainability of the models is also made. Chapter 3 then takes a closer look at the
opportunities and risks of the models. Various aspects are addressed:

• Description of the security threats in general, but also specifically for users as well
developing

• Classification of relevance by describing possible scenarios in which the


security threats may be relevant,

• Measures that can be taken to mitigate the specific security threat.

5 Federal Office for Security in Information Technology


Machine Translated by Google

1 Introduction

Disclaimer

This compilation does not claim to be complete. The document serves to create awareness of the
risks and to present possible measures to reduce them. It can thus be the basis for a systematic risk
analysis, which should be carried out before using or making LLMs available. Not all aspects
will be relevant in every application and the individual risk assessment and acceptance will vary
depending on the application scenario and user group.

This document addresses, among other things, "Privacy Attacks". This term has established itself in
the AI literature as a standard for attacks in which sensitive training data is reconstructed.
Contrary to what the term might suggest, however, these do not have to be related to individuals and
can, for example, also represent company secrets or the like. It should be noted that the BSI does not
make any statements on data protection aspects in the strict sense.

6 Federal Office for Security in Information Technology


Machine Translated by Google

2 Background and classification of LLMs

2 Background and classification of LLMs


2.1 Skills
In many cases, LLMs generate correct answers for problems that are formulated as natural language text.
The tasks can be in different subject areas, not only in the area of language processing in the narrower sense,
eg for the creation and translation of fictional texts or text summaries, but also in areas such as mathematics,
computer science, history, law or medicine1 . This ability of a single AI model to generate appropriate
answers in different subject areas is a key innovation of the LLMs.

2.2 Areas of application


LLMs are able to process a large number of text-based tasks and can therefore be used in a variety of
areas in which (partially) automated text processing and/or production is to take place. These include, for
example:

• Text generation

• Writing a first draft of a formal document (e.g. invitation, research proposal,


articles of association etc.)

• Composing texts in a specific writing style (e.g. a specific person or with


certain emotional coloring)

• Tools for text continuation or text completion

• Text editing

• Spelling and grammar checker

• Paraphrasing

• Word processing
• Word and text classification

• Sentiment Analysis

• Entity extraction (marking of terms in the text and assignment to their class: eg
Munich ÿ place; BSI ÿ institution)

• Text summary

• Question and answer systems

• Translation

1 The MMLU multiple-choice test battery (Hendrycks, et al., 2021) contains 15908 problems out of 57
Knowledge areas ranging in difficulty from foolproof to problems that are difficult even for human
experts. The authors of (Hendrycks, et al., 2021) estimate that a panel of human experts would answer 90%
of the questions correctly. The top LLMs in Spring 2019 got 32% of the questions correct (Hendrycks, et al.,
2021) (Papers With Code, 2023), which was just slightly above the 25% figure for pure guesswork on each
of the 4 multiple-choice answers . However, the rate for lay people in the academic fields is also only
34.5% (Hendrycks, et al., 2021). The best result to date was achieved by Google's LLM Flan-PaLM in
October 2022 with a rate of 75% correct answers (Papers With Code, 2023) (OpenAI, 2023). The GPT-4
model released in March 2023 answered 86.4% of the tasks correctly (OpenAI, 2023).

Federal Office for Security in Information Technology 7


Machine Translated by Google

2 Background and classification of LLMs

• Program code
• Tools to support programming (e.g. through suggestions for completion,
error messages, etc.)

• Generating program code for a task written in natural language


• Reprogramming and translation of a program into other programming languages

2.3 explainability
In the following, we understand explainability as a research area in all application areas of AI, which
deals, among other things, with making transparent why and how an AI model is output. Explainability
can thus lead to greater user confidence in the output of a model and also enables technical adjustments
to be made to a model in a more targeted manner (Danilevsky, et al., 2020). In addition to the
actual output of the model, an explanation is often output; this can be done, for example, in text form
or with visual support. A popular practice for LLMs is to highlight relevant words in the input that were
instrumental in generating the output (Danilevsky, et al., 2020).

Especially in areas where decisions can have far-reaching consequences, an explanation of the
issuance of an LLM is desirable. This includes, for example, applications from the following areas:
• Health (e.g. decisions about treatment methods)
• Finance (e.g. lending decisions)
• Justice (e.g. decisions on probation options)
• Personnel (e.g. decisions on applications)
Other potentially critical application areas include those that are likely to be classified as high-risk AI
systems under the EU AI Regulation (European Commission, 2021).
In addition to the option mentioned of using tools to mark relevant words in the input, the problem
of a lack of explainability can already be countered by selecting a suitable model. Especially in critical
areas, the use of an LLM for the respective application should be critically questioned; if necessary,
the use case can also be approached with a simpler, directly interpretable model (e.g. decision tree)
instead of with an LLM with a black box character. Furthermore, there are options for different use
cases to select models with greater explainability. In question-answer systems, for example, extractive
approaches, i.e. models that make answer markings in the text with the original source, can be chosen
instead of generative approaches. In the context of text continuations, on the other hand, a certain
degree of explainability can be generated by not only making the actual output available, but also the
best alternatives with their respective probability. In addition, there is the possibility of integrating
models into search engines, for example, which provide source information that can then be checked.

8th Federal Office for Security in Information Technology


Machine Translated by Google

3 Opportunities and Risks of LLMs

3 Opportunities and Risks of LLMs


In this chapter, the opportunities for IT security that result from the use of LLMs are presented (3.1).

Various security risks that can arise during the development and use of LLMs are then examined.
First of all, such risks are considered that affect the use of LLMs from the user perspective (3.2).
Risks are then described that people may be confronted with in their private or professional lives
because LLMs are misused (3.3). In a final section, risks are explained that should be considered
when developing LLMs (3.4). Aspects that can be influenced when developers have access to an LLM
and the associated training process are explicitly highlighted here.

Measures that can contribute to reducing the risk are presented for the respective security risks.

3.1 Opportunities for IT security


Support for detection of unwanted content
Some LLMs are well suited for text classification tasks. This results in possible applications in
the area of detecting spam/phishing mails (Yaseen, et al., 2021) or undesirable content (e.g. fake news
(Aggarwal, et al., 2020) or hate speech (Mozafari, et al. , 2019)) in social media. However, specializing
in the task of detection usually means that these models - possibly with some technical adjustments - are
also well suited for the production of corresponding texts (3.3.1) (Zellers, et al., 2019).

Word processing support


With their potential applications in the areas of text analysis, summarization and
structuring, LLMs are well suited to support use cases where larger amounts of text need to be
processed. In the field of IT security, such possible applications arise, for
example, when reporting on security incidents.
Support in the creation and analysis of program code
LLMs can be used to examine existing code for known vulnerabilities, verbally explain them, and
suggest ways to exploit these vulnerabilities for attacks or code improvement. You can
thus make a contribution to improving code security in the future.

In addition, LLMs can support the creation of code. Experimental evaluations show that the quality of
expenditure in this area has improved as models have evolved (Bubeck, et al., 2023). However, the
susceptibility of this code to known and unknown security gaps cannot be ruled out (cf.
3.2.1).
Traffic analysis support
Due to the large number of different text data that LLMs have processed during their training, they
can also, after additional training, support tasks in which data that is to be processed is in text format
but is not natural language text in the narrower sense. In the field of IT security, possible tasks are, for
example, the detection of malicious network traffic (Han, et al., 2020) or the detection of anomalies in
system logs (Lee, et al., 2021)
(Almodovar, et al., 2022).

Federal Office for Security in Information Technology 9


Machine Translated by Google

3 Opportunities and Risks of LLMs

3.2 Risks of using LLMs and countermeasures


3.2.1 Risks
Since LLMs usually generate linguistically error-free text with convincing content, users quickly get
the impression that a model has human-like capabilities (automation bias) and thus over-confidence in
the statements it generates and in its general abilities. This leaves users vulnerable to drawing
incorrect conclusions from the generated texts, which can be critical as these can be flawed due to
various weaknesses of LLMs, as described below.

Lack of factuality and reproducibility


Generative LLMs are trained to generate text based on stochastic correlations. As a result, there is no
technical guarantee that this is factually correct. This potential invention of content is also known as
"hallucinating" the model. This shows, among other things, that an LLM can deal with language, but
derives its "knowledge" from (already seen) texts. There are no references to the real world for the
model; Accordingly, it may make incorrect statements about facts that people take for granted.

Furthermore, due to the probability-based approach, outputs from LLMs on the same input can usually
be different. This can also be interpreted as an indication that the content is not necessarily correct.

Lack of security of generated code


LLMs that have also been trained on data containing program code can also generate this.
Because code used to train LLMs may be vulnerable to known vulnerabilities, the generated code
may also have these vulnerabilities (Pearce, et al., 2022).
Of course, the generated program code can also be vulnerable to previously unknown security gaps.

Lack of timeliness

If LLMs do not have access to live Internet data (ie the exception to this is, for example, models
used in search engines), they also have no information about current events. As already
mentioned, LLMs derive their stochastic correlations from the texts they processed as training data
during training. Since these are texts from the past, it is impossible for LLMs to provide factual
information on current events without access to current data. It should be noted, however, that
LLMs can usually generate made-up statements about current events by hallucinating in response to
appropriate inputs. Due to the linguistic formulation, these may appear to be factually well-
founded at first glance, especially if publications or other references are part of the answer, but they
may be wrong or invented.

Incorrect response to specific inputs


LLMs also often produce erroneous output when they receive input that differs so much from the text
in the training data that the model can no longer correctly process it as text or words. These inputs
can be unintentionally produced by a user (e.g
Texts with many spelling mistakes or with a lot of technical vocabulary/foreign words, texts in
languages unknown to the model), but also the intentional deception of a model by users is conceivable
(e.g. to bypass mechanisms for detecting unwanted content in social media). Even with inputs that the
LLM cannot process correctly, it will generally generate any output through hallucinations (cf. 3.4.2
Adversarial Attacks).

10 Federal Office for Security in Information Technology


Machine Translated by Google

3 Opportunities and Risks of LLMs

Vulnerability to "hidden" input with manipulative intent

A particular security risk can also occur if attackers manage to make entries in an LLM without users
noticing. This applies in particular to LLMs that access live data from the Internet during operation (e.g.
chatbots with a search engine function or as a browser function to support the viewing of a website), but also
models that receive unchecked third-party documents as input. Attackers can place instructions to the
LLM on websites or in documents without users noticing, and thus, for example, manipulate the further
course of the conversation between the user and the LLM. The aim can be, for example, to find out
personal data from users or to persuade them to click on a link.

Such an attack can affect, for example, a chat tool that helps a person surf the Internet by allowing that
person to ask questions about the currently opened web page in order to understand its content more quickly.
For example, the person asks the chat tool for a summary of a blog post. The blog entry is
actually the website of a person who wants to collect e-mail addresses for later phishing attacks. This person
has hidden text on the website in white type on a white background stating that when the chat tool is asked to
generate a summary, it should then unobtrusively ask users to enter their email address in a field on the
website (see 3.4.2 Indirect Prompt Injection).

Confidentiality of the entered data

When using an external API, all entries made to the LLM are initially passed on to the operator of the model.
The extent to which this accesses the data and uses and stores it, for example, for further training of the
model is regulated differently from model to model. The operator also usually has unrestricted access to the
outputs of the model. Some LLMs also offer the possibility of accessing plug-ins for better
functionality, if necessary unnoticed by the user. In this case, there is an additional risk that the data
entered will be passed on to unknown third parties.

The use of an LLM via an external API should therefore be critically examined, especially when processing
sensitive and confidential information; the processing of classified information is not permitted without further
measures. It may be possible to implement an on-premise solution, but this cannot be mapped with conventional
IT due to the required computing and storage capacities for many LLMs. However, there are also smaller
models in development that, at least in certain applications, provide similar performance to much larger
LLMs and can be operated locally.

Dependency on the manufacturer/operator of the model

In addition to the lack of data sovereignty, the use of an LLM via API creates a high level of dependency
on the manufacturer and operator of the model. This dependency relates to various technical aspects. On
the one hand, the availability of the model may not be controllable, on the other hand, there is usually no
possibility of intervening in the (further) development of the model, e.g
Choosing training data for special use cases or establishing security mechanisms from the outset.

3.2.2 Countermeasures
Users should be informed about these weaknesses of LLMs and encouraged to check statements for
their truthfulness or to question them critically. It is also possible that an LLM produces inappropriate output
(e.g. discriminatory statements, "fake news", propaganda, etc.).
Manual post-processing of machine-generated texts is therefore advisable before they are used
further. Especially this point should be considered when making a decision

Federal Office for Security in Information Technology 11


Machine Translated by Google

3 Opportunities and Risks of LLMs

decides whether an LLM with a direct external impact (e.g. a chatbot on a website) should be
used.

3.3 Misuse Scenarios and Countermeasures


3.3.1 Misuse Scenarios
LLMs can be used to produce text for malicious purposes. Examples of possible cases of abuse
include:
social engineering
The term social engineering refers to cyber attacks in which criminals try to trick their victims into
disclosing personal data, circumventing protective measures or installing malicious code
themselves (BSI, 2022). This usually happens by exploiting human characteristics such as
helpfulness, trust or fear. Spam or phishing e-mails are often used here, which are intended to get
recipients to click on a link or open a malicious attachment. Spear phishing emails, i.e. targeted
fraudulent emails, can also serve as the first step in a ransomware attack.

The texts contained in the fraudulent e-mails can be generated automatically and with high
linguistic quality using LLMs. It is possible to adapt the writing style of the texts so that it resembles
that of a specific organization or person. The imitation of writing styles is usually accurate with current
LLMs and requires little effort (e.g. a text example of a person to be imitated or only little knowledge of
the target language). In addition, texts can be personalized without great effort by including
publicly available information about the target person (e.g. from social and professional networks)
when generating the text. These measures can be used in various scenarios, for example in the
context of business e-mail compromise or CEO fraud, in which the writing style of the management is
imitated in order to tempt employees to make money payments to third-party accounts, for example
(Europol, 2023). The spelling and grammatical errors previously known in spam and phishing e-mails,
which can help users to recognize them, are now hardly found in the automatically generated
texts. This can make it easier for criminals to produce foreign-language texts in a quality that
comes close to that of a native speaker. In addition, criminals could not only increase the
number of attacks initiated via e-mail in the future with relatively little effort, but also use LLMs to
make these messages more convincing.

Dark Web forums are already discussing the suitability of generative LLMs for phishing or spam
emails. However, widespread use could not be observed by the beginning of 2023 (Insikt Group,
2023).
Generation and execution of malware
The ability of LLMs to generate words is not just limited to generating natural language text.
Within the training data there is usually also publicly accessible program code, which enables the
models to generate code as well as text. This is not always error-free, but good enough to help users
in many areas. This ability can be abused by criminals using LLMs to generate malicious code. This
danger was pointed out when the first LLMs that could generate code were released. At that time, it
was already apparent that LLMs are suitable, for example, for generating polymorphic malware, i.e.
malicious code that was only slightly modified in order to circumvent security filters, for example within
antivirus software, but still had the same effects as the original version (Chen, et al ., 2021).

Newer LLMs have increasingly sophisticated code generation capabilities, which could allow attackers
with little technical skill to generate malicious code without much background knowledge. Even
experienced attackers could be assisted by LLMs by helping to code

12 Federal Office for Security in Information Technology


Machine Translated by Google

3 Opportunities and Risks of LLMs

to improve (Europol, 2023). According to (Insikt Group, 2023), a popular LLM can automatically
generate code that exploits critical vulnerabilities. The model is also able to generate so-called
malware payloads. According to (BSI, 2022), payload is the part of a malicious program that remains
on the target computer. This payload, which can be generated using LLMs, can pursue various goals,
eg information theft, cryptocurrency theft or setting up remote access on the target device. However,
the generated code is usually similar to what is already publicly available anyway, and not always
error-free. Nonetheless, the capabilities of language models in this area could lower the barrier to
entry for inexperienced attackers (Insikt Group, 2023). In addition to pure code generation, LLMs
can also be used to provide instructions for searching for vulnerabilities (Eikenberg, 2023), to
generate configuration files for malware, or to establish command-and-control mechanisms (Insikt
Group, 2023).
hoax
LLMs are trained on a very large body of text. The origin of these texts and their quality are not fully
verified due to the large amount of data. Texts with questionable content (e.g. disinformation,
propaganda or hate messages) also remain in the training set and contribute to an
undesirable structure of the model, which shows a tendency towards potentially critical
content. Despite various protective measures, these influences are often found in a linguistically
similar way in the AI-generated texts (Weidinger, et al., 2022). This allows criminals to use the
models to influence public opinion through automatically generated propaganda texts, social
media posts or fake news. Due to the low effort involved in creating these texts, they can also
be mass-produced and distributed. The generation of hate messages is also conceivable.

The user-friendly access via an API and the enormous speed and flexibility of the responses
from currently popular LLMs enable the generation of a large number of high-quality texts. These can
hardly be distinguished from those of a human being and can be written in a wide variety of
moods and styles by user instructions. In this way, criminals can quickly create texts that are
negatively directed against a person or organization, or texts that are adapted to another person's
writing style in order to spread false information on their behalf. Apart from the imitation of writing styles,
machine-generated product reviews can also be written using LLMs, which can be used,
for example, to advertise a specific product or to discredit a competitor's product.

In the commercial LLMs available so far, warnings inserted into the generated text are
intended to make it more difficult to directly generate misinformation or other content
that violates the guidelines of the respective company. However, these warnings can
easily be removed from the generated texts. Thus, disinformation or the like can be generated
by small changes in a comparatively short time.

3.3.2 Countermeasures
The described possibilities for misusing LLMs can be countered with various measures to reduce
the risk of successful attacks.

3.3.2.1 General measures


Such measures can be both technical and organizational. A general method for preventing attacks is
often to ensure the authenticity of texts and messages, ie to prove that certain texts or messages
actually come from a specific person, group of people or institution. This takes into account the fact
that the capabilities of LLMs can easily fool classic implicit methods for authenticating messages,
as they are used unconsciously by users.

Federal Office for Security in Information Technology 13


Machine Translated by Google

3 Opportunities and Risks of LLMs

In the past, recipients of spam and phishing e-mails could often be recognized by errors in
spelling, grammar or the linguistic expression; if they are generated using LLMs, however, they
usually no longer have such defects. Before the widespread distribution of LLMs, spear phishing emails
or posts in social media also allowed certain conclusions to be drawn about their likely authors due to
their writing style; due to the ability of LLMs to imitate writing styles, such indicators are no longer
reliable.

These implicit authentication procedures can now be supplemented by explicit technical procedures
that can cryptographically prove the authorship of a message. This could be used to distinguish
legitimate messages (e.g. from a bank to its customers or from a CEO to his employees) from fake
ones. Similar approaches could also be used in social media to verifiably assign (text) contributions to
their actual source (such as private users, leading medium or state authority). The use of such
technical measures requires a certain amount of effort, which is why they have not been widespread
to date, and requires the users to be made aware and informed.

Social engineering attacks such as CEO fraud can be made more difficult by changing the framework
and introducing additional processes to authenticate messages. For example, the mandatory
confirmation of payment instructions via a separate, authenticated channel would be conceivable.
The mass submission of contributions and documents to overload the connected processes can
be combated by measures that limit the possible submissions. This can be done, for example, through
hard-coded limits or the use of CAPTCHAs.
An overarching measure to reduce the risk of attack is to sensitize and educate users about the
capabilities of LLMs and the resulting threats so that they can adapt and question the correctness
of automatically generated messages such as emails or social media posts, in particular if there is
any other evidence.

3.3.2.2 Measures at model level


At the model level, the misuse of LLMs can essentially be prevented by two strategies. On the one
hand, the possible uses can be restricted in general, which requires little effort, especially for models
operated by the company, on the other hand, measures can be taken to prevent potentially harmful
expenditure.
With the first, more general approach, the group of users can be restricted so that, for
example, only trustworthy users have access to the model. In addition, a restriction of the access rights
that users have to the model is conceivable, eg a restriction of the possible prompts. For example,
some attacks require fine-tuning, which requires more extensive access to the model.

The second approach, on the other hand, pursues the more specific goal of allowing
unrestricted use of the model a priori, but preventing harmful expenditure. We don't want to
generate an output for certain inputs that are clearly intended for malicious purposes, but
instead give a fixed output ("This model cannot be used for this purpose."). In addition to explicitly
excluding outputs on certain malicious requests through filtering, it is also possible to use
Reinforcement Learning through Human Feedback (RLHF). Through special further training, a model
learns how to evaluate expenditures in terms of how desirable they are and to adjust them if necessary.
Such filters and training methods are already used in current LLMs. However, they only prevent part
of the harmful output and can be circumvented by cleverly reformulating the input, also known as
prompt engineering (Cyber Security Agency of Singapore, 2023), although this is often
reproducible. Even when using filters or RLHF in the model, the distinction between permitted and
prohibited outputs again raises complex questions (cf. 3.3.2.1). In addition, with the argument of freedom
of speech, LLMs have already been used

14 Federal Office for Security in Information Technology


Machine Translated by Google

3 Opportunities and Risks of LLMs

provided that do not contain any such filters. It can also be assumed that in the future further unrestricted
models will be developed by actors with corresponding malicious motives.

3.3.2.3 Measures to detect typewritten text


There are various complementary approaches to detecting automatically generated texts.
Detection options give users the ability to recognize texts as typewritten and thus, if necessary, to
doubt their authenticity and the correctness of the information contained.

On the one hand, the human ability to recognize automatically generated texts can be used. The detection
performance depends heavily on aspects of the text (e.g. text type, subject, length) and personal factors
(e.g. experience with typewritten texts, specialist knowledge of the text topic).
Simple indications for detection such as spelling or grammatical errors and gross inconsistencies
in content are not to be expected in texts generated by LLMs, so that the human ability to detect is
limited, especially in the case of short texts.
In addition, tools for the automatic detection of machine-generated texts (e.g
(Tian, 2023), (Kirchner, et al., 2023), (Mitchell, et al., 2023), (Gehrmann, et al., 2019)), which usually
exploit statistical properties of the texts or parameters of a model to calculate a score that serves as
an indication of machine-generated text. However, the detection performance is often limited, especially
for texts generated by LLMs, which are only made available as a black box without
additional information. The results of the tools mentioned can therefore only give an indication and do
not usually represent a reliable statement per se. Limitations exist in particular for short texts and
texts that are not written in English. To support subsequent detection, research is also being carried
out into the implementation of statistical watermarks in machine-generated texts
(Kirchenbauer, et al., 2023). A fundamental problem of this class of tools is that the detection of a
text generated by an LLM can be made even more difficult by minor manual changes. In principle,
automatic detection can also be applied to program code and malware, but it has similar limitations.

3.4 Risks and Challenges in Developing Secure LLMs


In addition to the above measures to prevent and mitigate the potential for abuse of LLMs, there are
other security issues that providers of such models should consider. Users can use this sub-
chapter to get further pointers for an evaluation of a provider of an LLM.

3.4.1 Data quality when selecting training data


The selection of the training data is decisive for the quality of the model provided.
An LLM learns a statistical model of the training data during training; this only generalizes well to later
diverse applications if it is about real or at least realistic data and a range of different texts (e.g. in terms
of text types, topics, languages, specialist vocabulary, variety) is covered.

In addition to the quality of the texts, legal requirements may have to be observed. Due to the rapid
development of LLMs, there are still no final clarifications on some legal aspects.
If necessary, however, future problems can be reduced from the outset if sensitive data is not used to
train LLMs (cf. 3.4.2 Privacy Attacks).
Another aspect that should be considered when choosing training data is the undesirable
mapping of discrimination or bias in the training data. A model forms, so to speak, a mirror of
the training data; if there is a bias in these, the model will also be biased

Federal Office for Security in Information Technology 15


Machine Translated by Google

3 Opportunities and Risks of LLMs

depict. It is then possible, for example, for an LLM to generate discriminatory statements.
Possibilities of misuse of an LLM can also be limited by a targeted selection of training data (3.3.1).

Should there be a lot of machine-generated texts on the Internet in the future, it must also be ensured that
no self-reinforcing effects result from training an LLM on data generated by such a model. This is particularly
critical in cases in which texts with potential for misuse were generated, or if, as already mentioned,
a bias in text data has become entrenched. This happens, for example, by the fact that more and more
relevant texts are generated and in turn used to train new models, which again generate a large number of
texts (Bender, et al., 2021).

3.4.2 Attacks on LLMs and Specific Countermeasures


Privacy Attacks

In principle, it is possible to reconstruct training data by means of specific requests to an LLM. This can be
particularly critical when sensitive data has been used for training (Carlini, et al., 2021).
Data that could be reconstructed are, for example, allocations of personal data (telephone numbers,
addresses, health, financial data) to people, but also, for example, sensitive company internals
or data about the LLM itself.

LLMs are difficult to ensure that they do not contain data that has only been published for limited purposes
due to the large amount of training data, which is usually obtained automatically from the Internet.

Ways to reduce susceptibility to privacy attacks:

• Manual selection or automatic filtering or anonymization of data to avoid sensitive


include information in the training data

• Remove duplications from the training data, since duplications increase the likelihood of a possible
Increase reconstruction (Carlini, et al., 2021)

• Application of mechanisms that guarantee differential privacy (a detailed discussion on


Feasibility with unstructured data, such as that on which LLMs are based, can be found in (Klymenko, et al.,
2022))

• Restrict the output possibilities for an LLM so that certain inputs that are clearly aimed at reconstructing critical
data do not have a generated output and instead have a fixed output (“this model cannot be used
for this purpose”)

• Additional training to train the model to avoid certain expenses (Stiennon,


et al., 2020)

• Restrict access to the model: The fewer access rights users have to the model, the more difficult it is to
assess whether an output is a reconstruction of the training data or a
"invention" of the model

• Is training on sensitive data explicitly necessary (e.g. for specific applications in


healthcare or finance):
• Restrict user group

• Observe general IT security measures

Adversarial Attacks and Indirect Prompt Injections

Attackers can deliberately change text slightly so that humans hardly or not at all notice the change and
continue to understand the texts correctly, but LLMs no longer understand them in the

16 Federal Office for Security in Information Technology


Machine Translated by Google

3 Opportunities and Risks of LLMs

process as desired (Wang, et al., 2019). This can be problematic, for example, when filtering out
unwanted content in social media or when detecting spam.
Classifiers are particularly susceptible to changed text. Deliberately introducing "spelling
errors", using similar-looking characters (e.g. "$" instead of "S"), using rare synonyms that are
not in the vocabulary of the LLM, or rephrasing sentences can lead to classifiers make an
incorrect output. Other applications that may be vulnerable to adversarial attacks include translation
programs and question-and-answer models.

Even without malicious interest, a grossly erroneous input can have the same effect. The measures
listed below will also help in this case.
Ways to reduce vulnerability to adversarial attacks:
• Train or fine-tune the model with real data or data that is as realistic as possible so that peculiarities of
the usual input texts (eg use of certain terms or spellings) can be learned
• Pre-processing of the possibly adversarial text (recognition and correction)
• Spell check/ detection of unknown words (Wang, et al., 2019)
• Automatic spelling correction

• Use of image processing methods to deceive the model by using similar


prevent uglier signs (Eger, et al., 2019)
• Improvement of the model
• Carry out training with manipulated/changed texts (“adversarial training”) (Wang, et al.,
2019)

• Clustering of word embeddings so that semantically similar words are the same for the model
be presented (Jones, et al., 2020)
• Integration of an external knowledge base that contains, for example, lists of synonyms (Li, et al., 2019)

• In special cases, it is possible to use models that have been certified as robust, i.e. models that
mathematically guarantee that sufficiently small changes in the input do not cause a change in the
output (a consideration of different approaches for implementation in the area of LLMs offers
(Wang, et al ., 2019))
A special case of adversarial attacks is the so-called indirect prompt injection (Greshake, et al.,
2023). For example, attackers place hidden inputs in texts accessed by an LLM, as described under
(3.2.1 Susceptibility to "hidden" inputs with manipulative intent), with the aim of manipulating the
further course of the chat in order to achieve a certain behavior among end users . This attack is
particularly critical if LLMs are able to call up external plug-ins, which they can use to gain access to more
extensive functionalities, for example. In these use cases, it is even possible for attackers to perform
malicious actions (e.g. sending emails on behalf of the victim or reading data) without manipulating
the interaction with end users.

Since attackers in this scenario are just exploiting the normal functioning of an LLM, it is difficult to
find countermeasures against this type of attack. The only measure that can reliably protect against
indirect prompt feeds is restricting (distilling) an LLM to the specifically required task. However, this
means that a large part of the general functionality of the LLM is lost.

The following measures can be taken in individual cases to reduce susceptibility to indirect
Lead prompt feeds:

Federal Office for Security in Information Technology 17


Machine Translated by Google

3 Opportunities and Risks of LLMs

• Executing certain actions, eg calling plug-ins through the LLM only after an explicit
Enable consent from end users, e.g. via a confirmation button
• Prevent a model from outputting to inputs that clearly have a manipulation intent
(filtering of inputs)
• Additional training to train the model to avoid certain expenses (Stiennon,
et al., 2020)

poisoning attacks
As already discussed, the data used for training largely determines the functionality of an LLM. Much
of this data comes from public sources or is even collected from user input during operation, opening up
opportunities to manipulate functionality (Wallace, et al., 2020). This results in a multitude of attack
possibilities.
Public text sources are often limited thematically, regionally or institutionally and are operated
by public bodies or educational institutions (Wikipedia, Digital Public Library of America, Europeana,
PubMed Central, corpus.byu.edu etc.). The selection of these sources alone already requires a
cultural pre-determination of the text content. However, the institutions are often openly accessible,
not always protected by security technology and can be manipulated through clever social
engineering, traditional website hacking and link redirection. In this way, data can be exchanged or
added in the storage location or only mixed in during the download. Since large amounts of data are
used for training, they can at most be verified statistically. However, there are no standards for this
yet.
In addition to the original training data, models that are already trained and are only retrained for a
specific application are also exchanged via code databases, some of which are public.
These models are also subject to a variety of manipulation options. The multitude of individuals and
companies involved makes it difficult to blame a specific originator for weaknesses in a model, and
undocumented supply chains can make early-biased models a threat that is difficult to detect. Such
manipulation possibilities can be better hidden with increasing technical know-how.

Some chatbots can also use the data generated during interactions with end users to guide further
communication. This may impact the general functioning of the LLM if the LLM uses a
scoring model based on RLHF (Stiennon, et al., 2020) and user ratings of outputs as desirable or
undesirable are used to further train that scoring model (Shi , et al., 2023). This means that
manipulations are also possible through massive targeted use with subsequent evaluation.

LLMs increasingly interact with other software via APIs and can also be manipulated in this way.
Likewise, weaknesses in the models can increasingly affect other digital processes (administration,
finance, trade). The networking of the various applications with LLMs is very fast, so that it is
becoming increasingly difficult to control the influencing data.

Ways to reduce susceptibility to poisoning attacks:


• Use trusted sources as training data
• For the human assessment as part of an RLHF to trained and trusted personnel
fall back on and provide this with clear guidelines
• Analyze evaluations intensively before they have any repercussions on the model
• Confine the effects of the mission to a controllable square

18 Federal Office for Security in Information Technology


Machine Translated by Google

4 Summary

4 Summary
The technology behind LLMs is currently evolving rapidly. Along with this, new security concerns
about the development and use of these models are also emerging dynamically.
Businesses or government agencies considering integrating LLMs into their workflows should conduct a
risk analysis for their use in their specific use case. The safety aspects presented in this document
can provide clues. Particular attention should be paid to the following aspects:

• When using an LLM via external API access, data is provided by the provider of the model
processed and can be further used by the latter if necessary. 2
• The possibility of accessing live data from the Internet and possibly plug-ins,
additional security risks arise when using LLMs. On the other hand, it enables additional functions
and access to up-to-date information. The need for these functionalities and possible security
implications should be critically assessed and weighed up as part of a risk analysis.

• LLMs may make inappropriate, factually incorrect, or other undesirable spending. Use cases in which
an output is evaluated by humans in further processing steps are therefore less critical; On the other
hand, use cases in which the output of an LLM is made available immediately with an external effect
are to be evaluated particularly critically.
In addition, companies and authorities should evaluate the abuse scenarios mentioned under
(3.3.1) to see whether they pose a threat to their work processes. Based on this, existing security
measures should be adapted and, if necessary, new measures should be taken and users should be
informed about the potential dangers.

2 See also "Criteria catalog for AI cloud services - AIC4"


(https://www.bsi.bund.de/DE/Themen/company-and-organisations/information-and
recommendations/artificial-intelligence/AIC4/aic4_node.html) and "Criteria Catalog Cloud Computing
C5" (https://www.bsi.bund.de/DE/Themen/Unternahmen-und-Organisationen/Information-und
Recommendations/Recommendations-after-Attack Targets/Cloud-Computing/Criteria
Catalog C5/criteria catalogue- c5_node.html)

Federal Office for Security in Information Technology 19


Machine Translated by Google

bibliography

bibliography
Aggarwal, Akshay, et al. 2020. Classification of Fake News by Fine-tuning Deep Bidirectional Transformers
based Language Model. EAI Endorsed Transactions on Scalable Information Systems. 2020

Almodovar, Crispin, et al. 2022. Can language models help in system security? Investigating log anomaly
detection using BERT. Proceedings of the The 20th Annual Workshop of the Australasian Language Technology
Association. 2022.

Bender, Emily, et al. 2021. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?
Proceedings of the 2021 ACM conference on fairness, accountability, and transparency. 2021

BSI. 2022. The situation of IT security in Germany 2022. 2022.

Bubeck, Sebastien, et al. 2023. Sparks of Artificial General Intelligence: Early experiments with GPT-4. 2023

Carlini, Nicholas, et al. 2021. Extracting Training Data from Large Language Models. 30th USENIX Security
Symposium (USENIX Security 21). 2021

Chen, Mark, et al. 2021. Evaluating Large Language Models Trained on Code. 2021

Cyber Security Agency of Singapore. 2023. ChatGPT - Learning Enough to be Dangerous. 2023

Danilevsky, Marina, et al. 2020. A survey of the state of explainable AI for natural language processing. 2020

Eger, Steffen, et al. 2019. Text Processing Like Humans Do: Visually Attacking and Shielding NLP Systems.
2019

Eikenberg, Ronald. 2023. ChatGPT as a hacking tool: where the AI can support. c't magazine. [On-line]
May 02, 2023. https://www.heise.de/background/ChatGPT-als-Hacking-Tool-Wobei-die-KI-unterstuetzen
kann-7533514.html.

European Commission. 2021. Proposal for a regulation of the european parliament and of the council -
Laying down harmonized rules on artificial intelligence (artificial intelligence act) and amending certain union
legislative acts. 2021

Europol. 2023. ChatGPT - The impact of Large Language Models on Law Enforcement. 2023
Gehrmann, Sebastian, Strobelt, Hendrik and Rush, Alexander. 2019. GLTR: Statistical Detection and
Visualization of Generated Text. 2019

Greshake, Kai, et al. 2023. More than you've asked for: A Comprehensive Analysis of Novel Prompt Injection
Threats to Application-Integrated Large Language Models. 2023

Han, Luchao, Zeng, Xuewen and Song, Lei. 2020. A novel transfer learning based on albert for malicious
network traffic classification. International Journal of Innovative Computing, Information and Control. 2020

Hendrycks, Dan, et al. 2021. Measuring Massive Multitask Language Understanding. ICLR 2021. 2021.

Insect Group. 2023. I, Chatbot. Cyber Threat Analysis, Recorded Future. 2023

Jones, Eric, et al. 2020. Robust Encodings: A Framework for Combating Adversarial Typos. 2020

Kirchenbauer, John, et al. 2023. A watermark for large language models. 2023

Kirchner, Jan Hendrik, et al. 2023. New AI classifier for indicating AI-written text. [Online] May 02, 2023.
https://openai.com/blog/new-ai-classifier-for-indicating-ai-written-text.

Klymenko, Oleksandra, Meisenbacher, Stephen and Matthes, Florian. 2022. Differential Privacy in Natural
Language Processing: The Story So Far. 2022.

Lee, Yukyung, Kim, Jina and Kang, Pilsung. 2021. System log anomaly detection based on BERT masked
language model. 2021

20 Federal Office for Security in Information Technology


Machine Translated by Google

4 Summary

Li, Alexander Hanbo and Sethy, Abhinav. 2019. Knowledge Enhanced Attention for Robust Natural
Language Inference. 2019

Mitchell, Eric, et al. 2023. Detectgpt: Zero-shot machine-generated text detection using probability
curvature. 2023

Mozafari, Marzieh, Farahbakhsh, Reza and Crespi, Noël. 2019. A BERT-based transfer learning approach
for hate speech detection in online social media. Complex Networks and Their Applications VIII: Volume 1
Proceedings of the Eighth International Conference on Complex Networks and Their Applications. 2019

OpenAI. 2023. GPT-4 Technical Report. [Online] May 02, 2023. https://cdn.openai.com/papers/gpt-4.pdf.

Papers With Code. 2023. Multi-task Language Understanding on MMLU. [Online] 02 May 2023.
https://paperswithcode.com/sota/multi-task-language-understanding-on-mmlu.

Pearce, Hammond, et al. 2022. Asleep at the keyboard? Assessing the security of github copilot's code
contributions. IEEE Symposium on Security and Privacy (SP). 2022.

Shi, Jiawen, et al. 2023. BadGPT: Exploring Security Vulnerabilities of ChatGPT via Backdoor Attacks to
InstructGPT. 2023

Stiennon, Nisan, et al. 2020. Learning to summarize with human feedback. In Advances in Neural
Information Processing Systems. Advances in Neural Information Processing Systems 33 (NeurIPS 2020). 2020

Tien, Edward. 2023. GPTZero. [Online] May 02, 2023. https://gptzero.me/.

Wallace, Eric, et al. 2020. Concealed Data Poisoning Attacks on NLP Models. 2020

Wang, Wenqi, et al. 2019. A survey on Adversarial Attacks and Defenses in Text. 2019

Weidinger, Laura, et al. 2022. Taxonomy of Risks posed by Language Models. 2022.

Yaseen, Qussai and AbdulNabi, Isra'a. 2021. Spam email detection using deep learning techniques. Procedia
Computer Science. 2021

Zellers, Rowan, et al. 2019. Defending against neural fake news. Advances in neural information processing
systems. 2019

Federal Office for Security in Information Technology 21

You might also like