1. Application Of Large Language
1. Application Of Large Language
0 PROJECT REPORT:
Application Of Large Language
Models in Software Engineering1
Tampere University
Version number:1.0
State of publicity:Public
Preface
1
Contents
Abstract 1
Preface 1
1 Introduction 5
1.1 Defination of Large Language Model: . . . . . . . . . . . . . 6
1.1.1 Example Of LLMs: . . . . . . . . . . . . . . . . . . . 6
1.2 Applications and Use Cases of LLMs in Software Development 6
1.2.1 How LLMs are applied in Software Development . . . 7
2
3 Data Preparation for LLM Training in Software Companies 24
3.0.1 Data Collection: Leveraging Software Documentation
and Repositories . . . . . . . . . . . . . . . . . . . . . 24
3.1 Data Cleaning and Preprocessing: Techniques for Code and
Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.1 Handling Code data . . . . . . . . . . . . . . . . . . . 25
3.1.2 Managing Natural Language Data . . . . . . . . . . . . 26
3.1.3 Data set Splitting and Annotation for Software Devel-
opment Tasks . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.4 Annotation Techniques . . . . . . . . . . . . . . . . . 28
3
6.2 Pre-trained Models for Code Related tasks . . . . . . . . . . 42
6.2.1 Starcoder . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.2.2 CodeT5 . . . . . . . . . . . . . . . . . . . . . . . . . . 43
10 Conclusion 62
10.1 Summary of Key Findings for Software Companies . . . . . . 62
10.2 Recommendations for LLM Practitioners in Software Devel-
opment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
10.3 Final Thoughts . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4
Chapter 1
Introduction
Large Language models (LLM) have been a buzzword that took the internet
by storm in recent months, especially after the arrival of ChatGPT. As an
answer to this newly launched language model, other LLMs such as Google’s
bard have been introduced which proves that LLMs are here to stay for a
longer time Openai.
Additionally, language models are expanding opportunities since they
can automate operations, reduce costs and time, and improve task accuracy.
These are a few of the elements that make software companies interested in
implementing LLMs in their operations. Large language models, however,
are a recent innovation in computer science. Business executives might not
be knowledgeable about such models as a result. We penned this study to
educate interested software company executives in large language models:
• Definitions
5
1.1 Defination of Large Language Model:
LLM is a type of machine learning model that produces outputs for various
natural language processing (NLP) tasks, such as text generation, question
answering, and machine translation. Large language models are often trained
on enormous volumes of text data, frequently consisting of billions of words,
and are typically built on deep learning neural networks such as the Trans-
former architecture. Larger models, like Google’s BERT model, can produce
results for a variety of tasks because they are trained with a large data set
from a variety of data sources.[Wodecki, July 2, 2022] [Vaswani, 2017]
To name a few common LLMs, The table below introduces the 7 largest
large language models by parameter size :
6
Model Developer Parameter Size
Requirement Generation :
7
without LLM. It would have been frustrating, quite time-consuming, and
still prone to errors. With LLMs, things are faster and there is less room for
mistakes. The language models revolutionize the process by giving require-
ment engineers significant information and technological expertise.LLMs are
essential in producing precise and contextually aware customer stories, de-
scriptions of products, and feature recommendations. The development team
might give explicit guidelines and criteria to LLMs so they can write narra-
tives that support the project’s goals. By considering a variety of options and
the views of users, this method enables it easier to design and specify project
requirements[Abbas]. LLM can not only help to set up the requirements, but
they can also help in prioritizing or validating the requirements. By looking
for contradictions, ambiguities, or conflicts in the requirement documents,
language models can help in requirement validation.[Hou et al., 2023] They
can help make sure that the requirements are precise and practical by offering
suggestions or clarifications. Following that, the requirements’ completeness
is assured, strengthening the software development process[Abbas].
Implementation of app development is the actual process when all the source
code for a piece of software is written. This process is time-consuming, cum-
bersome, and sometimes tiring and challenging which calls for automatic
code creation. Code generation is a process in which source code is automat-
ically generated based on functional requirements such as natural language
descriptions or pseudo-code algorithms. In recent years, large pre-trained lan-
guage models such as AlphaCode and the GPT3 series have demonstrated
impressive capabilities in code generation. Other open-source code genera-
tion models include GPT-Neo[Sid Black and Biderman., 2021], GPT-J [Wang
and Komatsuzaki, 2021], CodeParrot[Thomas Wolf and Rush., 2020], Poly-
Coder[Frank F. Xu and Hellendoorn., 2022], and InCoder[Daniel Fried and
Lewis., 2022.].
Software developers can give large language models high-level descriptions
of what they want the code to do, and the LLMs will produce the relevant
8
code. As a result, less manual coding will be needed during the initial stages
of software development.[Jiang et al., 2023]
Automatic code generation is not the only step where LLMs can con-
tribute, rather they can be useful for debugging, error handling, restructur-
ing, translation, and domain-specific code generation.
Coding errors can be found and fixed with the help of large language models.
They are able to examine code samples, spot potential mistakes, and rec-
ommend changes or different strategies. This can speed up debugging and
increase code quality [Hou et al., 2023].
Large language models can make code completion suggestions based on
the context and patterns found in the source. They can help developers write
code more quickly and easily by suggesting names for functions, variables, or
even complete snippets of code. This can boost output greatly and cut down
on syntax errors.[Ciniselli et al., 2021].
Code snippets can be analyzed by language models, which can then
suggest refactoring changes to enhance readability, performance, or quality.
They can point out redundant code, recommend improved code structure, or
offer more effective strategies. This can aid developers in best practices and
code optimization.
In order to produce code specifically suited for a given domain or framework,
language models can be fine-tuned.[Wang et al., 2021] For instance, models
can be trained on certain frameworks or libraries, such as TensorFlow or
Django, allowing them to produce code unique to those tools.
9
With the purposes of generating test cases, the LLMs can generate realistic
and diverse data sets by understanding data dependencies and as a result
the efficiency and coverage of automated tests can be enhanced . They can
analyze test case requirements and produce corresponding test scripts which
reduces the manual effort required for scripting [Schäfer et al., 2023]
LLMs can understand code semantics, extract patterns, and interpret
natural language descriptions related to software functionalities which in-
volves their natural language processing and code comprehension capabili-
ties. With the use of language models, test results, logs, and reports can
be analyzed.[Hou et al., 2023] By examining error messages, log entries, or
natural language descriptions, the LLMs can assist in discovering trends,
patterns, or typical failure circumstances which help with trouble shooting
and issue identification.
Additionally, LLMs can help create test cases from descriptions in normal
language, promoting improved cooperation between developers and testers.
They also assist in identifying test coverage gaps and make pertinent test case
recommendations, ensuring thorough testing and lowering the possibility of
issues going undetected. LLMs contribute to the creation of more depend-
able and high-quality software products by increasing test effectiveness and
efficiency. [Hou et al., 2023] [Schäfer et al., 2023] .
Deployment
10
This step will ensure that the deployment environment is properly prepared
to run the software[Gong et al., 2023].
Language models can generate release notes and documentation for soft-
ware deployments.They are able to create release notes that highlight the new
features, bug fixes, and known issues while also summarizing the changes.
Additionally, they can help create or update deployment documentation,
which will make it simpler for users or administrators to comprehend and
implement the deployment procedure[Arakelyan et al., 2023].
11
mance bottlenecks or other issues in the installed software.
Maintenance
In some cases, language models can generate automated bug fixes or code
patches based on error reports or issue descriptions.[Chen et al., 2023] They
can analyze the code base, understand the context, and suggest potential
fixes for common or repetitive bugs. These suggestions can then be reviewed
and applied by human developers.
Large Language models can offer suggestions for refactoring the code
to enhance its efficiency, quality, and maintainability. They can look for
anti-patterns or assess the code base and recommend refactoring or different
implementations that follow best practices.
Large language models can help optimize software performance by ana-
lyzing code, logs, or system metrics. They can identify potential bottlenecks,
suggest performance optimizations, or guide developers in applying perfor-
mance tuning techniques to enhance the efficiency and responsiveness of the
software[Hou et al., 2023].
Language models can assist in identifying potential security risks in the
software by analyzing code, configuration files, or security guidelines, they
can provide insights and recommendations to enhance the security posture
of the system. This can include suggestions for secure coding practices,
vulnerability scanning, or access control improvements.[Alqarni and Azim,
2022].
12
Large Language models analyze system logs and monitoring data to iden-
tify patterns, anomalies, or recurring issues. By processing and understand-
ing the log entries, they can provide insights into system behavior, perfor-
mance degradation, or potential issues that require attention.
13
Chapter 2
One of the reasons large language models are being popular in different busi-
ness is that they can make the work easier and smoother and also improve
the work quality.Consequently, integrating Large Language Models (LLMs)
can greatly improve a variety of activities involving natural language pro-
cessing (NLP) and code generation in the field of software development.This
process of integration involves understanding key concepts and terminology,
exploring popular LLM architectures for software engineering, and consid-
ering cloud-based solutions for accessing LLM capabilities. It’s crucial to
get to know certain basic terms and concepts in order to comprehend the
integration process.
14
2.1.2 LLM
LLMs are deep learning models that have been trained on vast volumes of
text data in order to discover linguistic patterns and structures [Hadi et al.,
2023]. They can produce text that is human-like and are useful for a variety
of NLP applications, including sentiment analysis, language translation, code
generation, and more.
Figure 2.1: LLM Keywords (need to revise and change the caption)
2.1.3 Tokenization
Tokenization is the process of breaking down text into smaller units called to-
kens.Depending on the chosen tokenization approach, tokens could be words,
subwords, or characters. In NLP, tokenization is a basic step that enables
models to process text in detail [Roumeliotis and Tselikas, 2023]
15
2.1.4 Embedding
In a high-dimensional space, the technique of embedding is utilized to rep-
resent words or tokens as continuous vectors. Word embeddings record the
semantic and syntactic connections between words, allowing models to com-
prehend the meaning of words in the context of a particular text [Wikipedia,
2023]
2.1.5 Attention
LLM architectures must have attention mechanisms. They enable the model
to generate output by focusing on various segments of the input sequence.
The model’s capacity to recognize long-range dependencies and its compre-
hension of context are both improved by attention processes.
2.1.6 Pre-Training
An LLM is pre-trained using a sizable corpus of text data to gain general
knowledge and linguistic patterns. Pre-training helps the model develop a
solid grasp of syntax, semantics, and contextual relationships as it learns to
predict missing words or sentences [Hendrycks et al., 2019]
16
match the requirements of the target task or area [Tormos et al., 2022].
LLMs can be fine-tuned to better perform on particular software en-
gineering activities like code generation and code completion by including
domain-specific knowledge.
17
On the other hand, Auto encoding models discover how to create a fixed-
size vector representation (embeddings )of input text by reconstructing
the original input from a corrupted or hidden version of it. These models
are trained to predict words that are either missing or masked in the input
text by utilizing the surrounding context.[HuggingFace, b] One of the most
well-known models for automatic language encoding was created by Google;
which is BERT (Bidirectional Encoder Representations from Transformers).
For a range of Natural language processing tasks, including sentiment analy-
sis, named entity identification, and question-answering, it can be fine-tuned
The third option is the combination of autoencoding and autoregressive .T5
model is one of the example of such a model [Du et al., 2022] Several LLM ar-
chitectures have shown promising results in the field of software engineering.
Some popular architectures include:
Codex
18
ize to a variety of coding jobs, including code generation, code completion,
code repair, code translation, and code question answering. These features
have made it useful for a variety of practical tasks, including providing doc-
umentation or unit tests for code snippets, finishing partially written code,
writing explanations for code snippets, and correcting errors in code. It can
also generate code from natural language descriptions.
Bert
Bert-SE a model based on Bert has been introduceed for software engineer-
ing domain which is destined for textual classification in the field of software
engineering.[Eliane Maria, 18/9/2020]
Code-Bert
Code2Vec
19
ommendation, or even code clone detection. By encoding code snippets into
vectors,we can compare or match them based on their semantic similarities
. The model is more focused on learning code representations rather than
generating or predicting code sequences.[URI ALON, 30/10/2018]
Transformer-XL
These are just some examples of the models that can be used for software
engineering related tasks, but there exists also other models that can be use-
ful in this context.
OpenAI API
Developers can make API calls online as the OpenAI API is hosted on Ope-
nAI’s cloud infrastructure. The potential of OpenAI’s models can thus be
utilized without the need to build and maintain own computational infras-
tructure.[Greg Brockman]The cloud-based solution from OpenAI is built to
manage many API requests and can grow to suit any application’s needs.
This guarantees that program can keep performance while managing heavy
20
workloads. A simple method for incorporating the models into own programs
or services is the OpenAI API. It is simple to interact with the API and pro-
cess the results because developers can perform API calls using conventional
HTTP requests and receive responses in JSON format.
Hugging Face
Hugging Face is a popular platform for NLP models with a wide range of
pre-trained Large Language models. These models can be useful for user
specific tasks such as text generation,text classification , sentiment analysis
etc. Hugging face provides a library and API for easy integration of LLMs
into different tasks which can be related to software engineering. They are
continuously introducing new models in their website.[HuggingFace, a]
Microsoft Azure ML
21
2.2.2 Advantages and Considerations of Cloud-Based
Solutions for Large Language Models
Undoubtedly, cloud-based LLM services enable developers to take advantage
of the power of language models without the need for significant resources
from their end. Yet, we cannot help considering the challenges and issues
related in this context.
22
However, there are other factors to take into account and difficulties that
come with using LLMs as a service, such as:
23
Chapter 3
Effective data preparation is one of the most important task in the training
of Large Language Models (LLMs) for software organizations. This section
addresses the difficulties associated with LLM training in software develop-
ment environments by concentrating on the critical phases in data prepa-
ration. Software businesses can gather, clean, pre-process, split, annotate,
and enhance the data to guarantee the best training outcomes by utilizing
software documentation, repositories, and other pertinent sources.
24
resources such as forums, online communities, and knowledge bases related
to the software. These sources can contain valuable information, discussions,
and best practices that can add to the documentation collection process.
Several techniques exists for extracting code snippets from version control
systems like Git, SVN, or Mercurial.One approach is to analyze the dif-
ferences (diff) between consecutive commits in the version control system.
These differences provides information about the added, modified, or deleted
lines of code. By parsing the differences, developers can extract the rele-
vant code snippets that were changed or added in each commit. Another
technique is to extract code snippets at the file level. Developers can iterate
through the files in each commit and extract the code snippets based on
specific criteria, such as function definitions, class definitions, or code blocks
enclosed within specific markers or annotations.
Removing Comments
25
should be removed to focus solely on the executable code. This can be
achieved by identifying and stripping out comment lines or blocks.
This step is important as large language model may not be able to long
sequence of undtructed data. Therefore, initial data segmentationcan help
in splitting data into various categories as required ; For example, to split
into sentences or words [Kou et al., 2023] [Hou et al., 2023].
Removing Noise
Textual data may contain irrelevant or noisy elements, such as HTML tags,
special characters, or URLs. Cleaning techniques like regular expressions or
library functions can be used to remove or replace such noise.
26
Dealing with Stop Words:
Stop words are common words ; for example; ”the,” ”is,” or ”and” .They
often appear frequently in text but carry little semantic value. Removing
stop words can reduce noise and focus on the more informative content.
Stop word lists or libraries can be used for this purpose.
Random Splitting
In random splitting, the order of the data set’s index is used to produce the
training, validation, and test sets. Therefore, before dividing, the entire data
set should be shuffled to avoid problem with class imbalance. [medium] This
approach ensures an unbiased distribution of data across the splits but may
not take into account specific characteristics or dependencies within the data
set.
Stratified Splitting
In stratified splitting, the data set is divided while maintaining the distri-
bution of specific attributes or labels across the splits. This is particularly
27
useful when dealing with imbalanced data sets or when certain attributes are
crucical for model evaluation. [ludwig]
Time-based Splitting
In this process, the data set splitting is done based on a temporal order,
where the training set contains data from earlier time periods, the validation
set includes data from a more recent period, and the testing set represents
the most recent data. This is common in scenarios where the model needs
to generalize to future unseen data.[ludwig][Chalokia]
Cross-validation
In Cross Validation, the data set is partitioned into multiple subsets or folds,
with each fold used as a validation set while training the model on the re-
maining folds. Therefore, cross validation provides more robust evaluation
by utilizing all data for training and validation.
Manual Annotation
Automated Annotation
28
as keyword matching, topic modeling, or machine learning-based approaches.
Automated annotation can speed up the process but may require additional
validation and refinement.
Active Learning
29
Chapter 4
For training large language models for software engineering related task , the
core knowledge about different techniques for training LLM is important.
This idea helps to efficiently choose which approach will be helpful for that
specific task with task-specific data set. These technique may include super-
vised learning, unsupervised learning, transfer learning and fine tuning and
this chapter will focus on these techniques.
In software engineering tasks, the technique will involve using labeled software-
related data sets, where each input example is associated with a correspond-
ing target label. In the context of software engineering, the input examples
30
can include code snippets, natural language descriptions, or a combination
of both [Hou et al., 2023]
To train a language model using supervised learning, the developers would
typically provide the input examples along with their corresponding labels
to the model during the training process. The model then learns to map
the input examples to the appropriate output labels based on the provided
training data. This approach can be used for various software engineering
tasks, such as code classification, code summarization, and more.
31
4.3 Transfer Learning for Software-specific Tasks
In transfer learning, the knowledge of an already trained machine learning
model is applied to a different but related problem. With transfer learning,
we generally try to re-use what has been learned in one task to improve gen-
eralization and performance in another task. We move the weights that a
network has picked up in ”task A” to a fresh ”task B.” The most typical ap-
proach is to apply what a model has learnt from a task with a lot of labeled
training data to a new task with little to no training data. We begin the
learning process using patterns discovered while completing a comparable
task, as opposed to starting from scratch. [builtin].
By using transfer learning, language models can benefit from the general
knowledge learned during pre-training, which helps them bootstrap their
understanding of software-specific tasks with limited task-specific training
data. This approach can lead to better performance, especially when the
target task has a small or limited labeled data set [Tormos et al., 2022] For
transfer learning in software engineering, a common approach is to use a
large-scale language model pre-trained on a vast corpus of code and natural
language, such as GitHub repositories or Stack Overflow. This pre-trained
model can then be fine-tuned on specific software-related tasks, such as code
completion, bug detection, or code summarization, using task-specific labeled
data.
Transfer learning can be used for code-related tasks by pre-training trans-
former models on a generic data set using a self-supervised task, such as
filling masked words in sentences. Then, these models are fine-tuned to sup-
port specific code-related tasks, such as automatic bug-fixing, injection of
code mutants, generation of assert statements, and code summarization.
A single model can be fine-tuned to support multiple tasks, possibly exploit-
32
ing the benefits of transfer learning. This means that knowledge acquired
to solve a specific task can be useful to boost performance on another task.
The Text-To-Text Transfer Transformer (T5) model is a pre-trained trans-
former model that has been used to support code-related tasks. The T5
model achieved better performance compared to state-of-the-art baselines in
the four code-related tasks mentioned above.[Mastropaolo et al., 2022]
Large language models (LLMs) pre-trained on vast source code (“Code LLM”)
have achieved remarkable progress in code intelligence. For instance, with
the help of AI generative tools, software developers can now create and
maintain their codes easily, and eventually improve their productivity sig-
nificantly.[Yue Wang]
33
Figure 4.1: LLM fine tuning (need to revise and change the caption)
34
Chapter 5
In the software context, Large Language Model’s evaluation metrics are used
to assess the performance of language models specifically for software-related
tasks. Evaluating language models in this context is crucial to ensure they
are effective, accurate, and reliable.The choice of evaluation metrics depends
on the specific software application and use case. Different tasks require
different metrics, and a combination of metrics is often used to provide a
comprehensive evaluation of a language model’s performance in the software
context. LLM Evaluation Metrics can be categorized into two main types:
Intrinsic Metrics and Extrinsic Metrics. In this chapter, we would like to
focus on these metrics.
35
5.2 Extrinsic metrics
Extrinsic metrics evaluate a machine learning model based on its performance
on specific tasks or real-world applications. These metrics are often used to
assess how well the model will perform in the intended use case. The intrinsic
metrics that are used to evaluate NLP systems are as follows:
5.2.1 Accuracy
In classification jobs, the accuracy metric is employed to determine how
closely a measured value resembles a known value. When the output variable
is discrete or categorical, it is generally employed .
For instance, how often a sentiment classification algorithm is correct.
5.2.2 Precision
The accuracy metric would provide the proportion of labels with positive
labels compared to cases with positive labels assigned by the classifier. For
example, if identifying a cancer that is prevalent 1 per-chant of the time, a
model that always spits out “negative” will be 99% accurate, but 0% pre-
cise.[Yashaswini and Shylaja]
36
5.2.3 Recall
Recall measures how well the model can recall the positive class. Recall
value signifies the number of positive labels that the model has correctly
identified as positive [Yashaswini and Shylaja]. Precision and Recall are
complementary metrics that have an inverse relationship.[Afonja, 2017] If
both metrics are equally important then the F1 score can be used to combine
precision and recall into a single metric.
5.2.5 Perplexity:
Perplexity measures how well a language model predicts a given text. It
quantifies the level of uncertainty or surprise of the model when predicting
the next word. A lower perplexity score indicates that the model is better at
predicting the next word and, therefore, has a better understanding of the
language it is processing.
5.2.6 METEOR:
A precision-based metric for assessing the output of machine translation is the
Metric for Evaluation of Translation with Explicit Ordering (METEOR). It
avoids some of the BLEU score’s drawbacks, like exact word matching when
computing precision. Synonyms and stemmed words can be matched with a
reference term using the METEOR score.
On the basis of stemmed words and meanings, the n-grams can be matched.
METEOR calculates a score using unigram precision and recall.
37
5.2.7 ROUGE:
The evaluation metric Recall-Oriented Understudy for Gisting Evaluation
(ROUGE) assesses recall. It is frequently employed in machine translation
activities and for assessing the caliber of output content. However, as it
assesses recollection, summary tasks are where it is most frequently utilized.
38
Chapter 6
39
6.1.1 TensorFlow
Developed by Google, TensorFlow is an open-source deep learning framework
known for its flexibility and extensive community support. It’s widely used
in both research and industry for various machine learning tasks including
LLM. [wikipedia]
6.1.2 PyTorch
Created by Facebook’s AI Research lab, PyTorch is a library which is pop-
ular among researchers for its dynamic computation graph and ease of use.
It offers a more robust approach, making it accessible for those familiar with
Python programming.
40
• Pytorch allows flexibility in building and modifying neural networks on
the fly
Keras
Keras is a high-level neural network API which runs on top of other deep
learning frameworks like TensorFlow. It’s known for its simplicity and ease
of use, making it an excellent choice for beginners and experts alike.
41
Deepseed
42
A distinctive strength of StarCoderBase is its superior ability to produce
valid code outputs while maintaining a notably low occurrence of insecure
completions, particularly in cases where over 95% of the generated code is
valid. It showcases proficiency in tasks like converting natural language de-
scriptions into code, documenting code, and predicting type annotations.
Furthermore, StarCoder has been trained on natural language text, ren-
dering it versatile for a range of natural language tasks, including reasoning.
It’s essential to acknowledge that while StarCoder makes strides in data
privacy, it might still generate personally identifiable information (PII). Mea-
sures have been implemented to identify and remove PII, yet tailored valida-
tion and refinement remain necessary for specific applications.
The StarCoder models are accessible to the public under a version of the
Open Responsible AI Model license that encourages practical utilization and
stimulates ongoing research and advancement in the domain.[Li et al., 2023]
6.2.2 CodeT5
CodeT5 stands as an advanced pre-trained encoder-decoder model tailored
explicitly for tasks involving code comprehension and creation. Unlike its
predecessors that treated code snippets much like natural language text,
CodeT5 capitalizes on the distinctive attributes of programming languages,
meticulously considering token types within code. It embraces a unified
architecture accommodating both code understanding and generation tasks,
thereby enabling efficient multi-task learning. This is facilitated through
an innovative identifier-aware pre-training task, which equips the model to
discern code tokens functioning as identifiers, enhancing its ability to capture
semantic nuances in code.
Fine-tuning CodeT5 for specific tasks is facilitated through task-specific
transfer learning or multi-task learning strategies. When applied to code gen-
eration tasks, CodeT5 can be adapted using its Seq2Seq framework, while
for code understanding tasks, it explores methodologies such as generating
labels as unigram target sequences or predicting them based on class label
vocabularies. This versatile model has undergone fine-tuning across a spec-
43
trum of code-related tasks, showcasing its prowess in defect detection, clone
identification, summarization, translation, and refinement.
Notably, CodeT5’s efficacy has demonstrated significant advancements in
both understanding and generation tasks spanning diverse directions like pro-
gramming language-to-natural language, natural language-to-programming
language, and even within programming languages themselves.[Wang et al.,
2021]
44
Chapter 7
45
Selecting a Pre-trained Model
Pre-trained models are neural networks that have undergone extensive train-
ing on a large volume of data, typically for a general NLP tasks. They are
able to translate complicated linguistic elements and patterns to other rele-
vant tasks. Compared to developing a model from scratch, using pre-trained
models can help to get better results faster and with fewer data.[Linkedin].
When selecting a pre-trained model for any down stream task, there are sev-
eral factors to consider, such as the task and data.For instance for a text
classification task, a model pre-trained on a large text corpus like BERT
or GPT-2 might be a good starting point. Besides, for any code related
tasks, choosing a model which was not trained to understand programming
languages semantics may not be a good option.
Preparing Data-set
46
1 from transformers import AutoTokenizer
2
When processing a batch of sentences, they do not always have the same
length. Tensors, the model’s input, need to have a consistent shape, hence
this is a difficulty. By including a unique padding token, padding serves as
a way for guaranteeing that tensors are rectangular.
Furthermore, sometimes a sequence can be way too long for a model to
handle. In that case, truncation is necessary for the sequence to have a
shorter length.[?]
Hyperparameter Tuning
Hyper parameters are parameters that are set before the training process
begins and control various aspects of the training process itself. They are
not learned during training but rather set by the user.[Sathishkumar et al.,
2023] Examples of hyper-parameters include learning rate, batch size, num-
ber of layers, number of hidden units, regularization strength, etc. Hyper
parameter tuning involves selecting the best values for these parameters to
achieve the best performance on the validation set.
Selecting the right set of hyperparameters is crucial for model’s performance
and accuracy. Unfortunately, there are no set rules on which hyperparame-
ters work best nor their optimal or default values. We need to experiment
to find the optimum hyperparameter set. This activity is known as hyper-
parameter tuning or hyperparameter optimization.
Hyperparameters can effect model structure, function, and performance. Hy-
perparameter tuning allows to tweak model performance for optimal results.
This process is an essential part of machine learning, and choosing appropri-
ate hyperparameter values is crucial for success.
For example,For the learning rate of the model as a hyperparameter, if the
value is too high, the model may converge too quickly with suboptimal re-
47
sults. Again, if the rate is too low, training takes too long and results may not
converge. An appropriate and balanced choice of hyperparameters results in
accurate models and excellent model performance [Amazon Web Services]
Training
A pretrained model can be fine tuned with a deep learning framework of the
developer’s choice.The provided content is a guide through training transformer-
based models using PyTorch and TensorFlow:
PyTorch
The actual model training is initiated using the train() method of the
Trainer object. This step iteratively updates the model’s parameters using
the training data, iteratively refining its ability to make accurate predic-
tions. By incorporating the hyperparameters, metrics, and training data,
the model’s adaptation process is managed in a controlled and optimized
48
manner.
TensorFlow
The initial phase of model training entails loading a data set.The Auto-
Tokenizer function facilitates the generation of tokenizers tailored to specific
transformer architectures. Tokenization is a pivotal preprocessing step that
converts raw text data into manageable units, a prerequisite for model un-
derstanding.The utilization of NumPy arrays ensures efficient handling of
tokenized data and labels.
With tokenized data and labels prepared and the model compiled, the
training phase begins. This pivotal stage is facilitated through the model.fit()
function in TensorFlow. This process checks for the gradual achievement of
domain-specific knowledge which results in enhancing the model’s predictive
abilities.
Native Pytorch
In some scenarios, researchers and practitioners may prefer a more tailored
training loop for fine-tuning transformers models.Before starting on fine-
tuning, it’s advisable to optimize memory usage. This can be achieved by
liberating resources, such as by removing previously loaded models, using
49
the following code snippet:
1 del model
2 del pytorch_model
3 del trainer
4 torch . cuda . empty_cache ()
Evaluation
50
Data augmentation
The early layers of a pre-trained model are responsible for learning low-level
features such as edges and corners. These features are useful for many tasks,
and we do not want to overwrite them during fine-tuning. Therefore, it is
recommended to freeze the early layers of the pre-trained model and only
fine-tune the later layers
Regularization
One of the primary challenges researchers and practitioners can face when
fine-tuning pretrained models is acquiring or creating datasets that are tai-
lored to their specific task. While pretrained models are versatile, they still
require domain-specific and task-specific data for optimal performance. Ac-
quiring such datasets can be a complex task, particularly for niche or emerg-
ing fields where labeled data might be scarce. Moreover, curating datasets
that are representative and balanced poses its own set of challenges.
51
Documentation and Guidelines
The availability and quality of documentation play a pivotal role in the suc-
cess of any implementation. While libraries like Hugging Face’s Transform-
ers have greatly simplified the process of fine-tuning, the documentation and
guideline for fine-tuning pre-trained models can sometimes be inadequate or
ambiguous. Clear and comprehensive documentation is essential for guiding
users through the intricate process of selecting the right model, configuring
hyper parameters, pre-processing data, and interpreting results. Insufficient
documentation can lead to confusion, misinterpretations, and ultimately sub
optimal outcomes.
The NLP community may collaborate to create common datasets that are tai-
lored to particular objectives. Open-source platforms that encourage dataset
contributions and crowdsourcing can help alleviate the scarcity of task-specific
data.
52
Chapter 8
53
model’s overall architecture. An advanced variant, structured pruning, takes
a rule-based approach to eliminate entire structural components of the model,
all the while preserving the global network structure.[Zhu et al., 2023].
A noteworthy example of structured pruning is the LLM-Pruner, which
stands as a testament to the innovation in this field. This technique amal-
gamates a dependency detection algorithm with an efficient importance esti-
mation method, resulting in a finely tuned pruning process.[Ma et al., 2023].
By integrating model compression and pruning techniques, software com-
panies stand to achieve significant reductions in LLM size, memory utiliza-
tion, and computational expenses. This, in turn, paves the way for the
deployment of LLMs in resource-constrained environments with enhanced
efficiency, marking a substantial advancement in the realm of software engi-
neering.
54
LLMs, making them suitable for resource-constrained environments .
55
Efficient deployment of Large Language Models (LLMs) necessitates eval-
uating their inference efficiency through accuracy, zero-shot ability, and in-
ference scaling laws.
However, ensuring their real-world applicability demands ongoing moni-
toring and maintenance throughout the software development lifecycle. This
entails adapting to evolving data distributions, detecting and correcting er-
rors, mitigating biases, optimizing performance, and staying updated with
LLM versions and updates. By integrating these measures, LLMs can con-
sistently provide accurate, unbiased, and efficient responses in dynamic real-
world environments. [Zhu et al., 2023]
56
Chapter 9
57
els that are applied to recommend code patterns have been found to
carry security flaws forward which creates risks in not only generating
buggy code, but also perpetuating immature implementation practices
in software developers.[Perry et al., 2022]
58
9.1 Future Directions of LLM and Software
Engineering
• Future research can focus on creating Large Language Model (LLM)
architectures that are specifically designed to address software engineer-
ing tasks. These tailored architectures can take into account the unique
characteristics of coding languages, software design patterns, and de-
velopment practices. By fine-tuning LLMs for software engineering,
researchers can enhance their ability to generate accurate and con-
textually relevant code, documentation, and recommendations. [Hou
et al., 2023]
• LLMs are often seen as black-box models due to their complex in-
ternal workings. Enhancing the interpretability and explainability of
LLMs is crucial for building trust and adoption in software engineer-
ing tasks.[Tantithamthavorn et al., 2023] Researchers can explore tech-
niques to generate human-readable explanations for LLM-generated
outputs, such as code explanations or reasons behind suggested code
changes. This empowers developers to understand and trust the model’s
recommendations.
59
fine them, and incorporate their domain expertise, resulting in higher-
quality code and documentation
60
tion.
Responsible LLM development goes beyond addressing biases. It involves
tackling challenges inherent to these models. For instance, the sheer size of
LLMs can present deployment challenges, as they require substantial com-
putational resources and memory. Moreover, LLMs often depend heavily on
the quality and quantity of training data. Ensuring a consistent and reliable
supply of relevant data is crucial for optimal performance.
In response to challenges, ongoing efforts are directed towards refining
LLMs. One strategy involves reducing their sizes to improve efficiency and
practical usability. Techniques like genetic algorithms are being explored to
compress LLMs without compromising their performance.[Hou et al., 2023]
As LLMs find their place in various software engineering tasks, it’s vital
to consider the resources they demand. Storage, memory, and computational
requirements must be carefully managed to ensure smooth integration. Or-
ganizations should strike a balance between the benefits of LLM utilization
and the resources allocated to accommodate their usage effectively.
61
Chapter 10
Conclusion
At this point it is time to look back to the objective of the study and com-
pare the objective to our findings. This chapter will summarize the findings
from the whole study and recommend some healthy practice ideas to the
practitioners and the researchers.
62
depends on the availability and quality of the data. We can not but ac-
cept the fact of scarcity of task specific labelled data set in the software
domain.Moreover the careful management of data preparation, training tech-
niques, evaluation metrics, and deployment strategies emerges as a critical
factor in achieving successful LLM integration.Ethical concerns related to bi-
ases in training data and fairness metrics underscore the need for responsible
LLM development.
As we peer into the future, the significance of model optimization for effi-
cient deployment, exploration of edge device deployment, and the imperative
to maintain LLMs throughout the software development lifecycle become in-
creasingly apparent. The synthesis of these findings presents a roadmap
for software companies, guiding them toward informed decisions and best
practices as they navigate the realm of LLMs, fostering innovation while up-
holding ethical standards and achieving sustainable success in the evolving
software landscape.
63
derstanding the unique requirements of software engineering, models can be
optimized to generate more accurate and contextually relevant code, docu-
mentation, and other software related tasks.
Interpretability and explainability are crucial aspects, especially in critical
domains like software engineering. Techniques should be explored to make
LLMs more interpretable.
Creating interactive interfaces and tools that enable seamless interaction
between LLMs and human developers is essential. Such interfaces can help
developers integrate LLMs into their workflow, receive real-time suggestions,
and provide feedback to improve the model’s performance.
The performance of LLMs should be continuously evaluated and opti-
mized. Factors such as model size, deployment challenges, and data depen-
dencies should be considered. Regular model updates can ensure that the
LLM remains effective and aligned with the evolving needs of the software
engineering field.
Keeping up-to-date with the latest research and advancements in LLMs
is crucial. The field of machine learning, including LLMs, is rapidly evolv-
ing. Staying informed about new techniques and approaches can help in
harnessing the full potential of LLMs for optimizing software development
processes.[Hou et al., 2023]
64
with data collection, pre-processing, and utilization of LLMs. We emphasize
the pivotal role of high-quality datasets that have undergone meticulous cu-
ration. .
Furthermore, our review sheds light on specific SE tasks that have expe-
rienced significant improvements thanks to the incorporation of LLMs. We
highlight the tangible benefits and practical advancements that have been
achieved through the integration of these models. From code generation to
natural language understanding in SE, LLMs have demonstrated their ability
to enhance the efficiency and effectiveness of various processes.
Finally, we look into the several methods used to evaluate and improve
LLM performance for SE tasks.Additionally, we highlight the current diffi-
culties and introduce a road map that identifies interesting future directions.
For researchers and engineers investigating the application of LLMs in soft-
ware engineering, this thorough overview offers crucial insights.
65
Bibliography
A. Abbas. https://www.techopedia.com/5-ways-llms-can-empower-software-
engineering.
66
S. Arakelyan, R. J. Das, Y. Mao, and X. Ren. Exploring distributional shifts
in large language models for code analysis, 2023.
builtin. https://builtin.com/data-science/transfer-learning.
M. Chalokia. https://www.linkedin.com/pulse/time-based-splitting-
determining-train-test-data-come-manraj-chalokia/.
Z. Du, Y. Qian, X. Liu, M. Ding, J. Qiu, Z. Yang, and J. Tang. Glm: General
language model pretraining with autoregressive blank infilling, 2022.
67
geeksforgeeks. https://www.geeksforgeeks.org/why-tensorflow-is-so-popular-
tensorflow-features/.
HuggingFace. https://huggingface.co/models. a.
68
ibm. https://www.ibm.com/topics/supervised-learning.
Linkedin. https://www.linkedin.com/advice/0/how-do-you-use-fine-tune-pre-
trained.
ludwig. https://ludwig.ai/latest/configuration/preprocessing/.
69
A. Mastropaolo, N. Cooper, D. Nader, S. Scalabrino, D. Poshyvanyk,
R. Oliveto, and G. Bavota. Using transfer learning for code-related tasks.
IEEE Transactions on Software Engineering, PP:1–20, 01 2022. doi: 10.
1109/TSE.2022.3183297.
medium. https://medium.com/data-science-365/random-vs-stratified-splits-
5d3d528d445b.
nltk. https://www.nltk.org/.
nvidia. https://blogs.nvidia.com/blog/2022/03/25/what-is-a-transformer-
model/.
OpenAI. https://openai.com/blog/customizing-gpt-3.
70
text=Natural%20language%20processing%20(NLP)%20is,a%20lot%20of%
20unstructured%20data.
Safjan. https://safjan.com/measure-quality-of-embeddings-intrinsic-vs-
extrinsic/.
simform. https://www.simform.com/blog/completeguide-finetuning-llm/.
tensorflow. https://www.tensorflow.org/about.
V. S. J. C. C. D. A. M. P. C. T. R. R. L. M. F. J. D. S. S. P. v. P. C. M. Y.
J. J. P. C. X. T. L. S. S. G. M. D. Q. L. Thomas Wolf, Lysandre Debut and
71
A. M. Rush. Transformers: State-of-the-art natural language processing.
. In Proceedings of the 2020 Conference on Empirical Methods in Natural
Language Processing: System Demonstrations, pages 38–45, Online, 2020.
Y. Wan, W. Zhao, H. Zhang, Y. Sui, G. Xu, and H. Jin. What do they capture?
– a structural analysis of pre-trained language models for source code, 2022.
wikipedia. https://en.wikipedia.org/wiki/tensorflow.
B. Wodecki. 7 language models you need to know. pages 19–36, July 2, 2022.
72
S. Yashaswini and S. S. Shylaja. Metrics for automatic evaluation of text
from nlp models for text to scene generation. EJECE, European Journal of
Electrical Engineering and Computer Science, PP.
D. T. N. D. X. F. M. G. L. S. B. Q. T. L. D. J. M. Z. Zhangyin Feng1,
Daya Guo. Codebert: A pre-trained model for programming and natural
languages. 18/9/2020.
73