Training Large Language Models and Using Them for the Web
Training Large Language Models and Using Them for the Web
Topic :
Training Large Language Models
and Using Them for the Web
Supervisor:
Ms. Lipsa Das
Department of Computer Science and Engineering
Neelanjan Mukherji
A41105221002
B-Tech(Computer Science Engineering)
Semester - VIII
Amity School of Engineering
Amity University Greater Noida
Certificate
I, a student of B.Tech. CSE (2021 - 2025) hereby declare that the
presented report titled “Training Large Language Models and Using
them for the Web” submitted by us to the Department of Computer
Science and Engineering, Amity School of Engineering, Amity University
Uttar Pradesh, in partial fulfillment of requirement for the award of the
degree of Bachelor’s of Technology in Computer Science and
Engineering, has not been previously submitted either in part or in full
for any other degree of Amity University or any other
University/Institutions.
Amity School of Engineering
Amity University Greater Noida
Percentage of Plagiarism :
Name of supervisor under whom the work was Ms. Lipsa Das
performed :
Acknowledgement
I would like to thank the Amity School of Engineering,
the Hon. Lipsa Das ma’am for providing me with this
opportunity to write this report. I would also like to
express my sincerest gratitude to my industry guide,
for guiding me throughout the duration of this
research with his invaluable knowledge and
constantly supporting me and increasing my
confidence. Also I would like to thank my parents, my
friends, with whom I discussed my research and for
their helpful insights, and all the interviewees who
helped me with the primary data.
Amity School of Engineering
Amity University Greater Noida
Abstract
The development, training, and use of large language models (LLMs) in
web services are examined in this paper, with an emphasis on the
revolutionary effects of models such as GPT, BERT, and T5. These
models, which are based on the ground-breaking Transformer
architecture, are now indispensable resources in Natural Language
Processing (NLP) due to their exceptional accuracy and fluency in both
understanding and producing human language. Important subjects
covered in the paper include sophisticated LLM training methods, the
use of LLMs in web-based applications, and scalability, ethical, and data
protection issues.
The way in which people interact with digital systems has been
revolutionized as a result of the incorporation of LLMs into online
applications. Whether it is through the power of search engines or the
facilitation of dynamic chatbots, LLMs play a crucial role in improving
the user experience on the web. Increasing the relevance of search
results (for example, Google's usage of BERT in search queries),
automating customer support through virtual assistants (for example,
chatbots based on GPT), and creating content at scale (for example,
automated blog entries and summaries) are all examples of jobs that
modern online services utilize LLMs for. Websites are able to become
more responsive, adaptable, and intelligent as a result of the capacity of
LLMs to interpret and create natural language. This ability makes LLMs
essential components in the ever-changing digital world.
Email and Social Media Copy: LLMs may help marketing teams save
time and effort by using their expertise to create emails, social media
posts, and product descriptions that are succinct and powerful. With
the help of this automation, businesses are able to more easily maintain
a constant online presence with less effort required from them.
N-gram Models
N-gram models were among the earliest language models that were
extensively employed. They were able to operate by making predictions
about the following word in a series based on the 'n' words that came
before it. As an example, a trigram model would make a prediction
about the next word by using the two words that came before it. In spite
of the fact that they were successful for specific tasks, n-gram models
have difficulty dealing with lengthier contexts and more sophisticated
language.
Limitations of HMMs: In spite of the fact that they were helpful for
specific applications, HMMs had problems with scalability and were not
well-suited for dealing with the extensive variability that is present in
natural language. In addition to this, they lacked the profound
contextual awareness that is required for many NLP jobs.
Advantages of LSTMs:
LSTMs made it possible for models to remember information across far
longer contexts than RNNs did, which allowed them to handle jobs that
included more complicated phrase structures and longer text
sequences on a more comprehensive level. NLP activities, such as
machine translation (for example, Google Translate), text categorization,
and sentiment analysis, were among the most common environments
in which they were used.
Self-Attention Mechanism
The self-attention mechanism represented by the Transformer design is
the primary novelty that it has. It is possible for the model to process all
of the words in a sequence concurrently thanks to the self-attention
mechanism, which is in contrast to RNNs and LSTMs, which process
sequences in a sequential manner. Because each word in the sequence
is able to pay attention to every other word in the sequence, the model
is able to capture long-range dependencies as well as complicated
interactions between specific words.
GPT-2: A far more extensive model with 1.5 billion parameters was made
available by OpenAI in the year 2019. It is because of its capacity to
create text that is coherent and contextually relevant that GPT-2 has
garnered a lot of interest. This ability demonstrates the promise of
large-scale language models for tasks like as text completion,
summarization, and question answering.
GPT-3: Because it has 175 billion parameters, GPT-3, which was launched
in the year 2020, represents a huge step forward and is one of the
biggest language models that have ever been developed. GPT-3
exhibited the capacity to do a broad variety of tasks without the need
for task-specific fine-tuning. Instead, it relied on learning via few-shot,
one-shot, or zero-shot reinforcement learning. It was able to create text
that resembled that of a human being, write code, and even mimic
conversations, which positioned it as a flexible tool for web-based
applications such as chatbots, content generating, and virtual assistants.
Content Generation
It is possible to produce material for the web, automate the creation of
blogs, and summarize articles with the help of models such as GPT-3.
Websites are able to increase their content creation efforts while
keeping a high level of quality and efficiency because to this
automation.
Personalization
Personalized user experiences are made possible by LLMs, which
customize online content, suggestions, and interactions depending on
user choices and behavior. This makes web applications more
user-centric and engaging for users.
Ethical Concerns
Inadvertently learning and propagating biases that are inherent in the
data might happen when LLMs are trained on huge datasets obtained
from the internet. Because of this, questions have been raised about the
fairness, transparency, and accountability of artificial intelligence
systems, particularly when they are used in high-stakes fields such as
the law, healthcare, or finance.
Data Privacy
It is essential to ensure compliance with privacy requirements such as
the General Data Protection Regulation (GDPR) since LLMs depend on
enormous volumes of user data. When it comes to web-based services,
firms that use LLMs continue to face the difficulty of protecting personal
data while also offering individually tailored experiences.
Training Large Language Models
It is a resource-intensive and complicated process that comprises
numerous stages, beginning with data collection and preprocessing
and ending with model optimization and deployment. Training large
language models (LLMs) is a process that involves multiple steps. It is the
objective to develop models that are capable of processing,
comprehending, and producing human language in a manner that is
meaningful. This part will go deeper into the many components of
training LLMs, including the infrastructure, procedures, and problems
involved, as well as how LLMs are adapted for web-based applications.
In addition, we will discuss how LLMs are modified for use in web-based
applications.
Training Infrastructure
It is essential to have a well-developed infrastructure for training LLMs in
order to guarantee both efficiency and scalability. It is necessary to
improve both the hardware and software settings for high-performance
computing in order to accommodate the growing size and complexity
of LLM projects.
Hardware Considerations
GPUs and TPUs: The use of specialist hardware like as Graphics
Processing Units (GPUs) and Tensor Processing Units (TPUs) has become
the norm in the process of training huge models. Deep learning places a
high demand on parallel computing, and these hardware components
are specially built to meet those expectations.
High Bandwidth Networking: It is vital to have high-bandwidth
networking in order to expand training over numerous GPUs or TPUs. As
a result, this guarantees that the parameters of the model may be
synchronized across a variety of devices in an effective manner.
Memory Management: Memory optimization is an essential component
when working with several big models. By reducing the memory
footprint of LLMs by the use of techniques like as memory mapping,
swapping, and model pruning, it is possible for these programs to
operate on hardware with restricted resources.
Software Considerations
Deep Learning Frameworks: Well-known frameworks including
TensorFlow, PyTorch, and JAX provide the tools that are required for the
construction and training of LLMs. These frameworks provide libraries
and application programming interfaces (APIs) that facilitate distributed
computing and parallelism on a wide scale.
AutoML Tools: Automated Machine Learning (AutoML) tools provide
assistance in streamlining the process of training models by
automatically tweaking hyperparameters and improving the
architecture of the model. The use of AutoML may be very helpful when
it comes to fine-tuning LLMs on certain web-based activities.
The Encoder
The encoder is responsible for transforming the input sequence into a
continuous representation that captures both the meaning of the words
and their relationships to one another. The encoder consists of several
identical layers, each containing two main sub-layers:
The Decoder
The output sequence is created by the decoder, one token at a time,
while simultaneously attending to the output of the encoder as well as
the tokens that have been generated up to this point. In the same way
that the encoder is composed of numerous layers that are similar to one
another, the decoder also has an extra sub-layer that is responsible for
attending to the encoder's output.
Stacking Layers
The encoder and the decoder are both made up of numerous layers that
are similar to one another, often six or more, and they are piled on top
of one another. The representations that were learnt by the previous
layer are refined by each subsequent layer, which enables the model to
more accurately reflect the increasingly complex connections and
characteristics of the input.
Query, Key, and Value Vectors: The model creates three vectors for each
word in the input sequence: a query vector, a key vector, and a value
vector. These vectors are referred to as: query, key, and value vectors. It is
via the use of these vectors that the attention score for each word pair is
computed.
Attention Score: To compute the attention score, the dot product of the
query and key vectors for each pair of words is used, followed by a
softmax function to normalize the results. After then, the attention
score is used to assign a weight to the value vector of each word. This
enables the model to concentrate on the words that are most relevant
when it comes to generating predictions.
Benefits of Self-Attention
Parallel Processing: In contrast to RNNs, which process tokens in a
sequential fashion, self-attention enables the model to process the full
input sequence simultaneously across several inputs. The training
process is sped up greatly as a result, and the Transformer is given the
ability to handle bigger datasets.
𝑇
𝑄𝐾
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑄, 𝐾, 𝑉) = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥( )
𝑑𝑘
Where:
Through the use of this equation, the model is able to calculate the
significance of each word in the sequence, concentrating more on the
words that are significant and less on those that are not significant.
Multi-Head Attention
The Transformer makes use of multi-head attention, which is a
technique that applies the attention mechanism numerous times in
parallel, utilizing distinct sets of query, key, and value vectors. This allows
the Transformer to capture a variety of different sorts of associations
that exist between words. After being concatenated and converted by a
linear layer, the outputs of these attention heads are then altered.
𝑜
𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑(𝑄, 𝐾, 𝑉) = 𝐶𝑜𝑛𝑐𝑎𝑡(ℎ𝑒𝑎𝑑1, ℎ𝑒𝑎𝑑2,...., ℎ𝑒𝑎𝑑ℎ)𝑊
Where:
𝑄 𝐾 𝑉 𝑡ℎ
ℎ𝑒𝑎𝑑𝑖 = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑄𝑊𝑖 , 𝐾𝑊𝑖 , 𝑉𝑊 𝑖) is the output of the 𝑖 attention
𝑂
head, and 𝑊 is the output projection matrix.
Through the use of this method, the model is able to simultaneously
capture many features of connections between words.
Feed-Forward Network
The output is then sent to a feed-forward neural network (FFN), which
performs two linear transformations with a ReLU activation function in
between them. This occurs after the multi-head attention mechanism
has been applied.
This enables the model to catch patterns that are more complicated
and proceed with the processing of the attention outputs.
Batching and Parallelism: Due to the fact that the Transformer analyzes
complete sequences in parallel, it is able to take advantage of high
batch sizes, which considerably speeds up the training process. In order
to spread the training process over numerous GPUs or TPUs, various
techniques like data parallelism and model parallelism are used
throughout the process.
Reddit Data: This dataset is especially useful for jobs that include
user-generated material and conversational interactions. T5, which
includes around twenty percent of the data from Reddit into its
training, is very efficient in creating text that is contextually relevant and
conversational. As a result, it is exceptionally well-suited for applications
such as chatbots and the development of material for social media. In
addition, GPT makes use of data from Reddit; around fifteen percent of
its training data comes from Reddit, which contributes to its capacity to
handle informal chat that is reminiscent of human speech.
Training Dataset Size: The size of the dataset has a direct correlation
with the capacity of a model to generalize effectively across a variety of
online activities. The training data associated with GPT is more than 45
terabytes, which contributes to its resilience in terms of producing
information that is cohesive and of high quality across a range of areas.
BERT, on the other hand, is trained on a dataset that is somewhat
smaller, around 40 terabytes, but it shows exceptional performance in
tasks that need contextual knowledge. Despite the fact that the dataset
for T5 is relatively smaller, around 34 gigabytes, it is curated in an
effective manner for both comprehension and generating tasks.
Dataset
In light of the fact that this is a demonstration, we will make use of a
very limited dataset consisting of straightforward text completion
pairings. There are input phrases and their associated output
completions that are included in the dataset. The model will learn from
these datasets.
Sample Dataset:
Input Text Expected Completion
Implementation
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer, AdamW
train_data = [
("The sky is", "blue"),
("The sun is", "bright"),
("Grass is", "green"),
("Roses are", "red"),
("Violets are", "blue"),
("Sugar is", "sweet"),
("The cat is", "sleeping"),
("Birds are", "flying"),
]
# Training hyperparameters
device = torch.device("cuda" if torch.cuda.is_available() else
"cpu")
model = model.to(device)
optimizer = AdamW(model.parameters(), lr=5e-5)
def train_model(model, train_inputs, train_labels, epochs=3):
for epoch in range(epochs):
epoch_loss = 0
for input_tensor, label_tensor in zip(train_inputs,
train_labels):
outputs = model(input_tensor, labels=label_tensor)
loss = outputs.loss
optimizer.zero_grad()
loss.backward()
optimizer.step()
epoch_loss += loss.item()
print(f'Epoch {epoch+1} Loss:
{epoch_loss/len(train_inputs)}')
Here, we train the model for 3 epochs with a small learning rate of 5e-5.
Each epoch iterates over the dataset, calculating the loss for each
input-output pair and updating the model’s weights.
# Test cases
test_cases = [
"The sky is",
"The sun is",
"Roses are"
]
Results
The model generates completions based on the test inputs. Below are
the outputs for each test case after fine-tuning the model:
Global Reach: LLMs have made it easier for web services to reach a
worldwide audience, especially in text translation and multilingual
support. Services become more inclusive when users with different
language backgrounds may interact with and access web content
thanks to real-time translation capabilities.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019).
Language models are unsupervised multitask learners. OpenAI Blog.
Retrieved from https://openai.com/blog/better-language-models/
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training
of deep bidirectional transformers for language understanding.
Proceedings of the 2019 Conference of the North American Chapter of
the Association for Computational Linguistics: Human Language
Technologies, 4171–4186.
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P.,
Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss,
A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J.,
Winter, C., … Amodei, D. (2020). Language models are few-shot learners.
Advances in Neural Information Processing Systems, 33, 1877–1901.
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y.,
Li, W., & Liu, P. J. (2020). Exploring the limits of transfer learning with a
unified text-to-text transformer. Journal of Machine Learning Research,
21(140), 1–67.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M.,
Zettlemoyer, L., & Stoyanov, V. (2019). RoBERTa: A robustly optimized
BERT pretraining approach. arXiv preprint arXiv:1907.11692.
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019).
ALBERT: A lite BERT for self-supervised learning of language
representations. arXiv preprint arXiv:1909.11942.
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P.,
Rault, T., Louf, R., Funtowicz, M., & Rush, A. M. (2020). Transformers:
State-of-the-art natural language processing. Proceedings of the 2020
Conference on Empirical Methods in Natural Language Processing:
System Demonstrations, 38–45.
Clark, K., Khandelwal, U., Levy, O., & Manning, C. D. (2019). What does
BERT look at? An analysis of BERT’s attention. Proceedings of the 2019
Conference on Empirical Methods in Natural Language Processing and
the 9th International Joint Conference on Natural Language Processing
(EMNLP-IJCNLP), 4463–4473.
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R. R., & Le, Q. V.
(2019). XLNet: Generalized autoregressive pretraining for language
understanding. Advances in Neural Information Processing Systems, 32,
5754–5764.
Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2020). BART:
Denoising sequence-to-sequence pre-training for natural language
generation, translation, and comprehension. Proceedings of the 58th
Annual Meeting of the Association for Computational Linguistics,
7871–7880.
He, J., Chen, W., Liu, X., & Gao, J. (2021). DeBERTa: Decoding-enhanced
BERT with disentangled attention. International Conference on Learning
Representations.
Sun, C., Qiu, X., Xu, Y., & Huang, X. (2019). How to fine-tune BERT for text
classification? Proceedings of the China National Conference on Chinese
Computational Linguistics, 194–206.
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving
language understanding by generative pre-training. OpenAI Blog.
Retrieved from
https://openai.com/research/pretraining-transformer-model