What Are Large Language Models
What Are Large Language Models
com
Navigation
Search...
Large language models (LLMs) are recent advances in deep learning models to work on human languages.
Some great use case of LLMs has been demonstrated. A large language model is a trained deep-learning
model that understands and generates text in a human-like fashion. Behind the scene, it is a large
transformer model that does all the magic.
In this post, you will learn about the structure of large language models and how it works. In particular, you
will know:
https://machinelearningmastery.com/what-are-large-language-models/ 1/12
11/15/23, 9:27 AM What are Large Language Models - MachineLearningMastery.com
Get started and apply ChatGPT with my book Maximizing Productivity with ChatGPT. It provides real-
world use cases and prompt examples designed to get you using ChatGPT quickly.
Overview
This post is divided into three parts; they are:
https://machinelearningmastery.com/what-are-large-language-models/ 2/12
11/15/23, 9:27 AM What are Large Language Models - MachineLearningMastery.com
In 2017, Vaswani et al. published a paper, “Attention is All You Need,” to establish a transformer model. It is
based on the attention mechanism. Contrary to recurrent neural networks, the attention mechanism allows
you to see the entire sentence (or even the paragraph) at once rather than one word at a time. This allows
the transformer model to understand the context of a word better. Many state-of-the-art language
processing models are based on transformers.
To process a text input with a transformer model, you first need to tokenize it into a sequence of words.
These tokens are then encoded as numbers and converted into embeddings, which are vector-space
representations of the tokens that preserve their meaning. Next, the encoder in the transformer transforms
the embeddings of all the tokens into a context vector.
Below is an example of a text string, its tokenization, and the vector embedding. Note that the tokenization
can be subwords, such as the word “nosegay” in the text is tokenized into “nose” and “gay”.
As she said this, she looked down at her hands, and was surprised to find that she had put on one
of the rabbit's little gloves while she was talking. "How can I have done that?" thought she, "I
must be growing small again." She got up and went to the table to measure herself by it, and foun
d that, as nearly as she could guess, she was now about two feet high, and was going on shrinking
rapidly: soon she found out that the reason of it was the nosegay she held in her hand: she dropp
ed it hastily, just in time to save herself from shrinking away altogether, and found that she wa
s now only three inches high.
['As', ' she', ' said', ' this', ',', ' she', ' looked', ' down', ' at', ' her', ' hands', ',', '
and', ' was', ' surprised', ' to', ' find', ' that', ' she', ' had', ' put', ' on', ' one', ' o
f', ' the', ' rabbit', "'s", ' little', ' gloves', ' while', ' she', ' was', ' talking', '.', '
"', 'How', ' can', ' I', ' have', ' done', ' that', '?"', ' thought', ' she', ',', ' "', 'I', ' m
ust', ' be', ' growing', ' small', ' again', '."', ' She', ' got', ' up', ' and', ' went', ' to',
' the', ' table', ' to', ' measure', ' herself', ' by', ' it', ',', ' and', ' found', ' that',
',', ' as', ' nearly', ' as', ' she', ' could', ' guess', ',', ' she', ' was', ' now', ' about',
' two', ' feet', ' high', ',', ' and', ' was', ' going', ' on', ' shrinking', ' rapidly', ':', '
soon', ' she', ' found', ' out', ' that', ' the', ' reason', ' of', ' it', ' was', ' the', ' nos
e', 'gay', ' she', ' held', ' in', ' her', ' hand', ':', ' she', ' dropped', ' it', ' hastily',
',', ' just', ' in', ' time', ' to', ' save', ' herself', ' from', ' shrinking', ' away', ' altog
ether', ',', ' and', ' found', ' that', ' she', ' was', ' now', ' only', ' three', ' inches', ' h
igh', '.'] ×
[ 2.49 0.22 -0.36 -1.55 0.22 -2.45 2.65 -1.6 -0.14 2.26
-1.26 -0.61 -0.61 -1.89 -1.87 -0.16 3.34 -2.67 0.42 -1.71
https://machinelearningmastery.com/what-are-large-language-models/ 3/12
11/15/23, 9:27 AM What are Large Language Models - MachineLearningMastery.com
...
2.91 -0.77 0.13 -0.24 0.63 -0.26 2.47 -1.22 -1.67 1.63
1.13 0.03 -0.68 0.8 1.88 3.05 -0.82 0.09 0.48 0.33]
The context vector is like the essence of the entire input. Using this vector, the transformer decoder
generates output based on clues. For instance, you can provide the original input as a clue and let the
transformer decoder produce the subsequent word that naturally follows. Then, you can reuse the same
decoder, but this time the clue will be the previously produced next-word. This process can be repeated to
create an entire paragraph, starting from a leading sentence.
Transformer Architecture
This process is called autoregressive generation. This is how a large language model works, except such a
model is a transformer model that can take very long input text, the context vector is large so it can handle
very complex concepts, and with many layers in its encoder and decoder.
https://machinelearningmastery.com/what-are-large-language-models/ 4/12
11/15/23, 9:27 AM What are Large Language Models - MachineLearningMastery.com
According to Claude Shannon’s influential paper, “Prediction and Entropy of Printed English,” the English
language has an entropy of 2.1 bits per letter, despite having 27 letters (including spaces). If letters were
used randomly, the entropy would be 4.8 bits, making it easier to predict what comes next in a human
language text. Machine learning models, and especially transformer models, are adept at making such
predictions.
By repeating this process, a transformer model can generate the entire passage word by word. However,
what is grammar as seen by a transformer model? Essentially, grammar denotes how words are utilized in
language, categorizing them into various parts of speech and requiring a specific order within a sentence.
Despite this, it is challenging to enumerate all the rules of grammar. In reality, the transformer model
doesn’t explicitly store these rules, instead acquiring them implicitly through examples. It’s possible that the
model can learn beyond just grammar rules, extending to ideas presented in those examples, but the
transformer model must be large enough.
https://machinelearningmastery.com/what-are-large-language-models/ 5/12
11/15/23, 9:27 AM What are Large Language Models - MachineLearningMastery.com
For example, the GPT-3 model that is backing the ChatGPT service was trained on massive amounts of
text data from the internet. This includes books, articles, websites, and various other sources. During the
training process, the model learns the statistical relationships between words, phrases, and sentences,
allowing it to generate coherent and contextually relevant responses when given a prompt or query.
Distilling from this vast amount of text, the GPT-3 model can therefore understand multiple languages and
possess knowledge of various topics. That’s why it can produce text in different style. While you may be
amazed that large language model can perform translation, text summarization, and question answering, it
is not surprised if you consider these are special “grammars” that match the leading text, a.k.a. prompts.
Summary
There are multiple large language models developed. Examples include the GPT-3 and GPT-4 from
OpenAI, LLaMA from Meta, and PaLM2 from Google. These are models that can understand language and
can generate text. In this post, you learned that:
https://machinelearningmastery.com/what-are-large-language-models/ 6/12