0% found this document useful (0 votes)
2 views

BERT

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

BERT

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 98

The Illustrated BERT, ELMo, and co.

(How NLP Cracked Transfer Learning)


What is BERT used for?
• BERT can be used on a wide variety of language tasks:

• Can determine how positive or negative a movie’s reviews are. (Sentiment Analysis)

• Helps chatbots answer your questions. (Question answering)

• Predicts your text when writing an email (Gmail). (Text prediction)

• Can write an article about any topic with just a few sentence inputs. (Text generation)

• Can quickly summarize long legal contracts. (Summarization)

• Can differentiate words that have multiple meanings (like ‘bank’) based on the surrounding text.
(Polysemy resolution)
Example of BERT
• BERT helps Google better surface (English) results for
nearly all searches since November of 2020.
Pre-BERT Google surfaced information about getting a
prescription filled.
• Here’s an example of how BERT helps Google better
understand specific searches like:
Post-BERT Google understands that “for someone”
relates to picking up a prescription for someone else
and the search results now help to answer that.
• BERT is a deep bidirectional, unsupervised language
representation, pre-trained using a plain text corpus.

• To help bridge the gap in data, researchers have developed various


techniques for training general purpose language representation
models using the enormous piles of unannotated text on the web
(this is known as pre-training).

Example:
The woman went to the store and bought a _____ of shoes.
Need for BERT
Bidirectional Contextual Understanding:
• Early Transformer Models: Traditional transformer models like OpenAI's
GPT (Generative Pre-trained Transformer) were unidirectional, processing
text either left-to-right (causal) or right-to-left. This means, for example, if
you're trying to predict a word in the middle of a sentence, the model can
only look at the words before (left-to-right) or after (right-to-left), but not
both at the same time.
• BERT's Innovation: BERT introduced bidirectional encoding—it considers
both left and right contexts simultaneously. This results in much richer
contextual representations of words since the model understands a word
based on the entire sentence, not just the surrounding words in one
direction.
Pre-training with Specific Tasks (MLM and NSP):

•Masked Language Model (MLM): Instead of training in a purely generative way (predicting the next
word in a sequence), BERT uses a masked language modeling objective. This means some words in
the input sentence are randomly masked, and the model learns to predict those masked words by
considering the entire context (both left and right).
This task allows BERT to learn deep bidirectional representations, which earlier transformer-based
models lacked.

•Next Sentence Prediction (NSP): In addition to word-level predictions, BERT also introduced the next
sentence prediction task. This helps BERT understand the relationships between sentences, which is
crucial for tasks like question answering and natural language inference (NLI).
Previous transformer models didn’t focus on sentence-to-sentence relationships during pre-training.
Transformers Were Mainly Encoder-Decoder Models
BERT as a Pure Encoder Model: BERT is based purely on the
transformer encoder. This specialization helps it focus entirely on
creating powerful sentence representations, making it ideal for a range
of tasks (from classification to sequence labeling) without the need for
a decoder.
Transfer Learning in NLP
BERT's Breakthrough: BERT popularized the concept of pre-training on
large datasets and then fine-tuning for downstream tasks. This
dramatically reduced the amount of labeled data and training time
needed for specific tasks like sentiment analysis, named entity
recognition, or question answering.
What is Word Embedding ?
• Word embeddings have been a major force in how leading NLP models deal with language.

• Methods like Word2Vec and Glove have been widely used for such tasks.

• Word2Vec showed that we can use a vector (a list of numbers) to properly represent words in a way
that captures semantic or meaning-related relationships.

• (e.g. the ability to tell if words are similar, or opposites, or that a pair of words like “Stockholm” and
“Sweden” have the same relationship between them as “Cairo” and “Egypt” have between them).

• As well as syntactic, or grammar-based, relationships (e.g. the relationship between “had” and “has”
is the same as that between “was” and “is”).
Pretrained Word Embeddings
• The field quickly realized it’s a great idea to use embeddings that were pre-trained
on vast amounts of text data instead of training them alongside the model on what
was frequently a small dataset.

• This is created by Stanford University. GloVe has pre-defined dense vectors for
around every 6 billion words of English literature along with many other general
use characters like comma, braces, and semicolons.

• Glove :- GloVe is an unsupervised learning algorithm for obtaining vector representations for
words.

• Training is performed on aggregated global word-word co-occurrence statistics from a corpus,


and the resulting representations showcase interesting linear substructures of the word vector
space.
GloVe
• Both the architecture of the Word2Vec are the
predictive ones

• Ignores the fact that some context words occurs


more often than others

• they only take into consideration the local context


and hence failing to capture the global context.

• The GloVe model is trained on aggregated global


word to word co-occurrence matrix from a given
text collection of text documents.

• This co-occurrence matrix is decomposed to form


denser and expressive vector representation.
ELMo : Context Matters
• If we’re using this GloVe representation, then
the word “stick” would be represented by with
an embedding vector size of 200 vector no-
matter what the context was.

• “stick”” has multiple meanings depending on


where it’s used.

• Why not give it an embedding based on the


context it’s used in – to both capture the word
meaning in that context as well as other
contextual information?”.

• And so, contextualized word-embeddings


were born.
ELMo: Context Matters

• Instead of using a fixed embedding for each word, Embeddings from Language Models (ELMo) looks at the
entire sentence before assigning each word in it an embedding.

• It uses a bi-directional LSTM trained on a specific task to be able to create those embeddings.

• ELMo provided a significant step towards pre-training in the context of NLP.

• The ELMo LSTM would be trained on a massive dataset in the language of our dataset, and then we can use
it as a component in other models that need to handle language.
ELMO uses Language Modelling

• ELMo gained its language


understanding from being trained to
predict the next word in a sequence
of words - a task called Language
Modeling.

• This is convenient because we


have vast amounts of text data that
such a model can learn from
without needing labels.
BERT vs word2vec ?
• Consider the two examples sentences

• “We went to the river bank.”


• “I need to go to the bank to make a deposit.”

• Word2Vec generates the same single vector for the word bank for both of the
sentences.

• BERT will generate two different vectors for the word bank being used in two different
contexts.

• One vector will be like words like money, cash, etc. The other vector would be like
vectors like beach and coast.
Introduction to BERT
• Fundamentally, BERT is a stack of Transformer encoder layers (Vaswani et
al., 2017) that consist of multiple self-attention “heads”.

• For every input token in a sequence, each head computes key, value, and
query vectors, used to create a weighted representation.

• The outputs of all heads in the same layer are combined and run through a
fully connected layer.

• Each layer is wrapped with a skip connection and followed by layer


normalization.
BERT(BERT is based on the Transformer
architecture.)
• BERT stands for Bidirectional Encoder Representations from Transformers.

• It is designed to pre-train deep bidirectional representations from


unlabeled text by jointly conditioning on both left and right context.

• As a result, the pre-trained BERT model can be fine-tuned with just one
additional output layer to create state-of-the-art models for a wide range
of NLP tasks.
What is BERT ?
• BERT is based on the Transformer architecture.

• BERT is pre-trained on a large corpus of unlabelled text including the entire Wikipedia(that’s
2,500 million words!) and Book Corpus (800 million words).

• BERT is a “deeply bidirectional” model.


Example: I went to the bank to deposit money.

In BERT's deeply bidirectional model: At every layer, the word "bank" is processed with knowledge
of both its left context ("I went to the") and its right context ("to deposit money"). This leads to a
more accurate understanding at each stage of processing, reducing the chance of early-stage
confusion.
BERT introduction
• Google's research on Transformers.

• The transformer is the part of the model that gives BERT its
increased capacity for understanding context and ambiguity in
language.

• The transformer does this by processing any given word in


relation to all other words in a sentence, rather than
processing them one at a time.

• By looking at all surrounding words, the Transformer allows


the BERT model to understand the full context of the word,
and therefore better understand searcher intent.
BERT Example

• The bidirectionality of a model is important


for truly understanding the meaning of a
language.

• Let’s see an example to illustrate this.

• There are two sentences in this example and


both of them involve the word “bank”.
Example: "Bat"
•Sentence 1: "The baseball player swung the bat."
•Sentence 2: "A bat flew into the cave at night."
Context:
•In sentence 1, BERT uses "baseball player swung" (left) and "." (right) to understand "bat" as a sports equipment.
•In sentence 2, BERT uses "A" (left) and "flew into the cave at night" (right) to understand "bat" as a flying mammal.
Example: "Rose"
•Sentence 1: "She rose from her seat to greet the guests."
•Sentence 2: "He gave her a beautiful red rose on Valentine's Day."
Context:
•In sentence 1, BERT uses "She" (left) and "from her seat to greet" (right) to understand "rose" as the past tense of "rise".
•In sentence 2, BERT uses "He gave her a beautiful red" (left) and "on Valentine's Day" (right) to understand "rose" as a
flower.
Example: "Fair"
•Sentence 1: "The judge's decision was fair and impartial."
•Sentence 2: "We're going to the county fair this weekend."
•Sentence 3: "She has fair skin and blonde hair."
Context:
•In sentence 1, BERT uses "judge's decision was" (left) and "and impartial" (right) to understand "fair" as just or unbiased.
•In sentence 2, BERT uses "We're going to the county" (left) and "this weekend" (right) to understand "fair" as an event or
exhibition.
•In sentence 3, BERT uses "She has" (left) and "skin and blonde hair" (right) to understand "fair" as light-colored.
Feature Transformer BERT

Architecture Encoder-decoder Encoder-only (derived from Transformer)

Sequence-to-sequence tasks (e.g.,


Primary Use Understanding tasks (e.g., classification)
translation)

Encoder: Bidirectional, Decoder:


Directionality Fully bidirectional
Unidirectional

No pre-training specified in original


Training Objective Pre-trained with MLM and NSP
paper

Both directions in the encoder; left-to-


Contextualization Full bidirectional context in entire input
right in decoder

Downstream
Machine translation, text generation Text classification, question answering
Applications
How BERT achieves bidirectionality?

1. Self-Attention Mechanism:

•BERT uses the self-attention mechanism of the Transformer encoder, where each token
(word or subword) in the input sequence can attend to all other tokens, both before and
after it.

•This means that for every word, the model looks at the entire sequence of words around
it, gathering context from both directions.

•For example, if you're trying to understand the word "bank" in "I went to the bank to
deposit money," BERT will consider not only the words before it ("I went to the") but also
the words after it ("to deposit money") when determining the meaning.
How BERT achieves bidirectionality?

2. Masked Language Modeling (MLM):


• During BERT’s pre-training, a percentage of tokens (typically 15%) in the input are
randomly masked, and the model is trained to predict these masked tokens.

• Importantly, because BERT can see the entire sentence (except for the masked tokens),
it uses context from both the left and right of the masked word to make its prediction.

• For example, if the sentence is “The [MASK] chased the mouse,” BERT will use the
context from both directions to infer that the masked word is likely “cat.”

• This contrasts with models like GPT (Generative Pre-trained Transformer), which are
unidirectional, meaning they predict the next word by only looking at the words to the
left (left-to-right).
Bidirectional
• As opposed to directional models, which read the text input
sequentially (left-to-right or right-to-left), the Transformer encoder
reads the entire sequence of words at once.

• Therefore, it is considered bidirectional, though it would be more


accurate to say that it’s non-directional.

• This characteristic allows the model to learn the context of a word


based on all of its surroundings (left and right of the word).
Why is this non-directional approach so powerful?

• Pre-trained language representations can either be context-free or context-based.

• Context-based representations can then be unidirectional or bidirectional.

• Context-free models like word2vec generate a single word embedding representation (a


vector of numbers) for each word in the vocabulary.

• For example, the word “bank” would have the same context-free representation in “bank
account” and “bank of the river.”
Why is this non-directional approach so powerful?

• On the other hand, context-based models generate a representation of each


word that is based on the other words in the sentence.

• For example, in the sentence “I accessed the bank account,” a unidirectional


contextual model would represent “bank” based on “I accessed the” but not
“account.”

• However, BERT represents “bank” using both its previous and next context — “I
accessed the … account” — starting from the very bottom of a deep neural
network, making it deeply bidirectional.
Why BERT use Only Encoder of Transformer?
• BERT makes use of Transformer, an attention mechanism that learns
contextual relations between words (or sub-words) in a text.

• In its vanilla form, Transformer includes two separate mechanisms —


an encoder that reads the text input and a decoder that produces a
prediction for the task.

• Since BERT’s goal is to generate a language model, only the encoder


mechanism is necessary.
Conventional Workflow for BERT
• The conventional workflow for BERT consists of two stages:
• pre-training
• fine-tuning.

• Pre- training uses two self-supervised tasks: masked language modeling


(MLM, prediction of randomly masked input tokens) and next sentence
prediction (NSP, predicting if two input sentences are adjacent to each
other).
• In fine-tuning for downstream applications, one or more fully connected
layers are typically added on top of the final encoder layer.
•Task-Specific Layer (Learned from scratch): Fully trained from random initialization to perform a
specialized task (e.g., identifying answer spans in Q&A).
•Pretrained Layers: Slightly adjusted for the specific task (fine-tuned).
BERT
• BERT is a model that broke several records for how well models can handle language-based tasks.

• Soon after the release of the paper describing the model, the team also open-sourced the code of
the model and made available for download versions of the model that were already pre-trained on
massive datasets.

• This is a momentous development since it enables anyone building a machine learning model
involving language processing to use this powerhouse as a readily-available component – saving the
time, energy, knowledge, and resources that would have gone to training a language-processing
model from scratch.
Task specific-Models
The BERT paper shows several ways to use BERT for different tasks.
Model Architecture
• BERT’s architecture significantly scales up from the original Transformer
model, providing it with the capacity to understand language more deeply.
• BERT Model Sizes:
• BERT Base: 12 layers (Transformer Blocks), 12 attention heads, and 768 hidden units in the feed-forward
network.
•BERT Large: 24 layers (Transformer Blocks), 16 attention heads, and 1024 hidden units in the feed-forward
network.
The paper presents two model sizes for BERT:

• BERT BASE – Comparable in size to the


OpenAI Transformer to compare performance

• BERT LARGE – A ridiculously huge model which


achieved the state of the art results reported in
the paper

• BERT is basically a trained Transformer Encoder


stack. This is a good time to direct you to read
my earlier post The Illustrated Transformer which
explains the Transformer model – a foundational
concept for BERT and the concepts we’ll discuss
next.
Model Architecture
Why Did BERT Scale Up?
• Deeper Model (More Encoder Layers): Having more layers allows BERT to
learn richer and more complex representations of language. Each layer
processes the input more thoroughly, enabling the model to capture long-
range dependencies and more abstract features.

• More Attention Heads:BERT can look at different aspects of the sentence at


the same time. This enhances its ability to capture subtle relationships
between words.

• Larger Feed-Forward Networks: Increasing the size of these networks gives


the model greater capacity to learn detailed representations and perform
complex transformations.
Input Representation

• Each word in the input is first tokenized into word


pieces (Wu et al., 2016),

• and then three embedding layers (token, position, and


segment) are combined to obtain a fixed-length vector.

• Special token [CLS] is used for classification predictions,


and [SEP] separates input segments.
[CLS] Token (Classification Token)
• The [CLS] token is a special token added to the beginning of every
input sequence in BERT.

• It plays a critical role in tasks where you need a single output for
the entire sequence, such as classification tasks (e.g., sentiment
analysis, spam detection, etc.).
Why use [CLS]?
• In many tasks, you want to make a single prediction based on
an entire input sequence.

• For example, if you're classifying the sentiment of a


sentence, you don't just care about the meaning of
individual words;

• you want a summary of the entire sequence to make your


decision.

• The [CLS] token serves as that summary.


How [CLS] Works?
• When you input a sentence or pair of sentences into BERT, the [CLS] token is
the first token of the sequence.

• During training, the model learns to use the hidden state (internal
representation) associated with this [CLS] token to produce an output that
summarizes the meaning of the entire input.

• After passing the input sequence through multiple Transformer layers (the
architecture used by BERT), the hidden state of the [CLS] token captures the
aggregate meaning of the sequence.

• This output is typically passed to a classifier (such as a fully connected layer)


to make predictions.
Example
• Let’s say you have the sentence:

• "BERT is a great model for NLP tasks."


• When this sentence is passed into BERT, it is tokenized as:

• "[CLS] BERT is a great model for NLP tasks."


• The [CLS] token is treated just like any other token (such as "BERT" or "great") and goes
through all the layers of the Transformer.

• However, after the final layer, the hidden state associated with [CLS] represents a global
summary of the sentence.

• This hidden state is then used for classification tasks, like predicting whether the sentence has
positive or negative sentiment.
In Sentence Classification:
• The [CLS] token's hidden state is typically passed to a classification layer (a
simple neural network layer) to make a prediction.

• For example, in sentiment analysis, the hidden state of [CLS] could be used to
predict whether a sentence is positive, negative, or neutral.

• In Question Answering:
• In tasks like Question Answering (QA), BERT also uses the [CLS] token, but
instead of classifying a sentence, the [CLS] token can represent whether the
input contains the correct answer to a question.
[SEP] Token (Separator Token)

• The [SEP] token is a marker used to separate different parts


of the input sequence, especially when the task involves
sentence pairs.

• It helps BERT differentiate between two segments of text,


allowing it to better model the relationship between them.
Why use [SEP]?
• BERT is designed to handle tasks that involve understanding the
relationship between two sentences or text segments.

• The [SEP] token is used to clearly separate these segments.

• For example, in tasks like sentence pair classification (e.g., determining


if one sentence logically follows another) or question-answering tasks,

• BERT needs to know where one sentence ends and the next begins.
[SEP] provides that boundary.
Why use [SEP]?

• When BERT is processing pairs of sentences (or question-


answer pairs), the [SEP] token is placed between them and at
the end of the sequence.

• This helps BERT understand which tokens belong to which


segment and model the relationship between them.
Segment Embeddings

• BERT uses segment embeddings (also called token type embeddings)


to distinguish tokens in different sentences.

• There are two segment embeddings, A and B:


• Tokens from the first sentence are assigned the segment embedding A.
• Tokens from the second sentence are assigned the segment embedding B.

• The [SEP] token helps mark where the first sentence ends and the
second sentence begins.
Example
• Let’s say you have two sentences:
• Sentence 1: "The sky is blue."
• Sentence 2: "It is a sunny day."
• The input to BERT would look like this:
• "[CLS] The sky is blue [SEP] It is a sunny day [SEP]"
• Here, the [SEP] token is used twice:
• Once to separate the first sentence from the second sentence.
• Once at the end of the input sequence to mark the end of the input.
• During training, BERT uses these [SEP] tokens to understand that the two
sentences are separate, and to help it learn how to model the relationship
between them.
In Next Sentence Prediction (NSP):
• Next Sentence Prediction (NSP) is one of the two pretraining tasks BERT
uses (the other being Masked Language Modeling, or MLM).

• In NSP, BERT is trained to predict whether one sentence logically follows


another. During training, pairs of sentences are provided to the model:
• Positive pairs: Where the second sentence follows the first one in the original
text.
• Negative pairs: Where the second sentence is randomly selected and does not
follow the first one.

• The [SEP] token separates the two sentences, helping BERT understand the
distinction between the two sentences, and allowing it to learn how to model
whether the second sentence is likely to follow the first.
In Question Answering:

• In QA tasks, the [SEP] token separates the question and the


passage.
• Question: "What color is the sky?"
• Passage: "The sky is blue on a sunny day."
• The input to BERT would look like this:
• "[CLS] What color is the sky [SEP] The sky is blue on a sunny day
[SEP]"
• The [SEP] token helps BERT know where the question ends and
the passage begins.
Summary of Roles:
Token Position Role
Provides a global summary of the
input sequence. Its hidden state is
[CLS] Always the first token
used for classification tasks (e.g.,
sentiment analysis).
Separates different segments
Between sentences and (sentences, question-passage pairs).
[SEP]
at the end Helps BERT distinguish between two
sentences or pieces of text.
Example: Sentence Classification

• The most straight-forward way to use BERT is to use it to classify a single piece of text. This model would
look like this:

• To train such a model, you mainly must train the classifier, with minimal changes happening to the BERT
model during the training phase.

• This training process is called Fine-Tuning and has roots in Semi-supervised Sequence Learning and
ULMFiT(Universal Language Model Fine-tuning).
Model Architecture
• Both BERT model sizes have a large number of encoder layers (which the
paper calls Transformer Blocks) – twelve for the Base version, and twenty-
four for the Large version.

• These also have larger feedforward-networks (768 and 1024 hidden units
respectively), and more attention heads (12 and 16 respectively) than the
default configuration in the reference implementation of the Transformer in
the initial paper (6 encoder layers, 512 hidden units, and 8 attention
heads).
Model Inputs
Model Outputs

• Each position outputs a vector of


size hidden_size (768 in BERT Base).

• For the sentence classification example,


we’ve looked at above, we focus on the
output of only the first position (that we
passed the special [CLS] token to).

• That vector can now be used as the input for


a classifier of our choosing. The paper
achieves great results by just using a single-
layer neural network as the classifier.
Outputs: How does one predict output for two different tasks
simultaneously?
• The answer is by using different FFNN + Softmax layer built on top of output(s) from the last
encoder, corresponding to desired input tokens.

• We will refer to the outputs from last encoder as final states.

• The first input token is always a special classification [CLS] token.

• The final state corresponding to this token is used as the aggregate sequence representation
for classification tasks and used for the Next Sentence Prediction where it is fed into a FFNN +
Softmax layer that predicts probabilities for the labels “IsNext” or “NotNext”.

• The final states corresponding to [MASK] tokens is fed into FFNN+Softmax to predict the next
word from our vocabulary.
Masked Language Model
• BERT makes use of a novel technique called Masked LM (MLM):

• it randomly masks words in the sentence and then it tries to predict them.

• Masking means that the model looks in both directions and it uses the full context of
the sentence, both left and right surroundings, to predict the masked word.

• Unlike the previous language models, it takes both the previous and next tokens into
account at the same time.

• The existing combined left-to-right and right-to-left LSTM based models were missing
this “same-time part”.
What is Next Sentence Prediction?

• NSP (Next Sentence Prediction) is used to help BERT learn about relationships
between sentences by predicting if a given sentence follows the previous
sentence or not. Next Sentence Prediction Example: Paul went shopping. He
bought a new shirt. (correct sentence pair)Ramona made coffee. Vanilla ice
cream cones for sale. (incorrect sentence pair)In training, 50% correct sentence
pairs are mixed in with 50% random sentence pairs to help BERT increase next
sentence prediction accuracy.
For Finetuning:
Add a Task-Specific Output Layer
• For fine-tuning, a task-specific layer is added on top of BERT’s architecture,
depending on the nature of the downstream task. BERT’s base architecture
remains unchanged.
• Example tasks and output layers:
• Text Classification (e.g., sentiment analysis): A simple linear layer (fully connected
layer) is added on top of the BERT model. This layer uses the [CLS] token’s hidden
state (the representation of the entire input sequence) to make predictions.

• Question Answering: For tasks like SQuAD (a question-answering dataset), BERT


outputs start and end positions of the answer within a passage. A linear layer is added
to predict these start and end positions for the answer in the text.

• Named Entity Recognition (NER): A token-level classifier is added for each word token
to predict whether a token belongs to an entity (such as "PERSON," "ORG," etc.).
BERT and it’s Variants
What makes BERT powerful?

Self-attention, transformer blocks, and large-scale pre-training.


Limitations of BERT:

Size and Resource Intensity

•Training Costs: Pre-training BERT models is resource-heavy, requiring powerful GPUs/TPUs and a lot of
memory. Pre-training BERT-large, for example, can take weeks, even on advanced hardware.

•Inference Speed: Since BERT is large, its inference (the process of making predictions) is slow. It
processes tokens sequentially, and the size of its transformer layers adds to the time required for real-time
applications. This becomes a problem in use cases requiring low-latency responses (e.g., voice assistants,
real-time translations).

•Memory Footprint: The large number of parameters results in high memory usage, limiting BERT’s
deployment on edge devices or platforms with limited computational power (e.g., mobile devices).
Evolution of BERT
• Although BERT has proven to be a very effective language processing model, it has some
disadvantages that may make it less suitable for certain use cases.
• One disadvantage of BERT is that it is a large model, which requires a significant amount
of computational resources to train and use.
• It has about 110 million parameters, which makes it harder and slower to train and also
makes the inference times slower.
• It requires a lot of training data.
• The training process has a lot of room for optimization.

• While BERT has shown great promise in many natural language processing tasks, its large
size and need for large amounts of labeled data can be limitations in some situations.
• Many different variants of BERT were introduced to overcome the difficulties mentioned
above.
• To overcome these limitations, several variants of BERT have been
developed, each with its unique characteristics and performance
capabilities.
• These variants include ALBERT, RoBERTa, ELECTRA, DistilBERT,
SpanBERT, and TinyBERT.
Types/Variants of BERT

Several variants of BERT have been developed, each with its unique characteristics
and performance capabilities. Some of the most notable BERT variants include:
• ALBERT:(A Lite BERT) is a BERT model that has been modified to reduce the
number of parameters and computational complexity without sacrificing
performance. This makes training faster and more efficient while maintaining the
ability to generate high-quality language representations.
• RoBERTa:(Robustly Optimized BERT) is a BERT model that has been trained on a
larger dataset and for a longer period using more advanced training techniques.
This results in a better model for various natural language processing tasks.
• ELECTRA:(Efficiently Learning an Encoder that Classifies Tokens Accurately) is a
BERT model that has been modified to improve its ability to generate high-quality
text representations. This is achieved by training the model to generate fake text
and then using a discriminator to identify which text is real and which is fake.
• ELECTRA (Efficiently Learning an Encoder that Classifies Tokens Accurately) is a
BERT model that has been modified to improve its ability to generate high-quality
text representations. This is achieved by training the model to generate fake text
and then using a discriminator to identify which text is real and which is fake.
•DistilBERT: DistilBERT is a BERT model that has been distilled(or simplified) to
make it faster and more efficient without sacrificing too much performance. This
allows it to be used when computational resources are limited, such as on mobile
devices.
•TinyBERT: TinyBERT is a BERT model that has been further distilled to make it
smaller and more efficient than DistilBERT. This allows it to be used in even more
resource-constrained situations.
• SpanBERT: Focuses on span-based predictions rather than single-token
predictions, useful for question-answering tasks. Better for span-based problems.
Compare different BERT variants (ALBERT, RoBERTa, ELECTRA,
DistilBERT, SpanBERT, and TinyBERT) by understanding their
architecture, optimizations, performance, and real-world applications.
ALBERT
ALBERT achieves performance increase by using several
techniques that make it more efficient than BERT. For example,
it uses a technique called cross-layer parameter sharing to
share parameters across different layers of the model, reducing
the number of parameters that need to be trained. It also uses
a technique called factorized embedding parameterization to
reduce the number of parameters in the embedding layer,
which is the first layer of the model.

Using the above techniques, ALBERT brings down the total


parameters from about 110 million to around 12 million.

Another key difference between ALBERT and BERT is that


ALBERT uses a technique called sentence-order
prediction (SOP) to pre-train the model. This involves training
the model to predict whether a sentence has been randomly
shuffled, which helps it learn better to understand the
relationships between words in a sentence.
Cross-Layer Parameter Sharing
Optimizing Model Efficiency
• Cross-layer parameter sharing is a technique used to reduce the
number of trainable parameters in a deep learning model.
• Typical Deep Learning Model: Each layer has its own set of
parameters. Parameters are learned independently for each layer.
• Cross-Layer Parameter Sharing: Some or all layers share the same set
of parameters. Parameters are reused across multiple layers.

•Increases efficiency in terms of training time and memory usage.


•Reduces model size without sacrificing performance.
Factorized Embedding Parameterization

• Factorized embedding parameterization is a technique used in ALBERT (and other


models) to reduce the number of parameters in the embedding layer, which is the first
layer of the model. In a typical deep-learning model, the embedding layer maps each
word in the input text to a high-dimensional vector, which is then used as input to the
rest of the model.

• In a model that uses factorized embedding parameterization, the high-dimensional


vectors are represented as the product of two lower-dimensional vectors. , For example,
instead of representing a word as a 300-dimensional vector, it would be represented as
the product of two 150-dimensional vectors.

• The advantage of this technique is that it reduces the number of parameters in the
embedding layer, making training faster and more efficient. It can also improve the
model's performance by allowing it to learn more specialized word representations.
Benefits of Factorized Embedding Parameterization
• Reduces Parameters:
• Factorizes word embedding into two smaller matrices.
• Memory Efficiency:
• Helps in large-scale models by reducing the memory footprint.
• Improves Training Efficiency:
• Smaller parameter count results in faster training times.

Potential Disadvantages of Factorized Embedding Parameterization

Reduced Interpretability:
• Word representations are no longer a single vector but a product of two vectors, making them harder to interpret.
Limited Expressiveness:
• The word representations are constrained by being a product of two vectors, which could limit their ability to capture
complex relationships between words.
RoBERTa
• RoBERTa (Robustly Optimized BERT Pretraining Approach) is an
advanced variant of the BERT model, designed to improve
performance by optimizing training strategies and data usage.

• Purpose: To refine the BERT model for better natural language


understanding tasks, such as text classification, question answering,
and sentiment analysis.
• One key difference between RoBERTa and BERT is that RoBERTa is trained
on a much larger dataset, which includes more than160GBof text data,
whereas BERT was originally trained on about16GBof text data.
• This allows the model to learn from various sources and capture language
nuances better.
• Another key difference is that RoBERTa is trained using a technique called
dynamic masking, which involves randomly masking out different tokens
(e.g., words or punctuation) in the input text during training.
• This helps the model better understand the relationships between different
tokens in a sentence, improving its performance on various tasks.
RoBERTa Architecture
• Base Architecture:
• Same as BERT (Transformer-based, bidirectional model).
• Pre training:
• Masked Language Model (MLM), where 15% of tokens are randomly masked
and predicted by the model.
• Parameters:
• RoBERTa comes in different sizes, with the most common being RoBERTa-
base (125M parameters) and RoBERTa-large (355M parameters), similar to
BERT.
• RoBERTa also uses a technique called byte-pair encoding(BPE) to
tokenize the input text.
• This involves replacing common sequences of tokens with a single,
"combined" token, which reduces the vocabulary size and allows the
model to learn more efficient representations.
• Overall, RoBERTa is a more powerful and performant version of BERT
that is trained using advanced techniques and a larger dataset.
• This makes it a good choice for applications that require high-quality
language representations, such as in machine translation or
summarization.
• Transfer knowledge from a large, pre-trained model (teacher) to a
smaller model (student).
• Smaller model mimics the larger model's behavior, achieving similar
performance while being more efficient and deployable.

Process:
Step 1: Generating Predictions: The larger model generates predictions on a dataset, which are used as "ground
truth" labels for the smaller model's training.

Step 2: Learning Internal Representations: The smaller model also learns from the larger model’s internal
activations (values computed during the forward pass), enhancing performance on tasks.
•DistilBERT's Efficiency:
•DistilBERT is smaller, with 40% fewer parameters than BERT (66 million vs. 110 million).
•Faster to train and easier to deploy, especially in resource-constrained environments like
mobile devices.
•Distilled-Dynamic Masking:
•DistilBERT uses distilled-dynamic masking during training, which involves randomly masking
tokens.
•This improves the model’s ability to understand relationships between tokens, enhancing task
performance.

Key Benefits of DistilBERT:


•Lightweight and Efficient:
• Faster to train and requires fewer computational resources.
•Maintains Performance:
• Achieves strong performance while being more efficient, making it ideal for deployment on mobile
devices and other low-resource settings.
• WordPiece is the tokenization algorithm Google developed to pretrain
BERT. It has since been reused in quite a few Transformer models
based on BERT, such as DistilBERT, MobileBERT.
• Two-Part Model:ELECTRA consists of a generator and a
discriminator.
• Generator: Trained to create fake text similar to real input text.
• Discriminator: Trained to distinguish between real and fake text.
• Fake Token Detection:If "dog" is replaced by "cat", the discriminator is
trained to identify "cat" as the fake token and correct it to "dog."
• Replaced Token Detection:
• Technique:
• Discriminator detects replaced tokens, helping the model learn relationships
between tokens.
• Additional Techniques:
• Often combined with masked language modeling (MLM), where tokens are
masked and predicted by the model to improve context understanding.
• Benefits of ELECTRA:
• High-Quality Text Representations:
• More efficient at generating accurate representations of text compared to
standard BERT.
• Use Cases:
• Suitable for tasks requiring detailed text understanding, such as
summarization and question answering.
Masked Language Modeling with Spans:SpanBERT randomly masks multiple
contiguous words (spans) in a sentence.
• The model is trained to predict masked spans based on the surrounding
context.
• This helps the model learn better relationships between words and
phrases.
Sentence Order Prediction:
• Technique:
• SpanBERT uses sentence order prediction to understand the sequential structure of
documents.
• Purpose:
• Helps the model learn the order of sentences in a document, enhancing its ability to
understand overall structure.
Key Benefits of SpanBERT:
• Improved Span Prediction:
• Specifically designed to excel in span prediction tasks, like coreference resolution and
question answering.
• Document Understanding:
• The combination of span masking and sentence order prediction improves performance on
tasks like summarization and text classification.
Use Cases:
• Span Prediction Tasks:
• Ideal for tasks involving the identification and prediction of relationships between words and
phrases.
• NLP Applications:
• Suitable for various natural language processing tasks such as summarization, information
extraction, and text classification.
TinyBert
• Knowledge Distillation: TinyBERT uses knowledge distillation to train a
smaller, efficient model (the Student) to mimic a larger, pre-trained model
(the Teacher).
Learning Process in TinyBERT:
• The Student model learns from the Teacher at three levels:
• Transformer Layer – Learns from the internal attention mechanisms of the Teacher.
• Embedding Layer – Mimics how the Teacher embeds input text.
• Prediction Layer – Learns from the Teacher’s final predictions.
• Comprehensive Distillation:
• Unlike traditional distillation, TinyBERT incorporates knowledge transfer from the
Teacher’s internal representations, not just the final output.
Efficiency of TinyBERT:
• Parameter Size:
• TinyBERT has 14.5 million parameters, which is more than 7 times smaller than the original
BERT (110 million parameters).
• High-Quality Representations:
• Despite its smaller size, TinyBERT can produce high-quality text representations, making it
both efficient and effective.
Key Benefits:
• Smaller and Faster:
• TinyBERT is more efficient, ideal for deployment in resource-constrained environments (e.g.,
mobile devices).
• Maintains Performance:
• Produces text representations with competitive accuracy to larger models despite its
compact size.
• Exploring Variants of BERT (Overview) - Scaler Topics
• Byte-Pair Encoding: Subword-based tokenization | Towards Data
Science

You might also like