BERT
BERT
• Can determine how positive or negative a movie’s reviews are. (Sentiment Analysis)
• Can write an article about any topic with just a few sentence inputs. (Text generation)
• Can differentiate words that have multiple meanings (like ‘bank’) based on the surrounding text.
(Polysemy resolution)
Example of BERT
• BERT helps Google better surface (English) results for
nearly all searches since November of 2020.
Pre-BERT Google surfaced information about getting a
prescription filled.
• Here’s an example of how BERT helps Google better
understand specific searches like:
Post-BERT Google understands that “for someone”
relates to picking up a prescription for someone else
and the search results now help to answer that.
• BERT is a deep bidirectional, unsupervised language
representation, pre-trained using a plain text corpus.
Example:
The woman went to the store and bought a _____ of shoes.
Need for BERT
Bidirectional Contextual Understanding:
• Early Transformer Models: Traditional transformer models like OpenAI's
GPT (Generative Pre-trained Transformer) were unidirectional, processing
text either left-to-right (causal) or right-to-left. This means, for example, if
you're trying to predict a word in the middle of a sentence, the model can
only look at the words before (left-to-right) or after (right-to-left), but not
both at the same time.
• BERT's Innovation: BERT introduced bidirectional encoding—it considers
both left and right contexts simultaneously. This results in much richer
contextual representations of words since the model understands a word
based on the entire sentence, not just the surrounding words in one
direction.
Pre-training with Specific Tasks (MLM and NSP):
•Masked Language Model (MLM): Instead of training in a purely generative way (predicting the next
word in a sequence), BERT uses a masked language modeling objective. This means some words in
the input sentence are randomly masked, and the model learns to predict those masked words by
considering the entire context (both left and right).
This task allows BERT to learn deep bidirectional representations, which earlier transformer-based
models lacked.
•Next Sentence Prediction (NSP): In addition to word-level predictions, BERT also introduced the next
sentence prediction task. This helps BERT understand the relationships between sentences, which is
crucial for tasks like question answering and natural language inference (NLI).
Previous transformer models didn’t focus on sentence-to-sentence relationships during pre-training.
Transformers Were Mainly Encoder-Decoder Models
BERT as a Pure Encoder Model: BERT is based purely on the
transformer encoder. This specialization helps it focus entirely on
creating powerful sentence representations, making it ideal for a range
of tasks (from classification to sequence labeling) without the need for
a decoder.
Transfer Learning in NLP
BERT's Breakthrough: BERT popularized the concept of pre-training on
large datasets and then fine-tuning for downstream tasks. This
dramatically reduced the amount of labeled data and training time
needed for specific tasks like sentiment analysis, named entity
recognition, or question answering.
What is Word Embedding ?
• Word embeddings have been a major force in how leading NLP models deal with language.
• Methods like Word2Vec and Glove have been widely used for such tasks.
• Word2Vec showed that we can use a vector (a list of numbers) to properly represent words in a way
that captures semantic or meaning-related relationships.
• (e.g. the ability to tell if words are similar, or opposites, or that a pair of words like “Stockholm” and
“Sweden” have the same relationship between them as “Cairo” and “Egypt” have between them).
• As well as syntactic, or grammar-based, relationships (e.g. the relationship between “had” and “has”
is the same as that between “was” and “is”).
Pretrained Word Embeddings
• The field quickly realized it’s a great idea to use embeddings that were pre-trained
on vast amounts of text data instead of training them alongside the model on what
was frequently a small dataset.
• This is created by Stanford University. GloVe has pre-defined dense vectors for
around every 6 billion words of English literature along with many other general
use characters like comma, braces, and semicolons.
• Glove :- GloVe is an unsupervised learning algorithm for obtaining vector representations for
words.
• Instead of using a fixed embedding for each word, Embeddings from Language Models (ELMo) looks at the
entire sentence before assigning each word in it an embedding.
• It uses a bi-directional LSTM trained on a specific task to be able to create those embeddings.
• The ELMo LSTM would be trained on a massive dataset in the language of our dataset, and then we can use
it as a component in other models that need to handle language.
ELMO uses Language Modelling
• Word2Vec generates the same single vector for the word bank for both of the
sentences.
• BERT will generate two different vectors for the word bank being used in two different
contexts.
• One vector will be like words like money, cash, etc. The other vector would be like
vectors like beach and coast.
Introduction to BERT
• Fundamentally, BERT is a stack of Transformer encoder layers (Vaswani et
al., 2017) that consist of multiple self-attention “heads”.
• For every input token in a sequence, each head computes key, value, and
query vectors, used to create a weighted representation.
• The outputs of all heads in the same layer are combined and run through a
fully connected layer.
• As a result, the pre-trained BERT model can be fine-tuned with just one
additional output layer to create state-of-the-art models for a wide range
of NLP tasks.
What is BERT ?
• BERT is based on the Transformer architecture.
• BERT is pre-trained on a large corpus of unlabelled text including the entire Wikipedia(that’s
2,500 million words!) and Book Corpus (800 million words).
In BERT's deeply bidirectional model: At every layer, the word "bank" is processed with knowledge
of both its left context ("I went to the") and its right context ("to deposit money"). This leads to a
more accurate understanding at each stage of processing, reducing the chance of early-stage
confusion.
BERT introduction
• Google's research on Transformers.
• The transformer is the part of the model that gives BERT its
increased capacity for understanding context and ambiguity in
language.
Downstream
Machine translation, text generation Text classification, question answering
Applications
How BERT achieves bidirectionality?
1. Self-Attention Mechanism:
•BERT uses the self-attention mechanism of the Transformer encoder, where each token
(word or subword) in the input sequence can attend to all other tokens, both before and
after it.
•This means that for every word, the model looks at the entire sequence of words around
it, gathering context from both directions.
•For example, if you're trying to understand the word "bank" in "I went to the bank to
deposit money," BERT will consider not only the words before it ("I went to the") but also
the words after it ("to deposit money") when determining the meaning.
How BERT achieves bidirectionality?
• Importantly, because BERT can see the entire sentence (except for the masked tokens),
it uses context from both the left and right of the masked word to make its prediction.
• For example, if the sentence is “The [MASK] chased the mouse,” BERT will use the
context from both directions to infer that the masked word is likely “cat.”
• This contrasts with models like GPT (Generative Pre-trained Transformer), which are
unidirectional, meaning they predict the next word by only looking at the words to the
left (left-to-right).
Bidirectional
• As opposed to directional models, which read the text input
sequentially (left-to-right or right-to-left), the Transformer encoder
reads the entire sequence of words at once.
• For example, the word “bank” would have the same context-free representation in “bank
account” and “bank of the river.”
Why is this non-directional approach so powerful?
• However, BERT represents “bank” using both its previous and next context — “I
accessed the … account” — starting from the very bottom of a deep neural
network, making it deeply bidirectional.
Why BERT use Only Encoder of Transformer?
• BERT makes use of Transformer, an attention mechanism that learns
contextual relations between words (or sub-words) in a text.
• Soon after the release of the paper describing the model, the team also open-sourced the code of
the model and made available for download versions of the model that were already pre-trained on
massive datasets.
• This is a momentous development since it enables anyone building a machine learning model
involving language processing to use this powerhouse as a readily-available component – saving the
time, energy, knowledge, and resources that would have gone to training a language-processing
model from scratch.
Task specific-Models
The BERT paper shows several ways to use BERT for different tasks.
Model Architecture
• BERT’s architecture significantly scales up from the original Transformer
model, providing it with the capacity to understand language more deeply.
• BERT Model Sizes:
• BERT Base: 12 layers (Transformer Blocks), 12 attention heads, and 768 hidden units in the feed-forward
network.
•BERT Large: 24 layers (Transformer Blocks), 16 attention heads, and 1024 hidden units in the feed-forward
network.
The paper presents two model sizes for BERT:
• It plays a critical role in tasks where you need a single output for
the entire sequence, such as classification tasks (e.g., sentiment
analysis, spam detection, etc.).
Why use [CLS]?
• In many tasks, you want to make a single prediction based on
an entire input sequence.
• During training, the model learns to use the hidden state (internal
representation) associated with this [CLS] token to produce an output that
summarizes the meaning of the entire input.
• After passing the input sequence through multiple Transformer layers (the
architecture used by BERT), the hidden state of the [CLS] token captures the
aggregate meaning of the sequence.
• However, after the final layer, the hidden state associated with [CLS] represents a global
summary of the sentence.
• This hidden state is then used for classification tasks, like predicting whether the sentence has
positive or negative sentiment.
In Sentence Classification:
• The [CLS] token's hidden state is typically passed to a classification layer (a
simple neural network layer) to make a prediction.
• For example, in sentiment analysis, the hidden state of [CLS] could be used to
predict whether a sentence is positive, negative, or neutral.
• In Question Answering:
• In tasks like Question Answering (QA), BERT also uses the [CLS] token, but
instead of classifying a sentence, the [CLS] token can represent whether the
input contains the correct answer to a question.
[SEP] Token (Separator Token)
• BERT needs to know where one sentence ends and the next begins.
[SEP] provides that boundary.
Why use [SEP]?
• The [SEP] token helps mark where the first sentence ends and the
second sentence begins.
Example
• Let’s say you have two sentences:
• Sentence 1: "The sky is blue."
• Sentence 2: "It is a sunny day."
• The input to BERT would look like this:
• "[CLS] The sky is blue [SEP] It is a sunny day [SEP]"
• Here, the [SEP] token is used twice:
• Once to separate the first sentence from the second sentence.
• Once at the end of the input sequence to mark the end of the input.
• During training, BERT uses these [SEP] tokens to understand that the two
sentences are separate, and to help it learn how to model the relationship
between them.
In Next Sentence Prediction (NSP):
• Next Sentence Prediction (NSP) is one of the two pretraining tasks BERT
uses (the other being Masked Language Modeling, or MLM).
• The [SEP] token separates the two sentences, helping BERT understand the
distinction between the two sentences, and allowing it to learn how to model
whether the second sentence is likely to follow the first.
In Question Answering:
• The most straight-forward way to use BERT is to use it to classify a single piece of text. This model would
look like this:
• To train such a model, you mainly must train the classifier, with minimal changes happening to the BERT
model during the training phase.
• This training process is called Fine-Tuning and has roots in Semi-supervised Sequence Learning and
ULMFiT(Universal Language Model Fine-tuning).
Model Architecture
• Both BERT model sizes have a large number of encoder layers (which the
paper calls Transformer Blocks) – twelve for the Base version, and twenty-
four for the Large version.
• These also have larger feedforward-networks (768 and 1024 hidden units
respectively), and more attention heads (12 and 16 respectively) than the
default configuration in the reference implementation of the Transformer in
the initial paper (6 encoder layers, 512 hidden units, and 8 attention
heads).
Model Inputs
Model Outputs
• The final state corresponding to this token is used as the aggregate sequence representation
for classification tasks and used for the Next Sentence Prediction where it is fed into a FFNN +
Softmax layer that predicts probabilities for the labels “IsNext” or “NotNext”.
• The final states corresponding to [MASK] tokens is fed into FFNN+Softmax to predict the next
word from our vocabulary.
Masked Language Model
• BERT makes use of a novel technique called Masked LM (MLM):
• it randomly masks words in the sentence and then it tries to predict them.
• Masking means that the model looks in both directions and it uses the full context of
the sentence, both left and right surroundings, to predict the masked word.
• Unlike the previous language models, it takes both the previous and next tokens into
account at the same time.
• The existing combined left-to-right and right-to-left LSTM based models were missing
this “same-time part”.
What is Next Sentence Prediction?
• NSP (Next Sentence Prediction) is used to help BERT learn about relationships
between sentences by predicting if a given sentence follows the previous
sentence or not. Next Sentence Prediction Example: Paul went shopping. He
bought a new shirt. (correct sentence pair)Ramona made coffee. Vanilla ice
cream cones for sale. (incorrect sentence pair)In training, 50% correct sentence
pairs are mixed in with 50% random sentence pairs to help BERT increase next
sentence prediction accuracy.
For Finetuning:
Add a Task-Specific Output Layer
• For fine-tuning, a task-specific layer is added on top of BERT’s architecture,
depending on the nature of the downstream task. BERT’s base architecture
remains unchanged.
• Example tasks and output layers:
• Text Classification (e.g., sentiment analysis): A simple linear layer (fully connected
layer) is added on top of the BERT model. This layer uses the [CLS] token’s hidden
state (the representation of the entire input sequence) to make predictions.
• Named Entity Recognition (NER): A token-level classifier is added for each word token
to predict whether a token belongs to an entity (such as "PERSON," "ORG," etc.).
BERT and it’s Variants
What makes BERT powerful?
•Training Costs: Pre-training BERT models is resource-heavy, requiring powerful GPUs/TPUs and a lot of
memory. Pre-training BERT-large, for example, can take weeks, even on advanced hardware.
•Inference Speed: Since BERT is large, its inference (the process of making predictions) is slow. It
processes tokens sequentially, and the size of its transformer layers adds to the time required for real-time
applications. This becomes a problem in use cases requiring low-latency responses (e.g., voice assistants,
real-time translations).
•Memory Footprint: The large number of parameters results in high memory usage, limiting BERT’s
deployment on edge devices or platforms with limited computational power (e.g., mobile devices).
Evolution of BERT
• Although BERT has proven to be a very effective language processing model, it has some
disadvantages that may make it less suitable for certain use cases.
• One disadvantage of BERT is that it is a large model, which requires a significant amount
of computational resources to train and use.
• It has about 110 million parameters, which makes it harder and slower to train and also
makes the inference times slower.
• It requires a lot of training data.
• The training process has a lot of room for optimization.
• While BERT has shown great promise in many natural language processing tasks, its large
size and need for large amounts of labeled data can be limitations in some situations.
• Many different variants of BERT were introduced to overcome the difficulties mentioned
above.
• To overcome these limitations, several variants of BERT have been
developed, each with its unique characteristics and performance
capabilities.
• These variants include ALBERT, RoBERTa, ELECTRA, DistilBERT,
SpanBERT, and TinyBERT.
Types/Variants of BERT
Several variants of BERT have been developed, each with its unique characteristics
and performance capabilities. Some of the most notable BERT variants include:
• ALBERT:(A Lite BERT) is a BERT model that has been modified to reduce the
number of parameters and computational complexity without sacrificing
performance. This makes training faster and more efficient while maintaining the
ability to generate high-quality language representations.
• RoBERTa:(Robustly Optimized BERT) is a BERT model that has been trained on a
larger dataset and for a longer period using more advanced training techniques.
This results in a better model for various natural language processing tasks.
• ELECTRA:(Efficiently Learning an Encoder that Classifies Tokens Accurately) is a
BERT model that has been modified to improve its ability to generate high-quality
text representations. This is achieved by training the model to generate fake text
and then using a discriminator to identify which text is real and which is fake.
• ELECTRA (Efficiently Learning an Encoder that Classifies Tokens Accurately) is a
BERT model that has been modified to improve its ability to generate high-quality
text representations. This is achieved by training the model to generate fake text
and then using a discriminator to identify which text is real and which is fake.
•DistilBERT: DistilBERT is a BERT model that has been distilled(or simplified) to
make it faster and more efficient without sacrificing too much performance. This
allows it to be used when computational resources are limited, such as on mobile
devices.
•TinyBERT: TinyBERT is a BERT model that has been further distilled to make it
smaller and more efficient than DistilBERT. This allows it to be used in even more
resource-constrained situations.
• SpanBERT: Focuses on span-based predictions rather than single-token
predictions, useful for question-answering tasks. Better for span-based problems.
Compare different BERT variants (ALBERT, RoBERTa, ELECTRA,
DistilBERT, SpanBERT, and TinyBERT) by understanding their
architecture, optimizations, performance, and real-world applications.
ALBERT
ALBERT achieves performance increase by using several
techniques that make it more efficient than BERT. For example,
it uses a technique called cross-layer parameter sharing to
share parameters across different layers of the model, reducing
the number of parameters that need to be trained. It also uses
a technique called factorized embedding parameterization to
reduce the number of parameters in the embedding layer,
which is the first layer of the model.
• The advantage of this technique is that it reduces the number of parameters in the
embedding layer, making training faster and more efficient. It can also improve the
model's performance by allowing it to learn more specialized word representations.
Benefits of Factorized Embedding Parameterization
• Reduces Parameters:
• Factorizes word embedding into two smaller matrices.
• Memory Efficiency:
• Helps in large-scale models by reducing the memory footprint.
• Improves Training Efficiency:
• Smaller parameter count results in faster training times.
Reduced Interpretability:
• Word representations are no longer a single vector but a product of two vectors, making them harder to interpret.
Limited Expressiveness:
• The word representations are constrained by being a product of two vectors, which could limit their ability to capture
complex relationships between words.
RoBERTa
• RoBERTa (Robustly Optimized BERT Pretraining Approach) is an
advanced variant of the BERT model, designed to improve
performance by optimizing training strategies and data usage.
Process:
Step 1: Generating Predictions: The larger model generates predictions on a dataset, which are used as "ground
truth" labels for the smaller model's training.
Step 2: Learning Internal Representations: The smaller model also learns from the larger model’s internal
activations (values computed during the forward pass), enhancing performance on tasks.
•DistilBERT's Efficiency:
•DistilBERT is smaller, with 40% fewer parameters than BERT (66 million vs. 110 million).
•Faster to train and easier to deploy, especially in resource-constrained environments like
mobile devices.
•Distilled-Dynamic Masking:
•DistilBERT uses distilled-dynamic masking during training, which involves randomly masking
tokens.
•This improves the model’s ability to understand relationships between tokens, enhancing task
performance.