BERT
BERT
The BERT model was proposed in BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
It leverages transformers based neural network to understand and generate human like language.
The decision to use an encoder-only architecture in BERT suggests a primary emphasis on understanding input
sequences rather than generating output sequences.
Traditional methods limit the model’s awareness by excluding the preceding word, to overcome this BERT uses both
bi-directional approaches i.e. both sides context to the words by looking at all the words simultaneously.
BERT is pre-trained on large amount of unlabelled text data. The model learns contextual embeddings, which are the
representations of words that take into account their surrounding context in a sentence.
BERT engages in various unsupervised pre-training tasks. For instance, it might learn to predict missing words in a
sentence (Masked Language Model or MLM task), understand the relationship between two sentences, or predict
the next sentence in a pair.
After the pre-training phase, the BERT model, with its contextual embeddings, can be fine-tuned for as per NLP tasks.
This step tailors the model to more targeted applications by adapting its general language understanding to the
nuances of the particular task.
BERT is fine-tuned using labelled data specific to the downstream tasks of interest. The model’s parameters are
adjusted to optimize its performance as per requirements.
Working
BERT is designed to generate a language model so, only the encoder mechanism is used. Sequence of tokens are fed
to the Transformer encoder. These tokens are first embedded into vectors and then processed in the neural network.
The output is a sequence of vectors, each corresponding to an input token, providing contextualized representations.
Masking words: BERT hides some words (about 15%) and replaces them with a special symbol, like
[MASK].
Guessing hidden words: BERT’s job is to figure out what these hidden words are by looking at the
words around them.
How it learns:
o BERT adds a special layer on top of its learning system to make these guesses. It then checks
how close its guesses are to the actual hidden words.
o It does this by converting its guesses into probabilities.
o BERT’s main focus during training is on getting these hidden words right.
In the training process, BERT learns to understand the relationship between pairs of sentences,
predicting if the second sentence follows the first in the original document.
50% of the input pairs have the second sentence as the subsequent sentence in the original
document, and the other 50% have a randomly chosen sentence.
To help the model distinguish between connected and disconnected sentence pairs. The input is
processed before entering the model:
o A [CLS] token is inserted at the beginning of the first sentence, and a [SEP] token is added at
the end of each sentence.
o A sentence embedding indicating Sentence A or Sentence B is added to each token.
o A positional embedding indicates the position of each token in the sequence.
BERT predicts if the second sentence is connected to the first. This is done by transforming the
output of the [CLS] token into a 2×1 shaped vector using a classification layer, and then calculating
the probability of whether the second sentence follows the first using SoftMax.
During the training of BERT model, the Masked LM and Next Sentence Prediction are trained together. The model
aims to minimize the combined loss function of the Masked LM and Next Sentence Prediction.
BERT Architecture
BERTBASE has 12 layers in the Encoder stack while BERTLARGE has 24 layers in the Encoder stack.
BERT architectures (BASE and LARGE) also have larger feedforward networks (768 and 1024 hidden units
respectively), and more attention heads (12 and 16 respectively).
BERTBASE contains 110M parameters while BERTLARGE has 340M parameters.
This model takes the CLS token as input first, then it is followed by a sequence of words as input. Here CLS is a
classification token. It then passes the input to the above layers. Each layer applies self-attention and passes the
result through a feedforward network after then it hands off to the next encoder. The model outputs a vector of
hidden size (768 for BERT BASE).
Classification Task:
o It can be done by classifying the text into different categories (positive/ negative/ neutral), which can
be implemented by adding a classification layer on the top of the Transformer output for the [CLS]
token.
o The [CLS] token represents the aggregated information from the entire input sequence, which
further can be input for classification layer.
Question Answering:
o BERT is trained for question answering by learning two additional vectors that mark the beginning
and end of the answer. During training, the model is provided with questions and corresponding
passages, and it learns to predict the start and end positions of the answer within the passage.
Named Entity Recognition (NER):
o A BERT-based NER model is trained by taking the output vector of each token form the Transformer
and feeding it into a classification layer. The layer predicts the named entity label for each token,
indicating the type of entity it represents.
To tokenize and encode text using BERT, we will be using the ‘transformer’ library in Python.
Application of BERT
Text Representation: BERT is used to generate word embeddings or representation for words in a sentence.
Named Entity Recognition (NER): BERT can be fine-tuned for named entity recognition tasks, where the goal
is to identify entities such as names of people, organizations, locations, etc., in a given text.
Text Classification: BERT is widely used for text classification tasks, including sentiment analysis, spam
detection, and topic categorization. It has demonstrated excellent performance in understanding and
classifying the context of textual data.
Question-Answering Systems: BERT has been applied to question-answering systems, where the model is
trained to understand the context of a question and provide relevant answers. This is particularly useful for
tasks like reading comprehension.
Machine Translation: BERT’s contextual embeddings can be leveraged for improving machine translation
systems. The model captures the nuances of language that are crucial for accurate translation.
Text Summarization: BERT can be used for abstractive text summarization, where the model generates
concise and meaningful summaries of longer texts by understanding the context and semantics.
Conversational AI: BERT is employed in building conversational AI systems, such as chatbots, virtual
assistants, and dialogue systems. Its ability to grasp context makes it effective for understanding and
generating natural language responses.
Semantic Similarity: BERT embeddings can be used to measure semantic similarity between sentences or
documents. This is valuable in tasks like duplicate detection, paraphrase identification, and information
retrieval.