NLP - PPT - CH 2
NLP - PPT - CH 2
Processing
Pushpak Bhattacharya
Aditya Joshi
Chapter 2
Representation and NLP
2 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Learning Objectives
3 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
• Representation in NLP involves the creation of data structures
• This form of ‘data structures’ act as the input for specific NLP tasks
4 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Fig 2.1 Representations of ideas for humans and NLP
5 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Ambiguity and Representations
• When a human hears the phrase ‘a galloping horse’, it creates a specific image in the
human’s mind
• Because it does not specify the animal and the nature of its movement
• Depending on the choice of words, the idea becomes more or less ambiguous
6 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
• Language representations are at the heart of how we visualize the three generations
of NLP
7 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Generation 1: Belongingness via Grammars
• Language and its grammar are defined as a set of rules that produce valid sentences
• Grammar visualizes language as a set of rules that are applied to generate sentences
• A sentence is said to belong to the language if it can be generated via these rules
8 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
• Noam Chomsky proposed a set of languages, hierarchically arranged in layers
9 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Representing Method Definitions in Python Using a
Set of Rules
• Python is simpler than a natural language such as English
10 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
• At the end of the brackets, Python also expects to see a colon
[def <method-name>] …
11 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
• This may be followed by more than one argument, separated by commas
12 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Representing Simple English Sentences as a Set of
Rules
• The grammar for method definitions in Python seemed simple
• We said that the word def needs to be followed by the name of the function
13 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
• Consider the following set of sentences:
• So, a sentence that belongs to this language must start with the symbol S
• These are the symbols we see evidenced in the sentences of the language
14 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
• So, we define the first step as:
S → The …
S → The A …
A → boy
A → girl
15 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
• The third word can be ‘eats’ or ‘drinks’
• Let us define a non-terminal DRINKING that generates ‘drinking milk’ and another
EATING that generates ‘eats rice’
16 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
• The grammar then looks as follows:
• We can see that the rules above are able to generate the four sentences in the set
17 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Chomsky Hierarchy
• Noam Chomsky presented an idea of generative grammars for languages
• Each level of hierarchy imposes restrictions on the rules that can be a part of level
• Each level of hierarchy represents languages that comprise a certain set of sentences
• The levels differ in these restrictions, often relaxing the restrictions progressively
18 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Fig 2.2 Chomsky’s hierarchy of languages
19 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
• Here, we can describe the levels of hierarchy from 3 to 0
20 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
• Type-3 grammar allows a limited set of rules
• Example:
i. See the generation rules for a fictitious language, we call the ‘laughing language of
mobile phone users’:
S → Ha S → Ha S S → He S
This language is assumed to be used by mobile phone users to express that they are
laughing
(More modern forms such as ‘lol’ and ‘lmao’ are ignored for the sake of explanation)
21 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
• Type-2 grammar allows a larger set of rules than Type-3
S → very S
S → good|bad|terrible|excellent S
S → S C U N C → red|pink|green|blue
C → red|pink|green|blue U
U → handy|helpful|clunky
U→€
N → bag
22 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
• Type-1 grammar further relaxes restrictions on production rules
• The grammar above is able to generate phrases like ‘a tourist visa’, ‘diplomatic
Indian passport’, etc
23 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Applications
• It has been applied to understand the syntactic correctness of sentences in parsing
24 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Generation 2: Discrete Representational Semantics
n-Gram Vectors
• Then we look at their extensions to n-grams and some limitations of n-gram vectors
in general
25 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Unigrams
• The value of the variable corresponding to the word ‘I’ is 1 or True for all four sentences
27 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
• Now, consider the word ‘skipped’, a sentence will be represented as a set of random
variables
• Here, the words present in the sentence are 1 and those not present in the
sentence are 0
• In the dummy corpus mentioned, the vocabulary is (I, skipped, my, breakfast, today,
ate, lunch, yesterday)
• The zero at the second place indicates that the word ‘skipped’ is absent in the
sentence
28 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
• The 1 in the sixth place is different from the first sentence since it indicates the word
‘ate’
These vectors are known as unigram vectors since each element in the vector
corresponds to a unigram (i.e., a word in the corpus)
29 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Implementation of Unigrams
To implement a unigram vector, there are two steps as shown
in figure 2.3 :
Step 1: Create a vocabulary of words in the corpus.
The number of unique words in the corpus is called
vocabulary.
Step 2: Create an index indicating the position of each word.
Then, each sentence can be represented as a vector of 1’s
and 0’s corresponding to the position as indicated by the
index.
30 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
n-grams
• An n-gram refers to a set of n words that appear consecutively.
• Example:
31 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Caveats
• In n-gram vector, the components in the vector are Boolean values—a word is either
present or not present
• In such cases, components in the vector may represent counts of words instead of
their presence
• The earlier representation, where words are represented as Boolean values, can be
referred to as n-gram count vectors
32 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Limitations
• The n-gram vectors preserve the contiguous presence of words as indicated by the
value of n
• However, beyond that, they do not preserve the order of words in a sentence
33 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Statistical Language Models
• The second generation of NLP is characterized by the use of probability and, in turn,
statistical language models
• In other words, the goal here is to estimate the probability of a given word
sequence
• Using the chain rule of probability, the probability can be expressed as:
34 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Smoothing
• The question focus on the notion of an impossible event in probability and posists
that an impossible event has a non-zero probability of occurrence.
35 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Smoothing
• A statistical language model encounters zero probabilities for patterns it has not
encountered in the training corpus
• For a corpus with a large vocabulary, these zero probabilities can result in some
sentences being predicted as ‘impossible’
35 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
• Two intuitions are at the heart of smoothing:
i. Add-one smoothing
ii. Additive/Add-k smoothing
iii. Interpolation-based smoothing
iv. Backoff
v. Kneser–Ney smoothing
36 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Use of Statistical Language Modelling
“Statistical machine translation (SMT) is a sub-field of NLP that uses statistical models
to translate text from a language (known as source language) to another (known as
target language)”
• Statistical language models have also been used in information retrieval for
language identification of documents
37 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Generation 3: Dense Representations
• The third generation of NLP, namely the neural generation of NLP, is characterized
by dense representations of text
• Unigram vectors are ‘sparse’ because for a dataset with a large vocabulary
38 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Dense Representation of Words
“The goal of models that learn these word embeddings is to create representations of
words based on a dataset of documents where the words co-occur with each other”
39 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
• Two models for learning word embeddings have been proposed in an early
algorithm called Word2vec
• Word2vec learns vectors for words such that their cosine similarities are likely to
capture their meaning
• A and B represent word vectors for the two words being compared
40 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
• Word vectors using Word2vec can be learned using two algorithms:
i. Skip-gram model: Given a word, predict the words that are likely to appear in its
context (i.e., around it)
ii. Continuous bag-of-words model (CBOW): Given a context (i.e., a set of words that
appear in a certain sequence around a given position), predict the word at the
position
41 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Fig 2.4 Skip-gram and continuous bag-of-words models for word representations
42 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Training Word Vectors
• Two strategies have been suggested to optimize the computation of the word
vectors:
Hierarchical softmax
43 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Fig 2.5 Detailed skip-gram architecture to learn word representation
44 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Negative sampling
45 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Using Word Vectors to Represent Text
• Consider the following code snippet that trains Word2vec for a dummy dataset as
shown next:
46 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
• Here, An object of the Word2vec class is created with sentences as the text
47 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
• Here, the name of the model is word2vec-google-news-300 indicating that it was
trained on the news dataset with vectors of size 300
48 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Neural Language Models
• The Word2vec algorithm learns vectors for words based on their context
• The limitations of word vector models gave way to neural language models in the
third generation of NLP
49 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
50 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Transformer Architecture
• Attention is at the heart of a BERT encoder, which is based on the Transformer
architecture shown below:
51 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Attention and multi-headed attention
• Attention is a powerful concept that breaks out of the sequential nature of text
processing
52 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
• The input sentences are represented to the architecture using two key ideas:
position encoding and tokenization
53 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Positional Encoding
• Positional encoding accounts for a unique representation for each position in the
sentence
54 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
• Example of position encodings are given as:
• Here, for even positions, the sine component is used, while for odd positions, the
cosine component is used
55 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Tokenization
i. WordPiece tokenizer
ii. Byte-pair encoding
56 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Representation of Word and Sentence
• Where the weights of layers in the encoder and decoder are updated for
optimization
57 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Bidirectional Encoder Representations from Transformers (BERT)
• The encoder learns the representation of the input sentence by leveraging multi-
head attention
• Thus, the name contains the phrase ‘encoder representations from Transformers’
58 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Objective Functions of BERT
Words in a sentence are randomly masked (i.e., replaced by special tokens known as
[MASK]
59 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Pre-Training Using BERT
60 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Fine-Tuning of BERT
“A key utility of BERT is the ability to fine-tune it for learning tasks such as the
automatic detection of sentiment”
• The process has been referred to as the ‘downstream application’ of the language
model
• The labelled dataset is converted into the representation with [CLS] and [SEP]
tokens
61 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Variants of BERT
• Several variants of BERT have been reported to extend the capability of BERT
• RoBERTa trains BERT on larger batch sizes, removes the next-sentence prediction
objective
62 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Sample Code for BERT
• The Transformers library developed by HuggingFace is a popular library to use
Transformer-based models
• The top five words returned are ‘bed’, ‘work’, ‘sleep’, ‘him’, and ‘class’
63 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
• Let us now see how vectors of text can be obtained using the library
64 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
XLNet
65 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Thank you
66 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.