0% found this document useful (0 votes)
6 views

NLP - PPT - CH 2

Uploaded by

Somasekhar Lalam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

NLP - PPT - CH 2

Uploaded by

Somasekhar Lalam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 66

Natural Language

Processing
Pushpak Bhattacharya
Aditya Joshi

Chapter 2
Representation and NLP

Copyright © 2023 by Wiley India Pvt. Ltd.


Chapter 2 Representation and NLP

• 2.1 Ambiguity and Representations


• 2.2 Generation 1: Belongingness via Grammars
• 2.3 Generation 2: Discrete Representational Semantics
• 2.4 Generation 3: Dense Representations

2 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Learning Objectives

• Define grammars to represent language

• Understand statistical and neural language models

• Implement models to train and/or use neural representations

• Describe Transformers and underlying concepts

3 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
• Representation in NLP involves the creation of data structures

• It captures the belongingness of a text to a language

• This form of ‘data structures’ act as the input for specific NLP tasks

• The validity of a sentence in a language can be determined by its representation

• At the semantic level, representation of a text in NLP is also called grammaticality

Thus, different kinds of representations make assumptions and simplifications about


ambiguity in order to explicitly or implicitly construct a ‘language model’

4 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Fig 2.1 Representations of ideas for humans and NLP

5 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Ambiguity and Representations

• When a human hears the phrase ‘a galloping horse’, it creates a specific image in the
human’s mind

• The phrase ‘moving animal’ is more ambiguous

• Because it does not specify the animal and the nature of its movement

• A ‘moving animal’ could be a ‘crawling snake’ or a ‘flying dove’

• Depending on the choice of words, the idea becomes more or less ambiguous

6 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
• Language representations are at the heart of how we visualize the three generations
of NLP

• It is due to their distinctive ways of resolving ambiguity and capture belongingness

• The first generation uses grammar to model language

• The second generation uses probabilistic modelling of a sequence of words

• The third generation of NLP uses belongingness of sentences as a composition of


belongingness over words

7 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Generation 1: Belongingness via Grammars

• Language and its grammar are defined as a set of rules that produce valid sentences

• Grammar visualizes language as a set of rules that are applied to generate sentences

• A sentence is said to belong to the language if it can be generated via these rules

“This interplay of belongingness and generation is at the heart of representation


approaches in this generation of NLP”

8 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
• Noam Chomsky proposed a set of languages, hierarchically arranged in layers

• Here, one layer relaxes certain assumptions over the next

• This is known as the Chomsky’s hierarchy (Chomsky, 1956)

• It is the foundation of programming languages and compilers

• These are the basis of manifestation of the relationship between programming


languages and natural languages

9 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Representing Method Definitions in Python Using a
Set of Rules
• Python is simpler than a natural language such as English

• It has a significantly lower vocabulary and a restricted number of structures

• A method contains a set of Python commands that can be executed by referring to


its name

• A keyword for method definition is def

• It is also followed by the arguments of the method enclosed within brackets

10 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
• At the end of the brackets, Python also expects to see a colon

• This collectively forms the definition of a method such as:

def func1(a, b):

• The next word is the name of the method, indicated by <method-name>

[def <method-name>] …

11 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
• This may be followed by more than one argument, separated by commas

[def <method-name> (<argument-name> {, <argument-name>}*):]

• So, if the language is L, it can be defined by the following grammar:

L → def <method-name> (<argument-name> {, <argument-name>}*):

12 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Representing Simple English Sentences as a Set of
Rules
• The grammar for method definitions in Python seemed simple

• This is because the syntax was restricted

• We said that the word def needs to be followed by the name of the function

• Can we say something similar about a specific sub-set of English?

• Let us say that the first word of a sentence is always ‘The’

• The word ‘The’ may be followed by an adjective or a noun

13 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
• Consider the following set of sentences:

i. The boy eats rice


ii. The girl eats rice
iii. The boy drinks milk
iv. The girl drinks milk

• To begin with, let the start symbol be S

• So, a sentence that belongs to this language must start with the symbol S

• What are the terminals of the language?

• These are the symbols we see evidenced in the sentences of the language

14 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
• So, we define the first step as:
S → The …

• The word ‘The’ can be followed by either ‘boy’ or ‘girl’

• Let us define a non-terminal A which generates one among the two

• We also extend the first rule above

S → The A …
A → boy
A → girl

15 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
• The third word can be ‘eats’ or ‘drinks’

• The fourth word can be ‘milk’ or ‘rice’

• Let us define a non-terminal DRINKING that generates ‘drinking milk’ and another
EATING that generates ‘eats rice’

• Note that we used a large string ‘DRINKING ’ as a non-terminal instead of a single


letter S

• This is primarily to make the grammar readable

16 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
• The grammar then looks as follows:

S → The PERSON ACTION


PERSON → boy | girl
ACTION → EATING | DRINKING
EATING → eats EATABLE
DRINKING → drinks DRINKABLE
EATABLE → rice
DRINKABLE → milk

• The symbol ‘|’ indicates OR

• We can see that the rules above are able to generate the four sentences in the set

17 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Chomsky Hierarchy
• Noam Chomsky presented an idea of generative grammars for languages

• Chomsky presented a hierarchy of grammars used to represent languages

• Each level of hierarchy imposes restrictions on the rules that can be a part of level

• Each level of hierarchy represents languages that comprise a certain set of sentences

• The levels differ in these restrictions, often relaxing the restrictions progressively

18 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Fig 2.2 Chomsky’s hierarchy of languages

19 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
• Here, we can describe the levels of hierarchy from 3 to 0

• Note that Type-3 languages are a proper sub-set of Type-2 languages

• Type-3 rules are also Type-2 rules, but not vice-versa

• Similarly, Type-2 grammar is obtained by restricting rules in Type-1

• And Type-1 by restricting rules in Type-0

• This results in a hierarchical structure

20 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
• Type-3 grammar allows a limited set of rules

• Example:

i. See the generation rules for a fictitious language, we call the ‘laughing language of
mobile phone users’:
S → Ha S → Ha S S → He S

This language is assumed to be used by mobile phone users to express that they are
laughing

(More modern forms such as ‘lol’ and ‘lmao’ are ignored for the sake of explanation)

21 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
• Type-2 grammar allows a larger set of rules than Type-3

• Let us consider a fictitious language called ‘adjective banquet language’:

S → very S
S → good|bad|terrible|excellent S
S → S C U N C → red|pink|green|blue
C → red|pink|green|blue U
U → handy|helpful|clunky
U→€
N → bag

• € refers to a null string

• Here, the language generates an adjectival phrase describing a ‘bag’, with a


sequence of adjectives

22 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
• Type-1 grammar further relaxes restrictions on production rules

• It allows the head to be a combination of terminals and non-terminals

• Therefore, Type-1 grammar allows rules like:


S → B passport
S → a B visa
B → diplomatic
B diplomatic
B passport → diplomatic Indian passport diplomatic
B passport → diplomatic Australian passport
a B visa → a
T passport T → tourist|work

• The grammar above is able to generate phrases like ‘a tourist visa’, ‘diplomatic
Indian passport’, etc

23 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Applications
• It has been applied to understand the syntactic correctness of sentences in parsing

• It has also been applied to understand the semantic validity of sentences in


information extraction

• Finally, grammar-based representations have been applied to natural language


generation too

24 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Generation 2: Discrete Representational Semantics
n-Gram Vectors

• n-gram vectors can be used to represent text as a bag of words

• Let us look at unigram vectors first

• It helps to understand how they are implemented

• Then we look at their extensions to n-grams and some limitations of n-gram vectors
in general

25 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Unigrams

• Consider a dummy corpus consisting of four sentences:

i. I skipped my breakfast today


ii. I ate my breakfast today
iii. I ate my lunch yesterday
iv. I skipped my lunch today

• In order to make these sentences readable by a learning algorithm, a typical method to


convert all words in the entire corpus as a random variable for each unique word in the
corpus.

• Let us consider the word ‘I’

• The value of the variable corresponding to the word ‘I’ is 1 or True for all four sentences
27 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
• Now, consider the word ‘skipped’, a sentence will be represented as a set of random
variables

• Here, the words present in the sentence are 1 and those not present in the
sentence are 0

• In the dummy corpus mentioned, the vocabulary is (I, skipped, my, breakfast, today,
ate, lunch, yesterday)

• Thus, the first sentence is represented as (1, 1, 1, 1, 1, 0, 0, 0)

• The second sentence is represented as (1, 0, 1, 1, 1, 1, 0, 0)

• The zero at the second place indicates that the word ‘skipped’ is absent in the
sentence
28 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
• The 1 in the sixth place is different from the first sentence since it indicates the word
‘ate’

• The third and will be represented as (1, 0, 1, 0, 0, 1, 1, 1)

• The fourth sentences will be represented as (1, 1, 1, 0, 1, 0, 1, 0)

These vectors are known as unigram vectors since each element in the vector
corresponds to a unigram (i.e., a word in the corpus)

29 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Implementation of Unigrams
To implement a unigram vector, there are two steps as shown
in figure 2.3 :
Step 1: Create a vocabulary of words in the corpus.
The number of unique words in the corpus is called
vocabulary.
Step 2: Create an index indicating the position of each word.
Then, each sentence can be represented as a vector of 1’s
and 0’s corresponding to the position as indicated by the
index.

Fig 2.3 Representation of a unigram vector

30 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
n-grams
• An n-gram refers to a set of n words that appear consecutively.

• Example:

 Consider the sentence ‘I skipped my breakfast today’

 This sentence contains the bigrams: I-skipped, skipped-my, my-breakfast, and


breakfast-today

 Similarly, it contains the trigrams: I-skipped-my, skipped-my-breakfast, and my-


breakfast-today

31 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Caveats

• In n-gram vector, the components in the vector are Boolean values—a word is either
present or not present

• What happens if a word is present in the sentence more than once?

• In such cases, components in the vector may represent counts of words instead of
their presence

• The earlier representation, where words are represented as Boolean values, can be
referred to as n-gram count vectors

• While the latter can be referred to as n-gram presence vectors

32 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Limitations

• The n-gram vectors preserve the contiguous presence of words as indicated by the
value of n

• However, beyond that, they do not preserve the order of words in a sentence

• And merely rely on the presence of words in a document

33 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Statistical Language Models
• The second generation of NLP is characterized by the use of probability and, in turn,
statistical language models

• They formulate language models in the form of probabilities

• In other words, the goal here is to estimate the probability of a given word
sequence

• Here is a word at the position

• Using the chain rule of probability, the probability can be expressed as:

34 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Smoothing

• The sunrise problem discussed by french scholar and polymath Pierre-Simon


Laplace.

• What is the probability that the sun will rise tomorrow?

• The question focus on the notion of an impossible event in probability and posists
that an impossible event has a non-zero probability of occurrence.

• Smoothing does the same for probabilistic language models.

35 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Smoothing

• A statistical language model encounters zero probabilities for patterns it has not
encountered in the training corpus

• For a corpus with a large vocabulary, these zero probabilities can result in some
sentences being predicted as ‘impossible’

• However, they may be feasible (in terms of grammar and/or semantics).

• Smoothing ‘smoothes’ some of these probabilities

35 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
• Two intuitions are at the heart of smoothing:

i. Observed events are not as likely as you think


ii. Unobserved events are not as unlikely as you think

• Some methods of smoothing are as follows:

i. Add-one smoothing
ii. Additive/Add-k smoothing
iii. Interpolation-based smoothing
iv. Backoff
v. Kneser–Ney smoothing

36 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Use of Statistical Language Modelling

“Statistical machine translation (SMT) is a sub-field of NLP that uses statistical models
to translate text from a language (known as source language) to another (known as
target language)”

• SMT uses language models during the decoding step

• And the right order of words now remains to be determined

• Statistical language models have also been used in information retrieval for
language identification of documents

37 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Generation 3: Dense Representations

• The third generation of NLP, namely the neural generation of NLP, is characterized
by dense representations of text

• The word ‘dense’ here is in contrast with ‘sparse’ unigram vectors

• Unigram vectors are ‘sparse’ because for a dataset with a large vocabulary

38 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Dense Representation of Words

• Dense representations of words map them to points in a k-dimensional space

• A word is no longer represented as a random variable but as a vector in itself

• As the representation can be viewed as embedding the words in a k-dimensional


space, thus referred to as word embeddings

“The goal of models that learn these word embeddings is to create representations of
words based on a dataset of documents where the words co-occur with each other”

39 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
• Two models for learning word embeddings have been proposed in an early
algorithm called Word2vec

• Word2vec learns vectors for words such that their cosine similarities are likely to
capture their meaning

• The formula for cosine similarity is as follows:

• A and B represent word vectors for the two words being compared

40 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
• Word vectors using Word2vec can be learned using two algorithms:

i. Skip-gram model: Given a word, predict the words that are likely to appear in its
context (i.e., around it)

ii. Continuous bag-of-words model (CBOW): Given a context (i.e., a set of words that
appear in a certain sequence around a given position), predict the word at the
position

41 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Fig 2.4 Skip-gram and continuous bag-of-words models for word representations

42 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Training Word Vectors

• Two strategies have been suggested to optimize the computation of the word
vectors:

 Hierarchical softmax

i. The first step is to compute a Huffman tree from the dataset.

ii. A Huffman tree is used to minimize the encoding length of words

iii. It satisfies the following properties:


a. Leaf nodes are unique words in the vocabulary
b. Words that are frequent are closer to the root
c. There exists a unique path from the root to the leaf nodes

43 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Fig 2.5 Detailed skip-gram architecture to learn word representation

44 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
 Negative sampling

• Instead of learning P(word|context) or P(context|word), an alternative approach is


to decompose it as a classification problem

• This is done in the case of negative sampling

Fig 2.6 Huffman tree to learn hierarchical softmax for


word representations

45 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Using Word Vectors to Represent Text

• The Word2vec algorithm learns vectors for words

• One way to train word vectors is to use the gensim library

• Consider the following code snippet that trains Word2vec for a dummy dataset as
shown next:

46 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
• Here, An object of the Word2vec class is created with sentences as the text

47 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
• Here, the name of the model is word2vec-google-news-300 indicating that it was
trained on the news dataset with vectors of size 300

48 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Neural Language Models

• The Word2vec algorithm learns vectors for words based on their context

• However, every word has exactly one vector

• The limitations of word vector models gave way to neural language models in the
third generation of NLP

• Broadly speaking, there are two categories of neural language models:

i. Auto-regressive language models


ii. Auto-encoding language models

49 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
50 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Transformer Architecture
• Attention is at the heart of a BERT encoder, which is based on the Transformer
architecture shown below:

Fig 2.7 Transformer architecture

51 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Attention and multi-headed attention
• Attention is a powerful concept that breaks out of the sequential nature of text
processing

Fig 2.8 Examples of attention in


Transformer: (a) Scaled dot-product
attention, (b) Multi-head product
attention

52 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
• The input sentences are represented to the architecture using two key ideas:
position encoding and tokenization

Fig 2.9 Encoder of Transformer


architecture

53 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Positional Encoding

• Positional encoding accounts for a unique representation for each position in the
sentence

• One-hot position vectors can be used as positional representations

• Similar representations that use real-valued representations based on the relative


position of the word in the document an also be used

• These are called fixed representations

54 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
• Example of position encodings are given as:

• Here, for even positions, the sine component is used, while for odd positions, the
cosine component is used

55 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Tokenization

• The input to the Transformer is provided in a tokenized manner

• The words in a sentence are split into tokens

• Tokens are complete sub-sets of a word

• Common methods of tokenization are:

i. WordPiece tokenizer
ii. Byte-pair encoding

56 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Representation of Word and Sentence

• The representation of a word is an addition of its token embedding and positional


encoding

• The representation of a sentence is a concatenation of its word representations

• Representations (or vectors) of tokens and special tokens may be randomly


initialized, and their values updated as a part of training

• The Transformer is trained on input–output pairs of sentences

• Where the weights of layers in the encoder and decoder are updated for
optimization

57 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Bidirectional Encoder Representations from Transformers (BERT)

• BERT uses only the encoder of the Transformer architecture

• The encoder learns the representation of the input sentence by leveraging multi-
head attention

• Thus, the name contains the phrase ‘encoder representations from Transformers’

• It is an auto-encoding model, in that it learns representations of words based on


context to their left as well as right in the sentence

• This refers to the ‘bidirectional’ in its name

58 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Objective Functions of BERT

• BERT optimizes the learning using two objective functions:

i. Masked language model

Words in a sentence are randomly masked (i.e., replaced by special tokens known as
[MASK]

ii. Next-sentence prediction

The next-sentence prediction uses relationships between sentences to learn the


discourse nature of documents

59 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Pre-Training Using BERT

Fig 2.10 BERT pre-training: (a)


Pre-training, (b) Fine-tuning

60 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Fine-Tuning of BERT

“A key utility of BERT is the ability to fine-tune it for learning tasks such as the
automatic detection of sentiment”

• The process has been referred to as the ‘downstream application’ of the language
model

• It involves updating the parameters of the BERT model

• The labelled dataset is converted into the representation with [CLS] and [SEP]
tokens

61 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Variants of BERT

• Several variants of BERT have been reported to extend the capability of BERT

• BioBERT captures semantic representation in biomedical documents

• MBERT is a multilingual BERT trained on multilingual corpora

• RoBERTa aims at creating a robust BERT

• RoBERTa trains BERT on larger batch sizes, removes the next-sentence prediction
objective

62 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Sample Code for BERT
• The Transformers library developed by HuggingFace is a popular library to use
Transformer-based models

• Let us first use pipeline to generate output of masked words

• The top five words returned are ‘bed’, ‘work’, ‘sleep’, ‘him’, and ‘class’

63 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
• Let us now see how vectors of text can be obtained using the library

• Computing infrastructure that provides GPU can accelerate the performance of


these models

• The model will then be mapped to the device as follows:

64 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
XLNet

• XLNet is a language model that combines the two models

• It introduces the idea of ‘permutation-based language modelling’

• Permutation-based language modelling in the case of XLNet consists of two steps:

i. Sample a factorization order, which is a sequence of word positions in the


sentence
ii. Decompose in the sequence of the factorization order

65 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.
Thank you

66 CH2 Representation and NLP Copyright © 2023 by Wiley India Pvt. Ltd.

You might also like