ML for NLP-LO3
ML for NLP-LO3
Vanishing/Exploding Gradient
Problem
• Backpropagated errors multiply at each layer, resulting
in exponential decay (if derivative is small) or growth
(if derivative is large).
• Makes it very difficult train deep networks, or simple recurrent
networks over many time steps.
2
Long Distance Dependencies
• It is very difficult to train SRNs to retain information over
many time steps
• This make is very difficult to learn SRNs that handle long-
distance dependencies, such as subject-verb agreement.
3
Long Short Term Memory
• LSTM networks, add additional gating units in each
memory cell.
• Forget gate
• Input gate
• Output gate
• Prevents vanishing/exploding gradient problem and
allows network to retain state information over longer
periods of time.
4
LSTM Network Architecture
5
Cell State
• Maintains a vector Ct that is the same
dimensionality as the hidden state, ht
• Information can be added or deleted
from this state vector via the forget
and input gates.
6
Cell State Example
• Want to remember person & number of a subject noun
so that it can be checked to agree with the person &
number of verb when it is eventually encountered.
• Forget gate will remove existing information of a prior subject when
a new one is encountered.
• Input gate "adds" in the information for the new subject.
7
Forget Gate
• Forget gate computes a 0-1 value using a logistic
sigmoid output function from the input, xt, and the
current hidden state, ht:
• Multiplicatively combined with cell state, "forgetting" information
where the gate outputs something close to 0.
8
Hyperbolic Tangent Units
• Tanh can be used as an alternative
nonlinear function to the sigmoid
logistic (0-1) output function.
• Used to produce thresholded output
between –1 and 1.
9
Input Gate
• First, determine which entries in the cell
state to update by computing 0-1 sigmoid
output.
• Then determine what amount to add/subtract from
these entries by computing a tanh output (valued –1
to 1) function of the input and hidden state.
10
Updating the Cell State
• Cell state is updated by using
component-wise vector multiply to
"forget" and vector addition to "input"
new information.
11
Output Gate
• Hidden state is updated based on a
"filtered" version of the cell state, scaled
to –1 to 1 using tanh.
• Output gate computes a sigmoid
function of the input and current hidden
state to determine which elements of
the cell state to "output".
12
Overall Network Architecture
• Single or multilayer networks can
compute LSTM inputs from problem
inputs and problem outputs from LSTM
outputs.
Ot e.g. a POS tag as a “one hot” vector
14
General Problems Solved with
LSTMs
• Sequence labeling
• Train with supervised output at each time step
computed using a single or multilayer network
that maps the hidden state (ht) to an output
vector (Ot).
• Language modeling
• Train to predict next input (Ot =It+1)
• Sequence (e.g. text) classification
• Train a single or multilayer network that maps
the final hidden state (hn) to an output vector
(O).
15
Sequence to Sequence
Transduction (Mapping)
• Encoder/Decoder framework maps one
sequence to a "deep vector" then
another LSTM maps this vector to an
output sequence.
16
Summary of
LSTM Application Architectures
17
Successful Applications of LSTMs
• Speech recognition: Language and acoustic
modeling
• Sequence labeling
• POS Tagging
https://www.aclweb.org/aclwiki/index.php?title=POS_Tagging_(State_of_the_art)
• NER
• Phrase Chunking
• Neural syntactic and semantic parsing
• Image captioning: CNN output vector to sequence
• Sequence to Sequence
• Machine Translation (Sustkever, Vinyals, & Le, 2014)
• Video Captioning (input sequence of CNN frame
outputs)
18
Bi-directional LSTM (Bi-LSTM)
• Separate LSTMs process sequence
forward and backward and hidden layers
at each time step are concatenated to
form the cell output.
xt-1 xt xt+1
ht-1 ht ht+1
19
Gated Recurrent Unit
(GRU)• Alternative RNN to LSTM that uses
fewer gates (Cho, et al., 2014)
• Combines forget and input gates into
“update” gate.
• Eliminates cell state vector
20
GRU vs. LSTM
• GRU has significantly fewer parameters
and trains faster.
• Experimental results comparing the
two are still inconclusive, many
problems they perform the same, but
each has problems on which they work
better.
21
Attention
• For many applications, it helps to add
“attention” to RNNs.
• Allows network to learn to attend to
different parts of the input at different time
steps, shifting its attention to focus on
different aspects during its processing.
• Used in image captioning to focus on
different parts of an image when generating
different parts of the output sentence.
• In MT, allows focusing attention on different
parts of the source sentence when
generating different parts of the translation.
22
Attention for Image Captioning
(Xu, et al. 2015)
23
Conclusions
• By adding “gates” to an RNN, we can prevent the
vanishing/exploding gradient problem.
• Trained LSTMs/GRUs can retain state information
longer and handle long-distance dependencies.
• Recent impressive results on a range of challenging
NLP problems.
24
Outline
➔ What is NLP?
➔ Why NLP?
➔ Applications of NLP
➔ Journey of NLP
➔ Components of NLP
➔ Steps in NLP
◆ Preprocessing of Text
◆ Bag of Words Model
◆ Applying Machine Learning Algorithm/(s)
➔ Challenges in NLP
➔ Future of NLP
25
Definition
➔ Natural Language Processing (NLP) is
the ability of a computer program to
understand human language as it is
spoken/written.
➔ Natural Language Processing (NLP) is a
field of Artificial Intelligence (AI) that
focuses on quantifying human language
to make it intelligible to machines
➔ It is an area of computer science & AI
concerned with interactions between
computers and human languages.
26
NLP is used to apply machine learning
models to text and language
27
Why NLP?
28
Why NLP?
• Text is largest repository of human
knowledge and is growing quickly.
• Computer programs that understand text
and speech.
• Computer should communicate the same
thing
• How people communicate to each other
29
Applications
• Opinion Mining / Sentiment analysis
– is the interpretation and classification of emotions (positive, negative and neutral)
within text data using text analysis techniques.
– The result is in terms of +ve,-ve or neutral.
– opinion mining is finding or classifying opinionated parts of text
– helps the organization to gather insight from unorganized & unstructured text
– Aspect Based opinion mining
• Machine Translation
– process of converting one natural language into another while preserving the meaning
of input text and producing fluent text in output language.
– challenging task because languages contain words with multiple meanings, sentences
with multiple grammatical structures.
30
Applications..
• Chatbots
– Designed to simulate human conversation
– Used in Customer Service
– Siri / Alexa
• Speech recognition
– Ability of machine or program to identify words and phrases in spoken language and
convert them into readable format
• Keyword searching
– tries to discover documents that may have facts of interest for the user
– Web Search Disambiguation
31
Applications..
• Information Extraction
– Automatically extraction of specific information related to specific topic from body or
bodies of text which are structured / unstructured / semi-structured
– Web Scraping
32
Journey
➔ NLP introduced by Alan Turing in 1950s.
33
Natural Language Understanding Natural Language Generation
34
Word
Recognition NLU NLG
35
Natural Language Understanding
(NLU)
36
Natural Language Generation (NLG)
37
Steps in NLP
1- Pre Processing of Text
2- Creation of Bag of Words Model
3- Applying Machine Learning Algorithm/(s)
38
1- Pre Processing
39
40
41
42
43
NER steps
• Noun phase identification
• Phrase classification
• Entity disambiguation
44
45
46
47
2. Feature Representation:
Bag of Words Model
This model is used to process the text so
that it can fit to classification algorithms to
classify the text.
50
3. Applying Machine Learning
Algorithms
Supervised Learning:
➔ Naive Bayes Algorithm
➔ Decision Tree
Unsupervised Learning
➔ Clustering
➔ Latent Semantic Indexing
51
Challenges (NLU)
• Lexical Ambiguity: word level
53
Challenges (NLU)
➔ Non Standard Language
54
Analysis Techniques
➔ Morphological Analysis:
The analysis of morphology in various fields.
➔ Syntactic Analysis:
The process of analyzing a string of symbols in
natural language conforming to the rules of a
formal grammar.
➔ Semantic Analysis:
The individual words will be combined to provide
meaning in sentences. The most important task
of semantic analysis is to get the proper
meaning of the sentence.
➔ Discourse Analysis:
Research method for studying written or spoken
language in relation to its social context.
55
Future of NLP
Smarter Search: Recently, Google
announced that it has added NLP
capabilities to Google Drive to allow users
to search for documents and content
using conversational language.
Intelligence from Unstructured
Information: An understanding of
human language is especially powerful
when applied to extract information and
reveal meaning and sentiment in large
amounts of text content.
56
Future of NLP
57
Future of NLP
59
Web References & Resources
➢ http://web.stanford.edu/class/cs224n
➢ https://www.analyticssteps.com/blogs/7-natural-language-processing-techniques-extracting-informatio
n
➢ https://www.tutorialspoint.com/natural_language_processing/natural_language_processing_tutorial.pdf
➢ https://www.udemy.com/course/machinelearning
➢ https://www.superdatascience.com/pages/machine-learning
➢ https://github.com/MonicaGS/Machine-Learning-A-Z
60
THANKS!
Any questions?