0% found this document useful (0 votes)
3 views

ML for NLP-LO3

Uploaded by

shahbazhassan
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

ML for NLP-LO3

Uploaded by

shahbazhassan
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 61

Machine Learning for NLP

Vanishing/Exploding Gradient
Problem
• Backpropagated errors multiply at each layer, resulting
in exponential decay (if derivative is small) or growth
(if derivative is large).
• Makes it very difficult train deep networks, or simple recurrent
networks over many time steps.

2
Long Distance Dependencies
• It is very difficult to train SRNs to retain information over
many time steps
• This make is very difficult to learn SRNs that handle long-
distance dependencies, such as subject-verb agreement.

3
Long Short Term Memory
• LSTM networks, add additional gating units in each
memory cell.
• Forget gate
• Input gate
• Output gate
• Prevents vanishing/exploding gradient problem and
allows network to retain state information over longer
periods of time.

4
LSTM Network Architecture

5
Cell State
• Maintains a vector Ct that is the same
dimensionality as the hidden state, ht
• Information can be added or deleted
from this state vector via the forget
and input gates.

6
Cell State Example
• Want to remember person & number of a subject noun
so that it can be checked to agree with the person &
number of verb when it is eventually encountered.
• Forget gate will remove existing information of a prior subject when
a new one is encountered.
• Input gate "adds" in the information for the new subject.

7
Forget Gate
• Forget gate computes a 0-1 value using a logistic
sigmoid output function from the input, xt, and the
current hidden state, ht:
• Multiplicatively combined with cell state, "forgetting" information
where the gate outputs something close to 0.

8
Hyperbolic Tangent Units
• Tanh can be used as an alternative
nonlinear function to the sigmoid
logistic (0-1) output function.
• Used to produce thresholded output
between –1 and 1.

9
Input Gate
• First, determine which entries in the cell
state to update by computing 0-1 sigmoid
output.
• Then determine what amount to add/subtract from
these entries by computing a tanh output (valued –1
to 1) function of the input and hidden state.

10
Updating the Cell State
• Cell state is updated by using
component-wise vector multiply to
"forget" and vector addition to "input"
new information.

11
Output Gate
• Hidden state is updated based on a
"filtered" version of the cell state, scaled
to –1 to 1 using tanh.
• Output gate computes a sigmoid
function of the input and current hidden
state to determine which elements of
the cell state to "output".

12
Overall Network Architecture
• Single or multilayer networks can
compute LSTM inputs from problem
inputs and problem outputs from LSTM
outputs.
Ot e.g. a POS tag as a “one hot” vector

e.g. a word “embedding” with


reduced dimensionality

It e.g. a word as a “one hot” vector


13
LSTM Training
• Trainable with backprop derivatives such
as:
• Stochastic gradient descent (randomize order of examples
in each epoch) with momentum (bias weight changes to
continue in same direction as last update).
• ADAM optimizer (Kingma & Ma, 2015)
• Each cell has many parameters (Wf, Wi, WC, Wo)
• Generally requires lots of training data.
• Requires lots of compute time that exploits GPU clusters.

14
General Problems Solved with
LSTMs
• Sequence labeling
• Train with supervised output at each time step
computed using a single or multilayer network
that maps the hidden state (ht) to an output
vector (Ot).
• Language modeling
• Train to predict next input (Ot =It+1)
• Sequence (e.g. text) classification
• Train a single or multilayer network that maps
the final hidden state (hn) to an output vector
(O).

15
Sequence to Sequence
Transduction (Mapping)
• Encoder/Decoder framework maps one
sequence to a "deep vector" then
another LSTM maps this vector to an
output sequence.

I1, I2,…,In Encoder hn Decoder O1, O2,…,Om


LSTM LSTM

• Train model "end to end" on I/O pairs of


sequences.

16
Summary of
LSTM Application Architectures

Image Captioning Video Activity Recog Video Captioning POS Tagging


Text Classification Machine Translation Language Modeling

17
Successful Applications of LSTMs
• Speech recognition: Language and acoustic
modeling
• Sequence labeling
• POS Tagging
https://www.aclweb.org/aclwiki/index.php?title=POS_Tagging_(State_of_the_art)
• NER
• Phrase Chunking
• Neural syntactic and semantic parsing
• Image captioning: CNN output vector to sequence
• Sequence to Sequence
• Machine Translation (Sustkever, Vinyals, & Le, 2014)
• Video Captioning (input sequence of CNN frame
outputs)

18
Bi-directional LSTM (Bi-LSTM)
• Separate LSTMs process sequence
forward and backward and hidden layers
at each time step are concatenated to
form the cell output.
xt-1 xt xt+1

ht-1 ht ht+1

19
Gated Recurrent Unit
(GRU)• Alternative RNN to LSTM that uses
fewer gates (Cho, et al., 2014)
• Combines forget and input gates into
“update” gate.
• Eliminates cell state vector

20
GRU vs. LSTM
• GRU has significantly fewer parameters
and trains faster.
• Experimental results comparing the
two are still inconclusive, many
problems they perform the same, but
each has problems on which they work
better.

21
Attention
• For many applications, it helps to add
“attention” to RNNs.
• Allows network to learn to attend to
different parts of the input at different time
steps, shifting its attention to focus on
different aspects during its processing.
• Used in image captioning to focus on
different parts of an image when generating
different parts of the output sentence.
• In MT, allows focusing attention on different
parts of the source sentence when
generating different parts of the translation.

22
Attention for Image Captioning
(Xu, et al. 2015)

23
Conclusions
• By adding “gates” to an RNN, we can prevent the
vanishing/exploding gradient problem.
• Trained LSTMs/GRUs can retain state information
longer and handle long-distance dependencies.
• Recent impressive results on a range of challenging
NLP problems.

24
Outline
➔ What is NLP?
➔ Why NLP?
➔ Applications of NLP
➔ Journey of NLP
➔ Components of NLP
➔ Steps in NLP
◆ Preprocessing of Text
◆ Bag of Words Model
◆ Applying Machine Learning Algorithm/(s)
➔ Challenges in NLP
➔ Future of NLP

25
Definition
➔ Natural Language Processing (NLP) is
the ability of a computer program to
understand human language as it is
spoken/written.
➔ Natural Language Processing (NLP) is a
field of Artificial Intelligence (AI) that
focuses on quantifying human language
to make it intelligible to machines
➔ It is an area of computer science & AI
concerned with interactions between
computers and human languages.
26
NLP is used to apply machine learning
models to text and language

NLP = Computer Science+ AI+ Computer


Linguistics

27
Why NLP?

28
Why NLP?
• Text is largest repository of human
knowledge and is growing quickly.
• Computer programs that understand text
and speech.
• Computer should communicate the same
thing
• How people communicate to each other

29
Applications
• Opinion Mining / Sentiment analysis
– is the interpretation and classification of emotions (positive, negative and neutral)
within text data using text analysis techniques.
– The result is in terms of +ve,-ve or neutral.
– opinion mining is finding or classifying opinionated parts of text
– helps the organization to gather insight from unorganized & unstructured text
– Aspect Based opinion mining

• Machine Translation
– process of converting one natural language into another while preserving the meaning
of input text and producing fluent text in output language.
– challenging task because languages contain words with multiple meanings, sentences
with multiple grammatical structures.

30
Applications..
• Chatbots
– Designed to simulate human conversation
– Used in Customer Service
– Siri / Alexa

• Speech recognition
– Ability of machine or program to identify words and phrases in spoken language and
convert them into readable format

• Keyword searching
– tries to discover documents that may have facts of interest for the user
– Web Search Disambiguation

31
Applications..
• Information Extraction
– Automatically extraction of specific information related to specific topic from body or
bodies of text which are structured / unstructured / semi-structured
– Web Scraping

● Automatic Text Summarization


○ Text summarization automatically produces a summary containing important
sentences and includes all relevant important information from the original
document.
■ Extractive
■ Abstractive

●Auto Completion/ Auto Suggest of Text


○ is a feature that provides predictions as you type

32
Journey
➔ NLP introduced by Alan Turing in 1950s.

➔ Upto 1980s most NLP systems were based


on hand written rules.

➔ In late 1980s there was revolution in NLP


with introduction of machine learning
algorithms.

33
Natural Language Understanding Natural Language Generation
34
Word
Recognition NLU NLG

35
Natural Language Understanding
(NLU)

• Taking some spoken/ typed sentences and


working out what it means.

• Here input (speech/text) gets transformed


into useful representation in order to
analyze the various aspects of language.

36
Natural Language Generation (NLG)

• This is the process of converting back


representation to natural language.

37
Steps in NLP
1- Pre Processing of Text
2- Creation of Bag of Words Model
3- Applying Machine Learning Algorithm/(s)

38
1- Pre Processing

39
40
41
42
43
NER steps
• Noun phase identification
• Phrase classification
• Entity disambiguation

44
45
46
47
2. Feature Representation:
Bag of Words Model
This model is used to process the text so
that it can fit to classification algorithms to
classify the text.

Machine learning algorithms cannot work


with raw text directly; the text must be
converted into numbers. Specifically, vectors
of numbers.

This is called feature extraction or feature


encoding.
48
Bag of Words Model
Steps-
1.Tokenization
2.counting occurence of each word

● Each individual token is treated as a


feature.
● A corpus of documents can thus be
represented by a matrix with one row per
document and one column per token (e.g.
word) occurring in the corpus.
49
Bag of Words Model
It is called a “bag” of words, because any
information about the order or structure of
words in the document is discarded.

The model is only concerned with whether


known words occur in the document, not
where in the document.

50
3. Applying Machine Learning
Algorithms
Supervised Learning:
➔ Naive Bayes Algorithm
➔ Decision Tree

Unsupervised Learning
➔ Clustering
➔ Latent Semantic Indexing

51
Challenges (NLU)
• Lexical Ambiguity: word level

The tank was full of water.

• Syntactic Ambiguity: sentence level

old men and women were taken to safe


place.

Focuses on structure of sentences


52
Challenges (NLU)
• Semantic Ambiguity: related to meaning

The car hit the pole while it was moving.


who was moving??
• Pragmatic Ambiguity: different
interpretations

The police are coming.


for whom?

53
Challenges (NLU)
➔ Non Standard Language

➔ More Complex Language

54
Analysis Techniques
➔ Morphological Analysis:
The analysis of morphology in various fields.
➔ Syntactic Analysis:
The process of analyzing a string of symbols in
natural language conforming to the rules of a
formal grammar.
➔ Semantic Analysis:
The individual words will be combined to provide
meaning in sentences. The most important task
of semantic analysis is to get the proper
meaning of the sentence.
➔ Discourse Analysis:
Research method for studying written or spoken
language in relation to its social context.

55
Future of NLP
Smarter Search: Recently, Google
announced that it has added NLP
capabilities to Google Drive to allow users
to search for documents and content
using conversational language.
Intelligence from Unstructured
Information: An understanding of
human language is especially powerful
when applied to extract information and
reveal meaning and sentiment in large
amounts of text content.
56
Future of NLP

Supporting Invisible UI: The concept of


an invisible or zero user interface will rely
on direct interaction between user and
machine, whether through voice, text or a
combination of the two. Amazon’s Echo is
just one example of the move toward a
future that puts humans more directly in
contact with technology.

57
Future of NLP

The Bots: Frequently used in customer


service, especially in banking, retail and
hospitality, chatbots help customers get
right to the point without the wait,
answering customer questions and
directing them to relevant resources and
products at any hour, any day of the
week. To accomplish this,
chatbots employ NLP to understand
language, usually over text or voice-
recognition interactions, where users
communicate in their own words, as they
58
would speak to an agent.
References
➢ Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv
preprint arXiv:1409.0473, 2014.
➢ Bengio, Y., Ducharme, R., Vincent, P., and Janvin, C. A neural probabilistic language model. The Journal of
Machine Learning Research, 3:1137–1155, 2003.
➢ Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural Computation, 1997.
➢ Jean, S., Cho, K., Memisevic, R., and Bengio, Y. On using very large target vocabulary for neural machine
translation. CoRR, abs/1412.2007, 2014.
➢ Jurafsky, D. and Martin, J. Speech and language processing. Pearson International, 2009.
➢ Kalchbrenner, N. and Blunsom, P. Recurrent continuous translation models. In EMNLP, 2013.
➢ Luong, T., Sutskever, I., Le, Q. V., Vinyals, O., and Zaremba, W. Addressing the rare word problem in neural
machine translation. arXiv preprint arXiv:1410.8206, 2014.
➢ Mikolov, T. Statistical Language Models based on Neural Networks. PhD thesis, Brno University of Technology,
2012.
➢ Mikolov, T., Karafi´at, M., Burget, L., Cernock`y, J., and Khudanpur, S. Recurrent neural network based
language model. In INTERSPEECH, pp. 1045–1048, 2010.
➢ Shang, L., Lu, Z., and Li, H. Neural responding machine for short-text conversation. In Proceedings of ACL,
2015.
➢ Sordoni, A., Galley, M., Auli, M., Brockett, C., Ji, Y., Mitchell, M., Gao, J., Dolan, B., and Nie, J.-Y. A neural
network approach to context-sensitive generation of conversational responses. In Proceedings of NAACL, 2015.
➢ Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to sequence learning with neural networks. In NIPS, 2014.
➢ Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. Show and tell: A neural image caption generator. arXiv
preprint arXiv:1411.4555, 2014b. Will, T. Creating a Dynamic Speech Dialogue. VDM Verlag Dr, 2007.

59
Web References & Resources
➢ http://web.stanford.edu/class/cs224n
➢ https://www.analyticssteps.com/blogs/7-natural-language-processing-techniques-extracting-informatio
n
➢ https://www.tutorialspoint.com/natural_language_processing/natural_language_processing_tutorial.pdf
➢ https://www.udemy.com/course/machinelearning
➢ https://www.superdatascience.com/pages/machine-learning
➢ https://github.com/MonicaGS/Machine-Learning-A-Z

60
THANKS!
Any questions?

You might also like