0% found this document useful (0 votes)
48 views

Advanced NLP

This document discusses an upcoming webinar on NLP with deep learning. It contains questions to gauge attendees' experience levels in areas like job role, Python experience, and how they heard about the webinar. It also lists references and an outline of topics to be covered, including foundations of NLP, neural networks with Keras, text classification, word embeddings, and sequence modeling.

Uploaded by

komala
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views

Advanced NLP

This document discusses an upcoming webinar on NLP with deep learning. It contains questions to gauge attendees' experience levels in areas like job role, Python experience, and how they heard about the webinar. It also lists references and an outline of topics to be covered, including foundations of NLP, neural networks with Keras, text classification, word embeddings, and sequence modeling.

Uploaded by

komala
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 111

Bruno Gonçalves

www.bgoncalves.com

NLP With Deep Learning For Everyone


Bruno Gonçalves
www.data4sci.com/newsletter
graphs4sci.substack.com
https://github.com/DataForScience/AdvancedNLP
Question https://github.com/DataForScience/AdvancedNLP
• What’s your job title?

• Data Scientist

• Statistician

• Data Engineer

• Researcher

• Business Analyst

• Software Engineer

• Other

@bgoncalves www.data4sci.com
Question https://github.com/DataForScience/AdvancedNLP
• How experienced are you in Python?

• Beginner (<1 year)

• Intermediate (1-5 years)

• Expert (5+ years)

@bgoncalves www.data4sci.com
Question https://github.com/DataForScience/AdvancedNLP
• How did you hear about this webinar?

• O’Reilly Platform

• Newsletter

• data4sci.com Website

• Previous event

• Other?

@bgoncalves www.data4sci.com
References https://github.com/DataForScience/AdvancedNLP

https://amzn.to/3iMqanY https://amzn.to/2BGr0RL https://amzn.to/3sXAZbm https://amzn.to/3a2fhui

https://amzn.to/30fTJqB https://amzn.to/30fTMCN https://amzn.to/3qR3rKh https://amzn.to/2AavBuT

@bgoncalves www.data4sci.com
Table of Contents

1. Foundations of NLP
2. Neural Networks with Keras
3. Text Classi cation
4. Word Embeddings
5. Sequence Modeling
fi
Lesson 1:
Foundations of NLP
Lesson 1.1:
One-Hot Encoding
One-Hot Encoding
• The rst step in analyzing text is to represent it in a way that can be easily manipulated
numerically. Typically this takes the form of representing each term by a vector

• Many approaches have been developed for different purposes

• The most basic one is known as One-Hot Encoding:

• Each word corresponds to a different dimension in a high-dimensional space

• All elements of the vector are zero, except the one corresponding to the word
vfleece = (0, 0, 0, 0, 1, 0, 0,⋯)T
veverywhere = (0, 0, 0, 1, 0, 0, 0,⋯)T

• One-hot encoded vectors are extremely sparse and contain no semantic information

@bgoncalves www.data4sci.com
fi
One-Hot Encoding
• So the text for “Mary had a little lamb”:

Mary had a little lamb, little lamb,


little lamb, Mary had a little lamb
whose eece was white as snow.
And everywhere that Mary went
Mary went, Mary went, everywhere
that Mary went
The lamb was sure to go.

• Could be represented using this one-hot


encoded matrix (we omit the 0 values for
clarity).

@bgoncalves www.data4sci.com
fl
Bag-of-Words
• A closely related is that of Bag of Words, where we keep track of how many times a word is
used within a piece of text

• For our little nursery rhyme, this could simply be:

• Similar representations could be generated for different documents, allowing


us to compare or cluster them easily.

@bgoncalves www.data4sci.com
Lesson 1.2:
Stemming and Lemmatization
Stemming and Lemmatization
• In practical applications, Vocabularies can become extremely large (English is estimated to
have over 1 million unique words).

• Several techniques have been developed to help reduce the vocabulary size with minimal
loss of information. In particular:

• Stemming - Use heuristics to identify the root (or stem) of the word.
The stem doesn’t need to be a “real” word as long as the mapping is
consistent. love

• Lemmatization - Identify the “dictionary form” (lemma) of the word. This loved
approach requires identifying the Part-of-Speech being used and using loves love
hand curated tables to nd the correct lemma.
loving

• Stopwords - Remove the most common words that don’t contain any lovingly
semantic information (the, and, a, etc)

@bgoncalves www.data4sci.com
fi
Stemming
• NLTK contains several different stemmer algorithms, with varying support for different
languages

• Cistem - German

• ISRIStemmer - Arabic

• LancasterStemmer - English

• PorterStemmer - English (the original one)

• RSLPStemmer - Portuguese

• RegexpStemmer - English (using Regular Expressions)

• SnowballStemmer - Arabic, Danish, Dutch, English, Finnish, French, German,


Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish and Swedish

• The SnowballStemmer is a good default choice and tends to perform well across most
languages.

@bgoncalves www.data4sci.com
Lemmatization
• NLTK implements the WordNetLematizer algorithm that uses the WordNet database of
concepts.

• WordNetLematizer algorithm is guaranteed to return a “real” word but the results depend
on correct Part-Of-Speech identi cation. The result for a Noun will be different than the
result for a Verb, Adverb, etc.

@bgoncalves www.data4sci.com
fi
Lemmatization
• Lemmatization tends to be computationally more expensive than Stemming

• Depending on your speci c application, you might prefer Stemming or Lemmatization

@bgoncalves www.data4sci.com
fi
Lesson 1.3: Stopwords
Stopwords
• Stopwords are usually the most common words that don’t contain any semantic
information (the, and, a, etc), but there is no unique universal list of stop words.

• Different applications might use different sets of stop words of none at all.

• The goal of removing them from your text is to signi cantly reduce the number of
words you must process while losing as little information as possible.
'dutch'
• Naturally, these are language dependent. 'german'
'slovene'
'hungarian'
'romanian'
• NLTK supports 23 languages out of the box. These are typically stored 'kazakh'
'turkish'
as plain text les under ‘~/nltk_data/corpora/stopwords/‘ 'russian'
'italian'
'english'

• You can add more by simply adding a text le in the proper directory with 'greek'
'tajik'

one word per line. 'norwegian'


'portuguese'
'finnish'
'danish'
• Stopwords can be loaded to NTLK by using the le name 'french'
'swedish'
'azerbaijani'
'spanish'
'indonesian'
'arabic'
'nepali'

@bgoncalves
fi
fi
fi
fi
Lesson 1.4: N-grams
N-grams
• N-grams are co-occurring sequences of N items from a sequence of words of characters

• NLTK provides the nltk.util.ngrams utility function to easily generate N-Grams of speci c
lengths

• N-grams are important to account for modi ers, Named Entity Recognition, etc.

• But how can we know if a N-Gram is signi cant?

@bgoncalves www.data4sci.com
fi
fi
fi
Collocations
• A closely related concept is that of Collocation - N-Grams that occur more commonly than
expected by chance.

• The nltk.collocations submodule provides objects to identify and compute the most
signi cant Bigram, Trigram and Quadgrams:

• Bigram/Trigram/QuadgramCollocationFinder - support for different ways of nding 2,


3, and 4-grams.

• Bigram/Trigram/QuadgramAssocMeasures - selection of metrics to quantify the


relative importance of each 2, 3, and 4-grams. In particular:

• chi_sq/jaccard/likelihood_ratio/mi_like/pmi/poisson_stirling/raw_freq/student_t

• Signi cant collocations can prove useful for entity extraction, topic detection, etc.

@bgoncalves www.data4sci.com
fi
fi
fi
Code - Foundations of NLP
https://github.com/DataForScience/AdvancedNLP
Lesson 2:
Neural Networks with Keras
Lesson 2.1: Keras Overview
Arti cial Neuron
Bias
1
w
x1 0j
w
1j
x2 w2j
zj aj
w 3j wT x (z)
x3
j
N
w

xN
Activation
Output
Inputs Weights function

@bgoncalves www.data4sci.com
fi
Feed Forward Networks
Different aj aj aj aj aj Outputs from a layer
Outputs become inputs to the
next

(z) (z) (z) (z) (z)


Number of Neurons in
Every neuron has the
a layer determines
same activation function
number of outputs
wT x wT x wT x wT x wT x

Different Same input values to


Weights each neuron
ht = f (xt)
1 x1 x2 x3 ⋯ xN

@bgoncalves www.data4sci.com
(Deep) Feed Forward Networks

ht Network Output

Number of Outputs
from a layer much
match the number of
inputs in the next

Networks can have


arbitrary numbers of
layers with varying
numbers of neurons

xt Network Input

ht = f (xt)

@bgoncalves www.data4sci.com
Lego Blocks
keras.io

@bgoncalves
@bgoncalves www.data4sci.com
www.data4sci.com
Lego Blocks
keras.io

@bgoncalves
@bgoncalves www.data4sci.com
www.data4sci.com
Keras
keras.io
• Open Source neural network library written in Python

• TensorFlow, Microsoft Cognitive Toolkit or Theano backends

• Enables fast experimentation

• Created and maintained by François Chollet, a Google engineer.

• Implements Layers, Objective/Loss functions, Activation functions, Optimizers, etc…

@bgoncalves
@bgoncalves www.data4sci.com
www.data4sci.com
Keras
keras.io
• keras.models.Sequential(layers=None, name=None)- is the workhorse. You use it to
build a model layer by layer. Returns the object that we will use to build the model
• keras.layers
• Dense(units, activation=None, use_bias=True) - None means linear activation. Other
options are, ’tanh’, ’sigmoid’, ’softmax’, ’relu’, etc.
• Activation(activation) - Same as the activation option to Dense, can also be used to
pass TensorFlow or Theano operations directly.
• Dropout(rate, seed=None) - Add a dropout factor: turn connections on/off randomly
from batch to batch
• SimpleRNN(units, input_shape, activation='tanh', use_bias=True, dropout=0.0,
return_sequences=False)
• GRU(units, input_shape, activation='tanh', use_bias=True, dropout=0.0,
return_sequences=False)
• LSTM(units, input_shape, activation='tanh', use_bias=True, dropout=0.0,

@bgoncalves
@bgoncalves www.data4sci.com
www.data4sci.com
Keras
keras.io
• Keras also has a great deal of support for Recurrent Neural Networks
• SimpleRNN(units, input_shape, activation='tanh', use_bias=True, dropout=0.0,
return_sequences=False) - Simple RNN with just a single gate
• LSTM(units, input_shape, activation='tanh', use_bias=True, dropout=0.0,
return_sequences=False) - Long-Short Term Memory is able to remember information
from several steps back
• GRU(units, input_shape, activation='tanh', use_bias=True, dropout=0.0,
return_sequences=False) - Simpli ed version of the LSTM, optimized for small datasets

@bgoncalves
@bgoncalves www.data4sci.com
www.data4sci.com
fi
Keras
keras.io
• As well as Convolutional Networks
• Conv1D(units, input_shape, activation=None, padding=“valid”) - 1D Convolutional
Neural Network for time series and text
• MaxPool1D(pool_size=2, strides=2, padding=“valid”) - Downsamples the input
representation by taking the maximum value
• Flatten() - Flattens the input without changing the batch size

@bgoncalves
@bgoncalves www.data4sci.com
www.data4sci.com
Keras
keras.io
• Models are typically built in a sequential fashion, from the bottom (inputs) up (towards the
outputs)

• model = Sequential() - Initialize an empty model

• model.add(layer) - Add a layer to the top of the model

• model.summary() - Outputs a textual representation of the model with all the current layers,
parameters, etc

• Before a model can be used it must be compiled

• model.compile(optimizer, loss) - We have to compile the model before we can use it

• optimizer - ‘adam’, ‘sgd’, ‘rmsprop’, etc…

• loss - ‘mean_squared_error’, ‘categorical_crossentropy’,


‘kullback_leibler_divergence’, etc…

@bgoncalves
@bgoncalves www.data4sci.com
www.data4sci.com
Keras
keras.io
• After compilation, the model is ready to be trained. THe training interface is similar to that of
sklearn

• model. t(x=None, y=None, batch_size=None, epochs=1, verbose=1,


validation_split=0.0, validation_data=None, shuf e=True)

• model.predict(x, batch_size=32, verbose=0)

• There is much more than can be done within the Keras framework. We’re only touching the
surface!

• Now we look more carefully at some of the pieces we’ll be using later

@bgoncalves
@bgoncalves www.data4sci.com
www.data4sci.com
fi
fl
Keras Datasets
keras.io/api/datasets/
• Keras makes available a small number of curated
datasets that you can easily use
• Each dataset provides a load_data() function to
load the data (and download it the rst time it is
used).
• tensor ow also provides easy access to several
dozen datasets that are preprocessed and
vectorized: https://www.tensor ow.org/datasets/
catalog/overview#all_datasets across different
topics
• Audio
• Image / Image Classi cation
• Object detection
• Question Answering
• Structured
• Summarization
• Text
• Translate
• Video
• Vision Language

@bgoncalves
@bgoncalves www.data4sci.com
www.data4sci.com
fl
fi
fl
fi
Keras Datasets
keras.io/api/datasets/
• For our examples we’ll use the IMDB movie review
dataset:
• A dataset of 25,000 movies reviews from
IMDB,
• Each review is labeled as positive/negative
• Reviews have been preprocessed and are
ready to be used.

@bgoncalves
@bgoncalves www.data4sci.com
www.data4sci.com
Lesson 2.2:
Activation Functions
Activation Function - Linear
• Linear function

• Differentiable

• Non-decreasing

• Compute new sets of features

• Each layer builds up a more abstract ϕ (z) = z


representation of the data

• The simplest

@bgoncalves www.data4sci.com
Activation Function - Sigmoid
• Non-Linear function

• Differentiable

• non-decreasing

• Compute new sets of features

• Each layer builds up a more abstract 1


representation of the data ϕ (z) =
1 + e −z
• Perhaps the most common

@bgoncalves www.data4sci.com
Activation Function - ReLu
• Non-Linear function

• Differentiable

• non-decreasing

• Compute new sets of features

• Each layer builds up a more abstract ϕ (z) = z, z > 0


representation of the data

• Results in faster learning than with


sigmoid

@bgoncalves www.data4sci.com
Activation Function - Hyperbolic Tangent
• Non-Linear function

• Differentiable

• non-decreasing

• Compute new sets of features

• Each layer builds up a more abstract ϕ (z) = tanh (z)


representation of the data

• Produces bounded positive and


negative values

@bgoncalves www.data4sci.com
Lesson 2.3: Loss Functions
Loss-Functions https://keras.io/api/losses/

• Loss-Functions quantify the error we are making at each step


• Depend intrinsically on the output of our network (the nal layer). Two major types:
• Probabilistic Losses - Compare two probability distributions (Classi cation)
1 T
• Cross-Entropy: Jw (X, y)⃗ = − [y log (hw ) + (1 − y) log (1 − hw (X))]
T
(X)
m
• Regression Losses - Compare two arbitrary numbers (Regression)
1
[ ]
2
• Mean Squared Error: Jw (X, y)⃗ = (
(i)
)
(i)
2m ∑
h w x − y
i
• Many other variants

@bgoncalves
fi
fi
Lesson 2.4:
Training Procedures
Optimization Problem
• (Machine) Learning can be thought of as an optimization problem.

• Optimization Problems have 3 distinct pieces:

• The constraints

• The function to optimize

• The optimization algorithm.

@bgoncalves www.data4sci.com
Optimization Problem
• (Machine) Learning can be thought of as an optimization problem.

• Optimization Problems have 3 distinct pieces:

• The constraints Network Structure

• The function to optimize Loss Function

• The optimization algorithm. Gradient Descent

@bgoncalves www.data4sci.com
Gradient Descent
• Goal: Find the minimum of Jw (X, y)⃗ by varying the components of w ⃗

• Intuition: Follow the slope of the error function until convergence

δ
− Jw (X, y)⃗
δ w⃗
Jw (X, y)⃗

δ
− Jw (X, y)⃗
δ w⃗

Jw (X, y⃗)
• Algorithm:

⃗ (initial values of the parameters)


• Guess w (0)
step size
• Update until “convergence”:
δ δ 1 T
wj = wj − α Jw (X, y)⃗ Jw (X, y)⃗ = X ⋅ (hw (X ) − y)⃗
δwj δwj m
@bgoncalves www.data4sci.com
Optimizers https://keras.io/api/optimizers/

• Keras has a wide range of Optimizers available:

• SGD - Stochastic Gradient Descent (with momentum)

• RMSprop - Divide the gradient by a discounted moving average of previous gradients

• Adam - SGD using adaptive estimation of higher-order moments.

• Adadelta -SGD with an adaptive learning rate

• Adagrad - SGD with parameter-speci c learning rates

• Adamax - In nity norm Adam

• Each optimizer tries to deal with one or more limitations of the basic SGD algorithm that
causes it to fail in speci c cases

• Adam is a good general purpose choice

@bgoncalves www.data4sci.com
fi
fi
fi
Learning Procedure

Constraint

T predicted observed
input z=X w hypothesis
output output
XT ϕ (z) y⃗ ̂ Jw (X, y)⃗

Learning Error
Algorithm Function

@bgoncalves www.data4sci.com
Bias-Variance Tradeoff
High Bias Low Bias
Low Variance High Variance

Training
Error

Testing

Variance
Bias

Model Complexity

@bgoncalves www.data4sci.com
Code - NN with Keras
https://github.com/DataForScience/AdvancedNLP
Lesson 3:
Text Classi cation
fi
Lesson 3.1:
Text Classi cation
fi
Text Classi cation
• Our prototypical example will be Text Classi cation
• We’ll learn how to classify IMDB reviews as Positive or Negative
• Reviews can have arbitrary lengths and vocabulary
"<START> this film was just brilliant casting location scenery “<START> worst mistake of my life br br i picked this
story direction everyone's really suited the part they played movie up at target for 5 because i figured hey it's
and you could just imagine being there robert <UNK> is an sandler i can get some cheap laughs i was wrong
amazing actor and now the same being director <UNK> father came completely wrong mid way through the film all three
from the same scottish island as myself so i loved the fact of my friends were asleep and i was still suffering
there was a real connection with this film the witty remarks worst plot worst script worst movie i have ever seen
throughout the film were great it was just brilliant so much i wanted to hit my head up against a wall for an hour
that i bought the film as soon as it was released for <UNK> and then i'd stop and you know why because it felt damn
would recommend it to everyone to watch and the fly fishing was good upon bashing my head in i stuck that damn movie
amazing really cried at the end it was so sad and you know what in the <UNK> and watched it burn and that felt better
they say if you cry at a film it must have been good and this than anything else i've ever done it took american
definitely was also <UNK> to the two little boy's that played psycho army of darkness and kill bill just to get
the <UNK> of norman and paul they were just brilliant children over that crap i hate you sandler for actually going
are often left out of the <UNK> list i think because the stars through with this and ruining a whole day of my life"
that play them all grown up are such a big profile for the whole
film but these children are amazing and should be praised for
what they have done don't you think the whole story was so
lovely because it was true and was someone's life after all that
was shared with us all"

• For convenience, we’ll consider only the 10,000 most frequent words and truncate the
reviews at 500 words.
• Removed words are marked by a special <UNK> token

@bgoncalves www.data4sci.com
fi
fi
Lesson 3.2:
Feed Forward Networks
Feed Forward Network
• Words are mapped to individual numerical IDs (in order of frequency), before being fed into
the model.
• The rst layer of the network is an Embedding layer that maps numerical ids to a dense low
dimensional vector.

@bgoncalves www.data4sci.com
fi
Feed Forward Network
• Words are mapped to individual numerical IDs (in order of frequency), before being fed into
the model.
• The rst layer of the network is an Embedding layer that maps numerical ids to a dense low
dimensional vector.
Each word
500 words
mapped to a
per review
50D vector

@bgoncalves www.data4sci.com
fi
Feed Forward Network
• Words are mapped to individual numerical IDs (in order of frequency), before being fed into
the model.
• The rst layer of the network is an Embedding layer that maps numerical ids to a dense low
dimensional vector.
Each word
500 words
mapped to a
per review
50D vector

Flatten the 3D
tensor into a
2D matrix

@bgoncalves www.data4sci.com
fi
Feed Forward Network
• Words are mapped to individual numerical IDs (in order of frequency), before being fed into
the model.
• The rst layer of the network is an Embedding layer that maps numerical ids to a dense low
dimensional vector.
Each word
500 words
mapped to a
per review
50D vector

None input
dimension
Flatten the 3D
means that
tensor into a
batch_size is
2D matrix
de ned at
training time

@bgoncalves www.data4sci.com
fi
fi
Lesson 3.2:
Convolutional Neural Networks
Convolutional Neural Network http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution

• Originally developed for Image Processing

• A Convolution Layer computes a value along a moving window as it slides through the
image

• The output of the Convolution is smaller than the original image while still capturing
relevant information

• Different convolution operations produce different effects on the original image:

• Extract Edges, Blur, Emboss, etc

• Convolution layers are used to extract features from the original image

@bgoncalves www.data4sci.com
Convolutional Neural Network
• Images are just arrays of numbers, just like our input matrices of words!

d=8
this

film

was

just

brilliant

casting

• The Kernel for each Conv1D layer can be learned by the Network itself

@bgoncalves www.data4sci.com
Convolutional Neural Network
Each word vector
gets transformed
from 50D to 32D

Each review vector


gets transformed
from 500D to 250D

@bgoncalves www.data4sci.com
Code - Text Classi cation
https://github.com/DataForScience/AdvancedNLP
fi
Lesson 4:
Word Embeddings
Lesson 4.1: Motivations
Word-Embeddings
• Word-Embeddings are simply vector representations of words.

• Typical vector representations are;

• one-hot encoding

• bag of words

• TF/IDF

• etc

• None of this representations include semantic information

• We already used an Embedding layer to map word IDs to xed dimension vectors

• After training the network, the embedding layer contains meaningful representations of the
input words

@bgoncalves www.data4sci.com
fi
Word-Embeddings
• Different techniques were developed to generate vector representations that explicitly
encode semantics and that can be reused. Two common ones are:

• word2vec - Developed by Google using a simple Neural Network architecture.

• GloVe - Developed by Stanford to explicitly take co-occurrences into account

• Each vector encodes information about the meaning of the word it’s associated with

• Similarities between vectors match well to similarities between words

• Are useful ways of encoding words to input into a Neural Network

• Pre-trained vectors can be found online:

• word2vec - https://sites.google.com/site/rmyeid/projects/polyglot

• GloVe - https://github.com/stanfordnlp/GloVe

@bgoncalves www.data4sci.com
Word-Embeddings

@bgoncalves www.data4sci.com
Word-Embeddings

@bgoncalves www.data4sci.com
Lesson 4.2:
Skip-gram and Continuous
Bag of Words
Word Embeddings
• The distributional hypothesis in linguistics states that words with similar meanings should
occur in similar contexts.

• In other words, from a word we can get some idea about the context where it might appear.

___ ___ house __ ____.


max p (C | w)
___ ___ car __ _______.

• And from the context we have some idea about possible words.

The red _____ is beautiful.


max p (w | C)
The blue _____ is old.

@bgoncalves www.data4sci.com
word2vec
Mikolov 2013

Skipgram Continuous Bag of Words


max p (C | w) max p (w | C)

1
1

wj
wj
⇥2 θ1 word embeddings ⇥2
θ2
θ2 context embeddings
wj

wj
⇥1 ⇥1
wj one hot vector
θ2
⇥2
σ activation function
wj+1

wj+1
Word Context Context Word

@bgoncalves
@bgoncalves www.data4sci.com
www.data4sci.com
Variations
• Hierarchical Softmax:

• Approximate the softmax using a binary tree

• Reduces the number of calculations per training example from V to log2 V and
increases performance by orders of magnitude.

• Negative Sampling:

• Under sample the most frequent words by removing them from the text before
generating the contexts

• Similar idea to removing stop-words — very frequent words are less informative.

• Effectively makes the window larger, increasing the amount of information available for
context

@bgoncalves www.data4sci.com
word2vec details
• The output of this neural network is deterministic:

• If two words appear in the same context (“blue” vs “red”, for e.g.), they will have similar
internal representations in θ1 and θ2

• θ1 and θ2 are vector embeddings of the input words and the context words respectively

• Words that are too rare are also removed.

• The original implementation had a dynamic window size:

• for each word in the corpus a window size k′ is sampled uniformly between 1 and k

@bgoncalves www.data4sci.com

Lesson 4.3:
Transfer Learning
Transfer Learning https://learning.oreilly.com/library/view/java-deep-learning/9781788997454/
de1d99a5-576d-45de-b77f-ee5563550894.xhtml

• Transfer Learning is the process of putting the knowledge learned by one network to use in
another. Like when you make use concepts from a different eld to solve a problem

• In a more general case, entire layers of a Deep Learning Network that was trained for Task A
can be repurposed for use in Task B without any modi cations

• This is particularly common in large scale systems that are extremely expensive (in both time
and monty) to train from scratch

• We can take advantage of the huge


amounts of work put in by Google, Stanford, etc
to generate high quality embeddings to save time
and effort when developing our models

• In the case of small systems with relatively few


training examples, specially trained embeddings tend
to over perform these high quality ones.

@bgoncalves www.data4sci.com
fi
fi
Code - Word Embeddings
https://github.com/DataForScience/AdvancedNLP
Lesson 5:
Sequence Modeling
Lesson 5.1:
Recurrent Neural Networks
Feed Forward Networks

ht Output

The networks we’ve


seen so far operate in
a linear fashion

xt Input

ht = f (xt)

@bgoncalves www.data4sci.com
Feed Forward Networks

ht Output

The networks we’ve


seen so far operate in
a linear fashion
Information
Flow

xt Input

ht = f (xt)

@bgoncalves www.data4sci.com
Recurrent Neural Network (RNN)

ht Output

ht Output
RNNs allow us
to remember
information we’ve
seen before and Information
act accordingly Flow

Previous ht−1
Output
xt Input

ht = f (xt, ht−1)

@bgoncalves
Recurrent Neural Network (RNN)
• Each output depends (implicitly) on all previous outputs.

• RNNs are particularly useful to model sequential systems, like time series, audio or streams
of text

• Input sequences generate output sequences (seq2seq)

ht−1 ht ht+1
ht−2 ht−1 ht ht+1

xt−1 xt xt+1

@bgoncalves www.data4sci.com
Recurrent Neural Network (RNN)
• SimpleRNN simply concatenates the last output it produced ht
to the current input it is processing

ht−1 ht

tanh

xt
ht = tanh (Wht−1 + Uxt)
Concatenate
both inputs.

@bgoncalves www.data4sci.com
Lesson 5.2:
Gated Recurrent Unit
Gated Recurrent Unit (GRU)
• Introduced in 2014 by K. Cho

• Meant to solve the Vanishing Gradient Problem

• Can be considered as a simpli cation of LSTMs

• Similar performance to LSTM in some applications, better performance for smaller


datasets.

@bgoncalves www.data4sci.com
fi
+ Element wise addition

× Element wise multiplication

Gated Recurrent Unit (GRU) 1− 1 minus the input

ht

ht−1 × + ht

1−
r
× ×
z c
σ σ tanh

xt
c = tanh (Wc (ht−1 ⊗ r) + Uc xt)
z = σ (Wzht−1 + Uz xt)
r = σ (Wr ht−1 + Ur xt) ht = (z ⊗ c) + ((1 − z) ⊗ ht−1)

@bgoncalves www.data4sci.com
+ Element wise addition

× Element wise multiplication

Gated Recurrent Unit (GRU) 1− 1 minus the input

ht

ht−1 × + ht
Update gate:
1−
How much of
r
the previous × ×
state should z c
be kept?
σ σ tanh

xt
c = tanh (Wc (ht−1 ⊗ r) + Uc xt)
z = σ (Wzht−1 + Uz xt)
r = σ (Wr ht−1 + Ur xt) ht = (z ⊗ c) + ((1 − z) ⊗ ht−1)

@bgoncalves www.data4sci.com
+ Element wise addition

× Element wise multiplication

Gated Recurrent Unit (GRU) 1− 1 minus the input

ht

ht−1 × + ht
Reset gate:
1−
How much of
r
the previous × ×
output should z c
be removed?
σ σ tanh

xt
c = tanh (Wc (ht−1 ⊗ r) + Uc xt)
z = σ (Wzht−1 + Uz xt)
r = σ (Wr ht−1 + Ur xt) ht = (z ⊗ c) + ((1 − z) ⊗ ht−1)

@bgoncalves www.data4sci.com
+ Element wise addition

× Element wise multiplication

Gated Recurrent Unit (GRU) 1− 1 minus the input

ht

ht−1 × + ht
Current
1−
memory:
r
What × ×
information do z c
we remember
right now? σ σ tanh

xt
c = tanh (Wc (ht−1 ⊗ r) + Uc xt)
z = σ (Wzht−1 + Uz xt)
r = σ (Wr ht−1 + Ur xt) ht = (z ⊗ c) + ((1 − z) ⊗ ht−1)

@bgoncalves www.data4sci.com
+ Element wise addition

× Element wise multiplication

Gated Recurrent Unit (GRU) 1− 1 minus the input

ht

ht−1 × + ht
Output:
1−
Combine all
r
available × ×
information. z c
σ σ tanh

xt
c = tanh (Wc (ht−1 ⊗ r) + Uc xt)
z = σ (Wzht−1 + Uz xt)
r = σ (Wr ht−1 + Ur xt) ht = (z ⊗ c) + ((1 − z) ⊗ ht−1)

@bgoncalves www.data4sci.com
Lesson 5.3:
Long-Short Term Memory
Long-Short Term Memory (LSTM)
• What if we want to keep explicit information about previous states (memory)?

• How much information is kept, can be controlled through gates.

• LSTMs were rst introduced in 1997 by Hochreiter and Schmidhuber

ht−1 ht ht+1
ct−2 ct−1 ct ct+1
ht−2 ht−1 ht ht+1

xt−1 xt xt+1

@bgoncalves www.data4sci.com
fi
+ Element wise addition

× Element wise multiplication

Long-Short Term Memory (LSTM) 1− 1 minus the input

ht

ct−1 × + ct
tanh
f i o ×
×
g
σ σ tanh σ
ht−1 ht

xt
f = σ (Wf ht−1 + Uf xt) g = tanh (Wght−1 + Ug xt)
i = σ (Wiht−1 + Ui xt) ct = (ct−1 ⊗ f ) + (g ⊗ i)
o = σ (Woht−1 + Uo xt) ht = tanh (ct) ⊗ o
www.data4sci.com
+ Element wise addition

× Element wise multiplication

Long-Short Term Memory (LSTM) 1− 1 minus the input

ht

ct−1 × + ct
Forget gate: tanh
How much of f i o ×
the previous ×
state should g
be kept?
σ σ tanh σ
ht−1 ht

xt
f = σ (Wf ht−1 + Uf xt) g = tanh (Wght−1 + Ug xt)
i = σ (Wiht−1 + Ui xt) ct = (ct−1 ⊗ f ) + (g ⊗ i)
o = σ (Woht−1 + Uo xt) ht = tanh (ct) ⊗ o
www.data4sci.com
+ Element wise addition

× Element wise multiplication

Long-Short Term Memory (LSTM) 1− 1 minus the input

ht

ct−1 × + ct
Input gate: tanh
How much of f i o ×
the previous ×
output g
should be
remembered? σ σ tanh σ
ht−1 ht

xt
f = σ (Wf ht−1 + Uf xt) g = tanh (Wght−1 + Ug xt)
i = σ (Wiht−1 + Ui xt) ct = (ct−1 ⊗ f ) + (g ⊗ i)
o = σ (Woht−1 + Uo xt) ht = tanh (ct) ⊗ o
www.data4sci.com
+ Element wise addition

× Element wise multiplication

Long-Short Term Memory (LSTM) 1− 1 minus the input

ht

ct−1 × + ct
Output gate: tanh
How much of f i o ×
the previous ×
output g
should
contribute? σ σ tanh σ
ht−1 ht
All gates use
the same
xt inputs and

f = σ (Wf ht−1 + Uf xt) g = tanh (Wght−1 + Ug xt)


activation
functions,
but different
i = σ (Wiht−1 + Ui xt) ct = (ct−1 ⊗ f ) + (g ⊗ i) weights

o = σ (Woht−1 + Uo xt) ht = tanh (ct) ⊗ o


www.data4sci.com
+ Element wise addition

× Element wise multiplication

Long-Short Term Memory (LSTM) 1− 1 minus the input

ht

ct−1 × + ct
Output gate: tanh
How much of f i o ×
the previous ×
output g
should
contribute? σ σ tanh σ
ht−1 ht

xt
f = σ (Wf ht−1 + Uf xt) g = tanh (Wght−1 + Ug xt)
i = σ (Wiht−1 + Ui xt) ct = (ct−1 ⊗ f ) + (g ⊗ i)
o = σ (Woht−1 + Uo xt) ht = tanh (ct) ⊗ o
www.data4sci.com
+ Element wise addition

× Element wise multiplication

Long-Short Term Memory (LSTM) 1− 1 minus the input

ht

ct−1 × + ct
State: tanh
Update the f i o ×
current state ×
g
σ σ tanh σ
ht−1 ht

xt
f = σ (Wf ht−1 + Uf xt) g = tanh (Wght−1 + Ug xt)
i = σ (Wiht−1 + Ui xt) ct = (ct−1 ⊗ f ) + (g ⊗ i)
o = σ (Woht−1 + Uo xt) ht = tanh (ct) ⊗ o
www.data4sci.com
+ Element wise addition

× Element wise multiplication

Long-Short Term Memory (LSTM) 1− 1 minus the input

ht

ct−1 × + ct
Output: tanh
Combine all f i o ×
available ×
information. g
σ σ tanh σ
ht−1 ht

xt
f = σ (Wf ht−1 + Uf xt) g = tanh (Wght−1 + Ug xt)
i = σ (Wiht−1 + Ui xt) ct = (ct−1 ⊗ f ) + (g ⊗ i)
o = σ (Woht−1 + Uo xt) ht = tanh (ct) ⊗ o
www.data4sci.com
Lesson 5.4:
Auto-Encoder Models
Auto-Encoders
• Auto-Encoders use the same values for both inputs and outputs

• The Internal/hidden layer(s) have a smaller number of units than the input

• The fundamental idea is that the Network needs to learn an internal representation of its
inputs that is smaller but from which it is still possible to reconstruct the input.

Internal

outputs
Ωin Ωout
inputs

⇥2

• Think of it as “zipping” and “unzipping” the input values.

@bgoncalves www.data4sci.com
Auto-Encoders https://www.researchgate.net/ gure/The-structure-of-proposed-Convolutional-AutoEncoders-
CAE-for-MNIST-In-the-middle-there_ g1_320658590

• After training, the parts of the network that generate the internal representation can be used
as inputs to the Networks

• This is similar to what we did when we reused the word embeddings generated by training a
word2vec network

• Auto-encoders can be arbitrarily complex, including many layers between the input and the
internal representation (or Code) and are often used in Image Processing to generate
ef cient representations of complex images

@bgoncalves www.data4sci.com
fi
fi
fi
Code - Sequence Modeling
https://github.com/DataForScience/AdvancedNLP
Question
• How was the technical level?

• 1 — Too Low (too many details)

• 2 — Low

• 3 — Just Right

• 4 — High

• 5 — Too High (too few details)

@bgoncalves www.data4sci.com
Question
• How was the level of Python code/explanations?

• 1 — Too Low (too many details)

• 2 — Low

• 3 — Just Right

• 4 — High

• 5 — Too High (too few details)

@bgoncalves www.data4sci.com
Events
graphs4sci.substack.com
Interactive Data Visualization with Python
Mar 13, 2024 -9am-3pm (PST)

Natural Language Processing (NLP) for Everyone


Apr 3, 2024 - 10am-4pm (PST)

Generative AI with OpenAI


May 22, 2024 -10am-4pm (PST)

ChatGPT and Competing LLMs


May 28, 2024 -10am-4pm (PST)

@bgoncalves
@bgoncalves www.data4sci.com
www.data4sci.com
https://bit.ly/NLP_LL

@bgoncalves
@bgoncalves www.data4sci.com
www.data4sci.com
https://bit.ly/Timeseries_LL

@bgoncalves
@bgoncalves www.data4sci.com
www.data4sci.com

You might also like