NLP and Deep Learning for non_experts

NLP & Deep
Learning for
non-experts
Sanghamitra Deb
Staff Data Scientist
Chegg Inc

How to start projects in machine learning?
• Kaggle competitions ---
• Make sure to solve the ML problems for concept development
before competing

• Kaggle competitions ---
• Make sure to solve the ML
problems for concept
development before
competing

• Self guided workshops/projects ---
lets say you have data from Zomato
• Restaurant recommendation --
user based, content similarity
based.
• Restaurant tags from reviews.
• Sentiment analysis from reviews.

Outline
• What is NLP
• Bag of Words model for sentiment analysis using scikit learn
• DeepDive into deep learning
• Solve the sentiment analysis problem using keras
• A short into Convolution Neural Networks (CNN)

What is Natural
Language Processing?
• Giving structure to unstructured data
• Learn properties of the data that makes
decision making simple
• Provide concise information to drive
intelligence of different systems.

Why?
• Unstructured data cannot be consumed
directly
• Automate simple and complex
functionalities
• Inferences from text data becomes
queriable. This could help with regular BU
reports
• Understand customers better and take
necessary actions for better experience.

Applications
• Categorization of text
• Building domain specific Knowledge Graph
• Recommendations
• Web --- Search
• HR --- people analytics
• Medical --- drug discovery, automated
diagnosis
• ………..

What are the underlying tasks?
• Syntactic Parsing of sentences --- parsing based on structure
• Part of Speech Tagging
• Semantic Parsing -- mapping text directly into formal query language,
e.g. SQL queries for a pre-determined database schema.
• Dialogue state tracking --- chatbots
• Machine Translation
• Language modeling
• Text extraction
• Classification

Text Classification
Text Pre - processing Collecting Training Data Model Building
Offline
SME
• Reduces noise
• Ensures quality
• Improves overall performance
• Training Data Collection / Examples
of classes that we are trying to model
• Model performance is directly
correlated with quality of training
data
• Model selection
• Architecture
• Parameter Tuning
User
Online
Model Evaluation

Text Data
Data Source -- https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences

Model Building: a simple Bag of words (BOW)
model
https://realpython.com/python-keras-text-classification/

Model Building: a simple BOW model

Deep
Learning
Deep learning algorithms seek
to exploit the unknown
structure in the input
distribution in order to discover
good representations, often at
multiple levels, with higher-level
learned features defined in
terms of lower-level features.
--- Yoshua Bengio
a kind of
learning where
the
representation
you form have
several levels of
abstraction,
rather than a
direct input to
output --- Peter
Norvig
When you hear the term deep learning, just think
of a large deep neural net. Deep refers to the
number of layers typically and so this kind of the
popular term that’s been adopted in the press. I
think of them as deep neural networks generally.
--- Andrew Ng

Why now?
• Explosion in labelled data.
• Exponential growth in
computation power with
cloud computing and
availability of GPUs
• Improvements in setting
initial conditions and
activation functions

Neural Network
Simulate the brain and get neurons densely interconnected in a
computer such that it can learn things, recognize patterns and take
decisions?

Neural Network
Simulate the brain and get neurons densely interconnected in a
computer such that it can learn things, recognize patterns and take
decisions?
What is a neuron?

What is neuron?
https://www.slideshare.net/tw_dsconf/ss-62245351
a1
a2
a3

Neural Network
• Each node is a function with input
and output vectors
• Every network structure is defined
by a set of functions

• Loss is minimized using
Gradient Descent
• Find network parameters
such that the loss is
minimized
• This is done by taking
derivatives of the loss wrt
parameters.
• Next the parameters are
updated by subtracting
learning rate times the
derivative

Commonly
used loss
functions
• Mean Squared Error Loss
• Mean Squared Logarithmic Error Loss
• Mean Absolute Error Loss
Regression Loss Functions
• Binary Cross-Entropy
• Hinge Loss
• Squared Hinge Loss
Binary Classification Loss Functions
• Multi-Class Cross-Entropy Loss
• Sparse Multiclass Cross-Entropy Loss
• Kullback Leibler Divergence Loss
Multi-Class Classification Loss Functions

Cost
Function –
Cross
Entropy

Dropout -- avoid overfitting
• Large weights in a neural network are a
sign of a more complex network that has
overfit the training data.
• Probabilistically dropping out nodes in the
network is a simple and effective
regularization method.
• A large network with more training and the
use of a weight constraint are suggested
when using dropout.

Optimization
Techniques
Gradient Descent
Adagrad
RMSprop
Adam
…

Adam Optimization
• adaptive moment estimation
• The method computes individual adaptive learning rates for different
parameters from estimates of first and second moments of the
gradients.
• Calculates an exponential moving average of the gradient and the
squared gradient, parameters control the decay rates of these moving
averages.
https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/

Activation Functions
• Sigmoid/ Softmax
• Tanh
• Relu
• Leaky Relu
• Swish

• Tanh
• Relu
• Leaky Relu
• Swish
Derivative

• Tanh
• Relu
• Leaky Relu
• Swish
Derivative
a = max(0,z)

• Tanh
• Relu
• Leaky Relu
• Swish
https://arxiv.org/abs/1710.05941v1

Text Classification Reminder!

Text Classification using feed forward NN

Fit & measure accuracy!
plot_history(history)
Clearly overfits the data!

Can we do better? Word Embeddings
• Words are represented as dense
vectors
• These vectors are
• Learned during the training
task by the neural network
• Pre-trained, learned from
Language Models
• Encode the semantic meaning of
the word.

Text Pre-processing with Keras
PaddingTokenizing

Start with an Embedding Layer
• Embedding Layer of Keras which takes the previously calculated integers and
maps them to a dense vector of the embedding.
o Parameters
Ø input_dim: the size of the vocabulary
Ø output_dim: the size of the dense vector
Ø input_length: the length of the sequence
Hope to see you soon
Nice to see you again
After training
https://stats.stackexchange.com/questions/270546/how-does-keras-embedding-layer-work

Add a pooling layer
• MaxPooling1D/AveragePooling1D or
a GlobalMaxPooling1D/GlobalAveragePooling1D layer
• way to downsample (a way to reduce the size of) the incoming
feature vectors.
• Global max/average pooling takes the maximum/average of all
features whereas in the other case you have to define the pool size.

Definition of
the entire
model

Training
Using pre-trained word embeddings will lead to an accuracy of
0.82. This is a case of transfer learning.
https://realpython.com/python-keras-text-classification

Embeddings + Maxpooling -- Benifits
• Power of generalization --- embeddings are able to share information
across similar features.
• Fewer nodes with zero values.

Convolution Neural Network
Detect features ! Downsample.

What is a CNN?
In a traditional feedforward neural network we connect each
input neuron to each output neuron in the next layer. That’s
also called a fully connected layer, or affine layer.
• We use convolutions over the input layer to compute the
output. This results in local connections, where each region
of the input is connected to a neuron in the output. Each
layer applies different filters and combines the result
• During the training phase, a CNN automatically learns the
values of its filters based on the task you want to perform.
Tricky --- dimensions keep changing as we go from one layer to another

Model definition
Embedding_dim = 50
maxlen=10

Advantages
of CNN
• Character Based CNN
• Has the ability to deal with out of vocabulary
words. This makes it particularly suitable for user
generated raw text.
• Works for multiple languages.
• Model size is small since the tokens are limited to
the number of characters ~ 70. This makes real
life deployments easier and faster.
• Networks with convolutional and pooling
layers are useful for classification tasks in
which we expect to find strong local clues
regarding class membership.

Takeaways!
• If you have text data you need to use NLP
• Try a simple bag of words model for your data
• Having a high level understanding of deep learning will help with
better judgement in architecture design and choice of parameters.
• Deep Learning has the potential to give high performance, you do
need large amount of training data for the benefits.

Thank You
@sangha_deb
sangha123@gmail.com

Visualization of the architecture
50
10
GlobalMaxPool1D
DenseLayer
Sigmoid
Cov1D

Some helpful courses
https://www.coursera.org/learn/classification-vector-spaces-in-nlp

Character Based CNNs.
https://papers.nips.cc/paper/5782-character-level-convolutional-networks-for-text-classification.pdf
• Embedding Layer
• Six convolutional layers, and 3 convolutional layers followed by a max pooling layer
• Two fully connected layer(dense layer in keras), neuron units are 1024.
• Output layer(dense layer), neuron units depends on classes. In this task, we set it 4.

Model
https://towardsdatascience.com/character-level-cnn-with-keras-50391c3adf33

NLP and Deep Learning for non_experts

More Related Content

What's hot (20)

Similar to NLP and Deep Learning for non_experts (20)

More from Sanghamitra Deb (13)

Recently uploaded (20)

NLP and Deep Learning for non_experts