SlideShare a Scribd company logo
Natural LanguageProcessing
Yuriy Guts – Jul 09, 2016
Who Is This Guy?
Data Science Team Lead
Sr. Data Scientist
Software Architect, R&D Engineer
I also teach Machine Learning:
What is NLP?
Study of interaction between computers and human languages
NLP = Computer Science + AI + Computational Linguistics
Common NLP Tasks
Easy Medium Hard
• Chunking
• Part-of-Speech Tagging
• Named Entity Recognition
• Spam Detection
• Thesaurus
• Syntactic Parsing
• Word Sense Disambiguation
• Sentiment Analysis
• Topic Modeling
• Information Retrieval
• Machine Translation
• Text Generation
• Automatic Summarization
• Question Answering
• Conversational Interfaces
Interdisciplinary Tasks: Speech-to-Text
Interdisciplinary Tasks: Image Captioning
What Makes NLP so Hard?
Ambiguity
Non-Standard Language
Also: neologisms, complex entity names, phrasal verbs/idioms
More Complex Languages Than English
• German: Donaudampfschiffahrtsgesellschaftskapitän (5 “words”)
• Chinese: 50,000 different characters (2-3k to read a newspaper)
• Japanese: 3 writing systems
• Thai: Ambiguous word boundaries and sentence concepts
• Slavic: Different word forms depending on gender, case, tense
Write Traditional “If-Then-Else” Rules?
BIG NOPE!
Leads to very large and complex codebases.
Still struggles to capture trivial cases (for a human).
Better Approach: Machine Learning
“ • A computer program is said to learn from experience E
• with respect to some class of tasks T and performance measure P,
• if its performance at tasks in T, as measured by P,
• improves with experience E.
— Tom M. Mitchell
Part 1
Essential Machine Learning Backgroundfor NLP
Before We Begin: Disclaimer
• This will be a very quick description of ML. By no means exhaustive.
• Only the essential background for what we’ll have in Part 2.
• To fit everything into a small timeframe, I’ll simplify some aspects.
• I encourage you to read ML books or watch videos to dig deeper.
Common ML Tasks
• Regression
• Classification (Binary or Multi-Class)
1. Supervised Learning
2. Unsupervised Learning
• Clustering
• Anomaly Detection
• Latent Variable Models (Dimensionality Reduction, EM, …)
Natural Language Processing (NLP)
Natural Language Processing (NLP)
Regression
Predict a continuous dependent variable
based on independent predictors
Natural Language Processing (NLP)
Natural Language Processing (NLP)
Natural Language Processing (NLP)
Linear Regression
Natural Language Processing (NLP)
After adding polynomial features
Natural Language Processing (NLP)
Classification
Assign an observation to some category
from a known discrete list of categories
Logistic Regression
Class A
Class B
(Multi-class extension = Softmax Regression)
Neural Networks
and Backpropagation Algorithm
http://playground.tensorflow.org/
Clustering
Group objects in such a way
that objects in the same group are similar,
and objects in the different groups are not
K-Means Clustering
Evaluation
How do we know if an ML model is good?
What do we do if something goes wrong?
Underfitting & Overfitting
Development & Troubleshooting
• Picking the right metric: MAE, RMSE, AUC, Cross-Entropy, Log-Loss
• Training Set / Validation Set / Test Set split
• Picking hyperparameters against Validation Set
• Regularization to prevent OF
• Plotting learning curves to check for UF/OF
Deep Learning
• Core idea: instead of hand-crafting complex features, use increased computing
capacity and build a deep computation graph that will try to learn feature
representations on its own.
End-to-end learning rather than a cascade of apps.
• Works best with lots of homogeneous, spatially related features
(image pixels, character sequences, audio signal measurements).
Usually works poorly otherwise.
• State-of-the-art and/or superhuman performance on many tasks.
• Typically requires massive amounts of data and training resources.
• But: a very young field. Theories not strongly established, views change.
Example: Convolutional Neural Network
Part 2
NLP Challenges And Approaches
“Classical” NLP Pipeline
Tokenization
Morphology
Syntax
Semantics
Discourse
Break text into sentences and words, lemmatize
Part of speech (POS) tagging, stemming, NER
Constituency/dependency parsing
Coreference resolution, wordsense disambiguation
Task-dependent (sentiment, …)
Often Relies on Language Banks
• WordNet (ontology, semantic similarity tree)
• Penn Treebank (POS, grammar rules)
• PropBank (semantic propositions)
• …Dozens of them!
Tokenization & Stemming
POS/NER Tagging
Parsing (LPCFG)
“Classical” way: Training a NER Tagger
Task: Predict whether the word is a PERSON, LOCATION, DATE or OTHER.
Could be more than 3 NER tags (e.g. MUC-7 contains 7 tags).
1. Current word.
2. Previous, next word (context).
3. POS tags of current word and nearby words.
4. NER label for previous word.
5. Word substrings (e.g. ends in “burg”, contains “oxa” etc.)
6. Word shape (internal capitalization, numerals, dashes etc.).
7. …on and on and on…
Features:
Feature Representation: Bag of Words
A single word is a one-hot encoding vector with the size of the dictionary :(
Problem
• Manually designed features are often over-specified, incomplete,
take a long time to design and validate.
• Often requires PhD-level knowledge of the domain.
• Researchers spend literally decades hand-crafting features.
• Bag of words model is very high-dimensional and sparse,
cannot capture semantics or morphology.
Maybe Deep Learning can help?
Deep Learning for NLP
• Core enabling idea: represent words as dense vectors
[0 1 0 0 0 0 0 0 0] [0.315 0.136 0.831]
• Try to capture semantic and morphologic similarity so that the features
for “similar” words are “similar”
(e.g. closer in Euclidean space).
• Natural language is context dependent: use context for learning.
• Straightforward (but slow) way: build a co-occurrence matrix and SVD it.
Embedding Methods: Word2Vec
CBoW version: predict center word from context Skip-gram version: predict context from center word
Benefits
• Learns features of each word on its own, given a text corpus.
• No heavy preprocessing is required, just a corpus.
• Word vectors can be used as features for lots of supervised
learning applications: POS, NER, chunking, semantic role labeling.
All with pretty much the same network architecture.
• Similarities and linear relationships between word vectors.
• A bit more modern representation: GloVe, but requires more RAM.
Linearities
Training a NER Tagger: Deep Learning
Just replace this with NER tag
(or POS tag, chunk end, etc.)
Language Modeling
Assign high probabilities to well-formed sentences
(crucial for text generation, speech recognition, machine translation)
“Classical” Way: N-Grams
Problem: doesn’t scale well to bigger N. N = 5 is pretty much the limit.
Deep Learning Way: Recurrent NN (RNN)
Can use past information without restricting the size of the context.
But: in practice, can’t recall information that came in a long time ago.
Long Short Term Memory Network (LSTM)
Contains gates that control forgetting, adding, updating and outputting information.
Surprisingly amazing performance at language tasks compared to vanilla RNN.
Tackling Hard Tasks
Deep Learning enables end-to-
end learning for Machine
Translation, Image Captioning,
Text Generation, Summarization:
NLP tasks which are inherently
very hard!
RNN for Machine Translation
Hottest Current Research
• Attention Networks
• Dynamic Memory Networks
(see ICML 2016 proceedings)
Tools I Used
• NLTK (Python)
• Gensim (Python)
• Stanford CoreNLP (Java with bindings)
• Apache OpenNLP (Java with bindings)
Deep Learning Frameworks with GPU Support:
• Torch (Torch-RNN) (Lua)
• TensorFlow, Theano, Keras (Python)
NLP Progress for Ukrainian
• Ukrainian lemma dictionary with POS tags
https://github.com/arysin/dict_uk
• Ukrainian lemmatizer plugin for ElasticSearch
https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer
• lang-uk project (1M corpus, NER, tokenization, etc.)
https://github.com/lang-uk
Demo 1: Exploring Semantic Properties Of ASOIAF(“Game of Thrones”)
Demo 2: TopicModeling for DOU.UA Comments
GitHub Repos with IPython Notebooks
• https://github.com/YuriyGuts/thrones2vec
• https://github.com/YuriyGuts/dou-topic-modeling
yuriy.guts@gmail.com
linkedin.com/in/yuriyguts
github.com/YuriyGuts

More Related Content

What's hot (20)

PPT
Natural Language Processing
Yasir Khan
 
PPTX
Natural language processing
Yogendra Tamang
 
PPTX
Natural Language Processing
Rishikese MR
 
PPTX
Natural Language Processing
saurabhnarhe
 
PDF
Natural Language Processing seminar review
Jayneel Vora
 
PPTX
natural language processing help at myassignmenthelp.net
www.myassignmenthelp.net
 
PPTX
Natural language processing and transformer models
Ding Li
 
PPTX
Introduction to natural language processing, history and origin
Shubhankar Mohan
 
PPTX
Natural Language Processing (NLP) - Introduction
Aritra Mukherjee
 
PPT
Introduction to Natural Language Processing
rohitnayak
 
PDF
Classification Based Machine Learning Algorithms
Md. Main Uddin Rony
 
PPTX
Natural language processing
Abash shah
 
PPTX
NLP
guestff64339
 
PPTX
Natural Language Processing in AI
Saurav Shrestha
 
PDF
Natural language processing
Aanchal Chaurasia
 
PDF
Natural Language Processing
Jaganadh Gopinadhan
 
PPTX
Natural language processing (NLP)
ASWINKP11
 
PPTX
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Simplilearn
 
Natural Language Processing
Yasir Khan
 
Natural language processing
Yogendra Tamang
 
Natural Language Processing
Rishikese MR
 
Natural Language Processing
saurabhnarhe
 
Natural Language Processing seminar review
Jayneel Vora
 
natural language processing help at myassignmenthelp.net
www.myassignmenthelp.net
 
Natural language processing and transformer models
Ding Li
 
Introduction to natural language processing, history and origin
Shubhankar Mohan
 
Natural Language Processing (NLP) - Introduction
Aritra Mukherjee
 
Introduction to Natural Language Processing
rohitnayak
 
Classification Based Machine Learning Algorithms
Md. Main Uddin Rony
 
Natural language processing
Abash shah
 
Natural Language Processing in AI
Saurav Shrestha
 
Natural language processing
Aanchal Chaurasia
 
Natural Language Processing
Jaganadh Gopinadhan
 
Natural language processing (NLP)
ASWINKP11
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Simplilearn
 

Viewers also liked (16)

PPTX
Building effective communication skills using NLP
IIBA UK Chapter
 
PPTX
Selling technique - With NLP
Pratibha Mishra
 
PPTX
How People Really Hold and Touch (their Phones)
Steven Hoober
 
PDF
What 33 Successful Entrepreneurs Learned From Failure
ReferralCandy
 
PDF
Upworthy: 10 Ways To Win The Internets
Upworthy
 
PDF
Five Killer Ways to Design The Same Slide
Crispy Presentations
 
PDF
A-Z Culture Glossary 2017
sparks & honey
 
PDF
Digital Strategy 101
Bud Caddell
 
PDF
How I got 2.5 Million views on Slideshare (by @nickdemey - Board of Innovation)
Board of Innovation
 
PDF
The What If Technique presented by Motivate Design
Motivate Design
 
PDF
The Seven Deadly Social Media Sins
XPLAIN
 
PDF
The History of SEO
HubSpot
 
PDF
Displaying Data
Bipul Deb Nath
 
PDF
What Would Steve Do? 10 Lessons from the World's Most Captivating Presenters
HubSpot
 
PDF
How Google Works
Eric Schmidt
 
PPTX
10 Powerful Body Language Tips for your next Presentation
SOAP Presentations
 
Building effective communication skills using NLP
IIBA UK Chapter
 
Selling technique - With NLP
Pratibha Mishra
 
How People Really Hold and Touch (their Phones)
Steven Hoober
 
What 33 Successful Entrepreneurs Learned From Failure
ReferralCandy
 
Upworthy: 10 Ways To Win The Internets
Upworthy
 
Five Killer Ways to Design The Same Slide
Crispy Presentations
 
A-Z Culture Glossary 2017
sparks & honey
 
Digital Strategy 101
Bud Caddell
 
How I got 2.5 Million views on Slideshare (by @nickdemey - Board of Innovation)
Board of Innovation
 
The What If Technique presented by Motivate Design
Motivate Design
 
The Seven Deadly Social Media Sins
XPLAIN
 
The History of SEO
HubSpot
 
Displaying Data
Bipul Deb Nath
 
What Would Steve Do? 10 Lessons from the World's Most Captivating Presenters
HubSpot
 
How Google Works
Eric Schmidt
 
10 Powerful Body Language Tips for your next Presentation
SOAP Presentations
 
Ad

Similar to Natural Language Processing (NLP) (20)

PPTX
Building NLP solutions for Davidson ML Group
botsplash.com
 
PDF
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Saurabh Kaushik
 
PDF
Nlp and Neural Networks workshop
QuantUniversity
 
PDF
Machine Learning in NLP
Vijay Ganti
 
PPTX
Building NLP solutions using Python
botsplash.com
 
PPTX
An Introduction to Recent Advances in the Field of NLP
Rrubaa Panchendrarajan
 
PDF
Beyond the Symbols: A 30-minute Overview of NLP
MENGSAYLOEM1
 
PDF
Atlanta MLconf Machine Learning Conference 09-23-2016
Chris Fregly
 
PDF
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
MLconf
 
PPTX
Networking lesson 4 chaoter 1 Module 4-1.pptx
MAHERMOHAMED27
 
PDF
AINL 2016: Nikolenko
Lidia Pivovarova
 
PDF
Natural Language Processing, Techniques, Current Trends and Applications in I...
RajkiranVeluri
 
PDF
Devday @ Sahaj - Domain Specific NLP Pipelines
Rajesh Muppalla
 
PPTX
Natural Language Processing Advancements By Deep Learning - A Survey
AkshayaNagarajan10
 
PPTX
Text Classification
RAX Automation Suite
 
PPTX
Presentacion_Procesamiento_Lenguaje.pptx
TeresaGarca89
 
PPTX
Natural Language Processing (NLP).pptx
HelmandAtssar
 
PDF
CoreML for NLP (Melb Cocoaheads 08/02/2018)
Hon Weng Chong
 
PPTX
NLP Bootcamp
Anuj Gupta
 
PPT
Lecture1 Natural Language Processing for
abcdefghijklmtuvwxyz
 
Building NLP solutions for Davidson ML Group
botsplash.com
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Saurabh Kaushik
 
Nlp and Neural Networks workshop
QuantUniversity
 
Machine Learning in NLP
Vijay Ganti
 
Building NLP solutions using Python
botsplash.com
 
An Introduction to Recent Advances in the Field of NLP
Rrubaa Panchendrarajan
 
Beyond the Symbols: A 30-minute Overview of NLP
MENGSAYLOEM1
 
Atlanta MLconf Machine Learning Conference 09-23-2016
Chris Fregly
 
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
MLconf
 
Networking lesson 4 chaoter 1 Module 4-1.pptx
MAHERMOHAMED27
 
AINL 2016: Nikolenko
Lidia Pivovarova
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
RajkiranVeluri
 
Devday @ Sahaj - Domain Specific NLP Pipelines
Rajesh Muppalla
 
Natural Language Processing Advancements By Deep Learning - A Survey
AkshayaNagarajan10
 
Text Classification
RAX Automation Suite
 
Presentacion_Procesamiento_Lenguaje.pptx
TeresaGarca89
 
Natural Language Processing (NLP).pptx
HelmandAtssar
 
CoreML for NLP (Melb Cocoaheads 08/02/2018)
Hon Weng Chong
 
NLP Bootcamp
Anuj Gupta
 
Lecture1 Natural Language Processing for
abcdefghijklmtuvwxyz
 
Ad

More from Yuriy Guts (19)

PDF
Target Leakage in Machine Learning (ODSC East 2020)
Yuriy Guts
 
PDF
Automated Machine Learning
Yuriy Guts
 
PDF
Target Leakage in Machine Learning
Yuriy Guts
 
PDF
Paraphrase Detection in NLP
Yuriy Guts
 
PDF
UCU NLP Summer Workshops 2017 - Part 2
Yuriy Guts
 
PDF
NoSQL (ELEKS DevTalks #1 - Jan 2015)
Yuriy Guts
 
PDF
Experiments with Machine Learning - GDG Lviv
Yuriy Guts
 
PDF
A Developer Overview of Redis
Yuriy Guts
 
PDF
[JEEConf 2015] Lessons from Building a Modern B2C System in Scala
Yuriy Guts
 
PDF
Redis for .NET Developers
Yuriy Guts
 
PDF
Aspect-Oriented Programming (AOP) in .NET
Yuriy Guts
 
PDF
Non-Functional Requirements
Yuriy Guts
 
PDF
Introduction to Software Architecture
Yuriy Guts
 
PDF
UML for Business Analysts
Yuriy Guts
 
PDF
Intro to Software Engineering for non-IT Audience
Yuriy Guts
 
PPTX
ELEKS DevTalks #4: Amazon Web Services Crash Course
Yuriy Guts
 
PPTX
ELEKS Summer School 2012: .NET 09 - Databases
Yuriy Guts
 
PPTX
ELEKS Summer School 2012: .NET 06 - Multithreading
Yuriy Guts
 
PPTX
ELEKS Summer School 2012: .NET 04 - Resources and Memory
Yuriy Guts
 
Target Leakage in Machine Learning (ODSC East 2020)
Yuriy Guts
 
Automated Machine Learning
Yuriy Guts
 
Target Leakage in Machine Learning
Yuriy Guts
 
Paraphrase Detection in NLP
Yuriy Guts
 
UCU NLP Summer Workshops 2017 - Part 2
Yuriy Guts
 
NoSQL (ELEKS DevTalks #1 - Jan 2015)
Yuriy Guts
 
Experiments with Machine Learning - GDG Lviv
Yuriy Guts
 
A Developer Overview of Redis
Yuriy Guts
 
[JEEConf 2015] Lessons from Building a Modern B2C System in Scala
Yuriy Guts
 
Redis for .NET Developers
Yuriy Guts
 
Aspect-Oriented Programming (AOP) in .NET
Yuriy Guts
 
Non-Functional Requirements
Yuriy Guts
 
Introduction to Software Architecture
Yuriy Guts
 
UML for Business Analysts
Yuriy Guts
 
Intro to Software Engineering for non-IT Audience
Yuriy Guts
 
ELEKS DevTalks #4: Amazon Web Services Crash Course
Yuriy Guts
 
ELEKS Summer School 2012: .NET 09 - Databases
Yuriy Guts
 
ELEKS Summer School 2012: .NET 06 - Multithreading
Yuriy Guts
 
ELEKS Summer School 2012: .NET 04 - Resources and Memory
Yuriy Guts
 

Recently uploaded (20)

PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PDF
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
PDF
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
PDF
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PDF
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
PPTX
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PPTX
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 

Natural Language Processing (NLP)

  • 2. Who Is This Guy? Data Science Team Lead Sr. Data Scientist Software Architect, R&D Engineer I also teach Machine Learning:
  • 3. What is NLP? Study of interaction between computers and human languages NLP = Computer Science + AI + Computational Linguistics
  • 4. Common NLP Tasks Easy Medium Hard • Chunking • Part-of-Speech Tagging • Named Entity Recognition • Spam Detection • Thesaurus • Syntactic Parsing • Word Sense Disambiguation • Sentiment Analysis • Topic Modeling • Information Retrieval • Machine Translation • Text Generation • Automatic Summarization • Question Answering • Conversational Interfaces
  • 7. What Makes NLP so Hard?
  • 9. Non-Standard Language Also: neologisms, complex entity names, phrasal verbs/idioms
  • 10. More Complex Languages Than English • German: Donaudampfschiffahrtsgesellschaftskapitän (5 “words”) • Chinese: 50,000 different characters (2-3k to read a newspaper) • Japanese: 3 writing systems • Thai: Ambiguous word boundaries and sentence concepts • Slavic: Different word forms depending on gender, case, tense
  • 11. Write Traditional “If-Then-Else” Rules? BIG NOPE! Leads to very large and complex codebases. Still struggles to capture trivial cases (for a human).
  • 12. Better Approach: Machine Learning “ • A computer program is said to learn from experience E • with respect to some class of tasks T and performance measure P, • if its performance at tasks in T, as measured by P, • improves with experience E. — Tom M. Mitchell
  • 13. Part 1 Essential Machine Learning Backgroundfor NLP
  • 14. Before We Begin: Disclaimer • This will be a very quick description of ML. By no means exhaustive. • Only the essential background for what we’ll have in Part 2. • To fit everything into a small timeframe, I’ll simplify some aspects. • I encourage you to read ML books or watch videos to dig deeper.
  • 15. Common ML Tasks • Regression • Classification (Binary or Multi-Class) 1. Supervised Learning 2. Unsupervised Learning • Clustering • Anomaly Detection • Latent Variable Models (Dimensionality Reduction, EM, …)
  • 18. Regression Predict a continuous dependent variable based on independent predictors
  • 26. Classification Assign an observation to some category from a known discrete list of categories
  • 27. Logistic Regression Class A Class B (Multi-class extension = Softmax Regression)
  • 30. Clustering Group objects in such a way that objects in the same group are similar, and objects in the different groups are not
  • 32. Evaluation How do we know if an ML model is good? What do we do if something goes wrong?
  • 34. Development & Troubleshooting • Picking the right metric: MAE, RMSE, AUC, Cross-Entropy, Log-Loss • Training Set / Validation Set / Test Set split • Picking hyperparameters against Validation Set • Regularization to prevent OF • Plotting learning curves to check for UF/OF
  • 35. Deep Learning • Core idea: instead of hand-crafting complex features, use increased computing capacity and build a deep computation graph that will try to learn feature representations on its own. End-to-end learning rather than a cascade of apps. • Works best with lots of homogeneous, spatially related features (image pixels, character sequences, audio signal measurements). Usually works poorly otherwise. • State-of-the-art and/or superhuman performance on many tasks. • Typically requires massive amounts of data and training resources. • But: a very young field. Theories not strongly established, views change.
  • 37. Part 2 NLP Challenges And Approaches
  • 38. “Classical” NLP Pipeline Tokenization Morphology Syntax Semantics Discourse Break text into sentences and words, lemmatize Part of speech (POS) tagging, stemming, NER Constituency/dependency parsing Coreference resolution, wordsense disambiguation Task-dependent (sentiment, …)
  • 39. Often Relies on Language Banks • WordNet (ontology, semantic similarity tree) • Penn Treebank (POS, grammar rules) • PropBank (semantic propositions) • …Dozens of them!
  • 43. “Classical” way: Training a NER Tagger Task: Predict whether the word is a PERSON, LOCATION, DATE or OTHER. Could be more than 3 NER tags (e.g. MUC-7 contains 7 tags). 1. Current word. 2. Previous, next word (context). 3. POS tags of current word and nearby words. 4. NER label for previous word. 5. Word substrings (e.g. ends in “burg”, contains “oxa” etc.) 6. Word shape (internal capitalization, numerals, dashes etc.). 7. …on and on and on… Features:
  • 44. Feature Representation: Bag of Words A single word is a one-hot encoding vector with the size of the dictionary :(
  • 45. Problem • Manually designed features are often over-specified, incomplete, take a long time to design and validate. • Often requires PhD-level knowledge of the domain. • Researchers spend literally decades hand-crafting features. • Bag of words model is very high-dimensional and sparse, cannot capture semantics or morphology. Maybe Deep Learning can help?
  • 46. Deep Learning for NLP • Core enabling idea: represent words as dense vectors [0 1 0 0 0 0 0 0 0] [0.315 0.136 0.831] • Try to capture semantic and morphologic similarity so that the features for “similar” words are “similar” (e.g. closer in Euclidean space). • Natural language is context dependent: use context for learning. • Straightforward (but slow) way: build a co-occurrence matrix and SVD it.
  • 47. Embedding Methods: Word2Vec CBoW version: predict center word from context Skip-gram version: predict context from center word
  • 48. Benefits • Learns features of each word on its own, given a text corpus. • No heavy preprocessing is required, just a corpus. • Word vectors can be used as features for lots of supervised learning applications: POS, NER, chunking, semantic role labeling. All with pretty much the same network architecture. • Similarities and linear relationships between word vectors. • A bit more modern representation: GloVe, but requires more RAM.
  • 50. Training a NER Tagger: Deep Learning Just replace this with NER tag (or POS tag, chunk end, etc.)
  • 51. Language Modeling Assign high probabilities to well-formed sentences (crucial for text generation, speech recognition, machine translation)
  • 52. “Classical” Way: N-Grams Problem: doesn’t scale well to bigger N. N = 5 is pretty much the limit.
  • 53. Deep Learning Way: Recurrent NN (RNN) Can use past information without restricting the size of the context. But: in practice, can’t recall information that came in a long time ago.
  • 54. Long Short Term Memory Network (LSTM) Contains gates that control forgetting, adding, updating and outputting information. Surprisingly amazing performance at language tasks compared to vanilla RNN.
  • 55. Tackling Hard Tasks Deep Learning enables end-to- end learning for Machine Translation, Image Captioning, Text Generation, Summarization: NLP tasks which are inherently very hard! RNN for Machine Translation
  • 56. Hottest Current Research • Attention Networks • Dynamic Memory Networks (see ICML 2016 proceedings)
  • 57. Tools I Used • NLTK (Python) • Gensim (Python) • Stanford CoreNLP (Java with bindings) • Apache OpenNLP (Java with bindings) Deep Learning Frameworks with GPU Support: • Torch (Torch-RNN) (Lua) • TensorFlow, Theano, Keras (Python)
  • 58. NLP Progress for Ukrainian • Ukrainian lemma dictionary with POS tags https://github.com/arysin/dict_uk • Ukrainian lemmatizer plugin for ElasticSearch https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer • lang-uk project (1M corpus, NER, tokenization, etc.) https://github.com/lang-uk
  • 59. Demo 1: Exploring Semantic Properties Of ASOIAF(“Game of Thrones”) Demo 2: TopicModeling for DOU.UA Comments
  • 60. GitHub Repos with IPython Notebooks • https://github.com/YuriyGuts/thrones2vec • https://github.com/YuriyGuts/dou-topic-modeling