SlideShare a Scribd company logo
Vectorization
Core Concepts in Data Mining
Georgia Tech – CSE6242 – March 2015
Josh Patterson
Presenter: Josh Patterson
• Email:
– josh@pattersonconsultingtn.com
• Twitter:
– @jpatanooga
• Github:
– https://github.com/
jpatanooga
Past
Published in IAAI-09:
“TinyTermite: A Secure Routing
Algorithm”
Grad work in Meta-heuristics, Ant-
algorithms
Tennessee Valley Authority (TVA)
Hadoop and the Smartgrid
Cloudera
Principal Solution Architect
Today: Patterson Consulting
Topic Index
• Why Vectorization?
• Vector Space Model
• Text Vectorization
• General Vectorization
WHY VECTORIZATION?
“How is it possible for a slow, tiny brain, whether biological or
electronic, to perceive, understand, predict, and manipulate a
world far larger and more complicated than itself?”
--- Peter Norvig, “Artificial Intelligence: A Modern Approach”
Classic Scenario:
“Classify some tweets
for positive vs
negative sentiment”
What Needs to Happen?
• Need each tweet as some structure that can be
fed to a learning algorithm
– To represent the knowledge of “negative” vs
“positive” tweet
• How does that happen?
– We need to take the raw text and convert it into what
is called a “vector”
• Vector relates to the fundamentals of linear
algebra
– “Solving sets of linear equations”
Wait. What’s a Vector Again?
• An array of floating point numbers
• Represents data
– Text
– Audio
– Image
• Example:
–[ 1.0, 0.0, 1.0, 0.5 ]
VECTOR SPACE MODEL
“I am putting myself to the fullest possible use, which is
all I think that any conscious entity can ever hope to do.”
--- Hal, 2001
Vector Space Model
• Common way of vectorizing text
– every possible word is mapped to a specific integer
• If we have a large enough array then every word
fits into a unique slot in the array
– value at that index is the number of the times the
word occurs
• Most often our array size is less than our corpus
vocabulary
– so we have to have a “vectorization strategy” to
account for this
Text Can Include Several Stages
• Sentence Segmentation
– can skip straight to tokenization depending on use case
• Tokenization
– find individual words
• Lemmatization
– finding the base or stem of words
• Removing Stop words
– “the”, “and”, etc
• Vectorization
– we take the output of the process and make an array of
floating point values
TEXT VECTORIZATION STRATEGIES
“A man who carries a cat by the tail learns something he can learn
in no other way.”
--- Mark Twain
Bag of Words
• A group of words or a document is represented as a bag
– or “multi-set” of its words
• Bag of words is a list of words and their word counts
– simplest vector model
– but can end up using a lot of columns due to number of words
involved.
• Grammar and word ordering is ignored
– but we still track how many times the word occurs in the
document
• has been used most frequently in the document
classification
– and information retrieval domains.
Term frequency inverse document
frequency (TF-IDF)
• Fixes some issues with “bag of words”
• allows us to leverage the information about
how often a word occurs in a document (TF)
– while considering the frequency of the word in the
corpus to control for the facet that some words
will be more common than others (IDF)
• more accurate than the basic bag of words
model
– but computationally more expensive
Kernel Hashing
• When we want to vectorize the data in a single
pass
– making it a “just in time” vectorizer.
• Can be used when we want to vectorize text right
before we feed it to our learning algorithm.
• We come up with a fixed sized vector that is
typically smaller than the total possible words
that we could index or vectorize
– Then we use a hash function to create an index into
the vector.
GENERAL VECTORIZATION STRATEGIES
“Everybody good? Plenty of slaves for my robot colony?”
--- TARS, Interstellar
Four Major Attribute Types
• Nominal
– Ex: “sunny”, “overcast”, and “rainy”
• Ordinal
– Like nominal but with order
• Interval
– “year” but expressed in fixed and equal lengths
• Ratio
– scheme defines a zero point and then a distance
from this fixed zero point
Techniques of Feature Engineering
• Taking the values directly from the attribute unchanged
– If the value is something we can use out of the box
• Feature scaling
– standardization
– or Normalizing an attribute
• Binarization of features
– 0 or 1
• Dimensionality reduction
– Use only the most interesting features
Canova
• Command Line Based
– We don’t want to write custom code for every dataset
• Examples of Usage
– Convert the MNIST dataset from raw binary files to
the svmLight text format.
– Convert raw text into TF-IDF based vectors in a text
vector format {svmLight, arff}
• Scales out on multiple runtimes
– Local, hadoop
• Open Source, ASF 2.0 Licensed
– https://github.com/deeplearning4j/Canova
Ad

More Related Content

What's hot (20)

Deep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVecDeep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVec
Josh Patterson
 
Smart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVecSmart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVec
Josh Patterson
 
Deep learning on Hadoop/Spark -NextML
Deep learning on Hadoop/Spark -NextMLDeep learning on Hadoop/Spark -NextML
Deep learning on Hadoop/Spark -NextML
Adam Gibson
 
Building NLP solutions using Python
Building NLP solutions using PythonBuilding NLP solutions using Python
Building NLP solutions using Python
botsplash.com
 
Reactconf 2014 - Event Stream Processing
Reactconf 2014 - Event Stream ProcessingReactconf 2014 - Event Stream Processing
Reactconf 2014 - Event Stream Processing
Andy Piper
 
Hacking Lucene and Solr for Fun and Profit
Hacking Lucene and Solr for Fun and ProfitHacking Lucene and Solr for Fun and Profit
Hacking Lucene and Solr for Fun and Profit
lucenerevolution
 
Go Deep
Go DeepGo Deep
Go Deep
Yasunobu Chiba
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
StampedeCon
 
Tensorflow vs MxNet
Tensorflow vs MxNetTensorflow vs MxNet
Tensorflow vs MxNet
Ashish Bansal
 
Open Source Search FTW
Open Source Search FTWOpen Source Search FTW
Open Source Search FTW
Grant Ingersoll
 
Real Time search using Spark and Elasticsearch
Real Time search using Spark and ElasticsearchReal Time search using Spark and Elasticsearch
Real Time search using Spark and Elasticsearch
Sigmoid
 
DeepLearning4J and Spark: Successes and Challenges - François Garillot
DeepLearning4J and Spark: Successes and Challenges - François GarillotDeepLearning4J and Spark: Successes and Challenges - François Garillot
DeepLearning4J and Spark: Successes and Challenges - François Garillot
Steve Moore
 
Apache Toree
Apache ToreeApache Toree
Apache Toree
Asim Jalis
 
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...
OpenSource Connections
 
The inherent complexity of stream processing
The inherent complexity of stream processingThe inherent complexity of stream processing
The inherent complexity of stream processing
nathanmarz
 
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Lucidworks
 
R, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web ServicesR, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web Services
Portland R User Group
 
LuceneRDD for (Geospatial) Search and Entity Linkage
LuceneRDD for (Geospatial) Search and Entity LinkageLuceneRDD for (Geospatial) Search and Entity Linkage
LuceneRDD for (Geospatial) Search and Entity Linkage
zouzias
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in Solr
Tommaso Teofili
 
HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran
John Mulhall
 
Deep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVecDeep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVec
Josh Patterson
 
Smart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVecSmart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVec
Josh Patterson
 
Deep learning on Hadoop/Spark -NextML
Deep learning on Hadoop/Spark -NextMLDeep learning on Hadoop/Spark -NextML
Deep learning on Hadoop/Spark -NextML
Adam Gibson
 
Building NLP solutions using Python
Building NLP solutions using PythonBuilding NLP solutions using Python
Building NLP solutions using Python
botsplash.com
 
Reactconf 2014 - Event Stream Processing
Reactconf 2014 - Event Stream ProcessingReactconf 2014 - Event Stream Processing
Reactconf 2014 - Event Stream Processing
Andy Piper
 
Hacking Lucene and Solr for Fun and Profit
Hacking Lucene and Solr for Fun and ProfitHacking Lucene and Solr for Fun and Profit
Hacking Lucene and Solr for Fun and Profit
lucenerevolution
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
StampedeCon
 
Real Time search using Spark and Elasticsearch
Real Time search using Spark and ElasticsearchReal Time search using Spark and Elasticsearch
Real Time search using Spark and Elasticsearch
Sigmoid
 
DeepLearning4J and Spark: Successes and Challenges - François Garillot
DeepLearning4J and Spark: Successes and Challenges - François GarillotDeepLearning4J and Spark: Successes and Challenges - François Garillot
DeepLearning4J and Spark: Successes and Challenges - François Garillot
Steve Moore
 
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...
OpenSource Connections
 
The inherent complexity of stream processing
The inherent complexity of stream processingThe inherent complexity of stream processing
The inherent complexity of stream processing
nathanmarz
 
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Lucidworks
 
LuceneRDD for (Geospatial) Search and Entity Linkage
LuceneRDD for (Geospatial) Search and Entity LinkageLuceneRDD for (Geospatial) Search and Entity Linkage
LuceneRDD for (Geospatial) Search and Entity Linkage
zouzias
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in Solr
Tommaso Teofili
 
HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran
John Mulhall
 

Viewers also liked (16)

Vectorization
VectorizationVectorization
Vectorization
Amit Kumar
 
The road to the launch of vectoring in Belgium
The road to the launch of vectoring in BelgiumThe road to the launch of vectoring in Belgium
The road to the launch of vectoring in Belgium
Reinhard Laroy
 
Xgboost
XgboostXgboost
Xgboost
Vivian S. Zhang
 
Частные компании: Кипр и Белиз
Частные компании: Кипр и БелизЧастные компании: Кипр и Белиз
Частные компании: Кипр и Белиз
Maxim Shvidkiy
 
Research & Planning Task 3
Research & Planning Task 3Research & Planning Task 3
Research & Planning Task 3
bcasey34
 
Zija International Produces GenM Skincare Products
Zija International Produces GenM Skincare ProductsZija International Produces GenM Skincare Products
Zija International Produces GenM Skincare Products
Kenneth Brailsford
 
51 marketing hang may mac
51 marketing hang may mac51 marketing hang may mac
51 marketing hang may mac
NGOC TRINH NGUYEN DANG
 
Paneles con Manta Filtrante - Serie PG-4
Paneles con Manta Filtrante - Serie PG-4Paneles con Manta Filtrante - Serie PG-4
Paneles con Manta Filtrante - Serie PG-4
MET MANN, Fabricante de Climatización y Ventilación
 
Аутсорсинг лабораторных услуг
Аутсорсинг лабораторных услугАутсорсинг лабораторных услуг
Аутсорсинг лабораторных услуг
BDA
 
Acesso Aberto a publicações e dados: requisitos dos financiadores de ciência ...
Acesso Aberto a publicações e dados: requisitos dos financiadores de ciência ...Acesso Aberto a publicações e dados: requisitos dos financiadores de ciência ...
Acesso Aberto a publicações e dados: requisitos dos financiadores de ciência ...
Pedro Príncipe
 
Hp NLB Singaopre
Hp NLB SingaopreHp NLB Singaopre
Hp NLB Singaopre
Satya Harish
 
Pirita- Kose sügisretk
Pirita- Kose sügisretkPirita- Kose sügisretk
Pirita- Kose sügisretk
Mairi
 
Tic´s en pedagogia
Tic´s en pedagogiaTic´s en pedagogia
Tic´s en pedagogia
Jorge Aconda
 
Learning design overview
Learning design overviewLearning design overview
Learning design overview
Martin Weller
 
School nr 5
School nr 5School nr 5
School nr 5
Mairi
 
The road to the launch of vectoring in Belgium
The road to the launch of vectoring in BelgiumThe road to the launch of vectoring in Belgium
The road to the launch of vectoring in Belgium
Reinhard Laroy
 
Частные компании: Кипр и Белиз
Частные компании: Кипр и БелизЧастные компании: Кипр и Белиз
Частные компании: Кипр и Белиз
Maxim Shvidkiy
 
Research & Planning Task 3
Research & Planning Task 3Research & Planning Task 3
Research & Planning Task 3
bcasey34
 
Zija International Produces GenM Skincare Products
Zija International Produces GenM Skincare ProductsZija International Produces GenM Skincare Products
Zija International Produces GenM Skincare Products
Kenneth Brailsford
 
Аутсорсинг лабораторных услуг
Аутсорсинг лабораторных услугАутсорсинг лабораторных услуг
Аутсорсинг лабораторных услуг
BDA
 
Acesso Aberto a publicações e dados: requisitos dos financiadores de ciência ...
Acesso Aberto a publicações e dados: requisitos dos financiadores de ciência ...Acesso Aberto a publicações e dados: requisitos dos financiadores de ciência ...
Acesso Aberto a publicações e dados: requisitos dos financiadores de ciência ...
Pedro Príncipe
 
Pirita- Kose sügisretk
Pirita- Kose sügisretkPirita- Kose sügisretk
Pirita- Kose sügisretk
Mairi
 
Tic´s en pedagogia
Tic´s en pedagogiaTic´s en pedagogia
Tic´s en pedagogia
Jorge Aconda
 
Learning design overview
Learning design overviewLearning design overview
Learning design overview
Martin Weller
 
School nr 5
School nr 5School nr 5
School nr 5
Mairi
 
Ad

Similar to Vectorization - Georgia Tech - CSE6242 - March 2015 (20)

aistudy-240521200530-db141c56 RAG AI.pptx
aistudy-240521200530-db141c56 RAG AI.pptxaistudy-240521200530-db141c56 RAG AI.pptx
aistudy-240521200530-db141c56 RAG AI.pptx
emceemouli
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
ssbd6985
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
ssbd6985
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
ssbd6985
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
vincent683379
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
Rahul Jain
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
S. Diana Hu
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
Joaquin Delgado PhD.
 
Text Mining
Text MiningText Mining
Text Mining
sathish sak
 
xAPI Vocabulary Stone Soup: LAK 2016 JISC Learning Analytics Hackathon
xAPI Vocabulary Stone Soup: LAK 2016 JISC Learning Analytics HackathonxAPI Vocabulary Stone Soup: LAK 2016 JISC Learning Analytics Hackathon
xAPI Vocabulary Stone Soup: LAK 2016 JISC Learning Analytics Hackathon
Russell Duhon
 
MRT 2018: reflecting on the past and the present with temporal graph models
MRT 2018: reflecting on the past and the present with temporal graph modelsMRT 2018: reflecting on the past and the present with temporal graph models
MRT 2018: reflecting on the past and the present with temporal graph models
Antonio García-Domínguez
 
Vectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic MatchingVectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic Matching
Simon Hughes
 
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
Lucidworks
 
Haystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesHaystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon Hughes
OpenSource Connections
 
Searching with vectors
Searching with vectorsSearching with vectors
Searching with vectors
Simon Hughes
 
Artificial Intelligence: Knowledge Acquisition
Artificial Intelligence: Knowledge AcquisitionArtificial Intelligence: Knowledge Acquisition
Artificial Intelligence: Knowledge Acquisition
The Integral Worm
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with Solr
Lucidworks
 
Wither OWL
Wither OWLWither OWL
Wither OWL
James Hendler
 
Taming Text
Taming TextTaming Text
Taming Text
Grant Ingersoll
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
Minha Hwang
 
aistudy-240521200530-db141c56 RAG AI.pptx
aistudy-240521200530-db141c56 RAG AI.pptxaistudy-240521200530-db141c56 RAG AI.pptx
aistudy-240521200530-db141c56 RAG AI.pptx
emceemouli
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
ssbd6985
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
ssbd6985
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
ssbd6985
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
vincent683379
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
Rahul Jain
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
S. Diana Hu
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
Joaquin Delgado PhD.
 
xAPI Vocabulary Stone Soup: LAK 2016 JISC Learning Analytics Hackathon
xAPI Vocabulary Stone Soup: LAK 2016 JISC Learning Analytics HackathonxAPI Vocabulary Stone Soup: LAK 2016 JISC Learning Analytics Hackathon
xAPI Vocabulary Stone Soup: LAK 2016 JISC Learning Analytics Hackathon
Russell Duhon
 
MRT 2018: reflecting on the past and the present with temporal graph models
MRT 2018: reflecting on the past and the present with temporal graph modelsMRT 2018: reflecting on the past and the present with temporal graph models
MRT 2018: reflecting on the past and the present with temporal graph models
Antonio García-Domínguez
 
Vectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic MatchingVectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic Matching
Simon Hughes
 
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
Lucidworks
 
Haystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesHaystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon Hughes
OpenSource Connections
 
Searching with vectors
Searching with vectorsSearching with vectors
Searching with vectors
Simon Hughes
 
Artificial Intelligence: Knowledge Acquisition
Artificial Intelligence: Knowledge AcquisitionArtificial Intelligence: Knowledge Acquisition
Artificial Intelligence: Knowledge Acquisition
The Integral Worm
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with Solr
Lucidworks
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
Minha Hwang
 
Ad

More from Josh Patterson (13)

Patterson Consulting: What is Artificial Intelligence?
Patterson Consulting: What is Artificial Intelligence?Patterson Consulting: What is Artificial Intelligence?
Patterson Consulting: What is Artificial Intelligence?
Josh Patterson
 
What is Artificial Intelligence
What is Artificial IntelligenceWhat is Artificial Intelligence
What is Artificial Intelligence
Josh Patterson
 
Modeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksModeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural Networks
Josh Patterson
 
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Josh Patterson
 
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on HadoopHadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Josh Patterson
 
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARNMLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
Josh Patterson
 
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARNHadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Josh Patterson
 
Knitting boar atl_hug_jan2013_v2
Knitting boar atl_hug_jan2013_v2Knitting boar atl_hug_jan2013_v2
Knitting boar atl_hug_jan2013_v2
Josh Patterson
 
Knitting boar - Toronto and Boston HUGs - Nov 2012
Knitting boar - Toronto and Boston HUGs - Nov 2012Knitting boar - Toronto and Boston HUGs - Nov 2012
Knitting boar - Toronto and Boston HUGs - Nov 2012
Josh Patterson
 
LA HUG Dec 2011 - Recommendation Talk
LA HUG Dec 2011 - Recommendation TalkLA HUG Dec 2011 - Recommendation Talk
LA HUG Dec 2011 - Recommendation Talk
Josh Patterson
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
Josh Patterson
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and Hadoop
Josh Patterson
 
Classification with Naive Bayes
Classification with Naive BayesClassification with Naive Bayes
Classification with Naive Bayes
Josh Patterson
 
Patterson Consulting: What is Artificial Intelligence?
Patterson Consulting: What is Artificial Intelligence?Patterson Consulting: What is Artificial Intelligence?
Patterson Consulting: What is Artificial Intelligence?
Josh Patterson
 
What is Artificial Intelligence
What is Artificial IntelligenceWhat is Artificial Intelligence
What is Artificial Intelligence
Josh Patterson
 
Modeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksModeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural Networks
Josh Patterson
 
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Josh Patterson
 
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on HadoopHadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Josh Patterson
 
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARNMLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
Josh Patterson
 
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARNHadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Josh Patterson
 
Knitting boar atl_hug_jan2013_v2
Knitting boar atl_hug_jan2013_v2Knitting boar atl_hug_jan2013_v2
Knitting boar atl_hug_jan2013_v2
Josh Patterson
 
Knitting boar - Toronto and Boston HUGs - Nov 2012
Knitting boar - Toronto and Boston HUGs - Nov 2012Knitting boar - Toronto and Boston HUGs - Nov 2012
Knitting boar - Toronto and Boston HUGs - Nov 2012
Josh Patterson
 
LA HUG Dec 2011 - Recommendation Talk
LA HUG Dec 2011 - Recommendation TalkLA HUG Dec 2011 - Recommendation Talk
LA HUG Dec 2011 - Recommendation Talk
Josh Patterson
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
Josh Patterson
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and Hadoop
Josh Patterson
 
Classification with Naive Bayes
Classification with Naive BayesClassification with Naive Bayes
Classification with Naive Bayes
Josh Patterson
 

Recently uploaded (20)

chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptxmd-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
fatimalazaar2004
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
How iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost FundsHow iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost Funds
ireneschmid345
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjksPpt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
panchariyasahil
 
VKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptxVKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptx
Vinod Srivastava
 
03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia
Alexander Romero Arosquipa
 
Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
 
Digilocker under workingProcess Flow.pptx
Digilocker  under workingProcess Flow.pptxDigilocker  under workingProcess Flow.pptx
Digilocker under workingProcess Flow.pptx
satnamsadguru491
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
Data Analytics Overview and its applications
Data Analytics Overview and its applicationsData Analytics Overview and its applications
Data Analytics Overview and its applications
JanmejayaMishra7
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptxmd-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
fatimalazaar2004
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
How iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost FundsHow iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost Funds
ireneschmid345
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjksPpt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
panchariyasahil
 
VKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptxVKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptx
Vinod Srivastava
 
Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
 
Digilocker under workingProcess Flow.pptx
Digilocker  under workingProcess Flow.pptxDigilocker  under workingProcess Flow.pptx
Digilocker under workingProcess Flow.pptx
satnamsadguru491
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
Data Analytics Overview and its applications
Data Analytics Overview and its applicationsData Analytics Overview and its applications
Data Analytics Overview and its applications
JanmejayaMishra7
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 

Vectorization - Georgia Tech - CSE6242 - March 2015

  • 1. Vectorization Core Concepts in Data Mining Georgia Tech – CSE6242 – March 2015 Josh Patterson
  • 2. Presenter: Josh Patterson • Email: – [email protected] • Twitter: – @jpatanooga • Github: – https://github.com/ jpatanooga Past Published in IAAI-09: “TinyTermite: A Secure Routing Algorithm” Grad work in Meta-heuristics, Ant- algorithms Tennessee Valley Authority (TVA) Hadoop and the Smartgrid Cloudera Principal Solution Architect Today: Patterson Consulting
  • 3. Topic Index • Why Vectorization? • Vector Space Model • Text Vectorization • General Vectorization
  • 4. WHY VECTORIZATION? “How is it possible for a slow, tiny brain, whether biological or electronic, to perceive, understand, predict, and manipulate a world far larger and more complicated than itself?” --- Peter Norvig, “Artificial Intelligence: A Modern Approach”
  • 5. Classic Scenario: “Classify some tweets for positive vs negative sentiment”
  • 6. What Needs to Happen? • Need each tweet as some structure that can be fed to a learning algorithm – To represent the knowledge of “negative” vs “positive” tweet • How does that happen? – We need to take the raw text and convert it into what is called a “vector” • Vector relates to the fundamentals of linear algebra – “Solving sets of linear equations”
  • 7. Wait. What’s a Vector Again? • An array of floating point numbers • Represents data – Text – Audio – Image • Example: –[ 1.0, 0.0, 1.0, 0.5 ]
  • 8. VECTOR SPACE MODEL “I am putting myself to the fullest possible use, which is all I think that any conscious entity can ever hope to do.” --- Hal, 2001
  • 9. Vector Space Model • Common way of vectorizing text – every possible word is mapped to a specific integer • If we have a large enough array then every word fits into a unique slot in the array – value at that index is the number of the times the word occurs • Most often our array size is less than our corpus vocabulary – so we have to have a “vectorization strategy” to account for this
  • 10. Text Can Include Several Stages • Sentence Segmentation – can skip straight to tokenization depending on use case • Tokenization – find individual words • Lemmatization – finding the base or stem of words • Removing Stop words – “the”, “and”, etc • Vectorization – we take the output of the process and make an array of floating point values
  • 11. TEXT VECTORIZATION STRATEGIES “A man who carries a cat by the tail learns something he can learn in no other way.” --- Mark Twain
  • 12. Bag of Words • A group of words or a document is represented as a bag – or “multi-set” of its words • Bag of words is a list of words and their word counts – simplest vector model – but can end up using a lot of columns due to number of words involved. • Grammar and word ordering is ignored – but we still track how many times the word occurs in the document • has been used most frequently in the document classification – and information retrieval domains.
  • 13. Term frequency inverse document frequency (TF-IDF) • Fixes some issues with “bag of words” • allows us to leverage the information about how often a word occurs in a document (TF) – while considering the frequency of the word in the corpus to control for the facet that some words will be more common than others (IDF) • more accurate than the basic bag of words model – but computationally more expensive
  • 14. Kernel Hashing • When we want to vectorize the data in a single pass – making it a “just in time” vectorizer. • Can be used when we want to vectorize text right before we feed it to our learning algorithm. • We come up with a fixed sized vector that is typically smaller than the total possible words that we could index or vectorize – Then we use a hash function to create an index into the vector.
  • 15. GENERAL VECTORIZATION STRATEGIES “Everybody good? Plenty of slaves for my robot colony?” --- TARS, Interstellar
  • 16. Four Major Attribute Types • Nominal – Ex: “sunny”, “overcast”, and “rainy” • Ordinal – Like nominal but with order • Interval – “year” but expressed in fixed and equal lengths • Ratio – scheme defines a zero point and then a distance from this fixed zero point
  • 17. Techniques of Feature Engineering • Taking the values directly from the attribute unchanged – If the value is something we can use out of the box • Feature scaling – standardization – or Normalizing an attribute • Binarization of features – 0 or 1 • Dimensionality reduction – Use only the most interesting features
  • 18. Canova • Command Line Based – We don’t want to write custom code for every dataset • Examples of Usage – Convert the MNIST dataset from raw binary files to the svmLight text format. – Convert raw text into TF-IDF based vectors in a text vector format {svmLight, arff} • Scales out on multiple runtimes – Local, hadoop • Open Source, ASF 2.0 Licensed – https://github.com/deeplearning4j/Canova

Editor's Notes

  • #15: Advantage to use kernel hashing is that we don’t need the pre-cursor pass like we do with TF-IDF but we run the risk of having collisions between words The reality is that these collisions occur very infrequently and don’t have a noticeable impact on learning performance
  • #18: Feature scaling (or “feature normalization”) can improve convergence speed of certain algorithms (example: stochastic gradient descent) When we “standardize” a vector we subtract a measure of location (minimum, maximum, median, etc) and then divide by a measure of scale (variance, standard deviation, range, etc). Another method of feature normalization is “pre-whitening”. Pre-whitening is a decorrelation transformation that makes the input independent by transforming it against a transformed input covariance matrix. The transformation is called “pre-whitening” due to how it changes the input vector into a white noise vector.