SlideShare a Scribd company logo
Understanding
Feature Space in
Machine Learning
Alice Zheng, Dato
September 9, 2015
1
2
My journey so far
Applied machine learning
(Data science)
Build ML tools
Shortage of experts
and good tools.
3
Why machine learning?
Model data.
Make predictions.
Build intelligent
applications.
4
The machine learning pipeline
I fell in love the instant I laid
my eyes on that puppy. His
big eyes and playful tail, his
soft furry paws, …
Raw data
Features
Models
Predictions
Deploy in
production
Feature = numeric representation of raw data
6
Representing natural text
It is a puppy and it is
extremely cute.
What’s important?
Phrases? Specific
words? Ordering?
Subject, object, verb?
Classify:
puppy or not?
Raw Text
{“it”:2,
“is”:2,
“a”:1,
“puppy”:1,
“and”:1,
“extremely”:1,
“cute”:1 }
Bag of Words
7
Representing natural text
It is a puppy and it is
extremely cute.
Classify:
puppy or not?
Raw Text Bag of Words
it 2
they 0
I 1
am 0
how 0
puppy 1
and 1
cat 0
aardvark 0
cute 1
extremely 1
… …
Sparse vector
representation
8
Representing images
Image source: “Recognizing and learning object categories,”
Li Fei-Fei, Rob Fergus, Anthony Torralba, ICCV 2005—2009.
Raw image:
millions of RGB triplets,
one for each pixel
Classify:
person or animal?
Raw Image Bag of Visual Words
9
Representing images
Classify:
person or animal?
Raw Image Deep learning features
3.29
-15
-5.24
48.3
1.36
47.1
-
1.92
36.5
2.83
95.4
-19
-89
5.09
37.8
Dense vector
representation
10
Feature space in machine learning
• Raw data  high dimensional vectors
• Collection of data points  point cloud in feature space
• Model = geometric summary of point cloud
• Feature engineering = creating features of the appropriate
granularity for the task
Crudely speaking, mathematicians fall into two
categories: the algebraists, who find it easiest to
reduce all problems to sets of numbers and
variables, and the geometers, who understand the
world through shapes.
-- Masha Gessen, “Perfect Rigor”
12
Algebra vs. Geometry
a
b
c
a2 + b2 = c2
Algebra Geometry
Pythagorean
Theorem
(Euclidean space)
13
Visualizing a sphere in 2D
x2 + y2 = 1
a
b
c
Pythagorean theorem:
a2 + b2 = c2
x
y
1
1
14
Visualizing a sphere in 3D
x2 + y2 + z2 = 1
x
y
z
1
1
1
15
Visualizing a sphere in 4D
x2 + y2 + z2 + t2 = 1
x
y
z
1
1
1
16
Why are we looking at spheres?
= =
= =
Poincaré Conjecture:
All physical objects without holes
is “equivalent” to a sphere.
17
The power of higher dimensions
• A sphere in 4D can model the birth and death process of
physical objects
• Point clouds = approximate geometric shapes
• High dimensional features can model many things
Visualizing Feature Space
19
The challenge of high dimension geometry
• Feature space can have hundreds to millions of
dimensions
• In high dimensions, our geometric imagination is limited
- Algebra comes to our aid
20
Visualizing bag-of-words
puppy
cute
1
1
I have a puppy and
it is extremely cute
I have a puppy and
it is extremely cute
it 1
they 0
I 1
am 0
how 0
puppy 1
and 1
cat 0
aardvark 0
zebra 0
cute 1
extremely 1
… …
21
Visualizing bag-of-words
puppy
cute
1
1
1
extremely
I have a puppy and
it is extremely cute
I have an extremely
cute cat
I have a cute
puppy
22
Document point cloud
word 1
word 2
23
What is a model?
• Model = mathematical “summary” of data
• What’s a summary?
- A geometric shape
24
Classification model
Feature 2
Feature 1
Decide between two classes
25
Clustering model
Feature 2
Feature 1
Group data points tightly
26
Regression model
Target
Feature
Fit the target values
Visualizing Feature Engineering
28
When does bag-of-words fail?
puppy
cat
2
1
1
have
I have a puppy
I have a cat
I have a kitten
Task: find a surface that separates
documents about dogs vs. cats
Problem: the word “have” adds fluff
instead of information
I have a dog
and I have a pen
1
29
Improving on bag-of-words
• Idea: “normalize” word counts so that popular words
are discounted
• Term frequency (tf) = Number of times a terms
appears in a document
• Inverse document frequency of word (idf) =
• N = total number of documents
• Tf-idf count = tf x idf
30
From BOW to tf-idf
puppy
cat
2
1
1
have
I have a puppy
I have a cat
I have a kitten
idf(puppy) = log 4
idf(cat) = log 4
idf(have) = log 1 = 0
I have a dog
and I have a pen
1
31
From BOW to tf-idf
puppy
cat1
have
tfidf(puppy) = log 4
tfidf(cat) = log 4
tfidf(have) = 0
I have a dog
and I have a pen,
I have a kitten
1
log 4
log 4
I have a cat
I have a puppy
Decision surface
Tf-idf flattens
uninformative
dimensions in the
BOW point cloud
32
Entry points of feature engineering
• Start from data and task
- What’s the best text representation for classification?
• Start from modeling method
- What kind of features does k-means assume?
- What does linear regression assume about the data?
33
That’s not all, folks!
• There’s a lot more to feature engineering:
- Feature normalization
- Feature transformations
- “Regularizing” models
- Learning the right features
• Dato is hiring! jobs@dato.com
alicez@dato.com @RainyData
Ad

More Related Content

What's hot (20)

[기초개념] Graph Convolutional Network (GCN)
[기초개념] Graph Convolutional Network (GCN)[기초개념] Graph Convolutional Network (GCN)
[기초개념] Graph Convolutional Network (GCN)
Donghyeon Kim
 
Sentiment Analysis and Social Media: How and Why
Sentiment Analysis and Social Media: How and WhySentiment Analysis and Social Media: How and Why
Sentiment Analysis and Social Media: How and Why
Davide Feltoni Gurini
 
Feature Engineering in Machine Learning
Feature Engineering in Machine LearningFeature Engineering in Machine Learning
Feature Engineering in Machine Learning
Pyingkodi Maran
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
Hansi Thenuwara
 
Fuzzy Set Theory
Fuzzy Set TheoryFuzzy Set Theory
Fuzzy Set Theory
AMIT KUMAR
 
Knowledge Graph Embeddings for Recommender Systems
Knowledge Graph Embeddings for Recommender SystemsKnowledge Graph Embeddings for Recommender Systems
Knowledge Graph Embeddings for Recommender Systems
Enrico Palumbo
 
Feature selection
Feature selectionFeature selection
Feature selection
Dong Guo
 
Scikit Learn intro
Scikit Learn introScikit Learn intro
Scikit Learn intro
9xdot
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data Science
Jason Geng
 
Decision trees for machine learning
Decision trees for machine learningDecision trees for machine learning
Decision trees for machine learning
Amr BARAKAT
 
Introduction to matplotlib
Introduction to matplotlibIntroduction to matplotlib
Introduction to matplotlib
Piyush rai
 
Markov decision process
Markov decision processMarkov decision process
Markov decision process
Hamed Abdi
 
Linear regression with gradient descent
Linear regression with gradient descentLinear regression with gradient descent
Linear regression with gradient descent
Suraj Parmar
 
Linear discriminant analysis
Linear discriminant analysisLinear discriminant analysis
Linear discriminant analysis
Bangalore
 
Deep Learning for Computer Vision: Generative models and adversarial training...
Deep Learning for Computer Vision: Generative models and adversarial training...Deep Learning for Computer Vision: Generative models and adversarial training...
Deep Learning for Computer Vision: Generative models and adversarial training...
Universitat Politècnica de Catalunya
 
Fuzzy relations
Fuzzy relationsFuzzy relations
Fuzzy relations
naugariya
 
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Preferred Networks
 
Optimizers
OptimizersOptimizers
Optimizers
Il Gu Yi
 
Explainable AI
Explainable AIExplainable AI
Explainable AI
Dinesh V
 
Wtf is data science?
Wtf is data science?Wtf is data science?
Wtf is data science?
Dylan
 
[기초개념] Graph Convolutional Network (GCN)
[기초개념] Graph Convolutional Network (GCN)[기초개념] Graph Convolutional Network (GCN)
[기초개념] Graph Convolutional Network (GCN)
Donghyeon Kim
 
Sentiment Analysis and Social Media: How and Why
Sentiment Analysis and Social Media: How and WhySentiment Analysis and Social Media: How and Why
Sentiment Analysis and Social Media: How and Why
Davide Feltoni Gurini
 
Feature Engineering in Machine Learning
Feature Engineering in Machine LearningFeature Engineering in Machine Learning
Feature Engineering in Machine Learning
Pyingkodi Maran
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
Hansi Thenuwara
 
Fuzzy Set Theory
Fuzzy Set TheoryFuzzy Set Theory
Fuzzy Set Theory
AMIT KUMAR
 
Knowledge Graph Embeddings for Recommender Systems
Knowledge Graph Embeddings for Recommender SystemsKnowledge Graph Embeddings for Recommender Systems
Knowledge Graph Embeddings for Recommender Systems
Enrico Palumbo
 
Feature selection
Feature selectionFeature selection
Feature selection
Dong Guo
 
Scikit Learn intro
Scikit Learn introScikit Learn intro
Scikit Learn intro
9xdot
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data Science
Jason Geng
 
Decision trees for machine learning
Decision trees for machine learningDecision trees for machine learning
Decision trees for machine learning
Amr BARAKAT
 
Introduction to matplotlib
Introduction to matplotlibIntroduction to matplotlib
Introduction to matplotlib
Piyush rai
 
Markov decision process
Markov decision processMarkov decision process
Markov decision process
Hamed Abdi
 
Linear regression with gradient descent
Linear regression with gradient descentLinear regression with gradient descent
Linear regression with gradient descent
Suraj Parmar
 
Linear discriminant analysis
Linear discriminant analysisLinear discriminant analysis
Linear discriminant analysis
Bangalore
 
Deep Learning for Computer Vision: Generative models and adversarial training...
Deep Learning for Computer Vision: Generative models and adversarial training...Deep Learning for Computer Vision: Generative models and adversarial training...
Deep Learning for Computer Vision: Generative models and adversarial training...
Universitat Politècnica de Catalunya
 
Fuzzy relations
Fuzzy relationsFuzzy relations
Fuzzy relations
naugariya
 
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Preferred Networks
 
Optimizers
OptimizersOptimizers
Optimizers
Il Gu Yi
 
Explainable AI
Explainable AIExplainable AI
Explainable AI
Dinesh V
 
Wtf is data science?
Wtf is data science?Wtf is data science?
Wtf is data science?
Dylan
 

Viewers also liked (7)

Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive models
Gabriel Moreira
 
Horovod - Distributed TensorFlow Made Easy
Horovod - Distributed TensorFlow Made EasyHorovod - Distributed TensorFlow Made Easy
Horovod - Distributed TensorFlow Made Easy
Alexander Sergeev
 
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Sujit Pal
 
Lessons from 2MM machine learning models
Lessons from 2MM machine learning modelsLessons from 2MM machine learning models
Lessons from 2MM machine learning models
Extract Data Conference
 
Large-Scale Training with GPUs at Facebook
Large-Scale Training with GPUs at FacebookLarge-Scale Training with GPUs at Facebook
Large-Scale Training with GPUs at Facebook
Faisal Siddiqi
 
Parameter Server Approach for Online Learning at Twitter
Parameter Server Approach for Online Learning at TwitterParameter Server Approach for Online Learning at Twitter
Parameter Server Approach for Online Learning at Twitter
Zhiyong (Joe) Xie
 
2017 10-10 (netflix ml platform meetup) learning item and user representation...
2017 10-10 (netflix ml platform meetup) learning item and user representation...2017 10-10 (netflix ml platform meetup) learning item and user representation...
2017 10-10 (netflix ml platform meetup) learning item and user representation...
Ed Chi
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive models
Gabriel Moreira
 
Horovod - Distributed TensorFlow Made Easy
Horovod - Distributed TensorFlow Made EasyHorovod - Distributed TensorFlow Made Easy
Horovod - Distributed TensorFlow Made Easy
Alexander Sergeev
 
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Sujit Pal
 
Lessons from 2MM machine learning models
Lessons from 2MM machine learning modelsLessons from 2MM machine learning models
Lessons from 2MM machine learning models
Extract Data Conference
 
Large-Scale Training with GPUs at Facebook
Large-Scale Training with GPUs at FacebookLarge-Scale Training with GPUs at Facebook
Large-Scale Training with GPUs at Facebook
Faisal Siddiqi
 
Parameter Server Approach for Online Learning at Twitter
Parameter Server Approach for Online Learning at TwitterParameter Server Approach for Online Learning at Twitter
Parameter Server Approach for Online Learning at Twitter
Zhiyong (Joe) Xie
 
2017 10-10 (netflix ml platform meetup) learning item and user representation...
2017 10-10 (netflix ml platform meetup) learning item and user representation...2017 10-10 (netflix ml platform meetup) learning item and user representation...
2017 10-10 (netflix ml platform meetup) learning item and user representation...
Ed Chi
 
Ad

Similar to Understanding Feature Space in Machine Learning (20)

Understanding Feature Space in Machine Learning - Data Science Pop-up Seattle
Understanding Feature Space in Machine Learning - Data Science Pop-up SeattleUnderstanding Feature Space in Machine Learning - Data Science Pop-up Seattle
Understanding Feature Space in Machine Learning - Data Science Pop-up Seattle
Domino Data Lab
 
Maths in the PYP - A Journey through the Arts
Maths in the PYP - A Journey through the ArtsMaths in the PYP - A Journey through the Arts
Maths in the PYP - A Journey through the Arts
madahay
 
Introduction to LLMs, Prompt Engineering fundamentals,
Introduction to LLMs, Prompt Engineering fundamentals,Introduction to LLMs, Prompt Engineering fundamentals,
Introduction to LLMs, Prompt Engineering fundamentals,
Gianfranco Di Pietro
 
[D2 COMMUNITY] Spark User Group - 머신러닝 인공지능 기법
[D2 COMMUNITY] Spark User Group - 머신러닝 인공지능 기법[D2 COMMUNITY] Spark User Group - 머신러닝 인공지능 기법
[D2 COMMUNITY] Spark User Group - 머신러닝 인공지능 기법
NAVER D2
 
CO Quadratic Inequalties.pptx
CO Quadratic Inequalties.pptxCO Quadratic Inequalties.pptx
CO Quadratic Inequalties.pptx
ManuelEsponilla
 
Latent dirichlet allocation_and_topic_modeling
Latent dirichlet allocation_and_topic_modelingLatent dirichlet allocation_and_topic_modeling
Latent dirichlet allocation_and_topic_modeling
ankit_ppt
 
Ml3
Ml3Ml3
Ml3
poovarasu maniandan
 
Overview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature EngineeringOverview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature Engineering
Turi, Inc.
 
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
Patrice Bellot - Aix-Marseille Université / CNRS (LIS, INS2I)
 
CSCE181 Big ideas in NLP
CSCE181 Big ideas in NLPCSCE181 Big ideas in NLP
CSCE181 Big ideas in NLP
Insoo Chung
 
Peter Norvig - NYC Machine Learning 2013
Peter Norvig - NYC Machine Learning 2013Peter Norvig - NYC Machine Learning 2013
Peter Norvig - NYC Machine Learning 2013
Michael Scovetta
 
syntherella feedback synthesizer
syntherella feedback synthesizersyntherella feedback synthesizer
syntherella feedback synthesizer
Eelke Folmer
 
DL Classe 0 - You can do it
DL Classe 0 - You can do itDL Classe 0 - You can do it
DL Classe 0 - You can do it
Gregory Renard
 
Deep Learning Class #0 - You Can Do It
Deep Learning Class #0 - You Can Do ItDeep Learning Class #0 - You Can Do It
Deep Learning Class #0 - You Can Do It
Holberton School
 
Word2vec ultimate beginner
Word2vec ultimate beginnerWord2vec ultimate beginner
Word2vec ultimate beginner
Sungmin Yang
 
Edutalk f2013
Edutalk f2013Edutalk f2013
Edutalk f2013
Mel Chua
 
Collegeteaching102
Collegeteaching102Collegeteaching102
Collegeteaching102
Joanna Dunlap
 
Using binary classifiers
Using binary classifiersUsing binary classifiers
Using binary classifiers
butest
 
Translation to QL Part 1
Translation to QL Part 1Translation to QL Part 1
Translation to QL Part 1
Nat Karablina
 
First Order Logic for MBA Graduates studets
First Order Logic for MBA Graduates studetsFirst Order Logic for MBA Graduates studets
First Order Logic for MBA Graduates studets
210me2134
 
Understanding Feature Space in Machine Learning - Data Science Pop-up Seattle
Understanding Feature Space in Machine Learning - Data Science Pop-up SeattleUnderstanding Feature Space in Machine Learning - Data Science Pop-up Seattle
Understanding Feature Space in Machine Learning - Data Science Pop-up Seattle
Domino Data Lab
 
Maths in the PYP - A Journey through the Arts
Maths in the PYP - A Journey through the ArtsMaths in the PYP - A Journey through the Arts
Maths in the PYP - A Journey through the Arts
madahay
 
Introduction to LLMs, Prompt Engineering fundamentals,
Introduction to LLMs, Prompt Engineering fundamentals,Introduction to LLMs, Prompt Engineering fundamentals,
Introduction to LLMs, Prompt Engineering fundamentals,
Gianfranco Di Pietro
 
[D2 COMMUNITY] Spark User Group - 머신러닝 인공지능 기법
[D2 COMMUNITY] Spark User Group - 머신러닝 인공지능 기법[D2 COMMUNITY] Spark User Group - 머신러닝 인공지능 기법
[D2 COMMUNITY] Spark User Group - 머신러닝 인공지능 기법
NAVER D2
 
CO Quadratic Inequalties.pptx
CO Quadratic Inequalties.pptxCO Quadratic Inequalties.pptx
CO Quadratic Inequalties.pptx
ManuelEsponilla
 
Latent dirichlet allocation_and_topic_modeling
Latent dirichlet allocation_and_topic_modelingLatent dirichlet allocation_and_topic_modeling
Latent dirichlet allocation_and_topic_modeling
ankit_ppt
 
Overview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature EngineeringOverview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature Engineering
Turi, Inc.
 
CSCE181 Big ideas in NLP
CSCE181 Big ideas in NLPCSCE181 Big ideas in NLP
CSCE181 Big ideas in NLP
Insoo Chung
 
Peter Norvig - NYC Machine Learning 2013
Peter Norvig - NYC Machine Learning 2013Peter Norvig - NYC Machine Learning 2013
Peter Norvig - NYC Machine Learning 2013
Michael Scovetta
 
syntherella feedback synthesizer
syntherella feedback synthesizersyntherella feedback synthesizer
syntherella feedback synthesizer
Eelke Folmer
 
DL Classe 0 - You can do it
DL Classe 0 - You can do itDL Classe 0 - You can do it
DL Classe 0 - You can do it
Gregory Renard
 
Deep Learning Class #0 - You Can Do It
Deep Learning Class #0 - You Can Do ItDeep Learning Class #0 - You Can Do It
Deep Learning Class #0 - You Can Do It
Holberton School
 
Word2vec ultimate beginner
Word2vec ultimate beginnerWord2vec ultimate beginner
Word2vec ultimate beginner
Sungmin Yang
 
Edutalk f2013
Edutalk f2013Edutalk f2013
Edutalk f2013
Mel Chua
 
Using binary classifiers
Using binary classifiersUsing binary classifiers
Using binary classifiers
butest
 
Translation to QL Part 1
Translation to QL Part 1Translation to QL Part 1
Translation to QL Part 1
Nat Karablina
 
First Order Logic for MBA Graduates studets
First Order Logic for MBA Graduates studetsFirst Order Logic for MBA Graduates studets
First Order Logic for MBA Graduates studets
210me2134
 
Ad

Recently uploaded (20)

Zoonosis, Types, Causes. A comprehensive pptx
Zoonosis, Types, Causes. A comprehensive pptxZoonosis, Types, Causes. A comprehensive pptx
Zoonosis, Types, Causes. A comprehensive pptx
Dr Showkat Ahmad Wani
 
Polytene chromosomes. A Practical Lecture.pptx
Polytene chromosomes. A Practical Lecture.pptxPolytene chromosomes. A Practical Lecture.pptx
Polytene chromosomes. A Practical Lecture.pptx
Dr Showkat Ahmad Wani
 
Infrastructure for Tracking Information Flow from Social Media to U.S. TV New...
Infrastructure for Tracking Information Flow from Social Media to U.S. TV New...Infrastructure for Tracking Information Flow from Social Media to U.S. TV New...
Infrastructure for Tracking Information Flow from Social Media to U.S. TV New...
Himarsha Jayanetti
 
Preparation of Permanent mounts of Parasitic Protozoans.pptx
Preparation of Permanent mounts of Parasitic Protozoans.pptxPreparation of Permanent mounts of Parasitic Protozoans.pptx
Preparation of Permanent mounts of Parasitic Protozoans.pptx
Dr Showkat Ahmad Wani
 
Multydisciplinary Nature of Environmental Studies
Multydisciplinary Nature of Environmental StudiesMultydisciplinary Nature of Environmental Studies
Multydisciplinary Nature of Environmental Studies
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 
Effect of nutrition in Entomophagous Insectson
Effect of nutrition in Entomophagous InsectsonEffect of nutrition in Entomophagous Insectson
Effect of nutrition in Entomophagous Insectson
JabaskumarKshetri
 
Parallel resonance circuits of science.pdf
Parallel resonance circuits of science.pdfParallel resonance circuits of science.pdf
Parallel resonance circuits of science.pdf
rk5867336912
 
VERMICOMPOSTING A STEP TOWARDS SUSTAINABILITY.pptx
VERMICOMPOSTING A STEP TOWARDS SUSTAINABILITY.pptxVERMICOMPOSTING A STEP TOWARDS SUSTAINABILITY.pptx
VERMICOMPOSTING A STEP TOWARDS SUSTAINABILITY.pptx
hipachi8
 
Botany-Finals-Patterns-of-Inheritance-DNA-Synthesis.pdf
Botany-Finals-Patterns-of-Inheritance-DNA-Synthesis.pdfBotany-Finals-Patterns-of-Inheritance-DNA-Synthesis.pdf
Botany-Finals-Patterns-of-Inheritance-DNA-Synthesis.pdf
JseleBurgos
 
Skin_Glands_Structure_Secretion _Control
Skin_Glands_Structure_Secretion _ControlSkin_Glands_Structure_Secretion _Control
Skin_Glands_Structure_Secretion _Control
muralinath2
 
when is CT scan need in breast cancer patient.pptx
when is CT scan need in breast cancer patient.pptxwhen is CT scan need in breast cancer patient.pptx
when is CT scan need in breast cancer patient.pptx
Rukhnuddin Al-daudar
 
On the Lunar Origin of Near-Earth Asteroid 2024 PT5
On the Lunar Origin of Near-Earth Asteroid 2024 PT5On the Lunar Origin of Near-Earth Asteroid 2024 PT5
On the Lunar Origin of Near-Earth Asteroid 2024 PT5
Sérgio Sacani
 
Antonie van Leeuwenhoek- Father of Microbiology
Antonie van Leeuwenhoek- Father of MicrobiologyAntonie van Leeuwenhoek- Father of Microbiology
Antonie van Leeuwenhoek- Father of Microbiology
Anoja Kurian
 
whole ANATOMY OF EYE with eye ball .pptx
whole ANATOMY OF EYE with eye ball .pptxwhole ANATOMY OF EYE with eye ball .pptx
whole ANATOMY OF EYE with eye ball .pptx
simranjangra13
 
06-Molecular basis of transformation.pptx
06-Molecular basis of transformation.pptx06-Molecular basis of transformation.pptx
06-Molecular basis of transformation.pptx
LanaQadumii
 
amino compounds.pptx class 12_Govinda Pathak
amino compounds.pptx class 12_Govinda Pathakamino compounds.pptx class 12_Govinda Pathak
amino compounds.pptx class 12_Govinda Pathak
GovindaPathak6
 
4. Chapter 4 - FINAL Promoting Inclusive Culture (2).pdf
4. Chapter 4 - FINAL Promoting Inclusive Culture (2).pdf4. Chapter 4 - FINAL Promoting Inclusive Culture (2).pdf
4. Chapter 4 - FINAL Promoting Inclusive Culture (2).pdf
abayamargaug
 
Skin function_protective_absorptive_Presentatation.pptx
Skin function_protective_absorptive_Presentatation.pptxSkin function_protective_absorptive_Presentatation.pptx
Skin function_protective_absorptive_Presentatation.pptx
muralinath2
 
APES 6.5 Presentation Fossil Fuels .pdf
APES 6.5 Presentation Fossil Fuels   .pdfAPES 6.5 Presentation Fossil Fuels   .pdf
APES 6.5 Presentation Fossil Fuels .pdf
patelereftu
 
Presentatation_SM_muscle_structpes_funtionre_ty.pptx
Presentatation_SM_muscle_structpes_funtionre_ty.pptxPresentatation_SM_muscle_structpes_funtionre_ty.pptx
Presentatation_SM_muscle_structpes_funtionre_ty.pptx
muralinath2
 
Zoonosis, Types, Causes. A comprehensive pptx
Zoonosis, Types, Causes. A comprehensive pptxZoonosis, Types, Causes. A comprehensive pptx
Zoonosis, Types, Causes. A comprehensive pptx
Dr Showkat Ahmad Wani
 
Polytene chromosomes. A Practical Lecture.pptx
Polytene chromosomes. A Practical Lecture.pptxPolytene chromosomes. A Practical Lecture.pptx
Polytene chromosomes. A Practical Lecture.pptx
Dr Showkat Ahmad Wani
 
Infrastructure for Tracking Information Flow from Social Media to U.S. TV New...
Infrastructure for Tracking Information Flow from Social Media to U.S. TV New...Infrastructure for Tracking Information Flow from Social Media to U.S. TV New...
Infrastructure for Tracking Information Flow from Social Media to U.S. TV New...
Himarsha Jayanetti
 
Preparation of Permanent mounts of Parasitic Protozoans.pptx
Preparation of Permanent mounts of Parasitic Protozoans.pptxPreparation of Permanent mounts of Parasitic Protozoans.pptx
Preparation of Permanent mounts of Parasitic Protozoans.pptx
Dr Showkat Ahmad Wani
 
Effect of nutrition in Entomophagous Insectson
Effect of nutrition in Entomophagous InsectsonEffect of nutrition in Entomophagous Insectson
Effect of nutrition in Entomophagous Insectson
JabaskumarKshetri
 
Parallel resonance circuits of science.pdf
Parallel resonance circuits of science.pdfParallel resonance circuits of science.pdf
Parallel resonance circuits of science.pdf
rk5867336912
 
VERMICOMPOSTING A STEP TOWARDS SUSTAINABILITY.pptx
VERMICOMPOSTING A STEP TOWARDS SUSTAINABILITY.pptxVERMICOMPOSTING A STEP TOWARDS SUSTAINABILITY.pptx
VERMICOMPOSTING A STEP TOWARDS SUSTAINABILITY.pptx
hipachi8
 
Botany-Finals-Patterns-of-Inheritance-DNA-Synthesis.pdf
Botany-Finals-Patterns-of-Inheritance-DNA-Synthesis.pdfBotany-Finals-Patterns-of-Inheritance-DNA-Synthesis.pdf
Botany-Finals-Patterns-of-Inheritance-DNA-Synthesis.pdf
JseleBurgos
 
Skin_Glands_Structure_Secretion _Control
Skin_Glands_Structure_Secretion _ControlSkin_Glands_Structure_Secretion _Control
Skin_Glands_Structure_Secretion _Control
muralinath2
 
when is CT scan need in breast cancer patient.pptx
when is CT scan need in breast cancer patient.pptxwhen is CT scan need in breast cancer patient.pptx
when is CT scan need in breast cancer patient.pptx
Rukhnuddin Al-daudar
 
On the Lunar Origin of Near-Earth Asteroid 2024 PT5
On the Lunar Origin of Near-Earth Asteroid 2024 PT5On the Lunar Origin of Near-Earth Asteroid 2024 PT5
On the Lunar Origin of Near-Earth Asteroid 2024 PT5
Sérgio Sacani
 
Antonie van Leeuwenhoek- Father of Microbiology
Antonie van Leeuwenhoek- Father of MicrobiologyAntonie van Leeuwenhoek- Father of Microbiology
Antonie van Leeuwenhoek- Father of Microbiology
Anoja Kurian
 
whole ANATOMY OF EYE with eye ball .pptx
whole ANATOMY OF EYE with eye ball .pptxwhole ANATOMY OF EYE with eye ball .pptx
whole ANATOMY OF EYE with eye ball .pptx
simranjangra13
 
06-Molecular basis of transformation.pptx
06-Molecular basis of transformation.pptx06-Molecular basis of transformation.pptx
06-Molecular basis of transformation.pptx
LanaQadumii
 
amino compounds.pptx class 12_Govinda Pathak
amino compounds.pptx class 12_Govinda Pathakamino compounds.pptx class 12_Govinda Pathak
amino compounds.pptx class 12_Govinda Pathak
GovindaPathak6
 
4. Chapter 4 - FINAL Promoting Inclusive Culture (2).pdf
4. Chapter 4 - FINAL Promoting Inclusive Culture (2).pdf4. Chapter 4 - FINAL Promoting Inclusive Culture (2).pdf
4. Chapter 4 - FINAL Promoting Inclusive Culture (2).pdf
abayamargaug
 
Skin function_protective_absorptive_Presentatation.pptx
Skin function_protective_absorptive_Presentatation.pptxSkin function_protective_absorptive_Presentatation.pptx
Skin function_protective_absorptive_Presentatation.pptx
muralinath2
 
APES 6.5 Presentation Fossil Fuels .pdf
APES 6.5 Presentation Fossil Fuels   .pdfAPES 6.5 Presentation Fossil Fuels   .pdf
APES 6.5 Presentation Fossil Fuels .pdf
patelereftu
 
Presentatation_SM_muscle_structpes_funtionre_ty.pptx
Presentatation_SM_muscle_structpes_funtionre_ty.pptxPresentatation_SM_muscle_structpes_funtionre_ty.pptx
Presentatation_SM_muscle_structpes_funtionre_ty.pptx
muralinath2
 

Understanding Feature Space in Machine Learning

  • 1. Understanding Feature Space in Machine Learning Alice Zheng, Dato September 9, 2015 1
  • 2. 2 My journey so far Applied machine learning (Data science) Build ML tools Shortage of experts and good tools.
  • 3. 3 Why machine learning? Model data. Make predictions. Build intelligent applications.
  • 4. 4 The machine learning pipeline I fell in love the instant I laid my eyes on that puppy. His big eyes and playful tail, his soft furry paws, … Raw data Features Models Predictions Deploy in production
  • 5. Feature = numeric representation of raw data
  • 6. 6 Representing natural text It is a puppy and it is extremely cute. What’s important? Phrases? Specific words? Ordering? Subject, object, verb? Classify: puppy or not? Raw Text {“it”:2, “is”:2, “a”:1, “puppy”:1, “and”:1, “extremely”:1, “cute”:1 } Bag of Words
  • 7. 7 Representing natural text It is a puppy and it is extremely cute. Classify: puppy or not? Raw Text Bag of Words it 2 they 0 I 1 am 0 how 0 puppy 1 and 1 cat 0 aardvark 0 cute 1 extremely 1 … … Sparse vector representation
  • 8. 8 Representing images Image source: “Recognizing and learning object categories,” Li Fei-Fei, Rob Fergus, Anthony Torralba, ICCV 2005—2009. Raw image: millions of RGB triplets, one for each pixel Classify: person or animal? Raw Image Bag of Visual Words
  • 9. 9 Representing images Classify: person or animal? Raw Image Deep learning features 3.29 -15 -5.24 48.3 1.36 47.1 - 1.92 36.5 2.83 95.4 -19 -89 5.09 37.8 Dense vector representation
  • 10. 10 Feature space in machine learning • Raw data  high dimensional vectors • Collection of data points  point cloud in feature space • Model = geometric summary of point cloud • Feature engineering = creating features of the appropriate granularity for the task
  • 11. Crudely speaking, mathematicians fall into two categories: the algebraists, who find it easiest to reduce all problems to sets of numbers and variables, and the geometers, who understand the world through shapes. -- Masha Gessen, “Perfect Rigor”
  • 12. 12 Algebra vs. Geometry a b c a2 + b2 = c2 Algebra Geometry Pythagorean Theorem (Euclidean space)
  • 13. 13 Visualizing a sphere in 2D x2 + y2 = 1 a b c Pythagorean theorem: a2 + b2 = c2 x y 1 1
  • 14. 14 Visualizing a sphere in 3D x2 + y2 + z2 = 1 x y z 1 1 1
  • 15. 15 Visualizing a sphere in 4D x2 + y2 + z2 + t2 = 1 x y z 1 1 1
  • 16. 16 Why are we looking at spheres? = = = = Poincaré Conjecture: All physical objects without holes is “equivalent” to a sphere.
  • 17. 17 The power of higher dimensions • A sphere in 4D can model the birth and death process of physical objects • Point clouds = approximate geometric shapes • High dimensional features can model many things
  • 19. 19 The challenge of high dimension geometry • Feature space can have hundreds to millions of dimensions • In high dimensions, our geometric imagination is limited - Algebra comes to our aid
  • 20. 20 Visualizing bag-of-words puppy cute 1 1 I have a puppy and it is extremely cute I have a puppy and it is extremely cute it 1 they 0 I 1 am 0 how 0 puppy 1 and 1 cat 0 aardvark 0 zebra 0 cute 1 extremely 1 … …
  • 21. 21 Visualizing bag-of-words puppy cute 1 1 1 extremely I have a puppy and it is extremely cute I have an extremely cute cat I have a cute puppy
  • 23. 23 What is a model? • Model = mathematical “summary” of data • What’s a summary? - A geometric shape
  • 24. 24 Classification model Feature 2 Feature 1 Decide between two classes
  • 25. 25 Clustering model Feature 2 Feature 1 Group data points tightly
  • 28. 28 When does bag-of-words fail? puppy cat 2 1 1 have I have a puppy I have a cat I have a kitten Task: find a surface that separates documents about dogs vs. cats Problem: the word “have” adds fluff instead of information I have a dog and I have a pen 1
  • 29. 29 Improving on bag-of-words • Idea: “normalize” word counts so that popular words are discounted • Term frequency (tf) = Number of times a terms appears in a document • Inverse document frequency of word (idf) = • N = total number of documents • Tf-idf count = tf x idf
  • 30. 30 From BOW to tf-idf puppy cat 2 1 1 have I have a puppy I have a cat I have a kitten idf(puppy) = log 4 idf(cat) = log 4 idf(have) = log 1 = 0 I have a dog and I have a pen 1
  • 31. 31 From BOW to tf-idf puppy cat1 have tfidf(puppy) = log 4 tfidf(cat) = log 4 tfidf(have) = 0 I have a dog and I have a pen, I have a kitten 1 log 4 log 4 I have a cat I have a puppy Decision surface Tf-idf flattens uninformative dimensions in the BOW point cloud
  • 32. 32 Entry points of feature engineering • Start from data and task - What’s the best text representation for classification? • Start from modeling method - What kind of features does k-means assume? - What does linear regression assume about the data?
  • 33. 33 That’s not all, folks! • There’s a lot more to feature engineering: - Feature normalization - Feature transformations - “Regularizing” models - Learning the right features • Dato is hiring! [email protected] [email protected] @RainyData

Editor's Notes

  • #5: Features sit between raw data and model. They can make or break an application.