SlideShare a Scribd company logo
Machine Learning A large and fascinating field;  there’s much more than what you’ll see in this class!
What should we try to learn, if we want to… make computer systems more efficient or secure? make money in the stock market? avoid losing money to fraud or scams? do science or medicine? win at games? make more entertaining games? improve user interfaces?  even brain-computer interfaces … make traditional applications more useful? word processors, drawing programs, email, web search, photo organizers, …
What should we try to learn, if we want to… make computer systems more efficient or secure? make money in the stock market? avoid losing money to fraud or scams? do science or medicine? win at games? make more entertaining games? improve user interfaces?  even brain-computer interfaces … make traditional applications more useful? word processors, drawing programs, email, web search, photo organizers, … This stuff has got to be an important part of the future  … …  beats trying to program all the special cases directly …  and there are “intelligent” behaviors you can’t imagine programming directly.  (Most of the stuff now in your brain wasn’t programmed in advance, either!)
The simplest problem: Supervised binary classification of vectors Training set: (x 1 , y 1 ), (x 2 , y 2 ), … (x n , y n ) where x 1 , x 2 , … are in R n and y 1 , y 2 , … are in {0,1} or {-,+} or {-1,1}  Test set: (x n+1 , ?), (x n+2 , ?), … (x n+m , ?) where these x’s were probably  not  seen in training
Linear Separators slide thanks to Ray Mooney
Linear Separators slide thanks to Ray Mooney
Nonlinear Separators slide thanks to Ray Mooney (modified)
Nonlinear Separators Note: A more complex function requires more data to generate an accurate model  (sample complexity) slide thanks to Kevin Small (modified) x y
Encoding and decoding for learning Binary classification of vectors … but how do we treat “real” learning problems in this framework? We need to encode each input example as a vector in R n :   feature extraction
Features for recognizing a chair?
Features for recognizing childhood autism? (from DSM IV, the Diagnostic and Statistical Manual) A. A total of six (or more) items from (1), (2), and (3), with at least two from (1), and one each from (2) and (3):  (1) Qualitative impairment in social interaction, as manifested by at least two of the following:  marked impairment in the use of multiple nonverbal behaviors such as eye-to-eye gaze, facial expression, body postures, and gestures to regulate social interaction.  failure to develop peer relationships appropriate to developmental level  a lack of spontaneous seeking to share enjoyment, interests, or achievements with other people (e.g., by a lack of showing, bringing, or pointing out objects of interest)  lack of social or emotional reciprocity  (2) Qualitative impairments in communication as manifested by at least one of the following: …
Features for recognizing childhood autism? (from DSM IV, the Diagnostic and Statistical Manual) B. Delays or abnormal functioning in at least one of the following areas, with onset prior to age 3 years:  (1) social interaction (2) language as used in social communication, or  (3) symbolic or imaginative play.  C.  The disturbance is not better accounted for by Rett's disorder or childhood disintegrative disorder.
Features for recognizing a prime number? (2,+)  (3,+)  (4,-)  (5,+)  (6,-)  (7,+)  (8,-)  (9,-)  (10,-)  (11,+)  (12,-)  (13,+)  (14,-)  (15,-) … Ouch! But what kinds of features might you try if you didn’t know anything about primality? How well would they work? False positives vs. false negatives? Expected performance vs. worst-case
Features for recognizing masculine vs. feminine words in French? le fromage (cheese) la salade (salad, lettuce) le monument (monument) la fourchette (fork) le sentiment (feeling) la télévision (television) le couteau (knife) la culture (culture) le téléphone (telephone) la situation (situation) le microscope  (microscope) la société (society) le romantisme  (romanticism) la différence (difference)  la philosophie  (philosophy)
Features for recognizing when the user who’s typing isn’t the usual user? (And how do you train this?)
Measuring performance Simplest: Classification error (fraction of wrong answers) Better: Loss functions – different penalties for false positives vs. false negatives If the learner gives a confidence or probability along with each of its answers, give extra credit for being confidently right but extra penalty for being confidently wrong What’s the formula?  Correct answer is y i     {-1, +1} System predicts z i     [-1, +1]  (perhaps fractional) Score is   i  y i  * z i
Encoding and decoding for learning Binary classification of vectors … but how do we treat “real” learning problems in this framework? If the output is to be binary, we need to encode each input example as a vector in R n :   feature extraction If the output is to be more complicated, we may need to obtain it as a sequence of binary decisions, each on a different feature vector
Multiclass Classification Many binary classifiers (“one versus all”) slide thanks to Kevin Small (modified) One multiway classifier
Regression: predict a number, not a class Don’t just predict whether stock will go up or down in the present circumstance – predict by how much! Better, predict probabilities that it will go up and down by different amounts
Inference: Predict a whole pattern Predict a whole object  (in the sense of object-oriented programming) Output is a vector, or a tree, or something Why useful?  Or, return many possible trees with a different probability on each one Some fancy machine learning methods can handle this directly … but how would you do a simple encoding?
Defining Learning Problems ML algorithms are mathematical formalisms and problems must be modeled accordingly Feature Space  – space used to describe each instance; often  R d , {0,1} d , etc. Output Space  – space of possible output labels Hypothesis Space  – space of functions that can be selected by the machine learning algorithm (depends on the algorithm) slide thanks to Kevin Small (modified)
Context Sensitive Spelling Did anybody (else) want  too  sleep for  to  more hours this morning? Output Space Could use the entire vocabulary;  Y ={a,aback,...,zucchini} Could also use a confusion set;  Y= {to, too, two} Model as (single label) multiclass classification Hypothesis space is provided by your learner Need to define the feature space slide thanks to Kevin Small (modified)
Sentence Representation S = I would like a  piece  of cake too! Define a set of features Features are relations that hold in the sentence. Two components to defining features Describe relations in the sentence: text, text ordering, properties of the text (information sources) Define functions based upon these relations (more on this later) slide thanks to Kevin Small (modified)
Sentence Representation S 1  = I would like a  piece  of cake too! S 2  = This is not the way to achieve  peace  in Iraq. Examples of (simple) features Does ‘ever’ appear within a window of 3 words? Does ‘cake’ appear within a window of 3 words? Is the preceding word a verb? S 1  = 0, 1, 0 S 2  = 0, 0, 1 slide thanks to Kevin Small (modified) 1 2 3
Embedding Requires some knowledge engineering Makes the discriminant function simpler (and learnable) slide thanks to Kevin Small (modified) Peace Piece
Sparse Representation Between basic and complex features, the dimensionality will be very high Most features will not be active in a given example Represent vectors with a list of active indices S 1  = 1, 0, 1, 0, 0, 0, 1, 0, 0, 1  becomes S 1  = 1, 3, 7, 10 S 2  = 0, 0, 0, 1, 0, 0, 1, 0, 0, 0  becomes S 2  = 4, 7 slide thanks to Kevin Small (modified)
Types of Sparsity Sparse Function Space High dimensional data where target function depends on a few features (many irrelevant features) Sparse Example Space High dimensional data where only a few features are active in each example In NLP, we typically have both types of sparsity.  slide thanks to Kevin Small (modified)
Training paradigms Supervised? Unsupervised? Partly supervised? Incomplete? Active learning, online learning Reinforcement learning
Training and test sets How this relates to the midterm Want you to do well – proves I’m a good teacher (merit pay?) So I want to teach to the test …  heck, just show you the test in advance! Or equivalently, test exactly what I taught … what was the title of slide 29? How should JHU prevent this? what would the title of slide 29 ½ have been? Development sets the market newsletter scam so, what if we have an army of robotic professors? some professor’s class will do well just by luck!  she wins! JHU should only be able to send one prof to the professorial Olympics Olympic trials are like a development set
Overfitting and underfitting Overfitting: Model the training data all too well (autistic savants?).  Do really well if we test on the training data, but poorly if we test on new data. Underfitting: Try too hard to generalize.  Ignore relevant distinctions – try to find a simple linear separator when the data are actually more complicated than that. How does this relate to the # of parameters to learn? Lord Kelvin: “And with 3 parameters, I can fit an elephant …”
“ Feature Engineering” Workshop in 2005 CALL FOR PAPERS   Feature Engineering for Machine Learning in Natural Language Processing   Workshop at the Annual Meeting of the Association of Computational Linguistics (ACL 2005)   http://research.microsoft.com/~ringger/FeatureEngineeringWorkshop/   Submission Deadline: April 20, 2005   Ann Arbor, Michigan June 29, 2005  
“ Feature Engineering” Workshop in 2005 As experience with machine learning for solving natural language processing tasks accumulates in the field, practitioners are finding that feature engineering is as critical as the choice of machine learning algorithm, if not more so.   Feature design, feature selection, and feature impact (through ablation studies and the like) significantly affect the performance of systems and deserve greater attention.   In the wake of the shift away from knowledge engineering and of the successes of data-driven and statistical methods, researchers in the field are likely to make further progress by incorporating additional, sometimes familiar, sources of knowledge as features.   Although some experience in the area of feature engineering is to be found in the theoretical machine learning community, the particular demands of natural language processing leave much to be discovered.
“ Feature Engineering” Workshop in 2005 Topics may include, but are not necessarily limited to: Novel methods for discovering or inducing features, such as mining the web for closed classes, useful for indicator features. Comparative studies of different feature selection algorithms for NLP tasks. Interactive tools that help researchers to identify ambiguous cases that could be disambiguated by the addition of features. Error analysis of various aspects of feature induction, selection, representation. Issues with representation, e.g., strategies for handling hierarchical representations, including decomposing to atomic features or by employing statistical relational learning. Techniques used in fields outside NLP that prove useful in NLP. The impact of feature selection and feature design on such practical considerations as training time, experimental design, domain independence, and evaluation. Analysis of feature engineering and its interaction with specific machine learning methods commonly used in NLP. Combining classifiers that employ diverse types of features. Studies of methods for defining a feature set, for example by iteratively expanding a base feature set. Issues with representing and combining real-valued and categorical features for NLP tasks.
A Machine Learning System slide thanks to Kevin Small (modified) Preprocessing Raw Text Formatted Text
Preprocessing Text Sentence splitting, Word Splitting, etc. Put data in a form usable for feature extraction slide thanks to Kevin Small (modified) They recently recovered a small piece of a live Elvis concert recording. He was singing gospel songs, including “Peace in the Valley.”  0  0  0  They 0  0  1  recently 0  0  2  recovered 0  0  3  a 0  0  4  small piece  0  5  piece 0  0  6  of : 0  1  6  including 0  1  7  QUOTE peace 1  8  Peace 0  1  9  in 0  1  10  the 0  1  11  Valley 0  1  12  . 0  1  13  QUOTE
A Machine Learning System Feature Vectors slide thanks to Kevin Small (modified) Preprocessing Feature Extraction Raw Text Formatted Text
Feature Extraction Converts formatted text into feature vectors Lexicon file contains feature descriptions slide thanks to Kevin Small (modified) 0  0  0  They 0  0  1  recently 0  0  2  recovered 0  0  3  a 0  0  4  small piece  0  5  piece 0  0  6  of : 0  1  6  including 0  1  7  QUOTE peace 1  8  Peace 0  1  9  in 0  1  10  the 0  1  11  Valley 0  1  12  . 0  1  13  QUOTE 0, 1001, 1013, 1134, 1175, 1206 1, 1021, 1055, 1085, 1182, 1252 Lexicon File
A Machine Learning System Testing Examples Feature Vectors Training Examples slide thanks to Kevin Small (modified) Preprocessing Feature Extraction Machine Learner Classifier Postprocessing Raw Text Formatted Text Function Parameters Labels Annotated Text
A Machine Learning System Testing Examples Feature Vectors Training Examples slide thanks to Kevin Small (modified) Preprocessing Feature Extraction Machine Learner Classifier Postprocessing Raw Text Formatted Text Function Parameters Labels Annotated Text

More Related Content

What's hot (20)

PPTX
Ensemble learning
Haris Jamil
 
PDF
Decision tree
R A Akerkar
 
PPT
FUNDAMENTALS OF MACHINE LEARNING & IT’S TYPES
Bhimsen Joshi
 
PPTX
Association Analysis in Data Mining
Kamal Acharya
 
PPTX
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Simplilearn
 
PPTX
SVM
Bangalore
 
PPTX
Tutorial on word2vec
Leiden University
 
PPTX
Text Classification
RAX Automation Suite
 
PPTX
Local beam search example
Megha Sharma
 
PPT
Game Playing in Artificial Intelligence
lordmwesh
 
PPTX
Lecture 18: Gaussian Mixture Models and Expectation Maximization
butest
 
PPTX
Ranking algorithms
Ankit Raj
 
PPTX
Random Forest Classifier in Machine Learning | Palin Analytics
Palin analytics
 
PPTX
Naive bayes
Ashraf Uddin
 
PDF
Natural language processing (nlp)
Kuppusamy P
 
PDF
Word2Vec
hyunyoung Lee
 
PPTX
Rules of data mining
Sulman Ahmed
 
PPT
Image compression
Ale Johnsan
 
PPTX
State space search
chauhankapil
 
PPTX
Bayesian Linear Regression.pptx
JerminJershaTC
 
Ensemble learning
Haris Jamil
 
Decision tree
R A Akerkar
 
FUNDAMENTALS OF MACHINE LEARNING & IT’S TYPES
Bhimsen Joshi
 
Association Analysis in Data Mining
Kamal Acharya
 
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Simplilearn
 
SVM
Bangalore
 
Tutorial on word2vec
Leiden University
 
Text Classification
RAX Automation Suite
 
Local beam search example
Megha Sharma
 
Game Playing in Artificial Intelligence
lordmwesh
 
Lecture 18: Gaussian Mixture Models and Expectation Maximization
butest
 
Ranking algorithms
Ankit Raj
 
Random Forest Classifier in Machine Learning | Palin Analytics
Palin analytics
 
Naive bayes
Ashraf Uddin
 
Natural language processing (nlp)
Kuppusamy P
 
Word2Vec
hyunyoung Lee
 
Rules of data mining
Sulman Ahmed
 
Image compression
Ale Johnsan
 
State space search
chauhankapil
 
Bayesian Linear Regression.pptx
JerminJershaTC
 

Similar to Using binary classifiers (20)

PPT
09learning.ppt
ABINASHPADHY6
 
PDF
Machine Learning: Learning with data
ONE Talks
 
PDF
One talk Machine Learning
ONE Talks
 
PPT
MLlecture1.ppt
butest
 
PPT
MLlecture1.ppt
butest
 
PPTX
Rahul_Kirtoniya_11800121032_CSE_Machine_Learning.pptx
RahulKirtoniya
 
PPT
AML_030607.ppt
butest
 
PPTX
ML slide share.pptx
GoodReads1
 
PPTX
Introduction and Basics of Machine Learning.pptx
GoodReads1
 
PPT
LECTURE8.PPT
butest
 
PPT
ppt
butest
 
PPT
ppt
butest
 
DOC
Lecture #1: Introduction to machine learning (ML)
butest
 
PPT
Introduction to Machine Learning.
butest
 
PPT
lecture_mooney.ppt
butest
 
PPTX
Introduction
butest
 
PPTX
Introduction
butest
 
PPTX
Introduction
butest
 
PPT
Learning
butest
 
09learning.ppt
ABINASHPADHY6
 
Machine Learning: Learning with data
ONE Talks
 
One talk Machine Learning
ONE Talks
 
MLlecture1.ppt
butest
 
MLlecture1.ppt
butest
 
Rahul_Kirtoniya_11800121032_CSE_Machine_Learning.pptx
RahulKirtoniya
 
AML_030607.ppt
butest
 
ML slide share.pptx
GoodReads1
 
Introduction and Basics of Machine Learning.pptx
GoodReads1
 
LECTURE8.PPT
butest
 
ppt
butest
 
ppt
butest
 
Lecture #1: Introduction to machine learning (ML)
butest
 
Introduction to Machine Learning.
butest
 
lecture_mooney.ppt
butest
 
Introduction
butest
 
Introduction
butest
 
Introduction
butest
 
Learning
butest
 
Ad

More from butest (20)

PDF
EL MODELO DE NEGOCIO DE YOUTUBE
butest
 
DOC
1. MPEG I.B.P frame之不同
butest
 
PDF
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
PPT
Timeline: The Life of Michael Jackson
butest
 
DOCX
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
butest
 
PDF
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
PPTX
Com 380, Summer II
butest
 
PPT
PPT
butest
 
DOCX
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
butest
 
DOC
MICHAEL JACKSON.doc
butest
 
PPTX
Social Networks: Twitter Facebook SL - Slide 1
butest
 
PPT
Facebook
butest
 
DOCX
Executive Summary Hare Chevrolet is a General Motors dealership ...
butest
 
DOC
Welcome to the Dougherty County Public Library's Facebook and ...
butest
 
DOC
NEWS ANNOUNCEMENT
butest
 
DOC
C-2100 Ultra Zoom.doc
butest
 
DOC
MAC Printing on ITS Printers.doc.doc
butest
 
DOC
Mac OS X Guide.doc
butest
 
DOC
hier
butest
 
DOC
WEB DESIGN!
butest
 
EL MODELO DE NEGOCIO DE YOUTUBE
butest
 
1. MPEG I.B.P frame之不同
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Timeline: The Life of Michael Jackson
butest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Com 380, Summer II
butest
 
PPT
butest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
butest
 
MICHAEL JACKSON.doc
butest
 
Social Networks: Twitter Facebook SL - Slide 1
butest
 
Facebook
butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
butest
 
NEWS ANNOUNCEMENT
butest
 
C-2100 Ultra Zoom.doc
butest
 
MAC Printing on ITS Printers.doc.doc
butest
 
Mac OS X Guide.doc
butest
 
hier
butest
 
WEB DESIGN!
butest
 
Ad

Using binary classifiers

  • 1. Machine Learning A large and fascinating field; there’s much more than what you’ll see in this class!
  • 2. What should we try to learn, if we want to… make computer systems more efficient or secure? make money in the stock market? avoid losing money to fraud or scams? do science or medicine? win at games? make more entertaining games? improve user interfaces? even brain-computer interfaces … make traditional applications more useful? word processors, drawing programs, email, web search, photo organizers, …
  • 3. What should we try to learn, if we want to… make computer systems more efficient or secure? make money in the stock market? avoid losing money to fraud or scams? do science or medicine? win at games? make more entertaining games? improve user interfaces? even brain-computer interfaces … make traditional applications more useful? word processors, drawing programs, email, web search, photo organizers, … This stuff has got to be an important part of the future … … beats trying to program all the special cases directly … and there are “intelligent” behaviors you can’t imagine programming directly. (Most of the stuff now in your brain wasn’t programmed in advance, either!)
  • 4. The simplest problem: Supervised binary classification of vectors Training set: (x 1 , y 1 ), (x 2 , y 2 ), … (x n , y n ) where x 1 , x 2 , … are in R n and y 1 , y 2 , … are in {0,1} or {-,+} or {-1,1} Test set: (x n+1 , ?), (x n+2 , ?), … (x n+m , ?) where these x’s were probably not seen in training
  • 5. Linear Separators slide thanks to Ray Mooney
  • 6. Linear Separators slide thanks to Ray Mooney
  • 7. Nonlinear Separators slide thanks to Ray Mooney (modified)
  • 8. Nonlinear Separators Note: A more complex function requires more data to generate an accurate model (sample complexity) slide thanks to Kevin Small (modified) x y
  • 9. Encoding and decoding for learning Binary classification of vectors … but how do we treat “real” learning problems in this framework? We need to encode each input example as a vector in R n : feature extraction
  • 11. Features for recognizing childhood autism? (from DSM IV, the Diagnostic and Statistical Manual) A. A total of six (or more) items from (1), (2), and (3), with at least two from (1), and one each from (2) and (3): (1) Qualitative impairment in social interaction, as manifested by at least two of the following: marked impairment in the use of multiple nonverbal behaviors such as eye-to-eye gaze, facial expression, body postures, and gestures to regulate social interaction. failure to develop peer relationships appropriate to developmental level a lack of spontaneous seeking to share enjoyment, interests, or achievements with other people (e.g., by a lack of showing, bringing, or pointing out objects of interest) lack of social or emotional reciprocity (2) Qualitative impairments in communication as manifested by at least one of the following: …
  • 12. Features for recognizing childhood autism? (from DSM IV, the Diagnostic and Statistical Manual) B. Delays or abnormal functioning in at least one of the following areas, with onset prior to age 3 years: (1) social interaction (2) language as used in social communication, or (3) symbolic or imaginative play. C. The disturbance is not better accounted for by Rett's disorder or childhood disintegrative disorder.
  • 13. Features for recognizing a prime number? (2,+) (3,+) (4,-) (5,+) (6,-) (7,+) (8,-) (9,-) (10,-) (11,+) (12,-) (13,+) (14,-) (15,-) … Ouch! But what kinds of features might you try if you didn’t know anything about primality? How well would they work? False positives vs. false negatives? Expected performance vs. worst-case
  • 14. Features for recognizing masculine vs. feminine words in French? le fromage (cheese) la salade (salad, lettuce) le monument (monument) la fourchette (fork) le sentiment (feeling) la télévision (television) le couteau (knife) la culture (culture) le téléphone (telephone) la situation (situation) le microscope (microscope) la société (society) le romantisme (romanticism) la différence (difference)  la philosophie (philosophy)
  • 15. Features for recognizing when the user who’s typing isn’t the usual user? (And how do you train this?)
  • 16. Measuring performance Simplest: Classification error (fraction of wrong answers) Better: Loss functions – different penalties for false positives vs. false negatives If the learner gives a confidence or probability along with each of its answers, give extra credit for being confidently right but extra penalty for being confidently wrong What’s the formula? Correct answer is y i  {-1, +1} System predicts z i  [-1, +1] (perhaps fractional) Score is  i y i * z i
  • 17. Encoding and decoding for learning Binary classification of vectors … but how do we treat “real” learning problems in this framework? If the output is to be binary, we need to encode each input example as a vector in R n : feature extraction If the output is to be more complicated, we may need to obtain it as a sequence of binary decisions, each on a different feature vector
  • 18. Multiclass Classification Many binary classifiers (“one versus all”) slide thanks to Kevin Small (modified) One multiway classifier
  • 19. Regression: predict a number, not a class Don’t just predict whether stock will go up or down in the present circumstance – predict by how much! Better, predict probabilities that it will go up and down by different amounts
  • 20. Inference: Predict a whole pattern Predict a whole object (in the sense of object-oriented programming) Output is a vector, or a tree, or something Why useful? Or, return many possible trees with a different probability on each one Some fancy machine learning methods can handle this directly … but how would you do a simple encoding?
  • 21. Defining Learning Problems ML algorithms are mathematical formalisms and problems must be modeled accordingly Feature Space – space used to describe each instance; often R d , {0,1} d , etc. Output Space – space of possible output labels Hypothesis Space – space of functions that can be selected by the machine learning algorithm (depends on the algorithm) slide thanks to Kevin Small (modified)
  • 22. Context Sensitive Spelling Did anybody (else) want too sleep for to more hours this morning? Output Space Could use the entire vocabulary; Y ={a,aback,...,zucchini} Could also use a confusion set; Y= {to, too, two} Model as (single label) multiclass classification Hypothesis space is provided by your learner Need to define the feature space slide thanks to Kevin Small (modified)
  • 23. Sentence Representation S = I would like a piece of cake too! Define a set of features Features are relations that hold in the sentence. Two components to defining features Describe relations in the sentence: text, text ordering, properties of the text (information sources) Define functions based upon these relations (more on this later) slide thanks to Kevin Small (modified)
  • 24. Sentence Representation S 1 = I would like a piece of cake too! S 2 = This is not the way to achieve peace in Iraq. Examples of (simple) features Does ‘ever’ appear within a window of 3 words? Does ‘cake’ appear within a window of 3 words? Is the preceding word a verb? S 1 = 0, 1, 0 S 2 = 0, 0, 1 slide thanks to Kevin Small (modified) 1 2 3
  • 25. Embedding Requires some knowledge engineering Makes the discriminant function simpler (and learnable) slide thanks to Kevin Small (modified) Peace Piece
  • 26. Sparse Representation Between basic and complex features, the dimensionality will be very high Most features will not be active in a given example Represent vectors with a list of active indices S 1 = 1, 0, 1, 0, 0, 0, 1, 0, 0, 1 becomes S 1 = 1, 3, 7, 10 S 2 = 0, 0, 0, 1, 0, 0, 1, 0, 0, 0 becomes S 2 = 4, 7 slide thanks to Kevin Small (modified)
  • 27. Types of Sparsity Sparse Function Space High dimensional data where target function depends on a few features (many irrelevant features) Sparse Example Space High dimensional data where only a few features are active in each example In NLP, we typically have both types of sparsity. slide thanks to Kevin Small (modified)
  • 28. Training paradigms Supervised? Unsupervised? Partly supervised? Incomplete? Active learning, online learning Reinforcement learning
  • 29. Training and test sets How this relates to the midterm Want you to do well – proves I’m a good teacher (merit pay?) So I want to teach to the test … heck, just show you the test in advance! Or equivalently, test exactly what I taught … what was the title of slide 29? How should JHU prevent this? what would the title of slide 29 ½ have been? Development sets the market newsletter scam so, what if we have an army of robotic professors? some professor’s class will do well just by luck! she wins! JHU should only be able to send one prof to the professorial Olympics Olympic trials are like a development set
  • 30. Overfitting and underfitting Overfitting: Model the training data all too well (autistic savants?). Do really well if we test on the training data, but poorly if we test on new data. Underfitting: Try too hard to generalize. Ignore relevant distinctions – try to find a simple linear separator when the data are actually more complicated than that. How does this relate to the # of parameters to learn? Lord Kelvin: “And with 3 parameters, I can fit an elephant …”
  • 31. “ Feature Engineering” Workshop in 2005 CALL FOR PAPERS   Feature Engineering for Machine Learning in Natural Language Processing   Workshop at the Annual Meeting of the Association of Computational Linguistics (ACL 2005)   http://research.microsoft.com/~ringger/FeatureEngineeringWorkshop/   Submission Deadline: April 20, 2005   Ann Arbor, Michigan June 29, 2005  
  • 32. “ Feature Engineering” Workshop in 2005 As experience with machine learning for solving natural language processing tasks accumulates in the field, practitioners are finding that feature engineering is as critical as the choice of machine learning algorithm, if not more so.  Feature design, feature selection, and feature impact (through ablation studies and the like) significantly affect the performance of systems and deserve greater attention.  In the wake of the shift away from knowledge engineering and of the successes of data-driven and statistical methods, researchers in the field are likely to make further progress by incorporating additional, sometimes familiar, sources of knowledge as features.  Although some experience in the area of feature engineering is to be found in the theoretical machine learning community, the particular demands of natural language processing leave much to be discovered.
  • 33. “ Feature Engineering” Workshop in 2005 Topics may include, but are not necessarily limited to: Novel methods for discovering or inducing features, such as mining the web for closed classes, useful for indicator features. Comparative studies of different feature selection algorithms for NLP tasks. Interactive tools that help researchers to identify ambiguous cases that could be disambiguated by the addition of features. Error analysis of various aspects of feature induction, selection, representation. Issues with representation, e.g., strategies for handling hierarchical representations, including decomposing to atomic features or by employing statistical relational learning. Techniques used in fields outside NLP that prove useful in NLP. The impact of feature selection and feature design on such practical considerations as training time, experimental design, domain independence, and evaluation. Analysis of feature engineering and its interaction with specific machine learning methods commonly used in NLP. Combining classifiers that employ diverse types of features. Studies of methods for defining a feature set, for example by iteratively expanding a base feature set. Issues with representing and combining real-valued and categorical features for NLP tasks.
  • 34. A Machine Learning System slide thanks to Kevin Small (modified) Preprocessing Raw Text Formatted Text
  • 35. Preprocessing Text Sentence splitting, Word Splitting, etc. Put data in a form usable for feature extraction slide thanks to Kevin Small (modified) They recently recovered a small piece of a live Elvis concert recording. He was singing gospel songs, including “Peace in the Valley.” 0 0 0 They 0 0 1 recently 0 0 2 recovered 0 0 3 a 0 0 4 small piece 0 5 piece 0 0 6 of : 0 1 6 including 0 1 7 QUOTE peace 1 8 Peace 0 1 9 in 0 1 10 the 0 1 11 Valley 0 1 12 . 0 1 13 QUOTE
  • 36. A Machine Learning System Feature Vectors slide thanks to Kevin Small (modified) Preprocessing Feature Extraction Raw Text Formatted Text
  • 37. Feature Extraction Converts formatted text into feature vectors Lexicon file contains feature descriptions slide thanks to Kevin Small (modified) 0 0 0 They 0 0 1 recently 0 0 2 recovered 0 0 3 a 0 0 4 small piece 0 5 piece 0 0 6 of : 0 1 6 including 0 1 7 QUOTE peace 1 8 Peace 0 1 9 in 0 1 10 the 0 1 11 Valley 0 1 12 . 0 1 13 QUOTE 0, 1001, 1013, 1134, 1175, 1206 1, 1021, 1055, 1085, 1182, 1252 Lexicon File
  • 38. A Machine Learning System Testing Examples Feature Vectors Training Examples slide thanks to Kevin Small (modified) Preprocessing Feature Extraction Machine Learner Classifier Postprocessing Raw Text Formatted Text Function Parameters Labels Annotated Text
  • 39. A Machine Learning System Testing Examples Feature Vectors Training Examples slide thanks to Kevin Small (modified) Preprocessing Feature Extraction Machine Learner Classifier Postprocessing Raw Text Formatted Text Function Parameters Labels Annotated Text

Editor's Notes

  • #9: Add function formula
  • #19: pictures of a 4 class classifier...with the OvA definition
  • #22: Well-posed learning problems
  • #23: Framing the classification task
  • #24: note that the sentence changed
  • #25: note that the sentence changed
  • #26: Text categorization
  • #27: Word tagging
  • #28: Without algorithmic details as Dan says
  • #35: A machine learning system
  • #36: Preprocessing text
  • #37: A machine learning system
  • #38: Feature Generation (Kernels)
  • #39: A machine learning system
  • #40: A machine learning system