SlideShare a Scribd company logo
Jong Youl Choi Computer Science Department (jychoi@cs.indiana.edu)
Social Bookmarking Socialized Tags Bookmarks
Machine Learning and Statistical Analysis
Principles of Machine Learning Bayes’ theorem and maximum likelihood Machine Learning Algorithms Clustering analysis Dimension reduction Classification Parallel Computing General parallel computing architecture Parallel algorithms
Definition Algorithms or techniques that enable computer (machine) to “learn” from data. Related with many areas such as data mining, statistics, information theory, etc. Algorithm Types Unsupervised learning Supervised learning Reinforcement learning Topics Models Artificial Neural Network (ANN) Support Vector Machine (SVM) Optimization Expectation-Maximization (EM) Deterministic Annealing (DA)
Posterior probability of   i , given  X  i  2     : Parameter X  : Observations P (  i ) : Prior (or marginal) probability  P ( X |  i ) : likelihood Maximum Likelihood (ML) Used to find the most plausible   i   2    , given  X  Computing maximum likelihood (ML) or log-likelihood     Optimization problem
Problem Estimate hidden parameters (  ={  ,   }) from the given data extracted from  k Gaussian distributions Gaussian distribution Maximum Likelihood With Gaussian (P =  N ), Solve either brute-force or numeric method (Mitchell , 1997)
Problems in ML estimation Observation  X  is often not complete Latent (hidden) variable  Z   exists Hard to explore whole parameter space Expectation-Maximization algorithm Object : To find ML, over latent distribution  P ( Z  | X ,  ) Steps 0. Init – Choose a random   old 1. E-step – Expectation  P ( Z  | X ,   old ) 2. M-step – Find   new  which maximize likelihood.  3. Go to step 1 after updating   old   Ã    new
Definition Grouping unlabeled data into clusters, for the purpose of inference of hidden structures or information Dissimilarity measurement Distance : Euclidean(L 2 ), Manhattan(L 1 ), … Angle : Inner product, … Non-metric : Rank, Intensity, … Types of Clustering Hierarchical  Agglomerative or divisive Partitioning K-means, VQ, MDS, … (Matlab helppage)
Find K partitions with the total intra-cluster variance minimized Iterative method  Initialization : Randomized  y i Assignment  of  x  ( y i  fixed) Update of  y i  ( x  fixed) Problem?    Trap in local minima (MacKay, 2003)
Deterministically avoid local minima  No stochastic process (random walk) Tracing the global solution by changing  level of randomness Statistical Mechanics Gibbs distribution Helmholtz free energy F = D   – TS Average Energy D = <   E x > Entropy S = -  P (E x ) ln  P (E x ) F = – T ln Z In DA, we make F minimized (Maxima and Minima, Wikipedia)
Analogy to physical annealing process  Control energy (randomness) by temperature (high    low)  Starting with high temperature (T =  1 )  Soft (or fuzzy) association probability Smooth cost function with one global minimum Lowering the temperature (T  !   0) Hard association Revealing full complexity, clusters are emerged Minimization of F, using E( x ,  y j ) = || x - y j || 2 Iteratively,
Definition Process to transform high-dimensional data into low-dimensional ones for improving accuracy, understanding, or removing noises.  Curse of dimensionality Complexity grows exponentially  in volume by adding extra  dimensions Types Feature selection : Choose representatives (e.g., filter,…) Feature extraction : Map to lower dim. (e.g., PCA, MDS, … ) (Koppen, 2000)
Finding a map of principle components (PCs) of data into an orthogonal space, such that  y   = W  x   where W  2   R d £ h  (h À d) PCs –  Variables with the largest variances Orthogonality  Linearity – Optimal least mean-square error  Limitations?  Strict linearity  specific distribution Large variance assumption x 1 x 2 PC 1 PC 2
Like PCA, reduction of dimension by  y   = R  x  where R is a random matrix with i.i.d columns and R  2   R d £ p  (p À d) Johnson-Lindenstrauss lemma When projecting to a randomly selected subspace, the distance are approximately preserved Generating R Hard to obtain orthogonalized R Gaussian R Simple approach  choose r ij  = {+3 1/2 ,0,-3 1/2 } with probability 1/6, 4/6, 1/6 respectively
Dimension reduction preserving distance proximities observed in original data set Loss functions  Inner product Distance Squared distance  Classical MDS: minimizing STRAIN, given   From   , find inner product matrix B (Double centering) From B, recover the coordinates X’ (i.e., B=X’X’ T  )
SMACOF : minimizing STRESS Majorization – for complex f(x),  find auxiliary simple g(x,y) s.t.:  Majorization for STRESS Minimize tr(X T  B(Y) Y), known as Guttman transform (Cox, 2001)
Competitive and unsupervised learning process for clustering and visualization Result : similar data getting closer in the model space  Input Model Learning Choose the best similar model vector  m j  with  x i Update the winner and its neighbors by  m k  =  m k  +   (t)   (t)( x i  –  m k )  (t) : learning rate  (t) : neighborhood size
Definition A procedure dividing data into the given set of categories based on the training set in a supervised way Generalization Vs. Specification Hard to achieve both Avoid overfitting(overtraining) Early stopping Holdout validation K-fold cross validation  Leave-one-out cross-validation (Overfitting, Wikipedia) Validation Error Training Error Underfitting Overfitting
Perceptron : A computational unit with binary threshold Abilities Linear separable decision surface  Represent boolean functions  (AND, OR, NO) Network (Multilayer) of perceptrons   Various network architectures and capabilities Weighted Sum Activation Function (Jain, 1996)
Learning weights – random initialization and updating Error-correction training rules Difference between training data and output: E(t,o) Gradient descent (Batch learning)  With E =    E i  ,  Stochastic approach (On-line learning) Update  gradient for each result Various error functions Adding weight regularization term (   w i 2 ) to avoid overfitting Adding momentum (  w i (n-1) ) to expedite convergence
Q: How to draw the optimal linear separating hyperplane?    A: Maximizing  margin Margin maximization The distance between H +1  and H -1 : Thus, || w || should be minimized Margin
Constraint optimization problem Given training set { x i , y i } (y i   2  {+1, -1}):  Minimize : Lagrangian equation with saddle points  Minimized w.r.t the primal variable  w  and b: Maximized w.r.t the dual variables   i  (all   i   ¸  0) x i  with   i  > 0 (not   i  = 0) is called support vector (SV)
Soft Margin (Non-separable case) Slack variables   i  < C Optimization with additional constraint Non-linear SVM Map non-linear input to feature space Kernel function k( x , y ) =  h  ( x ),   ( y ) i   Kernel classifier with support vectors  s i Input Space Feature Space
Memory Architecture Decomposition Strategy Task – E.g., Word, IE, …  Data – scientific problem Pipelining – Task + Data Symmetric Multiprocessor (SMP) OpenMP, POSIX, pthread, MPI Easy to manage but expensive Commodity, off-the-shelf processors MPI Cost effective but hard to maintain (Barney, 2007) (Barney, 2007) Shared Memory Distributed Memory
Shrinking Recall : Only support vectors (  i >0) are  used in SVM optimization Predict if data is either SV or non-SV Remove non-SVs from problem space Parallel SVM Partition the problem Merge data hierarchically Each unit finds support vectors Loop until converge (Graf, 2005)
Machine Learning and Statistical Analysis

More Related Content

What's hot (15)

PDF
Machine learning in science and industry — day 4
arogozhnikov
 
PDF
Machine learning in science and industry — day 2
arogozhnikov
 
PPT
Jörg Stelzer
butest
 
PDF
Machine learning in science and industry — day 1
arogozhnikov
 
PPTX
Machine learning applications in aerospace domain
홍배 김
 
PDF
MLHEP 2015: Introductory Lecture #2
arogozhnikov
 
PDF
MLHEP 2015: Introductory Lecture #4
arogozhnikov
 
PDF
Variational Autoencoders For Image Generation
Jason Anderson
 
PDF
بررسی دو روش شناسایی سیستم های متغیر با زمان به همراه شبیه سازی و گزارش
پروژه مارکت
 
PDF
MLHEP 2015: Introductory Lecture #3
arogozhnikov
 
PPTX
Gaussian processing
홍배 김
 
PDF
505 260-266
idescitation
 
PDF
Lecture 5 Relationship between pixel-2
VARUN KUMAR
 
PDF
MLHEP 2015: Introductory Lecture #1
arogozhnikov
 
PDF
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Universitat Politècnica de Catalunya
 
Machine learning in science and industry — day 4
arogozhnikov
 
Machine learning in science and industry — day 2
arogozhnikov
 
Jörg Stelzer
butest
 
Machine learning in science and industry — day 1
arogozhnikov
 
Machine learning applications in aerospace domain
홍배 김
 
MLHEP 2015: Introductory Lecture #2
arogozhnikov
 
MLHEP 2015: Introductory Lecture #4
arogozhnikov
 
Variational Autoencoders For Image Generation
Jason Anderson
 
بررسی دو روش شناسایی سیستم های متغیر با زمان به همراه شبیه سازی و گزارش
پروژه مارکت
 
MLHEP 2015: Introductory Lecture #3
arogozhnikov
 
Gaussian processing
홍배 김
 
505 260-266
idescitation
 
Lecture 5 Relationship between pixel-2
VARUN KUMAR
 
MLHEP 2015: Introductory Lecture #1
arogozhnikov
 
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Universitat Politècnica de Catalunya
 

Viewers also liked (8)

PPTX
Virtualization 360 - Westcoast
butest
 
DOCX
09_KHIN AYE MU.docx - Abstract
butest
 
PPT
Integrating Model Checking and Procedural Languages
butest
 
DOC
Preetam CV
Choudhury Pritam Das
 
PDF
Firebird In 2 Minutes Bosnian
Mind The Firebird
 
PDF
Grammar Induction Through Machine Learning Part 1 ...
butest
 
PDF
News Press Viva Rc1
Alcantara
 
PDF
Machine Learning Project
butest
 
Virtualization 360 - Westcoast
butest
 
09_KHIN AYE MU.docx - Abstract
butest
 
Integrating Model Checking and Procedural Languages
butest
 
Firebird In 2 Minutes Bosnian
Mind The Firebird
 
Grammar Induction Through Machine Learning Part 1 ...
butest
 
News Press Viva Rc1
Alcantara
 
Machine Learning Project
butest
 
Ad

Similar to Machine Learning and Statistical Analysis (20)

PPT
Support Vector Machines
nextlib
 
PPT
Introduction to Machine Learning Aristotelis Tsirigos
butest
 
PPT
Free Ebooks Download ! Edhole.com
Edhole.com
 
PPT
Lecture 2
butest
 
PPT
SVM (2).ppt
NoorUlHaq47
 
PPT
SVM.ppt
SrikanthK799073
 
PPTX
Support Vector Machines USING MACHINE LEARNING HOW IT WORKS
rajalakshmi5921
 
PPTX
Unit-1 Introduction and Mathematical Preliminaries.pptx
avinashBajpayee1
 
PPT
. An introduction to machine learning and probabilistic ...
butest
 
PPT
Introduction to Support Vector Machine 221 CMU.ppt
MuhammadImtiazHossai
 
PDF
Machine Learning Algorithms Introduction.pdf
Vinodh58
 
PPTX
Support-Vector-Machine (Supervised Learning).pptx
engrfarhanhanif
 
PPT
Machine Learning: Foundations Course Number 0368403401
butest
 
PPT
Machine Learning: Foundations Course Number 0368403401
butest
 
PPT
Statistical Machine________ Learning.ppt
SandeepGupta229023
 
PPT
Introduction
butest
 
PDF
2012 mdsp pr13 support vector machine
nozomuhamada
 
PPT
November, 2006 CCKM'06 1
butest
 
PPTX
Deep learning from mashine learning AI..
premkumarlive
 
PPTX
When Models Meet Data: From ancient science to todays Artificial Intelligence...
ssuserbbbef4
 
Support Vector Machines
nextlib
 
Introduction to Machine Learning Aristotelis Tsirigos
butest
 
Free Ebooks Download ! Edhole.com
Edhole.com
 
Lecture 2
butest
 
SVM (2).ppt
NoorUlHaq47
 
Support Vector Machines USING MACHINE LEARNING HOW IT WORKS
rajalakshmi5921
 
Unit-1 Introduction and Mathematical Preliminaries.pptx
avinashBajpayee1
 
. An introduction to machine learning and probabilistic ...
butest
 
Introduction to Support Vector Machine 221 CMU.ppt
MuhammadImtiazHossai
 
Machine Learning Algorithms Introduction.pdf
Vinodh58
 
Support-Vector-Machine (Supervised Learning).pptx
engrfarhanhanif
 
Machine Learning: Foundations Course Number 0368403401
butest
 
Machine Learning: Foundations Course Number 0368403401
butest
 
Statistical Machine________ Learning.ppt
SandeepGupta229023
 
Introduction
butest
 
2012 mdsp pr13 support vector machine
nozomuhamada
 
November, 2006 CCKM'06 1
butest
 
Deep learning from mashine learning AI..
premkumarlive
 
When Models Meet Data: From ancient science to todays Artificial Intelligence...
ssuserbbbef4
 
Ad

More from butest (20)

PDF
EL MODELO DE NEGOCIO DE YOUTUBE
butest
 
DOC
1. MPEG I.B.P frame之不同
butest
 
PDF
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
PPT
Timeline: The Life of Michael Jackson
butest
 
DOCX
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
butest
 
PDF
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
PPTX
Com 380, Summer II
butest
 
PPT
PPT
butest
 
DOCX
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
butest
 
DOC
MICHAEL JACKSON.doc
butest
 
PPTX
Social Networks: Twitter Facebook SL - Slide 1
butest
 
PPT
Facebook
butest
 
DOCX
Executive Summary Hare Chevrolet is a General Motors dealership ...
butest
 
DOC
Welcome to the Dougherty County Public Library's Facebook and ...
butest
 
DOC
NEWS ANNOUNCEMENT
butest
 
DOC
C-2100 Ultra Zoom.doc
butest
 
DOC
MAC Printing on ITS Printers.doc.doc
butest
 
DOC
Mac OS X Guide.doc
butest
 
DOC
hier
butest
 
DOC
WEB DESIGN!
butest
 
EL MODELO DE NEGOCIO DE YOUTUBE
butest
 
1. MPEG I.B.P frame之不同
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Timeline: The Life of Michael Jackson
butest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Com 380, Summer II
butest
 
PPT
butest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
butest
 
MICHAEL JACKSON.doc
butest
 
Social Networks: Twitter Facebook SL - Slide 1
butest
 
Facebook
butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
butest
 
NEWS ANNOUNCEMENT
butest
 
C-2100 Ultra Zoom.doc
butest
 
MAC Printing on ITS Printers.doc.doc
butest
 
Mac OS X Guide.doc
butest
 
hier
butest
 
WEB DESIGN!
butest
 

Machine Learning and Statistical Analysis

  • 1. Jong Youl Choi Computer Science Department ([email protected])
  • 4. Principles of Machine Learning Bayes’ theorem and maximum likelihood Machine Learning Algorithms Clustering analysis Dimension reduction Classification Parallel Computing General parallel computing architecture Parallel algorithms
  • 5. Definition Algorithms or techniques that enable computer (machine) to “learn” from data. Related with many areas such as data mining, statistics, information theory, etc. Algorithm Types Unsupervised learning Supervised learning Reinforcement learning Topics Models Artificial Neural Network (ANN) Support Vector Machine (SVM) Optimization Expectation-Maximization (EM) Deterministic Annealing (DA)
  • 6. Posterior probability of  i , given X  i 2  : Parameter X : Observations P (  i ) : Prior (or marginal) probability P ( X |  i ) : likelihood Maximum Likelihood (ML) Used to find the most plausible  i 2  , given X Computing maximum likelihood (ML) or log-likelihood  Optimization problem
  • 7. Problem Estimate hidden parameters (  ={  ,  }) from the given data extracted from k Gaussian distributions Gaussian distribution Maximum Likelihood With Gaussian (P = N ), Solve either brute-force or numeric method (Mitchell , 1997)
  • 8. Problems in ML estimation Observation X is often not complete Latent (hidden) variable Z exists Hard to explore whole parameter space Expectation-Maximization algorithm Object : To find ML, over latent distribution P ( Z | X ,  ) Steps 0. Init – Choose a random  old 1. E-step – Expectation P ( Z | X ,  old ) 2. M-step – Find  new which maximize likelihood. 3. Go to step 1 after updating  old à  new
  • 9. Definition Grouping unlabeled data into clusters, for the purpose of inference of hidden structures or information Dissimilarity measurement Distance : Euclidean(L 2 ), Manhattan(L 1 ), … Angle : Inner product, … Non-metric : Rank, Intensity, … Types of Clustering Hierarchical Agglomerative or divisive Partitioning K-means, VQ, MDS, … (Matlab helppage)
  • 10. Find K partitions with the total intra-cluster variance minimized Iterative method Initialization : Randomized y i Assignment of x ( y i fixed) Update of y i ( x fixed) Problem?  Trap in local minima (MacKay, 2003)
  • 11. Deterministically avoid local minima No stochastic process (random walk) Tracing the global solution by changing level of randomness Statistical Mechanics Gibbs distribution Helmholtz free energy F = D – TS Average Energy D = <  E x > Entropy S = - P (E x ) ln P (E x ) F = – T ln Z In DA, we make F minimized (Maxima and Minima, Wikipedia)
  • 12. Analogy to physical annealing process Control energy (randomness) by temperature (high  low) Starting with high temperature (T = 1 ) Soft (or fuzzy) association probability Smooth cost function with one global minimum Lowering the temperature (T ! 0) Hard association Revealing full complexity, clusters are emerged Minimization of F, using E( x , y j ) = || x - y j || 2 Iteratively,
  • 13. Definition Process to transform high-dimensional data into low-dimensional ones for improving accuracy, understanding, or removing noises. Curse of dimensionality Complexity grows exponentially in volume by adding extra dimensions Types Feature selection : Choose representatives (e.g., filter,…) Feature extraction : Map to lower dim. (e.g., PCA, MDS, … ) (Koppen, 2000)
  • 14. Finding a map of principle components (PCs) of data into an orthogonal space, such that y = W x where W 2 R d £ h (h À d) PCs – Variables with the largest variances Orthogonality Linearity – Optimal least mean-square error Limitations? Strict linearity specific distribution Large variance assumption x 1 x 2 PC 1 PC 2
  • 15. Like PCA, reduction of dimension by y = R x where R is a random matrix with i.i.d columns and R 2 R d £ p (p À d) Johnson-Lindenstrauss lemma When projecting to a randomly selected subspace, the distance are approximately preserved Generating R Hard to obtain orthogonalized R Gaussian R Simple approach choose r ij = {+3 1/2 ,0,-3 1/2 } with probability 1/6, 4/6, 1/6 respectively
  • 16. Dimension reduction preserving distance proximities observed in original data set Loss functions Inner product Distance Squared distance Classical MDS: minimizing STRAIN, given  From  , find inner product matrix B (Double centering) From B, recover the coordinates X’ (i.e., B=X’X’ T )
  • 17. SMACOF : minimizing STRESS Majorization – for complex f(x), find auxiliary simple g(x,y) s.t.: Majorization for STRESS Minimize tr(X T B(Y) Y), known as Guttman transform (Cox, 2001)
  • 18. Competitive and unsupervised learning process for clustering and visualization Result : similar data getting closer in the model space Input Model Learning Choose the best similar model vector m j with x i Update the winner and its neighbors by m k = m k +  (t)  (t)( x i – m k )  (t) : learning rate  (t) : neighborhood size
  • 19. Definition A procedure dividing data into the given set of categories based on the training set in a supervised way Generalization Vs. Specification Hard to achieve both Avoid overfitting(overtraining) Early stopping Holdout validation K-fold cross validation Leave-one-out cross-validation (Overfitting, Wikipedia) Validation Error Training Error Underfitting Overfitting
  • 20. Perceptron : A computational unit with binary threshold Abilities Linear separable decision surface Represent boolean functions (AND, OR, NO) Network (Multilayer) of perceptrons  Various network architectures and capabilities Weighted Sum Activation Function (Jain, 1996)
  • 21. Learning weights – random initialization and updating Error-correction training rules Difference between training data and output: E(t,o) Gradient descent (Batch learning) With E =  E i , Stochastic approach (On-line learning) Update gradient for each result Various error functions Adding weight regularization term (  w i 2 ) to avoid overfitting Adding momentum (  w i (n-1) ) to expedite convergence
  • 22. Q: How to draw the optimal linear separating hyperplane?  A: Maximizing margin Margin maximization The distance between H +1 and H -1 : Thus, || w || should be minimized Margin
  • 23. Constraint optimization problem Given training set { x i , y i } (y i 2 {+1, -1}): Minimize : Lagrangian equation with saddle points Minimized w.r.t the primal variable w and b: Maximized w.r.t the dual variables  i (all  i ¸ 0) x i with  i > 0 (not  i = 0) is called support vector (SV)
  • 24. Soft Margin (Non-separable case) Slack variables  i < C Optimization with additional constraint Non-linear SVM Map non-linear input to feature space Kernel function k( x , y ) = h  ( x ),  ( y ) i Kernel classifier with support vectors s i Input Space Feature Space
  • 25. Memory Architecture Decomposition Strategy Task – E.g., Word, IE, … Data – scientific problem Pipelining – Task + Data Symmetric Multiprocessor (SMP) OpenMP, POSIX, pthread, MPI Easy to manage but expensive Commodity, off-the-shelf processors MPI Cost effective but hard to maintain (Barney, 2007) (Barney, 2007) Shared Memory Distributed Memory
  • 26. Shrinking Recall : Only support vectors (  i >0) are used in SVM optimization Predict if data is either SV or non-SV Remove non-SVs from problem space Parallel SVM Partition the problem Merge data hierarchically Each unit finds support vectors Loop until converge (Graf, 2005)

Editor's Notes

  • #6: Inductive Learning – extract rules, patterns, or information out of massive data (e.g., decision tree, clustering, …) Deductive Learning – require no additional input, but improve performance gradually (e.g., advice taker, …)