100% found this document useful (4 votes)
5K views

Introduction To Machine Learning - Ethem Alpaydin

Uploaded by

sumeyra61
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF or read online on Scribd
100% found this document useful (4 votes)
5K views

Introduction To Machine Learning - Ethem Alpaydin

Uploaded by

sumeyra61
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 432
Introduction to Machine Learning Ethem Alpaydin The MIT Press Cambridge, Massachusetts London, England © 2004 Massachusetts Institute of Technology All rights reserved. No part of this book may be reproduced in any form by any electronic or mechanical means (including photocopying, recording, or informa- tion storage and retrieval) without permission in writing from the publisher. MIT Press books may be purchased at special quantity discounts for business or sales promotional use. For information, please email special_sales@mitpress. mitedu or write to Special Sales Department, The MIT Press, 5 Cambridge Cen: ter, Cambridge, MA 02142. Library of Congress Control Number: 2004109627 ISBN: 0-262-01211-1 (he) ‘Typeset in 10/13 Lucida Bright by the author using B;X2¢. Printed and bound in the United States of America, 10987654321 Contents Series Foreword xiii Figures xv Tables xxiii Preface xxv Acknowledgments xxvii Notations xxix 1 Introduction 1 1.1 What Is Machine Learning? 1 1.2 Examples of Machine Learning Applications 3 1.2.1 Learning Associations 3 1.2.2 Classification 4 1.2.3 Regression 8 1.2.4 Unsupervised Learning 10 1.2.5 Reinforcement Learning 11 13 Notes 12 14 Relevant Resources 14 15 Exercises 15 1.6 References 16 2 Supervised Learning 17 2.1 Learning a Class from Examples 17 2.2 Vapnik-Chervonenkis (VC) Dimension 22 2.3 Probably Approximately Correct (PAC) Learning 24 vi 2.4 25 2.6 27 2.8 2.9 2.10 211 Contents Noise 25 Learning Multiple Classes. 27 Regression 29 Model Selection and Generalization 32 Dimensions of a Supervised Machine Learning Algorithm — 35 Notes 36 Exercises 37 References 38 Bayesian Decision Theory 39 3.1 Introduction 39 Classification 41 Losses and Risks 43 Discriminant Functions 45 Utility Theory 46 Value of Information 47 Bayesian Networks 48 Influence Diagrams 55 Association Rules 56 Notes 57 Exercises 57 References 58 Parametric Methods 61 41 42 43 44 4S 46 47 48 49 4.10 41 Introduction 61 Maximum Likelihood Estimation 62 4.2.1 Bernoulli Density 62 4.2.2. Multinomial Density 63 4.2.3 Gaussian (Normal) Density 64 Evaluating an Estimator: Bias and Variance 64 ‘The Bayes’ Estimator 67 Parametric Classification 69 Regression 73 Tuning Model Complexity: Bias/Variance Dilemma 76 Model Selection Procedures 79 Notes 82 Exercises 82 References 83 Contents 5 Multivariate Methods 85 Multivariate Data. 85 Parameter Estimation 86 Estimation of Missing Values 87 Multivariate Normal Distribution 88 5.5 Multivariate Classification 92 5.6 Tuning Complexity 98 5.7 Discrete Features 99 5.8 Multivariate Regression 100 5.9 Notes 102 5.10 Exercises 102 5.11 References 103 6 Dimensionality Reduction 105 6.1 Introduction 105 6.2 Subset Selection 106 6.3 Principal Components Analysis 108 6.4 Factor Analysis 116 6.5 Multidimensional Scaling 121 6.6 Linear Discriminant Analysis 124 6.7 Notes 127 68 Exercises 130 69 References 130 7 Clustering 133 7.1 Introduction 7.2 Mixture Densities 134 7.3. k-Means Clustering 135 7.4 Expectation-Maximization Algorithm — 139 7.5 Mixtures of Latent Variable Models 144 7.6 Supervised Learning after Clustering 145 7.7 Hierarchical Clustering 146 7.8 Choosing the Number of Clusters 149 7.9 Notes 149 7.10 Exercises 150 7.1 References 150 8 Nonparametric Methods 153 8.1 Introduction 153 vii 8.2 8.3 8.4 85 8.6 87 88 8.9 8.10 Contents Nonparametric Density Estimation 154 8.2.1 Histogram Estimator 155 8.2.2. Kernel Estimator 157 8.2.3 k-Nearest Neighbor Estimator 158 Generalization to Multivariate Data. 159 Nonparametric Classification 161 Condensed Nearest Neighbor 162 Nonparametric Regression: Smoothing Models 164 8.6.1 Running Mean Smoother 165 8.6.2 Kernel Smoother 166 8.6.3 Running Line Smoother 167 How to Choose the Smoothing Parameter 168 Notes 169 Exercises 170 References 170 9 Decision Trees 173 9.1 Introduction 173 9.2. Univariate Trees 175 9.2.1 Classification Trees 176 9.2.2. Regression Trees 180 9.3 Pruning 182 9.4 Rule Extraction from Trees 185 9.5 Learning Rules from Data 186 9.6 Multivariate Trees 190 9.7 Notes 192 9.8 Exercises 195 9.9 References 195 10 Linear Discrimination 197 10.1 Introduction 197 10.2. Generalizing the Linear Model 199 10.3 Geometry of the Linear Discriminant 200 10.3.1 Two Classes 200 10.3.2 Multiple Classes 202 10.4 Pairwise Separation 204 10.5. Parametric Discrimination Revisited 205 10.6 Gradient Descent 206 10.7. Logistic Discrimination 208 Contents ix 10.7.1 Two Classes 208 10.7.2 Multiple Classes. 211 10.8 Discrimination by Regression 216 10.9 Support Vector Machines 218 10.9.1 Optimal Separating Hyperplane 218 10.9.2 The Nonseparable Case: Soft Margin Hyperplane 221 Kernel Functions 9.4 Support Vector Machines for Regression 225 10.10 Notes 227 10.11 Exercises 227 10.12 References 228 11 Multilayer Perceptrons 229 11.1 Introduction 229 11.1.1 Understanding the Brain 230 11.1.2 Neural Networks as a Paradigm for Parallel Processing 231 11.2 The Perceptron 233 11.3. Training aPerceptron 236 114 Learning Boolean Functions — 239 115 Multilayer Perceptrons 241 11.6 MLPas a Universal Approximator 244 11.7. Backpropagation Algorithm — 245 117.1 Nonlinear Regression 246 117.2 Two-Class Discrimination 248 11.7.3. Multiclass Discrimination 250 11.7.4 Multiple Hidden Layers 252 11.8 Training Procedures 252 11.8.1 Improving Convergence 252 11.8.2 Overtraining 253 11.8.3 Structuring the Network 254 11.8.4 Hints 257 11.9 Tuning the Network Size 9 11.10 Bayesian View of Learning 262 11.11 Dimensionality Reduction 263 1112 Learning Time 266 1.12.1 Time Delay Neural Networks 266 11.12.2 Recurrent Networks 267 Contents 11.13 Notes 268 11.14 Exercises 270 11.15 References 271 12 Local Models. 275 Introduction 275 Competitive Learning 276 12.2.1 Online kMeans 276 12.2.2. Adaptive Resonance Theory 281 12.2.3. Self-Organizing Maps 282 12.3. Radial Basis Functions 284 12.4 Incorporating Rule-Based Knowledge 290 12.5. Normalized Basis Functions 291 12.6 Competitive Basis Functions 293, 12.7, Learning Vector Quantization 296 12.8 Mixture of Experts 296 12.8.1 Cooperative Experts 299 12.8.2. Competitive Experts 300 12.9 Hierarchical Mixture of Experts 300 12.10 Notes 301 12.11 Exercises 302 12.12 References 302 13 Hidden Markov Models 305 13.1. Introduction 305 13.2 Discrete Markov Processes 306 13.3 Hidden Markov Models 309 13.4 Three Basic Problems of HMMs 311 3.5. Evaluation Problem 311 13.6. Finding the State Sequence 315 13.7. Learning Model Parameters 317 13.8 Continuous Observations 320 13.9 The HMMwith Input 321 13.10 Model Selection in HMM 322 13.11 Notes 323 13.12 Exercises 325 13.13 References 325 14 Assessing and Comparing Classification Algorithms 327 14.1 Introduction 327 Contents 14.2. Cross-Validation and Resampling Methods 330 14.2.1 K-Fold Cross-Validation 331 14.2.2 5x2 Cross-Validation 331 14.2.3 Bootstrapping 332 14,3. Measuring Error 333 14.4 Interval Estimation 334 14.5 Hypothesis Testing 338 14.6 Assessing a Classification Algorithm’s Performance 339 14.6.1 Binomial Test 340 14.6.2. Approximate Normal Test 341 14.6.3 Paired t Test 341 14.7, Comparing Two Classification Algorithms 341 14.7.1 McNemar’s Test 342 14.7.2 K-Fold Cross-Validated Paired t Test 342 147.3 Sx2cvPairedt Test 343 14.7.4 5 x2 cv Paired F Test 344 14.8 Comparing Multiple Classification Algorithms: Analysis of Variance 345, 14.9 Notes 348 14.10 Exercises 349 14.11 References 350 15 Combining Multiple Learners. 351 15.1 Rationale 351 15.2 Voting 354 15.3. Error-Correcting Output Codes 357 15.4 Bagging 360 15.5 Boosting 360 15.6 Mixture of Experts Revisited 363 15.7 Stacked Generalization 364 15.8 Cascading 366 15.9 Notes 368 15.10 Exercises 369 15.11 References 370 16 Reinforcement Learning 373 16.1 Introduction 373 16.2 Single State Case: K-Armed Bandit 375 16.3 Elements of Reinforcement Learning 376 xii 16.4 16.5 16.6 16.7 16.8 16.9 Contents Model-Based Learning 379 16.4.1 Value Iteration 379 16.4.2 Policy Iteration 380 Temporal Difference Learning 380 16.5.1 Exploration Strategies 381 16.5.2 Deterministic Rewards and Actions 382 16.5.3 Nondeterministic Rewards and Actions 383 16.5.4 Eligibility Traces 385 Generalization 387 Partially Observable States 389 Notes 391 Exercises 393 16.10 References 394 A Probability 397 Ad AQ AB AA Index Elements of Probability 397 A..1 Axioms of Probability 398 A..2 Conditional Probability 398 Random Variables 399 Probability Distribution and Density Functions 399 Joint Distribution and Density Functions 400 Conditional Distributions 400 Bayes’ Rule 401 Expectation 401 A26 Variance 402 A.2.7 Weak Law of Large Numbers 403 Special Random Variables 403 A.3.1 Bernoulli Distribution 403 Binomial Distribution 404 Multinomial Distribution 404 Uniform Distribution 404 Normal (Gaussian) Distribution 405 A3.6 Chi-Square Distribution 406 A3.7 tDistribution 407 A3.8 F Distribution 407 References 407 409 Series Foreword The goal of building systems that can adapt to their environments and learn from their experience has attracted researchers from many fields, including computer science, engineering, mathematics, physics, neuro- science, and cognitive science. Out of this research has come a wide variety of learning techniques that are transforming many industrial and scientific fields. Recently, several research communities have begun to converge on a common set of issues surrounding supervised, unsuper- vised, and reinforcement learning problems. The MIT Press Series on Adaptive Computation and Machine Learning seeks to unify the many diverse strands of machine learning research and to foster high-quality research and innovative applications. The MIT Press is extremely pleased to publish this contribution by Ethem Alpaydin to the series. This textbook presents a readable and con- cise introduction to machine learning that reflects these diverse research strands. The book covers all of the main problem formulations and intro- duces the latest algorithms and techniques encompassing methods from computer science, neural computation, information theory, and statis- tics. This book will be a compelling textbook for introductory courses in machine learning at the undergraduate and beginning graduate level. Figures 1d 12 21 2.2 23 24 2.8 29 2.10 Bl 3.2 33 Example of a training dataset where each circle corresponds to one data instance with input values in the corresponding axes and its sign indicates the class. A training dataset of used cars and the function fitted. Training set for the class of a “family car. Example of a hypothesis class. C is the actual class and h is our induced hypothesis. $ is the most specific hypothesis and G is the most general hypothesis. ‘An axis-aligned rectangle can shatter four points. The difference between h and C is the sum of four rectangular strips, one of which is shaded. When there is noise, there is not a simple boundary between the positive and negative instances, and zero misclassification error may not be possible with a simple hypothesis. There are three classes: family car, sports car, and luxury sedan. Linear, second-order, and sixth-order polynomials are fitted to the same set of points. A line separating positive and negative instanc Example of decision regions and decision boundaries. Bayesian network modeling that rain is the cause of wet grass. Rain and sprinkler are the two causes of wet grass. 18 19 a1 22 23 27 28 31 38 46 48 49 38 41 42 43 44 46 47 ae a If it is cloudy, it is likely that it will rain and we will not use the sprinkler, Rain not only makes the grass wet but also disturbs the cat who normally makes noise on the roof, Bayesian network for classification. Naive Bayes’ classifier is a Bayesian network for classification assuming independent inputs. Influence diagram corresponding to classification, 0 is the parameter to be estimated. Likelihood functions and the posteriors with equal priors for two classes when the input is one-dimensional. Variances are equal and the posteriors intersect at one point, which is the threshold of decision. Likelihood functions and the posteriors with equal priors for two classes when the input is one-dimensional. Variances are unequal and the posteriors intersect at two points. Regression assumes 0 mean Gaussian noise added to the model; here, the model is linear. (a) Function, f(x) = 2 sin(1.5x), and one noisy (N(0, 1)) dataset sampled from the function. In the same setting as that of figure 4.5, using one hundred models instead of five, bias, variance, and error for polynomials of order 1 to 5. In the same setting as that of figure 4.5, training and validation sets (each containing 50 instances) are generated. Bivariate normal distribution. Isoprobability contour plot of the bivariate normal distribution. Classes have different covariance matrices Covariances may be arbitary but shared by both classes. All classes have equal, diagonal covariance matrices but variances are not equal. All classes have equal, diagonal covariance matrices of equal variances on both dimensions. Figures 84 55 66 7 72 78 79 89 90 94 95 96 97 Figures anvil 6.1 Principal components analysis centers the sample and then rotates the axes to line up with the directions of highest variance. lil 6.2 (a) Scree graph. (b) Proportion of variance explained is given for the Optdigits dataset from the UCI Repository. 413 6.3 Optdigits data plotted in the space of two principal components. 114 6.4 Principal components analysis generates new variables that are linear combinations of the original input variables. 47 6.5 Factors are independent unit normals that are stretched, rotated, and translated to make up the inputs. 118 6.6 Map of Europe drawn by MDS. 122 6.7 Two-dimensional, two-class data projected on w. 125 6.8 — Optdigits data plotted in the space of the first two dimensions found by LDA. 128 7.1 Given x, the encoder sends the index of the closest code word and the decoder generates the code word with the received index as x’ 137 7.2 Evolution of k-means. 138 7.3. kemeans algorithm. 139 74 Data points and the fitted Gaussians by EM, initialized by one k-means iteration of figure 7.2. 143 7.5 A two-dimensional dataset and the dendrogram showing the result of single-link clustering is shown. 148 8.1 Histograms for various bin lengths. 156 8.2. Naive estimate for various bin lengths. 157 8.3. Kernel estimate for various bin lengths. 158 8.4 k-nearest neighbor estimate for various k values. 160 8.5 Dotted lines are the Voronoi tesselation and the straight line is the class discriminant. 163 8.6 Condensed nearest neighbor algorithm. 164 8.7 Regressograms for various bin lengths. 165 8.8 Running mean smooth for various bin lengths. 166 89 Kernel smooth for various bin lengths. 167 8.10 Running line smooth for various bin lengths. 168 8.11 Regressograms with linear fits in bins for various bin lengths. 171 aviti 9.6 9.7 9.8 10.1 10.2 10.3 10.4 10.5 10.6 10.7 10.8 10.9 10.10 10.11 10.12 10.13 Figures Example of a dataset and the corresponding decision tree. Entropy function for a two-class problem. Classification tree construction. Regression tree smooths for various values of 0. Regression trees implementing the smooths of figure 9.4 for various values of O,. Example of a (hypothetical) decision tree. Ripper algorithm for learning rules. Example of a linear multivariate decision tree. In the two-dimensional case, the linear discriminant is a line that separates the examples from two classes. The geometric interpretation of the linear discriminant. In linear classification, each hyperplane H; separates the examples of C; from the examples of all other cla In pairwise linear separation, there is a separate hyperplane for each pair of classes. The logistic, or sigmoid, function. Logistic discrimination algorithm implementing gradient-descent for the single output case with two classes. For a univariate two-class problem (shown with ‘o' and ‘x’ ), the evolution of the line wx + wo and the sigmoid output after 10, 100, and 1,000 iterations over the sample. Logistic discrimination algorithm implementing gradient-descent for the case with K > 2 classes. For a two-dimensional problem with three classes, the solution found by logistic discrimination. For the same example in figure 10.9, the linear discriminants (top), and the posterior probabilities after the softmax (bottom) On both sides of the optimal separating hyperplance, the instances are at least 1/||w|| away and the total margin is 2/|lwil. In classifying an instance, there are three possible cases: In (1), € = 0; it is on the right side and sufficiently away. In (2), E = 1+ g(x) > Lit is on the wrong side. In (3), E = 1 -g(x),0 < < 1; itis on the right side but is in the margin and not sufficiently away. Quadratic and e-sensitive error functions. 174 177 179 183 184 185 188 191 201 202 203 204 207 210 219 222 226 Figures 1d 11.2 113 114 1s 11.6 1L7 11.8 11.9 11.10 1.12 11.13 1114 ALIS 11.16 waz 11.18 11.19 11.20 11.21 11.22 Simple perceptron. K parallel perceptrons. Percepton training algorithm implementing stochastic online gradient-descent for the case with K > 2 classes. The perceptron that implements AND and its geometric interpretation. XOR problem is not linearly separable. The structure of a multilayer perceptron. The multilayer perceptron that solves the XOR problem. Sample training data shown as ‘+’, where x! ~ U(—0.5,0.5), and y! (x') +.N(0, 0.1). The mean square error on training and validation sets as a function of training epochs. (a) The hyperplanes of the hidden unit weights on the first layer, (b) hidden unit outputs, and (c) hidden unit outputs multiplied by the weights on the second layer. Backpropagation algorithm for training a multilayer perceptron for regression with K outputs. As complexity increases, training error is fixed but the validation error starts to increase and the network starts to overfit. As training continues, the validation error starts to increase and the network starts to overfit. A structured MLP. In weight sharing, different units have connections to different inputs but share the same weight value (denoted by line type). The identity of the object does not change when it is translated, rotated, or scaled. Two examples of constructive algorithms. Optdigits data plotted in the space of the two hidden units of an MLP trained for classification. In the autoassociator, there are as many outputs as there are inputs and the desired outputs are the inputs, A time delay neural network. Examples of MLP with partial recurrency. Backpropagation through time: (a) recurrent network and (b) its equivalent unfolded network that behaves identically in four steps. xix 233 235 239 240 241 243 245 248 249 250 251 Ba 258 261 264 265 267 268 269 121 12.2 12.3 12.4 12.5 12.6 12.7 12.8 12.9 12.10 12.11 13.1 13.2 13.4 13.5 141 14.2 143 15.1 15.2 Figures Shaded circles are the centers and the empty circle is the input instance. 278 Online k-means algorithm, 279 The winner-take-all competitive neural network, which is a network of k perceptrons with recurrent connections at the output. 280 The distance from x* to the closest center is less than the vigilance value p and the center is updated as in online k-means. 281 In the SOM, not only the closest unit but also its neighbors, in terms of indices, are moved toward the input. 283 The one-dimensional form of the bell-shaped function used in the radial basis function network. 285 The difference between local and distributed representations. 286 ‘The RBF network where pj are the hidden units using the bell-shaped activation function. 288 () Before and (- -) after normalization for three Gaussians whose centers are denoted by **. 292 ‘The mixture of experts can be seen as an RBE network where the second-layer weights are outputs of linear models. 297 ‘The mixture of experts can be seen as a model for combining multiple models. 298 Example of a Markov model with three states is a stochastic automaton. 307 An HMM unfolded in time as a lattice (or trellis) showing all the possible trajectories. 310 Forward-backward procedure: (a) computation of ot; (j) and (b) computation of (i). 313 Computation of arc probabilities, &;(i, j). 317 Example of a left-to-right HMM. 323 Typical roc curve. 334 95 percent of the unit normal distribution lies between 6 and 1.96. 335 95 percent of the unit normal distribution lies before 1.64. 337 In voting, the combiner function f(-) is a weighted sum. 355 AdaBoost algorithm. 362 Figures 16.1 16.2 16.3 16.4 16.5 16.6 16.7 16.8 16.9 16.10 Al Mixture of experts is a voting method where the votes, as given by the gating system, are a function of the input. In stacked generalization, the combiner is another learner and is not restricted to being a linear combination as in voting. Cascading is a multistage method where there is a sequence of classifiers, and the next one is used only when the preceding ones are not confident. The agent interacts with an environment. Value iteration algorithm for model-based learning. Policy iteration algorithm for model-based learning Example to show that Q values increase but never decrease. Q learning, which is an off-policy temporal difference algorithm. Sarsa algorithm, which is an on-policy version of Q learning. Example of an eligibility trace for a value. Sarsa(A) algorithm. In the case of a partially observable environment, the agent has a state estimator (SE) that keeps an internal belief state b and the policy 7 generates actions based on the belief states. The grid world. Probability density function of Z, the unit normal distribution. 364 367 374 379 380 383 384 385 386 387 391 393 405 Tables Qa 5 ql 11.2 14.1 14.2 With two inputs, there are four possible cases and sixteen possible Boolean functions. Reducing variance through simplifying assumptions. Input and output for the AND function. Input and output for the XOR function. Confusion matrix Type I error, type Il error, and power of a test. 33 99 240 241 333 339 Preface Machine learning is programming computers to optimize a performance criterion using example data or past experience. We need learning in cases where we cannot directly write a computer program to solve a given problem, but need example data or experience. One case where learning is necessary is when human expertise does not exist, or when humans are unable to explain their expertise. Consider the recognition of spoken speech, that is, converting the acoustic speech signal to an ASCII text; we can do this task seemingly without any difficulty, but we are unable to explain how we do it. Different people utter the same word differently due to differences in age, gender, or accent. In machine learning, the ap- proach is to collect a large collection of sample utterances from different people and learn to map these to words. Another case is when the problem to be solved changes in time, or depends on the particular environment. We would like to have general- purpose systems that can adapt to their circumstances, rather than ex- plicitly writing a different program for each special circumstance. Con- sider routing packets over a computer network. The path maximizing the quality of service from a source to destination changes continuously as the network traffic changes. A learning routing program is able to adapt to the best path by monitoring the network traffic. Another example is an intelligent user interface that can adapt to the biometrics of its user, namely, his or her accent, handwriting, working habits, and so forth. Already, there are many successful applications of machine learning in various domains: There are commercially available systems for rec- ognizing speech and handwriting. Retail companies analyze their past sales data to learn their customers’ behavior to improve customer rela- tionship management. Financial institutions analyze past transactions 2xvi Preface to predict customers’ credit risks. Robots learn to optimize their behav- ior to complete a task using minimum resources. In bioinformatics, the huge amount of data can only be analyzed and knowledge be extracted using computers. These are only some of the applications that we—that is, you and I—will discuss throughout this book. We can only imagine what future applications can be realized using machine learning: Cars that can drive themselves under different road and weather conditions, phones that can translate in real time to and from a foreign language, autonomous robots that can navigate in a new environment, for example, on the surface of another planet. Machine learning is certainly an exciting field to be working in! The book discusses many methods that have their bases in different fields; statistics, pattern recognition, neural networks, artificial intelli gence, signal processing, control, and data mining. In the past, research in these different communities followed different paths with different emphases. In this book, the aim is to incorporate them together to give a unified treatment of the problems and the proposed solutions to them. This is an introductory textbook, intended for senior undergraduate and graduate level courses on machine learning, as well as engineers working in the industry who are interested in the application of these methods. The prerequisites are courses on computer programming, prob- ability, calculus, and linear algebra. The aim is to have all learning algo- rithms sufficiently explained so it will be a smalll step from the equations given in the book to a computer program. For some cases, pseudocode of algorithms are also included to make this task easier. The book can be used for a one semester course by sampling from the chapters, or it can be used for a two-semester course, possibly by dis- cussing extra research papers; in such a case, I hope that the references at the end of each chapter are useful. The Web page is http://www.cmpe.boun.edu.tr/~ethem/izml/ where I will post information related to the book that becomes available after the book goes to press, for example, errata. I welcome your feedback via email to [email protected]. I very much enjoyed writing this book; I hope you will enjoy reading it. Acknowledgments The way you get good ideas is by working with talented people who are also fun to be with, The Department of Computer Engineering of Boxazici University is a wonderful place to work and my colleagues gave me all the support I needed while working on this book. I would also like to thank my past and present students on which I have field-tested the content that is now in book form. While working on this book, I was supported by the Turkish Academy of Sciences, in the framework of the Young Scientist Award Program (EA- TUBA-GEBIP/2001-1-1). My special thanks go to Michael Jordan. I am deeply indebted to him for his support over the years and last for this book. His comments on the general organization of the book, and the first chapter, have greatly improved the book, both in content and form. Taner Bilgic, Vladimir Cherkassky, Tom Dieterich, Fikret Girgen, Olcay Taner Yildiz, and anony- mous reviewers of The MIT Press also read parts of the book and pro- vided invaluable feedback. I hope that they will sense my gratitude when they notice ideas that I have taken from their comments without proper acknowledgment. Of course, I alone am responsible for any errors or shortcomings. My parents believe in me, and I am grateful for their enduring love and support. Sema Oktug is always there whenever I need her and | will always be thankful of her friendship. I would also like to thank Hakan Unlii for our many discussions over the years on several topics related to life, the universe, and everything. This book is set using BTX macros prepared by Chris Manning for which I thank him. I would like to thank the editors of the Adaptive Com- putation and Machine Learning series, and Bob Prior, Valerie Geary, Kath- 2avili Acknowledgments leen Caruso, Sharon Deacon Warne, Erica Schultz, and Emily Gutheinz from The MIT Press for their continuous support and help during the completion of the book. Notations x! x P(X) p(x) P(XIY) ELX] Var(x) Cov(X,¥) Com(X,¥) Scalar value Vector Matrix Transpose Inverse Random vai le Probability mass function when X is discrete Probability density function when X is continuous Conditional probability of X given Y Expected value of the random variable X Variance of X Covariance of X and Y Correlation of X and ¥ Mean Variance Covariance matrix Estimator to the mean Estimator to the variance Estimator to the covariance matrix N(u,0?) Na(u.E) rN ZAR AX g(x\0) argmaxg g(x) arg ming g(x|0) E(@|X) \(a\x) L£(0|X) ec) #e} uy Notations Univariate normal distribution with mean w and vari- ance o? Unit normal distribution: JV (0,1) d-variate normal distribution with mean vector p and covariance matrix E Input Number of inputs: Input dimensionality Output Required output Number of outputs (classes) Number of training instances Hidden value, intrinsic dimension, latent factor Number of hidden dimensions, latent factors Class f ‘Training sample Set of x with index ¢ ranging from 1 to N Set of ordered pairs of input and desired output with index ¢ Function of x defined up to a set of parameters 0 The argument for which g has its maximum value ‘The argument @ for which g has its minimum value Error function with parameters @ on the sample X Likelihood with parameters 0 on the sample X Log likelihood with parameters 0 on the sample X 1 if c is true, 0 otherwise Number of elements for which c is true Kronecker delta: 1 if i = j, 0 otherwise 11 Introduction What Is Machine Learning? WITH ADVANCES in computer technology, we currently have the ability to store and process large amounts of data, as well as to access it from physically distant locations over a computer network. Most data acquisi- tion devices are digital now and record reliable data. Think, for example, of a supermarket chain that has hundreds of stores all over a country selling thousands of goods to millions of customers. The point of sale terminals record the details of each transaction: date, customer identifi- cation code, goods bought and their amount, total money spent, and so forth. This typically amounts to gigabytes of data every day. This stored data becomes useful only when it is analyzed and turned into information that we can make use of, for example, to make predictions. We do not know exactly which people are likely to buy a particular product, or which author to suggest to people who enjoy reading Hem- ingway. If we knew, we would not need any analysis of the data; we would Just go ahead and write down the code. But because we do not, we can only collect data and hope to extract the answers to these and similar questions from data, We do believe that there is a process that explains the data we observe. Though we do not know the details of the process underlying the gener- ation of data—for example, consumer behavior—we know that it is not completely random. People do not go to supermarkets and buy things at random. When they buy beer, they buy chips; they buy ice cream in summer and spices for Glihwein in winter. There are certain patterns in the data. We may not be able to identify the process completely, but we believe 1 Introduction we can construct a good and useful approximation. That approximation may not explain everything, but may still be able to account for some part of the data, We believe that though identifying the complete process may not be possible, we can still detect certain patterns or regularities. This is the niche of machine learning. Such patterns may help us understand the process, or we can use those patterns to make predictions: Assuming that the future, at least the near future, will not be much different from the past when the sample data was collected, the future predictions can also be expected to be right. Application of machine learning methods to large databases is called data mining. The analogy is that a large volume of earth and raw ma- terial is extracted from a mine, which when processed leads to a small amount of very precious material; similarly in data mining, a large vol- ume of data is processed to construct a simple model with valuable use, for example, having high predictive accuracy. Its application areas are abundant: In addition to retail, in finance banks analyze their past data to build models to use in credit applications, fraud detection, and the stock market. In manufacturing, learning models are used for optimiza- tion, control, and troubleshooting. In medicine, learning programs are used for medical diagnosis. In telecommunications, call patterns are an- alyzed for network optimization and maximizing the quality of service. In science, large amounts of data in physics, astronomy, and biology can only be analyzed fast enough by computers, The World Wide Web is huge; it is constantly growing and searching for relevant information cannot be done manually. But machine learning is not just a database problem; it is also a part of artificial intelligence. To be intelligent, a system that is in a changing environment should have the ability to learn. If the system can learn and adapt to such changes, the system designer need not foresee and provide solutions for all possible situations. Machine learning also helps us find solutions to many problems in vi- sion, speech recognition, and robotics. Let us take the example of rec- ognizing faces: This is a task we do effortlessly; every day we recognize family members and friends by looking at their faces or from their pho- tographs, despite differences in pose, lighting, hair style, and so forth, But we do it unconsciously and are unable to explain how we do it. Be- cause we are not able to explain our expertise, we cannot write the com- puter program. At the same time, we know that a face image is not just a random collection of pixels; a face has structure. It is symmetric. There 1.2 1.21 ASSOCIATION RULE 1.2 Examples of Machine Learning Applications 3 are the eyes, the nose, the mouth, located in certain places on the face. Each person's face is a pattern composed of a particular combination of these. By analyzing sample face images of a person, a learning pro- gram captures the pattern specific to that person and then recognizes by checking for this pattern in a given image. This is one example of pattern recognition. Machine learning is programming computers to optimize a performance criterion using example data or past experience. We have a model defined up to some parameters, and learning is the execution of a computer pro- gram to optimize the parameters of the model using the training data or past experience. The model may be predictive to make predictions in the future, or descriptive to gain knowledge from data, or both. Machine learning uses the theory of statistics in building mathematical models, because the core task is making inference from a sample, The role of computer science is twofold: First, in training, we need efficient algorithms to solve the optimization problem, as well as to store and pro- cess the massive amount of data we generally have. Second, once a model is learned, its representation and algorithmic solution for inference needs to be efficient as well, In certain applications, the efficiency of the learn- ing or inference algorithm, namely, its space and time complexity, may be as important as its predictive accuracy. Let us now discuss some example applications in more detail to gain ‘more insight into the types and uses of machine learning. Examples of Machine Learning Applications Learning Associations In the case of retail—for example, a supermarket chain—one application of machine learning is basket analysis, which is finding associations be- tween products bought by customers: If people who buy X typically also buy Y, and if there is a customer who buys X and does not buy Y, he or she is a potential ¥ customer. Once we find such customers, we can target them for cross-selling. In finding an association rule, we are interested in learning a conditional probability of the form P(Y|X) where ¥ is the product we would like to condition on X, which is the product or the set of products which we know that the customer has already purchased. 1.2.2 CLASSIFICATION 1 Introduction Let us say, going over our data, we calculate that P(chips|beer) = 0.7. Then, we can define the rule: 70 percent of customers who buy beer also buy chips. We may want to make a distinction among customers and toward this, estimate P(Y|X,D) where D is the set of customer attributes, for exam- ple, gender, age, marital status, and so on, assuming that we have access information. If this is a bookseller instead of a supermarket, prod- ucts can be books or authors. In the case of a Web portal, items corre- spond to links to Web pages, and we can estimate the links a user is likely to click and use this information to download such pages in advance for faster access. Classification A credit is an amount of money loaned by a financial institution, for example, a bank, to be paid back with interest, generally in installments. It is important for the bank to be able to predict in advance the risk associated with a loan, which is the probability that the customer will default and not pay the whole amount back. This is both to make sure that the bank will make a profit and also to not inconvenience a customer with a loan over his or her financial capacity. In credit scoring (Hand 1998), the bank calculates the risk given the amount of credit and the information about the customer. The informa- tion about the customer includes data we have access to and is relevant in calculating his or her financial capacity—namely, income, savings, collat- erals, profession, age, past financial history, and so forth. The bank has a record of past loans containing such customer data and whether the Joan was paid back or not. From this data of particular applications, the aim is to infer a general rule coding the association between a customer's attributes and his risk. That is, the machine learning system fits a model to the past data to be able to calculate the risk for a new application and then decides to accept or refuse it accordingly. This is an example of a classification problem where there are two classes: low-risk and high-risk customers. The information about a cus- tomer makes up the input to the classifier whose task is to assign the input to one of the two classes. DISCRIMINANT PREDICTION 1.2. Examples of Machine Learning Applications 5 Savings High-Risk ° 6 Income Figure 1.1 Example of a training dataset where each circle corresponds to one data instance with input values in the corresponding axes and its sign indicates the class. For simplicity, only two customer attributes, income and savings, are taken as input and the two classes are low-risk (‘+') and high-risk ('—"), An example discriminant that separates the two types of examples is also shown, After training with the past data, a classification rule learned may be of the form IF income> @: AND savings> 02 THEN low-risk ELSE high-risk for suitable values of @; and @, (see figure 1.1). This is an example of a discriminant; it is a function that separates the examples of different classes. Having a rule like this, the main application is prediction: Once we have arrule that fits the past data, if the future is similar to the past, then we can make correct predictions for novel instances. Given a new application with a certain income and savings, we can easily decide whether it is low- risk or high-risk. In some cases, instead of making a 0/1 (low-risk/high-risk) type de- cision, we may want to calculate a probability, namely, P(Y/X), where PATTERN RECOGNITION 1 Introduction X are the customer attributes and Y is 0 or 1 respectively for low-risk and high-risk. From this perspective, we can see classification as learn- ing an association from X to Y. Then for a given X = x, if we have P(¥ = 1{X = x) = 0.8, we say that the customer has an 80 percent proba- bility of being high-risk, or equivalently a 20 percent probability of being low-risk. We then decide whether to accept or refuse the loan depending on the possible gain and loss. There are many applications of machine learning in pattern recognition. One is optical character recognition, which is recognizing character codes from their images. This is an example where there are multiple classes, as many as there are characters we would like to recognize. Especially interesting is the case when the characters are handwritten. People have different handwriting styles; characters may be written small or large, slanted, with a pen or pencil, and there are many possible images corre- sponding to the same character. Though writing is a human invention, we do not have any system that is as accurate as a human reader. We do not have a formal description of ‘A’ that covers all ‘A’s and none of the non-‘A’s. Not having it, we take samples from writers and learn a defini- tion of A-ness from these examples. But though we do not know what it is that makes an image an ‘A’, we are certain that all those distinct ‘A’s have something in common, which is what we want to extract from the examples. We know that a character image is not just a collection of ran- dom dots; it is a collection of strokes and has a regularity that we can capture by a learning program. If we are reading a text, one factor we can make use of is the redun- dancy in human languages. A word is a sequence of characters and suc- cessive characters are not independent but are constrained by the words of the language. This has the advantage that even if we cannot recognize a character, we can still read t?e word. Such contextual dependencies may also occur in higher levels, between words and sentences, through the syntax and semantics of the language. There are machine learning algorithms to learn sequences and model such dependencies In the case of face recognition, the input is an image, the classes are people to be recognized, and the learning program should learn to asso- ciate the face images to identities. This problem is more difficult than optical character recognition because there are more classes, input im- age is larger, and a face is three-dimensional and differences in pose and lighting cause significant changes in the image. There may also be oc clusion of certain inputs; for example, glasses may hide the eyes and KNOWLEDGE EXTRACTION COMPRESSION OUTLIER DETECTION 1.2 Examples of Machine Learning Applications 7 eyebrows, and a beard may hide the chin, In medical diagnosis, the inputs are the relevant information we have about the patient and the classes are the illnesses. The inputs contain the patient's age, gender, past medical history, and current sympto1 Some tests may not have been applied to the patient, and thus these inputs would be missing. Tests take time, may be costly, and may inconvience the patient so we do not want to apply them unless we believe that they will give us valuable information. In the case of a medical diagnosis, a wrong decision may lead to a wrong or no treatment, and in cases of doubt it is preferable that the classifier reject and defer decision to a human expert. In speech recognition, the input is acoustic and the classes are words that can be uttered. This time the association to be learned is from an acoustic signal to a word of some language. Different people, because of differences in age, gender, or accent, pronounce the same word dif- ferently, which makes this task rather difficult. Another difference of specch is that the input is temporal; words are uttered in time as a se- quence of speech phonemes and some words are longer than others. A recent approach in speech recognition involves the use of lip movements as recorded by a camera as a second source of information in recogniz- ing speech. This requires sensor fusion, which is the integration of inputs from different modalities, namely, acoustic and visual. Learning a rule from data also allows knowledge extraction. The rule is, a simple model that explains the data, and looking at this model we have an explanation about the process underlying the data. For example, once we learn the discriminant separating low-risk and high-risk customers, we have the knowledge of the properties of low-risk customers. We can then use this information to target potential lor isk customers more efficiently, for example, through advertising. Learning also performs compression in that by fitting a rule to the data, we get an explanation that is simpler than the data, requiring less mem- ory to store and less computation to process. Once you have the rules of addition, you do not need to remember the sum of every possible pair of numbers. Another use of machine learning is outlier detection, which is finding the instances that do not obey the rule and are exceptions. In this case, after learning the rule, we are not interested in the rule but the exceptions not covered by the rule, which may imply anomalies requiring attention— for example, fraud. 1.2.3 REGRESSION SUPERVISED LEARNING 1 Introduction Regression Let us say we want to have a system that can predict the price of a used car. Inputs are the car attributes—brand, year, engine capacity, milage, and other information—that we believe affect a car's worth. The output is the price of the car. Such problems where the output is a number are regression problems. Let X denote the car attributes and Y be the price of the car. Again surveying the past transactions, we can collect a training data and the machine learning program fits a function to this data to learn Y as a function of X. An example is given in figure 1.2 where the fitted function is of the form Y= Wx + Wo for suitable values of w and wo. Both regression and classification are supervised learning problems where there is an input, X, an output, Y, and the task is to learn the map- ping from the input to the output. The approach in machine learning is that we assume a model defined up to a set of parameters: y= 9(xi0) where g(-) is the model and 0 are its parameters. Y is a number in re- gression and is a class code (e.g., 0/1) in the case of classification. g(-) is the regression function or in classification, it is the discriminant func- tion separating the instances of different classes. The machine learning program optimizes the parameters, 0, such that the approximation error is minimized, that is, our estimates are as close as possible to the cor- rect values given in the training set. For example in figure 1.2, the model is linear and w and wo are the parameters optimized for best fit to the training data. In cases where the linear model is too restrictive, one can use for example a quadratic Y = Wex? + WiX + Wo or a higher-order polynomial, or any other nonlinear function of the in- put, this time optimizing its parameters for best fit. Another example of regression is navigation of a mobile robot, for ex- ample, an autonomous car, where the output is the angle by which the steering wheel should be turned at each time, to advance without hitting obstacles and deviating from the route. Inputs in such a case are pro- vided by sensors on the car, for example, a video camera, GPS, and so 1.2. Examples of Machine Learning Applications 9 yiptice ve milage Figure 1.2 A training dataset of used cars and the function fitted. For simplic- ity, milage is taken as the only input attribute and a linear model is used. forth. Training data can be collected by monitoring and recording the actions of a human driver. One can envisage other applications of regression where one is trying to optimize a function.’ Let us say we want to build a machine that roasts coffee. The machine has many inputs that affect the quality: various settings of temperatures, times, coffee bean type, and so forth. We make a number of experiments and for different settings of these inputs, we measure the quality of the coffee, for example, as consumer satisfaction. To find the optimal setting, we fit a regression model linking these inputs to coffee quality and choose new points to sample near the optimum of the current model to look for a better configuration. We sample these points, check quality, and add these to the data and fit a new model. This is generally called response surface design. 1, [would like to thank Michael Jordan for this example. 10 1.2.4 DENSITY ESTIMATION (CLUSTERING 1 Introduction Unsupervised Learning In supervised learning, the aim is to learn a mapping from the input to an output whose correct values are provided by a supervisor. In unsuper- vised learning, there is no such supervisor and we only have input data. The aim is to find the regularities in the input. There is a structure to the input space such that certain patterns occur more often than others, and ‘we want to see what generally happens and what does not. In statistics, this is called density estimation. One method for density estimation is clustering where the aim is to find clusters or groupings of input. In the case of a company with a data of past customers, the customer data contains the demographic informa- tion as well as the past transactions with the company, and the company may want to see the distribution of the profile of its customers, to see what type of customers frequently occur. In such a case, a clustering model allocates customers similar in their attributes to the same group, providing the company with natural groupings of its customers. Once such groups are found, the company may decide strategies, for example, services and products, specific to different groups. Such a grouping also allows identifying those who are outliers, namely, those who are different from other customers, which may imply a niche in the market that can be further exploited by the company. An interesting application of clustering is in image compression. In this case, the input instances are image pixels represented as RGB val- ues. A clustering program groups pixels with similar colors in the same group, and such groups correspond to the colors occurring frequently in the image. If in an image, there are only shades of a small number of colors and if we code those belonging to the same group with one color, for example, their average, then the image is quantized. Let us say the pixels are 24 bits to represent 16 million colors, but if there are shades of only 64 main colors, for each pixel, we need 6 bits instead of 24. For example, if the scene has various shades of blue in different parts of the image, and if we use the same average blue for all of them, we lose the details in the image but gain space in storage and transmission. Ideally, one would like to identify higher-level regularities by analyzing repeated image patterns, for example, texture, objects, and so forth. This allows a higher-level, simpler, and more useful description of the scene, and for example, achieves better compression than compressing at the pixel level If we have scanned document pages, we do not have random on/off pix- 1.2.5 REINFORCEMENT LEARNING 1.2. Examples of Machine Learning Applications il els but bitmap images of characters. There is structure in the data, and we make use of this redundancy by finding a shorter description of the data: 16 x 16 bitmap of ‘A’ takes 32 bytes; its ASCII code is only 1 byte. Machine learning methods are also used in bioinformatics. DNA in our genome is the “blueprint of life” and is a sequence of bases, namely, A, G, C, and T. RNA is transcribed from DNA, and proteins are translated from the RNA. Proteins are what the living body is and does. Just as a DNA is a sequence of bases, a protein is a sequence of amino acids (as defined by bases). One application area of computer science in molecular biology is alignment, which is matching one sequence to another. This is a dif- ficult string matching problem because strings may be quite long, there are many template strings to match against, and there may be deletions, insertions, and substitutions. Clustering is used in learning motifs, which are sequences of amino acids that occur repeatedly in proteins. Motifs are of interest because they may correspond to structural or functional elements within the sequences they characterize. The analogy is that if the amino acids are letters and proteins are sentences, motifs are like words, namely, a string of letters with a particular meaning occurring frequently in different sentences. Reinforcement Learning In some applications, the output of the system is a sequence of actions. In such a case, a single action is not important; what is important is the policy that is the sequence of correct actions to reach the goal. There is no such thing as the best action in any intermediate state; an action is good if it is part of a good policy. In such a case, the machine learning program should be able to assess the goodness of policies and learn from past good action sequences to be able to generate a policy. Such learning methods are called reinforcement learning algorithms. A good example is game playing where a single move by itself is not that important; it is the sequence of right moves that is good. A move is good if it is part of a good game playing policy. Game playing is an im- portant research area in both artificial intelligence and machine learning. This is because games are easy to describe and at the same time, they are quite difficult to play well. A game like chess has a small number of rules but it is very complex because of the large number of possible moves at each state and the large number of moves that a game contains. Once

You might also like