Machine Learning Unit2

Uploaded by

Vinoth Kumar M

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

1 views

Machine Learning Unit2

Uploaded by

Vinoth Kumar M

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 31

UNIT II Supervised Learning Syllabus Linear Regression Models : Least squares, single & multiple variables, Bayesian linear regression, radient descent, Linear Classification Models : Discriminant function - Perceptron algorithm, Sopabilistic discriminative model - Logistic regression, Probabilistic generative model - Naive Bayes, Maximum margin classifier - Support vector machine, Decision Tree, Random Forests Contents 2.1. Regression 2.2. Linear Classification Models 23 Probabilistic Generative Model 2.4 Maximum Margin Classifier : Support Vector Machine 2.5 Decision Tree 2.6 Random Forests 2.7 Two Marks Questions with Answers @-Machine Learing Regression © Regression finds correlations between i dependent and independent variables. 7 cnr If the desired output consists of one line eteeco, or more continuous variable, then the s z task is called as regression. 8 + Therefore, regression algorithms help a predict continuous variables such as house prices, market trends, weather patterns, oil and gas prices etc. Fig, 241 Regression Independent variable + Fig, 2.1.1 shows regression. * When the targets in a dataset are real numbers, ¢ known as regression and each sample in the dataset has a real-valued output or the machine learning task is target. * Regression analysis is a set of statistical methods used for the estimation of relationships between a dependent variable and one or more independent variables. It can be utilized to assess the strength of the relationship between variables and for modelling the future relationship between them. + The two basic types of regression are linear regression and multiple linear regression. EXE] Linear Regression Models . Linear regression is a statistical method that allows us to summarize and study relationships between two continuous (quantitative) variables. * The objective of a linear regressi is i 7 ‘gression model is to find a relationshi input variables and a target variable. ae 1. One variable, denoted x, is regarded as the predictor, independent variable. explanatory or 2. The other variable, denoted y, is rej , 7 garded as Fre ainer| vac he response, outcome or * Regression models predict a continuous variable, such as the sales made on a d or predict temperature of a city. Let’s imagine that we fit a line with the trai i point that we have. If we want to add another data point, but to fit it, we ane change existing model. rw need to * This will happen with each data point that we add to the model; hence, {i regression isn’t good for classification models. ‘ oer ®iL ‘Supervised Leaming ion line gives thi . + The regress! 8 © average relationshi iables i mathematical form. ip between the two variables in vo variables X . For two varial and Y, there are always two lines of regression. + Regression line of X on Y Gives the best ost Fae civen'walues'st Wf imate for the value of X for any specific 5 X= atby here a = X- intercept b = Slope of the line X = Dependent variable Y = Independent variable « Regression line Y on X : Gives the best estimate for the value of Y for any specific given values of X : Y = a+bx wee a Y - intercept b = Slope of the line Y = Dependent variable x = Independent variable thod (a procedure that minimizes the vertical yunding a straight line) we are able to construct a ter diagram points and then formulate a * By using the least squares mel deviations of plotted points surro' best fitting straight line to the scal gression equation in the form of : 1 § = a+bx Bias term——" My fix. w) § = y+be-x SS "Re J ¥+bex-%) Input vector) *2—w, ‘tession analysis is the art x t “nd science of fitting straight “ “Nes to patterns of data. In @ Fig. 2.1.2 ar regression model, the TECHNICAL PUBLICAT!2-4 Supervised Leeming Machine Learning : variable) is predicted from k other variables + equation. If Y denotes the dependent ariables, then the assumption is that etermined by the linear equation ; variable of interest (“dependent” (Vindependent” variables) using a lineal variable and X1,.../Xxv are the independent v. the value of Y at time t in the data sample is d Yy = Bo +BaX11 +BaXat + +Ba%Ht Ft betas are constants and the epsilons are independent and identically ith mean zero. the predicted value and the actual values ‘The split point errors across lowest SSE is where the distributed normal random variables wi At each split point, the “error” between E is squared to get a “Sum of Squared Errors (SSE)". The sp the variables are compared and the variable/point yielding the chosen as the root node/split point. This process is recursively continued. « Error function measures how much our predictions deviate from the desired answers. 1 Mean-squared error Jn == (yi fous)? isn Advantages : a. Training a linear regression model is usually much faster than methods such as neural networks. b. Linear regression models are simple and require minimum memory to implement. c. By examining the magnitude and sign of the regression coefficients you can infer how predictor variables affect the target outcome. EXP Least Squares + The method of least squares is about estimating parameters by minimizing the squared discrepancies between observed data, on the one hand, and their expected values on the other. y Considering an arbitrary ' $= By + Byx straight line, y = by +b)x, is to be fitted through these data points. The question is ‘Which line is the most yj -9j=Ertor (residual) representative" ? || * What are the values of 3% bo andb; such that the z resulting line "best" fits the | data points ? But, what Fig. 24.3ea a Leamind 2. ye 2 supervised Leaming ess-of-fit criterion to oo? use to determine among all possibl binations of poand er ? gall possible combi east Squares (LS) criteri nee The ae that the sum of the squares of errors is mun hi solutions yields y(x) whose elements sum to 1, but do ensure the outputs to be in the range [0,1] s How to draw such a line based gn data points observed 7 y guppose @ imaginary line ofy= 4 a + bx. | jmagine @ vertical distance 3 petween the line and a data point E = ¥ - EQ). E(Y)=a + bX « This error is the deviation of the data point from the imaginary line, regression line. Then what is the best values ofaandb?A and b that minimizes the sum of such errors. + Deviation does not have good Fig. 2.14 properties for computation. Then why do we use squares of deviation ? Let us get a and b that can minimize the sum of squared, deviations rather than the sum of deviations. This method is called least squares. f squares of errors. Such a and b are parameters a. and B. a and b) is called estimation. thod minimizes the sum of * Least squares met timators of called least squares estimators i.e. es parameter estimators (eg, * The process of getting method of Ordinary Least Squares (OLS). Lest squares method is the estimation tT isadvantages of least square 1, Lack robustness to outliers 2 Certain datasets unsuitable for I ation east squares classific 3. Decision boundary corresponds to ML. solution vn up-thrust for knowledge pL ICATIONSSupervised Learning 2-6 Machine Leaming CQEEEEESED Fo stright tine to the points i the table. Compute m and b by least squares. Points x y x 3.00 oy B 4.25 4.25 c 5.50 5.50 D 8.00 SEN Solution : Represent in matrix form : 3.00 1 450] [va 425 1] [m 425] | vg = + 550 1 cE ve 8.00 1 5.50} vp X= (tl aT aytaty 121.3125 20.7500]""/105.8125] _ [0.246 * | 20.7500 4.0000 | | 19.7500 | ~ | 3.663 V = AX-L 3.00 1 4.50] f-0.10 425 1/7024] | 4.25 0.46 5.50 1||3.663|7] 5.50! ~|.-0.48 8.00 1 5.50 0.13 Multiple Regression + Regression analysis is used to predict the value of one or more responses from a set of predictors. It can also be used to estimate the linear association between the predictors and responses. Predictors can be continuous or categorical or a mixture of both. e If multiple independent variables affect the response variable, then the analysis calls for a model different from that used for the single predictor variable. In a situation where more than one independent factor (variable) affects.the outcome of a process, a multiple regression model is used. This is referred to as multiple linear regression model or multivariate least squares fitting.“Gq a machine Learning > Machin? Supervised Leaming « Let Z1; Z, be a set of r . Predictors believed to be related to a response variable Y. The li i varial near regression model for the j!" sample unit has the form Yi = Bo+Br 2i1+B2 Zip +.B, 2p +e; s ir i where € is a random error and Bj, i= x Bi, i=0,1,...,r are un-known regression coefficients. » With n independent observations, Ss, We can writ , rac deennotel te now rite one model for each sample unit so Y = ZBte where Y is nx 1, Z is nx (r+1),B is(r+1)x1 and eis nx1 ein order to estimate B , we take a least squares approach that is analogous to what we did in the simple linear regression case. « In matrix form, we can arrange the data in the following form : Lox x2 XK yi By Lox Xa A xn [EE Xm a] yfye | og |B 1 XN1 XN20 ++ XNK YN B, where fj are the estimates of the regression coefficients By Difference between Simple Regression and Multiple Regression | Simple regression Multiple regression One dependent variable Y predicted from one One dependent variable ¥ predicteg om 2 set independent variable X _of independent variables (X), Xz -- Xi) One regression coefficient for each independent One regression coefficient variable R? : Proportion of variation in dependent 1 : Proportion of variation in dependent variable Y predictable from X variable Y predictable by set of independent variables (X's) EXE] Bayesian Linear Regression n allows a useful mechanism to deal with insufficient to put a prior on the coefficients and the priors can take over. A prior is a * Bayesian linear regressio: data, or poor distributed data. It allows user on the noise so that in the absence of data, distribution on a parameter. * If we could flip the coin an infinite number easy by the law of large numbers. Howevel if we c ques that a coin is biased if we handful of times? Would we. of times, inferring its bias would be what if we could only flip the coin a saw three heads in TECHNICAL PUBLICATIONS® - an up-thrust for krowiedgeSupervised Learning Machine Lea ight times with unbiased coins? The bias of p =1- quantifying our prior knowledge that a the bias parameter is peaked aroung about coins. three flips, an event that happens one out ofl MLE would overfit these data, inferring a coin + A Bayesian approach avoids overfitting b; most coins are unbiased, that the prior on ig pri lief one-half, The data must overwhelm this prior be i eee ae imate model para’ - ° . vesi \ds allow us to estimal . A Igorith: ee a conduct model comparisons. Le EE ee ete sts and to 1 hypotheses. dea that the training data are utilized to calculate foreca t calculate explicit probabilities fo + Bayesian classifiers use a simple i an observed probability of each class an classifier is used for unclassified data, i s for the new features. based on feature values. it uses the observed « When Bayesi e probabilities to predict the most likely clas « Each observed training example can incrementally decrease or increase the estimated probability that a hypothesis is correct. + Prior knowledge can be combined with observed data to determine the final probability of a hypothesis. In Bayesian learning, prior knowledge is provided by asserting a prior probability for each candidate hypotheses and a probability distribution over observed data for each possible hypothesis. * Bayesian methods can accommodate hypotheses that make probabilistic predictions. New instances can be classified by combining the predictions of multiple hypotheses, weighted by their probabilities. Even in cases where Bayesian methods provide a standard of optimal decisi methods can be measured. . prove computationally intractable, they can ion making against which other practical * Uses of Bayesian classifiers are as follows : 1. Used in text-based classification f i ¢ or finding spam or junk mail filter 2. Medical diagnosis. sean 3. Network security such as detecting illegal intrusion, The basic procedure for implementing Bayesian Linear R ion i i) Specify priors for the model parameter, Baer ii) Create a model mapping the training inj ie Puts to the traini iii) Have a Markov Chain Monte Carlo (MCMC) alpori Ing outputs. the posterior distributions for the parameters Tt draw samples fromje Leaming 2 -9 Supervised Leaming Mech po Gradient Descent + Goal : Solving minimizati i Goi iB nization nonlinear problems through derivative information . eon dead : First a econ derivatives of the objective function or the constraints play an important role in optimization. The first order derivatives are called the gradient and the second order derivatives are called the Hessian matrix. Derivative based optimization is also called nonlinear. Capable of determining search directions" according to an objective function's derivative information. Derivative based optimization methods are used for : 1, Optimization of nonlinear neuro-fuzzy models 2. Neural network learning 3, Regression analysis in nonlinear models Basic descent methods are as follows : 1. Steepest descent 2. Newton-Raphson method Gradient Descent : Gradient descent is a first-order optimization algorithm. T of a function using gradient descent, one takes steps proportional to the negative of the gradient of the function at the current point. Gradient descent is popular for very large-scale optimization problems because it is easy to implement, can handle black box functions, and each iteration is cheap. Given a differentiable scalar field f (x) and an initial guess x; , gradient descent s of "f" by taking steps in the s ‘o find a local minimum iteratively moves the guess toward lower value: direction of the negative gradient ~ V f (x). * Locally, the negated gradient is the steep that x would need to move in order to yest descent direction, ie., the direction decrease "f" the fastest. The algorithm typically converges to a local minimum, but may rarely reach @ saddle point, or lies at a local maximum. e curve at that x and its direction will point change x in the opposite direction to lower not move at all if x; * The gradient will give the slope of th to an increase in the function. So we the function value : Xie = xR AVE (x) The A>0 is a small number that forces the algorithm to make small jumps TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeSupervised Machine Leaning 2-10 Learn dient Descent : aa he minimum : techni to # technically, it Jatively slow close ce is inferior to many other methods. gradient descent increasingly ‘zigzag. the shortest direction to a mi A as ‘um, Limitations of Gra © Gradient descent is rel asymptotic rate of convergen .d convex problems, + For poorly conditione orthogonally to the gradients point nearly point Steepest Descent : # Steepest descent is also « This method is based on first function. This method is also call descent method. known as gradient method. order Taylor series approximation of obje ed saddle point method. Fig. 2.1.5 shows ae e est Fig. 2.1.5 Steepest descent method © The Steepest D escent i : direction is where Paes simplest of the gradient methods. The choice of es most quickly, which is in the direction opposite VE (x). The search : starts at an arbi : until reach close to the eae) point x0 and then go down the gradient The method of steepest is nen eas ee one is the discrete analogue of gradient descent, but using a local minimization rather than computing ® gradient. It is typi ically local minima Be 'y able to converge in few st ne e plateaus in the objective functi ‘eps but it is unable to escap’ inction. The gradient i ; t is everywhi aaa eres Eee to the contour lines. After each lin? ient i: ‘i nt is always orthogonal to the previous step directo" TECHNICAL PUBLICAT| ip-thrust for knowled: 'UBLICATIONS™ - an up-thrust 8 jowledgeachine Leeming 2-14 ‘Supervised Leaming Consequently, the iterates tend to zig-zag down the valley in a very inefficient manner. + The method of Steepest Descent is simple, easy to apply, and each iteration is fast. It also very stable; if the minimum points exist, the method is guaranteed to locate them after at least an infinite number of iterations. fa Linear Classification Models + A classification algorithm (Classifier) that makes its classification based on a linear predictor function combining a set of weights with the feature vector. + A linear classifier does classification decision based on the value of a linear combination of the characteristics. Imagine that the linear classifier will merge into it's weights all the characteristics that define a particular class. + Linear classifiers can represent a lot of things, but they can't represent everything. The classic example of what they can't represent is the XOR function. EZAD Discri + Linear Discriminant Analysis (LDA) is the most commonly used dimensionality reduction technique in supervised learning. Basically, it is a preprocessing step for pattern classification and machine learning applications. LDA is a powerful algorithm that can be used to determine the best separation between two or more inant Function classes, + LDA is a supervised learning algorithm, which means that it requires a labelled training set of data points in order to learn the linear discriminant function. * The main purpose of LDA is to find the line or plane that best separates data points belonging to different classes. The key idea behind LDA is that the decision boundary should be chosen such that it maximizes the distance between the means of the two classes while simultaneously minimizing the variance within each class's data or within-class scatter. This criterion is known as the Fisher criterion, * LDA is one of the most widely used machine learning algorithms due to its accuracy and flexibility, LDA can be used for a variety of tasks such as classification, dimensionality reduction, and feature selection.Supervised 1, 2-12 oe Machine Learning : ify them efficiently, then ye; classes and we need to classify # a two * Suppose we have " an classes are divided as follows : : Before LDA After LDA Fig. 2.2.1 LDA ithm wi following steps : * LDA algorithm works based on the x a) The first step is to calculate the means and standard deviation of each feature, b) Within class scatter matrix and between class scatter matrix is calculated c) These matrices are then used to calculate the eigenvectors and eigenvalues. 4) LDA chooses the k eigenvectors with the largest eigenvalues to form a transformation matrix. LDA uses this transformation matrix to transform the data into a new space with k dimensions. f) Once the transformation matrix transforms the data into new space with k dimensions, LDA can then be used for classification or dimensionality reduction Benefits of using LDA : a) LDA is used for classification problems, e) ») LDA is a powerful tool for dimensionality reduction, ©) LDA is not susceptible to the learning algorithms. Logistic Regression ae a ees ie form of regression analysis in which the outcome variable shotomous. A statistical method i : : : used to model dichotomous oF binary outcomes using predictor variables, . * Logistic component : Instead of mode models the log odds curse of dimensionality" like many other machine ling the outcome, Y, directly, the method CO using the logistic function.” ” ‘ TECHNICAL PUBLICATIONS® _ .ine Leaming é sch 2-13 Supervised Leeming » Regression component ? Methods us outcome and predictor variables, function of predictors. ed to quantify association between an It could be used to build predictive models as a «In simple logistic regression, logistic regression with 1 predictor variable. Logistic Regression : PQ) ne of 2) = Bo + BiX1+B2X2 +...4B,X, = Bo+ BiX1+B2X2 4...4B,X_ +e With logistic regression, the response variable is an indicator of some characteristic, that is, a 0/1 variable. Logistic regression is used to determine whether other measurements are related to the presence of some characteristic, for example, whether certain blood measures are predictive of having a disease. If analysis of covariance can be said to be a t test adjusted for other variables, then logistic regression can be thought of as a chi-square test for homogeneity of proportions adjusted for other variables. While the response variable in a logistic regression is a 0/1 variable, the logistic regression equation, which is a linear equation, does not predict the 0/1 variable itself. Linear Ny Logistic Fig, 2.2.2 Fig. 22.2 shows Sigmoid curve for logistic regression. * The linear and logistic probability models are : Linear Regression : P= at ayXq $agXo tet AkXk Logistic Regression : In[p(—py = bo + byX1 tb2X2 tt PRX obability p is a linear function of the * The li that the pr pero came that the natural log of the odds egressors, while the logistic model assumes P/(1~p) is a linear function of the regressors. * The major advantage of the linear model is its interpretability. In the linear model, if a1 is 0.05, that means that a one-unit increase in X1 is associated with a5 % Point increase in the probability that ¥ is 7 TECHNICAL PUBLIGATIONS® - an up-hrst for knowiedgeSupervised |, ay 2-14 ty del, if bl is 0 1 model, i 5, retable. In the ST ath a 0.05 increase in thet Machine Leaming - interp’ - . istic model is less interPE Ne cociates The logistic mote nit increase in X1 is > Ye never met anyone wig °8 means that a one - unl does that mean! any odds that Y is 1. And what intuition for log odds. ; i ive Mode! EX] Probabilistic Generati statistical models that generate new ga, ate unsupervised machine learning to perf. ‘hood estimation, modelling data points, a hese probabilities. * Generative models are @ class instances. These models are used tasks such as probability and cae istinguishing between classes using : wee Sern rely on the Bayes theorem to find the ioe probabil Generative models describe how data is generated using, Pr a ili noi! They predict P(ylx), the probability of y given x, calculating the P(xy), thy probability of x and y. Ee Naive Bayes * Naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes’ theorem with strong independence assumptions between the features. It is highly’scalable, requiring a number of parameters linear in the number of variables (features /predictors) in a learning problem. + A Naive Bayes Classifier is a program which predicts a class value given a set of attributes. * For each known class value, 1. Calculate probabilities for each attribute, conditional on the class value. 2. Use the product rule to obtain a joint conditional probability for the attributes. 3. Use Bayes rule to derive conditional ‘lit robabiliti cable’ * Once this has been done f PI ities for the class variable. probability, Or all class values, output the class with the highetape Leeming: wore conditional Probability + Let Aand B be two events such that Pj of B given that A has occurre 2-18 Supervised Learning (A) > 0. We denote P(BI A) the probability d. Since A is kr is known to have occurred, it becomes the ew sample space replacing the ori ; ‘ a ne the original S. From this, the definition is P(B/A) = PA) OR P(A B) = P(A) P(B/A) The notation PBI A) is read "the probability of event B given event A”. It is the probability of an event B given the occurrence of the event A. We say that, the probability that both A and B occur is equal to the probability that A occurs times the probability that B occurs given that A has occurred. We call P(BIA) the conditional probability of B given A, ive., the probability that B will occur given that A has occurred. Similarly, the conditional probability of an event A, given B by, P(AN B) P(AB) = oo The probability P(A1B) simply reflects the fact that the probability of an event A may depend on a second event B. If A and B are mutually exclusive AN B= ¢ and P(AIB) =0. Another way to look at the conditional probability formula is : PiSecond/Fi P (First choice and second choice) emcees P (First choice) Conditional probability is a defined quantity and cannot be proven. The key to solving conditional probability problems is to = 1. Define the events. 2. Express the given information and question in probability notation. 3. Apply the formula. Joint Probability * A joint probability is a probability that measures the likelihood that two or more events will happen concurrently. * If there are two independent events A and B, the probability that A and B will occur is found by multiplying the two probabilities. Thus for two events A and B, the special rule of multiplication shown symbolically is : P(A and B) = P(A) P@)- TECHNICAL PUBLICATIONS® - an p-thrust for knowledgeMac Superviseg 2-16 22ming chine Leaming : n is used to find the joint probability thay tion is : iw the general rule of multiplication is, * The general rule of multiplicat events will occur. Symbolically, = P(A) P(BIA) 7 P(A and 2 sia te b) is called the joint probability for tWo events A ang» * The probability s . nape h t in the sample space. Venn diagram will readily shows that which intersect it sa P(An B) = P(A) + PB) - P (AU B) 2 Equivalently : P(AN B) = P(A)+ P(B)— P(AN B)s P(A) + PB) The probability of the union of two events never exceeds the sum of the even + The pr probabilities. * A tree diagram is very useful for portraying conditional and joint Probabilities, 4 tree diagram portrays outcomes that are mutually exclusive. Bayes Theorem * Bayes’ theorem is a method to revise the probability of an event 8iven additional information. Bayes's theorem calculates a conditional Probability called a Posterior or revised probability. * Bayes’ theorem is a result in Probability theory that Telates conditional Probabilities. If A and B denote two events, P(AIB) denotes the conditional Probability of A occurring, given that B Occurs. The two conditional probabilities P(AIB) and P(BIA) are in general different, * Bayes theorem gives a relation between P(A IB) * A prior probability is a n initial probabilit additional information is ty value originally obtained before any obtained, * A posterior probability is a p additional information th; robability value that has been revised by using at is later obtained, * Suppose that By, Ba, Bs ~By partition the outcomes of is another event, For any number, k, with 1 < kK < n we have the formula : P(B/A) = P(A/B, } PIB, ) an experiment and that A X P(A/B}P(B, is] TECHNICAL PUBLICATION:~ pachine Leeming Generative model Generative models can generate new data instances, Generative model revolves around the Gistribution of a dataset to retum a probability for a given example. Generative models capture the joint probability pO ¥), oF just pOX) if there are no labels, ‘A generative model includes the distribution of the data itself, and tells you how likely a given ‘example is. Generative models are used in unsupervised machine learning to perform tasks such as robability and likelihood estimation + Support Vector Machines (SVMs) are a set of supervised learning methods which learn from the dataset and used for dlassification. SVM is a classifier derived from statistical learning theory by Vapnik and Chervonenkis. * An SVM is a_ kind of large-margin classifier : I the goal is to find a decision boundary between two classes that _— is maximally far from any Point in the training data Given a set of training examples, each marked as belonging to one of two Classes, an SVM algorithm Class 1 2-17 pu Difference between Generative and Discriminative Models TECHNICAL PUBLICATIONS” Supervised Leaming | Discriminative models Discriminative models discriminate between different kinds of data instances Discriminative model makes predictions based on conditional probability and is either used for classification or regression. Discriminative models capture the conditional probability p(Y | X). ‘A discriminative model ignores the question of whether a given instance is likely, and just tells | you how likely a label is to apply to the instance. ‘The discriminative model is used particularly for supervised machine learning. Example : Logistic regression, SVMs Class 1 Fig. 2.4.1 Two class problem tis a vector space based machine learning method where ° owiess? Fig. 2.4.2 Bad decision boundary of SVM - an up-thrust for knowledgeSupervised Leaming Machine Leaming xample falls into one class oy the SVM model as representing the the examples of the Separate icts whether @ new @ can think of an mapped so that each of # s possible. me space and classified to belong tg builds a model that predi other. Simply speaking, W® in ace, examples as points in sp F ie classes are divided by a gap that is as wi New examples are then ma the class based on which sid ypped into the sa eof the gap they fall on. Two Class Problems : Many decision boundaries can choose ? a Perceptron leaming rule can be used to find any decision boundary between class separate these two classes. Which one should we 1 and class 2. The line that maximizes the minimum margin is a good bet. The model class of “hyper-planes with a margin of m" has a low VC dimension if m is big. « This maximum-margin separator is determined by a subset of the data points, Data points in this subset are called "support vectors". It. will be useful computationally if only a small fraction of the data points are support vectors, because we use the support vectors to decide which side of the separator a test case is on. Example of Bad Decision Boundaries * SVM are primarily two-class classifiers with aim to find the optimal hyperplane such that minimized. Instead. of directl the distinct characteristic that they ne suct the expected generalization error is 'Y minimizing the empirical risk calculated from the training data, SVMs perform structural risk minimization t) ~——achieve generalization, om i Confidence Empirical risk ass + Because o distribution p We don't know ib egos 2 Low empirical tisk over a trae "inimize from P. This * 8 training dataset draw) Smal may - This general lear; Hu ae ming techn; a called empirical risk. minimis fechnique is Complexity of function set on, ° Fig. 2.4.3 vs iri ig. shows empirical a Fig. 2.4.3 Empirical risk TE CHNICAL PUBLICATION that maximize margin => B1 is better than 82chine Learning chine L 2-24 Supervised Loaming 2, They maximize the margin of the deci , Taeeeea techniques which find the optimal ea boundary using quadratic optimization Ability to handle large feature spaces Overfitting can be controlled by soft margin approach 5, When used in practice, SVM approaches frequently map the examples to a higher dimensional space and find margin maximal hyperplanes in the mapped space, obtaining decision boundaries which are not hyperplanes in the original space. The most popular versions of SVMs use non-linear kernel functions and map the attribute space into a higher dimensional space to facilitate finding "good" linear decision boundaries in the modified space. pa SVM Applications + SVM has been used successfully in many real-world problems, 1. Text (and hypertext) categorization Be 2 Image classification Bioinformatics (Protein classification, Cancer classification) Peo Hand-written character recognition Determination of SPAM email. o EJ Limitations of SVM 1. It is sensitive to noise. 2. The biggest limitation of SVM lies in the choice of the keel. 3. Another limitation is speed and size. 4. The optimal design for multiclass SVM classifiers is also a research area. E2Y] sott Margin SvM For the very high dimensional problems common in text classification, sometimes the data are linearly separable, But in the general case they are not, and even if they are, we might prefer a solution that better separates the bulk of the data while ignoring a few weird noise documents. What if the training set is not linearly separable ? Slack variables can be added to allow misclassification of difficult or noisy examples, resulting margin called soft. cross into the margin or over the A soft-margin allows a few variables to hyperplane, allowing misclassification. We penalize the crossover by looking at the number and distance of the misclassifications, This is a trade off between the hyperplane violations and the TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeSupervise a, 2-22 Loam, Machine Leaming ne set cost. The fay bles are bounded by som the prediction ther they margin size. The slack Vs ess influence they have on Jess soft margin, the . ere frome cone csociated slack variable, ‘ae argin. All observations have an ia anes variable = 0 then all points on th oe ie, Sla ble > 0 then a point in the margin eof 2. Slack variable > a hyperplane ; and tl in. c. the tradeoff between the slack variable penalty and the margin, 3. Cis the EXXX comparison of SVM and Neural Networks See Neural Network Support Vector Machine : | Kemel maps to a very-high dimensional space Hidden Layers map to lower dimensional | eee es spaces E | Search space has a unique minimum Search space has multiple local minima | Classification extremely Ve: Y good accuracy in typical domains Very good accuracy in typical domains Kemel and cost the two parameters to select Training is extremely efficient CEE Fo the followir Support vectors (if any), sla Pariables on wrong side Requires number of hidden units and lay vers ck variables on correct side of Of classifier (j Penalty and why + if any). Mention which point will have maximum ry?| wechine Learning 2-23 Supervised Leaming olution : vor 3. EA Decision Tree Data points 1 and 5 will have maximum penalty, Margin (m) is the gap between data points & the classifier boundary. The margin is the minimum distance of any sample to the decision boundary. If this hyperplane is in the canonical form, the margin can be measured by the length of the weight vector. Maximal margin classifier : A classifier in the family F that maximizes the margin. Maximizing the margin is good according to intuition and PAC theory. Implies that only support vectors matter; other training examples are ignorable. What if the training set is not linearly separable ? Slack variables can be added to allow misclassification of difficult or noisy examples, resulting margin called soft. ‘A soft-margin allows a few variables to cross into the margin or over the hyperplane, allowing misclassification. We penalize the crossover by looking at the number and distance of the misclassifications. This is a trade off between the hyperplane violations and the margin size. The slack variables are bounded by some set cost. The farther they are from the soft margin, the less influence they have on the prediction. All observations have an associated slack variable Slack variable = 0 then all points on the margin. Slack variable > 0 then a point in the margin or ‘on the wrong side of the hyperplane. Cis the tradeoff between the slack variable penalty and the margin. A decision tree is a simple representation for classifying examples. Decision tree learning is one of the most successful techniques for supervised classification learning. In decision analysis, a decision tree can be used to visually and explicitly represent decisions and decision making. As the name goes, it uses a tree-like model of decisions, Learned trees can also be represented as sets of if-then rules to improve human readability. A decision tree has two kinds of nodes i 1. Each leaf node has a class label, determined by majority vote of training examples reaching that leaf. TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeSupervised Loam 24 ng 5 out accordin, Machine Learning tion on features. It branche: B to the is a que’ internal node 2 at imating discrete-value imating di d targa, 5 a method for approx answers. tin d by a decision tree. Decision tree learn functions. The learnes .d decision tree can one of ing i i i resentes .d function is rep! also be re-represented as a set of if-then rules, «A leame the most widely used and practical methods j,, Decision tree Iearning is ive il ce. ; inductive inferen es f learning disj isy data and capable o} « It is robust to noisy learning method searches a completely expressive hypothesis n tree le © Decisio EEE Decision Tree Representation Goal ; Build a decision tree for classifying examples as positive or negative instances of a concept a Supervised learning, batch processing of training examples, using a preference bias. A decision tree is a tree where a. Each non-leaf node has associated with it an attribute (feature). b. Each leaf node has associated with it a classification (+ or -). c. Each arc has associated with it one of the possible values of the attribute at the node from which the are is directed. Internal node denotes a test on an attribute. Branch represents an outcome of the test. Leaf nodes represent class labels or class. distribution. A decision tree is a flow-chart-like tree structure, where each node denotes a test sy an attribute value, each branch represents an outcome of the test, and tree Saves represent classes or class distributions, Decision trees can easily be converted to classification rules, Decision Tree Algorithm * To generate decision Taput; tree from the training tuples of data Partition D. 1. Data partition @M 2 Attribute list Algorithm : 3. Attribute selection method J. Create a node (N) 2. If tuples in D are all of the same cl, 3. Return node ) lass then as a leaf node labeled with the class C. aero ee alming machine Learning 2-25 Supervised Learning 4, If attribute list is empty then return N as a leaf node labeled with the majority dass in D 5. Apply attribute selection method(D, attribute list) to find the "best" splitting criterion; 6, Label node N with splitting criterion; 7, If splitting attribute is discrete-valued and multiway splits allowed 3, Then attribute list -> attribute list > splitting attribute 9, For (each outcome j of splitting criterion ) 10, Let D; be the set of data tuples in D satisfying outcome j; 11. If Dj is empty then attach a leaf labeled with the majority class in D to node N; 12. Else attach the node returned by Generate decision tree(Dj, attribute list) to node N; 7 13, End of for loop 14, Return N; + Decision tree generation consists of two phases : Tree construction and pruning + In tree construction phase, all the training examples are at the root. Partition examples recursively based on selected attributes. «+ In tree pruning phase, the identification and removal of branches that reflect noise or outliers. + There are various paradigms that are used for learning binary classifiers which include : 1. Decision Trees 2. Neural Networks 3. Bayesian Classification 4. Support Vector Machines Fig. 2.5.1 Decision tree TECHNICAL PUBLICATIONS® - an up-thrust for knowledge6 Machine Leeming Jes for majority class. it ision Tul PREREETE Using following feature tree, trite decisi combining two Boolean features. Each internal node 4d each edge emanating from a split is labelled with a unique combination of feature derived from the training set. Solution : Left Side : A feature tree or split is labelled with a feature, an a feature value. Each leaf therefore corresponds t values. Also indicated in each Jeaf is the class distribution * Right Side : A feature tree partitions the instance space into rectangular regions, cone for each leaf. ‘Viagra’ Fig, 2.5.3 * The leaves of i i aoe nee of es i in the above figure could be labelled, from left to right, a5 pam, employing a simple decision rule called majority class. a ‘spam: 20 ham: 5 TECHNICAL PUBLICATIONS® IONS® - an y ip-thrust for knowl ledgein achine Learning eae Supervised Learning ide : A feat ein . Left si Mead ute) tree with training set class distribution in the leaves. + Right side : A decision tree obtained using the majority class decision rule. pa Appropriate Problem for Decision Tree Learning « Decision tree learning is generally b i i i Pe acersticel Bs ly best suited to problems with the following 1, Instances are represented by attribute-value pairs. Fixed set of attributes, and the attributes take a small number of disjoint possible values. Rp The target function has discrete output values. Decision tree learning is appropriate for a boolean classification, but it easily extends to learning functions with more than two possible output values. 3. Disjunctive descriptions may be required. Decision trees naturally represent disjunctive expressions. 4. The training data may contain errors. Decision tree learning methods are ; robust to errors, both errors in classifications of the training examples and errors in the attribute values that describe these examples. 5. The training data may contain missing attribute values. Decision tree methods can be used even when some training examples have unknown values. 6. Decision tree learning has been applied to problems such as learning to classify. Advantages and Disadvantages of Decision Tree Advantages : 1. Rules are simple and easy to understand. 2. Decision trees can handle both nominal and numerical attributes. 3. Decision trees are capable of handling datasets that may have errors. 4. Decision trees are capable of handling datasets that may have missing values. 5. Decision trees are considered to be a nonparametric method. 6. . Decision trees are self-explantory. Disadvantages : 1. Most of the algorithms require that the target attribute will values, 2 Some problem are difficult to solve like XOR. 3. Decision trees are less appropriate for estimation the value of a continuous attribute. I have only discrete n tasks where the goal is to predict TECHNICAL PUBLICATIONS® - an up-thrust for knowledge‘Supervis 2-28 Supervised Leeming 9 Machine Lea! s with many class ang ation problem: one to errors in clas! e ig examples. 4. Decision trees are Pp ict ber of trainin: relatively small num EEG Random Forests Random forest is @ famous syste! metho st ised getting to know pian is based totally on th regression issues in ML. It is 9 thare a process of combining multiple classifiers to so to enhance the overall performance of the model. “Random forest is a classifier that incorporates some of sets of the given dataset and takes the average to that dataset.” Instead of relying on one decision arily based on im learning set of rules that belongs to the d. It may be used for both classification ang e concept of ensemble studying, Ive a complex problem and ‘As the call indicates, choice timber on diverse sul improve the predictive accuracy of kes the prediction from each tree and prim and it predicts the very last output. results in better accuracy and tree, the random forest tal most of the people's votes of predictions, The more wider variety of trees within the forest prevents the hassle of overfitting, [EGEI How Does Random Forest Algorithm Work ? Random forest works in two-section first is to create the rando: combining N selection trees and second is to make predictions for each tree created inside the first segment. The working technique may be explained within the below steps and diagram : m woodland by Step - 1: Select random K statistics points from the schooling set. Step -2: Build the selection tr i ‘ 7 . . (Gubsets), n trees associated with the selected information points Step - 3: Choose the wide variety N for selection trees which we want to build. Step - 4: Repeat step 1 and 2. Step - 5: For new factors, locate the predicti 7 predictions of each choice tri i new records factors to the category that wins most people's votes. See oar « The working of the set of rules ma i ‘ y be higher mre wn gher understood by the underneath « Example : Suppose there may be a dataset that includes more than one fruit photo. So, ae dataset is given to the random wooded area classifier. The dataset js divided into subsets and given to every decision tree. During the training section, each decision tree produces a prediction end result and while a brand new TECHNICAL PUBLICATIONS® - en up-thrust for knowledgeine Leamin ectine Leaming 2-29 Supervised Leaming statistics ae occurs, then primarily based on the majority of consequences, the random forest classifier predicts the final decision. Consider the underneath picture = : Treen 4 Class-B Fig. 2.6.1 Example of random forest _ EEE] Applications of Random Forest There are specifically 4 sectors where random forest normally used : 1. Banking : Banking zone in general uses this algorithm for the identification of loan danger. 2. Medicine : With the assistance of this set of rules, disorder traits disorder may be recognized. 3. Land use : We can perceive the areas of comparable land use with the aid of this algorithm. 4. Marketing : Marketing tendencies can be recognized by the usagi algorithm. and risks of the e of this by Advantages of Random Forest Random forest is able to appearing both classification and regression responsibilities. * Itis capable of managing large datasets with high dimensionality. * Ttenhances the accuracy of the version and forestalls the overfitting trouble. TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeSupervised 2-30 beaming Machine Learning [ERE Disadvantages of Random Forest be used for both class and regression responsibilitie, «Although random forest can : : jt isn’t extra appropriate for regression obligations. Two Marks Questions with Answers 1 What do you mean by least square method ? tical method used to determine a line of best fit by 5 created by a mathematical function. A "square" is between a data point and the regression line or ‘Ans. : Least squares is a statis! minimizing the sum of square determined by squaring the distance mean value of the data set. Q.2 What is linear Discriminant function ? ‘Ans. : LDA is a supervised learning algorithm, which means that it requires a labelled training set of data points in order to learn the Linear Discriminant function. Q.3 What Is a support vector in SVM ? ‘Ans. : Support vectors are data points that are closer to the hyperplane and influence the position and orientation of the hyperplane. Using these support vectors, we maximize the margin of the classifier. Q.4 What is Support Vector Machines ? ‘Ans. : A Support Vector Machine (SVM) is a supervised machine learning model that uses classification algorithms for two-group classification problems. After giving an SVM model sets of labeled training data for each category, they're able to categorize new text. Q5 Define logistic regression. Ans. + Logistic regression is supervised learning technique. It is used for predicting the categorical dependent variable using a given set of independent variables. Q6 List out types of machine learning. Ans. : Types of machine learning are su ° ipervised, semi-supervised, fised and reinforcement learning. P ee Q.7 What is Random forest 7 oa + Random forest is an ensemble learning technique that combines multiple cision trees, implementing the bagging method and results in a robust model with low variance. Q.8 What are the five popular algorithms of machine learning 7 Ans. : Popular algorithms are Decision Trees, Ne i Pop , Neural Networks (bi ation), Probabilistic networks, Nearest Neighbor and Support vector ae ee TECHNICAL PUBLICATIONS® - an up-thrust for knowledgemachine Learning 2-31 Supervised Learning qa What is the function of ‘Supervised Learning’ 7 ans.: Functions of ‘Supervised Learning’ are Classifications, Speech recognition, regression, Predict time series and Annotate strings. q10 What are the advantages of Naive Bayes 7 ans. : In Naive Bayes classifier will converge quicker than discriminative models like Iogistic regression, so you need less training data. The main advantage is that it can't eam interactions between features. a1 What is regression ? Ans, : Regression is a method to determine the statistical relationship between a dependent variable and one or more independent variables. iz Explain linear and non-linear regression model. Ans. : In linear regression models, the dependence of the response on the regressors is defined by a linear function, which makes their statistical analysis mathematically tractable. On the other hand, in nonlinear regression models, this dependence is defined by a nonlinear function, hence the mathematical difficulty in their analysis. 13 What is regression analysis used for 7 ans.: Regression analysis is a form of predictive modelling technique which investigates the relationship between a dependent (target) and independent variable (s) (predictor). This technique is used for forecasting, time series modelling and finding the causal effect relationship between the variables. Q14 List two properties of logistic regression. Ans. = and 1. The dependent variable in logistic regression follows Bernoulli Distribution. 2. Estimation is done through maximum likelihood. Q15 What is the goal of logistic regression ? Ans. The goal of logistic regression is to correctly predict the category of outcome for individual cases using the most parsimonious model. To accomplish this goal, a model is created that includes all predictor variables that are useful in predicting the response Variable. Q16 Define supervised learning. Ans. : Supervised learning in which the network is trained by providing it with input and matching output patterns. These input-output pairs are usually provided by an external teacher. aQ0o0 TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
4/5 (6129)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (627)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brene Brown
4/5 (1148)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
4.5/5 (935)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4/5 (8215)
Grit: The Power of Passion and Perseverance
From Everand
Grit: The Power of Passion and Perseverance
Angela Duckworth
4/5 (631)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1253)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4/5 (8365)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (860)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (877)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
4.5/5 (361)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
4/5 (954)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4/5 (2923)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
4.5/5 (484)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (277)
A Man Called Ove: A Novel
From Everand
A Man Called Ove: A Novel
Fredrik Backman
4.5/5 (4972)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (444)
Brooklyn: A Novel
From Everand
Brooklyn: A Novel
Colm Toibin
3.5/5 (2061)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
4/5 (4281)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
The Little Book of Hygge: Danish Secrets to Happy Living
From Everand
The Little Book of Hygge: Danish Secrets to Happy Living
Meik Wiking
3.5/5 (447)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2283)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
4.5/5 (278)
Yes Please
From Everand
Yes Please
Amy Poehler
4/5 (1987)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
4/5 (1068)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (1993)
The Woman in Cabin 10
From Everand
The Woman in Cabin 10
Ruth Ware
3.5/5 (2641)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
4.5/5 (1936)
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
From Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Viet Thanh Nguyen
4.5/5 (125)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (692)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (1912)
Wolf Hall: A Novel
From Everand
Wolf Hall: A Novel
Hilary Mantel
4/5 (4074)
On Fire: The (Burning) Case for a Green New Deal
From Everand
On Fire: The (Burning) Case for a Green New Deal
Naomi Klein
4/5 (75)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (830)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (901)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (143)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2544)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M L Stedman
4.5/5 (790)
Machine Learning Unit1
No ratings yet
Machine Learning Unit1
38 pages
Machine Learning Unit3
No ratings yet
Machine Learning Unit3
26 pages
FML Unit5
No ratings yet
FML Unit5
21 pages
FML Unit4
No ratings yet
FML Unit4
14 pages
FML Unit3
No ratings yet
FML Unit3
18 pages
FML Unit2
No ratings yet
FML Unit2
13 pages
Programming Languages Unit - V
No ratings yet
Programming Languages Unit - V
23 pages
Study All Questions Fully. Dont Leave Choice
No ratings yet
Study All Questions Fully. Dont Leave Choice
3 pages
Programming Languages Unit - III
No ratings yet
Programming Languages Unit - III
18 pages
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
Little Women
From Everand
Little Women
Louisa May Alcott
4/5 (105)
The Constant Gardener: A Novel
From Everand
The Constant Gardener: A Novel
John le Carré
3.5/5 (109)

Machine Learning Unit2

Uploaded by

Machine Learning Unit2

Uploaded by

You might also like