0% found this document useful (0 votes)
39 views

ML Unit 3

Machine learning unit 3 notes jntuh r18 for exam preparation

Uploaded by

cannotfindme41
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
39 views

ML Unit 3

Machine learning unit 3 notes jntuh r18 for exam preparation

Uploaded by

cannotfindme41
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 14
Learning and I Bayesian Learning, Computational Instance Based Learning 3.1: Bayesian Learning and Bayes Theorem Q.t What Ie Bayesian neural network ? Ans. : Bayesian Neural Network (BNN) refers to | extending standard networks with posterior inference. Standard NN training via optimization is equivalent to Maximum Likelihood Estimation (MLE) for the i | | weights, | Ans.:1. Each observed training example can incrementally decrease or increase the estimated Probability that a hypothesis is correct. 2. Prior knowledge can be combined with observed data to determine the final probability of a hypothesis, Bayesian methods can accommodate hypotheses that make probabilistic predictions. 4 New instances can be classified by combining the predictions of multiple hypotheses, weighted by their probabilities. Even in cases where Bayesian methods prove computationally intractable, they can provide a Standard of optimal decision making against which other practical methods can be measured, 2.3 What Is the practical difficulty in applying Bayesian methods ? i i | 1. Require initial knowledge of many probebiities, When these probabilities are not known in | advance they are often estimated based on | background knowledge, previously available | data, and assumptions about the form of the | underlying distributions. i ey 2 The significant computational cost required to determine the Bayes optimal hypothesis in the general case, Q4 What is Bayes theorem ? How to select Hypotheses ? Ans, «In machine learning, we try to determine the best hypothesis from some hypothesis space H, given the observed training data D. + In Bayesian learning, the best hypothesis means the ‘most probable hypothesis, given the data D plus any initial knowledge about the prior probabilities of the various hypotheses in H. * Bayes theorem provides a way to calculate the Probability of a hypothesis based on its prior Probabllity, the probabilities of observing various data given the hypothesis, and the observed data itself. * Bayes! theorem is a method to revise the probability of an event given additional information, * Bayes’s theorem calculates a conditional probability called a posterior or revised probability, * Bayes’ theorem is a result in probability theory that relates conditional probabilities. If A and B denote fwo events, P(AIB) denotes the conditional Probability of A occurring, given that B occurs, The {wo conditional probabilities P(AIB) and P(BIA) are in general different. * This theorem gives a relation between P(AIB) and POBIA). An important application of Bayes’ theorem 's that it gives a rule how to update or revise the Strengths of evidence-based beliefs in light of new evidence a posteriori, *A prior probability is an initial originally obtained information is obtained, Probability value before any additional Scanned with CamScanner oie Learing 3-2 «A posterior probability is a probability value that | thas been revised by using additional information thats later obtained. 4 fA and B are two random variables _ POJAIP(A) PB) «inthe context of classifier hypothesis h and training data. = POR)P(A) | Pq) Where (h) = Prior probability of hypothesis h () = Prior probability of training data I (hil) = Probability of h given 1 | P (Ih) = Probability of I given h | Choosing the Hypotheses i « Given the training data, we are interested in the | most probable hypothesis. The leamer considers | some set of candidate hypotheses H and it is interested in finding the most probable hypothesis he H given the observed data D. « Any such maximally probable hypothesis is called a maximum a posteriori (MAP) hypothesis yap + Maximum a posteriori hypothesis (hygap)- yap = argmax P(h/1) Belt POyh)PH) re POD = argmaxP(V/h)P(h) helt # If every hypothesis is equally probable, Phy) = Phy) for all hy and hy in HL. ‘*P(Yh) is often called the likelihood of the data I given h. Any hypothesis that maximizes P(Vh) is called a maximum likelihood (ML) hypothesis, hyq. hhygy argmax P(A) helt Q5 At a certain university, 4% of men are over 6 feet tall and 1% of women are over 6 feet tall. ‘The total student population Is divided In the ratio 3:2 in favour of women. If a student Ie selected at random from among all those over six feet tall, what Is the probability that the student Is a woman ? TTEGHNICAL PUBLICATIONS" Anup hua fr knowledge Bayesian Learning, Computational Learning and Instance Based Learning Ans, : Let us assume following : M = (Student is Male}, F = (Student is Female), T # (Student is over 6 feet tall}. Given data : P(M) = 2/5, PR) = 9/5, PCTIM) = 4/100 PCTIF) = 1/100. We require to find P(FIT) ? Using Bayes’ Theorem we have : b PCP PR) POM = par Pe)s PCM PO) w PEM = % Q.6 Bag contains 5 red balls and 2 white balls. Two balls are drawn successivly without replacement. Draw the probability tree for this. Sol.: Let R, = for the event of getting a red ball on. the first draw, W, for getting a white ball on the second draw, and so forth. Here's the probability tree. R PReIRy) R si_/P (Ry) eal Fa) w R ew) 86 PURI Wy) P (Wo |W) Scanned with CamScanner Machine easing 3.2: Maximum Likehood and Least Squared Error Hypotheses Q7 What do you mean by least square method ? Ans. : Least squares is a statistical method used to determine a line of best fit by minimizing the sum of Squares created by a mathematical function. A “square” is determined by squaring the distance between a data point and the regression line or mean value of the data set. Q.8 What is maximum likelihood estimation ? ‘Ans. : Maximum-Likelihood Estimation (MLE) is a method of estimating the parameters of a statistical model. When applied to a data set and given a statistical model, maximum-likelihood estimation provides estimates for the model's parameters. Q9 Briefly discuss least square method. List disadvantages of least square method. Ans.:© The method of least squares is about estimating parameters by minimizing the squared discrepancies between observed data, on the one hand, and their expected values on the other. * Considering an arbitrary straight line, y = by +b1x, is to be fitted through these data points. The uestion is "Which line is the most representative" ? ‘+ What are the values of bp andb, such that the resulting line “best” fits the data points ? But, what goodness-of-8t criterion to use to determine among all possible combinations of bg and by ? y T ‘TECHNICAL PUBLICATIONS”- An up thus or knowledge 3-3 Bayesian Learning, Computationa, Learning and Instance Based Learning «The Least Squares (LS) criterion states that the sum of the squares of errors is minimum. The least-squares solutions yields y6 whose clemenis sum to 1, but do not ensure the outputs to be in the range [0,1]. How to draw such a line based on data points ‘observed ? Suppose a imaginary line of y = a + bx. ‘© Imagine a vertical distance between the line and data point E = Y - E(Y). «This error is the deviation of the data point from the imaginary line, regression line. Then what is the best values of a and b ? A and b that minimizes the ‘sum of such errors. y E(y)=a + bX Fig. 0.9.2 ‘Deviation does not have good properties for computation. Then why do we use squares of deviation ? Let us get a and b that can minimize the sum of squared deviations rather than the sum f deviations. This method is called least squares. * Least squares method minimizes the sum of squares of errors, Such a and b are called least squares estimators ie. estimators of parameters « and f. ** The process of getting parameter estimators (eg, @ and b) is called estimation. Lest squares method is the estimation method of Ordinary Least Squares (Ls). Disadvantages of loast square 1. Lack robustness to outliers, 2 Certain datasets unsuitable for least squares classification. 3,_Decision boundary corresponds to ML solution. Scanned with CamScanner Bayesian Leaming, Computational : conned Learning and Inatence Based Lenin ‘a straight line to the points In the table. «The maximum likelihood estimate of 0 is that value a1 Fit aie Compute m and by lenet equares, of that maximises lik (0): It is the value that makes the observed data the most probable. Examples of maximizing likelihood : A randen variable with this distribution is a formalization of a coin toss. The value of the | random variable is 1 with probability @ and 0 with | probability 1 - @ Let X be a Bernoulli random variable and let x be an outcome of X, then we ‘ans: Represent in matrix form : eee eae st) [0] [% | race «f°, £ 25] 425 1|{m] | |425] | ve | 1-0 if x=0. 550 1 (Pl 5.50|" lve | © Usually, we use the notation P() for a probability ‘mass and the notation p() for a probability density. For mathemiatical convenience write P(X) as. x= fl -ataytaty POX =x) = ox-0)* Q.12 What gradient search to maximize Ikelihood in a neural net ? ‘Ans.:* Develop a method for computers to “understand” speech using mathematical methods. For 800 1 550] [vp __ [1213125 20.7500) 105.8125 20.7500 4.0000 ‘e700 | 0.246 iz [ses] ‘a D-dimensional input vector 0, the Gaussian distribution with mean y and positive definite V=AX-L covariance matrix J) can be expressed as 3.00 1 450] f-010) | 425 1\ro246] [425] | 046 | peas *|5.50 [bee] 550| ~|-0.48 8.0 1 550] | 013 | Q11 Explain with example maximum likelihood estimation, ‘Ans, : ¢ Maximum-Likelihood Estimation (MLE) is a | method of estimating the parameters of a statistical | model, When applied to a data set and given a Statistical model, maximum-likelihood estimation Pee Provides estimates for the model's parameters. | Nond)= —— eV/2 (Q-yyT X1sXayXpy.u, Xq have joint density denoted | ene 1y -172 £0 pe py oes Xp) £OKg» Xp oor Xn) | a Given observed values Xq = Xq, Xp" XareXn Xn | Tow) the likelihood of @ is the function | The distribution is completely described by the D Wa (0) = £(%40 X21 7 Xnl 8) parameters representing and the DD + 1)/2 Considered as a function of @ Parameters representing the symmetric covariance * If the distribution is discrete, f will be the frequency atri , istbution function, * Single Gaussian may do a bad job of modeling eee Scanned with CamScanner Beyestan Leerig, Computation nd nance Based Leaig amber of ridden iA i uae In) N ‘class prior probably, | normale probably, INormalsGaussian Fig. Q.12.2 * Solution : Mixtures of Gaussians is a solution for | © Assume the codes C; and C, to represent the this problem, hypothesis and the data given the hypothesis, we ‘A formalism for modeling a probability density | can state the MDL principle as function as a sum of parameterized functions. i hacou ~ argmin Lo, (h)+Le, (YH) «The above analysis shows that if we choose C; to be the optimal encoding of hypotheses Cyz and if We choose C, to be the optimal encoding Cy, then 3.3 : Minimum Description Length Principle Q.13 Explain minimum description _ length q principle. | Rept = wap: Ans.: 6 The Minimum Description Length (MDL) | ctiteria in machine learning gee that’ the i | 3.4: Bayes Optimal Classifier and description of the data is given by the model which Gibbs Algorithm ‘compresses it the best. © Put another way, leaming a model for the data or Q.14 Define Gibbs algorithm. Predicting it is about capturing the regularities in | Ans. : The Gibbs algorithm defined as follows : fhe data and any spay in the data can be used 1, Choose a hypothesis h from’ H at random, compress it more we can compress a lity dis data, the more we have learnt about it pa the =a re areata a better we can predict it. | 2. Use h to predict the classification of the next * The MDL principle states that one should prefer the instance x. model that yields the shortest description of the | data when the complexity of the model itself is also | -15 What is the Bayes optimal classifier ? cea fae Ans. : ¢ Bayes classifier is a classifier that minimizes Sth! litinon “Docipac wei’ b | the error in a probabilistic manner. If it is Bayes Scat by arene ihe a ean rar || optimal, then the errors are weighed using the join the light of basic concepts rom tion theory. | Probability distribution between the input and the Consider definition of hysap. output sets | * The Baye: Roe a es error is then the of the Ba imap ~ argmax POYH)P(R) uae error yes ¢The MDL principle recommends choosing the hypothesis that minimizes the sum of these two description lengths. Af recinaca. Poe TIONS» Anup trust for knowin aaa ee Scanned with CamScanner a6 What ls Nalve Bayes Claseifier ? fos.:¢ Naive Bayes cassfiers are a family of simple pesubiste casifers based on applying Bayes with strong independence assumptions fenveen the features. Its highly scalable, requiring a of parameters linear in the number of Tavabes (atures/predictors) in a learning problem, ‘A Naive Bayes Classifier is a program which predicts a class value given a set of attributes, 1 For each known class value, 1. Calculate probabilities for conditional on the class value, 2 Use the product rule to obtain a joint conditional probability for the attributes, each attribute, 3. Use Bayes rule to derive conditional probabilities for the ciass variable. «Once this has been done for all class values, output the dass with the highest probability. «Naive Bayes simplifies the calculation of probabilities by assuming that the probability of each attribite belonging to a given class value is independent of all other attributes. This is a strong assumption but results in a fast and effective method. «The probability of a class value given a value of an attribute is called the conditional probability. By multiplying the conditional probabilities together for each attribute for a given class value, we have a probability of a data instance belonging to that dass. 3.6 : Bayesian Bellef Networks 247 Describe Bayesian belief network. Ans.: Bayesian belief network describes the Probability distribution governing a set of variables by. specifying a set of conditional independence ‘sumptions along with a set of conditional Probabilities, . 7 Bayesian Learning, Computational Learning and Instance Based Learn Q.18 Explain with example how Bayesian belief network fe represented ? Ans, ‘Bayesian belief networks represent the full joint distributt n over the variables more compactly with a smaller number of parameters. lt take advantage of conditional and marginal independences among random variables * A and B are independent then P(A, B) = P(A)P(B) * A and B are conditionally independent given C P(A, B | C)= P(A | CPB IC) P(A | C, B) = P(A | C) * Example : Alarm system example. Assume your house has an alarm system against burglary. You live in the seismically active area and the alarm system can get occasionally set off by an earthquake, * You have two neighbors, Mary and John, who do not know each other. If they hear the alarm they call you, but this is not guaranteed. ‘* We want to represent the probability distribution of events : Burglary, Earthquake, Alarm, Mary calls and John calls Causal relations aD Caan) Gree — Ganoa Fig. 0.18.4 Directed acyclic graph : ‘* Nodes = Random variables Burglary, Earthquake, Alarm, Mary calls and John alls «Links = Direct (causal) dependencies between variables. The chance of Alarm is influenced by Earthquake, The chance of John calling is affected by the Alarm. Scanned with CamScanner Fig. 0.18.2 their parents PIA) Fig. 0.18.3 Bayesian belief network : Fig, 0.18.4 3.7 : The EM Algorithm Q.19 Write short note on EM algorithm. ‘Ans, : « Expectation-Maximization (EM) is an iterative method used to find maximum likelihood estimates ‘of ‘parameters in probabilistic models, where the model depends on unobserved, also called latent, variables, = Local conditional distributions : Relate variables and | | Beyesian Learning, Computational Learning and Instance Based Learning «EM alternates between performing an expectation (B) step, which computes an expectation of the likelihood by including the latent variables as if they were observed, and maximization (M) step, which computes the maximum likelihood estimates of the parameters by maximizing the expected likelihood found in the E step. ‘* The parameters found on the M step are then used to start another E step, and the process is repeated ‘until some criterion is satisfied. EM is frequently | used for data clustering like for example in | ‘en the Expectation step, find the expected values of the latent variables (here you need to use the ‘current parameter values). In the Maximization step, first plug in the expected | values of the latent variables in the log-likelihood of | the augmented data, Then maximize this log-likelihood to reevaluate the parameters. ‘* Expectation-Maximization (EM) is a technique used in point estimation. Given a set of observable variables X and unknown (latent) variables Z we want to estimate parameters @ in a model. © The expectation maximization (EM) algorithm is a widely used maximum likeli-hood estimation procedure for statistical models when the values of some of the variables in the model are not observed ©The EM algorithm is an elegant and powerful method for finding the maximum likelihood of models with hidden variables. The key concept in the EM algorithm is that it iterates between the expectation step (E-step) and maximization step (M-step) until convergence. ' ‘+ In the E-step, the algorithm estimates the posterior distribution of the hidden variables Q given the observed data and the current parameter settings; and in the M-step the algorithm calculates the ML Parameter settings with Q fixed. At the end of each iteration the lower bound on the likelihood is optimized for the given parameter setting (M-step) and the likelthood is set to that bound (E-step), which guarantees an increase in the likelihood and convergence to a loca maximum, ot global maximum if the likelihood function is unimodal. IAFF TecrnicAt PUBLICATIONS”. An un trust er inoue Scanned with CamScanner ws ,, EM works best when the fraction of oD ation is small and the dimensionality re data isnot t00 large. EML can requite many Sons, and higher dimensionality can sfamatically stow down the E-step, ‘pat is wseful for several reasons: conceptual lic, ease of implementation, and the fact that fach iteration improves 1() . The rate of fenvergence on the first few steps is typically quite good, but can become excruciatingly slow as you approach local optima. Sometimes the Mstep is a constrained ‘pacimization, which means that there are fonsiraints on valid solutions not encoded in the function itself « Expectation maximization is an effective technique that is offen used in data analysis to manage rissing data. Indeed, expectation maximization overcomes some of the limitations of other techniques, such as mean substitution or regression substitution. These alternative techniques generate biased estimates-and, specifically, underestimate the stndand errors. Expectation maximization overcomes this problem. 38: Introduction of Computational Learning Theory 20 What Is computational learning theory ? ‘Ans.: © Computational learning theory provides a formal framework in which to precisely formulate and address questions regarding the performance of different leaming algorithms so that careful comparisons of both the predictive power and the computational efficiency of alternative learning ‘gorithms can be made. ‘* Three key aspects that must be formalized are the Way in which the learner interacts with its environment the definition of successfully completing the learning task and a formal definition of efficiency of both data usage (sample complexity) ‘and processing time (time complexity) 3-8 Bayesian Learning, Computational Learning and Instance Based Learnt 3.9: Probably Learning an ‘Approximately Correct Hypothests Q.21 Define Learning. Ans. : A concept class C is said to be PAC learnable using a hypothesis class H if there exists a learning algorithm L such that for all concepts in C, for all instance distributions D on an instance space X, Ye 0