0% found this document useful (0 votes)
10 views

Pattern Recognition Organizer

The document provides an overview of Pattern Recognition, including its basics, methodologies such as Bayesian Decision Theory, and applications in various fields like natural language processing and image recognition. It discusses challenges in the field, such as data collection and feature extraction, and highlights the importance of pattern recognition in machine learning and data analysis. Additionally, it outlines the new curriculum changes at MAKAUT, introducing Pattern Recognition as a subject in the 6th semester.

Uploaded by

Rajdip Sinha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
10 views

Pattern Recognition Organizer

The document provides an overview of Pattern Recognition, including its basics, methodologies such as Bayesian Decision Theory, and applications in various fields like natural language processing and image recognition. It discusses challenges in the field, such as data collection and feature extraction, and highlights the importance of pattern recognition in machine learning and data analysis. Additionally, it outlines the new curriculum changes at MAKAUT, introducing Pattern Recognition as a subject in the 6th semester.

Uploaded by

Rajdip Sinha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 112
PATTERN RECOGNITION Basics of Pattern Recognition Bayesian Decision Theory parameters Estimation Methods Hidden Markov Models for Sequential Pattern Classification Dimension Reduction Methods Non-Parametric Techniques for Density Estimation Linear Discriminant Function Based Classifier Non Metric Method for Pattern Classification Unsupervised Learning and Clustering NOTE: 26 52 67 75 97 105 Patterns, a Questions & answer, so that students can get an idea about university questions MAKAUT course structure and syllabus of 6” semester has been changed from 2021 PATTERN RECOGNITION has been introduced as a new subject in present Curriculum. Taking special care of this matter we are providing chapterwise model POPULAR PUBLICATIONS BASICS OF PATTERN RECOGNITION 1. Which of the following is an example of Pattern Recognition? (MODEL QUESTION} a) Speech recognition b) Speaker identification c) MDR d) All of the above Answer: (d) 2. Pattern recognition solves the problem of fake bio metric detection. a) TRUE b)FALSE [MODEL QUESTION} ¢) Can be true or false d) cannot say Answer: (a) 3. Which of the following is disadvantages pattern recognition? [MODEL QUESTION] . a) Syntactic Pattern recognition approach is complex to implement b) It is very slow process c) Sometime to get better accuracy, larger dataset is required d) All of these Answer: (d) 4. In a typical pattern recognition application, the raw data is processed and converted into a form that is amenable for a machine to use. [MODEL QUESTION] a) TRUE b) FALSE: ¢) Can be true or false 4) cannot say Answer: (a) 8 is the process of recognizing patterns by using machine learning algorithm. [MODEL QUESTION] a) Processed Data b) Literate Statistical Programming ¢) Pattern Recognition d) Likelihood Answer: (c) Short Answer itions 1. What is pattern recognition? aes IMODEL QUESTION] ne spe is the process of recognizing patterns by using a machine leaming ate ane = recognition can be defined as the classification of data based om ge already gained or on statistical information extracted from patterns and/ot their representation. One of the importan ect pattern recog! important aspects of nn PRN-2 PATTERN RECOGNITION Examples: Speech recognition, speaker identification, multimedia document recognition (MDR), automatic medical diagnosis. in a typical pattern recognition application, the raw data is processed and converted into a form that is amenable for a machine to use. Pattern recognition involves the classification and cluster of patterns. + Inclassification, an appropriate class label is assigned to a pattern based on an abstraction that is generated using a set of training patterns or domain knowledge. Classification is used in supervised learning. + Clustering generated a partition of the data which helps decision making, the specific decision-making activity of interest to us. Clustering is used in unsupervised learning. Features may be represented as continuous, discrete, or discrete binary variables. A feature is a function of one or more measurements, computed so that it quantifies some significant characteristics of the object. Example: consider our face then eyes, ears, nose, ete are features of the face. A set of features that are taken together, forms the features vector. Example: In the above example of a face, if all the features (eyes, ears, nose, ete) are taken together then the sequence is a feature vector([eyes, ears, nose]). The feature vector is the sequence of a feature represented as a d-dimensional column vector. In the case of speech, MFCC (Mel-frequency Cepstral Coefficient) is the spectral feature of the speech, The sequence of the first 13 features forms a feature vector. Pattern recognition possesses the following features: © Pattern recognition system should recognize familiar patterns quickly and accurate © Recognize and classify unfamiliar objects * Accurately recognize shapes and objects from Identify patterns and objects even when partly hidden Recognize patterns quickly with ease, and with automaticity. jestions 1. Explain the idea of pattern recognition. [MODEL QUESTION] Answer: The problem of searching for patterns in data is a fundamental one and has a long and Successful history. For instance, the extensive astronomical observations of Tycho Brahe in the 16" century allowed Johannes Kepler to discover the empirical laws of planetary Motion, which in turn provided a springboard for the development of classical mechanics. Similarly, the discovery of regularities in atomic spectra played a key role in the development and verification of quantum physics in the early twentieth century, The field of pattem recognition is concemed with the automatic discovery of regularities in data PRN-3 POPULAR PUBLICATIONS through the use of computer any ie oe ae of these regularities to i classifying the data into differe 8. ee or eee handwritten digits, illustrated ae ee < digit corresponds to a 28%28 pixel image and so can be eee ea comprising 784 real numbers. The goal is to build a machine a we ee Be x as input and that will produce the identity of the digit 0 cape ’ nontrivial problem due to the wide variability of handwriting ae < P(e); otherwise decide @,. PRN-11 ye POPU! PUBLICATI This rule makes sense if we are to judge just one fish, but if we are to judge using this rule repeatedly may seem a bit strange. After all, we would always same decision even though we know that both types of fish will appear. How 4 works depends upon the values of the prior probabilities. If P(a) is very much gre I be right most of the than P(@,), our decision in favour of @ wil P(@)=P(e,), we have only @ fifty-fifty chance of being right. In g probability of eror is the smaller of P(a) and P(w,) and we shall see later that these conditions no other decision rule can yield a larger probability of being right In most circumstances we are not asked to make decisions with so little informatic our example, we might for instance use a lightness measurement x to improve classifier. Different fish will yield different lightness readings and we express variability in probabilistic terms; we consider x [0 be a continuous random a whose distribution depends on the state of nature and is expressed as. This is the el conditional probability density function. Strictly speaking, the probability densi function p(x|@) should be written as Py (x|@,) to indicate that we are speaking ; a particular density function for the random variable X'. This more elaborate s Bt notation makes it clear that p,(-) and p)(). Since this potential confusion rarely an in practice, we have elected to adopt the simpler notation. This is the probability density function for x given that the state of nature is @. also sometimes called state-conditional probability density). Then the difference bety p(x|@) and p(x|@,) describes the difference in lightness between populations of bass and salmon (Fig. 1). Suppose that we know both the prior probabilities P(@,) and the conditional densities p(x|@,). Suppose further that we measure the lightness of a fish and discover value is x, How does this measurement influence our attitude concerning the ture st nature — that is, the category of the fish? We-note first that the (joint) probability of finding a pattern that is in category «, and has feature value x can be written W ways: p{o,,x)=P(w,|x) p(x)=p(x|@,)P(«,). Reartanging these leads us 101 answer to our question, which js called Bayes’ formula: fe,i2) 2012 a) P(x) where in this case of two categories Pta)= al sie,)°(0,) . PRN-12, PATTERN RECOGNITION Bayes’ formula can be expressed informally in English by saying that likelihood x prior evidence ae ayes’ formula shows that by observing the value of x we can convert the prior probability P(a,) to the a posteriori probability (or posterior) probability P(@,\x) — the probability of the state of nature being @, given that feature value x has been posterior = measured. We call p(x|@,) the likelihood of @, with respect to x (a term chosen to indicate that, other things being equal, the category «, for which p(x|,) is large is more “likely” to be the true category). Notice that it is the product of the likelihood and | the prior propbability that is most important in determining the posterior probability; the evidence factor, p(x), can be viewed as merely a seale factor that guarantees that the posterior probabilities sum to one, as all good probabilities must. The variation of (ca |x) with x is illustrated in Fig. 2 forthe case P(a)=2 and P(a,) p(xla) oa x 10 n 2 3 4 1s 3 Fig: |. Hypothetical class‘conditional probability density functions show the probability density of measuring a particular feature value x given the pattern is in category @,. If x represents the length of a fish, the two curves might describe the difference in length of populations of two types of fish. Density functions are normalized and thus the arca under each curve is 1.0 If we have an observation x for which P(a|zx) is greater than P(«,|x), we would naturally be inclined to decide that the true state of nature is @. Similarly, if P(@, |x) is Bteater than P(a,|x), we would be inclined to choose @,. To justify this decision procedure, let us calculate the probability of error whenever we make a decision. ‘Whenever we observe a particular x, PRN-13 POPULAR PUBLICATIONS P(a |x) if we decide @, a otorais}-{p ‘as|x)_ if we decide 0 mize the probability of error by d e, Of course, WE may never observe ge probability of em given x we can minii and @, otherwist , ce, Will this rule minimize the averaj ferror is given by Clearly, for P(a|x)>P(% Ix) value of x twit il aoe the average probability 0! P(exror) = JP (error, x) a= [r(erorix) pl) (5) 4 P(a |x) é 1 o8 é * C 02 7 oo oo Se Fig: 2. Posterior probabilities for the particular priors P(a)=2/3 and P(a, for the class-conditional probability densities shown in Fig. 1. Thus in this case, given that a pattern is measured to have feature value x =14, the probability it is in category a is. roughly 0.08 and that it is in «, is 0.92. At every x, the posteriors sum to 1.0 and if for every x we insure that P(error| x) is as small as possible, then the must be as small as possible, Thus we have justified the following Bayes’ decision ri for minimizing the probability of error: Decide « if P(a|x)> P(«,|x); otherwise decide @, «+. (6) and under this rule Eqn. (4) becomes i __ P(eror|x) min| P(« |x), P(~,|x)] 8,(x) for all j#i, then x is in R, and the decision rule calls for us to assign x to @. The regions are separated by decision poundaries, surfaces in feature space where ties occur among the largest discriminant functions (Fig. 2) ‘The Two-Category Case: While the two-category case is just a special instance of the multicategory case, it has traditionally received separate treatment. Indeed, a classifier that places a pattern in ne of only two categories has a special name — a dichotomizer. Instead of using two discriminant functions g; and g, and assigning x to @, if g, > gp, it is more common to define a single discriminant function and to use the following decision rule: Decide 0 if g(x)>0; otherwise decide a, . g(x) =s,(x)~8:(x) =O rus, a dichotomizer ean be viewed as a machine that computes a single discriminant finetion g(x) and classifies x according to the algebraic sign of the result. Of the in which the minimum-error-rate discriminant function can be written, the various forms Following two (derived from Eqns. (2) & (4) are particularly convenient: es) = Pals) P(e) 6) (x)=1 pela) 20a) mt) x)= p(xl@,)— P(:) 4. What do you know about normal density and discriminant function? [MODEL QUESTION] Answer: Before talking about discriminant functions for the normal density, we first need to know that a normal distribution is and how it is represented for just a single variable and for a vector variable, Lets begin with the continuous univariate normal or Gaussian density. fh sot Hl" =) for which the expected value of x is, u=e[x]= Jee and where the expected squared deviation or variance is, o [ea ]= len ned PRN-I7 POPULAR PUBLICATIONS The univariate normal density is completely specified by two parameters; its variance o”. The function 7, can be writted as N(j,0) which says 1 distributed normally with mean yw and variance o*. Samples from normal tend to cluster about the mean with a spread related to the standard deviation o, For the multivariate normal density in d dimensions, f, is written as — teal Hal Een] (2m)2|3) where x is a d-component column vector, is the d -component mean the d-by-d covariance matrix and [Z| and D"' are its determinant and respectively. Also, (x~ 1) denotes the transpose of (x~y). and E=e[(x—a)(s- 1)! ]= fle-)(=-n) as) where the expected value of a vector or a matrix is found by taking the expected the individual components, i. if x, is the ith component of x, 44, the /th compo # and a, the ijth component of ©, then =2[s)]] and a =2(x,-4)(x,-m, I 3 The convariance matrix is always symmetric and positive definite which n t the determinant of is strictly positive. The diagonal elements o,, are the variances ¢ the respective x (ie. a7 and the off-diagonal elements @,, are the covariances of 3 and x,. If x, and x, are statistically independent, then o,, =0. If all off-d elements are zero, p(x) reduces to the product of the univariate normal densities components of x. Discriminant Functions: Discriminant functions are used to find the minimum probability of error in de making problems. In a problem with feature vector y and state of nature variable can represent the discriminant function as: 2,(Y)=In p(¥|w,)+1n P(w,)] where, p(¥|1w,) conditional probability density function for Y with w, being the state of P(w,) is the prior probability that nature is in state w,. If we take multivariate normal distributins, That is if p(¥|w,)=(4,o). Then the function changes to; ‘PATTERN RECOGNITION e(r)=-E=AL int) o; where, ||| denotes the Euclidean norm, that is, in a problem with feature vector y and state of nature variable W , We can represent the discriminant function as: I ea d a(x) =—g (ea) Zs! (em) ~Lin2a -- tn +t PO.) We will now look at the multiple cases for a multivariate normal distribution. Case 1: Bo This is the simplest case and it occurs when the features are statistically independent and each feature has the same variance, a”. Here, the covariance matrix is diagonal since its simply o* times the identity matrix I, This means that each sample falls into equal sized clusters that are centered about their respective mean vactors. ‘The computation of the determinant and the inverse |3,|=o™ and >, (v0*)|. Because both |Z,| and the (d/2)In2x term in the equation above are independent of i, we can ignore them and thus we obtain this simplified discriminant function: — teal g(x)=—— 552 +n P(m) where, |] denotes the Euclidean norm, thats, x—ylf =(x- 4) G-a)| If the prior probabilities are not equal, then the discriminant function shows that the squared distance [x — f° must be normalized by the variance a* and offset by adding InP(w,): therefore if x is equally near two different mean vectors, the optimal decision will favour the priori more likely. Expansion of the quadratic form (x— 4) (x-4) yields: ig pee 8,(:) =—s eax —2aja+ im] +P) \hich looks like a quadratic function of x. However, the quadratic term x'x is the same for all 1, meaning it can be ignored since it just an additive constant, thereby we obtain the equivalent discriminant function: 8;(x) =w/x-+ w, Where, yw) u| and Wp asi +me()| “ PRN-I9 POPULAR PUBLICATIONS Wp is the threshold or bias for the itheategory. a A classifier that uses linear discriminants is called a linear machine. For a linear the decision surfaces for a linear machine are just pieces of hyperplanes defined linear equations g,(x)= g(x) for the two categories with the highest po probabilities. In this situation, the equation can be written as wi(x-%)=9 a ) nN (yy) o-af Pb) The equations define a hyperplane through the point x, and orthogonal to the vector Because w=, ~ 4,, the hyperplane separating R and R, is orthogonal to the linking the means. If P(w,)=P(w,), the point x, is halfway between the means hyperplane is the perpendicular bisector of the line between the means in Fig. 1 bel P(v,)# P(v,), the point -, shifts away from the more likely mean. where, w=m,-y,| and 1 =3(4 tH) Case 2: mer Fig: | Another case occurs when the covariance matrices for all the classes are corresponds to a situation where the sam i < ; ples fall int ipsoi size and shape, with the cluster of the ith class ting a oe #,- Both {=] and the (d/2)in2x terms a1 can also be ignored i rst because they are independent of i. This leads to the Sates ie wh ae 8)(x)=-S(x- 4) E"(x-1,) +I PCa) Iethe Prior probabilities P(w, ignored, however, ) are equal for all cf : asses, then the In P(w,) term if they are unequal then the decision will be i - of PRN-20 POPULAR PUBLICATIONS Case 3: E, =arbitrary 5 In the general multivariate Gaussian case where the covariance matrices each class, the only term that can be dropped from the initial discriminant function (d/2)In2 term. The resulting discriminant term is; BA) HXW AW E+ Wy 1 =m and Wp a5) sla |+InP(s) This leads to hyperquadric decision boundaries as seen in the figure below: Example: Given the set of data below of a distribution with two classes w, and w, both with pri probability of 0.5, find the discriminant functions and decision boundary. PATTERN RECOGNITION From the data given above we know to use the equations from ft 1 points i each class have the same variance, therefore, iene are: ap Cares eee fs HDA =-oas| : eee ~-0563 and the variances are =H, =siaa| ia Me-m= s203| ‘The discriminant functions are then 044-04. -—=—— + In(0.5) == 83134 aeatan 14) anne -0.543 -0.5437 & ae + n(0.s)=-0.70| $2.62 2*52.62 and the decision boundary x, is going to be halfway between the means at 0.492 because they have the same prior probability. 4. What do you understand by continuous and discrete features of Bayes decision theory? [MODEL QUESTION] Answer: Continuous feature Allowing the use of more than one feature just means that we would replace the scaler with the feature vector ¥, where Y is in a d-dimensional Euclidean space ‘|, called the feature space. Allowing more than two states of nature provides a useful generalization with small notational expense. Allowing more actions also opens up the Possibility of rejection i.e., refusing to make a decision in too close cases. This is can be very useful if being indecisive is not too costly. The Loss function states exactly how Costly each chosen action is and is used to convert a probability determination into a decision. Cost functions enables us to look at situations where certain errors are more costly than others, although we will often only be looking at cases where all errors are equally costly. Putting this together, let {x,,....%} be the finite set of © states of nature and let kk.) bbe the finite set of a possible actions. The loss function 2(k; |k,) describes ‘he loss incurred for taking action k, when the state of nature is x,. Let ¥ be a d- PRN-23 POPULAR PUBLICATIONS component-vector-valued-RV and let_p(¥|x,) be the conditional probability deng function for ¥ with x, being the true state of nature. As discussed before, P(x,) i prior probability that nature is in state x, , therefore by using Bayes formula we ¢, the posterior probability P(x, |Y): P(¥ |x,)P(x,) ra vu (I) P(x,1Y)= where, P(Y) ~Lalrl x,) P(x,) -Q) Now, suppose we observe a particular feature space Y and we decide to take an a x, If the state of nature is x, , then from the definition of the loss function above we w incur the loss 2(k; |x,). Because P(x,|Y) is the probability that the true state of ns is x,, the loss associated with taking action k, can be expressed as: R(K, =D |x,)P(x,1”) +) In decision theory terminology, an expected loss is called a risk and R(k,|¥) ise the conditional risk. So whenever we have an observation Y, we can minim expected loss by choosing the action that minimized the conditional risk. To m overall risk, compute the compute the conditional risk in Eqn. (3), for i then select the action &, for which R(é,|¥) is minimum. The resulting minimum n also called the Bayes risk and is denoted by R*. Discrete Features: In many practical applications, the components of the feature vectors are binary, | or higher integer values so that ¥ can assume one of m discrete values {14 these cases, the probability density functions become sums of the form Er(ris) wend) where we understarid that the summation is overall values of x in the distributions. Bayes formula then involves probabilities, rather than probabil So we have: 4 e PRN-24- POPULAR PUBLICATIONS PARAMETERS ESTIMATION METHODS Multiple Choice Type 9 estions 4, The maximum likelihood estimate is a) minimum of a not necessarily in the parameter space b) maximum of a in the parameter space ‘c) maximum of a not necessarily in the parameter space d) minimum of a in the parameter space Answer: (b) [MODEL Q 1. What are parameters? Answer: Often in machine learning we use a model to describe the process that results in th that are observed. For example, we may use a random forest model to classify whet customers may cancel a subscription from a service (known as churn modelling) or may use a linear model to predict the revenue that will be generated for a company depending on how much they may spend on advertising (this would be an exam of linear regression). Each model contains its own set of parameters that ultimately. what the model looks like. For a linear model we can write this as y =mx+c. In this example x could represent the advertising spend and y might be the revenue generated. m and ¢ are parameters forth rent lines (see figure below). i | ‘i ‘Three linear models with different parameter values PRN-26 PATTERN RECOGNITION so parameters define a blueprint for the model. It is only when specific values are chosen for the parameters that We get an instantiation for the model that describes a given phenomenon. 2. What is maximum likelihood estimation? [MODEL QUESTION] nswer! a statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed datais most probable. The pointin the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference. ifthe likelihood function is differentiable, the derivative test for determining maxima can be applied. In some cases, the first-order conditions of the likelihood function can be solved explicitly; for instance, the ordinary least squares estimator maximizes the likelihood of the linear regression model. Under most circumstances, however, numerical methods will be necessary to find the maximum of the likelihood function. The likelihood function can also be non-convex with multiple local minima requiring the use of heuristic global optimization techniques. 3. What is Expectation- Maximization method? [MODEL QUESTION] Answer: In most of the real-life problem statements of Machine learning, it is very common ‘that we have many relevant features available to build our model but only a small portion of them are observable. Since we do not have the values for the not observed (latent) variables, the Expectation-Maximization algorithm tries to use the existing data to determine the optimum values for these variables and then finds the model parameters. 1, Explain maximum likelihood estimation. [MODEL QUESTION] Answer: ‘There are many methods for estimating unknown parameters from data. We will first consider the Maximum Likelihood Estimate (MLE), which answers the question: For which parameter value does the observed data have the biggest probability? The MLE is an example of a point estimate because it gives a single value for the inkrovm parameter (later our estimates will involve intervals and probabilities). Two ae of the MLE are that it is often easy to compute and that it agrees with our ree in simple examples, We will explain the MLE through a series of examples. ca mple 1: A coin is flipped 100 times. Given that there were 55 heads, find the “rela dees estimate for the probability p of heads ona single toss. ‘Ore actually solving the problem, let’s establish some notation and terms. — PRN-27 POPULAR PUBLICATIONS ber of heads in 100 tosses as an experis ee Oe reat of getting 55 heads in this experiment is the { probability P(ss nets)-{' o*(1—p)” The probability of getting 55 heads depends on the value of p, so let’s include p ip using the notation of conditional probability: P(55 heads| p) (3) p*(1-p)" You should read P(55 heads| 7) as: ‘the probability of 55 heads given p.” ‘or, more precisely as oy . r ‘the probability of 55 heads given that the probability of heads on a single tossis p”, Here are some standard teerms we will use as we do statistics. © Experiment: Flip the coin 100 times and count the number of heads. _ © Data:The data is the result of the experiment. In this case it is "55 heads’. * Parameter(s) of interest: We are interested in the value of the parameter p. J «Likelihood, or likelihood function: This is P(data| p). Note it isa fi both the data and the parameter p . In this case the likelihood is P(55 heads| p) -('s) pe (l-p)” 2. Look carefully at the definition. One typical source of confusion is to mist likelihood P(data| p) for P(p|data), We know from our earlier work with theorem that P(data|p) and P(p| data) are usually very different. Definition: Given data the maximum likelihood estimate (MLE) for the parameter ; the value of p that maximizes the likelihood P(data| P)- That is, the MLE is the ¥a of p for which the data is most likely, Answer: For the problem at hand, we saw above that the likelihood P(55 heads| p) -('s) p'(l-p)” We'll use the notation for the MLE, We of the likelihood function and setting it to 0. ay i i -(S seo)" ~45p*(~p)")=0 PRN-28 use calculus to find it by taking the deri solving this for p we get s5p"(1- p)” =45p"(1-p)* 55(1-p) =45p 55=100p The MLE is p=0.55 Note: 1. The MLE for p terned out to be exactly the fraction of heads we saw in our data. 2. The MLE is computed from the data, That is, it is a statistic. 3. Officially you should check that the critical point is indeed a maximum. You can do this with the second derivative test. Log likelihood: Ifis often easier to work with the natural log of the likelihood function. For short this is simply called the log likelihood. Since In(x) is an increasing function, the maxima of the likelihood and log likelihood coincide. Example 2. Redo the previous example using log likelihood. 00 4 Answer: We had the likelihood P(55 heads| p) (s } |p (1—p)**. Therefore the log likelihood is In(P (55 heads| p)) = w(()}essmcn +45in(I~p) Maximizing likelihood is the same as maximizing log likelihood. We check that calculus gives us the same answer as before: 100 Flee icitod) = 6 m(( )}-ssmtedessin-n] = S5(I-p)=45p > p=0.55 Maximum likelihood for continuous distributions tie Continuous distributions, we use the probability density function to define the likelihood. We show this in a few exmples. In the next section we explain how this is Analogous to what we did in the discrete case. POPULAR PUBLICATIONS 3. Light bulbs t . an aaee lifetime of Badger brand light bulbs is modeled by an Gietribution with (unknown) parameter 4. We test 5 bulbs and find they have lifet 2,3, 1, 3 and 4 years, respectively. what isthe MLE for 2? : ‘Anewer: We need to be careful with our notation. With five different values itis use subscripts. Let X, be the lifetime of the /th bulb and let x, be the value x, Then each xX, has pdf f, (x,)=2e™. We assume the lifetimes of the bulbs independent, so the joint pdf is the product of the individual densities: . 7 An) (je) = Ae” (050% He | A)=(Ae™ )(4e* )(de (de (de) = aera Note that we write this as a conditional density, since it depends on A. Viewing as fixed and a svariable, this density is the likelihood function. Our data had values x =2, x, =3,.x,=1, x, =3,%,=4 ’ So, the likelihood and log likelihood functions with this data are nf f(2,3,1,3,4|4)=A'e™, Inf (2, 3,1,3, 4] 4) =Sin(A)-134 Finally we use calculus to find the MLE: @n. pei 5 ea =2-13=0 >= [Z=— 77 lids likelihood) == 13 = Example 4. Normal distributions Suppose the data x x,, x,,...,%, is drawn froma N(u,o*) distribution, where 4: are unknown, Find the maximum likelihood estimate for the pair (1, 0*).. Answer: Let’s be precise and phrase this in terms of random variables and densities. L uppercase X\,...,X, be iid. N(,0°) random variables and let lowercase x, be th value X, takes. The density for each X, is , fal)-e Sihce the X, are independent their joint pdf is the product of the individual pdf's: nara Miee%1ao)=( pe) eu 2 o For the fixed data x,,..., x,, the likelihood and log likelihood are Pt laa) ( len) eA (4 (% 5-041 4.0))=-nin V2) —min(o)— (coe POPULAR PUBLICATIONS ~ x,=2 years and x, =3 years. Find the value of 2 that maximizes the data. J , Answer: The main paradox to deal with is that for a continuous probability of a single value, say x,=2, is zero. We resolve this parag remembering that a single measurement really means a range of values, eg, example we might check the light bulb once a day. So the data x, =2 years re x, is somewhere in a range of | day around 2 years. If the range is small we call it dx. The probability that X, is in the , approximated by f,, (x,|4)dx,. This is illustrated in the figure below. The data is treated in exactly the same way. Density /,, (14) Density fi, (%|4) Probability = f, (x, 4), x y The usual relationship between density and probability for small ranges Since the data is collected independently the joint probability is the product individual probabilities. Stated carefully P(X, in range, X, in range|2)~ f,, (x,|A)ds,-fy, (x,|4) de, Finally, using the values x,=2 and x, =3 and the formula for an exponential have , P(X, in range, X, in range| A) = de“*dy, -deMde, = Ae dx, Now that we have a genuine probability we can look for the value of 2 that maxi it. Looking at the formula above we see that the factor dx.dx, will play no role in fin the maximum. So for the MLE we drop it and simply call the density the likelihood: | © likelihood = f(x, ,x,|4)=a%e* The value of 4 that maximizes this is found just like in the example above, Itis 4=2 3. What is Gaussian mixture model (GMM)? Answer: ‘The Gaussian mixture model is defined as a clustering algorithm that is used to discov the underlying groups of data. It can be understood as a probabilistic mo Gaussian distributions are assumed for each group and they have means and cov: which define their parameters. GMM consists of two parts — mean vectors covariance matrices (Z). A Gaussian distribution is defined as a continuous distribution that takes on a bell-shaped curve. Another name for Gaussian distri the normal distribution. Here is a picture of Gaussian mixture models: PRN-32 $e oe i ee mM % A 4, What is expectation-maximization (EM) method in relation to GMM? [MODEL QUESTION] Answer in Gaussian mixture models, expectation-maximization method is used to find the gaussian mixture model parameters. Expectation is termed as E and maximization is termed as M. Expectation is used to find the gaussian parameters which are used to represent each component of gaussian mixture models, Maximization is termed as M and itis involved in determining whether new data points can be added or not. Expectation-maximization method is a two-step iterative algorithm that alternates between performing an expectation step, in which we compute expectations for each data point using current parameter estimates and then maximize these to produce new saussian, followed by a maximization step where we update our gaussian means based on the maximum likelihood estimate. This iterative process is performed until the gaussians parameters converge. Here is a picture representing the two-step iterative aspect of algorithm Mesep Update hypathesis ep Upstate variables 5. What are the key steps of using Gaussian mixture models? (MODEL QUESTION] Answer: The following are three different steps to using gaussian mixture models: * Determining a covariance matrix that defines how each Gaussian is related to one another. The more similar two Gaussians are, the closer their means will be and PRN-33

You might also like