0% found this document useful (0 votes)
18 views

Lecture 6 - Regression

Uploaded by

Samuel LI
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
18 views

Lecture 6 - Regression

Uploaded by

Samuel LI
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 6
2181S Introduction to Data Mining Lecture 6 Regression Regression Regression is a method to find out the line in a scatter plot that best predicts the dependent variable (put in y-axis) from different values of the independent variable (put in x-axis) In this lecture, we will talk about: = Some different forms of regression — Significance of Regression = Interpreting regression results in SPSS. ~ Other forms of correlation Types of Regression * Different Types of Regression: — Regarding the equation of the regression line: + Linear regression: The regression equation is always straightine: y= a + bx + Non-linear regression: We can derive @nen-linear regression line, such as y= pet + ge+r ~ Regarding the number of independent variables: + Single regression: One independent variable, only one x + Muliple regression: More than one independent variable, linear multiple regression takes the form y = byx, + bot, + + byt, * (asthe y.intercept when al Types of Regression | Linear (expressed Non-Linear (e.g log, exponential, polynomial) as y=a+ bx) ‘independent | Single linear variable regression ‘Single non-linear regression >1 independent | Multiple linear variables. | _ regression Multiple non-linear regression Non-Linear Regression in Excel Multiple Regression in SPSS | + Example: + Example: A person's Scatter Pit salary in a company “3 “gh Sa 40 -——— isexpectedtobe =} $ 2 4% 3 8 related to these four Ph fo: 8 6 factors: as ees 4 2 2 Dae | 5 Y= 0.250? -2.9829%? + 8.9571K-4.2, = # Working Hours, | o oes — # customers dealt ——— | with, and ow « ‘a o Boy fe — # awards eamed os Finding Coefficients in SPSS Multiple Linear Regression + Results: ~ Equation for salary = 500.3 x (# awards) + 43.3 x (# customers) + 30.5 (# hours) ~ 4.5 (age) + 309.8 | = Salary not too related to age, basic salary ~ $300 | a Saat |e “a ames tie tert tt la aes Back to Simple Linear Regression We are quite concerned with — How good is the estimation of the regression line for the prediction of the y-value of each data point? The R? coefficient can give you a rough idea of how good the numbers are related — However, sometimes the R® value can be misleading! Hence, we need some more mathematical ways to help us check whether the two variables are correlated or not Drawback of the R? statistics ‘A few wrong data with extreme values may cause the regression line and R® value to change dramatically — e.g. With data (1.1), (2:4), (3.3), (4.4), (5.6), R 0,973, but changing (5,6) to (5,0) will cause R? = Accuracy of the Samples The slope and correlation obtained is calculated from the samples, hence it is only an estimation about the relation of the two variables = Ifthere are other samples obtained from the Population, we may get another value for the slope and correlation coetticient. = If, unfortunately, we get the extreme samples, the slope and correlation coefficient may actually give wrong conclusions. For example, the samples may show correlation between two independent variables ~ Of course, this chance should be small Determining the Accuracy So, how good is this estimate of the correlation? — Generally, i the slope calculated from the samples is far from 0, there is a very high chance that the two variables are correlated (hence safe to claim so) We can find out the range of values for the slope so that this range is of a certain accuracy ~ Usually we use the 95% accuracy (confidence interval) meaning that 95% of the time the prediction about the range of slopes is accurate = If this range does not include 0, this means we are 95% confident that the two variables are correlated Determining the Accuracy + In fact, if we find all different sets of samples from a population and get the slopes, different | values of the siope form a normal distribution — For samples with fewer items (N < 30), the distribution is called t-cistibution with (N-1) degrees of freedom. ~ distribution actually looks like normal distribution, except with smaller degree of freedom, the peak is. sharper (see next slide) = We can then perform a t-test in order to check if the slope found from our sample can be safe enough to ‘conclude the variables are correlated | What t-distribution looks like + Different t-distribution curves with degree of freedom (df) from 1 to 8 A ‘istibuton always aymmatic 19 df > Closer to normal distribution Steps for Performing a t-test + First, we assume that the two variables are independent (i., slope = 0) — This is known as a null hypothesis. + Next, we calculate the value of f by using the formula: xn SAN X= our sampled slope {= slope in null hypothesis (0) ‘S= standard deviation umber of samples ~ Itis actually a normalization to the t-curve with (N-1) degrees of freedom, where Nis the number of valid data items in the sample Illustrating with an Example + With these sample data, do you think X and Y are correlated? (Slope = 1.2) — What is the chance if we choose these samples while Xand Y are actually independent? Illustrating with an Example This isthe 95% confidence | interval that can caver... / the t value fs within [-.) | should accept the null Therefore the null | hypothesis canbe | hypothesis (he cvalue can satel rjectes (tis | be found from table or Excel) fouls the S584 - ontenesinered) Steps for Performing a t-test + Thirdly, we need to determine a significant level to reject the null hypothesis — If we want a 95% confidence level, the significant level (callit a) will be 1 ~ 0.95 = 0.05 + Then, find out the critical value c such that the area at the two tails of the t-ourve totals a. — c actually means how extreme thas to be such that the sampled siope can give a wrong decision + inother words, the null hypothesis i re but we reject it + Le, our sampled slope indicates the two variables are correlated while they are actually independent, Steps for Performing a t-test Finally, if t c>0 (ie., tlocated at either end of the curve), we can safely reject the null hypothesis ~ This means the two numbers are correlated. Our conclusion is wrong with a probability lower than a. Otherwise, we must accept the null hypothesis, meaning the two variables can be independent. ~ We cannot safely claim the variables are correlated, — There can be a chance more than we can tolerate (e.,> a) that the variables are actually independent An Alternative Way Instead of finding the value of c from a, we can find the probability P that the t-curve has a value larger than f or smaller than —t = Note that you need to sum up both tail areas of the t curve (that is, multiply one tail area by 2) ~ Pis the probability that the null hypothesis is true while we wrongly reject it ~ So, if Pis smaller than a, the chance for a wrong decision is within our tolerance limit ~ This means, we can safely reject the null hypothesis, the two variables are correlated. | Illustrating the P value Conclusion “The area boyond his Boe line shows the values tat ‘causes the wrong decision (cull hypothess tobe rue | butwe reject), and ne | probaly is P(2sidea) (Few areas ] Ins slanfieance level Le, the max | + Crosstable vs Scatter Plot | = Both can see the relation of 2 variables. = Use a crosstable when: + You want to find out the exact percentage breakdown | + There is a small number of possible values forthe variables eee! | = Use a scatter plot when: wong + You want to predict the value of variable using another decison tobe tariable obtain a regression ine) , | Mad evi +The vail) take continuous values or avery large i: al oh 8 pees | numberof posses (9 inepersO~ 100) If the variables have only 2or3 possible values, the dots wil be superimposed together + The correlation we used is known as Pearson | correlation coefficient | — A type of parametric correlation | | = Itworks best on statistical data (not test experimental | data), in which we cannot manipulate both variables ~ The distioution ofthese variables usually obeys Normal distribution |= Ifthe cistbution of one or more ofthe variables is | known to be non-Normal, use Spearman's correlation = Spearman's correlation isa type of the non- parametric correlation, available in SPSS

You might also like