2181S
Introduction to Data Mining
Lecture 6 Regression
Regression
Regression is a method to find out the line in a
scatter plot that best predicts the dependent
variable (put in y-axis) from different values of
the independent variable (put in x-axis)
In this lecture, we will talk about:
= Some different forms of regression
— Significance of Regression
= Interpreting regression results in SPSS.
~ Other forms of correlation
Types of Regression
* Different Types of Regression:
— Regarding the equation of the regression line:
+ Linear regression: The regression equation is always
straightine: y= a + bx
+ Non-linear regression: We can derive @nen-linear regression
line, such as y= pet + ge+r
~ Regarding the number of independent variables:
+ Single regression: One independent variable, only one x
+ Muliple regression: More than one independent variable,
linear multiple regression takes the form y = byx, + bot, +
+ byt, * (asthe y.intercept when al
Types of Regression
| Linear (expressed
Non-Linear (e.g
log, exponential,
polynomial)
as y=a+ bx)
‘independent | Single linear
variable regression
‘Single non-linear
regression
>1 independent | Multiple linear
variables. | _ regression
Multiple non-linear
regressionNon-Linear Regression in Excel Multiple Regression in SPSS
| + Example: + Example: A person's
Scatter Pit salary in a company “3 “gh Sa
40 -——— isexpectedtobe =} $ 2 4% 3
8 related to these four Ph fo: 8
6 factors: as ees
4 2 2 Dae |
5 Y= 0.250? -2.9829%? + 8.9571K-4.2, = # Working Hours, |
o oes — # customers dealt ——— |
with, and ow « ‘a
o Boy fe — # awards eamed os
Finding Coefficients in SPSS Multiple Linear Regression
+ Results:
~ Equation for salary = 500.3 x (# awards) + 43.3 x (#
customers) + 30.5 (# hours) ~ 4.5 (age) + 309.8
| = Salary not too related to age, basic salary ~ $300
| a Saat |e “a
ames tie tert tt la aesBack to Simple Linear Regression
We are quite concerned with
— How good is the estimation of the regression line for
the prediction of the y-value of each data point?
The R? coefficient can give you a rough idea of
how good the numbers are related
— However, sometimes the R® value can be misleading!
Hence, we need some more mathematical ways
to help us check whether the two variables are
correlated or not
Drawback of the R? statistics
‘A few wrong data with extreme values may
cause the regression line and R® value to
change dramatically
— e.g. With data (1.1), (2:4), (3.3), (4.4), (5.6), R
0,973, but changing (5,6) to (5,0) will cause R? =
Accuracy of the Samples
The slope and correlation obtained is calculated
from the samples, hence it is only an estimation
about the relation of the two variables
= Ifthere are other samples obtained from the
Population, we may get another value for the slope
and correlation coetticient.
= If, unfortunately, we get the extreme samples, the
slope and correlation coefficient may actually give
wrong conclusions. For example, the samples may
show correlation between two independent variables
~ Of course, this chance should be small
Determining the Accuracy
So, how good is this estimate of the correlation?
— Generally, i the slope calculated from the samples is
far from 0, there is a very high chance that the two
variables are correlated (hence safe to claim so)
We can find out the range of values for the slope so
that this range is of a certain accuracy
~ Usually we use the 95% accuracy (confidence interval)
meaning that 95% of the time the prediction about the
range of slopes is accurate
= If this range does not include 0, this means we are
95% confident that the two variables are correlatedDetermining the Accuracy
+ In fact, if we find all different sets of samples
from a population and get the slopes, different
| values of the siope form a normal distribution
— For samples with fewer items (N < 30), the distribution
is called t-cistibution with (N-1) degrees of freedom.
~ distribution actually looks like normal distribution,
except with smaller degree of freedom, the peak is.
sharper (see next slide)
= We can then perform a t-test in order to check if the
slope found from our sample can be safe enough to
‘conclude the variables are correlated
| What t-distribution looks like
+ Different t-distribution curves with degree of
freedom (df) from 1 to 8
A
‘istibuton always aymmatic
19 df > Closer to normal distribution
Steps for Performing a t-test
+ First, we assume that the two variables are
independent (i., slope = 0)
— This is known as a null hypothesis.
+ Next, we calculate the value of f by using the
formula: xn
SAN
X= our sampled slope
{= slope in null hypothesis (0)
‘S= standard deviation
umber of samples
~ Itis actually a normalization to the t-curve with (N-1)
degrees of freedom, where Nis the number of valid
data items in the sample
Illustrating with an Example
+ With these sample data, do you think X and Y
are correlated? (Slope = 1.2)
— What is the chance if we choose these samples while
Xand Y are actually independent?Illustrating with an Example
This isthe 95% confidence
| interval that can caver...
/ the t value fs within [-.)
| should accept the null
Therefore the null |
hypothesis canbe | hypothesis (he cvalue can
satel rjectes (tis | be found from table or Excel)
fouls the S584 -
ontenesinered)
Steps for Performing a t-test
+ Thirdly, we need to determine a significant level
to reject the null hypothesis
— If we want a 95% confidence level, the significant
level (callit a) will be 1 ~ 0.95 = 0.05
+ Then, find out the critical value c such that the
area at the two tails of the t-ourve totals a.
— c actually means how extreme thas to be such that
the sampled siope can give a wrong decision
+ inother words, the null hypothesis i re but we reject it
+ Le, our sampled slope indicates the two variables are
correlated while they are actually independent,
Steps for Performing a t-test
Finally, if t c>0 (ie., tlocated at
either end of the curve), we can safely reject the
null hypothesis
~ This means the two numbers are correlated. Our
conclusion is wrong with a probability lower than a.
Otherwise, we must accept the null hypothesis,
meaning the two variables can be independent.
~ We cannot safely claim the variables are correlated,
— There can be a chance more than we can tolerate
(e.,> a) that the variables are actually independent
An Alternative Way
Instead of finding the value of c from a, we can
find the probability P that the t-curve has a value
larger than f or smaller than —t
= Note that you need to sum up both tail areas of the t
curve (that is, multiply one tail area by 2)
~ Pis the probability that the null hypothesis is true
while we wrongly reject it
~ So, if Pis smaller than a, the chance for a wrong
decision is within our tolerance limit
~ This means, we can safely reject the null hypothesis,
the two variables are correlated.| Illustrating the P value Conclusion
“The area boyond his Boe
line shows the values tat
‘causes the wrong decision
(cull hypothess tobe rue
| butwe reject), and ne
| probaly is P(2sidea)
(Few areas ]
Ins slanfieance
level Le, the max
| + Crosstable vs Scatter Plot
| = Both can see the relation of 2 variables.
= Use a crosstable when:
+ You want to find out the exact percentage breakdown
| + There is a small number of possible values forthe variables
eee! | = Use a scatter plot when:
wong + You want to predict the value of variable using another
decison tobe tariable obtain a regression ine)
, | Mad evi +The vail) take continuous values or avery large
i: al oh 8 pees | numberof posses (9 inepersO~ 100)
If the variables have only 2or3 possible values, the dots wil
be superimposed together
+ The correlation we used is known as Pearson |
correlation coefficient |
— A type of parametric correlation |
| = Itworks best on statistical data (not test experimental |
data), in which we cannot manipulate both variables
~ The distioution ofthese variables usually obeys
Normal distribution
|= Ifthe cistbution of one or more ofthe variables is
| known to be non-Normal, use Spearman's correlation
= Spearman's correlation isa type of the non-
parametric correlation, available in SPSS