0% found this document useful (0 votes)
39 views

Analysis of Categorical Data

Uploaded by

hasan
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

Analysis of Categorical Data

Uploaded by

hasan
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 60

Applied Business Statistics, 7th ed.

by Ken Black

Chapter 12

Analysis of
Categorical Data

Copyright2011
Copyright 2011John
JohnWiley
Wiley&&Sons,
Sons,Inc.
Inc. 1
Learning Objectives
Explain the purpose of regression analysis and the meaning
of independent versus dependent variables.
Compute the equation of a simple regression line from a
sample of data, and interpret the slope and intercept of the
equation.
Estimate values of Y to forecast outcomes using the
regression model.
Understand residual analysis in testing the assumptions
and in examining the fit underlying the regression line.
Compute a standard error of the estimate and interpret
its meaning.
Compute a coefficient of determination and interpret it.
Test hypotheses about the slope of the regression model
and interpret the results.
Copyright 2011 John Wiley & Sons, Inc. 2
Correlation
Correlation is a measure of the degree of relatedness of
variables.
Coefficient of Correlation (r) - applicable only if both
variables being analyzed have at least an interval level
of data.

Copyright 2011 John Wiley & Sons, Inc. 3


Pearson Product-Moment
Correlation Coefficient
SSXY
r
 SSX   SSY 


  X  X  Y  Y 
  X  X    Y Y 
2 2

  X   Y 
1 r  1
 XY  n


 2   X
2

  Y 2 
Y  2


 X 
n  n 
  

Copyright 2011 John Wiley & Sons, Inc. 4


Degrees of Correlation

The term (r) is a measure of the linear correlation


of two variables
The number ranges from -1 to 0 to +1
Positive correlation: as one variable increases, the other
variable increases
Negative correlation: as one variable increases, the other one
decreases
No correlation: the value of r is close to 0
Closer to +1 or -1, the higher the correlation between the
dependent and the independent variables

Copyright 2011 John Wiley & Sons, Inc. 5


Three Degrees of Correlation

r<0 r>0

r=0

Copyright 2011 John Wiley & Sons, Inc. 6


Computation of r for
the Economics Example (Part 1)
Futures
Interest Index
Day X Y X2 Y2 XY
1 7.43 221 55.205 48,841 1,642.03
2 7.48 222 55.950 49,284 1,660.56
3 8.00 226 64.000 51,076 1,808.00
4 7.75 225 60.063 50,625 1,743.75
5 7.60 224 57.760 50,176 1,702.40
6 7.63 223 58.217 49,729 1,701.49
7 7.68 223 58.982 49,729 1,712.64
8 7.67 226 58.829 51,076 1,733.42
9 7.59 226 57.608 51,076 1,715.34
10 8.07 235 65.125 55,225 1,896.45
11 8.03 233 64.481 54,289 1,870.99
12 8.00 241 64.000 58,081 1,928.00
Summations 92.93 2,725 720.220 619,207 21,115.07

Copyright 2011 John Wiley & Sons, Inc. 7


Computation of r
Economics Example (Part 2)

  X   Y 
 XY 
n
r


 X    2
Y  
2

X  n   Y  n 
2 2


  
 92.93  2725
 21,115.07 
 12

 720.22  92 .93 
2
 
  619,207  2725
  2


 12  12 
  
.815
Copyright 2011 John Wiley & Sons, Inc. 8
Computation of r
Economics Example (Part 2)
Is r = 0.815 high or low?
What can we conclude about the variables of
interest?

Copyright 2011 John Wiley & Sons, Inc. 9


Regression

Regression analysis is the process of constructing a


mathematical model or function that can be used to
predict or determine one variable by another
variable or variables.

Copyright 2011 John Wiley & Sons, Inc. 10


Simple Regression Analysis

Bivariate (two variables) linear regression -- the most


elementary regression model
dependent variable, the variable to be predicted, usually
called Y
independent variable, the predictor or explanatory variable,
usually called X
Usually the first step in this analysis is to construct a scatter
plot of the data
Nonlinear relationships and regression models with
more than one independent variable can be explored
by using multiple regression models

Copyright 2011 John Wiley & Sons, Inc. 11


Regression Models
Deterministic Regression Model - - produces an exact
output:
ŷ   0  1 x
Probabilistic Regression Model
ŷ   0  1 x  
0 and 1 are population parameters
0 and 1 are estimated by sample statistics b0 and b1

Copyright 2011 John Wiley & Sons, Inc. 12


Equation of the Simple Regression Line

yˆ  b0  b1 x
where : b0 = the sample intercept
b1 = the sample slope
yˆ = the predicted value of y

Copyright 2011 John Wiley & Sons, Inc. 13


Least Squares Analysis
Least squares analysis is a process whereby a
regression model is developed by producing the
minimum sum of the squared error values
The vertical distance from each point to the line is
the error of the prediction.
The least squares regression line is the regression line
that results in the smallest sum of errors squared.

Copyright 2011 John Wiley & Sons, Inc. 14


Least Squares Analysis

  X   Y 
  X  X  Y  Y   XY  nXY  XY 
n
b   
 X  X   X n X
2 2 2
X
1 2

X 2

n

 Y  X
b Y b X  n b n
0 1 1

Copyright 2011 John Wiley & Sons, Inc. 15


Least Squares Analysis

SSXY    X  X  Y  Y    XY 
  X   Y 
n
2

SSXX    X X
2
 X 2

 X
n
SSXY
b1  SSXX

 Y  X
b  Y b X  n b n
0 1 1

Copyright 2011 John Wiley & Sons, Inc. 16


Solving for b1 and b0 of the Regression
Line: Airline Cost Example (Part 1)
Number of
Passengers Cost ($1,000)
X Y X2 XY

61 4.28 3,721 261.08


63 4.08 3,969 257.04
67 4.42 4,489 296.14
69 4.17 4,761 287.73
70 4.48 4,900 313.60
74 4.30 5,476 318.20
76 4.82 5,776 366.32
81 4.70 6,561 380.70
86 5.11 7,396 439.46
91 5.13 8,281 466.83
95 5.64 9,025 535.80
97 5.56 9,409 539.32

X = 930 Y = 56.69 X 2
= 73,764  XY = 4,462.22

Copyright 2011 John Wiley & Sons, Inc. 17


Solving for b1 and b0 of the Regression
Line: Airline Cost Example (Part 2)
 X  Y ( 930 )( 56 . 69 )
SS XY   XY 
n
 4 , 462 . 22 
12
 68 . 745

( X ) 2
( 930 ) 2
SS XX   X 2

n
 73 , 764 
12
 1689

SS 68 . 745
b1  XY
  . 0407
SS XX 1689

b 
 Y
 b1
 X

56 . 69
 (. 0407 )
930
 1 . 57
0
n n 12 12

Y ˆ  1 . 57  . 0407 X

Copyright 2011 John Wiley & Sons, Inc. 18


Airline Cost: Excel Summary Output
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.94820033

R Square 0.89908386

Adjusted R Square 0.88899225

Standard Error 0.17721746

Observations 12

ANOVA
  df SS MS F Significance F

Regression 1 2.79803 2.79803 89.092179 2.7E-06

Residual 10 0.31406 0.03141

Total 11 3.11209      

  Coefficients Standard Error t Stat P-value


Intercept 1.56979278 0.33808 4.64322 0.0009175
Number of Passengers 0.0407016 0.00431 9.43887 2.692E-06

Copyright 2011 John Wiley & Sons, Inc. 19


MINITAB Regression Analysis of
the Airline Cost Example
The regression equation is
Cost = 1.57 + 0.0407 Number of Passengers

Predictor Coef StDev T P


Constant 1.5698 0.3381 4.64 0.001
Number o 0.040702 0.004312 9.44 0.000

S = 0.1772 R-Sq = 89.9% R-Sq(adj) = 88.9%

Analysis of Variance

Source DF SS MS F P
Regression 1 2.7980 2.7980 89.09 0.000
Residual Error 10 0.3141 0.0314
Total 11 3.1121

Obs Number o Cost Fit StDev Fit Residual St Resid


1 61.0 4.2800 4.0526 0.0876 0.2274 1.48
2 63.0 4.0800 4.1340 0.0808 -0.0540 -0.34
3 67.0 4.4200 4.2968 0.0683 0.1232 0.75
4 69.0 4.1700 4.3782 0.0629 -0.2082 -1.26
5 70.0 4.4800 4.4189 0.0605 0.0611 0.37
6 74.0 4.3000 4.5817 0.0533 -0.2817 -1.67
7 76.0 4.8200 4.6631 0.0516 0.1569 0.93
8 81.0 4.7000 4.8666 0.0533 -0.1666 -0.99
9 86.0 5.1100 5.0701 0.0629 0.0399 0.24
10 91.0 5.1300 5.2736 0.0775 -0.1436 -0.90
11 95.0 5.6400 5.4364 0.0912 0.2036 1.34
12 97.0 5.5600 5.5178 0.0984 0.0422 0.29

Copyright 2011 John Wiley & Sons, Inc. 20


Airline Cost: MINITAB Summary Output

Copyright 2011 John Wiley & Sons, Inc. 21


Residual Analysis

Residual is the difference between the actual y


values and the predicted ŷ values.
Reflects the error of the regression line at any given
point.

Copyright 2011 John Wiley & Sons, Inc. 22


Residual Analysis: Airline Cost Example

Number of Predicted
Passengers Cost ($1,000) Value Residual
X Y Ŷ Y  Yˆ

61 4.28 4.053 .227


63 4.08 4.134 -.054
67 4.42 4.297 .123
69 4.17 4.378 -.208
70 4.48 4.419 .061
74 4.30 4.582 -.282
76 4.82 4.663 .157
81 4.70 4.867 -.167
86 5.11 5.070 .040
91 5.13 5.274 -.144
95 5.64 5.436 .204
97 5.56 5.518 .042

(Y  Yˆ )  .001

Copyright 2011 John Wiley & Sons, Inc. 23


Residual Analysis for Number of Passengers

Outliers: data points that lie apart from the rest of


the points. They can produce large residuals and
affect the regression line.
Copyright 2011 John Wiley & Sons, Inc. 24
Using Residuals to Test the Assumptions of
the Regression Model

The assumptions of the regression model


The model is linear
The error terms have constant variances
The error terms are independent
The error terms are normally distributed

Copyright 2011 John Wiley & Sons, Inc. 25


Demonstration Problem 12.2

Compute the residuals for Demonstration Problem


12.1 in which a regression model was developed to
predict the number of full-time equivalent workers
(FTEs) by the number of beds in a hospital. Analyze
the residuals by using MINITAB graphic diagnostics.

Copyright 2011 John Wiley & Sons, Inc. 26


Demonstration Problem 12.2 – MINITAB
Computations for Residuals

Copyright 2011 John Wiley & Sons, Inc. 27


Demonstration Problem 12.2 – MINITAB
Plots of Residuals

Copyright 2011 John Wiley & Sons, Inc. 28


Another Residual Analysis:
Predicting the production of carrots in the U.S. by the
total production of sweet corn.

Copyright 2011 John Wiley & Sons, Inc. 29


Standard Error of the Estimate

Residuals represent errors of estimation for


individual points.
A more useful measurement of error is the
standard error of the estimate.
The standard error of the estimate, denoted se,
is a standard deviation of the error of the
regression model.

Copyright 2011 John Wiley & Sons, Inc. 30


Standard Error of the Estimate

 
Sum of Squares Error
2

SSE   Y Y
  Y  b0  Y  b1  XY
2
Standard Error
of the
Estimate SSE
Se  n  2

Copyright 2011 John Wiley & Sons, Inc. 31


Determining SSE for the
Airline Cost Example

Number of
Passengers Cost ($1,000) Residual
X Y Y  Yˆ ( Y  Yˆ )2

61 4.28 .227 .05153


63 4.08 -.054 .00292
67 4.42 .123 .01513
69 4.17 -.208 .04326
70 4.48 .061 .00372
74 4.30 -.282 .07952
76 4.82 .157 .02465
81 4 .70 -.167 .02789
86 5.11 .040 .00160
91 5.13 -.144 .02074
95 5.64 .204 .04162
97 5.56 .042 .00176

( Y  Yˆ )  .001 ( Y  Yˆ )2 =.31434

Sum of squares of error = SSE = .31434

Copyright 2011 John Wiley & Sons, Inc. 32


Determining SSE for the Airline Cost
Example – MINITAB Output

SSE = 0.3141

Copyright 2011 John Wiley & Sons, Inc. 33


Standard Error of the Estimate for
the Airline Cost Example

Sum of Squares

Y Yˆ 
Error 2
SSE  
Standard  0.31434
Error of the
SSE
Estimate
Se  n  2
0.31434

10
 0.1773
Copyright 2011 John Wiley & Sons, Inc. 34
Standard Error of the Estimate for
the Airline Cost Example

Copyright 2011 John Wiley & Sons, Inc. 35


Coefficient of Determination

The coefficient of determination is the proportion of


variability of the dependent variable (y) accounted for
or explained by the independent variable (x)
The coefficient of determination ranges from 0 to 1.
An r 2 of zero means that the predictor accounts for
none of the variability of the dependent variable and
that there is no regression prediction of y by x.
An r 2 of 1 means perfect prediction of y by x and that
100% of the variability of y is accounted for by x.

Copyright 2011 John Wiley & Sons, Inc. 36


Coefficient of Determination

SSYY   Y Y   Y
2 2

 Y 
2

n
SSYY  exp lained var iation  un exp lained var iation
SSYY  SSR  SSE
SSR SSE
1 
SSYY SSYY
2 SSR
r SSYY

SSE
 1
SSYY
SSE
 1
 
2
Y
2
0r 1
Y  n
2

Copyright 2011 John Wiley & Sons, Inc. 37


Coefficient of Determination for
the Airline Cost Example

SSE  0.31434

SSYY  Y 
2
Y
2

 270.9251 
 56.69
2

 3.11209
n 12
SSE
r  1
2
89.9% of the variability
SSYY of the cost of flying a
.31434 Boeing 737 is accounted for
 1 by the number of passengers.
3.11209
 .899

Copyright 2011 John Wiley & Sons, Inc. 38


Coefficient of Determination for
the Airline Cost Example

Copyright 2011 John Wiley & Sons, Inc. 39


Hypothesis Tests for the Slope
of the Regression Model
A hypothesis test can be conducted on the sample
slope of the regression model to determine whether
the population slope is significantly different from
zero.
Using this non-regression model (the model) as a
worst case, the researcher can analyze the regression
line to determine whether it adds a more significant
amount of predictability of y than does the model.

Copyright 2011 John Wiley & Sons, Inc. 40


Hypothesis Tests for the Slope
of the Regression Model
As the slope of the regression line diverges from zero,
the regression model is adding predictability that the
line is not generating.
Testing the slope of the regression line to determine
whether the slope is different from zero is important.
If the slope is not different from zero, the regression
line is doing nothing more than the average line of y
predicting y.

Copyright 2011 John Wiley & Sons, Inc. 41


Hypothesis Tests for the Slope
of the Regression Model
b 
H0:  0
1
t 1
1
S b
H1:   0 S
1
where: S b

SSXX
e

H 0:  1  0 SSE
S 
H 1:  1  0 n2
e

SSXX  
  X
2

H 0:  1  0
2
X 
n

H 1:  1  0  1
 the hypothesized slope
df  n  2
Copyright 2011 John Wiley & Sons, Inc. 42
Hypothesis Test: Airline Cost Example

df  n  2  10  2  10

H 0:  1  0   .05
t  2.228
H 1:  1  0
.025,10

If | t | 2.228, reject H0
If  2.228  t  2.228, do not reject H0

Copyright 2011 John Wiley & Sons, Inc. 43


Hypothesis Test: Airline Cost Example

|t| = 9.44 > 2.228


so reject H0

Note:
P-value = 0.000

Copyright 2011 John Wiley & Sons, Inc. 44


Hypothesis Test: Airline Cost Example

The t value calculated from the sample slope falls in


the rejection region and the p-value is .00000014.
The null hypothesis that the population slope is zero
is rejected.
This linear regression model is adding significantly
more predictive information to the model
(no regression).

Copyright 2011 John Wiley & Sons, Inc. 45


Testing the Overall Model

It is common in regression analysis to compute an F test


to determine the overall significance of the model.
In multiple regression, this test determines whether at
least one of the regression coefficients (from multiple
predictors) is different from zero.
Simple regression provides only one predictor and only
one regression coefficient to test.
Because the regression coefficient is the slope of the
regression line, the F test for overall significance is
testing the same thing as the t test in simple regression

Copyright 2011 John Wiley & Sons, Inc. 46


Testing the Overall Model

H 0:  1  0 dfreg  k  1
dferr  n  k  1  12  1  1  10
H 1:  1  0   .05
F .05,1,10
 4.96
IfF  4.96, reject H0
If F  4.96, do not reject H0

Copyright 2011 John Wiley & Sons, Inc. 47


Testing the Overall Model

F = 89.09 > 4.96


so reject H0

Note:
P-value = 0.000

Copyright 2011 John Wiley & Sons, Inc. 48


Testing the Overall Model

The difference between this value (89.09) and the


value obtained by squaring the t statistic (88.92) is
due to rounding error.
The probability of obtaining an F value this large or
larger by chance if there is no regression prediction
in this model is .000 according to the ANOVA output
(the p-value).

Copyright 2011 John Wiley & Sons, Inc. 49


Estimation

One of the main uses of regression analysis is as a


prediction tool.
If the regression function is a good model, the
researcher can use the regression equation to
determine values of the dependent variable from
various values of the independent variable.
In simple regression analysis, a point estimate
prediction of y can be made by substituting
The associated value of x into the regression equation
and solving for y.

Copyright 2011 John Wiley & Sons, Inc. 50


Point Estimation for the Airline
Cost Example

Yˆ  1.57  0.0407 X
For X  73,
Yˆ  1.57  0.0407 73
 4.5411 or $4,541.10

Copyright 2011 John Wiley & Sons, Inc. 51


Confidence Interval to Estimate Y:
Airline Cost Example
 
2

ˆ 1 x0  x
Y  t  ,n2 S e 
2 n SSXX
where : x0  a particular value of x

  x
2

SSXX=  x 2

n
For x0  73 and a 95% confidence level,

 73  77.5 
2
1
4.5411   2.228   0.1773  
 930 
2
12
73, 764 
12
 4.5411  1220
4.4191  E  Y 73   4.6631

Copyright 2011 John Wiley & Sons, Inc. 52


Confidence Interval to Estimate the
Average Value of Y for some Values of X:
Airline Cost Example

X Confidence Interval

62 4.0934 + .1876 3.9058 to 4.2810


68 4.3376 + .1461 4.1915 to 4.4837
73 4.5411 + .1220 4.4191 to 4.6631
85 5.0295 + .1349 4.8946 to 5.1644
90 5.2230 + .1656 5.0674 to 5.3986

Copyright 2011 John Wiley & Sons, Inc. 53


Prediction Interval to Estimate Y
for a given value of X

 
2

1 x0  x
Yˆ  t  ,n 2 S e 1  
2 n SSXX
where : x0  a particular value of x

  x
2

SSXX=  x 
2

Copyright 2011 John Wiley & Sons, Inc. 54


Minitab Intervals for Estimation

Copyright 2011 John Wiley & Sons, Inc. 55


Minitab Intervals for Estimation

Copyright 2011 John Wiley & Sons, Inc. 56


Forecasting Using the Trend Line Equation
Time-series data is useful in predicting future values.

Copyright 2011 John Wiley & Sons, Inc. 57


Developing a Forecasting Trend Line
Future values can be forecast by using the equation of
the trend line.
Recoding of time units may be necessary to produce a
meaningful trend line with smaller numbers, for
example, instead of 2002, 2003, 2004, etc. use 1, 2, 3
(forecasting also needs to follow the recoded scheme).

Copyright 2011 John Wiley & Sons, Inc. 58


Interpreting Minitab Output

Copyright 2011 John Wiley & Sons, Inc. 59


Interpreting Excel Output

Copyright 2011 John Wiley & Sons, Inc. 60

You might also like