SDA Book
SDA Book
Gangineni Dhananjhay
Chapter 1 : Introduction
following questions
manner.
do this later on). The study produces data (or information). Statistical
1
techniques are then applied to make sense of the data and the
around us. Thus, in a nutshell, Statistics is the art and science of learning
from data.
2
2 Statistical Enquiry
Every statistical enquiry has four distinct stages viz :
on the above data but in such a way that the conclusions are
3
Eg : If we want to know the living standard of average Indians,
4
subject).
Data set : In a study, the totality of the values of all the variables for
undertaken.
5
Eg : For the death penalty example above, the population will
students.
In this case, you may want to predict, say, the true percentage of
size
6
sample is a good representative of the population. Generally,
of a random phenomenon.
population.
But, once the sample is collected and the value of the statistic is
8
Chapter 2 : Sampling Techniques
4 Data Collection
As mentioned in Chapter 1, one of the most important component of
Association
9
2. Explanatory variable (X) : This is the independent variable or the
Eg : (i) There are many factors which influence the ranking of IIMA
(or of any institution per se) like salary of graduating students (say,
(say, i), research output of faculties (say, r) etc. So, ranking is the
variable, while
variables.
1
Later on we will see that association may not necessarily imply causation. However, we will go
by this statement for the time being for the sake of defining response and explanatory variables.
2
Explanatory variables are also known as covariates or predictors
1
0
(ii) Whether taking Aspirin reduces the chance of heart attacks. Here
concept, we will now learn about the two main types of statistical
studies.
Statistical Studies
issue :
1
1
transgenic mice, specially bred to be susceptible to cancer; 100
were not exposed. After 18 months, it was found that, the brain
tumor rate for the mice exposed to cell phone radiation was
twice as high as the brain tumor rate for the unexposed mice.
1
2
did not have the eye cancer. The patients cell phone use was
variables for each of them without doing anything to them. So, these
are .
since it is the one that is used more often and specifically in business
the other hand, a sample which does not have proper representation
about current happenings. So, here the population will be all the
subjects (or units) in it so that you can sample from it - this list is called
the sampling frame - this is applicable when the population is finite (i.e has a
1
4
For example, if you want to know some particular characteristics
design. The sampling design should be such that the resulting sample
(that too from PGP, PGP- ABM, FPM etc), faculties (from all ranks),
staff members, workers (from all depart- ments) etc. However, if for
convenience, you only select the diners from the KLMDC canteen on a
dinner).
indicator since the results will not account for a huge chunk of the
population (say those who dont have access to internet or are not
a sampling design which gives each and every population unit an equal
chance of being included in the sam- ple will likely result in a much
size N , is one in which each possible sample of that size (i.e n) has
1
6
Eg: Suppose a campus club can select two of its officers to attend the
Coordinator (A). So, here the sampling frame will be . The club
person blindly selects two slips from the hat - this is the sampling
design.
(randomly) selecting two slips ensures that each of the above sample
Random number tables : For large populations, there are better, more
generators) as follows :
Stop the process once you have reached the desired sample
size.
for the auditors to examine all the crop fields (since there are so
1
8
{01, 02, ..., 70}. The auditor generates 2 digits at a time from a random
number generator until he/she has 10 unique two digit numbers from
select the same field more than once). Suppose the selected numbers
are {05, 09, 15, 23, 44, 46, 52, 59, 63, 69} - the official should then check
(SWR) since a particular unit can be selected more than once. However, if any unit
can be selected/included in the sample only once, it is known as sampling without
replacement (SWOR). Although SWR is a valid way of selecting a simple random
sample, it is SWOR that is used most often. In fact, unless explicitly stated, a simple
random sample implies a sample selected without replacement.
done at the Taj Mahal. For all practical purposes, the visitors (to the
population. So, to select a SRS from the steady stream of visitors, the
One way of doing this may be to record the entry ticket number of
every visitor in a computer (as soon as they buy the ticket) and
once a ticket number has been drawn, that number is not included in
10
Sampling with and without replacement
unit and so on. Thus, the subsequent draws are independent and
times.
lected, it is not replaced back and the second sample unit is drawn
unit) and so on. Natu- rally, the subsequent draws are dependent
once.
Sample Surveys
Once a sample is selected, there are different ways in which one can
collect data from the sample units. Some of the commonly used
1
1
selecting a representative sample from the population and collecting
One of the main issues with sample surveys is that, often the
population over others. Then the results of the sample are not
reasons as follows :
1
2
1. Sampling Bias : As the term suggests, this kind of bias results
population.
strong views about the survey issues. The subjects who do par-
Nearly all major surveys suffer from some non response biases
happened !!
1
4
Accuracy of Sample Surveys
into account the ratio of the sample size (say, n) and the population
size (say, N ) - this ratio is known as the sampling fraction. Based on the
n
> 0.05 (i.e sample size is larger than 5% of the population size) -
N
in this
case, we need to correct the above formula by what is known
as the finite
population correction resulting in the following modified
have data from the whole population, we will know the parameter
values and hence the question of making an error does not arise.
1
5
Eg 1. After the disastrous earthquake and tsunami that struck the east
the question Do you think that the city of Miyagi (one of the worst hit areas)
will ever completely recover from the effects of the hurricane, or not ? The poll
questioned 700 Japanese adults out of whom 42% responded Yes, will
Here, we can use the simpler formula because it is obvious that the
sampling fraction if less than 0.05 (700 divided by 127 million - the
population of Japan is way less than 0.05!). So, the margin of error will
be
1
6
Thus, in the population of all adult Japanese, it is highly likely that
between and believed that Miyagi will completely recover from the
effects of the earthquake. It is amazing that such a small sample of 700 can
accurately predict the population percentage - this is the power of random sampling
and statistical inferential techniques.
of Japan is 5000. Then the sampling fraction will be. So, the
interval compared to the one we got earlier (using the simpler formula).
But this does not mean that the wider interval is wrong. In fact, the wider
Note : The first (simpler) formula is generally used when we are dealing with an
infinite population or when the population size, although finite, is so large (compared
to the sample) that for all practical purposes it can be treated as infinite.
Cluster Sampling
1
7
Often a sampling frame consisting of ALL the population units is hard
You can use a map to label and number the city blocks (which are
1
8
a simple random sample of 1% of the blocks. Now you can select each
Stratified Sampling
week that IIMA students spend in the library and also how it
compares between PGP I, PGP II, PGPX and FPM. You can easily
those will be your strata. If you want a sample size of, say 40, you
can select a simple random sample of size 10 from each strata (or
student bodies).
income/wealth of the various pols. Then you can treat each pol as a
each. For each such sample of households, you can calculate the
mean income/wealth.
1
9
should have access to the sampling frame and the strata into which
Allocation Schemes
2
0
is given
by Nh
n n
) h = (
N
This scheme is appropriate when the within-stratum variances
and cost per unit sampling are approximately equal across the
different strata.
Nh h
nh = n ( H N )
h=1 h h
Note : (i) Clearly, when the stratum variances are equal, proportional and Neyman
allocation will result in the same allocation scheme.
(ii) Usually we do not know the population standard deviations. In those cases, we
typically use estimates based on some prior information.
Eg. Suppose a survey carried out by HSBC in all its branches in India
2
1
Suppose, for a particular survey, you want to select 300 employees
in total. Then what would be the sample sizes for each stratum
under
2
2
(ii) Neyman allocation :
References
2
3
1. Repacholi, M. H. (1997), Radio frequency field exposure and cancer,
Environ.
Health Prospect, 105 : 1565-1568.
12(1) : 7-12.
3. Hepworth, S. J., et al. (2006), Mobile phone use and risk of glioma
2
4
Chapter 3 : Exploratory Analysis
6 Summarizing data
Any statistical experiment/survey usually generates a lot of raw
chapter we will deal with the two main ways of summarizing data
below.
Types of Data/Variable
2
6
Note
identification.
Graphical Summaries
thousand words !
The two most common ways of summarizing a categorical variable are (i)
Bar graphs
and (ii) Pie charts.
1. Bar graph : It displays a vertical bar for each category, the height
2
7
category.
2. Pie chart : It is a circle with a slice of pie for each category, the
that category.
2
8
60
Coal
50
40
Percent
30
Other
Petro leum
20
HydPwr
Nuclear
10
NatGas
0
Source
look at the possible values (of a variable) and count/list the number of
values in each category. We can then divide the number in each cat-
egory by the total number (in all the categories combined) to get a
2
9
Eg : The following table shows the number/frequency of shark
attacks in some (shark infested) states of the U.S and also in some
other countries.
example, out of 701 total shark attacks, 365 were in Florida. Thus, the
Note : You can also view this data in a slightly different manner with Region being
the units and # shark attacks being the discrete variable.
3
0
Region Frequen Proportion Percenta
Florida cy
365 365/701 = ge
52%
Hawaii 60 .086
.52 8.6%
California 40 .057 5.7%
Australia 94 .134 13.4%
Brazil 66 .094 9.4%
South 76 .108 10.8%
Africa
Total 701 1.00 100
the pos- sible values into a set of intervals and displays the number of
For discrete variable, a histogram usually has one bar for each
intervals and have a separate bar for each. The intervals should
in one serving of that food. The table in the following page lists 20
3
1
Cereal Sodium Sug
Frosted mini wheats (mg)
0 ar
7
Raisin bran 210 12
All bran 260 5
Apple Jacks 125 14
Capt Crunch 220 12
Cheerios 290 1
Cinnamon Toast 210 13
Crunch
Crackling Oat Bran 140 10
Crispix 220 3
Frosted Flakes 200 11
Froot Loops 125 13
Grape Nuts 170 3
Honey Nut Cheerios 250 10
Life 150 6
Oatmeal Raisin Crisp 170 10
Honey Smacks 70 15
Special K 230 3
Wheaties 200 3
Corn flakes 290 2
Honeycomb 180 11
For the above data, there are many possible values for the amount
histogram. The table in the next page shows the intervals and the
3
2
Interva Frequen Proporti Percenta
l
0-39 cy
1 on
0.05 ge
5.0
40-79 1 0.05 5.0
80-119 0 0.0 0.0
120- 4 0.20 20.0
159
160- 3 0.15 15.0
199
200- 7 0.35 35
239
240- 2 0.10 10
279
280- 2 0.10 10
319
Total 20 1.00 100.0
cereals had Sodium content between 120 and 159 mg. Two possible
histograms for the cereal data are shown below with two different
6
6
5
4
5
Frequency
4
Frequency
3
3
2
2
1
1
0
3
3
0 50 100 150 200 250 300 0 50 100 150 200 250 300
3
4
Graphs for Time Varying Variables
Often we come across variables which are measured over time, for
example, our blood pressure when measured every month for one
year, the monthly temperature of a place measured for the last two
present. These plots are known as Time plots. Following is a time plot
the daily temperature for a year) of Central Park in New York City
57
Annual Average Temperature
53
52
51
50
Year
3
5
Figure 3: Time plot of average annual temperature of
Central Park, NY
a clear increas- ing trend over the 100 year period...a tell-tale sign of
global warming!
the frequency table and the histogram can give us an idea of the
distribution of a variable.
3
6
The one thing we look for in a distribution is its shape. Some of
5. Left skewed : Distribution whose left tail is longer than the right
tail.
3
7
Eg : Life span (large majority of people live at least 65 years but
exam.
6. Right skewed : Distribution whose right tail is longer than the left
tail.
3
8
by a temple (large majority donates a standard amount but
Numerical Summaries
statistical data since they sort of provides a birds eye view of the
overall data pattern and often indicates the next step in the analytical
Measures of Center
1. Mean : The mean (or average) of a variable is just the sum of its
if you have a set of values, say {x1, x2, ..., xn} with frequencies {f1,
3
9
2. Median : The median of a variable is the midpoint of its
Eg: For the cereal data set, the observations (amount of sugar) are {1,
2, 3, 3, 3, 3,
5, 6, 7, 10, 10, 10, 11, 11, 12, 12, 13, 13, 14, 15 }. Here the mean will be
4
0
To calculate the median, we first arrange the observations in
increasing order (as is done above) and calculate the mean of the two
Eg : The following table depicts the per capita CO2 emissions for
some countries
Country CO2
emission
China 2.3
India 1.1
United 19.7
States
Indonesia 1.2
Brazil 1.8
Russia 9.8
Pakistan 0.7
Bangladesh 0.2
Do you spot an
the median is
4
1
outliers (which virtually drag the mean towards themselves). However,
observations and hence is not affected (thus remains low). So, we say
that median is resistant to extreme observations while the mean is not.
to ; however,
4
2
Similarly, if the C0 2 emission of the U.S increased to, say, 25, the
mean would have shot up to but the median would still remain
the same - this is because, this change will not affect the
median.
Often an extremely large value in the right(left) hand tail (of the
data distribu- tion) may drag the mean towards the right (left)
have
Right-skewed distribution
Left-skewed distribution
In a nut shell, the mean is always drawn towards the longer tail
that we will learn later - so, the use of mean or median would
4
3
depend on the data we have.
Measures of Spread
variables may have the same mean and median but very different
maximum value is 19.7 and the minimum is 0.2. So, the range is
4
4
Unfortunately, range is also sensitive to the present of outliers.
For example, if the CO2 emission for U.S was, say, 25, the range
Eg : For the cereal data, the mean (of x = 185.5. Honey Nut
Sodium) is
Cheerios has sodium content 250 mg; so its deviation (from the
mean) will be
.
2 2 2
(0 185.5) + (70 185.5) + ... + (290 185.5)
=71.25
19 (2
This sort of means that the typical difference between ) an
4
5
observation from the mean is 71.25.
Note
n 2
)(x i x )
i=1
Variance =
n1
4
6
s = 0 only when all the observations have the same value (i.e same
as the mean). Eg: if all the Cherrios had Sodium content 100 mg,
the mean would have been 0, hence each deviation (about the
mean) would have been 0, resulting in a 0
standard deviation.
standard deviation.
Measures of Position
a location such that half of the data falls above it and the other half below it.
4
7
Eg : Suppose a student scores 1200 (out of 1600) in the GRE and is
told that his score is at the 90th percentile. This implies that 90% of
those who took the exam scored between the minimum score and
1200 i.e only 10% of the scores were higher than your friends.
First quartile (Q1) : Lowest 25% of the observations fall below it i.e
p =
4
8
Thus, the quartiles split the distribution into 4 distinct parts, each
median ( Q2).
3. Consider the observations below the median. The first quartile (Q1)
Eg. Let us find the quartiles of the sodium values in the 20 cereals. The
Sodium val- ues, arranged in increasing order are : {0, 70, 125, 125, 140,
210, 210, 220, 220, 230, 250, 260, 290, 290}. Thus, we have
Median =
Q1 =
Q3 =
Inter Quartile Range (IQR) : This is the distance between the third
and the first quartiles i.e IQR = Q3 - Q1 i.e it is range for the middle
5
0
As with range and standard deviation, IQR increases with the spread
290), both the range and standard deviation would have increased
as Box-plot.
200
5
1
50
5
2
A Box-plot has the following features :
rest of the data except for potential outliers. These lines are
called whiskers.
For the Box-plot of the Cereal data, the box extends from to
with the larger part of the box and the longer whisker usually has
The Mode
Mode is the value with the largest frequency i.e the value that occurs
5
3
corresponds to the highest percentage/frequency.
Note :
5
4
Chapter 4 : Sampling Distributions
7 Introduction
Statistical inference is all about using a sample to predict characteristics
has been a topic of heated debate for quite some time. You can
select from the current batch. In this case, the parameter is the true
Two most commonly used statistics are the sample mean (x ) and
5
5
8 How Does it Work ?
Since sample statistics are based on samples, they will have different
values for dif- ferent samples. Thus, a statistic is also a random variable
and will have a probability distribution that will assign probabilities to the
possible values it can take for the dif- ferent samples. The probability
distribution of a sample statistic is called a sampling distribution.
5
6
sample). If you construct a histogram of those values, what you will
(of females).
officer (at a particular posting). For that purpose, suppose you select
random samples of 20 IAS officers for the pool of all IAS officers
currently in service. For each sample, you calculate the average tenure
and draw the histogram (of the sample average tenure value of an IAS
officer), what you will have is precisely the sampling distribution of the
possible samples and calculate the sample statistic value for each. In fact,
in a real-life study, we can only collect one sample. However, the theory
of sampling distribution will tell us how much a statistic would vary from
sample to sample and will help us to predict how close a statistic will
Necessary tools
of sampling distribution.
5
7
sample is x and s respectively2 If it can be assumed that the
5
8
of the observations would fall within 3 standard
5
9
9 Sampling Distribution of the Sample Mean
Since means or averages are so ubiquitous in Statistics, it is useful
vary from sample to sample and how close they will be to the
6
0
Suppose you draw samples of size n from a population with mean and standard
deviation and calculate the sample mean (x ) for each. Then the means will have a
sampling distribution which will be centered around the true population mean and
will have a standard deviation (or standard error) / n. In fact, if n 30, then this
distribution will be approximately with mean and standard error / n.
In the light of this result, let us consider the following example :
stall (located adjacent to the boundary wall of IIM, a little to the left of the
mail gate) vary from day to day. The daily sales figures fluctuate with
wants to calculate the mean daily sales for the week to check how he is
doing.
Q1. What would the mean daily sale figures for the week center around
?
Q2. How much variability would you expect in the mean daily sales
Thus, if Rambhai were to observe the mean daily sales for several
What will be the sampling distribution ? Will his mean daily sales
for the month vary more or less than the mean daily sales for the
6
1
week ?
His mean daily sales for the month will be centered around with
standard error . Thus, the mean daily sales for the month will tend
to vary than the mean daily sales for the week and thus be closer to the
true mean sales of Rs 900. In fact, since n = 30, CLT holds and the sampling
distribution of the mean daily sales for the month is approximately with mean
and standard
error.
Clearly there is variability in the mean daily sales from
6
2
Q4. What is the probability that the mean daily sales of the month will
standard deviation
Note : (i) If n is sufficiently large such that both np and n(1 p) are at
to CLT.
distribution of p is
the top business schools across India, about 55% went abroad for
school and it turns out to be XLRI which has about 350 students
6
4
b) What is the probability that at least 50% of the 350 XLRI students
c) What is the probability that at most 70% of the 350 XLRI students
d) What is the probability that between 50% and 70% of the 350
Note :
(i) Clearly, the standard error will decrease as you your sample size.
For example, if you select a larger B school which has 500 students,
population proportion.
the Empir- ical rule. For example, in the above example, nearly all the
6
6
Chapter 5 : Confidence Intervals
11 Introduction
The process of making decisions and predictions about one or more
population parameter.
Eg: The interval (5.1, 5.9) may be a 95% confidence interval of the
0.90, 0.95 and 0.99 being the commonly used values and is denoted
confidence levels.
Thus, if (a, b) is a 95% confidence interval of a parameter, say , then we can be 95%
confident that (a, b) contains the true unknown value of in the population.
What this statement really implies is that, if we go on collecting a
large number of
random samples from the population and form a 95% confidence interval
from each
6
8
of those, then, in the long run, about 95% of those intervals would
contain the true population parameter and the rest 5% will not.
where the margin of error measures the accuracy with which the
Sample statistic :
Estimated se of p :
Margin of error :
where Z/2 is that value of the standard normal variable above which
the area under a (standard normal) curve is /2. So, for a 90%, 95%
6
9
will be
distribution.
7
0
Note 1. (i) For fixed confidence level, as sample size increases, standard error
and hence the margin of error . So, the confidence interval gets
(ii) For fixed sample size, as confidence level increases, the Z-score ;
hence the margin of error . Thus the confidence interval gets
career choices. 4% said they want to take the plunge and start their
own companies even if that meant giving up lucrative job offers from
of final year
graduat- ing.
7
1
p will be :
Sample statistic :
:
Standard error (se) of X
7
2
Estimated se : where S is the sample standard
of X deviation.
Margin of error :
2.Population distribution
section of a t-table :
Confidence Level
90% 95% 99%
Right tail probability
df t.05 t.025 t.005
1 6.31 12.70 63.65
2 2.92
4 4.303
6 9.925
6
3 0
2.35 3.182 5.841
4 3
2.13 2.776 4.604
5 2
2.01 2.571 4.032
6 5
1.94 2.447 3.707
7 3
1.89 2.365 3.499
5
7
3
8 1.86 2.306 3.355
9 0
1.83 2.262 3.250
1 3
1.81 2.228 3.169
0 2
7
4
standard deviation was 2.617. Find a 95% confidence interval of the
TV.
need to determine the sample size that will yield that precision. This is
Now we will discuss how this is done for the case of proportion.
7
5
The sample size (n) for which a confidence interval of a population
As usual, the Z-score depends on the confidence level (i.e Z = 1.96 for a
7
6
the opportunity. To estimate this proportion in the IIMs with
sample size if
Note 3. i) p = 0.5 will always result in a larger (i.e more conservative) sample
size estimate.
ii) Always round the sample size estimate to the next higher integer.
7
7
Chapter 6 : Hypotheses Tests for one Population
15 Introduction
Significance tests (or Tests of Hypotheses) is an integral part of
sided.
IIMA stu- dent spends more than 15 hours per week solving
7
8
sided alternatives are also possible.
p.
7
9
the parameter). This closeness is measured in terms of the
standard error
of the point estimate. Thus, the test statistic is given by
null hypothe- ses based on the available data. Smaller the p-value,
stronger is the evidence against the null hypotheses and vice versa.
The smallness of the p-value is measured with respect to the
8
0
1. Assumptions :
Sample size (n) should be large enough such that np0 10, n(1
p0) 10.
2. Hypotheses :
Null : p = p0
Alternative :
8
1
3. Test statistic:
alternative as follows :
If Ha p > p0, p-value will be the right tailed area above the
observed value of the test statistic (Zobs) under the standard
normal curve.
8
3
5. Drawing conclusion : We will reject H0 if and fail to reject
H0 otherwise.
are female in the Indian corporate sector has been pretty low,
problem.
managers. Let us
1. Assumptions :
2. Hypotheses :
Null :
Alternative :
3. Test statistic:
8
4
This implies that the sample proportion (0.25) falls
standard errors
reject H0.
8
5
From the normal tables, the p-value will be
1. Assumptions :
2. Hypotheses :
Null : = 0
Alternative :
3. Test statistic:
8
6
As for confidence intervals, our test statistic follows a t
alternative as follows :
8
7
If Ha < 0, p-value will be the left tailed area below the
observed value of the test statistic under a tn1 curve.
Vikram Sarabhai library per week. Re- cently, the library has been
8
8
spending less time in the library. Accordingly a random sample of
shifting of the library has adversely im- pacted the study time in
population).
8
9
1. Assumptions :
2. Hypotheses :
Null :
Alternative :
3. Test statistic:
9
0
Following is a small part of the t table :
9
1
5. Drawing conclusion :
goes down.
9
2
a
s possible. The probability of rejecting H0 when it is false is called
the power of the test. It is given by :
9
3
Chapter 7 : Comparison of Two Populations
19 Introduction
In the previous lectures, we learned how to construct confidence
compare population means. Here we will only deal with techniques for
Confidence Intervals
1. Notation :
2. Assumptions :
9
5
These will ensure that the sampling distribution of (p1 p2 ) is approximately
under the CLT.
given by
.
p1 (1 p2 (1 p2 )
se(p1 p2 ) = p ) +
1 n2
n1
4. Observations :
that
different.
523 males and 498 females were randomly sampled and each of them
9
6
were asked the question Do you believe in miracles?. The
Belief in miracles
Gender Yes No Tota
l
Male 225 298 523
Female 276 222 498
9
7
1. Assumptions : Here the assumptions are satisfied because (i)
the samples of males and females were chosen randomly and ii)
2. Structure : Here
p1 = p2 = , se(p1 p2 ) =
Hypothesis Tests
two groups can be carried out. As for confidence intervals, the general
9
8
2. Assumptions : same as for confidence intervals.
H0 p1 p2 = 0
Ha :
9
9
4. Test statistic : The test statistic is given by
p1 p2 0
Z= . 1 1 N (0, 1)
p(1 p) ( + )
n1 n 2
one (for one sided alternative) or two (for two sided alternative)
1
0
0
2. Hypotheses :
H0 :
Ha :
1
0
1
Thus, the p1 p2 falls about (estimated) standard errors below
the null value of 0.
below
1
0
2
Chapter 8 : Analysis of Variance (ANOVA)
21 Introduction
In Chapter 7, we have learnt how to compare two groups (or
ANOVA.
(population) average starting salaries across the top five IIMs (IIMA,
groups are called factors while the individual categories being the
levels. So, in the above example, there is one factor (IIMs) with levels
chapter.
22 One-way ANOVA
ANOVA is basically a hypotheses testing problem. As with any
1
0
3
hypotheses test, it is based on certain assumptions as follows :
the population means of the response for the g groups (say 1, 2, ..., g
1
0
4
will respectively be
H0 1 = 2 = ... = g =
Ha
Variability between 2
groups
F = w (3)
Variability within groups
= 2
Observations :
within groups, will be the evidence against the null and the
we will not know the values b and in (1). So, we replace those by
of 2 2 their
w
1
0
5
Now, MSB and MSE can again be represented as the ratios of the
corresponding
respectively, N being the total combined sample size from all the
groups. Thus, replacing these values in (2), the test statistic becomes
SSB/g 1
F= (5)
MSW / N g
1
0
6
which follows a F distribution with (g 1, N g) degrees of freedom.
P-values : Since the F distribution is only defined on the positive axis,
p-value will always be the tailed area above the observed F value.
We can use the F table at the back of the book for that.
T
Here SST is the Sum of Squares Total and is equal to SSB + SSE.
Low Cal and Low Carb) by the amount of weight loss they induce
these regimens and measure their weights before and after (they take
the diet) and take the difference of the same. It can be assumed that
population) by these 3 diets are the same or not. Thus, the null and
H0 Ha
1
0
8
Here, n1 = n2 = n3 = ,N = and g =
Moreover, it can be shown that SSB = 122.1 and SST = 182.85. Thus,
Df of SSB =
Df of SSE =
Df of SST =
SSE =
MSB =
MSE =
F=
Df of F =
Within groups
(Error) Total
P-value : From the F table at the back of the book, we have, for df (2,
27), the right tailed area above 5.49 is .01. Hence the area to the right
of 27.13 will be
i.
1
0
9
e our p-value will be
1
1
0
Notations :
1 g ni
Overall sample Y = ) ) Yij .
mean : N i=1 j=1
g
Total sample size in all the groups combined : N = ) ni.
i=1
g
Sum of squares between (SSB) : ) ni (Y2i Y ) .
i=1
g ni
Sum of squares within (or error, SSE) : ) )(Yij2 Yi ) .
i=1 j=1
g ni
Sum of squares total (SST) : ) )(Yij 2 Y )
i=1 j=1
1
1
1
Chapter 9 : Simple Linear Regression
23 Introduction
One of the most fundamental aspects of statistical practice is to analyze
variables using the known value of the other. The branch of Statistics that
groups 40 and 79 and calculates their muscle mass. Using the tools of
regression analysis, you can help her figure out the true underlying
women in that age range using the information from the above 60
women.
like education (e), urbanization (u) and income (i) levels of that
1
1
2
region. Using regression analysis, we can predict the crime rate of a
are the
1
1
3
response variables.
examples, , and
variables.
Yn) where each pair correspond to a unit and n is the sample size.
Whether there are any unusual points which falls well apart from
1
1
4
Figure 5. shows the scatterplot of muscle mass against age of the
60 women.
1
1
5
120
110
100
90
Muscle mass
80
70
60
50
40 50 60 70
Age
Observations :
1. The points show a strong trend i.e muscle mass and age
2. Age and muscle mass seems to have a linear relationship i.e the
well.
3. We do not see any point which falls well apart from the general
trend of the points. i.e there does not seem to be any outliers or
1
1
6
influential observations.
25 Correlation
When X and Y have an approximately linear relationship, we can
1
1
7
Population and Sample Correlation
There are two different quantities that might be called the correlation
If X and Y have a positive association (as one goes up, the other
If X and Y have a negative association (as one goes up, the other
Correlation coefficient does not depend on the units of the variables nor on their
identities (i.e response or explanatory) - this is a big advantage of
correlation coefficients.
X = 1 ) Xi , SX = 1 )(X
n n
1
i X)
, 2
n i=1 n 1 i=1
1
1
9
The closer the correlation is to or , the stronger the linear
association is between the two variables. For sample data, this means
X X
Since r is quite close to 1, we conclude that age and muscle mass have
1
2
0
The scatterplot on the right clearly shows that X and Y have a non-
linear (parabolic) relationship which cannot be quantified through
the correlation coefficient. So, r = .06 does not have a meaning in
this context and can even be misleading. On the other hand, the fact
that r = .07 for the first graph makes sense because it reflects the
scatter of the points. Thus, in
1
2
1
Y
X X
Figure 7: Two very different data sets with sample correlations near
zero (r = 0.07 on the left, and r = 0.06 on the right).
about.
Outliers
The presence of outliers in a set of sample data can greatly influence the
value of r. For example, in Figure 8, the addition of one extra observation
changes the sample correlation from r = 0.90 to r = 0.47. In fact, if the new
point is a genuine outlier, the new pattern is not linear anymore. Thus,
correlation coefficient may not be meaningful when a dataset contains one or
more outliers.
1
2
2
further by finding the equation of the straight line that best describes
1
2
3
Y
X X
Figure 8: Effect of an outlier on the sample correlation. The data sets on the left
example.
1
2
4
all Indian adults whose height is 65 inches. Similarly the weights of all
Indians who are 70 inches tall (i.e X = 70) will vary around some mean, say
Y (70), which we expect to be greater than Y (65) (since height and
weight are assumed to have a positive association).
What were going to assume about the population is that X and Y
(X) are related by a straight line, as shown in the following figure. This is
1
2
5
We write this
relationship as Y (X) = + X (6)
where is the slope and is the y-intercept for the population. Since
and are
between X and Y (X) will not be linear. For example, Figure 9. plots
their age.
0.70750
0.70740
strontium.ratio
0.70730
0.70720
age
the above data. So, its important to think about whether the linearity
1
2
6
assumption is at least somewhat sensible before deciding to conduct a study or
analyze data using simple linear regression. - this is where scatterplots come
into play.
Remember: Variability !!
= + X,
Y NO!
1
2
7
but this is unrealistic, because every individual with the same value of
X will not have the same value of Y . (Does all PGPX students who
study the same amount of time ends with the same CGPA/starting
salary ? Does all companies having the same manpower generate the
same revenue/year ?)
pattern.
Statistical Model
We have n pairs of data points vis (X1, Y1), (X2, Y2),...,(Xn, Yn) where
Y = + X + s
= Y (X) + s (7)
1
2
8
with mean 0 and standard deviation 2 ) i.e for a particular value of X
= x, Y is assumed to have a normal distribution with mean and
variance . However, since we do not have population data, and
are unknown and will estimated from the sample. This procedure,
known as Least Squares Estimation, is done in such a way that the
resulting straight line has the best possible fit to the given sample
data.
(X1, Y1), (X2, Y2),...,(Xn, Yn), we have to find the best possible estimates
of and
1
2
9
(say, and ). We will use the sample data to get these estimates.
The resulting line (also known as the least squares regression line) is the
Y i = + Xi , i = 1, 2, ..., n
by the equation
Yi = Y i + ei , i = 1, 2, ..., n (9)
Here (e1 , e2 , ..., en ) are the observed errors and are known as the residuals.
= Y (11)
X
Given and , the least squares regression line (or the estimated
line) is given by
7
3
Y (x) = + x (12)
relationship in the sample, which will usually be close, but not exactly
Thus, it makes sense to use the various parts of the regression equation,
7
4
Interpretation of Coefficients
and are the sample Y -intercept and slope while and are their population
sign of the slope indicates the direction of change while the value tells us
7
6
Note : Unlike the correlation, the value of the slope will depend on the units in which
X and Y are measured.
For the muscle mass example, the sample summary statistics are
= 59.98,
X SX = 11.80, Y = 84.97, SY = 16.21, r = .866
Figure 11. depicts the scatterplot of the age-muscle mass data with
7
7
Interpretation
year increase in the age of a woman, her predicted muscle mass will
decrease by .
Since the y-intercept is , a woman with 0 yrs of age (i.e new born
7
8
120
110
100
90
Muscle mass
80
70
60
50
40 50 60 70
Age
Making Predictions
One of the most important use of the least squares regression
equation is to predict unknown values of the response variable (Y )
from given or known values of the ex- planatory variables (X). We can
predict the value of Y for any particular value of X by simply
plugging that value of X into the regression equation and seeing what
we get for Y . However, we need to keep in mind that this X value
should come from a subject who is similar to the subjects sampled in
the original data set. Otherwise, we may run the risk of
extrapolation.
7
9
If we are predicting for a in-sample subject, the prediction will not
One of the women sampled for the Muscle-mass data was Mrs.
Tripathi who is 56 years old. So, her predicted muscle mass will be
8
0
2.3
Y
3.8
X
Figure 12: Visualization of a prediction using the regression equation (solid line).
Residuals
Residuali = Yi Y i . (13)
In this way, we can obtain the residuals of all the sampled subjects.
Suppose the actual muscle mass of Mrs. Tripathi is 97. Then her
residual will be
8
1
Predictive Ability
from given or known values of the explanatory variables (X). Figure 13.
predictions.
predicted values (on the line), better is the predictive ability of the
8
2
X X
Figure 13: Visual interpretation of predictive ability. The
Coefficient of Determination
8
3
We have already seen that the correlation coefficient of age and
muscle mass is
-0.866. Hence the coefficient of determination for the least squares regression will be
R2 =
So, we conclude that the least squares regression line has a pretty
good predictive ability since R2 is quite high. Moreover, the above
regression line results in 75% less error in predicting muscle mass
compared to Y .
R2 can also be interpreted as the amount of variability in the
8
4
of the variability in muscle mass can be explained by age. So,
whichever way we see it, the least squares prediction equation does
Extrapolation
range of the sample values i.e significantly smaller than the smallest
Just as outliers can greatly influence the correlation, they can also
equation.
equation can fit the data very poorly, and any results we obtain might
8
5
be unreliable. So its always a good idea to look at a scatterplot to
make sure there are no outliers when doing simple linear regression.
8
6
Y
X X
observation.
8
7
Chapter 10 : Parameter Estimation
Basic idea
Y (X) =
Example 1. In the age-muscle mass example, = 0 would imply that muscle mass
and age are not linearly related.
Here we will learn to perform a Regression t test for . For that, we need
Hypotheses
8
8
association.
8
9
Test statistic
0
t= tn2 under H0 (14)
se( )
P-values
Definition 1. The p-value is the probability of getting a test statistic value at least
as extreme as the one observed, assuming H0 is true.
9
0
Although it looks like a standard normal distribution, a t
normal.
for us, we often have to use a t table like the one in the back of our
textbook to try to figure out the p-value. A typical t table, like the one
9
1
4 2 0 2 4
Value of t
Figure 15: Two-tailed probability of a t distribution, df = 30.
to different df values. Within the appropriate row, the table shows the
probability for any test statistic value we want, and we then double
this right tail probability to get the p-value (for a two-sided test).
Right-Tail Probability
df 0.10 0.05 0.025 0.010 0.005 0.001
0 0
1 3.07 6.31 12.70 31.821 63.65 318.309
2 8
1.88 4
2.92 6
4.30 6.965 9.925 7 22.327
3 6
1.63 0
2.35 3
3.18 4.541 5.841 10.215
4 8
1.53 3
2.13 2
2.77 3.747 4.604 7.173
5 3
1.47 2
2.01 6
2.57 3.365 4.032 5.893
6 5 1
9
2
If instead Ha is true, then our test statistic will probably be farther
from zero, and so our p-value will probably be smaller, which
also makes sense.
9
3
Bear in mind that if software says that the p-value for a two-
answer as 0.074.
Decision
level (often 0.05) and making a decision the same way we always do:
Note : If Ha > 0 and we reject Ha, it MAYNOT imply that Y and X are
independent; (maybe < 0 i.e Y and X have a negative association). We can only
conclude that, given the data, there is strong evidence that Y and X does not have a
positive association.
8
4
Age-muscle mass revisited
Example 3. Lets go through all the steps of a two-sided regression t-test for the
age-muscle mass example.
Assumptions
8
5
Age-muscle mass observations for different women can be
Mean muscle mass and age are linearly associated. (vide Fig 1).
Hypotheses
linearly related.
Test Statistic
The SPSS output for the muscle mass data is given below
Since the sample size is 60, the degrees of freedom (df ) of the above
8
6
statistic will be
P-value
under a t curve with df. The relevant t scores for this degrees of
8
7
Right-Tail Probability
Decision
Note : If we had tested Ha < 0, the one sided p-value would also had been
0 (since the t statistic is negative), we still would have rejected H0 and conclude
that there is strong evidence to believe that mileage and weight are negatively
associated.
are exactly the same as those used for the regression t test which we
8
9
Formula
where the standard is the same as the one used in the t statistic.
error of The
i.
Some t tables, including the one we will use, provide a second set of
appropriate t-score for a confidence interval. Simply find the row for
level, and the number in the body of the table is the t-score that
9
0
Confidence Level
9
1
Interpretation
is that we are 95% confident that the true value of is between (lower
that are reasonable based on the data. We can fine-tune what we mean
most and at least . Thus, both the two sided significance test and
slope.
Note : We rarely perform inferential procedures for because often its interpretation
is not realistic. However, hypotheses tests and confidence interval procedures for
work exactly the same way as those for .
9
2
Chapter 11 : Regression Diagnostics
29 Introduction
In fitting a linear regression model to a given data set, we have to
Now, some (or all) of these assumptions may not hold for a
Analysis.
9
3
Residual Analysis
It turns out that if a regression model fails to satisfy some
assumptions for a given data set, it gets reflected very clearly in the
residuals of the fitted model. So, an examination of the residuals is
a very effective way of checking the appropriateness of a regression
model for a particular data set. This is implemented by plotting the
residuals or standardized residuals (an improved version of the
residuals) against the covaritate/s (X) or the fitted/predicted values
(Y ).
For example, one of our assumptions for the population regression
model is that si N (0, 2 ) (where si s are the errors). If a regression
equation fits the data well, the residuals ei s should also tend to be
independently distributed about 0 with constant
9
4
variance 2. Thus, an examination of the residuals should give us a
pretty good idea whether the above assumption has been satisfied by
Type of Departures
observations) can be hard to check once the data has been collected.
Non-linearity of data
appropriate for the data, the residuals, when plotted against X, tend to
Figure 19. shows the residual plot for the age-muscle mass data.
1
3
2
9
0
Standardized residuals
40 50 60 70
Age
Clearly, the residuals roughly follow a random pattern above and below
the 0-line.
9
1
Thus the linear regression model seems to be appropriate for this
data set.
Figure 20. shows the non-linear fossil data (with the fitted least
(age).
0.70750
0.00010
0.70745
0.00005
0.70740
strontium.ratio
0.70735
Residuals
0.00005
0.70730
0.70725
0.00015
0.70720
95 100 105 110 115 120 95 100 105 110 115 120
age age
pattern about 0 indi- cating that a straight line fit is not at all
9
2
error variance (V (si ) = 2 ) has been satisfied. We have the following
rule of thumb :
scattered about 0.
9
3
If the error variance increases with X, the residuals will have
spre
ad as X increases.
spre
ad as X increases.
pattern of the residuals with age (X). This implies that the error
assumption.
9
4
Non-normality of errors
One of the most basic assumptions we made for our regression model
9
5
of outliers. Figure 21. shows the histogram and box plots of
standardized resid- uals for the age-muscle mass data. The plots
serious. Moreover, two sided tests and confi- dence intervals are
3
20
2
15
1
Frequency
10
0
5
1
2
0
2 1 0 1 2 3
9
6
that deviates substantially from linearity suggests that the nor-
mass data.
9
7
3
2
1
Standardized residual
2 1 0 1 2
Expected
Note 4. Residuals (or standardized residuals) can also be used to check for other
qualities of the dataset, like correlated errors and the presence of outliers (or influ-
ential points). If the data are collected in such a way that there are some inherent
dependence between successive observations (longitudinal study, spatial data etc), the
same will be reflected in the residuals. In that case, the residuals will reflect a definite
trend when plotted against the time variable. On the other hand, if the errors terms
are independent, the residuals will fluctuate more or less randomly around 0.
9
8
There are various ways of checking outliers. One rule of thumb is that, if the
absolute value of the standardized residual for an observation is more than 3, that
observation can be assumed to be an outlier.
9
9
Chapter 12 : Multiple Regression
30 Introduction
So long we have used a single explanatory variable (X) in the linear
the true variability of the response and hence the resulting regression
Example 5.
The crime rate (number of crimes per 1000 residents)(say Y ) of a particular region
can depend on a lot of factors like the percentage of residents who are well educated
(say X1), the level of urbanization (X2), the average income of the residents (X3) etc.
Thus, in order to accurately predict the true crime rate of a region, we should take all
these factors into account because each of these give us some information about the
crime rate in that region.
General Form
1
0
1
wher k is the least squares estimate of k , (k = 1, ..., p) obtained by
e minimizing
the sum of squares of residuals - this is similar to what was done for
Interpretation of Coefficients
regression model have the same spirit as those for the simple linear
regression set up, the association pattern between the response and a
Example 6.
We have data on crime rate (Y), percent with high school education
and median income (X3) (in thousands of Dollars) for all the
1
0
2
. Specifically, the predicted crime rate of a county
increase in the education rate i.e a county will be safer if more of its
it is.
1
0
3
rate - for 1 thousand Dollar increase in the median income of the
predictor will remain the same for any value of the other predictors.
Note 5. A basic difference between multiple and simple regression models is that for
the former, in order to interpret the effect of a predictor, we fix (or control for) the
other predictors but for the latter, we ignore any other possible predictors in order
to interpret the effect of a particular predictor.
In the above example, suppose we regress Y only on X2 (Urbanization). The
regression equation is given by
Y = 24.54 + 0.562X2
Here education and income have been altogether ignored, NOT controlled. The slope of
X2 has also changed (decreased from 0.697 to 0.562). Thus, ignoring and controlling
for a variable have different impact on the regression model.
Inferential Procedures
1
0
4
R2 can only increase when additional predictor variables are
given by
n1
R2 2
a = 1 ( p )(1 R )
n 1
1
0
5
In fact, R
a can even decrease with the addition of a predictor variable
2
Example 7.
total variation in crime rate. Now, let us test the amount by which
with crime rate) and add on the other predictors. The following
between (U, I) and (U, E). However, if we use adjusted R 2, we have the
1
0
6
Clearly, R2 identifies the model with urbanization and education as a
better
a
As in the simple linear regression set up, we can test for the
model (16). The necessary assumptions are the same as that for a
1
0
7
The observations (Yi, Xi1, ..., Xip) are independent of each other.
Normality of the errors (vis-a-vis the response) for any given value
Errors (vis-a-vis the response) has the same standard deviation for
For testing the significance of a particular predictor, say Xk, the null
and alternative hypotheses are given by
p- < H0.
p-
value H0.
value
where the p-values are obtained in the usual manner.
Example 8.
be tentatively sat- isfied. Let us test whether income (X3) has any
1
0
8
and education (X1). Thus our hypotheses will be
The estimates and standard errors for the various predictors are
1
0
9
Predictor Estima Standard
Intercept te
59.715 Error
28.59
Education -0.467 0.554
Urbanizati 0.697 0.129
on
Income -0.383 0.941
P-value : For the above value of the test statistic and degrees of
k t/2,np1 se( k )
Example 9.
From the t-table, we have t0.025,63 = 2.0. Thus the 95% confidence interval
of 3 will be
Note 6. Two sided tests and confidence intervals are robust to the violation of the
normality assumption for large sample sizes (n > 30).
10
0
31 Regression Diagnostics
As for simple linear regression, we can use residual analysis to verify
nice tool to test for the overall appropriateness of the model. For
example,
can help us to test for linearity of the regression model and the
distributional assumption.
Example 10. For the crime data set, the residual analysis is given below.
10
1
Standardized Residuals
20 30 40 50 60 70 80 90
Fitted values
10
2
2. The residual plot corresponding to education is shown below
1
Standardized Residuals
55 60 65 70 75 80 85
Education
Standardized Residuals
0 20 40 60 80 100
Urbanization
10
3
Here also the residuals fluctuate more or less randomly about 0
10
4
4. The following figure shows the residual plot against income.
1
Standardized Residuals
15 20 25 30 35
Income
10
5
Frequency
10
Residual
0
5
2 1 0 1 2 2 1 0 1 2
10
6
The above plots indicate that the normal distributional
Note 7. Residuals can also be used to test for the presence of effects which were not
included in the original regression model. For example, if the residual plot against X12
and/or X1X2 depicts a systematic non-random pattern, it is indicative of the presence
of a quadratic or an interaction effect.
2
1
Standardized Residuals
Standardized Residuals
0 500 1000 1500 2000 2500 3000 1000 1500 2000 2500
10
7
Both the patterns are pretty random (same is true for the
that the predictors does not interact in their effect on crime rate i.e
10
8
32 Multicollinearity
As we have studied so far, multiple regression analysis deals with
may be correlated with each other - in fact they are, as the following
correlation
matrix shows
Non-correlated Predictors
10
9
In order to view multicollinearity in the proper context and appreciate
between size of work force (X1) and level of bonus pay (X2) on
1. Y = 23.5 + 5.375X1
2. Y = 27.25 + 9.25X2
11
0
3. Y = 0.375 + 5.375X1 + 9.250X2
Comparing (1) and (3), we see that the coefficient of X1 remains the
same (5.375) whether or not X2 is in the model. In other words, the
association pattern (strength and direction) between X1 and Y is
independent of X2. Clearly this is because X1 and X2 are
uncorrelated.
Same is true for X2 since its coefficient (9.25) remains the same
whether or not X1 is in the model (compare (2) and (3)). Thus the
influence pattern of X2 on Y is independent of the presence (or
absence) of X1.
Analyzing Multicollinearity
Example 11. Now let us test for and analyze the effects of multicollinearity for the
Florida crime dataset.
income).
Clearly a definite positive - linear trend is visible in all the plots with
0.791, 0.793 and 0.731 respectively which are pretty high. Last but not
regressing each of the predictors on the other two are also quite high
11
1
Urbanization education + income; R2 = 65.43%.
11
2
85
85
80
80
75
75
Education
Education
70
70
65
65
60
60
55
55
0 20 40 60 80 100 15 20 25 30 35
Urbanization Income
80
60
Urbanization
40
20
15 20 25 30 35
Income
(g) urbanization-income
11
3
Although multicollinearity may affect different aspects of the fit of a
Model 1 2
variables
X 1 1.486
X2 0.56
X1 , X 2 - 2
0.68
X1 , X 3 0.583
1.05 2
X2 , X 3 0.64
X 1 , X 2 , X3 - 2
0.69
0.467 7
the various models, still it is far from being constant. Same is true for
the coefficient
11
5
32.3 Measuring Multicollinearity
known as the variance inflation factor or VIF. The VIF of a variable, say
Xj , is defined as
V IFj = 1/(1 Rj 2)
where R
j
2 is the coefficient of determination for the regression of X on
j
Method : Start with all the predictors. Examine if any predictor(s) have
V IF > 4. Drop the one with the highest VIF. Keep repeating the
exercise until all independent
Example 12. For the Florida crime data set, we have the following MINITAB output
11
6
enough to warrant deletion of any of the variables.
Note 8. Dropping a variable from a model based on VIF does not mean that the
variable has no effect on the response. It just means that its effect is adequately
explained by the remaining variables.
11
7