0% found this document useful (0 votes)
71 views

SDA Book

statistics
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views

SDA Book

statistics
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 211

Statistical Data Analysis

Gangineni Dhananjhay
Chapter 1 : Introduction

1 What is Statistics ? Why do we need it ?


In everyday life, we always face scenarios (that matters to us) and want

to make sense of those - this often boils down to asking ourselves a

question and seeking an answer that satisfies it. In order to do that,

we have to analyze the available information in an objective and

appropriate manner. For example, we may be interested in the

following questions

Does excessive use of social media makes a person more


lonely/isolated ?

Does smoking causes lung cancer ?

Are astrology predictions better than mere guessing ?

How likely are you to win a lottery ?

Does Aspirin reduces the chance of heart attack ?

Statistics is probably the only scientific discipline, that helps us to

answer questions like those above in a data-based, scientific

manner.

Every statistical study is motivated by a question (like one of the

above) that directly relates to reality. In order to answer that question

satisfactorily, we have to design a suitable study (we will learn how to

do this later on). The study produces data (or information). Statistical

1
techniques are then applied to make sense of the data and the

conclusions drawn (by using those statistical methods on the dataset)

is (are) the answer(s) to the question we started off with.

Definition: Statistics is the art and science of designing studies,

analyzing the data those studies produce and drawing objective

conclusions based on the analysis. Its ultimate goal is to translate data

into knowledge that would better help us in understanding the world

around us. Thus, in a nutshell, Statistics is the art and science of learning
from data.

2
2 Statistical Enquiry
Every statistical enquiry has four distinct stages viz :

1. The Question : Here you should give careful thought to the

question that needs to be answered by the study. This is because, the


exact nature of the question influence all the later stages of the study.

2. Design : This deals with formulating a proper and realistic plan or

experiment which would generate the data we want and thus

would lead us to answer the question we started off with.

3. Description : This deals with exploring and summarizing

patterns in the raw data obtained from an experiment.

4. Inference : This deals with making predictions and decisions based

on the above data but in such a way that the conclusions are

applicable to the whole population, not merely those in the dataset.

Hopefully, these stages will give you a satisfactory answer to the

question you started off with !

3 Some Important terms


Following are some important terms/concepts that every statistician

should be famil- iar with.

Variable : It is any characteristic that is observed for subjects/units

in a statistical study. As the term suggest, it is something that

varies across the different subjects.

3
Eg : If we want to know the living standard of average Indians,

we may select a representative sample of Indians and record

their (i) monthly income, (ii) number of dependent family

members (iii) monthly expenditure on food and other

necessities etc. All these are variables for this study.

Observations : The actual data values we observe for a variable (or

a particular subject) are referred to as the observations for that

variable (or that particular

4
subject).

Eg : Suppose the number of dependent family members of a

sample of 5 Indians are (2, 3, 5, 1, 0) - these are the observations

(or observed values) for this variable.

Data set : In a study, the totality of the values of all the variables for

all the selected sample units constitute the dataset

corresponding to that particular study.

Eg : In the above study, the totality of the values of monthly

incomes, number of family members etc for all the sampled

Indians constitute our data set.

Subject : A unit on which we make observations or measurements on


variables

i.e on which we collect data. Generally, a subject is a person

but can also be a state, country etc. Basically, the data we

collect should be characteristics of the subject.

Eg : If you ask a sample of IIMA students whether they support

the death penalty or not, the subjects will be ; if we collect

data on the Gross Domestic Product (GDP) of different

countries, the subjects will be

; if we collect data on the amount (or percentage) of rainfall each

Indian state has received so far, the subjects will be

Population : Set of all subjects of interest. A population

generally depends on the purpose of the study being

undertaken.

5
Eg : For the death penalty example above, the population will

be all the students currently enrolled at IIMA.

Sample : It is a part of the population on which data is actually


collected.

Any statistical study uses the sample to learn about the


population.

Eg : In the above example, suppose you asked 20 students about

their views on death penalty. So, the sample will be those 20

students.

In this case, you may want to predict, say, the true percentage of

PGP I students who support death penalty based on a sample of

size

Simple Random Sample : A sample selected in such a way that

every unit in the population has an equal chance of being

included in the sample - such a

6
sample is a good representative of the population. Generally,

in the statistical field, if not stated otherwise, a sample is

always a random sample.

Eg : In the above example, your sample will be a simple random

sample if each of the students currently enrolled at IIMA had an

equal chance of being included in your sample (of size 20).

Random variable : It is a numerical measurement of the outcome

of a random phenomenon.

Eg : The feedback (yes/no) of a randomly chosen student in your


sample.

Parameter : A parameter is a numerical summary of the

population and hence describes the characteristics of a

population. A parameter can be calculated only if we have

information from the whole population. Generally, a parameter

is regarded as an unknown quantity.

Eg : The percentage of IIMA students who support death penalty.

Statistic : It is a numerical summary of a sample. A statistic

describes a sample as a parameter describes a population. In

fact, we estimate a parameter of a population using analogous

sample statistics calculated from samples selected from that

population.

Eg : In this case, the statistic will be the percentage of affirmative

response in your sample. Thus, if 8 out of the 20 students you

interviewed support death penalty, the statistic will be (8/20)


7
100% = 40%.

Note : There is a subtle link between random variables and sample

statistics. Before a sample is collected, a sample statistic (say, the

sample mean) is a random variable because its value is unknown.

But, once the sample is collected and the value of the statistic is

calculated (based on the sample values), it no longer remains a

random variable. This is because, once the sample values are

known, anything related to those sample values (like, the sample

statistic) becomes known too - the randomness vanishes and

everything becomes deterministic. This holds for observations in

general too. Always remember that an observation is a random

variable before it is observed. It loses its randomness and becomes a

true observation once it is observed.

8
Chapter 2 : Sampling Techniques

4 Data Collection
As mentioned in Chapter 1, one of the most important component of

any statistical investigation is to formulate a realistic and sound plan

for data collection. A data set based on a well-formulated study is a

good representation of the population and hence statistical

inferences based on it would be believable and would help us

understand the population better. On the other hand, if a study is not

well designed, the results will likely be meaningless and/or

misleading. There are different types of statistical studies. However,

before going into the details, we need to understand the concept of

association between variables, a key concept in Statistics.

Association

One of the most fundamental aspects of statistical practice is to analyze

and inter- pret the relationship between different variables in the

population. What makes this interesting is that relationships or

association patterns between variables are often as diverse as the

variables themselves. Some variables may have a pretty simple relation-

ship while others may have a much more complicated pattern.

Generally speaking, when two variables are associated, one influences

the other.1 Thus, we have the following two types of variables :

1. Response variable (Y) : This is the outcome or the dependent variable.

9
2. Explanatory variable (X) : This is the independent variable or the

variable which explains or is related to the outcome 2.

Eg : (i) There are many factors which influence the ranking of IIMA

(or of any institution per se) like salary of graduating students (say,

s), placement success (say, p), percentage of international students

(say, i), research output of faculties (say, r) etc. So, ranking is the

variable, while

constitute the explanatory

variables.
1
Later on we will see that association may not necessarily imply causation. However, we will go
by this statement for the time being for the sake of defining response and explanatory variables.
2
Explanatory variables are also known as covariates or predictors

1
0
(ii) Whether taking Aspirin reduces the chance of heart attacks. Here

the explanatory variable is whether someone takes Aspirin while

the response variable is

Similarly, you can find innumerable examples from day-to-day life

where associ- ation is at play between two variables. Given this

concept, we will now learn about the two main types of statistical

studies.

Statistical Studies

Broadly speaking, there are two major types of statistical studies :

Experiment : Here a researcher conducts an experiment by

assigning subjects to certain experimental conditions and then

observing outcomes on the response variable. The experimental

conditions correspond to some assigned values on the

explanatory variables and are often called treatments.

Observational Study : Here the researcher samples some subjects

and just observes values of the response and explanatory

variables for them without doing any experiment i.e without

assigning the subjects to treatments.

Eg : Nowadays, as cell phones are becoming all-pervasive, a growing

concern is whether heavy use of cell phones may increase a

persons risk of getting cancer. Several studies have explored this

issue :

1. Study 1 : An Australian study (Repacholi, 1997) used 200

1
1
transgenic mice, specially bred to be susceptible to cancer; 100

mice were exposed an hour every day to the same frequency of

microwaves as transmitted from a cell phone. The other 100 mice

were not exposed. After 18 months, it was found that, the brain

tumor rate for the mice exposed to cell phone radiation was

twice as high as the brain tumor rate for the unexposed mice.

Clearly, this is an ; subjects (mice) were assigned to

two dif- ferent experimental conditions (or treatments) which were

categories of the explanatory variable whether exposed to radiation ?

. The treatments were compared with respect to the response

whether developed tumor/cancer ? having categories (yes, no).

2. Study 2 : A German study (Stang et al., 2001) compared 118

patients with a rare form of eye cancer, called uveal melanoma

to 475 healthy patients who

1
2
did not have the eye cancer. The patients cell phone use was

measured using a questionnaire. It was found that the eye cancer

patients used cell phones more often, on an average.

In the above study, the researchers sampled some subjects and

merely observed the values of the response and explanatory

variables for each of them without doing anything to them. So, these

are .

So, the main difference between experimental and observational

studies is that in the former, it is the researcher who decides on the

allocation of treatments to the subjects while in the latter, it is

predetermined and the researcher is just an observer.

In this course, we will mainly concentrate on observational studies

since it is the one that is used more often and specifically in business

applications. One of the foundations of virtually any observational

study is a good sampling plan i.e selection of a sample that is a good

representative of the population. Only then can the inferences drawn

from that sample will be applicable to the population as a whole. On

the other hand, a sample which does not have proper representation

of the population is likely to give us erroneous/invalid results. Now,

we will discuss some techniques of selecting a good sample.

How to Select a Good Sample ?

In order to select a good, representative sample from the

population, we first need to identify the population of interest. This

is because, the population depends on the statistical study (or the


1
3
study question) itself. For example, many news papers collect

regular surveys/polls from their readers regarding their views

about current happenings. So, here the population will be all the

subscribers to that newspaper/s. On the other hand, if your target of

inference is the proportion of IIMA students who are vegetarians,

the population would be all the students of IIMA.

Sampling Frame and Sampling Design

Once the population is identified, the second step is to compile a list of

subjects (or units) in it so that you can sample from it - this list is called

the sampling frame - this is applicable when the population is finite (i.e has a

fixed size, say N).

1
4
For example, if you want to know some particular characteristics

of registered voters of Gujarat (say, what proportion of them own a

vehicle), a possible sampling frame would be the list of all registered

voters (of Gujarat) maintained by the Election Commission.

Once you have a sampling frame, you need to decide on a method

of drawing samples from it. This method is known as the sampling

design. The sampling design should be such that the resulting sample

is a good representative of the population i.e it should reflect all

the characteristics or nuances of the population.

For example, if your population is everyone who work/studies in

IIMA, a good sample should have a representative mixture of students

(that too from PGP, PGP- ABM, FPM etc), faculties (from all ranks),

staff members, workers (from all depart- ments) etc. However, if for

convenience, you only select the diners from the KLMDC canteen on a

given day, it may give you misleading information about the

population since it will definitely not contain a significant part of the

IIMA people (a large chunk of staff members do not go to KLMDC for

dinner).

In a bigger canvas, if you want to gauge the national mood about

an important issue, then the verdict of an online poll by a leading

newspaper/media (say TOI, NDTV etc) will NOT be an accurate

indicator since the results will not account for a huge chunk of the

population (say those who dont have access to internet or are not

subscribers/viewers of the above newspapers).

Thus, the bottom-line is that you have to be very careful in

selecting a sample that would accurately reflect (or be a good


1
5
representative of) the population.

Simple Random Sampling

Based on the discussion above, it should be clear that a sample selected

out of conve- nience will NOT be a good, representative sample. Rather,

a sampling design which gives each and every population unit an equal

chance of being included in the sam- ple will likely result in a much

better sample (in terms of accurately representing the population).

Such a sampling design, which is guided by chance rather than

convenience is known as a simple random sample.

A simple random sample of n subjects from a finite population, say of

size N , is one in which each possible sample of that size (i.e n) has

the same chance of being selected.

1
6
Eg: Suppose a campus club can select two of its officers to attend the

clubs annual conference in Bangalore. The five officers are President

(P), Vice-President (V), Secretary (S), Treasurer (T) and Activity

Coordinator (A). So, here the sampling frame will be . The club

members want the selection process to be fair - so they decide to select

a simple random sample of size . The 5 names are written on a

identical slips of paper, placed in a hat, mixed, and then a neutral

person blindly selects two slips from the hat - this is the sampling

design.

Thus the possible samples of the two officers will be

There are 10 possible samples and the process of blindly

(randomly) selecting two slips ensures that each of the above sample

has an equal chance of occurring. Thus the chance of selecting any

one of the above samples is . Moreover, since each of the 5

officers appears in of the above samples, the chance of each

of them going to Bangalore is .

Random number tables : For large populations, there are better, more

automated ways of selecting random samples. One of the popular

method is using random number tables (or random number

generators) as follows :

Number the subjects/units in the sampling frame.

Generate random numbers of the same length/digits as the


1
7
above numbers (this can easily be done through freely available

softwares like www.random.org).

Select those units from the sampling frame whose numbers

were generated in the above step.

Stop the process once you have reached the desired sample
size.

Eg: State agriculture officials regularly visit different districts to

check the produc- tivity of crop fields. However, it is not possible

for the auditors to examine all the crop fields (since there are so

many of those in a particular district). So, they select a random

sample of the fields and check their productivity.

Suppose a particular district has 70 large crop fields and the

auditor want to randomly select 10 of those. Thus, the sampling

frame consists of the 70 accounts

1
8
{01, 02, ..., 70}. The auditor generates 2 digits at a time from a random

number generator until he/she has 10 unique two digit numbers from

01 to 70. Any number from 71 to 99 are rejected and also if any

number appears twice, it is ignored (since it does not make sense to

select the same field more than once). Suppose the selected numbers

are {05, 09, 15, 23, 44, 46, 52, 59, 63, 69} - the official should then check

the productivity of the fields with the above numbers.

The Income Tax department is said to use a similar method to

randomly select those tax return accounts that it should audit.

Note : The above selection procedure is known as sampling with replacement

(SWR) since a particular unit can be selected more than once. However, if any unit
can be selected/included in the sample only once, it is known as sampling without
replacement (SWOR). Although SWR is a valid way of selecting a simple random
sample, it is SWOR that is used most often. In fact, unless explicitly stated, a simple
random sample implies a sample selected without replacement.

SRS from an infinite population

Often the population of interest is either countably infinite or so

large that for all practical purposes it must be treated as infinite. In

that case, it is virtually impossible to make a list of all the population

units and draw a simple random sample of those. So the simple

random sampling scheme is modified as follows :

1. Each element of the sample should come from

the population. 2.The elements are selected

independently of each other.

Suppose officials at the Ministry of Tourism wants to get visitor


feedback about the ongoing maintenance/renovation work being

done at the Taj Mahal. For all practical purposes, the visitors (to the

Taj Mahal) can be assumed to be elements of an infinite

population. So, to select a SRS from the steady stream of visitors, the

tourism officials can just select visitors independently of each other.

One way of doing this may be to record the entry ticket number of

every visitor in a computer (as soon as they buy the ticket) and

drawing a number randomly out of those every few minutes. The

bearer of that ticket can be asked to fill in a short survey. Moreover,

once a ticket number has been drawn, that number is not included in

the pool for the subsequent draws.

10
Sampling with and without replacement

Broadly speaking, there are two types of simple random sampling as


follows :

SRS with replacement (SRSWR) : As the term suggests, here we

randomly select a population unit, note its identity and then

return it back to the popula- tion before selecting the second

unit and so on. Thus, the subsequent draws are independent and

a particular population unit can appear in the sample multiple

times.

SRS without replacement (SRSWOR) : Here, once a sample unit is se-

lected, it is not replaced back and the second sample unit is drawn

from the remaining population (i.e population without the first

unit) and so on. Natu- rally, the subsequent draws are dependent

and a particular population unit can appear in the sample at most

once.

Unless stated otherwise, a simple random sample would

effectively mean that the sample is selected without replacement.

Sample Surveys

Once a sample is selected, there are different ways in which one can

collect data from the sample units. Some of the commonly used

methods are personal interview, telephonic interview and self-

administered questionnaire. In majority of the cases though, these data

are results of sample surveys. As the term suggests, it is based on

1
1
selecting a representative sample from the population and collecting

data on those. Clearly, sample surveys are

Potential Biases in Sample Surveys

One of the main issues with sample surveys is that, often the

responses from the sample tend to favor some parts of the

population over others. Then the results of the sample are not

representative of the population and are said to be biased. For

example, for the population of adults in your home town, results of

an opinion survey may be biased in a liberal direction if you sample

only educators or biased in the conservative direction if you sample

only business owners. Bias can occur in a sample due to various

reasons as follows :

1
2
1. Sampling Bias : As the term suggests, this kind of bias results

from a flaw in the sampling method, most likely if the sample is

non-random (like the one mentioned above). Another way it can

occur is due to under-coverage - having a sampling frame that

lacks representation from parts of the population. Responses by

those not in the sampling frame might be quite different from

those in it thus leading to misleading conclusions about the

population.

Eg : A telephone survey (vis-a-vis the sampling frame on which it

is based) will not reach homeless people or prison inmates;

incidentally, these groups of people may have very different

views about life in general. Similarly, online surveys by

newspapers suffer from serious under-coverage.

2. Non-response bias: This kind of bias results when some of the

sampled sub- jects cannot be reached or refuse to participate.

In fact, the subjects who are willing to participate may be

different from the overall sample in some way, perhaps having

strong views about the survey issues. The subjects who do par-

ticipate may not respond to some questions, resulting in non-

response bias due to missing data.

Nearly all major surveys suffer from some non response biases

ranging from 20 30% (Eurobarometer in UK) to 6 7% (Current

Population Surveys in US).

3. Response bias : This kind of bias results from the actual

responses. The responses of subjects may differ based on the


1
3
particular manner the interviewer asks questions; subjects can

often lie because they think that their responses may be

socially unacceptable. In fact the wording of questions can

greatly affect the responses - it is always preferable to word

questions in a direct, clear and understandable manner and

avoid, wordy and confusing questions.

Eg: A Roper poll (Newsweek, July 25, 1994) asked a sample of

adult Ameri- cans : Does it seem possible or does it seem

impossible to you that the Nazi extermination of Jews never

happened ? 22% said that it was possible the Holocaust never

happened !! The Roper organization later admitted that the

question was worded in a confusing manner. When they asked

instead : Does it seem possible to you that the Nazi

extermination of the Jews never happened, or do you feel certain

that it happened ?, only 1% said that it is possible that it never

happened !!

1
4
Accuracy of Sample Surveys

Since a sample is only a small part of the population, estimates

obtained from a sample survey are always reported with a certain

margin of error. In calculating the margin of error, we need to take

into account the ratio of the sample size (say, n) and the population

size (say, N ) - this ratio is known as the sampling fraction. Based on the

value of the sampling fraction, we have two different versions of the

margin of error viz


n
0.05 (i.e sample size is at most 5% of the population size) - in
N
this case,
the margin of error is
given by
1
100%
n

n
> 0.05 (i.e sample size is larger than 5% of the population size) -
N
in this
case, we need to correct the above formula by what is known
as the finite
population correction resulting in the following modified

expression of the margin of error


.
1 N n
100%
n N 1

Clearly, the above expression tends to as N


and becomes if N = n
i.e the sample is the whole population. This is intuitive because, if we

have data from the whole population, we will know the parameter

values and hence the question of making an error does not arise.

1
5
Eg 1. After the disastrous earthquake and tsunami that struck the east

coast of Japan in March 2011, a Gallup poll conducted on CNN asked

the question Do you think that the city of Miyagi (one of the worst hit areas)

will ever completely recover from the effects of the hurricane, or not ? The poll

questioned 700 Japanese adults out of whom 42% responded Yes, will

completely recover, 56% said No, will not and 2% responded no

opinion. What will be the margin of error ?

Here, we can use the simpler formula because it is obvious that the

sampling fraction if less than 0.05 (700 divided by 127 million - the

population of Japan is way less than 0.05!). So, the margin of error will

be

1
6
Thus, in the population of all adult Japanese, it is highly likely that

between and believed that Miyagi will completely recover from the

effects of the earthquake. It is amazing that such a small sample of 700 can
accurately predict the population percentage - this is the power of random sampling
and statistical inferential techniques.

Suppose, for the sake of argument, we assume that the population

of Japan is 5000. Then the sampling fraction will be. So, the

corrected margin of error will be

Hence, we will conclude than in the population of all Japanese, it is

highly likely that between and believed Miyagi will completely

recover from the effects of the hurricane. However, this is a narrower

interval compared to the one we got earlier (using the simpler formula).

But this does not mean that the wider interval is wrong. In fact, the wider

interval is a conservative interval.

Note : The first (simpler) formula is generally used when we are dealing with an
infinite population or when the population size, although finite, is so large (compared
to the sample) that for all practical purposes it can be treated as infinite.

5 Some Good Observational Studies


There are different ways of sampling which can result in a well-

designed observational study. Some of the commonly used ones are :

Cluster Sampling
1
7
Often a sampling frame consisting of ALL the population units is hard

to come by. In that case, a population can be divided into a large

number of clusters and a simple random sample of a pre-specified

number of clusters can be selected. The cluster random sample will

consist of all the units in those clusters.

Eg : Suppose you like to sample about 1% of the families in your city.

You can use a map to label and number the city blocks (which are

the clusters) and can select

1
8
a simple random sample of 1% of the blocks. Now you can select each

and every family in those blocks - those will be your observations.

The sectors of Chandigarh in Punjab or the blocks of Salt Lake City in

Kolkata may be taken as good examples of clusters.

Stratified Sampling

Here, a population is divided into separate groups or stratas and a

simple random sample is selected from each stratum.

Eg : Suppose you want to estimate the mean number of hours a

week that IIMA students spend in the library and also how it

compares between PGP I, PGP II, PGPX and FPM. You can easily

identify all these groups of students using the registration records -

those will be your strata. If you want a sample size of, say 40, you

can select a simple random sample of size 10 from each strata (or

you can take it to be proportional to the size of the respective

student bodies).

Historically, part of the old city of Ahmedabad is neatly divided

into pols, mainly along religious lines (Jain, Hindu, Muslim,

Buddhist etc). Suppose you want to compare the average

income/wealth of the various pols. Then you can treat each pol as a

and select a random sample of households from

each. For each such sample of households, you can calculate the

mean income/wealth.

Stratified sampling ensures that you have adequate

representations from each strata you want to compare. However, you

1
9
should have access to the sampling frame and the strata into which

each subject belongs. The main difference between cluster and

stratified sampling is that in the former the within-cluster variability is

large while the between-cluster variability is small; however in

stratified sampling, the within-strata variability is small but the

between-strata variability is large.

Allocation Schemes

Naturally, an important issue with stratified sampling is to formulate a

way to decide on the sample size to be chosen from each strata.

Various allocation schemes have been formulated for this purpose.

We will explain the two most common ones here.

1. Proportional allocation : Here the sample size (nh) allocated to


stratum h

2
0
is given
by Nh
n n
) h = (
N
This scheme is appropriate when the within-stratum variances

and cost per unit sampling are approximately equal across the

different strata.

2. Neyman allocation : Here the sample size (nh) allocated to stratum


h is given by

Nh h
nh = n ( H N )
h=1 h h

where h is the population standard deviation for stratum h.

Clearly, nh is an increasing function of the stratum size and the

standard deviation of the hth stratum.

Note : (i) Clearly, when the stratum variances are equal, proportional and Neyman
allocation will result in the same allocation scheme.
(ii) Usually we do not know the population standard deviations. In those cases, we
typically use estimates based on some prior information.

Eg. Suppose a survey carried out by HSBC in all its branches in India

revealed the following information about the age distribution and

background of the executives.

Rank Mean SD of Strata


Asst Manager 23
age 3.5
age 1500
size
Manager 27 3.0 1200
Senior 31 2.8 850
Manager
AVP 35 4.0 400
VP 41 3.7 200

2
1
Suppose, for a particular survey, you want to select 300 employees

in total. Then what would be the sample sizes for each stratum

under

(i) Proportional allocation :

2
2
(ii) Neyman allocation :

References
2
3
1. Repacholi, M. H. (1997), Radio frequency field exposure and cancer,
Environ.
Health Prospect, 105 : 1565-1568.

2. Stang, A., et al. (2001), The possible role of radio frequency

radiation in the developement of uveal melanoma, Epidemiology,

12(1) : 7-12.

3. Hepworth, S. J., et al. (2006), Mobile phone use and risk of glioma

in adults : case control study, BMJ, 332 : 883-887.

2
4
Chapter 3 : Exploratory Analysis

6 Summarizing data
Any statistical experiment/survey usually generates a lot of raw

data. In order to make sense of it, we need to somehow summarize

it using some numerical and/or graphical measures. The procedure

of doing this is broadly known as Descriptive Statistics. In this

chapter we will deal with the two main ways of summarizing data

i.e numerical and graphical summaries. However, prior to that, we

need to understand the different types of variables as explained

below.

Types of Data/Variable

A variable can be of different types (resulting in different types of data

and hence the methods to analyze those) as follows :

1. Quantitative : A variable is quantitative if the observations on it

takes nu- merical values that represent different magnitudes. Eg :

Number of friends you have, your height, weight etc.

Quantitative variables can be of two types :

Discrete : A quantitative variable is discrete if the possible

values belong to a set of distinct numbers like 1, 2, 3,... Eg :

Number of pets you have.

Continuous : A quantitative variable is continuous if the

possible values belong to an interval. Eg : Your exact height

since it may literally be any number between, say, 4ft and


2
5
7ft.

2. Categorical : A variable is categorical if each of its observations

belong to any one of a set of categories. Eg : Your religious

affiliation : {no religion, Hindu, Muslim, Buddhist, Jain, Christian

etc}. Categorical variables can be of two types :

Ordinal : A categorical variable is ordinal if it has ordered


categories.

Eg : Level of education having categories {no education,

middle school, high school, undergraduate, graduate}.

Nominal : A categorical variable is nominal if it has un-ordered


categories.

Eg : Your favorite color may be any of {red, yellow, green,


crimson etc}.

2
6
Note

1. Sometimes a continuous variable may be simplified into a

categorical one for ease of measurement.

Eg : Length of hair, although continuous, may be categorized as

{very short, short, medium, long, very long}.

2. Not all variables those are numbers are quantitative.

Eg : Section numbers of courses, zip codes, passport, Aadhar

card numbers for example do not measure the magnitude of

anything although they have numerical values; you cannot

calculate numerical summaries out of those (does average

passport number of a bunch of people make any sense ??). In

fact, these are just convenient numerical labels used for

identification.

Graphical Summaries

One of the best ways to summarize (and understand) a large chunk of

raw data is to represent it pictorially. As you know, a picture is worth a

thousand words !

Graphs for Categorical Variables

The two most common ways of summarizing a categorical variable are (i)
Bar graphs
and (ii) Pie charts.

1. Bar graph : It displays a vertical bar for each category, the height

of the bar representing the percentage of observations in that

2
7
category.

2. Pie chart : It is a circle with a slice of pie for each category, the

size of the slice representing the percentage of observations in

that category.

Eg : Suppose the following table shows the different sources of

electricity in India and the corresponding percent uses of each source.

Source Coa Hydropo Natura Gas Nuclea Petroleu Oth Tota


Percenta l
51 wer
6 16l r 21 m
3 er
3 l
100
ge
The corresponding bar graph and pie chart are given below

2
8
60

Coal
50
40
Percent

30

Other

Petro leum
20

HydPwr
Nuclear
10

NatGas
0

Coal Nuclear NatGas HydPwr Petroleum Other

Source

Figure 1: Bar graph of electric


Figure 2: Pie chart of electric usage
usage

Clearly, sources of electricity with higher percent use have higher

bars/larger slices of pie. However, bar graphs are superior since it is

easier to distinguish between cate- gories specially when two

categories are very close (in terms of percentage/frequency).

Graphs for Quantitative Variables

The most common graphical summary of quantitative variables is

Histogram. How- ever, before going on to histograms, we need to

know about frequency tables. For large data-sets, it is often useful to

look at the possible values (of a variable) and count/list the number of

occurrence of each. For categorical variables, this is just the number of

values in each category. We can then divide the number in each cat-

egory by the total number (in all the categories combined) to get a

proportion (or percentage) for each category.

2
9
Eg : The following table shows the number/frequency of shark

attacks in some (shark infested) states of the U.S and also in some

other countries.

Here is the categorical variable with categories. For

example, out of 701 total shark attacks, 365 were in Florida. Thus, the

proportion of shark attack in Florida is .

Note : You can also view this data in a slightly different manner with Region being
the units and # shark attacks being the discrete variable.

3
0
Region Frequen Proportion Percenta
Florida cy
365 365/701 = ge
52%
Hawaii 60 .086
.52 8.6%
California 40 .057 5.7%
Australia 94 .134 13.4%
Brazil 66 .094 9.4%
South 76 .108 10.8%
Africa
Total 701 1.00 100

However, for quantitative variables, a frequency table usually divides

the pos- sible values into a set of intervals and displays the number of

observations (or fre- quencies) in each interval. A Histogram is a graph

which use bars to represent the frequencies (or proportions) of the

possible outcomes of a quantitative variable.

For discrete variable, a histogram usually has one bar for each

possible value. For a particular value, the height of the bar is

proportional to the frequency or relative frequency of that value.

For continuous variables or discrete variable with a large number of

unique values, it is convenient to segregate the values into different

intervals and have a separate bar for each. The intervals should

ideally have the same width.

Eg : Nutritional labels on packaged foods give us information

about the amount of fat, cholesterol, sodium, vitamins etc contained

in one serving of that food. The table in the following page lists 20

popular brands of cereals and the amounts of sodium and sugar

contained in a single serving (about 3/4 cup) as listed on the label.

3
1
Cereal Sodium Sug
Frosted mini wheats (mg)
0 ar
7
Raisin bran 210 12
All bran 260 5
Apple Jacks 125 14
Capt Crunch 220 12
Cheerios 290 1
Cinnamon Toast 210 13
Crunch
Crackling Oat Bran 140 10
Crispix 220 3
Frosted Flakes 200 11
Froot Loops 125 13
Grape Nuts 170 3
Honey Nut Cheerios 250 10
Life 150 6
Oatmeal Raisin Crisp 170 10
Honey Smacks 70 15
Special K 230 3
Wheaties 200 3
Corn flakes 290 2
Honeycomb 180 11

For the above data, there are many possible values for the amount

of Sodium (0-319). So, it is best to form intervals before drawing the

histogram. The table in the next page shows the intervals and the

corresponding frequencies for the cereal data.

3
2
Interva Frequen Proporti Percenta
l
0-39 cy
1 on
0.05 ge
5.0
40-79 1 0.05 5.0
80-119 0 0.0 0.0
120- 4 0.20 20.0
159
160- 3 0.15 15.0
199
200- 7 0.35 35
239
240- 2 0.10 10
279
280- 2 0.10 10
319
Total 20 1.00 100.0

For example, the frequency 4 in the interval 120-159 implies that 4

cereals had Sodium content between 120 and 159 mg. Two possible

histograms for the cereal data are shown below with two different

interval structures (first histogram : above interval).


7

6
6

5
4
5

Frequency
4
Frequency

3
3

2
2

1
1
0

3
3
0 50 100 150 200 250 300 0 50 100 150 200 250 300

Sodium (mg) Sodium (mg)

3
4
Graphs for Time Varying Variables

Often we come across variables which are measured over time, for

example, our blood pressure when measured every month for one

year, the monthly temperature of a place measured for the last two

years etc. It is helpful to plot the observations on these variables over

time to understand whether any trend (generally long term) is

present. These plots are known as Time plots. Following is a time plot

showing the variation of the annual average temperature (average of

the daily temperature for a year) of Central Park in New York City
57

between 1901 to 2000.


56
55
54










Annual Average Temperature





















53






52



51


50

1900 1920 1940 1960 1980 2000

Year

3
5
Figure 3: Time plot of average annual temperature of
Central Park, NY

Although, the observations have considerable fluctuations, there is

a clear increas- ing trend over the 100 year period...a tell-tale sign of

global warming!

Data Distribution and Shape

The distribution of a variable is explained by the values that the

variable takes and the frequency of occurrence of each values. So,

the frequency table and the histogram can give us an idea of the

distribution of a variable.

3
6
The one thing we look for in a distribution is its shape. Some of

the common shapes we may encounter are :

1. Unimodal : Distribution with one mode.

2. Bimodal : Distribution with two modes.

3. Multimodal : Distribution with more than two modes.

4. Symmetric : Two sides of a distribution about a central point

mirror images of each other. Eg : IQ, height, weight.

5. Left skewed : Distribution whose left tail is longer than the right
tail.
3
7
Eg : Life span (large majority of people live at least 65 years but

some die at a young age); distribution of scores in an easy PS

exam.

6. Right skewed : Distribution whose right tail is longer than the left
tail.

Eg : Distribution of scores in a difficult PS exam, amount of


donation received

3
8
by a temple (large majority donates a standard amount but

few wealthy indi- viduals may donate a huge amount).

Numerical Summaries

Generally graphical summaries are the first step forward in analyzing a

statistical data since they sort of provides a birds eye view of the

overall data pattern and often indicates the next step in the analytical

procedure. This next step is often calculating the numerical summaries

which are basically some values reflecting some important

characteristics of the data set.

The numerical summaries are of three types viz (i) Measures of


center (ii)

Measures of spread and (iii) Measures of position.

Measures of Center

There are basically two types of measures of center as follows :

1. Mean : The mean (or average) of a variable is just the sum of its

observations divided by the number of observations. However,

if you have a set of values, say {x1, x2, ..., xn} with frequencies {f1,

f2, ..., fn}, then the mean will be


n
xifi (1)
m = i=1
ni=1 fi

3
9
2. Median : The median of a variable is the midpoint of its

observations when ordered in increasing or decreasing order.

Note : When the number of observations (say, n) is odd, median

is exactly the middle observation (i.e the (n + 1)/2th observation).

However, when the number of observations (n) is even, median is

the average of the two middle observations (i.e the average of

the n/2th and (n/2) + 1th observation).

Eg: For the cereal data set, the observations (amount of sugar) are {1,
2, 3, 3, 3, 3,

5, 6, 7, 10, 10, 10, 11, 11, 12, 12, 13, 13, 14, 15 }. Here the mean will be

4
0
To calculate the median, we first arrange the observations in

increasing order (as is done above) and calculate the mean of the two

middle most observation. So, the median will be

Some important points to note :

Median does not depend on the actual values of the observations

while mean does. This is precisely why mean is highly influenced

by outliers (observations falling way above or below the rest of

the data) while the median is not.

Eg : The following table depicts the per capita CO2 emissions for
some countries

Country CO2
emission
China 2.3
India 1.1
United 19.7
States
Indonesia 1.2
Brazil 1.8
Russia 9.8
Pakistan 0.7
Bangladesh 0.2

Do you spot an

outlier/s ? The mean

C02 emissions is while

the median is

Conclusion : The mean is so high compared to the median due to the

4
1
outliers (which virtually drag the mean towards themselves). However,

the median only depends on the relative position of the

observations and hence is not affected (thus remains low). So, we say
that median is resistant to extreme observations while the mean is not.

For example, if the C0 2 emission of US would have been the

same as that of Russia (9.8), the mean would have decreased

to ; however,

the median would have remained the same at 1.5.

4
2
Similarly, if the C0 2 emission of the U.S increased to, say, 25, the

mean would have shot up to but the median would still remain

the same - this is because, this change will not affect the

relative positions of the observations.

Suppose Brazils C0 2 emission is 1.0 instead of 1.8. Then the new


ordering would be {0.2, 0.7, 1, 1.1, 1.2, 2.3, 9.8, 19.7}. So, the
new median would be

. In the old data, Brazil was between Indonesia and China;

however in the new data, it is between India and Pakistan. Thus,

its relative position has changed, resulting in the change in the

median.

The shape of a distribution influences whether the mean should

be larger or smaller than the median.

Often an extremely large value in the right(left) hand tail (of the

data distribu- tion) may drag the mean towards the right (left)

so much that it may fall above (below) the median. Thus, we

have

Symmetric distribution mean = median.

Right-skewed distribution

Left-skewed distribution

In a nut shell, the mean is always drawn towards the longer tail

of the distri- bution.

However, mean also have some advantages over the median

that we will learn later - so, the use of mean or median would

4
3
depend on the data we have.

Measures of Spread

Measures of center only tells us about the average or middle-most

value of the distri- bution. Another key feature of a distribution is the

amount of spread of the observa- tions. This is important because two

variables may have the same mean and median but very different

spread of the individual observations. There are two commonly used

measures of spread viz

1. Range : It is simply the difference between the largest and

smallest values of a variable. Eg: For the CO 2 emission data, the

maximum value is 19.7 and the minimum is 0.2. So, the range is

4
4
Unfortunately, range is also sensitive to the present of outliers.

For example, if the CO2 emission for U.S was, say, 25, the range

would have shot up to

2. Standard Deviation : This is the most popular measure of spread

since it uses all the observations in the data. Basically, it

computes the deviation of each observation from the mean and

then combines all the deviations into a single number which

reflects the overall spread in the data.

Let there be n observations and xi be the ith observation and x


n
= x1 + ... + xn
be the mean. Then, the standard deviation is given by
.
2 2
( x1 x ) + ... + (xn x )
s =
n1

n 2

)(xi
x )
, i=1
=
n1

Eg : For the cereal data, the mean (of x = 185.5. Honey Nut
Sodium) is
Cheerios has sodium content 250 mg; so its deviation (from the
mean) will be

. Similarly, the deviation of Honey Smacks (Sodium content 70


mg) will be

. Once we have the deviations of all the cereals, the standard


deviation will be

.
2 2 2
(0 185.5) + (70 185.5) + ... + (290 185.5)
=71.25
19 (2
This sort of means that the typical difference between ) an

4
5
observation from the mean is 71.25.

Note

The square of the standard deviation (s) is known as the variance


i.e

n 2
)(x i x )
i=1
Variance =
n1

The sum of all the deviations is 0 since i=1 (xi x ) = nx nx = 0.


n

Clearly, greater the spread of the data, larger will be s.

4
6
s = 0 only when all the observations have the same value (i.e same
as the mean). Eg: if all the Cherrios had Sodium content 100 mg,
the mean would have been 0, hence each deviation (about the
mean) would have been 0, resulting in a 0

standard deviation.

Since s uses the mean, it is also sensitive to outliers; this is

intuitive because an outlier would have large deviation (and

hence very large squared deviation) and hence increases the

standard deviation.

Measures of Position

As the term suggests, these are some positional landmarks of a

distribution in that they divide the distribution into some well

defined areas (in terms of the proportion of observations that fall

above and/or below these values).

For example, the median is also a measure of position since it specifies

a location such that half of the data falls above it and the other half below it.

In fact, the median is a special case of a general set of measures of

position called the percentiles.

Definition : The pth percentile is a value such that p% of the

observations fall below it.

4
7
Eg : Suppose a student scores 1200 (out of 1600) in the GRE and is

told that his score is at the 90th percentile. This implies that 90% of

those who took the exam scored between the minimum score and

1200 i.e only 10% of the scores were higher than your friends.

Thus, it is clear that the median is the 50 th percentile. Infact, the

three popular percentiles are :

First quartile (Q1) : Lowest 25% of the observations fall below it i.e
p =

Second quartile (Q2)/median: 50% of the observations fall below it


i.e p =

Third quartile (Q3) : 75% of the observations fall below it (or

highest 25% of the observations fall above it) i.e p =

4
8
Thus, the quartiles split the distribution into 4 distinct parts, each

containing 25% of the observations.

Finding the Quartiles

1. Arrange the data in

increasing order. 2.Identify the

median ( Q2).

3. Consider the observations below the median. The first quartile (Q1)

will be the median of these observations.

4. Consider the observations above the median. The third quartile

(Q3) will be the median of these observations.

Eg. Let us find the quartiles of the sodium values in the 20 cereals. The

Sodium val- ues, arranged in increasing order are : {0, 70, 125, 125, 140,

150, 170, 170, 180, 200, 200,

210, 210, 220, 220, 230, 250, 260, 290, 290}. Thus, we have

Median =

Q1 =

Q3 =

Note : Quartiles also indicates the shape of the distribution. As a rule

of thumb, if the distance between Q1 and Q2 is larger (smaller) than

that between Q2 and Q3, the distribution is left (right) skewed.

For the Cereal data, Q2 Q1 = while Q3 Q2 = . Hence the


4
9
distribution of the Sodium values is

The quartiles are also used to define a measure of spread known as

Inter Quartile Range (IQR) : This is the distance between the third

and the first quartiles i.e IQR = Q3 - Q1 i.e it is range for the middle

50% of the data.

Eg : For the Sodium data, IQR =

5
0
As with range and standard deviation, IQR increases with the spread

of the data. However, compared to the range and the standard

deviation, the IQR is much more resistant to outliers. This is because, it

is based on the quartiles, which in turn are resistant to outliers (as

they depend only on the relative positions of observations). So, for

highly skewed distributions (or distributions with outliers), it is better

to use the IQR rather than range or standard deviation.

For example, if the maximum Sodium content was 1000 (instead of

290), both the range and standard deviation would have increased

but the IQR would have remained the same.

Five Number Summary and Box-plot

The five number summaries of a dataset/distributions are the

minimum value, first quartile, median, third quartile and the

maximum value. These are the basis of a graphical display known

as Box-plot.
200

Eg : For the Sodium data, the box-plot is given below :


150
300
100
250

5
1
50

Figure 4: Boxplot of Sodium content in Cereals

5
2
A Box-plot has the following features :

The box of a box-plot goes from Q1 to Q3 i.e it contains the

central 50% of the distribution.

A line inside the box marks the median.

Lines extending from the box in either direction encompasses the

rest of the data except for potential outliers. These lines are

called whiskers.

Outliers are shown separately, generally using the * symbol.

For the Box-plot of the Cereal data, the box extends from to

i.e it contains the central 50% of the observations. The only

outlier (Frosted Mini Wheats) is shown as the dot at the extreme

bottom. Lastly, the two whiskers extend to and which are

the minimum and maximum values except the outliers.

Note : A boxplot indicate whether the data is skewed - the side

with the larger part of the box and the longer whisker usually has

the skew in that direction.

The Mode

Mode is the value with the largest frequency i.e the value that occurs

most frequently in a distribution. Usually, it is used with regard to a

categorical variable to denote the category that has the highest

frequency. For quantitative variables, it is generally used for discrete

variables taking a small number of possible values.

Eg : (i) For the Electricity source data, is the mode since it

5
3
corresponds to the highest percentage/frequency.

(ii) For the Shark attack data, is the mode.

(iii) For the Sodium content example, the interval (200-239) or

(200-250) corre- sponds to the mode since the highest proportions of

cereals has Sodium content belonging to this interval.

Note :

The Mode need not be near or at the center of the distribution

- so it not really a measure of center; nor does it tells us

anything about the spread of the distribution. So, it is usually

classified as a measure of position.

For perfectly symmetric distributions, Mean = Median =


Mode.

5
4
Chapter 4 : Sampling Distributions

7 Introduction
Statistical inference is all about using a sample to predict characteristics

of a pop- ulation. The sample characteristics are summarized using

the sample statistics while the characteristics of a population are

summarized by the population param- eters. Thus, statistical inference boils


down to estimating a population parameter using analogous sample statistic.
Eg 1. The traditionally low percentage of female students in IIMA

has been a topic of heated debate for quite some time. You can

estimate the percentage of female students in the current batch (i.e

PGP, PGPX, FPM, FDP, AFP combined) using the corresponding

percentage (of females) in a random sample of students you may

select from the current batch. In this case, the parameter is the true

(unknown) proportion of females in the population of all students

currently enrolled in IIMA while the statistic is the sample proportion

of female students in your sample.

Two most commonly used statistics are the sample mean (x ) and

sample pro- portion (p). The corresponding parameters are the

population mean () and population proportion (p).

Clearly, in order for the inferential procedure to be good, the

sample statistic should be pretty close to the unknown population

parameter. In order to ensure that, we have to study the sampling

distribution of sample statistics.

5
5
8 How Does it Work ?
Since sample statistics are based on samples, they will have different

values for dif- ferent samples. Thus, a statistic is also a random variable

and will have a probability distribution that will assign probabilities to the

possible values it can take for the dif- ferent samples. The probability
distribution of a sample statistic is called a sampling distribution.

Eg 1. contd Suppose the true proportion of females in the current


batch of students is p = 0.21. Suppose you select all possible random
samples of 20 students - each of those samples will yield a value of
the sample proportion (of females in that

5
6
sample). If you construct a histogram of those values, what you will

get is precisely the sampling distribution of the sample proportion

(of females).

Eg 2. IAS Officers A few of years ago, a female IAS officer was

suspended by the UP government for her actions against the sand

mafias. Suppose you want to estimate the average tenure of an IAS

officer (at a particular posting). For that purpose, suppose you select

random samples of 20 IAS officers for the pool of all IAS officers

currently in service. For each sample, you calculate the average tenure

of an officer. If you continue doing this for a large number of samples

and draw the histogram (of the sample average tenure value of an IAS

officer), what you will have is precisely the sampling distribution of the

sample mean tenure of IAS officers1.

Having said all the above, in reality, it is impossible to collect all

possible samples and calculate the sample statistic value for each. In fact,

in a real-life study, we can only collect one sample. However, the theory

of sampling distribution will tell us how much a statistic would vary from

sample to sample and will help us to predict how close a statistic will

be to the parameter it estimates.

Necessary tools

Before delving into some examples, let us briefly discuss a couple of

important con- cepts which are essential for a proper understanding

of sampling distribution.

1. Empirical rule : Suppose the mean and standard deviation of a

5
7
sample is x and s respectively2 If it can be assumed that the

sample comes from a distribution that is approximately bell-

shaped and symmetric, then

of the observations would fall within 1 standard

deviation of the mean i.e within (x s, x + s).

of the observations would fall within 2 standard

deviation of the mean i.e within (x 2s, x + 2s).


1
Here the population is the set of ALL IAS officers currently on duty; the parameter is the true
average tenure of an IAS officer which is unknown (since we cannot survey each and every IAS
officer). In fact, a recent Harvard study has concluded that the average tenure of an IAS officer is
only 16 months !
.
2
s is known as the sample standard deviation and is given by 1 x x 2 , x , i 1, ..., n
i i
( ) =
being the sample units and n the sample size. n1

5
8
of the observations would fall within 3 standard

deviation of the mean i.e within (x 3s, x + 3s).

This is important to know because a lot of variables observed

in reality have approximately bell shaped distributions.

2. Central Limit Theorem (CLT) : This is a very well known result in

Statistics which basically says that as the sample size increases

(i.e as we take larger and larger samples), the sampling

distribution of the sample statistic (mean or proportion) tends

to a Normal distribution. In fact this holds true even if the

population (from which the sample is drawn) is moderately

skewed or discrete. The only restriction is that the mean and

standard deviation of the population distribution should exist.

For all practical purposes, n 30 is good enough to ensure the


normality of the sample mean (X ) if the population
distribution is not too skewed. Needless to say, larger the
sample size, better will be the approximation.

Now, we will discuss the sampling distribution of sample means and

proportions in some detail.

5
9
9 Sampling Distribution of the Sample Mean
Since means or averages are so ubiquitous in Statistics, it is useful

to learn about their sampling distribution i.e how sample means

vary from sample to sample and how close they will be to the

population mean in repeated sampling from the popu- lation.

However, in reality, it is impossible to collect many samples and

obtain the distribution of the sample means manually. The

following result provide us with an automated way of achieving the

same even when we collect only one sample.

6
0
Suppose you draw samples of size n from a population with mean and standard
deviation and calculate the sample mean (x ) for each. Then the means will have a
sampling distribution which will be centered around the true population mean and

will have a standard deviation (or standard error) / n. In fact, if n 30, then this

distribution will be approximately with mean and standard error / n.
In the light of this result, let us consider the following example :

Eg 3. Rambhais income : The sales of food and drink in Rambhais

stall (located adjacent to the boundary wall of IIM, a little to the left of the

mail gate) vary from day to day. The daily sales figures fluctuate with

mean = Rs 900 and standard deviation = Rs 300. Suppose Rambhai

wants to calculate the mean daily sales for the week to check how he is

doing.

Q1. What would the mean daily sale figures for the week center around
?

Q2. How much variability would you expect in the mean daily sales

figures for the week ? Interpret.

Thus, if Rambhai were to observe the mean daily sales for several

weeks, those will center around with a standard deviation .

Q3. Suppose Rambhai now wants to look at the monthly sales.

What will be the sampling distribution ? Will his mean daily sales

for the month vary more or less than the mean daily sales for the
6
1
week ?

His mean daily sales for the month will be centered around with

standard error . Thus, the mean daily sales for the month will tend

to vary than the mean daily sales for the week and thus be closer to the

true mean sales of Rs 900. In fact, since n = 30, CLT holds and the sampling
distribution of the mean daily sales for the month is approximately with mean
and standard
error.
Clearly there is variability in the mean daily sales from

month-to- month than from week-to-week than there is from day-

to-day in the daily sales.

6
2
Q4. What is the probability that the mean daily sales of the month will

be between 800 and 1000 Rupees ?

10 Sampling Distribution of the Sample Propor-


tion
If p is the sample proportion for a random sample of size n drawn

from a population with proportion p, then p has mean and

standard deviation

Note : (i) If n is sufficiently large such that both np and n(1 p) are at

least 10, then this sampling distribution is approximately normal due

to CLT.

(ii) The standard deviation of a sampling distribution is known as


6
3
the standard error. So, the standard error of the sampling

distribution of p is

Eg 4. Internship : Suppose out of all first year students enrolled in

the top business schools across India, about 55% went abroad for

summer internship last year. Suppose you randomly select a business

school and it turns out to be XLRI which has about 350 students

enrolled in the first year.

a) What is the sampling distribution of the proportion of females in


your sample ?

6
4
b) What is the probability that at least 50% of the 350 XLRI students

will go abroad for internship this year ?

c) What is the probability that at most 70% of the 350 XLRI students

will go abroad for internship this year ?

d) What is the probability that between 50% and 70% of the 350

XLRI students will go abroad for internship this year ?

Note :
(i) Clearly, the standard error will decrease as you your sample size.

For example, if you select a larger B school which has 500 students,

the standard error of p will be . Thus, smaller


6
5
the standard error, closer will be the sample proportion to the

population proportion.

(ii) If the sampling distribution is approximately normal, we can use

the Empir- ical rule. For example, in the above example, nearly all the

sample proportions (corresponding to all possible samples/B-schools

of size 350) will lie between

6
6
Chapter 5 : Confidence Intervals

11 Introduction
The process of making decisions and predictions about one or more

population parameters using the corresponding sample statistics

(obtained from a randomly selected representative sample from the

population) is called Statistical Inference. Broadly, there are three ways

of making statistical inference as follows :

1. Point Estimation : Here we put forward a single estimate (usually

the sample statistic obtained from a random sample) for the

population parameter.

Eg : The proportion of vegetarians in a random sample of 50

IIMA students can be a point estimate of the corresponding

proportion in the population of all IIMA students.

2. Interval Estimation : Here we form an interval containing the

most plausible values of the population parameter and within

which the parameter is believed to lie with a certain confidence.

Unlike point estimates, interval estimates give us an idea of the

precision of our estimates.

Eg: The interval (5.1, 5.9) may be a 95% confidence interval of the

true average height of an Indian adult.

3. Hypotheses Tests : This is an inferential procedure that yields a

decision on whether a claim about the value of a parameter

(framed in terms of hypotheses) is supported by data observed


6
7
from a random sample.

An important part of any confidence interval is the Confidence level -

it is the level of confidence with which the interval actually contains

the true parameter value. It is usually chosen to be very close to 1 with

0.90, 0.95 and 0.99 being the commonly used values and is denoted

by 1 , where can be 0.1, 0.05 or 0.01 respectively for the above

confidence levels.

Thus, if (a, b) is a 95% confidence interval of a parameter, say , then we can be 95%
confident that (a, b) contains the true unknown value of in the population.
What this statement really implies is that, if we go on collecting a
large number of
random samples from the population and form a 95% confidence interval
from each

6
8
of those, then, in the long run, about 95% of those intervals would

contain the true population parameter and the rest 5% will not.

Confidence intervals generally have the form :

where the margin of error measures the accuracy with which the

sample statistic es- timates the unknown population parameter.

Now, we will separately discuss confidence intervals for population

means and proportions.

12 Confidence interval for population proportion


Unknown parameter :

Sample statistic :

Standard error (se) of p :

Estimated se of p :

Margin of error :

where Z/2 is that value of the standard normal variable above which

the area under a (standard normal) curve is /2. So, for a 90%, 95%

and 99% confidence interval, the corresponding Z/2 will be and

- these values are 1.645,


1.96 and 2.58 respectively. Thus the resulting 100(1 )% confidence
interval of p

6
9
will be

However, the above confidence interval is valid only under the

following assumptions 1.Random sample of observations from a

distribution.

2.# success 10; # failures 10

7
0
Note 1. (i) For fixed confidence level, as sample size increases, standard error
and hence the margin of error . So, the confidence interval gets
(ii) For fixed sample size, as confidence level increases, the Z-score ;
hence the margin of error . Thus the confidence interval gets

Eg : NDTV randomly selected 10,000 final year students across

different management schools in India and asked them about their

career choices. 4% said they want to take the plunge and start their

own companies even if that meant giving up lucrative job offers from

established MNCs. Find a 99% confidence interval of the true

population proportion of management students in India who want

to work their start-ups.

Conclusion : We can be 99% confident that between and

of final year

management students want to work on their start-ups right after

graduat- ing.

A 95% confidence interval of p will

be : A 90% confidence interval of

7
1
p will be :

Thus, comparing the above three confidence intervals, it is obvious

that the inter- vals get with increasing confidence level.

13 Confidence interval for population mean


Unknown parameter :

Sample statistic :


:
Standard error (se) of X

7
2
Estimated se : where S is the sample standard

of X deviation.

Margin of error :

Note 2. A t-distribution is similar to the standard normal distribution but with


thicker tails and it approaches the standard normal distribution as the degrees of
freedom increases. Here, t/2,n1 is that value of the t-variate such that the area above
it under a t curve with n 1 degrees of freedom is

The resulting 100(1 )% confidence interval of will be

The above confidence interval is valid under the following

assumptions : 1.Random sample

2.Population distribution

approximately normal Following is a

section of a t-table :

Confidence Level
90% 95% 99%
Right tail probability
df t.05 t.025 t.005
1 6.31 12.70 63.65
2 2.92
4 4.303
6 9.925
6
3 0
2.35 3.182 5.841
4 3
2.13 2.776 4.604
5 2
2.01 2.571 4.032
6 5
1.94 2.447 3.707
7 3
1.89 2.365 3.499
5
7
3
8 1.86 2.306 3.355
9 0
1.83 2.262 3.250
1 3
1.81 2.228 3.169
0 2

Eg : A recent survey asked 899 randomly selected college students On

an average day, about how many hours do you watch TV ? The

sample mean was 2.865 and

7
4
standard deviation was 2.617. Find a 95% confidence interval of the

population mean number of hours per day college students watch

TV.

Thus, we are 95% confident that the average number of hours an

Indian college stu- dent watch TV per day is between

(i) 90% confidence interval of :

(ii) 95% confidence interval of :

14 Sample size determination


In many real-life studies, we often know the level of precision and

need to determine the sample size that will yield that precision. This is

doable because precision is measured by the margin of error of a

confidence interval which in turn is a function of the sample size.

Now we will discuss how this is done for the case of proportion.
7
5
The sample size (n) for which a confidence interval of a population

proportion has margin of error m is

As usual, the Z-score depends on the confidence level (i.e Z = 1.96 for a

95% confidence interval). The value of p can either be determined

from past experience/studies or can be assumed to be 0.5.

Eg : A study by a social sciences institute concluded that 19% of

university students in India do not want to go abroad for jobs even if

they are presented with

7
6
the opportunity. To estimate this proportion in the IIMs with

precision 0.05 with a 95% confidence interval, what should be your

sample size if

a) You have absolutely no idea of the proportion in the IIMs.

b) You use the social science study as your guideline

Now, what will be the sample size for a

90% confidence interval :

99% confidence interval :

Note 3. i) p = 0.5 will always result in a larger (i.e more conservative) sample
size estimate.
ii) Always round the sample size estimate to the next higher integer.

7
7
Chapter 6 : Hypotheses Tests for one Population

15 Introduction
Significance tests (or Tests of Hypotheses) is an integral part of

statistical inference along side point and interval estimation. Any

significance test procedure has five (5) distinct steps viz.

1. Making assumptions : simple random sampling etc.

2. Constructing hypotheses : Each significance test is composed of

two hy- potheses

Null hypotheses (H0) : It is a statement that specifies a

particular value for the parameter (in our case, p or ) which

is pre-determined from experience and/or prior belief.

Eg : You believe that on an average an IIMA student spends

15 hours per week discussing/solving cases outside of class

i.e the null hypotheses will be

Alternative hypotheses (Ha) : It states that the population

parameter takes values in some alternative parameter

space (than what is stated by the null). It may be one or two

sided.

Eg : Suppose on the contrary, your friend believes that an

IIMA stu- dent spends more than 15 hours per week solving

cases i.e the alternative hypotheses will be

The above alternative is a one-sided one. However, two-

7
8
sided alternatives are also possible.

Here is the population mean number of hours an IIMA student

spends solving cases per week outside of class. Similar one/two-

sided hypotheses can be framed for the population proportion

p.

3. Determining the test statistic : A test statistic measures how close

the point estimate of the population parameter is to the null

hypotheses value (of

7
9
the parameter). This closeness is measured in terms of the
standard error
of the point estimate. Thus, the test statistic is given by

4. P-values : p-values represent the amount of evidence against the

null hypothe- ses based on the available data. Smaller the p-value,
stronger is the evidence against the null hypotheses and vice versa.
The smallness of the p-value is measured with respect to the

significance level ().

5. Drawing conclusion : In order to come to a definite conclusion

(about re- jecting or not rejecting the null hypotheses), we

compare the p-value with the significance level (denoted by ).

The significance level is usually set at 0.05,

0.1 or 0.01 and it is chosen by the statistician. We would reject H0 at a


given significance level , if

and fail to reject H0 otherwise (i.e if p-value ).

Now we will separately discuss significance tests for population

proportion (p) and mean ().

16 Significance Tests for Population Proportion


Let us go through the steps sequentially :

8
0
1. Assumptions :

Simple random sample.

Sample size (n) should be large enough such that np0 10, n(1
p0) 10.

2. Hypotheses :

Null : p = p0

Alternative :

where p0 is known as the null value of p.

8
1
3. Test statistic:

4. P-value : The p-value would depend on the direction of the

alternative as follows :

If Ha p > p0, p-value will be the right tailed area above the
observed value of the test statistic (Zobs) under the standard
normal curve.

If Ha p < p0, p-value will be the tailed area below the


observed value of the test statistic under the standard
normal curve.

If Ha p p0, p-value will be the tailed area beyond


the ob- served value of the test statistic under the standard
8
2
normal curve. Since the normal curve is symmetric, it can
also be calculated as twice the one- tailed area above (or
below) the observed value of the test statistic.

8
3
5. Drawing conclusion : We will reject H0 if and fail to reject
H0 otherwise.

Eg 1: Female managers : Traditionally, the percentage of managers who

are female in the Indian corporate sector has been pretty low,

about 18%. The HRD ministry wants to know whether the

percentage has improved during recent times. Accordingly, a

random sample of 100 managers were chosen and 25 of them were

females. Perform an appropriate test of hypotheses for the above

problem.

Let p be the unknown population proportion of female

managers. Let us

go through the steps one by one.

1. Assumptions :

Since the 100 managers were chosen randomly, the random

sampling as- sumption is preserved.

Here the null value is p0 = . So np0 = and n(1 p0) = .

Thus, all our assumptions are satisfied and we can proceed

with the test.

2. Hypotheses :

Null :

Alternative :

3. Test statistic:

8
4
This implies that the sample proportion (0.25) falls

standard errors

the null value 0.18.

Now, we will check whether this is good enough evidence to

reject H0.

4. P-value : The observed value of our test statistic is . Since the

alter- native is , the p-value will be the

tailed area above un- der the Z curve i.e

8
5
From the normal tables, the p-value will be

5. Conclusion : Let us choose a significance level of 0.05. Since ,


we H0 at = 0.05. Thus, there is significant evidence
(at = 0.05) that the proportion of female managers in the
Indian corporate sector has increased in recent times.

17 Significance Tests for Population Mean


As for proportions, significance tests for means also have five distinct
steps viz.

1. Assumptions :

Simple random sample.

Population distribution approximately

2. Hypotheses :

Null : = 0

Alternative :

where 0 is the null value of .

3. Test statistic:

and S are respectively the point estimates of and


where X

and tn 1 denotes a t distribution with (n 1) degrees of freedom.

8
6
As for confidence intervals, our test statistic follows a t

distribution under H0 since we replace by S.

4. P-value : The p-value would depend on the direction of the

alternative as follows :

If Ha > 0, p-value will be the right tailed area above the


observed value of the test statistic under a tn1 curve.

8
7
If Ha < 0, p-value will be the left tailed area below the
observed value of the test statistic under a tn1 curve.

If Ha 0, p-value will be the two tailed area beyond the


observed value of the test statistic under a tn1 curve. Since
the t distribution is symmetric, it can also be calculated as
twice the one-tailed area above (or below) the observed
value of the test statistic.

5. Drawing conclusion : We will reject H0 if and fail to reject


H0 otherwise.

Eg 2. Based on past records, it is generally believed that on an

average, a typical IIMA student spends about 25 hours in the

Vikram Sarabhai library per week. Re- cently, the library has been

shifted to a new location in KLMDC which is further away from the

dorms. As a result, the administration feels that students may be

8
8
spending less time in the library. Accordingly a random sample of

41 students were selected and the average number of hours they

spend in the library came out to be

16.78 with a standard deviation of 5.17. Carry out an appropriate

test of hypotheses for the above problem to test whether the

shifting of the library has adversely im- pacted the study time in

the population of all students. (You may assume that study

times/week approximately follow a normal distribution in the

population).

8
9
1. Assumptions :

41 students were selected randomly - so this is a simple


random sample.

Population distribution of study times is Normal

2. Hypotheses :

Null :

Alternative :

where is the population mean number of hours/week a

student spends in the library.

3. Test statistic:

which follows a t distribution with df under

H0. This implies that the sample mean ( ) falls standard

errors the null value of 25. Now, we will

check whether this is good enough evidence to reject H0.

4. P-value : Since the alternative is <, the p-value will be the


tailed area below under the t curve with df i.e

9
0
Following is a small part of the t table :

Area to the right


df .100 .050 .025 .010 .005 .001
40 1.30 1.68 2.02 2.42 2.704 3.30
50 1.29
3 1.67
4 2.00
1 2.40
3 2.678 3.26
7
9 6 9 3 1

9
1
5. Drawing conclusion :

18 Errors in Hypotheses Tests


Two types of errors may result from drawing conclusions from
hypotheses tests :

Type I error : We commit a type I error when we mistakenly reject


H0 when it is true. In that case,

P(type I error) = significance


level ()

Thus, we control the probability of type I error by our choice of the

significance level. In practice, = 0.05 is most common.


Eg 2 : Suppose for the library example, you fix the significance level to
be 0.05.
Then, the probability that we will conclude Ha < 25 when in fact
the average study time has increased is

In other words, the probability of taking a correct decision i.e H0

= 25 is As decreases, we are to reject H0 i.e P(type I error)

goes down.

Type II error : We commit a type II error when we fail to reject H0


even when it is false. It is usually denoted by .

When H0 is false, we would want the probability of rejecting it to be


as

9
2
a
s possible. The probability of rejecting H0 when it is false is called
the power of the test. It is given by :

Power = 1 P(type II error) = 1-


Obviously, higher the power, is our test.

Eg 3 : Suppose for the female managers example, the true


proportion of man- agers who are females has increased
significantly from 0.18. However, we still conclude the null
hypotheses (i.e H0 p = 0.18) with probability 0.01. Then the
power of our test will be

9
3
Chapter 7 : Comparison of Two Populations

19 Introduction
In the previous lectures, we learned how to construct confidence

intervals of and test hypothesis corresponding to one population.

Now, we shall extend the above method to two groups or

populations i.e we will learn how to construct confidence intervals

and hypothesis tests on the difference of proportions corresponding

to two populations. By doing this, we can compare the

characteristics of the subjects belonging to these two groups.

Eg. We want to compare the proportions of male and female in the

India who believe in miracles. Here the two groups/populations are

respectively the populations of all and in India while the

characteristic we want to compare (between the two populations) is

The two groups mentioned above can be compared with regard to

a categorical or quantitative outcome. For categorical outcomes, we

compare population propor- tions and for quantitative outcomes, we

compare population means. Here we will only deal with techniques for

comparing population proportions across two groups.

20 Comparing two proportions

Confidence Intervals

1. Notation :

p1(p2) : population proportion of success in the first(second)


group.
9
4
n1(n2) : sizes of random samples drawn from the first (second)
populations.

p1 (p2 ) : corresponding sample proportions in the first


(second) group.

2. Assumptions :

Independent random samples from the two groups.

Large enough sample sizes so that in each sample there are

at least 10 success and 10 failures.

9
5
These will ensure that the sampling distribution of (p1 p2 ) is approximately
under the CLT.

3. Structure : As for one proportion, the confidence interval for (p1

p2) is ob- tained by adding and subtracting a to its point

estimate (p1 p2 ). As before, the margin of error is the product

of the Z-score and the estimated standard error of (p1 p2 )

given by
.
p1 (1 p2 (1 p2 )
se(p1 p2 ) = p ) +
1 n2
n1

So, for a 100(1 )% confidence interval, the margin of error will


be

resulting in the confidence interval

4. Observations :

If the confidence interval contains only positive (negative)

values, we can conclude (with the appropriate confidence)

that

If the confidence interval contains 0, we conclude (with the

appropriate confidence) that p1 and p2 are not significantly

different.

Eg 1. Let us go back to the belief-in-miracles example. Suppose

523 males and 498 females were randomly sampled and each of them
9
6
were asked the question Do you believe in miracles?. The

following table summarizes the observations. We want to compare the

population proportion of males and females who believe in miracles

using a 95% confidence interval.

Belief in miracles
Gender Yes No Tota
l
Male 225 298 523
Female 276 222 498

(The above table is called a 2 x 2 contingency table. It cross-classifies the observations

corresponding to the response and explanatory variables ).

9
7
1. Assumptions : Here the assumptions are satisfied because (i)

the samples of males and females were chosen randomly and ii)

the number of successes (belief in miracles) and failures (no

belief in miracles) in each sample (male and female) are much

higher than 10.

2. Structure : Here

p1 = p2 = , se(p1 p2 ) =

Hence the required 95% confidence interval of p1 p2 will be

3. Conclusion : Since the above confidence interval only contains

negative num- bers, we can be 95% confident that the

population proportion of males who believes in miracles is

between to than the

population proportion of who females believe in miracles.

Hypothesis Tests

Significance test is another avenue through which the comparison of

two groups can be carried out. As for confidence intervals, the general

methodology for significance tests is perfectly analogous for the one

and two sample case.

1. Notation : same as for confidence intervals.

9
8
2. Assumptions : same as for confidence intervals.

3. Hypotheses : Similar to the case for single proportion, we have to

formulate two hypothesis for (p1 p2) : a null hypothesis based on

our experience and an alternative one challenging our belief.

These can be expressed as:

H0 p1 p2 = 0

Ha :

9
9
4. Test statistic : The test statistic is given by
p1 p2 0
Z= . 1 1 N (0, 1)
p(1 p) ( + )
n1 n 2

where p is the pooled estimate of p i.e the common population

proportions of success given by

p = total # of success in the two samples/total sample


size

The above standard error is obtained from the estimated

standard error of p1 p2 by replacing both p1 and p2 by p.

Analogous to the one proportion case, the above test statistic

would measure the number of that

separate p1 p2 from the null value 0.

5. P-values : As in the one proportion case, the p-value will be the

one (for one sided alternative) or two (for two sided alternative)

tailed probability of values even more extreme than the

observed test statistic value.

6. Rejection rule : As before, we would reject H0 at significance level


if and would fail to reject H0 otherwise.

Eg 2. Let us go back to the belief in miracles example. We want to

test whether there is any difference in the population proportion of

males and females who believe in miracles.

1. Assumptions : All the assumptions have been satisfied.

1
0
0
2. Hypotheses :

H0 :

Ha :

3. Test statistic : The total number of believers in the two samples

taken together is and the total sample size is .

Hence, the pooled sample estimate of believers will be .

Hence the test statistic will be

1
0
1
Thus, the p1 p2 falls about (estimated) standard errors below
the null value of 0.

4. P-value : Since our alternative is and our test statistic value

is , the p-value will be the area under the standard

normal curve above and below which is double the area

below

So, the p-value is

5. Conclusion : Since our p-value is less than all possible significance

levels, we H0 and conclude that there is strong evidence that the

proportion of male and female believers in miracles are different

in the population. In fact, since the test statistic is negative, the

population proportion of male believers should be than the

population proportions of female believers.

Observe that the conclusions we have drawn from significance

tests and confi- dence intervals match perfectly.

1
0
2
Chapter 8 : Analysis of Variance (ANOVA)

21 Introduction
In Chapter 7, we have learnt how to compare two groups (or

populations) with respect to a categorical response variable.

However, in reality, we may have to compare a characteristic across

multiple (i.e more than 2) groups. The statistical technique through

which this is accomplished is known as Analysis of Variance or

ANOVA.

Eg 1 : A recent news article in the Times of India reported that in the

current placement season, the average starting (domestic) salaries of

graduating PGP students at IIMA have increased by a whooping 28%.

Suppose, in view of this news, you want to compare the true

(population) average starting salaries across the top five IIMs (IIMA,

IIMB, IIMC, IIML and IIMK). Thus, we are comparing groups

with respect to a response variable (salaries).

In ANOVA, the categorical explanatory variables identifying the

groups are called factors while the individual categories being the

levels. So, in the above example, there is one factor (IIMs) with levels

(or categories) : . When there is only one factor, the

technique is called one-way ANOVA, which will be the topic of this

chapter.

22 One-way ANOVA
ANOVA is basically a hypotheses testing problem. As with any
1
0
3
hypotheses test, it is based on certain assumptions as follows :

Simple random samples from each group (or population).

Response should follow an approximate in the population.

However, this is mainly important for small samples (n < 30).

As mentioned above, suppose we want to compare a quantitative

response variable across g groups. This is tantamount to comparing

the population means of the response for the g groups (say 1, 2, ..., g

). Thus, the null and alternative hypotheses

1
0
4
will respectively be

H0 1 = 2 = ... = g =
Ha

In order to test the above hypotheses i.e whether there is any

significant differences in the population means, the following test


b
statistic is used

Variability between 2
groups
F = w (3)
Variability within groups
= 2
Observations :

Higher the variability between groups relative to the variability

within groups, will be the evidence against the null and the

test statistic (i.e F ) value will be Thus, we will H0 and would

conclude that the means cannot be assumed to equal.

If the variability between groups is similar to the variability within


groups (i.e F ), we would H0 and conclude that the means (1,
2, ..., g ) are not significantly different from each other.

Since the population variances between and within groups are


generally unknown,

we will not know the values b and in (1). So, we replace those by
of 2 2 their
w

estimates Mean Squares Between (MSB) and Mean Squares Within


(MSW)
(also known as Mean Squares Error (MSE)). Thus, the test statistic
becomes
MSB
F (4)
=
MSW

1
0
5
Now, MSB and MSE can again be represented as the ratios of the
corresponding

Sums of Squares and the degrees of freedom;


SSB SSW
MSW =
MSB = 1 Ng
g
Here, SSB is known the Sums of Squares Between while SSE is the Sums

of Squares of Errors with degrees of freedom g 1 and N g

respectively, N being the total combined sample size from all the

groups. Thus, replacing these values in (2), the test statistic becomes

SSB/g 1
F= (5)
MSW / N g

1
0
6
which follows a F distribution with (g 1, N g) degrees of freedom.
P-values : Since the F distribution is only defined on the positive axis,

p-value will always be the tailed area above the observed F value.

We can use the F table at the back of the book for that.

Decision rule : As always, we will reject H0 is p-value and would


fail to reject H0 otherwise.
The output of a ANOVA procedure is summarized in an ANOVA

which has the following form :

Source Df Sum of Mean squares F


Between groups g-1 SSB
squares MSB = SSB/g-1 F=
Within groups N-g SSE MSE=SSE/N-g
MSB/MSE
N-1
(Error) Total SS

T
Here SST is the Sum of Squares Total and is equal to SSB + SSE.

Eg 2 : Suppose a researcher want to compare 3 diet regimens (Low Fat,

Low Cal and Low Carb) by the amount of weight loss they induce

among subjects. Accordingly, she randomizes 10 subjects into each of

these regimens and measure their weights before and after (they take

the diet) and take the difference of the same. It can be assumed that

weight loss has a normal distribution in the population.

So, the explanatory variable (or factor) is which has


categories

i.e while the response is . The

assumptions are valid because

The researcher randomizes the subjects into the 3 regimens.

Weight loss has a normal distribution in the population.


1
0
7
We want to test whether the mean weight loss induced (in the

population) by these 3 diets are the same or not. Thus, the null and

alternative hypotheses will respectively be

H0 Ha

1
0
8
Here, n1 = n2 = n3 = ,N = and g =
Moreover, it can be shown that SSB = 122.1 and SST = 182.85. Thus,

Df of SSB =

Df of SSE =

Df of SST =

SSE =

MSB =

MSE =

F=

Df of F =

Based on the above values, the ANOVA table will be :

Source Df Sum of squares Mean


squares F
Between groups

Within groups

(Error) Total

P-value : From the F table at the back of the book, we have, for df (2,

27), the right tailed area above 5.49 is .01. Hence the area to the right

of 27.13 will be

i.
1
0
9
e our p-value will be

Decision : Since the p-value is approximately 0, we would H0 at all


the commonly used significance levels i.e = .01, .05, .1.
Conclusion : At = .01, .05, .1, We have strong evidence to believe that
the mean
weight loss induced by the 3 diet regimens are not the same (or the

mean weight loss induced by at least one diet regimen is

significantly different from the rest).

1
1
0
Notations :

Let Yij be the value of the response variable for j th observation in


the ith group,
i = 1, 2, ..., g; j = 1, 2, ..., ni. Then,

Sample mean for the ith 1)


ni
Yij , i = 1, 2, ..., g.
group : Yi =
ni j=1

1 g ni
Overall sample Y = ) ) Yij .
mean : N i=1 j=1

g
Total sample size in all the groups combined : N = ) ni.
i=1
g
Sum of squares between (SSB) : ) ni (Y2i Y ) .
i=1
g ni
Sum of squares within (or error, SSE) : ) )(Yij2 Yi ) .
i=1 j=1

g ni
Sum of squares total (SST) : ) )(Yij 2 Y )
i=1 j=1

1
1
1
Chapter 9 : Simple Linear Regression

23 Introduction
One of the most fundamental aspects of statistical practice is to analyze

and inter- pret the relationship between different variables in the

population. What makes this interesting is that relationships or

association patterns between variables are often as diverse as the

variables themselves. Some variables may have a pretty simple re-

lationship while others may have a more complicated pattern. Moreover,

once the relationship between two variables has been determined, it is

often of interest to predict the unknown value of one of those

variables using the known value of the other. The branch of Statistics that

deals with this problem is known as Regression Analysis.

Eg 1. Medical practitioners have long hypothesized that a persons

muscle mass decreases with age. To explore this relationship in

women, a nutritionist randomly selects 60 women between age

groups 40 and 79 and calculates their muscle mass. Using the tools of

regression analysis, you can help her figure out the true underlying

relationship between age and muscle mass in the population of ALL

women in that age range using the information from the above 60

women.

Eg 2. The crime rate of a region may depend on various factors

like education (e), urbanization (u) and income (i) levels of that

1
1
2
region. Using regression analysis, we can predict the crime rate of a

particular region for given values of the above variables.

Eg 3. The price of houses/apartments in a particular city may

depend on different factors like location (l), number of bedrooms (b),

proximity to different attractions (metros, shopping malls) (a), etc.

Using regression analysis, we can predict the price of a apartment yet

to be built based on given values of the above factors.

The first step in any regression analysis exercise is to identify the

concerned vari- ables as response and covariates (or explanatory variables ).

Response variable (or Y ) : This is the outcome or the dependent

variable. In example 1, 2 and 3 above, , and

are the

1
1
3
response variables.

Explanatory variable (or X) : This is the independent variable or the

variable which explains or is related to the outcome. In the above

examples, , and

are respectively the explanatory

variables.

Note : Whether a variable would be deemed response or explanatory often depends


on the type of study. For example, if we want to know the relationship between muscle
mass of a woman and the chance of her having osteoarthritis, muscle mass becomes the
variable while having osteoarthritis (or not) sold is the .

24 Plotting the data : Scatterplots


Generally the sample data will come as pairs i.e (X1, Y1), (X2, Y2),...,(Xn,

Yn) where each pair correspond to a unit and n is the sample size.

Once the data is collected, it is advisable to plot those using a

scatterplot to get a first hand visual impression of the association

pattern between them. Every sample unit is represented by a point

in the scatterplot. A scatterplot tells us

How (and whether) the response and predictor variables are

related to each other.

If related, whether the relationship pattern can be reasonably

approximated by a straight line.

Whether there are any unusual points which falls well apart from

the general trend of the points (outliers or influential points).

1
1
4
Figure 5. shows the scatterplot of muscle mass against age of the
60 women.

1
1
5
120
110
100
90














Muscle mass



80








70





60




50

40 50 60 70

Age

Figure 5: Plot of muscle mass against age of 60


women

Observations :

1. The points show a strong trend i.e muscle mass and age

seems to have a association. So, older women

tend to have lower muscle mass on an average.

2. Age and muscle mass seems to have a linear relationship i.e the

above trend can be approximated by a straight line reasonably

well.

3. We do not see any point which falls well apart from the general

trend of the points. i.e there does not seem to be any outliers or
1
1
6
influential observations.

25 Correlation
When X and Y have an approximately linear relationship, we can

actually go ahead and measure the strength of that relationship

with a quantity called the correlation coefficient.

1
1
7
Population and Sample Correlation

There are two different quantities that might be called the correlation

(or correlation coefficient) vis

The population correlation, , measures the strength of the

association (be- tween X and Y ) in the population.

If we have a sample from a population, the sample correlation r,


measures the strength of the association in that sample. The
formula for the sample correlation is
1 n Xi X Yi Y
)( ,
r = 1 i=1 SX )( )
n SY
(Y ) are the sample means of X(Y ) values while SX (SY )
. Here X

are the corre- sponding sample standard deviations 1. Having

said that, we will never actually calculate it by hand (any

statistical software will do it for us). Naturally, we seldom know

the value of (since we never really have the population data),

so we typically estimate it with the value of r.

Properties of the Correlation Coefficient

The correlation coefficient always takes values between and .

If X and Y have a positive association (as one goes up, the other

tends to go up), then their correlation is positive.

If X and Y have a negative association (as one goes up, the other

tends to go down), then their correlation is negative.

If X and Y have no linear association (in the population), then their

population correlation is zero. However, due to random


1
1
8
variation, their sample correlation r will almost never be exactly

zero but will be close to zero.

Correlation coefficient does not depend on the units of the variables nor on their
identities (i.e response or explanatory) - this is a big advantage of

correlation coefficients.


X = 1 ) Xi , SX = 1 )(X
n n
1
i X)

, 2
n i=1 n 1 i=1

1
1
9
The closer the correlation is to or , the stronger the linear

association is between the two variables. For sample data, this means

that the closer r is to 1 or 1, the closer the points on a scatterplot

adhere to a straight line pattern, as shown in Figure 6.

X X

Figure 6: Visual interpretation of the correlation. The data sets on

the left and right have r = 0.75 and r = 0.98, respectively.

For the age-muscle mass example, r = 0.866. So, we conclude that

Since r is negative, age and muscle mass have a relationship

i.e as age increases, muscle mass

Since r is quite close to 1, we conclude that age and muscle mass have

a strong negative linear relationship.

Caution : Linear Relationships only


The correlation is only useful for measuring relationships that are
linear. Figure 7 shows two data sets both of which have r nearly zero.

1
2
0
The scatterplot on the right clearly shows that X and Y have a non-
linear (parabolic) relationship which cannot be quantified through
the correlation coefficient. So, r = .06 does not have a meaning in
this context and can even be misleading. On the other hand, the fact
that r = .07 for the first graph makes sense because it reflects the
scatter of the points. Thus, in

1
2
1






















Y

X X
Figure 7: Two very different data sets with sample correlations near
zero (r = 0.07 on the left, and r = 0.06 on the right).

a nutshell, it is always a good idea to look at the scatterplot of the

data first to see if the correlation is even a useful quantity to talk

about.

Outliers
The presence of outliers in a set of sample data can greatly influence the
value of r. For example, in Figure 8, the addition of one extra observation
changes the sample correlation from r = 0.90 to r = 0.47. In fact, if the new
point is a genuine outlier, the new pattern is not linear anymore. Thus,
correlation coefficient may not be meaningful when a dataset contains one or
more outliers.

26 Linear Regression Analysis

Concepts and Setup

When two variables, X and Y are linearly associated, we can go a step

1
2
2
further by finding the equation of the straight line that best describes

this pattern. Once this line is obtained, it can be used to predict an

unknown value of the response variable

(Y) from a known value of the explanatory variable (X). The

procedure of doing this is known as Linear Regression Analysis.

1
2
3

Y

X X

Figure 8: Effect of an outlier on the sample correlation. The data sets on the left

and right have r = 0.90 and r = 0.47, respectively, despite


differing by only a single observation.

To properly understand concepts about regression, we first need

to understand how populations and samples relate to each other in

the context of regression. Let us begin by considering a quantitative

response variable Y and quantitative explanatory variable X. Each

individual in the population has a value of X and a value of Y . It is

easiest to think about the relationship between X and Y with an

example.

Suppose, in the population of Indian adults, X is height (in inches),


and Y is weight (in kilos). Let us consider only those people who are
65 inches tall (X = 65). Obviously all of them will not weigh the same,
but their weights will vary around
some mean value, which we will denote as, say Y (65). (We use to indicate
the mean, Y to indicate what it is the mean of and 65 to indicate that it only
refers to individuals with X = 65) i.e Y (65) is the population mean weight of

1
2
4
all Indian adults whose height is 65 inches. Similarly the weights of all
Indians who are 70 inches tall (i.e X = 70) will vary around some mean, say
Y (70), which we expect to be greater than Y (65) (since height and
weight are assumed to have a positive association).
What were going to assume about the population is that X and Y

(X) are related by a straight line, as shown in the following figure. This is

why the technique is called linear regression.

1
2
5
We write this
relationship as Y (X) = + X (6)

where is the slope and is the y-intercept for the population. Since
and are

population parameters, they are usually assumed to be unknown).

Is the Relationship Linear?

There are many real-world situations in which the relationship

between X and Y (X) will not be linear. For example, Figure 9. plots

106 measurements of strontium isotopes found in fossil shells against

their age.
0.70750
0.70740





















strontium.ratio










0.70730




0.70720

95 100 105 110 115 120

age

Figure 9: An example of non linear relationship

Clearly, it would be absurd to even try to fit a linear regression line to

the above data. So, its important to think about whether the linearity
1
2
6
assumption is at least somewhat sensible before deciding to conduct a study or
analyze data using simple linear regression. - this is where scatterplots come

into play.

Remember: Variability !!

Why do we need to write Y (X) in (1) ? Its tempting to write


something like

= + X,
Y NO!

1
2
7
but this is unrealistic, because every individual with the same value of

X will not have the same value of Y . (Does all PGPX students who

study the same amount of time ends with the same CGPA/starting

salary ? Does all companies having the same manpower generate the

same revenue/year ?)

Instead, there will generally be some amount of variability in the


values of Y for the same value of X = x. (Later on, we will make a
further assumption that these Y values vary according to a certain
distribution.)

27 Simple Linear Regression


This is the simplest (but also one of the most commonly used) form of

regression analysis where our ultimate goal is to find the best-

fitting straight line through a set of data points having a linear

pattern.

Statistical Model

Let X and Y respectively be the explanatory and response variables.

We have n pairs of data points vis (X1, Y1), (X2, Y2),...,(Xn, Yn) where

each pair correspond to a sample unit. X and Y are assumed to be

related in the population as

Y = + X + s
= Y (X) + s (7)

where s are (unknown) error terms which are assumed to be


independently and iden- tically distributed (i.i.d) as N (0, 2 ) (Normal

1
2
8
with mean 0 and standard deviation 2 ) i.e for a particular value of X
= x, Y is assumed to have a normal distribution with mean and
variance . However, since we do not have population data, and
are unknown and will estimated from the sample. This procedure,
known as Least Squares Estimation, is done in such a way that the
resulting straight line has the best possible fit to the given sample
data.

Least Squares Estimation

In order to find the best fittingstraight line through a given sample


data (say,

(X1, Y1), (X2, Y2),...,(Xn, Yn), we have to find the best possible estimates
of and

1
2
9
(say, and ). We will use the sample data to get these estimates.

The resulting line (also known as the least squares regression line) is the

best possible estimate of the population regression line (7) (for

the given sample) and is given by

Y i = + Xi , i = 1, 2, ..., n

(8) where Yi and Y i are related

by the equation

Yi = Y i + ei , i = 1, 2, ..., n (9)

Here (e1 , e2 , ..., en ) are the observed errors and are known as the residuals.

and are obtained by minimizing the sum of squares of the above


residuals i.e
n
S = ) ei2
i=1
n
= )(Yi Y i )2
i=1
n 2
= ) (Yi Xi )
i=1

with respect to and . On doing so, we have


n
)(Yi Y
)(Xi X
= ) (10)
i=1
n
2
)(Xi X
)
i=1

= Y (11)

X

which are known as the least squares estimates of and .

Given and , the least squares regression line (or the estimated
line) is given by
7
3
Y (x) = + x (12)

where Y (x) is the predicted value of Y at X = x and are the sample y -


while intercept and slope respectively.

The least squares regression equation (7) describes the X Y

relationship in the sample, which will usually be close, but not exactly

equal, to the relationship in the whole population, as shown in Figure 10.

Thus, it makes sense to use the various parts of the regression equation,

which we calculate based on the sample data, to estimate the

corresponding parts of the equation that describes the population

relationship, which we dont actually know.

7
4


Figure 10: Regression equation (solid line) as an estimate of the

unknown popula- tion relationship (dotted line).

Interpretation of Coefficients

and are the sample Y -intercept and slope while and are their population

analogues - these are called the regression coefficients. Since we will


mainly deal with the estimated (or least squares) regression model, let
)

us explore the meaning of (, .

The interpretation of (, ) will be analogous in the population context.

Interpreting the slope

The slope tells us how a change in X affect a change in Y . Specifically, the

sign of the slope indicates the direction of change while the value tells us

the amount of change. Specifically,


If is Y (X) with X which indicates that X
positive, association and Y
.
If isa
have Y (X) with X which indicates that X and
association Y
negative,
. 7
have a
5
represents the amount by which Y (X) changes when X is
changed by

7
6
Note : Unlike the correlation, the value of the slope will depend on the units in which
X and Y are measured.

Interpreting the y-intercept

The sample y-intercept, is the predicted value of Y when i.e =


However, we have to be careful here. If X = 0 doesnt make sense, or
if X = 0 is outside the range of our data (so that talking about what
happens there would be extrapolation), then we should not interpret
the y-intercept.

Muscle Mass revisited

For the muscle mass example, the sample summary statistics are

= 59.98,
X SX = 11.80, Y = 84.97, SY = 16.21, r = .866

Now, in order to calculate , we use the following alternative


formulation
SY
= r ( )
SX
Plugging in the values above, we

have = which in turn gives us

Thus, the least squares regression equation is given by

Figure 11. depicts the scatterplot of the age-muscle mass data with

the above least squares regression line superimposed.

7
7
Interpretation

Since the slope is , muscle mass and age have a

associ- ation. More specifically, for every 1

year increase in the age of a woman, her predicted muscle mass will

decrease by .

Since the y-intercept is , a woman with 0 yrs of age (i.e new born

girl child) is predicted to have a muscle mass of . (Obviously

this is absurd and a gross extrapolation - so, in this case, we

should not interpret the intercept).

7
8
120
110
100
90














Muscle mass



80








70




60




50

40 50 60 70

Age

Figure 11: Least squares regression line fitted to the age-


muscle mass data

Making Predictions
One of the most important use of the least squares regression
equation is to predict unknown values of the response variable (Y )
from given or known values of the ex- planatory variables (X). We can
predict the value of Y for any particular value of X by simply
plugging that value of X into the regression equation and seeing what
we get for Y . However, we need to keep in mind that this X value
should come from a subject who is similar to the subjects sampled in
the original data set. Otherwise, we may run the risk of
extrapolation.
7
9
If we are predicting for a in-sample subject, the prediction will not

be exactly equal to the actual Y value of that subject. This is because

of the variability that is inherent in our model (think of the si s in the

population regression model). All in all, the prediction is just our

single-number best guess for a Y value at a particular X value.

Figure 12. illustrates the idea of prediction.

One of the women sampled for the Muscle-mass data was Mrs.

Tripathi who is 56 years old. So, her predicted muscle mass will be

8
0
2.3
Y

3.8
X
Figure 12: Visualization of a prediction using the regression equation (solid line).

Here, the predicted value for X = 3.8 is Y = 2.3.

Residuals

Let Yi be the actual response value for an in-sample subject, say

subject i while Y i be the corresponding predicted value obtained by

plugging in the corresponding X value in the least squares

regression line. Then, the residual for subject i is given by

Residuali = Yi Y i . (13)

In this way, we can obtain the residuals of all the sampled subjects.

Clearly, closer the residuals are to better will be the prediction.

Suppose the actual muscle mass of Mrs. Tripathi is 97. Then her

residual will be

8
1
Predictive Ability

As mentioned above, one of the most important use of the regression

equation is to predict unknown values of the response variable (Y )

from given or known values of the explanatory variables (X). Figure 13.

illustrates regression equations those are good and bad at making

predictions.

It is obvious that closer the observed values to the corresponding

predicted values (on the line), better is the predictive ability of the

least squares regression line.

8
2

X X
Figure 13: Visual interpretation of predictive ability. The

regression on the left has better predictive ability than

the regression on the right.

Coefficient of Determination

We can quantify predictive ability using the coefficient of


determination, R 2, which is just the square of the correlation
coefficient r i.e R2 = r2 . Clearly R2 would range from

Basically R2 tells us how much better we are doing by regressing Y


on X rather
that just using to predict Y .
to Y

Better the predictive ability of our fitted regression model, closer


R2 is to

Poorer the predictive ability of our fitted regression model, closer


R2 is to

8
3
We have already seen that the correlation coefficient of age and
muscle mass is

-0.866. Hence the coefficient of determination for the least squares regression will be

R2 =

So, we conclude that the least squares regression line has a pretty
good predictive ability since R2 is quite high. Moreover, the above
regression line results in 75% less error in predicting muscle mass
compared to Y .
R2 can also be interpreted as the amount of variability in the

response Y explained by the linear regression of Y on X. So for the

above example, we can also say that

8
4
of the variability in muscle mass can be explained by age. So,

whichever way we see it, the least squares prediction equation does

a pretty decent job in predicting muscle mass from age.

Extrapolation

As mentioned in Sec 5.5, when should avoid making predictions for

a new subject (i.e out-of-sample), whose X values are outside the

range of the sample values i.e significantly smaller than the smallest

X value or significantly larger than the largest X value in the sample.

Making a prediction at an X value which is significantly outside the

range of the X values in the data is called Extrapolation, and it leads

to predictions that are unreliable or even ridiculous.

If we have daily temperature data of Ahmedabad from 1984-

2015, making a pre- diction for 2016 would technically be

extrapolation, but it might be okay for most practical situations.

However, we probably would not want to use that data to make a

prediction for 2300.

Outliers and their effect

Just as outliers can greatly influence the correlation, they can also

greatly influence the regression equation. Figure 14. illustrates the

dramatic effect that a single outlier can have on the regression

equation.

When the data contains one or more outliers, the regression

equation can fit the data very poorly, and any results we obtain might

8
5
be unreliable. So its always a good idea to look at a scatterplot to

make sure there are no outliers when doing simple linear regression.

8
6

Y

X X

Figure 14: Effect of an outlier on the regression equation. The

data sets yield regression equations that differ

substantially despite differing by only a single

observation.

8
7
Chapter 10 : Parameter Estimation

28 Inferences on the parameters

Basic idea

One of the most basic questions we should address in any regression

analysis problem is whether Y and X are linearly associated.

Specifically in a linear regression problem, this question translates into

checking whether or not

For example, if = 0, we have

Y (X) =

Thus, Y and X have no linear association between them.

Example 1. In the age-muscle mass example, = 0 would imply that muscle mass
and age are not linearly related.

There are two distinct ways in making inferences about parameters in


general viz

Hypotheses tests and Confidence intervals. These will be discussed next.

Hypotheses Tests for

Here we will learn to perform a Regression t test for . For that, we need

to go through the following steps :

Hypotheses

For testing whether Y and X are linearly associated, we have the

following two hypotheses with respect to

Null hypotheses H0 = 0 Y and X have no linear

8
8
association.

Alternative hypotheses Ha 0 Y and X are linearly related.

The above alternative hypotheses is a two-sided one. Depending on


situation, we can also use one sided alternative hypotheses Ha > 0
(i.e linear association) or Ha < 0 (i.e linear
association).

Example 2. In the age-muscle mass example, we might want to test Ha since


age and muscle mass generally have a negative association.

8
9
Test statistic

The test statistic for testing the above hypotheses is given by

0
t= tn2 under H0 (14)
se( )

where ) is the estimated standard error of . Any statistical software


se( will give

us the value of the above test statistic.

P-values

Remember the definition of the p-value:

Definition 1. The p-value is the probability of getting a test statistic value at least
as extreme as the one observed, assuming H0 is true.

For calculating the p-value, we need to keep the alternative


hypotheses in mind. Suppose we observe a test statistic value of say, t
= 2.2 for a t distribution with 30 degrees of freedom. If the alternative
hypotheses is of a > (<) type, then the p- value will be the area
above (below) 2.2 under a t curve with 30 degrees of freedom.
However, if the alternative is two-sided (i.e ), the p-value will be the
combined area above 2.2 and below 2.2 for a t distribution with 30
degrees of freedom. This is represented graphically in Figure 15.
Since the t distribution is symmetric, we typically find the probability
for just one of these tails, usually the one on the right, and then
double it to get the p-value.
Heres a quick review of some of the properties of the t distribution:

It is symmetric and centered at zero, with both a positive and a


negative tail.

Its exact shape is determined by its degrees of freedom.

9
0
Although it looks like a standard normal distribution, a t

distribution has thicker tails than the normal. However, as the

degrees of freedom gets larger, the t gets closer to a standard

normal.

If we dont have access to statistical software to calculate p-values

for us, we often have to use a t table like the one in the back of our

textbook to try to figure out the p-value. A typical t table, like the one

shown in Figure 12. has rows corresponding

9
1
4 2 0 2 4
Value of t
Figure 15: Two-tailed probability of a t distribution, df = 30.

to different df values. Within the appropriate row, the table shows the

test statistic values that correspond to certain right tail probabilities.

We can use this information to figure out an approximate right tail

probability for any test statistic value we want, and we then double

this right tail probability to get the p-value (for a two-sided test).

Right-Tail Probability
df 0.10 0.05 0.025 0.010 0.005 0.001
0 0
1 3.07 6.31 12.70 31.821 63.65 318.309
2 8
1.88 4
2.92 6
4.30 6.965 9.925 7 22.327

3 6
1.63 0
2.35 3
3.18 4.541 5.841 10.215
4 8
1.53 3
2.13 2
2.77 3.747 4.604 7.173
5 3
1.47 2
2.01 6
2.57 3.365 4.032 5.893
6 5 1

Figure 16: First few rows of a t table.

We can see that the p-value behaves as it should:

If H0 is true, our test statistic will probably be somewhat close to


zero, and our p-value will probably be large, which makes sense.

9
2
If instead Ha is true, then our test statistic will probably be farther
from zero, and so our p-value will probably be smaller, which
also makes sense.

Larger the p-value, is the evidence against H0 i.e we are


more likely to reject H0 for p-values.

9
3
Bear in mind that if software says that the p-value for a two-

sided regression t test is 0.074, that means it calculated a right-tail

probability of 0.037 and already doubled it before reporting the

answer as 0.074.

Decision

Finally, we need to make a decision about whether H0 is reasonable, or

whether we have enough evidence against H0 to believe Ha instead.


We do this by comparing the p-value to our chosen significance

level (often 0.05) and making a decision the same way we always do:

If the p-value is less than or equal to , we reject H0 and believe


Ha instead.

If the p-value is greater than , we fail to reject H0 and conclude


that H0 is reasonable based on the data.

In the context of our actual hypotheses (with a two-sided

alternative), this means the following:

If we reject H0, then were concluding that Y is linearly


dependent on X.

If we fail to reject H0, then were concluding that its reasonable

that Y does not linearly depend on X, which would mean that

there would be no point in trying to use X to predict Y .

Note : If Ha > 0 and we reject Ha, it MAYNOT imply that Y and X are
independent; (maybe < 0 i.e Y and X have a negative association). We can only
conclude that, given the data, there is strong evidence that Y and X does not have a
positive association.

8
4
Age-muscle mass revisited
Example 3. Lets go through all the steps of a two-sided regression t-test for the
age-muscle mass example.

Assumptions

Since the nutritionist selected the 60 women randomly, the

random sampling assumption is satisfied.

8
5
Age-muscle mass observations for different women can be

assumed to be inde- pendent of each other.

Mean muscle mass and age are linearly associated. (vide Fig 1).

It can be assumed that the muscle mass values have a normal

distribution in the population.

Vertical spread of the muscle mass values can be assumed to be

approximately similar for different age values (vide Fig 1) i.e

constant variance assumption seems to be satisfied.

Hypotheses

H0 = 0 i.e muscle mass and age have no linear

association. Ha 0 i.e muscle mass and age are

linearly related.

Test Statistic

The SPSS output for the muscle mass data is given below

Predictor Estimate St. Error t-value P (> |t|)


Intercep 156.3 5.51 28.36 <2e-16
tAge 5
-1.19
0.090 ***
2
Table 1: Parameter estimates for muscle mass data.

Thus, the test statistic would be

Since the sample size is 60, the degrees of freedom (df ) of the above

8
6
statistic will be

P-value

Since the alternative hypotheses is sided, the p-value will be the


area

under a t curve with df. The relevant t scores for this degrees of

freedom are as follows :

So, the p-value would be

8
7
Right-Tail Probability

df 0.100 0.050 0.025 0.010 0.005


0.001
58 Table
1.2962:1.672
Right2.002 2.392 2.663 3.237
tail probabilities for t58

Decision

Since our p value is extremely low (approximately 0), we H0 and


conclude that there is strong evidence of linear association between
age and muscle mass.

Figure 17. shows which conclusions and interpretations go

together for a (two- sided) regression t test.

t value far from 0 t value near 0



Small p-value Large p-value

Evidence against H0 (for No evidence against H0 (for
Ha ) Ha )

Reject H0 Fail to reject H0

Evidence that Y depends No evidence that Y
on X depends on X

Figure 17: Results and interpretations of a (two-sided)


regression t test.

Note : If we had tested Ha < 0, the one sided p-value would also had been

0 (since the t statistic is negative), we still would have rejected H0 and conclude
that there is strong evidence to believe that mileage and weight are negatively
associated.

Confidence Interval for


8
8
Hypotheses tests simply tests whether its reasonable that = 0.
Instead, it might be interesting to figure out the set of all reasonable
values of . We do this by constructing a confidence interval of .

The assumptions required to construct a confidence interval for

are exactly the same as those used for the regression t test which we

have already discussed.

8
9
Formula

A 100(1 )% confidence interval for is given by

( t/2,n2 se( , + se( t/2,n2 ) . (15)


) )

where the standard is the same as the one used in the t statistic.
error of The

t-score is a number from the t table that depends on two things:

It depends on the desired confidence level. Higher confidence

levels require intervals, which mean larger t-scores. 95% is the

most commonly chosen confi- dence level.

It depends on the degrees of freedom. For simple linear

regression, the degrees of freedom is n 2 since there are 2

parameters ( and ) to estimate. For a given confidence level,

the t-score decreases as the degrees of freedom

i.

e the confidence interval gets narrower (hence, more precise) as

the sample size

Some t tables, including the one we will use, provide a second set of

column headings called Confidence Level, as shown in Figure 18.

These are designed to streamline the process of finding the

appropriate t-score for a confidence interval. Simply find the row for

the appropriate df and the column for the appropriate confidence

level, and the number in the body of the table is the t-score that

should be used in constructing the confidence interval.

9
0
Confidence Level

80% 90% 95% 98% 99% 99.8%


Right- Probability

df 0.10 0.05 Tail 0.010 0.001


1 3.07 6.31 12.70 31.82 63.65 318.309
0 0 0.025 0.00 22.327
2 1.88
8 2.92
4 4.303
6 6.965
1 9.925
7
3 6
1.63 0
2.35 3.182 4.541 5.841 10.215
5
4 8
1.53 3
2.13 2.776 3.747 4.604 7.173
5 3
1.47 2
2.01 2.571 3.365 4.032 5.893
6 5

Figure 18: First few rows of a t table, with headings for


confidence level.

9
1
Interpretation

The standard interpretation of a confidence interval (lets say 95%) for

is that we are 95% confident that the true value of is between (lower

number) and (higher number). Loosely speaking, it sometimes helps to

just think of a confidence interval for as the set of possible values of

that are reasonable based on the data. We can fine-tune what we mean

by reasonable by adjusting the confidence level.

Example 4. Age-muscle mass revisited.

From the hypotheses test we concluded that there is strong evidence


of a linear asso- ciation between age and muscle mass i.e 0. Let us
now figure out the reasonable values of by calculating a 95%
confidence interval of the same.
From Table 2. we have t0.025,58 2.00 and) = .0902. So, a 95%
se( confidence
interval of would be

i.e as age increases by 1 year, the average muscle mass will by at

most and at least . Thus, both the two sided significance test and

the confidence interval gives us the same conclusion regarding the

slope.

Note : We rarely perform inferential procedures for because often its interpretation
is not realistic. However, hypotheses tests and confidence interval procedures for
work exactly the same way as those for .

9
2
Chapter 11 : Regression Diagnostics

29 Introduction
In fitting a linear regression model to a given data set, we have to

make the following assumptions.

1.The sample has been selected through simple random sampling.

2.Observations corresponding to the sample units are

independent of each other.

3. Y (or Y (X)) has a linear association with X in the population.

4. Y values corresponding to any particular X value has a normal

distribution in the population.

5. Y values corresponding to any particular X value has the same

spread (or standard deviation).

Now, some (or all) of these assumptions may not hold for a

particular data set. In that case, it will be fallacious to use a linear

regression model to draw inferences about that data. Regression

Diagnostics refers to the procedure of checking whether a linear

regression model is appropriate for a particular dataset (in the

sense that the model satisfies the assumptions on which it is

based). This is achieved through a procedure called Residual

Analysis.

9
3
Residual Analysis
It turns out that if a regression model fails to satisfy some
assumptions for a given data set, it gets reflected very clearly in the
residuals of the fitted model. So, an examination of the residuals is
a very effective way of checking the appropriateness of a regression
model for a particular data set. This is implemented by plotting the
residuals or standardized residuals (an improved version of the
residuals) against the covaritate/s (X) or the fitted/predicted values
(Y ).
For example, one of our assumptions for the population regression
model is that si N (0, 2 ) (where si s are the errors). If a regression
equation fits the data well, the residuals ei s should also tend to be
independently distributed about 0 with constant

9
4
variance 2. Thus, an examination of the residuals should give us a

pretty good idea whether the above assumption has been satisfied by

the regression model. This is the basis for residual analysis.

Type of Departures

The first two assumptions (randomization and independence of the

observations) can be hard to check once the data has been collected.

So, it is important to design studies/surveys carefully to ensure that

these two assumptions are valid. However, residual analysis can be

used to check the last three assumptions (namely linearity, normality

and homoscedasticity) as detailed below :

Non-linearity of data

When Y and X have a linear pattern vis-a-vis a linear regression model is

appropriate for the data, the residuals, when plotted against X, tend to

be randomly scattered above and below 0 having no particular pattern.

Figure 19. shows the residual plot for the age-muscle mass data.
1
3
2

9
0



Standardized residuals







40 50 60 70

Age

Figure 19: Residual plot for the age-muscle mass data.

Clearly, the residuals roughly follow a random pattern above and below
the 0-line.

9
1
Thus the linear regression model seems to be appropriate for this
data set.

Figure 20. shows the non-linear fossil data (with the fitted least

squares regression line) and the corresponding residual plot against X

(age).
0.70750

0.00010
0.70745

0.00005
0.70740






























strontium.ratio




0.70735


Residuals









0.00005







0.70730






0.70725




0.00015







0.70720

95 100 105 110 115 120 95 100 105 110 115 120

age age

(a) Fossil data (b) Residual plot

Figure 20: Nonlinear data and accompanying residual


plot.

It is clear that the residuals follow a systematic (non-random)

pattern about 0 indi- cating that a straight line fit is not at all

appropriate for this data.

Thus, appropriateness of a linear regression model (for a dataset) will be indicated


by a random scattering of residuals about 0 while inappropriateness of a linear fit
will be indicated by a systematic (non-random) pattern of residuals about 0.

Non-constant error variance

A residual plot also indicates whether the assumption of constant

9
2
error variance (V (si ) = 2 ) has been satisfied. We have the following

rule of thumb :

If the error variance is constant, the residuals will be randomly

scattered about 0.

9
3
If the error variance increases with X, the residuals will have

spre

ad as X increases.

If the error variance decreases with X, the residuals will have

spre

ad as X increases.

In Figure 19. we do not see any particular increasing or decreasing

pattern of the residuals with age (X). This implies that the error

variance may be independent of X and hence constant. So, we

conclude that the linear regression model (fitted to the age-

muscle mass data) seem to satisfy the constant variance

assumption.

9
4
Non-normality of errors

One of the most basic assumptions we made for our regression model

is the assumption of normality of the error terms. A lot of important

results in linear regression analysis follows from this assumption.

Although minor departures from normality is not an issue (and is

often expected in most cases), major departures do create problems

in fitting and reliability of the estimates. Thus, it is of utmost

importance to verify whether the fitted linear regression model

satisfies this assumption. There are a couple of ways of checking this

assumption as given below :

Box-plots or histograms of residuals (or standardized residuals)

convey im- portant information about the shape of the error

distribution and the presence

9
5
of outliers. Figure 21. shows the histogram and box plots of

standardized resid- uals for the age-muscle mass data. The plots

seem to indicate a slight right skewness but it doesnt seem to

serious. Moreover, two sided tests and confi- dence intervals are

robust to violations of the normality assumption. So, the

conclusions we have drawn before are still valid.

3
20

2
15

1
Frequency

10

0
5

1
2
0

2 1 0 1 2 3

Standardized residuals Standardized residuals

(a) Histogram (b) Boxplot

Figure 21: Histogram and Boxplots of standardized residuals for

the age-muscle mass data.

A popular tool of assessing normality of the error distribution

is to construct Normal probability plots of the residuals. Here,

each residual is plotted against its expected value under

normality. A linear plot suggested normality whereas a plot

9
6
that deviates substantially from linearity suggests that the nor-

mality assumption may not be valid. Figure 22. shows the

normal probability plot for the residuals of the age-muscle

mass data.

9
7
3
2
1






Standardized residual

2 1 0 1 2

Expected

Figure 22: Normal probability plot for age-muscle mass


data.

Since the pattern is pretty linear, the error distribution can be

assumed to be normal. However, if the normality assumption is

not satisfied, it is immediately reflected in the normal probability

plots i.e. those tend to be nonlinear.

Note 4. Residuals (or standardized residuals) can also be used to check for other
qualities of the dataset, like correlated errors and the presence of outliers (or influ-
ential points). If the data are collected in such a way that there are some inherent
dependence between successive observations (longitudinal study, spatial data etc), the
same will be reflected in the residuals. In that case, the residuals will reflect a definite
trend when plotted against the time variable. On the other hand, if the errors terms
are independent, the residuals will fluctuate more or less randomly around 0.
9
8
There are various ways of checking outliers. One rule of thumb is that, if the
absolute value of the standardized residual for an observation is more than 3, that
observation can be assumed to be an outlier.

9
9
Chapter 12 : Multiple Regression

30 Introduction
So long we have used a single explanatory variable (X) in the linear

regression model to predict the unknown value of the response (Y).

However, in many real life applications, the response (or outcome) of a

process can depend on more than one explanatory variables. In those

situations, we should ideally take ALL the explanatory variables into

account to estimate (or predict) the unknown value of the response.

Failing to do so would evidently result in loss of information about

the true variability of the response and hence the resulting regression

model will not be accurate enough for practical purposes.

Example 5.

The crime rate (number of crimes per 1000 residents)(say Y ) of a particular region
can depend on a lot of factors like the percentage of residents who are well educated
(say X1), the level of urbanization (X2), the average income of the residents (X3) etc.
Thus, in order to accurately predict the true crime rate of a region, we should take all
these factors into account because each of these give us some information about the
crime rate in that region.

General Form

Suppose we have p explanatory variables, X1, X2,...,Xp corresponding to


the response variable Y . Then the (population) multiple regression
model is given by

Yi = 0 + 1 Xi1 + 2 Xi2 + ... + p Xip + si , i = 1, 2, ..., n

(16) where si s have a


1
0
0
normal distribution with mean 0 and constant standard deviation
2 while n is the number of subjects/units. This is just an extension of

the simple linear regression set up to p predictors.

The (least squares) predicted regression model is given by

Y = 0 + 1 Xi1 + 2 Xi2 + ... + p Xip

1
0
1
wher k is the least squares estimate of k , (k = 1, ..., p) obtained by
e minimizing
the sum of squares of residuals - this is similar to what was done for

simple linear regression. Clearly 0 is the estimated y-intercept while

(1 , ..., p ) are the estimated slopes corresponding to the predictors


(X1, X2, ..., Xp).

Interpretation of Coefficients

The meaning and interpretation of the parameters of the multiple

regression model have the same spirit as those for the simple linear

regression one. However, in order to interpret the effect of a predictor,

we have to control for the others. This is because, in a multiple (linear)

regression set up, the association pattern between the response and a

predictor is NOT affected by any other predictor. This is so because the

predictors and the response are related in an additive manner.

Example 6.

We have data on crime rate (Y), percent with high school education

(X1), percent- age of residents living in an urban environment (X2)

and median income (X3) (in thousands of Dollars) for all the

counties of Florida, USA. Software generates the following

estimated multiple regression model

Y = 59.715 0.467X1 + 0.697X2 0.383X3

We can interpret the parameters as follows :

Effect of Education : Since the slope of education is , crime rate

of a county is related to education controlling for and

1
0
2
. Specifically, the predicted crime rate of a county

decreases by for 1 percent

increase in the education rate i.e a county will be safer if more of its

residents are educated.

Effect of Urbanization : Controlling for and , crime rate

of a county is related to urbanization (since the slope is

). In fact, the predicted crime rate of a county

increases by for 1 percent increase in

the urbanization rate i.e more urbanized the county, safer

it is.

Effect of Income : Since the slope of income is , controlling for

and , the median income of a county is related to its


crime

1
0
3
rate - for 1 thousand Dollar increase in the median income of the

residents, the crime rate by i.e wealthier the residents of a

county, safer it is.

Moreover, for the above regression model, the effect/slope of any

predictor will remain the same for any value of the other predictors.

For example, the slope of education would remain -0.467, no matter

what values we assume for urbanization and income.

Note 5. A basic difference between multiple and simple regression models is that for
the former, in order to interpret the effect of a predictor, we fix (or control for) the
other predictors but for the latter, we ignore any other possible predictors in order
to interpret the effect of a particular predictor.
In the above example, suppose we regress Y only on X2 (Urbanization). The
regression equation is given by

Y = 24.54 + 0.562X2

Here education and income have been altogether ignored, NOT controlled. The slope of
X2 has also changed (decreased from 0.697 to 0.562). Thus, ignoring and controlling
for a variable have different impact on the regression model.

Inferential Procedures

Now we will discuss some inferential procedures that can be

performed on multiple regression models.

Coefficient of Multiple Determination

This coefficient measures the proportion of variation in Y that is

simultaneously explained by the set of predictors (X1, ..., Xp). As in the

simple linear regression set up, R2 ranges from 0 to 1 with higher

values of R2 indicating a better fitting model and vice versa.

1
0
4
R2 can only increase when additional predictor variables are

added to the model. However, increasing the predictors will also

increase the number of parameters and hence the computational

cost. In order to achieve a trade off between these two factors, an

adjusted coefficient of multiple determination has been proposed,

given by
n1
R2 2

a = 1 ( p )(1 R )
n 1

1
0
5
In fact, R
a can even decrease with the addition of a predictor variable
2

in the regression model if the new predictor does not result in a

significant improvement of the model fit.

Example 7.

For the crime data, R2 can be shown to be .473 i.e Education,

Urbanization and Income taken together explains about of the

total variation in crime rate. Now, let us test the amount by which

R2 increase as we include more and more predictors. We start by

only including Urbanization (since it has the highest correla- tion

with crime rate) and add on the other predictors. The following

table shows the

R2 values for each case.

Predictors U (U, I) (U, E) (U, I, E)

R2 0.459 0.467 0.471 0.473

It is clear that once we have included urbanization, income and

education does not add a significant amount of information (on the

variability of crime rate). This may be because of multicollinearity -

correlation coefficient between (U, I) is 0.731 and between (U, E) is

0.791 i.e there is a large overlap of information (on crime rate)

between (U, I) and (U, E). However, if we use adjusted R 2, we have the

following modified table

Predictors U (U, I) (U, E) (U, I, E)

Ra2 .4507 .4503 .4545 .4479

1
0
6
Clearly, R2 identifies the model with urbanization and education as a
better
a

model than the one with all the predictors.

T-tests of Regression Coefficients

As in the simple linear regression set up, we can test for the

significance of the slope parameters for the multiple regression

model (16). The necessary assumptions are the same as that for a

simple linear regression model i.e

Data is obtained through random sampling.

1
0
7
The observations (Yi, Xi1, ..., Xip) are independent of each other.

Linearity of the population regression model.

Normality of the errors (vis-a-vis the response) for any given value

of the pre- dictors.

Errors (vis-a-vis the response) has the same standard deviation for

any given value of the predictors.

For testing the significance of a particular predictor, say Xk, the null
and alternative hypotheses are given by

H0 k = 0 vs Ha k (or > or <)0

The above hypotheses tests for the dependence between Xk and Y

controlling for the other predictors. The test statistic and

corresponding p-values will given by any statistical software. However,

the degrees of freedom of the test statistic is n p 1 since there are

p + 1 parameters in the multiple regression model.

As before, the decision rule will be

p- < H0.
p-
value H0.
value
where the p-values are obtained in the usual manner.

Example 8.

For the Florida crime dataset, the above assumptions seem to

be tentatively sat- isfied. Let us test whether income (X3) has any

effect in predicting crime rate con- trolling for urbanization (X2)

1
0
8
and education (X1). Thus our hypotheses will be

The estimates and standard errors for the various predictors are

given in the following table. Thus, the test statistic will be

Since there are 67 counties, the degrees of freedom of the above


statistic will be

1
0
9
Predictor Estima Standard
Intercept te
59.715 Error
28.59
Education -0.467 0.554
Urbanizati 0.697 0.129
on
Income -0.383 0.941

P-value : For the above value of the test statistic and degrees of

freedom, the p-value can be shown to be much higher than 0.2.

Conclusion : At significance level of 0.05, we will H0 since our p value is


0.05. Thus, we conclude that there is little evidence of any association

between crime rate and income controlling for urbanization and

education i.e income information of residents does not add

significantly to our knowledge of crime rate if we already have

information on urbanization and education rates of a county.

Confidence Intervals of Regression Coefficients

A 95% confidence interval of k is given by

k t/2,np1 se( k )

Example 9.

From the t-table, we have t0.025,63 = 2.0. Thus the 95% confidence interval
of 3 will be

Since the confidence interval contains 0, we are 95% confident that

3 is not significantly different from 0 i.e controlling for urbanization

and education, income doesnt seem to influence the crime rate of a


10
0
county. Thus, both the two sided significance test and the confidence

interval gives us the same conclusion about the slope.

Note 6. Two sided tests and confidence intervals are robust to the violation of the
normality assumption for large sample sizes (n > 30).

10
0
31 Regression Diagnostics
As for simple linear regression, we can use residual analysis to verify

the assumptions of the multiple regression model, specifically those

relating to normality, linearity and constant variance. Thus, this is a

nice tool to test for the overall appropriateness of the model. For

example,

Plot of residuals (or standardized residuals) against fitted values

can help us to test for linearity of the regression model and the

constancy of error variances.

Plot of the residuals against each of the predictors can be used to

check whether the response is linearly related to those predictors

controlling for the others.

Boxplots/histograms and normal probability plots of the

residuals can be used to check for the validity of the normal

distributional assumption.

Example 10. For the crime data set, the residual analysis is given below.

1. The plot of the standardized residuals against the fitted values

( Y ) are shown below


2

10
1
Standardized Residuals

20 30 40 50 60 70 80 90

Fitted values

Since there is not definite pattern, we can conclude that the

linearity and con- stant variance assumptions are valid.

10
2
2. The residual plot corresponding to education is shown below



1




Standardized Residuals

55 60 65 70 75 80 85

Education

The residuals fluctuate more or less randomly about 0 with no

noticeable trend or variation. Thus we conclude that crime rate

can be assumed to be linearly related to education controlling for


and .

3. The residual plot corresponding to urbanization is shown below.


2


Standardized Residuals


0 20 40 60 80 100

Urbanization

10
3
Here also the residuals fluctuate more or less randomly about 0

with no notice- able trend or variation. Thus crime rate can be

assumed to be linearly related to urbanization controlling for and


.

10
4
4. The following figure shows the residual plot against income.

1


Standardized Residuals

15 20 25 30 35

Income

Here there seem to be a slight decrease in the spread of the

residuals with in- creasing income. This suggests that crime

rate may have a smaller variation when the incomes of

residents in a county is higher for given values of urbaniza- tion

and education. However this is not very strong (and only

visible for very high incomes) - so we can tentatively assume a

linear relation between crime rate and income controlling for

education and urbanization.

5. The histogram and normal probability plots of the

standardized residuals are shown below


2
15

10
5

Frequency

10

Residual




0









5





2 1 0 1 2 2 1 0 1 2

Standardized Residuals Expected

(a) Histogram (b) Normal Probability Plot

10
6
The above plots indicate that the normal distributional

assumptions (for the error/response terms) may have been

violated. However, two sided tests and confidence intervals of

the slope parameters are robust against violation of this

assumption. Thus, the conclusions we have drawn earlier

regarding the effects of the explanatory variables on the crime

rate still holds.

Note 7. Residuals can also be used to test for the presence of effects which were not
included in the original regression model. For example, if the residual plot against X12
and/or X1X2 depicts a systematic non-random pattern, it is indicative of the presence
of a quadratic or an interaction effect.

The standardized residual plots corresponding to the

product/interaction terms (incomeurbanization) and

(incomeeducation) are shown below.


2

2
1





Standardized Residuals

Standardized Residuals















0 500 1000 1500 2000 2500 3000 1000 1500 2000 2500

urb * inc edu * inc

(c) incomeurbanization (d) incomeeducation

10
7
Both the patterns are pretty random (same is true for the

residual plot against urbanizationeducation). Thus, we conclude

that the predictors does not interact in their effect on crime rate i.e

no significant interaction effect is present in the crime dataset.

10
8
32 Multicollinearity
As we have studied so far, multiple regression analysis deals with

the study and analysis of association and influence patterns

between a given set of predictors and a response of interest.

However a phenomenon that is frequently encountered is that

some (or all) of the predictors may have strong

association/correlations between each other which often have

implications on our inferential procedures. This phenomenon is

known as Multicollinearity and we will look at it in some details in this

section. As a motivating example, we can look at the Florida crime

dataset. As we have seen, the level of education, urbanization and

income of a particular county influences the crime rate of that

county. However, it is also intuitive that the predictors them- selves

may be correlated with each other - in fact they are, as the following

correlation

matrix shows

crime educati urbanizati inco


crime 1.000 on
0.467 on
0.677 me
0.434
education 0.467 1.000 0.791 0.793
urbanizati 0.677 0.791 1.000 0.731
on
income 0.434 0.793 0.731 1.000

Now we will analyze multicollinearity and its effect on inferences for

the crime data set.

Non-correlated Predictors

10
9
In order to view multicollinearity in the proper context and appreciate

its significance, let us start by analyzing a situation where the

predictors are totally uncorrelated.

Let us consider an experiment which looks at the relation

between size of work force (X1) and level of bonus pay (X2) on

productivity (Y ). X1 and X2 have totally uncorrelated i.e r = 0. Y

was regressed on each/both of these predictors and the


results are given below

1. Y = 23.5 + 5.375X1

2. Y = 27.25 + 9.25X2

11
0
3. Y = 0.375 + 5.375X1 + 9.250X2

Comparing (1) and (3), we see that the coefficient of X1 remains the
same (5.375) whether or not X2 is in the model. In other words, the
association pattern (strength and direction) between X1 and Y is
independent of X2. Clearly this is because X1 and X2 are
uncorrelated.
Same is true for X2 since its coefficient (9.25) remains the same
whether or not X1 is in the model (compare (2) and (3)). Thus the
influence pattern of X2 on Y is independent of the presence (or
absence) of X1.

Analyzing Multicollinearity
Example 11. Now let us test for and analyze the effects of multicollinearity for the
Florida crime dataset.

The figures in the following page depict the scatterplots of

(education, urbanization), (education, income) and (urbanization,

income).

Clearly a definite positive - linear trend is visible in all the plots with

education and income depicting the strongest relationship. Moreover

the correlation coefficients between the above pairs of variables are

0.791, 0.793 and 0.731 respectively which are pretty high. Last but not

the least, the coefficient of multiple determination (R2) resulting from

regressing each of the predictors on the other two are also quite high

Education (X1) urbanization (X2) + income (X3); R2 = 72.43%.

11
1
Urbanization education + income; R2 = 65.43%.

Income education + urbanization ; R2 = 65.71%.

From the above results it is clear that multicollinearity exists in our


dataset.

11
2
85

85
80

80


75

75



Education



Education


70

70








65

65









60

60










55

55

0 20 40 60 80 100 15 20 25 30 35

Urbanization Income

(e) education-urbanization (f) income-education


100










80






60
Urbanization




40








20

15 20 25 30 35

Income

(g) urbanization-income

11
3
Although multicollinearity may affect different aspects of the fit of a

regression model, here we will discuss its effect on regression

coefficients only since it is often the most visible one.

Effect on Regression Coefficients

The following table shows the estimated slope coefficients for


education (X1) and urbanization (X2) (when regressed on crime rate, Y
) in the context of different models containing other predictors

Model 1 2
variables
X 1 1.486
X2 0.56
X1 , X 2 - 2
0.68
X1 , X 3 0.583
1.05 2
X2 , X 3 0.64
X 1 , X 2 , X3 - 2
0.69
0.467 7

If we just look at the values of 1 (estimated coefficient of

education) across the different models it is clear that it varies from

as high as 1.486 (model with only education) to as low as -0.583,

when urbanization is included. This implies that the strength and

even direction of association of education on crime rate depends

on urbanization and income.

Althou 2 (estimated coefficient of urbanization) is much more


gh stable across

the various models, still it is far from being constant. Same is true for
the coefficient

of 3 (not shown). Thus, the nature of association between


income, each of the
11
4
predictors on crime rate varies quite a bit depending on whether the

other predictors are included in the model or not. This is a direct

effect of multicollinearity. Thus, in a multiple regression setting, a regression


coefficient only reflects the partial effect of the corresponding predictor on the
response conditional on the predictors, NEVER an absolute one.

11
5
32.3 Measuring Multicollinearity

We now wrap up our discussion with a measure of multicollinearity

known as the variance inflation factor or VIF. The VIF of a variable, say

Xj , is defined as

V IFj = 1/(1 Rj 2)

where R
j
2 is the coefficient of determination for the regression of X on
j

the remaining independent variables. Clearly, when Xj does not


depend (linearly) on the other variables, R2 = 0 and hence V IFj = 1.
j

However in reality, it is always the case that


R2j > 0 and V IFj > 1. The more severe the multicollinearity, higher is
the VIF. There
is some debate about the cut-off for VIF so that Xj can be dropped
from the model. However, we will take a conservative value of 4 as our
cut-off.

Method : Start with all the predictors. Examine if any predictor(s) have
V IF > 4. Drop the one with the highest VIF. Keep repeating the
exercise until all independent

variables in the model have V IF < 4.

Example 12. For the Florida crime data set, we have the following MINITAB output

Predictor Estima Standard T P- VIF


Intercept te
59.715 Error
28.59 2.09 value
0.041
Education -0.467 0.554 -.84 0.403 3.627
Urbanizati 0.697 0.129 5.40 0.00 2.893
on
Income -0.383 0.941 - 0.685 2.916
0.41

Based on the VIF values, we conclude that although there is

multicollinearity among the predictors, the values are not strong

11
6
enough to warrant deletion of any of the variables.

Note 8. Dropping a variable from a model based on VIF does not mean that the
variable has no effect on the response. It just means that its effect is adequately
explained by the remaining variables.

11
7

You might also like