Problem Set On Desc Stats Regression - PGDBA
Problem Set On Desc Stats Regression - PGDBA
(PGDBA)
by
1
Reference Book:
Statistics for Managers using Microsoft Excel, Latest edition (8th), by Levine, Stephen & Szabat,
Pearson Education.
Case: “Market Analytics at YouGo Cabs”, by Sarkar and Pethe, published by IIMC Case Reaserch
Centre, March 2020, Case Number: IIMC-CRC-2019-09.
You must read and come prepared to the first class. We need sincere cooperation of all students in
this regard. An Excel file corresponding to this Handout will be emailed to you for you to practice
examples in the handout.
2
Organizing & Making Sense of Business Data
(Readings: Book by Levine et al, Sec 2.1 – 2.6, Sec 3.1 – 3.4, 3.6)
Data collected are of two types: (1) Categorical (Qualitative) and (2) Numerical (Quantitative),
which may be further divided into two types (2a) discrete and (2b) continuous. Categorical data
usually take non-numerical ‘text’ values – examples being Gender of customer, salesperson,
Location of a factory or store, Preferences for food or beverage, Bond rating. Discrete data usually
arise out of counting, for example, Number of employees, Number of credit cards or savings
accounts, Number of customers visiting a store, Number of units of product in inventory.
Continuous data arise out of measuring things like Product life, Waiting time at a check-in or
check-out counter, Market share of a product, Sales cost, Inventory value, Shipment weight etc.
Sometimes people also talk about data being collected on different measurement scales.
Before one does statistical analysis like regression etc., one needs to numerically code the non-
numerical categorical data. Such numerically coded categorical data are said to be measured on
either ‘nominal scale’ or ‘ordinal scale’. For example, one may numerically code gender =
‘female’ as ‘1’ and gender = ‘male’ as ‘0’. Similarly, one may numerically code different values
of the variable Internet Provider as follows: aircel as ‘1’, airtel as ‘2’, BSNL as ‘3’, Idea cellular
as ‘4’, etc. These numbers assigned to different ‘text’ values are numbers in name only; ‘1’ for
female and ‘0’ for male do not mean females are better than males; similarly ‘2’ for airtel and ‘3’
for BSNL do not mean BSNL is better than airtel. These numerical codes are just like jersey
numbers of soccer or cricket players of a team. Thus coded numbers for a categorical variable are
said to be measurements on the (i) nominal scale. When values of a categorical variable are
naturally ordered, i.e., one value is perceptually ‘higher’ or ‘better’ than another, for example, sales
employee designation taking values Director (sales) , Manager(sales), Assistant Manager (sales),
Executive (sales), etc. In such case, the values assigned as follows Director = ‘1’, Manager = ‘2’,
Assistant Manager = ‘3’, Executive = ‘4’, etc. have an ordinal meaning. The assigned numbers for
a categorical variables such as employee designation, course grades, bond rating are said to be
measurements on the (ii) Ordinal scale.
3
Quantitative data are naturally numbers and they are said to be measured either on ‘ratio
scale’ or ‘interval scale’. Variables like Age, Weight, Sales, Profit, Demand, Counts have an
absolute zero as a possible value. Temperature in Kelvin scale also has an absolute zero as a value.
For such variables interpretation of the ratio of two values is meaningful (like “one value is twice
another value”). Values of such variables are said to be measured on ‘Ratio scale’. Ratio of two
values measured on this ‘ratio scale’ is meaningful. For example, 6 units of profit is three times
2 units of profit.
The nominal and ordinal scales are known as non-metric scales, and the interval and ratio
scales are known as metric scales.
4
Frequency Distribution Tables, Histograms, Stem & Leaf Diagrams
5
Dataset 2 on Magnetron Tube Life, manufactured by a different machine:
6
Calculation Steps for Situation 0 (Optional).
1
Exponential: X~Exp(θ) distribution with probability density function f(x) = 𝜃𝜃 𝑒𝑒 −𝑥𝑥/𝜃𝜃 , x>0; here
θ=10;
3 1 𝑥𝑥
3 3
𝑃𝑃(𝑋𝑋 ≤ 3) = ∫0 𝜃𝜃
𝑒𝑒 −𝑥𝑥/𝜃𝜃 𝑑𝑑𝑑𝑑 = −𝑒𝑒 −𝜃𝜃 |30 = 1 − exp �− 𝜃𝜃� = 1 − exp �− 10� ≈ 0.26;
𝑐𝑐 𝑐𝑐
0.1 = 𝑝𝑝 = 𝑃𝑃(𝑋𝑋 ≤ 𝑐𝑐) = 1 − 𝑒𝑒𝑒𝑒 𝑝𝑝 �− 𝜃𝜃� = 1 − 𝑒𝑒𝑒𝑒 𝑝𝑝 �− 10� ⟹ 𝑐𝑐 = −10 ∗ ln(. 9) ≈ 1(yr).
3−10
Normal: Y~N(mean=10, sd=3); 𝑃𝑃(𝑌𝑌 ≤ 3) = 𝑃𝑃 �𝑍𝑍 ≤ 3
� = 𝑃𝑃(𝑍𝑍 ≤ −2.333)=
𝑐𝑐−10 𝑐𝑐−10
0.1 = 𝑝𝑝 = 𝑃𝑃(𝑌𝑌 ≤ 𝑐𝑐) = 𝑃𝑃 �𝑍𝑍 ≤ 3
�⟹ 3
= −𝑧𝑧0.1 = 10𝑡𝑡ℎ 𝑝𝑝𝑝𝑝𝑝𝑝𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑜𝑜𝑜𝑜 𝑁𝑁(0,1)
Lesson from Situation 0: Unless we correctly identify - through sampled data on the lifetime of
a product or its component - what underlying probability distribution of the lifetime is, we may
end up with wrong calculations. For example, when it is really exponential distribution and
warranty period calculation is done assuming normal distribution then the manufacturer would
incur huge losses; on the other hand when it is really normal distribution and warranty period
calculation is done assuming exponential distribution then the manufacturer would offer much
smaller warranty period (than what he could have offered) and end up losing customers to its
competitors. Therefore, it is extremely important to try to figure out correctly the underlying
probability distribution of a variable. It can be done by constructing a frequency distribution table
of a randomly sampled data set and therefrom a histogram, which consists of about six or seven
steps and is thus laborious. There is a simple and easy way of constructing a “histogram-type”
diagram called ‘stem and leaf diagram’. Thus a stem and leaf diagram is a very useful tool.
7
Situation 1 (Frequency distribution table and Histogram): Daily demand data for a certain
product (in appropriate units) are as follows: 12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37,
38, 41, 43, 44, 46, 53, 58. How to summarize data in a table, in an organized manner?
(iii) Compute Class Interval-width: 9.2 (=46/5); another choice: 10 (round up 9.2)
(iv) Choose Class Boundaries (limits): 12, 21.2, 30.4, 39.4, 48.8, 58 with intervals (11.9, 21.2],
(21.2, 30.4], (30.4, 39.4], (39.4, 48.8], (48.8, 58]; another choice 9, 19, 29, 39, 49, 59 with intervals
(9,19], (19, 29], (29, 39], (39, 49], (49, 59]
(v) Compute Class Midpoints: 14, 24, 34, 44, 54 (optional step)
Upper Relative
Midpoint Class Interval Frequency
Limit Frequency
14 19 More than 9 and Upto 19 3 0.15
24 29 More than 19 and Upto 29 6 0.30
34 39 More than 29 and Upto 39 5 0.25
44 49 More than 39 and Upto 49 4 0.20
54 59 More than 49 and Upto 59 2 0.10
8
As shown above, construction of a histogram involves many steps and laborious. However, there
is a simple and easy way of constructing “histogram-type” diagram called stem and leaf
diagram, which works very well for most practical purposes. A stem-and-leaf diagram helps one
form clusters or groups for assigning appraisal grades.
Situation 2 (Stem and leaf diagram): Monthly sales of a company in 10 areas are as follows:
How will you present this dataset graphically, in an organized manner? One easy way to do it is
to construct a stem-leaf diagram as follows: First write 41 = 4*10 + 1*1, 24 = 2*10 + 4*1, 32 =
3*10 + 2*1, … , 21 = 2*10 + 1*1. Then draw a vertical line, and to the left of the vertical line
write possible the tens-place digits of numbers (called “stem” digits) and to the right of vertical
line across the stems write the corresponding units-place digits of the data (called the “leaf” digits).
Stem-and-Leaf Display
Stem unit: 10
2 144677
3 028
4 1
If to the above dataset we add two more data values, say, 3 and 100, then 3 = 0*10 + 3*1, 100 =
10*10 + 0*1, we will add stem values 0, 1,5,6,7,8,9,10 to the left of the vertical line and write the
corresponding leaf values if available otherwise keep them blank (as some stems can be leafless!)
A stem-and-leaf display organizes data into groups (called stems) so that the values within each
group (the leaves) branch out to the right on each row. It is like a histogram, rotated clockwise
by 90°, shown as a vertical picture, where the class-intervals are defined naturally by the “stem”
digits and the “leaf” digits corresponding to a stem complete the description of the individual data-
values represented by a frequency bar or column. When is a histogram preferable to a stem-leaf
display?
9
Situation 3 (Novartis Sales). Annual Net sales data (in Rs. Crore) for Novartis is shown below:
Annual Net Sales (in Rs. Crore) for Novartis
Year Mar-90 Mar-91 Mar-92 Mar-93 Mar-94 Mar-95 Mar-96 Mar-97 Mar-98 Mar-99
Sales 241.52 266.03 326.85 383.37 429.98 454.75 494.09 584.87 646.38 722.55
Year Mar-00 Mar-01 Mar-02 Mar-03 Mar-04 Mar-05 Mar-06 Mar-07 Mar-08
Sales 807.02 445.64 465.41 479.72 514.87 481.84 538.23 553.34 569.54
Construct a stem-leaf diagram. What should be the stem-digit place and leaf-digit place?
Situation 4 (Financial Returns). Suppose the following table shows investment returns on
different assets (for example, stocks). A number such as 0.52 means 52% return. Prepare a stem-
leaf diagram for this dataset.
Asset No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Return 0.52 0.32 0.25 0.24 0.31 0.24 0.21 0.37 0.35 0.24 0.13 0.13 0.02 0.08 0.05 0.01 0.01
10
Situation 5 (Performance Grading, an application of stem-leaf diagram): Performance scores
(out of 100) of fifteen employees turned out to be as follows:
20, 90, 25, 95, 30, 75, 50, 70, 55, 70, 50, 70, 55, 55, 50.
You are required to grade the scores as “Excellent”, “Good”, “Average”, “Poor”. How will you
do it?
Uses of Stem and Leaf Diagram: (a) As mentioned before (Situation 0), histogram construction
is greatly useful to figure out correctly the underlying probability distribution of a variable from
sampled data. However, construction of a histogram is laborious. A ‘stem and leaf diagram’
immensely simplifies construction of a “histogram-type” diagram; it can easily be done by hand;
and thus is a very valuable tool. (b) A ‘stem and leaf diagram’ can be effectively used to find
natural clusters or groups of data values and then assign grades to scores such as “Excellent”,
11
“Good”, “Average”, “Poor” (Situation 5). This can be used by a manager for appraisal of
employees, by a sports manager for evaluation of players, by a teacher for assessment of students.
Situation 5A. One important measure of quality of service provided by any organization is the
speed with which it responds to customer complaints. For a large carpet company, the number of
days between the receipt and the resolution of a complaint, over a 6-month period, is given by the
following data:
68 33 23 20 26 36 22 30 52 4
27 5 10 13 14 1 25 26 29 28
29 32 4 12 5 26 31 35 61 29
Construct an appropriate stem-leaf diagram, as discussed in class. What proportion of the data
values correspond to the stem digit “2” ?
Measures of Central Value: Data may often be summarized through a few numbers. First concept
is what the central value of the data is. This is computed by some kind of “Mean or Average” (such
as the arithmetic mean (AM), geometric mean (GM), harmonic mean (HM), trimmed mean (TM)),
Median, Mode. Examples below show when the easy-to-calculate AM is not appropriate to use.
Normally, the mode needs to be used for categorical data where we wish to know the most common
category (for example, most common letter grade obtained by examinees; most used mode of
transport in a country, bus, train, car, others).
For n non-negative real numbers X1, X2, ..., Xn, the geometric mean 𝑋𝑋�𝐺𝐺 is defined as the n-th root
of their product, i.e.,
1
𝑋𝑋�𝐺𝐺 = (∏𝑛𝑛𝑖𝑖=1 𝑋𝑋𝑖𝑖 )𝑛𝑛,
The harmonic mean of the positive real numbers X1, X2,..., Xn, is defined as
12
1
𝑋𝑋�𝐻𝐻 =
1 1 1 1
𝑛𝑛 �𝑋𝑋1 + 𝑋𝑋2 + ⋯ + 𝑋𝑋𝑛𝑛 �
The trimmed mean is designed to reduce the effects of “outliers” (unusually small or large values)
on the calculated average. First one removes the p/2 proportion of the data from each end and then
calculates arithmetic mean of the remaining observations and it is a trimmed mean with a total of
p proportion or 100p% of trimming done. It may be denoted as TM(p). If the trimming is only on
the largest (smallest) values then it is called right-sided (left-sided) trimmed mean.
If all the observations X1, X2,..., Xn are distinct, then the arithmetic mean puts equal weight 1/n =
� = ∑𝑛𝑛𝑖𝑖=1( 1)Xi. If for a given dataset
relative frequency, on all n observations in the formula X 𝑛𝑛
X1, X2,..., Xn, some data values occur more than once and thus only k values 𝑋𝑋1∗ , 𝑋𝑋2∗ , ..., 𝑋𝑋𝑘𝑘∗ , among
the X1, X2,..., Xn are distinct with relative frequency r1, r2, …,rk, then the arithmetic mean of the n
observations X1, X2,..., Xn is the weighted average of the k values 𝑋𝑋1∗ , 𝑋𝑋2∗ , ..., 𝑋𝑋𝑘𝑘∗ with non-constant
relative frequency values r1, r2, …,rk as weights, and is given by
Measures of Variation: The second summary measure quantifies the variation or dispersion or
diversity in the data {X1, X2, …, Xn}. This is computed by the Range, Inter-Quartile Range (IQR),
Standard Deviation (i.e., square root of Variance), and the Coefficient of Variation (CV).
Calculation of CV, defined by SD/AM, is meaningful for data collected on the ratio scale such as
sales, profit, demand, counts. For calculation of the sample variance sn2 , we will use the divisor (n-
1
1) as in sn2 = n−1 ∑ni=1(Xi − �
X)2 , where n is the sample size.
1 n 1
Note: sn2 = n−1 ∑ni=1(Xi − � � 2 ) = � n � ��1 ∑ni=1 Xi2 � − �
X)2 = �n−1� n ( ∑ni=1 Xi2 − nX X 2 �.
n−1 n
13
independent pieces of information (deviations). Intuitively, this is why while calculating the
"average" of the squared deviations, one divides by (n−1) and not n. Note that while calculating
the sample mean �
X, we divide by n because the numerator of �
X is based on X1, X2, …, Xn, and
the number of independent pieces of information among these values is n. This intuitive logic will
come in handy when we consider later estimating the variance of linear regression model errors,
in which case while defining the estimator the divisor would be not n, not (n−1), but instead
(n−k−1), where k = number of non-constant explanatory variables in the model.
Inter Quartile Range (IQR): Note that the Median (the 50th percentile) divides the data into two
halves, and it is defined as the (n+1)/2-th ranked value. In a similar manner, one may define
quartiles to divide data into 4 quarters. First, we sort out the data X1, X2, …, Xn in the increasing
order as X (1) ≤ X (2) ≤ … ≤ X(n). Then, by generalizing the definition of the Median (50th percentile)
one may define the first quartile (25th percentile) as Q1 = (n+1)/4-ranked value and the third quartile
(75th percentile) as Q3 = 3(n+1)/4-ranked value with Median being the second quartile. Then Q1
(or Q3) is calculated as the Median of the first (or second) half of the data. The IQR is then defined
as (Q3 − Q1) = the mid-range of the data. Like the Range = max (X1, X2, …, Xn) – min (X1, X2,
…, Xn), the IQR is very easy to calculate than the SD, but unlike both Range and SD, it is not
affected by possible outliers (unusually small or large values) present in data. As the Median is a
robust (unaffected by outliers) measure of the central value, the IQR is a robust measure of
variation in the data.
Calculation of IQR Illustrated for n=8 (for data: 30, 32, 9, 10, 16, 22, 23, 27):
(n+1)/4=2.25, Q1 = 2.25-th ranked value = X(2)+ 0.25×( X(3) −X(2)) = 0.25 ×X(3) + 0.75× X(2).
3(n+1)/4=6.75, Q3 = 6.75-th ranked value = X(6)+ 0.75×( X(7) −X(6)) = 0.75 ×X(7) + 0.25× X(6).
For this example, IQR = Q3 − Q1 = 0.75 ×(X(7) − X(2)) + 0.25× (X(6) − X(3)).
First sort the data in an increasing order: 9, 10, 16, 22, 23, 27, 30, 32
14
Variants of IQR Formulae: There are variants of the method discussed here, as shown in the
table below. Our definition corresponds to what the software Minitab uses.
One reason such variants naturally arise is as follows. Note that Q2 = Median of the whole data
set, divides it into two halves. Then Q1 (or Q3) is calculated as the median of the first (or second)
half of the data. Now where should Q2 belong to – the first half of the data set or second half of
the data set or both or none ? Some people would like to include and some would like to exclude
Q2 value in (from) the first half and the second half of the data to calculate Q1 and Q3.
Excel Function for Quartile Calculation: For calculating quartiles, define Q1= (n+1)/4-ranked
value and Q3=3(n+1)/4-ranked value. We will use interpolation as needed and we will not follow
the textbook. In Excel (version 2010 and later) one may calculate Q1= (n+1)/4-ranked value,
Q3=3(n+1)/4-ranked values by using the QUARTILE.EXC function.
CV vs ‘Robust’ CV: The CV is unit-free and is used to compare variation in multiple situations
where the means are widely different. If one wishes to compare variation in multiple situations
where central values are quite different and undesirable outliers are possibly present in the
datasets, one may define a ‘Robust’ measure of CV as the IQR divided by the median, since IQR
is a ‘Robust’ measure of Variability and Median is a ‘Robust’ measure of the Central Value. Here,
15
‘Robust’ means ‘not affected by unusually small or unusually large relative to other observations’.
Consideration of Robust CV rather than CV may often give a different answer while checking
relative variability in multiple datasets. Which one to choose will depend on the objective of the
user.
There are two more numerical summary measures of a data set, namely, skewness and kurtosis
which are often explored for effective statistical inference (estimation and testing of hypothesis).
1
∑𝑛𝑛 (𝑋𝑋 −𝑋𝑋�)3
𝑛𝑛−1 𝑖𝑖=1 𝑖𝑖
1 3/2 .
� ∑n (X −X � )2 �
n−1 i=1 i
The above skewness measure takes the average of the cubes of the deviations in the numerator and
then standardizes it by deviding by the cube of the standard deviation and is therefore unit-free.
Skewness indicates which direction and a relative magnitude of how far a distribution deviates
from a normal distribution. In most cases it is assumed that the data come from a normal
distribution (having a skewness of zero), i.e., data are symmetric about the mean. But underlying
distribution may not be normal. For example, daily (net) returns for a Nifty (or Sensex)
company stock collected over a long period, say, a year will usually be skewed to the left (with
longer left tail) or negatively skewed. As a second example, household income distribution in a
country like India is skewed to the right (with longer right tail) or positively skewed. For modelling
stock market data properly it is imperative that one examines shape of the data distribution and
quantifies probable asymmetry through sample skewness. If the skewness of the underlying
probability distribution is much different from zero, then standard statistical inference procedures
such as a confidence interval (with, say, 95% ‘level of confidence’) for a mean is unlikely to
provide the desired ‘level of confidence’.
16
presence of outliers or unusually large values (magnitude-wise), in relation to a standard normal
distribution. For example, daily (net) returns for a Nifty (or Sensex) company stock collected
over a long period, say, five years will usually have high kurtosis value.
For a sample of n observations X1, X2, …, Xn, the sample kurtosis may be defined by
1
∑𝑛𝑛 (𝑋𝑋 −𝑋𝑋�)4
𝑛𝑛−1 𝑖𝑖=1 𝑖𝑖
1 2 . (*)
� ∑n (X −X � )2 �
n−1 i=1 i
The above kurtosis measure (*) takes the average of the fourth powers of the deviations in the
numerator and then standardizes it by deviding by the fourth power of the standard deviation.
Consequently, its value is unit-free. It indicates how much a distribition deviates from a normal
distribution in the left and right tails. It is often assumed that the data come as a sample from a
normal distribution or population (having a kurtosis value three). But underlying distribution may
not be normal. For example, daily (net) returns for a Nifty (or Sensex) company stock collected
over a long period, say, five years, will usually have a high kurtosis value – much larger than 3.
Often one talks about ‘excess kurtosis’ of an underlying probability distribution (i.e., in excess of
3, where 3 is the kurtosis value for the family of normal distributions), defined by
1
∑𝑛𝑛 (𝑋𝑋 −𝑋𝑋�)4
𝑛𝑛−1 𝑖𝑖=1 𝑖𝑖
1 2 − 3.
� ∑n (X −X � )2 �
n−1 i=1 i
[Application of skewness and kurtosis in the stock market investment: Read the article “Use
these two tools to look for less risky, promising stocks” by Sameer Bhardwaj, Economic Times
Bureau, Updated: Dec 10, 2018, 09.49AM IST at the website:
https://economictimes.indiatimes.com/wealth/invest/use-these-two-tools-to-look-for-less-risky-
promising-stocks/printarticle/66997140.cms
They summarized as follows: “Positive skewness and low kurtosis promise high returns”. ]
17
Situation 6: During a recruitment interview, 10 members of the HR department are present as
interviewers and are asked to give ratings on the interviewees. In the past, the HR-Head observed
that some of the interviewers had shown regional biases while giving their ratings. (a) Before the
interview takes place, the HR Head is required to declare how he or she would calculate the
“average” of the ratings to be given by the interviewers. What would you recommend? (b) For
two interviewees, the ratings (on a scale of 1 to 10) data have turned out to be:
Situation 6B: (May 27, 2016) “TCS CEO N Chandrasekaran's pay rises 20% in FY16 to Rs
25.6 crore, 459 times company's median remuneration”
( http://economictimes.indiatimes.com/tech/ites/tcs-ceo-n-chandrasekarans-pay-rises-20-in-fy16-
to-rs-25-6-crore-459-times-companys-median-remuneration/articleshow/ 52450273.cms )
Here company's median (not mean) remuneration was considered because our interest was to know
how much more N Chandrasekaran's pay got compared to an ‘average’ or ‘regular’ employee.
18
Situation 6C: (Nov 27, 2014) “Australia batsman Hughes dies from head injury”
(http://timesofindia.indiatimes.com/sports/off-the-field/Australia-batsman-Hughes-dies-from-
head-injury/articleshow/45292785.cms ) … … Questions about the response time of
ambulances dispatched to the stadium were also raised. The head of New South Wales
Ambulance was to be hauled before the state health minister Jillian Skinner on Thursday after the
ambulance authority issued conflicting statements about their response times. The arrival of the
first ambulance took 15 minutes, NSW Ambulance clarified in a statement on Wednesday. The
state's median response time for the highest priority "life-threatening cases" was just under
eight minutes in 2013-14, according the authority's statistics. …
Situation 6D: (Sep 13, 2016) “U.S. Household Incomes Surged 5.2% in 2015, First Gain Since
2007” (http://www.wsj.com/articles/u-s-household-incomes-surged-5-2-in-2015-ending-slide-
1473776295 ) The median household income—the level at which half are above and half are
below—rose 5.2%, or $2,798, to $56,516, from a year earlier, after adjusting for inflation, the
Census Bureau said Tuesday. …
Situation 6E: (Jan 3, 2018) “Believe in the average rather than in extremes” by Uma Shashikant
(https://www.pressreader.com/india/the-times-of-india-mumbai-edition/20180108/
282278140723458; or, https://economictimes.indiatimes.com/wealth/invest/why-average-return-
from-investments-is-what-you-should-expect-and-be-happy-with/articleshow/62296843.cms):
If someone told us that we should be happy making average returns on our investments, it would
probably be unacceptable to us. The lure of doing better or beating the average is something we
simply cannot let go of. The very notion of earning above average returns on a consistent basis is
a statistical challenge. … … … Fund managers tend to compare their performances with the
median returns made by their peer group. … … .
Situation 7: Suppose one invests Rs 100 thousand in stocks. The investment “grew” as shown in
the table given below. How would you calculate the “average” annual return on investment at the
end of (i) year 3 and (ii) year 4? [Let Vt = value of investment at time t. Then gross return at time
t is defined as = Vt / Vt-1, and net return at time t is defined as (Vt – Vt-1)/Vt-1 = gross return -1.]
19
Value of Investment Annual Gross Annual Net
(in 1000 Rs) Return Return
Start of Year 1 100
End of Year 1 50 0.5 -0.5
End of Year 2 85 1.7 0.7
End of Year 3 42.5 0.5 -0.5
End of Year 4 72.25 1.7 0.7
AM= 1.1 0.1
GM= 0.922 −0.078
Note: The “CAGR” (compound annual growth rate) is a widely used measure in the business
world. It is nothing but a short-cut formula for the geometric mean of net return (rate), which may
not be computed directly and is indirectly defined as “the geometric mean of gross return − 1”.
End of Year 0 1 2 3 4
Calculate the average annual net return (or compound annual growth rate) over the first 3 years.
Situation 8A: The value of an investment of 1 lac rupees made on January 1, 2010 became 3.6
lac rupees on July 1, 2014. Calculate the most appropriate "average" annual return, on July 1, 2014.
Situation 9: (i) An investor X invests Rs 100,000 in a mutual fund every month for three months
and the prices paid per unit in a month were Rs 12, Rs 10, and Rs 8 respectively, then the average
price per unit the investor X paid is .............................(?)
(ii) Another investor Y purchased 100,000 units of the another mutual fund for three months with
the prices paid per unit being Rs 10, Rs 9, and Rs 8 respectively. Then the average price per unit
the investor Y paid is .............................(?)
20
(iii) An investor invests 25%, 30% and 45% of his total investment in a mutual fund at three
different time periods at prices (per share of the mutual fund) 10, 8 and 6 dollars respectively, Then
the average price per unit the investor X paid is .............................(?)
(iv) An investor invested 40% his total investment in a mutual fund in the first quarter at price (per
unit-share of the mutual fund) 50 rupees and invested the rest in second quarter at price x rupees.
The average acquisition cost per unit the investor paid for the total investment came out to be 60
rupees. Then, what is the value of x ?
What is “SIP” ?: “Mutual funds garner record high AUM of Rs 7,304 crore via SIPs in May”
( https://economictimes.indiatimes.com/wealth/invest/mutual-funds-garner-record-high-aum-of-
rs-7304-crore-via-sips-in-may/canararobecoshowsp_dp/64600837.cms ; June 15, 2018)
“Mutual funds collected Rs 7,304 crore through SIPs in May, nine per cent higher than the
collection in April, showed Amfi data. Total SIP accounts stood at 2.23 crore.
AMFI data shows that the mutual fund industry had added about 9.58 lakh SIP accounts
each month on an average during the FY 2018-19, with an average SIP size of about Rs 3,275 per
SIP account.
SIP has been gaining popularity among mutual fund investors, as it helps in Rupee Cost
Averaging and also in investing in a disciplined manner without worrying about market volatility
and timing the market, it adds.”
21
Situation 10 (Weighted Average): An investor purchased shares of a certain stock from the BSE
on four different trading days. The purchase data are given as follows:
(i) What is the ‘average’ acquisition cost per share? Why? (ii) What is the simple average cost per
share?
[Note: ‘Closing Price’ of a stock (or its ‘Future’) at NSE is the weighted average of its transaction
prices (or transaction prices of Futures contracts) traded in the last half-an-hour of the trading.
Volumes of shares transacted in the last half-an-hour are taken into account.]
Situation 11 (Deciding the “Best Team”): Three teams of sales executives A, B, C operate in
cities with similar economic and social conditions, while team D operate in the city of Mumbai
where making sales is easier because of higher economic opportunities and higher level of
consumerism in Mumbai. The sales achieved by different sales teams are given below.
(i) Among the teams A, B & C, which team performed the best?
(ii) Between the teams B & D, which team performed better?
22
Usually, one would consider high “average” and low variation to be better.
Situation 12 (Stock Picking). Suppose today is 19 Oct 2011. Price data, over the period 3 Jan –
18 Oct 2011, have been collected for the stocks CIPLA, Tata Motors, Unitech -- graphs of which
are shown below:
(i) From the graph, which stock price seems to be most “volatile”/fluctuating (and hence most
risky)? Which seems least “volatile”? (ii) Which are really most and least volatile ?
23
[Standard deviation of a variable cannot capture comparative or relative variation.]
[Note: E(X) = ∑𝑘𝑘𝑖𝑖=1 𝑥𝑥𝑖𝑖 𝑃𝑃(𝑋𝑋 = 𝑥𝑥𝑖𝑖 ); E(X2) = ∑𝑘𝑘𝑖𝑖=1 𝑥𝑥𝑖𝑖2 𝑃𝑃(𝑋𝑋 = 𝑥𝑥𝑖𝑖 ); Var(X) = E(X2) – (E(X))2. This
short-cut formula for variance of a random variable X may be intuitively understood from the fact
that for a dataset X1, X2, …, Xn, sample variance is equal to average of squared Xi values minus
square of the average of the Xi values, as shown below:
𝑛𝑛 𝑛𝑛 𝑛𝑛
1 n 1 n 1
𝒔𝒔𝟐𝟐𝒏𝒏 = �(𝑋𝑋𝑖𝑖 − 𝑋𝑋�)2 = � � �(𝑋𝑋𝑖𝑖 − 𝑋𝑋�)2 = � � � � 𝑋𝑋𝑖𝑖2 − 𝑛𝑛𝑋𝑋� 2 �
𝑛𝑛 − 1 𝑛𝑛 − 1 𝑛𝑛 𝑛𝑛 − 1 𝑛𝑛
𝑖𝑖=1 𝑖𝑖=1 𝑖𝑖=1
n 𝟏𝟏
� 𝟐𝟐 � ≈ (1) ��1 ∑𝑛𝑛𝑖𝑖=1 𝑋𝑋𝑖𝑖2 � − 𝑋𝑋� 2 � = "𝑬𝑬(𝑿𝑿𝟐𝟐 ) − (𝑬𝑬(𝑿𝑿))𝟐𝟐 ". ]
= �𝑛𝑛−1� ��𝒏𝒏 ∑𝒏𝒏𝒊𝒊=𝟏𝟏 𝑿𝑿𝟐𝟐𝒊𝒊 � − 𝑿𝑿 𝑛𝑛
Situation 13A (Exercise). You plan to invest either in a corporate bond fund or in a common stock
fund. The following table presents expected annual percentage of return of each of these
investments under various economic conditions and the probability that each of those economic
conditions would occur.
24
Corporate Common
Economic Condition Probability Bond Fund Stock Fund
Recession 0.1 -7 -30
Stagnation 0.2 3 -10
Slow Growth 0.3 8 10
Moderate Growth 0.3 10 15
High Growth 0.1 12 35
Then the risk to return ratio, for corporate bond fund is given by …. …
While the quality of employees working under the three managers can be assumed to be uniform,
the same cannot be said about the way the managers evaluate the employees reporting to them.
Manager A is known to give appraisal ratings leniently, Manager C is known to be very demanding
and it is quite hard to get high appraisal ratings from him. Manager B is neither lenient nor strict.
Assume that the underlying distribution of the appraisal ratings under each manager is normal.
Which employees are the best/worst performers? Data on appraisal ratings are given below:
25
[Note: “sd” = standard deviation (for 10 numbers corresponding to the respective manager) and
“mean” is the corresponding average. “Grand Mean” is the overall average of all 30 numbers, and
“Grand Stdev” is the overall standard deviation of all 30 numbers. Step 1: Standardized Rating
= (Actual rating – Group Mean)/Group SD; Step 2: Final Rating = Standardized Rating*Grand
SD + Grand Mean]. Note that this Step 2 helps only in making the numbers obtained in Step 1
(approximately) belong to the original range of raw ratings, namely 0 to 100 here. Step 2 does not
change the ranks of the Standardized Ratings obtained in Step 1. It is carried out so that a layman
can understand the Final rating values better than Standardized Rating values. This method would
work well if the numbers of employees under the managers nA, nB, nC are “large” (say at least 20)
and they need not be equal. ]
Situation 15 (Application of IQR): Two teams of sales executives operate in two different areas
with different economic and social conditions. Monthly sales achieved by these teams are given
below.
Team 1: 2 4 21 24 31 33 35 41 43 46 48 49 50 50 54 55 56 67 69
Team 2: 9 10 16 22 23 27 32 32 33 34 39 41 41 45 46 48 50 55 64 66
26
Which team performed better? Why ?
[Discussion: If one would like to ignore abnormal performance of employees due to possibly ill
health or absenteeism or unusually high talent and hence would rather compare performance of
the ‘normal’ performance of the ‘regular’ members of the two teams, then ‘robust’ cv should be
the criterion to go by.]
Situation 15A (Exc). Suppose investment returns on different assets of an investor are as
follows:
0.32, 0.25, 0.24, 0.31, 0.25, 0.21, 0.37, 0.35, 0.13
27
Then, calculate the inter-quartile range of the returns, as a measure of variation.
Situation 16 (Exercise). The sample mean and sample standard deviation sn, where 𝑠𝑠𝑛𝑛2 =
1
∑𝑛𝑛𝑖𝑖=1(𝑋𝑋𝑖𝑖 − 𝑋𝑋�)2 , of the age (X) of the 24 sales executives of a company were 35 years and 9
𝑛𝑛−1
years respectively. If a new sales executive aged 30 years is recruited by the company, what will
be the new mean and standard deviation of all the sales executives?
1 1 n 1
[ Hint: 𝑠𝑠𝑛𝑛2 = 𝑛𝑛−1 ∑𝑛𝑛𝑖𝑖=1(𝑋𝑋𝑖𝑖 − 𝑋𝑋�)2 = �𝑛𝑛−1� ( ∑𝑛𝑛𝑖𝑖=1 𝑋𝑋𝑖𝑖2 − 𝑛𝑛𝑋𝑋� 2 ) = �𝑛𝑛−1� ��𝑛𝑛 ∑𝑛𝑛𝑖𝑖=1 𝑋𝑋𝑖𝑖2 � − 𝑋𝑋� 2 �.]
Situation 17 (Exercise). The (arithmetic) mean of the following set of scores, out of 10, in an
MBA course Quiz is 4:
1, ?, 2, ?, 3, 4, 6, 7, 8.
Given that the sum of squares of deviation from the mean is 52, find the (i) sample standard
1
deviation sn, where 𝑠𝑠𝑛𝑛2 = 𝑛𝑛−1 ∑𝑛𝑛𝑖𝑖=1(𝑋𝑋𝑖𝑖 − 𝑋𝑋�)2 , and (ii) the two missing values.
28
ANSWER KEY to
Situation (0). Exponential: (a) 0.26, (b) ≈1; Normal: (a) 0.01, (b) ≈ 6
1
Exponential: X~Exp(θ) distribution with probability density function f(x) = 𝜃𝜃 𝑒𝑒 −𝑥𝑥/𝜃𝜃 , x>0;
here θ=10;
3 1 𝑥𝑥
3 3
𝑃𝑃(𝑋𝑋 ≤ 3) = ∫0 𝜃𝜃
𝑒𝑒 −𝑥𝑥/𝜃𝜃 𝑑𝑑𝑑𝑑 = −𝑒𝑒 −𝜃𝜃 |30 = 1 − exp �− 𝜃𝜃� = 1 − exp �− 10� ≈ 0.26;
𝑐𝑐 𝑐𝑐
0.1 = 𝑝𝑝 = 𝑃𝑃(𝑋𝑋 ≤ 𝑐𝑐) = 1 − 𝑒𝑒𝑒𝑒 𝑝𝑝 �− 𝜃𝜃� = 1 − 𝑒𝑒𝑒𝑒 𝑝𝑝 �− 10� ⟹ 𝑐𝑐 = −10 ∗ ln(. 9) ≈ 1(yr).
3−10
Normal: Y~N(mean=10, sd=3); 𝑃𝑃(𝑌𝑌 ≤ 3) = 𝑃𝑃 �𝑍𝑍 ≤ 3
� = 𝑃𝑃(𝑍𝑍 ≤ −2.333)=
NORMSDIST(-2.333) ≈ 1%
𝑐𝑐−10 𝑐𝑐−10
0.1 = 𝑝𝑝 = 𝑃𝑃(𝑌𝑌 ≤ 𝑐𝑐) = 𝑃𝑃 �𝑍𝑍 ≤ 3
�⟹ 3
= −𝑧𝑧0.1 = 10𝑡𝑡ℎ 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑜𝑜𝑜𝑜 𝑁𝑁(0,1)
=NORMSINV(0.1) = −1.2816 ⟹ 𝑐𝑐 = −3𝑧𝑧0.1 + 10 ≈ 6 (years)
(8A) 0.3292
9(i) 9.73;
(ii) 9,
(iii) 7.273,
1 .6
(iv) 69.23 [ 60 = 1 1 ⟹ 𝑥𝑥 = 1 = 69.23077 ]
.4� �+.6( ) � −.008�
50 𝑥𝑥 60
29
9(i): Hint
Month Price Investment Number of Units
1 12 100,000 100,000/12
2 10 100,000 100,000/10
3 8 100,000 100,000/8
Total 300,000 1 1 1
100,000( + + )
12 10 8
300,000 3 1
Average cost per unit = 1 1 1 = 1 1 1 = 1 1 1 1 ; for ratio data (here
100,000( + + ) ( + + ) ( + + )
12 10 8 12 10 8 3 12 10 8
9(ii). Hint
Month Price Investment Number of Units
1 10 100,000*10 100,000
2 9 100,000*9 100,000
3 8 100,000*8 100,000
Total 100,000*(10+9+8) 300,000
100,000∗(10+9+8) 10+9+8
Average cost per unit = 300000
= 3
; for ratio data (here price) when
denominator is constant then AM of ratio data is the ‘average’.
9(iii). Hint
Month Price Investment Number of Units
1 10 X(.25) X(.25)/10
2 8 X(.30) X(.30)/8
3 6 X(.45) X(.45)/6
Total X . 25 . 30 . 45
𝑋𝑋 � + + �
10 8 6
𝑋𝑋 1
Average cost per unit = .25 .30 .45 = 1 1 1 ; for ratio data (here price)
𝑋𝑋� + + � (.25)( )+(.30)( )+(.45)( )
10 8 6 10 8 6
when both numerator and denominator are changing then weighted HM of ratio data is
the ‘average’.
30
(10) (i) 98.44 [= 𝑋𝑋�𝑊𝑊 = ∑𝑘𝑘𝑖𝑖=1 𝑟𝑟𝑖𝑖 𝑋𝑋𝑖𝑖∗ = 220*0.11 +170*0.15 + 110*0.22+ 47*0.52];
(13A). 0.81
(14A, omitted)
(15) Team 1: mean = 40.95, sd=18.46, median=46, Q1=31, Q3=54, IQR=23, cv=0.45, “robust” cv
= 0.50; Team 2: mean=36.65, sd=16.08, median=36.5, Q1=24, Q3=47.5, IQR=23.5, cv = 0.44,
“robust” cv=0.64.
[If one would like to ignore abnormal performance of employees due to possibly ill health
or absenteeism or unusually high talent and hence would rather compare performance of
the ‘normal’ performance of the ‘regular’ members of the two teams, then ‘robust’ cv
should be the criterion to go by.]
∑𝑛𝑛𝑖𝑖=1 𝑋𝑋𝑖𝑖2 − 𝑛𝑛𝑋𝑋�𝑛𝑛2 = (𝑛𝑛 − 1)𝑠𝑠𝑛𝑛2 ⇒ ∑𝑛𝑛𝑖𝑖=1 𝑋𝑋𝑖𝑖2 = (𝑛𝑛 − 1)𝑠𝑠𝑛𝑛2 + 𝑛𝑛𝑋𝑋�𝑛𝑛2 ; Now,
31
∑𝑛𝑛+1 2 𝑛𝑛 2 2 2 �2 2
𝑖𝑖=1 𝑋𝑋𝑖𝑖 = {∑𝑖𝑖=1 𝑋𝑋𝑖𝑖 } + 𝑋𝑋𝑛𝑛+1 = {(𝑛𝑛 − 1)𝑠𝑠𝑛𝑛 + 𝑛𝑛𝑋𝑋𝑛𝑛 } + 𝑋𝑋𝑛𝑛+1 = {23 ∗ (92 ) + 24 ∗
(352 )}+ 302 ;
2
(𝑛𝑛)𝑠𝑠𝑛𝑛+1 = ∑𝑛𝑛+1 2 �2
𝑖𝑖=1 𝑋𝑋𝑖𝑖 − (𝑛𝑛 + 1)𝑋𝑋𝑛𝑛+1 = ({23 ∗ (92 ) + 24 ∗ (352 )} + 302 ) − 25(34.82 ),
where 𝑠𝑠𝑛𝑛+1 is new sd.
∑𝑛𝑛 � 2
𝑖𝑖=1(𝑋𝑋𝑖𝑖 −𝑋𝑋)
17. (i) 2.55 [= � = �52/(9 − 1) ]
𝑛𝑛−1
(ii) 4, 1;
[Let the missing values be denoted by a and b. Then, ∑𝑛𝑛𝑖𝑖=1 𝑋𝑋𝑖𝑖2 = 179 + (𝑎𝑎2 + 𝑏𝑏 2 ); 𝑋𝑋� =
4 ⇒ 𝑎𝑎 + 𝑏𝑏 = 5; and 52 = ∑𝑛𝑛𝑖𝑖=1 𝑋𝑋𝑖𝑖2 − 𝑛𝑛𝑋𝑋� 2 = 179 + (𝑎𝑎2 + 𝑏𝑏 2 ) − 9(42 ) = 35 + (𝑎𝑎2 +
𝑏𝑏 2 ) ]
32
Correlation and Regression Analysis in Business Forecasting
Readings: Book by Levine et al, Covariance & Correlation Coefficient (Sec 3.5, 5.2), Simple
Linear Regression (Sec 13.1-13.6), Multiple Linear Regression [Sec 14.1, 14.2, 14.6 (Dummy Var
Reg), 15.1 (Polynomial Reg), 15.2]
For a given value of X=x, Y can take many possible values but the predicted value of Y remains
the same, i.e., a + bX; and hence for a given value of X=x, regression error or prediction error
e = Y – a − bX can take many possible values. We usually assume that probability distribution of
regression errors is ‘normal’ (bell curved histogram) with mean 0 and variance σ2. Since the
prediction errors are sometimes positive and sometimes negative, on an average they are expected
to be 0 and additionally it is generally assumed that the variation in the prediction errors will
remain constant σ2, irrespective of what the X=x value is. This σ2 is referred to as the regression
error variance.
33
Situation 0 (Optional). [ Nifty-rupee correlation hints at more pain for stock investors (ET,
2nd July 2018) ]
(https://economictimes.indiatimes.com/markets/stocks/news/nifty-rupee-correlation-hints-at-
more-pain-for-stock-investors/articleshow/64820842.cms)
“If the historical correlation between the Rupee and the Nifty were any guide of future
trends, pressure would mount on the Nifty in the near-term. The 90-day rolling correlation between
the Rupee (against the US dollar) and the Nifty currently stood at a negative 0.34, compared with
the 10-year average of negative 0.47, according to Bloomberg data compiled by ETIG.
The rupee has depreciated 6.7 per cent since the beginning of the year, becoming the worst
performing currency in Asia in the process. But the Nifty rose 1.7 per cent in the same period,
standing out among the emerging markets that have been hit by the dollar gains.
A positive correlation means the two securities move in the same direction with one being
the highest level of relationship — both go up or down together at the same rate — and a negative
correlation means both move in the opposite direction with minus one being highest level of
negative correlation — both move in the opposite direction at the same rate.
The current reading of the correlation is nearly one standard deviation away from the mean.
According to the mean reversion theory, the variable eventually moves back towards the mean or
average and the probability of mean reversion is higher if the variable is farther away from mean
or, in statistical terminology, when standard deviation is higher. Therefore, in case of mean
reversion, it could put more pressure on the Nifty performance.
34
The Indian rupee hit its weakest level ever of 69.09 per dollar last Thursday. Economists
are predicting more pain in the coming days, with Barclays forecasting the rupee at 72 to a dollar,
and DBS sees it at 71 by next June.
In the past seven years, whenever rupee depreciated more than 5 per cent, the Nifty return
has been in the range of negative 4-14 per cent. For instance, when in rupee depreciated between
April 2015 and February 2016 by 7 per cent to reach a low of 68.71, the Nifty dropped 14 per cent
in the same period.
Investors who closely keep an eye on correlation to hedge their portfolio would be
increasing their weightage to the IT and defensive stocks such as Kotak Mahindra, HDFC Bank,
Mahindra & Mahindra and Asian Paints. IT and select defensive stocks have typically
outperformed the Nifty during the period of Rupee depreciation.”
[(Optional) Exercise: Verify the calculations of Situation 0, and if needed see the Answer Key
given later]
Situation 1 (Demand Prediction, Method of Least Squares): A company has observed weekly
demand y (in thousands) for its product at three different price x (in Rs hundreds) over three weeks
at a particular locality of a big city as shown below:
The sales manager wishes to find out how demand changes with price. To do this the business
analyst determines the line (y = a + bx) of best fit to these data points (x1,y1) = (5,50), (x2,y2) =
(10,30), (x3,y3) = (15,20) using the “LEAST SQUARES” method under the simple linear regression
(SLR) model: 𝑌𝑌𝑖𝑖 = 𝑎𝑎 + 𝑏𝑏𝑋𝑋𝑖𝑖 + 𝑒𝑒𝑖𝑖 , 𝑖𝑖 = 1,2, … , 𝑛𝑛, where errors ei are assumed to be independently
arising from a population with mean 0 and standard deviation σ. This method considers the sum
of squared vertical deviations of the observed (x=price, y=demand) points to a line of fit y = a+bx,
given by, f(a,b) = [ y1 – (a + x1*b) ]2 + [ y2 – (a+ x2*b) ]2 + [ y3 – (a+x3*b) ]2
35
=[ 50 – (a + 5b) ]2 + [30 – (a+ 10b) ]2 + [ 20 – (a+15b) ]2. Then one tries to determine (a,b) that
minimizes f(a,b).
(i) How “strong” is the linear relationship between demand and price? Quantify it. [−0.98]
(ii) Determine the slope and y-intercept of the best fitting line? [Demand = 63.333 −3*Price]
(iv) How good is the fit (on a finite benchmark)? [Adjusted R2 = 0.93]
(v) The manager wants to sell 40 units of the product in a particular week to reduce inventory
size. What price should be set to achieve the goal? [price=7.78]
(vi) To check “independence” of regression errors (for time series data) calculate the Durbin-
Watson statistic (DW) value. [Note the DW value may seem unreasonable in this problem because
we have taken only minimum possible number of observations here.] [DW = 3]
(vii) Calculate the MAPE (mean absolute percentage error) based on the dataset (“training”
dataset) using which the model is estimated. [MAPE = 7.59%]
n=3 ∑ 𝑋𝑋𝑖𝑖 =30 ∑ 𝑌𝑌𝑖𝑖 =100 ∑ 𝑋𝑋𝑖𝑖2 =350 ∑ 𝑋𝑋𝑖𝑖 𝑌𝑌𝑖𝑖 =850 ∑ 𝑌𝑌𝑖𝑖2 =3800 �=10
X �=33.33
Y
36
Checking Some Properties of Prediction Errors for y=a+bx model:
𝑦𝑦�𝑖𝑖 = 𝑎𝑎�+𝑏𝑏�𝑥𝑥𝑖𝑖 =
SSE 𝑛𝑛−1
(iv) R2 = 1 − ∑(𝑌𝑌𝑖𝑖 −𝑌𝑌�)2
= (iv) Adj R2 = 1−[ 1−R2 ]×( 𝑛𝑛−2) =
∑𝑛𝑛
𝑖𝑖=2(𝑒𝑒𝑖𝑖 −𝑒𝑒𝑖𝑖−1 )
2
(vi) Durbin-Watson statistic, denoted by DW= ∑𝑛𝑛 2 =
𝑖𝑖=1 𝑒𝑒𝑖𝑖
37
Situation 2: (Analytics in Dispute Resolution) A large power utility firm employing thousands
of workers has been accused of discriminating against its female managers. The accusation is based
on a random sample of 100 managers. The mean annual salary of 38 female managers is $76,189,
whereas the mean annual salary of the 62 male managers is $97,832. It would appear to a layman
that male managers are being paid more than the female managers.
In rebuttal, the Managing Director of the firm points out that the company has a strict policy of
equal pay for equal work and that the difference may be due to some other factors. Accordingly,
he found and recorded the number of years of education and the number of years of experience for
each of the 100 managers in the sample. Also recorded are the salary and gender (0=female,
1=male). The Managing Director wanted to know whether a business analytics (regression
analysis) technique can help him resolve this effectively. Do you think the Managing Director
needs to provide data on some other factors/variables to resolve the dispute more effectively?
(i) How “strong” is the linear relationship between salary and gender? Quantify it.
(ii) Find the slope and y-intercept of the best fitting line of salary on gender.
(iii) How good is the fit? What is the interpretation of the slope of the best fitting line of salary on
gender?
Consider the following output from the MLR of salary on education, experience and gender:
38
(iv) (a) How good is the fit? (b) Quantify the relationship between salary and the “best” linear
combination of the 3 explanatory variables education, experience and gender.
(v) What does the 95% CI for coefficient of gender imply? [At 5% level of significance (alpha),
we can conclude that Gender is redundant (i.e., b3 = 0)]
Consider the following output from MLR of salary on education, experience, gender and
appraisal:
(vi) (a) How good is the fit? (b) Quantify the relationship between salary and the “best” linear
combination of the 4 explanatory variables education, experience, gender and appraisal.
(vii) Calculate the F statistic used to test whether all the explanatory variables (education,
experience, gender and appraisal) are redundant or not.
39
Consider the following output from MLR of salary on education, experience and appraisal:
(ix) Suppose a female manager has educational level = 15 and experience = 10 years and appraisal
rating = 7. (a) What is the “mean” salary for “similar” employees?
Situation 3 (Salary Prediction, Exercise). The salary (in appropriate units of money) of an
employee (Y) of a certain company is found to depend upon the number of months of his/her
experience (X). Suppose data on 9 employees of this company are as given below:
X 27 53 41 59 33 34 49 31 56
Suppose the HR manager is trying to recruit someone with 45 months of experience. What salary
should he/she offer to the interviewee, if found suitable?
n=9 ∑ 𝑋𝑋𝑖𝑖 =383 ∑ 𝑌𝑌𝑖𝑖 =2495 ∑ 𝑋𝑋𝑖𝑖2 =17443 ∑ 𝑋𝑋𝑖𝑖 𝑌𝑌𝑖𝑖 = 114417 ∑ 𝑌𝑌𝑖𝑖2 =757257
(i) How “strong” is the linear relationship between salary and experience? Quantify it.
(ii) Determine the slope and y-intercept of the best fitting line?
40
(iv) How good is the fit?
(v) Calculate the MAPE (mean absolute percentage error) based on the “training” sample using
which the model is estimated.
(vi) (a) Does the intercept estimate have an interpretation? (b) What is the interpretation of the
95% CI for slope [5.12, 9.29] ? (c) Is “experience” useful as an explanatory variable at 5% level?
Situation 4 (Exercise): India is the largest market of two wheeler motorcycles and scooters.
Higher mileage is one of the important decision making parameters considered by customers in
India. A manufacturer always claims certain mileage for its product. This mileage is arrived at by
testing under ‘standard’ conditions. Some of these standard conditions are optimum speed
(approximately 50 kmph), only driver, appropriate air pressure of tyres and appropriate engine
temperature. In most of the day-to-day driving one experiences deviations from these standard
conditions. There are different automobile magazines and web portals, which compile user-
mileages from different motorcycle owners. These user mileages are collected under city drive
conditions and highway drive conditions. These user-mileage values vary across cities and
highways due to different road and traffic conditions. The following table contains data on mileage
given by manufacturer (X), city drive mileage, and highway drive mileage.
User User
Ser. Company Mileage in Mileage on
No. Milage City Highway
1 50 39 41
2 55 41 45
3 58 43 45
4 60 46 47
5 60 50 53
6 63 46 51
7 70 49 62
8 74 54 60
41
9 77 60 67
10 79 57 66
11 80 55 70
12 82 60 71
13 88 64 74
14 95 68 81
City:
n=14 ∑ 𝑋𝑋𝑖𝑖 =991 ∑ 𝑌𝑌𝑖𝑖 =732 ∑ 𝑋𝑋𝑖𝑖2 =72497 ∑ 𝑋𝑋𝑖𝑖 𝑌𝑌𝑖𝑖 = 53318 ∑ 𝑌𝑌𝑖𝑖2 =39294
Highway:
n=14 ∑ 𝑋𝑋𝑖𝑖 =991 ∑ 𝑌𝑌𝑖𝑖 =833 ∑ 𝑋𝑋𝑖𝑖2 =72497 ∑ 𝑋𝑋𝑖𝑖 𝑌𝑌𝑖𝑖 = 61130 ∑ 𝑌𝑌𝑖𝑖2 =51617
(i) (a) How “strong” is the linear relationship between “company mileage” and “user city
mileage”? Quantify it. (b) How “strong” is the linear relationship between “company mileage” and
“user highway mileage”?
(ii) Determine the slope and y-intercept of the best fitting line for regression of (a) “user city
mileage” on “company mileage” and “user highway mileage” on “company mileage”?
(iii) Which prediction of user-mileage do you expect to be more accurate -- for city or for highway?
[Hint. Considering Adjusted R2]
(iv) If a new motorcycle is being launched in the market with manufacturer mileage of X0= 100
kmpl, estimate the “mean” user mileage under both conditions.
(v) If a new motorcycle is being launched in the market with manufacturer mileage of X0= 90
kmpl, estimate the “mean” user mileage under both conditions.
42
Situation 5 (Forecasting at Trent” Case) (Excercise): Trent is a major Indian Business Group
retail-operations company that owns and manages a number of retail chains in India. Established
in 1998, Trent runs a lifestyle chain, one of India’s largest and fastest growing chain of lifestyle
retail stores (Westside), a hypermarket chain (Star India Bazaar), a books and music chain
(Landmark), and a complete family fashion store (Fashion Yatra).
When Atul Jain, an MBA, joined the company (Trent) as a business analyst on April 1, 2008, his
boss Gaurav Rai assigned him the task of forecasting “Net Sales” (NS) for the company for the
June-end quarter of 2008. Atul learnt a few things from his Business Statistics course. He requested
his boss Gaurav for past Trent data on its Net Sales and a few other “explanatory” variables such
as Advertising Expense (AE) and Personnel Cost (PC). Before starting his analysis Atul arranged
the data in an Excel file as shown in Table 1. As can be seen from Table 1, Atul created two more
“explanatory” variables, variation in which he wished to exploit to capture/explain the variation in
Net Sales. These variables are LAG1, the previous quarter Net Sales, and TIME, sequential
counting of the quarters starting from December 2003 as ‘1’. [LAG1 means previous data value
with a lag of one time period; LAG2 means previous data value with a lag of two time periods and
so on.] The LAG1 value corresponding to Dec-03 was obtained from the knowledge of Net sales
figure of 40.13 in previous quarter Sep-03. Atul also found out from his boss that the company
budgeted/planned to spend Rs 10 crore and Rs 11 crore on AE and PC respectively during the
June-end quarter of 2008.
To find the “best” model, a Training dataset (14 quarters data till Mar-07) Atul formed and
predictive ability (MAPE) of the models is checked on the “Hold-Out” or “Validation” dataset
consisting of the last 4 quarters. The “best” model is then re-estimated on the basis of the entire
dataset of 18 quarters and is used to predict Net sales for Jun-08 quarter.
43
Table 1: Atul’s Data
(i) Atul found PC to be potentially the best single predictor of net sales. Do you agree? Explain.
(ii) If no data were available on the explanatory variable(s), still one might try a regression model
using “Time” as regression model and then improve upon that by capturing “seasonal (quarterly)”
effects, if any. Given the following graph of the sales data over time, what kind of models should
Atul try?
Dec-04
Mar-05
Sep-05
Dec-05
Mar-06
Sep-06
Dec-06
Mar-07
Jun-04
Jun-05
Jun-06
44
(iii) Atul tried various regression models using the training dataset and prepared the following
summary table. Check whether his calculations are correct. Note that for predicting Sep 2007
quarter Atul expanded the training dataset from Dec 2003 to June 2007 to re-estimate model
coefficients, for predicting Dec 2007 quarter Atul expanded the training dataset from Dec 2003 to
Sep 2007 to re-estimate model coefficients, for predicting Mar 2008 quarter Atul expanded the
training dataset from Dec 2003 to Dec 2007.
[SLR-AE means simple linear regression model with AE as explanatory variable, MLR-AE-PC
means multiple linear regression model with AE and PC as explanatory variables, DVR-Time
means Dummy Variable regression model with Time and Q1, Q2, Q3 as explanatory variables,
MLR-PC-PC^2 means multiple linear regression model with PC and PC^2 as explanatory
variables. MAPE (Mean Absolute Percentage Error) calculation: One may use the same estimated
model based on the training dataset of 14 observations to predict for Jun-07, Sep-07, Dec-07 and
Mar-08 quarters.
(iv) Re-estimate the “best” model on the basis of the entire dataset of all 18 quarters and thereby
predict the net sales for June-2008 quarter. [Sales = 2.395 +0.577*Lag1+2.575*AE+3.258*PC or Sales =
0.6*Lag1+2.666*AE+3.166*PC; Find values for Lag1 = ??, AE= ??, PC =??].
45
Problems for Practice:
Situation 7: Let Yi = return for a certain stock on day i, and Xi = return for Sensex on day i, over
20 trading days. Summarized data are given below:
∑ 𝑋𝑋𝑖𝑖 =12.89 ∑ 𝑌𝑌𝑖𝑖 =25.67 ∑ 𝑋𝑋𝑖𝑖2 =153.22 ∑ 𝑋𝑋𝑖𝑖 𝑌𝑌𝑖𝑖 =142.77 ∑ 𝑌𝑌𝑖𝑖2 =251.72
Suppose one fits the regression model Y = bX, with zero intercept, to the data. Calculate the
estimated slope.
[Hint. (a) Multiply the regression equation (𝑦𝑦𝑖𝑖 = 𝑏𝑏 ∗ 𝑥𝑥𝑖𝑖 ) by the explanatory variable xi and
then sum over i = 1,…,n to get one equation in one unknown and then we
minimize the sum of squared errors ∑𝑛𝑛𝑖𝑖=1(𝑦𝑦𝑖𝑖 − 𝑏𝑏𝑏𝑏𝑖𝑖 )2 , w. r. t. b, to obtain the following
equation to solve ∑𝑛𝑛𝑖𝑖=1 𝑥𝑥𝑖𝑖 𝑦𝑦𝑖𝑖 = 𝑏𝑏(∑𝑛𝑛𝑖𝑖=1 𝑥𝑥𝑖𝑖2 ) .]
Situation 8 (Exc): Let Yi = advertising expense for company B in quarter i, and Xi = advertising
expense for company A in quarter i, over 15 quarters. Suppose a simple linear regression of Y on
X gives adjusted R-square value of 0.6. Then calculate the absolute value of the correlation
coefficient between X and Y.
𝑛𝑛−1 𝑛𝑛−𝑘𝑘−1
[Hint. 𝐴𝐴𝐴𝐴𝐴𝐴 𝑅𝑅 2 = 1 − (1 − 𝑅𝑅 2 ) �𝑛𝑛−𝑘𝑘−1� ⇔ 𝑅𝑅 2 = 1 − (1 − 𝐴𝐴𝐴𝐴𝐴𝐴 𝑅𝑅 2 ) � 𝑛𝑛−1
�
⇒ |𝑟𝑟| = √𝑅𝑅 2 ]
Situation 9 (Exc): Let Yi = adverting expense for company B in quarter i, and Xi = adverting
expense for company A in quarter i, over 3 quarters. Let ei denote the prediction error in quarter i,
in the simple linear regression of Y on X. Suppose that x1 = 3, x2 = 2, x3 = 1, in appropriate units;
and a fitted simple linear regression of Y on X gives value of prediction error e1 = 0.1 for quarter
1. Then the value of (e3 – e2) is given by … .
[Hint. For simple linear regression with k=1, errors satisfy (k+1)=2 conditions ∑𝑛𝑛𝑖𝑖=1 𝑒𝑒𝑖𝑖 =
0 and ∑𝑛𝑛𝑖𝑖=1 𝑥𝑥𝑖𝑖 𝑒𝑒𝑖𝑖 = 0. Here, n=3, e1+ e2+ e3 = 0 and 3 e1+ 2e2+ e3 = 0]
Situation 10 (Exc): The following sales and price data (in appropriate monetary units) have been
collected from a range of discount stores selling a particular product and the regression results
obtained are as given below:
46
Then it follows that if price increases by 2 units, the sales of the product is expected to ... …?
Situation 11 (Exc): The starting salary (Y) after graduation for marketing majors in a Business
School is thought to be related to their grade point average (GPA) in major courses. For a randomly
selected graduates of the last batch the following dataset was obtained:
Graduate No. 1 2 3 4 5 6 7
GPA 3.3 2.6 3.4 2.9 3.8 2.2 3.5
Salary 34 30 33 30 36 28 35
The regression of starting salary on GPA produces Adjusted R Square value 0f 0.92. Then calculate
the absolute value of the correlation coefficient between salary and GPA.
𝑛𝑛−1 𝑛𝑛−𝑘𝑘−1
[Hint. 𝐴𝐴𝐴𝐴𝐴𝐴 𝑅𝑅 2 = 1 − (1 − 𝑅𝑅 2 ) �𝑛𝑛−𝑘𝑘−1� ⇔ 𝑅𝑅 2 = 1 − (1 − 𝐴𝐴𝐴𝐴𝐴𝐴 𝑅𝑅 2 ) � 𝑛𝑛−1
�
⇒ |𝑟𝑟| = √𝑅𝑅 2 ]
Situation 12 (Exc): The total monthly sales of a certain product (Y) is found to depend upon the
number of sales professionals employed by the company (X) in that month. The following
summarized information is available:
To examine the relationship between total sales and number of sales professionals, one may
consider the simple linear regression model: Y=a+bX+error
47
(ii) Suppose the number of sales professionals employed by the company for the current month
was found out to be 45. Predict the value of sales for the current month.
SSE/(n−k−1) 𝑛𝑛−1
[ Hint. Adj R2 = 1 − 𝑇𝑇𝑇𝑇𝑇𝑇/(𝑛𝑛−1)
= 1− [1−R2 ]×( 𝑛𝑛−𝑘𝑘−1) ]
(v) Calculate the estimated standard deviation of the regression equation error.
df = (n-k-1).]
(vi) Find the regression equation of the total sales figures on the number of sales professionals
based on the above data. Suppose the Durbin Watson statistic value for the regression is 1.5915.
Then the approximate correlation between the residual and its first lag is given by:
∑𝑛𝑛
𝑖𝑖=2(𝑒𝑒𝑖𝑖 −𝑒𝑒𝑖𝑖−1 )
2 ∑𝑛𝑛 2
𝑖𝑖=2 𝑒𝑒𝑖𝑖 ∑𝑛𝑛 2
𝑖𝑖=2 𝑒𝑒𝑖𝑖−1 ∑𝑛𝑛
𝑖𝑖=2 𝑒𝑒𝑖𝑖 𝑒𝑒𝑖𝑖−1
[ Hint. DW= ∑𝑛𝑛 2 = ∑𝑛𝑛 𝑒𝑒 2
+ ∑𝑛𝑛 2 −2 ∑𝑛𝑛 2
𝑖𝑖=1 𝑒𝑒𝑖𝑖 𝑖𝑖=1 𝑖𝑖 𝑖𝑖=1 𝑒𝑒𝑖𝑖 𝑖𝑖=1 𝑒𝑒𝑖𝑖
48
Situation 13 (Exc). Consider the following ANOVA table obtained as part of a simple linear
regression (that includes an intercept term) analysis output:
ANOVA
df SS MS F Significance F
Total 9 30.4
(ii) Calculate the value of adjusted R2 using the information of the ANOVA table.
SSE/(n−k−1) 𝑛𝑛−1
[ Hint. Adj R2 = 1 − 𝑇𝑇𝑇𝑇𝑇𝑇/(𝑛𝑛−1)
= 1− [1−R2 ]×( 𝑛𝑛−𝑘𝑘−1) ]
(iii) Calculate the estimated value of standard deviation of the regression model error using
the above ANOVA table.
[ Hint. estimated standard deviation of the regression equation error = √𝑀𝑀𝑀𝑀𝑀𝑀 where
MSE = Mean Sqaure Error = SSE/(n-k-1]
(iv) Calculate the sample variance of the response/dependent variable from the above data.
[ Hint. TSS/(n−1); why?]
Situation 14 (Exc). Suppose the following two estimated regression equations are obtained from
data on monthly demand of a product (y) and its monthly price (x), over 20 months, in appropriate
units: (a) x+ 6y =4 and (b) 3x+ 4y= 8. One of the regression equations are obtained by regressing
y on x and the other by regressing x on y. Then calculate:
(i) The correlation coefficient between monthly demand and monthly price.
[ Hint. Usually, it is price that impacts demand. But if the demad continues to remain
subdued it would induce the seller to change the price. Thus, demand (y) also may impact
price (x), and one may regress x on y to see the effect of y on x.
49
Regression of y on x, and that of x on y respectively produce the equations:
𝑆𝑆 1 𝑆𝑆𝑦𝑦 1 𝑆𝑆𝑦𝑦
𝑥𝑥 = 𝑥𝑥̅ + 𝑟𝑟 𝑆𝑆𝑥𝑥 (𝑦𝑦 − 𝑦𝑦�) ⟺ 𝑦𝑦 − ( 𝑟𝑟 )𝑥𝑥 = 𝑦𝑦� − ( 𝑟𝑟 )𝑥𝑥̅ .
𝑦𝑦 𝑆𝑆𝑥𝑥 𝑆𝑆𝑥𝑥
Here, we are given the final forms of two estimated equations without any raw or
summarized data. We need to figure out which one is the regression equation of y on x and
which one of x on y.
(−1/6) (−4/3) = r2 ⟹ 4/18 = r2 ⟹ r = −�4/18; it is negative since r has the same sign as
the slope of the regression line. If you start with supposing the opposite then r2 would come
out as (−6)(−3/4) = 18/4 > 1, which is not possible. ]
𝑆𝑆𝑦𝑦 𝑆𝑆
[ Hint. 𝑟𝑟 𝑆𝑆 for y on x, and 𝑟𝑟 𝑆𝑆𝑥𝑥 for x on y ]
𝑥𝑥 𝑦𝑦
(iii) 𝑥𝑥̅ , the average monthly demand of the product, and 𝑦𝑦�, the average monthly price of the
product.
[Hint. The point (𝑥𝑥̅ , 𝑦𝑦�) falls on the estimated simple linear regression line of y on x [i.e.,
𝑆𝑆𝑦𝑦
𝑦𝑦 = 𝑦𝑦� + 𝑟𝑟 𝑆𝑆 (𝑥𝑥 − 𝑥𝑥̅ ) ] as well as the estimated simple linear regression line of x on y [i.e.,
𝑥𝑥
𝑆𝑆
𝑥𝑥 = 𝑥𝑥̅ + 𝑟𝑟 𝑆𝑆𝑥𝑥 (𝑦𝑦 − 𝑦𝑦�). Hence just solve the two equations x+ 6y =4 and 3x+ 4y= 8 to obtain
𝑦𝑦
15. (omitted)
16 (Exc). Suppose the following two estimated regression equations are obtained from monthly
data on demand (y) of a product and its price (x), over n=11 months, in appropriate units: (a) 3x
50
+ 4y = 6 and (b) x + 3y = 3. One of these regression equations is obtained by regressing y on x and
the other by regressing x on y.
(−3/4) (−3) = r2 ⟹ 9/4 > 1; this is not possible. If you start with supposing the opposite
then r2 would come out as (−4/3)(−1/3) = 4/9 = r2 . ⟹ r = −�4/9 = −2/3; it is negative
since r has the same sign as the slope of the regression line]
(ii) Calculate the absolute value of the estimated regression line slope for demand on price.
[ Hint. x+ 3y= 3 is the estimated regression line for demand (y) on price (x) ]
[ Hint. The point (𝑥𝑥̅ , 𝑦𝑦�) falls on the estimated simple linear regression line of y on x [i.e.,
𝑆𝑆𝑦𝑦
𝑦𝑦 = 𝑦𝑦� + 𝑟𝑟 𝑆𝑆 (𝑥𝑥 − 𝑥𝑥̅ ) ] as well as the estimated simple linear regression line of x on y [i.e.,
𝑥𝑥
𝑆𝑆𝑥𝑥
𝑥𝑥 = 𝑥𝑥̅ + 𝑟𝑟 𝑆𝑆 (𝑦𝑦 − 𝑦𝑦�). Hence just solve the two equations 3x + 4y = 6 and x + 3y = 3 to
𝑦𝑦
(iv) Calculate the ratio of the sample standard deviation for demand to that for price.
𝑆𝑆𝑦𝑦 𝑆𝑆𝑦𝑦
[ Hint. 𝑆𝑆 = (𝑟𝑟 𝑆𝑆 )/r = (−1/3)/(−2/3) ]
𝑥𝑥 𝑥𝑥
n n
(v) Suppose ∑ xi = 24, ( ∑ xi ) / n = 1.2, (n = 11). Then, calculate the sample variance for price.
2
i =1 i =1
1 1
[ Hint. 𝑠𝑠𝑥𝑥2 = 𝑛𝑛−1 ∑𝑛𝑛𝑖𝑖=1(𝑋𝑋𝑖𝑖 − 𝑋𝑋�)2 = 𝑛𝑛−1
( ∑𝑛𝑛𝑖𝑖=1 𝑋𝑋𝑖𝑖2 − 𝑛𝑛𝑋𝑋� 2 ) ]
(vi). Calculate the ‘coefficient of determination’ for the regression of price on demand.
[Hint. R2 = r2 ]
51
(vii) Suppose sample variance for demand is 0.2040. Then calculate the estimated variance of the
regression equation error for regressing demand on price.
[Hint. MSE = SSE/(n-k-1), SSE = ∑𝑛𝑛𝑖𝑖=1(𝑌𝑌𝑖𝑖 − 𝑌𝑌�)2 − 𝑏𝑏� 2 ∑𝑛𝑛𝑖𝑖=1(𝑋𝑋𝑖𝑖 − 𝑋𝑋�)2 ; ∑𝑛𝑛𝑖𝑖=1(𝑌𝑌𝑖𝑖 − 𝑌𝑌�)2 =
(𝑛𝑛 − 1)𝑠𝑠𝑦𝑦2 ; 𝑠𝑠𝑦𝑦2 = 0.2040; ∑𝑛𝑛𝑖𝑖=1(𝑋𝑋𝑖𝑖 − 𝑋𝑋�)2 is calculated in (v)]
52
ANSWER KEY to
• In the second paragraph, the opening price of rupee against the US dollar on 01/01/2018
and 02/07/2018 has been used to compute the depreciation. Also, the lowest Nifty index
on 01/01/2018 and 02/07/2018 has been used to compute the rise in Nifty.
• In the fifth paragraph, 69.09 is the maximum value of rupee against US dollar on
28/06/2018.
• In the sixth paragraph, the depreciation in rupee has been computed using the lowest value
of rupee on 30/04/2015 and 29/02/2016. The data for the computation has been obtained
from https://m.in.investing.com/currencies/usd-inr-chart. Also, the drop in Nifty has been
computed using the opening index on 30/04/2015 and 29/02/2016.
53
Nifty Open High Low Close
29-Feb-16 7050.45 7094.6 6825.8 6987.05
30-Apr-15 8224.5 8229.4 8144.75 8181.5
14.3% 13.8% 16.2% 14.6%
Situation 1. (i) −0.98, (ii) −3, 63.333, (iii) MSE = 16.67, (iv) Adjusted R2 = 0.93, (v) price=7.78,
(vi) DW = 3; (vii) MAPE = 7.59%
(2). (i) r = 0.36 (gender), r=0.82 (experience), (ii) a= 76,189, b=21,643, (iii) Adj R2 = 0.12, (iv)
�,
(a) Adj R2 = 0.68, (b) “Multiple correlation coefficient” = 0.83 [= correlation between Y and Y
which is the best linear combination (function) of explanatory variables]; (v) gender can be said
to be redundant as an explanatory variable with 95% ‘level of confidence’ since the ‘P-value’ =
0.62 > 0.05; (vi) (a) Adj R2 = 0.92, (b) “Multiple correlation coefficient” = 0.96 [= correlation
�, which is the best linear combination (function) of explanatory variables]. (vii)
between Y and Y
F = 284.69; (viii) gender can be said to be redundant as an explanatory variable with 95% ‘level
of confidence’ since the ‘P-value’ = 0.3 > 0.05; (ix) (a) 87163;
(3). (i) r = 0.95, (ii) a= -29.27, b=7.20, (iii) MSE = 890.75; (iv) Adj R2 = 0.89, (v) MAPE =
10%, (vi) (b) If a candidate’s experience is 1 month higher, we are ‘95% confident’ that the salary
would be higher by a number between 5.12 and 9.29 (units of money); (c) Experience is useful at
the 5% ‘level’ since the ‘P-value’ = 8.01262E-05 < 0.05.
(4) (i)(a) 0.971; (i) (b) 0.986; (ii) (a) City: intercept=6.986, slope=0.640; (b) Highway:
intercept=-5.774, slope=0.922; (iii) For Highway (why?); (iv) 71 (City); 86 (Highway); (v) 65
(City); 77 (Highway)
5(i) corr(Net Sales, PC) = 0.97; (iv) Net sales = 2.395 + 2.575*AE + 3.258*PC + 0.577*Lag1; Net
Sales value for Jun-08 quarter is 138.32 (for AE=10, PC=11, LAG1=128.9 for June-2008 quarter
[one may reestimate this model by setting the constant term as zero and see how different the
predicted Net Sales value for Jun-08 quarter from this model is].
54
𝑛𝑛−1 𝑛𝑛−𝑘𝑘−1
(8) 0.7928 [𝐴𝐴𝐴𝐴𝐴𝐴 𝑅𝑅 2 = 1 − (1 − 𝑅𝑅 2 ) �𝑛𝑛−𝑘𝑘−1� ⇔ 𝑅𝑅 2 = 1 − (1 − 𝐴𝐴𝐴𝐴𝐴𝐴 𝑅𝑅 2 ) � 𝑛𝑛−1
� ⇒ |𝑟𝑟| = √𝑅𝑅 2 ]
(9) 0.3
(11). 0.97
12(i) 0.9513; (ii) 295, (Y= −29.37 + 7.20*X, X=45); (iii) 0.9049; (iv) 0.8914; (v) 29.8454; (vi)
0.2043
14(i) -0.4714; (ii) (a) -1/6, (b) -4/3.; (iii) 𝑦𝑦� = 0.2857, 𝑥𝑥̅ = 2.2857;
15 (a) y= 93.7273 + 4(x) + 7.3579(z); (b) y=97.0947+ 4x +7.3529(z) - 4.3579(z2); (c) 97555.48;
(d) 0.8015
16 (i) −2/3 ; (ii) 1/3 (absolute value); (iii) 3/5; (iv) 1/2; (v) 0.816; (vi) 4/9; (vii) 0.1259 (check)
0.1385 ; SSE = ∑𝑛𝑛𝑖𝑖=1(𝑌𝑌𝑖𝑖 − 𝑌𝑌�)2 − 𝑏𝑏� 2 ∑𝑛𝑛𝑖𝑖=1(𝑋𝑋𝑖𝑖 − 𝑋𝑋�)2 = 2.04 − (−1/3)2 (8.16)= 1.133333; MSE =
SSE/(n-k-1) = 1.133333/(11-1-1) = 0.125926.
55
APPENDIX 1 (Optional)
(0) Let Z be a standard normal variate. For any α, α lying between 0 and 1, zα is a real number
such that P(Z > zα ) = α . Tables showing zα against α have already been constructed and are
available. For example, Z0.005 = 2.58, Z0.01 = 2.33, Z0.025 = 1.96, Z0.05 = 1.645, Z0.10 = 1.28
(1) Chi-square distribution: It is used to approximate the distribution of the sample variance.
Let Z1 ,....Z n be independent random variables each having a standard normal distribution. Then
Y = Z12 + .... + Z 2n is said to follow chi-square ( χ 2 ) with n degrees of freedom (df). The pdf
(probability density function) of Y is given by
y n
1 − −1
f(y) = n
e 2
y 2
if 0 < y < ∞
2 Γ( )
2 n
2
∞
where Γ(α) represents the gamma function. For α>0, Γ(α ) = ∫ xα −1e − x dx is called the gamma
0
function.
• Γ(1/2) = π , Γ(1) = 1,
• Γ(α) = (α−1)Γ(α−1) for α >1 and
• Γ(n) = (n−1)! for a positive integer n.
56
Result: The expected value and variance of a χ 2n random variable are given by:
E( χ 2n ) = n, Var( χ 2n ) = 2n.
For given n and α, let χ α2 ;n denote a real number, which is exceeded with probability α by a
chi-square random variable having n df. Tables are available showing χ α2 ;n for various values of
α and n.
(2) t distribution: It is used to approximate the distribution of the ratio of sample mean to sample
standard deviation (i.e., reciprocal of the coefficient of variation). It is heavily used in construction
of interval-estimates and in testing of hypothesis, for example in linear regression analysis. The t
distribution is symmetric, bell-shaped like the standard normal distribution but its tails are much
thicker.
Let Z be a standard normal variate and Y be a chi-square variate with n df, such that Z and Y are
Z
independent. Let t = . Then the random variable t has what is called the t-distribution with
Y
n
n df. The pdf of t,
57
n +1
Γ( )
2 1
f(t) = ⋅ n +1
, −∞ < t < ∞
1 n
n Γ ( )Γ ( ) t
2 2
2 2 1 +
n
Result: The expected value and variance of a tn random variable are given by:
E(tn) = 0, if n > 1,
n
Var(tn) = , if n > 2.
n−2
For any given α and n, let tα ;n be a real number which is exceeded with probability by random
variable, following t distribution with n df. Available tables show tα ;n against α and n. The t
distribution converges to standard normal distribution as n → ∞. Hence it can be shown tα ;n → zα
as n → ∞ with α held fixed.
(3) F distribution: It is used to approximate the distribution of the ratio of sample varinaces for
data coming from two populations and thereby compare variability of the populations (e.g.,
volatility of returns for two financial assets).
Let Y1 and Y2 be independent random variables having χ 2 distribution with n1 and n2 df.
Y /n
respectively. Then, F = 1 1 follows F-distribution with n1 and n2 df. The n1 is referred to
Y2 / n 2
as the numerator df and n2 as the denominator df. The pdf of F-distribution is given by
58
n1
n1 2
n1
−1
2
n
⋅
f2
, if 0 < f < ∞
n1 + n 2
n1 n 2
B , n 1 2
2 2 1 + f
n2
where,
Γ (a )Γ ( b )
B(a , b ) = , a > 0, b>0.
Γ (a + b )
Result:
(i). The expected value and variance of a Fn ,n random variable are given by:
1 2
n2
E( Fn ,n ) = , if n2 > 2;
1 2
n2 − 2
2n 22 (n 1 + n 2 − 2)
Var( Fn ,n ) = , if n2 > 4.
1 2
n 1 (n 2 − 2) 2 (n 2 − 4)
For any given α, n1 and n2, Fα ;n ,n is a real number which is exceeded with probability α by a
1 2
random variable following the F distribution having n1 and n2 degrees of freedom. Available tables
show Fα ;n ,n against α, n1 and n2.
1 2
1
(ii). F1−α;n 2 ,n1 =
Fα;n1 ,n 2
59
APPENDIX 2 (Optional)
𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 𝑡𝑡ℎ𝑎𝑎𝑎𝑎 𝑒𝑒𝑖𝑖 ′𝑠𝑠 are independent and normally distributed, 𝑒𝑒𝑖𝑖 ~ N(0, σ2).
Then, if we treat Xi values as fixed (not random), 𝑌𝑌𝑖𝑖 ~ N(a+b𝑋𝑋𝑖𝑖 , σ2) and are independent.
∑ 𝑋𝑋𝑖𝑖 ∑ 𝑌𝑌𝑖𝑖
Define 𝑋𝑋� = , 𝑌𝑌� = ,
𝑛𝑛 𝑛𝑛
(∑ 𝑋𝑋𝑖𝑖 )2 (∑ 𝑌𝑌𝑖𝑖 )2
𝑆𝑆𝑆𝑆𝑋𝑋 = ∑(𝑋𝑋𝑖𝑖 − 𝑋𝑋�)2 = �∑ 𝑋𝑋𝑖𝑖 2 − � ; 𝑆𝑆𝑆𝑆𝑌𝑌 = ∑(𝑌𝑌𝑖𝑖 − 𝑌𝑌�)2 = �∑ 𝑌𝑌𝑖𝑖 2 − �
𝑛𝑛 𝑛𝑛
𝑆𝑆
𝑏𝑏� = 𝑟𝑟 �𝑆𝑆𝑌𝑌 � … … … … … (3),
𝑋𝑋
60
𝑆𝑆
𝑎𝑎� = 𝑌𝑌� − 𝑏𝑏�𝑋𝑋� = 𝑌𝑌� − �𝑟𝑟 𝑆𝑆𝑌𝑌 � 𝑋𝑋� = 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 𝑜𝑜𝑜𝑜 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑒𝑒𝑝𝑝𝑝𝑝 𝑎𝑎 … … … … … … … … … (4),
𝑋𝑋
𝑌𝑌 = 𝑎𝑎� + 𝑏𝑏�𝑋𝑋 = 𝒀𝒀
� + 𝑏𝑏�(𝑿𝑿 − 𝑿𝑿
� ) … … … … … … … … … … … (5) .
Note that Regression Line goes through the point (𝑋𝑋�, 𝑌𝑌�)
∑ 𝑌𝑌𝑖𝑖 𝑋𝑋𝑖𝑖
𝑌𝑌 = (𝑏𝑏 ∗ )𝑋𝑋, 𝑤𝑤𝑤𝑤𝑤𝑤ℎ 𝑎𝑎 = 0, 𝑡𝑡ℎ𝑒𝑒𝑒𝑒 𝑏𝑏� ∗ = … … … … … (6)
∑ 𝑋𝑋𝑖𝑖 2
�2 = ∑(𝑌𝑌𝑖𝑖 −𝑌𝑌�𝑖𝑖 )2 �
and σ2 estimated by 𝜎𝜎 � �.
[= Mean Squared Error (MSE) ] in (7) gives 𝑉𝑉�𝑏𝑏
𝑛𝑛−2
∑ 𝑌𝑌𝑖𝑖 1 1
𝑎𝑎� = 𝑌𝑌� − 𝑏𝑏�𝑋𝑋� = − �(𝑐𝑐𝑖𝑖 𝑋𝑋�)𝑌𝑌𝑖𝑖 = �( − 𝑐𝑐𝑖𝑖 𝑋𝑋�)𝑌𝑌𝑖𝑖 = � 𝑑𝑑𝑖𝑖 𝑌𝑌𝑖𝑖 , 𝑤𝑤ℎ𝑒𝑒𝑒𝑒𝑒𝑒 𝑑𝑑𝑖𝑖 = � − 𝑐𝑐𝑖𝑖 𝑋𝑋��
𝑛𝑛 𝑛𝑛 𝑛𝑛
1 𝑋𝑋� 2
Therefore, 𝑉𝑉(𝑎𝑎�) = ∑ 𝑑𝑑𝑖𝑖 2 𝜎𝜎 2 = �𝑛𝑛 + ∑(𝑋𝑋 −𝑋𝑋�)2 � 𝜎𝜎 2 … … … … … … … … … … (8)
𝑖𝑖
61
�2 = ∑(𝑌𝑌𝑖𝑖 −𝑌𝑌�𝑖𝑖 )2 �
and σ2 estimated by 𝜎𝜎 in (8) gives 𝑉𝑉(𝑎𝑎
�).
𝑛𝑛−2
𝑛𝑛
𝑐𝑐𝑐𝑐𝑐𝑐�𝑎𝑎�, 𝑏𝑏�� = 𝑐𝑐𝑐𝑐𝑐𝑐 �� 𝑑𝑑𝑖𝑖 𝑌𝑌𝑖𝑖 , 𝑐𝑐𝑖𝑖 𝑌𝑌𝑖𝑖 � = � 𝑐𝑐𝑖𝑖 𝑑𝑑𝑖𝑖 𝜎𝜎 2 = �− � 𝑐𝑐𝑖𝑖 2 𝑋𝑋�� 𝜎𝜎 2
𝑖𝑖=1
−𝑋𝑋�
= 𝜎𝜎 2 � � … … … … (9)
∑(𝑋𝑋𝑖𝑖 − 𝑋𝑋�)2
−𝑋𝑋�
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐�𝑎𝑎�, 𝑏𝑏�� = … … … … … … … … … … … … … … … (9𝑎𝑎)
2
�(∑ 𝑋𝑋𝑖𝑖 )
𝑛𝑛
�
𝑎𝑎� ∓ �𝑡𝑡(𝑛𝑛−2),α � � �𝑉𝑉(𝑎𝑎
�) � … … … … … … … … … … … … … … … (10𝑎𝑎)
2
α
where 𝑡𝑡(𝑛𝑛−2),α = 100(1− 2 )-th percentile of the t distribution with (n-2) ‘degrees of freedom’.
2
If α=.05, then 𝑡𝑡(𝑛𝑛−2),α = 97.5th percentile of the t distribution with (n-2) degrees of freedom.
2
�
𝑏𝑏� ∓ �𝑡𝑡(𝑛𝑛−2),α � � �𝑉𝑉�𝑏𝑏
� � � … … … … … … … … … … … … … … … (10𝑏𝑏) ,
2
�(𝑌𝑌𝑖𝑖 − 𝑌𝑌�)2 = �(𝑌𝑌𝑖𝑖 − 𝑌𝑌�𝑖𝑖 + 𝑌𝑌�𝑖𝑖 − 𝑌𝑌�)2 = �(𝑌𝑌𝑖𝑖 − 𝑌𝑌�𝑖𝑖 )2 + �(𝑌𝑌�𝑖𝑖 − 𝑌𝑌�)2
62
Why divide SSE by (n-2) to estimate σ2 for the simple linear regression model ?
∑ 𝑒𝑒𝑖𝑖 = ∑ 𝑒𝑒𝑖𝑖 (1) = ∑(𝑌𝑌𝑖𝑖 − 𝑌𝑌�𝑖𝑖 ) = 0 ;
∑ 𝑒𝑒𝑖𝑖 𝑋𝑋𝑖𝑖 = ∑�𝑌𝑌𝑖𝑖 − 𝑌𝑌�𝑖𝑖 �(𝑋𝑋𝑖𝑖 − 𝑋𝑋�) = ∑�(𝑌𝑌𝑖𝑖 − 𝑌𝑌�) − 𝑏𝑏�(𝑋𝑋𝑖𝑖 − 𝑋𝑋�)� (𝑋𝑋𝑖𝑖 − 𝑋𝑋�) = 0
Note: For the multiple linear regression model containing a constant term and k predictors (X1,
…, Xk), the estimated regression errors will satisfy (k+1) conditions given by
1
𝑌𝑌0 = 𝑎𝑎� + 𝑏𝑏�𝑋𝑋0 = 𝑌𝑌� + 𝑏𝑏�(𝑋𝑋0 − 𝑋𝑋�) = �( − 𝑐𝑐𝑖𝑖 (𝑋𝑋0 − 𝑋𝑋�)) 𝑌𝑌𝑖𝑖 = � 𝑓𝑓𝑖𝑖 𝑌𝑌𝑖𝑖 , 𝑤𝑤ℎ𝑒𝑒𝑒𝑒𝑒𝑒 𝑓𝑓𝑖𝑖
𝑛𝑛
1
= � − 𝑐𝑐𝑖𝑖 (𝑋𝑋0 − 𝑋𝑋�)�
𝑛𝑛
1 (𝑋𝑋0 − 𝑋𝑋�)2
𝑉𝑉(𝑌𝑌0 ) = � + � 𝜎𝜎 2 … … … … … … … … … … … (12𝑎𝑎)
𝑛𝑛 ∑(𝑋𝑋𝑖𝑖 − 𝑋𝑋�)2
63
1 (𝑋𝑋0 − 𝑋𝑋�)2 �2
𝑌𝑌� + 𝑏𝑏�(𝑋𝑋0 − 𝑋𝑋�) ∓ 𝑡𝑡(𝑛𝑛−2),𝛼𝛼 �� + � 𝜎𝜎 … … … … (12𝑏𝑏)
2 𝑛𝑛 ∑(𝑋𝑋𝑖𝑖 − 𝑋𝑋�)2
1 (𝑋𝑋0 − 𝑋𝑋�)2
𝑉𝑉(𝑌𝑌0∗ ) = �1 + + � 𝜎𝜎 2 … … … … … … … … … … … … … (13𝑎𝑎)
𝑛𝑛 ∑(𝑋𝑋𝑖𝑖 − 𝑋𝑋�)2
1 (𝑋𝑋0 − 𝑋𝑋�)2 �2
� �𝑌𝑌 + 𝑏𝑏�(𝑋𝑋0 − 𝑋𝑋�)� ∓ 𝑡𝑡(𝑛𝑛−2),𝛼𝛼 ��1 + + � 𝜎𝜎 … … … … … (13𝑏𝑏)
2 𝑛𝑛 ∑(𝑋𝑋𝑖𝑖 − 𝑋𝑋�)2
�𝚤𝚤 − 𝑌𝑌�)2
∑(𝑌𝑌 ∑�𝑌𝑌𝑖𝑖 − �𝚤𝚤 �2
𝑌𝑌
𝑅𝑅 2 = =1− … … … … … … … … … … … … … … … (14𝑎𝑎)
∑(𝑌𝑌𝑖𝑖 − 𝑌𝑌�)2 ∑(𝑌𝑌𝑖𝑖 − 𝑌𝑌�)2
�𝚤𝚤 − 𝑌𝑌�)2
∑(𝑌𝑌 ∑(𝑋𝑋𝑖𝑖 − 𝑋𝑋�)2 𝑆𝑆𝑌𝑌 2 𝑆𝑆𝑋𝑋 2
2
𝐹𝐹𝐹𝐹𝐹𝐹 𝑘𝑘 = 1 (𝑖𝑖. 𝑒𝑒. , 𝑆𝑆𝑆𝑆𝑆𝑆), 𝑅𝑅 = � 2
= 𝑏𝑏 � � = �𝑟𝑟 2
�� �
∑(𝑌𝑌𝑖𝑖 − 𝑌𝑌�)2 ∑(𝑌𝑌𝑖𝑖 − 𝑌𝑌�)2 𝑆𝑆𝑋𝑋 2 𝑆𝑆𝑌𝑌 2
= 𝑟𝑟 2 … … (14𝑏𝑏)
64
𝑆𝑆𝑆𝑆𝑆𝑆�
2) (𝑛𝑛 − 𝑘𝑘 − 1) 𝑆𝑆𝑆𝑆𝑆𝑆 (𝑛𝑛 − 1)
𝐴𝐴𝐴𝐴𝐴𝐴(𝑅𝑅 =1− =1− (≤ 𝑅𝑅 2 ) … … … … … (15)
𝑆𝑆𝑆𝑆𝑆𝑆� 𝑆𝑆𝑆𝑆𝑆𝑆 (𝑛𝑛 − 𝑘𝑘 − 1)
(𝑛𝑛 − 2)
𝑛𝑛 − 1
𝐴𝐴𝐴𝐴𝐴𝐴(𝑅𝑅 2 ) = 1 − (1 − 𝑅𝑅 2 ) � � … … … … … … … … … … … … … … … … … (16)
𝑛𝑛 − 𝑘𝑘 − 1
𝑛𝑛−1 −𝑘𝑘
Minimum value of 𝐴𝐴𝐴𝐴𝐴𝐴(𝑅𝑅 2 ) = 1 − (1 − 0) �𝑛𝑛−𝑘𝑘−1� = �𝑛𝑛−𝑘𝑘−1� … … … … … … … (17)
Durbin-Watson Statistic Tests for “Independence” of Errors for Time Series Data:
∑𝑛𝑛𝑖𝑖=2(𝑒𝑒𝑖𝑖 − 𝑒𝑒𝑖𝑖−1 )2
𝐷𝐷𝐷𝐷 =
∑𝑛𝑛𝑖𝑖=1 𝑒𝑒𝑖𝑖 2
∑𝑛𝑛𝑖𝑖=2(𝑒𝑒𝑖𝑖 )2 ∑𝑛𝑛𝑖𝑖=2(𝑒𝑒𝑖𝑖−1 )2 ∑𝑛𝑛𝑖𝑖=2 2𝑒𝑒𝑖𝑖 𝑒𝑒𝑖𝑖−1
= + −
∑𝑛𝑛𝑖𝑖=1 𝑒𝑒𝑖𝑖 2 ∑𝑛𝑛𝑖𝑖=1 𝑒𝑒𝑖𝑖 2 ∑𝑛𝑛𝑖𝑖=1 𝑒𝑒𝑖𝑖 2
∑𝑛𝑛
𝑖𝑖=2(𝑒𝑒𝑖𝑖 −𝑒𝑒̅ )(𝑒𝑒𝑖𝑖−1 −𝑒𝑒̅ )
≈ �1 + 1 − 2 � ≈ 2(1 − 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶(𝑒𝑒𝑖𝑖 , 𝑒𝑒𝑖𝑖−1 )) … … … … (18)
�∑𝑛𝑛 2 𝑛𝑛
𝑖𝑖=1(𝑒𝑒𝑖𝑖 −𝑒𝑒̅ ) �∑𝑖𝑖=1(𝑒𝑒𝑖𝑖−1 −𝑒𝑒̅ )
2
65