0% found this document useful (0 votes)
45 views

This Is Only For Practice and Will Not Be Graded

The document discusses several statistical analysis techniques including principal component analysis, factor analysis, discriminant analysis, logistic regression, and multidimensional scaling. Key points include performing PCA on a dataset of countries' athletic records, explaining variance in a factor analysis model, classifying companies into risk groups using discriminant analysis, interpreting coefficients in a multinomial logistic regression, and evaluating a logistic regression model for space shuttle thermal distress.

Uploaded by

Vikash Kumar
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

This Is Only For Practice and Will Not Be Graded

The document discusses several statistical analysis techniques including principal component analysis, factor analysis, discriminant analysis, logistic regression, and multidimensional scaling. Key points include performing PCA on a dataset of countries' athletic records, explaining variance in a factor analysis model, classifying companies into risk groups using discriminant analysis, interpreting coefficients in a multinomial logistic regression, and evaluating a logistic regression model for space shuttle thermal distress.

Uploaded by

Vikash Kumar
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

This is only for practice and will not be graded

1. The excerpt below is from a data set that contains the athletic records of 55 countries
for various athletic events. The minimum time recorded by the country for a given
event is recorded in the table.
race
serial race100 m race200m race400m race800m race1500m race5000m race10000m marathon
number Country (in sec) (in sec) (in sec) (in min) (in min) (in min) (in min) (in min)
1 Argentina 10.39 20.81 46.84 1.81 3.7 14.04 29.36 137.72
2 Australia 10.31 20.06 44.84 1.74 3.57 13.28 27.66 128.3
3 Austria 10.44 20.81 46.82 1.79 3.6 13.26 27.72 135.9
4 Belgium 10.34 20.68 45.04 1.73 3.6 13.22 27.45 129.95
5 Bermuda 10.28 20.58 45.91 1.8 3.75 14.68 30.55 146.62
6 Brazil 10.22 20.43 45.21 1.73 3.66 13.62 28.62 133.13
7 Burma 10.64 21.52 48.3 1.8 3.85 14.45 30.28 139.95
8 Canada 10.17 20.22 45.68 1.76 3.63 13.55 28.09 130.15
other records not shown…..

A principal component analysis is performed on this data by considering the correlation


matrix of the numeric columns of athletic records shown above. The analysis output is shown
below.
eigen eigen eigen eigen eigen eigen eigen eigen
value 1 value 2 value 3 value 4 value 5 value 6 value 7 value 8
6.00 1.04 0.55 0.14 0.11 0.08 0.06 0.02

eigen eigen eigen eigen eigen eigen eigen eigen


Variable vector 1 vector 2 vector 3 vector 4 vector 5 vector 6 vector 7 vector 8
race100m.in.sec. -0.32 -0.38 -0.56 -0.48 0.42 0.05 -0.08 -0.14
race200m.in.sec. -0.15 -0.84 0.51 0.07 -0.10 -0.06 0.01 0.04
race400m.in.sec. -0.37 -0.09 -0.43 0.06 -0.79 -0.05 0.17 0.10
race800m.in.min. -0.39 0.03 -0.15 0.64 0.30 -0.49 -0.26 0.12
race1500m.in.min. -0.39 0.07 0.07 0.35 0.22 0.51 0.58 -0.24
race5000m.in.min. -0.39 0.18 0.23 -0.05 -0.19 0.31 -0.66 -0.44
race10000m.in.min. -0.39 0.19 0.21 -0.24 0.10 0.24 -0.08 0.80
racemarathon.in.min. -0.37 0.27 0.34 -0.41 -0.00 -0.58 0.34 -0.25

a) What would be a rationale for working with correlation matrix instead of the
covariance matrix?
b) What is the sum of variances of all the principal components?
c) What is the maximum percentage of total variance that can be explained by a single
principal component?.
d) What is the minimum number of principal components needed to explain at least 90%
of the total variance?
e) Compute the second principal component score for Australia.
f) The correlation matrix computed using all the principal component score columns
need not be an identity matrix. True or False. Briefly justify your answer
2. An exploratory factor analysis is carried out using three observed variables
( X 1 , X 2 , X 3) . Suppose that the three variables have been centered and scaled so that
their mean =0 and variance is 1. Suppose that a single factor solution is estimated and
let the factor be denoted byφ. The factor loadings (i.e. Correlation of φ with each of
the variables X 1 , X 2 , X 3 ¿ are estimated to be 0.9, 0.5 and 0.8 respectively.

a. Write down the mathematical formulation of this model and state the
accompanying assumptions.
b. What percentage of the total variance (i.e. V ( X ¿¿1)+V ( X 2 ) +V ( X 3) ¿ ) is
explained by the model
c. For variable X 3 , calculate what percentage of variance is explained by the factor?
d. It is found that the squared multiple correlation (smc) for the second variable is
90%. Based on your answer to part (c), what can you conclude about the
adequacy of a single factor model?. What would you conclude if smc had been
25%?
e. According to this model, what is the correlation between X 1 and X 3 ?

3. A bank that provides loans to private companies is looking to use discriminant


analysis to classify its borrowers into High risk (A) and Low risk (B) categories,
based on the two key financial ratios of the borrower, namely debt service coverage
(DSC) and liquidation coverage (LC). Roughly, Debt service coverage is a ratio of
operating cash flow to the principal + interest payments the company needs to make
during a year. Liquidation coverage is a measure of how much the bank may be able
to recover by liquidating the company (in the event the company goes bankrupt). It is
assumed that for each group DSC and LC follow a normal distribution and that they
are mutually independent (and hence also uncorrelated i.e. Correlation(DSC,LC)=0).
The parameters of the normal distributions are as below:

  Group A Group B
  Mean Variance Mean Variance
DSC 0.8 0.64 1.5 0.64
LC 0.75 0.81 1.2 0.81

a) A company that has borrowed loan from the bank has DSC =1 and LC =1. To which
category would you classify the company based on Mahalonobis method?. (Clearly
show the main steps of your approach).
b) A risk manager who has past experience lending to companies similar to that in (a)
believes that there is a 60% chance that such a company belongs to the low risk
category. Based on this prior information and using the fact that DSC=1, LC=1, to
which category would you classify the company?. What is the posterior probability of
such a company belonging to group A?.
c) An analyst suggests that the variance of DSC for Group A should be changed to 0.36.
How would this change your answer in part (b) ?.

4. A multi-logit model is built in order to classify observations into 3 categories 1, 2 and


3. Let us denote the response variable by Y and the explanatory variable by X. The
estimated model equations are as follows: [note that logarithm is to the base e]

P(Y =1)
log ( )
P ( Y =2 )
=1+2 x

log ( P(Y =3)


P (Y =2 ) )
=.5−x

a) Interpret the coefficient of x in the first equation.


b) What is the base or reference category used in the model?
c) If x=0.3, then what is P(Y=1), P(Y=2) and P(Y=3) ?
d) If the base category had been Y=1, then what would have been the model equations ?
5. For the 23 space shuttle flights before the Challenger mission disaster in 1986, the
table below shows the temperature (in degree Fahrenheit) at the time of flight and
whether at least one primary O-ring suffered thermal distress ( 1=thermal distress, 0=
no thermal distress).

Flight Temperature Thermal Distress


1 66 0
2 70 1
3 69 0
4 68 0
5 67 0
6 72 0
7 73 0
8 70 0
9 57 1
10 63 1
11 70 1
12 78 0
13 67 0
14 53 1
15 67 0
16 75 0
17 70 0
18 81 0
19 76 0
20 79 0
21 75 1
22 76 0
23 58 1

A logistic regression model is formulated as follows

logit ( P ( Thermal Distress=1 ) ) =α + β∗Temperature

The corresponding output from the estimation is as below:


Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 15.0429 7.3786 2.039 0.0415 *
Temperature -0.2322 0.1082 -2.145 0.0320 *
a) According to the model, increasing the temperature by 1 degree Fahrenheit would
decrease the predicted probability of thermal distress by 0.2322. True or False?.
Briefly explain your answer.
b) What would have been the values of α '∧β ' if the model had been formulated as

logit ( P ( Thermal Distress=0 ) )=α ' + β'∗Temperature

c) Estimate the probability of thermal distress at 31 degrees, the temperature at the time
of challenger flight.
d) At what temperature does the estimated probability equal 0.5?

6. Answer TRUE or FALSE, with appropriate reasons.


a. For the correlation matrix of three variables (X, Y, Z), if one Eigen value is 0, then the
three variables are independent.
b. For a problem involving classification of subjects into one of 5 different classes A, B, C, D,

E based on measurements( X ¿ ¿1 , X 2 , X 3 , X 4 , X 5 , X 6) ¿, Fisher’s LDA results in 5 linear


discriminant functions
c. Quadratic discriminant analysis is nothing but Mahalonobis method when the number of
groups is more than 2.
d. In MDS, higher the STRESS value better is the model fit.
e. In Factor analysis, lower the communality worse is the model fit.
f. Suppose that the Eigen values of the 2 dimensional correlation matrix for variables (X , Y)
are 2 and 0. Then the scatter plot of X versus Y must be an exact straight line.
g. Fishers Discriminant analysis and Mahalonobis Method are essentially two equivalent
methods for the classification problem. No matter which method we choose, we will
arrive at exactly the same rule for classification.
7. Suggested practice from book
a) problems 4.9, 4.10
b) problems 5.3, 5.7
c) problem 7.6
d) problems 12.3, 12.4 , 12.9, 13.3.

You might also like