0% found this document useful (0 votes)
26 views

STA3022 Test2 Solutions

The document discusses a statistical analysis of factors that influence whether people consider themselves lucky or unlucky. It describes a survey that collected data on 62 students, including whether they consider themselves lucky, their age, gender, history of competition wins, and economics courses completed. A discriminant analysis was conducted to identify which variables distinguish between those who say they are lucky versus unlucky. The analysis found that the model could significantly discriminate between the two groups. It also provides details on evaluating the model, such as calculating hit rates and classification accuracy for each group.

Uploaded by

alutakaunda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

STA3022 Test2 Solutions

The document discusses a statistical analysis of factors that influence whether people consider themselves lucky or unlucky. It describes a survey that collected data on 62 students, including whether they consider themselves lucky, their age, gender, history of competition wins, and economics courses completed. A discriminant analysis was conducted to identify which variables distinguish between those who say they are lucky versus unlucky. The analysis found that the model could significantly discriminate between the two groups. It also provides details on evaluating the model, such as calculating hit rates and classification accuracy for each group.

Uploaded by

alutakaunda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

UNIVERSITY OF CAPE TOWN

DEPARTMENT OF STATISTICAL SCIENCES


STA3022F
TEST 2

Question 1 [5 marks]
(a) What is test-retest reliability? (1)
(b) What is internal consistency reliability? (1)
(c) How do you measure internal consistency? Provide three formula’s or explanations, not just
the names of the methods. (3)

Answer to Q1
(a) One of
A reliable measuring instrument in this context is one that gives consistent scores when used
repeatedly.
Or
There should be high correlations between test scores taken over multiple trials.

(b) The group of questions is internally-consistent or reliable if they are able to measure the same
underlying construct.

𝑘 𝑣𝑎𝑟(𝑄1 +⋯𝑄𝑘 )−{𝑣𝑎𝑟(𝑄1 )+⋯+𝑣𝑎𝑟(𝑄𝑘 )}


(c) Chronbach’s alpha = 𝑘−1 × 𝑣𝑎𝑟(𝑄1 +⋯𝑄𝑘 )
𝛼-if-deleted by calculating Chronbach’s alpha without each questions
Item total correlation by calculating the correlation between each question and the sum of all
the other questions.
Half mark for each name and half mark for formula/description

Question 2 [16 marks]


(a) In the painters data set in the R package MASS the subjective assessment, on a 0 to 20 integer
scale, of 54 classical painters is given. The painters were assessed on four characteristics:
composition, drawing, colour and expression. Calculate the Euclidean distance between the
following two samples:
> painters[1:2,]
Composition Drawing Colour Expression
Da Udine 10 8 16 3
Da Vinci 15 16 4 14
(3)
(b) Why is there no need to scale the data set before calculating the Euclidean distance? (1)
1
(c) Define 𝑠𝑡𝑟𝑒𝑠𝑠 and explain how it is used. (5)
(d) Explain step by step how to perform hierarchical clustering with the centroid method. (7)

Answer to Q2
 
(a) 𝑑12 = √∑𝑝𝑗=1(𝑥1𝑗 − 𝑥2𝑗 )2 = √(10 − 15)2 + (8 − 16)2 + (16 − 4)2 + (3 − 14)2 =

√(−5)2 + (−8)2 + (12)2 + (−11)2 = √25 + 64 + 144 + 121 = √354 = 18.8 

(b) All variables are measured on the same 0 to 20 integer scale.

2 
(c) 𝑠𝑡𝑟𝑒𝑠𝑠 = ∑𝑛−1 𝑛
𝑖=1 ∑𝑗=𝑖+1(𝑑𝑖𝑗 − 𝛿𝑖𝑗 )
The aim of MDS is to find a representationof the samples so that the dissimilarities
between
 
them in the plot, given by 𝛿𝑖𝑗 , match the given dissimilarities 𝑑𝑖𝑗 as closely as possible
(optimally).
If the symbols are reversed, no marks are deducted as long as die descriptions are correct.

(d) Start with all objects each in its own cluster. 


Merge the two closest clusters 
Repeat

Calculate the dissimilarity between the newly merged cluster and each other cluster
By calculating the distance between the cluster means 
Merge the two closest clusters 
Until all objects are merged into the same cluster. 
Use the clustering tree to cut the tree at a specific height or into a specific number of
clusters. 

QUESTION 3 [17 marks]

The current study aims to identify what factors make some people believe that they are lucky and others
believe that they are unlucky. The study is based on a survey of 62 STA3022F students who answered the
following questions in an online questionnaire (possible responses for categorical variables are given in
brackets).

1. Do you consider yourself to be a lucky person? (Yes/No)


2. What is your age?
3. What is your gender? (1 = Male; 0 = Female)
4. Have you ever won a competition before? (1 = Yes; 0 = No)
5. How many economic courses have you completed?

A discriminant analysis model has been constructed with the aim of identify which, if any, of the four
independent variables are able to distinguish between the two groups (groups labelled as “Yes”, and “No”).
Questions:

a) Write down the discriminant function. (2)

2
b) Which groups is the discriminant model able to significantly discriminate between? Provide
statistical evidence at the 5% level to support your answer. Clearly state all null and alternate
hypotheses. (4)

c) Use the cut-off value rule to classify Respondent 4. Clearly indicate the classification rule. Is this a
correct classification? (5.5)

d) Compare the overall hit rate with two chance criteria and use these comparisons to evaluate the
overall quality of the discriminant model (4)

e) Evaluate whether the discriminant model is better at predicting some groups than others. (Hint:
Calculate the correct classification rate for each group)
(1.5)

Q3-a) Write down the discriminant function.


Z1 = 0.254 − 2.948 ∗ Q2 + 0.085 ∗ Q3d + 1.383 ∗ Q4d − 0.011 ∗ Q5
12 12 12 12
Q3-b)

𝐻0 : There is no difference between the yes and no categories’ centroids.


𝐻1 : There is difference between the yes and no categories’centroids.

First we need to calculate the distance:

12 12
2 2
𝑑 = (−1.0242 − 1.0974) = 4.501187

12 (ratio)
12 (answer)
(𝑛 − 1 − 𝑝)𝑛1 𝑛2 2 (62 − 1 − 4) ∗ 34 ∗ 28
𝐹𝑦𝑒𝑠,𝑙𝑜𝑤 = 𝑑 = ∗ 4.501187 = 16.41481
𝑝(𝑛 − 2)(𝑛1 +𝑛2 ) 4 ∗ (62 − 2) ∗ (34 + 28)

𝐹𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 = 𝐹𝑝,𝑛−1−𝑝,𝛼 = 𝐹4,62−1−4,0.05 = 𝐹4,57,0.05 = 2.533

12 (comparison) 12 (conclusion)


Since F calculated is greater than the critical F value, centroids are significantly different from each other at
5% sig. level.

(or alternatively they can say that the F calculated is very high)

3
Q3-c) First we need to calculate the cut-off value
12 (ratio)
12 (answer)
𝑛1 𝑍̅ 2 + 𝑛2 𝑍̅1 34 ∗ 1.0974 + 28 ∗ (−1.0242)
𝐶𝑢𝑡 − 𝑜𝑓𝑓 = = = 0.1392581
𝑛1 + 𝑛2 34 + 28

Then we need to specify the rule:

12
If Z<0.1392581 then classify as “YES”

Calculate Z value for the 4th respondent:

12 12 12 12 12


Z4 = 0.254 − 2.948 ∗ 20 + 0.085 ∗ 1 + 1.383 ∗ 0 − 0.011 ∗ 2 = −58.643

12 12
Since Z4 < 0.1392581, classify as “YES”, hence the centroid for Yes is negative

Therefore it is a correct classification. 12

Q3-d) Evaluate the hit-rate. 12


28 + 24 12
𝐻𝑖𝑡 − 𝑟𝑎𝑡𝑒 = = 83.87%
62

12 12
𝐻𝑚𝑎𝑥 = max(34/62, 28/62) = 54.84%

12 12
34 2 28 2
𝐻𝑝𝑟𝑜𝑝 = ( ) + ( ) = 50.47%
62 62

12 12
Hit-rate is greater than both 𝐻𝑚𝑎𝑥 and 𝐻𝑝𝑟𝑜𝑝 , therefore this indicates a good hit-rate.

4
Q3-e) Evaluate the hit-rate for each category
28 12
𝐻𝑖𝑡 − 𝑟𝑎𝑡𝑒(𝑦𝑒𝑠) = = 82.4%
34

12
24
𝐻𝑖𝑡 − 𝑟𝑎𝑡𝑒(𝑛𝑜) = = 85.7%
28

Both correct classification rates are similar and very good. 12

Q4-a) Interpret the Classification Tree and define an appropriate decision rule for selecting a
positive return.
(1) If CACL<=1.1694 & ROCS<=4.4486 & WCTA<= - 0.3326, then classify as Not Fail 12
(2) If CACL<=1.1694 & ROCS<=4.4486 & WCTA> - 0.3326, classify as Fail 12
(3) If CACL<=1.1694 & ROCS>4.4486, then classify as NotFail 12
(4) If CACL>1.1694 & CLTA<=0.70635 & Sales <=3091.5, then classify as Fail 12
(5) If CACL>1.1694 & CLTA<=0.70635 & Sales >3091.5, then classify as Not Fail
(6) If CACL>1.1694 & CLTA>0.70635, then classify as Fail 12
12

Q4-b)
Firm SALES ROCS CLTA CACL WCTA FAIL
2 16149 -1.07 1.22 0.62 -0.46 0

12 12 12


CACL=0.62 < 1.1694 & ROCS = -1.07 <4.4486 & WCTA = -0.46 < -0.3326

Therefore classify as Not Fail. 12

Q4-c)

12 12 12


2 2
29 31
𝐷𝐼1 = 1 − (( ) + ( ) ) = 0.4994444
60 60

OR
30 2 30 2
𝐷𝐼1 = 1 − (( ) + ( ) ) = 0.5
60 60

5
The variable is chosen according to the reduction in the DI. The variable that creates the maximum
reduction in the index is chosen for splitting the node. 12

Q4-d)

Bonsai techniques check the several stopping criteria before letting the tree grow fully. 12
Pruning techniques let the grow fully and then start pruning the tree. 12

Q4-d)

Classification Table
Predicted Groups
Fail 12 NotF Total
Observed Fail
12
21+2+2=25 29-25=412 29
Groups NotF 31-28=3 2+4+22=28 31 12
12 12 12 (totals)
Total 28 32 60
12
(totals)
OR

Classification Table
Predicted Groups
Fail 12 NotF Total
Observed Fail
12
21+2+2=25 1+1+3=512 30
Groups NotF 2+0+0=2 2+4+22=28 30 12
12 12 12 (totals)
Total 27 33 60
12
(totals)
OR

Classification Table
Predicted Groups
Fail 12 NotF Total
Observed Fail
12
21+2+2=25 1+1+3=512 29
Groups NotF 2+0+0=2 2+4+22=28 31 12
12 12 12 (totals)
Total 27 33 60
12
(totals) 6
7

You might also like