0% found this document useful (0 votes)
36 views

Introduction To Data Analysis Solutions

This document provides an introduction to data analysis concepts including probability, distributions, hypothesis testing, and other statistical techniques. It includes 8 questions with examples and explanations of how to calculate probabilities, determine median and quartiles from a histogram, perform a chi-square test of independence, and other statistical analyses. The key concepts covered are probabilities, distributions, hypothesis testing, and applying statistical methods to analyze data.

Uploaded by

Oumaima Ziat
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Introduction To Data Analysis Solutions

This document provides an introduction to data analysis concepts including probability, distributions, hypothesis testing, and other statistical techniques. It includes 8 questions with examples and explanations of how to calculate probabilities, determine median and quartiles from a histogram, perform a chi-square test of independence, and other statistical analyses. The key concepts covered are probabilities, distributions, hypothesis testing, and applying statistical methods to analyze data.

Uploaded by

Oumaima Ziat
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Introduction to Data Analysis

Correction
Q1:

a) P(26-39/+3 accidents) = 25/(25+25+50) = 25/100 = 0.25


b) P(2-3 accidents / 18-25) = 150/(150 + 100 + 50) = 150/300 = 0.5
c) P(0-1 accidents / 26-39 or 40-55) = (150 + 250)/ (150+25+25 + 250+125+25) = 400/600 = 0.66
d) P(40-55) = (250+125+25)/1000 = 0.4

Q2:

1) 0.75 of the employees are satisfactory (S).


2) 0.25 of the employees are unsatisfactory (S’).
3) 0.8 of the satisfactory employees had previous work experience (E) which means 0.25 of them
had no work experience (E’).
4) 0.15 of the unsatisfactory employees had no work experience which means 0.85 of them had
work experience.

Probability Tree:

Bayes theorem: P(A/B) = P(B/A)xP(A)/P(B)

The Question is what is the probability that the person will be an unsatisfactory given that he has had a previous work
experience? So here we are looking at the probability of S’ given E.

P(S’/E) = P(E/S’)xP(S’)/P(E)

P(E/S’) = 0.85 (using the tree)

P(S’) = 0.25

P(E) = P(E/S)xP(S) + P(E/S’)xP(S’) (because we don’t have the probability of E)

P(E) = 0.8x0.75 + 0.85x0.25 = 0.6 + 0.2125 = 0.8125

P(S’/E) = 0.85x0.25/0.8125 = 0.26

Q3:

Let random variable X denote the height (in inches) of male soccer players. The figure below shows the histogram of
the heights of 100 male soccer players. Using this histogram:

I. The sample size is 100 because we have 100 male soccer players.
II. The Q2 = Median (looking for 0.5 of the data or more):

0.05 < 0.5


0.05 + 0.03 = 0.08 < 0.5
0.08 + 0.15 = 0.23 < 0.5
0.23 + 0.4 = 0.63 >= 0.5 ➔ the median or Q2 is within this bin, thus we choose the median of that
bin which is 66.95

III. The value if Q1 is 66.95 because we reach 0.25 of the data in the 4th bin. But for the Q3 quartile we
need to add another bin to reach .75 of the data or more ➔ Q3 = 68.95.
IQR = Q3 – Q1 = 68.95 – 66.95 = 2

IV. The mean or the expected value is E(X) = ∑ x P(X=x )= 60.95x0.05 + 62.95x0.03 + ….74.95x0.01
i i

Q4:

Fair game means the expected value of Wining/losing equals zero.

a) Finding the value of N:


∑ x P(X=x )= 0 ➔ P(wining)xN + p(Losing)x(-1) = 0 (because we lose one dollar).
i i

P(Wining) = HHH or TTT ➔ 0.5x0.5x0.5 + 0.5x0.5x0.5 (probability of heads or tails is 0.5)


P(Wining) = 0.25
P(Losing) = 1 – P(Wining) = 1 – 0.25 = 0.75
P(wining)xN + p(Losing)x(-1) = 0 ➔ 0.25xN + 0.75x(-1) = 0 ➔ N=3

b) Standard deviation SD = sqrt(Variance)

Variance = E(X2) – E(X)2

E(X) = 0 (fair game) ➔ E(X)2 = 0

E(X2) = 0.25xN2 + 0.75x(-1)2

SD = sqrt (0.25.9 + 0.75x1) = sqrt(3)

Q5:
Expected Earnings Per Roll (Excluding Rolling a 6):

• Expected earnings per roll E = (1+2+3+4+5)/5=3

Expected Profit Function for N Rolls:

• Probability of not rolling a 6 in N rolls: (5/6)N


• Total expected earnings from N rolls: N×E= N×3.
• Expected profit for N rolls: 3*N*(5/6) N −3.

For finding the value of N that maximizes the expected profit, we can use different values of N = {1,2,3……15} for
example, and then we choose N that gives the maximum value of the expected profit. In this exercise N = 6.

Q6:

The obtained scores are normally distributed with mean = 82 and SD = 6. The question is how many
students had scores between 76 and 88?
P(76<X<88) ➔

We use Z-score transformation: P(76<X<88) ➔ p(76-82/6 < Z< 88-82/6) = P(-1 <Z< 1) = P(Z< 1) – P(Z<-1)

= P(Z<1) – (1 – P(Z<1))

= 0.84 – (1 – 0.84) = 0.68

Q7:
To solve this problem, we'll perform a hypothesis test for the difference between two proportions and then calculate a
95% confidence interval.

i. Hypothesis Test
1) State the hypotheses.
• Null Hypothesis (H0): The proportion of men using smartphones (Pm) is less than or equal
to that of women (Pw), i.e., Pm≤Pw.
• Alternative Hypothesis (H1): The proportion of men using smartphones is greater than that
of women, i.e., Pm>Pw.

2) Calculate Sample Proportions:


Pm = 973/379 = 0.38
Pw= 404/1304 = 0.30

3) Standard Error and Z-score:


• In this case, we don't use pooled proportion since the null hypothesis does not assume
the proportions are equal.
• Calculate the standard error (SE) using the formula:

• Compute the Z-score Z = (Pm - Pw)/SE


Z = 3.94
4) P-value:
P-value = 1 - Z3.94 = 0.002
5) Decision:
Since P-value < 0.05, so we reject H0.

ii. 95% Confidence Interval.

CI = point estimate +/- 1.96*SE = point estimate +/- ME


ME = 1.96* 0.020 = 0.039
lower = (0.38 – 0.3) - 0.039
upper = (0.38 – 0.3) + 0.039

CI = [0.04, 0.11]

Q8:
H0: There is no inconsistency between the observed and the expected counts. The
observed counts follow the same distribution as the expected counts.
HA: There is an inconsistency between the observed and the expected counts. The
observed counts do not follow the same distribution as the expected counts.

Observed distribution:
Underweightn BMI<18.5 = 20

Normal Weight BMI 18.5-24.9 = 932

Overweight BMI 25.0-29.9 = 1374

Obese BMI > 30 = 1000

Expected distribution:

Underweightn BMI<18.5 = 0.02 * 3326 = 66.52

Normal Weight BMI 18.5-24.9 = 0.39 * 3326 = 1297.14

Overweight BMI 25.0-29.9 = 0.36 * 3326 = 1197.36

Obese BMI > 30 = 0.23 * 3326 = 764.98

Calculation of chi-square:

X2 = ∑(Observed – Expected)^2/Expected

X2 = 233.58

The degree of freedom (df) is the number of columns (or rows) minus 1, here df= 4 – 1 = 3

From the table chi2-square with df = 3 and 5% level of significance the value is 7.81

Since the X2>> 7.81, we reject H0 which means the observed counts do not follow the same distribution as
the expected counts.

Q9:

H0: Males and females are independent.


HA: Males and Females are dependent.
Observed counts are in Black color and the expected counts are in red color:

For each cell in the table, calculate the expected frequency: Eij = (row total) * (column total)/total

Agree No opinion Disagree Row total


Males 75 / 95.2 10 / 8.74 85 / 66.05 170
Females 121 / 100.8 8 / 9.25 51 / 69.94 180
Column Total 196 18 136 350

Chi-Squared Statistic:
In this example X2 = 19.24

Degrees of Freedom:

df = (number of rows – 1)*(number of columns – 1)

df = (2 – 1)*(3-1) = 2

X2df=2, 0.05 = 5.99.

Since the p-value corresponding to the chi2 is very small (using python I got 6.6117999886364e-05) we
reject the H0.

You might also like