0% found this document useful (0 votes)
4 views

Individual assignment for data science

The document outlines an individual assignment for a Master's program in Data Science, focusing on statistical methods for analyzing data. It includes various sampling techniques for studying kidney disease prevalence, probability questions related to alcohol consumption, t-tests and ANOVA for comparing test scores, regression analysis for predicting student performance, and discussions on the Central Limit Theorem and properties of good estimators. Additionally, it requires writing the Bernoulli distribution as an exponential family, highlighting the application of statistical concepts in real-world scenarios.

Uploaded by

teshager8922
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Individual assignment for data science

The document outlines an individual assignment for a Master's program in Data Science, focusing on statistical methods for analyzing data. It includes various sampling techniques for studying kidney disease prevalence, probability questions related to alcohol consumption, t-tests and ANOVA for comparing test scores, regression analysis for predicting student performance, and discussions on the Central Limit Theorem and properties of good estimators. Additionally, it requires writing the Bernoulli distribution as an exponential family, highlighting the application of statistical concepts in real-world scenarios.

Uploaded by

teshager8922
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Emerald International College

MSc in Data Science (2025)

Individual Assignment (Statistics for Data Science)

1) A hospital in Ethiopia wants to study the prevalence of a rare kidney disease among its
patients. Since testing all patients is too costly and time-consuming, the researchers decide to
use a sampling method. They consider the following options:

1. Simple Random Sampling (SRS): Selecting 200 patients randomly from the hospital
database.
2. Stratified Sampling: Dividing patients into age groups (under 30, 30-50, above 50)
and randomly selecting 100 patients from each group.
3. Cluster Sampling: Choosing 3 departments in the hospital at random and testing all
patients in those departments.
4. Systematic Sampling: Selecting every 10th patient who visits the hospital for a
check-up.

Question:
(a) Which sampling method ensures proportional representation of different age groups?
(b) If the researchers want to minimize the cost of data collection while still getting a
representative sample, which method might be the best choice?
(c) What potential bias could arise if they use cluster sampling?

2) The following table shows the frequency of alcohol consumption by age group among a group of
150 adults surveyed:

a) What is the probability that a randomly selected person is in the 30-49 age group?
b) What is the probability that a randomly selected person has consumed alcohol 50
or more times?
c) Given that a person is in the 18-29 age group, what is the probability that they
have consumed alcohol 10-49 times?
d) What is the probability that a randomly selected person is 50+ years old and has
consumed alcohol 50 or more times?

3) We have two groups (A and B) with different test scores, and we want to check if their means are
significantly different.

# Generate sample data set.seed(123)

# For reproducibility
group_A <- c(75, 80, 85, 78, 82, 77, 83, 79, 81, 76) # Group A scores

group_B <- c(68, 72, 70, 74, 71, 69, 73, 75, 70, 72) # Group B scores # Perform two-sample t-test
t_test_result <- t.test(group_A, group_B, alternative = "two.sided", var.equal = TRUE)

###############output###############

Questions Based on the Output

1. What is the null hypothesis (H₀) for this t-test?


2. What is the t-statistic value?
3. What is the p-value, and what does it indicate about statistical significance at α =
0.05?
4. What are the sample means for group A and group B?
5. What is the 95% confidence interval for the difference in means?
6. Based on the results, should we reject the null hypothesis? Why?

4) We have test scores from three different study methods (A, B, and C), and we want to check if
there is a significant difference in their means.

# Generate sample data set.seed(123) # For reproducibility

group <- rep(c("A", "B", "C"), each = 10) # 3 groups

scores <- c(75, 80, 85, 78, 82, 77, 83, 79, 81, 76, 68, 72, 70, 74, 71, 69, 73, 75, 70, 72,

88, 90, 85, 87, 91, 86, 89, 92, 88, 90)
Questions Based on the Output

1. What is the null hypothesis (H₀) for this ANOVA test?


2. What is the F-statistic value, and what does it indicate?
3. What is the p-value, and what does it suggest about the means of the groups at α =
0.05?

Questions Based on the Output

1. Which group pairs show statistically significant differences?


2. What is the mean difference between groups A and B? Is it statistically significant?
3. If a pairwise comparison has a p-value greater than 0.05, what does it mean in terms
of significance?

5) Let's assume we have a dataset where we predict student test scores based on study
hours and sleep hours.
Questions Based on the Output

1. What is the regression equation based on the model output?


2. What does the coefficient for study_hours (4.7854) indicate about its relationship
with test scores?
3. Is sleep_hours a statistically significant predictor? Why or why not?
4. What is the R-squared value, and what does it tell us about the model’s fit?
5. If another student studies for 8 hours and sleeps for 6 hours, what would the model
predict for their test score?

6) Discuss about Central Limit Theorem

7) Discuss about Properties of Good estimator

8) Write Bernoulli distribution as Exponential Family

You might also like