Individual assignment for data science
Individual assignment for data science
1) A hospital in Ethiopia wants to study the prevalence of a rare kidney disease among its
patients. Since testing all patients is too costly and time-consuming, the researchers decide to
use a sampling method. They consider the following options:
1. Simple Random Sampling (SRS): Selecting 200 patients randomly from the hospital
database.
2. Stratified Sampling: Dividing patients into age groups (under 30, 30-50, above 50)
and randomly selecting 100 patients from each group.
3. Cluster Sampling: Choosing 3 departments in the hospital at random and testing all
patients in those departments.
4. Systematic Sampling: Selecting every 10th patient who visits the hospital for a
check-up.
Question:
(a) Which sampling method ensures proportional representation of different age groups?
(b) If the researchers want to minimize the cost of data collection while still getting a
representative sample, which method might be the best choice?
(c) What potential bias could arise if they use cluster sampling?
2) The following table shows the frequency of alcohol consumption by age group among a group of
150 adults surveyed:
a) What is the probability that a randomly selected person is in the 30-49 age group?
b) What is the probability that a randomly selected person has consumed alcohol 50
or more times?
c) Given that a person is in the 18-29 age group, what is the probability that they
have consumed alcohol 10-49 times?
d) What is the probability that a randomly selected person is 50+ years old and has
consumed alcohol 50 or more times?
3) We have two groups (A and B) with different test scores, and we want to check if their means are
significantly different.
# For reproducibility
group_A <- c(75, 80, 85, 78, 82, 77, 83, 79, 81, 76) # Group A scores
group_B <- c(68, 72, 70, 74, 71, 69, 73, 75, 70, 72) # Group B scores # Perform two-sample t-test
t_test_result <- t.test(group_A, group_B, alternative = "two.sided", var.equal = TRUE)
###############output###############
4) We have test scores from three different study methods (A, B, and C), and we want to check if
there is a significant difference in their means.
scores <- c(75, 80, 85, 78, 82, 77, 83, 79, 81, 76, 68, 72, 70, 74, 71, 69, 73, 75, 70, 72,
88, 90, 85, 87, 91, 86, 89, 92, 88, 90)
Questions Based on the Output
5) Let's assume we have a dataset where we predict student test scores based on study
hours and sleep hours.
Questions Based on the Output