CS001-B03 - Exploratory Data Analysis 20
CS001-B03 - Exploratory Data Analysis 20
India
Instructions:
● Read each question carefully before answering. Make sure you understand what is being
asked before you start writing your answer.
● Show your work and explain your reasoning. Even if your final answer is incorrect, you
can still receive partial credit if you show your work and explain your thought process.
● Write clearly and legibly. Make sure your handwriting is neat and easy to read. If the
grader can't read your writing, you may lose points even if your answer is correct.
● Manage your time effectively. Allocate your time wisely to ensure that you have enough
time to answer all the questions. Don't spend too much time on one question at the
expense of other questions.
● Review your answers before submitting your exam. Take a few minutes to check your
work and make sure you have answered all the questions. Make any necessary
corrections or additions before submitting your exam.
SECTION-A
Question 1: Which of the following plots is most suitable for visualizing the relationship
between two numerical variables? (1.5 marks)
a) Scatterplot b) Histogram
c) Boxplot d) Bar chart
Question 2: Which of the following is NOT a measure of central tendency? (1.5 marks)
a) Mean b) Median
c) Mode d) Standard deviation
Question 3: What type of hypothesis test is most appropriate when comparing the means of
more than two groups? (1.5 marks)
a) t-test b) Chi-squared test
c) ANOVA d) Pearson correlation
Question 4: In a positively skewed distribution, which of the following is true? (1.5 marks)
a) Mean < Median < Mode b) Mode < Median < Mean
c) Mean = Median = Mode d) Median < Mean < Mode
Question 6: Which of the following is NOT a common technique for outlier detection? (1.5
marks)
Each question consists of two statements,namely,Assertion (A) and Reason (R).For selecting the
correct answer,use the following code:
(a) Both Assertion (A) and Reason (R) are the truth and Reason (R) is a correct explanation of
Assertion (A).
(b) Both Assertion (A) and Reason (R) are true but Reason (R) is not a correct explanation of
Assertion (A).
(c) Assertion (A) is true and Reason (R) is false.
(d) Assertion (A) is false and Reason (R) is true.
Question 1: Assertion: Histograms and boxplots are suitable for visualizing the distribution of
numerical variables.
Reasoning: Imputation methods rely on assumptions about the data to estimate
missing values, which can lead to biased results if the assumptions are
Incorrect.
Question 2: Assertion (A): Histograms and boxplots are suitable for visualizing the distribution
of numerical variables.
Reason (R): Histograms and boxplots provide insights into the central tendency,
dispersion, and shape of the data distribution.
Question 1: Match the following statistical measures with their corresponding definitions:
A. Mean
B. Variance
C. Standard deviation
D. Mode
1. A measure of central tendency that represents the typical or average value of a dataset.
2. A measure of the spread or variability of a dataset around its mean.
3. A measure of the average deviation of a dataset from its mean.
4. The most frequently occurring value in a dataset.
Question 2: Match the following statistical techniques with their corresponding applications:
A. Boxplot
B. ANOVA
C. Principal Component Analysis (PCA)
D. T-test
1. Used to identify the presence of statistically significant differences in means between two
groups.
2. Used to visualize the distribution, spread, and skewness of a dataset and to identify
outliers.
3. Used to reduce the dimensionality of a dataset by transforming the original features into
a smaller set of new features.
4. Used to test for the presence of statistically significant differences in means across
multiple groups.
Question 3: Match the following statistical measures with their corresponding formulas:
A. Mean
B. Variance
C. Standard deviation
D. Correlation coefficient
i. ∑(xi - x̄)/n
ii. √(∑(xi - x̄)²/(n-1))
iii. ∑(xi - x̄)²/(n-1)
iv. ∑[(xi - x̄)(yi - ȳ)]/[(n-1)sxsy]
Question 4: Match the following statistical techniques with their corresponding advantages:
A. Parametric tests
B. Non-parametric tests
C. Correlation analysis
D. Regression analysis
SECTION-B
2. A survey is conducted to determine the average amount of time that people spend
watching TV per week. A sample of 50 people is selected and the mean time spent
watching TV per week is found to be 25 hours with a standard deviation of 4 hours. What
is the 95% confidence interval for the population mean time spent watching TV per
week? What can you infer about the population mean time spent watching TV per week?
3. A new drug is developed to reduce high blood pressure. A clinical trial is conducted to
test the efficacy of the drug. A random sample of 50 patients is selected and the mean
reduction in blood pressure is found to be 10 mmHg with a standard deviation of 2
mmHg. What can you infer about the population mean reduction in blood pressure? Is
the sample mean statistically significant?
1. What is the difference between a categorical variable and a numerical variable? Give an
example of each.
2. What is a scatter plot, and what type of data is it used to visualize? How can you interpret
the relationship between variables in a scatterplot?
3. What is a boxplot, and what type of data is it used to visualize? How can you interpret
the information provided by a boxplot?
4. What is the mean, and how is it calculated? What are some limitations of using the mean
as a measure of central tendency?
5. What is the standard deviation, and how is it calculated? How can you interpret the value
of the standard deviation in relation to the distribution of the data?
6. What is an outlier, and why can it be a problem in data analysis? What are some
common methods used to detect outliers in a dataset?
7. What is the interquartile range (IQR), and how can it be used to detect outliers in a
dataset? How does the IQR differ from the standard deviation as a measure of spread?
8. What is a normal distribution, and why is it important in data analysis? How can you
determine if a dataset follows a normal distribution?
9. What is a skewed distribution, and what are the different types of skewness that can
occur in a dataset? How can you interpret the skewness of a dataset in relation to its
distribution?
10. What is a correlation coefficient, and how is it calculated? What is the range of possible
values for a correlation coefficient, and how can you interpret the strength and direction
of the correlation?
11. What is a scatterplot matrix, and how is it used to visualize correlations between features
in a dataset? What information can you obtain from a scatterplot matrix that you can't
get from individual scatterplots?
12. What is a null hypothesis, and why is it used in hypothesis testing? What is the
alternative hypothesis, and how does it relate to the null hypothesis?
13. What is a t-test, and what type of data is it used to analyze? What is the difference
between a one-sample t-test and a two-sample t-test?
SECTION-C
Case1:
● What is the null hypothesis and alternative hypothesis for this experiment?
● Calculate the sample proportion of customers who made a purchase after receiving the
email.
● Calculate the standard error of the proportion.
● Calculate the test statistic and corresponding p-value.
● Should the null hypothesis be rejected? What is the conclusion of the hypothesis test?
Case 2:
A company is conducting a study to determine if there is a relationship between age and income.
They collect data from 200 people and record their age (in years) and income (in thousands of
dollars). The data is shown below:
● Calculate the sample mean and sample standard deviation for age and income.
● Create a scatterplot of the data. What is the direction of the relationship between age and
income? Is the relationship linear or nonlinear?
● Calculate the correlation coefficient between age and income. What is the strength and
direction of the correlation?
● Conduct a hypothesis test to determine if there is a significant linear relationship
between age and income. Use a 5% level of significance.
● What is the equation of the regression line for predicting income based on age? What is
the predicted income for someone who is 50 years old?
Case 3:
A hospital is conducting a study to compare the effectiveness of two different treatments for a
certain medical condition. They randomly assign 50 patients to receive Treatment A and 50
patients to receive Treatment B. The results are shown below:
Treatment A Treatment B
10 15
12 14
15 18
16 20
18 17
20 16
● Calculate the sample mean and sample standard deviation for Treatment A and
Treatment B.
● Conduct a two-sample t-test to determine if there is a significant difference in the mean
effectiveness of Treatment A and Treatment B. Use a 5% level of significance.
● What is the null hypothesis and alternative hypothesis for this experiment?
● Calculate the test statistic and corresponding p-value.
● Should the null hypothesis be rejected? What is the conclusion of the hypothesis test?