0% found this document useful (0 votes)
66 views

CS001-B03 - Exploratory Data Analysis 20

Here are the key points to address in this case: - The sample proportion of customers who made a purchase after receiving the email is 20/100 = 0.2 - To test if this is significantly different from the expected population proportion, a one-sample proportions test (also called a z-test) can be used - The null hypothesis would be that the sample proportion is equal to the expected population proportion - The alternative hypothesis would be that the sample proportion is different than the expected population proportion - Calculate the test statistic and p-value and compare to the significance level (usually α=0.05) to determine if the null hypothesis can be rejected - If the p-value is less than 0.05

Uploaded by

Viswa Spiritual
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views

CS001-B03 - Exploratory Data Analysis 20

Here are the key points to address in this case: - The sample proportion of customers who made a purchase after receiving the email is 20/100 = 0.2 - To test if this is significantly different from the expected population proportion, a one-sample proportions test (also called a z-test) can be used - The null hypothesis would be that the sample proportion is equal to the expected population proportion - The alternative hypothesis would be that the sample proportion is different than the expected population proportion - Calculate the test statistic and p-value and compare to the significance level (usually α=0.05) to determine if the null hypothesis can be rejected - If the p-value is less than 0.05

Uploaded by

Viswa Spiritual
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Antern Learning

India

CS001-B03 - Exploratory Data Analysis (20%)


Test Paper

1.5 hours 50 marks

Instructions:

● Read each question carefully before answering. Make sure you understand what is being
asked before you start writing your answer.
● Show your work and explain your reasoning. Even if your final answer is incorrect, you
can still receive partial credit if you show your work and explain your thought process.
● Write clearly and legibly. Make sure your handwriting is neat and easy to read. If the
grader can't read your writing, you may lose points even if your answer is correct.
● Manage your time effectively. Allocate your time wisely to ensure that you have enough
time to answer all the questions. Don't spend too much time on one question at the
expense of other questions.
● Review your answers before submitting your exam. Take a few minutes to check your
work and make sure you have answered all the questions. Make any necessary
corrections or additions before submitting your exam.

SECTION-A

Question 1: Which of the following plots is most suitable for visualizing the relationship
between two numerical variables? (1.5 marks)

a) Scatterplot b) Histogram
c) Boxplot d) Bar chart

Question 2: Which of the following is NOT a measure of central tendency? (1.5 marks)

a) Mean b) Median
c) Mode d) Standard deviation

Question 3: What type of hypothesis test is most appropriate when comparing the means of
more than two groups? (1.5 marks)
a) t-test b) Chi-squared test
c) ANOVA d) Pearson correlation

Question 4: In a positively skewed distribution, which of the following is true? (1.5 marks)

a) Mean < Median < Mode b) Mode < Median < Mean
c) Mean = Median = Mode d) Median < Mean < Mode

Question 5: Which dimensionality reduction technique is most commonly used for


transforming data into a lower-dimensional space while preserving as much variance as
possible? (1.5 marks)

a) Principal Component Analysis (PCA)


b) Independent Component Analysis (ICA)
c) t-Distributed Stochastic Neighbor Embedding (t-SNE)
d) Feature selection using mutual information

Question 6: Which of the following is NOT a common technique for outlier detection? (1.5
marks)

a) Z-score method b) Tukey's fences (IQR method)


c) Box-Cox transformation d) DBSCAN clustering

Assertion-and-Reason Type (3 marks)

Each question consists of two statements,namely,Assertion (A) and Reason (R).For selecting the
correct answer,use the following code:

(a) Both Assertion (A) and Reason (R) are the truth and Reason (R) is a correct explanation of
Assertion (A).
(b) Both Assertion (A) and Reason (R) are true but Reason (R) is not a correct explanation of
Assertion (A).
(c) Assertion (A) is true and Reason (R) is false.
(d) Assertion (A) is false and Reason (R) is true.

Question 1: Assertion: Histograms and boxplots are suitable for visualizing the distribution of
numerical variables.
Reasoning: Imputation methods rely on assumptions about the data to estimate
missing values, which can lead to biased results if the assumptions are
Incorrect.

Question 2: Assertion (A): Histograms and boxplots are suitable for visualizing the distribution
of numerical variables.
Reason (R): Histograms and boxplots provide insights into the central tendency,
dispersion, and shape of the data distribution.

Matching Question: (6 marks)

Question 1: Match the following statistical measures with their corresponding definitions:

A. Mean
B. Variance
C. Standard deviation
D. Mode

1. A measure of central tendency that represents the typical or average value of a dataset.
2. A measure of the spread or variability of a dataset around its mean.
3. A measure of the average deviation of a dataset from its mean.
4. The most frequently occurring value in a dataset.

Question 2: Match the following statistical techniques with their corresponding applications:

A. Boxplot
B. ANOVA
C. Principal Component Analysis (PCA)
D. T-test

1. Used to identify the presence of statistically significant differences in means between two
groups.
2. Used to visualize the distribution, spread, and skewness of a dataset and to identify
outliers.
3. Used to reduce the dimensionality of a dataset by transforming the original features into
a smaller set of new features.
4. Used to test for the presence of statistically significant differences in means across
multiple groups.

Question 3: Match the following statistical measures with their corresponding formulas:

A. Mean
B. Variance
C. Standard deviation
D. Correlation coefficient

i. ∑(xi - x̄)/n
ii. √(∑(xi - x̄)²/(n-1))
iii. ∑(xi - x̄)²/(n-1)
iv. ∑[(xi - x̄)(yi - ȳ)]/[(n-1)sxsy]
Question 4: Match the following statistical techniques with their corresponding advantages:

A. Parametric tests
B. Non-parametric tests
C. Correlation analysis
D. Regression analysis

i. More powerful than nonparametric tests when assumptions are met.


ii. Less sensitive to outliers and non-normality of data.
iii. Used to measure the strength and direction of the linear relationship between two variables.
iv. Used to predict the value of a dependent variable based on the value of one or more
independent variables.

SECTION-B

Inference based questions: ( 6 marks)

1. The average salary of employees in a company is $50,000 with a standard deviation of


$5,000. A random sample of 25 employees is selected from the company and their
average salary is found to be $47,500. What can you infer about the population mean
salary and the reliability of the sample mean?

2. A survey is conducted to determine the average amount of time that people spend
watching TV per week. A sample of 50 people is selected and the mean time spent
watching TV per week is found to be 25 hours with a standard deviation of 4 hours. What
is the 95% confidence interval for the population mean time spent watching TV per
week? What can you infer about the population mean time spent watching TV per week?

3. A new drug is developed to reduce high blood pressure. A clinical trial is conducted to
test the efficacy of the drug. A random sample of 50 patients is selected and the mean
reduction in blood pressure is found to be 10 mmHg with a standard deviation of 2
mmHg. What can you infer about the population mean reduction in blood pressure? Is
the sample mean statistically significant?

Short answer based questions: (18 marks)

1. What is the difference between a categorical variable and a numerical variable? Give an
example of each.

2. What is a scatter plot, and what type of data is it used to visualize? How can you interpret
the relationship between variables in a scatterplot?
3. What is a boxplot, and what type of data is it used to visualize? How can you interpret
the information provided by a boxplot?

4. What is the mean, and how is it calculated? What are some limitations of using the mean
as a measure of central tendency?

5. What is the standard deviation, and how is it calculated? How can you interpret the value
of the standard deviation in relation to the distribution of the data?

6. What is an outlier, and why can it be a problem in data analysis? What are some
common methods used to detect outliers in a dataset?

7. What is the interquartile range (IQR), and how can it be used to detect outliers in a
dataset? How does the IQR differ from the standard deviation as a measure of spread?

8. What is a normal distribution, and why is it important in data analysis? How can you
determine if a dataset follows a normal distribution?

9. What is a skewed distribution, and what are the different types of skewness that can
occur in a dataset? How can you interpret the skewness of a dataset in relation to its
distribution?

10. What is a correlation coefficient, and how is it calculated? What is the range of possible
values for a correlation coefficient, and how can you interpret the strength and direction
of the correlation?

11. What is a scatterplot matrix, and how is it used to visualize correlations between features
in a dataset? What information can you obtain from a scatterplot matrix that you can't
get from individual scatterplots?

12. What is a null hypothesis, and why is it used in hypothesis testing? What is the
alternative hypothesis, and how does it relate to the null hypothesis?

13. What is a t-test, and what type of data is it used to analyze? What is the difference
between a one-sample t-test and a two-sample t-test?

SECTION-C

Case based questions: (14 marks)

Case1:

A company is trying to determine the effectiveness of a new marketing campaign. They


randomly select 100 customers and send them a promotional email. Of these 100 customers, 20
make a purchase. The company wants to know if the proportion of customers who make a
purchase after receiving the email is significantly different from the proportion who make a
purchase without receiving the email. Use a 5% level of significance.

● What is the null hypothesis and alternative hypothesis for this experiment?
● Calculate the sample proportion of customers who made a purchase after receiving the
email.
● Calculate the standard error of the proportion.
● Calculate the test statistic and corresponding p-value.
● Should the null hypothesis be rejected? What is the conclusion of the hypothesis test?

Case 2:

A company is conducting a study to determine if there is a relationship between age and income.
They collect data from 200 people and record their age (in years) and income (in thousands of
dollars). The data is shown below:

Age (years) Income (thousands of dollars)


25 40
30 50
35 60
40 55
45 70
50 65
55 75
60 80

● Calculate the sample mean and sample standard deviation for age and income.
● Create a scatterplot of the data. What is the direction of the relationship between age and
income? Is the relationship linear or nonlinear?
● Calculate the correlation coefficient between age and income. What is the strength and
direction of the correlation?
● Conduct a hypothesis test to determine if there is a significant linear relationship
between age and income. Use a 5% level of significance.
● What is the equation of the regression line for predicting income based on age? What is
the predicted income for someone who is 50 years old?

Case 3:

A hospital is conducting a study to compare the effectiveness of two different treatments for a
certain medical condition. They randomly assign 50 patients to receive Treatment A and 50
patients to receive Treatment B. The results are shown below:

Treatment A Treatment B
10 15
12 14
15 18
16 20
18 17
20 16

● Calculate the sample mean and sample standard deviation for Treatment A and
Treatment B.
● Conduct a two-sample t-test to determine if there is a significant difference in the mean
effectiveness of Treatment A and Treatment B. Use a 5% level of significance.
● What is the null hypothesis and alternative hypothesis for this experiment?
● Calculate the test statistic and corresponding p-value.
● Should the null hypothesis be rejected? What is the conclusion of the hypothesis test?

You might also like