Data Science Question Bank Updated - Google Docs
Data Science Question Bank Updated - Google Docs
1. List out and explain the various phases of the data science process.
2. Why is goal definition important in a Data Science project? How
does domain expertise help in this process? Explain first step of
data science process-Setting the Research Goal & Project Charter
3. What is the data retrieval phase?What are the common sources of
data in Data Science?
4. Explain data preparation phase with data cleaning , transformation
and combining data.
5. Explain various types of data entry errors and how to fix them in
the data preparation phase.
6. What is data cleaning, and why is it crucial? Discuss handling
missing data, outliers, and feature scaling.
7. What is Exploratory Data Analysis (EDA)? How do visualization
techniques help in understanding data?OR Explain data exploration
phase in data science process.
8. What is data modeling in Data Science? What is a hold out sample?
How does cross-validation improve model performance?
Unit II Introduction to statistics
3. Inferential Statistics
Standard
Subject Mean
deviation
Maths 65 10
English 60 12
Economics 57 14
20. In a small business firm, two typists are employed- typist A and Typist B.
Typist A types out, on an average, 30 pages per day with a standard
deviation of 6. Typist B, on an average, types out 45 pages with a
standard deviation of 10. Which typist shows greater consistency in his
output.
21. The male population’s weight data follows a normal distribution. It
has a mean of 70 kg and a standard deviation of 15 kg. What would
the mean and standard deviation of a sample of 50 guys be if a
researcher looked at their records?
22. A distribution has a mean of 69 and a standard deviation of 420.
Find the mean and standard deviation if a sample of 80 is drawn
from the distribution.
23. A nutritionist claims that the average sugar content in a brand of cereal is
less than 10 grams per serving. A random sample of 30 cereal boxes
shows an average sugar content of 9.5 grams with a standard deviation
of 1.2 grams. At a 5% significance level (α = 0.05), test whether the
nutritionist's claim is supported.
24. A manufacturer claims that the average lifespan of its LED bulbs is at
least 25,000 hours. A consumer protection agency tests 40 randomly
selected bulbs and finds an average lifespan of 24,500 hours with a
standard deviation of 1,200 hours. At a 5% significance level (α =
0.05), test whether the agency’s data contradicts the manufacturer’s
claim.
25. A soft drink company claims that the average sugar content in its cola is
39 grams per can. A health organization collects a random sample of
50 cans and finds the average sugar content is 40 grams, with a
standard deviation of 2 grams. At a 1% significance level (α = 0.01),
test if the actual sugar content is different from 39 grams.
26. A company manufacturing automobiles finds that tyre-life is normally
distributed with a mean of 40,000 km and standard deviation of 3000
km. It is believed that a change in the production process will result in a
better product and the company has developed a new tyre. A sample of
100 new tyres has been selected. The company has found that the mean
life of these new tyres is 40,900 Km. Can it be concluded that the new
tyre is significantly better than the old one, using the significance level of
0.01.
27. Hint; we are interested in testing whether or not there has been an
increase in the mean life of tyres or test whether the mean life of new
tyre has increased beyond 40,000 km.
28. Following are the runs scored by two batsmen in 5 cricket matches, Who
is more consistent in scoring runs?
Batsman A: 38 47 34 18 33
Batsman B: 37 35 41 27 35
30. For the given observations {23, 24, 56, 55, 28, 38, 48}, calculate:
● Skewness
● Kurtosis
● Determine the type of kurtosis
31. Given the weights of five persons: 120, 140, 150, 160, and 180 find the
following:
● Mean
● Median
● Mode
● Standard deviation
● Variance
● Interquartile range
33. Given the observations {6, 8, 10, 12, 14, 16, 18, 20, 22, 24}, calculate the
following:
● Mean
● Median
● Standard deviation
● Variance
● Skewness
● Kurtosis
● Lower quartile
● Upper quartile
● Middle quartile
● Interquartile range
● Range
34. Calculate Sample mean, sample variance, sample skewness and sample
kurtosis from the following grouped data:
2-4 3
4-6 4
6-8 2
8-10 1
Examples:
KNN Classification
6. Given the following dataset with two features (X1, X2) and class labels:
X1 X2 Class
1 1 A
2 2 A
3 3 B
6 6 B
7. You have a dataset of fruits classified based on their weight and size.
150 8 Apple
180 10 Apple
200 12 Orange
220 14 Orange
Classify a fruit with weight 190g and size 11cm using KNN (K=3).
KNN Regression
8. A dataset provides the exam scores of students based on study hours:
1 45
2 50
3 55
4 60
5 65
Predict the score for a student who studies 3.5 hours using KNN
regression with K=3.
1000 50
1500 70
2000 90
2500 110
Predict the price for a 1750 sq. ft. house using KNN regression with
K=2.
1 10
2 20
3 30
4 40
Fit a linear regression model and estimate the sales revenue if the ad spend is
2.5 Lakhs.
2 50
4 60
6 70
8 80
Calculate the linear regression equation and predict the score for a student who
studies 5 hours.
1 3
3 6
5 9
Find the regression equation and predict the salary for an employee with 4
years of experience.
Data Visualization
Filters:
1. What is the purpose of filters in data visualization, and how do they
enhance the viewer's experience?
2. Can you explain the difference between static and interactive filters
in data visualizations?
Python Libraries for Visualization:
1. Compare and contrast Matplotlib and Seaborn. What are the
primary use cases for each library?
2. How would you choose the right visualization library for a given
data science project? Provide examples where you would prefer one
over the other.
Matplotlib:
1. What are the basic components of a Matplotlib plot?
2. How can you create a multi-line plot using Matplotlib?
3. Describe box plot graph in detail.
Seaborn:
1. Describe how Seaborn integrates with other Python libraries. What
advantages does this offer?
2. Provide an example of a complex data visualization that can be
more easily generated with Seaborn than with Matplotlib.
Data Science Ethics
Doing Good Data Science:
1. What are the ethical considerations a data scientist must keep in
mind when designing a new algorithm?
2. How can data scientists ensure their work contributes positively to
society?
Owners of the Data:
1. Discuss the implications of data ownership in the context of
personal versus corporate data.
2. How does the concept of data ownership affect data access for
scientific research?
Valuing Different Aspects of Privacy:
1. What challenges arise when balancing individual privacy with the
benefits of big data analytics?
2. Provide examples of privacy-preserving techniques in data science.
The Five Cs of Data Science:
1. Define the "Five Cs" in the context of ethical data science practices.
2. How can adhering to the Five Cs improve the outcome of a data
science project?
Diversity and Inclusion:
1. Why is diversity important in data science teams and data
collection?
2. Discuss an example where lack of diversity in data collection led to
biased outcomes.