0% found this document useful (0 votes)
3 views

Data Science Question Bank Updated - Google Docs

The document is a comprehensive question bank for a Data Science course, covering topics such as the definition and significance of Data Science, its interdisciplinary fields, the roles of Data Scientists, and the Data Science process. It includes sections on statistics, machine learning concepts, and data visualization, along with practical numerical problems and ethical considerations in data science. Each unit contains detailed questions aimed at assessing knowledge and understanding of the subject matter.

Uploaded by

Piyush Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Data Science Question Bank Updated - Google Docs

The document is a comprehensive question bank for a Data Science course, covering topics such as the definition and significance of Data Science, its interdisciplinary fields, the roles of Data Scientists, and the Data Science process. It includes sections on statistics, machine learning concepts, and data visualization, along with practical numerical problems and ethical considerations in data science. Each unit contains detailed questions aimed at assessing knowledge and understanding of the subject matter.

Uploaded by

Piyush Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

CE0630-Data Science Question Bank

Unit 1 Introduction to Data Science

Introduction to Data Science

1. Define Data Science and explain its significance. How does it


differ from traditional data analysis and Business Intelligence?
2. What are the key interdisciplinary fields that contribute to Data
Science? Explain their roles.
3. What are the main responsibilities of a Data Scientist? How do
they differ from Data Analysts and Data Engineers?
4. What key skills are required to become a Data Scientist? Discuss
both technical and non-technical skills.
5. How does Data Science contribute to business decision-making?
Give real-world examples.
6. List out important use cases of data science.
7. Give Differences between Data Engineer, Data Analyst and Data
Scientist.
8. Discuss some major applications of Data Science in healthcare,
finance, e-commerce, and social media analytics.
9. Explain the relationship between Data Science and Big Data.
What are the key characteristics (6Vs) of Big Data?
10. What is the role of Machine Learning in Data Science?
Differentiate between Supervised, Unsupervised, and
Reinforcement Learning.
11. How is Data Science used in fraud detection, recommendation
systems, and predictive maintenance? Explain with examples.
12. Discuss the role of Data Science in social media analytics,
customer segmentation, and personalized marketing.
13. Discuss the application areas of Data Science.
14. Discuss the use cases of Data Science.

Data Science Process Overview

1. List out and explain the various phases of the data science process.
2. Why is goal definition important in a Data Science project? How
does domain expertise help in this process? Explain first step of
data science process-Setting the Research Goal & Project Charter
3. What is the data retrieval phase?What are the common sources of
data in Data Science?
4. Explain data preparation phase with data cleaning , transformation
and combining data.
5. Explain various types of data entry errors and how to fix them in
the data preparation phase.
6. What is data cleaning, and why is it crucial? Discuss handling
missing data, outliers, and feature scaling.
7. What is Exploratory Data Analysis (EDA)? How do visualization
techniques help in understanding data?OR Explain data exploration
phase in data science process.
8. What is data modeling in Data Science? What is a hold out sample?
How does cross-validation improve model performance?
Unit II Introduction to statistics

1. Basics of Statistics:Descriptive Statistics

1. What is statistics? How is it used in Data Science?


2. Differentiate between Descriptive and Inferential Statistics with
examples.
3. Explain the concepts of Population and Sample in statistics. Why is
sampling important?
4. What are the different types of variables in statistics? Give examples.
5. Define Measures of Central Tendency. Why is it important in data
science? Explain Mean, Median, and Mode with examples.
6. What are Measures of Variability? Discuss Range, Variance, and
Standard Deviation.
7. Define Coefficient of Variance (CV) and explain its significance.
8. What is Skewness? How does it indicate the shape of a distribution?
9. What is Kurtosis? How does it describe the characteristics of a
probability distribution?

3. Inferential Statistics

10. What is the Normal Distribution, and why is it important in statistics?


11. Explain Hypothesis Testing and its importance in statistics.
12. What is the Central Limit Theorem (CLT)? Why is it important in
inferential statistics?
13. What is a Confidence Interval? How is it calculated?
14. What is a t-test? Explain its applications with examples.
15.Differentiate between Type I and Type II errors in hypothesis testing.
Numerical Problems

15. The number of points scored by two teams in a hockey match is


given below. With the help of Coefficient of Variation, determine
which team is more consistent.

16. Coefficients of Variation and Standard Deviation of two series X and Y


are 55.43% and 48.86%, and 25.5 and 24.43, respectively. Find the
means of series X and Y.
17. The standard deviation and mean of the data are 8.5 and 14.5
respectively. Find the coefficient of variation
18. If the mean and coefficient of deviation of data are 13 and 38
respectively, then locate the value of expected variation?
19. The mean and standard variation of marks received by 40 students of a
class in three subjects Mathematics, English and economics are given
below.Which of the three subjects indicates the most elevated deviation
and which indicates the most subordinate variation in marks?

Standard
Subject Mean
deviation

Maths 65 10

English 60 12

Economics 57 14
20. In a small business firm, two typists are employed- typist A and Typist B.
Typist A types out, on an average, 30 pages per day with a standard
deviation of 6. Typist B, on an average, types out 45 pages with a
standard deviation of 10. Which typist shows greater consistency in his
output.
21. The male population’s weight data follows a normal distribution. It
has a mean of 70 kg and a standard deviation of 15 kg. What would
the mean and standard deviation of a sample of 50 guys be if a
researcher looked at their records?
22. A distribution has a mean of 69 and a standard deviation of 420.
Find the mean and standard deviation if a sample of 80 is drawn
from the distribution.
23. A nutritionist claims that the average sugar content in a brand of cereal is
less than 10 grams per serving. A random sample of 30 cereal boxes
shows an average sugar content of 9.5 grams with a standard deviation
of 1.2 grams. At a 5% significance level (α = 0.05), test whether the
nutritionist's claim is supported.
24. A manufacturer claims that the average lifespan of its LED bulbs is at
least 25,000 hours. A consumer protection agency tests 40 randomly
selected bulbs and finds an average lifespan of 24,500 hours with a
standard deviation of 1,200 hours. At a 5% significance level (α =
0.05), test whether the agency’s data contradicts the manufacturer’s
claim.
25. A soft drink company claims that the average sugar content in its cola is
39 grams per can. A health organization collects a random sample of
50 cans and finds the average sugar content is 40 grams, with a
standard deviation of 2 grams. At a 1% significance level (α = 0.01),
test if the actual sugar content is different from 39 grams.
26. A company manufacturing automobiles finds that tyre-life is normally
distributed with a mean of 40,000 km and standard deviation of 3000
km. It is believed that a change in the production process will result in a
better product and the company has developed a new tyre. A sample of
100 new tyres has been selected. The company has found that the mean
life of these new tyres is 40,900 Km. Can it be concluded that the new
tyre is significantly better than the old one, using the significance level of
0.01.
27. Hint; we are interested in testing whether or not there has been an
increase in the mean life of tyres or test whether the mean life of new
tyre has increased beyond 40,000 km.
28. Following are the runs scored by two batsmen in 5 cricket matches, Who
is more consistent in scoring runs?

Batsman A: 38 47 34 18 33

Batsman B: 37 35 41 27 35

29. Find the skewness for the given Data ( 2,4,6,6) :

Skewness = 3(Mean – Median)/S.D.

30. For the given observations {23, 24, 56, 55, 28, 38, 48}, calculate:
● Skewness
● Kurtosis
● Determine the type of kurtosis

31. Given the weights of five persons: 120, 140, 150, 160, and 180 find the
following:
● Mean
● Median
● Mode
● Standard deviation
● Variance
● Interquartile range

32. A random sample of n = 500 observations from a binomial population


produced x = 240 successes.

● Find a point estimate for p and place a 95% confidence interval.


● Find a 90% confidence interval for p.

33. Given the observations {6, 8, 10, 12, 14, 16, 18, 20, 22, 24}, calculate the
following:

● Mean
● Median
● Standard deviation
● Variance
● Skewness
● Kurtosis
● Lower quartile
● Upper quartile
● Middle quartile
● Interquartile range
● Range

35. Calculate Population Skewness, Population Kurtosis from the following


grouped data and explain the type of kurtosis and skewness of the data.

34. Calculate Sample mean, sample variance, sample skewness and sample
kurtosis from the following grouped data:

Class Interval Frequency

2-4 3

4-6 4

6-8 2
8-10 1

Unit III: Machine Learning - Introduction and


Concepts
1. What is Machine Learning?Explain the Modeling Process in Machine
Learning. What are its key steps?
2. Explain the following key terminologies in Machine Learning: Features,
Target, Training Data, Testing Data, Overfitting, and Underfitting.
3. What are the four main phases in the machine learning modeling
process?
4. What is the role of feature engineering in Machine Learning?
5. What is the role of model training in machine learning? What is its
significance?
6. What is model selection and validation, and why is it important in
machine learning? How is model scoring used to assess the effectiveness
of a machine learning model?
7. What are some key methods for validating a Machine Learning model?
8. How does a trained model make predictions on new observations?
9. List out and explain the role of various Python tools used in machine
learning for data science.
10. What is Supervised Learning? How does it work? Explain the differences
between Regression and Classification in Supervised Learning.
11. What is a Naïve Bayes classifier? Explain it in the context of the case
study on handwritten digit recognition.
12. What is Unsupervised Learning? How does it differ from Supervised
Learning?
13. Explain linear regression with suitable examples for supervised learning.
14. How can a confusion matrix help evaluate the performance of a
classification model?
15. What role does principal component analysis (PCA) play in unsupervised
learning? OR

How does PCA help in reducing input variables while maintaining


important information?

16. What are Clustering Algorithms? Explain their applications in Data


Science. Explain K-means clustering algorithm with suitable examples.
17. What are the key evaluation metrics for classification and
regression models? Provide examples.

Examples:

KNN Classification

6. Given the following dataset with two features (X1, X2) and class labels:

X1 X2 Class

1 1 A

2 2 A

3 3 B

6 6 B

Using KNN with K=3, classify a new data point (4,4).

7. You have a dataset of fruits classified based on their weight and size.

Weight (g) Size (cm) Fruit Type

150 8 Apple

180 10 Apple

200 12 Orange

220 14 Orange

Classify a fruit with weight 190g and size 11cm using KNN (K=3).

KNN Regression
8. A dataset provides the exam scores of students based on study hours:

Hours Studied Exam Score

1 45

2 50

3 55

4 60

5 65

Predict the score for a student who studies 3.5 hours using KNN
regression with K=3.

9. Given house price data:

Area (sq. ft.) Price (₹ in Lakhs)

1000 50

1500 70

2000 90

2500 110

Predict the price for a 1750 sq. ft. house using KNN regression with
K=2.

10. Given the dataset of house prices:


Price (₹ in
Area (sq. ft.)
Lakhs)
1000 50
1500 75
2000 100
Find the linear regression equation (y = mx + c) and predict the price of
a 1250 sq. ft. house.

11. A company’s advertisement spending (₹ Lakhs) and sales revenue (₹


Crores) is given below:

Ad Spend (₹ Lakhs) Sales Revenue (₹ Crores)

1 10

2 20

3 30

4 40

Fit a linear regression model and estimate the sales revenue if the ad spend is
2.5 Lakhs.

12. Given student study hours and exam scores:

Hours Studied Score

2 50

4 60

6 70
8 80

Calculate the linear regression equation and predict the score for a student who
studies 5 hours.

13. A company records employee experience and salary:

Experience (Years) Salary (₹ Lakhs)

1 3

3 6

5 9

Find the regression equation and predict the salary for an employee with 4
years of experience.

Unit IV: Data Visualization

Data visualization options – Filters – Python libraries for visualization –


Matplotlib- seaborn
Data Science Ethics – Doing good data science – Owners of the data -
Valuing different aspects of privacy - Getting informed consent - The
Five Cs – Diversity – Inclusion – Future Trends.

Data Visualization
Filters:
1. What is the purpose of filters in data visualization, and how do they
enhance the viewer's experience?
2. Can you explain the difference between static and interactive filters
in data visualizations?
Python Libraries for Visualization:
1. Compare and contrast Matplotlib and Seaborn. What are the
primary use cases for each library?
2. How would you choose the right visualization library for a given
data science project? Provide examples where you would prefer one
over the other.
Matplotlib:
1. What are the basic components of a Matplotlib plot?
2. How can you create a multi-line plot using Matplotlib?
3. Describe box plot graph in detail.
Seaborn:
1. Describe how Seaborn integrates with other Python libraries. What
advantages does this offer?
2. Provide an example of a complex data visualization that can be
more easily generated with Seaborn than with Matplotlib.
Data Science Ethics
Doing Good Data Science:
1. What are the ethical considerations a data scientist must keep in
mind when designing a new algorithm?
2. How can data scientists ensure their work contributes positively to
society?
Owners of the Data:
1. Discuss the implications of data ownership in the context of
personal versus corporate data.
2. How does the concept of data ownership affect data access for
scientific research?
Valuing Different Aspects of Privacy:
1. What challenges arise when balancing individual privacy with the
benefits of big data analytics?
2. Provide examples of privacy-preserving techniques in data science.
The Five Cs of Data Science:
1. Define the "Five Cs" in the context of ethical data science practices.
2. How can adhering to the Five Cs improve the outcome of a data
science project?
Diversity and Inclusion:
1. Why is diversity important in data science teams and data
collection?
2. Discuss an example where lack of diversity in data collection led to
biased outcomes.

Note: Question Bank is for reference purposes. Mid-semester and


End Semester Exam Question papers will be drawn from the syllabus
mentioned in the Course file.

You might also like