0% found this document useful (0 votes)
5 views

Question Paper DSBDA

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Question Paper DSBDA

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

lOMoARcPSD|22796718

Dsbda Handbook TE 22-23 Sem II

big data analytics (RMD engineering college)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Maithili Narkhede ([email protected])
lOMoARcPSD|22796718

Savitribai Phule Pune University


Third Year Computer Engineering (2019 Course)
310251: Data Science and Big Data Analytics
Teaching Scheme: Credit Examination Scheme:
TH: 03 Hours/Week 03 In-Sem (Paper): 30 Marks
End-Sem (Paper): 70 Marks
Prerequisite Courses: Discrete Mathematics (210241), Database Management Systems (310341)
Companion Course: Data Science and Big Data Analytics Laboratory (310256)
Course Objectives:

1. To understand the need of Data Science and Big Data


2. To understand computational statistics in Data Science
3. To study and understand the different technologies used for Big Data processing
4. To understand and apply data modelling strategies
5. To learn Data Analytics using Python programming
6. To be conversant with advances in analytics

Course Outcomes:
After completion of the course, learners should be able to
CO1: Analyze needs and challenges for Data Science Big Data Analytics
CO2: Apply statistics for Big Data Analytics
CO3: Apply the lifecycle of Big Data analytics to real world problems
CO4: Implement Big Data Analytics using Python programming
CO5: Implement data visualization using visualization tools in Python programming
CO6: Design and implement Big Databases using the Hadoop ecosystem

Unit I: Introduction 07
Hours
Basics and need of Data Science and Big Data, Applications of Data Science, Data explosion, 5
V’s of Big Data, Relationship between Data Science and Information Science, Business
intelligence versus Data Science, Data Science Life Cycle, Data: Data Types, Data Collection.
Need of Data wrangling, Methods: Data Cleaning, Data Integration, Data Reduction, Data
Transformation, Data Discretization.
Unit II Statistical Inference 07 Hours
Need of statistics in Data Science and Big Data Analytics, Measures of Central Tendency: Mean,
Median, Mode, Mid-range. Measures of Dispersion: Range, Variance, Mean Deviation, Standard
Deviation. Bayes theorem, Basics and need of hypothesis and hypothesis testing, Pearson
Correlation, Sample Hypothesis testing, Chi-Square Tests, t- test.

Downloaded by Maithili Narkhede ([email protected])


lOMoARcPSD|22796718

.Unit III Big Data Analytics Life Cycle 07 Hours


Introduction to Big Data, sources of Big Data, Data Analytic Lifecycle: Introduction, Phase 1:
Discovery, Phase 2: Data Preparation, Phase 3: Model Planning, Phase 4: Model Building, Phase
5: Communication results, Phase 6: Operationalize

Unit IV Predictive Big Data Analytics with Python 07 Hours


Introduction, Essential Python Libraries, Basic examples. Data Preprocessing: Removing
Duplicates, Transformationof Data using function or mapping, replacing values, Handling Missing
Data. Analytics Types: Predictive, Descriptive and Prescriptive. Association Rules: Apriori
Algorithm, FP growth. Regression: Linear Regression, Logistic Regression. Classification: Naïve
Bayes, Decision Trees. Introduction to Scikit-learn, Installations, Dataset, matplotlib, filling
missing values, Regression and Classification using Scikit-learn.

Unit V Big Data Analytics and Model Evaluation 07 Hours


Clustering Algorithms: K-Means, Hierarchical Clustering, Time-series analysis. Introduction to
Text Analysis: Text-preprocessing, Bag of words, TF-IDF and topics. Need and Introduction to
social network analysis, Introduction to business analysis.Model Evaluation and Selection:
Metrics for Evaluating Classifier Performance, Holdout Method and Random Subsampling,
Parameter Tuning and Optimization, Result Interpretation, Clustering and Time- series analysis
using Scikit-learn, sklearn.metrics, Confusion matrix, AUC-ROC Curves, Elbow plot.

Unit VI: Data visualization and Hadoop 07


Hours
Introduction to Data Visualization, Challenges to Big data visualization, Types of data
visualization, Data Visualization Techniques, Visualizing Big Data, Tools used in Data
Visualization, Hadoop ecosystem, Map Reduce, Pig, Hive, Analytical techniques used in Big data
visualization. Data Visualization using Python: Line plot, Scatter plot, Histogram, Density plot,
Box- plot.

Text Books:
1. David Dietrich, Barry Hiller, “Data Science and Big Data Analytics”, EMC education
services, Wiley publication, 2012, ISBN0-07-120413-X.
2. Jiawei Han, MichelineKamber, and Jian Pie, “Data Mining: Concepts and Techniques”
Elsevier Publishers Third Edition, ISBN: 9780123814791, 9780123814807.

Downloaded by Maithili Narkhede ([email protected])


lOMoARcPSD|22796718

Reference Books :
1. EMC Education Services, “Data Science and Big Data Analytics- Discovering,
analyzing Visualizing and Presenting Data”
2. DT Editorial Services, “Big Data, Black Book”, DT Editorial Services, ISBN:
9789351197577, 2016 Edition.
3. Chirag Shah, “A Hands-On Introduction To Data Science”, Cambridge University
Press, (2020), ISBN : ISBN 978-1-108-47244-9.
4. Wes McKinney, “Python for Data Analysis” O' Reilly media, ISBN: 978-1-449-31979-3
5. “Scikit-learn Cookbook”, Trent hauk,Packt Publishing, ISBN: 9781787286382
6. 6. Jenny Kim, Benjamin Bengfort, “Data Analytics with Hadoop”, OReilly Media, Inc., ISBN:
9781491913703.
7. Venkat Ankam, “Big Data Analytics”, Packt Publishing, ISBN: 9781785884696
e-Books :
• An Introduction to Statistical Learning by Gareth James
• https://www.ime.unicamp.br/~dias/Intoduction%20to%20Statistical%20Learning.pdf
• Python Data Science Handbook by Jake VanderPlas
• https://tanthiamhuat.files.wordpress.com/2018/04/pythondatasciencehandbook.pdf
• Introducing Data Science by Davy Ciele, Manning Publications
• Introducing Data Science [PDF]
• Handbook for visualizing : a handbook for data driven design by Andy krik
• A Handbook for Data Driven Design
• An introduction to data Science :
https://docs.google.com/file/d/0B6iefdnF22XQeVZDSkxjZ0Z5VUE/edit?pli=1
• Hadoop Tutorial :
https://www.tutorialspoint.com/hadoop/hadoop_tutorial.pdf?utm_source=7_&utm_medium=affili
ate&utm_content=5f34cd37cdf1050001b09537&utm_campaign=Admitad&utm_term=761c5754
24fc4a6b48d02f72157eb578
• Learning with Python; How to think like a computer scientist:
http://openbookproject.net/thinkcs/python/english3e/
• Python for everybody:
• http://do1.dr-chuck.com/pythonlearn/EN_us/pythonlearn.pdf
• Scikit Learn Tutorial
• https://scikit-learn.org/stable/

MOOCs Courses links:


• Computer Science and Engineering - NOC:Data Science for Engineers
• Computer Science and Engineering - NOC:Python for Data Science
• Computer Science and Engineering - NOC:Data Mining
• Computer Science and Engineering - NOC:Big Data Computing
• Big Data Computing - Course

Downloaded by Maithili Narkhede ([email protected])


lOMoARcPSD|22796718

UNIT WISE QUESTION BANK


Unit I: Introduction to Data Science and Big Data

Question. Questions
No.
1 Compare BI Vs. Data science

2 Explain Data Analytic Life cycle.

3 What is relationship between Data Science and Information Scie

4 Explain Business intelligence in details

5 What is mean by Data Science. Explain in details

6 Explain Data Science Life Cycle.

7 What are data Types and need of Data wrangling ?

8 Write short notes


(a) Data Cleaning,
(b) Data Integration,
(c) Data Reduction,
(d) Data Transformation,
(e) Data Discretization

Downloaded by Maithili Narkhede ([email protected])


lOMoARcPSD|22796718

UNIT 2: Statistical Inference

Question. Questions
No.
1 What is the need of statistics in Data Science and Big Data Analytics

2 Explain the Measures of Central Tendency.

3 What are the Measures of Dispersion?

4 Explain the following terms


1) Range, Variance,
2) Mean Deviation,
3) Standard Deviation
5 State and explain Bayes theorem.

6 What is need of hypothesis and hypothesis testing?

7 What is Pearson Correlation?

8 What is Chi-Square Test?

Downloaded by Maithili Narkhede ([email protected])


lOMoARcPSD|22796718

Unit: III Big Data Analytics Life Cycle

Question. Questions
No.
1 Discuss the following in detail
a. Conventional challenges in big data
b. Nature of Data
2 Describe any five characteristics of Big Data.

3 Define the different inferences in big data analytics.

4 Define big data. Why is big data required? How does traditional BI
environmentdiffer from big data environment?
5 What are the challenges with big data?

6 What are the three characteristics of big data? Explain the differences
between Bland Data Science.

7 Describe the current analytical architecture for data scientists.

8 Describe the Challenges of Big Data.

9 What is big data analytics? Explain in detail with its example.

10 Describe the Challenges of Big Data.

Downloaded by Maithili Narkhede ([email protected])


lOMoARcPSD|22796718

UNIT IV: Predictive Big Data Analytics with Python

Question. Questions
No.
1 Discuss the Looping Statements with an example.
(i) while (ii) for (iii) range
2 Write a Python function to sum of the numbers in a list
3 Write the features of Python. Give the advantages & disadvantages of it.

4 What is the difference between a module and a package?

5 Explain in detail about python operators?

6 Write python program to illustrate variable length keyword arguments?


7 Write python program to perform linear search?
8 What is linear regression?
9 Define FP growth. Explain in detail.
10 Explain apriori algorithm.
11 Explain logistic regression.
12 Describe Naive Bayes Algorithm.
13 What are the types of naïve byes model?

14 Explain decision tree.

15 Explain scikit-learn libraries.

16 Write short note on matplotlib.

17 Write difference between regression and classification.

Downloaded by Maithili Narkhede ([email protected])


lOMoARcPSD|22796718

UNIT V: Big Data Analytics and Model Evaluation

Question. Ques
No. tions

1. What is K-means algorithm

2. What is hierarchical clustering

3. Write the difference between hierarchical clustering and K-means


algorithm
4. Explain time series analysis.
5. Define bag of words.

6. What is TF-IDF?

7. What is social network analysis and Business analysis?


8. What are the types of model selection.

9. How to evaluate models?

10. Define holdout method.

11. What is random sub-sampling method?

12. Write short notes on confusion matrix.

13. Define AUC-ROC curve.

14. Describe Elbow-plot.

Downloaded by Maithili Narkhede ([email protected])


lOMoARcPSD|22796718

UNIT IV: Data visualization and Hadoop

Question. Questions
No.
1 Introduce data visualization.

2 What are the challenges in data visualization?

3 What are the types of data visualization?

4 How to visualize big data?

5 What kind of tools is used in data visualization?

6 Explain Hadoop ecosystem.

7 Define map reduce.

8 Describe Pig?

9 Write difference between pig and Map-reduce

10 Explain Hive.

11 What are analytical techniques used in big data visualization?

12 What is the line plot?

13 Write about scatter plot

14 Define Histogram.

15 What is density Plot?

Downloaded by Maithili Narkhede ([email protected])


lOMoARcPSD|22796718

Department of Computer Engineering


SET A
A.Y. 2022-23 (Semester-II)
UNIT TEST I EXAM
Class TE
Subject: Data Science and Big Data Analytics (310251) Date: 27/03/2023
Time: 1Hr Maximum Marks: 30
Instructions to Candidates:
1. Attempt Questions Q.1 OR Q.2, Q.3 OR Q.4.
2. Neat diagrams must be drawn wherever necessary.
3. Assume suitable data, if necessary.

Q. No. Questions Marks


A. Define big data analytics? Identify three areas or domains in which data
5
science is being used and describe how.
B. Differentiate between analysis and analytics? Discuss the importance of
1 5
big data analytics?
C. What is data science? Identify three areas or domains in which data
5
science is being used and describe how.
OR
A. List and Explain Sources of Big Data. 5
2 B. How data science relates to and differs from Business intelligence? 5
C. List and Explain various stages of Data Science Life Cycle 5
A. What is population and how it is different from a sample? 5
B. With reference skewness of data, explain the empirical relation between
5
mean, mode and median.
3 C. Here are the 19 scores listed out.
5, 7, 10, 15, 19 ,21, 21, 22, 22, 23, 23, 23, 23, 23, 24,24,24, 24, 25
Calculate 1.5*IQR for below the first quartile and above the third quartile. 5
How many data points can low outliers or above outliers?
OR
A. What is T-test? What are the types of Test? Explain by the number of
5
variable, degree of freedom and means of examples.
4 B. Define hypothesis. What is hypothesis testing? 5
C. What is the difference between Null and Alternative hypothesis. Give one
5
example of each.

Downloaded by Maithili Narkhede ([email protected])


lOMoARcPSD|22796718

Department of Computer Engineering


A.Y. 2022-23 (Semester-II) SET B

UNIT TEST I EXAM


Class TE
Subject: Data Science and Big Data Analytics (310251) Date: 27/03/2023
Time: 1Hr Maximum Marks: 30

Instructions to Candidates:
1. Attempt Questions Q.1 OR Q.2, Q.3 OR Q.4,
2. Neat diagrams must be drawn wherever necessary
3. Assume suitable data, if necessary

Q. No. Questions Marks


A. Explain 5 V’s of Big Data 5
1 B. Draw Data Analytics Life Cycle and give Briefly Explain its phase. 5
C. Why data wrangling is important for data science? 5
OR
A. Differentiate Structured and Unstructured Data 5
2 B. What is Data Explosion? 5
C. List the data Storage Format and Explain any two of them. 5
A. A researcher has exam results for a sample of students who took a training
course for a national exam. The researcher wants to know if trained
students score above the national average of 850. 5
i) Define Null Hypothesis and Alternative Hypothesis.
3 ii) Is it one tail or two tail hypothesis? Comment on your answer.
B. Define Type I and Type II Error. Give example to diffrentiate between the
5
two types of error
C. What is Chi-square test? Explain its significance in data analytics. 5
OR
A. In which of the following cases could you use a paired-samples t-test?
Give proper reason for your answer
(a) When comparing the same participant's performance before and after 5
training
(b) When comparing two separate groups of people
4 B.
What is P -Value? Explain the significance of P Value in hypothesis
5
testing.

C. Distinguish one tail or two tail hypothesis; draw the diagram to support 5
your answer.

Downloaded by Maithili Narkhede ([email protected])


lOMoARcPSD|22796718

Department of Computer Engineering


A.Y. 2022-23 (Semester-II) SET A
PRELIM EXAMINATION
Class TE
Subject: Data Science and Big Data Analytics (310251) Date: / / 2023
Time: 2(1/2) Hr Maximum Marks: 70

Instructions to Candidates:
1. Attempt Questions Q.1 OR Q.2, Q.3 OR Q.4, Q.5 OR Q.6, Q.7 OR Q.8
2. Neat diagrams must be drawn wherever necessary
3. Assume suitable data, if necessary

Question Question Marks


no.
a. Explain Data Science Life Cycle. 6
b. What is relationship between Data Science and Information Science? 6
c. Write short notes 8
1 i. Data Cleaning
ii. Data Integration
iii. Data Reduction
iv. Data Transformation
OR
a. What are the three characteristics of big data? Explain the differences 6
between BI and Data Science.
2 b. Discuss the following in detail 6
i. Communication Results
ii. Operationalize
c. Explain in detail sources of big data. 8

a. Explain logistic regression 6


3 b. Differentiate between data analytics types. 6
c. Explain Decision tree in detail. 4
OR

a. Explain Linear Regression with diagram.


6
4 b. Write short note on:
i. Apriori algorithm 6
ii. FP growth
c. Explain Data Preprocessing with suitable example. 4

a. Describe: 6
i. K-means clustering
5 ii. Hierarchical Clustering
b. What is TF-IDF? Explain with example. 6
c. Explain need and introduction to social network analysis. 4
OR

Downloaded by Maithili Narkhede ([email protected])


lOMoARcPSD|22796718

4
a. What is parameter tuning and optimization?
6 b. Write short note on:
i. Holdout Method 6
ii. Random Subsampling
6
c. Explain Confusion matrix in detail.

a. Explain Data visualization and Challenges to Big data visualization. 6

7 b. Describe: 6
i. line plot
ii. Density plot
iii. Box- plot
6
c. What is Map reduce, pig and Hive?

OR
6
a. Explain the types of data visualization
8 6
b. Explain Data Visualization Techniques
c. What are Analytical techniques used in Big data visualization 6

Downloaded by Maithili Narkhede ([email protected])


lOMoARcPSD|22796718

Department of Computer Engineering


A.Y. 2022-23 (Semester-II) SET B
PRELIM EXAMINATION
Class TE
Subject: Data Science and Big Data Analytics (310251) Date: / / 2023
Time: 2(1/2) Hr Maximum Marks: 70

Instructions to Candidates:
1. Attempt Questions Q.1 OR Q.2, Q.3 OR Q.4, Q.5 OR Q.6, Q.7 OR Q.8
2. Neat diagrams must be drawn wherever necessary
3. Assume suitable data, if necessary

Question Question Mark


no. s
a. What are the three characteristics of big data? Explain the differences 8
between BI and Data Science.
b. Discuss the following in detail 6
i. Communication Results
1
ii. Operationalize
c. Explain in detail sources of big data. 6

OR
a. Explain Data Science Life Cycle. 6
b. What is relationship between Data Science and Information Science? 6
2
c. Write short notes 8
i. Data Cleaning
ii. Data Integration
iii. Data Reduction
iv. Data Discretization

a. Explain Linear Regression. 6


b. Write short note on: 6
3 i. Apriori algorithm
ii. FP growth
c. Explain about Removing Duplicates, Transformation of Data using 4
function or mapping, replacing values, Handling Missing Data.

OR
a. Explain logistic regression
6
4 b. Differentiate between data analytics types. 4
c. Explain Naïve Bayes algorithm in detail. 6

a. Describe: 6
i. K-means clustering
5 ii. Hierarchical Clustering
b. Explain in detail text-preprocessing. 4
c. Explain Bag of words and TF-IDF. 6

Downloaded by Maithili Narkhede ([email protected])


lOMoARcPSD|22796718

OR

a. What is time series analysis? Explain in detail. 4


6 b. Write short note on:
i. Holdout Method 6
iii. Random Subsampling
6
c. Explain Elbow method in detail.

6
a. Explain the types of data visualization
7 6
b. Explain Data Visualization Techniques
c. What are Analytical techniques used in Big data visualization 6

OR
a. Explain Data visualization and Challenges to Big data visualization. 6
b. Describe: 6
8 i. line plot
ii. Density plot
iii. Histogram
6
c. What is Map reduce, pig and Hive?

Downloaded by Maithili Narkhede ([email protected])

You might also like