Question Bank (1)
Question Bank (1)
BANK
Unit -1
TWO
MARK
1. Define data science and explain two benefits of applying it in various industries.
2. What are the three main facets of data in data science? Give examples of each.
5. Explain the importance of retrieving data from multiple sources in data science.
7. Define exploratory data analysis (EDA) and its significance in data science.
8. How does data exploration help in identifying patterns and anomalies in the dataset?
13. Explain the concept of data warehousing and its role in data management.
15. What are the basic statistical descriptions of data, and why are they important?
17. Explain the concept of median and when it is preferred over the mean.
19. Define standard deviation and describe its importance in measuring data variability.
20. Explain the concept of variance and its relationship with standard deviation.
21. Describe the process of defining research goals in a data science project.
22. What are the common challenges faced during the data retrieval phase in data science projects?
23. Discuss the significance of exploratory data analysis in understanding the dataset.
24. Explain the difference between supervised and unsupervised learning in building models.
28. What factors should be considered while selecting statistical measures for data analysis?
16MARK
1. Discuss the benefits of data science in modern society, providing examples of its applications
across various industries. Explain how these applications contribute to improving decision-
making processes. (16 marks)
2. Define the facets of data in the context of data science and elaborate on their significance in
data analysis. Provide real-world examples to illustrate each facet. (16 marks)
3. Explain the data science process in detail, highlighting each step from defining research goals
to presenting findings and building applications. Discuss the importance of each step in ensuring
the success of a data science project. (16 marks)
4. Describe the challenges associated with retrieving data for data science projects. Discuss
strategies to overcome these challenges, emphasizing the importance of data quality and
relevance. (16 marks)
5. Discuss the various techniques involved in data preparation for analysis in data science
projects. Explain how data preprocessing, cleaning, and transformation contribute to the overall
success of a data science project. (16 marks)
6. Explore the role of exploratory data analysis (EDA) in understanding the characteristics of a
dataset. Provide examples of common EDA techniques and discuss their applications in real-
world scenarios. (16 marks)
8. Discuss the significance of presenting findings in data science projects. Explain how effective
data visualization techniques can enhance the communication of insights derived from data
analysis. Provide examples of visualization tools and their applications. (16 marks)
9. Define data mining and its role in extracting valuable insights from large datasets. Discuss the
different data mining techniques and algorithms commonly used in practice, providing examples
of their applications. (16 marks)
10. Describe the concept of data warehousing and its importance in managing and analyzing large
volumes of data. Discuss the architecture of a typical data warehouse and the components
involved. (16 marks)
11. Explain the basic statistical descriptions of data, including measures of central tendency and
variability. Discuss how these statistical measures are used to summarize and interpret data in
data science projects. (16 marks)
12. Discuss the challenges associated with data preparation in data science projects and strategies
to address them effectively. Explain the importance of data quality assurance and data governance
in ensuring the reliability of analysis results. (16 marks)
13. Explore the impact of outliers on statistical descriptions of data. Discuss techniques for
identifying and handling outliers in data analysis, highlighting their implications for decision-
making processes. (16 marks)
14. Compare and contrast different data mining techniques, such as classification, clustering,
and association rule mining. Discuss their respective strengths, weaknesses, and applications in
real- world scenarios. (16 marks)
16. Evaluate the importance of defining clear research goals in data science projects. Discuss how
well-defined research goals contribute to project success and the effective utilization of
resources. (16 marks)
17. Discuss the challenges and opportunities associated with exploratory data analysis (EDA) in
data science projects. Explain how EDA techniques can uncover patterns and relationships in data,
aiding in hypothesis generation and model building. (16 marks)
18. Assess the impact of data quality on the outcomes of data science projects. Discuss common
sources of data quality issues and strategies to ensure data quality throughout the data lifecycle.
(16 marks)
19. Explore the ethical considerations and challenges associated with data mining in data science
projects. Discuss issues such as privacy, bias, and fairness, and propose strategies to mitigate
these concerns. (16 marks)
20. Critically analyze the role of data visualization in communicating insights derived from data
analysis. Discuss the principles of effective data visualization design and the factors to consider
when selecting visualization techniques. (16 marks)
UNIT-2
TWO MARKS
1. What is the difference between Python Shell and Jupyter Notebook in terms of their
interface and functionality? 2. Explain the purpose of IPython magic commands and provide
an example of how they can be used in Jupyter Notebook. 3. Define NumPy arrays and
explain their significance in data manipulation tasks. 4. What are universal functions (ufuncs)
in NumPy, and how do they facilitate computation on arrays? 5. Discuss the concept of
aggregations in the context of NumPy arrays. 6. How does fancy indexing differ from basic
indexing in NumPy arrays? Provide an example of fancy indexing. 7. Explain the process of
sorting arrays in NumPy. 8. Define structured data and discuss its importance in data
manipulation tasks. 9. Describe the role of Pandas in data manipulation and analysis. 10.
Explain the concept of data indexing and selection in Pandas. 11. How does Pandas handle
missing data? Discuss the methods available for handling missing values. 12. What is
1. Python Shell is an interactive command-line interface where users can execute Python
code line by line, while Jupyter Notebook is a web-based interactive computing
environment that allows users to create and share documents containing live code,
equations, visualizations, and narrative text. Jupyter Notebook provides a more
dynamic and visually appealing interface compared to the Python Shell.
2. IPython magic commands are special commands that provide additional functionality
and shortcuts in Jupyter Notebook. For example, the %timeit magic command can be
used to measure the execution time of a Python statement or expression. By typing
%timeit followed by the statement or expression in a Jupyter Notebook cell and
running the cell, the time taken to execute the code will be displayed.
3. NumPy arrays are multi-dimensional arrays that allow for efficient computation on
large datasets. They are the core data structure used in numerical computing with
Python. NumPy arrays provide a way to perform mathematical operations on
entire datasets without the need for explicit loops, making them significantly faster
than traditional Python lists.
4. Universal functions (ufuncs) in NumPy are functions that operate element-wise on
NumPy arrays. They are designed to efficiently handle computations across entire
arrays, making them significantly faster than equivalent Python functions applied
element-wise in loops. Examples of ufuncs include addition, subtraction,
multiplication, and trigonometric functions like sine and cosine.
5. Aggregations in the context of NumPy arrays involve performing mathematical
operations that result in a single value summarizing the entire array or a specific
axis of the array. Examples of aggregations include computing the sum, mean,
median, minimum, and maximum of an array.
6. Fancy indexing in NumPy allows for indexing arrays with arrays of indices or
boolean arrays. Unlike basic indexing, which selects individual elements or slices of
an array, fancy indexing selects multiple elements or slices simultaneously based on
16-MARK
16-MARK
Sure, here are 20 sixteen-mark questions based on the provided syllabus for UNIT II Data
Manipulation:
1. **Explain the key features of the Python Shell and Jupyter Notebook. How do they aid in
data manipulation tasks?**
2. **Discuss the role of IPython Magic Commands in data analysis. Provide examples of
how they can be used to enhance productivity in a Jupyter Notebook environment.**
3. **Describe the structure and functions of NumPy arrays. How do they differ from Python
lists in terms of performance and functionality?**
4. **What are Universal Functions (ufuncs) in NumPy? Explain with examples how they are
used for efficient computation.**
6. **Explain the concept of fancy indexing in NumPy. How can it be used for advanced data
manipulation and selection?**
7. **Describe the methods available for sorting arrays in NumPy. How do these methods
improve data manipulation tasks?**
8. **Discuss the importance of structured data in NumPy. How does the use of structured
arrays facilitate data analysis? Provide examples.**
9. **Explain how data manipulation is performed using Pandas. Discuss the advantages of
using Pandas for data manipulation compared to traditional methods.**
10. **Describe the techniques available for data indexing and selection in Pandas. Provide
examples to illustrate how these techniques can be applied in data analysis.**
11. **Discuss the methods for handling missing data in Pandas. How can missing data be
identified and addressed? Provide examples.**
12. **Explain hierarchical indexing in Pandas. How does it support complex data structures
and multi-level data analysis?**
14. **Explain the process of aggregation and grouping in Pandas. How do these techniques
assist in summarizing and analyzing data?**
15. **Discuss the various string operations available in Pandas. Provide examples of how
string manipulation can be performed on Pandas DataFrames.**
16. **Describe the methods for working with time series data in Pandas. How can time series
analysis be performed using Pandas functionalities?**
17. **Explain the techniques for achieving high performance in data manipulation using
Pandas. Discuss the role of vectorization and efficient data handling practices.**
18. **Compare and contrast the data manipulation capabilities of NumPy and Pandas. In
what scenarios would you prefer one over the other?**
20. **Explain the concept of combining datasets in Pandas. Provide examples of how
datasets can be concatenated, merged, and joined.**
These questions cover the key topics and concepts from the syllabus, encouraging in-depth
understanding and application of data manipulation techniques using Python, NumPy, and
Pandas.
UNIT -3.
1. **Describe the complete modeling process in machine learning, from data collection to
model deployment.**
2. **Compare and contrast the different types of machine learning: supervised, unsupervised,
and semi-supervised learning.**
3. **Discuss the key characteristics of supervised learning. Provide examples of algorithms
and real-world applications.**
4. **Explain the concept of unsupervised learning. Discuss the techniques used and their
applications with examples.**
5. **Describe semi-supervised learning and its advantages. Provide examples of scenarios
where this type of learning is beneficial.**
6. **Explain the process of classification in machine learning. Include a discussion on
different classification algorithms and their applications.**
7. **Discuss regression analysis in machine learning. Explain various regression techniques
and their use cases.**
8. **Describe the clustering process in detail. Discuss different clustering algorithms and
their practical applications.**
9. **What are outliers in data? Explain the significance of outlier detection and analysis in
machine learning.**
10. **Provide a comprehensive overview of outlier analysis techniques. Discuss their
advantages and limitations.**
11. **Compare classification and regression in terms of methodology, use cases, and
evaluation metrics.**
12. **Explain the role of training, validation, and test datasets in the machine learning
process.**
13. **Discuss the concept of model overfitting and underfitting. Provide strategies to prevent
these issues.**
14. **Describe the k-means clustering algorithm. Explain its working, advantages, and
limitations with examples.**
15. **Discuss the application of decision trees in classification and regression. Explain how
they are constructed and used.**
16. **Explain the concept and applications of the k-nearest neighbors (k-NN) algorithm in
detail.**
17. **Describe how support vector machines (SVM) are used for classification tasks. Discuss
their advantages and limitations.**
UNIT-4
18. **What is the difference between a scatter plot and a line plot?**
2. **Explain the creation and customization of simple line plots using Matplotlib.
Provide examples with multiple lines and different styles.**
4. **Describe the methods for visualizing errors in data plots. Include examples of error
bars and shaded regions.**
5. **Explain how to create and interpret density and contour plots in Matplotlib.
Provide examples with different datasets.**
7. **Explain the importance and usage of legends in Matplotlib plots. Provide examples
of different positioning and customization options.**
8. **Describe the different ways to apply and customize colors in Matplotlib plots.
Include examples of color maps and color bars.**
10. **Discuss the use of text and annotation in Matplotlib plots. Provide examples of
adding labels, titles, and annotations to different parts of a plot.**
11. **Explain the various customization options available in Matplotlib. Include examples
of customizing axes, grid lines, and plot styles.**
13. **Discuss the use of Basemap for plotting geographic data. Include examples of
different types of geographic visualizations.**
14. **Explain how Seaborn can be used for advanced data visualization. Provide examples
of different types of plots available in Seaborn.**
15. **Compare and contrast Matplotlib and Seaborn in terms of functionality and ease of
use. Provide examples of plots created with both libraries.**
16. **Explain how to create and customize histograms using Seaborn. Provide examples
of different customization options.**
17. **Describe the process of creating heatmaps using Seaborn. Include examples
of customization and interpretation of the heatmap.**
18. **Discuss the various types of plots available in Seaborn for visualizing
distributions. Provide examples and explain their applications.**
19. **Explain the creation and customization of categorical plots using Seaborn.
Provide examples of bar plots, box plots, and violin plots.**
20. **Describe the integration of Matplotlib and Seaborn in a single project. Provide
examples of how both libraries can be used together for comprehensive data
visualization.**
UNIT-5
3. **What is a key programming tip for dealing with large data sets?**
12. **What is a common method for reducing the size of a large data set?**
13. **Name one technique for optimizing memory usage when working with large
data sets.**
14. **What is the main benefit of using distributed computing for large data sets?**
17. **Why is it important to use efficient algorithms when working with large data sets?**
18. **Name a Python library commonly used for handling large data sets.**
28. **What is the role of batch processing in handling large data sets?**
2. **Explain various techniques for handling large data sets. Include examples of tools
and methods used in practice.**
3. **Describe programming tips and best practices for dealing with large data sets in
Python. Provide code examples to illustrate your points.**
4. **Detail the steps involved in predicting malicious URLs. Explain the data
preparation, model building, and evaluation processes.**
6. **Discuss the tools and techniques needed for handling large data sets. Include
examples of software and platforms commonly used.**
7. **Explain the importance of a well-defined research question in data analysis. How does
it influence the subsequent steps of the data analysis process?**
8. **Describe the data preparation process for handling large data sets. Include data
cleaning, transformation, and feature engineering.**
9. **Explain the concept of model building in data science. Discuss the steps involved
and provide examples of different types of models.**
10. **Discuss the significance of presenting and automating data handling processes.
Provide examples of tools and methods used for presentation and automation.**
11. **Compare and contrast batch processing and stream processing for handling large
data sets. Discuss the advantages and disadvantages of each.**
12. **Explain the role of distributed computing in handling large data. Discuss
popular frameworks such as Apache Hadoop and Apache Spark.**
13. **Discuss the importance of data partitioning and sharding in handling large volumes
of data. Provide examples of how they are implemented.**
14. **Describe the ETL process in data handling. Explain each step and its importance
in managing large data sets.**
17. **Explain the role of feature selection and feature engineering in model building.
Provide examples of techniques used for feature selection.**
19. **Describe the role of data visualization in handling large data sets. Provide examples of
tools and techniques used for effective visualization.**
20. **Explain the process of conducting a case study in data science. Provide a detailed
example of a case study, including research question, data preparation, model building, and
presentation.**