0% found this document useful (0 votes)
9 views

Question Bank (1)

The document contains a comprehensive question bank for a Data Science Fundamentals course, covering various topics such as data science definitions, processes, data mining, and statistical measures. It includes both two-mark and sixteen-mark questions aimed at assessing knowledge and understanding of key concepts in data science. Additionally, it addresses tools and libraries like Python, NumPy, and Pandas, along with their applications in data manipulation and analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Question Bank (1)

The document contains a comprehensive question bank for a Data Science Fundamentals course, covering various topics such as data science definitions, processes, data mining, and statistical measures. It includes both two-mark and sixteen-mark questions aimed at assessing knowledge and understanding of key concepts in data science. Additionally, it addresses tools and libraries like Python, NumPy, and Pandas, along with their applications in data manipulation and analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

OCS353 DATA Science Fundamentals- Question

BANK

Computer science and Engineering (Madras Institute of Technology, Anna


University)

Scan to open on Studocu

Downloaded by Gayathri Ramasamy


Studocu is not sponsored or endorsed by any college or university

Downloaded by Gayathri Ramasamy


OCS353 DATA SCIENCE FUNDAMENTALS

Unit -1

TWO

MARK

1. Define data science and explain two benefits of applying it in various industries.

2. What are the three main facets of data in data science? Give examples of each.

3. Describe the steps involved in the data science process.

4. Why is it essential to define research goals in data science projects?

5. Explain the importance of retrieving data from multiple sources in data science.

6. What does data preparation involve in the context of data science?

7. Define exploratory data analysis (EDA) and its significance in data science.

8. How does data exploration help in identifying patterns and anomalies in the dataset?

9. Briefly explain the process of building a model in data science.

10. Why is it necessary to present findings in data science projects?

11. Define data mining and provide an example of its application.

12. What techniques are commonly used in data mining?

13. Explain the concept of data warehousing and its role in data management.

14. How does data warehousing differ from traditional databases?

15. What are the basic statistical descriptions of data, and why are they important?

16. Define mean and describe its significance in data analysis.

17. Explain the concept of median and when it is preferred over the mean.

18. What is mode, and how is it useful in data analysis?

19. Define standard deviation and describe its importance in measuring data variability.

20. Explain the concept of variance and its relationship with standard deviation.

21. Describe the process of defining research goals in a data science project.

22. What are the common challenges faced during the data retrieval phase in data science projects?

23. Discuss the significance of exploratory data analysis in understanding the dataset.

24. Explain the difference between supervised and unsupervised learning in building models.

Downloaded by Gayathri Ramasamy


25. How does data presentation aid in decision-making processes?

26. Provide an example of a real-world application of data mining.

27. Describe the architecture of a typical data warehouse.

28. What factors should be considered while selecting statistical measures for data analysis?

29. Explain the role of data preparation in ensuring data quality.

30. Discuss the impact of outliers on statistical descriptions of data.

16MARK

Certainly, here are 20 sixteen-mark questions based on the provided syllabus:

1. Discuss the benefits of data science in modern society, providing examples of its applications
across various industries. Explain how these applications contribute to improving decision-
making processes. (16 marks)

2. Define the facets of data in the context of data science and elaborate on their significance in
data analysis. Provide real-world examples to illustrate each facet. (16 marks)

3. Explain the data science process in detail, highlighting each step from defining research goals
to presenting findings and building applications. Discuss the importance of each step in ensuring
the success of a data science project. (16 marks)

4. Describe the challenges associated with retrieving data for data science projects. Discuss
strategies to overcome these challenges, emphasizing the importance of data quality and
relevance. (16 marks)

5. Discuss the various techniques involved in data preparation for analysis in data science
projects. Explain how data preprocessing, cleaning, and transformation contribute to the overall
success of a data science project. (16 marks)

6. Explore the role of exploratory data analysis (EDA) in understanding the characteristics of a
dataset. Provide examples of common EDA techniques and discuss their applications in real-
world scenarios. (16 marks)

Downloaded by Gayathri Ramasamy


7. Explain the process of building a model in data science, discussing different modeling
techniques and their applications. Compare and contrast supervised and unsupervised learning
approaches, providing examples of each. (16 marks)

8. Discuss the significance of presenting findings in data science projects. Explain how effective
data visualization techniques can enhance the communication of insights derived from data
analysis. Provide examples of visualization tools and their applications. (16 marks)

9. Define data mining and its role in extracting valuable insights from large datasets. Discuss the
different data mining techniques and algorithms commonly used in practice, providing examples
of their applications. (16 marks)

10. Describe the concept of data warehousing and its importance in managing and analyzing large
volumes of data. Discuss the architecture of a typical data warehouse and the components
involved. (16 marks)

11. Explain the basic statistical descriptions of data, including measures of central tendency and
variability. Discuss how these statistical measures are used to summarize and interpret data in
data science projects. (16 marks)

12. Discuss the challenges associated with data preparation in data science projects and strategies
to address them effectively. Explain the importance of data quality assurance and data governance
in ensuring the reliability of analysis results. (16 marks)

13. Explore the impact of outliers on statistical descriptions of data. Discuss techniques for
identifying and handling outliers in data analysis, highlighting their implications for decision-
making processes. (16 marks)

14. Compare and contrast different data mining techniques, such as classification, clustering,
and association rule mining. Discuss their respective strengths, weaknesses, and applications in
real- world scenarios. (16 marks)

Downloaded by Gayathri Ramasamy


15. Analyze the role of data warehousing in supporting business intelligence and decision-making
processes. Discuss how data warehouses integrate data from disparate sources and enable
complex analytical queries. (16 marks)

16. Evaluate the importance of defining clear research goals in data science projects. Discuss how
well-defined research goals contribute to project success and the effective utilization of
resources. (16 marks)

17. Discuss the challenges and opportunities associated with exploratory data analysis (EDA) in
data science projects. Explain how EDA techniques can uncover patterns and relationships in data,
aiding in hypothesis generation and model building. (16 marks)

18. Assess the impact of data quality on the outcomes of data science projects. Discuss common
sources of data quality issues and strategies to ensure data quality throughout the data lifecycle.
(16 marks)

19. Explore the ethical considerations and challenges associated with data mining in data science
projects. Discuss issues such as privacy, bias, and fairness, and propose strategies to mitigate
these concerns. (16 marks)

20. Critically analyze the role of data visualization in communicating insights derived from data
analysis. Discuss the principles of effective data visualization design and the factors to consider
when selecting visualization techniques. (16 marks)

UNIT-2

TWO MARKS

1. What is the difference between Python Shell and Jupyter Notebook in terms of their
interface and functionality? 2. Explain the purpose of IPython magic commands and provide
an example of how they can be used in Jupyter Notebook. 3. Define NumPy arrays and
explain their significance in data manipulation tasks. 4. What are universal functions (ufuncs)
in NumPy, and how do they facilitate computation on arrays? 5. Discuss the concept of
aggregations in the context of NumPy arrays. 6. How does fancy indexing differ from basic
indexing in NumPy arrays? Provide an example of fancy indexing. 7. Explain the process of
sorting arrays in NumPy. 8. Define structured data and discuss its importance in data
manipulation tasks. 9. Describe the role of Pandas in data manipulation and analysis. 10.
Explain the concept of data indexing and selection in Pandas. 11. How does Pandas handle
missing data? Discuss the methods available for handling missing values. 12. What is

Downloaded by Gayathri Ramasamy


hierarchical indexing in Pandas, and how is it useful in organizing and accessing data? 13.
Discuss the process of combining datasets in Pandas, highlighting different methods and
functions. 14. Explain the concept of aggregation in Pandas and provide an example of its
application. 15. What are string operations in Pandas, and how are they used for data
manipulation? 16. Describe the process of working with time series data in Pandas. 17. How
does Pandas ensure high performance in data manipulation tasks? 18. Compare and contrast
Python Shell and Jupyter Notebook in terms of their features and advantages. 19. Discuss
three commonly used IPython magic commands and their respective functionalities. 20.
Explain how NumPy arrays facilitate efficient computation on large datasets compared to
traditional Python lists. 21. Describe the role of ufuncs in NumPy and provide examples of
common ufuncs. 22. What are some practical applications of aggregations in data analysis
and statistics? 23. Explain the concept of fancy indexing in NumPy and its advantages over
basic indexing. 24. Discuss the importance of data sorting in data manipulation tasks and
provide examples of scenarios where sorting is necessary. 25. How does hierarchical
indexing improve data organization and accessibility in Pandas? 26. Compare and contrast
different methods for handling missing data in Pandas. 27. Discuss the advantages of
combining datasets in Pandas for comprehensive data analysis. 28. Provide examples of
aggregation functions commonly used in Pandas and their applications. 29. How are string
operations in Pandas beneficial for data cleaning and preprocessing tasks? 30. Discuss the
significance of high-performance computing in data manipulation and analysis tasks.
ChatGPT
ChatGPT

1. Python Shell is an interactive command-line interface where users can execute Python
code line by line, while Jupyter Notebook is a web-based interactive computing
environment that allows users to create and share documents containing live code,
equations, visualizations, and narrative text. Jupyter Notebook provides a more
dynamic and visually appealing interface compared to the Python Shell.
2. IPython magic commands are special commands that provide additional functionality
and shortcuts in Jupyter Notebook. For example, the %timeit magic command can be
used to measure the execution time of a Python statement or expression. By typing
%timeit followed by the statement or expression in a Jupyter Notebook cell and
running the cell, the time taken to execute the code will be displayed.
3. NumPy arrays are multi-dimensional arrays that allow for efficient computation on
large datasets. They are the core data structure used in numerical computing with
Python. NumPy arrays provide a way to perform mathematical operations on
entire datasets without the need for explicit loops, making them significantly faster
than traditional Python lists.
4. Universal functions (ufuncs) in NumPy are functions that operate element-wise on
NumPy arrays. They are designed to efficiently handle computations across entire
arrays, making them significantly faster than equivalent Python functions applied
element-wise in loops. Examples of ufuncs include addition, subtraction,
multiplication, and trigonometric functions like sine and cosine.
5. Aggregations in the context of NumPy arrays involve performing mathematical
operations that result in a single value summarizing the entire array or a specific
axis of the array. Examples of aggregations include computing the sum, mean,
median, minimum, and maximum of an array.
6. Fancy indexing in NumPy allows for indexing arrays with arrays of indices or
boolean arrays. Unlike basic indexing, which selects individual elements or slices of
an array, fancy indexing selects multiple elements or slices simultaneously based on

Downloaded by Gayathri Ramasamy


the values or positions specified in the index array. For example, array[[0, 2, 4]]
selects the elements at positions 0, 2, and 4 in the array.
7. Sorting arrays in NumPy can be done using the np.sort() function, which returns a
sorted copy of the input array. Alternatively, arrays can be sorted in-place using the
array.sort() method. Sorting can be performed along a specified axis or flattened
to sort the entire array.
8. Structured data refers to data that is organized in a predefined format, typically
consisting of rows and columns with named fields or attributes. Structured data is
commonly used in databases, spreadsheets, and other tabular formats. In data
manipulation tasks, structured data allows for efficient storage, retrieval, and
analysis of information.
9. Pandas is a Python library for data manipulation and analysis that provides data
structures and functions for working with structured data. It is built on top of
NumPy and provides additional features for handling labeled data, missing data, and
time series data, making it a powerful tool for data scientists and analysts.
10. Data indexing and selection in Pandas involves accessing and manipulating data
within Pandas data structures, such as Series and DataFrame objects. It allows
for selecting specific rows, columns, or subsets of data based on labels, indices,
or boolean conditions.
11. Pandas provides several methods for handling missing data, including dropping
missing values (dropna()), filling missing values with specified values (fillna()),
and interpolating missing values based on neighboring values (interpolate()).
12. Hierarchical indexing in Pandas, also known as multi-level indexing, allows for
organizing and accessing data in multiple dimensions using multiple index levels. It
enables more complex data structures than traditional flat indexing, making it
easier to represent and analyze hierarchical data.
13. Combining datasets in Pandas involves merging or concatenating multiple datasets
into a single dataset. This can be done using functions like pd.concat(),
pd.merge(), and pd.join(), depending on the desired outcome and the structure of
the datasets.
14. Aggregation in Pandas involves computing summary statistics or aggregating data
based on one or more grouping variables. For example, the groupby() function can
be used to group data by one or more variables and then apply aggregation
functions like sum(), mean(), median(), etc.
15. String operations in Pandas involve performing text processing tasks on string
data within Pandas Series or DataFrame objects. This includes tasks such as
splitting strings, extracting substrings, replacing values, and performing regular
expression operations.
16. Working with time series data in Pandas involves analyzing and manipulating
datasets where the index represents dates or timestamps. Pandas provides specialized
data structures and functions for working with time series data, including resampling,
shifting, and rolling window operations.
17. Pandas ensures high performance in data manipulation tasks through efficient
algorithms and data structures optimized for numerical and labeled data
processing. Additionally, Pandas leverages underlying libraries like NumPy and
Cython for computational tasks, further enhancing performance.
18. Python Shell is an interactive command-line interface that allows users to execute
Python code line by line, while Jupyter Notebook is a web-based interactive
computing environment that allows users to create and share documents
containing

Downloaded by Gayathri Ramasamy


live code, equations, visualizations, and narrative text. Jupyter Notebook provides a
more dynamic and visually appealing interface compared to the Python Shell.
19. Three commonly used IPython magic commands include %matplotlib for enabling
inline plotting, %timeit for measuring the execution time of code snippets, and
%load_ext for loading and activating external Python modules or extensions.
20. NumPy arrays facilitate efficient computation on large datasets compared to
traditional Python lists by providing vectorized operations and optimized algorithms
for numerical computing. This allows for faster execution of mathematical
operations and reduces the need for explicit loops, leading to improved performance.
21. Ufuncs in NumPy are functions that operate element-wise on NumPy arrays,
allowing for efficient computation across entire arrays without the need for explicit
loops. Common ufuncs include arithmetic operations (e.g., addition, subtraction),
trigonometric functions (e.g., sine, cosine), and mathematical functions (e.g.,
exponentiation, logarithm).
22. Aggregations in data analysis and statistics are used to summarize data and
compute summary statistics that provide insights into the dataset's characteristics.
Practical applications of aggregations include computing measures of central
tendency (e.g., mean, median), measures of dispersion (e.g., standard deviation,
range), and other summary statistics for descriptive analysis and inference.
23. Fancy indexing in NumPy allows for indexing arrays with arrays of indices or
boolean arrays, enabling advanced indexing operations not possible with basic
indexing. Its advantages over basic indexing include the ability to select multiple
elements or slices simultaneously and the flexibility to specify complex
selection criteria using boolean arrays.
24. Data sorting is important in data manipulation tasks as it allows for organizing data
in a specified order, making it easier to analyze and interpret. Examples of scenarios
where sorting is necessary include ranking data, identifying outliers, and preparing
data for visualization or further analysis.
25. Hierarchical indexing improves data organization and accessibility in Pandas by
enabling multi-level indexing of Series and DataFrame objects. It allows for
representing and analyzing data with multiple dimensions and levels of
granularity, facilitating more complex data structures and analyses.
26. Different methods for handling missing data in Pandas include dropping missing
values (dropna()), filling missing values with specified values (fillna()), and
interpolating missing values based on neighboring values (interpolate()). The
choice of method depends on the nature of the data and the specific requirements of
the analysis.
27. Combining datasets in Pandas allows for comprehensive data analysis by merging or
concatenating multiple datasets into a single dataset. This enables comparing and
analyzing data from different sources, identifying relationships between variables, and
gaining insights that may not be possible with individual datasets alone.
28. Examples of aggregation functions commonly used in Pandas include sum(), mean(),
median(), min(), max(), count(), and std(). These functions are applied to
grouped data using the groupby() function to compute summary statistics or perform
aggregations based on one or more grouping variables.
29. String operations in Pandas are beneficial for data cleaning and preprocessing tasks as
they allow for manipulating and transforming string data within Pandas Series or
DataFrame objects. Examples of string operations include splitting strings, extracting
substrings, replacing values, and performing regular expression operations.

Downloaded by Gayathri Ramasamy


30. High-performance computing is significant in data manipulation and analysis tasks as
it enables faster execution of algorithms and processing of large datasets. This leads
to improved productivity, faster insights, and more efficient utilization of
computational resources, ultimately enhancing the overall data analysis workflow.

16-MARK

16-MARK
Sure, here are 20 sixteen-mark questions based on the provided syllabus for UNIT II Data
Manipulation:

1. **Explain the key features of the Python Shell and Jupyter Notebook. How do they aid in
data manipulation tasks?**

2. **Discuss the role of IPython Magic Commands in data analysis. Provide examples of
how they can be used to enhance productivity in a Jupyter Notebook environment.**

3. **Describe the structure and functions of NumPy arrays. How do they differ from Python
lists in terms of performance and functionality?**

4. **What are Universal Functions (ufuncs) in NumPy? Explain with examples how they are
used for efficient computation.**

5. **Illustrate the process of aggregations in NumPy. Provide examples of different


aggregation operations and their applications.**

6. **Explain the concept of fancy indexing in NumPy. How can it be used for advanced data
manipulation and selection?**

7. **Describe the methods available for sorting arrays in NumPy. How do these methods
improve data manipulation tasks?**

8. **Discuss the importance of structured data in NumPy. How does the use of structured
arrays facilitate data analysis? Provide examples.**

9. **Explain how data manipulation is performed using Pandas. Discuss the advantages of
using Pandas for data manipulation compared to traditional methods.**

10. **Describe the techniques available for data indexing and selection in Pandas. Provide
examples to illustrate how these techniques can be applied in data analysis.**

11. **Discuss the methods for handling missing data in Pandas. How can missing data be
identified and addressed? Provide examples.**

12. **Explain hierarchical indexing in Pandas. How does it support complex data structures
and multi-level data analysis?**

Downloaded by Gayathri Ramasamy


13. **Describe the methods available for combining datasets in Pandas. Provide examples of
different types of joins and merges.**

14. **Explain the process of aggregation and grouping in Pandas. How do these techniques
assist in summarizing and analyzing data?**

15. **Discuss the various string operations available in Pandas. Provide examples of how
string manipulation can be performed on Pandas DataFrames.**

16. **Describe the methods for working with time series data in Pandas. How can time series
analysis be performed using Pandas functionalities?**

17. **Explain the techniques for achieving high performance in data manipulation using
Pandas. Discuss the role of vectorization and efficient data handling practices.**

18. **Compare and contrast the data manipulation capabilities of NumPy and Pandas. In
what scenarios would you prefer one over the other?**

19. **Discuss the computation on arrays in NumPy. Provide examples of mathematical


operations and their applications in data manipulation.**

20. **Explain the concept of combining datasets in Pandas. Provide examples of how
datasets can be concatenated, merged, and joined.**

These questions cover the key topics and concepts from the syllabus, encouraging in-depth
understanding and application of data manipulation techniques using Python, NumPy, and
Pandas.

UNIT -3.

### Two-Mark Questions

1. **What is the modeling process in machine learning?**


2. **Define supervised learning.**
3. **What is unsupervised learning?**
4. **Explain semi-supervised learning.**
5. **What is classification in machine learning?**
6. **Define regression in the context of machine learning.**
7. **What is clustering?**
8. **Describe what an outlier is in data analysis.**
9. **What is outlier analysis?**
10. **Name two types of machine learning algorithms.**
11. **What is the main goal of supervised learning?**
12. **How does unsupervised learning differ from supervised learning?**
13. **Give an example of a semi-supervised learning application.**
14. **What is the difference between classification and regression?**
15. **What is a common use case for clustering?**
16. **Name one method used for outlier detection.**
17. **What is the role of a training dataset in supervised learning?**
18. **How does a regression model differ from a classification model?**

Downloaded by Gayathri Ramasamy


19. **What is the purpose of a validation dataset in machine learning?**
20. **Explain the term "model overfitting".**
21. **What is the purpose of a test dataset?**
22. **Give an example of a classification algorithm.**
23. **What is a common metric used to evaluate regression models?**
24. **Name one algorithm used for clustering.**
25. **What is the purpose of feature scaling?**
26. **What does the term "label" refer to in supervised learning?**
27. **Define the term "feature" in the context of machine learning.**
28. **What is the k-nearest neighbors (k-NN) algorithm used for?**
29. **What is meant by the term "data normalization"?**
30. **What is principal component analysis (PCA) used for in machine learning?

** ### Sixteen-Mark Questions

1. **Describe the complete modeling process in machine learning, from data collection to
model deployment.**
2. **Compare and contrast the different types of machine learning: supervised, unsupervised,
and semi-supervised learning.**
3. **Discuss the key characteristics of supervised learning. Provide examples of algorithms
and real-world applications.**
4. **Explain the concept of unsupervised learning. Discuss the techniques used and their
applications with examples.**
5. **Describe semi-supervised learning and its advantages. Provide examples of scenarios
where this type of learning is beneficial.**
6. **Explain the process of classification in machine learning. Include a discussion on
different classification algorithms and their applications.**
7. **Discuss regression analysis in machine learning. Explain various regression techniques
and their use cases.**
8. **Describe the clustering process in detail. Discuss different clustering algorithms and
their practical applications.**
9. **What are outliers in data? Explain the significance of outlier detection and analysis in
machine learning.**
10. **Provide a comprehensive overview of outlier analysis techniques. Discuss their
advantages and limitations.**
11. **Compare classification and regression in terms of methodology, use cases, and
evaluation metrics.**
12. **Explain the role of training, validation, and test datasets in the machine learning
process.**
13. **Discuss the concept of model overfitting and underfitting. Provide strategies to prevent
these issues.**
14. **Describe the k-means clustering algorithm. Explain its working, advantages, and
limitations with examples.**
15. **Discuss the application of decision trees in classification and regression. Explain how
they are constructed and used.**
16. **Explain the concept and applications of the k-nearest neighbors (k-NN) algorithm in
detail.**
17. **Describe how support vector machines (SVM) are used for classification tasks. Discuss
their advantages and limitations.**

Downloaded by Gayathri Ramasamy


18. **Explain the process of linear regression. Include a discussion on how it is used to
predict continuous outcomes.**
19. **Discuss principal component analysis (PCA) and its role in dimensionality reduction.
Provide examples of its application.**
20. **Explain the process and importance of feature engineering in machine learning.
Discuss different techniques and their impact on model performance.**

UNIT-4

### Two-Mark Questions

1. **What is Matplotlib used for in Python?**

2. **How do you import Matplotlib in a Python script?**

3. **What is a simple line plot?**

4. **What command is used to create a simple scatter plot in Matplotlib?**

5. **Define what a histogram represents in data visualization.**

6. **What is the purpose of a legend in a plot?**

7. **How can you change the color of a line in a Matplotlib plot?**

8. **What is a subplot in Matplotlib?**

9. **What is the function of text and annotation in a plot?**

10. **How can you customize a Matplotlib plot?**

11. **What is a density plot?**

12. **What are contour plots used for?**

13. **How do you create three-dimensional plots in Matplotlib?**

14. **What is Basemap used for in data visualization?**

15. **What type of data visualization does Seaborn specialize in?**

16. **How do you import Seaborn in a Python script?**

17. **What function is used to create a simple line plot in Matplotlib?**

18. **What is the difference between a scatter plot and a line plot?**

19. **Why are error bars used in plots?**

Downloaded by Gayathri Ramasamy


20. **How do you add a title to a Matplotlib plot?**

21. **What command is used to create a histogram in Matplotlib?**

22. **How do you add a legend to a Matplotlib plot?**

23. **What function is used to create subplots in Matplotlib?**

24. **How can you change the style of a plot in Matplotlib?**

25. **What are the advantages of using Seaborn over Matplotlib?**

26. **How do you plot geographic data with Basemap?**

27. **What is the purpose of using colors in data visualization?**

28. **What function is used to add text annotations in Matplotlib?**

29. **What are the different types of plots available in Seaborn?**

30. **How can you create a heatmap using Seaborn?**

### Sixteen-Mark Questions

1. **Describe the process of importing and setting up Matplotlib in a Python


environment. Include examples of basic setup and customization.**

2. **Explain the creation and customization of simple line plots using Matplotlib.
Provide examples with multiple lines and different styles.**

3. **Discuss the creation of simple scatter plots in Matplotlib. Explain how to


customize markers, colors, and sizes.**

4. **Describe the methods for visualizing errors in data plots. Include examples of error
bars and shaded regions.**

5. **Explain how to create and interpret density and contour plots in Matplotlib.
Provide examples with different datasets.**

6. **Discuss the creation of histograms using Matplotlib. Explain how to customize


bins, colors, and other properties.**

7. **Explain the importance and usage of legends in Matplotlib plots. Provide examples
of different positioning and customization options.**

8. **Describe the different ways to apply and customize colors in Matplotlib plots.
Include examples of color maps and color bars.**

Downloaded by Gayathri Ramasamy


9. **Explain the concept of subplots in Matplotlib. Provide examples of creating
and customizing multiple subplots in a single figure.**

10. **Discuss the use of text and annotation in Matplotlib plots. Provide examples of
adding labels, titles, and annotations to different parts of a plot.**

11. **Explain the various customization options available in Matplotlib. Include examples
of customizing axes, grid lines, and plot styles.**

12. **Describe the process of creating three-dimensional plots using Matplotlib.


Provide examples with surface plots and wireframes.**

13. **Discuss the use of Basemap for plotting geographic data. Include examples of
different types of geographic visualizations.**

14. **Explain how Seaborn can be used for advanced data visualization. Provide examples
of different types of plots available in Seaborn.**

15. **Compare and contrast Matplotlib and Seaborn in terms of functionality and ease of
use. Provide examples of plots created with both libraries.**

16. **Explain how to create and customize histograms using Seaborn. Provide examples
of different customization options.**

17. **Describe the process of creating heatmaps using Seaborn. Include examples
of customization and interpretation of the heatmap.**

18. **Discuss the various types of plots available in Seaborn for visualizing
distributions. Provide examples and explain their applications.**

19. **Explain the creation and customization of categorical plots using Seaborn.
Provide examples of bar plots, box plots, and violin plots.**

20. **Describe the integration of Matplotlib and Seaborn in a single project. Provide
examples of how both libraries can be used together for comprehensive data
visualization.**

UNIT-5

### Two-Mark Questions

1. **What is one common problem when handling large volumes of data?**

2. **Name a technique used to handle large data sets.**

3. **What is a key programming tip for dealing with large data sets?**

Downloaded by Gayathri Ramasamy


4. **What is the goal of predicting malicious URLs?**

5. **What is a recommender system?**

6. **Name one tool used for handling large volumes of data.**

7. **What is the first step in the data preparation process?**

8. **Why is data cleaning important when handling large data sets?**

9. **What is model building in the context of data science?**

10. **What is the purpose of automating data handling processes?**

11. **Define the term "research question" in data analysis.**

12. **What is a common method for reducing the size of a large data set?**

13. **Name one technique for optimizing memory usage when working with large
data sets.**

14. **What is the main benefit of using distributed computing for large data sets?**

15. **How does a data warehouse help in handling large data?**

16. **What is data streaming?**

17. **Why is it important to use efficient algorithms when working with large data sets?**

18. **Name a Python library commonly used for handling large data sets.**

19. **What is the role of data visualization in handling large data?**

20. **What is a case study in the context of data science?**

21. **Define data partitioning.**

22. **What is the significance of feature selection in model building?**

23. **Name one challenge in building a recommender system.**

24. **What is the purpose of cross-validation in model building?**

25. **What does ETL stand for in data processing?**

26. **What is data sharding?**

27. **Name one cloud-based tool for handling large data.**

28. **What is the role of batch processing in handling large data sets?**

29. **Why is scalability important in handling large data?**

Downloaded by Gayathri Ramasamy


30. **What is the significance of presentation in data analysis?**

### Sixteen-Mark Questions

1. **Discuss the common problems encountered when handling large volumes of


data. Provide examples and suggest potential solutions.**

2. **Explain various techniques for handling large data sets. Include examples of tools
and methods used in practice.**

3. **Describe programming tips and best practices for dealing with large data sets in
Python. Provide code examples to illustrate your points.**

4. **Detail the steps involved in predicting malicious URLs. Explain the data
preparation, model building, and evaluation processes.**

5. **Explain the process of building a recommender system. Discuss different algorithms


and techniques used, and provide a case study example.**

6. **Discuss the tools and techniques needed for handling large data sets. Include
examples of software and platforms commonly used.**

7. **Explain the importance of a well-defined research question in data analysis. How does
it influence the subsequent steps of the data analysis process?**

8. **Describe the data preparation process for handling large data sets. Include data
cleaning, transformation, and feature engineering.**

9. **Explain the concept of model building in data science. Discuss the steps involved
and provide examples of different types of models.**

10. **Discuss the significance of presenting and automating data handling processes.
Provide examples of tools and methods used for presentation and automation.**

11. **Compare and contrast batch processing and stream processing for handling large
data sets. Discuss the advantages and disadvantages of each.**

12. **Explain the role of distributed computing in handling large data. Discuss
popular frameworks such as Apache Hadoop and Apache Spark.**

13. **Discuss the importance of data partitioning and sharding in handling large volumes
of data. Provide examples of how they are implemented.**

14. **Describe the ETL process in data handling. Explain each step and its importance
in managing large data sets.**

Downloaded by Gayathri Ramasamy


15. **Explain how memory optimization techniques can improve the efficiency of handling
large data sets. Provide examples of specific techniques.**

16. **Discuss the challenges and solutions in building a scalable recommender


system. Provide examples of different scaling strategies.**

17. **Explain the role of feature selection and feature engineering in model building.
Provide examples of techniques used for feature selection.**

18. **Discuss the significance of cross-validation in model building and evaluation.


Explain different cross-validation techniques with examples.**

19. **Describe the role of data visualization in handling large data sets. Provide examples of
tools and techniques used for effective visualization.**

20. **Explain the process of conducting a case study in data science. Provide a detailed
example of a case study, including research question, data preparation, model building, and
presentation.**

Downloaded by Gayathri Ramasamy

You might also like