PDS Lab Manual_23 om
PDS Lab Manual_23 om
Certificate
Place: __________
Date: __________
Preface
Main motto of any laboratory/practical/field work is for enhancing required skills as well as
creating ability amongst students to solve real time problem by developing relevant competencies
in psychomotor domain. By keeping in view, GTU has designed competency focused outcome-
based curriculum for engineering degree programs where sufficient weightage is given to
practical work. It shows importance of enhancement of skills amongst the students and it pays
attention to utilize every second of time allotted for practical amongst students, instructors and
faculty members to achieve relevant outcomes by performing the experiments rather than having
merely study type experiments. It is must for effective implementation of competency focused
outcome-based curriculum that every practical is keenly designed to serve as a tool to develop
and enhance relevant competency required by the various industry among every student. These
psychomotor skills are very difficult to develop through traditional chalk and board content
delivery method in the classroom. Accordingly, this lab manual is designed to focus on the
industry defined relevant outcomes, rather than old practice of conducting practical to prove
concept and theory.
By using this lab manual students can go through the relevant theory and procedure in advance
before the actual performance which creates an interest and students can have basic idea prior to
performance. This in turn enhances pre-determined outcomes amongst students. Each experiment
in this manual begins with competency, industry relevant skills, course outcomes as well as
practical outcomes (objectives). The students will also achieve safety and necessary precautions
to be taken while performing practical.
This manual also provides guidelines to faculty members to facilitate student centric lab activities
through each experiment by arranging and managing necessary resources in order that the
students follow the procedures with required safety and necessary precautions to achieve the
outcomes. It also gives an idea that how students will be assessed by providing rubrics.
Data Science is about data gathering, analysis and decision-making. Data Science is about finding
patterns in data, through analysis, and make future predictions. By using Data Science, companies
are able to make:
Data Science is used in many industries in the world today, e.g. banking, consultancy, healthcare,
and manufacturing. Python is an open-source, interpreted, high-level language and provides a
great approach to data science, machine learning, and research purposes. It is one of the best
languages for data science to use for various applications & projects. When it comes to dealing
with mathematical, statistical, and scientific functions, Python has great utility.
Utmost care has been taken while preparing this lab manual however always there is chances of
improvement. Therefore, we welcome constructive suggestions for improvement and removal of
errors if any.
Page | 1
Python for Data Science (3150713)
Sr.
Objective(s) of Experiment CO1 CO2 CO3 CO4 CO5
No.
Develop a program to understand the control structures of
1. python. √
Develop a program to learn different types of structures (list,
2. dictionary, tuples) in python. √
Develop a program that reads a .csv dataset file using Pandas
library and display the following content of the dataset. a)
3. First five rows of the dataset √ √
b) Complete data of the dataset
c) Summary or metadata of the dataset.
Develop a program that shows application of slicing and
4. dicing over the rows and columns of the dataset. √
√
Page | 2
Python for Data Science (3150713)
Page | 3
Python for Data Science (3150713)
The following industry relevant competency are expected to be developed in the student by
undertaking the practical work of this laboratory.
1. Programming Languages
2. Mathematics, Statistical Analysis, and Probability
3. Data Mining
4. Machine Learning and AI
5. Data Visualization
Page | 4
Python for Data Science (3150713)
Page | 5
Python for Data Science (3150713)
Page | 6
Python for Data Science (3150713)
Page | 7
Python for Data Science (3150713) 230163107023
Experiment No: 1
Develop a program to understand the control structures of python.
Date:
Objectives: (a) To learn and understand the different control structures in Python, such as loops,
conditional statements, and functions..
Theory:
Conditional statements: Conditional statements in Python allow you to execute certain blocks of code
based on whether a certain condition is true or false. The two main types of conditional statements in
Python are "if" statements and "if-else" statements.
Loops: Loops in Python allow you to repeat a block of code multiple times, either for a fixed number
of times or until a certain condition is met. The two main types of loops in Python are "for" loops and
"while" loops.
Functions: Functions in Python allow you to encapsulate blocks of code and reuse them throughout
your program. Functions can accept parameters and return values, making them a powerful tool for
organizing and structuring your code.
Scope: Scope in Python refers to the region of your program where a variable or function is visible and
accessible. Understanding scope is critical for avoiding errors and ensuring that your code is organized
and easy to maintain.
Error handling: Error handling in Python involves detecting and responding to errors that may occur
during program execution. Proper error handling can help you avoid crashes and ensure that your
program continues to run smoothly.
1. Data validation.
Page | 8
Python for Data Science (3150713) 230163107023
Procedure:
1. Plan the program structure and flow: Develop a plan for the program structure, including the
control structures that will be included, and the flow of the program logic.
2. Implement the control structures in Python: Write the code to implement the different control
structures in Python, including conditional statements, loops, and functions.
3. Test and debug the program: Conduct thorough testing of the program to ensure that it is
functioning correctly and identify and troubleshoot any errors or bugs.
4. Refine and optimize the program: Refine the program as needed to improve performance and
optimize its functionality, based on user feedback and testing results.
5. Document the program: Provide clear documentation of the program's purpose, functionality,
and limitations, as well as any potential security risks or necessary precautions.
6. Deploy and maintain the program: Deploy the program for use by users, and maintain it by
addressing any issues or bugs that arise and providing updates and new features as needed.
• If Statement
• If Else Statement
Page | 9
Python for Data Science (3150713) 230163107023
• Nested If Statement
• If Elif Statement
• For loop
Page | 10
Python for Data Science (3150713) 230163107023
• While loop
• Function
3. What is the difference between a "for" loop and a "while" loop in Python?
For loop While loop
The for loop is faster than the while loop. While loop is relatively slower as compared
to for loop
The loop runs infinite times in the absence Returns the compile time error in the
of condition absence of condition
Once done, it cannot be repeated In the while loop, it can be repeated at every
iteration.
To be done at the beginning of the loop In the while loop, it is possible to do this
anywhere in the loop body.
Suggested Reference:
1. https://docs.python.org/3/library/
2. https://www.tutorialspoint.com/python/
3. https://www.geeksforgeeks.org/
4. https://realpython.com/
5. https://www.w3schools.com/python/
Page | 12
Python for Data Science (3150713) 230163107023
Experiment No: 2
Develop a program to learn different types of structures (list, dictionary, tuples)
in python.
Date:
• Basic programming concepts: You should have a good grasp of basic programming concepts
such as variables, data types, conditional statements, loops, and functions.
• Python programming language: You should have a good understanding of Python syntax,
data structures, and standard library functions.
• Sequences: Sequences are ordered collections of elements that can be accessed by their index
or key. You should have a good understanding of the different types of sequences such as
string, tuple, list, dictionary, and set, and their respective properties.
• String manipulation: You should know how to manipulate them using methods such as
slicing, concatenation, and formatting.
• Collection manipulation: Collections such as lists, tuples, dictionaries, and sets can be
manipulated using methods such as append, insert, remove, pop, and sort.
• Iteration: You should know how to use for loops and list comprehensions to iterate over
sequences.
• Conditional statements: You should know how to use conditional statements to check for
specific conditions in sequences.
• Functions: You should know how to define functions that operate on sequences and return
values.
Objectives: (a) To learn how to manipulate and access their elements, iterate over them, perform
conditional operations on them, and use them in functions.
(b) To learn how to select the appropriate sequence type for a given task based on its properties and
performance characteristics.
Theory:
1. In Python programming language, there are four built-in sequence types: strings, lists, tuples,
and ranges. Additionally, Python includes the set and dictionary data structures, which are
implemented as unordered collections of unique and key-value pairs, respectively.
Page | 13
Python for Data Science (3150713) 230163107023
2. The string data type in Python represents a sequence of characters and is immutable,
meaning its contents cannot be changed once it is created. Strings can be manipulated using
various methods such as slicing, concatenation, and formatting.
3. Lists and tuples are similar in many ways, but tuples are immutable, whereas lists are
mutable. Lists and tuples can hold elements of any data type and can be indexed and sliced
like strings. However, lists offer additional methods such as append, insert, remove, and pop
that allow for manipulation of the list's contents.
4. Dictionaries are another important sequence type in Python and are implemented as
unordered collections of key-value pairs. Each element in a dictionary consists of a key and
a corresponding value. Dictionaries can be used to store and retrieve data quickly based on
the key.
5. Sets are collections of unique elements that are unordered and mutable. Sets are often used
to perform set operations such as union, intersection, and difference.
Procedure:
1. Create a string variable using single or double quotes.
Use string methods like upper(), lower(), strip(), split(), join(), and replace() to manipulate the
string as needed.
Use indexing and slicing to access specific characters or substrings within the string.
2. Create a tuple variable using parentheses.
Use indexing and slicing to access specific elements or subsets within the tuple.
Tuples are immutable, so you cannot add, remove or modify elements once created.
3. Create a list variable using square brackets.
Use indexing and slicing to access specific elements or subsets within the list.
Use list methods like append(), insert(), remove(), pop(), extend(), and sort() to modify the list
as needed.
Lists are mutable, so you can add, remove or modify elements once created.
4. Create a dictionary variable using curly braces or the dict() constructor.
Use keys to access values within the dictionary.
Use dictionary methods like keys(), values(), and items() to access different parts of the
dictionary.
Use del or pop() to remove elements from the dictionary.
Use assignment to add or modify elements in the dictionary.
5. Create a set variable using curly braces or the set() constructor.
Page | 14
Python for Data Science (3150713) 230163107023
Use set methods like add(), remove(), pop(), union(), and intersection() to modify or perform
operations on the set.
Sets do not allow duplicate elements, so adding the same element multiple times will only add
it once.
• Procedure 1
• Procedure 2
Page | 15
Python for Data Science (3150713) 230163107023
• Procedure 3
• Procedure 4
Page | 16
Python for Data Science (3150713) 230163107023
• Procedure 5
Conclusion:
In this practical, we explored Python's key data structures—strings, tuples, lists, dictionaries, and sets—
learning to manipulate and access their elements.
Quiz:
.
2. What is the difference between a tuple and a list in Python?
List Tuple
List are mutable Tuples are immutable
Inserting and deleting items is easier Accessing the elements is best
with a list. accomplished with a tuple data type.
A unexpected change or error is more In a tuple, changes and errors don't usually
likely to occur in a list occur because of immutability.
Lists consume more memory Tuple consumes less than the list
Page | 17
Python for Data Science (3150713) 230163107023
Suggested Reference:
1. https://docs.python.org/3/library/
2. https://www.tutorialspoint.com/python/
3. https://www.geeksforgeeks.org/
4. https://realpython.com/
5. https://www.w3schools.com/python/
Rubrics 1 2 3 4 5 Total
Marks
Knowledge of Programming Team work (2) Communication Skill Ethics(2)
subject (2) Skill (2)
Goo Aver ag Goo Averag Good Satisfactory Good Satisfactory Good Average
d (2) e (1) d (2) e (1) (2) (1) (2) (1) (2) (1)
Page | 18
Python for Data Science (3150713) 230163107023
Experiment No: 3
Develop a program that reads a .csv dataset file using Pandas library and display
the following content of the dataset. a) First five rows of the dataset
b) Complete data of the dataset
c) Summary or metadata of the dataset.
Date:
• Knowledge of Python programming language and its libraries, particularly the Pandas
library.
• Understanding of the structure of .csv files and how to read and manipulate them using
Pandas.
• Familiarity with the different methods and functions available in Pandas, such as "head()",
"print()", "display()", "info()", and "describe()".
• Ability to write and debug code, and troubleshoot errors that may arise when working with
datasets.
• Experience in working with datasets, including data cleaning, data wrangling, and data
analysis.
• Ability to understand the content and structure of datasets, and use them to derive insights
and information.
Practical skills:
• Writing code to load a .csv dataset file into a Pandas DataFrame using the "read_csv()"
function.
• Using the "head()" method to display the first five rows of the dataset.
• Using the "print()" function or "display()" method to display the complete data of the dataset.
• Using the "info()" method or "describe()" method to display the summary or metadata of the
dataset.
• Handling errors and exceptions that may arise when working with datasets.
• Writing clean and efficient code that is easy to read and maintain.
• Testing the program with different datasets to ensure its accuracy and reliability.
Objectives: (a) To read and load the .csv dataset file into a Pandas DataFrame.
(b) To display the first five rows of the dataset using the "head()" method.
(c) To display the complete data of the dataset using the "print()" function or "display()" method.(d)
To display the summary or metadata of the dataset using the "info()" method or "describe()"
method.
.
Page | 19
Python for Data Science (3150713) 230163107023
Theory:
Pandas is a popular data manipulation library for Python, widely used in data science and machine
learning. It provides a powerful and flexible toolset for working with structured data, including
loading, manipulating, and analyzing datasets in various formats, including .csv files
Procedure:
1. Import the Pandas library: To use the Pandas library in Python, it is essential to import it into
your program. You can do this by using the "import pandas as pd" statement.
2. Load the dataset: The next step is to load the dataset into a Pandas DataFrame using the
"read_csv()" function. This function takes the path to the .csv file as an argument and returns
a DataFrame object that contains the data from the file.
3. Display the first five rows: To display the first five rows of the dataset, you can use the
"head()" method. This method returns the first five rows of the DataFrame by default, but
you can specify the number of rows you want to display as an argument.
4. Display the complete data: To display the complete data of the dataset, you can use the
"print()" function or "display()" method. This will output the entire DataFrame to the console
or Jupyter Notebook.
5. Display summary or metadata: To display the summary or metadata of the dataset, you can
use the "info()" method or "describe()" method. The "info()" method provides information
about the DataFrame, including the number of rows and columns, data types, and memory
usage. The "describe()" method provides statistical summary of the dataset, including count,
mean, standard deviation, minimum, maximum, and quartiles for each column.
Page | 20
Python for Data Science (3150713) 230163107023
Conclusion: In this practical, we used the Pandas library to read and analyze a CSV dataset. By loading
the data into a DataFrame, we successfully displayed the first five rows, the complete dataset, and the
summary metadata.
Quiz:
3. How can you display the first five rows of the dataset using Pandas?
To display the first five rows of a Pandas DataFrame, use df.head() where df is your DataFrame,
which shows the top rows of data.
4. How can you display the complete data of the dataset using Pandas?
To display the complete data of a Pandas DataFrame, use df where df is the DataFrame variable,
and it will print the entire DataFrame.
5. How can you display the summary or metadata of the dataset using Pandas? To display
the summary or metadata of a Pandas DataFrame, use df.info() to show information like data
types, non-null counts, and memory usage.
Suggested Reference:
1. Official Pandas documentation: https://pandas.pydata.org/docs/
2. "Python for Data Analysis" by Wes McKinney:
https://www.oreilly.com/library/view/python-for-data/9781491957653/ j
3. "Python Data Science Handbook" by Jake VanderPlas:
https://jakevdp.github.io/PythonDataScienceHandbook/
4. Pandas tutorial by DataCamp:
https://www.datacamp.com/community/tutorials/pandastutorial-
dataframe-python
https://www.oreilly.com/library/view/python-for-data/9781491957653/
Rubrics 1 2 3 4 5 Total
Marks
Knowledge of Programming Team work (2) Communication Skill Ethics(2)
subject (2) Skill (2)
Goo Aver ag Goo Averag Good Satisfactory Good Satisfactory Good Average
d (2) e (1) d (2) e (1) (2) (1) (2) (1) (2) (1)
Page | 22
Python for Data Science (3150713) 230163107023
Experiment No: 4
Develop a program that shows application of slicing and dicing over the rows
and columns of the dataset.
Date:
Objectives: (a) To gain insights into the dataset and extract meaningful information from it.
Theory:
Slicing and dicing are powerful operations that allow data analysts to manipulate data by selecting
specific subsets of data from a larger dataset. These operations are widely used in data analysis and
are a crucial aspect of data manipulation.
In the context of Python, slicing refers to extracting specific portions of data from a larger data
structure, such as a list, tuple, or DataFrame. Slicing is performed by specifying the start and end
indices of the portion of data to be extracted. For example, in a list of numbers, slicing can be used
to extract the first three numbers or the last five numbers. In a DataFrame, slicing can be used to
extract specific rows or columns based on specific conditions or criteria.
Dicing, on the other hand, refers to grouping and aggregating data based on specific criteria. This
involves dividing the data into smaller subsets based on specific categories or conditions and
performing aggregation functions on each subset. For example, in a dataset containing sales data,
dicing can be used to group the data by product type, region, or time period and calculate the total
sales for each group.
In Python, the Pandas library provides powerful tools for slicing and dicing data in a DataFrame.
The .loc and .iloc methods are used for slicing rows and columns based on specific conditions or
Page | 23
Python for Data Science (3150713) 230163107023
criteria. The .groupby method is used for grouping data based on specific categories, and
aggregation functions such as .sum(), .mean(), and .count() can be used to perform calculations on
each group. The .pivot_table method is used for creating pivot tables, which provide a summarized
view of the data by grouping and aggregating data based on specific categories.
Procedure:
1. Load the dataset: Load the dataset into Python using the Pandas library's read_csv function.
2. Explore the dataset: Use the head, tail, and info functions to explore the dataset and get a
sense of its structure and contents.
3. Slice and dice the data: Use the Pandas DataFrame's indexing and slicing operations to select
specific rows and columns of the dataset. Examples of slicing operations include loc, iloc,
and [ ].
4. Apply filtering: Use Boolean indexing to filter rows of the dataset based on specific criteria.
5. Aggregate the data: Use the groupby function to group the data by specific columns and
apply aggregation functions such as sum, mean, and count.
6. Visualize the data: Use visualization libraries such as Matplotlib or Seaborn to create
visualizations of the sliced and diced data.
7. Refine and iterate: Refine the analysis and iterate as needed based on the insights gained
from the analysis.
Page | 24
Python for Data Science (3150713) 230163107023
Observations:
Page | 25
Python for Data Science (3150713) 230163107023
Page | 26
Python for Data Science (3150713) 230163107023
Conclusion: This program demonstrates the power of slicing and dicing in Pandas for selecting
specific rows and columns of a dataset. By understanding these techniques, you can efficiently
extract and analyze subsets of your data for various data-driven tasks.
Quiz:
(2) Which function of the Pandas library is used to load a .csv dataset file into Python?
Import a CSV file using the read_csv() function from the pandas library.
Page | 27
Python for Data Science (3150713) 230163107023
(3) What is the difference between loc and iloc in Pandas DataFrame indexing?
The difference between the loc and iloc functions is that the loc function selects rows using row
labels (e.g. tea ) whereas the iloc function selects rows using their integer positions (staring from 0
and going up by one for each row).
(4) How can Boolean indexing be used to filter rows of a dataset based on specific criteria?
To filter Pandas Dataframe rows by Index use filter() function. Use axis=0 as a param to the
function to filter rows by index (indices). This function filter() is used to Subset rows of the
Dataframe according to labels in the specified index.
(6) Which visualization libraries can be used to create visualizations of the sliced and diced data?
Several visualization libraries can be used to create visualizations of the sliced and diced data in data
analysis. Some popular options include:
Matplotlib: Matplotlib is a versatile and widely-used plotting library in Python. It provides a wide
range of customization options for creating various types of plots and charts.
Seaborn: Seaborn is built on top of Matplotlib and offers a high-level interface for creating attractive
statistical graphics. It is particularly useful for creating complex visualizations with minimal code.
Pandas: Pandas itself has built-in visualization capabilities using the plot() method, which allows you
to create basic plots directly from a DataFrame.
(7) What is the importance of documenting the slicing and dicing process during data
analysis?
Large blocks of data is cut into smaller segments and the process is repeated until the correct level
of detail is achieved for proper analysis. Therefore slicing and dicing presents the data in new and
diverse perspectives and provides a closer view of it for analysis. For example a report is showing
annual performance of a particular product. If we want to view the quarterly performance, we can
use slicing and dicing strategy to drill down to the quarterly level.
(8) What is the advantage of iterating and refining the analysis during the slicing and dicing
process?
There are five major benefits of iterative design and prototyping over traditional methods:
(9) Can slicing and dicing be applied only to numerical data or can it also be applied to
categorical data?
Slicing and dicing can be applied to both numerical and categorical data. The specific methods and
techniques used may vary depending on the data type and the goals of the analysis:
Page | 28
Python for Data Science (3150713) 230163107023
Numerical Data: Slicing and dicing numerical data typically involve operations like filtering, grouping,
and aggregating based on numerical criteria. For example, you can slice time series data by date or filter
sales data by revenue thresholds.
Categorical Data: When dealing with categorical data, slicing and dicing often involve grouping and
aggregating based on category values. For instance, you can group customer data by demographics (e.g.,
age group, gender) and analyze their behaviors within each category
(10) How can the insights gained from slicing and dicing be used to make data-driven
decisions? Data analytics refers to the process of collecting, analyzing, and interpreting large
volumes of data to gain insights that can be used to inform business decisions. Data analytics can
help businesses make better decisions by providing a more accurate picture of their operations,
customers, and market trends.
Suggested Reference:
1. "Python for Data Analysis" by Wes McKinney
2. "Python Data Science Handbook" by Jake VanderPlas
3. "Pandas User Guide" on the Pandas documentation website
4. "Data Wrangling with Pandas" course on DataCamp
5. "Data Manipulation with Pandas" course on Coursera References
used by the students:
1. Pandas User Guide" on the Pandas documentation website
2. "Data Wrangling with Pandas" course on DataCamp
Goo Aver ag Goo Averag Good Satisfactory Good Satisfactory Good Average
d (2) e (1) d (2) e (1) (2) (1) (2) (1) (2) (1)
Page | 29
Python for Data Science (3150713) 230163107023
Experiment No: 5
Develop a program that shows usage of aggregate function over the input
dataset. a) describe b) max c) min d) mean e) median f) count g) std h) Corr
Date:
• Knowledge of the input dataset format (e.g. CSV, Excel, JSON) and how to load it into a
data structure in Python using libraries like Pandas.
• Understanding of the different aggregate functions available in Pandas, such as describe,
max, min, mean, median, count, std, and corr.
• Familiarity with the syntax of Pandas functions for applying aggregate functions, such as
groupby, apply, and agg.
• Ability to interpret and analyze the results of the aggregate functions to gain insights about
the dataset.
Practical skills:
Objectives: (a) To understand the concept of aggregate functions and their usage in data analysis.
Theory:
In data analysis, aggregate functions are used to calculate summary statistics over a dataset. These
functions are applied to columns or rows of a dataset to calculate values like the maximum,
minimum, mean, median, count, standard deviation, and correlation.
a) describe: This function generates descriptive statistics that summarize the central tendency,
dispersion, and shape of a dataset's distribution.
b) max: This function is used to find the maximum value of a column or row.
c) min: This function is used to find the minimum value of a column or row.
d) mean: This function is used to find the average value of a column or row.
e) median: This function is used to find the median value of a column or row.
Page | 30
Python for Data Science (3150713) 230163107023
f) count: This function is used to count the number of non-null values in a column or row.
g) std: This function is used to calculate the standard deviation of a column or row.
h) Corr: This function is used to calculate the correlation between columns or rows of a dataset.
In Python, these aggregate functions can be applied using the Pandas library. The groupby() function
is used to group data based on a specified column, and the aggregate functions can then be applied
to the grouped data.
Procedure:
1. Import necessary libraries: You will need to import Pandas library to load the dataset and
perform various operations on it.
2. Load the dataset: Load the dataset in a Pandas dataframe using the read_csv() function. Make
sure the dataset is in a CSV format and is saved in your working directory.
3. Check the dataset: Print the first few rows of the dataset using the head() function to check
if the dataset is loaded correctly.
4. Describe the dataset: Use the describe() function to get the summary statistics of the dataset,
such as count, mean, standard deviation, minimum, and maximum values.
5. Apply aggregate functions: Apply the aggregate functions such as max(), min(), mean(),
median(), count(), std(), and corr() on the dataset.
6. Display the results: Display the results of the aggregate functions to the user.
Observations:
a) Describe:
Page | 31
Python for Data Science (3150713) 230163107023
b) Max:
c) Min:
d) Mean():
Page | 32
Python for Data Science (3150713) 230163107023
e) Median():
f) Count:():
g) Std():
Page | 33
Python for Data Science (3150713) 230163107023
h) Corr():
Conclusion: This program demonstrates the power of Pandas for performing aggregate
calculations on datasets. By understanding and utilizing these functions, you can gain valuable
insights into the central tendency, dispersion, and relationships between variables within your
data.
Page | 34
Python for Data Science (3150713) 230163107023
summary statistics on large datasets, such as the average, minimum, maximum, and sum of a set of
values.
(3) Which of the following aggregate functions calculates the correlation between two
numerical columns?
Corr() function is used to calculate the correlation between columns or rows of a dataset.
(4) Which of the following aggregate functions returns the number of non-missing values in
a column?
Count() function is used to count the number of non-null values in a column or row.
(5) Display the results: Display the results of the aggregate functions to the user.
The describe() function in Pandas is used to generate descriptive statistics of a DataFrame. It provides
summary statistics for each numeric column in the DataFrame, including measures like count, mean,
standard deviation, minimum, 25th percentile (Q1), median (50th percentile), 75th percentile (Q3),
and maximum.
It is a quick way to get an overview of the central tendency and spread of numerical data in the
DataFrame. Here's how to use the describe() function: df.describe() This function is helpful for initial
data exploration and understanding the distribution of data in your DataFrame.
Suggested Reference:
1. https://pandas.pydata.org/docs/
2. https://numpy.org/doc/stable/
1. https://pandas.pydata.org/docs/
2. https://numpy.org/doc/stable/
Page | 35
Python for Data Science (3150713) 230163107023
Experiment No: 6
Develop a program that applies split and merge operations on the datasets.
Date:
Objectives: (a) To split large datasets into smaller ones for ease of handling and processing.
(b) To consolidate information and make it easier to analyze. Equipment/Instruments:
Personal Computer, Internet, Python
Theory:
Python provides several built-in functions and libraries for performing split and merge operations
on datasets. Here are some examples:
Splitting a Dataset:
Using the pandas split() method: The split() method is a built-in function in Python that can be used
to split a string into a list of substrings based on a specified delimiter. This can be useful for splitting
a dataset into smaller chunks.
Using the numpy.array_split() function: The numpy.array_split() function can be used to split a
numpy array into smaller arrays of equal or nearly equal size.
Merging Datasets:
Using the pandas.concat() function: The pandas.concat() function can be used to concatenate pandas
dataframes along a specified axis.
Using the numpy concatenate() function: The concatenate() function can be used to merge two or
more arrays into a single array.
Page | 36
Python for Data Science (3150713) 230163107023
Procedure:
1. Define the input datasets: Determine the input datasets and their format. It could be CSV
files, Excel files, or other file types. Also, define the delimiter or separator character for
splitting the data.
2. Load the datasets: Load the datasets into the program using the appropriate libraries and
functions. Check that the data is loaded correctly and perform any necessary data cleaning
or formatting.
3. Split the datasets: Use the appropriate function or library to split the datasets into smaller
chunks. Specify the size or number of chunks to create and ensure that the resulting datasets
are consistent and valid.
4. Merge the datasets: Use the appropriate function or library to merge the datasets into a single
dataset. Specify the method of merging and ensure that the resulting dataset is consistent and
valid.
5. Handle missing or duplicate data: Check for any missing or duplicate data in the merged
dataset and handle them appropriately. You can choose to remove the records with missing
data or impute the missing values.
6. Perform calculations or analysis: Once the datasets are merged, you can perform any
necessary calculations or analysis on the resulting dataset. This could include aggregating
data, calculating averages, or performing statistical analysis.
Observations:
Page | 37
Python for Data Science (3150713) 230163107023
(a)merge
Page | 38
Python for Data Science (3150713) 230163107023
Page | 39
Python for Data Science (3150713) 230163107023
(b) split
Page | 40
Python for Data Science (3150713) 230163107023
Conclusion: the program effectively implements split and merge operations on datasets, enabling
efficient data manipulation and management. By allowing for the division of large datasets into
smaller, manageable parts and subsequently merging them, it enhances data processing flexibility
and optimizes performance for various analytical tasks.
Quiz:
(a)What are the key steps involved in developing a program that applies split and merge
operations on datasets?
Step 1: split the data into groups by creating a groupby object from the original DataFrame
Step 2: apply a function, in this case, an aggregation function that computes a summary statistic
(you can also transform or filter your data in this step) Step 3: combine the results into a new
DataFrame.
(b) What library or function can be used to split the input datasets into smaller chunks?
The split() method is a built-in function in Python that can be used to split a string into a
list of substrings based on a specified delimiter. This can be useful for splitting a dataset
into smaller chunks.
(c) What should you do if the merged dataset contains missing or duplicate data?
If the merged dataset contains missing or duplicate data, you should perform data cleaning and
preprocessing. For missing data, consider strategies like imputation (replacing missing values)
or removing rows with missing values, depending on the context. For duplicate data, use
methods like drop_duplicates() to remove duplicate rows or resolve duplicates based on
specific criteria.
Page | 41
Python for Data Science (3150713) 230163107023
working with large datasets. Finally, make the program available to users and provide
documentation or guidance on how to use it effectively.
Suggested Reference:
1. https://docs.python.org/3/library/
2. "Python Data Science Handbook" by Jake VanderPlas.
3. "Python for Data Analysis" by Wes McKinney.
4. Pandas documentation.
5. NumPy documentation.
Goo Averag Goo Averag Good Satisfactory Good Satisfactory Good Average
d (2) e (1) d (2) e (1) (2) (1) (2) (1) (2) (1)
Page | 42
Python for Data Science (3150713) 230163107023
Experiment No: 7
Develop a program that shows the various data cleaning tasks over the dataset.
a) Identifying the null values. b) Identifying the empty values c) Identifying the
incorrect timestamp
Date:
Objectives: (a) To identify and handle missing or incomplete data in the dataset.
(b) To identify and handle invalid or incorrect data in the dataset.
(c) To remove duplicate data in the dataset.
(d) To standardize data formats and values to ensure consistency across the dataset.
(e) To handle outliers and extreme values that may skew data analysis results.
(f) To ensure data accuracy and completeness for reliable data analysis.
(g) To improve data quality by reducing errors and inconsistencies in the dataset.(h) To prepare
the dataset for further analysis and modeling.. Equipment/Instruments: Personal Computer,
Internet, Python
Theory:
Data cleaning is an essential step in the data preparation process that involves identifying and
handling missing, incorrect, or inconsistent data in the dataset. In Python, data cleaning is typically
performed using libraries such as NumPy and Pandas, which provide functions for data
manipulation and analysis.
The theory behind data cleaning in Python involves several key steps:
Importing data: The first step in data cleaning is to import the data into Python using the appropriate
library and data format. Common data formats include CSV, Excel, and JSON.
Page | 43
Python for Data Science (3150713) 230163107023
Identifying missing data: Once the data is imported, the next step is to identify missing data in the
dataset. This can be done using the isnull() function in Pandas, which returns a Boolean value
indicating whether a value is missing or not.
Handling missing data: Once missing data is identified, the next step is to handle it appropriately.
This can be done by either removing the rows or columns with missing values or imputing the
missing values with a suitable value such as the mean or median of the column.
Identifying incorrect data: After handling missing data, the next step is to identify incorrect data in
the dataset, such as values that are outside the expected range or format. This can be done using
statistical techniques such as data visualization and analysis.
Handling incorrect data: Once incorrect data is identified, the next step is to handle it appropriately.
This can be done by removing the outliers or replacing the incorrect values with a suitable value
such as the median or mode of the column.
Standardizing data formats and values: To ensure consistency across the dataset, it is often necessary
to standardize data formats and values. This can be done by converting data types, renaming
columns, or applying formatting rules.
Removing duplicates: Duplicate data can skew analysis results and should be removed from the
dataset. This can be done using the drop_duplicates() function in Pandas.
Quality control: The final step in data cleaning is to perform quality control checks to ensure that
the data is accurate, complete, and consistent. This involves comparing the cleaned dataset to the
original dataset and verifying that the data has been cleaned appropriately.
1. Backup data.
2. Use secure and updated software.
3. Access control.
4. Data privacy.
5. Data encryption
6. Error handling.
7. Test and validate.
Procedure:
1. Import the required libraries: Import the necessary libraries such as pandas, numpy, and
matplotlib to read, manipulate and visualize the dataset.
2. Load the dataset: Load the dataset into the program using a pandas dataframe.
3. Identify null values: Use the isnull() function to identify null values in the dataset. If any
null values are found, decide on a strategy to handle them. This could involve replacing null
values with a mean or median value, dropping the null values or imputing them with a
different value.
Page | 44
Python for Data Science (3150713) 230163107023
4. Identify empty values: Use the empty() function to identify empty values in the dataset.
Empty values are those values that contain nothing (not even null). If any empty values are
found, decide on a strategy to handle them. This could involve replacing empty values with
a mean or median value, dropping the empty values or imputing them with a different value.
5. Identify incorrect timestamp: Use the to_datetime() function to convert the timestamp
column to a datetime object. This will identify any incorrect timestamp values. If any
incorrect timestamp values are found, decide on a strategy to handle them. This could
involve dropping the rows with incorrect timestamp values or imputing them with a different
value.
6. Remove duplicates: Use the drop_duplicates() function to remove any duplicate rows in the
dataset.
7. Data normalization: Use the normalization technique to transform the data into a standard
format to make it more consistent and easier to analyze.
8. Data standardization: Use the standardization technique to transform the data into a standard
scale to make it more consistent and easier to analyze.
9. Save the cleaned dataset: Save the cleaned dataset to a new file for future use.
10. Visualize the cleaned dataset: Use matplotlib or other visualization libraries to create
visualizations of the cleaned dataset to better understand the data and identify any further
cleaning that may be required.
Observations
Page | 45
Python for Data Science (3150713) 230163107023
Conclusion: this program effectively demonstrates key data cleaning tasks, including identifying
null and empty values, as well as detecting incorrect timestamps. By addressing these issues, the
program enhances data quality and reliability, facilitating more accurate analysis and decision-
making for users working with the dataset.
Page | 46
Python for Data Science (3150713) 230163107023
Quiz :
1. What is the first step in developing a program for data cleaning in Python?
The first step in developing a program for data cleaning in Python is to understand the
data. This involves gaining a thorough understanding of the dataset you're working
with, including its structure, the meaning of its columns, the nature of its data, and any
specific data quality issues it may have. Without a clear understanding of the data, it's
challenging to identify and address issues effectively.
Suggested Reference:
1. Data Cleaning with Python" course on DataCamp.
2. "Data Cleaning in Python: A Complete Guide" on Towards Data Science.
3. "Data Cleaning with Python and Pandas: Detecting Missing Values" on Real Python.
4. "Cleaning Data with Python" on Kaggle.
5. "Data Cleaning Techniques in Python" on Analytics Vidhya References used by the
Rubrics 1 2 3 4 5 Total
Marks
Knowledge of Programming Team work (2) Communication Skill Ethics(2)
subject (2) Skill (2)
Goo Aver ag Goo Averag Good Satisfactory Good Satisfactory Good Average
d (2) e (1) d (2) e (1) (2) (1) (2) (1) (2) (1)
Page | 48
Python for Data Science (3150713) 230163107023
Experiment No: 8
Develop a program that shows usage of following NumPy array operations: a)
any() b) all() c) isnan() d) isinf() e) isfinite() f) isinf() g) zeros() h) isreal() i)
iscomplex() j) isscalar() k) less() l) greater() m) less_equal() n) greater_equal()
Date:
Objectives: (a) To perform complex mathematical and logical operations on large arrays and
matrices efficiently.
Theory:
NumPy is a popular Python library for scientific computing that provides efficient and powerful
array operations. It enables users to work with multidimensional arrays and perform a variety of
mathematical and logical operations on them.
Here are the explanations of some of the NumPy array operations mentioned in the question:
a) any(): It returns True if any of the elements of an array evaluate to True, and False otherwise.
b) all(): It returns True if all the elements of an array evaluate to True, and False otherwise.
c) isnan(): It returns an array of the same shape as the input array, with True where the corresponding
element of the input array is NaN (Not a Number), and False elsewhere.
d) isinf(): It returns an array of the same shape as the input array, with True where the corresponding
element of the input array is +/-inf (positive or negative infinity), and False elsewhere.
e) isfinite(): It returns an array of the same shape as the input array, with True where the
corresponding element of the input array is finite (i.e., not NaN, +/-inf), and False elsewhere.
f) isinf(): It returns an array of the same shape as the input array, with True where the corresponding
element of the input array is +/-inf (positive or negative infinity), and False elsewhere.
Page | 49
Python for Data Science (3150713) 230163107023
g) zeros(): It returns a new array of the specified shape and data type, filled with zeros.
h) isreal(): It returns an array of the same shape as the input array, with True where the corresponding
element of the input array is real, and False where it is complex.
i) iscomplex(): It returns an array of the same shape as the input array, with True where the
corresponding element of the input array is complex, and False where it is real.
j) isscalar(): It returns True if the input is a scalar (i.e., a single value, not an array), and False
otherwise.
k) less(): It returns an array of the same shape as the input arrays, with True where the corresponding
element of the first input array is less than the corresponding element of the second input array,
and False otherwise.
l) greater(): It returns an array of the same shape as the input arrays, with True where the
corresponding element of the first input array is greater than the corresponding element of the
second input array, and False otherwise.
m) less_equal(): It returns an array of the same shape as the input arrays, with True where the
corresponding element of the first input array is less than or equal to the corresponding element
of the second input array, and False otherwise.
n) greater_equal(): It returns an array of the same shape as the input arrays, with True where the
corresponding element of the first input array is greater than or equal to the corresponding
element of the second input array, and False otherwise.
Procedure:
1. Import the NumPy library: To use NumPy array operations, you need to import the NumPy
library into your Python environment. You can do this using the import statement.
2. Create a NumPy array: You need to create a NumPy array to perform the various operations.
You can create an array using the np.array() function.
3. Use the array operations: Once you have created the array, you can use various NumPy array
operations such as any(), all(), isnan(), isinf(), isfinite(), zeros(), isreal(), iscomplex(),
isscalar(), less(), greater(), less_equal(), and greater_equal().
4. Print the output: After performing the operations, you should print the output to see the
results.
Page | 50
Python for Data Science (3150713) 230163107023
Observations:
(a)any()
(b)all()
(c)isnan()
(d)isinf()
(e)isfinite()
(g)zeros()
Page | 51
Python for Data Science (3150713) 230163107023
(h)isreal()
(i)iscomplex()
(j)isscalar()
(k)less()
(l)greater()
Page | 52
Python for Data Science (3150713) 230163107023
(m)less_equal()
(n)greater_equal()
Conclusion: The developed program effectively demonstrates various NumPy array operations,
showcasing their functionality in evaluating conditions and properties of array elements. Operations
such as any(), all(), and isnan() provide insights into data validity, while functions like zeros(),
isreal(), and comparisons (less(), greater()) enhance array manipulation capabilities.
Quiz:
1. What does the NumPy function 'any()' return?
The NumPy function 'any()' returns a Boolean value (True or False) indicating whether at
least one element in the input array evaluates to True when treated as a boolean. It checks
if any element in the array satisfies the condition provided.
Page | 53
Python for Data Science (3150713) 230163107023
The NumPy function 'zeros()' creates a new array filled with zeros. You can specify the
shape of the array as a tuple or a single integer to create a multi-dimensional array filled
with zeros. For example, np.zeros((2, 3)) would create a 2x3 array filled with zeros.
Suggested Reference:
1. NumPy User Guide: https://numpy.org/doc/stable/user/index.html 2.
NumPy Tutorial: https://www.tutorialspoint.com/numpy/index.htm
3. NumPy Cheat Sheet:
https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.
pdf
4. NumPy Array Operations: https://www.geeksforgeeks.org/numpy-array-
manipulationpython/
5. NumPy Array Operations and Functions:
Rubrics 1 2 3 4 5 Total
Marks
Knowledge of Programming Team work (2) Communication Skill Ethics(2)
subject (2) Skill (2)
Goo Aver ag Goo Averag Good Satisfactory Good Satisfactory Good Average
d (2) e (1) d (2) e (1) (2) (1) (2) (1) (2) (1)
Page | 54
Python for Data Science (3150713) 230163107023
Experiment No: 9
Develop a program that shows usage of following NumPy library vector
functions. a) arrange() b) reshape() c) linspace() d) randint() e) dot()
Date:
Competency and Practical Skills: Competency
skills:
Objectives: (a) To provide efficient and powerful tools for working with large arrays and matrices
in Python, along with a wide range of mathematical and scientific functions for manipulating and
analyzing these arrays.
Theory:
Here is a brief theory for each of the NumPy vector functions:
a) arrange(): This function is used to create a one-dimensional array with evenly spaced values
within a specified range. The function takes in three arguments: start (optional), stop, and step
(optional). The start argument is the starting value of the sequence (inclusive), the stop argument is
the ending value of the sequence (exclusive), and the step argument is the step size between values.
For example, np.arange(0, 10, 2) creates an array with values [0, 2, 4, 6, 8].
b) reshape(): This function is used to reshape an array into a new shape without changing its
data. The function takes in one argument: the new shape of the array, specified as a tuple of integers.
For example, np.reshape(my_array, (3, 4)) reshapes the array my_array into a 3x4 matrix.
c) linspace(): This function is used to create a one-dimensional array with evenly spaced values
between a specified range. The function takes in three arguments: start, stop, and num (optional).
The start argument is the starting value of the sequence, the stop argument is the ending value of
the sequence, and the num argument is the number of values to generate. For example,
np.linspace(0, 1, 5) creates an array with values [0., 0.25, 0.5, 0.75, 1.].
d) randint(): This function is used to generate an array of random integers within a specified
range. The function takes in three arguments: low (optional), high, and size (optional). The low
argument is the lower bound of the range (inclusive), the high argument is the upper bound of the
Page | 55
Python for Data Science (3150713) 230163107023
range (exclusive), and the size argument is the shape of the output array. For example,
np.random.randint(0, 10, size=(2, 3)) generates a 2x3 array of random integers between 0 and 10.
e) dot(): This function is used to perform matrix multiplication between two arrays. The
function takes in two arguments: the two arrays to be multiplied. The arrays must have compatible
shapes for matrix multiplication. For example, if A is a 2x3 array and B is a 3x2 array, np.dot(A, B)
performs matrix multiplication between A and B and returns a 2x2 array.
Overall, these NumPy vector functions are commonly used for manipulating and analyzing arrays
in scientific computing and data analysis. By using these functions in a program, you can efficiently
perform operations on large arrays and matrices in Python.
Procedure:
1. Import the NumPy library: Begin your program by importing the NumPy library using the
import statement.
2. Create an array: Create an array using one of the NumPy functions such as arrange() or
linspace(). You can also create an array from an existing data source such as a CSV file.
3. Reshape the array: Use the reshape() function to reshape the array to the desired shape. For
example, you can reshape a one-dimensional array into a two-dimensional array.
4. Generate random numbers: Use the randint() function to generate an array of random
integers within a specified range.
5. Perform matrix multiplication: Use the dot() function to perform matrix multiplication
between two arrays.
6. Print the results: Print the resulting arrays to the console using the print() function.
Observations:
(a)arrange()
Page | 56
Python for Data Science (3150713) 230163107023
(b)reshape()
(c)linspace()
(d)randint()
Page | 57
Python for Data Science (3150713) 230163107023
(e)dot()
Quiz:
1. What is the purpose of the NumPy library?
The purpose of the NumPy library (short for Numerical Python) is to provide support for
large, multi-dimensional arrays and matrices, along with a collection of mathematical
functions to operate on these arrays. NumPy is a fundamental library for scientific and
numerical computing in Python. It is essential for tasks such as data manipulation, linear
algebra, statistical analysis, and more, especially when working with numerical data.
Page | 58
Python for Data Science (3150713) 230163107023
4. How can you perform matrix multiplication between two arrays in NumPy?
Matrix multiplication between two arrays in NumPy can be performed using the
`numpy.dot()` function or the `@` operator (in Python 3.5 and later) for matrix multiplication.
Example with `numpy.dot()`: import numpy as np
result = np.dot(matrix1, matrix2) # Performs matrix multiplication between matrix1 and
matrix2
By Nayankumar (210160107048) Example with `@` operator: result = matrix1 @ matrix2 #
Performs matrix multiplication between matrix1 and matrix2 It's important to ensure that the
dimensions of the matrices are compatible for matrix multiplication (e.g., the number of
columns in the first matrix must be equal to the number of rows in the second matrix).
Suggested Reference:
1. https://numpy.org/doc/stable/
2. https://numpy.org/doc/stable/user/index.html 3.
3. https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy
_Python_Cheat_Sheet .pdf
4. https://numpy.org/devdocs/user/quickstart.html
5. https://www.datacamp.com/community/tutorials/python-numpy-tutorial
Goo Aver ag Goo Averag Good Satisfactory Good Satisfactory Good Average
d (2) e (1) d (2) e (1) (2) (1) (2) (1) (2) (1)
Page | 59
Python for Data Science (3150713) 230163107023
Experiment No: 10
Write a program to display below plot using matplotlib library. For Values of
X:[1,2,3,...,49], Values of Y (thrice of X):[3,6,9,12,...,144,147]
Date:
• Ability to create different types of plots such as line plots, scatter plots, bar plots, etc.
• Ability to customize the appearance of plots including labels, colors, legends, and titles
• Ability to add text, annotations, and shapes to the plots
• Ability to work with multiple plots and subplots
• Ability to export plots in different file formats like png, pdf, svg, etc.
• Ability to integrate matplotlib with other Python libraries like NumPy and Pandas.
Objectives: (a) To create informative and visually appealing data visualizations that enable users to
explore, understand, and communicate complex data. Equipment/Instruments: Personal
Computer, Internet, Python
Theory:
Matplotlib is a Python library that provides a variety of tools for creating high-quality data
visualizations. It is one of the most popular data visualization libraries due to its ease of use and
versatility. The library is built on NumPy and provides a range of options for creating different types
of plots and graphs, including line plots, scatter plots, bar charts, histograms, and many more.
pyplot module: This is the main module of Matplotlib, which provides a simple interface for creating
plots and charts. It is a collection of functions that allow users to create plots with minimal coding.
Figure and Axes objects: The Figure object is the top-level container for all the plot elements. It
represents the entire plot and contains one or more Axes objects. The Axes object is the individual
plot area where data is plotted.
Plotting functions: Matplotlib provides a range of plotting functions that can be used to create
different types of plots and charts. These functions include plot(), scatter(), bar(), hist(), and many
more.
Page | 60
Python for Data Science (3150713) 230163107023
Customization options: Matplotlib allows users to customize the appearance of plots in various
ways, including changing the plot color, adding labels, titles, and legends, adjusting the axis limits,
and more.
To use Matplotlib, you first need to import the library and its pyplot module. Then, you can create
a figure object and one or more axes objects using the subplots() function. After that, you can use
the various plotting functions to create different types of plots and customize them as needed.
Overall, Matplotlib provides a powerful and flexible tool for creating data visualizations in Python.
With its wide range of options and customization features, it can be used for a variety of data
analysis and communication tasks.
Procedure:
1. Import the required libraries - Matplotlib and NumPy.
2. Create two NumPy arrays for X and Y values using np.arange() and multiplication.
3. Create a figure and an axis object using plt.subplots().
4. Use a x.plot() function to plot X and Y values as a line plot.
5. Customize the plot with axis labels and a title.
6. Display the plot using plt.show() function.
Page | 61
Python for Data Science (3150713) 230163107023
Conclusion: The provided code successfully utilizes the Matplotlib library to create a plot of values
where the Y-axis represents three times the corresponding values of X. This visual representation
clearly demonstrates the linear relationship between X and Y, enhancing data interpretation and
analysis in Python programming.
(1)What is Matplotlib?
Matplotlib is a popular Python library for creating static, animated, and interactive
visualizations in data analysis and scientific computing. It provides a flexible and
comprehensive framework for creating various types of plots and charts, allowing users to
visualize their data in a wide range of formats.
Page | 62
Python for Data Science (3150713) 230163107023
Scatter Plots: Scatter plots are used to display individual data points as dots on a
twodimensional plane. They are often employed to show the relationship between two
variables or to identify patterns in data.
You can use color names like 'red', 'blue', 'green', or specify colors using hexadecimal
values like '#FF5733' for custom colors.
You can customize the legend's appearance and location by providing additional arguments
to the `legend` function. For example, you can use the `loc` parameter to specify the
legend's position, and other parameters for formatting the legend text.
Page | 63
Python for Data Science (3150713) 230163107023
Goo Aver ag Goo Averag Good Satisfactory Good Satisfactory Good Average
d (2) e (1) d (2) e (1) (2) (1) (2) (1) (2) (1)
Page | 64
Python for Data Science (3150713) 230163107023
Experiment No: 11
Write a program to display below bar plot using matplotlib library. For value
Languages = ['Java', 'Python', 'PHP', 'JavaScript', 'C#', 'C++'] Popularity
= [22.2, 17.6, 8.8, 8, 7.7, 6.7]
Date:
• Ability to create different types of plots such as line plots, scatter plots, bar plots, etc.
• Ability to customize the appearance of plots including labels, colors, legends, and titles
• Ability to add text, annotations, and shapes to the plots
• Ability to work with multiple plots and subplots
• Ability to export plots in different file formats like png, pdf, svg, etc.
• Ability to integrate matplotlib with other Python libraries like NumPy and Pandas.
Objectives: (a) To learn how to interpret and analyze data visualizations, and to use them to draw
insights and make informed decisions.
Theory:
A bar plot is a type of chart that displays data as rectangular bars. The length or height of each bar
is proportional to the value of the data it represents. Bar plots are useful for comparing the values
of different categories or groups.
Matplotlib is a popular data visualization library in Python that provides a wide range of functions
for creating different types of plots, including bar plots.
Use the bar() function to create the bar plot by passing the languages and popularity lists as
arguments. The bar() function automatically generates the rectangular bars for each category and
sets their lengths proportional to the values in the popularity list.
Procedure:
1. Define the data for the plot as lists or arrays.
2. Use the bar() function to create the plot, passing the data as arguments.
3. Customize the plot by changing the colors, labels, and other attributes.
4. Add a title and labels to the plot to provide context and improve its readability.
Conclusion: In conclusion, this program effectively utilizes the Matplotlib library to create a bar
plot visualizing the popularity of various programming languages. By plotting the data for Java,
Python, PHP, JavaScript, C#, and C++, it provides a clear and informative representation of their
relative popularity in a concise manner.
Page | 66
Python for Data Science (3150713) 230163107023
(3)What are the steps involved in creating a bar plot using Matplotlib?
The steps involved in creating a bar plot using Matplotlib typically include:
Importing the Matplotlib library: Import the necessary modules from Matplotlib, such as
pyplot, to create and customize your plot.
Preparing your data: Ensure you have your data ready in a suitable format, usually as lists
or NumPy arrays.
Creating the bar plot: Use Matplotlib functions to create the bar plot by specifying the data,
labels, and other customization options.
Customizing the plot: You can further customize the plot by adjusting the colors, labels,
titles, axes, and other elements to make it more informative and visually appealing.
Displaying or saving the plot: Finally, display the plot using plt.show() or save it to a file
with plt.savefig().
(5)What are the parameters required by the bar() function to create a bar plot?
The bar() function in Matplotlib, when creating a bar plot, requires the following
parameters:
x: A list of category or group labels to be displayed on the x-axis. height: A list of
values representing the height or length of the bars for each category.
Page | 67
Python for Data Science (3150713) 230163107023
Additional parameters such as color, width, and label can be used to customize the
appearance of the bars. These parameters are optional but allow you to control the color of
the bars, the width of the bars, and add labels for the bars, respectively.
Suggested Reference:
1. https://matplotlib.org/stable/index.html
2. https://realpython.com/python-matplotlib-guide/
3. Matplotlib Tutorial by Corey Schafer:
https://www.youtube.com/playlist?list=PLosiE80TeTvipOqomVEeZ1HRrcEvtZB_
4. Python Data Science Handbook by Jake VanderPlas:
https://jakevdp.github.io/PythonDataScienceHandbook/
1. Mastering Matplotlib by Duncan M. McGreggor and Paul Ivanov:
https://www.packtpub.com/product/mastering-matplotlib-second-edition/9781800565547
References used by the students: (Sufficient space to be provided) Rubric wise marks
obtained:
Rubrics 1 2 3 4 5 Total
Marks
Knowledge of Programming Team work (2) Communication Skill Ethics(2)
subject (2) Skill (2)
Goo Aver ag Goo Averag Good Satisfactory Good Satisfactory Good Average
d (2) e (1) d (2) e (1) (2) (1) (2) (1) (2) (1)
Page | 68
Python for Data Science (3150713) 230163107023
Experiment No: 12
Write a program to display below bar plot using matplotlib library For below
data display pie plot
Languages = ['Java', 'Python', 'PHP', 'JavaScript', 'C#', 'C++']
Popuratity = [22.2, 17.6, 8.8, 8, 7.7, 6.7]
Colors = ["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728",
"#9467bd", "#8c564b"]
Date:
• Ability to create different types of plots such as line plots, scatter plots, bar plots, etc.
• Ability to customize the appearance of plots including labels, colors, legends, and titles
• Ability to add text, annotations, and shapes to the plots
• Ability to work with multiple plots and subplots
• Ability to export plots in different file formats like png, pdf, svg, etc.
• Ability to integrate matplotlib with other Python libraries like NumPy and Pandas.
Objectives: (a) To learn how to interpret and analyze data visualizations, and to use them to draw
insights and make informed decisions.
Theory:
A bar plot is a type of chart that displays data as rectangular bars. The length or height of each bar
is proportional to the value of the data it represents. Bar plots are useful for comparing the values
of different categories or groups.
Matplotlib is a popular data visualization library in Python that provides a wide range of functions
for creating different types of plots, including bar plots.
Use the bar() function to create the bar plot by passing the languages and popularity lists as
arguments. The bar() function automatically generates the rectangular bars for each category and
sets their lengths proportional to the values in the popularity list.
Page | 69
Python for Data Science (3150713) 230163107023
Procedure:
1. Import the necessary libraries (matplotlib.pyplot)
Page | 70
Python for Data Science (3150713) 230163107023
Conclusion: In conclusion, the program effectively utilizes the Matplotlib library to create visual
representations of programming language popularity through a bar plot and a pie chart. This allows
for a clear comparison and analysis of the data, enhancing data visualization and understanding of
trends in programming language usage.
(1)What libraries do you need to import to create the pie chart using matplotlib?
To create a pie chart using Matplotlib, you need to import the following libraries:
import matplotlib.pyplot as plt
You will use the pyplot module from Matplotlib to create and customize your pie chart.
In this case, you've defined four colors to be used for the pie chart's segments.
In this example, labels is a list of labels for each segment, and plt.legend() is used to add
the legend to the chart. The loc="best" argument specifies that Matplotlib should place the
legend in the best available position.
Page | 71
Python for Data Science (3150713) 230163107023
The purpose of the Popularity list in the program is not clear without the full context of
the code. The Popularity list is likely a list of values that represent the sizes or proportions
of the segments in the pie chart. These values determine the size of each segment in the
chart. For example, if you're creating a pie chart to represent the popularity of different
programming languages, the Popularity list might contain the relative popularity scores for
each language.
Here's an example of how the Popularity list could be used:
Popularity = [45, 30, 15, 10] # Popularity scores for four categories
plt.pie(Popularity, labels=labels, colors=Colors, autopct='%1.1f%%')
In this example, the values in the Popularity list determine the size of each segment in the
pie char
Suggested Reference:
1. https://matplotlib.org/stable/index.html
2. https://realpython.com/python-matplotlib-guide/
3. Matplotlib Tutorial by Corey Schafer:
https://www.youtube.com/playlist?list=PLosiE80TeTvipOqomVEeZ1HRrcEvtZB_
4. Python Data Science Handbook by Jake VanderPlas:
https://jakevdp.github.io/PythonDataScienceHandbook/
5. Mastering Matplotlib by Duncan M. McGreggor and Paul Ivanov:
https://www.packtpub.com/product/mastering-matplotlib-second-edition/9781800565547
References used by the students: (Sufficient space to be provided) Rubric wise marks
obtained:
Rubrics 1 2 3 4 5 Total
Marks
Knowledge of Programming Team work (2) Communication Skill Ethics(2)
subject (2) Skill (2)
Goo Aver ag Goo Averag Good Satisfactory Good Satisfactory Good Average
d (2) e (1) d (2) e (1) (2) (1) (2) (1) (2) (1)
Page | 72
Python for Data Science (3150713) 230163107023
Experiment No: 13
Write a program to display below bar plot using matplotlib library For 200
random points for both X and Y display scatter plot.
Date:
• Ability to create different types of plots such as line plots, scatter plots, bar plots, etc.
• Ability to customize the appearance of plots including labels, colors, legends, and titles
• Ability to add text, annotations, and shapes to the plots
• Ability to work with multiple plots and subplots
• Ability to export plots in different file formats like png, pdf, svg, etc.
• Ability to integrate matplotlib with other Python libraries like NumPy and Pandas.
Objectives: (a) To learn how to interpret and analyze data visualizations, and to use them to draw
insights and make informed decisions.
Theory:
In Matplotlib, a scatter plot is a chart type that displays data as a collection of points with the position
determined by the values of two variables. Each point on the scatter plot represents an observation,
and the position of the point on the X-Y axis is determined by the values of the two variables.
A scatter plot is useful for exploring the relationship between two continuous variables. It can be
used to identify patterns or trends in the data and to detect the presence of outliers or unusual
observations. Scatter plots can also be used to assess the correlation between the two variables.
Matplotlib provides the scatter() function for creating scatter plots. The function takes two arrays,
one for the X-axis data and one for the Y-axis data, as its input arguments. Additional parameters
can be used to customize the appearance of the scatter plot, such as the color, size, and transparency
of the points.
Page | 73
Python for Data Science (3150713) 230163107023
2. Use Comments
3. Test your code.
Procedure:
1. Import necessary libraries: We will need the Matplotlib and NumPy libraries for this task.
2. Generate random data for the X and Y axes: We can use the NumPy library to generate
random data for both the X and Y axes
3. Create a scatter plot: We can use the scatter method of the Matplotlib library to create a
scatter plot. We need to pass the X and Y data as arguments and specify the marker style and
color using the marker and c parameters, respectively
4. Add title and labels: We can add a title and labels for the X and Y axes using the title, xlabel,
and ylabel methods of the Matplotlib library.
5. Set axes limits: We can set the limits for the X and Y axes using the xlim and ylim methods
of the Matplotlib library.
6. Display the plot: We can display the plot using the show method of the Matplotlib library.
Page | 74
Python for Data Science (3150713) 230163107023
Conclusion: In conclusion, this program utilizes the Matplotlib library to create a bar plot and a
scatter plot. By generating 200 random points for both the X and Y axes, it effectively visualizes
data distributions and relationships, showcasing the versatility of Matplotlib for data representation
in Python.
Suggested Reference:
1. https://matplotlib.org/stable/index.html
2. https://realpython.com/python-matplotlib-guide/
3. Matplotlib Tutorial by Corey Schafer:
https://www.youtube.com/playlist?list=PLosiE80TeTvipOqomVEeZ1HRrcEvtZB_
4. Python Data Science Handbook by Jake VanderPlas:
https://jakevdp.github.io/PythonDataScienceHandbook/
5. Mastering Matplotlib by Duncan M. McGreggor and Paul Ivanov:
https://www.packtpub.com/product/mastering-matplotlib-second-edition/9781800565547
Page | 75
Python for Data Science (3150713) 230163107023
Goo Aver ag Goo Averag Good Satisfactory Good Satisfactory Good Average
d (2) e (1) d (2) e (1) (2) (1) (2) (1) (2) (1)
Page | 76
Python for Data Science (3150713) 230163107023
Experiment No: 14
Develop a program that reads .csv and plot the data of the dataset stored in the
.csv file file from the url:
(https://github.com/chris1610/pbpython/blob/master/data/sample salesv3.xlsx?raw=true)
Date:
Objectives: (a) To analyze and visualize the data in an efficient and effective way.
(b) To identify patterns, trends, and outliers in the data. Equipment/Instruments:
Personal Computer, Internet, Python
Theory:
Reading a .csv file from a URL and plotting the data is a common data analysis and visualization
task in many fields. Here are the main steps involved in this process:
Importing the necessary libraries: To read and plot the .csv file, we typically use the pandas and
matplotlib libraries. We need to import them at the beginning of our program.
Loading the data from the URL: We can use the pandas library's read_csv function to read the data
from the URL. We need to provide the URL of the .csv file as an argument to this function.
Data cleaning and preparation: Once we have loaded the data, we may need to clean and prepare it
for visualization. This may include dropping unnecessary columns, filling missing values, and
transforming the data.
Data visualization: Once the data is cleaned and prepared, we can use matplotlib's various plotting
functions to create visualizations such as line plots, scatter plots, bar plots, and more. We can
customize the plot with various parameters such as colors, labels, titles, and more.
Displaying the plot: After creating the plot, we need to display it using the show function provided
by the matplotlib library.
Page | 77
Python for Data Science (3150713) 230163107023
1. Validate inputs.
2. Handle errors.
3. Secure the program 4. Optimize performance
5. Test and review.
Procedure:
1. Import the necessary libraries: You will need the pandas library to read the .csv file, and
matplotlib library to create the plot.
2. Read the .csv file from the URL: Use the pandas library to read the .csv file from the URL
and store it as a DataFrame object.
3. Preprocess the data: Preprocess the data as required. This may involve cleaning the data,
removing duplicates, handling missing values, and converting data types.
4. Visualize the data: Use the matplotlib library to create a visualization of the data. You can
create scatter plots, line graphs, histograms, and other types of visualizations based on the
data.
5. Save or display the visualization: Save the visualization to a file or display it on the screen,
depending on the user requirements.
6. Test and validate the program: Test the program thoroughly to ensure that it works as
expected for various input datasets. Validate the results against the expected output and fix
any issues or errors.
7. Document the program: Document the program by providing clear and concise comments
in the code and a user manual that explains how to use the program.
Page | 78
Python for Data Science (3150713) 230163107023
Page | 79
Python for Data Science (3150713) 230163107023
Page | 80
Python for Data Science (3150713) 230163107023
Page | 81
Python for Data Science (3150713) 230163107023
Conclusion: The program successfully reads data from a specified CSV file hosted on GitHub and
visualizes the dataset through graphical plots. By leveraging libraries for data manipulation and
visualization, it effectively presents insights from the data, enhancing understanding and analysis
of sales trends. This approach streamlines data exploration and presentation.
Page | 82
Python for Data Science (3150713) 230163107023
(3)What is the first step in developing a program that reads a .csv file from a URL
and plots the data?
The first step in developing a program that reads a .csv file from a URL and plots the data
is to import the necessary libraries (pandas for reading the CSV and matplotlib for creating
plots). Additionally, you need to fetch the CSV data from the URL, which often involves
using a package like requests to make an HTTP request to the URL and retrieve the data.
(4)How do you read a .csv file from a URL in Python using the pandas library?
To read a .csv file from a URL in Python using the pandas library, you can use the pd.read_csv()
function with the URL as the argument. Here's an example:
import pandas as pd
url = "https://example.com/data.csv" df =
pd.read_csv(url)
This code will fetch the data from the specified URL and create a DataFrame (df)
containing the CSV data.
(5)How do you create a scatter plot of two columns from a DataFrame using the
matplotlib library?
To create a scatter plot of two columns from a DataFrame using the matplotlib library, you
can use the plt.scatter() function. Here's an example:
This code will save the scatter plot as a PNG image with the filename "scatter_plot.png" in the
current working directory.
Suggested Reference:
1. Pandas documentation on reading a CSV file from a URL:
https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#reading-csv-files
2. Matplotlib documentation on creating plots:
https://matplotlib.org/stable/tutorials/introductory/pyplot.html
3. Real Python tutorial on reading and writing CSV files in Python:
Page | 83
Python for Data Science (3150713) 230163107023
https://realpython.com/python-csv/
4. DataCamp tutorial on data visualization with Matplotlib:
https://www.datacamp.com/community/tutorials/matplotlib-tutorial-python
5. Towards Data Science tutorial on creating visualizations with Pandas and Matplotlib:
https://towardsdatascience.com/data-visualization-with-pandas-and-
matplotlib8dadc69f2f79
Goo Aver ag Goo Averag Good Satisfactory Good Satisfactory Good Average
d (2) e (1) d (2) e (1) (2) (1) (2) (1) (2) (1)
Page | 84
Python for Data Science (3150713) 230163107023
Experiment No: 15
Write a text classification pipeline using a custom preprocessor and
CharNGramAnalyzer using data from Wikipedia articles as a training set.
Evaluate the performance on some held out test sets.
Date:
Objectives: (a) To develop a machine learning model that can accurately classify text documents
into predefined categories that can be used for various applications such as sentiment analysis, spam
detection, and topic modeling.
Theory:
Text classification is the task of assigning predefined categories or labels to text documents based
on their content. A text classification pipeline typically consists of several stages, including data
preprocessing, feature extraction, model training, and evaluation.
In the context of Wikipedia articles, the first step in building a text classification pipeline is to collect
a dataset of articles with their corresponding labels. These labels can be either manually assigned
or obtained from existing metadata such as categories or tags.
Once a dataset is obtained, the next step is data preprocessing. This typically involves text
normalization, tokenization, stop word removal, and stemming/lemmatization. The goal of data
preprocessing is to clean the text and reduce its dimensionality while retaining the relevant
information for classification.
Page | 85
Python for Data Science (3150713) 230163107023
After preprocessing, the text is converted into numerical features that can be used as input to a
machine learning model. A popular technique for feature extraction is the bag-of-words model,
which represents each document as a vector of word frequencies. However, this approach may not
capture the semantic meaning of words and their relationships in the text.
The final stage in the text classification pipeline is model training and evaluation. A common
approach is to use supervised learning algorithms such as Naive Bayes, Logistic Regression, or
Support Vector Machines. The performance of the model is evaluated using metrics such as
accuracy, precision, recall, and F1 score on held-out test sets.
1. Data privacy.
2. Bias and fairness.
3. Model accuracy and reliability 4. Ethical considerations
5. Test and review.
Procedure:
Collect and preprocess the data: Download a set of Wikipedia articles that represent the different
categories you want to classify (e.g., sports, politics, entertainment, etc.). Preprocess the data by
removing any unnecessary characters, converting all text to lowercase, and removing any stop
words.
Split the data: Split the preprocessed data into two sets: training and test sets. The training set will
be used to train the model, while the test set will be used to evaluate the model's performance.
Feature extraction: Extract the features from the preprocessed text using CharNGramAnalyzer. This
will convert each text document into a vector of features that can be used as input to the
classification model.
Train the model: Train a text classification model using the extracted features and the training set.
You can use any machine learning algorithm, such as Naive Bayes, SVM, or Neural Networks.
Evaluate the model: Use the trained model to classify the test set and evaluate its performance using
metrics such as accuracy, precision, recall, and F1-score.
Tune the model: If the model's performance is not satisfactory, you can tune the hyperparameters of
the algorithm or try different algorithms to improve its performance.
Page | 86
Python for Data Science (3150713) 230163107023
Deploy the model: Once you are satisfied with the model's performance, you can deploy it in
production to classify new text documents. Observations: Put Output of the program
Conclusion: In conclusion, the text classification pipeline effectively utilizes a custom preprocessor
and CharNGramAnalyzer to classify Wikipedia articles. The model demonstrates robust
performance on held-out test sets, showcasing the effectiveness of character n-grams in capturing
linguistic nuances. This approach can be adapted for various text classification tasks.
Page | 87
Python for Data Science (3150713) 230163107023
(2) Which analyzer is used in the given scenario? "Writing a text classification
pipeline using a custom preprocessor and CharNGramAnalyzer using data from
Wikipedia articles as a training set."
In the given scenario, the analyzer used is "CharNGramAnalyzer." CharNGramAnalyzer is an
analyzer that breaks text into character-level n-grams, where "n" typically refers to the number
of characters in each n-gram. It's suitable for text analysis where character-level information
is important, such as when dealing with languages with complex character structures or when
you want to capture character-level patterns in text data. Using CharNGramAnalyzer can be
helpful when working with diverse and potentially noisy text data like Wikipedia articles.
(3) What is the purpose of evaluating the performance on held-out test sets in
text classification?
The purpose of evaluating the performance on held-out test sets in text classification is to
assess how well the trained model generalizes to new, unseen data. When you train a text
classification model, it learns patterns and associations from the training data. By evaluating
the model on a held-out test set, you can determine how well it performs on data that it has
not been exposed to during training. This evaluation helps you gauge the model's ability to
make accurate predictions in real-world scenarios and detect whether it suffers from issues
like overfitting (performing well on the training data but poorly on new data) or underfitting
(performing poorly on both training and test data). The performance on the test set provides
a more objective measure of the model's quality and suitability for its intended application.
Suggested Reference:
1. "Building a Text Classification Pipeline with Python" by Dipanjan Sarkar: This article
provides a step-by-step guide on how to build a text classification pipeline using Python
and scikit-learn library. It covers preprocessing techniques, feature extraction, model
selection, and evaluation.
2. "Text Classification with NLTK and Scikit-Learn" by Ahmed Besbes: This tutorial
provides a detailed guide on how to perform text classification using Python and two
popular libraries, NLTK and scikit-learn. It covers data preprocessing, feature extraction,
and model training and evaluation.
Page | 88
Python for Data Science (3150713) 230163107023
3. "Using Wikipedia Articles for Text Classification" by Nikolay Krylov: This article
demonstrates how to use Wikipedia articles as a training set for text classification. It covers
data collection, preprocessing, feature extraction using TF-IDF and CharNGramAnalyzer,
model training, and evaluation.
4. "Text Classification with Python and Scikit-Learn" by Sebastian Raschka: This book
chapter provides a comprehensive guide on how to perform text classification using
Python and scikit-learn. It covers data preprocessing, feature extraction, model training,
and evaluation, as well as advanced topics such as model selection and parameter tuning.
5. "A Complete Tutorial on Text Classification using Naive Bayes Algorithm" by Divya
Gupta: This tutorial provides a detailed guide on how to perform text classification using
Naive Bayes algorithm in Python. It covers data preprocessing, feature extraction, model
training and evaluation, as well as parameter tuning. References used by the students:
(Sufficient space to be provided) Rubric wise marks obtained:
Rubrics 1 2 3 4 5 Total
Marks
Knowledge of Programming Team work (2) Communication Skill Ethics(2)
subject (2) Skill (2)
Goo Aver ag Goo Averag Good Satisfactory Good Satisfactory Good Average
d (2) e (1) d (2) e (1) (2) (1) (2) (1) (2) (1)
Page | 89
Python for Data Science (3150713) 230163107023
Experiment No: 16
Write a text classification pipeline to classify movie reviews as either positive or
negative.
Find a good set of parameters using grid search. Evaluate
the performance on a held out test set.
Date:
Objectives: (a) To create an accurate and reliable model that can automatically classify movie
reviews as positive or negative, which can be useful for analyzing large volumes of reviews quickly
and efficiently, as well as for providing recommendations to users based on their preferences..
Theory:
The theory behind writing a text classification pipeline to classify movie reviews as either positive
or negative involves several key steps:
Data preprocessing: This step involves cleaning and preparing the raw text data by removing stop
words, converting text to lowercase, and performing stemming or lemmatization.
Feature extraction: This step involves converting the preprocessed text data into a numerical
representation that can be used as input to a machine learning algorithm. Common techniques
include Bag-of-Words, TF-IDF, and Word Embeddings.
Page | 90
Python for Data Science (3150713) 230163107023
Model selection and training: This step involves selecting an appropriate machine learning
algorithm and training it on the preprocessed and transformed data. Popular algorithms include
Naive Bayes, Support Vector Machines, and Neural Networks.
Hyperparameter tuning: This step involves selecting the optimal hyperparameters for the chosen
machine learning algorithm. This can be done using techniques such as grid search or random
search.
Evaluation: This step involves evaluating the performance of the trained model on a held-out test
set. This can be done using metrics such as accuracy, precision, recall, and F1-score.
Deployment: This step involves deploying the trained model in a production environment, where it
can be used to classify new movie reviews.
Grid search is a hyperparameter tuning technique that involves searching for the optimal set of
hyperparameters for a given machine learning algorithm by exhaustively trying all possible
combinations of hyperparameter values. This can be done by training and evaluating the model with
different combinations of hyperparameters on a validation set, and selecting the combination that
yields the best performance.
Evaluating the performance of the trained model on a held-out test set is important to ensure that
the model generalizes well to new, unseen data. This helps to avoid overfitting, where the model
performs well on the training data but poorly on new data.
Overall, the theory behind writing a text classification pipeline to classify movie reviews as either
positive or negative involves a combination of data preprocessing, feature extraction, model
selection and training, hyperparameter tuning, evaluation, and deployment.
1. Data preprocessing
2. Feature extraction
3. Model selection
4. Hyper parameter tuning
5. evaluation
Procedure:
1. Preprocess the data: Preprocess the movie review data by cleaning the text, removing stop
words, and performing stemming or lemmatization to reduce the dimensionality of the
feature space.
2. Split the data: Split the preprocessed data into training, validation, and test sets. The training
set will be used to train the model, the validation set will be used to tune the hyperparameters,
and the test set will be used to evaluate the final performance of the model.
3. Extract features: Extract features from the preprocessed text using techniques such as Bagof-
Words, TF-IDF, or Word Embeddings. This will convert the text data into a numerical
representation that can be used as input to a machine learning algorithm.
Page | 91
Python for Data Science (3150713) 230163107023
4. Select a model: Choose a suitable machine learning algorithm, such as Naive Bayes, Support
Vector Machines, or Neural Networks, and train it on the preprocessed and transformed data.
5. Hyperparameter tuning: Use grid search to find the best set of hyperparameters for the
chosen machine learning algorithm. This involves training and evaluating the model with
different combinations of hyperparameters on the validation set, and selecting the
combination that yields the best performance.
6. Evaluate the model: Evaluate the performance of the trained model on the held-out test set
using metrics such as accuracy, precision, recall, and F1-score.
7. Deploy the model: Deploy the trained model in a production environment, where it can be
used to classify new movie reviews.
Page | 92
Python for Data Science (3150713) 230163107023
Conclusion: The text classification pipeline successfully classifies movie reviews into positive or
negative categories. By employing grid search to optimize parameters, the model's accuracy and
performance were enhanced. Evaluating on a held-out test set confirmed the effectiveness of the
approach, yielding reliable insights into sentiment analysis for movie reviews.
(1) What is the first step you should take when developing a text classification
pipeline?
The first step in developing a text classification pipeline is data preprocessing. This
involves tasks such as data cleaning, text normalization, tokenization, and handling
missing values. Cleaning and preparing your text data is crucial to ensure that it is in a
suitable format for analysis.
(2) What are some techniques for feature extraction in text classification?
Techniques for feature extraction in text classification include:
• Bag of Words (BoW): Represents text as a matrix of word counts, ignoring word order.
• TF-IDF (Term Frequency-Inverse Document Frequency): Measures the importance of
a word in a document relative to a corpus of documents.
• Word embeddings (e.g., Word2Vec, GloVe): Represent words as dense vector
representations, capturing semantic relationships.
• N-grams: Consider sequences of adjacent words to capture local context.
(3) Which of the following algorithms is not suitable for text classification?
Neural networks are generally not suitable for text classification. While they can be used
for this task, traditional machine learning algorithms (e.g., Naive Bayes, SVM, Decision
Trees) and classic NLP techniques (e.g., TF-IDF, BoW) are often more straightforward
and effective for text classification tasks. Neural networks, like deep learning models, may
require large amounts of data and computational resources, making them less practical for
smaller datasets or simpler tasks.
Page | 93
Python for Data Science (3150713) 230163107023
2. "Natural Language Processing with Python" by Steven Bird, Ewan Klein, and Edward
Loper - This book provides an introduction to natural language processing and includes a
section on text classification. It covers topics such as feature selection, training classifiers,
and evaluation metrics.
4. "Text Classification in Python using spaCy" by Dipanjan Sarkar - This tutorial provides an
introduction to text classification using spaCy, a popular NLP library in Python. It covers
topics such as preprocessing text data, feature extraction, model selection, and
hyperparameter tuning.
models. It also provides examples of how to use grid search to find the best set of
hyperparameters for a model.
Rubrics 1 2 3 4 5 Total
Marks
Knowledge of Programming Team work (2) Communication Skill Ethics(2)
subject (2) Skill (2)
Goo Aver ag Goo Averag Good Satisfactory Good Satisfactory Good Average
d (2) e (1) d (2) e (1) (2) (1) (2) (1) (2) (1)
Page | 95