20CAI213 DATA SCIENCE LABORATORY Manual 2024
20CAI213 DATA SCIENCE LABORATORY Manual 2024
(UGC-Autonomous)
Student Manual
20CAI213 Data Science Laboratory B. Tech III Year II Semester
Experiments
Experiment 1.
8. Create NumPy arrays from Python Data Structures, Intrinsic NumPy objects 1
and Random Functions.
Experiment 2.
9. Manipulation of NumPy arrays- Indexing, Slicing, Reshaping, Joining and 7
Splitting.
Experiment 3.
10. Computation on NumPy arrays using Universal Functions and Mathematical 16
methods.
Experiment 4.
11. Import a CSV file and perform various Statistical and Comparison 26
operations on rows/columns.
Experiment 5.
12. 34
Load an image file and do crop and flip operation using NumPy Indexing.
Experiment 6.
13. Write a program to compute summary statistics such as mean, median,
mode, standard deviation and variance of the given different types of data.
Experiment 7.
14.
Create Pandas Series and DataFrame from various inputs.
Experiment 8
Import any CSV file to Pandas DataFrame and perform the following:
a. Visualize the first and last 10 records
b. Get the shape, index and column details.
15. c. Select/Delete the records(rows)/columns based on conditions.
d. Perform ranking and sorting operations.
e. Do required statistical operations on the given columns.
f. Find the count and uniqueness of the given categorical values.
g. Rename single/multiple columns
Experiment 9.
Import any CSV file to Pandas DataFrame and perform the following:
a. Handle missing data by detecting and dropping/ filling missing values.
b. Transform data using apply() and map() method.
16.
c. Detect and filter outliers.
d. Perform Vectorized String operations on Pandas Series.
e. Visualize data using Line Plots, Bar Plots, Histograms, Density Plots and
Scatter Plots
Experiment 10.
17. Write a program to demonstrate Linear Regression analysis with residual
plots on a given data set
Experiment 11.
Write a program to implement the Naïve Bayesian classifier for a sample
18.
training data set stored as a .CSV file. Compute the accuracy of the classifier,
considering few test data sets
Experiment 12.
Write a program to implement k-Nearest Neighbour algorithm to classify the
19.
iris data set. Print both correct and wrong predictions using Python ML
library classes
Experiment 13.
Write a program to implement k-Means clustering algorithm to cluster the
20.
set of data storedin .CSV file. Compare the results of various “k” values for
the quality of clustering.
1. Institution Vision
To become a globally recognized research and academic institution and there by contribute to
technological and socio-economic development of the nation.
2. Institution Mission
To foster a culture of excellence in research, innovation, entrepreneurship, rational thinking and
civility by providing necessary resources for generation, dissemination and
utilization of knowledge and in the process create an ambience for practice-based learning to
the youth for success in their careers.
.
Mission 3: Enrich employability and entrepreneurial skills in the field of AI & DS through
experiential and self-directed learning.
3. Program Outcomes
PO1: Engineering knowledge: Apply the knowledge of mathematics, science, engineering
fundamentals, and an engineering specialization to the solution of complex engineering problems.
PO2: Problem analysis: Identify, formulate, review research literature, and analyze complex
engineering problems reaching substantiated conclusions using first principles of mathematics, natural
sciences, and engineering sciences.
PO3: Design/development of solutions: Design solutions for complex engineering problems and
design system components or processes that meet the specified needs with appropriate consideration
for the public health and safety, and the cultural, societal, and environmental considerations.
PO4: Conduct investigations of complex problems: Use research-based knowledge and research
methods including design of experiments, analysis and interpretation of data, and synthesis of the
information to provide valid conclusions.
PO5: Modern tool usage: Create, select, and apply appropriate techniques, resources, and modern
engineering and IT tools including prediction and modeling to complex engineering activities with an
understanding of the limitations.
PO6: The engineer and society: Apply reasoning informed by the contextual knowledge to assess
societal, health, safety, legal and cultural issues and the consequent responsibilities relevant to the
professional engineering practice.
PO7: Environment and sustainability: Understand the impact of the professional engineering
solutions in societal and environmental contexts, and demonstrate the knowledge of, and need for
sustainable development.
PO8: Ethics: Apply ethical principles and commit to professional ethics and responsibilities and
norms of the engineering practice.
PO9: Individual and team work: Function effectively as an individual, and as a member or leader in
diverse teams, and in multidisciplinary settings.
PO10: Communication: Communicate effectively on complex engineering activities with the
engineering community and with society at large, such as, being able to comprehend and write
effective reports and design documentation, make effective presentations, and give and receive clear
instructions.
PO11: Project management and finance: Demonstrate knowledge and understanding of the
engineering and management principles and apply these to one‟s own work, as a member and leader in
a team, to manage projects and in multidisciplinary environments.
PO12: Life-long learning: Recognize the need for, and have the preparation and ability to engage in
independent and life-long learning in the broadest context of technological change.
5. Lab Syllabus
List of Programs:
1. Create NumPy arrays from Python Data Structures, Intrinsic NumPy objects and Random
Functions.
2. Manipulation of NumPy arrays- Indexing, Slicing, Reshaping, Joining and Splitting.
3. Computation on NumPy arrays using Universal Functions and Mathematical methods.
4. Import a CSV file and perform various Statistical and Comparison operations on rows/columns.
5. Load an image file and do crop and flip operation using NumPy Indexing.
6. Write a program to compute summary statistics such as mean, median, mode, standard
deviation and variance of the given different types of data.
7. Create Pandas Series and DataFrame from various inputs.
8. Import any CSV file to Pandas DataFrame and perform the following:
a. Visualize the first and last 10 records
b. Get the shape, index and column details.
c. Select/Delete the records(rows)/columns based on conditions.
d. Perform ranking and sorting operations.
e. Do required statistical operations on the given columns.
f. Find the count and uniqueness of the given categorical values.
g. Rename single/multiple columns
9. Import any CSV file to Pandas DataFrame and perform the following:
a. Handle missing data by detecting and dropping/ filling missing values.
b. Transform data using apply() and map() method.
c. Detect and filter outliers.
d. Perform Vectorized String operations on Pandas Series.
e. Visualize data using Line Plots, Bar Plots, Histograms, Density Plots and Scatter Plots
10. Write a program to demonstrate Linear Regression analysis with residual plots on a given data
set
11. Write a program to implement the Naïve Bayesian classifier for a sample training data set
stored as a .CSV file. Compute the accuracy of the classifier, considering few test data sets
12. Write a program to implement k-Nearest Neighbour algorithm to classify the iris data set.
Print both correct and wrong predictions using Python ML library classes
13. Write a program to implement k-Means clustering algorithm to cluster the set of data storedin
.CSV file. Compare the results of various “k” values for the quality of clustering.
Course Outcomes:
Upon successful completion of the course, students will be able to
1. Illustrate the use of various data structures.
2. Analyze and manipulate Data using Numpy and Pandas.
3. Creating static, animated, and interactive visualizations using Matplotlib.
4. Understand the implementation procedures for the machine learning algorithms.
5. Identify and apply Machine Learning algorithms to solve real-world problems using
appropriate data sets.
Text Book(s)
1. Wes McKinney, “Python for Data Analysis: Data Wrangling with Pandas, NumPy, and
IPython”, O‟Reilly, 2nd Edition,2018.
2. Jake VanderPlas, “Python Data Science Handbook: Essential Tools for Working with Data”,
O‟Reilly, 2017.
Reference Books
1. Y. Daniel Liang, “Introduction to Programming using Python”, Pearson,2012.
2. Francois Chollet, Deep Learning with Python, 1/e, Manning Publications Company, 2017.
3. Peter Wentworth, Jeffrey Elkner, Allen B. Downey and Chris Meyers, “How to Think Like
a Computer Scientist: Learning with Python 3”, 3rd edition, Available at
https://www.ict.ru.ac.za/Resources/cspw/thinkcspy3/thinkcspy3.pdf
4. Paul Barry, “Head First Python a Brain Friendly Guide” 2nd Edition, O‟Reilly, 2016 4.
Dainel Y.Chen “Pandas for Everyone Python Data Analysis” Pearson Education, 2019.
Mode of Evaluation: Continuous Internal Evaluation and End Semester Examination
Experiments
Experiment -1
Question: To create numpy arrays from python data structures, intrinsic numpy objects
and random functions.
Aim:
To create NumPy arrays from Python Data Structures, Intrinsic NumPy objects and
Random Functions.
Algorithm:
Step 1: Install Numpy and rename it as np for the ease of use
Step 2: Use Numpy arrays with different data types and learn its usage
Step 3: NumPy has built-in functions for creating arrays from scratch practice it:
1. zeros(shape) will create an array filled with 0 values with the specified shape. The
>> np.arange(10)
Output: array([0,1,2,3,4,5,6,7,8,9])
Output: array([2.,3.,3.,4.,5.,6.,7.,8.,9.,])
3. linspace() will create arrays with a specified number of elements, and spaced equally
4. indices() will create a set of arrays (stacked as a one-higher dimensioned array), one
>> np.indices((3,3))
Output: array([[[0, 0, 0], [1, 1, 1], [2, 2, 2]], [[0, 1, 2], [0, 1, 2], [0, 1, 2]]])
Step 5: Learn most commonly used Random functions in Numpy and practice it
>> np.random.rand()
Output: 0.6926529371565405
>>np.random.rand(3,2)
[0.92730776, 0.98828719]])
2. Practice the functions like shuffle(), Permutations(), Uniform(), seed(), choice() etc
Source Code:
# Zeroes()
np.zeros((2,3))
output: array([[0.,0.,0.],[0,.0.,0.]])
# Arrange()
np.arange(10)
output: array([0,1,2,3,4,5,6,7,8,9])
# Linspace()
np.linspace(1,26,3)
output: array([1.,13.5,26.])
# Indices()
np.indices((3,3))
output: array([[[0,0,0],[1,1,1],[2,2,2]],[[0,1,2],[0,1,2],[0,1,2]]])
# Random.rand()
np.random.rand()
output: 0.692652937156405
np.random.rand(3,2)
Theory:
A, numpy.zeros(shape, dtype=float, order='C', *, like=None)
Return a new array of given shape and type, filled with zeros.
Parameters:
shapeint or tuple of ints
dtypedata-type, optional
The desired data-type for the array, e.g., numpy.int8. Default is numpy.float64.
likearray_like, optional
Reference object to allow the creation of arrays which are not NumPy arrays. If an array-
like passed in as like supports the __array_function__ protocol, the result will be defined
by it. In this case, it ensures the creation of an array object compatible with that passed in
via this argument.
Eg: np.zeros(5)
array([ 0., 0., 0., 0., 0.])
arange(stop): Values are generated within the half-open interval [0, stop) (in other words,
the interval including start but excluding stop).
arange(start, stop): Values are generated within the half-open interval [start, stop).
arange(start, stop, step) Values are generated within the half-open interval [start, stop),
with spacing between values given by step.
For integer arguments the function is roughly equivalent to the Python built-in range, but returns
an ndarray rather than a range instance.
When using a non-integer step, such as 0.1, it is often better to use numpy.linspace.
Parameters:
startinteger or real, optional
Start of interval. The interval includes this value. The default start value is 0.
stopinteger or real
End of interval. The interval does not include this value, except in some cases where step
is not an integer and floating point round-off affects the length of out.
Spacing between values. For any output out, this is the distance between two adjacent
values, out[i+1] - out[i]. The default step size is 1. If step is specified as a position
argument, start must also be given.
dtypedtype, optional
The type of the output array. If dtype is not given, infer the data type from the other input
arguments.
likearray_like, optional
Reference object to allow the creation of arrays which are not NumPy arrays. If an array-
like passed in as like supports the __array_function__ protocol, the result will be defined
by it. In this case, it ensures the creation of an array object compatible with that passed in
via this argument.
Compute an array where the subarrays contain index values 0, 1, … varying only along the
corresponding axis.
Parameters:
dimensionssequence of ints
dtypedtype, optional
sparseboolean, optional
Returns:
gridone ndarray or tuple of ndarrays
If sparse is False:
If sparse is True:
Returns a tuple of arrays, with grid[i].shape = (1, ..., 1, dimensions[i], 1, ..., 1) with
dimensions[i] in the ith place
Eg: x = np.arange(20).reshape(5, 4)
row, col = np.indices((2, 3))
x[row, col]
array([[0, 1, 2],
[4, 5, 6]])
Create an array of the given shape and populate it with random samples from a uniform
distribution over [0, 1).
Parameters:
d0, d1, …, dnint, optional
Returns:
outndarray, shape (d0, d1, ..., dn)
Random values.
Eg: np.random.rand(3,2)
array([[ 0.14022471, 0.96360618], #random
[ 0.37601032, 0.25528411], #random
[ 0.49313049, 0.94909878]])
V, random.shuffle(x)
This function only shuffles the array along the first axis of a multi-dimensional array. The order
of sub-arrays is changed but their contents remains the same.
Parameters:
xndarray or MutableSequence
Returns:
None
Eg: arr = np.arange(10)
np.random.shuffle(arr)
arr
[1 7 5 2 9 4 3 6 0 8]
Result:
Practiced with Numpy Array and familiarized different random functions in it.
Experiment -2
Aim:
Manipulation of NumPy arrays- Indexing, Slicing, Reshaping, Joining and Splitting.
Algorithm:
Step 2: Use Numpy arrays and perform various operations by referring documentation
(i) array (index) : The indexes in NumPy arrays start with 0, meaning that the first
element has index 0, and the second has index 1.
(ii) slice(): Slicing in python means taking elements from one given index to another
given index.
(iii) reshaping() : This function is to change the shape of array, that is number of
elements in the array.
(iv) Joining (): Numpy array joins contents of two or more arrays using join function.
We pass a sequence of arrays that we want to join to the concatenate () function,
along with the axis.
(v) Splitting () Splitting is reverse operation of Joining we use array_split() for splitting
arrays, we pass it the array we want to split and the number of splits.
Step 4: Try different Numpy functions
Source Code:
# array (index) The indexes in NumPy arrays start with 0, meaning that the first element has
index 0, and the second has index
Import numpy as np
print(array[2])
Output: 3
# slice() Slicing in python means taking elements from one given index to another givenindex.
print(arr[2:5])
Output: [ 31 45 25]
# reshaping() This function is to change the shape of array, that is number of elements in the
array
newarr = arr.reshape(3,4)
print(newarr)
Output: [[ 1 2 3 4]
[ 5 6 7 8]
[ 9 10 11 12]]
# Joining () Numpy array joins contents of two or more arrays using join function. We pass a
sequence of arrays that we want to join to the concatenate () function, along with the axis
print(arr)
# Splitting () Splitting is reverse operation of Joining we use array_split() for splitting arrays, we
pass it the array we want to split and the number of splits.
newarr = np.array_split(arr, 3)
print(newarr)
Theory:
Python Numpy Array Indexing:
ndarrays can be indexed using the standard Python x[obj] syntax, where x is the array and obj the
selection. There are different kinds of indexing available depending on obj: basic indexing,
advanced indexing, and field access.
Most of the following examples show the use of indexing when referencing data in an array. The
examples work just as well when assigning to an array. See Assigning values to indexed arrays
for specific examples and explanations on how assignments work.
Note that in Python, x[(exp1, exp2, ..., expN)] is equivalent to x[exp1, exp2, ..., expN]; the latter
is just syntactic sugar for the former.
Basic indexing
Single element indexing works exactly like that for other standard Python sequences. It is 0-
based, and accepts negative indices for indexing from the end of the array.
>>>x = np.arange(10)
>>>x[2]
Output:
>>>x[-2]
Output:
It is not necessary to separate each dimension‟s index into its own set of square brackets.
>>>x[1, 3]
Output:
>>>x[1, -1]
Output:
Note that if one indexes a multidimensional array with fewer indices than dimensions, one gets a
subdimensional array. For example:
>>>x[0]
Output:
array([0, 1, 2, 3, 4])
Python NumPy array slicing is used to extract some portion of data from the actual array. Slicing
in python means extracting data from one given index to another given index, however, NumPy
slicing is slightly different. Slicing can be done with the help of (:). A NumPy array slicing
object is constructed by giving start, stop, and step parameters to the built-in slicing function.
This slicing object is passed to the array to extract some portion of the array.
Example:
>>>import numpy as np
>>>print(arr[1:5])
Output:
[2, 3,4,5]
Reshaping numpy array simply means changing the shape of the given array, shape basically
tells the number of elements and dimension of array, by reshaping an array we can add or
remove dimensions or change number of elements in each dimension. In order to reshape a
numpy array we use reshape method with the given array.
Syntax : array.reshape(shape)
Example:
>>>import numpy as np
>>>newarr = arr.reshape(4, 3)
>>>print(newarr)
Output:
[[ 1 2 3]
[ 4 5 6]
[ 7 8 9]
[10 11 12]]
In SQL we join tables based on a key, whereas in NumPy we join arrays by axes.We pass a
sequence of arrays that we want to join to the concatenate() function, along with the axis. If axis
is not explicitly passed, it is taken as 0.
Example:
>>>import numpy as np
>>>print(arr)
Output:
[1, 2,3,4,5,6]
Joining merges multiple arrays into one and Splitting breaks one array into multiple.
We use array_split() for splitting arrays, we pass it the array we want to split and the number of
splits.
Example:
>>>import numpy as np
>>>newarr = np.array_split(arr, 3)
>>>print(newarr)
Output:
Result:
Practiced with Numpy Array and familiarized with different operations on Numpy array.
Experiment -3
Aim:
Computation of NumPy arrays using universal functions and mathematical methods.
Algorithm:
Step 2: Use Numpy arrays and perform various operations by referring documentation.
Step 4: Trigonometric functions like sin, cos, tan and it‟s inverse.
Step 7: Apply various statistical functions (mean, median, mode, std, var etc.) on values in an np
array
Source Code:
#Importing numpy as np
import numpy as np
np.multiply.reduce([2,3,5])
Output : 30
#Identity Multiplication
A=np.multiply.identity
Output : 1
x1 = np.arange(6)
np.power(x1, 3)
np.sqrt([1,4,9])
np.sin(np.pi/2.)
Output : 1.0
np.cos(0)
Output : 1.0
np.bitwise_or(13, 16)
Output : 29
np.invert(np.array([True, False]))
#applying·greater·than·operator·on·elements·of·two·different·arrays.
np.greater([4,2],[2,2])
Theory:
NumPy Universal functions are in actual the mathematical functions. The NumPy mathematical
functions in NumPy are framed as Universal functions. These Universal (mathematical NumPy
functions) operate on the NumPy Array and perform element-wise operations on the data values.
The universal NumPy functions belong to the numpy.ufunc class in Python. Some of the basic
mathematical operations are called internally when we invoke certain operators. For example,
when we frame x + y, it internally invokes the numpy.add() universal function.
We can even create our own universal functions using frompyfunc() method.
Syntax:
Example:
import numpy as np
rad = np.deg2rad(data)
# hypotenuse
b=3
h=6
print('hypotenuse value for the right angled triangle:')
print(np.hypot(b, h))
Output:
Example:
import numpy as np
data = np.array([10.2,34,56,7.90])
print('Minimum and maximum data values from the array: ')
print(np.amin(data))
print(np.amax(data))
Output:
Result:
Familiarized with different Universal functions in Numpy and performed various
mathematical operations with it.
Experiment -4
Question: Import a CSV file and perform various Statistical and Comparison operations
on rows/columns.
Aim:
Import a CSV file and perform various Statistical and Comparison operations on
rows/columns.
Algorithm:
Step 2: Import this file to the drive and to colab for practicing different operations.
Step 3: Import Numpy and Pandas to use data frames and apply various statistical operations
Step 4: Display column titles of imported CSV file and start analyzing data in it.
Step 5: Apply the functions like info (), head (), describe () etc and learn it‟s uses.
Step 7: Choose appropriate feature from the dataset and find mean, median, mode, std and
variance
Source code:
import pandas as pd
df.sort_values(„Lscore‟).head() #sorting
Theory:
CSV is a typical file format that is often used in domains like Monetary Services, etc. Most
applications can enable you to import and export knowledge in CSV format.
Thus, it is necessary to induce a good understanding of the CSV format to higher handle the data
you are used with every day.
What is a CSV?
CSV (Comma Separated Values) may be a simple file format accustomed to store tabular data,
like a spreadsheet or database. CSV file stores tabular data (numbers and text) in plain text. Each
line of the file could be a data record. Each record consists of 1 or more fields, separated by
commas, the utilization of the comma as a field separator is that the source of the name for this
file format.
✳
️ Working with CSV Files
Working with CSV files isn‟t that tedious task but it‟s pretty straightforward. However, counting
on your workflow, there can become caveats that you simply might want to observe out for.
If you‟ve got a CSV file, you‟ll open it in Excel without much trouble. Just open Excel, open and
find the CSV file to figure with (or right-click on the CSV file and choose Open in Excel). After
you open the file, you‟ll notice that the info is simply plain text put into different cells.
If you wish to save lots of your current workbook into a CSV file, you have got to use the
subsequent commands:
There are Dataset in pandas of different file formats like some CSV, HTML, XLSX, etc.
For Excel or XLSX file format .read_excel() and for HTML file format .read_html() functions of
pandas are used.
Result:
Learned to import csv files and started to work on selected attribute for various statistical
studies.
Experiment-5
Question: Load an image file and do crop and flip operation using NumPy Indexing.
Aim: To load an image file and do crop and flip operation using NumPy Indexing.
Algorithm:
Numpy
Matplotlib
Pillow
Step3: Import Image module of pillow library to crop and flip the image easily.
Step7: Do the same image and apply rotate function to rotate in specified angle and display.
Source Code:
import numpy as np
import matplotlib.pyplot as plt
import cv2
Result:
Loaded an image file and performed crop and flip operation using NumPy Indexing.
Experiment-6
Question: Compute summary statistics such as mean, median, mode, standard deviation, and
variance of the given different types of data.
Aim: A program to compute summary statistics such as mean, median, mode, standard
deviation, and variance of the given different types of data.
Step 4: Learn role of summary statistics on data set through application on small csv file
Source Code:
#importing pandas
import pandas as pd
median1=data["housing_median_age"].median()
mode1=data["housing_median_age"].mode()
std1=data["housing_median_age"].std()
var1=data["housing_median_age"].var()
print("median is"+str(median1))
print("mode is"+str(mode1))
print("str is"+str(std1))
print("var is"+str(var1))
THEORY:
Statistics is concerned with collecting and then analyzing that data. It includes methods for
collecting the samples, describing the data, and then concluding that data. NumPy is the
fundamental package for scientific calculations and hence goes hand-in-hand for NumPy
statistical Functions.
NumPy contains various statistical functions that are used to perform statistical data analysis.
These statistical functions are useful when finding a maximum or minimum of elements. It is
also used to find basic statistical concepts like standard deviation, variance, etc.
It calculates the mean by adding all the items of the arrays and then divides it by
the number of elements. We can also mention the axis along which the mean can
be calculated.
import numpy as np
a = np.array([5,6,7])
print(a)
print(np.mean(a))
Output
[5 6 7]
6.0
Median
Median is the middle element of the array. The formula differs for odd and even
sets.
It can calculate the median for both one-dimensional and multi-dimensional arrays.
Median separates the higher and lower range of data values.
import numpy as np
a = np.array([5,6,7])
print(a)
print(np.median(a))
Output
[5 6 7]
6.0
Mode
It can calculate the mode for both one-dimensional and multi-dimensional arrays.
Standard Deviation
Standard deviation is the square root of the average of square deviations from mean. The formula
for standard deviation is:
import numpy as np
a = np.array([5,6,7])
print(a)
print(np.std(a))
Output
[5 6 7]
0.816496580927726
Variance
In probability theory and statistics, variance is the expectation of the squared deviation of
a random variable from its population mean or sample mean. Variance is a measure
of dispersion, meaning it is a measure of how far a set of numbers is spread out from their
average value.
Output
array([[1,2],
[3,4]])
1.25
Summary
These functions are useful for performing statistical calculations on the array elements. NumPy
statistical functions further increase the scope of the use of the NumPy library. The objective of
statistical functions is to eliminate the need to remember lengthy formulas. It makes processing
more user-friendly.
Result:
A program to compute summary statistics such as mean, median, mode, standard deviation, and
variance of the given different types of data is successfully executed.
Experiment-7
Question: Create panda series and data frame from various types of inputs
Aim: To create panda series and data frame from various types of inputs.
Algorithm:
Step 2: Create data frame with pandas library from multiple series of data
Step 4: Create data frame from the dictionary, list and list of list
Step 6: Create pandas data frame using DataFrame() function and add data to it.
Step 9: Load any small csv file to pandas data frame and display it.
Source Code:
#iimport pandas
Import pandas as pd
author=[„jitender‟,‟purnima‟,‟arpit‟,‟jyoti‟]
auth_series=pd.series(author)
print(auth_series)
article=[210,211,114,178]
article_series=pd.series(article)
frame={„author‟:auth_series,‟article‟:article_series}
result=pd.dataframe(frame)
print(result)
age=[21,21,24,23]
result[„age‟]=pd.series(age)
print(result)
ls=list(zip(author,article,age))
df1=pd.dataframe(ls,columns=[„author‟,‟article‟,‟age‟])
print(df1)
df.to_csv(“exp7.csv”)
df2=pd.read_csv(“exp7.csv”)
df2
THEORY:
PANDAS SERIES:
Series is a type of list in pandas which can take integer values, string values, double values and
more. But in Pandas Series we return an object in the form of list, having index starting
from 0 to n, Where n is the length of values in series. Later in this article, we will discuss
dataframes in pandas, but we first need to understand the main difference
between Series and Dataframe. Series can only contain single list with index, whereas
dataframe can be made of more than one series or we can say that a dataframe is a collection of
series that can be used to analyse the data.
Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer,
string, float, python objects, etc.). The axis labels are collectively called index. Pandas Series is
nothing but a column in an excel sheet.
Labels need not be unique but must be a hashable type. The object supports both integer and label-
based indexing and provides a host of methods for performing operations involving the index.
PANDAS DATAFRAME:
Pandas is a python package designed for fast and flexible data processing, manipulation and
analysis. Pandas has a number of fundamental data structures (a data management and storage
format). If you are working with two-dimensional labelled data, which is data that has both
columns and rows with row headers — similar to a spreadsheet table, then the DataFrame is the
data structure that you will use with Pandas.
In the real world, a Pandas DataFrame will be created by loading the datasets from existing
storage, storage can be SQL Database, CSV file, and Excel file. Pandas DataFrame can be created
from the lists, dictionary, and from a list of dictionary etc.
Result:
Created panda series and data frame from various types of inputs
Experiment-8
Question: Import any CSV file to Pandas DataFrame and perform the following:
a. Visualize the first and last 10 records
b. Get the shape, index and column details.
c. Select/Delete the records(rows)/columns based on conditions.
d. Perform ranking and sorting operations.
e. Do required statistical operations on the given columns.
f. Find the count and uniqueness of the given categorical values.
g. Rename single/multiple columns
Aim: Familiarizing some basic operations on CSV file with Pandas Data Frame
Algorithm:
Step 1: Create simple panda series from CSV and visualize the first and last 10 records.
Step 2: From the CSV imported in step 1 get the shape, index and column details.
Step 6: In the attached CSV data find the count and uniqueness of the given categorical values.
Source Code:
Result:
Familiarizing some basic operations on CSV file with Pandas Data Frame is successfully
completed.
Experiment-9
Question:
Import any CSV file to Pandas DataFrame and perform the following:
a. Handle missing data by detecting and dropping/ filling missing values.
b. Transform data using apply() and map() method.
c. Detect and filter outliers.
d. Perform Vectorized String operations on Pandas Series.
e. Visualize data using Line Plots, Bar Plots, Histograms, Density Plots and Scatter Plots
AIM:
ALGORITHM:
import pandas as pd
print(df.dtypes)
# describe
Print(df.describe())
print(df[“owner”])
print(df.head())
print(df.tail())
df2=df[0:3]
print(df2)
copied_data=df.copy()
print(copied_data)
print(copied_data.dropna())
DATASET:
https://www.kaggle.com/datasets/nehalbirla/vehicle-dataset-from-
cardekho/download?datasetVersionNumber=3
Result:
Familiarizing more operations on CSV file with Pandas Data Frame is successfully completed.
Experiment-10
Question: Demonstrate Linear Regression analysis with residual plots on a given data set
Linear regression:
Linear regression analysis is used to predict the value of a variable based on the value of another
variable. The variable you want to predict is called the dependent variable. The variable you are
using to predict the other variable‟s value is called the independent variable.
This form of analysis estimates the coefficients of the linear equation, involving one or more
independent variables that best predict the value of the dependent variable. Linear regression fits
a straight line or surface that minimizes the discrepancies between predicted and actual output
values. There are simple linear regression calculators that use a “least squares” method to
discover the best-fit line for a set of paired data. You then estimate the value of X (dependent
variable) from Y (independent variable).
Residuals Plot:
A residual plot is a type of plot that displays the fitted values against the residual values for a
regression model.This type of plot is often used to assess whether or not a linear regression
model is appropriate for a given dataset and to check for heteroscedasticity of residuals.
Residuals, in the context of regression models, are the difference between the observed value of
the target variable (y) and the predicted value (ŷ), i.e. the error of the prediction. The residuals
plot shows the difference between residuals on the vertical axis and the dependent variable on
the horizontal axis, allowing you to detect regions within the target that may be susceptible to
more or less error.
Visualizer ResidualsPlot
# Reading dataset
data = pd.read_csv("/content/salary.csv")
data.head()
output :
YearsExperience Salary
0 1.1 39343.0
1 1.3 46205.0
2 1.5 37731.0
3 2.0 43525.0
4 2.2 39891.0
output :
output :
0.9569566641435086
Result:
EXPERIMENT - 11
Question: Implement the Naïve Bayesian classifier for a sample training data set stored as a
.CSV file. Compute the accuracy of the classifier, considering few test data sets
AIM:
To Implement the Naïve Bayesian classifier for a sample training data set stored as a
.CSV file. Compute the accuracy of the classifier, considering few test data sets
ALOGRITHM:
Step 5: split the dataset into 40% testing data and 60% training data.
Step 7: import the metrics and compute the accuracy_score of the model.
Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem.
It is not a single algorithm but a family of algorithms where all of them share a common
principle, i.e. every pair of features being classified is independent of each other.
Bayes’ Theorem
Bayes‟ Theorem finds the probability of an event occurring given the probability of another
event that has already occurred. Bayes‟ theorem is stated mathematically as the following
equation:
where, y is class variable and X is a dependent feature vector (of size n) where:
Just to clear, an example of a feature vector and corresponding class variable can be: (refer 1st
row of dataset)
X = (Rainy, Hot, High, False)
y = No
So basically, P(y|X) here means, the probability of “Not playing golf” given that the weather
conditions are “Rainy outlook”, “Temperature is hot”, “high humidity” and “no wind”.
Naive assumption
Now, its time to put a naive assumption to the Bayes‟ theorem, which is, independence among
the features. So now, we split evidence into the independent parts.
Now, if any two events A and B are independent, then,
P(A,B) = P(A)P(B)
Hence, we reach to the result:
So, finally, we are left with the task of calculating P(y) and P(x i | y).
Please note that P(y) is also called class probability and P(xi | y) is called conditional
probability.
The different naive Bayes classifiers differ mainly by the assumptions they make regarding the
distribution of P(x i | y).
Let us try to apply the above formula manually on our weather dataset. For this, we need to do
some precomputations on our dataset.
We need to find P(x i | yj) for each xi in X and yj in y. All these calculations have been
demonstrated in the tables below:
CODE:
EXPERIMENT - 12
Question:Implement k-Nearest Neighbour algorithm to classify the iris data set. Print both correct
and wrong predictions using Python ML library classes
AIM:
To Implement k-Nearest Neighbour algorithm to classify the iris data set. Print both
correct and wrong predictions using Python ML library classes.
ALGORITHM:
Step 4: spilt the data into 80:20 , 80 for training the data and the remaining
SOURCE CODE:
#import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
plt.style.use('seaborn')
plt.figure(figsize = (10,10))
plt.scatter(X[:,0], X[:,1], c=y, marker= '*',s=100,edgecolors='black')
plt.show()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
KNN ALGORITHM
Introduction
K-NN algorithm can be used for Regression as well as for Classification but mostly it is used
for the Classification problems.
The following two properties would define KNN well:
1. Lazy learning algorithm − KNN is a lazy learning algorithm because it does not have a
specialized training phase and uses all the data for training while classification.
2. Non-parametric learning algorithm − KNN is also a non-parametric learning algorithm
because it doesn‟t assume anything about the underlying data.
KNN algorithm at the training phase just stores the dataset and when it gets new data,
then it classifies that data into a category that is much similar to the new data.
Example: Suppose, we have an image of a creature that looks similar to cat and dog, but
we want to know either it is a cat or dog. So for this identification, we can use the KNN
algorithm, as it works on a similarity measure. Our KNN model will find the similar
features of the new data set to the cats and dogs images and based on the most similar
features it will put it in either cat or dog category.
EXPERIMENT - 13
Question:Implement k-Means clustering algorithm to cluster the set of data storedin .CSV file.
Compare the results of various “k” values for the quality of clustering.
AIM:
To Implement k-Means clustering algorithm to cluster the set of data storedin .CSV file.
Compare the results of various “k” values for the quality of clustering.
ALGORITHM:
predefined k clusters.
Step 4: Calculate the variance and place a new centroid of each cluster.
Step 5: Repeat the step-3 , which means reassign each data point to the new closest
centroid of each cluster.
Step 6: If any reassignment occur , then go to step-4 else go to finish.
Step 7: The model is ready.