We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 45
Revised as
per CBSE
> Curriculum
, 2021-22
Informatics Practices
Class XII (CBSE Board) 4
Data Handling using Pandas
UNIT-1: & Data Visualization
Chapter: 3
Data Handling using Pandas-II
*Descriptive Statistics
& Advanced operations on Data Frames *
] Visit: www. ip4you.blogspot.com for more...
Open Teaching-Learning Material
Kendriya Vidyalaya Khanapai
e-mail : [email protected]Expected Learning Outcome:
CBSE Syllabus (2021-22) Covered in this presentation:
O Descriptive Statistics: max, min, count, sum, mean, median, mode,
quartile, Standard deviation, variance.
O Data Frame operations: Aggregation, group by, Sorting, Deleting
and Renaming Index, Pivoting.
O Handling missing values - dropping and filling.
Ol Importing/Exporting Data between MySQL database and Pandas.
In this presentation you will learn about data handling using
Pandas and its basic concepts like...
Oo
Oo
oO
What are Descriptive Statistics?
Applying various statistical functions on DataFrame.
Advanced operations on DataFrame: Aggregation, Grouping, Sorting
and Pivoting etc.
How to handle missing values in DataFrame: Dropping & Filling.
Importing/ Exporting data between MySQL Database and Pandas
DataFrame.Introduction to Descriptive Statistics.
O Statistics, being a branch of Mathematics, deals with
collection, organization, analysis or interpretation of
data.
O Descriptive statistics involves summarizing and
organizing bulk data, so that interpretation or analysis
becomes easy. In other words, they refer to the
methods which are used to get some statistical idea
about the data.
O Pandas provides various Statistical functions for
data analysis, which can be applied on Data sets.
O Apart from statistical functions, Pandas also provides
some Aggregate functions to summarize data.
O These functions/methods can be applied on rows or
columns of DataFrame to get summarized or
aggregated values.Pandas Aggregate functions
Aggregate Functions Purpose
count() Counts total number of not-Nan-Values
sum() Calculate the sum of given set of numbers
min() Finds the minimum value of given data set.
max() Finds the maximum value of given data set.
Statistical Functions Purpose
mean() Calculate the arithmetic mean (average) of given
set of numbers.
median() Calculate the median (middle value) of given set
of numbers
mode() Calculate the mode (most repeated value) of
given set of numbers
std() Calculate standard deviation of given set of
numbers
var() Calculate variance of given set of numbers.
quantile() Calculate the quartile of given set of numbersHow to apply Aggregate & Statistical functions
sé - = #- i= = #-
Consider the following data set of student’s score in Unit Test (UT),
Half Yearly(HY) and Session Ending Exam (SEE) of three students
in five subjects.
import pandas as pd
marks={'Name':[ ‘Amar’, ‘Amar’, ‘Amar’, ‘Akbar’, ‘Akbar’, ‘Akbar’, ‘Anthony’,
‘Anthony’, 'Anthony'],
“exami: UN 5 H¥ 5 SEEs Ul 4 HY’, SEE, Ut". HY. SEE J;
‘English’ :[22,21,14,20,23,22,23,33,24],
"Hindi':[21,22,19,25,35,28,39,24,41],
"Science':[18,17,15,22, 21,19, 20, 24,23],
'Maths':[20,22,24,24,25,23,15, 33,32],
"SST' :[21,24,23,19,15,13,22,34,32]}
df= pd.DataFrame(marks)
print(df)
Name Exam English Hindi Science Maths
We can apply
Aggregate and
statistical functions
on DataFrame in row-
wise (axis=1) or
column-wise (axis=0).How to apply Aggregate & Statistical functions
—E—————— ee |)hDhDmhrlrlrlrlrlrmrmrmrrmm—“C=tisSNCS
Aggregate functions can be applied on a DataFrame in following
ways as per requirement-
Application How to apply Clute
On whole DF.function() Default axis is 0, so
DataFrame Ex:- df.sum() applied along-row on all
columns
On Selected DF[‘Col’].function() Applied on selected
Column Ex:- df[‘Science’].sum() column(s)
On selected
Rows
DF.iloc[:].function()
Ex:- df.iloc[2:6].sum()
Applied on selected rows
using loc/iloc method.
On Subset of
DataFrame
DF.iloc[:,
:].function()
Ex:- df.iloc[3:5,2:6].sum()
Applied on selected rows
and columns using
loc/iloc method.
On Conditional
selection of
rows/columns
DF[DF[‘col’]].function()
Example:
df[df[‘Exam’]==‘UT’].sum()
Applied on selected set
of rows and columns
filtered on condition.
You can give axis=1 setting with aggregate functions
application on DataFrame.
‘or row-wise (along row)Aggregate Function: coun
O count(): aE
0 amar OP
Pandas count() is used to count the
number of non-NaN values of a DataFrame
or series. It works with strings also.
.count( axis=0/1, numeric_only=True/False)
for Column-wise and 1 for Row-wise. Default is 0.
-_only=True: Ignores non-numeric values. False is default.
Exam
English
Hindi
Science
Maths
SST
dtype: inté
# counting values column wise (axis=0)
print (df.count())
# counting values for non-numeric columns
print (df.count (numeric_only=True)
# counting values row wise (axis=1)
print (df.count(axis=1) ).
# counting values of ‘English’ colum ———
print(df[ ‘English’ ].count()).
# counting values having more tha marks in Science
dfi=df[df[ ‘Science’ ]>20].count()
print (df1[ 'Science'])
3
9
5
Maths 3
3
4
: intéAggregate Function: sum()
O sum(): meat TE
amar UP
amar BY
Pandas sum() is used to get total
of non-null values of a DataFrame
Science
1
1
Amar SEE
Akbar UT
Akbar HY
Akbar SEE
Anthony UT
Anthony
Anthony
or series. It concatenates strings
data types.
Axis=0 for Column-wise and 1 for Row-wise. Default is 0.
skipna=True: Ignores NaN values. Default is True.
Numeri
-_only=True: Ignores non-numeric values.
# column wise sum for non-numeric column
print (df.sum(numeric_only=True) )-
# row wise sum (axis=1)
print (df.sum(axis=1) ).
# sum of 'English' column
print(df['English'].sum())
# sum of 'Hindi' to 'Maths‘ column
print(df.loc[:, 'Hindi': 'Maths'].sum()
Donne WN EO}Aggregate Function: min()
O min():
Pandas min() is used to get
minimum values of a DataFrame
or series.
an
An
amar
Akbar
Akbar
Akbar
thony
Anthony
thony
sh
22
21
14
20
23
22
23
33
24
Axis=0 for Column-wise and 1 for Row-wise. Default is 0.
skipna=True: Ignores NaN values. Default is True.
Numeric_only=True: Ignores non-numeric values.
# column wise minimum(axis=0
print(df.min())
# column wise min for non-numeric columns
print (df.min(numeric_only=True) ).
# Student-wise (Row wise) min marks
print (df.min(axis=1) ).
# Minimum in ‘English’ subject
print(df[‘English'].min())
# Minimum from ‘Hindi’ to 'Maths‘* subject
print(df.loc[:, ‘Hindi’: 'Maths'].min()
Hindi
21
22
19
25
35
28
39
24
41
Science
18
17
15
22
21
19
20
24
23
Maths
20
24
24
25
23
15
: object
18
17
14
19
15
13
15
24
23
type
int6é4
SST
21Aggregate Function: max()
O max():
Pandas max() is used to get
maximum values of a DataFrame
or series.
amar
amar
amar
Akbar
Akbar
Akbar
Anthony
Anthony
Anthony
22
21
14
20
23
22
23
33
24
Axis=0 for Column-wise and 1 for Row-wise. Default is 0.
skipna=True: Ignores NaN values. Default is True.
Numeric_only=True: Ignores non-numeric values.
# column wise maximum(axis=0)
print (df.max(axis=@) )
# column wise max for non-numeric columns
print (df.max(numeric_only=True) ).
# Student-wise (Row wise) max marks
print (df.max(axis=1) ).
# Maximum in ‘Science’ subject
print (df[ ‘Science’ ].max())
# Maximum from ‘Hindi' to 'Maths‘ subject
print(df.loc[:, Hindi’: 'Maths'].max()
English
Hindi
Science
Maths
ssT
dtype: int6é4
24
Hindi 41
Science 24
Maths 33
dtype: int6é4
sh Hindi
21
22
19
25
35
28
39
24
41
Science
BoOIKNUAWHEHO!|
18
17
15
22
21
19
20
24
23
Maths
20
22
24
24
25
23
15
33
32
SSTStatistical Function: mean()
Name Exam English Hindi Science Maths SST
. o Ama: ur 22 Zi. 18 20 21
oO mean(): 1 aaa BY 21 22 17 22 24
2 Amar SEE 14 19 15 24 23
. 3 Akba: ur 20 25 22 24 19
mean() is used to get 4 Bear HY = 2335 tS
rf j 5 kb: SEE 22 28 19 23 13
arithmetic mean (average) ofa Gi aches oa a a
DataFrame or series. 7 Anthony BY 33a 24 33 34
8 Anthony SEE 24 41 23 32 32
Axis=0 for Column-wise and 1 for Row-wise. Default is 0. | =nglish 444444
skipna=True: Ignores NaN values. Default is True. Binds peeeeee
= is Z 3 Science -888889
Numeric_only=True: Ignores non-numeric values. Mattie "222222
- ; Sst 555556
# column wise average(axi | dtype:
print(df.mean()
# OR print(df.mean(numeric_only=True) )
# Student wise average (axis=0)
print (df.mean(axis=1) )
# Average of ‘Science’ Subject
print (df['Science'].mean()
# Average of Amar in all exams
print (df[d['Name']=='Amar'].mean()
19.88888888888889
English 19.000000
Hindi 20. 666667
Science 16.666667
Maths 22.000000
Sst 22.666667
dtype: floaté4Statistical Function : median()
O median(): Middle value in sorted list or data set.
median() function is used to calculate the median or middle value of a
given set of numbers. The dataset can be a series, a data frame. For even
size of data set, the median will be average of two almost medians.
Since total size is 5 (odd), so
Since total size is 6 (even), so
median will be middle value i.e. 8
median will be (8+10)/2=9
- - - English
# column wise median(axis=@) Hindi
print (df.median()) Science
# OR print(df.median(axis=@) ) Maths
# Student wise (along ror) median (axis=1) Pe .
print(d.median(axis=1) ) oCYDS
# Median of ‘Science’ column
print(df['Science'].median()
Visit www.ip4you.blogspot.com for more teaching-learning materials...Statistical Function : mode()
O mode(): Most frequent value in the data set.
The mode() function is used to calculate the BmmroOrmor
mode or most repeated value of a given
data set as Series or DataFrame. If more than
one number occurs many times, then both
numbers will be considered as mode.
Frequency of 6 and 10 is 2,
so mode will be 6 and 10.
.mode(axis=0/1, skipna=True/False, numeric_only=True/False)
Name Exam English Hindi Science Maths SST
Amar UT 22 21 18 20 21
Amar HY 21) 22 av 22 «24
Amar SEE 19 19 18 24 23
Akbar UT 20 20 23 24 19
English Hindi Science Maths SST
50 20.0 24 18 20 21.0
1 NaN 35 20 24 NaN
BIAWeRWREO
Akbar HY 23 35 21 25° 22
Akbar SEE 21 35 19 2319
anthony UT 20 39 20 20 21 — .
Anthony HY 33 24 24 33 Hindi Column is
Anthony SEE 20 24 20 a3. having 24 and 35
as most and
equal frequency
count, so
appeared twice.
# Subject wise most frequent marks
print (df.mode(axis=@,numeric_only=True) )
# Row wise most occurring marks
print(df.mode(axis=1, numeric_only=True) )Statistical Function : std()
O std(): Computing distance from mean.
Standard Deviation is a quantity which tells variation (difference) of a members of
data set from its mean value. A low standard deviation indicates that the values
are closer to the mean of the data set, while a high standard deviation indicates
that the values are spread out over a wider range.
The Std() function calculates standard deviation of a given set of numbers as
Series or DataFrame.
.std(axis=0/1, skipna=True/False, numeric_only=True/False)
Name Exam English Hindi Science Maths SsT
Amar UT 22 21 18 20 21
Amar HY 21 22 17 22 24
English 4.255715
Hindi 7.601170
Science 2.345208
0
1
2 Amar SEE 19 19 18 24 23 Maths 4.711098
3 Akbar UT 20 20 23 24 19 Sst 4.555217
4 Akbar HY 23 35 at 25° 22 tyes Pidatna
5S Akbar SEE 21 35 1g 23° «19
6 anthony vu? 20 39 20 20 7916575
7 Anthony HY 33 24 24 33 - 588436
8 Anthony SEE 20 24 20 32 + 701851
167948
+ 830952
- 693280
-396428
-128353
- 979960
floate4)
# Subject wise (column) standard devi
print(df.std(axis=0,numeric_only=True) )
# Row wise standard deviation
print(df.std(axis=1, numeric_only=True)
BaAIKHAUAWNHED
e RUDAUNNNStatistical Function : var()
Oo var(): Computing variability of group from mean.
Variance is a numerical value that describes the variability of observations from
its arithmetic mean. In contrast to standard deviation it tells how far individuals in
a group are spread out from mean. Variance is nothing but an average of squared
deviations. The var() function calculates variance of a given set of numbers as
Series or DataFrame.
.var(axis=0/1, skipna=True/False, numeric_only=True/False)
Name Exam English Hindi Science
0 Amar UT 22 21 18
1 amar HY 21 22 17
2 Amar SEE 19 19 18
3 Akbar UT 20 20 23
4 Akbar «HY 23 a5 21
5 Akbar SEE 21 35 19
6 Anthony vt 20 39 20
7 Anthony HY 33 24 24
@ Anthony SEE 20 24 20
# Row wise standard deviation
print(df.var(axis=1) )
# Column wise standard deviation
print(df.var(axis=@) )
DIHNARWNHO
English aoenes 11
Hindi 57.777778
Science 5.500000
Maths 22.194444
20.750000
: float64Statistical Function : quantile()
O What is Quantile?
Consider the following Data set and its median (middle value)
Median value 12 is dividing data set in 2 equal parts
ESRC EEE
50% values are less than median 50% values are greater than median
So, if we divide data set into 4 equal parts of 25% of each then it is called
Quartile (quarter) and if 5 parts (20%)then called Quintiles.
25% percentile 25% percentile 25% percentile 25% percentile
25% (0.25
50% (0.50
75% (0.75.
100% (1.0
When data set is divided in 100 parts (1% each) the it is called Percentile.Statistical Function : quantile()
Oo quantile(): Dividing dataset in equal proportions.
Pandas quantile() function is used to get the quartiles. It gives the quartile of
each column or row of the DataFrame. By default, it will display the second
quantile (median) of all numeric values.
.quantile(q=< values>, axis=0/1, numeric_only=True/False)
q= : refers quartile values [0.25/0.5/0.75/1.0]. Default is 0.5
Axis=0 for Column-wise and 1 for Row-wise. Default is 0.
Numeric_only=True: Ignores non-numeric values. English -0
Hindi -0
# Column wise quartile ([email protected]) on axis=0 Science 20.0
: Maths 24.0
print(df.quantile()) nar 21.0
# Row wise quartile (q=0. 5) Name: 0.5, dtype: floaté4
print(df.quantile(q=0.5,
# Row wise all four quartiles
Print(df.quantile(q=[@.25,0.50,0.75,1.0],axis=0)
V English Hindi Science Maths Sst
baS 20.0 21.0 18.0 22.0 21.0
students are
-50 es 24.0 20.0 24 21.9 below 23
22.0 35.0 21.0 marks in SST
33.0 33.0 24.0 bject,Other Statistical Functions:
Oo discribe(): All-in-one descriptive statistics function
Pandas describe() function is used to get all descriptive statistical
analysis like count, mean, maximum, minimum, quantiles etc. through
one function.
For columns having string data types, describe function returns count
of data, number of unique entries and most frequent string.
Import pandas as pd
dct={'Phy':[60,65,70,67],
“Chem' :[34,55,32,46],
"Maths':[45,56,65,75]}
df=pd.DataFrame(dct,
index=['Amar', ‘Akbar!
> ‘Manpreet'])
Chem Maths
4.000000 4.000000 4.00000}
65.500000 41.750000 60.25000
: . 4.203173 10.781929 12.78997
print (df.describe() i 60.000000 32.000000 45.00000
63.750000 33.500000 53.25000
66.000000 40.000000 60.50000
67.750000 48.250000 67.50000
000000 _55.000000
Visit www.ip4you.blogspot.com for more teaching-learning materials...Advanced operations on DataFrame
Pandas offers some advanced operations which can be applied
on DataFrame for summarizing, ordering and reshaping of data
as per need. These operations may include the followings-
Aggregation
S «Summarizing data
fr oD) Handling Missing Data
S «Removing and managing
ay Pivoting
S «Reshaping Data
Visit www.ip4you.blogspot.com for more teaching-learning materials...Advanced operations: Aggregation
Oo aggregate() or agg():Aggregation of functions
The term ‘Aggregation’ refers to summarization of data set to provide
single numeric value. The aggregate functions like count(), sum(),
min(), max(), median(), mode() etc. as discussed earlier are used to
get single value in terms of count, sum, minimum etc.
Sometimes, it is required to apply multiple aggregate functions on same
data set. Pandas offers aggregate() or agg()function to apply one or
more _aqgregate functions at a time on same data set.
.aggregate(, axis=0/1)
: max sum
# refer DataFrame as created earlier a 60 139
# Applying max() function on axis=0 Akbar 65 176 |
70 167
print(df.aggregate( ‘max’ ))-# same as df.max() . Anthony
# Applying max() & sum() function on axis=1 = Ee
print(df.aggregate([ ‘max’, ‘sum’], axis=1))
# Applying max()& sum() function on axis=0 = on a eer
print (df.agg([ ‘max? , ‘sum?])) Se
max 70
min 60
Name: Phy, dtype: inteal
Phy Chem Maths
# Applying max()& min() on ‘Phy’ column
print(df[ ‘Phy’ ].aggregate([ ‘max’, ‘min’ ]))-Advanced operations: Sorting
O Sorting of DataFrames:
The term ‘Sorting’ refers to arrangement of data set in specified
order i.e. ascending (low>high) or descending (high>low) order.
A DataFrame can be sorted on the basis of values of columns or
row/column labels. By default sorting is done on Row index
(label) in ascending order.
Pandas provides the following two methods for sorting Data
Frames.
sort_values() Sorts data frame in ascending or descending
order as per values of specified column(s).
sort_index() Sorts data frame in ascending or descending
order as per labels i.e. row-indexes (axis=0) or
column-indexes (axis=1)
pee functions creates another DataFrame after sorting. You can giveAdvanced operations: Sorting
Oo sort_values(): Sorting on values of columns.
The sort_values() functions sorts data frame as per values of given columns.
.sort_values(by=, axis=0/1, ascending=True/False)
Chem Maths
by= : Refers to columns for sorting.
Multiple columns to be given as list. ee 6 me 5
axis=0 for Column-wise and 1 for Row-wise sorting. nea
ascending=True: True for ascending and False for _
I Manpreet
ndin raer
descending orde Original Data Frame
print (df.sort_values('Chem')
print (df.sort_values(by='Akbar' ,axis=1) )
print (df.sort_values('Phy' ,ascending=False) )
print(df.sort_values(['Phy', 'Chem' ])).
Chem’ Maths
Anthony 32 65
Akbar 55 56
Manpreet 46 15
Amar 34 45
For same marks of Phy (Primary col.), record are arranged as per marks of Chem (secondary col.)Advanced operations: Sorting
Oo sort_index(): Sorting on indexes/labels.
The sort_index() functions sorts data frame as per labels of given axis.
.sort_index( axis=0/1, ascending=True/False)
axis=0 for Column-wise and 1 for Row-wise sorting.
ascending=True: True for ascending and False for
descending order.
print(df.sort_index())
print(df.sort_index(ascending=False) )
print(df.sort_index(axis=1)
print(df.sort_index(axis=1, jascending=False)
Maths
Amar 60 34
65 55 56
Amar
Akbar
Anthony
Manpreet
—
Maths Chem Phy Maths Chem Phy
Akbar 65 55 56 65 46 75
Amar 45 60 34
60 34 «45 70 32) 6S J a ynae 56 65 55
60 34 45
65
Anthony 65
Manpreet 5Advanced operations: Reindexing
O Reindex(): Reordering of Rows and columns
As you know, that row indexes and column indexes can be arranged in
ascending order or descending order by using sort_index() function. But
if you want to arrange rows or columns in any user defined order, then
reindex() method can be used.
.reindex( index=[row order], columns=[column order])
Name Phy Chem Maths
# Reindexing rows only Amar 68 45
print (df.reindex(index=[2,4,3,1])) Akbar 6s 36
# Reindexing rows and columns both oe : Ee
print (df.reindex(index=[2,4,3,1], a
columns=['Eng','Chem', 'Phy', 'Maths'
ai a
65 56 Akbar
7 45 65 Manpreet
64 77-70 65 Anthony
45 68 60 45 Amar
Name Phy Chem Maths
2 Akbar 65 65 56 65
4) Manpreet 45 56 65 78
3] Anthony 70 71 65 64
1 Amar 60 45 45
Reordering
of rows &
columnsAdvanced operations: Grouping
O What is Grouping?
Grouping is a process to classify records (subset of records) on some
criteria for group-wise operations.
Consider the following data collection of marks in Unit Test, Half Yearly
and Session Ending Exam of three students.
Now think on the following requirement..
= Prepare Student-wise report with Total of marks
in all exams.
= Prepare Exam-Wise report with mean of marks
scored by students.
Name Total Marks Exam [Mean
Amar 944 se Total UT 314.67
Akbar 916 HY 305.33
Anthony 912 Mean
You may observe that these reports require to apply aggregate function
(sum/median) on student-wise/exam-wise groups (subsets) of records instead of
whole data frame.Advanced operations: Grouping
O Grouping of Records
Pandas groupby() function is used to create groups (based on some
criteria) and then apply aggregate functions on groups, instead of
whole data set. Pandas groupby() function works on split-apply-combine
strategy which works as follows-
= Split original DataFrame into Groups by creating Groupby objects.
= Apply required aggregate function on each group.
* Combine results to form a new DataFrame.
Amar UT 350 -
Amar HY 304 Combine result
Amar SEE 290
Name Total Marks
Akbar UT 346 ‘Amar 944
Akbar _ [HY 310 ps akbar 916
Akbar _|[SEE__| 260
Anthony 912
Anthony |UT 295
Anthony |HY 312
Anthony |SEE 305 A Sum() on each group
Name
Amar UT 350
Amar HY 304
Amar ‘SEE 290
Akbar UT 346
Akbar HY 310
Akbar SEE 260
Anthony [UT 295
Anthony
Anthony
Visit www.ip4you.blogspot.com for more teaching-learning materials...Advanced operations: Grouping
O groupby(): Grouping of records
.groupby( by=, axis=0/1)
more about groups.
Pandas groupby() function create Groups internally, which can be
stored on Groupby objects. Once the group object has been created, we
can apply the following commonly used functions on group object to get
.groups
Displays list of groups created
.get_group()
Display group as per given value
.[].
Applies given aggregate function
on each group and display results.
.count()
Counts non-NaN values of each
column of the group
.size()
Displays size of group
.first()
Displays first row of each groupAdvanced operations: Grouping
. {"Akbar': Int64Index([3, 4, 5], dtyp
Oo groupby(): Example ‘amar': Int64Index([0, 1, 2], dtype='int64"), '
JAnthony': Int64Index([6, 7, 8], dtype='int64')}
Name Exam
Amar UT 350
Amar HY 304
go=df . groupby( ‘Name
print(go.groups) nethay
Akbar UT 346| | print(go.size() dtype: inté:
print(go.count()) ~
print(go.get_group("Amar") pico ‘ .
i ' ' Akbar
print (go[ 'Marks'].sum()) pane 3 es
Anthony
oe aie Exam Marks |
‘ Akbar 0 oT 350
aggregate functions Amar 944 1 HY 304
can also be applied Anthony 912 2 SEE 290
RiihICOrningtGn Name: Marks, dtype: int64
sum mean std
print (df.groupby( ‘Name? )[ 'Marks" Name
print (df.groupby('Name’)[ "Marks" ] Akbar 916 305.333333 43.189505
-aggregate([ ‘sum’, ‘mean’, 'std'])) lamar 944 314.666667 31.390020
Since you can put df.groupby(‘Name’) in place of go Anthony 912 304.000000 8.544004Advanced operations: Missing Data
O Handling Missing Data:
A DataFrame may contains multiple rows and each row can have
multiple values for corresponding columns. If a value corresponding to
a column is not available then it is considered as missing value and
denoted by None or NaN (Not a Number) constant.
In real world, during data collection, some columns may not applicable
to some individuals. For e.g. salary column is irrelevant for an
unemployed person and may be left blank. But this missing data may
generate misleading/inaccurate information during data analysis and
should be handled properly. These NaN values may be deleted or
replaced by some relevant values to get correct analysis.
Pandas provides the following methods for handling NaN values
isnull() Checks that which row or column is having NaN values.
dropna() | Drop (delete) the row/column having NaN (missing) values.
fillna() Fill some estimated value for NaN (missing) value.Advanced operations: Missing Data
O isnull(): Checking for NaN (missing values)
Pandas isnull() function returns True, if row/column contains NaN
(missing) values otherwise returns False. This function can be applied
on Series/DataFrame or column of a DataFrame. We can also count
NaN values by piping sum() method with isnull().
import pandas as pd Nams:
import numpy as np artes
dct={'Name':['Amar', 'Akbar', ‘Anthony’, akbar
‘Manpreet'], 'Phy':[60,None,70,45], 7
"Chem" :[68,65,None, 56], aren ane
*Maths':[45,56,65,65], Name Maths
"Eng':[45,np.NaN, 64,np.NaN]} False False
df=pd.DataFrame(dct) rales malsa
False False
False False
print(df)
print(df.isnull()) =
print(df['Eng'].isnull() n 0 False
# getting count/sum of NaN values doce
print(df.isnull().sum()) th: True
print(df.isnull().sum().sum())—> 4 5 : Eng, dtype: boolAdvanced operations: Missing Data
O dropna(): Deleting Row/Columns having NaN Values
Pandas dropna() function deletes rows or columns containing NaN
(missing ) values.
.dropna( axi: /1, how
axis=0 deletes along rows and 1 deletes column.
How="‘any’ : Deletes Row/Column if any number of NaN value is found. In
case of ‘all’ , row/column to be deleted only when all values are NaN.
‘any’
Since no any row or column
value, so no impact of how:
having all NaN
Chem Maths II’ setting.
Name
0 Amar 60.0 68.0 45 45.0
it Akbar NaN 65.0 56 NaN Be
2 Anthony 70.0 NaN 5 64.0 print (df.dropna(axis=1) )
3 Manpreet 56.0 5 #0R (axis=1 for column deletion)
print (df.dropna(axis=1,how='any'))
6
6:
print (df.dropna())
#OR (axis=@ for row deletion) Name Maths Alveolumnes
print (df.dropna(axis=0,how='any')) 0 Amar 45 containing
i Akbar 36 NaN values
Name Phy Chem Maths Eng 2 Anthony 65 have been
0 Amar 60.0 68.0 45 45.0 3 Manpreet 65 deleted.Advanced operations: Missing Data
O fillna(): Replacing NaN Values with other values
Pandas fillna() function replaces NaN values with any other defined
values. You can also define different values for different columns by
providing Filler dictionary.
Ina( |)
Number : fill all NaN values with any number.
Filler Dictionary: Fill values as per given dictionary of column: value pair.
Name Age City print (df. fillna(@))
0 Amar 21.0 Delhi Tams Age
i Akbar 25.0 None 0 Amar 21.0 Delhi 45.0
2 Anthony NaN Agartala 1 Akbar 25.0 a
3 Manpreet 45.0 NaN 2 anthony [QMO agartala
3 Manpreet
ame
Amar * i z
Akbar 25.0 [GGWahati 20m0i]
Anthony [GMM Agartala 65.0
Manpreet 45.0 (GGwaHaea
print(df.fillna({‘Name':'NA',
‘Age':0,
‘City’: ‘Guwahati’,
‘Marks' :20}))
WNPO
Visit www.ip4you.blogspot.com for more teaching-learning materials...Advanced operations: Pivoting
O Pivoting of DataFrame :
Pivoting is a process of reshaping/rearrangement of data by rotating row
and columns & applying aggregating data to provide summarized report in
different view of user.
Consider the following sales data of an electronic shop and two reports
(Brand-wise & Item-wise) generated from the DataFrame-
Item |Brand_|Qty Brand-Wise Report Item-Wise Report
Tv SONY 65
ac |sony | 32 pr ac | pc | Tv Brand | HP | LG | SONY
Bran
pC___[HP 45 —
5
T___|le 25_ |} | He 45 | 2 a2
ac |is 25 be 5 25
65 25 65
SONY 32 15
PC SONY 15 32 28
wM__[LG 32
wm |SONY 38 To generate above reshaped & summarized reports,
Pandas offers the following two functions-
pivot() | Reshapes data when single values found for row-column combinations
pivot_table() | Reshapes data when multiple values found for row-column combinationsAdvanced operations: Pivoting
O pivot() : Reshapes data on defined indexes and values
.pivot(index=, columns=, values=
)
index=
:specify column which will be placed as row-index of pivot table.
columns=
:Specify column whose values to be used as columns of pivot.
values=
:Specify columns whose values to be filled across rows and
columns of pivot. All remaining columns will be used as values, if values
parameter is not provided. Pandas will use NaN constant for any missing values.
Reshaping
of Columns
to fe
NaN 45.0 NaN NaN ee
25.0 NaN 25.0 32.0
32.0 15.0 65.0 28.0
OIKAN WN
You can apply fillna() to replace
NaN values with any numbers.
print(df.pivot(index=‘Brand’, columns= ‘Item’, values= ‘Qty’))Advanced operations: Pivoting
O pivot_table() : Reshapes data on defined indexes
Consider the following DataFrame for creating Pivot Table. Notice that there
are multiple value (Qty) for TV-LG combination. When you will try to create
Pivot using pivot() method, Pandas will raise value error. Since there are
two values for a cell (i.e. 25 and 35), so which value to be taken for cell ?
In such case Pandas offers pivot_table() method to resolve multiple values
problem using aggregate functions like sum, min, max, mean etc. By
default mean method is used, when aggregate function is not given.
df. pivot (index= ‘Brand’ , columns=‘Item’ , values=‘Qty’ ) =
mW NE
w
Item Brand
TV
AC
Year
SONY 2020
SONY 2020
2021
PC___HP
[ry te] 2020
pivot() method is
Item AC PC Tv unable to resolve
Brand multiple values for
HP NaN 45 NaN same combination
LG 25 NaN | 25?35 of Row-Column
SONY a2 15 65 index.
ivot_table() method uses aggregate functions to
AC LG 2021
6 2021 G5p P
—_ 7 —_ handle duplicate entry (multiple values) for a cell.Advanced operations: Pivoting
O pivot_table() : Reshapes data on defined indexes
.pivot_table(index=
, columns=
, values=
,
aggfunc= )
index=
:Specify column which will be placed as row-index of pivot table.
columns=
:Specify column whose values to be used as columns of pivot.
values=
:Specify columns whose values to be filled across rows and
columns of pivot.
aggfunc=:Specify aggregate function like sum, min, max,
mean etc. to be used to handle multiple values. Default is mean function.
print (df.pivot_table(index='Brand',columns='Item', values='Qty'))
print (df.pivot_table(index='Brand' ,columns='"Item' ,values='Qty',
aggfunc='sum'))
Item Brand
TV SONY 2020 6s
AC SONY 2020 32
Be HP 2021 45
mw MM 2020 BB
AC LG 2021 25
Item AC BC Tv
Brand
HP NaN 45.0 NaN
LG 25:0 NaN
32.0 NaN 65.0
Item AC Pc TV
Brand
HP NaN 45.0 NaN
LG 25.0 NaN
SONY 32.0 NaN
AubWNEImporting & Exporting Data Between
Python Pandas and MySQL
As you are aware that MySQL is a Relational Database Management
(RDBMS) application based on Structured Query Language(SQL). A database
in MySQL may contains several tables containing data in 2D structure of rows
and columns. Since Pandas DataFrame is also 2D structure, so we can
import/exports data between Pandas and MySQL database. However
importing/exporting of data using CSV file is also a common way, but CSV
file is not secure and not compatible to MySQL.
Python offers various ways to connect MySQL database but the following
three methods are very common.
Method-1 | Using pymysql e Provided by Python
connector e Less support to execute parameterized query.
Method-2 | Using e Provided by Oracle Corp. (MySQL distributor)
mysql.connector | ¢ Supports parameterized query while reading
package from MySQL database.
Method-3 | Using e Sqlalchemy is ORM based additional Toolkit to
sqlalchemy handle database connection and packaging of
package data compatible to Pandas. It can also be
used with pymysql/mysql.connector etc.
Method-3: is recommended for handling Pandas-MySQL connectivity as per syllabus.Importing & Exporting Data Between
Python Pandas and MySQL
O Installing and using pymysq|I package library
There are so many ways to connect MySQL database with Python
application using different libraries and functionalities, but for
exchanging data between Pandas DataFrame and MySQL , we need the
following two library packages to be installed in Python environment.
pymysql A Python’s Database-Connection Driver How to install-
package library, which handles connection | pip install pymysq]l
Python and MySQL Database.
sqlalchemy | It is add-on toolkit used to interact with How to install-
MySQL database by creating connection _| pip install sqlalchemy
(Engine) using connection details.
Note: You can import/export data using pymysql only, but using add-on library
sqlalchemy along with pymysq! make job easier through minimum code/steps.
Pandas library offers the following two functions to handle MySQL
database/ Table:
™ read_sql(): reads/imports data from MySQL to DataFrame.
™ to_sql(): writes/exports data from DataFrame to MySQL table.Importing & Exporting Data Between
Python Pandas and MySQL
O Steps to Import/Export Data between Pandas & MySQL
After installing the required packages (pymysql & sqlalchemy), the following
two steps are required to connect MySQL Database to Python program.
O Step 1: Importing libraries You can also use mysql.connector
The following libraries to imported- in place of pymysql
1. Pandas: For handling Pandas data structures.
2. Pymysql: MySQL connection driver library. import pymysql
3. Sqlalchemy: For database connections (engine). import sqlalchemy
O Step 2: Establishing connection (Engine)
The sqlalchemy offers create_engine() method to establish secure
connection to MySQL Database using MySQL credentials and Database name.
import pandas as pd
=sqlalchemy.create_engine(‘mysql+pymysalI: //
:@localhost/’)
=.connect()
# Establishing connection to school database with MySQL user root (pass-tiger).
engine= sqlalchemy.create_engine( 'mysql+pymysq1://root:tiger@localhost/school' )
con= engine. connect()Importing & Exporting Data Between
Python Pandas and MySQL
ee
O Importing (Reading) records from MySQL to DataFrame
Pandas offers read_sql() method to import data from MySQL table.
=pandas.read_sql(
|,)
Suppose, student table of school database to
be imported. The MySQL user is root and
password is tiger
# importing packages rane
import pandas as pd ca
import pymysql Anthony
import sqlalchemy as alc =
# Establishing connection to school database. name exam
engine= alc.create_engine( 'mysql+pymysql:// eiae Gr
root :tiger@localhost/school?charset=utf8' ) Bebe 86
con= engine.connect() Bonuony a
df=pd.read_sql('select * from student',con) eae =
#0R df=pd.read_sql( ‘student’ ,con) Anthony HY
Note: if you are using MySQL below 5.6 version, you must specify character set
(?charset=utf8 ) with database name to avoid unknown character set error.
www.ip4you.blogspot.com for more teaching-learning materiImporting & Exporting Data Between
Python Pandas and MySQL
O Exporting (Writing) data from DataFrame to MySQL
Pandas offers to_sql() method to export data from DataFrame.
: Refers MySQL table in which data to be written.
: Specify connection object created through create_engine().
index=True : By default Row index of DataFrame will be added as Index column
in MySQL table. In case of False, row index will not be added.
If exists= : Defines action to be taken, if table already exists. By
default error (fail) to be reported, you can also append or overwrite (replace).
#Import commands to be written here
dct={'Name':['Amar', ‘Akbar’, 'Anthony'],
‘Total’ :[320, 360,380] }
df=pd.DataFrame(dct)
print (df)
engine=alc.create_engine( 'mysql+pymysql://
root: tiger@localhost/school?charset=utf8' )
con=engine.connect()
df.to_sql('student',con, if_exists='replace')
eae)Importing & Exporting Data Between Python
and MySQL (Alternative Method)
O Alternative Method to import/export data to MySQL
In previous examples, we have used sqlalchemy library to make connection
to MySQL database using create_engine() method, which automatically
handles dataflow compatible to Panda’s to_sql() and read_sql() method.
So, if you want to import/export data directly from/to Pandas DataFrame
then sqlalchemy plays very important role.
But, you can also apply general methodology to handle database connections
and dataflow using Pymysq] or sql-connector package library methods
programmatically without using sqlalchemy toolkit. In fact this method can
be used to import/export data from any data object like variables, list, tuple
or DataFrame etc. in Python.
The program depicted in next two slides, does the same reading/writing job
through MySQL commands without using sglalchemy library.
Kindly note that while exporting (writing) data to MySQL table using this
method, you must create a compatible table using create command of
MySQL within Python Program or compatible MySQL table must be present
before executing Insert command of MySQL.
¥| This slide may be used as additional reference material.Importing & Exporting Data Between
Python and MySQL (Alternative Method)
O Exporting (Writing) records from DataFrame to MySQL
#Program to export data from DataFrame to MySQL without using sqlalchemy
import pandas as pd
import pymysql
dct={'Name':['Amar', ‘Akbar’, ‘Anthony'], ‘Total’ :[320,360,380]}
df=pd.DataFrame(dct)
print (df) > Name Total
# Create Connection to the database 0 Amar 320
con=pymysql.connect(host='localhost',user='root', 1 Akbar 360
password=' tiger’ ,db="school') 2_ Anthony 380
cur=con.cursor()
# Create table executing MySQL Query
cur.execute("Create table student (name char(3@),total int(3));")
for i,row in df.iterrows():
sql="INSERT INTO student VALUES ('%s','%s');" %(row[ 'Name'],row[ 'Total'])
cur.execute(sql)
con.commit()
con.close()
mysql> select * from student;
ide may be used as additional_reference mate!Importing & Exporting Data Between
Python and MySQL (Alternative Method)
O Importing (Reading) records from MySQL to DataFrame
#Program to export data from DataFrame to MySQL without using sqlalchemy
import pandas as pd
import pymysql
# Connection to the database Teneo aay
con = pymysql.connect(host='localhost',
user='root',
password='tiger',
db='school') + 3
rows in set (@.@0 sec)
cur=con.cursor()
cur.execute("select * from student;")
rows=cur.fetchall()
# Create DataFrame using fetched records
df=pd.DataFrame(rows, columns=['Name', 'Total']) Amar
print (df) >i Akbar
con.close() 2__ Anthony
This slide may be used as additional reference material.es eMac taser wete Mma steM (sr eurtteT ey
of f but the training of
the mind to think.”
Visit www.ip4you.blogspot.com for more....