0% found this document useful (0 votes)
12 views

Pandas 2 Complete Notes Class XII

Uploaded by

asayushsingh638
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Pandas 2 Complete Notes Class XII

Uploaded by

asayushsingh638
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Python Pandas II

Introduction: - Basic operations of dataframe descriptive


statistics , pivoting, handling missing data, combining/merging
etc.

Descriptive Statistics are used to summarise the given data. In


other words, they refer to the methods which are used to get
some basic idea about the data.

Iteration over a Dataframe:-


(i)The iterrows() method iterates over dataframe row wise :-
We can iterate over dataframe row-wise where each row's
values are returned in form of a Series type object.

import pandas as pd
import numpy as np
df=pd.DataFrame({'Population':[10927986,12691836,4631392
, 4328063],
'Hospital':[189,208,149,157],
'School':[7916,8508,7226,7617]},
index=['Delhi','Mumbai','Kolkata','Chennai'])
print(df)
for (row, rowSeries) in df.iterrows():
print("Row index :" , row)
print("Containing :")
print(rowSeries)

Row index : Delhi


Containing :
Population 10927986
Hospital 189
School 7916
Name: Delhi, dtype: int64
Row index : Mumbai
Containing :
Population 12691836
Hospital 208
School 8508
Name: Mumbai, dtype: int64
Row index : Kolkata
Containing :
Population 4631392
Hospital 149
School 7226
Name: Kolkata, dtype: int64
Row index : Chennai
Containing :
Population 4328063
Hospital 157
School 7617
Name: Chennai, dtype: int64

(ii) The iteritems() method iterates over dataframe


column-wise :-
for (column, ColSeries) in df.iteritems():
print("Col index :" , column)
print("Containing :")
print(ColSeries)
Col index : Population
Containing :
Delhi 10927986
Mumbai 12691836
Kolkata 4631392
Chennai 4328063
Name: Population, dtype: int64
Col index : Hospital
Containing :
Delhi 189
Mumbai 208
Kolkata 149
Chennai 157
Name: Hospital, dtype: int64
Col index : School
Containing :
Delhi 7916
Mumbai 8508
Kolkata 7226
Chennai 7617
Name: School, dtype: int64

Binary Operations in a DataFrames

df1=pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])

df2=pd.DataFrame([[51,12,32],[41,55,62],[17,88,None]])

(i)Addition +, add() and radd()


df1+df2 or df1.add(df2)
(ii) substraaction -, sub(), rsub()
df1- df2 or df1.sub(df2)
(iii) multiplication * and mul(), rmul()
df1*df2 or df1.mul(df2)
(iv) division / and div()
df1/df2 or df1.div(df2)

Descriptive Statistics with Pandas


(i)DataFrame.max() is used to calculate the maximum
values from the DataFrame.
print(df.max())
If we want to output maximum value for the columns having
only numeric values, then we can set the parameter
numeric_only=True in the max() method
print(df.max(numeric_only=True))

(ii) DataFrame.min() is used to display the minimum


values from the DataFrame.
print(df.min())
(iii) DataFrame.sum() will display the sum of the values
from the DataFrame.
print(df.sum())
print(df['Maths'].sum())
To calculate total marks of a particular student, the name of
the student needs to be specified.
(iv) DataFrame.count() will display the total number of
values for each column or row of a DataFrame. To count
the rows we need to use the argument axis=1 .
print(df.count())
(v) DataFrame.mean() will display the mean (average)
of the values of each column of a DataFrame.
df.mean()
(vi) DataFrame.Median() will display the middle value of
the data. This function will display the median of the
values of each column of a DataFrame.
print(df.median())
(vii) DateFrame.mode() will display the mode. The mode is
defined as the value that appears the most number of
times in a data.
df.mode()
Quartile
Dataframe.quantile() is used to get the quartiles. It will output
the quartile of each column or row of the DataFrame in four
parts i.e. the first quartile is 25% (parameter q = .25), the
second quartile is 50% (Median), the third quartile is 75%
(parameter q = .75). By default, it will display the second
quantile (median) of all numeric values.

df.quantile()

DataFrame.var() is used to display the variance. It is the


average of squared differences from the mean.

df[['Maths','Science','S. St','Hindi','Eng']].var()

DataFrame.std() returns the standard deviation of the values.


Standard deviation is calculated as the square root of the
variance.

df[['Maths','Science','S. St','Hindi','Eng']].std

DataFrame.describe() function displays the descriptive


statistical values in a single command. These values help us
describe a set of data in a DataFrame.
The describe() function: The describe() funciton give the
following information for a dataframe.
• Count Count of non-NA values in a column
• Mean Computed mean of values in column
• std. Standard deviation of values in a column
• min Minimum value in a column
• 25%,50%, 75% Percentiles of values in the column
• max maximum value in column

df2=pd.DataFrame([[51,12,32],[41,55,62],[17,88,None]])
df2.describe()
0 1 2
count 3.000000 3.000000 2.000000
mean 36.333333 51.666667 47.000000
std 17.473790 38.109491 21.213203
min 17.000000 12.000000 32.000000
25% 29.000000 33.500000 39.500000
50% 41.000000 55.000000 47.000000
75% 46.000000 71.500000 54.500000
max 51.000000 88.000000 62.000000

Data Aggregations :- Aggregation means to transform the


dataset and produce a single numeric value from an array.
Aggregation can be applied to one or more columns together.
Aggregate functions are max(),min(), sum(), count(), std(),
var().

import pandas as pd

df=pd.DataFrame(marksUT)
print(df)
>>> df.aggregate('max')
Name Zuhaire
UT 3
Maths 24
Science 25
S.St 25
Hindi 25
Eng 24
dtype: object
>>>df.aggregate(['max','count'])
Name UT Maths Science S.St Hindi Eng
max Zuhaire 3 24 25 25 25 24
count 12 12 12 12 12 12 12

Sorting a dataFrame
Sorting refers to the arrangement of data elements in a
specified order, which can either be ascending or descending.
Pandas provide sort_values() function to sort the data values
of a DataFrame.

DataFrame.sort_values(by, axis=0, ascending=True)

Here, a column list (by), axis arguments (0 for rows and 1 for
columns) and the order of sorting (ascending = False or True)
are passed as arguments. By default, sorting is done on row
indexes in ascending order.

print(df.sort_values(by=['Name']))
>>> print(df.sort_values(by=['Name']))
Name UT Maths Science S.St Hindi Eng
6 Ashravy 1 23 19 20 15 22
7 Ashravy 2 24 22 24 17 21
8 Ashravy 3 12 25 19 21 23
9 Mishti 1 15 22 25 22 22
10 Mishti 2 18 21 25 24 23
11 Mishti 3 17 18 20 25 20
0 Raman 1 22 21 18 20 21
1 Raman 2 21 20 17 22 24
2 Raman 3 14 19 15 24 23
5 Zu haire 3 22 18 19 23 13
3 Zuhaire 1 20 17 22 24 19
4 Zuhaire 2 23 15 21 25 15

print(df.sort_values(by=['Science']))

print(df.sort_values(by=['Eng'],ascending=False))
A DataFrame can be sorted based on multiple columns.
>>> print(df.sort_values(by=['Science','Hindi']))

Group BY FunctIons
GROUP BY() function is used to split the data into groups based
on some criteria. Pandas objects like a DataFrame can be split
on any of their axes.
In other words, the duplicate values in the same field are
grouped together to form groups.

Step 1: Split the data into groups by creating a GROUP BY


object from the original DataFrame.
2: Apply the required function.
Step 3: Combine the results to form a new DataFrame.

g1=df.groupby('Name')
note:- Python creaed groups based on volumn's values but did
not display grouped data, as groupby() is also an object.

df1=df.groupby('Name')

>>> df1.groups (lists the groups created)

{'Ashravy': [6, 7, 8], 'Mishti': [9, 10, 11], 'Raman': [0, 1, 2],
'Zu haire': [5], 'Zuhaire': [3, 4]}

#Displaying group data, i.e., group_name, row indexes


corresponding to the group and their data type.

#df1.get_group('Mishti')

Name UT Maths Science S.St Hindi Eng


9 Mishti 1 15 22 25 22 22
10 Mishti 2 18 21 25 24 23
11 Mishti 3 17 18 20 25 20

df1.get_group('Raman')
df1=df.groupby(['Name', 'UT'])

>>> df1.first()
Maths Science S.St Hindi Eng
Name UT
Ashravy 1 23 19 20 15 22
2 24 22 24 17 21
3 12 25 19 21 23
Mishti 1 15 22 25 22 22
2 18 21 25 24 23
3 17 18 20 25 20
Raman 1 22 21 18 20 21
2 21 20 17 22 24
3 14 19 15 24 23
Zu haire 3 22 18 19 23 13
Zuhaire 1 20 17 22 24 19
2 23 15 21 25 15

#Displaying the size of each group

>>> df1.size()
Name
Ashravy 3
Mishti 3
Raman 3
Zu haire 1
Zuhaire 2
dtype: int64

>>> df1.count()

UT Maths Science S.St Hindi Eng


Name
Ashravy 3 3 3 3 3 3
Mishti 3 3 3 3 3 3
Raman 3 3 3 3 3 3
Zu haire 1 1 1 1 1 1
Zuhaire 2 2 2 2 2 2
df.groupby(['UT']).aggregate('mean')

Altering the Index :- We use indexing to access the elements


of a DataFrame. It is used for fast retrieval of data. By default,
a numeric index starting from 0 is created as a row index.

When we slice the data, we get the original index which is not
continuous.

We create a new continuous index alongside this using the


reset_index() function.

>>> a=df[df.UT == 1]
>>> a
Name UT Maths Science S.St Hindi Eng
0 Raman 1 22 21 18 20 21
3 Zuhaire 1 20 17 22 24 19
6 Ashravy 1 23 19 20 15 22
9 Mishti 1 15 22 25 22 22

>>> a.reset_index(inplace=True)
>>> a
index Name UT Maths Science S.St Hindi Eng
0 0 Raman 1 22 21 18 20 21
1 3 Zuhaire 1 20 17 22 24 19
2 6 Ashravy 1 23 19 20 15 22
3 9 Mishti 1 15 22 25 22 22

A new continuous index is created while the original one is also


intact. We can drop the original index by using the drop
function.

a.drop(columns=['index'],inplace=True)

>>> a
Name UT Maths Science S.St Hindi Eng
0 Raman 1 22 21 18 20 21
1 Zuhaire 1 20 17 22 24 19
2 Ashravy 1 23 19 20 15 22
3 Mishti 1 15 22 25 22 22

We can change the index to some other column of the data.

a.set_index('Name',inplace=True)
>>> a
UT Maths Science S.St Hindi Eng
Name
Raman 1 22 21 18 20 21
Zuhaire 1 20 17 22 24 19
Ashravy 1 23 19 20 15 22
Mishti 1 15 22 25 22 22

We can revert back to previous index.


a.reset_index('Name', inplace = True)
>>> a
Name UT Maths Science S.St Hindi Eng
0 Raman 1 22 21 18 20 21
1 Zuhaire 1 20 17 22 24 19
2 Ashravy 1 23 19 20 15 22
3 Mishti 1 15 22 25 22 22

Reshaping Data :
For reshaping data, two basic functions are available in Pandas,
pivot and pivot_table.

(A) Pivot: -The pivot function is used to reshape and create a


new DataFrame from the original one.
Pivoting is actually a summary technique that work on tabular
data.

import pandas as pd
d1={'Tutor':['Tahira','Gurjot','Anusha','Jacob','Venkat'],
'Class':[28,36,41,32,40],'Country':['USA','UK','Japan','USA','Br
azil']}
df=pd.DataFrame(d1)
>>> df
Tutor Class Country
0 Tahira 28 USA
1 Gurjot 36 UK
2 Anusha 41 Japan
3 Jacob 32 USA
4 Venkat 40 Brazil
>>> df.pivot(index='Country', columns='Tutor',values='Class')

Tutor Anusha Gurjot Jacob Tahira Venkat


Country
Brazil NaN NaN NaN NaN 40.0
Japan 41.0 NaN NaN NaN NaN
UK NaN 36.0 NaN NaN NaN
USA NaN NaN 32.0 28.0 NaN

With pivot() function following argument work:


index stores the column name about which the
information is to be sumarised(will become rows
in result)
columns Stores the columns name whose data will
become a column each in the summary
information (will become columns in the result)
values Stores the column name whose data will be
displayed for the index column combination
(will become cells in result.

df.pivot(index='Country', columns='Tutor',
values='Class' ) .fillna(0)

Using pivot_table() functon:- if there are multiple entries for a


column's value for the same values for index(row), it leads to
error. Hence, before we use pivot(), we should ensure that the
data does not have rows with duplicate values for the specified
columns.
ontutD={'Tutor': [ 'Tahira', 'Gurjot', 'Anusha','Jacob', 'Venkat',
'Tahira','Gurjot', 'Anusha','Jacob','Venkat',
'Tahira','Gurjot','Anusha',
'Jacob','Venkat','Tahira','Gurjot','Anusha','Jacob','Venkat'],
'Classes':[28,36,41,32,40,36,40,36,40,46,24,30,44,40,32,36,3
2,36,41,38],'Quarter':[1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4],
'Country':['USA','UK','Japan','USA','Brazil','USA','USA','Japan','B
razil','USA','Brazil','USA','UK','Brazil','USA','Japan','Japan','Brazil
','UK','USA']}

df1=pd.DataFrame(ontutD)
>>> df1
Tutor Classes Quarter Country
0 Tahira 28 1 USA
1 Gurjot 36 1 UK
2 Anusha 41 1 Japan
3 Jacob 32 1 USA
4 ,Venkat 40 1 Brazil
5 Tahira 36 2 USA
6 Gurjot 40 2 USA
7 Anusha 36 2 Japan
8 Jacob 40 2 Brazil
9 Venkat 46 2 USA
10 Tahira 24 3 Brazil
11 Gurjot 30 3 USA
12 Anusha 44 3 UK
13 Jacob 40 3 Brazil
14 Venkat 32 3 USA
15 Tahira 36 4 Japan
16 Gurjot 32 4 Japan
17 Anusha 36 4 Brazil
18 Jacob 41 4 UK
19 Venkat 38 4 USA

for data having multiple values for same row and column
combination we can use another pivoting funciton the
pivot_table() function.

The pivot_table() is also a pivoting function, which like pivot()


also produces a pivoted table, but it is different from the
pivot() funciton.
(i) It does not raise errors for multiple entries fo a row,
column combination.
(ii) It aggregates the multiple entires present for a row-
column combination. We need to specify what type of
aggregation we want (sum, mean).
Parameters:
index contains the column name for rows.
columns contains the column name for columns.
Values contains the column names for data of the
pivoted table.
aggfunc contains, the function as per which data is to be
aggregated. By default mean will compute.

>>> df1.pivot_table(index='Country', columns='Tutor',


values='Classes', aggfunc=[sum,max,np.mean])

Tutor Anusha Gurjot Jacob Tahira Venkat


Country
Brazil 36.0 NaN 40.0 24.0 40.000000
Japan 38.5 32.0 NaN 36.0 NaN
UK 44.0 36.0 41.0 NaN NaN
USA NaN 35.0 32.0 32.0 38.666667

Filling NaN values : - Where the NaN values is missing, we can


fill by using fillna(0) argument.

df1.pivot_table(index='Country', columns='Tutor', values=


'Classes'). fillna(0)

Handling missing values : - If a value corresponding to a


column is not present, it is considered to be a missing value. A
missing value is denoted by NaN.

Missing values create a lot of problems during data analysis


and have to be handled properly. The two most common
reason for handling missing values are:
i) drop the object having missing values,
ii) fill the missing value

df2=pd.DataFrame([[51,12,32],[41,np.NaN,55,62],[17,88,Non
e]])

Checking Missing Values :- isnull() to check whether any


value is missing or not in the DataFrame. This function checks
all attributes and returns True in case that attribute has
missing values, otherwise returns False.

>>> df2.isnull()
0 1 2 3
0 False False False True
1 False True False False
2 False False True True

print(df2['Tutor'].isnull())
print(df2['Country'].isnull())

To check whether a column (attribute) has a missing value in


the entire dataset, any() function is used. It returns True in
case of missing value else returns False.

print(df2.isnull().any())
Tutor False
Class False
Country False
dtype: bool

The function any() can be used for a particular attribute also


print(df2[1].isnull().any())

Dropping Missing Values


marksUT =
{ 'Name':['Raman','Raman','Raman','Raman','Zuhaire','Zuhaire'
,'Zuhaire' ,'Zuhaire','Ashravy','Ashravy','Ashravy','Ashravy','Mis
hti','Mishti', 'Mishti','Mishti'],
'UT':[1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4],
'Maths':[22,21,14,np.NaN,20,23,22,19,23,24,12,15,15,18,17,1
4],
'Science':[21,20,19,np.NaN,17,15,18,20,19,22,25,20,22,21,18
,20],
'S.St':[18,17,15,19,22,21,19,17,20,24,19,20,25,25,20,19],
'Hindi':[20,22,24,18,24,25,23,21, 15,17,21,20,22,24,25,20],
'Eng':[21,24,23,np.NaN,19,15,13,16,22,21,23,17,22,23,20,18]
}

df = pd.DataFrame(marksUT)

print(df.isnull()
print(df['Science'].isnull())
print(df.isnull().any())
The function any() can be used for a particular attribute..

To find the number of NaN values corresponding to each


attribute.

print(df.isnull().sum())

To find the total number of NaN in the whole dataset, we can


use-
print(df.isnull().sum().sum())

Dropping Missing Values :- Dropping will remove the entire


row (object) having the missing value(s). The dropna()
function we can use to drop NaN values.
>>> a=df[df.Name=='Raman']
>>> a
Name UT Maths Science S.St Hindi Eng
0 Raman 1 22.0 21.0 18 20 21.0
1 Raman 2 21.0 20.0 17 22 24.0
2 Raman 3 14.0 19.0 15 24 23.0
3 Raman 4 NaN NaN 19 18 NaN

a.dropna(inplace=True, how='any')
>>> a
Name UT Maths Science S.St Hindi Eng
0 Raman 1 22.0 21.0 18 20 21.0
1 Raman 2 21.0 20.0 17 22 24.0
2 Raman 3 14.0 19.0 15 24 23.0

Joining, Merging and Concatenation of DataFrames


(A) Joining :- We can use the pandas DataFrame.append()
method to merge two DataFrames. It appends rows of the
second DataFrame at the end of the first DataFrame. Columns
not present in the first DataFrame are added as new columns.

df=pd.DataFrame([[1, 2, 3], [4, 5], [6]], columns=['C1', 'C2',


'C3'], index=['R1', 'R2', 'R3'])
>>> df
C1 C2 C3
R1 1 2.0 3.0
R2 4 5.0 NaN
R3 6 NaN NaN

>>> df1=pd.DataFrame([[10, 20], [30], [40, 50]],


columns=['C2', 'C5'], index=['R4', 'R2', 'R5'])
>>> df1
C2 C5
R4 10 20.0
R2 30 NaN
R5 40 50.0
dfnew=df.append(df1)
>>> dfnew
C1 C2 C3 C5
R1 1.0 2.0 3.0 NaN
R2 4.0 5.0 NaN NaN
R3 6.0 NaN NaN NaN
R4 NaN 10.0 NaN 20.0
R2 NaN 30.0 NaN NaN
R5 NaN 40.0 NaN 50.0

To get the column labels appear in sorted order we can set the
parameter sort=True. The column labels shall appear in
unsorted order when the parameter sort = False.
dFrame2 =df1.append(df, sort='True')

The parameter ignore_index of append()method may be set to


True, when we do not want to use row index labels. By default,
ignore_index = False.

dFrame1 = df.append(df1, ignore_index=True)


>>> dFrame1
C1 C2 C3 C5
0 1.0 2.0 3.0 NaN
1 4.0 5.0 NaN NaN
2 6.0 NaN NaN NaN
3 NaN 10.0 NaN 20.0
4 NaN 30.0 NaN NaN
5 NaN 40.0 NaN 50.0

Importing a CSV file to a DataFrame : We can create a


DataFrame by importing data from CSV files where values are
separated by commas.

the following data in a csv file


named C:\Users\Ashutosh\Desktop\ ng.csv stored.

We can load the data from the data.csv file into a DataFrame,
say marks using Pandas read_csv() function.

marks = pd.read_csv(r"C:\Users\Ashutosh\Desktop\data.csv",
sep =",", header=0)

• The first parameter to the read_csv() is the name of the


comma separated data file along with its path.

• The parameter sep specifies whether the values are


separated by comma, semicolon, tab, or any other character.
The default value for sepis a space.
• The parameter header specifies the number of the row whose
values are to be used as the column names. It also marks the
start of the data to be fetched. Header=0 implies that column
names are inferred from the first line of the file. By default,
header=0.

names parameter is used to specify the labels for columns of


the DataFrame marks1

marks = pd.read_csv(r"C:\Users\Ashutosh\Desktop\data.csv",
sep =",", names=['RNo','StudentName', 'Sub1','Sub2'] )

Exporting a DataFrame to a CSV file:- We can use the


to_csv() function to save a DataFrame to a text or csv file.

df1.to_csv(r"C:\Users\Ashutosh\Desktop\data12.csv", sep
=",", header=0)

You might also like