Python Code
Python Code
Python Library/package- A Python library is a collection of modules that together cater to a specific type of
need or application. The advantage of using libraries is that we can directly use functions/methods for
performing specific type of application instead of rewriting the code for that particular use. They are used by
using the import command as-
import libraryname
at the top of the python code/script file.
Data Structure- Data structure is the arrangement of data in such a way that permits efficient access and
modification.
a b c de fg
all data is of object type
1 def 10.5 Jkl True
Creating a Series- A series object can be created by calling the Series() method in the following ways-
a) Create an empty Series- A Series object not containing any elements is an empty Series. It can be
created as follows-
import pandas as pd
s1=pd.Series()
print(s1)
o/p-
Series([], dtype: float64)
b) Create a series from array without index- A numpy 1D array can be used to create a Series object as
import pandas as pd
import numpy as np
a1=np.array(['hello', 'world', 'good', np.NaN])
s1=pd.Series(a1)
print(s1)
o/p-
0 hello
1 world
2 good
3 nan
dtype: object
c) Create a series from array with index- The default index for a Series object can be changed and
specified by the programmer by using the index parameter and enclosing the index in square
brackets. The number of elements of the array must match the number of index specified otherwise
python gives an error.
#Creating a Series object using numpy array and specifying index
import pandas as pd
import numpy as np
a1=np.array(['hello', 'world', 'good', 'morning'])
s1=pd.Series(a1, index=[101, 111, 121, 131])
print(s1)
o/p-
101 hello
111 world
121 good
131 morning
dtype: object
d) Create a Series from dictionary- Each element of the dictionary contains a key:value pair. The key of
the dictionary becomes the index of the Series object and the value of the dictionary becomes the
data.
#4 Creating a Series object from dictionary
import pandas as pd
o/p-
101 hello
111 world
121 good
131 morning
dtype: object
e) Create a Series from dictionary, reordering the index- When we are creating a Series object from a
dictionary then we can specify which all elements of the dictionary, we want to include in the Series
object and in which order by specifying the index argument while calling the Series() method.
If any key of the dictionary is missing in the index argument, then that element is not added
to the Series object.
If the index argument contains a key not present in the dictionary then a value of NaN is
assigned to that particular index.
The order in which the index arguments are specified determines the order of the elements
in the Series object.
#5 Creating a Series object from dictionary reordering the index
import pandas as pd
o/p-
131 morning
111 world
121 good
199 NaN
dtype: object
f) Create a Series from a scalar value- A Series object can be created from a single value i.e. a scalar
value and that scalar value can be repeated many times by specifying the index arguments that
many number of times.
#6 Creating a Series object from scalar value
import pandas as pd
o/p-
101 7
111 7
121 7
dtype: int64
g) Create a Series from a List- A Series object can be created from a list as shown below.
#7 Creating a Series object from list
import pandas as pd
L=['abc', 'def', 'ghi', 'jkl']
s1=pd.Series(L)
print(s1)
o/p-
0 abc
1 def
2 ghi
3 jkl
dtype: object
h) Create a Series from a Numpy Array (using various array creation methods) - A Series object can be
created from a numpy array as shown below. All the methods of numpy array creation can be used
to create a Series object.
#7a Creating a Series object from list
import pandas as pd
import numpy as np
#e. Create an array of 10 elements which are linearly spaced between 1 and 10 (both inclusive)
a5=np.linspace(1,10,4)
s5=pd.Series(a5)
print('s5=', s5)
#f.
a6=np.fromiter('helloworld', dtype='U1')
s6=pd.Series(a6)
print('s6=', s6)
o/p:
s1= 0 2.0
1 4.0
2 7.0
3 10.0
4 13.5
5 20.4
dtype: float64
s2= 101 0.0
102 0.0
103 0.0
104 0.0
105 0.0
106 0.0
107 0.0
108 0.0
109 0.0
110 0.0
dtype: float64
s3= 0 1.0
1 1.0
2 1.0
3 1.0
4 1.0
dtype: float64
s4= 0 1.1
1 1.2
2 1.3
3 1.4
4 1.5
5 1.6
6 1.7
dtype: float64
s5= 0 1.0
1 4.0
2 7.0
3 10.0
dtype: float64
s6= 0 h
1 e
2 l
3 l
4 o
5 w
6 o
7 r
8 l
9 d
dtype: object
o/p-
world
morning
o/p-
world
morning
None
nice day
o/p-
world
o/p-
hello
morning
o/p-
<class 'str'>
world
<class 'pandas.core.series.Series'>
y=
def world
jkl morning
dtype: object
z=
def world
ghi good
jkl morning
dtype: object
m=
def world
ghi good
dtype: object
o/p-
<class 'str'>
good
<class 'pandas.core.series.Series'>
y=
abc hello
ghi good
dtype: object
z=
def world
ghi good
jkl morning
dtype: object
m=
def world
ghi good
dtype: object
L=[101, 111, 121, 131, 141, 151, 161, 171, 181, 191, 201, 211]
s=pd.Series(L)
x=s.head()
print('x=\n', x)
y=s.head(3)
print('y=\n', y)
o/p:
x=
0 101
1 111
2 121
3 131
4 141
dtype: int64
y=
0 101
1 111
2 121
dtype: int64
L=[101, 111, 121, 131, 141, 151, 161, 171, 181, 191, 201, 211]
s=pd.Series(L)
x=s.tail()
print('x=\n', x)
y=s.tail(3)
print('y=\n', y)
o/p:
x=
7 171
8 181
9 191
10 201
11 211
dtype: int64
y=
9 191
10 201
11 211
dtype: int64
4. Indexing/Slicing a Series object-
The index [] operator can be used to perform indexing and slicing operations on a Series object. The index[]
operator can accept either-
a) Index/labels
b) Integer index positions
o/p:
102
ii) Using multiple labels- We can pass multiple labels in any order that is present in the Series object.
The multiple labels must be passed as a list i.e. the multiple labels must be separated by commas and
enclosed in double square brackets. Passing a label is passed that is not present in the Series object,
should be avoided as it right now gives NaN as the value but in future will be considered as an error by
Python.
# 17 indexing a Series object multiple labels
import pandas as pd
o/p:
b 102
a 101
f 106
dtype: int64
iii) Using slice notation startlabel:endlabel- Inside the index operator we can pass startlabel:endlabel.
Here contrary to the slice concept all the items from startlabel values till the endlabel values including
the endlabel values is returned back.
# 18 indexing a Series object using startlabel:endlabel
import pandas as pd
o/p:
b 102
c 103
d 104
e 105
dtype: int64
Slice concept-
The basic concept of slicing using integer index positions are common to Python object such as strings, list,
tuples, Series, Dataframe etc. Slice creates a new object using elements of an existing object. It is created as:
ExistingObjectName[start : stop : step] where start, stop , step are integers
x=s[1: :2]
print('x=\n', x)
y=s[-1: :-1]
print('y=\n', y)
z=s[1: -2: 2]
print('z=\n', z)
o/p:
x=
b 111
d 131
f 151
dtype: int64
y=
f 151
e 141
d 131
c 121
b 111
a 101
dtype: int64
z=
b 111
d 131
dtype: int64
s['c'] = 555
s[['f','a']] = [666,777]
print('s=\n', s)
s['b':'d']=[0,1,2]
print('s=\n', s)
o/p:
s=
a 777
b 111
c 555
d 131
e 141
f 666
dtype: int64
s=
a 777
b 0
c 1
d 2
e 141
f 666
dtype: int64
s.at['d'] = 999
s.iat[-1] = 777
print('s=\n', s)
o/p:
s=
a 101
b 111
c 121
d 999
e 141
f 777
dtype: int64
s.loc['b'] = 9
s.loc['e':'f'] = [8,7]
print('s=\n', s)
s.iloc[1: :2] = [33,44,55]
print('s=\n', s)
o/p:
s=
a 101
b 9
c 121
d 131
e 8
f 7
dtype: int64
s=
a 101
b 33
c 121
d 44
e 8
f 55
dtype: int64
o/p:
s=
a 101
b 1
c 121
d 2
e 141
f 3
dtype: int64
Vector Operations: When an operation is performed on a Series object then all the elements of that object
take part in that operation. Such operations are known as Vector Operations. A Series object supports Vector
operations and the result that is returned is also a Series object.
Data Alignment: When Vector/Arithmetic operations are performed between two Series objects then the
data is aligned based on the common/matching labels/indexes between the two Series objects. In the case
where the data does not align due to incompatible data or corresponding index not being available then
usually a result of NaN is assigned to the result.
#25 Vector Operations on Series object
import pandas as pd
s2=s1*2
print('s2=\n', s2)
s3=s1>10
print('s3=\n', s3)
o/p:
s2=
a 42
b 10
c 22
d 6
dtype: int64
s3=
a True
b False
c True
d False
dtype: bool
s3=s1+s2
print('s3=\n', s3)
o/p:
s3=
a NaN
b 7.0
c 18.0
d NaN
e NaN
dtype: float64
b) seriesobject.drop(labels=[list_of_indexes], inplace=False)
The drop() method can be used to delete one or more elements by passing a list of labels/indexes to
the labels parameter. If the inplace parameter is not passed or is False then the new series object
with the elements deleted is returned back. If the inplace=True parameter is passed then the current
object is modified in place.
c) seriesobject.pop(index)
The pop() method is passed an index of the element that is to be deleted. This method returns back
the value of the element that is being deleted.
import pandas as pd
s1=pd.Series([1,2,3,4,5], index=['a','b','c','d','e'])
del s1['b']
print('s1=',s1)
s1.drop(labels=['c','e'], inplace=True)
print('s1=',s1)
x=s1.pop('d')
print('Deleted value:',x)
print('s1=',s1)
o/p:
s1= a 1
c 3
d 4
e 5
dtype: int64
s1= a 1
d 4
dtype: int64
Deleted value: 4
s1= a 1
dtype: int64
8. Boolean Indexing in Series object-
The index operator [], the loc and the iloc properties can be passed a boolean list containing the same
number of elements as that in the Series object. Wherever the True values are present, those elements are
selected and returned back as another Series object.
Inside the index operator a relational expression can also be passed that gives a boolean Series object having
the same number of elements as the Series object itself. If different boolean expressions are combined then
instead of using the 'and' operator the 'bitwise and' i.e. '&' is used, instead of 'or' operator the 'bitwise or'
i.e. '|' is used and instead of the 'not' operator the 'bitwise not' i.e. '~' is used.
Whenever multiple relational expressions are used for Boolean indexing then individual expressions must be
enclosed in parentheses since the bitwise and(&), or(|), not(~) operators have higher precedence than the
relational operators.
#Boolean indexing
import pandas as pd
s1=pd.Series([1,2,3,4,5], index=['a','b','c','d','e'])
s2=s1[[True,False,True,False,True]]
print('s2=',s2)
s3=s1[s1>=3]
print('s3=',s3)
s4=s1[(s1>=2)&(s1<=4)]
print('s4=',s4)
s5=s1[(s1<2)|(s1>4)]
print('s5=',s5)
s6=s1[~(s1==3)]
print('s6=',s6)
o/p:
s2= a 1
c 3
e 5
dtype: int64
s3= c 3
d 4
e 5
dtype: int64
s4= b 2
c 3
d 4
dtype: int64
s5= a 1
e 5
dtype: int64
s6= a 1
b 2
d 4
e 5
dtype: int64
o/p:
True
3 is_monotonic_increasing Return boolean if values in the d1={'a':9, 'b':7, 'c':5, 'd':1}
object are s1=pd.Series(d1)
monotonic_increasing. print(s1.is_monotonic_increasing)
o/p:
False
4 ndim Returns the number of d1={'a':9, 'b':1, 'c':7, 'd':2}
dimensions of the underlying s1=pd.Series(d1)
data, for Series object by print(s1.ndim)
definition has only one
dimension i.e. 1. o/p:
1
5 shape, size shape property returns a tuple d1={'a':9, 'b':1, 'c':7, 'd':2}
(n,) containing a single element s1=pd.Series(d1)
which is the number of print(s1.shape)
elements in the Series object. print(s1.size)
o/p:
c=
a 11.0
b 16.0
c NaN
d NaN
e NaN
dtype: float64
d=
a 11.0
b 16.0
c 101.0
d 120.0
e NaN
dtype: float64
8 obj1.radd(obj2, fill_value=None) Reverse addition. Technically
does obj2+obj1, similar to add
function
9 agg() Both agg() and aggregate() a = pd.Series([23,11,2,7,7,2],
aggregate() methods are used to perform index=['a','b','c','d','e','f'])
aggregate functions over a b = a.agg('min')
Series object. The aggregate print('b=', b)
operation viz 'min', 'max', 'sum', c = a.agg('sum')
'mean', 'median', 'mode' are print('c=', c)
passed as string parameter to
the agg()/aggregate() method. d = a.aggregate('mean')
print('d=', d)
f = a.aggregate('median')
print('f=', f)
o/p:
b= 2
c= 52
d= 8.666666666666666
e=
0 2
1 7
dtype: int64
f= 7.0
10 obj1.count() Return number of non-NA/null import pandas as pd
observations in the Series import numpy as np
o/p:
3
11 obj1.div(obj2, fill_value=None) Return the floating division of #30 divide Function on Series object
series element/index-wise. import pandas as pd
obj1.divide(obj2, fill_value=None) Fill value: import numpy as np
Fill existing missing (NaN)
obj1.truediv(obj2, fill_value=None) values, and any new element a = pd.Series([10, 12, 13, np.NaN], index=['a', 'b',
needed for successful Series 'c', 'd'])
obj1.rtruediv (obj2, fill_value=None) alignment, with this value b = pd.Series([2, 5, 6, np.nan], index=['a', 'b', 'd',
before computation. If data in 'e'])
obj1.rdiv(obj2, fill_value=None) both corresponding Series c = a.div(b)
locations is missing the result print('c=\n', c)
will be missing.
d=a.divide(b, fill_value=3)
rdiv does reverse division i.e print('d=\n', d)
obj2/obj1
e=a.rdiv(b)
print('e=\n', e)
o/p:
c=
a 5.0
b 2.4
c NaN
d NaN
e NaN
dtype: float64
d=
a 5.000000
b 2.400000
c 4.333333
d 0.500000
e NaN
dtype: float64
e=
a 0.200000
b 0.416667
c NaN
d NaN
e NaN
dtype: float64
12 Relational functions on Series object: The eq, ne gt, ge, lt, le functions x = pd.Series([12,2,3,4,np.NaN], index=['a', 'b', 'c',
obj1.eq(obj2, fill_value=None) are similar to the relational 'd','f'])
operators ==, !=, >, >=, <, <=. y = pd.Series([1,9,4,5,np.NaN], index=['a', 'b', 'd',
obj1.ne(obj2, fill_value=None) These functions return back a 'e','f'])
Series object which compares m = x.eq(y)
obj1.gt(obj2, fill_value=None) the elements index-wise and print('m=\n', m)
returns either True/False.
obj1.ge(obj2, fill_value=None) n = x.gt(y)
If only one value is missing or print('n=\n', n)
obj1.lt(obj2, fill_value=None) NaN then result is False.
p=x.le(y)
obj1.le(obj2, fill_value=None) In addition the use of fill_value print('p=\n', p)
is:
Fill value: q=x.le(y,fill_value=10)
Fill existing missing (NaN) print('q=\n', q)
values, and any new element
needed for successful Series o/p:
alignment, with this value m=
before computation. a False
b False
If data in both corresponding c False
Series locations is missing/NaN d True
the result will be False. e False
f False
dtype: bool
n=
a True
b False
c False
d False
e False
f False
dtype: bool
p=
a False
b True
c False
d True
e False
f False
dtype: bool
q=
a False
b True
c True
d True
e False
f False
dtype: bool
13 equals() It returns a single True value if x = pd.Series([1,2,3,4,np.NaN], index=['a', 'b', 'c',
all the elements of both Series 'd','e'])
object match index-wise. If y = pd.Series([1,2,3,4,np.NaN], index=['a', 'b', 'c',
both Series object contain NaN 'd','e'])
at same position then also it print(x.equals(y))
evaluates to True. Otherwise it
returns False. o/p:
True
14 fillna(inplace=False) It fills NaN values with the x = pd.Series([1,2,3,4,np.NaN], index=['a', 'b', 'c',
parameter passed. If inplace 'd','e'])
parameter is not passed or is y = x.fillna(0)
False, then it returns back print('x=', x)
another Series object with the print('y=', y)
NaN values filled with the
parameter that was passed. z = pd.Series([2,2,np.NaN], index=['a', 'b', 'c'])
z.fillna(99,inplace=True)
If inplace=True parameter is print('z=', z)
passed then the current object
itself is modified. o/p:
x= a 1.0
b 2.0
c 3.0
d 4.0
e NaN
dtype: float64
y= a 1.0
b 2.0
c 3.0
d 4.0
e 0.0
dtype: float64
z= a 2.0
b 2.0
c 99.0
dtype: float64
15 obj1.floordiv(obj2, fill_value=None) Performs action similar to // i.e. x = pd.Series([10,7,9,5,np.NaN], index=['a', 'b', 'c',
integer/floor division with the 'd','e'])
obj1.rfloordiv (obj2, fill_value=None) addition of the fill_value y = pd.Series([3,2,5,4,7], index=['a', 'b', 'c', 'd','e'])
argument. z = x.floordiv(y)
print('z=',z)
o/p:
z= a 3.0
b 3.0
c 1.0
d 1.0
e NaN
dtype: float64
16 max(), min(), mean(), median(), max-finds the max a = pd.Series([2,1,2,1,np.NaN],
mode(), sum() min-finds the min index=['a','b','c','d','e'])
mean-finds the mean b = a.min()
median-finds the median print('b=', b)
mode-finds the mode. Mode
can return multiple values and c = a.max()
it returns back another Series print('c=', c)
object as the result.
sum-finds the sum. d = a.mean()
print('d=', d)
If any data is NaN then it is not
counted while doing the e = a.median()
calculation print('e=', e)
g = a.sum()
print('g=', g)
o/p:
b= 1.0
c= 2.0
d= 1.5
e= 1.5
f=
0 1.0
1 2.0
dtype: float64
g= 6.0
17 obj1.mul(obj2, fill_value=None) Similar to the * operator with a = pd.Series([2,1,3,4,np.NaN],
the addition of fill_value index=['a','b','c','d','e'])
obj1.multiply (obj2, fill_value=None) argument b = pd.Series([4,5,7,np.NaN], index=['a','b','d','e'])
c= a.mul(b)
print('c=\n', c)
o/p:
c=
a 8.0
b 5.0
c NaN
d 28.0
e NaN
dtype: float64
18 nlargest(), nsmallest() If no parameter is passed, it a = pd.Series([2,1,3,4,7,10, 19, 21, 8, np.NaN])
returns the top 5 largest / b = a.nlargest()
smallest element in the Series print('b=\n', b)
object. If an integer parameter
x, is passed then it returns the c=a.nsmallest(3)
top 'x' largest/smallest print('c=\n', c)
elements in the Series object.
o/p:
b=
7 21.0
6 19.0
5 10.0
8 8.0
4 7.0
dtype: float64
c=
1 1.0
0 2.0
2 3.0
dtype: float64
19 obj1.pow(obj2, fill_value=None) Similar to the ** operator with a = pd.Series([2,1,3], index=['a','b','c'])
the option of using the b = pd.Series([3,4,2], index=['a','b','d'])
fill_value parameter to fill NaN c = a.pow(b)
values. print('c=\n', c)
o/p:
c=
a 8.0
b 1.0
c NaN
d NaN
dtype: float64
20 obj1.prod() Returns the product of the a = pd.Series([2,4,3], index=['a','b','c'])
obj1.product() values in the Series object b = a.prod()
print('b=', b)
o/p:
b= 24
21 obj1.round(decimals=0) Round each value in a Series to a = pd.Series([212.542,452.987,327.192],
the given number of decimals. index=['a','b','c'])
The parameter decimals has b = a.round()
default value of 0 i.e. if no print('b=\n', b)
parameter is specified then it c = a.round(2)
rounds to integers. If decimals print('c=\n', c)
is negative, it specifies the d = a.round(-2)
number of positions to the left print('d=\n', d)
of the decimal point
o/p:
b=
a 213.0
b 453.0
c 327.0
dtype: float64
c=
a 212.54
b 452.99
c 327.19
dtype: float64
d=
a 200.0
b 500.0
c 300.0
dtype: float64
22 obj1.std(ddof=1) std() without any parameters a = pd.Series([9, 2, 5, 4])
takes default ddof parameter as b = a.std() #calculates sample standard
1 and calculates the sample deviation
standard deviation: print('b=', b)
c = a.std(ddof=0) #calculates population standard
deviation
If we want to calculate the print('c=', c)
population standard deviation,
then use obj.std(ddof=0) which o/p:
is given by the formula: b= 2.943920288775949
c= 2.5495097567963922
o/p:
var(ddof=0) calculates the
b= 8.666666666666666
population variance:
c= 6.5
o/p:
c=
0 8.0
1 -3.0
2 NaN
3 0.0
dtype: float64
d=
0 -8.0
1 3.0
2 NaN
3 0.0
dtype: float64
s1=pd.Series([1,None,3,4,np.NaN], index=['a','b','c','d','e'])
s2=s1.dropna()
print('s1=\n',s1)
print('s2=\n',s2)
s1.dropna(inplace=True)
print('s1=\n',s1)
o/p:
s1=
a 1.0
b NaN
c 3.0
d 4.0
e NaN
dtype: float64
s2=
a 1.0
c 3.0
d 4.0
dtype: float64
s1=
a 1.0
c 3.0
d 4.0
dtype: float64
s1=pd.Series([1,np.NaN,np.NaN,np.NaN,5], index=['a','b','c','d','e'])
s2=s1.fillna(0)
print('s2=\n',s2)
s3=s1.fillna(method='ffill')
print('s3=\n',s3)
s4=s1.fillna(method='bfill')
print('s4=\n',s4)
s1.fillna(1.5,limit=2,inplace=True)
print('s1=\n',s1)
o/p:
s2=
a 1.0
b 0.0
c 0.0
d 0.0
e 5.0
dtype: float64
s3=
a 1.0
b 1.0
c 1.0
d 1.0
e 5.0
dtype: float64
s4=
a 1.0
b 5.0
c 5.0
d 5.0
e 5.0
dtype: float64
s1=
a 1.0
b 1.5
c 1.5
d NaN
e 5.0
dtype: float64
DataFrame
DataFrame
A DataFrame is a two-dimensional data structure in the python pandas library which stores heterogeneous
(different kinds of) data in different columns.
columns
axis=1
index
axis = 0
rows
For using the DataFrame object we must import the pandas library by using the statement:
import pandas as pd
Creating a DataFrame
The DataFrame() method is primarily used to create a DataFrame. It can accept different kinds of input. There
are many different ways of creating a DataFrame. Some of which are:
df1=pd.DataFrame()
print('df1=\n',df1)
o/p:
df1=
Empty DataFrame
Columns: []
Index: []
2. Creating a DataFrame from List of Lists
A two-dimensional nested list can be used to create a DataFrame. The columns parmeter is used to
give pass the name of the columns as a list.
#2 Creating a DataFrame from List of Lists
import pandas as pd
o/p:
df1=
name age
0 abc 15
1 def 16
2 ghi 17
a1 = np.array(['jkl','mno','pqr'])
a2 = np.array([20,21,22])
d2 = { 'name' : a1, 'age' : a2 }
df2 = pd.DataFrame(d2, index=['r1', 'r2', 'r3'])
print('df2=\n', df2)
s1 = pd.Series(['stu','vw', 'xyz'])
s2 = pd.Series([23, 24, 25])
d3 = {'name':s1, 'age' : s2}
df3 = pd.DataFrame(d3)
print('df3=\n', df3)
o/p:
df1=
name age
0 abc 15
1 def 16
2 ghi 17
df2=
name age
r1 jkl 20
r2 mno 21
r3 pqr 22
df3=
name age
0 stu 23
1 vw 24
2 xyz 25
o/p:
df1=
age class name
0 15 NaN abc
1 16 5.0 def
o/p:
df1=
age class name
r1 15 NaN abc
r2 16 5.0 def
6. Creating a DataFrame from List of Dictionary and specifying row / column index
Similar to the previous two examples we can use the index=[list_of_row_labels] and
columns=[list_of_column_labels] to specify the row index as well as the column index.
Here while specifying the column labels we have the flexibility of specifying only a limited list of
column names in the column list in which case only the columns apperaring in the list appear in the
DataFrame object.
Another flexibility is that if any additional column name is specified which does not exist in any of the
dictionary then that column is created in the DataFrame object and all the values appear as NaN under
that column.
#6 Creating a DataFrame from List of Dictionary and row / column index
import pandas as pd
o/p:
df1=
name age
r1 abc 15
r2 def 16
df2=
name age marks
r1 abc 15 NaN
r2 def 16 NaN
The parameter sep='char' can be used to specify the character used to separate the column values, by
default it is the comma(,).
The parameter index_col=int can be used to specify the the row labels are to be taken from which
column. An int is specified to highlight the int column number containing the row labels. The first
column has index 0, second column has index 1, and so on.
Similar to importing of data from a csv file, data from a DataFrame object can be exported to a csv file
using the to_csv() method. The to_csv() method has many parameters to control the kind of data
exported. The parameter index=False will not export the index as a column in the csv file. The
parameter header=False will omit writing of the column names to the csv file being exported.
df1.to_csv('newfile1.csv')
df1.to_csv('newfile2.csv',index=False, header=False)
o/p:
df1=
name age
0 abc 15
1 def 16
2 ghi 17
df2=
name age hometown
regno
101 abc 15 lll
111 def 16 mmm
121 ghi 17 nnn
0 abc 15 def 16
1 def 16 ghi 17
2 ghi 17
Common properties/attributes of DataFrames
Assume DataFrame df1 is as defined below:
df1=
dict1={'students':['abc', 'def','ghi'], students marks sports
'marks': [24.5, 27.5, 30], I abc 24.5 cricket
'sports': ['cricket', 'badminton', 'football']} II def 27.5 badminton
df1=pd.DataFrame(dict1,index=['I','II','III']) III ghi 30.0 football
print('df1=\n',df1)
o/p:
Student column is :
I abc
II def
III ghi
Name: students, dtype: object
o/p:
Marks column is :
I 24.5
II 27.5
III 30.0
Name: marks, dtype: float64
The square bracket notation(df1['students'], df[2017]) can be used when the column names are
strings('students') or numbers(2017). The dot notation can only be used when the column name
is a string(df1.marks). Hence we use the square bracket notation in general for all cases.
#3. Selecting subset of rows/columns from Dataframe using row names and column names
print('Displaying subset:\n', df1.loc['I':'II', 'students':'marks'])
o/p:
Displaying subset:
students marks
I abc 24.5
II def 27.5
<dataframe>.loc( <startrow> : <endrow> , <startcolumn> : <endcolumn> )
Used to access a subset of dataframe using row index name and column index name
#4. Selecting subset of rows/columns using row numbers and column numbers
print('Displaying subset using row and column index numbers:\n', df1.iloc[0:2, 1:3])
o/p:
Displaying subset using row and column index numbers:
marks sports
I 24.5 cricket
II 27.5 badminton
<dataframe>.iloc( <startrow index> : <endrow index> : <step value> ,
<startcolumn index> : <endcolumn index> :<step value> )
Used to access a subset of dataframe using row index number and column index number using
row slice and column slice. If step value is not written it is assumed to be 1.
#5. Selecting/Accessing individual value using column name and row name
print("Value in row I column student is:\n", df1.students['I'])
# after dot there is column name and inside square bracket the row name
o/p:
Value in row I column student is:
abc
#6. Selecting/Accessing individual value using column name and row number
print("Value in row 0 column sports is:\n", df1.sports[0])
# after dot there is column name and inside square bracket the row number
o/p:
Value in row 0 column sports is:
cricket
#7. Selecting/Accessing individual value using at attribute i.e. using row name and column
name
print("Accessing individual value using at attribute:\n", df1.at['II','students'])
# after .at there is square bracket and then inside it row name and column name
o/p:
Accessing individual value using at attribute:
def
#8. Selecting/Accessing individual value using iat attribute i.e. using numeric row index and
column index
print("Accessing individual value using iat attribute:\n", df1.iat[2,2])
# after .iat there is square bracket and then inside it row number and column number
o/p:
Accessing individual value using iat attribute:
football
o/p:
df1(After updating values)=
students marks sports
I xyz 24.5 chess
II pqr 27.5 badminton
III ghi 30.0 carrom
All the four methods described previously to access individual values of a DataFrame can be
used to also change an individual value of a DataFrame.
o/p:
df1(After adding column hometown)=
students marks sports hometown
I xyz 24.5 chess otp
II pqr 27.5 badminton otp
III ghi 30.0 carrom otp
The value 'otp' is appearing across all rows of the DataFrame.
o/p:
df1(After updating column hometown)=
students marks sports hometown
I xyz 24.5 chess ottapalam
II pqr 27.5 badminton shoranur
III ghi 30.0 carrom palakkad
The value ['ottapalam', 'shoranur', 'palakkad'] is appearing across the rows of the DataFrame.
o/p:
df1(After adding row IV )=
students marks sports hometown
I xyz 24.5 chess ottapalam
II pqr 27.5 badminton shoranur
III ghi 30 carrom palakkad
IV rrr rrr rrr rrr
The value 'rrr' is appearing across the columns of the DataFrame.
The functions at and loc can only be used to add as well as modify an entire row, since they
only can be used to access a row label.
o/p:
df1(After updating row 3 )=
students marks sports hometown
I xyz 24.5 chess ottapalam
II pqr 27.5 badminton shoranur
III ghi 30 carrom palakkad
IV mno 25.5 football delhi
The value ['mno', 25.5, 'football', 'delhi'] is appearing across the columns of the DataFrame.
The functions at and loc can only be used to add as well as modify an entire row, since they
only can be used to access a row label.
s1=df1.pop('marks')
print('Deleted column is:\n', s1)
print('df1(After deleting column marks )=\n', df1)
o/p:
df1(After deleting column hometown )=
students marks sports
I xyz 24.5 chess
II pqr 27.5 badminton
III ghi 30 carrom
IV mno 25.5 football
df1(After deleting column sports )=
students marks
I xyz 24.5
II pqr 27.5
III ghi 30
IV mno 25.5
Deleted column is:
I 24.5
II 27.5
III 30
IV 25.5
Name: marks, dtype: object
df1(After deleting column marks )=
students
I xyz
II pqr
III ghi
IV mno
There are three ways of deleting a column of a DataFrame:
a) using the python del command as:
del dataframeobject[columnname]
b) using the dataframe drop() method
The drop command can be used to delete rows (axis=0) or columns(axis=1).
The first parameter is a list containing either the row index names or the column index
names.
The parameter inplace=True is used to modify/delete the dataframe df1 itself. If this
parameter is not specified or is False then the dataframe df1 is not modified, instead it
returns a new dataframe with the row or column deleted.
c) Using the pop('columnname') method
The pop() method is used to delete a single column from a DataFrame. In addition, the
column that was deleted is returned back as a Series object.
o/p:
df1(After deleting row 1 and 2 )=
students marks sports
I xyz 24.5 chess
IV mno 25.5 football
The drop command can be used to delete rows (axis=0) or columns(axis=1).
If multiple rows are to be deleted then the first parameter must contain the list of row names to be
deleted.
If a positive value, n, is passed to the head function then it retrieves the top n rows. If a negative n is
passed to the head function, then it returns all the rows except the last n rows.
Similarly, if a positive value, n, is passed to the tail function then it retrieves the bottom n rows of the
DataFrame. If a negative, n, is passed to the DataFrame then all the rows except the first n rows are
retrieved back.
These functions are useful for quickly verifying the data for example after sorting or adding rows.
d={'students':['a', 'b','c','d','e','f','g','h','i','j'],
'marks': [25,21,8,9,15,29,np.NaN,25,24,30]}
df1=pd.DataFrame(d)
df2=df1.head()
print('df2=\n',df2)
df3=df1.head(-7)
print('df3=\n',df3)
print(df1.tail(2))
o/p:
df2=
students marks
0 a 25.0
1 b 21.0
2 c 8.0
3 d 9.0
4 e 15.0
df3=
students marks
0 a 25.0
1 b 21.0
2 c 8.0
students marks
8 i 24.0
9 j 30.0
The length of the Boolean array passed must match the number or rows/indexes of the DataFrame
otherwise an error is thrown. The Boolean array can also be a Series object which can be derived
from applying a relational operator to one or more columns of the DataFrame. Different relational
expressions can be combined using the bitwise and (&), or (|), not (~) operators. When using the
bitwise operators the individual relational expressions must be enclosed in parentheses () as the
bitwise operators have higher precedence than the relational operators.
d={'students':['a', 'b','c','d','e','f','g','h','i','j'],
'marks': [25,21,8,9,15,29,np.NaN,25,24,30],
'hobby': ['mm','nn','oo','pp','qq','rr','ss','t','uu','vv']}
df1=pd.DataFrame(d)
o/p:
df2=
students marks hobby
0 a 25.0 mm
1 b 21.0 nn
8 i 24.0 uu
9 j 30.0 vv
df3=
students marks hobby
0 a 25.0 mm
1 b 21.0 nn
df4=
students marks hobby
8 i 24.0 uu
9 j 30.0 vv
df5=
students marks hobby
5 f 29.0 rr
9 j 30.0 vv
df6=
students marks hobby
1 b 21.0 nn
8 i 24.0 uu
Usually when using any of the iteration methods, we work on a copy of the DataFrame. So we must not
modify any of the DataFrame's values as those are not reflected/updated in the original DataFrame.
#18. Iterating directly over a DataFrame
Iterating directly over a DataFrame gives the column names.
import pandas as pd
o/p:
name age hobby
s1 abc 19 reading
s2 def 20 playing
s3 ghi 21 gardening
Using iteritems
cname: name
cseries:
s1 abc
s2 def
s3 ghi
Name: name, dtype: object
cname: age
cseries:
s1 19
s2 20
s3 21
Name: age, dtype: int64
cname: hobby
cseries:
s1 reading
s2 playing
s3 gardening
Name: hobby, dtype: object
o/p:
name age hobby
s1 abc 19 reading
s2 def 20 playing
s3 ghi 21 gardening
Using iterrows
rname: s1
rseries:
name abc
age 19
hobby reading
Name: s1, dtype: object
rname: s2
rseries:
name def
age 20
hobby playing
Name: s2, dtype: object
rname: s3
rseries:
name ghi
age 21
hobby gardening
Name: s3, dtype: object
The first element of the named tuple is the row label and the remaining elements are the values
under different columns for that particular row.
#Using itertuples
import pandas as pd
o/p:
name age hobby
s1 abc 19 reading
s2 def 20 playing
s3 ghi 21 gardening
Using itertuples
Pandas(Index='s1', name='abc', age=19, hobby='reading')
Pandas(Index='s2', name='def', age=20, hobby='playing')
Pandas(Index='s3', name='ghi', age=21, hobby='gardening')