0% found this document useful (0 votes)

42 views

DSL Pandas

Pandas provides useful data structures like Series and DataFrames for data analysis. Series are one-dimensional arrays with an associated index, while DataFrames are two-dimensional data structures that can be thought of as tables with labeled columns. Pandas allows easy creation, manipulation, and access of Series and DataFrames for tasks like data selection, grouping, pivoting, and statistics.

Uploaded by

PhamThi Thiet

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views

DSL Pandas

Uploaded by

PhamThi Thiet

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 87

Data Science Lab

Pandas

DataBase and Data Mining Group Andrea Pasini, Elena Baralis

Introduction to Pandas

▪ Pandas
▪ Provides useful data structures (Series and
DataFrames) and data analysis tools
▪ Based on Numpy arrays

▪ Tools:
▪ Managing tables and series
• data selection
• grouping, pivoting
▪ Managing missing data
▪ Statistics on data

2
Pandas Series

▪ Series: 1-Dimensional sequence of homogeneous

elements
▪ Elements are associated to an explicit index
▪ index elements can be either strings or integers
▪ Examples:
index 1 2 3

values 0.3 0.5 0.8

index '3-July' '4-July' '5-July'

values 0.3 0.5 0.8

3
Pandas Series

▪ Creation from list

▪ When not specified, index is set automatically
with a progressive number

In [1]: import pandas as pd

s1 = pd.Series([2.0, 3.1, 4.5])
print(s1)

Out[1]: 0 2.0
1 3.1
2 4.5

4
Pandas Series

▪ Creation from list, specifying index

In [1]: pd.Series([2.0, 3.1, 4.5], index=['a', 'b', 'c'])

Out[1]: 'a' 2.0

'b' 3.1
'c' 4.5

5
Pandas Series

▪ Creation from dictionary

▪ keys define the index
In [1]: pd.Series({'c':4.5, 'b':3.1, 'a':4.5})

Out[1]: 'c' 2.0

'b' 3.1
'a' 4.5

6
Pandas Series

▪ Obtaining values and index from a Series

In [1]: s1 = pd.Series([2.0, 3.1, 4.5], index=['a', 'b', 'c'])

print(s1.values) # Numpy array
print(s1.index)

Out[1]: [2.0, 3.1, 4.5]

Index(['a', 'b', 'c'], dtype='object')

▪ Index is a custom Python object defined in

Pandas
7
Pandas Series

▪ Accessing Series elements

▪ Access by Index
▪ Explicit: the one specified while creating a Series
▪ Use the Series.loc attribute
▪ Implicit: number associated to the element order
(similarly to Numpy arrays)
▪ Use the Series.iloc attribute

8
Pandas Series

▪ Accessing Series elements

In [1]: s1 = pd.Series([2.0, 3.1, 4.5], index=['a', 'b', 'c'])

print(s1.loc['a']) # With explicit index
print(s1.iloc[0]) # With implicit index
s1.loc['b'] = 10 # Allows editing values
print(f"Series:\n{s1}")

Out[1]: 2.0
2.0
Series:
'a' 2.0
'b' 10
'c' 4.5

9
Pandas Series

▪ Accessing Series elements: slicing

In [1]: s1 = pd.Series([2.0, 3.1, 4.5], index=['a', 'b', 'c'])

print(s1.loc['b':'c']) # explicit index (stop element included)
print(s1.iloc[1:3]) # implicit index (stop element excluded)

Out[1]: b 3.1
c 4.5

b 3.1
c 4.5

10
Pandas Series

▪ Accessing Series elements: masking

In [1]: s1 = pd.Series([2.0, 3.1, 4.5], index=['a', 'b', 'c'])

print(s1[(s1>2) & (s1<10)])

Out[1]: b 3.1
c 4.5

11
Pandas Series

▪ Accessing Series elements: fancy indexing

In [1]: s1 = pd.Series([2.0, 3.1, 4.5], index=['a', 'b', 'c'])

print(s1.loc[['a', 'c']])

Out[1]: a 2.0
c 4.5

12
Pandas DataFrame

▪ DataFrame: 2-Dimensional array

▪ Can be thought as a table where columns are
Series objects that share the same index
▪ Each column has a name

▪ Example:
Index 'Price' 'Quantity' 'Liters'

'Water' 1.0 5 1.5

'Beer' 1.4 10 0.3

'Wine' 5.0 8 1

13
Pandas DataFrame

▪ Creation from Series

▪ Use a dictionary to set column names

In [1]: price = pd.Series([1.0, 1.4, 5], index=['a', 'b', 'c'])

quantity = pd.Series([5, 10, 8], index=['a', 'b', 'c'])
liters = pd.Series([1.5, 0.3, 1], index=['a', 'b', 'c'])
df = pd.DataFrame({'Price':price, 'Quantity':quantity,
'Liters':liters})
print(df)

Out[1]: Price Quantity Liters

a 1.0 5 1.5
b 1.4 10 0.3
c 5.0 8 1.0

14
Pandas DataFrame

▪ Creation from list of dictionaries

▪ Each dictionary is associated to a row
▪ Index is automatically set to a progressive number
▪ Example:
In [1]: dic_list = [{'c1':i, 'c2':2*i} for i in range(3)]
df = pd.DataFrame(dic_list)
print(df)

Out[1]: c1 c2
0 0 0
1 1 2
2 2 4

15
Pandas DataFrame

▪ Creation from 2D Numpy array

▪ Example:
In [1]: arr = np.arange(6).reshape((3,2))
df = pd.DataFrame(arr, columns=['c1', 'c2'],
index=['a', 'b', 'c'])
print(df)

Out[1]: c1 c2
a 0 1
b 2 3
c 4 5

16
Pandas DataFrame

▪ Obtaining column names and index from a

DataFrame
Index Price Quantity Liters

a 1.0 5 1.5

b 1.4 10 0.3

c 5.0 8 1

In [2]: print(df.columns) # Index object with column names

print(df.index) # Index object

Out[2]: Index(['Price', 'Quantity', 'Liters'], dtype='object')

Index(['a', 'b', 'c'], dtype='object')

17
Pandas DataFrame

▪ Accessing DataFrame data

▪ Get a 2D Numpy array
Index Price Quantity Liters

a 1.0 5 1.5

b 1.4 10 0.3

c 5.0 8 1

In [2]: print(df.values) # Numpy array with data

Out[2]: array([[1.0, 5.0, 1.5],

[1.4, 10.0, 0.3],
[5.0, 8.0, 1.0]])

18
Pandas DataFrame

▪ Accessing DataFrames
▪ Access a DataFrame column
▪ Access rows and columns with indexing
▪ df.loc
• Explicit index
• Slicing, masking, fancy indexing
▪ df.iloc
• Implicit index
▪ Whether a copy or view will be returned it
depends on the context
▪ Usually it is difficult to make assumptions
19
Pandas DataFrame

▪ Accessing DataFrame columns

▪ Returns a Series with column data
Index Price Quantity Liters

a 1.0 5 1.5

b 1.4 10 0.3

c 5.0 8 1

In [1]: df['Quantity']

Out[1]: a 5
b 10
c 8

20
Pandas DataFrame

▪ Accessing single DataFrame row by index

▪ loc (explicit), iloc (implicit)
▪ Return a Series with an element for each column
In [1]: print(df.loc['a']) # Get the first row (explicit)
print(df.iloc[0]) # Get the first row

Out[1]: Price 1.0

Quantity 5.0
Liters 1.5

Price 1.0
Quantity 5.0
Liters 1.5
21
Pandas DataFrame

▪ Accessing DataFrames with slicing

▪ Allows selecting rows and columns

In [1]: print(df.loc['b':'c', 'Quantity':'Liters'])

Out[1]: Quantity Liters

b 10 0.3
c 8 1

22
Pandas DataFrame

▪ Accessing DataFrames with masking

▪ Select rows based on a condition
Index Price Quantity Liters

a 1.0 5 1.5

b 1.4 10 0.3

c 5.0 8 1

In [1]: mask = (df['Quantity']<10) & (df['Liters']>1)

df.loc[mask, 'Quantity':] # Use mask and slicing

Out[1]: Quantity Liters

a 5 1.5

23
Pandas DataFrame

▪ Accessing DataFrames with fancy indexing

▪ To select columns...
Index Price Quantity Liters

a 1.0 5 1.5

b 1.4 10 0.3

c 5.0 8 1

In [1]: mask = (df['Quantity']<10) & (df['Liters']>1)

df.loc[mask, ['Price','Liters']] # Use mask and fancy

Out[1]: Price Liters

a 1.0 1.5

24
Pandas DataFrame

▪ Accessing DataFrames with fancy indexing

▪ To select rows and columns...
Index Price Quantity Liters

a 1.0 5 1.5

b 1.4 10 0.3

c 5.0 8 1

In [1]: df.loc[['a', 'c'], ['Price','Liters']]

Out[1]: Price Liters

a 1.0 1.5
c 5.0 1.0

25
Pandas DataFrame

▪ Assign value to selected items

In [1]: df.loc[['a', 'c'], ['Price','Liters']] = 0

Index Price Quantity Liters

a 0.0 5 0.0

b 1.4 10 0.3

c 0.0 8 0.0

26
Pandas DataFrame

▪ Add new column to DataFrame

▪ DataFrame is modified inplace
Index Price Quantity Liters Index Price Quantity Liters Available

a 0.0 5 0.0 a 1.0 5 1.5 True

b 1.4 10 0.3 b 1.4 10 0.3 False

c 0.0 8 0.0 c 5.0 8 1 True

In [1]: df['Available'] = pd.Series([True, False, True],

index=['a', 'b', 'c'])

▪ If the DataFrame already has a column with the

specified name, then this is replaced
27
Pandas DataFrame

▪ Add new column to DataFrame

▪ It is also possible to assign directly a list
Index Price Quantity Liters Index Price Quantity Liters Available

a 0.0 5 0.0 a 1.0 5 1.5 True

b 1.4 10 0.3 b 1.4 10 0.3 False

c 0.0 8 0.0 c 5.0 8 1 True

In [1]: df['Available'] = [True, False, True]

28
Pandas DataFrame

▪ Drop column(s)
▪ Returns a copy of the updated DataFrame
Index Price Quantity Liters Available

a 1.0 5 1.5 True

b 1.4 10 0.3 False

c 5.0 8 1 True

In [1]: df = df.drop(columns=['Quantity', 'Liters'])

29
Pandas DataFrame

▪ Rename column(s)
▪ Use a dictionary which maps old names with
new names
▪ Returns a copy of the updated DataFrame
Index Price Quantity Liters Available Index Price nItems [L] Available

a 1.0 5 1.5 True a 1.0 5 1.5 True

b 1.4 10 0.3 False b 1.4 10 0.3 False

c 5.0 8 1 True c 5.0 8 1 True

In [1]: df = df.rename(columns={'Quantity': 'nItems',

'Liters': '[L]'})

30
Computation with Pandas

▪ Unary operations on Series and DataFrames

▪ exponentiation, logarithms, ...
▪ Operations between Series and DataFrames
▪ Operations are performed element-wise, being
aware of their indices
▪ Aggregations (min, max, std, ...)

31
Computation with Pandas

▪ Unary operations on Series and DataFrames

▪ Works with any Numpy ufunc
▪ The operation is applied to each element of the
Series/DataFrame
▪ Examples:
▪ res = my_series/4 + 1
▪ res = np.abs(my_series)
▪ res = np.exp(my_dataframe)
▪ res = np.sin(my_series/4)
▪ ...

32
Computation with Pandas

▪ Operations between Series (+,-,*,/)

▪ Applied element-wise after aligning indices
▪ Index elements which do not match are set to NaN
(Not a Number)
After index alignment
▪ Example: index in the result is sorted
▪ res = my_series1 + my_series2
Index Index Index

b 3 a 1 a 2

a 1 b 3 b 6

c 10 d 30 c NaN

d NaN
my_series1 my_series2
res
33
Computation with Pandas

▪ Operations between DataFrames

▪ Applied element-wise after aligning indices
and columns
▪ Example (align index): Index in the result
is sorted
▪ res = my_dataframe1 + my_dataframe2
Index Total Quantity Index Total Quantity Index Total Quantity

b 3 4 a 1 2 a 2 4

a 1 2 b 3 4 b 6 8

c 10 20 d 30 40 c NaN NaN

d NaN NaN
my_dataframe1 my_dataframe2
res

34
Computation with Pandas

▪ Operations between DataFrames

▪ Example (align columns)
Columns in the
▪ res = my_dataframe1 + my_dataframe2 result are sorted

Index Total Quantity Index Total Price Index Price Quantity Total

a 1 2 a 1 2 a NaN NaN 2

b 3 4 b 3 4 b NaN NaN 6

c 5 6 c 5 6 c NaN NaN 10

my_dataframe1 my_dataframe2 res

35
Computation with Pandas

▪ Operations between DataFrames and Series

▪ The operation is applied between the Series
and each row of the DataFrame
▪ Follows broadcasting rules
▪ Example:
▪ res = my_dataframe1 + my_series1
Index Total Quantity Index Index Total Quantity

a 1 2 Total 1 a 2 4

b 3 4 Quantity 2 b 4 6

c 5 6 c 6 8

my_dataframe1 my_series1 res

36
Computation with Pandas

▪ Pandas Series and DataFrames allow

performing aggregations
▪ mean, std, min, max, sum
▪ Examples
In [1]: my_series.mean() # Return the mean of Series elements

▪ For DataFrames, aggregate functions are

applied column-wise and return a Series
In [1]: my_df.mean() # Return a Series

37
Computation with Pandas

▪ Example of aggregations with DataFrames:

z-score normalization
In [1]: mean_series = df.mean()
std_series = df.std()
df_norm = (df-mean_series)/std_series

Index Total Quantity Index Index

a 1 2 Total 3.0 Total 2.0

b 3 4 Quantity 4.0 Quantity 2.0

c 5 6

my_dataframe1 mean_series std_series

38
Missing values

▪ Represented with sentinel value

▪ None: Python null value
▪ np.nan: Numpy Not A Number
▪ None is a Python object:
▪ np.array([4, None, 5]) has dtype=Object
▪ np.NaN is a Floating point number
▪ np.array([4, np.nan, 5]) has dtype=Float

▪ Using nan achieves better performances

when performing numerical computations
39
Missing values

▪ Pandas supports both None and NaN, and

automatically converts between them when
appropriate
▪ Example:
In [1]: pd.Series([4, None, 5, np.nan])

Out[1]: 0 4.0
1 NaN
2 5.0
3 NaN
dtype=float64

40
Missing values

▪ Operating on missing values (for Series and

DataFrames)
▪ isnull()
▪ Return a boolean mask indicating null values
▪ notnull()
▪ Return a boolean mask indicating not null values
▪ dropna()
▪ Return filtered data containing null values
▪ fillna()
▪ Return new data with filled or input missing values

41
Missing values

▪ Operating on missing values: isnull, notnull

▪ Return a new Series/DataFrame with the same
shape as the input
In [1]: s1 = pd.Series([4, None, 5, np.nan])
s1.isnull()

Out[1]: 0 False
1 True
2 False
3 True
dtype=bool

42
Missing values

▪ Operating on missing values: dropna

▪ For Series it removes null elements
In [1]: s1 = pd.Series([4, None, 5, np.nan])
s1.dropna()

Out[1]: 0 4.0
2 5.0
dtype=float64

43
Missing values

▪ Operating on missing values: dropna

▪ For DataFrames it removes rows that contain
at least a missing value (default behaviour)
Index Total Quantity Index Total Quantity

a 1 2 a 1 2

b 3 NaN c 5 6

c 5 6

▪ Alternatively, it is possible to remove columns

dropped_df = df.dropna(axis='columns')

44
Missing values

▪ Operating on missing values: fillna

▪ Fill null fields with a specified value (for both
Series and DataFrames)
In [1]: s1 = pd.Series([4, None, 5, np.nan])
s1.fillna(0)

Out[1]: 0 4.0
1 0.0
2 5.0
3 0.0
dtype=float64

45
Missing values

▪ Operating on missing values: fillna

▪ The parameter method allows specifying
different filling techniques
▪ ffill: propagate last valid observation forward
▪ bfill: use next valid observation to fill gap
In [1]: s1 = pd.Series([4, None, 5, np.nan])
s1.fillna(method='ffill')

Out[1]: 0 4.0
1 4.0
2 5.0
3 5.0

46
Notebook Examples

▪ 4-Pandas
Examples.ipynb
▪ 1. Accessing
DataFrames and Series

47
Combining Pandas objects

▪ Pandas provides 2 methods for combining

Series and DataFrames
▪ concat()
▪ Concatenate a sequence of Series/DataFrames
▪ append()
▪ Append a Series/DataFrame to the specified object

48
Combining Pandas objects

▪ Concatenating 2 Series
▪ Index is preserved, even if duplicated
In [1]: s1 = pd.Series(['a', 'b'], index=[1,2])
s2 = pd.Series(['c', 'd'], index=[1,2])
pd.concat((s1, s2))

Out[1]: 1 a
2 b
1 c
2 d
dtype=object

49
Combining Pandas objects

▪ Concatenating 2 Series
▪ To avoid duplicates use ignore_index
In [1]: s1 = pd.Series(['a', 'b'], index=[1,2])
s2 = pd.Series(['c', 'd'], index=[1,2])
pd.concat((s1, s2), ignore_index=True)

Out[1]: 0 a
1 b
2 c
3 d
dtype=object

50
Combining Pandas objects

▪ Concatenating 2 DataFrames
▪ Concatenate vertically by default
In [1]: pd.concat((df1, df2))

Index Total Quantity

Index Total Quantity
a 1 2

b 3 4 a 1 2

b 3 4
Index Total Quantity
c 5 6
c 5 6
d 7 8
d 7 8

51
Combining Pandas objects

▪ Concatenating 2 DataFrames
▪ Missing columns are filled with NaN
In [1]: pd.concat((df1, df2))

Index Total Quantity

Index Total Quantity Liters
a 1 2
a 1 2 NaN
b 3 4
b 3 4 NaN

Index Total Quantity Liters c 5 6 1

c 5 6 1 d 7 8 2

d 7 8 2

52
Combining Pandas objects

▪ The append() method is a shortcut for

concatenating DataFrames
▪ Returns the result of the concatenation

In [1]: df_concat = df1.append(df2)

is equivalent to:
In [1]: df_concat = pd.concat((df1, df2))

53
Combining Pandas objects

▪ Joining DataFrames with relational algebra:

merge()
▪ Column(s) with the same name in the two
DataFrames are used as key
▪ Depending on the DataFrames, a one-to-one,
many-to-one or many-to-many join can be
performed
▪ Indices of the input DataFrames are discarded

In [1]: joined_df = pd.merge(df1, df2)

54
Combining Pandas objects

▪ Examples
one-to-one
Index k1 c2 Index k1 c3 Index k1 c2 c3

i1 0 a i1 1 b1 0 0 a a1

i2 1 b i2 0 a1 1 1 b b1

many-to-one
Index k1 c2 Index k1 c3 Index k1 c2 c3

i1 0 a i1 1 b1 0 0 a a1

i2 1 b i2 0 a1 1 0 c a1

i3 0 c 2 1 b b1

i4 1 d 3 1 d b1

55
Grouping data

▪ Pandas provides the equivalent of the SQL

group by statement
▪ It allows the following operations:
▪ Iterating on groups
▪ Aggregating the values of each group (mean,
min, max, ...)
▪ Filtering groups according to a condition

56
Grouping data

▪ Applying group by
▪ Specify the column where you want to group (key)
▪ Obtain a DataFrameGroupBy object
df = pd.DataFrame({'k' : ['a','b','a','b'],
'c1': [2,10,3,15], 'c2' : [4,20,5,30]})
grouped_df = df.groupby('k') # 2 groups: 'a' and 'b'

Index k c1 c2 Index k c1 c2

0 a 2 4 0 a 2 4

1 b 10 20 2 a 3 5

2 a 3 5 1 b 10 20

3 b 15 30 3 b 15 30
57
Grouping data

▪ Iterating on groups
▪ Each group is a subset of the original DataFrame
In [1]: for key, group_df in grouped_df:
print(key)
print(group_df)

Out[1]: a Index k1 c1 c2

k1 c1 c2 0 a 2 4
0 a 2 4
2 a 3 5
2 a 3 5
b Index k1 c1 c2
k1 c1 c2 1 b 10 20
1 b 10 20
3 b 15 30
3 b 15 30
58
Grouping data

▪ Aggregating by group (min, max, mean, std)

▪ The output is a DataFrame with the result of the
aggregation for each group
In [1]: grouped_df.mean() # Mean, separately for each group

Index k1 c1 c2
Out[1]: k c1 c2
a 2.5 4.5 0 a 2 4

b 12.5 25.0 2 a 3 5

Index k1 c1 c2

The index of the result is 1 b 10 20

the key of each group 3 b 15 30

59
Grouping data

▪ Aggregating a single column by group

▪ The output is a Series with the result of the
aggregation for each group
In [1]: grouped_df['c1'].mean()

Index k1 c1 c2
Out[1]: k
a 2.5 0 a 2 4

b 12.5 2 a 3 5
Name: c1, dtype=float64
Index k1 c1 c2

1 b 10 20

3 b 15 30

60
Grouping data

▪ Filtering data by group

▪ The filter is expressed with a lambda function
working with each group DataFrame (x)
In [1]: # Keep groups for which column c1 has a mean > 5
grouped_df.filter(lambda x: x['c1'].mean()>5)

Index k1 c1 c2
Out[1]: k c1 c2
mean = 2.5
1 b 10 20 0 a 2 4 x: filtered
out
3 b 15 30 2 a 3 5

Index k1 c1 c2
mean = 12.5
x: kept in
1 b 10 20
the result
3 b 15 30
61
Pivoting

▪ Pivoting allows inspecting relationships

within a dataset
▪ Suppose to have the following dataset:
df = pd.DataFrame({'type':['a','b','b','a','b','a','b','a'],
'class':[3,2,3,3,2,1,1,2],
Index type class fail
'fail':[1,1,1,0,1,0,0,0]})
0 a 3 1
1 b 2 1
2 b 3 1

▪ that shows failures for 3

4
a
b
3
2
0
1

sensors of a given type and 5 a 1 0

class during some test

6 b 1 0
7 a 2 0

62
Pivoting

In [1]: df.pivot_table('fail', index='type',

columns='class', aggfunc='sum')

▪ Shows the number of failures for all the

combinations of type and class
Index type class fail
0 a 3 1

Out[1]: 1 b 2 1
class 1 2 3
2 b 3 1
type
3 a 3 0
a 0 0 1
4 b 2 1
b 0 2 1 5 a 1 0
6 b 1 0
7 a 2 0
2 sensors of type b and
class 2 had some failure
63
Pivoting

In [1]: df.pivot_table('fail', index='type',

columns='class', aggfunc='mean')

▪ Shows the average number of failures for

all the combinations of type and class
Index type class fail
0 a 3 1

Out[1]: 1 b 2 1
class 1 2 3
2 b 3 1
type
3 a 3 0
a 0.0 0.0 0.5
4 b 2 1
b 0.0 1.0 1.0 5 a 1 0
6 b 1 0
50% of sensors of type a 7 a 2 0
and class 3 had some
failure 64
Multi-Index

▪ Multi-Index allows specifying an index

hierarchy for
▪ Series
▪ DataFrames
▪ Example: index a Series by city and year
city Rome Rome Turin Turin
index
year 2018 2019 2018 2019

values 10 13 7 9

65
Multi-Index

▪ Building a multi-indexed Series

In [1]: ix = [['Rome', 'Rome', 'Turin', 'Turin'],
['2018', '2019', '2018', '2019']]
s1 = pd.Series([10,13,7,9], index=ix)
s1 = s1.sort_index() # Multi-Index must be sorted
# if you want to use slicing
print(s1)

Out[1]: Rome 2018 10

2019 13
Turin 2018 7
2019 9

66
Multi-Index

▪ Naming index levels

In [1]: s1.index.names=['city', 'year']
print(s1)

Out[1]: city year

Rome 2018 10
2019 13
Turin 2018 7
2019 9

67
Multi-Index

▪ Accessing index levels

▪ Slicing and simple indexing are allowed
▪ Slicing on index levels follows Numpy rules
In [1]: print(s1.loc['Rome']) # Outer index level
print(s1.loc[:,'2018']) # All cities, only 2018

Out[1]: year
Rome Rome Turin Turin
2018 10
2019 13 2018 2019 2018 2019

10 13 7 9
city
Rome 10
Turin 7

68
Multi-Index

▪ Accessing index levels (Examples)

In [1]: print(s1.loc['Turin', '2018':'2019'])
print(s1[s1>10]) # Masking

Out[1]: city year Rome Rome Turin Turin

Turin 2018 7
2018 2019 2018 2019
2019 9
10 13 7 9

city year
Rome 2019 13

69
Multi-Index

▪ Multi-indexed DataFrame
▪ Specify a multi-index for rows
▪ Columns can be multi-indexed as well

Humidity Temperature

max min max min

Turin 2018 33 48 6 33

2019 35 45 5 35

Rome 2018 40 59 2 33

2019 41 57 3 34

70
Multi-Index

▪ Multi-indexed DataFrame: creation

In [1]: ix = [['Rome', 'Rome', 'Turin', 'Turin'],
['2018', '2019', '2018', '2019']]
cols = [['c1','c1','c2','c2'],['a','b','a','b']]
data = np.arange(16).reshape((4,4))
df = pd.DataFrame(data, index=ix, columns=cols)
print(df)

Out[1]: c1 c2
a b a b
Rome 2018 0 1 2 3
2019 4 5 6 7
Turin 2018 8 9 10 11
2019 12 13 14 15
71
Multi-Index

▪ Multi-indexed DataFrame: access with

outer index level
In [1]: print(df['c1']) # Access by column
print(df.loc['Rome', 'c1']) # Access rows and cols

Out[1]: a b c1 c2
Rome 2018 0 1
a b a b
2019 4 5
Rome 2018 0 1 2 3
Turin 2018 8 9
2019 4 5 6 7
2019 12 13
Turin 2018 8 9 10 11

a b 2019 12 13 14 15
2018 0 1
2019 4 5

72
Multi-Index

▪ Multi-indexed DataFrame: access with

outer and inner index levels
In [1]: df['c1', 'a'] # Access by column

Out[1]: Rome 2018 0

c1 c2
2019 4
a b a b
Turin 2018 8
2019 12 Rome 2018 0 1 2 3

2019 4 5 6 7

Turin 2018 8 9 10 11

2019 12 13 14 15

73
Multi-Index

▪ Multi-indexed DataFrame: access with

outer and inner index levels
In [1]: ix = pd.IndexSlice
df.loc[ix['Rome', '2018'], ix['c1':'c2', 'a']]

Out[1]: c1 a 0 c1 c2
c2 a 2
a b a b

Rome 2018 0 1 2 3

2019 4 5 6 7

Turin 2018 8 9 10 11

2019 12 13 14 15

74
Multi-Index

▪ Reset Index: transform index to DataFrame

columns
In [1]: df.index.names = ['city', 'year']
df_reset = df.reset_index()
print(df_reset)

Out[1]: city year c1 c2

a b a b
0 Rome 2018 0 1 2 3
1 Rome 2019 4 5 6 7
2 Turin 2018 8 9 10 11
3 Turin 2019 12 13 14 15

75
Multi-Index

▪ Set Index: transform columns to Multi-Index

▪ Inverse function of reset_index()

In [1]: df_reset.set_index(['city', 'year'])

city year c1 c2 city year c1 c2

a b a b a b a b

0 Rome 2018 0 1 2 3 Rome 2018 0 1 2 3

1 Rome 2019 4 5 6 7 2019 4 5 6 7

2 Turin 2018 8 9 10 11 Turin 2018 8 9 10 11

3 Turin 2019 12 13 14 15 2019 12 13 14 15

76
Multi-Index

▪ Unstack: transform multi-indexed Series to a

Dataframe
myseries.unstack()

city year

Rome 2018 0 2018 2019

2019 4 Rome 0 4

Turin 2018 8 Turin 8 12

2019 12

77
Multi-Index

▪ Stack: inverse function of unstack

▪ From DataFrame to multi-indexed Series

mydataframe.stack()

city year

2018 2019 Rome 2018 0

Rome 0 4 2019 4

Turin 8 12 Turin 2018 8

2019 12

78
Multi-Index

▪ Aggregates on multi-indices
▪ Allowed by passing the level parameter
▪ Level specifies the row granularity at which the
result is computed
my_dataframe.max(level='city')

city year c1 c2 city c1 c2

a b a b a b a b

Rome 2018 0 1 2 3 Rome 4 5 6 7

2019 4 5 6 7 Turin 12 13 14 15

Turin 2018 8 9 10 11

2019 12 13 14 15

79
Multi-Index

▪ Aggregates on multi-indices
my_dataframe.max(level='year')

city year c1 c2 year c1 c2

a b a b a b a b

Rome 2018 0 1 2 3 2018 8 9 10 11

2019 4 5 6 7 2019 12 13 14 15

Turin 2018 8 9 10 11

2019 12 13 14 15

80
Multi-Index

▪ Aggregates on multi-indices
▪ Can also aggregate columns
▪ Specify axis=1
my_dataframe.max(axis=1, level=0)

city year c1 c2 city year c1 c2

a b a b Rome 2018 1 3

Rome 2018 0 1 2 3 Rome 2019 5 7

2019 4 5 6 7 Turin 2018 9 11

Turin 2018 8 9 10 11 Turin 2019 13 15

2019 12 13 14 15

81
Loading DataFrames

▪ Load DataFrame from csv file

▪ Allows specifying the column delimiter (sep)
▪ Automatically read header from first line of the file
(after skipping the specified number of rows)
▪ Column data types are inferred
df = pd.read_csv('./mycsv.csv', sep=',', skiprows=1)

mycsv.csv
MyTitle c1 c2 c3

c1,c2,c3 0 0 1 2
0,1,2 1 3 4 5
3,4,5
2 6 7 8
6,7,8
82
Loading DataFrames

▪ Load DataFrame from csv file

▪ If it contains null values, you can specify how
to recognize them
▪ The string 'NaN' is automatically recognized
df = pd.read_csv('./mycsv.csv', sep=',',
na_values=['no info', 'x'])

mycsv.csv
c1,c2,c3 c1 c2 c3

0,no info,2 0 0 NaN 2

3,4,5 1 3 4 5
6,x,NaN
2 6 NaN NaN

83
Loading DataFrames

▪ Save DataFrame to csv

df.to_csv('./savedcsv.csv', sep=',')

savedcsv.csv
c1 c2 c3
c1,c2,c3
0 0 NaN 2
0,0,,2
1 3 4 5 1,3,4,5
2 6 NaN NaN 2,6,,

▪ Use index=False to avoid writing the index

df.to_csv('./savedcsv.csv', sep=',', index=False)

84
Loading DataFrames

▪ Load DataFrame from json file

df = pd.read_json('./myjson.json')

myjson.csv
{"c1":{"0":0, "1":3, "2":6}, c1 c2 c3

"c2":{"0":null, "1":4, "2":null}, 0 0 NaN 2

"c3":{"0":2, "1":5, "2":null}} 1 3 4 5

2 6 NaN NaN

▪ Use pd.to_json(path) to save a DataFrame

in json format

85
Loading DataFrames

▪ Many other data types are supported

▪ Excel, HTML, HDF5, SAS, ...
▪ Check the pandas documentation
▪ https://pandas.pydata.org/pandas-
docs/stable/user_guide/io.html

86
Notebook Examples

▪ 4-Pandas
Examples.ipynb
▪ 2. Working with Pandas
and spatial data

Difference Between Data, Information, Knowledge
50% (2)
Difference Between Data, Information, Knowledge
17 pages
ACFrOgCuxzI7id1LCXi9yoyuvISxGard75NvAshCzyRkhz0Fv_jimN6GuJsUI3qR2_jr7vxbRmHlwJPmcpRa7v3zCXyCokAXM23U17GlLnoA-5jSOz-osgZwdAL-ghXvjz5yld44_1rLLZaDMrebwXv-HRUry-kJjWFBo4Jkhw==
No ratings yet
ACFrOgCuxzI7id1LCXi9yoyuvISxGard75NvAshCzyRkhz0Fv_jimN6GuJsUI3qR2_jr7vxbRmHlwJPmcpRa7v3zCXyCokAXM23U17GlLnoA-5jSOz-osgZwdAL-ghXvjz5yld44_1rLLZaDMrebwXv-HRUry-kJjWFBo4Jkhw==
12 pages
Unit 2
No ratings yet
Unit 2
81 pages
09_Pandas slides
No ratings yet
09_Pandas slides
33 pages
Pandas Dataframe Export The CSV File
No ratings yet
Pandas Dataframe Export The CSV File
9 pages
Pandas Dataframe
No ratings yet
Pandas Dataframe
48 pages
Pandas Notes(1)
No ratings yet
Pandas Notes(1)
44 pages
Python Pandas New Sylabus
No ratings yet
Python Pandas New Sylabus
53 pages
Python Pandas ch-2
No ratings yet
Python Pandas ch-2
56 pages
Ai Workflow Data Preparation With Numpy and Pandas: MR Hew Ka Kian Hew - Ka - Kian@Rp - Edu.Sg
No ratings yet
Ai Workflow Data Preparation With Numpy and Pandas: MR Hew Ka Kian Hew - Ka - Kian@Rp - Edu.Sg
26 pages
Pandas DataFrame
No ratings yet
Pandas DataFrame
70 pages
14_Pandas
No ratings yet
14_Pandas
25 pages
Pandas Class 12 Ncertttt
No ratings yet
Pandas Class 12 Ncertttt
48 pages
Dataframe Notes
No ratings yet
Dataframe Notes
47 pages
Pandas: Import
100% (1)
Pandas: Import
13 pages
Data Manipulation With Pandas
No ratings yet
Data Manipulation With Pandas
38 pages
Pandas Viva Questions
No ratings yet
Pandas Viva Questions
23 pages
Class 12 Practical File
No ratings yet
Class 12 Practical File
29 pages
05Getting Started With Pandas
No ratings yet
05Getting Started With Pandas
44 pages
Data Handling Using Pandas-I-ORG
No ratings yet
Data Handling Using Pandas-I-ORG
44 pages
python unit 3 4
No ratings yet
python unit 3 4
92 pages
2.2 Data Indexing and Selection
No ratings yet
2.2 Data Indexing and Selection
8 pages
Unit 4
No ratings yet
Unit 4
36 pages
lecture-9-pandas
No ratings yet
lecture-9-pandas
176 pages
Data Handlinng Using Pandas
No ratings yet
Data Handlinng Using Pandas
46 pages
UNIT 3(Chapter 2) Pandas
No ratings yet
UNIT 3(Chapter 2) Pandas
43 pages
Python Data Frame New
No ratings yet
Python Data Frame New
32 pages
Pandas
No ratings yet
Pandas
5 pages
Data Science Notes Unit-1 Part -2
No ratings yet
Data Science Notes Unit-1 Part -2
22 pages
Pandas
No ratings yet
Pandas
44 pages
Data Handing Using Pandas-I
100% (2)
Data Handing Using Pandas-I
46 pages
Python UnitIV
No ratings yet
Python UnitIV
20 pages
FDS Module 2 Notes
No ratings yet
FDS Module 2 Notes
24 pages
Pandas-Creating Series & Dataframes (DR V Gowri, Srmist)
No ratings yet
Pandas-Creating Series & Dataframes (DR V Gowri, Srmist)
47 pages
Phan1_Pandas_Numpy_Matplotlib
No ratings yet
Phan1_Pandas_Numpy_Matplotlib
158 pages
Notes - EDA-Unit2 (1)
No ratings yet
Notes - EDA-Unit2 (1)
43 pages
Unit III - Pandas - Data Manipulation Using Python
No ratings yet
Unit III - Pandas - Data Manipulation Using Python
15 pages
Line By Line 12 IP
No ratings yet
Line By Line 12 IP
21 pages
Pandas
No ratings yet
Pandas
21 pages
Pandas Shan Ver2
No ratings yet
Pandas Shan Ver2
25 pages
ip study
No ratings yet
ip study
18 pages
Python Pandas - DataFrame
No ratings yet
Python Pandas - DataFrame
12 pages
UNIT - 3 Pandas
No ratings yet
UNIT - 3 Pandas
21 pages
Class XII Data Handlinng Using PandasI
No ratings yet
Class XII Data Handlinng Using PandasI
46 pages
Pandas
No ratings yet
Pandas
8 pages
12 Pandas
No ratings yet
12 Pandas
9 pages
FALLSEMFY2023-24 BCSE101E ELA CH2023241700215 Reference Material II 24-11-2023 Introduction To Pandas
No ratings yet
FALLSEMFY2023-24 BCSE101E ELA CH2023241700215 Reference Material II 24-11-2023 Introduction To Pandas
15 pages
Introduction To Pandas For Data Analysis
No ratings yet
Introduction To Pandas For Data Analysis
6 pages
DATA HANDLING AND CSV 2024- 2025
No ratings yet
DATA HANDLING AND CSV 2024- 2025
12 pages
Unit I: Data Handling Using Pandas and Data Visualization: Marks:25
No ratings yet
Unit I: Data Handling Using Pandas and Data Visualization: Marks:25
135 pages
11.2 Pandas
No ratings yet
11.2 Pandas
24 pages
Data Handling Using Pandas-1
No ratings yet
Data Handling Using Pandas-1
25 pages
All Document Reader 1715619870900
No ratings yet
All Document Reader 1715619870900
6 pages
The Pandas Library
No ratings yet
The Pandas Library
39 pages
Pandas
No ratings yet
Pandas
29 pages
SBLC 1
No ratings yet
SBLC 1
23 pages
ML UNIT-2 NOTES
No ratings yet
ML UNIT-2 NOTES
17 pages
Chapter 1 - Part 2 - DataFrame (1)
No ratings yet
Chapter 1 - Part 2 - DataFrame (1)
48 pages
python 2.1.2 (2)
No ratings yet
python 2.1.2 (2)
7 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Quick Python Guide
From Everand
Quick Python Guide
Coder1
No ratings yet
Energies: A Novel Deep Feature Learning Method Based On The Fused-Stacked Aes For Planetary Gear Fault Diagnosis
No ratings yet
Energies: A Novel Deep Feature Learning Method Based On The Fused-Stacked Aes For Planetary Gear Fault Diagnosis
18 pages
Semisupervised Autoencoder For Sentiment Analysis12059-55631-1-PB
No ratings yet
Semisupervised Autoencoder For Sentiment Analysis12059-55631-1-PB
7 pages
Clustering and Applications and Trends in Data Mining
No ratings yet
Clustering and Applications and Trends in Data Mining
42 pages
Data Science Lab: Introduction To Python
No ratings yet
Data Science Lab: Introduction To Python
21 pages
A Survey of Deep Learning and Its Applications - A New Paradigm To Machine Learning - Dargan2019
No ratings yet
A Survey of Deep Learning and Its Applications - A New Paradigm To Machine Learning - Dargan2019
22 pages
DSL Python Programming
No ratings yet
DSL Python Programming
97 pages
Data Science Lab: Numpy: Numerical Python
No ratings yet
Data Science Lab: Numpy: Numerical Python
71 pages
Data Science Lab: Matplotlib
No ratings yet
Data Science Lab: Matplotlib
23 pages
Data Mining and Warehousing Lab
No ratings yet
Data Mining and Warehousing Lab
4 pages
Big Data Analytics PDF
100% (1)
Big Data Analytics PDF
6 pages
Quiz I - 2016 (Solution Key)
No ratings yet
Quiz I - 2016 (Solution Key)
2 pages
Modeling Multidimensional Databases
No ratings yet
Modeling Multidimensional Databases
12 pages
David M. Kroenke's: Database Processing
No ratings yet
David M. Kroenke's: Database Processing
32 pages
Aditya Kumar - Internship Report
No ratings yet
Aditya Kumar - Internship Report
3 pages
DEFENSE-PREPARATION
No ratings yet
DEFENSE-PREPARATION
2 pages
Unit 1&3
No ratings yet
Unit 1&3
18 pages
List of NOSQL Database
No ratings yet
List of NOSQL Database
23 pages
BL-1300 SR-600 MITSUBISHIQ Ethernet OM 600F04 GB WW 1084-1 PDF
No ratings yet
BL-1300 SR-600 MITSUBISHIQ Ethernet OM 600F04 GB WW 1084-1 PDF
26 pages
Introduction to Artificial Intelligence for Security Professionals 1st Edition The Cylance Data Science Team - Download the ebook today to explore every detail
100% (1)
Introduction to Artificial Intelligence for Security Professionals 1st Edition The Cylance Data Science Team - Download the ebook today to explore every detail
59 pages
Abap Material ENRICH IT
No ratings yet
Abap Material ENRICH IT
330 pages
Steps Followed To Write Research Proposal (2) (1) For Kume
No ratings yet
Steps Followed To Write Research Proposal (2) (1) For Kume
7 pages
Integration of Procedural Constructs With SQL
No ratings yet
Integration of Procedural Constructs With SQL
3 pages
BCOM&BFS (INTRODUCTION TO STATISTICS) Lat
No ratings yet
BCOM&BFS (INTRODUCTION TO STATISTICS) Lat
29 pages
L12 de Normalization
No ratings yet
L12 de Normalization
16 pages
s11356 022 24764 1
No ratings yet
s11356 022 24764 1
16 pages
Pr1 4th Quarter Learning Activity Sheets
No ratings yet
Pr1 4th Quarter Learning Activity Sheets
14 pages
Lab Manual 10
No ratings yet
Lab Manual 10
6 pages
Osi Layers
No ratings yet
Osi Layers
43 pages
Chapter 1 - Research Methodology - Introduction
No ratings yet
Chapter 1 - Research Methodology - Introduction
29 pages
SQL Assistant
No ratings yet
SQL Assistant
40 pages
1-1 What Is Science
No ratings yet
1-1 What Is Science
57 pages
1000 Tech. InterviewQuestions
No ratings yet
1000 Tech. InterviewQuestions
72 pages
MANG6513 2023 Lecture 2 (1)
No ratings yet
MANG6513 2023 Lecture 2 (1)
38 pages
Math 7 DLL
No ratings yet
Math 7 DLL
8 pages
Midterms NCM 113
No ratings yet
Midterms NCM 113
22 pages
DML Commands Exp 5
No ratings yet
DML Commands Exp 5
9 pages
Hibernante
No ratings yet
Hibernante
215 pages
Course Outline For PHP
No ratings yet
Course Outline For PHP
3 pages