DSL Pandas
DSL Pandas
Pandas
▪ Pandas
▪ Provides useful data structures (Series and
DataFrames) and data analysis tools
▪ Based on Numpy arrays
▪ Tools:
▪ Managing tables and series
• data selection
• grouping, pivoting
▪ Managing missing data
▪ Statistics on data
2
Pandas Series
3
Pandas Series
Out[1]: 0 2.0
1 3.1
2 4.5
4
Pandas Series
5
Pandas Series
6
Pandas Series
8
Pandas Series
Out[1]: 2.0
2.0
Series:
'a' 2.0
'b' 10
'c' 4.5
9
Pandas Series
Out[1]: b 3.1
c 4.5
b 3.1
c 4.5
10
Pandas Series
Out[1]: b 3.1
c 4.5
11
Pandas Series
Out[1]: a 2.0
c 4.5
12
Pandas DataFrame
▪ Example:
Index 'Price' 'Quantity' 'Liters'
'Wine' 5.0 8 1
13
Pandas DataFrame
14
Pandas DataFrame
Out[1]: c1 c2
0 0 0
1 1 2
2 2 4
15
Pandas DataFrame
Out[1]: c1 c2
a 0 1
b 2 3
c 4 5
16
Pandas DataFrame
a 1.0 5 1.5
b 1.4 10 0.3
c 5.0 8 1
17
Pandas DataFrame
a 1.0 5 1.5
b 1.4 10 0.3
c 5.0 8 1
18
Pandas DataFrame
▪ Accessing DataFrames
▪ Access a DataFrame column
▪ Access rows and columns with indexing
▪ df.loc
• Explicit index
• Slicing, masking, fancy indexing
▪ df.iloc
• Implicit index
▪ Whether a copy or view will be returned it
depends on the context
▪ Usually it is difficult to make assumptions
19
Pandas DataFrame
a 1.0 5 1.5
b 1.4 10 0.3
c 5.0 8 1
In [1]: df['Quantity']
Out[1]: a 5
b 10
c 8
20
Pandas DataFrame
Price 1.0
Quantity 5.0
Liters 1.5
21
Pandas DataFrame
22
Pandas DataFrame
a 1.0 5 1.5
b 1.4 10 0.3
c 5.0 8 1
23
Pandas DataFrame
a 1.0 5 1.5
b 1.4 10 0.3
c 5.0 8 1
24
Pandas DataFrame
a 1.0 5 1.5
b 1.4 10 0.3
c 5.0 8 1
25
Pandas DataFrame
a 0.0 5 0.0
b 1.4 10 0.3
c 0.0 8 0.0
26
Pandas DataFrame
28
Pandas DataFrame
▪ Drop column(s)
▪ Returns a copy of the updated DataFrame
Index Price Quantity Liters Available
c 5.0 8 1 True
29
Pandas DataFrame
▪ Rename column(s)
▪ Use a dictionary which maps old names with
new names
▪ Returns a copy of the updated DataFrame
Index Price Quantity Liters Available Index Price nItems [L] Available
30
Computation with Pandas
31
Computation with Pandas
32
Computation with Pandas
b 3 a 1 a 2
a 1 b 3 b 6
c 10 d 30 c NaN
d NaN
my_series1 my_series2
res
33
Computation with Pandas
b 3 4 a 1 2 a 2 4
a 1 2 b 3 4 b 6 8
c 10 20 d 30 40 c NaN NaN
d NaN NaN
my_dataframe1 my_dataframe2
res
34
Computation with Pandas
Index Total Quantity Index Total Price Index Price Quantity Total
a 1 2 a 1 2 a NaN NaN 2
b 3 4 b 3 4 b NaN NaN 6
c 5 6 c 5 6 c NaN NaN 10
35
Computation with Pandas
a 1 2 Total 1 a 2 4
b 3 4 Quantity 2 b 4 6
c 5 6 c 6 8
37
Computation with Pandas
c 5 6
Out[1]: 0 4.0
1 NaN
2 5.0
3 NaN
dtype=float64
40
Missing values
41
Missing values
Out[1]: 0 False
1 True
2 False
3 True
dtype=bool
42
Missing values
Out[1]: 0 4.0
2 5.0
dtype=float64
43
Missing values
a 1 2 a 1 2
b 3 NaN c 5 6
c 5 6
44
Missing values
Out[1]: 0 4.0
1 0.0
2 5.0
3 0.0
dtype=float64
45
Missing values
Out[1]: 0 4.0
1 4.0
2 5.0
3 5.0
46
Notebook Examples
▪ 4-Pandas
Examples.ipynb
▪ 1. Accessing
DataFrames and Series
47
Combining Pandas objects
48
Combining Pandas objects
▪ Concatenating 2 Series
▪ Index is preserved, even if duplicated
In [1]: s1 = pd.Series(['a', 'b'], index=[1,2])
s2 = pd.Series(['c', 'd'], index=[1,2])
pd.concat((s1, s2))
Out[1]: 1 a
2 b
1 c
2 d
dtype=object
49
Combining Pandas objects
▪ Concatenating 2 Series
▪ To avoid duplicates use ignore_index
In [1]: s1 = pd.Series(['a', 'b'], index=[1,2])
s2 = pd.Series(['c', 'd'], index=[1,2])
pd.concat((s1, s2), ignore_index=True)
Out[1]: 0 a
1 b
2 c
3 d
dtype=object
50
Combining Pandas objects
▪ Concatenating 2 DataFrames
▪ Concatenate vertically by default
In [1]: pd.concat((df1, df2))
b 3 4 a 1 2
b 3 4
Index Total Quantity
c 5 6
c 5 6
d 7 8
d 7 8
51
Combining Pandas objects
▪ Concatenating 2 DataFrames
▪ Missing columns are filled with NaN
In [1]: pd.concat((df1, df2))
c 5 6 1 d 7 8 2
d 7 8 2
52
Combining Pandas objects
is equivalent to:
In [1]: df_concat = pd.concat((df1, df2))
53
Combining Pandas objects
54
Combining Pandas objects
▪ Examples
one-to-one
Index k1 c2 Index k1 c3 Index k1 c2 c3
i1 0 a i1 1 b1 0 0 a a1
i2 1 b i2 0 a1 1 1 b b1
many-to-one
Index k1 c2 Index k1 c3 Index k1 c2 c3
i1 0 a i1 1 b1 0 0 a a1
i2 1 b i2 0 a1 1 0 c a1
i3 0 c 2 1 b b1
i4 1 d 3 1 d b1
55
Grouping data
56
Grouping data
▪ Applying group by
▪ Specify the column where you want to group (key)
▪ Obtain a DataFrameGroupBy object
df = pd.DataFrame({'k' : ['a','b','a','b'],
'c1': [2,10,3,15], 'c2' : [4,20,5,30]})
grouped_df = df.groupby('k') # 2 groups: 'a' and 'b'
Index k c1 c2 Index k c1 c2
0 a 2 4 0 a 2 4
1 b 10 20 2 a 3 5
2 a 3 5 1 b 10 20
3 b 15 30 3 b 15 30
57
Grouping data
▪ Iterating on groups
▪ Each group is a subset of the original DataFrame
In [1]: for key, group_df in grouped_df:
print(key)
print(group_df)
Out[1]: a Index k1 c1 c2
k1 c1 c2 0 a 2 4
0 a 2 4
2 a 3 5
2 a 3 5
b Index k1 c1 c2
k1 c1 c2 1 b 10 20
1 b 10 20
3 b 15 30
3 b 15 30
58
Grouping data
Index k1 c1 c2
Out[1]: k c1 c2
a 2.5 4.5 0 a 2 4
b 12.5 25.0 2 a 3 5
Index k1 c1 c2
59
Grouping data
Index k1 c1 c2
Out[1]: k
a 2.5 0 a 2 4
b 12.5 2 a 3 5
Name: c1, dtype=float64
Index k1 c1 c2
1 b 10 20
3 b 15 30
60
Grouping data
Index k1 c1 c2
Out[1]: k c1 c2
mean = 2.5
1 b 10 20 0 a 2 4 x: filtered
out
3 b 15 30 2 a 3 5
Index k1 c1 c2
mean = 12.5
x: kept in
1 b 10 20
the result
3 b 15 30
61
Pivoting
62
Pivoting
Out[1]: 1 b 2 1
class 1 2 3
2 b 3 1
type
3 a 3 0
a 0 0 1
4 b 2 1
b 0 2 1 5 a 1 0
6 b 1 0
7 a 2 0
2 sensors of type b and
class 2 had some failure
63
Pivoting
Out[1]: 1 b 2 1
class 1 2 3
2 b 3 1
type
3 a 3 0
a 0.0 0.0 0.5
4 b 2 1
b 0.0 1.0 1.0 5 a 1 0
6 b 1 0
50% of sensors of type a 7 a 2 0
and class 3 had some
failure 64
Multi-Index
values 10 13 7 9
65
Multi-Index
66
Multi-Index
67
Multi-Index
Out[1]: year
Rome Rome Turin Turin
2018 10
2019 13 2018 2019 2018 2019
10 13 7 9
city
Rome 10
Turin 7
68
Multi-Index
city year
Rome 2019 13
69
Multi-Index
▪ Multi-indexed DataFrame
▪ Specify a multi-index for rows
▪ Columns can be multi-indexed as well
Humidity Temperature
Turin 2018 33 48 6 33
2019 35 45 5 35
Rome 2018 40 59 2 33
2019 41 57 3 34
70
Multi-Index
Out[1]: c1 c2
a b a b
Rome 2018 0 1 2 3
2019 4 5 6 7
Turin 2018 8 9 10 11
2019 12 13 14 15
71
Multi-Index
Out[1]: a b c1 c2
Rome 2018 0 1
a b a b
2019 4 5
Rome 2018 0 1 2 3
Turin 2018 8 9
2019 4 5 6 7
2019 12 13
Turin 2018 8 9 10 11
a b 2019 12 13 14 15
2018 0 1
2019 4 5
72
Multi-Index
2019 4 5 6 7
Turin 2018 8 9 10 11
2019 12 13 14 15
73
Multi-Index
Out[1]: c1 a 0 c1 c2
c2 a 2
a b a b
Rome 2018 0 1 2 3
2019 4 5 6 7
Turin 2018 8 9 10 11
2019 12 13 14 15
74
Multi-Index
75
Multi-Index
a b a b a b a b
76
Multi-Index
city year
2019 4 Rome 0 4
2019 12
77
Multi-Index
mydataframe.stack()
city year
Rome 0 4 2019 4
2019 12
78
Multi-Index
▪ Aggregates on multi-indices
▪ Allowed by passing the level parameter
▪ Level specifies the row granularity at which the
result is computed
my_dataframe.max(level='city')
a b a b a b a b
2019 4 5 6 7 Turin 12 13 14 15
Turin 2018 8 9 10 11
2019 12 13 14 15
79
Multi-Index
▪ Aggregates on multi-indices
my_dataframe.max(level='year')
a b a b a b a b
2019 4 5 6 7 2019 12 13 14 15
Turin 2018 8 9 10 11
2019 12 13 14 15
80
Multi-Index
▪ Aggregates on multi-indices
▪ Can also aggregate columns
▪ Specify axis=1
my_dataframe.max(axis=1, level=0)
a b a b Rome 2018 1 3
2019 12 13 14 15
81
Loading DataFrames
mycsv.csv
MyTitle c1 c2 c3
c1,c2,c3 0 0 1 2
0,1,2 1 3 4 5
3,4,5
2 6 7 8
6,7,8
82
Loading DataFrames
mycsv.csv
c1,c2,c3 c1 c2 c3
83
Loading DataFrames
savedcsv.csv
c1 c2 c3
c1,c2,c3
0 0 NaN 2
0,0,,2
1 3 4 5 1,3,4,5
2 6 NaN NaN 2,6,,
84
Loading DataFrames
myjson.csv
{"c1":{"0":0, "1":3, "2":6}, c1 c2 c3
2 6 NaN NaN
85
Loading DataFrames
86
Notebook Examples
▪ 4-Pandas
Examples.ipynb
▪ 2. Working with Pandas
and spatial data
87