0% found this document useful (0 votes)
72 views

Tuxdoc.com Effective Pandas Patterns for Data Manipulation Treading on Python Matt Harrison Independently Published 2021

Effective Pandas is a guide by Matt Harrison focused on data manipulation using the Pandas library. The book covers installation, data structures, series, operators, aggregate methods, conversion methods, and manipulation methods, providing practical exercises throughout. It serves as a resource for those looking to enhance their skills in data analysis with Pandas.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views

Tuxdoc.com Effective Pandas Patterns for Data Manipulation Treading on Python Matt Harrison Independently Published 2021

Effective Pandas is a guide by Matt Harrison focused on data manipulation using the Pandas library. The book covers installation, data structures, series, operators, aggregate methods, conversion methods, and manipulation methods, providing practical exercises throughout. It serves as a resource for those looking to enhance their skills in data analysis with Pandas.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Effective Pandas

Patterns for Data Manipulation


Effective Pandas

Patterns for Data Manipulation


Effective Pandas

Patterns for Data Manipulation


Matt Harrison

Technical Editors: Lawrence Gray,


Gray, Alexandre Batisse, Edward Krueger,
Krueger,

hairysun.com
COPYRIGHT © 2021

While every precaution has been taken in the preparation of this book, the publisher and author
assumes no responsibility for errors or omissions, or for damages resulting from the use of the
information contained herein
Contents

Contents

1 Introduction 3
1.1 Who this book is for . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Data in this Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Hints, Tables, and Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Installation 5
2.1 Anaconda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Pip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Jupyter Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3 Data Structures 11
3.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4 Series Introduction 13
4.1 The index abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 Th
Thee pa
pand as Series . . . . . . . . .
ndas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.3 The NaN value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.4 Optio
Optional
nal Inte
Integer
ger Supp ort for NaN
Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.5 Similar to NumPy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

44..67 C
Suam
tem
goarriy
ca.l D
. a. ta. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 2119
4.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5 Series Deep Dive 23


5.1 Loading the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.2 Series Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

6 Operators (& Dunder Methods) 27


6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.2 Dunder Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.3 Index Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.4 Broadcasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.5 Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

v
Contents
6.6 Operator Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.7 Chaining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

7 Aggregate Methods 33
7.1 Aggregations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
7.2 Count and Mean of an Attribute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.3 .agg and Aggregation Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

8 Conversion Methods 39
8.1 Automatic Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
8.2 Memory Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
8.3 String and Category Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
8.4 Ordered Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
8.5 Converting to Other Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
8.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
8.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

9 Manipulation Methods 45
9.1 .apply and .where . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
9.2 If Else with Pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
9.3 Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
9.4 Filling In Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
9.5 Interpolating Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
9.6 Clipping Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
9.7 Sorting Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
9.8 Sorting the Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
9.9 Dropping Duplicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
9.10 Ranking Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
9.11 Replacing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
9.12 Binning Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
9.13 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
9.14 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

10 Indexing Operations 63
10.1
10.1 Prepeppi
pinng the
the Dat
ataa an
and
d Re
Rennamin
aming g th
thee In
Inde
dexx . . . . . . . . . . . . . . . . . . . . . . . . 63
10.2 Resetting the Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
10.33 The .loc Attribute . . . . . . . . . . . . . . .
10. . . . . . . . . . . . . . . . . . . . . . . . . 66
10.44 The .iloc Attribute . . . . . . . . . . . . . .
10. . . . . . . . . . . . . . . . . . . . . . . . . 73
10.5 Heads and Tails . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
10.6 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
10.7 Filtering Index Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
10.8 Reindexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
10.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
10.10 E
Exxercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

vi
Contents
11 String Manipulation 81
11.1 Strings and Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
11.2 Categorical Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
11.3 The .str Accessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
11.4 Searching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
11.5 Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
11
11.6
.6 Optim izing .apply with Cython
Optimizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
11.7 Replacing Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
11.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
11.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

12 Date and Time Manipulation 93


12.1 Date Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
12.2 Loading UTC Time Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
12.3 Loading Local Time Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
12.4 Converting Local time to UTC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
12.5 Converting to Epochs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
12.6 Manipulating Dates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
12.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
12.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

13 Dates in the Index 105


13.1 Finding Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
13.2 Filling In Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
13.3 Inte rpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
13.4 Dropping Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
13.5 Shifting Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
13.6 Rolling Average . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
13.7 Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
13
13.8
.8 Ga
Gath
ther
erin
ing
g Ag
Aggr
greg
egat
atee Value
aluess (But
(But Keep
Keepiningg In
Inde
dex)
x) . . . . . . . . . . . . . . . . . . . . 115
13.9 Groupby Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
13.10Cumulative Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
1133..1112 E
SSu
uxm
ercmisaersy .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 112211

14 Plotting with a Series 123


14.1 Plotting in Jupyter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Thee .plot Attribute . . . . . . . . . . . . . . . . . . . . . . . .
14.22 Th
14. . . . . . . . . . . . . . . 123
14.3 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 24
24
14.4 Box Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1 25
14.5 Kernel Density Estimation Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
14.6 Line Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
14.7 Line Plots with Multiple Aggregations . . . . . . . . . . . . ............... 127
14.8 Bar Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
14.9 Pie Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 32
32
14.10 Styling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1 33
14.11 SSu
ummary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
14.12 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

vii
Contents
15 Categorical Manipulation 135
15.1 Categorical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
15.2 Frequency Counts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
15.3 Benefits of Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
15.4 Conversion to Ordinal Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
15.55 The .cat Accessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
15. 137
15.6 Category Gotchas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
15.7 Ge ne ralization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
15.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
15.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

16 Dataframes 143
16.1 Database and Spreadsheet Analogues . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
16.2 A Simple Python Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
16.3 Dataframes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11444
16.4 Cons tru ction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
16.5 Dataframe Axis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
16.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
16.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

17 Similarities with Series and DataFrame


Similarities 151
17.1 Getting the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
17.2 Viewing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
155
17.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
17.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

18 Math Methods in DataFrames 159


18.1 Index Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
159
18.2 Duplicate Index Entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
18.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
18.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

19 Looping and Aggregation 163


19.1 For Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
163
19.2 Aggregations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
163
19.33
19. The .apply Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1 66
19.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
19.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

20 Colum
Columns
ns T
Types,
ypes, .assign, and Memory Usage 171
20.1 Conversion Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1711
20.2 Memory Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
20.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
20.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

21 Creating and Updating Columns 175


21.1 Loading the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
175
21.2 More Column Cleanup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
21.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

viii
Contents
21.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

22 Dealing with Missing and Duplicated Data 187


22.1 Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
22.2 Duplicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1 90
22.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
22.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

23 Sorting Columns and Indexes 193


23.1 Sorting Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
193
23.2 Sorting Column Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
23.3 Setting and Sorting the Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
23.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
23.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

24 Filtering and Indexing Operations 199


24.1 Renaming an Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
24.2 Resetting the Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
24.3
24.3 Dat
ataf
afra
rame
me In
Inde
dexi
xing
ng,, Filt
Filteering
ring,, & Que
uery
ryiing . . . . . . . . . . . . . . . . . . . . . . . . 20
2000
24.4 Indexing by Position . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
24.5
24.6 Indering
Filte xing wi
Filtering bythNFunc
with ame tions
. . s .&. .loc
Function . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 220095
20
24.7 .query vs .loc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
210
24.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
24.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211

25 Plotting with Dataframes 213


25.1 Lines Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2 13
25.2 Bar Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
25.3 Scatter Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
25.4 Area Plots and Stacked Bar Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
25.5 Column Distributions with KDEs, Histograms, and Boxplots . . . . .... ..... . 224
25.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
25.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229

26 Reshaping Dataframes with Dummies 231


26.1 Dummy Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
231
26.2 Undoing Dummy Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
26.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
26.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235

27 Reshaping By Pivoting and Grouping 237


27.1 A Basic Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2377
27.2 Using a Custom Aggregation Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
27.3 Multiple Aggregations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
27.4 Per Column Aggregations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
27.5 Grouping by Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
27.6 Grouping with Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
27.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256

ix
Contents
27.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256

28 More Aggregations 257


28.1 Aggregations while Keeping Rows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
28.2 Filtering Parts of Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
28.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
28.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261

29 Cross-tabulation Deep Dive 263


29.1 Cross-tabulation Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
29.2 Adding Margins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2633
29.3 Normalizing Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
29.4 Hierarchical Columns with Cross Tabulations . . . . . . . . . . . . . . . . . . . . . . . 265
29.5 Heatmaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 266
29.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
29.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266

30 Melting, Transposing, and Stacking Data 267


30.1 Melting Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
30.2 Un-melting Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270270

3300..43 T
Srtaacnksipnogsi&ngUD
nsattaack.in. g. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 227713
30.5 Stacking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2 75
30.6 Flattening Hierarchical Indexes and Columns . ...... . ...... ...... . . . 278
30.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
30.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282

31 Working with Time Series 283


31.1 Loading the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
283
31.2 Adding Timezone Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
31.3 Exploring the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
31.4 Slicing Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2866
31.5 Missing Timeseries Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
31.6 Exploring Seasonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
31.7 Resampling Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
295
31.8 Rules with Offset Aliases . . . . . . . . . . . . . . . . . . . . . ........ . . . . . . 295
31.9 Combining Offset Aliases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
31.10A
0An nchored Offset Aliases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
31.11Re
1Ressampling to Finer-grain Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
31.12Grouping a Date Column with pd.Grouper . . . . . . . . . . . . . . . . . . . . . . . . . 29
298
31.13 SSu
ummary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
31.14 EExxercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300

32 Joining Dataframes 301


32.1 Adding Rows to Dataframes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
32.2 Adding Columns to Dataframes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
32.3 Joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33002
32.4 Join Indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306

x
Contents
32.5 Merge Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307 307
32.6 Joining Data Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
32.7 Dirty Devil Flow and Weather Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
32.8 Joining Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
32.9 Validating Joined Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
32.10V
0Viisualization of Merged Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
32.11 SSu
ummary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
32.12 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312

33 Exporting Data 315


33.1 Dirty Devil Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
33.2 Reading and Writing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
33.3 Creating CSV Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
33.4 Exporting to Excel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
33.5 Feather . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
33.6 SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
319
33.7 JSON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
320
33.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
33.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324

34 Styling Dataframes 327


34.1 Loading the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
327
34.2 Sparklines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323 29
Thee .style Attribute . . . . . . . . . . . . . . . . . . . . . . . .
34.33 Th
34. . . . . . . . . . . . . . 323 29
34.4 Formatting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333 30
34.5 Embedding Bar Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
330
34.6 Highlighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
34.7 Heatmaps and Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
34.8 Captions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
34.9 CSS Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
34.10S
0Sttickiness and Hiding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
34.11H
1Hiiding the Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
3344..1123 SEuxm
ercmisaersy .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 333344

35 Debugging Pandas 339


35.1 Checking if Dataframes are Equal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
35.2 Debugging Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
35.3 Debugging Chains Part II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
35.4 Debugging Chains Part III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
35.5 Debugging Chains Part IV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
35.6 Debugging Apply (and Friends) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
35.7 Memory Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
35.8 Timing Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
35.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
35.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356

36 Summary 357

xi
Contents
About the Author 359

Index 361

Also Available 377

One more thing 379

xii
Forward

Python is eas
Python easyy to le
learn
arn.. You can lea
learn
rn the bas
basics
ics in a day and be produ
productictive.
ve. With ononly
ly an
understanding of Python, moving to pandas can be difficult or confusing. It borrows some ideas
from NumPy that are not common in the wider Python ecosystem. This book is meant to aid you
in mastering pandas.
I have
have taug
taughtht Py
Pyth
thon
on and
and pa
pand
ndas
as to many
many pe peop
ople
le ov
over
er the
the ye
year
ars,
s, in larg
largee co
corp
rpor
orat
atee
environments,
environmen ts, small startups, and in Python and Data Science conferences.
conferences. I have seen what trips
people up, and confuses them. Wi With
th the correct background, an attitude of acceptance, and a deep
breath, much of this confusion evaporates.
Having said this, pandas is an excellent tool. Many use it around the world to great success. I
hope to empower you to do this as well.
Cheers!
Matt

1
Chapter 1
Introduction

I have been using Python in some professional capacity or another since the turn of the century.
One of the trends that I have seen in that time is the uptake of Python for various aspects of data
science—gathering data, cleaning data, analysis, machine learning, and visualization.
visualization. The pandas
library has seen much uptake in this area.
pandas1 is a data
data an
anal
alys
ysis
is libr
librar
ary
y fo
forr Pyth
Python
on th
that
at ha
hass ex
expl
plod
oded
ed in po
popu
pula
lari
rity
ty ov
over
er the
the pa
past
st ye
year
ars.
s.
The website describes it like this:

“pandas is an open-source, BSD-licensed library providing high-performance, easy-to-


use data structures and data analysis tools for the Python programming language. ”
-pandas.pydata.org

My des
descri
cripti
ption
on of pandas
pandas is: pan
pandas
das is an in-
in-mem
memoryory analysis
analysis tool, whi
which ch has SQL-
SQL-lik
likee
constr
con struct
ucts,
s, ess
essent
ential
ial sta
statis
tistic
tical
al and ana
analyt
lytic
ic suppo
support,
rt, as wel
welll as gra
graphi
phing
ng capabi
capabilit
lity
y. Becaus
Becausee pan
pandas
das
is built on top of Cython and NumPy, it has less memory overhead and runs quicker than pure
Python
Pyth on code. Many peopl peoplee use pandas to replace Exce Excel,
l, perform
perform ETL (extrac
(extractt transform
transform load
processing to move data from one place to another), process tabular data, load CSV or JSON files,
prep for machine learning
learning,, and more. Thou
Thoughgh it grew out of the financ
financial
ial sector (for time series
analysis), it is now a general-purpose data manipulation library library..
With its NumPy lineage, pandas adopts some NumPy’isms that regular Python programmers
may not be aware of or familiar with. Yes, one could go out and use Cython to perform fast typed
data
data an
anal
alys
ysis
is wi
with
th a Py
Pyth
thon
on-l
-lik
ikee dial
dialec
ect,
t, bu
butt wi
with
th pa
pand
ndas
as,, you
you do
don’
n’tt need
need to
to.. Thi
Thiss wo
work
rk is do
done
ne for
for
you. If you use pandas and the vectorized operations, you are getting close to C-level speeds for
numeric work but writing Python.

1.1 Who th
this
is bo
book
ok is for

This gu
This guid
idee is in
inte
tend
nded
ed to intr
introd
oduc
ucee pa
pand
ndas
as an
and d pa
patt
tter
erns
ns for
for best
best prac
practi
tice
ces.
s. If yo
you
u wo
work
rk wi
withth tabu
tabula
larr
data
da ta an
andd ne
needed capa
capabi
bili
liti
ties
es beyo
beyond
nd Ex
Exce
cel,
l, th
this
is is fo
forr yo
you.
u. Th
This
is bo
book
ok co
cove
vers
rs many
many (but
(but no
nott all)
all) aspe
aspect
ctss
of the
the li
libr
brar
aryy, as we
well
ll as some
some go
gotc
tcha
hass or de
deta
tail
ilss th
that
at ma
mayy be co
coun
unte
ter-
r-in
intu
tuit
itiv
ivee or even
even no
non-n-py
pyththon
onic
ic
to longtime users of Python.
This book assumes
assumes a basic knowledg
knowledgee of Pyth Python.on. The author
author has written Illustrated Guide to
Python 3 that provides all the background necessary
necessary..
1 pandas (http://pandas.pydata.org) refers to itself in lowercase, so this book will follow suit. When I’m referring
to specific code, I will set it in a monospace font.

3
1. Intr
Introduct
oduction
ion
1.2 Da
Data
ta in th
this
is Book
Book

Every attempt
attempt has been made to use data that illust
illustrates
rates rea
real-wor
l-worldld pandas usage. As a visu
visual
al
learner, I appreciate seeing where data is coming and going. As such, I try to shy away from just
show
showiningg tabl
tables
es of ra
rand
ndom
om nu
numb
mber
erss th
that
at ha
have
ve no me
mean
anin
ing.
g. I will
will show
show be
best
st pract
practic
ices
es gl
glea
eane
ned
d from
from
years of using pandas.
I have selected a variety of datasets to show that the advice given in this book is applicable in
most situations you may encounter
encounter..

1.3 Hint
Hints,
s, Tables
ables,, and Image
Imagess

The hints, tables, and graphics found in this book have been collected over my years of using
pandas.
panda s. They come from han
hang-ups
g-ups,, note
notes,
s, and cheat shee
sheets
ts that I have deve
developed
loped afte
afterr using
pandas and teaching others how to use the library.
In the physical version of this book, there is an index that has also been battle-tested during
development. Inevitably, when I was doing analysis for consulting or clients, I would check that
the index had the information I needed. If it didn’t, I added it.
If you enjoy this book, please consider
consider writin
writingg a review
review on Amazon. That is one of the best
ways to thank an author.

4
Chapter 2
Installation

This book will use Python 3 throughout! Please do not use Python 2 unless you have a compelling
co mpelling
reason to. Python 3 is the future of the language, and the current pandas releases do not support
Python 2.

2.
2.1
1 Anac
Anacon
onda
da

With that out of the way


way,, let’
let’ss addr
address
ess the install
installation
ation of pandas. The easies
easiestt and least painfu
painfull
way to install pandas on most platforms is to use the Anaconda distribution 2 . Anac Anacon
onda
da iiss a
meta-distribution of Python, which contains many additional packages that have traditionally
been annoying to install unless you have the necessary toolchains to compile Fortran and C code.
Anac
An acon
onda
da al
allo
lows
ws you
you to skip
skip th
thee co
comp
mpil
ilee step
step beca
becaus
usee it prov
provid
ides
es bina
binari
ries
es for
for mo
most
st plat
platfo
form
rms.
s. Th
Thee
Anaconda distribution itself is freely available, though commercial support is available as well.
After installing the Anaconda package, you should have a conda executabl executable.
e. Runn
Running
ing ththee
following command will install pandas:
$ conda install pandas

Note
This book shows commands
commands run from the UNIX comma
command
nd prompt
prompt.. They are prefi
prefixed
xed by
the prompt
prom
prompt $ . Unless
pt as well. otherwise
Do not type thenoted,
prom these
pt. It commands
prompt. is includedwill
included run onuish
to disting the comm
distinguish Windows
andscommand
commands run via a
terminal or command prompt from Python code.

We can verify that this works by trying to import the pandas package:
$ python
>>> import pandas
>>> pandas.__version__
'1.3.2'

Note
The command above shows a Python prompt, >>>. Do not
not ty
type
pe tthe
he P
Pyt
ytho
hon
n pr
prom
ompt
pt.. It is
incl
included
uded
example,
exam to make
ple, the outpuitt easy
output of thetoabove
disti
distingui
above,nguish
sh Python
, '1.3.2' code
does not havfro
from
have m the
e the output
promp
promptt in of Python
front of it. code. For
The book
also includes the secondary Python prompt, ... for code that is longer than a single line.

5
2. Inst
Installat
allation
ion
Note that Jupyter does not use the Python prompt in its cells.

If the library successfully imports, you should be good to go.

2.2 Pip

If yo
youu aren
aren’t
’t us
usin
ing
g An
Anac
acononda
da,, I re
reco
comm
mmen end
d yo
youu us
usee pi
pipp3 to inst
instal
alll pa
pand
ndas
as.. Th
Thee pa
pandndasas libr
librar
ary
y wi
will
ll
install on Windows, Mac, and Linux via pip.
It may be necessary to prepare the operating system for building pandas from source by
instal
ins tallin
ling
g dep
depend
enden
encie
ciess and the prope
properr hea
header
der file
filess for Python
Python.. On UbuUbuntu
ntu,, this
this is str
straig
aightf
htforw
orward
ard,,
other environments may be different:
$ sudo apt-get install build-essential python-all-dev
Using virtualenv4 will alleviate
alleviate the need for supe
superuse
ruserr access during instal
installation
lation.. Becau
Because
se
virtualenv uses pip, it can download and install newer releases of pandas if the version found
on the distribution is lagging.
On Mac and Linux platforms, the following commands create a virtualenv sandbox and install
the latest pandas in it (assuming that the prerequisite files are also installed):
$ python3 -m venv pandas-env
$ source pandas-env/bin/activate
(pandas-env)$
(pand as-env)$ pip install pandas
pandas
Once
Once yo
you
u ha
have
ve pa
pand
ndas
as inst
instal
alle
led,
d, co
confi
nfirm
rm th
that
at yo
you
u can
can im
impo
port
rt the
the libr
librar
ary
y an
and
d ch
chec
eck
k the
the ve
vers
rsio
ion:
n:
$ source pandas-env/bin/activate
(pandas-env)$ python
>>> import pandas
>>> pandas.__version__
'1.3.2'
On Windows, you will open a Command Prompt and run the following to create a virtual
environment:
> python -m venv pandas-env
> pandas-env/Scripts/activa
pandas-env/Scripts/activate
te
(pandas-env)>
(pand as-env)> pip install pandas
pandas

Note
The Windows command prompt, >, is shown in the previous command. Do not type it. Only
type the commands following the prompt.

Try to import the library and check the version:


(pandas-env)> python
>>> import pandas
>>> pandas.__version__
'1.3.2'

2 https://anaconda.com/downloads

3 http://pip-installer.org/

4 http://www.virtualenv.org
6

2.3. Jupyt
Jupyter
er Over
Overview
view

Figure 2.1: Jupyter home page.

2.3 Jup
Jupyte
yterr Ove
Overvie
rview
w

I recommend
recommend you use Jupyter (or a program
program that connect
connectss to it) as a data explor
exploration
ation tool. I use
Jupyter
VSCode,classic,
Emacs,though
as wellthere are other
as Google options:
Colab. JupyterLab,
Jupyter classic willconnecting to Jupyter
give you basic via PyCharm,
functionality and is
included in many cloud environments.
Jupyter notebook is an environmen
environmentt for combining interactive coding and text in a web browser
browser..
This allows us to easily share code and narrative around that code. An example that was popular
in the scientific community was the discovery of gravitational waves. 5
The name Jupyter is a rebranding of an open-source project previously known as iPython
Notebook.
Noteb ook. The rebran
rebranding
ding was to emph
emphasize
asize that althou
although
gh the backe
backend
nd is writt
written
en in Pyth
Python,
on,
Jupyter supports various kernels to run other languages, including Julia (the ”Ju” portion), Python
(”pyt”), and R (”er”). All popular data science programming languages.
The architecture of Jupyter includes a server running various kernels. Using a notebook we can
interact with a kernel. Typically we use a web browser to do this, but other interfaces exist, such
as an
Toemacs
installmode (ein),
Jupyter, PyCharm, or VSCode.
type:
$ pip install notebook
Once Jupyter is installed, launch it with this command:
$ jupyter -notebook
Then navigate to https://localhost:8888 and you should be presented with the Jupyter home
page.
Click on the dropdown button on the right that says ”New” and select Python 3.
At this point,
point, you are pre
presen
sented
ted wit
with
h a not
noteb
ebook
ook with an empty celcell.
l. Jupyt
Jupyter
er is a modal
envir
env ironm
onmen
ent.
t. The
There
re are two mode
modes,s, comman
command d mod
modee and edi
editt mode. Com
Commanmandd mode is for
creating and manipulating cells. Edit mode is for changing what is inside of a single cell.

thatThere arethe
because many
boxcommands
around thefor
cellboth modes.
is blue), youIfcan
youtype
are in command
”h”, mode
and it will (and
bring up you will know
a pop-up with
5 https://losc.ligo.org/s/events/GW150914/GW150914_tutorial.html
7

2. Inst
Installat
allation
ion

Figure 2.2: Creating a Python 3 Jupyter notebook.

the keyboard
keyboard shortcut
shortcutss for both comman
command
d and edit mode. Don’
Don’tt worry about memorizin
memorizing
g all of
them. Here are the commands you will be using most of the time in command mode:
• h - Bring up help (ESC to dismiss)
• a - Cre
Create
ate ce
cell
ll abov
abovee
• b - Cre
Create
ate cel
celll below
• x-C
Cut
ut ccell
ell
• c - Co
Copy
py ce
cell
ll
• v - Past
Pastee cell b
below
elow
• Ente
Enterr - Go int
into
o Edit M
Mode
ode
• m - Change ccell
ell typ
typee to Markd
Markdown
own
• y - Chan
Change
ge cel
celll type tto
o code
• ii - Interrupt kernel
• 00 - Restart kernel
• Ctr-Enter - Execute cell
When you click on a cell or type Enter, you go into edit mode.
mode. You will see that the outline turns
turns
green
green if you are in edit mode. In edit mode, you have basic editing function
functionality
ality.. A few keys to
know:
• Ctr-Enter - Run cell (e
(execute
xecute Python code, re
render
nder Markdown)
• ESC - Go b
back
ack to com
command
mand mo
mode
de
• TAB - T
Tab
ab completion
• Shift-T
Shift-TAB
AB - Bring up tooltip ((ESC
ESC to dismiss)
8

2.4.. Summa
2.4 Summary
ry

Figure 2.3: Running a cell in Jupyter with basic Python commands.

2.4
2.4 Summ
Summar
aryy

In this chapter
chapter, we saw how to set up a Pytho
Pythonn enviro
environmen
nmentt using Anacond
Anacondaa or Pip. We also
introduced the Jupyter notebook. I recommend that you get comfortable with Jupyter. Not only is
it free and open-source, but many large cloud providers also offer Jupyter in their environments.

2.
2.5
5 Exer
Exerci
cise
sess

1. Install pandas on you


yourr machine (usin
(using
g Anaconda or pip).
2. Install JJupyter
upyter on your machine.
3. Launch Jupyter and run the fol
following
lowing in a cell:
import pandas
pandas.show_versions()
9
Chapter 3
Data Structures

One of the keys to understanding pandas is to understand the data model. At the core of pandas
are two data structures. The most widely used data structures are the Series and the DataFrame for
dealing
dealing with array data and tabular data. This table shows thei
theirr analogs in the sprea
spreadshe
dsheet
et and
database world.
Dataa Struct
Dat Structur
uree Dimens
Dimension
ionali
ality
ty Spread
Spreadshee
sheett Analog
Analog Databa
Database
se Analog
Analog Linear
Linear Algebra
Algebra
Series 1D Column Column Column Vector

DataFrame 2D Single Sheet Table Matrix


Figure 3.1: Different dimensions of pandas data structures

An analogy with the spreadsheet


spreadsheet world illustrates the basic differences between these types. A
DataFrame is similar to a sheet with rows and columns, while a Series is similar to a single column
of data (when we refer to a column of data in this text, we are referring to a Series).
Diving into these core data structures a little more is helpful because a bit of understanding
goes a long way towards better use of the library
library.. We will spend a good portion of time discussing
the Series and DataFrame. Both thee Series and DataFrame sha
Both th share
re featu
feature
res.
s. For ex
examp
ample,
le, they
they bot
both
h hav
havee
an index, which we will need to examine to understand how pandas works.
Also, because the DataFrame can be thought of as a collection of columns that are really Series

objects, itodd
perhaps is imperative that
to some), we wesee
will have
thisa when
comprehensive
we iterate study of theand
over rows, Series first. Additionally
the rows (and
are represented
represented as
Series (however, if you find yourself consistently dealing with rows instead of columns, you are
probably not using pandas in an optimal way).
Some have compared the data structures to Python lists or dictionaries, and I think this is a
stretch
stretch that doesn’
doesn’tt provide much benefi
benefit.
t. Mappi
Mappingng the list and dictionary
dictionary meth
methodsods on top of
pandas’ data structures just leads to confusion.

3.1
3.1 Summ
Summar
aryy

The pandas library includes two main data structures and associated functions for manipulating
them.. This bo
them book
ok will ffocus thee Series and DataFrame. First,
ocus on th First, we wi
will
ll look at the Series as the
look
DataFrame Series
can be considered a collection of columns represented as objects.
11

3. Data Stru
Structur
ctures
es

Figure
Figure 3.2: Figure
Figure showin
showingg the relatio
relation
n between the main data structur
structures
es in pandas.
pandas. Namely
Namely,, that
that a
dataframe can have on or many series.

3.
3.2
2 Ex
Exer
erci
cise
sess

1. If you had a spreadsheet with data, which pandas


pandas data structure would you use
use to hold the
data? Why?
2. If you
you ha
had
d a da
data
taba
base
se wi
with
th da
data
ta,, wh
whic
ich
h pa
pand
ndas
as da
data
ta stru
struct
ctur
uree wo
woul
uld
d yo
you
u use
use to ho
hold
ld the
the da
data
ta??
Why?
12

Chapter 4
Series Introduction

A Series is us
used
ed to mo
mode
dell on
one-
e-di
dime
mens
nsio
iona
nall data Thee Series ob
data.. Th obje
ject
ct al
alsso has a few mor
more bits
bits of da
data
ta,,
including an index and a name. A common idea through pandas is the notion of an axis. Because
a series is one-dimensional, it has a single axis—the index.
Below is a table of counts of songs artists composed. We will use this to explore the series:

Ar
Arti
tist
st Data
Data
0 145
1 142
2 38
3 13

If you wanted to represent this data in pure Python, you could use a data structure similar to
dictionary, series, has a list of the data points stored under the 'data'
the one that follows. The dictionary,
key. In addition to an entry in the dictionary for the actual data, there is an explicit entry for the
corre
corresp
spon
ondi
ding
ng in
inde
dexx valu
values
es fo
forr th
thee da
data
ta (in thee 'index' ke
(in th key)
y),, as well as an entry
try for the
the name
ame of the
the
data (in the 'name' key):
>>> series = {
. ..
.. ' in
in de
de x ' :[:[ 0 , 1 , 2, 2, 3 ] ,
. ..
.. ' da
da ta
ta ' :[:[ 14
14 5 , 1 42
4 2 , 3 8 , 1 3]
3] ,
. ..
.. ' na
na mmee ' : ' so
so ng
ng s '
... }
The get function defined below can pull items out of this data structure based on the index:
>>> def get(series, idx):
. ..
.. v al
a l ue
u e _i
_ i dx
d x = s eerr iiee s ['
[ ' i nd
nd eexx ' ]].. i nd
n d eexx ( i dx
dx )
. ..
.. r eett ur
ur n s er er ie
ie s [ 'd
'd aatt a ' ]][[ v a lu
lu ee__ iidd x ]

>>> get(series, 1)
14 2

Note
The code samples in this book are shown as if they were typed directly into an interpreter.
Lines starting with >>> and ... are interpreter markers for the input prompt and continuation
prompt respectively. Lines that are not prefixed by one of those sequences are the output from
the interpreter after running the code.
13

4. Seri
Series
es Intr
Introduct
oduction
ion
In Jupyter
Jupyter (and IPython) you do not see the promp
prompts.
ts. I include them to help distin
distinguis
guishh
between code and output.
The Python interpreter will print the return value of the last invocation (even if the print
statement is missing) automatically. If you desire to use the code samples found in this book,
leave the interpreter prompts out.

4.1 The index abst


abstracti
raction
on

This double abstraction of the index seems unnecessary at first glance—a list already has integer
indexes. But there is a trick up pandas’ sleeves
sleeves.. By allowing non-integer values,
values, the data structure
supports other index types such as strings, dates, as well as arbitrarily ordered indices, or even
duplicate index values.
Below is an example that has string values for the index:
>>> songs = {
...
... 'inde
'indexx ':['Pau
':['Paul', l', 'John',
'John', 'Geor
'George',
ge', 'Ring
'Ringoo '],
. ..
.. ' da
da ta
ta ' :[
:[ 14
14 5 , 1 42
4 2 , 3 8 , 1 3]
3] ,
. ..
.. ' na
na me
me ' : ' co
c o un
un ts
ts '
... }

>>> get(songs, 'John')


14 2
The index is a core feature of pandas’data structures given the library’s past in analysis of
financial data or time-series data.
data. Many of the operatio
operations
ns performed on a Series operate directly
performed
on the index or by index lookup.

4.
4.2
2 Th
Thee pandas Series
pandas

With that background in mind, let ’s look at how to create a Series in pandas.
pandas. It is easy to creat
createe a
Series object from a list:
>>> import pandas as pd
>>> songs2 = pd.Series([145, 142, 38, 13],
... n a m e ='
= ' co
co u n t s ' )

>>> songs2
0 145
1 142
2 38
3 13
Name: counts, dtype: int64
When the interpreter prints our series, pandas makes a best effort to format it for the current
terminal size. The series is one-dimensional. However, this looks like it is two-dimensional. The
leftmost column is the index, which contains entries for the index. The index is not part of the
value
val ues.
s. The gen ericc name for an index is an axis, and the values of the index—0, 1, 2, 3—are
generi
labels. The data—145, 142, 38, and 13—is also called the values of the series. The two-
called axis labels.
dimensional structure in pandas—a DataFrame—has two axes, one for the rows and another for the

columns.
The rightmost column in the output contains the values of the series—145 145,, 142, 38, and 13. In
this case, they are integers (the console representation says dtype: int64, dtype meaning data type,
and int64 meaning 64-bit integer), but in general, the values of a Series can hold strings, floats,
14

4.2.. The pan


4.2 das Series
pandas

Figuree 4.1: The parts of a Series.


Figur

booleans, or arbitrary Python objects. To get the best speed (and to leverage vectorized operations),
the values should be of the same type, though this is not required.
It is easy to inspect the index of a series (or data frame), as it is an attribute of the object:
>>> songs2.index
RangeIndex(start=0, stop=4, step=1)

Theindex.
based default values for an index are monotonically increasing integers. songs2 has an integer-

Note
The index can be string-based as well, in which case pandas indicates that the datatype for the
index is object (not string):
>>> songs3 = pd.Series([145, 142, 38, 13],
... n a m e ='
=' co
co u n t s '',,
. ..
.. i nd
nd ex
ex = [ 'P
' P au
au l ' , ' JJoo hhnn ' , ' GGee oorr ggee ' , ' RRii nngg o ' ])
])
Note that the dtype that we see when we print a Series is the type of the values, not the
index.
index. Even thou
though
gh this looks two-dim
two-dimensi
ensional,
onal, reme
remember
mber that the index is not part of the
values:
>>> songs3
Paul 145
John 142
George 38
Ringo 13
Name: counts, dtype: int64
When we inspect the index attribute, we see that the dtype is object:
>>> songs3.index
Index(['Paul',
Inde x(['Paul', 'John',
'John', 'George',
'George', 'Ringo
'Ringo '],
dtype ='object ')

The actu
The actual
al da
data
ta (or
(or va
valu
lues
es)) fo
forr a seri
series
es do
does
es no
nott ha
have
ve to be nume
numeri
ricc or ho
homo
moge
gene
neou
ous.
s. We ca
can
n in
inse
sert
rt
Python objects into a series:
>>> class Foo:
... pass

>>> ringo = pd.Series(


15

4. Seri
Series
es Intr
Introduct
oduction
ion
. ..
.. [ ' Ri
Ri ch
ch ar
ar d ' , ' StS t aarr kkee y ' , 1 3 , Fo
Fo o ()
() ]],,
... n a m e ='
=' ri
ri n g o ' )

>>> ringo
0 Richard
1 Starkey
2 13
3 < __
_ _ ma
ma in
in __
__ . Fo
F o o i ns
ns ta
ta nncc e a t 0 x ..
.. ..>>
Name: ringo, dtype: object
In the above case, the dtype-datatype -of the Series is object (meaning a Python object). This can
datatype-of
be good or bad.
The object data type is also used
used for a seri
series
es with stri
string
ng value
values.s. In additi
addition,
on, it is also used
for values that have heterogeneous or mixed types. If you have just numeric data in a series, you
woul
wo uldn
dn’t
’t wa
want
nt it stor
stored
ed as a Py
Pyth
thon
on ob
obje
ject
ct,, bu
butt rather as an int64 or float64, which allow you to do
rather
vectorized numeric operations.
If you have time data and it says it has the object type, you probably have strings for the dates.
Using strings instead of date types is bad as you don’t get the date operations that you would get
if the type were datetime64[ns]. A series with string data, on the other han d, has the type of object.
hand,
Don’t worry; we will see how to convert types later in the book.

4.3 The NaN value


A va
valu
luee th
that
at ma
may
y be fa
fami
mili
liar
ar to Nu
NumP
mPyy user
users,
s, bu
butt no
nott Py
Pyth
thon
on user
userss in ge
gene
neral, is NaN. When pan
ral, pandas
das
determines that a series holds numeric values but cannot find a number to represent an entry, it
will use NaN. This val
value
ue stan
stands
ds for Not A Number and is usually ignored in arithmetic operations.
(Similar to NULL in SQL).
Here is a series that has NaN in it:
>>> import numpy as np
>>> nan_series = pd.Series([2, np.nan],
. ..
.. i nd
nd ex
ex = [ 'O
' O no
no ' , ' C la
la pptt oonn ' ]]))
>>> nan_series
Ono 2.0
Clapton NaN
dtype: float64

Note
One thing to note is that the type of this series is float64, not int64! The type is a float because
float64 supports NaN, which int64 does not. When pandas sees numeric data ( 2) as well as the
np.nan, it coerced the 2 to a float value.

Below is an example of how pandas ignores NaN. The .count method, which counts the number of
value
valuess in a serie
series,
s, dis
disre
regards NaN. In th
gards this
is ca
case
se,, it in
indi
dica
cate
tess that
that the
the co
coun
untt of item
itemss in the
the seri
series
es is on
one,
e,
one for the value of 2 at index location Ono, ignoring the NaN value at index location Clapton:
>>> nan_series.count()
1

You can inspect the number of entries (including missing values) with the .size property:
>>> nan_series.size
2
16

4.4. Opti
Optional
onal IInteg
nteger
er Support for NaN
Support
Note
If yo
you
u load
load data
data from
from a CS
CSV
V file,
file, an em
empt
pty
y va
valu
luee fo
forr an othe
otherw
rwis
isee nu
nume
meri
ricc co
colu
lumn
mn will
will be
beco
come
me
NaN. Later, methods such as .fillna and .dropna will explain how to deal with NaN.

None, NaN , nan , <NA>, and null are synonyms in this book when referring to empty or missing data
found in a pandas series or dataframe.

4.4 Opti
Optional
onal Integ
Integer
er Supp
Support
ort for NaN

The int64 typ


typee does not sup
support
port miss
missing
ing data. Many con conside
sidered
red thathatt a wart of pandas. As of
pandas
pan das 0.24,
0.24, the
there
re is opt
option
ional
al suppo
support
rt for anothe
anotherr int
intege
egerr type
type tha
thatt can hold
hold mis
missin
sing
g val
values
ues den
denote
oted
d
as <NA> below
below.. The documentation calls this type the nullable integer type.type. When you create a series,
series,
you can pass in dtype='Int64' (note the capitalization):
>>> nan_series2 = pd.Series([2, None],
. ..
.. i nd
nd ex
ex = [ 'O
' O no
no ' , ' C la
la pptt oonn ' ]],,
. ..
.. d ty
ty pe
pe = 'I
' I nt
nt 6 4 ' )
>>> nan_series2
Ono 2
Clapton <NA >
dtype: Int64
Operations on these series still ignore NaN or <NA>:
>>> nan_series2.count()
1

Note
You can use the .astype method to convert columns to the nullable integer type. Just use the
string 'Int64' as the type:
>>> nan_series.astype('Int64 ')
Ono 2
Clapton < NA >
dtype: Int64

I gen
genera
erally ignore 'Int64' as I te
lly ignore tend
nd to clea
clean
n up mi
miss
ssin
ing
g da
data
ta.. Al
Also
so,, wh
when
en yo
you
u in
inge
gest
st data
data in pa
pand
ndas
as,,
most functions use 'int64' (in lowercase) by default.

4.5 Sim
Simila
ilarr to Num
NumPy
Py

The Series objec


objectt behaves simila
similarly
rly to a NumPy array
array.. As shown below
below,, both types resp
respond
ond to
index operations:
>>> import numpy as np
>>> numpy_ser = np.array([145, 142, 38, 13])
>>> songs3[1]
14 2
>>> numpy_ser[1]
14 2
They both have methods in common:
17

4. Seri
Series
es Intr
Introduct
oduction
ion

Figure 4.2: Filtering a series with a boolean array.

>>> songs3.mean()
84.5
>>> numpy_ser.mean()
84.5
They also both have a notion of a boolean array.
array. A boolean array is a series with the same index
as the series you are working with that has boolean values, and it can be used as a mask to filter
out items. Normal Python lists do not support such fancy index operations, like sticking a list into
an index operation.
In this example, we will make a mask:
>>> mask
mask = songs3
songs3 > songs3.medi
songs3.median()
an() # boolean
boolean array
>>> mask
Paul True
John True
George False
Ringo False
Name: counts, dtype: bool
Once we have a mask, we can use that as a filter. We just need to pass the mask into an index
operation. If the mask has a True value for a given index, the value is kept. Otherwise, the value is
dropped. The mask above represents the locations that have a value higher than the median value
of the series.
>>> songs3[mask]
PJ ao uh ln 11 44 52
Name: counts, dtype: int64
18

4.6. Cate
Categorica
goricall Data
NumPy also has filtering by boolean arrays, but lacks the .median method on an array. Instead,
NumPy provides a median function in the NumPy namespace. The equivalent version in NumPy
looks like this:
>>> numpy_ser[numpy_ser > np.median(numpy_ser)]
array([145, 142])

Note

Both Num
Both NumPy Py and pan
pandas
das hav
havee ado
adopte
pted
d the conven
conventio
tion
n of using
using imp
import
ort sta
state
temen
ments ts in
combin
com binati
ation
on with an as stat
with statem
emen
entt to re
rena
name
me th
thei
eirr im
impo
port
rtss to tw
two
o lett
letter
er ac
acro
rony
nyms
ms.. This
This is ca
call
lled
ed
aliasing::
aliasing
>>> import pandas as pd
>>> import numpy as np
Renaming imports provides a slight typing benefit (four fewer characters) while still
allowing the user to be explicit with their namespaces.
Be car
careful
eful,, as yo
youu ma
mayy see
see th
thee fo
foll
llow
owin
ing
g ca
cast
st ab
abou
outt in co
code
de samp
sample
les,
s, bl
blog
ogs,
s, or
documentation:
>>> from pandas import *
Though you see star imports frequently used in examples online, I would advise not to use
star imports. I never use them in my book examples or code that I write for clients. They have
the potential to clobber items in your namespace and make tracing the source of a definition
more difficul
difficultt (especi
(especially
ally if you have multiple star impo
imports)
rts).. As the Zen of Pyth
Pythonon states
states,,
6
“Explicit is better than implicit ” .

4.6 Cat
Catego
egoric
rical
al Dat
Dataa

When you load data,


data, you can indica
indicate
te that the data is categor
categorical.
ical. If we know that our data is
limited
limited to a few value
values;
s; we might want to use categori
categorical
cal data. Cate
Categorica
goricall values have a few
benefits:

• Use less memory than strings


• Improve performance
• Can ha
have
ve an or
orderin
dering
g
• Can perform operations on categories
• Enforce membership on values

Catego
Cate gori
ries
es are
are no
nott li
limi
mite
ted
d to stri
string
ngs;
s; we can also
also co
conv
nver
ertt nu
numb
mber
erss or da
date
teti
time
me valu
values
es to ca
cate
tego
gori
rica
call
data.
To create a category, we pass dtype="category" into the Series constructor. Alternatively, we can
call the .astype("categor
.astype("category")y") method on a series:
6 Type import this into an interpreter to see the Zen of Python. Or search for ”PEP 20”.
19

4. Seri
Series
es Intr
Introduct
oduction
ion
>>> s = pd.Series(['m',
pd.Series(['m', 'l', 'xs', 's', 'xl'], dtype='catego
dtype='category
ry ')
>>> s
0 m
1 l
2 xs
3 s
4 xl
dtype: category
Categories (5, object): ['l', 'm', 's', 'xl', 'xs']
If this series represents
represents the size, there
there is a natural ordering as a small is less than a medium. By
default, categories don’t have an ordering. We can verify this by inspecting the .cat attribute that
has various properties:
>>> s.cat.ordered
False
To convert a non-categorical series to an ordered category, we can create a type with the
CategoricalDtype con
constru
structor
ctor and the approp
appropriate
riate param
parameter
eters.
s. Then we pass this type into the
.astype method:
>>> s2 = pd.Series(['m', 'l', 'xs', 's', 'xl'])
>>> size_type = pd.api.types.CategoricalDtype(
. ..
.. c at
a t eg
e g o ri
ri es
e s = [ 's
's ' ,'
, ' m ',
', ''ll '']] , o rrdd eerr eedd = T ru
ru e )
>>> s3 = s2.astype(size_type)
.. .
>>> s3
0 m
1 l
2 NaN
3 s
4 NaN
dtype: category
Categories (3, object): ['s' < 'm' < 'l']
In this case, we limited the categories to just 's', 'm', and 'l', but the data had values that were
nott in thos
no thosee cate
categor
gorie
ies.
s. Con
Conve
vert
rtin
ing
g th
thee da
data
ta to a cate
catego
gory
ry ty
type
pe re
repl
plac
aces
es thos
thosee ex
extr
traa valu
values
es with NaN.
with
If we have ordered categories, we can do comparisons on them:
>>> s3 > 's'
0 True
1 True
2 False
3 False
4 False
dtype: bool
The pr
The prio
iorr exam
exampl
plee crea
create
ted new Series from
d a new from ex
exis
isti
ting
ng data
data that
that wa
wass no
nott ca
cate
tego
gori
rica
cal.
l. We ca
can
n also
also
add ordering information to categorical data. We just need to make sure that we specify all of the
members of the category or pandas will throw a ValueError:
>>> s.cat.reorder_categories(['xs','s','m','l', 'xl'],
... ordered = True )
0 m
1 l
23 xss
4 xl
dtype: category
20

4.7.. Summa
4.7 Summary
ry
Categories (5, object): ['xs' < 's' < 'm' < 'l' < 'xl']

Note
String and datetime series have a str and dt attribute that allow us to perform common
operations
opera tions spe
specific
cific to that type. If we convert these
these types to categoric
categorical
al types, we can stil
stilll
use the str or dt attributes on them:
>>> s3.str.upper()
0 M
1 L
2 NaN
3 S
4 NaN
dtype: object

Method Description
pd.Series(data=None, index=None, Create a series from data (sequence, dictionary, or
dtype=None, name=None, copy=False) scalar).
s.index Access index of series.
s.astype(dtype, errors='raise') Cast a series to dtype. To ignore errors (and return
original object) use errors='ignore'.
s[boolean_array] Return values from s where boolean_array is True.
s.cat.ordered Determine if a categorical series is ordered.
s.cat.reorder_categories( new_categories, Add categories (potentially ordered) to the series.
ordered=False) new_categories must include all categories.
Table 4.1: Series Overview Attributes and Methods

4.7
4.7 Summ
Summar
aryy

The Series obobject


ject is a one-
one-dimen
dimensiona
sionall data structu
structure.
re. It can hold numerica
numericall data, time data,
strings,
strings, or arbit
arbitrary
rary Python objec
objects.
ts. If you are deali
dealing
ng with numeric data, using pandas rather
than a Python list will benefit you. Pandas is faster,
faster, consumes less memory
memory,, and ccomes
omes with built-
in methods
acce
access
ssing that
ing val
values arepos
ues by very
positi usef
useful
ition
on or ul
labto
el.manipulat
label. manip ulate
A Series caenthe
can alsodata.
also have Also
have Also,
empt,ythe
empty va
valuindex
lues
es andabst
andabstracti
hassraction
ha some
someonsimi
siallows
mila
lari for
riti
ties
es
to NumPy arrays. It is the primary workhorse of pandas; mastering it will pay dividends.

4.
4.8
8 Exer
Exerci
cise
sess

1. Using Jupyter
Jupyter,, create a series with the ttemperature
emperature valu
values
es for the last seven days. Filter out
the values below the mean.
2. Using Jupyter
Jupyter,, create a serie
seriess with your favorite colors. Use a categorical type.
21
Chapter 5
Series Deep Dive

There are many operations you can do with a Series. In this chapter
chapter, we will introduce
introduce many of
them.
We will pull data from the US Fuel Economy website 7 . This site has data on the efficiency of
makes and models of cars sold in the US since 1984.

5.1 Loa
Loadin
dingg th
thee Dat
Dataa

I have a copy of this data in my GitHub repository. One of the nice features of pandas is that the
read_csv function can accept not only URLs but also ZIP files. Because this ZIP file contains only
a single file,
file, we can use this fun
function
ction.. If it was a ZIP file with mul
multiple
tiple fil
files,
es, we would need to
decompress the data to pull out the file we were interested in.
The first columns in the dataset we will investigate are city08 and highway08, which provide
information on miles per gallon usage while driving around in the city and highway respectively:
>>> import pandas as pd
>>> url = 'https://github.com/mattharrison/datasets/raw/master/data/' \
.... ' ve
ve h i c l e s .c
. c s v ..zz i p '
>>> df = pd.read_csv(url)
>>> city_mpg = df.city08
>>> highway_mpg = df.highway08

Let’s look at the data:


>>> city_mpg
0 19
1 9
2 23
3 10
4 17
..
41139 19
41140 20
41141 18
41142 18
41143 16
Name: city08, Length: 41144, dtype: int64

7 https://www
https://www.fueleconomy
.fueleconomy.gov/feg/download.shtml
.gov/feg/download.shtml
23

5. Seri
Series
es Deep Dive

Figure 5.1: Jupyter will pop up a list of options for completions when you hit TAB following a period.

>>> highway_mpg
0 25
1 14
2 33
3 12
4 23
..
41139 26
41140 28
41141 24
41142 24
41143 21
Name: highway0
highway088 , Length:
Length: 41144,
41144, dtype:
dtype: int64
int64
It lo
look
okss li
like
ke each
each seri
series
es ha
hass arou
around
nd 40
40,0
,000
00 inte
intege
gerr entr
entrie
ies.
s. Becau
Because
se the
the ty
type
pe of this
this series is int64,
series
we know that none of the values are missing.

5.2 Ser
Series
ies Att
Attrib
ribute
utess

The pandas library provides a lot of functionality. The built-in dir function will list the attributes
of an object. Let’s examine how many attributes there are on a series:
>>> len(dir(city_mpg))
45 7
Wow! Ther
Theree are over 400 attri
attribute
butess on a seri
series.
es. In contra
contrast,
st, a Python
Python list or diction
dictionary
ary has
around 40 attributes. Do not fret; you will not need to memorize all of these if you get comfortable
with a tool like Jupyter. If you have a Series object, you can hit TAB after a period, and it will pop
up a list of completions. (Other tools are also able to do this for Python objects).
What
Wh at fu
func
ncti
tion
onal
alit
ity
y do all
all of th
thes
esee at
attr
trib
ibut
utes
es pr
prov
ovid
ide?
e? He
Here
re is a su
summ
mmar
ary
y. Ther
Theree are
are ma
many
ny wa
ways
ys
to categorize these, and I’m roughly going to do it by what the result of the method is:

• Dunder methods (.__add__, .__iter__, etc) provide many numeric operations, looping,
attribute access, and index access. For the numeric operations, these return Series.
• Corresponding opeoperator
rator methods for many of the nnumeric
umeric operations allow us
us to tweak the
behavior (there is an .add method in addition to .__add__).
24

5.3.. Summa
5.3 Summary
ry
• Aggregate methods and prope
properties
rties which reduce or aggregate the valu
values
es in a series down to
a single
single scalar value. The .mean, .max, and .sum methods and .is_monotonic property are all
value.
examples.
• Conversion methods. Some of these start with .to_ and export the data to other formats.
these
uch as .sort_values, .drop_duplicates, that return Series objects with
• Manipulation methods ssuch
the same index.
• Indexing and accessor methods and attributes such as .loc and .iloc. These return Series or
scalars.
• String man
manipulation
ipulation met
methods
hods using .str.
using
• Date mani
manipulation
pulation methods using .dt.
methods
• Plotting methods using .plot.
• Categorical man
manipulation
ipulation met
methods
hods using .cat.
using
• Transformation methods such as .unstack and .reset_index, .agg, .transform.
methods
• Attributess such as .index and .dtype.
Attribute
• Ab
bun
unch
ch o
off private attributes that we will ignore (around 130 of them).

We will cover many of these in the following chapters.

5.3
5.3 Summ
Summar
aryy

In this
this chap
chapte
terr, we in
intr
trod
oduc
uced
ed th
thee no
noti
tion
on th
that
at pa
pand
ndas
as ob
obje
ject
ctss ha
have
ve a larg
largee nu
numb
mber
er of attr
attrib
ibut
utes
es an
and
d
methods. Do not let this overwhelm you. You don’t need to memorize all of the methods.

5.
5.4
4 Exer
Exerci
cise
sess

1. Explore the docume


documentation
ntation for five attributes of a series from
from Jupyter.
Jupyter.
2. How many attattribut
ributes
es are fou nd on the .str attribute? Look at the documentation for three
found
of them.
3. How many attri
attributes he .dt attribute? Look at the documentation for three of
butes are found on tthe
them.
25
Chapter 6
Operators (& Dunder Methods)

6.1 Introd
Introduct
uction
ion

This chapter,
chapter, will revie
review
w some of the operator
operatorss and magic or dunder methods fou
found
nd in ser
series.
ies. In
short,, thes
short thesee are the protoc
protocols
ols that determi
determine
ne how the Python langu
language
age reacts to operation
operations.
s. For
example, when you use the + operation, Python is dispatching to the .__add__ method. When you
use a loop with a for statement, Python dispatches to the .__iter__ method.
This will not be a deep treatise on the dunder methods (double underscore methods) or magic
methods.
Let’s look at how this works with a pandas series.

6.2 Dunde
Dunderr Met
Metho
hods
ds

Here is an example in pure Python. When you run this code:


>>> 2 + 4
6
Under the covers, Python runs this:
>>> (2).__add__(4)
6

A Py
objectPyth
thon
has on
thisin
inte
tege
gerr ob
method, obje
ject
ct th
you that
at ha
can has
s a+.__add__
call method
met
on it. There hod re
respo
is alsosponds
nds to the
a .__div__ + opera
operation.
method tion.
that Becau
Because
sedivision.
supports a Series
One way to calculate the average of the two series is the following:
>>> (city_mpg + highway_mpg)/2
0 22.0
1 11.5
2 28.0
3 11.0
4 20.0
.. .
41139 22.5
41140 24.0
41141 21.0
41142 21.0
4Length:
1143 1 8 . 5 dtype: float64
41144,
Note that the type of the result is float64.

You might also like