Tuxdoc.com Effective Pandas Patterns for Data Manipulation Treading on Python Matt Harrison Independently Published 2021
Tuxdoc.com Effective Pandas Patterns for Data Manipulation Treading on Python Matt Harrison Independently Published 2021
hairysun.com
COPYRIGHT © 2021
While every precaution has been taken in the preparation of this book, the publisher and author
assumes no responsibility for errors or omissions, or for damages resulting from the use of the
information contained herein
Contents
Contents
1 Introduction 3
1.1 Who this book is for . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Data in this Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Hints, Tables, and Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Installation 5
2.1 Anaconda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Pip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Jupyter Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3 Data Structures 11
3.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4 Series Introduction 13
4.1 The index abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 Th
Thee pa
pand as Series . . . . . . . . .
ndas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.3 The NaN value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.4 Optio
Optional
nal Inte
Integer
ger Supp ort for NaN
Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.5 Similar to NumPy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
44..67 C
Suam
tem
goarriy
ca.l D
. a. ta. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 2119
4.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
v
Contents
6.6 Operator Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.7 Chaining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
7 Aggregate Methods 33
7.1 Aggregations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
7.2 Count and Mean of an Attribute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.3 .agg and Aggregation Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
8 Conversion Methods 39
8.1 Automatic Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
8.2 Memory Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
8.3 String and Category Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
8.4 Ordered Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
8.5 Converting to Other Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
8.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
8.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
9 Manipulation Methods 45
9.1 .apply and .where . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
9.2 If Else with Pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
9.3 Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
9.4 Filling In Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
9.5 Interpolating Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
9.6 Clipping Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
9.7 Sorting Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
9.8 Sorting the Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
9.9 Dropping Duplicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
9.10 Ranking Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
9.11 Replacing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
9.12 Binning Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
9.13 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
9.14 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
10 Indexing Operations 63
10.1
10.1 Prepeppi
pinng the
the Dat
ataa an
and
d Re
Rennamin
aming g th
thee In
Inde
dexx . . . . . . . . . . . . . . . . . . . . . . . . 63
10.2 Resetting the Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
10.33 The .loc Attribute . . . . . . . . . . . . . . .
10. . . . . . . . . . . . . . . . . . . . . . . . . 66
10.44 The .iloc Attribute . . . . . . . . . . . . . .
10. . . . . . . . . . . . . . . . . . . . . . . . . 73
10.5 Heads and Tails . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
10.6 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
10.7 Filtering Index Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
10.8 Reindexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
10.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
10.10 E
Exxercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
vi
Contents
11 String Manipulation 81
11.1 Strings and Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
11.2 Categorical Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
11.3 The .str Accessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
11.4 Searching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
11.5 Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
11
11.6
.6 Optim izing .apply with Cython
Optimizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
11.7 Replacing Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
11.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
11.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
vii
Contents
15 Categorical Manipulation 135
15.1 Categorical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
15.2 Frequency Counts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
15.3 Benefits of Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
15.4 Conversion to Ordinal Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
15.55 The .cat Accessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
15. 137
15.6 Category Gotchas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
15.7 Ge ne ralization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
15.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
15.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
16 Dataframes 143
16.1 Database and Spreadsheet Analogues . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
16.2 A Simple Python Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
16.3 Dataframes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11444
16.4 Cons tru ction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
16.5 Dataframe Axis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
16.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
16.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
20 Colum
Columns
ns T
Types,
ypes, .assign, and Memory Usage 171
20.1 Conversion Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1711
20.2 Memory Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
20.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
20.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
viii
Contents
21.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
ix
Contents
27.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
3300..43 T
Srtaacnksipnogsi&ngUD
nsattaack.in. g. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 227713
30.5 Stacking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2 75
30.6 Flattening Hierarchical Indexes and Columns . ...... . ...... ...... . . . 278
30.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
30.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
x
Contents
32.5 Merge Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307 307
32.6 Joining Data Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
32.7 Dirty Devil Flow and Weather Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
32.8 Joining Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
32.9 Validating Joined Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
32.10V
0Viisualization of Merged Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
32.11 SSu
ummary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
32.12 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
36 Summary 357
xi
Contents
About the Author 359
Index 361
xii
Forward
Python is eas
Python easyy to le
learn
arn.. You can lea
learn
rn the bas
basics
ics in a day and be produ
productictive.
ve. With ononly
ly an
understanding of Python, moving to pandas can be difficult or confusing. It borrows some ideas
from NumPy that are not common in the wider Python ecosystem. This book is meant to aid you
in mastering pandas.
I have
have taug
taughtht Py
Pyth
thon
on and
and pa
pand
ndas
as to many
many pe peop
ople
le ov
over
er the
the ye
year
ars,
s, in larg
largee co
corp
rpor
orat
atee
environments,
environmen ts, small startups, and in Python and Data Science conferences.
conferences. I have seen what trips
people up, and confuses them. Wi With
th the correct background, an attitude of acceptance, and a deep
breath, much of this confusion evaporates.
Having said this, pandas is an excellent tool. Many use it around the world to great success. I
hope to empower you to do this as well.
Cheers!
Matt
1
Chapter 1
Introduction
I have been using Python in some professional capacity or another since the turn of the century.
One of the trends that I have seen in that time is the uptake of Python for various aspects of data
science—gathering data, cleaning data, analysis, machine learning, and visualization.
visualization. The pandas
library has seen much uptake in this area.
pandas1 is a data
data an
anal
alys
ysis
is libr
librar
ary
y fo
forr Pyth
Python
on th
that
at ha
hass ex
expl
plod
oded
ed in po
popu
pula
lari
rity
ty ov
over
er the
the pa
past
st ye
year
ars.
s.
The website describes it like this:
My des
descri
cripti
ption
on of pandas
pandas is: pan
pandas
das is an in-
in-mem
memoryory analysis
analysis tool, whi
which ch has SQL-
SQL-lik
likee
constr
con struct
ucts,
s, ess
essent
ential
ial sta
statis
tistic
tical
al and ana
analyt
lytic
ic suppo
support,
rt, as wel
welll as gra
graphi
phing
ng capabi
capabilit
lity
y. Becaus
Becausee pan
pandas
das
is built on top of Cython and NumPy, it has less memory overhead and runs quicker than pure
Python
Pyth on code. Many peopl peoplee use pandas to replace Exce Excel,
l, perform
perform ETL (extrac
(extractt transform
transform load
processing to move data from one place to another), process tabular data, load CSV or JSON files,
prep for machine learning
learning,, and more. Thou
Thoughgh it grew out of the financ
financial
ial sector (for time series
analysis), it is now a general-purpose data manipulation library library..
With its NumPy lineage, pandas adopts some NumPy’isms that regular Python programmers
may not be aware of or familiar with. Yes, one could go out and use Cython to perform fast typed
data
data an
anal
alys
ysis
is wi
with
th a Py
Pyth
thon
on-l
-lik
ikee dial
dialec
ect,
t, bu
butt wi
with
th pa
pand
ndas
as,, you
you do
don’
n’tt need
need to
to.. Thi
Thiss wo
work
rk is do
done
ne for
for
you. If you use pandas and the vectorized operations, you are getting close to C-level speeds for
numeric work but writing Python.
1.1 Who th
this
is bo
book
ok is for
This gu
This guid
idee is in
inte
tend
nded
ed to intr
introd
oduc
ucee pa
pand
ndas
as an
and d pa
patt
tter
erns
ns for
for best
best prac
practi
tice
ces.
s. If yo
you
u wo
work
rk wi
withth tabu
tabula
larr
data
da ta an
andd ne
needed capa
capabi
bili
liti
ties
es beyo
beyond
nd Ex
Exce
cel,
l, th
this
is is fo
forr yo
you.
u. Th
This
is bo
book
ok co
cove
vers
rs many
many (but
(but no
nott all)
all) aspe
aspect
ctss
of the
the li
libr
brar
aryy, as we
well
ll as some
some go
gotc
tcha
hass or de
deta
tail
ilss th
that
at ma
mayy be co
coun
unte
ter-
r-in
intu
tuit
itiv
ivee or even
even no
non-n-py
pyththon
onic
ic
to longtime users of Python.
This book assumes
assumes a basic knowledg
knowledgee of Pyth Python.on. The author
author has written Illustrated Guide to
Python 3 that provides all the background necessary
necessary..
1 pandas (http://pandas.pydata.org) refers to itself in lowercase, so this book will follow suit. When I’m referring
to specific code, I will set it in a monospace font.
3
1. Intr
Introduct
oduction
ion
1.2 Da
Data
ta in th
this
is Book
Book
Every attempt
attempt has been made to use data that illust
illustrates
rates rea
real-wor
l-worldld pandas usage. As a visu
visual
al
learner, I appreciate seeing where data is coming and going. As such, I try to shy away from just
show
showiningg tabl
tables
es of ra
rand
ndom
om nu
numb
mber
erss th
that
at ha
have
ve no me
mean
anin
ing.
g. I will
will show
show be
best
st pract
practic
ices
es gl
glea
eane
ned
d from
from
years of using pandas.
I have selected a variety of datasets to show that the advice given in this book is applicable in
most situations you may encounter
encounter..
1.3 Hint
Hints,
s, Tables
ables,, and Image
Imagess
The hints, tables, and graphics found in this book have been collected over my years of using
pandas.
panda s. They come from han
hang-ups
g-ups,, note
notes,
s, and cheat shee
sheets
ts that I have deve
developed
loped afte
afterr using
pandas and teaching others how to use the library.
In the physical version of this book, there is an index that has also been battle-tested during
development. Inevitably, when I was doing analysis for consulting or clients, I would check that
the index had the information I needed. If it didn’t, I added it.
If you enjoy this book, please consider
consider writin
writingg a review
review on Amazon. That is one of the best
ways to thank an author.
4
Chapter 2
Installation
This book will use Python 3 throughout! Please do not use Python 2 unless you have a compelling
co mpelling
reason to. Python 3 is the future of the language, and the current pandas releases do not support
Python 2.
2.
2.1
1 Anac
Anacon
onda
da
Note
This book shows commands
commands run from the UNIX comma
command
nd prompt
prompt.. They are prefi
prefixed
xed by
the prompt
prom
prompt $ . Unless
pt as well. otherwise
Do not type thenoted,
prom these
pt. It commands
prompt. is includedwill
included run onuish
to disting the comm
distinguish Windows
andscommand
commands run via a
terminal or command prompt from Python code.
We can verify that this works by trying to import the pandas package:
$ python
>>> import pandas
>>> pandas.__version__
'1.3.2'
Note
The command above shows a Python prompt, >>>. Do not
not ty
type
pe tthe
he P
Pyt
ytho
hon
n pr
prom
ompt
pt.. It is
incl
included
uded
example,
exam to make
ple, the outpuitt easy
output of thetoabove
disti
distingui
above,nguish
sh Python
, '1.3.2' code
does not havfro
from
have m the
e the output
promp
promptt in of Python
front of it. code. For
The book
also includes the secondary Python prompt, ... for code that is longer than a single line.
5
2. Inst
Installat
allation
ion
Note that Jupyter does not use the Python prompt in its cells.
2.2 Pip
If yo
youu aren
aren’t
’t us
usin
ing
g An
Anac
acononda
da,, I re
reco
comm
mmen end
d yo
youu us
usee pi
pipp3 to inst
instal
alll pa
pand
ndas
as.. Th
Thee pa
pandndasas libr
librar
ary
y wi
will
ll
install on Windows, Mac, and Linux via pip.
It may be necessary to prepare the operating system for building pandas from source by
instal
ins tallin
ling
g dep
depend
enden
encie
ciess and the prope
properr hea
header
der file
filess for Python
Python.. On UbuUbuntu
ntu,, this
this is str
straig
aightf
htforw
orward
ard,,
other environments may be different:
$ sudo apt-get install build-essential python-all-dev
Using virtualenv4 will alleviate
alleviate the need for supe
superuse
ruserr access during instal
installation
lation.. Becau
Because
se
virtualenv uses pip, it can download and install newer releases of pandas if the version found
on the distribution is lagging.
On Mac and Linux platforms, the following commands create a virtualenv sandbox and install
the latest pandas in it (assuming that the prerequisite files are also installed):
$ python3 -m venv pandas-env
$ source pandas-env/bin/activate
(pandas-env)$
(pand as-env)$ pip install pandas
pandas
Once
Once yo
you
u ha
have
ve pa
pand
ndas
as inst
instal
alle
led,
d, co
confi
nfirm
rm th
that
at yo
you
u can
can im
impo
port
rt the
the libr
librar
ary
y an
and
d ch
chec
eck
k the
the ve
vers
rsio
ion:
n:
$ source pandas-env/bin/activate
(pandas-env)$ python
>>> import pandas
>>> pandas.__version__
'1.3.2'
On Windows, you will open a Command Prompt and run the following to create a virtual
environment:
> python -m venv pandas-env
> pandas-env/Scripts/activa
pandas-env/Scripts/activate
te
(pandas-env)>
(pand as-env)> pip install pandas
pandas
Note
The Windows command prompt, >, is shown in the previous command. Do not type it. Only
type the commands following the prompt.
2 https://anaconda.com/downloads
3 http://pip-installer.org/
4 http://www.virtualenv.org
6
2.3. Jupyt
Jupyter
er Over
Overview
view
2.3 Jup
Jupyte
yterr Ove
Overvie
rview
w
I recommend
recommend you use Jupyter (or a program
program that connect
connectss to it) as a data explor
exploration
ation tool. I use
Jupyter
VSCode,classic,
Emacs,though
as wellthere are other
as Google options:
Colab. JupyterLab,
Jupyter classic willconnecting to Jupyter
give you basic via PyCharm,
functionality and is
included in many cloud environments.
Jupyter notebook is an environmen
environmentt for combining interactive coding and text in a web browser
browser..
This allows us to easily share code and narrative around that code. An example that was popular
in the scientific community was the discovery of gravitational waves. 5
The name Jupyter is a rebranding of an open-source project previously known as iPython
Notebook.
Noteb ook. The rebran
rebranding
ding was to emph
emphasize
asize that althou
although
gh the backe
backend
nd is writt
written
en in Pyth
Python,
on,
Jupyter supports various kernels to run other languages, including Julia (the ”Ju” portion), Python
(”pyt”), and R (”er”). All popular data science programming languages.
The architecture of Jupyter includes a server running various kernels. Using a notebook we can
interact with a kernel. Typically we use a web browser to do this, but other interfaces exist, such
as an
Toemacs
installmode (ein),
Jupyter, PyCharm, or VSCode.
type:
$ pip install notebook
Once Jupyter is installed, launch it with this command:
$ jupyter -notebook
Then navigate to https://localhost:8888 and you should be presented with the Jupyter home
page.
Click on the dropdown button on the right that says ”New” and select Python 3.
At this point,
point, you are pre
presen
sented
ted wit
with
h a not
noteb
ebook
ook with an empty celcell.
l. Jupyt
Jupyter
er is a modal
envir
env ironm
onmen
ent.
t. The
There
re are two mode
modes,s, comman
command d mod
modee and edi
editt mode. Com
Commanmandd mode is for
creating and manipulating cells. Edit mode is for changing what is inside of a single cell.
thatThere arethe
because many
boxcommands
around thefor
cellboth modes.
is blue), youIfcan
youtype
are in command
”h”, mode
and it will (and
bring up you will know
a pop-up with
5 https://losc.ligo.org/s/events/GW150914/GW150914_tutorial.html
7
2. Inst
Installat
allation
ion
the keyboard
keyboard shortcut
shortcutss for both comman
command
d and edit mode. Don’
Don’tt worry about memorizin
memorizing
g all of
them. Here are the commands you will be using most of the time in command mode:
• h - Bring up help (ESC to dismiss)
• a - Cre
Create
ate ce
cell
ll abov
abovee
• b - Cre
Create
ate cel
celll below
• x-C
Cut
ut ccell
ell
• c - Co
Copy
py ce
cell
ll
• v - Past
Pastee cell b
below
elow
• Ente
Enterr - Go int
into
o Edit M
Mode
ode
• m - Change ccell
ell typ
typee to Markd
Markdown
own
• y - Chan
Change
ge cel
celll type tto
o code
• ii - Interrupt kernel
• 00 - Restart kernel
• Ctr-Enter - Execute cell
When you click on a cell or type Enter, you go into edit mode.
mode. You will see that the outline turns
turns
green
green if you are in edit mode. In edit mode, you have basic editing function
functionality
ality.. A few keys to
know:
• Ctr-Enter - Run cell (e
(execute
xecute Python code, re
render
nder Markdown)
• ESC - Go b
back
ack to com
command
mand mo
mode
de
• TAB - T
Tab
ab completion
• Shift-T
Shift-TAB
AB - Bring up tooltip ((ESC
ESC to dismiss)
8
2.4.. Summa
2.4 Summary
ry
2.4
2.4 Summ
Summar
aryy
In this chapter
chapter, we saw how to set up a Pytho
Pythonn enviro
environmen
nmentt using Anacond
Anacondaa or Pip. We also
introduced the Jupyter notebook. I recommend that you get comfortable with Jupyter. Not only is
it free and open-source, but many large cloud providers also offer Jupyter in their environments.
2.
2.5
5 Exer
Exerci
cise
sess
One of the keys to understanding pandas is to understand the data model. At the core of pandas
are two data structures. The most widely used data structures are the Series and the DataFrame for
dealing
dealing with array data and tabular data. This table shows thei
theirr analogs in the sprea
spreadshe
dsheet
et and
database world.
Dataa Struct
Dat Structur
uree Dimens
Dimension
ionali
ality
ty Spread
Spreadshee
sheett Analog
Analog Databa
Database
se Analog
Analog Linear
Linear Algebra
Algebra
Series 1D Column Column Column Vector
objects, itodd
perhaps is imperative that
to some), we wesee
will have
thisa when
comprehensive
we iterate study of theand
over rows, Series first. Additionally
the rows (and
are represented
represented as
Series (however, if you find yourself consistently dealing with rows instead of columns, you are
probably not using pandas in an optimal way).
Some have compared the data structures to Python lists or dictionaries, and I think this is a
stretch
stretch that doesn’
doesn’tt provide much benefi
benefit.
t. Mappi
Mappingng the list and dictionary
dictionary meth
methodsods on top of
pandas’ data structures just leads to confusion.
3.1
3.1 Summ
Summar
aryy
The pandas library includes two main data structures and associated functions for manipulating
them.. This bo
them book
ok will ffocus thee Series and DataFrame. First,
ocus on th First, we wi
will
ll look at the Series as the
look
DataFrame Series
can be considered a collection of columns represented as objects.
11
3. Data Stru
Structur
ctures
es
Figure
Figure 3.2: Figure
Figure showin
showingg the relatio
relation
n between the main data structur
structures
es in pandas.
pandas. Namely
Namely,, that
that a
dataframe can have on or many series.
3.
3.2
2 Ex
Exer
erci
cise
sess
Chapter 4
Series Introduction
A Series is us
used
ed to mo
mode
dell on
one-
e-di
dime
mens
nsio
iona
nall data Thee Series ob
data.. Th obje
ject
ct al
alsso has a few mor
more bits
bits of da
data
ta,,
including an index and a name. A common idea through pandas is the notion of an axis. Because
a series is one-dimensional, it has a single axis—the index.
Below is a table of counts of songs artists composed. We will use this to explore the series:
Ar
Arti
tist
st Data
Data
0 145
1 142
2 38
3 13
If you wanted to represent this data in pure Python, you could use a data structure similar to
dictionary, series, has a list of the data points stored under the 'data'
the one that follows. The dictionary,
key. In addition to an entry in the dictionary for the actual data, there is an explicit entry for the
corre
corresp
spon
ondi
ding
ng in
inde
dexx valu
values
es fo
forr th
thee da
data
ta (in thee 'index' ke
(in th key)
y),, as well as an entry
try for the
the name
ame of the
the
data (in the 'name' key):
>>> series = {
. ..
.. ' in
in de
de x ' :[:[ 0 , 1 , 2, 2, 3 ] ,
. ..
.. ' da
da ta
ta ' :[:[ 14
14 5 , 1 42
4 2 , 3 8 , 1 3]
3] ,
. ..
.. ' na
na mmee ' : ' so
so ng
ng s '
... }
The get function defined below can pull items out of this data structure based on the index:
>>> def get(series, idx):
. ..
.. v al
a l ue
u e _i
_ i dx
d x = s eerr iiee s ['
[ ' i nd
nd eexx ' ]].. i nd
n d eexx ( i dx
dx )
. ..
.. r eett ur
ur n s er er ie
ie s [ 'd
'd aatt a ' ]][[ v a lu
lu ee__ iidd x ]
>>> get(series, 1)
14 2
Note
The code samples in this book are shown as if they were typed directly into an interpreter.
Lines starting with >>> and ... are interpreter markers for the input prompt and continuation
prompt respectively. Lines that are not prefixed by one of those sequences are the output from
the interpreter after running the code.
13
4. Seri
Series
es Intr
Introduct
oduction
ion
In Jupyter
Jupyter (and IPython) you do not see the promp
prompts.
ts. I include them to help distin
distinguis
guishh
between code and output.
The Python interpreter will print the return value of the last invocation (even if the print
statement is missing) automatically. If you desire to use the code samples found in this book,
leave the interpreter prompts out.
This double abstraction of the index seems unnecessary at first glance—a list already has integer
indexes. But there is a trick up pandas’ sleeves
sleeves.. By allowing non-integer values,
values, the data structure
supports other index types such as strings, dates, as well as arbitrarily ordered indices, or even
duplicate index values.
Below is an example that has string values for the index:
>>> songs = {
...
... 'inde
'indexx ':['Pau
':['Paul', l', 'John',
'John', 'Geor
'George',
ge', 'Ring
'Ringoo '],
. ..
.. ' da
da ta
ta ' :[
:[ 14
14 5 , 1 42
4 2 , 3 8 , 1 3]
3] ,
. ..
.. ' na
na me
me ' : ' co
c o un
un ts
ts '
... }
4.
4.2
2 Th
Thee pandas Series
pandas
With that background in mind, let ’s look at how to create a Series in pandas.
pandas. It is easy to creat
createe a
Series object from a list:
>>> import pandas as pd
>>> songs2 = pd.Series([145, 142, 38, 13],
... n a m e ='
= ' co
co u n t s ' )
>>> songs2
0 145
1 142
2 38
3 13
Name: counts, dtype: int64
When the interpreter prints our series, pandas makes a best effort to format it for the current
terminal size. The series is one-dimensional. However, this looks like it is two-dimensional. The
leftmost column is the index, which contains entries for the index. The index is not part of the
value
val ues.
s. The gen ericc name for an index is an axis, and the values of the index—0, 1, 2, 3—are
generi
labels. The data—145, 142, 38, and 13—is also called the values of the series. The two-
called axis labels.
dimensional structure in pandas—a DataFrame—has two axes, one for the rows and another for the
columns.
The rightmost column in the output contains the values of the series—145 145,, 142, 38, and 13. In
this case, they are integers (the console representation says dtype: int64, dtype meaning data type,
and int64 meaning 64-bit integer), but in general, the values of a Series can hold strings, floats,
14
booleans, or arbitrary Python objects. To get the best speed (and to leverage vectorized operations),
the values should be of the same type, though this is not required.
It is easy to inspect the index of a series (or data frame), as it is an attribute of the object:
>>> songs2.index
RangeIndex(start=0, stop=4, step=1)
Theindex.
based default values for an index are monotonically increasing integers. songs2 has an integer-
Note
The index can be string-based as well, in which case pandas indicates that the datatype for the
index is object (not string):
>>> songs3 = pd.Series([145, 142, 38, 13],
... n a m e ='
=' co
co u n t s '',,
. ..
.. i nd
nd ex
ex = [ 'P
' P au
au l ' , ' JJoo hhnn ' , ' GGee oorr ggee ' , ' RRii nngg o ' ])
])
Note that the dtype that we see when we print a Series is the type of the values, not the
index.
index. Even thou
though
gh this looks two-dim
two-dimensi
ensional,
onal, reme
remember
mber that the index is not part of the
values:
>>> songs3
Paul 145
John 142
George 38
Ringo 13
Name: counts, dtype: int64
When we inspect the index attribute, we see that the dtype is object:
>>> songs3.index
Index(['Paul',
Inde x(['Paul', 'John',
'John', 'George',
'George', 'Ringo
'Ringo '],
dtype ='object ')
The actu
The actual
al da
data
ta (or
(or va
valu
lues
es)) fo
forr a seri
series
es do
does
es no
nott ha
have
ve to be nume
numeri
ricc or ho
homo
moge
gene
neou
ous.
s. We ca
can
n in
inse
sert
rt
Python objects into a series:
>>> class Foo:
... pass
4. Seri
Series
es Intr
Introduct
oduction
ion
. ..
.. [ ' Ri
Ri ch
ch ar
ar d ' , ' StS t aarr kkee y ' , 1 3 , Fo
Fo o ()
() ]],,
... n a m e ='
=' ri
ri n g o ' )
>>> ringo
0 Richard
1 Starkey
2 13
3 < __
_ _ ma
ma in
in __
__ . Fo
F o o i ns
ns ta
ta nncc e a t 0 x ..
.. ..>>
Name: ringo, dtype: object
In the above case, the dtype-datatype -of the Series is object (meaning a Python object). This can
datatype-of
be good or bad.
The object data type is also used
used for a seri
series
es with stri
string
ng value
values.s. In additi
addition,
on, it is also used
for values that have heterogeneous or mixed types. If you have just numeric data in a series, you
woul
wo uldn
dn’t
’t wa
want
nt it stor
stored
ed as a Py
Pyth
thon
on ob
obje
ject
ct,, bu
butt rather as an int64 or float64, which allow you to do
rather
vectorized numeric operations.
If you have time data and it says it has the object type, you probably have strings for the dates.
Using strings instead of date types is bad as you don’t get the date operations that you would get
if the type were datetime64[ns]. A series with string data, on the other han d, has the type of object.
hand,
Don’t worry; we will see how to convert types later in the book.
Note
One thing to note is that the type of this series is float64, not int64! The type is a float because
float64 supports NaN, which int64 does not. When pandas sees numeric data ( 2) as well as the
np.nan, it coerced the 2 to a float value.
Below is an example of how pandas ignores NaN. The .count method, which counts the number of
value
valuess in a serie
series,
s, dis
disre
regards NaN. In th
gards this
is ca
case
se,, it in
indi
dica
cate
tess that
that the
the co
coun
untt of item
itemss in the
the seri
series
es is on
one,
e,
one for the value of 2 at index location Ono, ignoring the NaN value at index location Clapton:
>>> nan_series.count()
1
You can inspect the number of entries (including missing values) with the .size property:
>>> nan_series.size
2
16
4.4. Opti
Optional
onal IInteg
nteger
er Support for NaN
Support
Note
If yo
you
u load
load data
data from
from a CS
CSV
V file,
file, an em
empt
pty
y va
valu
luee fo
forr an othe
otherw
rwis
isee nu
nume
meri
ricc co
colu
lumn
mn will
will be
beco
come
me
NaN. Later, methods such as .fillna and .dropna will explain how to deal with NaN.
None, NaN , nan , <NA>, and null are synonyms in this book when referring to empty or missing data
found in a pandas series or dataframe.
4.4 Opti
Optional
onal Integ
Integer
er Supp
Support
ort for NaN
Note
You can use the .astype method to convert columns to the nullable integer type. Just use the
string 'Int64' as the type:
>>> nan_series.astype('Int64 ')
Ono 2
Clapton < NA >
dtype: Int64
I gen
genera
erally ignore 'Int64' as I te
lly ignore tend
nd to clea
clean
n up mi
miss
ssin
ing
g da
data
ta.. Al
Also
so,, wh
when
en yo
you
u in
inge
gest
st data
data in pa
pand
ndas
as,,
most functions use 'int64' (in lowercase) by default.
4.5 Sim
Simila
ilarr to Num
NumPy
Py
4. Seri
Series
es Intr
Introduct
oduction
ion
>>> songs3.mean()
84.5
>>> numpy_ser.mean()
84.5
They also both have a notion of a boolean array.
array. A boolean array is a series with the same index
as the series you are working with that has boolean values, and it can be used as a mask to filter
out items. Normal Python lists do not support such fancy index operations, like sticking a list into
an index operation.
In this example, we will make a mask:
>>> mask
mask = songs3
songs3 > songs3.medi
songs3.median()
an() # boolean
boolean array
>>> mask
Paul True
John True
George False
Ringo False
Name: counts, dtype: bool
Once we have a mask, we can use that as a filter. We just need to pass the mask into an index
operation. If the mask has a True value for a given index, the value is kept. Otherwise, the value is
dropped. The mask above represents the locations that have a value higher than the median value
of the series.
>>> songs3[mask]
PJ ao uh ln 11 44 52
Name: counts, dtype: int64
18
4.6. Cate
Categorica
goricall Data
NumPy also has filtering by boolean arrays, but lacks the .median method on an array. Instead,
NumPy provides a median function in the NumPy namespace. The equivalent version in NumPy
looks like this:
>>> numpy_ser[numpy_ser > np.median(numpy_ser)]
array([145, 142])
Note
Both Num
Both NumPy Py and pan
pandas
das hav
havee ado
adopte
pted
d the conven
conventio
tion
n of using
using imp
import
ort sta
state
temen
ments ts in
combin
com binati
ation
on with an as stat
with statem
emen
entt to re
rena
name
me th
thei
eirr im
impo
port
rtss to tw
two
o lett
letter
er ac
acro
rony
nyms
ms.. This
This is ca
call
lled
ed
aliasing::
aliasing
>>> import pandas as pd
>>> import numpy as np
Renaming imports provides a slight typing benefit (four fewer characters) while still
allowing the user to be explicit with their namespaces.
Be car
careful
eful,, as yo
youu ma
mayy see
see th
thee fo
foll
llow
owin
ing
g ca
cast
st ab
abou
outt in co
code
de samp
sample
les,
s, bl
blog
ogs,
s, or
documentation:
>>> from pandas import *
Though you see star imports frequently used in examples online, I would advise not to use
star imports. I never use them in my book examples or code that I write for clients. They have
the potential to clobber items in your namespace and make tracing the source of a definition
more difficul
difficultt (especi
(especially
ally if you have multiple star impo
imports)
rts).. As the Zen of Pyth
Pythonon states
states,,
6
“Explicit is better than implicit ” .
4.6 Cat
Catego
egoric
rical
al Dat
Dataa
Catego
Cate gori
ries
es are
are no
nott li
limi
mite
ted
d to stri
string
ngs;
s; we can also
also co
conv
nver
ertt nu
numb
mber
erss or da
date
teti
time
me valu
values
es to ca
cate
tego
gori
rica
call
data.
To create a category, we pass dtype="category" into the Series constructor. Alternatively, we can
call the .astype("categor
.astype("category")y") method on a series:
6 Type import this into an interpreter to see the Zen of Python. Or search for ”PEP 20”.
19
4. Seri
Series
es Intr
Introduct
oduction
ion
>>> s = pd.Series(['m',
pd.Series(['m', 'l', 'xs', 's', 'xl'], dtype='catego
dtype='category
ry ')
>>> s
0 m
1 l
2 xs
3 s
4 xl
dtype: category
Categories (5, object): ['l', 'm', 's', 'xl', 'xs']
If this series represents
represents the size, there
there is a natural ordering as a small is less than a medium. By
default, categories don’t have an ordering. We can verify this by inspecting the .cat attribute that
has various properties:
>>> s.cat.ordered
False
To convert a non-categorical series to an ordered category, we can create a type with the
CategoricalDtype con
constru
structor
ctor and the approp
appropriate
riate param
parameter
eters.
s. Then we pass this type into the
.astype method:
>>> s2 = pd.Series(['m', 'l', 'xs', 's', 'xl'])
>>> size_type = pd.api.types.CategoricalDtype(
. ..
.. c at
a t eg
e g o ri
ri es
e s = [ 's
's ' ,'
, ' m ',
', ''ll '']] , o rrdd eerr eedd = T ru
ru e )
>>> s3 = s2.astype(size_type)
.. .
>>> s3
0 m
1 l
2 NaN
3 s
4 NaN
dtype: category
Categories (3, object): ['s' < 'm' < 'l']
In this case, we limited the categories to just 's', 'm', and 'l', but the data had values that were
nott in thos
no thosee cate
categor
gorie
ies.
s. Con
Conve
vert
rtin
ing
g th
thee da
data
ta to a cate
catego
gory
ry ty
type
pe re
repl
plac
aces
es thos
thosee ex
extr
traa valu
values
es with NaN.
with
If we have ordered categories, we can do comparisons on them:
>>> s3 > 's'
0 True
1 True
2 False
3 False
4 False
dtype: bool
The pr
The prio
iorr exam
exampl
plee crea
create
ted new Series from
d a new from ex
exis
isti
ting
ng data
data that
that wa
wass no
nott ca
cate
tego
gori
rica
cal.
l. We ca
can
n also
also
add ordering information to categorical data. We just need to make sure that we specify all of the
members of the category or pandas will throw a ValueError:
>>> s.cat.reorder_categories(['xs','s','m','l', 'xl'],
... ordered = True )
0 m
1 l
23 xss
4 xl
dtype: category
20
4.7.. Summa
4.7 Summary
ry
Categories (5, object): ['xs' < 's' < 'm' < 'l' < 'xl']
Note
String and datetime series have a str and dt attribute that allow us to perform common
operations
opera tions spe
specific
cific to that type. If we convert these
these types to categoric
categorical
al types, we can stil
stilll
use the str or dt attributes on them:
>>> s3.str.upper()
0 M
1 L
2 NaN
3 S
4 NaN
dtype: object
Method Description
pd.Series(data=None, index=None, Create a series from data (sequence, dictionary, or
dtype=None, name=None, copy=False) scalar).
s.index Access index of series.
s.astype(dtype, errors='raise') Cast a series to dtype. To ignore errors (and return
original object) use errors='ignore'.
s[boolean_array] Return values from s where boolean_array is True.
s.cat.ordered Determine if a categorical series is ordered.
s.cat.reorder_categories( new_categories, Add categories (potentially ordered) to the series.
ordered=False) new_categories must include all categories.
Table 4.1: Series Overview Attributes and Methods
4.7
4.7 Summ
Summar
aryy
4.
4.8
8 Exer
Exerci
cise
sess
1. Using Jupyter
Jupyter,, create a series with the ttemperature
emperature valu
values
es for the last seven days. Filter out
the values below the mean.
2. Using Jupyter
Jupyter,, create a serie
seriess with your favorite colors. Use a categorical type.
21
Chapter 5
Series Deep Dive
There are many operations you can do with a Series. In this chapter
chapter, we will introduce
introduce many of
them.
We will pull data from the US Fuel Economy website 7 . This site has data on the efficiency of
makes and models of cars sold in the US since 1984.
5.1 Loa
Loadin
dingg th
thee Dat
Dataa
I have a copy of this data in my GitHub repository. One of the nice features of pandas is that the
read_csv function can accept not only URLs but also ZIP files. Because this ZIP file contains only
a single file,
file, we can use this fun
function
ction.. If it was a ZIP file with mul
multiple
tiple fil
files,
es, we would need to
decompress the data to pull out the file we were interested in.
The first columns in the dataset we will investigate are city08 and highway08, which provide
information on miles per gallon usage while driving around in the city and highway respectively:
>>> import pandas as pd
>>> url = 'https://github.com/mattharrison/datasets/raw/master/data/' \
.... ' ve
ve h i c l e s .c
. c s v ..zz i p '
>>> df = pd.read_csv(url)
>>> city_mpg = df.city08
>>> highway_mpg = df.highway08
7 https://www
https://www.fueleconomy
.fueleconomy.gov/feg/download.shtml
.gov/feg/download.shtml
23
5. Seri
Series
es Deep Dive
Figure 5.1: Jupyter will pop up a list of options for completions when you hit TAB following a period.
>>> highway_mpg
0 25
1 14
2 33
3 12
4 23
..
41139 26
41140 28
41141 24
41142 24
41143 21
Name: highway0
highway088 , Length:
Length: 41144,
41144, dtype:
dtype: int64
int64
It lo
look
okss li
like
ke each
each seri
series
es ha
hass arou
around
nd 40
40,0
,000
00 inte
intege
gerr entr
entrie
ies.
s. Becau
Because
se the
the ty
type
pe of this
this series is int64,
series
we know that none of the values are missing.
5.2 Ser
Series
ies Att
Attrib
ribute
utess
The pandas library provides a lot of functionality. The built-in dir function will list the attributes
of an object. Let’s examine how many attributes there are on a series:
>>> len(dir(city_mpg))
45 7
Wow! Ther
Theree are over 400 attri
attribute
butess on a seri
series.
es. In contra
contrast,
st, a Python
Python list or diction
dictionary
ary has
around 40 attributes. Do not fret; you will not need to memorize all of these if you get comfortable
with a tool like Jupyter. If you have a Series object, you can hit TAB after a period, and it will pop
up a list of completions. (Other tools are also able to do this for Python objects).
What
Wh at fu
func
ncti
tion
onal
alit
ity
y do all
all of th
thes
esee at
attr
trib
ibut
utes
es pr
prov
ovid
ide?
e? He
Here
re is a su
summ
mmar
ary
y. Ther
Theree are
are ma
many
ny wa
ways
ys
to categorize these, and I’m roughly going to do it by what the result of the method is:
• Dunder methods (.__add__, .__iter__, etc) provide many numeric operations, looping,
attribute access, and index access. For the numeric operations, these return Series.
• Corresponding opeoperator
rator methods for many of the nnumeric
umeric operations allow us
us to tweak the
behavior (there is an .add method in addition to .__add__).
24
5.3.. Summa
5.3 Summary
ry
• Aggregate methods and prope
properties
rties which reduce or aggregate the valu
values
es in a series down to
a single
single scalar value. The .mean, .max, and .sum methods and .is_monotonic property are all
value.
examples.
• Conversion methods. Some of these start with .to_ and export the data to other formats.
these
uch as .sort_values, .drop_duplicates, that return Series objects with
• Manipulation methods ssuch
the same index.
• Indexing and accessor methods and attributes such as .loc and .iloc. These return Series or
scalars.
• String man
manipulation
ipulation met
methods
hods using .str.
using
• Date mani
manipulation
pulation methods using .dt.
methods
• Plotting methods using .plot.
• Categorical man
manipulation
ipulation met
methods
hods using .cat.
using
• Transformation methods such as .unstack and .reset_index, .agg, .transform.
methods
• Attributess such as .index and .dtype.
Attribute
• Ab
bun
unch
ch o
off private attributes that we will ignore (around 130 of them).
5.3
5.3 Summ
Summar
aryy
In this
this chap
chapte
terr, we in
intr
trod
oduc
uced
ed th
thee no
noti
tion
on th
that
at pa
pand
ndas
as ob
obje
ject
ctss ha
have
ve a larg
largee nu
numb
mber
er of attr
attrib
ibut
utes
es an
and
d
methods. Do not let this overwhelm you. You don’t need to memorize all of the methods.
5.
5.4
4 Exer
Exerci
cise
sess
6.1 Introd
Introduct
uction
ion
This chapter,
chapter, will revie
review
w some of the operator
operatorss and magic or dunder methods fou
found
nd in ser
series.
ies. In
short,, thes
short thesee are the protoc
protocols
ols that determi
determine
ne how the Python langu
language
age reacts to operation
operations.
s. For
example, when you use the + operation, Python is dispatching to the .__add__ method. When you
use a loop with a for statement, Python dispatches to the .__iter__ method.
This will not be a deep treatise on the dunder methods (double underscore methods) or magic
methods.
Let’s look at how this works with a pandas series.
6.2 Dunde
Dunderr Met
Metho
hods
ds
A Py
objectPyth
thon
has on
thisin
inte
tege
gerr ob
method, obje
ject
ct th
you that
at ha
can has
s a+.__add__
call method
met
on it. There hod re
respo
is alsosponds
nds to the
a .__div__ + opera
operation.
method tion.
that Becau
Because
sedivision.
supports a Series
One way to calculate the average of the two series is the following:
>>> (city_mpg + highway_mpg)/2
0 22.0
1 11.5
2 28.0
3 11.0
4 20.0
.. .
41139 22.5
41140 24.0
41141 21.0
41142 21.0
4Length:
1143 1 8 . 5 dtype: float64
41144,
Note that the type of the result is float64.