cookbook.rst
cookbook.rst
_cookbook:
{{ header }}
********
Cookbook
********
This is a repository for *short and sweet* examples and links for useful pandas
recipes.
We encourage users to add to this documentation.
Adding interesting links and/or inline examples to this section is a great *First
Pull Request*.
Simplified, condensed, new-user friendly, in-line examples have been inserted where
possible to
augment the Stack-Overflow and GitHub links. Many of the links contain expanded
information,
above what the in-line examples offer.
pandas (pd) and NumPy (np) are the only two abbreviated imported modules. The rest
are kept
explicitly imported for newer users.
Idioms
------
.. _cookbook.idioms:
.. ipython:: python
df = pd.DataFrame(
{"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]}
)
df
if-then...
**********
.. ipython:: python
.. ipython:: python
.. ipython:: python
.. ipython:: python
df_mask = pd.DataFrame(
{"AAA": [True] * 4, "BBB": [False] * 4, "CCC": [True, False] * 2}
)
df.where(df_mask, -1000)
.. ipython:: python
df = pd.DataFrame(
{"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]}
)
df
df["logic"] = np.where(df["AAA"] > 5, "high", "low")
df
Splitting
*********
.. ipython:: python
df = pd.DataFrame(
{"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]}
)
df
df[df.AAA <= 5]
df[df.AAA > 5]
Building criteria
*****************
.. ipython:: python
df = pd.DataFrame(
{"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]}
)
df
.. ipython:: python
.. ipython:: python
.. ipython:: python
.. ipython:: python
df = pd.DataFrame(
{"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]}
)
df
aValue = 43.0
df.loc[(df.CCC - aValue).abs().argsort()]
.. ipython:: python
df = pd.DataFrame(
{"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]}
)
df
.. ipython:: python
.. ipython:: python
import functools
df[AllCrit]
.. _cookbook.selection:
Selection
---------
Dataframes
**********
.. ipython:: python
df = pd.DataFrame(
{"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]}
)
df
Use loc for label-oriented slicing and iloc positional slicing :issue:`2904`
.. ipython:: python
df = pd.DataFrame(
{"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]},
index=["foo", "bar", "boo", "kar"],
)
.. ipython:: python
df.iloc[0:3] # Positional
df.loc["bar":"kar"] # Label
# Generic
df[0:3]
df["bar":"kar"]
Ambiguity arises when an index consists of integers with a non-zero start or non-
unit increment.
.. ipython:: python
data = {"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -
50]}
df2 = pd.DataFrame(data=data, index=[1, 2, 3, 4]) # Note index starts at 1.
df2.iloc[1:3] # Position-oriented
df2.loc[1:3] # Label-oriented
.. ipython:: python
df = pd.DataFrame(
{"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]}
)
df
New columns
***********
.. ipython:: python
df[new_cols] = df[source_cols].map(categories.get)
df
.. ipython:: python
df = pd.DataFrame(
{"AAA": [1, 1, 1, 2, 2, 2, 3, 3], "BBB": [2, 1, 3, 4, 5, 1, 2, 3]}
)
df
.. ipython:: python
df.loc[df.groupby("AAA")["BBB"].idxmin()]
df.sort_values(by="BBB").groupby("AAA", as_index=False).first()
.. _cookbook.multi_index:
Multiindexing
-------------
.. ipython:: python
df = pd.DataFrame(
{
"row": [0, 1, 2],
"One_X": [1.1, 1.1, 1.1],
"One_Y": [1.2, 1.2, 1.2],
"Two_X": [1.11, 1.11, 1.11],
"Two_Y": [1.22, 1.22, 1.22],
}
)
df
# As Labelled Index
df = df.set_index("row")
df
# With Hierarchical Columns
df.columns = pd.MultiIndex.from_tuples([tuple(c.split("_")) for c in
df.columns])
df
# Now stack & Reset
df = df.stack(0, future_stack=True).reset_index(1)
df
# And fix the labels (Notice the label 'level_1' got added automatically)
df.columns = ["Sample", "All_X", "All_Y"]
df
Arithmetic
**********
.. ipython:: python
cols = pd.MultiIndex.from_tuples(
[(x, y) for x in ["A", "B", "C"] for y in ["O", "I"]]
)
df = pd.DataFrame(np.random.randn(2, 6), index=["n", "m"], columns=cols)
df
df = df.div(df["C"], level=1)
df
Slicing
*******
.. ipython:: python
coords = [("AA", "one"), ("AA", "six"), ("BB", "one"), ("BB", "two"), ("BB",
"six")]
index = pd.MultiIndex.from_tuples(coords)
df = pd.DataFrame([11, 22, 33, 44, 55], index, ["MyData"])
df
To take the cross section of the 1st level and 1st axis the index:
.. ipython:: python
.. ipython:: python
.. ipython:: python
import itertools
All = slice(None)
df.loc["Violet"]
df.loc[(All, "Math"), All]
df.loc[(slice("Ada", "Quinn"), "Math"), All]
df.loc[(All, "Math"), ("Exams")]
df.loc[(All, "Math"), (All, "II")]
Sorting
*******
.. ipython:: python
Levels
******
.. _cookbook.missing_data:
Missing data
------------
.. ipython:: python
df = pd.DataFrame(
np.random.randn(6, 1),
index=pd.date_range("2013-08-01", periods=6, freq="B"),
columns=list("A"),
)
df.loc[df.index[3], "A"] = np.nan
df
df.bfill()
Replace
*******
.. _cookbook.grouping:
Grouping
--------
Unlike agg, apply's callable is passed a sub-DataFrame which gives you access to
all the columns
.. ipython:: python
df = pd.DataFrame(
{
"animal": "cat dog cat fish dog cat cat".split(),
"size": list("SSMMMLL"),
"weight": [8, 10, 11, 1, 20, 12, 12],
"adult": [False] * 5 + [True] * 2,
}
)
df
`Using get_group
<https://stackoverflow.com/questions/14734533/how-to-access-pandas-groupby-
dataframe-by-key>`__
.. ipython:: python
gb = df.groupby("animal")
gb.get_group("cat")
.. ipython:: python
def GrowUp(x):
avg_weight = sum(x[x["size"] == "S"].weight * 1.5)
avg_weight += sum(x[x["size"] == "M"].weight * 1.25)
avg_weight += sum(x[x["size"] == "L"].weight)
avg_weight /= len(x)
return pd.Series(["L", avg_weight, True], index=["size", "weight", "adult"])
`Expanding apply
<https://stackoverflow.com/questions/14542145/reductions-down-a-column-in-
pandas>`__
.. ipython:: python
S.expanding().apply(red, raw=True)
.. ipython:: python
def replace(g):
mask = g < 0
return g.where(~mask, g[~mask].mean())
gb.transform(replace)
.. ipython:: python
df = pd.DataFrame(
{
"code": ["foo", "bar", "baz"] * 2,
"data": [0.16, -0.21, 0.33, 0.45, -0.59, 0.62],
"flag": [False, True] * 3,
}
)
code_groups = df.groupby("code")
agg_n_sort_order = code_groups[["data"]].transform("sum").sort_values(by="data")
sorted_df = df.loc[agg_n_sort_order.index]
sorted_df
.. ipython:: python
def MyCust(x):
if len(x) > 2:
return x.iloc[1] * 1.234
return pd.NaT
.. ipython:: python
df = pd.DataFrame(
{"Color": "Red Red Red Blue".split(), "Value": [100, 150, 50, 50]}
)
df
df["Counts"] = df.groupby(["Color"]).transform(len)
df
.. ipython:: python
df = pd.DataFrame(
{"line_race": [10, 10, 8, 10, 10, 8], "beyer": [99, 102, 103, 103, 88,
100]},
index=[
"Last Gunfighter",
"Last Gunfighter",
"Last Gunfighter",
"Paynter",
"Paynter",
"Paynter",
],
)
df
df["beyer_shifted"] = df.groupby(level=0)["beyer"].shift(1)
df
.. ipython:: python
df = pd.DataFrame(
{
"host": ["other", "other", "that", "this", "this"],
"service": ["mail", "web", "mail", "mail", "web"],
"no": [1, 2, 1, 2, 1],
}
).set_index(["host", "service"])
mask = df.groupby(level=0).agg("idxmax")
df_count = df.loc[mask["no"]].reset_index()
df_count
.. ipython:: python
Splitting
*********
`Splitting a frame
<https://stackoverflow.com/questions/13353233/best-way-to-split-a-dataframe-given-
an-edge/15449992#15449992>`__
.. ipython:: python
df = pd.DataFrame(
data={
"Case": ["A", "A", "A", "B", "A", "A", "B", "A", "A"],
"Data": np.random.randn(9),
}
)
dfs = list(
zip(
*df.groupby(
(1 * (df["Case"] == "B"))
.cumsum()
.rolling(window=3, min_periods=1)
.median()
)
)
)[-1]
dfs[0]
dfs[1]
dfs[2]
.. _cookbook.pivot:
Pivot
*****
The :ref:`Pivot <reshaping.pivot>` docs.
df = pd.DataFrame(
data={
"Province": ["ON", "QC", "BC", "AL", "AL", "MN", "ON"],
"City": [
"Toronto",
"Montreal",
"Vancouver",
"Calgary",
"Edmonton",
"Winnipeg",
"Windsor",
],
"Sales": [13, 6, 16, 8, 4, 3, 1],
}
)
table = pd.pivot_table(
df,
values=["Sales"],
index=["Province"],
columns=["City"],
aggfunc="sum",
margins=True,
)
table.stack("City", future_stack=True)
.. ipython:: python
grades = [48, 99, 75, 80, 42, 80, 72, 68, 36, 78]
df = pd.DataFrame(
{
"ID": ["x%d" % r for r in range(10)],
"Gender": ["F", "M", "F", "M", "F", "M", "F", "M", "M", "M"],
"ExamYear": [
"2007",
"2007",
"2007",
"2008",
"2008",
"2008",
"2008",
"2009",
"2009",
"2009",
],
"Class": [
"algebra",
"stats",
"bio",
"algebra",
"algebra",
"stats",
"stats",
"algebra",
"bio",
"bio",
],
"Participated": [
"yes",
"yes",
"yes",
"yes",
"no",
"yes",
"yes",
"yes",
"yes",
"yes",
],
"Passed": ["yes" if x > 50 else "no" for x in grades],
"Employed": [
True,
True,
True,
False,
False,
False,
False,
True,
True,
False,
],
"Grade": grades,
}
)
df.groupby("ExamYear").agg(
{
"Participated": lambda x: x.value_counts()["yes"],
"Passed": lambda x: sum(x == "yes"),
"Employed": lambda x: sum(x),
"Grade": lambda x: sum(x) / len(x),
}
)
.. ipython:: python
df = pd.DataFrame(
{"value": np.random.randn(36)},
index=pd.date_range("2011-01-01", freq="ME", periods=36),
)
pd.pivot_table(
df, index=df.index.month, columns=df.index.year, values="value",
aggfunc="sum"
)
Apply
*****
.. ipython:: python
df = pd.DataFrame(
data={
"A": [[2, 4, 8, 16], [100, 200], [10, 20, 30]],
"B": [["a", "b", "c"], ["jj", "kk"], ["ccc"]],
},
index=["I", "II", "III"],
)
def SeriesFromSubList(aList):
return pd.Series(aList)
df_orgz = pd.concat(
{ind: row.apply(SeriesFromSubList) for ind, row in df.iterrows()}
)
df_orgz
.. ipython:: python
df = pd.DataFrame(
data=np.random.randn(2000, 2) / 10000,
index=pd.date_range("2001-01-01", periods=2000),
columns=["A", "B"],
)
df
s = pd.Series(
{
df.index[i]: gm(df.iloc[i: min(i + 51, len(df) - 1)], 5)
for i in range(len(df) - 50)
}
)
s
Rolling Apply to multiple columns where function returns a Scalar (Volume Weighted
Average Price)
.. ipython:: python
def vwap(bars):
return (bars.Close * bars.Volume).sum() / bars.Volume.sum()
window = 5
s = pd.concat(
[
(pd.Series(vwap(df.iloc[i: i + window]), index=[df.index[i + window]]))
for i in range(len(df) - window)
]
)
s.round(2)
Timeseries
----------
`Between times
<https://stackoverflow.com/questions/14539992/pandas-drop-rows-outside-of-time-
range>`__
`Constructing a datetime range that excludes weekends and includes only certain
times
<https://stackoverflow.com/a/24014440>`__
`Vectorized Lookup
<https://stackoverflow.com/questions/13893227/vectorized-look-up-of-values-in-
pandas-dataframe>`__
Turn a matrix with hours in columns and days in rows into a continuous row sequence
in the form of a time series.
`How to rearrange a Python pandas DataFrame?
<https://stackoverflow.com/questions/15432659/how-to-rearrange-a-python-pandas-
dataframe>`__
.. ipython:: python
.. _cookbook.resample:
Resampling
**********
Using TimeGrouper and another grouping to create subgroups, then apply a custom
function :issue:`3791`
.. _cookbook.merge:
Merge
-----
.. ipython:: python
.. ipython:: python
df = pd.DataFrame(
data={
"Area": ["A"] * 5 + ["C"] * 2,
"Bins": [110] * 2 + [160] * 3 + [40] * 2,
"Test_0": [0, 1, 0, 1, 2, 0, 1],
"Data": np.random.randn(7),
}
)
df
df["Test_1"] = df["Test_0"] - 1
pd.merge(
df,
df,
left_on=["Bins", "Area", "Test_0"],
right_on=["Bins", "Area", "Test_1"],
suffixes=("_L", "_R"),
)
.. _cookbook.plotting:
Plotting
--------
`Plotting a heatmap
<https://stackoverflow.com/questions/17050202/plot-timeseries-of-histograms-in-
python>`__
`Generate Embedded plots in excel files using Pandas, Vincent and xlsxwriter
<https://pandas-xlsxwriter-charts.readthedocs.io/>`__
.. ipython:: python
df = pd.DataFrame(
{
"stratifying_var": np.random.uniform(0, 100, 20),
"price": np.random.normal(100, 5, 20),
}
)
df["quartiles"] = pd.qcut(
df["stratifying_var"], 4, labels=["0-25%", "25-50%", "50-75%", "75-100%"]
)
@savefig quartile_boxplot.png
df.boxplot(column="price", by="quartiles")
Data in/out
-----------
.. _cookbook.csv:
CSV
***
Reading a file that is compressed but not by ``gzip/bz2`` (the native compressed
formats which ``read_csv`` understands).
This example shows a ``WinZipped`` file, but is a general application of opening
the file within a context manager and
using that handle to read.
`See here
<https://stackoverflow.com/questions/17789907/pandas-convert-winzipped-csv-file-to-
data-frame>`__
.. _cookbook.csv.multiple_files:
The best way to combine multiple files into a single DataFrame is to read the
individual frames one by one, put all
of the individual frames into a list, and then combine the frames in the list using
:func:`pd.concat`:
.. ipython:: python
for i in range(3):
data = pd.DataFrame(np.random.randn(10, 4))
data.to_csv("file_{}.csv".format(i))
You can use the same approach to read all files matching a pattern. Here is an
example using ``glob``:
.. ipython:: python
import glob
import os
files = glob.glob("file_*.csv")
result = pd.concat([pd.read_csv(f) for f in files], ignore_index=True)
Finally, this strategy will work with the other ``pd.read_*(...)`` functions
described in the :ref:`io docs<io>`.
.. ipython:: python
:suppress:
for i in range(3):
os.remove("file_{}.csv".format(i))
.. ipython:: python
i = pd.date_range("20000101", periods=10000)
df = pd.DataFrame({"year": i.year, "month": i.month, "day": i.day})
df.head()
.. ipython:: python
data = """;;;;
;;;;
;;;;
;;;;
;;;;
;;;;
;;;;
;;;;
;;;;
;;;;
date;Param1;Param2;Param4;Param5
;m²;°C;m²;m
;;;;
01.01.1990 00:00;1;1;2;3
01.01.1990 01:00;5;3;4;5
01.01.1990 02:00;9;5;6;7
01.01.1990 03:00;13;7;8;9
01.01.1990 04:00;17;9;10;11
01.01.1990 05:00;21;11;12;13
"""
.. ipython:: python
pd.read_csv(
StringIO(data),
sep=";",
skiprows=[11, 12],
index_col=0,
parse_dates=True,
header=10,
)
.. ipython:: python
.. _cookbook.sql:
SQL
***
.. _cookbook.excel:
Excel
*****
.. _cookbook.html:
HTML
****
`Reading HTML tables from a server that cannot handle the default request
header <https://stackoverflow.com/a/18939272/564538>`__
.. _cookbook.hdf:
HDFStore
********
.. ipython:: python
df = pd.DataFrame(np.random.randn(8, 3))
store = pd.HDFStore("test.h5")
store.put("df", df)
.. ipython:: python
:suppress:
store.close()
os.remove("test.h5")
.. ipython:: python
df = pd.DataFrame(np.random.randn(8, 3))
store["test"] = df
.. ipython:: python
:suppress:
os.remove("test.h5")
.. _cookbook.binary:
Binary files
************
pandas readily accepts NumPy record arrays, if you need to read in a binary
file consisting of an array of C structs. For example, given this C program
in a file called ``main.c`` compiled with ``gcc main.c -std=gnu99`` on a
64-bit machine,
.. code-block:: c
#include <stdio.h>
#include <stdint.h>
return 0;
}
the following Python code will read the binary file ``'binary.dat'`` into a
pandas ``DataFrame``, where each element of the struct corresponds to a column
in the frame:
.. code-block:: python
# note that the offsets are larger than the size of the type because of
# struct padding
offsets = 0, 8, 16
formats = "i4", "f8", "f4"
dt = np.dtype({"names": names, "offsets": offsets, "formats": formats},
align=True)
df = pd.DataFrame(np.fromfile("binary.dat", dt))
.. note::
Computation
-----------
Correlation
***********
Often it's useful to obtain the lower (or upper) triangular form of a correlation
matrix calculated from :func:`DataFrame.corr`. This can be achieved by passing a
boolean mask to ``where`` as follows:
.. ipython:: python
df = pd.DataFrame(np.random.random(size=(100, 5)))
corr_mat = df.corr()
mask = np.tril(np.ones_like(corr_mat, dtype=np.bool_), k=-1)
corr_mat.where(mask)
.. ipython:: python
df = pd.DataFrame(np.random.normal(size=(100, 3)))
df.corr(method=distcorr)
Timedeltas
----------
.. ipython:: python
import datetime
s - s.max()
s.max() - s
s - datetime.datetime(2011, 1, 1, 3, 5)
s + datetime.timedelta(minutes=5)
datetime.datetime(2011, 1, 1, 3, 5) - s
datetime.timedelta(minutes=5) + s
.. ipython:: python
df.dtypes
`Another example
<https://stackoverflow.com/questions/15683588/iterating-through-a-pandas-
dataframe>`__
.. ipython:: python
y = s - s.shift()
y
y[1] = np.nan
y
To create a dataframe from every combination of some given values, like R's
``expand.grid()``
function, we can create a dict where the keys are column names and the values are
lists
of the data values:
.. ipython:: python
def expand_grid(data_dict):
rows = itertools.product(*data_dict.values())
return pd.DataFrame.from_records(rows, columns=data_dict.keys())
df = expand_grid(
{"height": [60, 70], "weight": [100, 140, 180], "sex": ["Male", "Female"]}
)
df
Constant series
---------------
.. ipython:: python
v = s.to_numpy()
is_constant = v.shape[0] == 0 or (s[0] == s).all()
This approach assumes that the series does not contain missing values.
For the case that we would drop NA values, we can simply remove those values first:
.. ipython:: python
v = s.dropna().to_numpy()
is_constant = v.shape[0] == 0 or (s[0] == s).all()
If missing values are considered distinct from any other value, then one could use:
.. ipython:: python
v = s.to_numpy()
is_constant = v.shape[0] == 0 or (s[0] == s).all() or not pd.notna(v).any()
(Note that this example does not disambiguate between ``np.nan``, ``pd.NA`` and
``None``)