Pandas
Pandas
toolkit
Release 0.25.2
i
ii
pandas: powerful Python data analysis toolkit, Release 0.25.2
CONTENTS 1
pandas: powerful Python data analysis toolkit, Release 0.25.2
2 CONTENTS
CHAPTER
ONE
These are the changes in pandas 0.25.2. See release for a full changelog including other versions of pandas.
1.1.1 Indexing
1.1.2 I/O
• Fix regression in notebook display where <th> tags were missing for DataFrame.index values
(GH28204).
• Regression in to_csv() where writing a Series or DataFrame indexed by an IntervalIndex would
incorrectly raise a TypeError (GH28210)
• Fix to_csv() with ExtensionArray with list-like values (GH28840).
1.1.3 Groupby/resample/rolling
3
pandas: powerful Python data analysis toolkit, Release 0.25.2
1.2 Contributors
A total of 6 people contributed patches to this release. People with a “+” by their names contributed a
patch for the first time.
• Felix Divo +
• Jeremy Schendel
• Joris Van den Bossche
• MeeseeksMachine
• Tom Augspurger
• jbrockmendel
{{ header }}
TWO
INSTALLATION
The easiest way to install pandas is to install it as part of the Anaconda distribution, a cross platform
distribution for data analysis and scientific computing. This is the recommended installation method for
most users.
Instructions for installing from source, PyPI, ActivePython, various Linux distributions, or a development
version are also provided.
Installing pandas and the rest of the NumPy and SciPy stack can be a little difficult for inexperienced users.
The simplest way to install not only pandas, but Python and the most popular packages that make up
the SciPy stack (IPython, NumPy, Matplotlib, …) is with Anaconda, a cross-platform (Linux, Mac OS X,
Windows) Python distribution for data analytics and scientific computing.
After running the installer, the user will have access to pandas and the rest of the SciPy stack without
needing to install anything else, and without needing to wait for any software to be compiled.
Installation instructions for Anaconda can be found here.
A full list of the packages available as part of the Anaconda distribution can be found here.
Another advantage to installing Anaconda is that you don’t need admin rights to install it. Anaconda can
install in the user’s home directory, which makes it trivial to delete Anaconda if you decide (just delete that
folder).
The previous section outlined how to get pandas installed as part of the Anaconda distribution. However
this approach means you will install well over one hundred packages and involves downloading the installer
which is a few hundred megabytes in size.
If you want to have more control on which packages, or have a limited internet bandwidth, then installing
pandas with Miniconda may be a better solution.
5
pandas: powerful Python data analysis toolkit, Release 0.25.2
Conda is the package manager that the Anaconda distribution is built upon. It is a package manager that is
both cross-platform and language agnostic (it can play a similar role to a pip and virtualenv combination).
Miniconda allows you to create a minimal self contained Python installation, and then use the Conda
command to install additional packages.
First you will need Conda to be installed and downloading and running the Miniconda will do this for you.
The installer can be found here
The next step is to create a new conda environment. A conda environment is like a virtualenv that allows
you to specify a specific version of Python and set of libraries. Run the following commands from a terminal
window:
This will create a minimal environment with only Python installed in it. To put your self inside this
environment run:
activate name_of_my_env
The final step required is to install pandas. This can be done with the following command:
If you need packages that are available to pip but not conda, then install pip, and then use pip to install
those packages:
Installation instructions for ActivePython can be found here. Versions 2.7 and 3.5 include pandas.
6 Chapter 2. Installation
pandas: powerful Python data analysis toolkit, Release 0.25.2
The commands in this table will install pandas for Python 3 from your distribution. To install pandas for
Python 2, you may need to use the python-pandas package.
However, the packages in the linux package managers are often a few versions behind, so to get the newest
version of pandas, it’s recommended to install using the pip or conda methods described above.
See the contributing guide for complete instructions on building from the git source tree. Further, see creating
a development environment if you wish to create a pandas development environment.
pandas is equipped with an exhaustive set of unit tests, covering about 97% of the code base as of this writing.
To run it on your machine to verify that everything is working (and that you have all of the dependencies,
soft and hard, installed), make sure you have pytest >= 4.0.2 and Hypothesis >= 3.58, then run:
>>> pd.test()
running: pytest --skip-slow --skip-network C:\Users\TP\Anaconda3\envs\py36\lib\site-
,→packages\pandas
..................................................................S......
........S................................................................
.........................................................................
2.4 Dependencies
• numexpr: for accelerating certain numerical operations. numexpr uses multiple cores as well as smart
chunking and caching to achieve large speedups. If installed, must be Version 2.6.2 or higher.
• bottleneck: for accelerating certain types of nan evaluations. bottleneck uses specialized cython
routines to achieve large speedups. If installed, must be Version 1.2.1 or higher.
Note: You are highly encouraged to install these libraries, as they provide speed improvements, especially
when working with large data sets.
Pandas has many optional dependencies that are only used for specific methods. For example, pandas.
read_hdf() requires the pytables package. If the optional dependency is not installed, pandas will raise
an ImportError when the method requiring that dependency is called.
8 Chapter 2. Installation
pandas: powerful Python data analysis toolkit, Release 0.25.2
One of the following combinations of libraries is needed to use the top-level read_html() function:
Changed in version 0.23.0.
• BeautifulSoup4 and html5lib
• BeautifulSoup4 and lxml
• BeautifulSoup4 and html5lib and lxml
• Only lxml, although see HTML Table Parsing for reasons as to why you should probably not take this
approach.
Warning:
• if you install BeautifulSoup4 you must install either lxml or html5lib or both. read_html() will
not work with only BeautifulSoup4 installed.
• You are highly encouraged to read HTML Table Parsing gotchas. It explains issues surrounding
the installation and usage of the above three libraries.
2.4. Dependencies 9
pandas: powerful Python data analysis toolkit, Release 0.25.2
{{ header }}
10 Chapter 2. Installation
CHAPTER
THREE
GETTING STARTED
{{ header }}
pandas is a Python package providing fast, flexible, and expressive data structures designed to make working
with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building
block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of
becoming the most powerful and flexible open source data analysis / manipulation tool available
in any language. It is already well on its way toward this goal.
pandas is well suited for many different kinds of data:
• Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
• Ordered and unordered (not necessarily fixed-frequency) time series data.
• Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
• Any other form of observational / statistical data sets. The data actually need not be labeled at all to
be placed into a pandas data structure
The two primary data structures of pandas, Series (1-dimensional) and DataFrame (2-dimensional), handle
the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. For
R users, DataFrame provides everything that R’s data.frame provides and much more. pandas is built on
top of NumPy and is intended to integrate well within a scientific computing environment with many other
3rd party libraries.
Here are just a few of the things that pandas does well:
• Easy handling of missing data (represented as NaN) in floating point as well as non-floating point
data
• Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional
objects
• Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the
user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you
in computations
• Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for
both aggregating and transforming data
• Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures
into DataFrame objects
• Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
11
pandas: powerful Python data analysis toolkit, Release 0.25.2
The best way to think about the pandas data structures is as flexible containers for lower dimensional data.
For example, DataFrame is a container for Series, and Series is a container for scalars. We would like to be
able to insert and remove objects from these containers in a dictionary-like fashion.
Also, we would like sensible default behaviors for the common API functions which take into account the
typical orientation of time series and cross-sectional data sets. When using ndarrays to store 2- and 3-
dimensional data, a burden is placed on the user to consider the orientation of the data set when writing
functions; axes are considered more or less equivalent (except when C- or Fortran-contiguousness matters
for performance). In pandas, the axes are intended to lend more semantic meaning to the data; i.e., for a
particular data set there is likely to be a “right” way to orient the data. The goal, then, is to reduce the
amount of mental effort required to code up data transformations in downstream functions.
For example, with tabular data (DataFrame) it is more semantically helpful to think of the index (the rows)
and the columns rather than axis 0 and axis 1. Iterating through the columns of the DataFrame thus results
in more readable code:
All pandas data structures are value-mutable (the values they contain can be altered) but not always size-
mutable. The length of a Series cannot be changed, but, for example, columns can be inserted into a
DataFrame. However, the vast majority of methods produce new objects and leave the input data untouched.
In general we like to favor immutability where sensible.
The first stop for pandas issues and ideas is the Github Issue Tracker. If you have a general question, pandas
community experts can answer through Stack Overflow.
3.1.4 Community
pandas is actively supported today by a community of like-minded individuals around the world who con-
tribute their valuable time and energy to help make open source pandas possible. Thanks to all of our
contributors.
If you’re interested in contributing, please visit the contributing guide.
pandas is a NumFOCUS sponsored project. This will help ensure the success of development of pandas as
a world-class open-source project, and makes it possible to donate to the project.
The governance process that pandas project has used informally since its inception in 2008 is formalized in
Project Governance documents. The documents clarify how decisions are made and how the various elements
of our community interact, including the relationship between open source collaborative development and
work that may be funded by for-profit or non-profit entities.
Wes McKinney is the Benevolent Dictator for Life (BDFL).
The list of the Core Team members and more detailed information can be found on the people’s page of the
governance repo.
The information about current institutional partners can be found on pandas website page.
3.1.8 License
Copyright (c) 2008-2012, AQR Capital Management, LLC, Lambda Foundry, Inc. and PyData␣
,→Development Team
* Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
* Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
{{ header }}
This is a short introduction to pandas, geared mainly for new users. You can see more complex recipes in
the Cookbook.
Customarily, we import as follows:
In [4]: s
Out[4]:
0 1.0
1 3.0
(continues on next page)
Creating a DataFrame by passing a NumPy array, with a datetime index and labeled columns:
In [6]: dates
Out[6]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
In [8]: df
Out[8]:
A B C D
2013-01-01 1.363830 0.773775 -2.275815 -0.733332
2013-01-02 1.775862 -0.867224 1.214459 1.846233
2013-01-03 0.747247 0.166559 -0.694008 -0.686417
2013-01-04 0.347708 0.060473 -1.393855 -1.249877
2013-01-05 0.578869 0.782500 0.045559 -0.490189
2013-01-06 0.587491 -0.675389 -0.433293 -0.583037
In [10]: df2
Out[10]:
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
In [11]: df2.dtypes
Out[11]:
A float64
B datetime64[ns]
(continues on next page)
If you’re using IPython, tab completion for column names (as well as public attributes) is automatically
enabled. Here’s a subset of the attributes that will be completed:
As you can see, the columns A, B, C, and D are automatically tab completed. E is there as well; the rest of
the attributes have been truncated for brevity.
In [14]: df.tail(3)
Out[14]:
A B C D
2013-01-04 0.347708 0.060473 -1.393855 -1.249877
2013-01-05 0.578869 0.782500 0.045559 -0.490189
2013-01-06 0.587491 -0.675389 -0.433293 -0.583037
Display the index, columns:
In [15]: df.index
Out[15]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
In [16]: df.columns
Out[16]: Index(['A', 'B', 'C', 'D'], dtype='object')
DataFrame.to_numpy() gives a NumPy representation of the underlying data. Note that this can be an
expensive operation when your DataFrame has columns with different data types, which comes down to a
fundamental difference between pandas and NumPy: NumPy arrays have one dtype for the entire ar-
ray, while pandas DataFrames have one dtype per column. When you call DataFrame.to_numpy(),
pandas will find the NumPy dtype that can hold all of the dtypes in the DataFrame. This may end up being
object, which requires casting every value to a Python object.
For df, our DataFrame of all floating-point values, DataFrame.to_numpy() is fast and doesn’t require copying
data.
In [17]: df.to_numpy()
Out[17]:
array([[ 1.36382983, 0.77377516, -2.2758147 , -0.73333194],
[ 1.77586181, -0.86722409, 1.21445884, 1.84623251],
[ 0.74724665, 0.1665589 , -0.69400832, -0.68641742],
[ 0.34770833, 0.06047343, -1.39385516, -1.24987716],
[ 0.57886924, 0.78250045, 0.04555881, -0.49018939],
[ 0.58749057, -0.67538866, -0.43329251, -0.58303716]])
For df2, the DataFrame with multiple dtypes, DataFrame.to_numpy() is relatively expensive.
In [18]: df2.to_numpy()
Out[18]:
array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
dtype=object)
Note: DataFrame.to_numpy() does not include the index or column labels in the output.
In [19]: df.describe()
Out[19]:
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean 0.900168 0.040116 -0.589492 -0.316103
std 0.549803 0.698714 1.198627 1.091824
min 0.347708 -0.867224 -2.275815 -1.249877
25% 0.581025 -0.491423 -1.218893 -0.721603
50% 0.667369 0.113516 -0.563650 -0.634727
75% 1.209684 0.621971 -0.074154 -0.513401
max 1.775862 0.782500 1.214459 1.846233
In [20]: df.T
Out[20]:
2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06
A 1.363830 1.775862 0.747247 0.347708 0.578869 0.587491
B 0.773775 -0.867224 0.166559 0.060473 0.782500 -0.675389
C -2.275815 1.214459 -0.694008 -1.393855 0.045559 -0.433293
D -0.733332 1.846233 -0.686417 -1.249877 -0.490189 -0.583037
Sorting by an axis:
In [21]: df.sort_index(axis=1, ascending=False)
Out[21]:
D C B A
2013-01-01 -0.733332 -2.275815 0.773775 1.363830
2013-01-02 1.846233 1.214459 -0.867224 1.775862
2013-01-03 -0.686417 -0.694008 0.166559 0.747247
2013-01-04 -1.249877 -1.393855 0.060473 0.347708
2013-01-05 -0.490189 0.045559 0.782500 0.578869
2013-01-06 -0.583037 -0.433293 -0.675389 0.587491
Sorting by values:
In [22]: df.sort_values(by='B')
Out[22]:
A B C D
2013-01-02 1.775862 -0.867224 1.214459 1.846233
2013-01-06 0.587491 -0.675389 -0.433293 -0.583037
2013-01-04 0.347708 0.060473 -1.393855 -1.249877
2013-01-03 0.747247 0.166559 -0.694008 -0.686417
2013-01-01 1.363830 0.773775 -2.275815 -0.733332
2013-01-05 0.578869 0.782500 0.045559 -0.490189
3.2.3 Selection
Note: While standard Python / Numpy expressions for selecting and setting are intuitive and come in
handy for interactive work, for production code, we recommend the optimized pandas data access methods,
.at, .iat, .loc and .iloc.
See the indexing documentation Indexing and Selecting Data and MultiIndex / Advanced Indexing.
Getting
In [25]: df['20130102':'20130104']
Out[25]:
A B C D
2013-01-02 1.775862 -0.867224 1.214459 1.846233
2013-01-03 0.747247 0.166559 -0.694008 -0.686417
2013-01-04 0.347708 0.060473 -1.393855 -1.249877
Selection by label
In [26]: df.loc[dates[0]]
Out[26]:
A 1.363830
B 0.773775
C -2.275815
D -0.733332
Name: 2013-01-01 00:00:00, dtype: float64
Selection by position
In [32]: df.iloc[3]
Out[32]:
A 0.347708
B 0.060473
C -1.393855
D -1.249877
Name: 2013-01-04 00:00:00, dtype: float64
In [35]: df.iloc[1:3, :]
Out[35]:
A B C D
2013-01-02 1.775862 -0.867224 1.214459 1.846233
2013-01-03 0.747247 0.166559 -0.694008 -0.686417
In [37]: df.iloc[1, 1]
Out[37]: -0.8672240914412521
In [38]: df.iat[1, 1]
Out[38]: -0.8672240914412521
Boolean indexing
In [43]: df2
Out[43]:
A B C D E
2013-01-01 1.363830 0.773775 -2.275815 -0.733332 one
2013-01-02 1.775862 -0.867224 1.214459 1.846233 one
2013-01-03 0.747247 0.166559 -0.694008 -0.686417 two
2013-01-04 0.347708 0.060473 -1.393855 -1.249877 three
2013-01-05 0.578869 0.782500 0.045559 -0.490189 four
2013-01-06 0.587491 -0.675389 -0.433293 -0.583037 three
Setting
In [46]: s1
Out[46]:
2013-01-02 1
2013-01-03 2
2013-01-04 3
2013-01-05 4
2013-01-06 5
2013-01-07 6
Freq: D, dtype: int64
In [47]: df['F'] = s1
In [49]: df.iat[0, 1] = 0
In [51]: df
Out[51]:
A B C D F
2013-01-01 0.000000 0.000000 -2.275815 5 NaN
2013-01-02 1.775862 -0.867224 1.214459 5 1.0
2013-01-03 0.747247 0.166559 -0.694008 5 2.0
2013-01-04 0.347708 0.060473 -1.393855 5 3.0
2013-01-05 0.578869 0.782500 0.045559 5 4.0
2013-01-06 0.587491 -0.675389 -0.433293 5 5.0
In [54]: df2
Out[54]:
A B C D F
2013-01-01 0.000000 0.000000 -2.275815 -5 NaN
2013-01-02 -1.775862 -0.867224 -1.214459 -5 -1.0
2013-01-03 -0.747247 -0.166559 -0.694008 -5 -2.0
2013-01-04 -0.347708 -0.060473 -1.393855 -5 -3.0
2013-01-05 -0.578869 -0.782500 -0.045559 -5 -4.0
2013-01-06 -0.587491 -0.675389 -0.433293 -5 -5.0
pandas primarily uses the value np.nan to represent missing data. It is by default not included in compu-
tations. See the Missing Data section.
Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data.
In [57]: df1
Out[57]:
A B C D F E
2013-01-01 0.000000 0.000000 -2.275815 5 NaN 1.0
2013-01-02 1.775862 -0.867224 1.214459 5 1.0 1.0
2013-01-03 0.747247 0.166559 -0.694008 5 2.0 NaN
2013-01-04 0.347708 0.060473 -1.393855 5 3.0 NaN
In [58]: df1.dropna(how='any')
Out[58]:
A B C D F E
2013-01-02 1.775862 -0.867224 1.214459 5 1.0 1.0
In [59]: df1.fillna(value=5)
Out[59]:
A B C D F E
2013-01-01 0.000000 0.000000 -2.275815 5 5.0 1.0
2013-01-02 1.775862 -0.867224 1.214459 5 1.0 1.0
2013-01-03 0.747247 0.166559 -0.694008 5 2.0 5.0
2013-01-04 0.347708 0.060473 -1.393855 5 3.0 5.0
In [60]: pd.isna(df1)
Out[60]:
A B C D F E
2013-01-01 False False False False True False
2013-01-02 False False False False False False
2013-01-03 False False False False False True
2013-01-04 False False False False False True
3.2.5 Operations
Stats
In [61]: df.mean()
Out[61]:
A 0.672863
B -0.088847
C -0.589492
D 5.000000
F 3.000000
dtype: float64
In [62]: df.mean(1)
Out[62]:
2013-01-01 0.681046
2013-01-02 1.624619
2013-01-03 1.443959
2013-01-04 1.402865
2013-01-05 2.081386
2013-01-06 1.895762
Freq: D, dtype: float64
Operating with objects that have different dimensionality and need alignment. In addition, pandas auto-
matically broadcasts along the specified dimension.
In [64]: s
Out[64]:
2013-01-01 NaN
2013-01-02 NaN
2013-01-03 1.0
2013-01-04 3.0
2013-01-05 5.0
2013-01-06 NaN
Freq: D, dtype: float64
Apply
Histogramming
In [69]: s
Out[69]:
0 3
1 2
2 4
3 5
4 1
5 1
6 3
7 2
8 4
9 0
dtype: int64
In [70]: s.value_counts()
Out[70]:
4 2
3 2
2 2
1 2
5 1
0 1
dtype: int64
String Methods
Series is equipped with a set of string processing methods in the str attribute that make it easy to operate
on each element of the array, as in the code snippet below. Note that pattern-matching in str generally uses
regular expressions by default (and in some cases always uses them). See more at Vectorized String Methods.
In [71]: s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
In [72]: s.str.lower()
Out[72]:
0 a
1 b
2 c
3 aaba
4 baca
5 NaN
6 caba
7 dog
8 cat
dtype: object
3.2.6 Merge
Concat
pandas provides various facilities for easily combining together Series and DataFrame objects with vari-
ous kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type
operations.
See the Merging section.
Concatenating pandas objects together with concat():
In [74]: df
Out[74]:
0 1 2 3
0 2.215176 -1.315908 -0.317116 -0.842397
1 2.213106 -0.596732 0.409649 1.286960
2 1.006929 -1.120177 0.230672 0.537003
3 -2.502802 0.785754 -0.228391 0.063769
4 -0.469153 -0.620792 1.997251 -0.360238
5 1.116982 -3.165350 -1.410497 -0.466197
6 0.712124 0.581404 0.715763 1.044883
7 -2.015823 0.212701 0.841041 0.089602
8 -1.633435 0.535489 -1.146539 0.930138
9 -0.219756 -0.913111 -0.769806 -0.355672
In [76]: pd.concat(pieces)
Out[76]:
0 1 2 3
0 2.215176 -1.315908 -0.317116 -0.842397
1 2.213106 -0.596732 0.409649 1.286960
2 1.006929 -1.120177 0.230672 0.537003
3 -2.502802 0.785754 -0.228391 0.063769
4 -0.469153 -0.620792 1.997251 -0.360238
5 1.116982 -3.165350 -1.410497 -0.466197
6 0.712124 0.581404 0.715763 1.044883
7 -2.015823 0.212701 0.841041 0.089602
8 -1.633435 0.535489 -1.146539 0.930138
9 -0.219756 -0.913111 -0.769806 -0.355672
Join
In [79]: left
Out[79]:
key lval
0 foo 1
1 foo 2
In [80]: right
Out[80]:
key rval
0 foo 4
1 foo 5
In [84]: left
Out[84]:
key lval
0 foo 1
1 bar 2
In [85]: right
Out[85]:
key rval
0 foo 4
1 bar 5
Append
In [88]: df
Out[88]:
A B C D
0 0.292785 -0.036327 -0.303377 0.132313
1 -1.437443 -0.521953 0.693930 -0.556417
2 2.246185 -0.175934 1.702784 2.475101
3 0.040067 -0.987877 -0.741536 1.142627
4 -1.657556 -0.812830 0.587969 1.347117
5 0.433323 -0.045170 -0.694659 -0.954585
6 -1.258443 -0.380086 1.610889 0.841358
7 -0.660989 0.519592 -0.360741 -1.441424
In [89]: s = df.iloc[3]
3.2.7 Grouping
By “group by” we are referring to a process involving one or more of the following steps:
• Splitting the data into groups based on some criteria
• Applying a function to each group independently
• Combining the results into a data structure
See the Grouping section.
In [92]: df
Out[92]:
A B C D
0 foo one -1.593566 1.125380
1 bar one 0.854244 1.315419
2 foo two 1.218171 1.511499
3 bar three -0.028648 0.726054
4 foo two 1.632905 0.664038
5 bar two -0.637469 -0.707051
6 foo one -0.512159 -1.576752
7 foo three 0.312766 0.831920
Grouping and then applying the sum() function to the resulting groups.
In [93]: df.groupby('A').sum()
Out[93]:
C D
A
bar 0.188127 1.334423
foo 1.058118 2.556085
Grouping by multiple columns forms a hierarchical index, and again we can apply the sum function.
3.2.8 Reshaping
Stack
In [99]: df2
Out[99]:
A B
first second
bar one 1.374920 0.502088
two -0.304574 -1.021713
baz one 0.713245 0.315322
two 0.591517 0.543840
In [101]: stacked
Out[101]:
first second
bar one A 1.374920
B 0.502088
two A -0.304574
B -1.021713
baz one A 0.713245
(continues on next page)
With a “stacked” DataFrame or Series (having a MultiIndex as the index), the inverse operation of stack()
is unstack(), which by default unstacks the last level:
In [102]: stacked.unstack()
Out[102]:
A B
first second
bar one 1.374920 0.502088
two -0.304574 -1.021713
baz one 0.713245 0.315322
two 0.591517 0.543840
In [103]: stacked.unstack(1)
Out[103]:
second one two
first
bar A 1.374920 -0.304574
B 0.502088 -1.021713
baz A 0.713245 0.591517
B 0.315322 0.543840
In [104]: stacked.unstack(0)
Out[104]:
first bar baz
second
one A 1.374920 0.713245
B 0.502088 0.315322
two A -0.304574 0.591517
B -1.021713 0.543840
Pivot tables
In [106]: df
Out[106]:
A B C D E
0 one A foo -0.278531 0.004556
1 one B foo 0.940889 -2.481391
2 two C foo 1.256053 1.125308
(continues on next page)
pandas has simple, powerful, and efficient functionality for performing resampling operations during fre-
quency conversion (e.g., converting secondly data into 5-minutely data). This is extremely common in, but
not limited to, financial applications. See the Time Series section.
In [110]: ts.resample('5Min').sum()
Out[110]:
2012-01-01 26405
Freq: 5T, dtype: int64
In [113]: ts
Out[113]:
2012-03-06 1.106290
2012-03-07 -1.917986
(continues on next page)
In [115]: ts_utc
Out[115]:
2012-03-06 00:00:00+00:00 1.106290
2012-03-07 00:00:00+00:00 -1.917986
2012-03-08 00:00:00+00:00 1.099233
2012-03-09 00:00:00+00:00 -1.180562
2012-03-10 00:00:00+00:00 -0.140211
Freq: D, dtype: float64
In [116]: ts_utc.tz_convert('US/Eastern')
Out[116]:
2012-03-05 19:00:00-05:00 1.106290
2012-03-06 19:00:00-05:00 -1.917986
2012-03-07 19:00:00-05:00 1.099233
2012-03-08 19:00:00-05:00 -1.180562
2012-03-09 19:00:00-05:00 -0.140211
Freq: D, dtype: float64
In [119]: ts
Out[119]:
2012-01-31 0.471911
2012-02-29 -1.370620
2012-03-31 -0.267019
2012-04-30 0.111729
2012-05-31 0.611299
Freq: M, dtype: float64
In [120]: ps = ts.to_period()
In [121]: ps
Out[121]:
2012-01 0.471911
2012-02 -1.370620
2012-03 -0.267019
2012-04 0.111729
2012-05 0.611299
Freq: M, dtype: float64
In [122]: ps.to_timestamp()
Out[122]:
2012-01-01 0.471911
2012-02-01 -1.370620
2012-03-01 -0.267019
2012-04-01 0.111729
2012-05-01 0.611299
Freq: MS, dtype: float64
Converting between period and timestamp enables some convenient arithmetic functions to be used. In the
following example, we convert a quarterly frequency with year ending in November to 9am of the end of the
month following the quarter end:
In [126]: ts.head()
Out[126]:
1990-03-01 09:00 -1.315293
1990-06-01 09:00 -0.487557
1990-09-01 09:00 -0.778802
1990-12-01 09:00 -0.221309
1991-03-01 09:00 0.174483
Freq: H, dtype: float64
3.2.10 Categoricals
pandas can include categorical data in a DataFrame. For full docs, see the categorical introduction and the
API documentation.
In [129]: df["grade"]
Out[129]:
0 a
1 b
2 b
3 a
4 a
5 e
Name: grade, dtype: category
Categories (3, object): [a, b, e]
Reorder the categories and simultaneously add the missing categories (methods under Series .cat return
a new Series by default).
In [132]: df["grade"]
Out[132]:
0 very good
1 good
2 good
3 very good
4 very good
5 very bad
Name: grade, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]
In [133]: df.sort_values(by="grade")
Out[133]:
id raw_grade grade
5 6 e very bad
1 2 b good
2 3 b good
0 1 a very good
3 4 a very good
4 5 a very good
In [134]: df.groupby("grade").size()
Out[134]:
grade
very bad 1
bad 0
medium 0
good 2
very good 3
dtype: int64
3.2.11 Plotting
In [135]: ts = pd.Series(np.random.randn(1000),
.....: index=pd.date_range('1/1/2000', periods=1000))
.....:
In [137]: ts.plot()
Out[137]: <matplotlib.axes._subplots.AxesSubplot at 0x127495750>
On a DataFrame, the plot() method is a convenience to plot all of the columns with labels:
In [138]: df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
.....: columns=['A', 'B', 'C', 'D'])
.....:
In [139]: df = df.cumsum()
In [140]: plt.figure()
Out[140]: <Figure size 640x480 with 0 Axes>
In [141]: df.plot()
Out[141]: <matplotlib.axes._subplots.AxesSubplot at 0x1277a0a90>
In [142]: plt.legend(loc='best')
Out[142]: <matplotlib.legend.Legend at 0x1277a5310>
CSV
In [143]: df.to_csv('foo.csv')
In [144]: pd.read_csv('foo.csv')
Out[144]:
Unnamed: 0 A B C D
0 2000-01-01 0.714384 1.118691 0.214306 -0.209375
1 2000-01-02 1.099535 -0.094588 1.519586 1.103485
2 2000-01-03 2.212839 -0.425442 0.268937 1.276245
3 2000-01-04 2.695004 -2.043315 1.337847 1.970893
4 2000-01-05 2.669479 -2.612468 2.095337 2.351292
.. ... ... ... ... ...
995 2002-09-22 -32.014002 -33.223408 28.122550 55.528126
996 2002-09-23 -31.418475 -33.747331 27.812849 55.601151
(continues on next page)
HDF5
Excel
3.2.13 Gotchas
If you are attempting to perform an operation you might see an exception like:
Here we discuss a lot of the essential functionality common to the pandas data structures. Here’s how to
create some of the objects used in the examples from the previous section:
To view a small sample of a Series or DataFrame object, use the head() and tail() methods. The default
number of elements to display is five, but you may pass a custom number.
In [4]: long_series = pd.Series(np.random.randn(1000))
In [5]: long_series.head()
Out[5]:
0 -1.353625
1 0.446611
2 1.600544
3 -0.602700
4 0.096528
dtype: float64
In [6]: long_series.tail(3)
Out[6]:
997 -0.933084
998 -1.895481
999 -1.109266
dtype: float64
pandas objects have a number of attributes enabling you to access the metadata
• shape: gives the axis dimensions of the object, consistent with ndarray
• Axis labels
– Series: index (only axis)
– DataFrame: index (rows) and columns
Note, these attributes can be safely assigned to!
In [7]: df[:2]
Out[7]:
A B C
2000-01-01 0.717501 -0.936966 0.634588
2000-01-02 -0.712106 -0.600773 0.676949
In [9]: df
Out[9]:
a b c
2000-01-01 0.717501 -0.936966 0.634588
2000-01-02 -0.712106 -0.600773 0.676949
2000-01-03 1.578377 -0.108295 0.317850
2000-01-04 1.020407 -0.220174 0.719467
2000-01-05 0.423406 0.742788 -0.047665
2000-01-06 1.516605 -0.287212 -0.086036
2000-01-07 0.411242 0.988457 -1.187729
2000-01-08 -0.053553 0.930622 1.334235
Pandas objects (Index, Series, DataFrame) can be thought of as containers for arrays, which hold the actual
data and do the actual computation. For many types, the underlying array is a numpy.ndarray. However,
pandas and 3rd party libraries may extend NumPy’s type system to add support for custom arrays (see
dtypes).
To get the actual data inside a Index or Series, use the .array property
In [10]: s.array
Out[10]:
<PandasArray>
[ -1.245150154666457, -1.1212737834640922, 0.14371758135167426,
1.6052453659926817, 0.49430022914548555]
Length: 5, dtype: float64
In [11]: s.index.array
Out[11]:
<PandasArray>
['a', 'b', 'c', 'd', 'e']
Length: 5, dtype: object
array will always be an ExtensionArray. The exact details of what an ExtensionArray is and why pandas
uses them is a bit beyond the scope of this introduction. See dtypes for more.
If you know you need a NumPy array, use to_numpy() or numpy.asarray().
In [12]: s.to_numpy()
Out[12]: array([-1.24515015, -1.12127378, 0.14371758, 1.60524537, 0.49430023])
In [13]: np.asarray(s)
Out[13]: array([-1.24515015, -1.12127378, 0.14371758, 1.60524537, 0.49430023])
When the Series or Index is backed by an ExtensionArray, to_numpy() may involve copying data and
coercing values. See dtypes for more.
to_numpy() gives some control over the dtype of the resulting numpy.ndarray. For example, consider
datetimes with timezones. NumPy doesn’t have a dtype to represent timezone-aware datetimes, so there are
two possibly useful representations:
1. An object-dtype numpy.ndarray with Timestamp objects, each with the correct tz
2. A datetime64[ns] -dtype numpy.ndarray, where the values have been converted to UTC and the
timezone discarded
Timezones may be preserved with dtype=object
In [15]: ser.to_numpy(dtype=object)
Out[15]:
array([Timestamp('2000-01-01 00:00:00+0100', tz='CET', freq='D'),
Timestamp('2000-01-02 00:00:00+0100', tz='CET', freq='D')],
dtype=object)
In [16]: ser.to_numpy(dtype="datetime64[ns]")
Out[16]:
array(['1999-12-31T23:00:00.000000000', '2000-01-01T23:00:00.000000000'],
dtype='datetime64[ns]')
Getting the “raw data” inside a DataFrame is possibly a bit more complex. When your DataFrame only has
a single data type for all the columns, DataFrame.to_numpy() will return the underlying data:
In [17]: df.to_numpy()
Out[17]:
array([[ 0.71750128, -0.93696585, 0.63458795],
[-0.71210567, -0.60077296, 0.67694886],
[ 1.57837673, -0.10829478, 0.31785043],
[ 1.02040735, -0.22017439, 0.71946677],
[ 0.42340608, 0.74278775, -0.04766533],
[ 1.51660469, -0.28721192, -0.08603582],
(continues on next page)
If a DataFrame contains homogeneously-typed data, the ndarray can actually be modified in-place, and
the changes will be reflected in the data structure. For heterogeneous data (e.g. some of the DataFrame’s
columns are not all the same dtype), this will not be the case. The values attribute itself, unlike the axis
labels, cannot be assigned to.
Note: When working with heterogeneous data, the dtype of the resulting ndarray will be chosen to
accommodate all of the data involved. For example, if strings are involved, the result will be of object dtype.
If there are only floats and integers, the resulting array will be of float dtype.
In the past, pandas recommended Series.values or DataFrame.values for extracting the data from a
Series or DataFrame. You’ll still find references to these in old code bases and online. Going forward, we
recommend avoiding .values and using .array or .to_numpy(). .values has the following drawbacks:
1. When your Series contains an extension type, it’s unclear whether Series.values returns a NumPy
array or the extension array. Series.array will always return an ExtensionArray, and will never
copy data. Series.to_numpy() will always return a NumPy array, potentially at the cost of copying
/ coercing values.
2. When your DataFrame contains a mixture of data types, DataFrame.values may involve copying data
and coercing values to a common dtype, a relatively expensive operation. DataFrame.to_numpy(),
being a method, makes it clearer that the returned NumPy array may not be a view on the same data
in the DataFrame.
pandas has support for accelerating certain types of binary numerical and boolean operations using the
numexpr library and the bottleneck libraries.
These libraries are especially useful when dealing with large data sets, and provide large speedups. numexpr
uses smart chunking, caching, and multiple cores. bottleneck is a set of specialized cython routines that
are especially fast when dealing with arrays that have nans.
Here is a sample (using 100 column x 100,000 row DataFrames):
You are highly encouraged to install both libraries. See the section Recommended Dependencies for more
installation info.
These are both enabled to be used by default, you can control this by setting the options:
New in version 0.20.0.
pd.set_option('compute.use_bottleneck', False)
pd.set_option('compute.use_numexpr', False)
With binary operations between pandas data structures, there are two key points of interest:
• Broadcasting behavior between higher- (e.g. DataFrame) and lower-dimensional (e.g. Series) objects.
• Missing data in computations.
We will demonstrate how to manage these issues independently, though they can be handled simultaneously.
DataFrame has the methods add(), sub(), mul(), div() and related functions radd(), rsub(), … for
carrying out binary operations. For broadcasting behavior, Series input is of primary interest. Using these
functions, you can use to either match on the index or columns via the axis keyword:
In [18]: df = pd.DataFrame({
....: 'one': pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
....: 'two': pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
....: 'three': pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})
....:
In [19]: df
Out[19]:
one two three
a -0.466716 0.029137 NaN
b -0.965699 1.967624 -0.384693
c 0.544505 0.612498 0.566326
d NaN -0.191558 0.106820
Series and Index also support the divmod() builtin. This function takes the floor division and modulo
operation at the same time returning a two-tuple of the same type as the left hand side. For example:
In [29]: s = pd.Series(np.arange(10))
In [30]: s
Out[30]:
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
dtype: int64
In [32]: div
Out[32]:
0 0
1 0
2 0
3 1
4 1
5 1
6 2
7 2
8 2
9 3
dtype: int64
In [33]: rem
Out[33]:
0 0
1 1
2 2
3 0
4 1
5 2
6 0
7 1
8 2
9 0
dtype: int64
In [35]: idx
Out[35]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')
In [37]: div
Out[37]: Int64Index([0, 0, 0, 1, 1, 1, 2, 2, 2, 3], dtype='int64')
In [38]: rem
Out[38]: Int64Index([0, 1, 2, 0, 1, 2, 0, 1, 2, 0], dtype='int64')
We can also do elementwise divmod():
In [39]: div, rem = divmod(s, [2, 2, 3, 3, 4, 4, 5, 5, 6, 6])
In [40]: div
Out[40]:
0 0
1 0
2 0
3 1
4 1
5 1
6 1
7 1
8 1
9 1
dtype: int64
In [41]: rem
Out[41]:
0 0
1 1
2 2
3 0
4 0
5 1
6 1
7 2
8 2
9 3
dtype: int64
In Series and DataFrame, the arithmetic functions have the option of inputting a fill_value, namely a value
to substitute when at most one of the values at a location are missing. For example, when adding two
DataFrame objects, you may wish to treat NaN as 0 unless both DataFrames are missing that value, in
which case the result will be NaN (you can later replace NaN with some other value using fillna if you
wish).
In [42]: df
Out[42]:
one two three
a -0.466716 0.029137 NaN
b -0.965699 1.967624 -0.384693
c 0.544505 0.612498 0.566326
d NaN -0.191558 0.106820
In [43]: df2
Out[43]:
one two three
a -0.466716 0.029137 1.000000
b -0.965699 1.967624 -0.384693
c 0.544505 0.612498 0.566326
d NaN -0.191558 0.106820
In [44]: df + df2
Out[44]:
one two three
a -0.933433 0.058274 NaN
b -1.931397 3.935247 -0.769386
c 1.089011 1.224997 1.132652
d NaN -0.383116 0.213640
Flexible comparisons
Series and DataFrame have the binary comparison methods eq, ne, lt, gt, le, and ge whose behavior is
analogous to the binary arithmetic operations described above:
In [46]: df.gt(df2)
Out[46]:
one two three
a False False False
b False False False
c False False False
d False False False
In [47]: df2.ne(df)
Out[47]:
one two three
a False False True
b False False False
c False False False
d True False False
These operations produce a pandas object of the same type as the left-hand-side input that is of dtype bool.
These boolean objects can be used in indexing operations, see the section on Boolean indexing.
Boolean reductions
You can apply the reductions: empty, any(), all(), and bool() to provide a way to summarize a boolean
result.
In [48]: (df > 0).all()
Out[48]:
one False
two False
three False
dtype: bool
You can test if a pandas object is empty, via the empty property.
In [51]: df.empty
Out[51]: False
In [52]: pd.DataFrame(columns=list('ABC')).empty
Out[52]: True
To evaluate single-element pandas objects in a boolean context, use the method bool():
In [53]: pd.Series([True]).bool()
Out[53]: True
In [54]: pd.Series([False]).bool()
Out[54]: False
In [55]: pd.DataFrame([[True]]).bool()
Out[55]: True
In [56]: pd.DataFrame([[False]]).bool()
Out[56]: False
Or
>>> df and df2
These will both raise errors, as you are trying to compare multiple values.:
ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().
Often you may find that there is more than one way to compute the same result. As a simple example,
consider df + df and df * 2. To test that these two computations produce the same result, given the tools
shown above, you might imagine using (df + df == df * 2).all(). But in fact, this expression is False:
In [57]: df + df == df * 2
Out[57]:
one two three
a True True False
b True True True
c True True True
d False True True
So, NDFrames (such as Series and DataFrames) have an equals() method for testing equality, with NaNs
in corresponding locations treated as equal.
Note that the Series or DataFrame index needs to be in the same order for equality to be True:
In [61]: df1 = pd.DataFrame({'col': ['foo', 0, np.nan]})
In [63]: df1.equals(df2)
Out[63]: False
In [64]: df1.equals(df2.sort_index())
Out[64]: True
You can conveniently perform element-wise comparisons when comparing a pandas data structure with a
scalar value:
In [65]: pd.Series(['foo', 'bar', 'baz']) == 'foo'
Out[65]:
0 True
1 False
2 False
dtype: bool
Note that this is different from the NumPy behavior where a comparison can be broadcast:
A problem occasionally arising is the combination of two similar data sets where values in one are preferred
over the other. An example would be two data series representing a particular economic indicator where one is
considered to be of “higher quality”. However, the lower quality series might extend further back in history
or have more complete data coverage. As such, we would like to combine two DataFrame objects where
missing values in one DataFrame are conditionally filled with like-labeled values from the other DataFrame.
The function implementing this operation is combine_first(), which we illustrate:
In [71]: df1 = pd.DataFrame({'A': [1., np.nan, 3., 5., np.nan],
....: 'B': [np.nan, 2., 3., np.nan, 6.]})
....:
In [73]: df1
Out[73]:
A B
0 1.0 NaN
1 NaN 2.0
2 3.0 3.0
3 5.0 NaN
4 NaN 6.0
In [74]: df2
Out[74]:
A B
0 5.0 NaN
1 2.0 NaN
2 4.0 3.0
3 NaN 4.0
4 3.0 6.0
5 7.0 8.0
In [75]: df1.combine_first(df2)
Out[75]:
A B
0 1.0 NaN
1 2.0 2.0
2 3.0 3.0
3 5.0 4.0
4 3.0 6.0
5 7.0 8.0
The combine_first() method above calls the more general DataFrame.combine(). This method takes
another DataFrame and a combiner function, aligns the input DataFrame and then passes the combiner
function pairs of Series (i.e., columns whose names are the same).
So, for instance, to reproduce combine_first() as above:
There exists a large number of methods for computing descriptive statistics and other related operations on
Series, DataFrame. Most of these are aggregations (hence producing a lower-dimensional result) like sum(),
mean(), and quantile(), but some of them, like cumsum() and cumprod(), produce an object of the same
size. Generally speaking, these methods take an axis argument, just like ndarray.{sum, std, …}, but the
axis can be specified by name or integer:
• Series: no axis argument needed
• DataFrame: “index” (axis=0, default), “columns” (axis=1)
For example:
In [77]: df
Out[77]:
one two three
a -0.466716 0.029137 NaN
b -0.965699 1.967624 -0.384693
c 0.544505 0.612498 0.566326
d NaN -0.191558 0.106820
In [78]: df.mean(0)
Out[78]:
one -0.295970
two 0.604425
three 0.096151
dtype: float64
In [79]: df.mean(1)
Out[79]:
a -0.218790
b 0.205744
c 0.574443
d -0.042369
dtype: float64
All such methods have a skipna option signaling whether to exclude missing data (True by default):
In [80]: df.sum(0, skipna=False)
Out[80]:
one NaN
two 2.417701
three NaN
dtype: float64
In [83]: ts_stand.std()
Out[83]:
one 1.0
two 1.0
three 1.0
dtype: float64
In [85]: xs_stand.std(1)
Out[85]:
a 1.0
b 1.0
c 1.0
d 1.0
dtype: float64
Note that methods like cumsum() and cumprod() preserve the location of NaN values. This is somewhat
different from expanding() and rolling(). For more details please see this note.
In [86]: df.cumsum()
Out[86]:
one two three
a -0.466716 0.029137 NaN
b -1.432415 1.996760 -0.384693
c -0.887909 2.609259 0.181633
d NaN 2.417701 0.288453
Here is a quick reference summary table of common functions. Each also takes an optional level parameter
which applies only if the object has a hierarchical index.
Function Description
count Number of non-NA observations
sum Sum of values
mean Mean of values
mad Mean absolute deviation
median Arithmetic median of values
min Minimum
max Maximum
mode Mode
abs Absolute Value
prod Product of values
std Bessel-corrected sample standard deviation
var Unbiased variance
sem Standard error of the mean
skew Sample skewness (3rd moment)
kurt Sample kurtosis (4th moment)
quantile Sample quantile (value at %)
cumsum Cumulative sum
cumprod Cumulative product
cummax Cumulative maximum
cummin Cumulative minimum
Note that by chance some NumPy methods, like mean, std, and sum, will exclude NAs on Series input by
default:
In [87]: np.mean(df['one'])
Out[87]: -0.29596980138380163
In [88]: np.mean(df['one'].to_numpy())
Out[88]: nan
Series.nunique() will return the number of unique non-NA values in a Series:
In [91]: series[10:20] = 5
In [92]: series.nunique()
Out[92]: 11
There is a convenient describe() function which computes a variety of summary statistics about a Series
or the columns of a DataFrame (excluding NAs of course):
In [95]: series.describe()
(continues on next page)
In [98]: frame.describe()
Out[98]:
a b c d e
count 500.000000 500.000000 500.000000 500.000000 500.000000
mean -0.051793 0.014006 -0.088499 -0.065968 0.099254
std 0.946751 0.953261 1.012176 1.021467 0.939771
min -2.778772 -2.932532 -3.194228 -2.490294 -2.640602
25% -0.701138 -0.632182 -0.716483 -0.787017 -0.546624
50% -0.075766 0.010519 -0.061189 -0.076329 0.095151
75% 0.620752 0.681123 0.642695 0.673043 0.743923
max 2.470484 2.415173 2.952825 2.850750 3.220515
In [101]: s.describe()
(continues on next page)
Note that on a mixed-type DataFrame object, describe() will restrict the summary to include only numer-
ical columns or, if none are, only categorical columns:
In [103]: frame.describe()
Out[103]:
b
count 4.000000
mean 1.500000
std 1.290994
min 0.000000
25% 0.750000
50% 1.500000
75% 2.250000
max 3.000000
This behavior can be controlled by providing a list of types as include/exclude arguments. The special
value all can also be used:
In [104]: frame.describe(include=['object'])
Out[104]:
a
count 4
unique 2
top Yes
freq 2
In [105]: frame.describe(include=['number'])
Out[105]:
b
count 4.000000
mean 1.500000
std 1.290994
min 0.000000
25% 0.750000
50% 1.500000
75% 2.250000
max 3.000000
In [106]: frame.describe(include='all')
Out[106]:
a b
count 4 4.000000
unique 2 NaN
top Yes NaN
freq 2 NaN
mean NaN 1.500000
std NaN 1.290994
min NaN 0.000000
25% NaN 0.750000
50% NaN 1.500000
75% NaN 2.250000
max NaN 3.000000
That feature relies on select_dtypes. Refer to there for details about accepted inputs.
The idxmin() and idxmax() functions on Series and DataFrame compute the index labels with the minimum
and maximum corresponding values:
In [107]: s1 = pd.Series(np.random.randn(5))
In [108]: s1
Out[108]:
0 -1.882481
1 0.007554
2 0.851716
3 -0.395161
4 -0.304605
dtype: float64
In [111]: df1
Out[111]:
A B C
0 0.029120 -1.109406 -0.690599
1 -1.083192 0.419577 -1.165015
2 1.112228 0.169718 -1.251593
3 -0.313268 -0.517805 -0.736270
4 -0.354983 0.601455 0.452575
In [112]: df1.idxmin(axis=0)
Out[112]:
A 1
B 0
C 2
dtype: int64
In [113]: df1.idxmax(axis=1)
Out[113]:
0 A
1 B
2 A
3 A
4 B
dtype: object
When there are multiple rows (or columns) matching the minimum or maximum value, idxmin() and
idxmax() return the first matching index:
In [114]: df3 = pd.DataFrame([2, 1, 1, 3, np.nan], columns=['A'], index=list('edcba'))
In [115]: df3
Out[115]:
A
e 2.0
d 1.0
c 1.0
b 3.0
a NaN
In [116]: df3['A'].idxmin()
Out[116]: 'd'
Note: idxmin and idxmax are called argmin and argmax in NumPy.
The value_counts() Series method and top-level function computes a histogram of a 1D array of values.
It can also be used as a function on regular arrays:
In [117]: data = np.random.randint(0, 7, size=50)
In [118]: data
Out[118]:
array([1, 1, 0, 4, 1, 3, 4, 5, 2, 3, 0, 6, 5, 1, 5, 4, 5, 2, 6, 3, 6, 6,
1, 2, 0, 3, 3, 3, 4, 3, 2, 4, 4, 0, 3, 2, 5, 3, 5, 5, 6, 5, 1, 3,
6, 5, 0, 6, 1, 1])
In [119]: s = pd.Series(data)
In [120]: s.value_counts()
Out[120]:
3 10
5 9
1 8
6 7
4 6
2 5
0 5
dtype: int64
In [121]: pd.value_counts(data)
Out[121]:
3 10
5 9
1 8
6 7
4 6
2 5
0 5
dtype: int64
Similarly, you can get the most frequently occurring value(s) (the mode) of the values in a Series or
DataFrame:
In [123]: s5.mode()
Out[123]:
0 3
1 7
dtype: int64
In [125]: df5.mode()
Out[125]:
A B
0 1 1
1 4 2
Continuous values can be discretized using the cut() (bins based on values) and qcut() (bins based on
sample quantiles) functions:
In [128]: factor
Out[128]:
[(0.631, 1.318], (0.631, 1.318], (0.631, 1.318], (-0.0556, 0.631], (-0.742, -0.0556], ...
,→, (-0.742, -0.0556], (-0.742, -0.0556], (-0.742, -0.0556], (-0.0556, 0.631], (0.631, 1.
,→318]]
Length: 20
Categories (4, interval[float64]): [(-1.432, -0.742] < (-0.742, -0.0556] < (-0.0556, 0.
,→631] <
(0.631, 1.318]]
In [130]: factor
Out[130]:
[(0, 1], (0, 1], (0, 1], (0, 1], (-1, 0], ..., (-1, 0], (-1, 0], (-1, 0], (0, 1], (1, 5]]
Length: 20
Categories (4, interval[int64]): [(-5, -1] < (-1, 0] < (0, 1] < (1, 5]]
qcut() computes sample quantiles. For example, we could slice up some normally distributed data into
equal-size quartiles like so:
In [131]: arr = np.random.randn(30)
In [133]: factor
Out[133]:
[(-0.69, 0.0118], (0.874, 2.622], (0.0118, 0.874], (-0.69, 0.0118], (0.874, 2.622], ...,␣
,→(-1.7619999999999998, -0.69], (0.874, 2.622], (0.0118, 0.874], (0.874, 2.622], (0.0118,
,→ 0.874]]
Length: 30
Categories (4, interval[float64]): [(-1.7619999999999998, -0.69] < (-0.69, 0.0118] < (0.
,→0118, 0.874] <
(0.874, 2.622]]
In [134]: pd.value_counts(factor)
Out[134]:
(0.874, 2.622] 8
(-1.7619999999999998, -0.69] 8
(0.0118, 0.874] 7
(-0.69, 0.0118] 7
dtype: int64
We can also pass infinite values to define the bins:
In [137]: factor
Out[137]:
[(-inf, 0.0], (0.0, inf], (-inf, 0.0], (-inf, 0.0], (0.0, inf], ..., (0.0, inf], (0.0,␣
,→inf], (0.0, inf], (-inf, 0.0], (0.0, inf]]
Length: 20
Categories (2, interval[float64]): [(-inf, 0.0] < (0.0, inf]]
To apply your own or another library’s functions to pandas objects, you should be aware of the three methods
below. The appropriate method to use depends on whether your function expects to operate on an entire
DataFrame or Series, row- or column-wise, or elementwise.
1. Tablewise Function Application: pipe()
2. Row or Column-wise Function Application: apply()
3. Aggregation API : agg() and transform()
4. Applying Elementwise Functions: applymap()
DataFrames and Series can of course just be passed into functions. However, if the function needs to be
called in a chain, consider using the pipe() method. Compare the following
>>> (df.pipe(h)
... .pipe(g, arg1=1)
... .pipe(f, arg2=2, arg3=3))
Pandas encourages the second style, which is known as method chaining. pipe makes it easy to use your
own or another library’s functions in method chains, alongside pandas’ methods.
In the example above, the functions f, g, and h each expected the DataFrame as the first positional argument.
What if the function you wish to apply takes its data as, say, the second argument? In this case, provide
pipe with a tuple of (callable, data_keyword). .pipe will route the DataFrame to the argument specified
in the tuple.
For example, we can fit a regression using statsmodels. Their API expects a formula first and a DataFrame
as the second argument, data. We pass in the function, keyword pair (sm.ols, 'data') to pipe:
In [138]: import statsmodels.formula.api as sm
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly␣
,→specified.
[2] The condition number is large, 1.49e+07. This might indicate that there are
strong multicollinearity or other numerical problems.
"""
The pipe method is inspired by unix pipes and more recently dplyr and magrittr, which have introduced the
popular (%>%) (read pipe) operator for R. The implementation of pipe here is quite clean and feels right at
home in python. We encourage you to view the source code of pipe().
Arbitrary functions can be applied along the axes of a DataFrame using the apply() method, which, like
the descriptive statistics methods, takes an optional axis argument:
In [141]: df.apply(np.mean)
Out[141]:
one -0.295970
two 0.604425
three 0.096151
dtype: float64
In [144]: df.apply(np.cumsum)
Out[144]:
one two three
a -0.466716 0.029137 NaN
In [145]: df.apply(np.exp)
Out[145]:
one two three
a 0.627058 1.029566 NaN
b 0.380717 7.153656 0.680659
c 1.723756 1.845035 1.761782
d NaN 0.825672 1.112734
The apply() method will also dispatch on a string method name.
In [146]: df.apply('mean')
Out[146]:
one -0.295970
two 0.604425
three 0.096151
dtype: float64
You may also pass additional arguments and keyword arguments to the apply() method. For instance,
consider the following function you would like to apply:
Another useful feature is the ability to pass Series methods to carry out some Series operation on each
column or row:
In [150]: tsdf
Out[150]:
A B C
2000-01-01 0.752721 -0.364275 0.134481
2000-01-02 0.826213 1.208276 -0.154643
2000-01-03 0.041831 0.479876 -0.956380
2000-01-04 NaN NaN NaN
2000-01-05 NaN NaN NaN
2000-01-06 NaN NaN NaN
2000-01-07 NaN NaN NaN
2000-01-08 -1.549128 1.145240 0.320799
2000-01-09 0.507849 -1.142744 -0.155781
2000-01-10 -0.169863 1.813758 0.771396
In [151]: tsdf.apply(pd.Series.interpolate)
Out[151]:
A B C
2000-01-01 0.752721 -0.364275 0.134481
2000-01-02 0.826213 1.208276 -0.154643
2000-01-03 0.041831 0.479876 -0.956380
2000-01-04 -0.276361 0.612949 -0.700945
2000-01-05 -0.594552 0.746021 -0.445509
2000-01-06 -0.912744 0.879094 -0.190073
2000-01-07 -1.230936 1.012167 0.065363
2000-01-08 -1.549128 1.145240 0.320799
2000-01-09 0.507849 -1.142744 -0.155781
2000-01-10 -0.169863 1.813758 0.771396
Finally, apply() takes an argument raw which is False by default, which converts each row or column into
a Series before applying the function. When set to True, the passed function will instead receive an ndarray
object, which has positive performance implications if you do not need the indexing functionality.
Aggregation API
In [154]: tsdf
Out[154]:
A B C
2000-01-01 -0.096512 0.747788 0.465973
2000-01-02 1.596492 -1.093022 -0.571837
2000-01-03 0.090216 -1.615127 -0.025061
2000-01-04 NaN NaN NaN
2000-01-05 NaN NaN NaN
2000-01-06 NaN NaN NaN
2000-01-07 NaN NaN NaN
2000-01-08 -0.051882 -0.369806 -1.363744
2000-01-09 2.243620 -1.063587 0.448386
2000-01-10 0.096756 0.781555 -1.912402
Using a single function is equivalent to apply(). You can also pass named methods as strings. These will
return a Series of the aggregated output:
In [155]: tsdf.agg(np.sum)
Out[155]:
A 3.878691
B -2.612201
C -2.958685
dtype: float64
In [156]: tsdf.agg('sum')
Out[156]:
A 3.878691
B -2.612201
C -2.958685
dtype: float64
In [158]: tsdf.A.agg('sum')
Out[158]: 3.8786905103802636
You can pass multiple aggregation arguments as a list. The results of each of the passed functions will be a
row in the resulting DataFrame. These are naturally named from the aggregation function.
In [159]: tsdf.agg(['sum'])
Out[159]:
A B C
sum 3.878691 -2.612201 -2.958685
Passing a named function will yield that name for the row:
Passing a dictionary of column names to a scalar or a list of scalars, to DataFrame.agg allows you to
customize which functions are applied to which columns. Note that the results are not in any particular
order, you can use an OrderedDict instead to guarantee ordering.
Passing a list-like will generate a DataFrame output. You will get a matrix-like output of all of the aggregators.
The output will consist of all unique functions. Those that are not noted for a particular column will be
NaN:
Mixed dtypes
When presented with mixed dtypes that cannot aggregate, .agg will only take the valid aggregations. This
is similar to how groupby .agg works.
In [168]: mdf.dtypes
Out[168]:
A int64
B float64
C object
D datetime64[ns]
dtype: object
Custom describe
With .agg() is it possible to easily create a custom describe function, similar to the built in describe function.
Transform API
In [178]: tsdf
Out[178]:
A B C
2000-01-01 -0.646172 0.316546 0.442399
2000-01-02 -0.656399 -2.103519 0.373859
2000-01-03 0.479981 -0.047075 -0.002847
2000-01-04 NaN NaN NaN
2000-01-05 NaN NaN NaN
2000-01-06 NaN NaN NaN
2000-01-07 NaN NaN NaN
2000-01-08 1.029689 0.548916 0.960356
2000-01-09 0.800346 0.604359 0.324512
2000-01-10 -1.402090 -0.915210 0.541518
Transform the entire frame. .transform() allows input functions as: a NumPy function, a string function
name or a user defined function.
In [179]: tsdf.transform(np.abs)
Out[179]:
A B C
2000-01-01 0.646172 0.316546 0.442399
2000-01-02 0.656399 2.103519 0.373859
2000-01-03 0.479981 0.047075 0.002847
2000-01-04 NaN NaN NaN
2000-01-05 NaN NaN NaN
2000-01-06 NaN NaN NaN
2000-01-07 NaN NaN NaN
2000-01-08 1.029689 0.548916 0.960356
2000-01-09 0.800346 0.604359 0.324512
2000-01-10 1.402090 0.915210 0.541518
In [180]: tsdf.transform('abs')
Out[180]:
A B C
2000-01-01 0.646172 0.316546 0.442399
2000-01-02 0.656399 2.103519 0.373859
2000-01-03 0.479981 0.047075 0.002847
2000-01-04 NaN NaN NaN
2000-01-05 NaN NaN NaN
2000-01-06 NaN NaN NaN
2000-01-07 NaN NaN NaN
2000-01-08 1.029689 0.548916 0.960356
2000-01-09 0.800346 0.604359 0.324512
2000-01-10 1.402090 0.915210 0.541518
Passing a single function to .transform() with a Series will yield a single Series in return.
In [183]: tsdf.A.transform(np.abs)
Out[183]:
2000-01-01 0.646172
2000-01-02 0.656399
2000-01-03 0.479981
2000-01-04 NaN
2000-01-05 NaN
2000-01-06 NaN
2000-01-07 NaN
2000-01-08 1.029689
2000-01-09 0.800346
2000-01-10 1.402090
Freq: D, Name: A, dtype: float64
Passing multiple functions will yield a column MultiIndexed DataFrame. The first level will be the original
frame column names; the second level will be the names of the transforming functions.
Passing multiple functions to a Series will yield a DataFrame. The resulting column names will be the
transforming functions.
Passing a dict of lists will generate a MultiIndexed DataFrame with these selective transforms.
Since not all functions can be vectorized (accept NumPy arrays and return another array or value), the
methods applymap() on DataFrame and analogously map() on Series accept any Python function taking a
single value and returning a single value. For example:
In [188]: df4
Out[188]:
one two three
a -0.466716 0.029137 NaN
b -0.965699 1.967624 -0.384693
c 0.544505 0.612498 0.566326
In [190]: df4['one'].map(f)
Out[190]:
a 19
b 19
c 17
d 3
Name: one, dtype: int64
In [191]: df4.applymap(f)
Out[191]:
one two three
a 19 20 3
b 19 18 19
c 17 17 18
d 3 20 18
Series.map() has an additional feature; it can be used to easily “link” or “map” values defined by a
secondary series. This is closely related to merging/joining functionality:
In [192]: s = pd.Series(['six', 'seven', 'six', 'seven', 'six'],
.....: index=['a', 'b', 'c', 'd', 'e'])
.....:
In [194]: s
Out[194]:
a six
b seven
c six
d seven
e six
dtype: object
In [195]: s.map(t)
Out[195]:
a 6.0
b 7.0
c 6.0
d 7.0
e 6.0
dtype: float64
reindex() is the fundamental data alignment method in pandas. It is used to implement nearly all other
features relying on label-alignment functionality. To reindex means to conform the data to match a given
set of labels along a particular axis. This accomplishes several things:
In [197]: s
Out[197]:
a 1.910501
b -0.247105
c -1.134167
d -1.591918
e 0.243080
dtype: float64
Note that the Index objects containing the actual axis labels can be shared between objects. So if we have
a Series and a DataFrame, the following can be done:
In [202]: rs = s.reindex(df.index)
In [203]: rs
Out[203]:
a 1.910501
b -0.247105
c -1.134167
d -1.591918
dtype: float64
Note: When writing performance-sensitive code, there is a good reason to spend some time becoming a
reindexing ninja: many operations are faster on pre-aligned data. Adding two unaligned DataFrames
internally triggers a reindexing step. For exploratory analysis you will hardly notice the difference (because
reindex has been heavily optimized), but when CPU cycles matter sprinkling a few explicit reindex calls
here and there can have an impact.
You may wish to take an object and reindex its axes to be labeled the same as another object. While the
syntax for this is straightforward albeit verbose, it is a common enough operation that the reindex_like()
method is available to make this simpler:
In [207]: df2
Out[207]:
one two
a -0.466716 0.029137
b -0.965699 1.967624
c 0.544505 0.612498
In [208]: df3
Out[208]:
one two
a -0.170747 -0.840616
b -0.669729 1.097871
c 0.840475 -0.257255
In [209]: df.reindex_like(df2)
Out[209]:
one two
a -0.466716 0.029137
b -0.965699 1.967624
c 0.544505 0.612498
The align() method is the fastest way to simultaneously align two objects. It supports a join argument
(related to joining and merging):
• join='outer': take the union of the indexes (default)
• join='left': use the calling object’s index
• join='right': use the passed object’s index
• join='inner': intersect the indexes
It returns a tuple with both of the reindexed Series:
In [210]: s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
In [211]: s1 = s[:4]
In [212]: s2 = s[1:]
In [213]: s1.align(s2)
Out[213]:
(a -1.362711
b 0.385943
c -0.538184
d 0.502612
e NaN
dtype: float64, a NaN
b 0.385943
c -0.538184
d 0.502612
e -1.491718
dtype: float64)
You can also pass an axis option to only align on the specified axis:
If you pass a Series to DataFrame.align(), you can choose to align both objects either on the DataFrame’s
index or columns using the axis argument:
reindex() takes an optional parameter method which is a filling method chosen from the following table:
Method Action
pad / ffill Fill values forward
bfill / backfill Fill values backward
nearest Fill from the nearest index value
In [222]: ts
Out[222]:
2000-01-03 -0.267818
2000-01-04 0.951544
2000-01-05 1.220097
2000-01-06 1.083557
2000-01-07 -0.556464
2000-01-08 0.614753
2000-01-09 0.617631
2000-01-10 -0.131812
Freq: D, dtype: float64
In [223]: ts2
Out[223]:
2000-01-03 -0.267818
2000-01-06 1.083557
2000-01-09 0.617631
dtype: float64
In [224]: ts2.reindex(ts.index)
Out[224]:
2000-01-03 -0.267818
2000-01-04 NaN
2000-01-05 NaN
2000-01-06 1.083557
2000-01-07 NaN
2000-01-08 NaN
2000-01-09 0.617631
2000-01-10 NaN
Freq: D, dtype: float64
2000-01-07 1.083557
2000-01-08 1.083557
2000-01-09 0.617631
2000-01-10 0.617631
Freq: D, dtype: float64
In [228]: ts2.reindex(ts.index).fillna(method='ffill')
Out[228]:
2000-01-03 -0.267818
2000-01-04 -0.267818
2000-01-05 -0.267818
2000-01-06 1.083557
2000-01-07 1.083557
2000-01-08 1.083557
2000-01-09 0.617631
2000-01-10 0.617631
Freq: D, dtype: float64
reindex() will raise a ValueError if the index is not monotonically increasing or decreasing. fillna() and
interpolate() will not perform any checks on the order of the index.
The limit and tolerance arguments provide additional control over filling while reindexing. Limit specifies
the maximum count of consecutive matches:
In contrast, tolerance specifies the maximum distance between the index and indexer values:
Notice that when used on a DatetimeIndex, TimedeltaIndex or PeriodIndex, tolerance will coerced into
a Timedelta if possible. This allows you to specify tolerance with appropriate strings.
A method closely related to reindex is the drop() function. It removes a set of labels from an axis:
In [231]: df
Out[231]:
one two three
a -0.466716 0.029137 NaN
b -0.965699 1.967624 -0.384693
c 0.544505 0.612498 0.566326
d NaN -0.191558 0.106820
d -0.191558 0.106820
Note that the following also works, but is a bit less obvious / clean:
The rename() method allows you to relabel an axis based on some mapping (a dict or Series) or an arbitrary
function.
In [235]: s
Out[235]:
a -1.362711
b 0.385943
c -0.538184
d 0.502612
e -1.491718
dtype: float64
In [236]: s.rename(str.upper)
Out[236]:
A -1.362711
B 0.385943
C -0.538184
D 0.502612
E -1.491718
dtype: float64
If you pass a function, it must return a value when called with any of the labels (and must produce a set of
unique values). A dict or Series can also be used:
If the mapping doesn’t include a column/index label, it isn’t renamed. Note that extra labels in the mapping
don’t throw an error.
New in version 0.21.0.
DataFrame.rename() also supports an “axis-style” calling convention, where you specify a single mapper
and the axis to apply that mapping to.
In [238]: df.rename({'one': 'foo', 'two': 'bar'}, axis='columns')
Out[238]:
In [240]: s.rename("scalar-name")
Out[240]:
a -1.362711
b 0.385943
c -0.538184
d 0.502612
e -1.491718
Name: scalar-name, dtype: float64
In [242]: df
Out[242]:
x y
let num
a 1 1 10
2 2 20
b 1 3 30
2 4 40
c 1 5 50
2 6 60
2 2 20
b 1 3 30
2 4 40
c 1 5 50
2 6 60
In [244]: df.rename_axis(index=str.upper)
Out[244]:
x y
LET NUM
a 1 1 10
2 2 20
b 1 3 30
2 4 40
c 1 5 50
2 6 60
3.3.8 Iteration
The behavior of basic iteration over pandas objects depends on the type. When iterating over a Series, it is
regarded as array-like, and basic iteration produces the values. DataFrames follow the dict-like convention
of iterating over the “keys” of the objects.
In short, basic iteration (for i in object) produces:
• Series: values
• DataFrame: column labels
Thus, for example, iterating over a DataFrame gives you the column names:
Pandas objects also have the dict-like items() method to iterate over the (key, value) pairs.
To iterate over the rows of a DataFrame, you can use the following methods:
• iterrows(): Iterate over the rows of a DataFrame as (index, Series) pairs. This converts the rows to
Series objects, which can change the dtypes and has some performance implications.
• itertuples(): Iterate over the rows of a DataFrame as namedtuples of the values. This is a lot faster
than iterrows(), and is in most cases preferable to use to iterate over the values of a DataFrame.
Warning: Iterating through pandas objects is generally slow. In many cases, iterating manually over
the rows is not needed and can be avoided with one of the following approaches:
• Look for a vectorized solution: many operations can be performed using built-in methods or NumPy
functions, (boolean) indexing, …
• When you have a function that cannot work on the full DataFrame/Series at once, it is better to
use apply() instead of iterating over the values. See the docs on function application.
• If you need to do iterative manipulations on the values but performance is important, consider
writing the inner loop with cython or numba. See the enhancing performance section for some
examples of this approach.
Warning: You should never modify something you are iterating over. This is not guaranteed to work
in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it
will have no effect!
For example, in the following case setting the value has no effect:
In [247]: df = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})
In [249]: df
Out[249]:
a b
0 1 a
1 2 b
2 3 c
items
Consistent with the dict-like interface, items() iterates through key-value pairs:
• Series: (index, scalar value) pairs
• DataFrame: (column, Series) pairs
For example:
iterrows
iterrows() allows you to iterate through the rows of a DataFrame as Series objects. It returns an iterator
yielding each index value along with a Series containing the data in each row:
Note: Because iterrows() returns a Series for each row, it does not preserve dtypes across the rows
(dtypes are preserved across columns for DataFrames). For example,
In [253]: df_orig.dtypes
Out[253]:
int int64
float float64
dtype: object
In [255]: row
Out[255]:
int 1.0
float 1.5
Name: 0, dtype: float64
All values in row, returned as a Series, are now upcasted to floats, also the original integer value in column
x:
In [256]: row['int'].dtype
Out[256]: dtype('float64')
In [257]: df_orig['int'].dtype
Out[257]: dtype('int64')
To preserve dtypes while iterating over the rows, it is better to use itertuples() which returns namedtuples
of the values and which is generally much faster than iterrows().
In [259]: print(df2)
x y
0 1 4
1 2 5
2 3 6
In [260]: print(df2.T)
0 1 2
x 1 2 3
y 4 5 6
In [262]: print(df2_t)
0 1 2
x 1 2 3
y 4 5 6
itertuples
The itertuples() method will return an iterator yielding a namedtuple for each row in the DataFrame.
The first element of the tuple will be the row’s corresponding index value, while the remaining values are
the row values.
For instance:
This method does not convert the row to a Series object; it merely returns the values inside a namedtuple.
Therefore, itertuples() preserves the data type of the values and is generally faster as iterrows().
Note: The column names will be renamed to positional names if they are invalid Python identifiers,
repeated, or start with an underscore. With a large number of columns (>255), regular tuples are returned.
Series has an accessor to succinctly return datetime like properties for the values of the Series, if it is a
datetime/period like Series. This will return a Series, indexed like the existing Series.
# datetime
In [264]: s = pd.Series(pd.date_range('20130101 09:10:12', periods=4))
In [265]: s
Out[265]:
0 2013-01-01 09:10:12
1 2013-01-02 09:10:12
2 2013-01-03 09:10:12
3 2013-01-04 09:10:12
dtype: datetime64[ns]
In [266]: s.dt.hour
Out[266]:
0 9
1 9
2 9
3 9
dtype: int64
In [267]: s.dt.second
Out[267]:
0 12
1 12
2 12
3 12
dtype: int64
In [268]: s.dt.day
Out[268]:
0 1
1 2
2 3
3 4
dtype: int64
This enables nice expressions like this:
In [269]: s[s.dt.day == 2]
Out[269]:
1 2013-01-02 09:10:12
dtype: datetime64[ns]
In [271]: stz
Out[271]:
0 2013-01-01 09:10:12-05:00
1 2013-01-02 09:10:12-05:00
2 2013-01-03 09:10:12-05:00
3 2013-01-04 09:10:12-05:00
dtype: datetime64[ns, US/Eastern]
In [272]: stz.dt.tz
Out[272]: <DstTzInfo 'US/Eastern' LMT-1 day, 19:04:00 STD>
You can also chain these types of operations:
In [273]: s.dt.tz_localize('UTC').dt.tz_convert('US/Eastern')
Out[273]:
(continues on next page)
You can also format datetime values as strings with Series.dt.strftime() which supports the same format
as the standard strftime().
# DatetimeIndex
In [274]: s = pd.Series(pd.date_range('20130101', periods=4))
In [275]: s
Out[275]:
0 2013-01-01
1 2013-01-02
2 2013-01-03
3 2013-01-04
dtype: datetime64[ns]
In [276]: s.dt.strftime('%Y/%m/%d')
Out[276]:
0 2013/01/01
1 2013/01/02
2 2013/01/03
3 2013/01/04
dtype: object
# PeriodIndex
In [277]: s = pd.Series(pd.period_range('20130101', periods=4))
In [278]: s
Out[278]:
0 2013-01-01
1 2013-01-02
2 2013-01-03
3 2013-01-04
dtype: period[D]
In [279]: s.dt.strftime('%Y/%m/%d')
Out[279]:
0 2013/01/01
1 2013/01/02
2 2013/01/03
3 2013/01/04
dtype: object
The .dt accessor works for period and timedelta dtypes.
# period
In [280]: s = pd.Series(pd.period_range('20130101', periods=4, freq='D'))
In [281]: s
Out[281]:
0 2013-01-01
1 2013-01-02
2 2013-01-03
3 2013-01-04
dtype: period[D]
In [282]: s.dt.year
Out[282]:
0 2013
1 2013
2 2013
3 2013
dtype: int64
In [283]: s.dt.day
Out[283]:
0 1
1 2
2 3
3 4
dtype: int64
# timedelta
In [284]: s = pd.Series(pd.timedelta_range('1 day 00:00:05', periods=4, freq='s'))
In [285]: s
Out[285]:
0 1 days 00:00:05
1 1 days 00:00:06
2 1 days 00:00:07
3 1 days 00:00:08
dtype: timedelta64[ns]
In [286]: s.dt.days
Out[286]:
0 1
1 1
2 1
3 1
dtype: int64
In [287]: s.dt.seconds
Out[287]:
0 5
1 6
2 7
3 8
dtype: int64
In [288]: s.dt.components
Out[288]:
days hours minutes seconds milliseconds microseconds nanoseconds
0 1 0 0 5 0 0 0
1 1 0 0 6 0 0 0
2 1 0 0 7 0 0 0
3 1 0 0 8 0 0 0
Note: Series.dt will raise a TypeError if you access with a non-datetime-like values.
Series is equipped with a set of string processing methods that make it easy to operate on each element of
the array. Perhaps most importantly, these methods exclude missing/NA values automatically. These are
accessed via the Series’s str attribute and generally have names matching the equivalent (scalar) built-in
string methods. For example:
In [290]: s.str.lower()
Out[290]:
0 a
1 b
2 c
3 aaba
4 baca
5 NaN
6 caba
7 dog
8 cat
dtype: object
Powerful pattern-matching methods are provided as well, but note that pattern-matching generally uses
regular expressions by default (and in some cases always uses them).
Please see Vectorized String Methods for a complete description.
3.3.11 Sorting
Pandas supports three kinds of sorting: sorting by index labels, sorting by column values, and sorting by a
combination of both.
By index
The Series.sort_index() and DataFrame.sort_index() methods are used to sort a pandas object by its
index levels.
In [291]: df = pd.DataFrame({
.....: 'one': pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
.....: 'two': pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
.....: 'three': pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})
.....:
In [293]: unsorted_df
Out[293]:
three two one
a NaN 0.498142 -0.579017
d -1.636064 -0.051421 NaN
c 0.439120 -1.425223 -1.211745
b 1.021821 0.226007 1.192046
# DataFrame
In [294]: unsorted_df.sort_index()
Out[294]:
three two one
a NaN 0.498142 -0.579017
b 1.021821 0.226007 1.192046
c 0.439120 -1.425223 -1.211745
d -1.636064 -0.051421 NaN
In [295]: unsorted_df.sort_index(ascending=False)
Out[295]:
three two one
d -1.636064 -0.051421 NaN
c 0.439120 -1.425223 -1.211745
b 1.021821 0.226007 1.192046
a NaN 0.498142 -0.579017
In [296]: unsorted_df.sort_index(axis=1)
Out[296]:
one three two
a -0.579017 NaN 0.498142
d NaN -1.636064 -0.051421
c -1.211745 0.439120 -1.425223
b 1.192046 1.021821 0.226007
# Series
In [297]: unsorted_df['three'].sort_index()
Out[297]:
a NaN
b 1.021821
c 0.439120
d -1.636064
Name: three, dtype: float64
By values
The Series.sort_values() method is used to sort a Series by its values. The DataFrame.sort_values()
method is used to sort a DataFrame by its column or row values. The optional by parameter to DataFrame.
sort_values() may used to specify one or more columns to use to determine the sorted order.
In [299]: df1.sort_values(by='two')
Out[299]:
one two three
0 2 1 5
2 1 2 3
1 1 3 4
3 1 4 2
These methods have special treatment of NA values via the na_position argument:
In [301]: s[2] = np.nan
In [302]: s.sort_values()
Out[302]:
0 A
3 Aaba
1 B
4 Baca
6 CABA
8 cat
7 dog
2 NaN
5 NaN
dtype: object
In [303]: s.sort_values(na_position='first')
Out[303]:
2 NaN
5 NaN
0 A
3 Aaba
1 B
4 Baca
6 CABA
8 cat
7 dog
dtype: object
Strings passed as the by parameter to DataFrame.sort_values() may refer to either columns or index level
names.
# Build MultiIndex
In [304]: idx = pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('a', 2),
.....: ('b', 2), ('b', 1), ('b', 1)])
.....:
# Build DataFrame
In [306]: df_multi = pd.DataFrame({'A': np.arange(6, 0, -1)},
.....: index=idx)
.....:
In [307]: df_multi
Out[307]:
A
first second
a 1 6
2 5
2 4
b 2 3
1 2
1 1
Note: If a string matches both a column name and an index level name then a warning is issued and the
column takes precedence. This will result in an ambiguity error in a future version.
searchsorted
Series has the nsmallest() and nlargest() methods which return the smallest or largest n values. For a
large Series this can be much faster than sorting the entire Series and calling head(n) on the result.
In [316]: s = pd.Series(np.random.permutation(10))
In [317]: s
Out[317]:
0 3
1 0
2 2
3 6
4 9
5 7
6 4
7 1
8 8
9 5
dtype: int64
In [318]: s.sort_values()
Out[318]:
1 0
7 1
2 2
0 3
6 4
9 5
3 6
5 7
8 8
4 9
dtype: int64
In [319]: s.nsmallest(3)
Out[319]:
1 0
7 1
2 2
dtype: int64
In [320]: s.nlargest(3)
Out[320]:
4 9
8 8
5 7
dtype: int64
DataFrame also has the nlargest and nsmallest methods.
In [321]: df = pd.DataFrame({'a': [-2, -1, 1, 10, 8, 11, -1],
.....: 'b': list('abdceff'),
.....: 'c': [1.0, 2.0, 4.0, 3.2, np.nan, 3.0, 4.0]})
.....:
You must be explicit about sorting when the column is a MultiIndex, and fully specify all levels to by.
In [326]: df1.columns = pd.MultiIndex.from_tuples([('a', 'one'),
.....: ('a', 'two'),
.....: ('b', 'three')])
(continues on next page)
3.3.12 Copying
The copy() method on pandas objects copies the underlying data (though not the axis indexes, since they
are immutable) and returns a new object. Note that it is seldom necessary to copy objects. For
example, there are only a handful of ways to alter a DataFrame in-place:
• Inserting, deleting, or modifying a column.
• Assigning to the index or columns attributes.
• For homogeneous data, directly modifying the values via the values attribute or advanced indexing.
To be clear, no pandas method has the side effect of modifying your data; almost every method returns a
new object, leaving the original object untouched. If the data is modified, it is because you did so explicitly.
3.3.13 dtypes
For the most part, pandas uses NumPy arrays and dtypes for Series or individual columns of a DataFrame.
NumPy provides support for float, int, bool, timedelta64[ns] and datetime64[ns] (note that NumPy
does not support timezone-aware datetimes).
Pandas and third-party libraries extend NumPy’s type system in a few places. This section describes the
extensions pandas has made internally. See Extension types for how to write your own extension that works
with pandas. See Extension data types for a list of third-party libraries that have implemented an extension.
The following table lists all of pandas extension types. See the respective documentation sections for more
on each type.
Finally, arbitrary objects may be stored using the object dtype, but should be avoided to the extent possible
(for performance and interoperability with other libraries and methods. See object conversion).
A convenient dtypes attribute for DataFrame returns a Series with the data type of each column.
In [328]: dft = pd.DataFrame({'A': np.random.rand(3),
.....: 'B': 1,
.....: 'C': 'foo',
.....: 'D': pd.Timestamp('20010102'),
.....: 'E': pd.Series([1.0] * 3).astype('float32'),
.....: 'F': False,
.....: 'G': pd.Series([1] * 3, dtype='int8')})
.....:
In [329]: dft
Out[329]:
A B C D E F G
0 0.537956 1 foo 2001-01-02 1.0 False 1
1 0.623232 1 foo 2001-01-02 1.0 False 1
2 0.558728 1 foo 2001-01-02 1.0 False 1
In [330]: dft.dtypes
Out[330]:
A float64
B int64
C object
D datetime64[ns]
E float32
F bool
G int8
dtype: object
On a Series object, use the dtype attribute.
In [331]: dft['A'].dtype
Out[331]: dtype('float64')
If a pandas object contains data with multiple dtypes in a single column, the dtype of the column will be
chosen to accommodate all of the data types (object is the most general).
# these ints are coerced to floats
In [332]: pd.Series([1, 2, 3, 4, 5, 6.])
Out[332]:
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
dtype: float64
2 3
3 6
4 foo
dtype: object
The number of columns of each type in a DataFrame can be found by calling DataFrame.dtypes.
value_counts().
In [334]: dft.dtypes.value_counts()
Out[334]:
datetime64[ns] 1
object 1
bool 1
float64 1
int64 1
int8 1
float32 1
dtype: int64
Numeric dtypes will propagate and can coexist in DataFrames. If a dtype is passed (either directly via the
dtype keyword, a passed ndarray, or a passed Series, then it will be preserved in DataFrame operations.
Furthermore, different numeric dtypes will NOT be combined. The following example will give you a taste.
In [335]: df1 = pd.DataFrame(np.random.randn(8, 1), columns=['A'], dtype='float32')
In [336]: df1
Out[336]:
A
0 1.111131
1 0.459178
2 0.633793
3 1.315839
4 0.075656
5 0.526596
6 0.442085
7 -0.224085
In [337]: df1.dtypes
Out[337]:
A float32
dtype: object
In [339]: df2
Out[339]:
A B C
0 -0.809082 0.268481 0
1 -2.189453 0.991689 255
2 0.447021 -0.273606 0
3 0.072876 0.848166 1
In [340]: df2.dtypes
Out[340]:
A float16
B float64
C uint8
dtype: object
defaults
By default integer types are int64 and float types are float64, regardless of platform (32-bit or 64-bit).
The following will all result in int64 dtypes.
In [341]: pd.DataFrame([1, 2], columns=['a']).dtypes
Out[341]:
a int64
dtype: object
upcasting
Types can potentially be upcasted when combined with other types, meaning they are promoted from the
current type (e.g. int to float).
In [345]: df3 = df1.reindex_like(df2).fillna(value=0.0) + df2
In [346]: df3
Out[346]:
A B C
0 0.302049 0.268481 0.0
1 -1.730275 0.991689 255.0
2 1.080814 -0.273606 0.0
3 1.388715 0.848166 1.0
4 -0.286405 -0.340956 255.0
5 -0.003677 -0.602481 0.0
6 0.345284 -0.552552 0.0
In [347]: df3.dtypes
Out[347]:
A float32
B float64
C float64
dtype: object
DataFrame.to_numpy() will return the lower-common-denominator of the dtypes, meaning the dtype that
can accommodate ALL of the types in the resulting homogeneous dtyped NumPy array. This can force
some upcasting.
In [348]: df3.to_numpy().dtype
Out[348]: dtype('float64')
astype
You can use the astype() method to explicitly convert dtypes from one to another. These will by default
return a copy, even if the dtype was unchanged (pass copy=False to change this behavior). In addition,
they will raise an exception if the astype operation is invalid.
Upcasting is always according to the numpy rules. If two different dtypes are involved in an operation, then
the more general one will be used as the result of the operation.
In [349]: df3
Out[349]:
A B C
0 0.302049 0.268481 0.0
1 -1.730275 0.991689 255.0
2 1.080814 -0.273606 0.0
3 1.388715 0.848166 1.0
4 -0.286405 -0.340956 255.0
5 -0.003677 -0.602481 0.0
6 0.345284 -0.552552 0.0
7 1.038611 0.324815 255.0
In [350]: df3.dtypes
Out[350]:
A float32
B float64
C float64
dtype: object
# conversion of dtypes
In [351]: df3.astype('float32').dtypes
Out[351]:
A float32
B float32
C float32
dtype: object
Convert a subset of columns to a specified type using astype().
In [352]: dft = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
In [354]: dft
Out[354]:
a b c
0 1 4 7
1 2 5 8
2 3 6 9
In [355]: dft.dtypes
Out[355]:
a uint8
b uint8
c int64
dtype: object
New in version 0.19.0.
Convert certain columns to a specific dtype by passing a dict to astype().
In [356]: dft1 = pd.DataFrame({'a': [1, 0, 1], 'b': [4, 5, 6], 'c': [7, 8, 9]})
In [358]: dft1
Out[358]:
a b c
0 True 4 7.0
1 False 5 8.0
2 True 6 9.0
In [359]: dft1.dtypes
Out[359]:
a bool
b int64
c float64
dtype: object
Note: When trying to convert a subset of columns to a specified type using astype() and loc(), upcasting
occurs.
loc() tries to fit in what we are assigning to the current dtypes, while [] will overwrite them taking the
dtype from the right hand side. Therefore the following piece of code produces the unintended result.
In [360]: dft = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
object conversion
pandas offers various functions to try to force conversion of types from the object dtype to other types.
In cases where the data is already of the correct type, but stored in an object array, the DataFrame.
infer_objects() and Series.infer_objects() methods can be used to soft convert to the correct type.
In [364]: import datetime
In [366]: df = df.T
In [367]: df
Out[367]:
0 1 2
0 1 a 2016-03-02
1 2 b 2016-03-02
In [368]: df.dtypes
Out[368]:
0 object
1 object
2 datetime64[ns]
dtype: object
Because the data was transposed the original inference stored all columns as object, which infer_objects
will correct.
In [369]: df.infer_objects().dtypes
Out[369]:
0 int64
1 object
2 datetime64[ns]
dtype: object
The following functions are available for one dimensional object arrays or scalars to perform hard conversion
of objects to a specified type:
• to_numeric() (conversion to numeric dtypes)
In [370]: m = ['1.1', 2, 3]
In [371]: pd.to_numeric(m)
Out[371]: array([1.1, 2. , 3. ])
In [374]: pd.to_datetime(m)
Out[374]: DatetimeIndex(['2016-07-09', '2016-03-02'], dtype='datetime64[ns]',␣
,→freq=None)
In [376]: pd.to_timedelta(m)
Out[376]: TimedeltaIndex(['0 days 00:00:00.000005', '1 days 00:00:00'], dtype=
,→'timedelta64[ns]', freq=None)
To force a conversion, we can pass in an errors argument, which specifies how pandas should deal with
elements that cannot be converted to desired dtype or object. By default, errors='raise', meaning that
any errors encountered will be raised during the conversion process. However, if errors='coerce', these
errors will be ignored and pandas will convert problematic elements to pd.NaT (for datetime and timedelta)
or np.nan (for numeric). This might be useful if you are reading in data which is mostly of the desired
dtype (e.g. numeric, datetime), but occasionally has non-conforming elements intermixed that you want to
represent as missing:
In [380]: m = ['apple', 2, 3]
The errors parameter has a third option of errors='ignore', which will simply return the passed in data
if it encounters any errors with the conversion to a desired data type:
In [387]: m = ['apple', 2, 3]
In addition to object conversion, to_numeric() provides another argument downcast, which gives the option
of downcasting the newly (or already) numeric data to a smaller dtype, which can conserve memory:
In [391]: m = ['1', 2, 3]
In [397]: df = pd.DataFrame([
.....: ['2016-07-09', datetime.datetime(2016, 3, 2)]] * 2, dtype='O')
.....:
In [398]: df
Out[398]:
0 1
0 2016-07-09 2016-03-02 00:00:00
1 2016-07-09 2016-03-02 00:00:00
In [399]: df.apply(pd.to_datetime)
Out[399]:
0 1
0 2016-07-09 2016-03-02
1 2016-07-09 2016-03-02
In [401]: df
Out[401]:
0 1 2
0 1.1 2 3
1 1.1 2 3
In [402]: df.apply(pd.to_numeric)
Out[402]:
0 1 2
0 1.1 2 3
1 1.1 2 3
In [404]: df
Out[404]:
0 1
0 5us 1 days 00:00:00
1 5us 1 days 00:00:00
In [405]: df.apply(pd.to_timedelta)
Out[405]:
0 1
0 00:00:00.000005 1 days
1 00:00:00.000005 1 days
gotchas
Performing selection operations on integer type data can easily upcast the data to floating. The dtype
of the input data will be preserved in cases where nans are not introduced. See also Support for integer NA.
In [406]: dfi = df3.astype('int32')
In [407]: dfi['E'] = 1
In [408]: dfi
Out[408]:
A B C E
0 0 0 0 1
1 -1 0 255 1
2 1 0 0 1
3 1 0 1 1
4 0 0 255 1
5 0 0 0 1
6 0 0 0 1
7 1 0 255 1
In [409]: dfi.dtypes
Out[409]:
A int32
B int32
C int32
E int64
dtype: object
In [411]: casted
Out[411]:
A B C E
0 NaN NaN NaN 1
1 NaN NaN 255.0 1
2 1.0 NaN NaN 1
3 1.0 NaN 1.0 1
4 NaN NaN 255.0 1
5 NaN NaN NaN 1
6 NaN NaN NaN 1
7 1.0 NaN 255.0 1
In [412]: casted.dtypes
Out[412]:
A float64
B float64
C float64
E int64
dtype: object
While float dtypes are unchanged.
In [413]: dfa = df3.copy()
In [415]: dfa.dtypes
Out[415]:
A float32
B float64
C float64
dtype: object
In [417]: casted
Out[417]:
A B C
0 NaN 0.268481 NaN
1 NaN 0.991689 255.0
2 1.080814 NaN NaN
3 1.388715 0.848166 1.0
4 NaN NaN 255.0
5 NaN NaN NaN
6 NaN NaN NaN
7 1.038611 0.324815 255.0
In [418]: casted.dtypes
Out[418]:
A float32
B float64
C float64
dtype: object
In [424]: df
Out[424]:
string int64 uint8 float64 bool1 bool2 dates category␣
,→tdeltas uint64 other_dates tz_aware_dates
0 a 1 3 4.0 True False 2019-10-23 14:27:32.265347 A ␣
,→NaT 3 2013-01-01 2013-01-01 00:00:00-05:00
1 b 2 4 5.0 False True 2019-10-24 14:27:32.265347 B 1␣
,→days 4 2013-01-02 2013-01-02 00:00:00-05:00
2 c 3 5 6.0 True False 2019-10-25 14:27:32.265347 C 1␣
,→days 5 2013-01-03 2013-01-03 00:00:00-05:00
In [425]: df.dtypes
Out[425]:
string object
int64 int64
uint8 uint8
float64 float64
bool1 bool
bool2 bool
dates datetime64[ns]
category category
tdeltas timedelta64[ns]
uint64 uint64
(continues on next page)
select_dtypes() has two parameters include and exclude that allow you to say “give me the columns
with these dtypes” (include) and/or “give the columns without these dtypes” (exclude).
For example, to select bool columns:
In [426]: df.select_dtypes(include=[bool])
Out[426]:
bool1 bool2
0 True False
1 False True
2 True False
You can also pass the name of a dtype in the NumPy dtype hierarchy:
In [427]: df.select_dtypes(include=['bool'])
Out[427]:
bool1 bool2
0 True False
1 False True
2 True False
In [429]: df.select_dtypes(include=['object'])
Out[429]:
string
0 a
1 b
2 c
To see all the child dtypes of a generic dtype like numpy.number you can define a function that returns a
tree of child dtypes:
In [431]: subdtypes(np.generic)
Out[431]:
[numpy.generic,
[[numpy.number,
[[numpy.integer,
[[numpy.signedinteger,
[numpy.int8,
numpy.int16,
numpy.int32,
numpy.int64,
numpy.int64,
numpy.timedelta64]],
[numpy.unsignedinteger,
[numpy.uint8,
numpy.uint16,
numpy.uint32,
numpy.uint64,
numpy.uint64]]]],
[numpy.inexact,
[[numpy.floating,
[numpy.float16, numpy.float32, numpy.float64, numpy.float128]],
[numpy.complexfloating,
[numpy.complex64, numpy.complex128, numpy.complex256]]]]]],
[numpy.flexible,
[[numpy.character, [numpy.bytes_, numpy.str_]],
[numpy.void, [numpy.record]]]],
numpy.bool_,
numpy.datetime64,
numpy.object_]]
Note: Pandas also defines the types category, and datetime64[ns, tz], which are not integrated into
the normal NumPy hierarchy and won’t show up with the above function.
{{ header }}
We’ll start with a quick, non-comprehensive overview of the fundamental data structures in pandas to get
you started. The fundamental behavior about data types, indexing, and axis labeling / alignment apply
across all of the objects. To get started, import NumPy and load pandas into your namespace:
Here is a basic tenet to keep in mind: data alignment is intrinsic. The link between labels and data will
not be broken unless done so explicitly by you.
We’ll give a brief intro to the data structures, then consider all of the broad categories of functionality and
methods in separate sections.
3.4.1 Series
Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point
numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method
to create a Series is to call:
In [4]: s
Out[4]:
a 0.544055
b 0.457937
c -0.055260
d 0.487673
e 0.021212
dtype: float64
In [5]: s.index
Out[5]: Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
In [6]: pd.Series(np.random.randn(5))
Out[6]:
0 -0.269995
1 -0.453290
2 0.118795
3 0.658569
4 -0.208094
dtype: float64
Note: pandas supports non-unique index values. If an operation that does not support duplicate index
values is attempted, an exception will be raised at that time. The reason for being lazy is nearly all
performance-based (there are many instances in computations, like parts of GroupBy, where the index is not
used).
From dict
Series can be instantiated from dicts:
In [8]: pd.Series(d)
Out[8]:
b 1
a 0
c 2
dtype: int64
Note: When the data is a dict, and an index is not passed, the Series index will be ordered by the dict’s
insertion order, if you’re using Python version >= 3.6 and Pandas version >= 0.23.
If you’re using Python < 3.6 or Pandas < 0.23, and an index is not passed, the Series index will be the
lexically ordered list of dict keys.
In the example above, if you were on a Python version lower than 3.6 or a Pandas version lower than 0.23,
the Series would be ordered by the lexical order of the dict keys (i.e. ['a', 'b', 'c'] rather than ['b',
'a', 'c']).
If an index is passed, the values in data corresponding to the labels in the index will be pulled out.
In [9]: d = {'a': 0., 'b': 1., 'c': 2.}
In [10]: pd.Series(d)
Out[10]:
a 0.0
b 1.0
c 2.0
dtype: float64
Note: NaN (not a number) is the standard missing data marker used in pandas.
Series is ndarray-like
Series acts very similarly to a ndarray, and is a valid argument to most NumPy functions. However,
operations such as slicing will also slice the index.
In [13]: s[0]
Out[13]: 0.544055375065062
In [14]: s[:3]
Out[14]:
a 0.544055
b 0.457937
c -0.055260
dtype: float64
In [17]: np.exp(s)
Out[17]:
a 1.722980
b 1.580809
c 0.946239
d 1.628523
e 1.021439
dtype: float64
In [18]: s.dtype
Out[18]: dtype('float64')
This is often a NumPy dtype. However, pandas and 3rd-party libraries extend NumPy’s type system in a few
places, in which case the dtype would be a ExtensionDtype. Some examples within pandas are categorical
and Nullable integer data type. See dtypes for more.
If you need the actual array backing a Series, use Series.array.
In [19]: s.array
Out[19]:
<PandasArray>
(continues on next page)
Accessing the array can be useful when you need to do some operation without the index (to disable automatic
alignment, for example).
Series.array will always be an ExtensionArray. Briefly, an ExtensionArray is a thin wrapper around one
or more concrete arrays like a numpy.ndarray. Pandas knows how to take an ExtensionArray and store it
in a Series or a column of a DataFrame. See dtypes for more.
While Series is ndarray-like, if you need an actual ndarray, then use Series.to_numpy().
In [20]: s.to_numpy()
Out[20]: array([ 0.54405538, 0.45793694, -0.05526003, 0.48767342, 0.02121215])
Even if the Series is backed by a ExtensionArray, Series.to_numpy() will return a NumPy ndarray.
Series is dict-like
A Series is like a fixed-size dict in that you can get and set values by index label:
In [21]: s['a']
Out[21]: 0.544055375065062
In [23]: s
Out[23]:
a 0.544055
b 0.457937
c -0.055260
d 0.487673
e 12.000000
dtype: float64
In [24]: 'e' in s
Out[24]: True
In [25]: 'f' in s
Out[25]: False
If a label is not contained, an exception is raised:
>>> s['f']
KeyError: 'f'
Using the get method, a missing label will return None or specified default:
In [26]: s.get('f')
When working with raw NumPy arrays, looping through value-by-value is usually not necessary. The same
is true when working with Series in pandas. Series can also be passed into most NumPy methods expecting
an ndarray.
In [28]: s + s
Out[28]:
a 1.088111
b 0.915874
c -0.110520
d 0.975347
e 24.000000
dtype: float64
In [29]: s * 2
Out[29]:
a 1.088111
b 0.915874
c -0.110520
d 0.975347
e 24.000000
dtype: float64
In [30]: np.exp(s)
Out[30]:
a 1.722980
b 1.580809
c 0.946239
d 1.628523
e 162754.791419
dtype: float64
A key difference between Series and ndarray is that operations between Series automatically align the data
based on label. Thus, you can write computations without giving consideration to whether the Series involved
have the same labels.
The result of an operation between unaligned Series will have the union of the indexes involved. If a label
is not found in one Series or the other, the result will be marked as missing NaN. Being able to write code
without doing any explicit data alignment grants immense freedom and flexibility in interactive data analysis
and research. The integrated data alignment features of the pandas data structures set pandas apart from
the majority of related tools for working with labeled data.
Note: In general, we chose to make the default result of operations between differently indexed objects
yield the union of the indexes in order to avoid loss of information. Having an index label, though the data
is missing, is typically important information as part of a computation. You of course have the option of
dropping labels with missing data via the dropna function.
Name attribute
In [33]: s
Out[33]:
0 0.482656
1 0.041324
2 0.376516
3 0.386480
4 0.884489
Name: something, dtype: float64
In [34]: s.name
Out[34]: 'something'
The Series name will be assigned automatically in many cases, in particular when taking 1D slices of
DataFrame as you will see below.
New in version 0.18.0.
You can rename a Series with the pandas.Series.rename() method.
In [35]: s2 = s.rename("different")
In [36]: s2.name
Out[36]: 'different'
3.4.2 DataFrame
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can
think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly
used pandas object. Like Series, DataFrame accepts many different kinds of input:
• Dict of 1D ndarrays, lists, dicts, or Series
• 2-D numpy.ndarray
• Structured or record ndarray
• A Series
• Another DataFrame
Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments.
If you pass an index and / or columns, you are guaranteeing the index and / or columns of the resulting
DataFrame. Thus, a dict of Series plus a specific index will discard all data not matching up to the passed
index.
If axis labels are not passed, they will be constructed from the input data based on common sense rules.
Note: When the data is a dict, and columns is not specified, the DataFrame columns will be ordered by
the dict’s insertion order, if you are using Python version >= 3.6 and Pandas >= 0.23.
If you are using Python < 3.6 or Pandas < 0.23, and columns is not specified, the DataFrame columns will
be the lexically ordered list of dict keys.
The resulting index will be the union of the indexes of the various Series. If there are any nested dicts,
these will first be converted to Series. If no columns are passed, the columns will be the ordered list of dict
keys.
In [37]: d = {'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
....: 'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
....:
In [38]: df = pd.DataFrame(d)
In [39]: df
Out[39]:
one two
a 1.0 1.0
b 2.0 2.0
c 3.0 3.0
d NaN 4.0
Note: When a particular set of columns is passed along with a dict of data, the passed columns override
the keys in the dict.
In [42]: df.index
Out[42]: Index(['a', 'b', 'c', 'd'], dtype='object')
In [43]: df.columns
Out[43]: Index(['one', 'two'], dtype='object')
The ndarrays must all be the same length. If an index is passed, it must clearly also be the same length as
the arrays. If no index is passed, the result will be range(n), where n is the array length.
In [44]: d = {'one': [1., 2., 3., 4.],
....: 'two': [4., 3., 2., 1.]}
....:
In [45]: pd.DataFrame(d)
Out[45]:
one two
0 1.0 4.0
1 2.0 3.0
2 3.0 2.0
3 4.0 1.0
In [49]: pd.DataFrame(data)
Out[49]:
A B C
0 1 2.0 b'Hello'
1 2 3.0 b'World'
Note: DataFrame is not intended to work exactly like a 2-dimensional NumPy ndarray.
In [52]: data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
In [53]: pd.DataFrame(data2)
Out[53]:
a b c
0 1 2 NaN
1 5 10 20.0
From a Series
The result will be a DataFrame with the same index as the input Series, and with one column whose name
is the original name of the Series (only if no other column name provided).
Missing data
Much more will be said on this topic in the Missing data section. To construct a DataFrame with missing
data, we use np.nan to represent missing values. Alternatively, you may pass a numpy.MaskedArray as the
data argument to the DataFrame constructor, and its masked entries will be considered missing.
Alternate constructors
DataFrame.from_dict
DataFrame.from_dict takes a dict of dicts or a dict of array-like sequences and returns a DataFrame. It
operates like the DataFrame constructor except for the orient parameter which is 'columns' by default,
but which can be set to 'index' in order to use the dict keys as row labels.
If you pass orient='index', the keys will be the row labels. In this case, you can also pass the desired
column names:
DataFrame.from_records
DataFrame.from_records takes a list of tuples or an ndarray with structured dtype. It works analogously
to the normal DataFrame constructor, except that the resulting DataFrame index may be a specific field of
the structured dtype. For example:
In [59]: data
Out[59]:
array([(1, 2., b'Hello'), (2, 3., b'World')],
dtype=[('A', '<i4'), ('B', '<f4'), ('C', 'S10')])
You can treat a DataFrame semantically like a dict of like-indexed Series objects. Getting, setting, and
deleting columns works with the same syntax as the analogous dict operations:
In [61]: df['one']
Out[61]:
a 1.0
b 2.0
c 3.0
d NaN
Name: one, dtype: float64
In [64]: df
Out[64]:
one two three flag
a 1.0 1.0 1.0 False
b 2.0 2.0 4.0 False
c 3.0 3.0 9.0 True
d NaN 4.0 NaN False
In [67]: df
Out[67]:
one flag
a 1.0 False
b 2.0 False
c 3.0 True
d NaN False
When inserting a scalar value, it will naturally be propagated to fill the column:
In [69]: df
Out[69]:
one flag foo
a 1.0 False bar
b 2.0 False bar
c 3.0 True bar
d NaN False bar
When inserting a Series that does not have the same index as the DataFrame, it will be conformed to the
DataFrame’s index:
In [71]: df
Out[71]:
one flag foo one_trunc
a 1.0 False bar 1.0
b 2.0 False bar 2.0
c 3.0 True bar NaN
d NaN False bar NaN
You can insert raw ndarrays but their length must match the length of the DataFrame’s index.
By default, columns get inserted at the end. The insert function is available to insert at a particular
location in the columns:
In [73]: df
Out[73]:
one bar flag foo one_trunc
a 1.0 1.0 False bar 1.0
b 2.0 2.0 False bar 2.0
c 3.0 3.0 True bar NaN
d NaN NaN False bar NaN
Inspired by dplyr’s mutate verb, DataFrame has an assign() method that allows you to easily create new
columns that are potentially derived from existing columns.
In [74]: iris = pd.read_csv('data/iris.data')
In [75]: iris.head()
Out[75]:
SepalLength SepalWidth PetalLength PetalWidth Name
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
assign always returns a copy of the data, leaving the original DataFrame untouched.
Passing a callable, as opposed to an actual value to be inserted, is useful when you don’t have a reference to
the DataFrame at hand. This is common when using assign in a chain of operations. For example, we can
limit the DataFrame to just those observations with a Sepal Length greater than 5, calculate the ratio, and
plot:
Since a function is passed in, the function is computed on the DataFrame being assigned to. Importantly,
this is the DataFrame that’s been filtered to those rows with sepal length greater than 5. The filtering
happens first, and then the ratio calculations. This is an example where we didn’t have a reference to the
filtered DataFrame available.
The function signature for assign is simply **kwargs. The keys are the column names for the new fields,
and the values are either a value to be inserted (for example, a Series or NumPy array), or a function of
one argument to be called on the DataFrame. A copy of the original DataFrame is returned, with the new
values inserted.
Changed in version 0.23.0.
Starting with Python 3.6 the order of **kwargs is preserved. This allows for dependent assignment, where
an expression later in **kwargs can refer to a column created earlier in the same assign().
In the second expression, x['C'] will refer to the newly created column, that’s equal to dfa['A'] +
dfa['B'].
To write code compatible with all versions of Python, split the assignment in two.
Warning: Dependent assignment may subtly change the behavior of your code between Python 3.6
and older versions of Python.
If you wish to write code that supports versions of python before and after 3.6, you’ll need to take care
when passing assign expressions that
• Update an existing column
• Refer to the newly updated column in the same assign
For example, we’ll update column “A” and then refer to it when creating “B”.
>>> dependent = pd.DataFrame({"A": [1, 1, 1]})
>>> dependent.assign(A=lambda x: x["A"] + 1, B=lambda x: x["A"] + 2)
For Python 3.5 and earlier the expression creating B refers to the “old” value of A, [1, 1, 1]. The
output is then
A B
0 2 3
1 2 3
2 2 3
For Python 3.6 and later, the expression creating A refers to the “new” value of A, [2, 2, 2], which
results in
A B
0 2 4
1 2 4
2 2 4
Indexing / selection
Row selection, for example, returns a Series whose index is the columns of the DataFrame:
In [83]: df.loc['b']
Out[83]:
one 2
bar 2
flag False
foo bar
one_trunc 2
Name: b, dtype: object
In [84]: df.iloc[2]
Out[84]:
one 3
bar 3
flag True
foo bar
one_trunc NaN
Name: c, dtype: object
For a more exhaustive treatment of sophisticated label-based indexing and slicing, see the section on indexing.
We will address the fundamentals of reindexing / conforming to new sets of labels in the section on reindexing.
Data alignment between DataFrame objects automatically align on both the columns and the index
(row labels). Again, the resulting object will have the union of the column and row labels.
In [87]: df + df2
Out[87]:
(continues on next page)
When doing an operation between DataFrame and Series, the default behavior is to align the Series index
on the DataFrame columns, thus broadcasting row-wise. For example:
In [88]: df - df.iloc[0]
Out[88]:
A B C D
0 0.000000 0.000000 0.000000 0.000000
1 0.823010 -1.407428 1.810920 0.036966
2 -0.836963 -0.013977 3.361133 -0.321214
3 0.215511 0.136409 2.241683 -1.609052
4 0.010095 -1.006592 2.969119 -1.362727
5 1.268823 -1.691757 2.108988 -2.197647
6 0.192297 0.926025 1.937398 -0.193649
7 -1.637169 1.843618 2.244498 -2.147770
8 2.009625 -2.136301 2.017400 1.580465
9 0.027738 1.062145 3.472820 -1.503278
In the special case of working with time series data, if the DataFrame index contains dates, the broadcasting
will be column-wise:
In [89]: index = pd.date_range('1/1/2000', periods=8)
In [91]: df
Out[91]:
A B C
2000-01-01 -0.334372 1.607388 -2.433251
2000-01-02 -0.196343 1.455554 -0.134245
2000-01-03 0.351420 1.148340 -1.691404
2000-01-04 -0.493614 0.073640 0.182179
2000-01-05 -0.203285 -0.010826 -1.539934
2000-01-06 -1.415928 -0.245523 -0.167154
2000-01-07 -0.223275 -1.167290 2.137569
2000-01-08 0.120902 -0.185518 1.927187
In [92]: type(df['A'])
Out[92]: pandas.core.series.Series
In [93]: df - df['A']
Out[93]:
Warning:
df - df['A']
is now deprecated and will be removed in a future release. The preferred way to replicate this behavior
is
df.sub(df['A'], axis=0)
For explicit control over the matching and broadcasting behavior, see the section on flexible binary operations.
Operations with scalars are just as you would expect:
In [94]: df * 5 + 2
Out[94]:
A B C
2000-01-01 0.328140 10.036938 -10.166254
2000-01-02 1.018287 9.277769 1.328775
2000-01-03 3.757099 7.741701 -6.457019
2000-01-04 -0.468070 2.368199 2.910897
2000-01-05 0.983577 1.945869 -5.699668
2000-01-06 -5.079639 0.772383 1.164231
2000-01-07 0.883623 -3.836452 12.687845
2000-01-08 2.604509 1.072411 11.635937
In [95]: 1 / df
Out[95]:
A B C
2000-01-01 -2.990682 0.622128 -0.410973
2000-01-02 -5.093138 0.687024 -7.449068
2000-01-03 2.845600 0.870822 -0.591225
2000-01-04 -2.025875 13.579631 5.489096
2000-01-05 -4.919213 -92.369255 -0.649379
2000-01-06 -0.706251 -4.072930 -5.982514
2000-01-07 -4.478772 -0.856685 0.467821
2000-01-08 8.271174 -5.390317 0.518891
In [96]: df ** 4
Out[96]:
A B C
2000-01-01 0.012500 6.675478e+00 35.054796
2000-01-02 0.001486 4.488623e+00 0.000325
2000-01-03 0.015251 1.738931e+00 8.184445
2000-01-04 0.059368 2.940682e-05 0.001102
2000-01-05 0.001708 1.373695e-08 5.623517
2000-01-06 4.019430 3.633894e-03 0.000781
2000-01-07 0.002485 1.856589e+00 20.877601
2000-01-08 0.000214 1.184521e-03 13.794178
Boolean operators work as well:
In [97]: df1 = pd.DataFrame({'a': [1, 0, 1], 'b': [0, 1, 1]}, dtype=bool)
In [102]: -df1
Out[102]:
a b
0 False True
1 True False
2 False False
Transposing
To transpose, access the T attribute (also the transpose function), similar to an ndarray:
Elementwise NumPy ufuncs (log, exp, sqrt, …) and various other NumPy functions can be used with no
issues on Series and DataFrame, assuming the data within are numeric:
In [104]: np.exp(df)
Out[104]:
A B C
2000-01-01 0.715788 4.989759 0.087751
2000-01-02 0.821731 4.286857 0.874376
2000-01-03 1.421084 3.152955 0.184261
2000-01-04 0.610416 1.076419 1.199829
2000-01-05 0.816046 0.989232 0.214395
2000-01-06 0.242700 0.782295 0.846069
2000-01-07 0.799894 0.311209 8.478801
2000-01-08 1.128514 0.830674 6.870160
In [105]: np.asarray(df)
Out[105]:
array([[-0.33437191, 1.60738755, -2.43325073],
[-0.19634262, 1.45555386, -0.13424497],
[ 0.35141973, 1.1483402 , -1.69140383],
[-0.49361395, 0.0736397 , 0.18217936],
[-0.20328455, -0.01082611, -1.53993362],
[-1.41592783, -0.24552349, -0.16715382],
[-0.22327549, -1.16729044, 2.13756903],
[ 0.12090182, -0.18551784, 1.92718745]])
DataFrame is not intended to be a drop-in replacement for ndarray as its indexing semantics and data model
are quite different in places from an n-dimensional array.
Series implements __array_ufunc__, which allows it to work with NumPy’s universal functions.
The ufunc is applied to the underlying array in a Series.
In [107]: np.exp(ser)
Out[107]:
0 2.718282
1 7.389056
2 20.085537
(continues on next page)
Changed in version 0.25.0: When multiple Series are passed to a ufunc, they are aligned before performing
the operation.
Like other parts of the library, pandas will automatically align labeled inputs as part of a ufunc with
multiple inputs. For example, using numpy.remainder() on two Series with differently ordered labels will
align before the operation.
In [108]: ser1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
In [110]: ser1
Out[110]:
a 1
b 2
c 3
dtype: int64
In [111]: ser2
Out[111]:
b 1
a 3
c 5
dtype: int64
In [114]: ser3
Out[114]:
b 2
c 4
d 6
dtype: int64
Series is returned.
NumPy ufuncs are safe to apply to Series backed by non-ndarray arrays, for example SparseArray (see
Sparse calculation). If possible, the ufunc is applied without converting the underlying data to an ndarray.
Console display
Very large DataFrames will be truncated to display them in the console. You can also get a summary using
info(). (Here I am reading a CSV version of the baseball dataset from the plyr R package):
In [119]: baseball = pd.read_csv('data/baseball.csv')
In [120]: print(baseball)
id player year stint team lg g ab r h X2b X3b hr rbi sb ␣
,→cs bb so ibb hbp sh sf gidp
0 88641 womacto01 2006 2 CHN NL 19 50 6 14 1 0 1 2.0 1.0 1.
,→0 4 4.0 0.0 0.0 3.0 0.0 0.0
1 88643 schilcu01 2006 1 BOS AL 31 2 0 1 0 0 0 0.0 0.0 0.
,→0 0 1.0 0.0 0.0 0.0 0.0 0.0
.. ... ... ... ... ... .. .. ... .. ... ... ... .. ... ... ..
,→. .. ... ... ... ... ... ...
98 89533 aloumo01 2007 1 NYN NL 87 328 51 112 19 1 13 49.0 3.0 0.
,→0 27 30.0 5.0 2.0 0.0 3.0 13.0
99 89534 alomasa02 2007 1 NYN NL 8 22 1 3 1 0 0 0.0 0.0 0.
,→0 0 3.0 0.0 0.0 0.0 0.0 0.0
In [121]: baseball.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 23 columns):
id 100 non-null int64
player 100 non-null object
year 100 non-null int64
stint 100 non-null int64
team 100 non-null object
lg 100 non-null object
g 100 non-null int64
ab 100 non-null int64
r 100 non-null int64
h 100 non-null int64
X2b 100 non-null int64
You can change how much to print on a single row by setting the display.width option:
You can adjust the max width of the individual columns by setting display.max_colwidth
In [128]: pd.DataFrame(datafile)
Out[128]:
filename path
0 filename_01 media/user_name/storage/fo...
1 filename_02 media/user_name/storage/fo...
In [130]: pd.DataFrame(datafile)
Out[130]:
filename path
0 filename_01 media/user_name/storage/folder_01/filename_01
1 filename_02 media/user_name/storage/folder_02/filename_02
You can also disable this feature via the expand_frame_repr option. This will print the table in one block.
If a DataFrame column label is a valid Python variable name, the column can be accessed like an attribute:
In [131]: df = pd.DataFrame({'foo1': np.random.randn(5),
.....: 'foo2': np.random.randn(5)})
.....:
In [132]: df
Out[132]:
foo1 foo2
0 0.175701 0.688798
1 1.857269 0.874645
2 1.062442 -0.593865
3 0.780121 0.422091
4 0.357684 -0.227634
In [133]: df.foo1
Out[133]:
0 0.175701
1 1.857269
2 1.062442
3 0.780121
4 0.357684
Name: foo1, dtype: float64
The columns are also connected to the IPython completion mechanism so they can be tab-completed:
{{ header }}
{{ header }}
Since pandas aims to provide a lot of the data manipulation and analysis functionality that people use R for,
this page was started to provide a more detailed look at the R language and its many third party libraries
as they relate to pandas. In comparisons with R and CRAN libraries, we care about the following things:
• Functionality / flexibility: what can/cannot be done with each tool
• Performance: how fast are operations. Hard numbers/benchmarks are preferable
• Ease-of-use: Is one tool easier/harder to use (you may have to be the judge of this, given side-by-side
code comparisons)
This page is also here to offer a bit of a translation guide for users of these R packages.
For transfer of DataFrame objects from pandas to R, one option is to use HDF5 files, see External compatibility
for an example.
Quick reference
We’ll start off with a quick reference guide pairing some common R operations using dplyr with pandas
equivalents.
R pandas
dim(df) df.shape
head(df) df.head()
slice(df, 1:10) df.iloc[:9]
filter(df, col1 == 1, col2 == 1) df.query('col1 == 1 & col2 == 1')
df[df$col1 == 1 & df$col2 == 1,] df[(df.col1 == 1) & (df.col2 == 1)]
select(df, col1, col2) df[['col1', 'col2']]
select(df, col1:col3) df.loc[:, 'col1':'col3']
select(df, -(col1:col3)) df.drop(cols_to_drop, axis=1) but see1
distinct(select(df, col1)) df[['col1']].drop_duplicates()
distinct(select(df, col1, col2)) df[['col1', 'col2']].drop_duplicates()
sample_n(df, 10) df.sample(n=10)
sample_frac(df, 0.01) df.sample(frac=0.01)
Sorting
R pandas
arrange(df, col1, col2) df.sort_values(['col1', 'col2'])
arrange(df, desc(col1)) df.sort_values('col1', ascending=False)
Transforming
R pandas
select(df, col_one = col1) df.rename(columns={'col1': 'col_one'})['col_one']
rename(df, col_one = col1) df.rename(columns={'col1': 'col_one'})
mutate(df, c=a-b) df.assign(c=df.a-df.b)
R pandas
summary(df) df.describe()
gdf <- group_by(df, col1) gdf = df.groupby('col1')
summarise(gdf, avg=mean(col1, na.rm=TRUE)) df.groupby('col1').agg({'col1': 'mean'})
summarise(gdf, total=sum(col1)) df.groupby('col1').sum()
Base R
list of columns, for example df[cols[1:3]] or df.drop(cols[1:3]), but doing this by column name is a bit messy.
or by integer location
df <- data.frame(matrix(rnorm(1000), ncol=100))
df[, c(1:10, 25:30, 40, 50:100)]
In [5]: n = 30
aggregate
In R you may want to split data into subsets and compute the mean for each. Using a data.frame called df
and splitting it into groups by1 and by2:
df <- data.frame(
v1 = c(1,3,5,7,8,3,5,NA,4,5,7,9),
v2 = c(11,33,55,77,88,33,55,NA,44,55,77,99),
by1 = c("red", "blue", 1, 2, NA, "big", 1, 2, "red", 1, NA, 12),
by2 = c("wet", "dry", 99, 95, NA, "damp", 95, 99, "red", 99, NA, NA))
aggregate(x=df[, c("v1", "v2")], by=list(mydf2$by1, mydf2$by2), FUN = mean)
In [9]: df = pd.DataFrame(
...: {'v1': [1, 3, 5, 7, 8, 3, 5, np.nan, 4, 5, 7, 9],
...: 'v2': [11, 33, 55, 77, 88, 33, 55, np.nan, 44, 55, 77, 99],
...: 'by1': ["red", "blue", 1, 2, np.nan, "big", 1, 2, "red", 1, np.nan, 12],
...: 'by2': ["wet", "dry", 99, 95, np.nan, "damp", 95, 99, "red", 99, np.nan,
...: np.nan]})
...:
match / %in%
A common way to select data in R is using %in% which is defined using the function match. The operator
%in% is used to return a logical vector indicating if there is a match or not:
s <- 0:4
s %in% c(2,4)
The match function returns a vector of the positions of matches of its first argument in its second:
s <- 0:4
match(s, c(2,4))
tapply
tapply is similar to aggregate, but data can be in a ragged array, since the subclass sizes are possibly
irregular. Using a data.frame called baseball, and retrieving information based on the array team:
baseball <-
data.frame(team = gl(5, 5,
labels = paste("Team", LETTERS[1:5])),
player = sample(letters, 25),
batting.average = runif(25, .200, .400))
tapply(baseball$batting.average, baseball.example$team,
max)
subset
The query() method is similar to the base R subset function. In R you might want to get the rows of a
data.frame where one column’s values are less than another column’s values:
In pandas, there are a few ways to perform subsetting. You can use query() or pass an expression as if it
were an index/slice as well as standard boolean indexing:
In [18]: df = pd.DataFrame({'a': np.random.randn(10), 'b': np.random.randn(10)})
with
An expression using a data.frame called df in R with the columns a and b would be evaluated using with
like so:
In pandas the equivalent expression, using the eval() method, would be:
In [22]: df = pd.DataFrame({'a': np.random.randn(10), 'b': np.random.randn(10)})
Out[23]:
0 1.582100
1 1.102172
2 -0.621243
3 0.059273
4 -1.744880
5 0.915561
6 -2.643852
7 -0.519496
8 0.733243
9 -2.914023
dtype: float64
plyr
plyr is an R library for the split-apply-combine strategy for data analysis. The functions revolve around
three data structures in R, a for arrays, l for lists, and d for data.frame. The table below shows how
these data structures could be mapped in Python.
R Python
array list
lists dictionary or list of objects
data.frame dataframe
ddply
require(plyr)
df <- data.frame(
x = runif(120, 1, 168),
y = runif(120, 7, 334),
z = runif(120, 1.7, 20.7),
month = rep(c(5,6,7,8),30),
(continues on next page)
In pandas the equivalent expression, using the groupby() method, would be:
reshape / reshape2
melt.array
An expression using a 3 dimensional array called a in R where you want to melt it into a data.frame:
melt.list
An expression using a list called a in R where you want to melt it into a data.frame:
In Python, this list would be a list of tuples, so DataFrame() method would convert it to a dataframe as
required.
In [31]: pd.DataFrame(a)
Out[31]:
0 1
0 0 1.0
1 1 2.0
2 2 3.0
3 3 4.0
4 4 NaN
For more details and examples see the Into to Data Structures documentation.
melt.data.frame
An expression using a data.frame called cheese in R where you want to reshape the data.frame:
cast
In R acast is an expression using a data.frame called df in R to cast into a higher dimensional array:
df <- data.frame(
x = runif(12, 1, 168),
y = runif(12, 7, 334),
z = runif(12, 1.7, 20.7),
month = rep(c(5,6,7),4),
week = rep(c(1,2), 6)
)
Similarly for dcast which uses a data.frame called df in R to aggregate information based on Animal and
FeedType:
df <- data.frame(
Animal = c('Animal1', 'Animal2', 'Animal3', 'Animal2', 'Animal1',
'Animal2', 'Animal3'),
FeedType = c('A', 'B', 'A', 'A', 'B', 'B', 'A'),
Amount = c(10, 7, 4, 2, 5, 6, 2)
)
Python can approach this in two different ways. Firstly, similar to above using pivot_table():
In [38]: df = pd.DataFrame({
....: 'Animal': ['Animal1', 'Animal2', 'Animal3', 'Animal2', 'Animal1',
....: 'Animal2', 'Animal3'],
....: 'FeedType': ['A', 'B', 'A', 'A', 'B', 'B', 'A'],
....: 'Amount': [10, 7, 4, 2, 5, 6, 2],
....: })
....:
For more details and examples see the reshaping documentation or the groupby documentation.
factor
cut(c(1,2,3,4,5,6), 3)
factor(c(1,2,3,2,2,3))
Since many potential pandas users have some familiarity with SQL, this page is meant to provide some
examples of how various SQL operations would be performed using pandas.
If you’re new to pandas, you might want to first read through 10 Minutes to pandas to familiarize yourself
with the library.
As is customary, we import pandas and NumPy as follows:
Most of the examples will utilize the tips dataset found within pandas tests. We’ll read the data into a
DataFrame called tips and assume we have a database table of the same name and structure.
In [5]: tips.head()
Out[5]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
SELECT
In SQL, selection is done using a comma-separated list of columns you’d like to select (or a * to select all
columns):
With pandas, column selection is done by passing a list of column names to your DataFrame:
Calling the DataFrame without the list of column names would display all columns (akin to SQL’s *).
WHERE
SELECT *
FROM tips
WHERE time = 'Dinner'
LIMIT 5;
DataFrames can be filtered in multiple ways; the most intuitive of which is using boolean indexing.
The above statement is simply passing a Series of True/False objects to the DataFrame, returning all rows
with True.
In [8]: is_dinner = tips['time'] == 'Dinner'
In [9]: is_dinner.value_counts()
Out[9]:
True 176
False 68
Name: time, dtype: int64
In [10]: tips[is_dinner].head(5)
Out[10]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
Just like SQL’s OR and AND, multiple conditions can be passed to a DataFrame using | (OR) and & (AND).
-- tips by parties of at least 5 diners OR bill total was more than $45
SELECT *
FROM tips
WHERE size >= 5 OR total_bill > 45;
# tips by parties of at least 5 diners OR bill total was more than $45
In [12]: tips[(tips['size'] >= 5) | (tips['total_bill'] > 45)]
Out[12]:
total_bill tip sex smoker day time size
59 48.27 6.73 Male No Sat Dinner 4
125 29.80 4.20 Female No Thur Lunch 6
141 34.30 6.70 Male No Thur Lunch 6
142 41.19 5.00 Male No Thur Lunch 5
143 27.05 5.00 Female No Thur Lunch 6
155 29.85 5.14 Female No Sun Dinner 5
156 48.17 5.00 Male No Sun Dinner 6
170 50.81 10.00 Male Yes Sat Dinner 3
182 45.35 3.50 Male Yes Sun Dinner 3
185 20.69 5.00 Male No Sun Dinner 5
187 30.46 2.00 Male Yes Sun Dinner 5
212 48.33 9.00 Male No Sat Dinner 4
216 28.15 3.00 Male Yes Sat Dinner 5
In [14]: frame
Out[14]:
col1 col2
0 A F
1 B NaN
2 NaN G
3 C H
4 D I
Assume we have a table of the same structure as our DataFrame above. We can see only the records where
SELECT *
FROM frame
WHERE col2 IS NULL;
In [15]: frame[frame['col2'].isna()]
Out[15]:
col1 col2
1 B NaN
Getting items where col1 IS NOT NULL can be done with notna().
SELECT *
FROM frame
WHERE col1 IS NOT NULL;
In [16]: frame[frame['col1'].notna()]
Out[16]:
col1 col2
0 A F
1 B NaN
3 C H
4 D I
GROUP BY
In pandas, SQL’s GROUP BY operations are performed using the similarly named groupby() method.
groupby() typically refers to a process where we’d like to split a dataset into groups, apply some function
(typically aggregation) , and then combine the groups together.
A common SQL operation would be getting the count of records in each group throughout a dataset. For
instance, a query getting us the number of tips left by sex:
In [17]: tips.groupby('sex').size()
Out[17]:
sex
Female 87
Male 157
dtype: int64
Notice that in the pandas code we used size() and not count(). This is because count() applies the
function to each column, returning the number of not null records within each.
In [18]: tips.groupby('sex').count()
Out[18]:
total_bill tip smoker day time size
sex
Female 87 87 87 87 87 87
Male 157 157 157 157 157 157
In [19]: tips.groupby('sex')['total_bill'].count()
Out[19]:
sex
Female 87
Male 157
Name: total_bill, dtype: int64
Multiple functions can also be applied at once. For instance, say we’d like to see how tip amount differs
by day of the week - agg() allows you to pass a dictionary to your grouped DataFrame, indicating which
functions to apply to specific columns.
Grouping by more than one column is done by passing a list of columns to the groupby() method.
JOIN
JOINs can be performed with join() or merge(). By default, join() will join the DataFrames on their
indices. Each method has parameters allowing you to specify the type of join to perform (LEFT, RIGHT,
INNER, FULL) or the columns to join on (column names or indices).
Assume we have two database tables of the same name and structure as our DataFrames.
Now let’s go over the various types of JOINs.
INNER JOIN
SELECT *
FROM df1
INNER JOIN df2
ON df1.key = df2.key;
merge() also offers parameters for cases when you’d like to join one DataFrame’s column with another
DataFrame’s index.
RIGHT JOIN
FULL JOIN
pandas also allows for FULL JOINs, which display both sides of the dataset, whether or not the joined
columns find a match. As of writing, FULL JOINs are not supported in all RDBMS (MySQL).
UNION
SQL’s UNION is similar to UNION ALL, however UNION will remove duplicate rows.
-- MySQL
SELECT * FROM tips
ORDER BY tip DESC
LIMIT 10 OFFSET 5;
In [36]: (tips.assign(rnk=tips.groupby(['day'])['total_bill']
....: .rank(method='first', ascending=False))
....: .query('rnk < 3')
....: .sort_values(['day', 'rnk']))
....:
Out[36]:
total_bill tip sex smoker day time size rnk
95 40.17 4.73 Male Yes Fri Dinner 4 1.0
90 28.97 3.00 Male Yes Fri Dinner 2 2.0
170 50.81 10.00 Male Yes Sat Dinner 3 1.0
(continues on next page)
Let’s find tips with (rank < 3) per gender group for (tips < 2). Notice that when using rank(method='min')
function rnk_min remains the same for the same tip (as Oracle’s RANK() function)
UPDATE
UPDATE tips
SET tip = tip*2
WHERE tip < 2;
DELETE
In pandas we select the rows that should remain, instead of deleting them
{{ header }}
For potential users coming from SAS this page is meant to demonstrate how different SAS operations would
be performed in pandas.
If you’re new to pandas, you might want to first read through 10 Minutes to pandas to familiarize yourself
with the library.
As is customary, we import pandas and NumPy as follows:
Note: Throughout this tutorial, the pandas DataFrame will be displayed by calling df.head(), which
displays the first N (default 5) rows of the DataFrame. This is often used in interactive work (e.g. Jupyter
notebook or terminal) - the equivalent in SAS would be:
Data structures
pandas SAS
DataFrame data set
column variable
row observation
groupby BY-group
NaN .
DataFrame / Series
A DataFrame in pandas is analogous to a SAS data set - a two-dimensional data source with labeled columns
that can be of different types. As will be shown in this document, almost any operation that can be applied
to a data set using SAS’s DATA step, can also be accomplished in pandas.
A Series is the data structure that represents one column of a DataFrame. SAS doesn’t have a separate data
structure for a single column, but in general, working with a Series is analogous to referencing a column in
the DATA step.
Index
Every DataFrame and Series has an Index - which are labels on the rows of the data. SAS does not have an
exactly analogous concept. A data set’s rows are essentially unlabeled, other than an implicit integer index
that can be accessed during the DATA step (_N_).
In pandas, if no index is specified, an integer index is also used by default (first row = 0, second row = 1,
and so on). While using a labeled Index or MultiIndex can enable sophisticated analyses and is ultimately
an important part of pandas to understand, for this comparison we will essentially ignore the Index and just
treat the DataFrame as a collection of columns. Please see the indexing documentation for much more on
how to use an Index effectively.
A SAS data set can be built from specified values by placing the data after a datalines statement and
specifying the column names.
data df;
input x y;
datalines;
1 2
3 4
5 6
;
run;
A pandas DataFrame can be constructed in many different ways, but for a small number of values, it is often
convenient to specify it as a Python dictionary, where the keys are the column names and the values are the
data.
In [4]: df
Out[4]:
x y
0 1 2
1 3 4
2 5 6
Like SAS, pandas provides utilities for reading in data from many formats. The tips dataset, found within
the pandas tests (csv) will be used in many of the following examples.
SAS provides PROC IMPORT to read csv data into a data set.
In [7]: tips.head()
Out[7]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
Like PROC IMPORT, read_csv can take a number of parameters to specify how the data should be parsed.
For example, if the data was instead tab delimited, and did not have column names, the pandas command
would be:
In addition to text/csv, pandas supports a variety of other data formats such as Excel, HDF5, and SQL
databases. These are all read via a pd.read_* function. See the IO documentation for more details.
Exporting data
Similarly in pandas, the opposite of read_csv is to_csv(), and other data formats follow a similar api.
tips.to_csv('tips2.csv')
Data operations
Operations on columns
In the DATA step, arbitrary math expressions can be used on new or existing columns.
data tips;
set tips;
total_bill = total_bill - 2;
new_bill = total_bill / 2;
run;
pandas provides similar vectorized operations by specifying the individual Series in the DataFrame. New
columns can be assigned in the same way.
In [10]: tips.head()
Out[10]:
total_bill tip sex smoker day time size new_bill
0 14.99 1.01 Female No Sun Dinner 2 7.495
1 8.34 1.66 Male No Sun Dinner 3 4.170
2 19.01 3.50 Male No Sun Dinner 3 9.505
3 21.68 3.31 Male No Sun Dinner 2 10.840
4 22.59 3.61 Female No Sun Dinner 4 11.295
Filtering
data tips;
set tips;
if total_bill > 10;
run;
data tips;
set tips;
where total_bill > 10;
/* equivalent in this case - where happens before the
DATA step begins and can also be used in PROC statements */
run;
DataFrames can be filtered in multiple ways; the most intuitive of which is using boolean indexing
If/then logic
data tips;
set tips;
format bucket $4.;
The same operation in pandas can be accomplished using the where method from numpy.
In [13]: tips.head()
Out[13]:
total_bill tip sex smoker day time size bucket
0 14.99 1.01 Female No Sun Dinner 2 high
1 8.34 1.66 Male No Sun Dinner 3 low
2 19.01 3.50 Male No Sun Dinner 3 high
3 21.68 3.31 Male No Sun Dinner 2 high
4 22.59 3.61 Female No Sun Dinner 4 high
Date functionality
data tips;
set tips;
format date1 date2 date1_plusmonth mmddyy10.;
date1 = mdy(1, 15, 2013);
date2 = mdy(2, 15, 2015);
date1_year = year(date1);
date2_month = month(date2);
* shift date to beginning of next interval;
date1_next = intnx('MONTH', date1, 1);
* count intervals between dates;
months_between = intck('MONTH', date1, date2);
run;
The equivalent pandas operations are shown below. In addition to these functions pandas supports other
Time Series features not available in Base SAS (such as resampling and custom offsets) - see the timeseries
documentation for more details.
In [19]: tips['months_between'] = (
....: tips['date2'].dt.to_period('M') - tips['date1'].dt.to_period('M'))
....:
Selection of columns
SAS provides keywords in the DATA step to select, drop, and rename columns.
data tips;
set tips;
keep sex total_bill tip;
run;
data tips;
set tips;
drop sex;
run;
data tips;
set tips;
rename total_bill=total_bill_2;
run;
# drop
In [22]: tips.drop('sex', axis=1).head()
Out[22]:
total_bill tip smoker day time size
0 14.99 1.01 No Sun Dinner 2
1 8.34 1.66 No Sun Dinner 3
2 19.01 3.50 No Sun Dinner 3
3 21.68 3.31 No Sun Dinner 2
4 22.59 3.61 No Sun Dinner 4
# rename
In [23]: tips.rename(columns={'total_bill': 'total_bill_2'}).head()
Out[23]:
total_bill_2 tip sex smoker day time size
Sorting by values
pandas objects have a sort_values() method, which takes a list of columns to sort by.
In [25]: tips.head()
Out[25]:
total_bill tip sex smoker day time size
67 1.07 1.00 Female Yes Sat Dinner 1
92 3.75 1.00 Female Yes Fri Dinner 2
111 5.25 1.00 Female No Sat Dinner 1
145 6.35 1.50 Female No Thur Lunch 2
135 6.51 1.25 Female No Thur Lunch 2
String processing
Length
SAS determines the length of a character string with the LENGTHN and LENGTHC functions. LENGTHN
excludes trailing blanks and LENGTHC includes trailing blanks.
data _null_;
set tips;
put(LENGTHN(time));
put(LENGTHC(time));
run;
Python determines the length of a character string with the len function. len includes trailing blanks. Use
len and rstrip to exclude trailing blanks.
In [26]: tips['time'].str.len().head()
Out[26]:
67 6
92 6
111 6
145 5
135 5
Name: time, dtype: int64
In [27]: tips['time'].str.rstrip().str.len().head()
Out[27]:
67 6
92 6
111 6
145 5
135 5
Name: time, dtype: int64
Find
SAS determines the position of a character in a string with the FINDW function. FINDW takes the string
defined by the first argument and searches for the first position of the substring you supply as the second
argument.
data _null_;
set tips;
put(FINDW(sex,'ale'));
run;
Python determines the position of a character in a string with the find function. find searches for the first
position of the substring. If the substring is found, the function returns its position. Keep in mind that
Python indexes are zero-based and the function will return -1 if it fails to find the substring.
In [28]: tips['sex'].str.find("ale").head()
Out[28]:
67 3
92 3
111 3
145 3
135 3
Name: sex, dtype: int64
Substring
SAS extracts a substring from a string based on its position with the SUBSTR function.
data _null_;
set tips;
put(substr(sex,1,1));
run;
With pandas you can use [] notation to extract a substring from a string by position locations. Keep in
mind that Python indexes are zero-based.
In [29]: tips['sex'].str[0:1].head()
Out[29]:
67 F
92 F
111 F
145 F
135 F
Name: sex, dtype: object
Scan
The SAS SCAN function returns the nth word from a string. The first argument is the string you want to
parse and the second argument specifies which word you want to extract.
data firstlast;
input String $60.;
First_Name = scan(string, 1);
Last_Name = scan(string, -1);
datalines2;
John Smith;
Jane Cook;
;;;
run;
Python extracts a substring from a string based on its text by using regular expressions. There are much
more powerful approaches, but this just shows a simple approach.
In [33]: firstlast
Out[33]:
String First_Name Last_Name
0 John Smith John John
1 Jane Cook Jane Jane
The SAS UPCASE LOWCASE and PROPCASE functions change the case of the argument.
data firstlast;
input String $60.;
string_up = UPCASE(string);
string_low = LOWCASE(string);
string_prop = PROPCASE(string);
datalines2;
John Smith;
Jane Cook;
;;;
run;
In [38]: firstlast
Out[38]:
String string_up string_low string_prop
0 John Smith JOHN SMITH john smith John Smith
1 Jane Cook JANE COOK jane cook Jane Cook
Merging
In [40]: df1
Out[40]:
key value
0 A 0.970270
1 B 0.438338
2 C 0.402038
3 D -0.629526
In [42]: df2
Out[42]:
key value
0 B -0.600662
1 D -0.478682
2 D -0.546915
3 E -0.054672
In SAS, data must be explicitly sorted before merging. Different types of joins are accomplished using the
in= dummy variables to track whether a match was found in one or both input frames.
pandas DataFrames have a merge() method, which provides similar functionality. Note that the data does
not have to be sorted ahead of time, and different join types are accomplished via the how keyword.
In [44]: inner_join
Out[44]:
key value_x value_y
0 B 0.438338 -0.600662
1 D -0.629526 -0.478682
2 D -0.629526 -0.546915
In [46]: left_join
Out[46]:
key value_x value_y
0 A 0.970270 NaN
1 B 0.438338 -0.600662
2 C 0.402038 NaN
3 D -0.629526 -0.478682
4 D -0.629526 -0.546915
In [48]: right_join
Out[48]:
key value_x value_y
0 B 0.438338 -0.600662
1 D -0.629526 -0.478682
2 D -0.629526 -0.546915
3 E NaN -0.054672
In [50]: outer_join
Out[50]:
key value_x value_y
0 A 0.970270 NaN
1 B 0.438338 -0.600662
2 C 0.402038 NaN
3 D -0.629526 -0.478682
4 D -0.629526 -0.546915
5 E NaN -0.054672
Missing data
Like SAS, pandas has a representation for missing data - which is the special float value NaN (not a number).
Many of the semantics are the same, for example missing data propagates through numeric operations, and
is ignored by default for aggregations.
In [51]: outer_join
Out[51]:
key value_x value_y
0 A 0.970270 NaN
1 B 0.438338 -0.600662
2 C 0.402038 NaN
3 D -0.629526 -0.478682
4 D -0.629526 -0.546915
5 E NaN -0.054672
In [53]: outer_join['value_x'].sum()
Out[53]: 0.5515934095727003
One difference is that missing data cannot be compared to its sentinel value. For example, in SAS you could
do this to filter missing values.
data outer_join_nulls;
set outer_join;
if value_x = .;
run;
data outer_join_no_nulls;
set outer_join;
if value_x ^= .;
run;
Which doesn’t work in pandas. Instead, the pd.isna or pd.notna functions should be used for comparisons.
In [54]: outer_join[pd.isna(outer_join['value_x'])]
Out[54]:
key value_x value_y
5 E NaN -0.054672
In [55]: outer_join[pd.notna(outer_join['value_x'])]
Out[55]:
key value_x value_y
0 A 0.970270 NaN
1 B 0.438338 -0.600662
2 C 0.402038 NaN
3 D -0.629526 -0.478682
4 D -0.629526 -0.546915
pandas also provides a variety of methods to work with missing data - some of which would be challenging to
express in SAS. For example, there are methods to drop all rows with any missing values, replacing missing
values with a specified value, like the mean, or forward filling from previous rows. See the missing data
documentation for more.
In [56]: outer_join.dropna()
Out[56]:
key value_x value_y
1 B 0.438338 -0.600662
3 D -0.629526 -0.478682
4 D -0.629526 -0.546915
In [57]: outer_join.fillna(method='ffill')
Out[57]:
key value_x value_y
0 A 0.970270 NaN
1 B 0.438338 -0.600662
2 C 0.402038 -0.600662
3 D -0.629526 -0.478682
4 D -0.629526 -0.546915
5 E -0.629526 -0.054672
In [58]: outer_join['value_x'].fillna(outer_join['value_x'].mean())
Out[58]:
0 0.970270
1 0.438338
2 0.402038
3 -0.629526
4 -0.629526
5 0.110319
Name: value_x, dtype: float64
GroupBy
Aggregation
SAS’s PROC SUMMARY can be used to group by one or more key variables and compute aggregations on
numeric columns.
pandas provides a flexible groupby mechanism that allows similar aggregations. See the groupby documen-
tation for more details and examples.
In [60]: tips_summed.head()
(continues on next page)
Transformation
In SAS, if the group aggregations need to be used with the original frame, it must be merged back together.
For example, to subtract the mean for each observation by smoker group.
data tips;
merge tips(in=a) smoker_means(in=b);
by smoker;
adj_total_bill = total_bill - group_bill;
if a and b;
run;
pandas groupby provides a transform mechanism that allows these type of operations to be succinctly
expressed in one operation.
In [61]: gb = tips.groupby('smoker')['total_bill']
In [63]: tips.head()
Out[63]:
total_bill tip sex smoker day time size adj_total_bill
67 1.07 1.00 Female Yes Sat Dinner 1 -17.686344
92 3.75 1.00 Female Yes Fri Dinner 2 -15.006344
111 5.25 1.00 Female No Sat Dinner 1 -11.938278
145 6.35 1.50 Female No Thur Lunch 2 -10.838278
135 6.51 1.25 Female No Thur Lunch 2 -10.678278
By group processing
In addition to aggregation, pandas groupby can be used to replicate most other by group processing from
SAS. For example, this DATA step reads the data by sex/smoker group and filters to the first entry for each.
data tips_first;
set tips;
by sex smoker;
if FIRST.sex or FIRST.smoker then output;
run;
Other Considerations
Disk vs memory
pandas operates exclusively in memory, where a SAS data set exists on disk. This means that the size of
data able to be loaded in pandas is limited by your machine’s memory, but also that the operations on that
data may be faster.
If out of core processing is needed, one possibility is the dask.dataframe library (currently in development)
which provides a subset of pandas functionality for an on-disk DataFrame
Data interop
pandas provides a read_sas() method that can read SAS data saved in the XPORT or SAS7BDAT binary
format.
df = pd.read_sas('transport-file.xpt')
df = pd.read_sas('binary-file.sas7bdat')
You can also specify the file format directly. By default, pandas will try to infer the file format based on its
extension.
df = pd.read_sas('transport-file.xpt', format='xport')
df = pd.read_sas('binary-file.sas7bdat', format='sas7bdat')
XPORT is a relatively limited format and the parsing of it is not as optimized as some of the other pandas
readers. An alternative way to interop data between SAS and pandas is to serialize to csv.
{{ header }}
For potential users coming from Stata this page is meant to demonstrate how different Stata operations
would be performed in pandas.
If you’re new to pandas, you might want to first read through 10 Minutes to pandas to familiarize yourself
with the library.
As is customary, we import pandas and NumPy as follows. This means that we can refer to the libraries as
pd and np, respectively, for the rest of the document.
Note: Throughout this tutorial, the pandas DataFrame will be displayed by calling df.head(), which
displays the first N (default 5) rows of the DataFrame. This is often used in interactive work (e.g. Jupyter
notebook or terminal) – the equivalent in Stata would be:
list in 1/5
Data structures
pandas Stata
DataFrame data set
column variable
row observation
groupby bysort
NaN .
DataFrame / Series
A DataFrame in pandas is analogous to a Stata data set – a two-dimensional data source with labeled columns
that can be of different types. As will be shown in this document, almost any operation that can be applied
Index
Every DataFrame and Series has an Index – labels on the rows of the data. Stata does not have an exactly
analogous concept. In Stata, a data set’s rows are essentially unlabeled, other than an implicit integer index
that can be accessed with _n.
In pandas, if no index is specified, an integer index is also used by default (first row = 0, second row = 1,
and so on). While using a labeled Index or MultiIndex can enable sophisticated analyses and is ultimately
an important part of pandas to understand, for this comparison we will essentially ignore the Index and just
treat the DataFrame as a collection of columns. Please see the indexing documentation for much more on
how to use an Index effectively.
A Stata data set can be built from specified values by placing the data after an input statement and
specifying the column names.
input x y
1 2
3 4
5 6
end
A pandas DataFrame can be constructed in many different ways, but for a small number of values, it is often
convenient to specify it as a Python dictionary, where the keys are the column names and the values are the
data.
In [4]: df
Out[4]:
x y
0 1 2
1 3 4
2 5 6
Like Stata, pandas provides utilities for reading in data from many formats. The tips data set, found within
the pandas tests (csv) will be used in many of the following examples.
Stata provides import delimited to read csv data into a data set in memory. If the tips.csv file is in the
current working directory, we can import it as follows.
The pandas method is read_csv(), which works similarly. Additionally, it will automatically download the
data set if presented with a url.
In [7]: tips.head()
Out[7]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
Like import delimited, read_csv() can take a number of parameters to specify how the data should be
parsed. For example, if the data were instead tab delimited, did not have column names, and existed in the
current working directory, the pandas command would be:
Pandas can also read Stata data sets in .dta format with the read_stata() function.
df = pd.read_stata('data.dta')
In addition to text/csv and Stata files, pandas supports a variety of other data formats such as Excel, SAS,
HDF5, Parquet, and SQL databases. These are all read via a pd.read_* function. See the IO documentation
for more details.
Exporting data
tips.to_csv('tips2.csv')
Pandas can also export to Stata file format with the DataFrame.to_stata() method.
tips.to_stata('tips2.dta')
Data operations
Operations on columns
In Stata, arbitrary math expressions can be used with the generate and replace commands on new or
existing columns. The drop command drops the column from the data set.
pandas provides similar vectorized operations by specifying the individual Series in the DataFrame. New
columns can be assigned in the same way. The DataFrame.drop() method drops a column from the
DataFrame.
In [10]: tips.head()
Out[10]:
total_bill tip sex smoker day time size new_bill
0 14.99 1.01 Female No Sun Dinner 2 7.495
1 8.34 1.66 Male No Sun Dinner 3 4.170
2 19.01 3.50 Male No Sun Dinner 3 9.505
3 21.68 3.31 Male No Sun Dinner 2 10.840
4 22.59 3.61 Female No Sun Dinner 4 11.295
Filtering
DataFrames can be filtered in multiple ways; the most intuitive of which is using boolean indexing.
If/then logic
The same operation in pandas can be accomplished using the where method from numpy.
In [14]: tips.head()
Out[14]:
total_bill tip sex smoker day time size bucket
0 14.99 1.01 Female No Sun Dinner 2 high
1 8.34 1.66 Male No Sun Dinner 3 low
2 19.01 3.50 Male No Sun Dinner 3 high
3 21.68 3.31 Male No Sun Dinner 2 high
4 22.59 3.61 Female No Sun Dinner 4 high
Date functionality
The equivalent pandas operations are shown below. In addition to these functions, pandas supports other
Time Series features not available in Stata (such as time zone handling and custom offsets) – see the timeseries
documentation for more details.
Selection of columns
drop sex
The same operations are expressed in pandas below. Note that in contrast to Stata, these operations do not
happen in place. To make these changes persist, assign the operation back to a variable.
# keep
In [22]: tips[['sex', 'total_bill', 'tip']].head()
Out[22]:
sex total_bill tip
0 Female 14.99 1.01
1 Male 8.34 1.66
2 Male 19.01 3.50
3 Male 21.68 3.31
4 Female 22.59 3.61
# drop
In [23]: tips.drop('sex', axis=1).head()
Out[23]:
total_bill tip smoker day time size
0 14.99 1.01 No Sun Dinner 2
1 8.34 1.66 No Sun Dinner 3
2 19.01 3.50 No Sun Dinner 3
3 21.68 3.31 No Sun Dinner 2
4 22.59 3.61 No Sun Dinner 4
# rename
In [24]: tips.rename(columns={'total_bill': 'total_bill_2'}).head()
Out[24]:
total_bill_2 tip sex smoker day time size
0 14.99 1.01 Female No Sun Dinner 2
1 8.34 1.66 Male No Sun Dinner 3
2 19.01 3.50 Male No Sun Dinner 3
3 21.68 3.31 Male No Sun Dinner 2
4 22.59 3.61 Female No Sun Dinner 4
Sorting by values
pandas objects have a DataFrame.sort_values() method, which takes a list of columns to sort by.
In [26]: tips.head()
Out[26]:
total_bill tip sex smoker day time size
67 1.07 1.00 Female Yes Sat Dinner 1
92 3.75 1.00 Female Yes Fri Dinner 2
111 5.25 1.00 Female No Sat Dinner 1
145 6.35 1.50 Female No Thur Lunch 2
135 6.51 1.25 Female No Thur Lunch 2
String processing
Stata determines the length of a character string with the strlen() and ustrlen() functions for ASCII
and Unicode strings, respectively.
Python determines the length of a character string with the len function. In Python 3, all strings are
Unicode strings. len includes trailing blanks. Use len and rstrip to exclude trailing blanks.
In [27]: tips['time'].str.len().head()
Out[27]:
67 6
92 6
111 6
145 5
135 5
Name: time, dtype: int64
In [28]: tips['time'].str.rstrip().str.len().head()
Out[28]:
67 6
92 6
111 6
145 5
135 5
Name: time, dtype: int64
Stata determines the position of a character in a string with the strpos() function. This takes the string
defined by the first argument and searches for the first position of the substring you supply as the second
argument.
Python determines the position of a character in a string with the find() function. find searches for the
first position of the substring. If the substring is found, the function returns its position. Keep in mind that
Python indexes are zero-based and the function will return -1 if it fails to find the substring.
In [29]: tips['sex'].str.find("ale").head()
Out[29]:
67 3
92 3
111 3
145 3
135 3
Name: sex, dtype: int64
Stata extracts a substring from a string based on its position with the substr() function.
With pandas you can use [] notation to extract a substring from a string by position locations. Keep in
mind that Python indexes are zero-based.
In [30]: tips['sex'].str[0:1].head()
Out[30]:
67 F
92 F
111 F
145 F
135 F
Name: sex, dtype: object
The Stata word() function returns the nth word from a string. The first argument is the string you want to
parse and the second argument specifies which word you want to extract.
clear
input str20 string
"John Smith"
"Jane Cook"
end
Python extracts a substring from a string based on its text by using regular expressions. There are much
more powerful approaches, but this just shows a simple approach.
In [34]: firstlast
Out[34]:
string First_Name Last_Name
0 John Smith John John
1 Jane Cook Jane Jane
Changing case
The Stata strupper(), strlower(), strproper(), ustrupper(), ustrlower(), and ustrtitle() functions
change the case of ASCII and Unicode strings, respectively.
clear
input str20 string
"John Smith"
"Jane Cook"
end
In [39]: firstlast
Out[39]:
string upper lower title
0 John Smith JOHN SMITH john smith John Smith
1 Jane Cook JANE COOK jane cook Jane Cook
Merging
In [41]: df1
Out[41]:
key value
0 A -1.456807
1 B 1.779988
2 C -1.602829
3 D 0.944411
In [43]: df2
Out[43]:
key value
0 B 0.750030
1 D 0.349022
2 D 1.433851
3 E -0.720900
In Stata, to perform a merge, one data set must be in memory and the other must be referenced as a file
name on disk. In contrast, Python must have both DataFrames already in memory.
By default, Stata performs an outer join, where all observations from both data sets are left in memory after
the merge. One can keep only observations from the initial data set, the merged data set, or the intersection
of the two by using the values created in the _merge variable.
preserve
* Left join
merge 1:n key using df2.dta
keep if _merge == 1
* Right join
restore, preserve
merge 1:n key using df2.dta
keep if _merge == 2
* Inner join
restore, preserve
merge 1:n key using df2.dta
keep if _merge == 3
* Outer join
restore
merge 1:n key using df2.dta
pandas DataFrames have a DataFrame.merge() method, which provides similar functionality. Note that
different join types are accomplished via the how keyword.
In [45]: inner_join
Out[45]:
key value_x value_y
0 B 1.779988 0.750030
1 D 0.944411 0.349022
2 D 0.944411 1.433851
In [47]: left_join
Out[47]:
key value_x value_y
0 A -1.456807 NaN
1 B 1.779988 0.750030
2 C -1.602829 NaN
3 D 0.944411 0.349022
4 D 0.944411 1.433851
In [49]: right_join
Out[49]:
key value_x value_y
0 B 1.779988 0.750030
1 D 0.944411 0.349022
(continues on next page)
In [51]: outer_join
Out[51]:
key value_x value_y
0 A -1.456807 NaN
1 B 1.779988 0.750030
2 C -1.602829 NaN
3 D 0.944411 0.349022
4 D 0.944411 1.433851
5 E NaN -0.720900
Missing data
Like Stata, pandas has a representation for missing data – the special float value NaN (not a number). Many
of the semantics are the same; for example missing data propagates through numeric operations, and is
ignored by default for aggregations.
In [52]: outer_join
Out[52]:
key value_x value_y
0 A -1.456807 NaN
1 B 1.779988 0.750030
2 C -1.602829 NaN
3 D 0.944411 0.349022
4 D 0.944411 1.433851
5 E NaN -0.720900
In [54]: outer_join['value_x'].sum()
Out[54]: 0.6091740246349007
One difference is that missing data cannot be compared to its sentinel value. For example, in Stata you
could do this to filter missing values.
This doesn’t work in pandas. Instead, the pd.isna() or pd.notna() functions should be used for compar-
isons.
In [55]: outer_join[pd.isna(outer_join['value_x'])]
Out[55]:
key value_x value_y
5 E NaN -0.7209
In [56]: outer_join[pd.notna(outer_join['value_x'])]
Out[56]:
key value_x value_y
0 A -1.456807 NaN
1 B 1.779988 0.750030
2 C -1.602829 NaN
3 D 0.944411 0.349022
4 D 0.944411 1.433851
Pandas also provides a variety of methods to work with missing data – some of which would be challenging
to express in Stata. For example, there are methods to drop all rows with any missing values, replacing
missing values with a specified value, like the mean, or forward filling from previous rows. See the missing
data documentation for more.
# Drop rows with any missing value
In [57]: outer_join.dropna()
Out[57]:
key value_x value_y
1 B 1.779988 0.750030
3 D 0.944411 0.349022
4 D 0.944411 1.433851
# Fill forwards
In [58]: outer_join.fillna(method='ffill')
Out[58]:
key value_x value_y
0 A -1.456807 NaN
1 B 1.779988 0.750030
2 C -1.602829 0.750030
3 D 0.944411 0.349022
4 D 0.944411 1.433851
5 E 0.944411 -0.720900
GroupBy
Aggregation
Stata’s collapse can be used to group by one or more key variables and compute aggregations on numeric
columns.
pandas provides a flexible groupby mechanism that allows similar aggregations. See the groupby documen-
tation for more details and examples.
In [61]: tips_summed.head()
Out[61]:
total_bill tip
sex smoker
Female No 869.68 149.77
Yes 527.27 96.74
Male No 1725.75 302.00
Yes 1217.07 183.07
Transformation
In Stata, if the group aggregations need to be used with the original data set, one would usually use bysort
with egen(). For example, to subtract the mean for each observation by smoker group.
pandas groupby provides a transform mechanism that allows these type of operations to be succinctly
expressed in one operation.
In [62]: gb = tips.groupby('smoker')['total_bill']
In [64]: tips.head()
Out[64]:
total_bill tip sex smoker day time size adj_total_bill
67 1.07 1.00 Female Yes Sat Dinner 1 -17.686344
92 3.75 1.00 Female Yes Fri Dinner 2 -15.006344
111 5.25 1.00 Female No Sat Dinner 1 -11.938278
145 6.35 1.50 Female No Thur Lunch 2 -10.838278
135 6.51 1.25 Female No Thur Lunch 2 -10.678278
By group processing
In addition to aggregation, pandas groupby can be used to replicate most other bysort processing from
Stata. For example, the following example lists the first observation in the current sort order by sex/smoker
group.
Other considerations
Disk vs memory
Pandas and Stata both operate exclusively in memory. This means that the size of data able to be loaded
in pandas is limited by your machine’s memory. If out of core processing is needed, one possibility is the
dask.dataframe library, which provides a subset of pandas functionality for an on-disk DataFrame. {{
header }}
3.6 Tutorials
This is a guide to many pandas tutorials, geared mainly for new users.
The goal of this 2015 cookbook (by Julia Evans) is to give you some concrete examples for getting started
with pandas. These are examples with real-world data, and all the bugs and weirdness that entails. For the
table of contents, see the pandas-cookbook GitHub repository.
This guide is an introduction to the data analysis process using the Python data ecosystem and an interesting
open dataset. There are four sections covering selected topics as munging data, aggregating data, visualizing
data and time series.
Practice your skills with real data sets and exercises. For more resources, please visit the main repository.
Modern pandas
Tutorial series written in 2016 by Tom Augspurger. The source may be found in the GitHub repository
TomAugspurger/effective-pandas.
• Modern Pandas
• Method Chaining
• Indexes
• Performance
• Tidy Data
• Visualization
• Timeseries
Video tutorials
Various tutorials
FOUR
USER GUIDE
The User Guide covers all of pandas by topic area. Each of the subsections introduces a topic (such as
“working with missing data”), and discusses how pandas approaches the problem, with many examples
throughout.
Users brand-new to pandas should start with 10min.
Further information on any specific method can be obtained in the API reference. {{ header }}
The pandas I/O API is a set of top level reader functions accessed like pandas.read_csv() that generally
return a pandas object. The corresponding writer functions are object methods that are accessed like
DataFrame.to_csv(). Below is a table containing available readers and writers.
Note: For examples that use the StringIO class, make sure you import it according to your Python
version, i.e. from StringIO import StringIO for Python 2 and from io import StringIO for Python 3.
187
pandas: powerful Python data analysis toolkit, Release 0.25.2
The workhorse function for reading text files (a.k.a. flat files) is read_csv(). See the cookbook for some
advanced strategies.
Parsing options
Basic
header [int or list of ints, default 'infer'] Row number(s) to use as the column names, and the start of the
data. Default behavior is to infer the column names: if no names are passed the behavior is identical
to header=0 and column names are inferred from the first line of the file, if column names are passed
explicitly then the behavior is identical to header=None. Explicitly pass header=0 to be able to replace
existing names.
The header can be a list of ints that specify row locations for a MultiIndex on the columns e.g. [0,1,
3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped). Note
that this parameter ignores commented lines and empty lines if skip_blank_lines=True, so header=0
denotes the first line of data rather than the first line of the file.
names [array-like, default None] List of column names to use. If file contains no header row, then you should
explicitly pass header=None. Duplicates in this list are not allowed.
index_col [int, str, sequence of int / str, or False, default None] Column(s) to use as the row labels of
the DataFrame, either given as string name or column index. If a sequence of int / str is given, a
MultiIndex is used.
Note: index_col=False can be used to force pandas to not use the first column as the index, e.g.
when you have a malformed file with delimiters at the end of each line.
usecols [list-like or callable, default None] Return a subset of the columns. If list-like, all elements must
either be positional (i.e. integer indices into the document columns) or strings that correspond to
column names provided either by the user in names or inferred from the document header row(s). For
example, a valid list-like usecols parameter would be [0, 1, 2] or ['foo', 'bar', 'baz'].
Element order is ignored, so usecols=[0, 1] is the same as [1, 0]. To instantiate a DataFrame
from data with element order preserved use pd.read_csv(data, usecols=['foo', 'bar'])[['foo',
'bar']] for columns in ['foo', 'bar'] order or pd.read_csv(data, usecols=['foo',
'bar'])[['bar', 'foo']] for ['bar', 'foo'] order.
If callable, the callable function will be evaluated against the column names, returning names where
the callable function evaluates to True:
In [1]: from io import StringIO, BytesIO
In [3]: pd.read_csv(StringIO(data))
Out[3]:
col1 col2 col3
0 a b 1
1 a b 2
2 c d 3
dtype [Type name or dict of column -> type, default None] Data type for data or columns. E.g. {'a':
np.float64, 'b': np.int32} (unsupported with engine='python'). Use str or object together with
suitable na_values settings to preserve and not interpret dtype.
New in version 0.20.0: support for the Python parser.
engine [{'c', 'python'}] Parser engine to use. The C engine is faster while the Python engine is currently
more feature-complete.
converters [dict, default None] Dict of functions for converting values in certain columns. Keys can either
be integers or column labels.
true_values [list, default None] Values to consider as True.
In [6]: pd.read_csv(StringIO(data))
Out[6]:
col1 col2 col3
0 a b 1
1 a b 2
2 c d 3
na_values [scalar, str, list-like, or dict, default None] Additional strings to recognize as NA/NaN. If dict
passed, specific per-column NA values. See na values const below for a list of the values interpreted as
NaN by default.
keep_default_na [boolean, default True] Whether or not to include the default NaN values when parsing
the data. Depending on whether na_values is passed in, the behavior is as follows:
• If keep_default_na is True, and na_values are specified, na_values is appended to the default
NaN values used for parsing.
• If keep_default_na is True, and na_values are not specified, only the default NaN values are used
for parsing.
• If keep_default_na is False, and na_values are specified, only the NaN values specified na_values
are used for parsing.
• If keep_default_na is False, and na_values are not specified, no strings will be parsed as NaN.
Note that if na_filter is passed in as False, the keep_default_na and na_values parameters will be
ignored.
na_filter [boolean, default True] Detect missing value markers (empty strings and the value of na_values).
In data without any NAs, passing na_filter=False can improve the performance of reading a large
file.
verbose [boolean, default False] Indicate number of NA values placed in non-numeric columns.
skip_blank_lines [boolean, default True] If True, skip over blank lines rather than interpreting as NaN
values.
Datetime handling
parse_dates [boolean or list of ints or names or list of lists or dict, default False.]
• If True -> try parsing the index.
• If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column.
• If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column.
• If {'foo': [1, 3]} -> parse columns 1, 3 as date and call result ‘foo’. A fast-path exists for
iso8601-formatted dates.
infer_datetime_format [boolean, default False] If True and parse_dates is enabled for a column, at-
tempt to infer the datetime format to speed up the processing.
keep_date_col [boolean, default False] If True and parse_dates specifies combining multiple columns
then keep the original columns.
date_parser [function, default None] Function to use for converting a sequence of string columns to an
array of datetime instances. The default uses dateutil.parser.parser to do the conversion. pandas
will try to call date_parser in three different ways, advancing to the next if an exception occurs: 1)
Pass one or more arrays (as defined by parse_dates) as arguments; 2) concatenate (row-wise) the
string values from the columns defined by parse_dates into a single array and pass that; and 3) call
date_parser once for each row using one or more strings (corresponding to the columns defined by
parse_dates) as arguments.
dayfirst [boolean, default False] DD/MM format dates, international and European format.
cache_dates [boolean, default True] If True, use a cache of unique, converted dates to apply the datetime
conversion. May produce significant speed-up when parsing duplicate date strings, especially ones with
timezone offsets.
New in version 0.25.0.
Iteration
iterator [boolean, default False] Return TextFileReader object for iteration or getting chunks with
get_chunk().
chunksize [int, default None] Return TextFileReader object for iteration. See iterating and chunking below.
compression [{'infer', 'gzip', 'bz2', 'zip', 'xz', None}, default 'infer'] For on-the-fly decompres-
sion of on-disk data. If ‘infer’, then use gzip, bz2, zip, or xz if filepath_or_buffer is a string ending
in ‘.gz’, ‘.bz2’, ‘.zip’, or ‘.xz’, respectively, and no decompression otherwise. If using ‘zip’, the ZIP file
must contain only one data file to be read in. Set to None for no decompression.
New in version 0.18.1: support for ‘zip’ and ‘xz’ compression.
Changed in version 0.24.0: ‘infer’ option added and set to default.
thousands [str, default None] Thousands separator.
decimal [str, default '.'] Character to recognize as decimal point. E.g. use ',' for European data.
float_precision [string, default None] Specifies which converter the C engine should use for floating-point
values. The options are None for the ordinary converter, high for the high-precision converter, and
round_trip for the round-trip converter.
lineterminator [str (length 1), default None] Character to break file into lines. Only valid with C parser.
quotechar [str (length 1)] The character used to denote the start and end of a quoted item. Quoted items
can include the delimiter and it will be ignored.
quoting [int or csv.QUOTE_* instance, default 0] Control field quoting behavior per csv.QUOTE_* constants.
Use one of QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).
doublequote [boolean, default True] When quotechar is specified and quoting is not QUOTE_NONE, indicate
whether or not to interpret two consecutive quotechar elements inside a field as a single quotechar
element.
escapechar [str (length 1), default None] One-character string used to escape delimiter when quoting is
QUOTE_NONE.
comment [str, default None] Indicates remainder of line should not be parsed. If found at the beginning of
a line, the line will be ignored altogether. This parameter must be a single character. Like empty lines
(as long as skip_blank_lines=True), fully commented lines are ignored by the parameter header but
not by skiprows. For example, if comment='#', parsing ‘#empty\na,b,c\n1,2,3’ with header=0 will
result in ‘a,b,c’ being treated as the header.
encoding [str, default None] Encoding to use for UTF when reading/writing (e.g. 'utf-8'). List of Python
standard encodings.
dialect [str or csv.Dialect instance, default None] If provided, this parameter will override values (default
or not) for the following parameters: delimiter, doublequote, escapechar, skipinitialspace, quotechar,
and quoting. If it is necessary to override values, a ParserWarning will be issued. See csv.Dialect
documentation for more details.
Error handling
error_bad_lines [boolean, default True] Lines with too many fields (e.g. a csv line with too many
commas) will by default cause an exception to be raised, and no DataFrame will be returned. If False,
then these “bad lines” will dropped from the DataFrame that is returned. See bad lines below.
warn_bad_lines [boolean, default True] If error_bad_lines is False, and warn_bad_lines is True, a
warning for each “bad line” will be output.
You can indicate the data type for the whole DataFrame or individual columns:
In [8]: data = ('a,b,c,d\n'
...: '1,2,3,4\n'
...: '5,6,7,8\n'
...: '9,10,11')
...:
In [9]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11
In [11]: df
Out[11]:
a b c d
0 1 2 3 4
1 5 6 7 8
2 9 10 11 NaN
In [12]: df['a'][0]
Out[12]: '1'
In [13]: df = pd.read_csv(StringIO(data),
....: dtype={'b': object, 'c': np.float64, 'd': 'Int64'})
....:
In [14]: df.dtypes
Out[14]:
a int64
b object
c float64
d Int64
dtype: object
Fortunately, pandas offers more than one way to ensure that your column(s) contain only one dtype. If
you’re unfamiliar with these concepts, you can see here to learn more about dtypes, and here to learn more
about object conversion in pandas.
For instance, you can use the converters argument of read_csv():
In [15]: data = ("col_1\n"
....: "1\n"
....: "2\n"
....: "'A'\n"
....: "4.22")
....:
In [17]: df
Out[17]:
col_1
0 1
1 2
2 'A'
3 4.22
In [18]: df['col_1'].apply(type).value_counts()
Out[18]:
<class 'str'> 4
Name: col_1, dtype: int64
Or you can use the to_numeric() function to coerce the dtypes after reading in the data,
In [19]: df2 = pd.read_csv(StringIO(data))
In [21]: df2
Out[21]:
col_1
0 1.00
1 2.00
2 NaN
3 4.22
In [22]: df2['col_1'].apply(type).value_counts()
Out[22]:
<class 'float'> 4
Name: col_1, dtype: int64
which will convert all valid parsing to floats, leaving the invalid parsing as NaN.
Ultimately, how you deal with reading in columns containing mixed dtypes depends on your specific needs.
In the case above, if you wanted to NaN out the data anomalies, then to_numeric() is probably your best
option. However, if you wanted for all the data to be coerced, no matter the type, then using the converters
argument of read_csv() would certainly be worth trying.
New in version 0.20.0: support for the Python parser.
The dtype option is supported by the ‘python’ engine.
Note: In some cases, reading in abnormal data with columns containing mixed dtypes will result in an
inconsistent dataset. If you rely on pandas to infer the dtypes of your columns, the parsing engine will go
and infer the dtypes for different chunks of the data, rather than the whole dataset at once. Consequently,
you can end up with column(s) with mixed dtypes. For example,
In [23]: col_1 = list(range(500000)) + ['a', 'b'] + list(range(500000))
In [25]: df.to_csv('foo.csv')
In [27]: mixed_df['col_1'].apply(type).value_counts()
Out[27]:
In [28]: mixed_df['col_1'].dtype
Out[28]: dtype('O')
will result with mixed_df containing an int dtype for certain chunks of the column, and str for others due
to the mixed dtypes from the data that was read in. It is important to note that the overall column will be
marked with a dtype of object, which is used for columns with mixed dtypes.
In [30]: pd.read_csv(StringIO(data))
Out[30]:
col1 col2 col3
0 a b 1
1 a b 2
2 c d 3
In [31]: pd.read_csv(StringIO(data)).dtypes
Out[31]:
col1 object
col2 object
col3 int64
dtype: object
Note: With dtype='category', the resulting categories will always be parsed as strings (object dtype).
If the categories are numeric they can be converted using the to_numeric() function, or as appropriate,
another converter such as to_datetime().
When dtype is a CategoricalDtype with homogeneous categories ( all numeric, all datetimes, etc.), the
conversion is done automatically.
In [39]: df = pd.read_csv(StringIO(data), dtype='category')
In [40]: df.dtypes
Out[40]:
col1 category
col2 category
col3 category
dtype: object
In [41]: df['col3']
Out[41]:
0 1
1 2
2 3
Name: col3, dtype: category
Categories (3, object): [1, 2, 3]
In [43]: df['col3']
Out[43]:
0 1
1 2
2 3
Name: col3, dtype: category
Categories (3, int64): [1, 2, 3]
A file may or may not have a header row. pandas assumes the first row should be used as the column names:
In [44]: data = ('a,b,c\n'
....: '1,2,3\n'
....: '4,5,6\n'
....: '7,8,9')
....:
In [45]: print(data)
a,b,c
1,2,3
4,5,6
7,8,9
In [46]: pd.read_csv(StringIO(data))
Out[46]:
a b c
0 1 2 3
1 4 5 6
2 7 8 9
By specifying the names argument in conjunction with header you can indicate other names to use and
whether or not to throw away the header row (if any):
In [47]: print(data)
a,b,c
1,2,3
4,5,6
7,8,9
Out[49]:
foo bar baz
0 a b c
1 1 2 3
2 4 5 6
3 7 8 9
If the header is in a row other than the first, pass the row number to header. This will skip the preceding
rows:
Note: Default behavior is to infer the column names: if no names are passed the behavior is identical to
header=0 and column names are inferred from the first non-blank line of the file, if column names are passed
explicitly then the behavior is identical to header=None.
If the file or header contains duplicate names, pandas will by default distinguish between them so as to
prevent overwriting data:
In [53]: pd.read_csv(StringIO(data))
Out[53]:
a b a.1
0 0 1 2
1 3 4 5
There is no more duplicate data because mangle_dupe_cols=True by default, which modifies a series of
duplicate columns ‘X’, …, ‘X’ to become ‘X’, ‘X.1’, …, ‘X.N’. If mangle_dupe_cols=False, duplicate data
can arise:
To prevent users from encountering this problem with duplicate data, a ValueError exception is raised if
mangle_dupe_cols != True:
The usecols argument allows you to select any subset of the columns in a file, either using the column
names, position numbers or a callable:
New in version 0.20.0: support for callable usecols arguments
In [54]: data = 'a,b,c,d\n1,2,3,foo\n4,5,6,bar\n7,8,9,baz'
In [55]: pd.read_csv(StringIO(data))
Out[55]:
a b c d
0 1 2 3 foo
1 4 5 6 bar
2 7 8 9 baz
In this case, the callable is specifying that we exclude the “a” and “c” columns from the output.
If the comment parameter is specified, then completely commented lines will be ignored. By default, com-
pletely blank lines will be ignored as well.
In [60]: data = ('\n'
....: 'a,b,c\n'
....: ' \n'
....: '# commented line\n'
....: '1,2,3\n'
....: '\n'
....: '4,5,6')
....:
In [61]: print(data)
a,b,c
# commented line
1,2,3
4,5,6
Warning: The presence of ignored lines might create ambiguities involving line numbers; the param-
eter header uses row numbers (ignoring commented/empty lines), while skiprows uses line numbers
(including commented/empty lines):
In [65]: data = ('#comment\n'
....: 'a,b,c\n'
....: 'A,B,C\n'
....: '1,2,3')
....:
If both header and skiprows are specified, header will be relative to the end of skiprows. For example:
In [70]: print(data)
# empty
# second empty line
# third emptyline
X,Y,Z
1,2,3
A,B,C
1,2.,4.
5.,NaN,10.0
Comments
In [72]: print(open('tmp.csv').read())
ID,level,category
Patient1,123000,x # really unpleasant
Patient2,23000,y # wouldn't take his medicine
Patient3,1234018,z # awesome
In [73]: df = pd.read_csv('tmp.csv')
In [74]: df
Out[74]:
ID level category
0 Patient1 123000 x # really unpleasant
1 Patient2 23000 y # wouldn't take his medicine
2 Patient3 1234018 z # awesome
In [76]: df
Out[76]:
ID level category
0 Patient1 123000 x
1 Patient2 23000 y
2 Patient3 1234018 z
The encoding argument should be used for encoded unicode data, which will result in byte strings being
decoded to unicode in the result:
In [77]: data = (b'word,length\n'
....: b'Tr\xc3\xa4umen,7\n'
....: b'Gr\xc3\xbc\xc3\x9fe,5')
....:
In [80]: df
Out[80]:
word length
0 Träumen 7
1 Grüße 5
In [81]: df['word'][1]
Out[81]: 'Grüße'
Some formats which encode all characters as multiple bytes, like UTF-16, won’t parse correctly at all without
specifying the encoding. Full list of Python standard encodings.
If a file has one more column of data than the number of column names, the first column will be used as the
DataFrame’s row names:
In [83]: pd.read_csv(StringIO(data))
Out[83]:
a b c
4 apple bat 5.7
8 orange cow 10.0
Ordinarily, you can achieve this behavior using the index_col option.
There are some exception cases when a file has been prepared with delimiters at the end of each data line,
confusing the parser. To explicitly disable the index column inference and discard the last column, pass
index_col=False:
In [86]: data = ('a,b,c\n'
....: '4,apple,bat,\n'
....: '8,orange,cow,')
....:
In [87]: print(data)
a,b,c
4,apple,bat,
8,orange,cow,
In [88]: pd.read_csv(StringIO(data))
Out[88]:
a b c
4 apple bat NaN
8 orange cow NaN
In [91]: print(data)
a,b,c
4,apple,bat,
8,orange,cow,
Date Handling
To better facilitate working with datetime data, read_csv() uses the keyword arguments parse_dates and
date_parser to allow users to specify a variety of columns and date/time formats to turn the input text
data into datetime objects.
The simplest case is to just pass in parse_dates=True:
# Use a column as an index, and parse it as dates.
In [94]: df = pd.read_csv('foo.csv', index_col=0, parse_dates=True)
In [95]: df
Out[95]:
A B C
date
2009-01-01 a 1 2
2009-01-02 b 3 4
2009-01-03 c 4 5
It is often the case that we may want to store date and time data separately, or store various date fields
separately. the parse_dates keyword can be used to specify a combination of columns to parse the dates
and/or times from.
You can specify a list of column lists to parse_dates, the resulting date columns will be prepended to the
output (so as to not affect the existing column order) and the new column names will be the concatenation
of the component column names:
In [97]: print(open('tmp.csv').read())
KORD,19990127, 19:00:00, 18:56:00, 0.8100
KORD,19990127, 20:00:00, 19:56:00, 0.0100
KORD,19990127, 21:00:00, 20:56:00, -0.5900
KORD,19990127, 21:00:00, 21:18:00, -0.9900
KORD,19990127, 22:00:00, 21:56:00, -0.5900
KORD,19990127, 23:00:00, 22:56:00, -0.5900
In [99]: df
Out[99]:
1_2 1_3 0 4
0 1999-01-27 19:00:00 1999-01-27 18:56:00 KORD 0.81
1 1999-01-27 20:00:00 1999-01-27 19:56:00 KORD 0.01
2 1999-01-27 21:00:00 1999-01-27 20:56:00 KORD -0.59
3 1999-01-27 21:00:00 1999-01-27 21:18:00 KORD -0.99
4 1999-01-27 22:00:00 1999-01-27 21:56:00 KORD -0.59
5 1999-01-27 23:00:00 1999-01-27 22:56:00 KORD -0.59
By default the parser removes the component date columns, but you can choose to retain them via the
keep_date_col keyword:
In [101]: df
Out[101]:
1_2 1_3 0 1 2 3 4
0 1999-01-27 19:00:00 1999-01-27 18:56:00 KORD 19990127 19:00:00 18:56:00 0.81
1 1999-01-27 20:00:00 1999-01-27 19:56:00 KORD 19990127 20:00:00 19:56:00 0.01
2 1999-01-27 21:00:00 1999-01-27 20:56:00 KORD 19990127 21:00:00 20:56:00 -0.59
3 1999-01-27 21:00:00 1999-01-27 21:18:00 KORD 19990127 21:00:00 21:18:00 -0.99
4 1999-01-27 22:00:00 1999-01-27 21:56:00 KORD 19990127 22:00:00 21:56:00 -0.59
5 1999-01-27 23:00:00 1999-01-27 22:56:00 KORD 19990127 23:00:00 22:56:00 -0.59
Note that if you wish to combine multiple columns into a single date column, a nested list must be used.
In other words, parse_dates=[1, 2] indicates that the second and third columns should each be parsed as
separate date columns while parse_dates=[[1, 2]] means the two columns should be parsed into a single
column.
You can also use a dict to specify custom name columns:
In [104]: df
Out[104]:
nominal actual 0 4
0 1999-01-27 19:00:00 1999-01-27 18:56:00 KORD 0.81
1 1999-01-27 20:00:00 1999-01-27 19:56:00 KORD 0.01
2 1999-01-27 21:00:00 1999-01-27 20:56:00 KORD -0.59
3 1999-01-27 21:00:00 1999-01-27 21:18:00 KORD -0.99
4 1999-01-27 22:00:00 1999-01-27 21:56:00 KORD -0.59
5 1999-01-27 23:00:00 1999-01-27 22:56:00 KORD -0.59
It is important to remember that if multiple text columns are to be parsed into a single date column, then
a new column is prepended to the data. The index_col specification is based off of this new set of columns
rather than the original data columns:
In [107]: df
Out[107]:
actual 0 4
nominal
1999-01-27 19:00:00 1999-01-27 18:56:00 KORD 0.81
1999-01-27 20:00:00 1999-01-27 19:56:00 KORD 0.01
1999-01-27 21:00:00 1999-01-27 20:56:00 KORD -0.59
1999-01-27 21:00:00 1999-01-27 21:18:00 KORD -0.99
1999-01-27 22:00:00 1999-01-27 21:56:00 KORD -0.59
1999-01-27 23:00:00 1999-01-27 22:56:00 KORD -0.59
Note: If a column or index contains an unparsable date, the entire column or index will be returned
unaltered as an object data type. For non-standard datetime parsing, use to_datetime() after pd.read_csv.
Note: read_csv has a fast_path for parsing datetime strings in iso8601 format, e.g “2000-01-
01T00:01:02+00:00” and similar variations. If you can arrange for your data to store datetimes in this
format, load times will be significantly faster, ~20x has been observed.
Note: When passing a dict as the parse_dates argument, the order of the columns prepended is not
guaranteed, because dict objects do not impose an ordering on their keys. On Python 2.7+ you may use
collections.OrderedDict instead of a regular dict if this matters to you. Because of this, when using a dict for
‘parse_dates’ in conjunction with the index_col argument, it’s best to specify index_col as a column label
rather then as an index on the resulting frame.
Finally, the parser allows you to specify a custom date_parser function to take full advantage of the
flexibility of the date parsing API:
In [109]: df
Out[109]:
nominal actual 0 4
0 1999-01-27 19:00:00 1999-01-27 18:56:00 KORD 0.81
1 1999-01-27 20:00:00 1999-01-27 19:56:00 KORD 0.01
2 1999-01-27 21:00:00 1999-01-27 20:56:00 KORD -0.59
3 1999-01-27 21:00:00 1999-01-27 21:18:00 KORD -0.99
4 1999-01-27 22:00:00 1999-01-27 21:56:00 KORD -0.59
5 1999-01-27 23:00:00 1999-01-27 22:56:00 KORD -0.59
Pandas will try to call the date_parser function in three different ways. If an exception is raised, the next
one is tried:
1. date_parser is first called with one or more arrays as arguments, as defined using parse_dates (e.g.,
date_parser(['2013', '2013'], ['1', '2'])).
2. If #1 fails, date_parser is called with all the columns concatenated row-wise into a single array (e.g.,
date_parser(['2013 1', '2013 2'])).
3. If #2 fails, date_parser is called once for every row with one or more string arguments from
the columns indicated with parse_dates (e.g., date_parser('2013', '1') for the first row,
date_parser('2013', '2') for the second, etc.).
Note that performance-wise, you should try these methods of parsing dates in order:
1. Try to infer the format using infer_datetime_format=True (see section below).
2. If you know the format, use pd.to_datetime(): date_parser=lambda x: pd.to_datetime(x,
format=...).
3. If you have a really non-standard format, use a custom date_parser function. For optimal perfor-
mance, this should be vectorized, i.e., it should accept arrays as arguments.
You can explore the date parsing functionality in date_converters.py and add your own. We would
love to turn this module into a community supported set of date/time parsers. To get you started,
date_converters.py contains functions to parse dual date and time columns, year/month/day columns,
and year/month/day/hour/minute/second columns. It also contains a generic_parser function so you can
curry it with a function that deals with a single date rather than the entire array.
Pandas cannot natively represent a column or index with mixed timezones. If your CSV file contains
columns with a mixture of timezones, the default result will be an object-dtype column with strings, even
with parse_dates.
In [112]: df['a']
Out[112]:
0 2000-01-01 00:00:00+05:00
1 2000-01-01 00:00:00+06:00
Name: a, dtype: object
To parse the mixed-timezone values as a datetime column, pass a partially-applied to_datetime() with
utc=True as the date_parser.
In [114]: df['a']
Out[114]:
0 1999-12-31 19:00:00+00:00
1 1999-12-31 18:00:00+00:00
Name: a, dtype: datetime64[ns, UTC]
If you have parse_dates enabled for some or all of your columns, and your datetime strings are all formatted
the same way, you may get a large speed up by setting infer_datetime_format=True. If set, pandas will
attempt to guess the format of your datetime strings, and then use a faster means of parsing the strings.
5-10x parsing speeds have been observed. pandas will fallback to the usual parsing if either the format
cannot be guessed or the format that was guessed cannot properly parse the entire column of strings. So in
general, infer_datetime_format should not have any negative consequences if enabled.
Here are some examples of datetime strings that can be guessed (All representing December 30th, 2011 at
00:00:00):
• “20111230”
• “2011/12/30”
• “20111230 00:00:00”
• “12/30/2011 00:00:00”
• “30/Dec/2011 00:00:00”
• “30/December/2011 00:00:00”
In [116]: df
Out[116]:
A B C
date
2009-01-01 a 1 2
2009-01-02 b 3 4
2009-01-03 c 4 5
While US date formats tend to be MM/DD/YYYY, many international formats use DD/MM/YYYY instead.
For convenience, a dayfirst keyword is provided:
In [117]: print(open('tmp.csv').read())
date,value,cat
1/6/2000,5,a
2/6/2000,10,b
3/6/2000,15,c
The parameter float_precision can be specified in order to use a specific floating-point converter during
parsing with the C engine. The options are the ordinary converter, the high-precision converter, and the
round-trip converter (which is guaranteed to round-trip values after writing to a file). For example:
In [120]: val = '0.3066101993807095471566981359501369297504425048828125'
Thousand separators
For large numbers that have been written with a thousands separator, you can set the thousands keyword
to a string of length 1 so that integers will be parsed correctly:
By default, numbers with a thousands separator will be parsed as strings:
In [125]: print(open('tmp.csv').read())
ID|level|category
Patient1|123,000|x
Patient2|23,000|y
Patient3|1,234,018|z
In [127]: df
Out[127]:
ID level category
0 Patient1 123,000 x
1 Patient2 23,000 y
2 Patient3 1,234,018 z
In [128]: df.level.dtype
Out[128]: dtype('O')
The thousands keyword allows integers to be parsed correctly:
In [129]: print(open('tmp.csv').read())
ID|level|category
Patient1|123,000|x
Patient2|23,000|y
Patient3|1,234,018|z
In [131]: df
Out[131]:
ID level category
0 Patient1 123000 x
1 Patient2 23000 y
2 Patient3 1234018 z
In [132]: df.level.dtype
Out[132]: dtype('int64')
NA values
To control which values are parsed as missing values (which are signified by NaN), specify a string in
na_values. If you specify a list of strings, then all values in it are considered to be missing values. If
you specify a number (a float, like 5.0 or an integer like 5), the corresponding equivalent values will also
imply a missing value (in this case effectively [5.0, 5] are recognized as NaN).
To completely override the default values that are recognized as missing, specify keep_default_na=False.
The default NaN recognized values are ['-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/A N/A',
'#N/A', 'N/A', 'n/a', 'NA', '#NA', 'NULL', 'null', 'NaN', '-NaN', 'nan', '-nan', ''].
Let us consider some examples:
pd.read_csv('path_to_file.csv', na_values=[5])
In the example above 5 and 5.0 will be recognized as NaN, in addition to the defaults. A string will first be
interpreted as a numerical 5, then as a NaN.
pd.read_csv('path_to_file.csv', na_values=["Nope"])
The default values, in addition to the string "Nope" are recognized as NaN.
Infinity
inf like values will be parsed as np.inf (positive infinity), and -inf as -np.inf (negative infinity). These
will ignore the case of the value, meaning Inf, will also be parsed as np.inf.
Returning Series
Using the squeeze keyword, the parser will return output with a single column as a Series:
In [133]: print(open('tmp.csv').read())
level
Patient1,123000
Patient2,23000
Patient3,1234018
In [135]: output
Out[135]:
Patient1 123000
Patient2 23000
Patient3 1234018
Name: level, dtype: int64
In [136]: type(output)
Out[136]: pandas.core.series.Series
Boolean values
The common values True, False, TRUE, and FALSE are all recognized as boolean. Occasionally you might
want to recognize other values as being boolean. To do this, use the true_values and false_values options
as follows:
In [137]: data = ('a,b,c\n'
.....: '1,Yes,2\n'
.....: '3,No,4')
.....:
In [138]: print(data)
a,b,c
1,Yes,2
3,No,4
In [139]: pd.read_csv(StringIO(data))
Out[139]:
a b c
0 1 Yes 2
1 3 No 4
Some files may have malformed lines with too few fields or too many. Lines with too few fields will have NA
values filled in the trailing fields. Lines with too many fields will raise an error by default:
In [142]: pd.read_csv(StringIO(data))
---------------------------------------------------------------------------
ParserError Traceback (most recent call last)
<ipython-input-142-6388c394e6b8> in <module>
----> 1 pd.read_csv(StringIO(data))
683 )
684
--> 685 return _read(filepath_or_buffer, kwds)
686
687 parser_f.__name__ = name
~/sandbox/pandas-doc/pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read()
~/sandbox/pandas-doc/pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_
,→low_memory()
~/sandbox/pandas-doc/pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_
,→rows()
~/sandbox/pandas-doc/pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._
,→tokenize_rows()
~/sandbox/pandas-doc/pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_
,→error()
Out[29]:
a b c
0 1 2 3
1 8 9 10
You can also use the usecols parameter to eliminate extraneous column data that appear in some lines but
not others:
Out[30]:
a b c
0 1 2 3
1 4 5 6
2 8 9 10
Dialect
The dialect keyword gives greater flexibility in specifying the file format. By default it uses the Excel
dialect but you can specify either the dialect name or a csv.Dialect instance.
Suppose you had data with unenclosed quotes:
In [143]: print(data)
label1,label2,label3
index1,"a,c,e
index2,b,d,f
By default, read_csv uses the Excel dialect and treats the double quote as the quote character, which causes
it to fail when it finds a newline before it finds the closing double quote.
We can get around this using dialect:
Another common dialect option is skipinitialspace, to skip any whitespace after a delimiter:
In [150]: data = 'a, b, c\n1, 2, 3\n4, 5, 6'
In [151]: print(data)
a, b, c
1, 2, 3
4, 5, 6
Quotes (and other escape characters) in embedded fields can be handled in any number of ways. One way
is to use backslashes; to properly parse this data, you should pass the escapechar option:
In [153]: data = 'a,b\n"hello, \\"Bob\\", nice to see you",5'
In [154]: print(data)
a,b
"hello, \"Bob\", nice to see you",5
While read_csv() reads delimited data, the read_fwf() function works with data files that have known
and fixed column widths. The function parameters to read_fwf are largely the same as read_csv with two
extra parameters, and a different usage of the delimiter parameter:
• colspecs: A list of pairs (tuples) giving the extents of the fixed-width fields of each line as half-open
intervals (i.e., [from, to[ ). String value ‘infer’ can be used to instruct the parser to try detecting the
column specifications from the first 100 rows of the data. Default behavior, if not specified, is to infer.
• widths: A list of field widths which can be used instead of ‘colspecs’ if the intervals are contiguous.
• delimiter: Characters to consider as filler characters in the fixed-width file. Can be used to specify
the filler character of the fields if it is not spaces (e.g., ‘~’).
In [156]: print(open('bar.csv').read())
id8141 360.242940 149.910199 11950.7
id1594 444.953632 166.985655 11788.4
id1849 364.136849 183.628767 11806.2
id1230 413.836124 184.375703 11916.8
id1948 502.953953 173.237159 12468.3
In order to parse this file into a DataFrame, we simply need to supply the column specifications to the
read_fwf function along with the file name:
In [159]: df
Out[159]:
1 2 3
0
id8141 360.242940 149.910199 11950.7
id1594 444.953632 166.985655 11788.4
id1849 364.136849 183.628767 11806.2
id1230 413.836124 184.375703 11916.8
id1948 502.953953 173.237159 12468.3
Note how the parser automatically picks column names X.<column number> when header=None argument
is specified. Alternatively, you can supply just the column widths for contiguous columns:
In [162]: df
Out[162]:
0 1 2 3
0 id8141 360.242940 149.910199 11950.7
1 id1594 444.953632 166.985655 11788.4
2 id1849 364.136849 183.628767 11806.2
3 id1230 413.836124 184.375703 11916.8
4 id1948 502.953953 173.237159 12468.3
The parser will take care of extra white spaces around the columns so it’s ok to have extra separation between
the columns in the file.
By default, read_fwf will try to infer the file’s colspecs by using the first 100 rows of the file. It can do
it only in cases when the columns are aligned and correctly separated by the provided delimiter (default
delimiter is whitespace).
In [164]: df
(continues on next page)
Indexes
Consider a file with one less entry in the header than the number of data column:
In [167]: print(open('foo.csv').read())
A,B,C
20090101,a,1,2
20090102,b,3,4
20090103,c,4,5
In this special case, read_csv assumes that the first column is to be used as the index of the DataFrame:
In [168]: pd.read_csv('foo.csv')
Out[168]:
A B C
20090101 a 1 2
20090102 b 3 4
20090103 c 4 5
Note that the dates weren’t automatically parsed. In that case you would need to do as before:
In [170]: df.index
Out[170]: DatetimeIndex(['2009-01-01', '2009-01-02', '2009-01-03'], dtype='datetime64[ns]
,→', freq=None)
In [171]: print(open('data/mindex_ex.csv').read())
year,indiv,zit,xit
1977,"A",1.2,.6
1977,"B",1.5,.5
1977,"C",1.7,.8
1978,"A",.2,.06
1978,"B",.7,.2
1978,"C",.8,.3
1978,"D",.9,.5
1978,"E",1.4,.9
1979,"C",.2,.15
1979,"D",.14,.05
1979,"E",.5,.15
1979,"F",1.2,.5
1979,"G",3.4,1.9
1979,"H",5.4,2.7
1979,"I",6.4,1.2
The index_col argument to read_csv can take a list of column numbers to turn multiple columns into a
MultiIndex for the index of the returned object:
In [172]: df = pd.read_csv("data/mindex_ex.csv", index_col=[0, 1])
In [173]: df
Out[173]:
zit xit
year indiv
1977 A 1.20 0.60
B 1.50 0.50
C 1.70 0.80
1978 A 0.20 0.06
B 0.70 0.20
C 0.80 0.30
D 0.90 0.50
E 1.40 0.90
1979 C 0.20 0.15
D 0.14 0.05
E 0.50 0.15
F 1.20 0.50
G 3.40 1.90
H 5.40 2.70
I 6.40 1.20
In [174]: df.loc[1978]
Out[174]:
zit xit
indiv
A 0.2 0.06
B 0.7 0.20
C 0.8 0.30
D 0.9 0.50
E 1.4 0.90
By specifying list of row locations for the header argument, you can read in a MultiIndex for the columns.
Specifying non-consecutive rows will skip the intervening rows.
In [175]: from pandas.util.testing import makeCustomDataframe as mkdf
In [177]: df.to_csv('mi.csv')
In [178]: print(open('mi.csv').read())
C0,,C_l0_g0,C_l0_g1,C_l0_g2
C1,,C_l1_g0,C_l1_g1,C_l1_g2
C2,,C_l2_g0,C_l2_g1,C_l2_g2
C3,,C_l3_g0,C_l3_g1,C_l3_g2
R0,R1,,,
R_l0_g0,R_l1_g0,R0C0,R0C1,R0C2
R_l0_g1,R_l1_g1,R1C0,R1C1,R1C2
R_l0_g2,R_l1_g2,R2C0,R2C1,R2C2
R_l0_g3,R_l1_g3,R3C0,R3C1,R3C2
R_l0_g4,R_l1_g4,R4C0,R4C1,R4C2
read_csv is capable of inferring delimited (not necessarily comma-separated) files, as pandas uses the csv.
Sniffer class of the csv module. For this, you have to specify sep=None.
In [182]: print(open('tmp2.sv').read())
:0:1:2:3
0:1.1214905765122583:-1.1011663421613171:-1.2725711408453018:0.8434589457722285
1:0.8739661419816901:-1.1622548707272122:0.12618578996106738:0.5057848504967111
2:0.6695152369722812:0.4833977900441433:-0.4383565886430891:-0.13952146077085656
3:1.6678766138462109:0.906356209978661:0.8603041052486606:-0.009413710135323125
4:-0.8075485015292924:-0.7848128653629299:-1.3155155066668116:0.6875244729698119
5:-0.1572352664979729:0.30339976035788174:-0.36340691002502046:-0.5526511482544121
6:0.41442095212262187:0.17517103850750262:-0.5295157789486404:-0.06745694327155764
7:1.058814717443789:-0.11789792502832808:-1.8534207864364352:-0.7018494437516053
8:0.26239634172416604:-1.7245959745828128:0.2765803759042711:1.0730241342647273
9:0.6352851164219758:-2.1785482358583024:0.3120437647651685:1.5723784501068536
It’s best to use concat() to combine multiple files. See the cookbook for an example.
Suppose you wish to iterate through a (potentially very large) file lazily rather than reading the entire file
into memory, such as the following:
In [184]: print(open('tmp.sv').read())
|0|1|2|3
0|1.1214905765122583|-1.1011663421613171|-1.2725711408453018|0.8434589457722285
1|0.8739661419816901|-1.1622548707272122|0.12618578996106738|0.5057848504967111
2|0.6695152369722812|0.4833977900441433|-0.4383565886430891|-0.13952146077085656
3|1.6678766138462109|0.906356209978661|0.8603041052486606|-0.009413710135323125
4|-0.8075485015292924|-0.7848128653629299|-1.3155155066668116|0.6875244729698119
5|-0.1572352664979729|0.30339976035788174|-0.36340691002502046|-0.5526511482544121
6|0.41442095212262187|0.17517103850750262|-0.5295157789486404|-0.06745694327155764
7|1.058814717443789|-0.11789792502832808|-1.8534207864364352|-0.7018494437516053
8|0.26239634172416604|-1.7245959745828128|0.2765803759042711|1.0730241342647273
9|0.6352851164219758|-2.1785482358583024|0.3120437647651685|1.5723784501068536
In [186]: table
Out[186]:
Unnamed: 0 0 1 2 3
0 0 1.121491 -1.101166 -1.272571 0.843459
1 1 0.873966 -1.162255 0.126186 0.505785
2 2 0.669515 0.483398 -0.438357 -0.139521
3 3 1.667877 0.906356 0.860304 -0.009414
4 4 -0.807549 -0.784813 -1.315516 0.687524
5 5 -0.157235 0.303400 -0.363407 -0.552651
6 6 0.414421 0.175171 -0.529516 -0.067457
7 7 1.058815 -0.117898 -1.853421 -0.701849
8 8 0.262396 -1.724596 0.276580 1.073024
9 9 0.635285 -2.178548 0.312044 1.572378
By specifying a chunksize to read_csv, the return value will be an iterable object of type TextFileReader:
In [187]: reader = pd.read_csv('tmp.sv', sep='|', chunksize=4)
In [188]: reader
Out[188]: <pandas.io.parsers.TextFileReader at 0x13eb31a90>
In [191]: reader.get_chunk(5)
Out[191]:
Unnamed: 0 0 1 2 3
0 0 1.121491 -1.101166 -1.272571 0.843459
1 1 0.873966 -1.162255 0.126186 0.505785
2 2 0.669515 0.483398 -0.438357 -0.139521
3 3 1.667877 0.906356 0.860304 -0.009414
4 4 -0.807549 -0.784813 -1.315516 0.687524
Under the hood pandas uses a fast and efficient parser implemented in C as well as a Python implementation
which is currently more feature-complete. Where possible pandas uses the C parser (specified as engine='c'),
but may fall back to Python if C-unsupported options are specified. Currently, C-unsupported options
include:
• sep other than a single character (e.g. regex separators)
• skipfooter
• sep=None with delim_whitespace=False
Specifying any of the above options will produce a ParserWarning unless the python engine is selected
explicitly using engine='python'.
df = pd.read_csv('https://download.bls.gov/pub/time.series/cu/cu.item',
sep='\t')
S3 URLs are handled as well but require installing the S3Fs library:
df = pd.read_csv('s3://pandas-test/tips.csv')
If your S3 bucket requires credentials you will need to set them as environment variables or in the ~/.aws/
credentials config file, refer to the S3Fs documentation on credentials.
The Series and DataFrame objects have an instance method to_csv which allows storing the contents of
the object as a comma-separated-values file. The function takes a number of arguments. Only the first is
required.
• path_or_buf: A string path to the file to write or a file object. If a file object it must be opened with
newline=’’
• sep : Field delimiter for the output file (default “,”)
The DataFrame object has an instance method to_string which allows control over the string representation
of the object. All arguments are optional:
• buf default None, for example a StringIO object
• columns default None, which columns to write
• col_space default None, minimum width of each column.
• na_rep default NaN, representation of NA value
• formatters default None, a dictionary (by column) of functions each of which takes a single argument
and returns a formatted string
• float_format default None, a function which takes a single (float) argument and returns a formatted
string; to be applied to floats in the DataFrame.
• sparsify default True, set to False for a DataFrame with a hierarchical index to print every MultiIndex
key at each row.
• index_names default True, will print the names of the indices
• index default True, will print the index (ie, row labels)
• header default True, will print the column labels
• justify default left, will print column headers left- or right-justified
The Series object also has a to_string method, but with only the buf, na_rep, float_format arguments.
There is also a length argument which, if set to True, will additionally output the length of the Series.
4.1.2 JSON
Writing JSON
A Series or DataFrame can be converted to a valid JSON string. Use to_json with optional parameters:
• path_or_buf : the pathname or buffer to write the output This can be None in which case a JSON
string is returned
• orient :
Series:
– default is index
– allowed values are {split, records, index}
DataFrame:
– default is columns
– allowed values are {split, records, index, columns, values, table}
The format of the JSON string
split dict like {index -> [index], columns -> [columns], data -> [values]}
records list like [{column -> value}, … , {column -> value}]
index dict like {index -> {column -> value}}
columns dict like {column -> {index -> value}}
values just the values array
• date_format : string, type of date conversion, ‘epoch’ for timestamp, ‘iso’ for ISO8601.
• double_precision : The number of decimal places to use when encoding floating point values, default
10.
• force_ascii : force encoded string to be ASCII, default True.
• date_unit : The time unit to encode to, governs timestamp and ISO8601 precision. One of ‘s’, ‘ms’,
‘us’ or ‘ns’ for seconds, milliseconds, microseconds and nanoseconds respectively. Default ‘ms’.
• default_handler : The handler to call if an object cannot otherwise be converted to a suitable format
for JSON. Takes a single argument, which is the object to convert, and returns a serializable object.
• lines : If records orient, then will write each record per line as json.
Note NaN’s, NaT’s and None will be converted to null and datetime objects will be converted based on the
date_format and date_unit parameters.
,→9464922966,"4":-0.5276761509}}'
Orient options
There are a number of different options for the format of the resulting JSON file / string. Consider the
following DataFrame and Series:
In [196]: dfjo
Out[196]:
A B C
x 1 4 7
y 2 5 8
z 3 6 9
In [198]: sjo
Out[198]:
x 15
y 16
z 17
Name: D, dtype: int64
Column oriented (the default for DataFrame) serializes the data as nested JSON objects with column
labels acting as the primary index:
In [199]: dfjo.to_json(orient="columns")
Out[199]: '{"A":{"x":1,"y":2,"z":3},"B":{"x":4,"y":5,"z":6},"C":{"x":7,"y":8,"z":9}}'
Index oriented (the default for Series) similar to column oriented but the index labels are now primary:
In [200]: dfjo.to_json(orient="index")
Out[200]: '{"x":{"A":1,"B":4,"C":7},"y":{"A":2,"B":5,"C":8},"z":{"A":3,"B":6,"C":9}}'
In [201]: sjo.to_json(orient="index")
Out[201]: '{"x":15,"y":16,"z":17}'
Record oriented serializes the data to a JSON array of column -> value records, index labels are not
included. This is useful for passing DataFrame data to plotting libraries, for example the JavaScript library
d3.js:
In [202]: dfjo.to_json(orient="records")
Out[202]: '[{"A":1,"B":4,"C":7},{"A":2,"B":5,"C":8},{"A":3,"B":6,"C":9}]'
In [203]: sjo.to_json(orient="records")
Out[203]: '[15,16,17]'
Value oriented is a bare-bones option which serializes to nested JSON arrays of values only, column and
index labels are not included:
In [204]: dfjo.to_json(orient="values")
Out[204]: '[[1,4,7],[2,5,8],[3,6,9]]'
Split oriented serializes to a JSON object containing separate entries for values, index and columns. Name
is also included for Series:
In [205]: dfjo.to_json(orient="split")
Out[205]: '{"columns":["A","B","C"],"index":["x","y","z"],"data":[[1,4,7],[2,5,8],[3,6,
,→9]]}'
In [206]: sjo.to_json(orient="split")
Out[206]: '{"name":"D","index":["x","y","z"],"data":[15,16,17]}'
Table oriented serializes to the JSON Table Schema, allowing for the preservation of metadata including
but not limited to dtypes and index names.
Note: Any orient option that encodes to a JSON object will not preserve the ordering of index and column
labels during round-trip serialization. If you wish to preserve label ordering use the split option as it uses
ordered containers.
Date handling
In [211]: json
Out[211]: '{"date":{"0":"2013-01-01T00:00:00.000Z","1":"2013-01-01T00:00:00.000Z","2":
,→"2013-01-01T00:00:00.000Z","3":"2013-01-01T00:00:00.000Z","4":"2013-01-01T00:00:00.000Z
,→"},"B":{"0":0.3903383957,"1":-0.5223681486,"2":2.0249145293,"3":2.1144885256,"4":0.
,→5337588359},"A":{"0":-1.0954121534,"1":-0.147141856,"2":0.6305826658,"3":1.5730764249,
,→"4":0.6200376615}}'
In [213]: json
Out[213]: '{"date":{"0":"2013-01-01T00:00:00.000000Z","1":"2013-01-01T00:00:00.000000Z",
,→"2":"2013-01-01T00:00:00.000000Z","3":"2013-01-01T00:00:00.000000Z","4":"2013-01-
(continues on next page)
,→01T00:00:00.000000Z"},"B":{"0":0.3903383957,"1":-0.5223681486,"2":2.0249145293,"3":2.
,→1144885256,"4":0.5337588359},"A":{"0":-1.0954121534,"1":-0.147141856,"2":0.6305826658,
226
,→"3":1.5730764249,"4":0.6200376615}}'
Chapter 4. User Guide
pandas: powerful Python data analysis toolkit, Release 0.25.2
In [215]: json
Out[215]: '{"date":{"0":1356998400,"1":1356998400,"2":1356998400,"3":1356998400,"4
,→":1356998400},"B":{"0":0.3903383957,"1":-0.5223681486,"2":2.0249145293,"3":2.
,→1144885256,"4":0.5337588359},"A":{"0":-1.0954121534,"1":-0.147141856,"2":0.6305826658,
,→"3":1.5730764249,"4":0.6200376615}}'
In [221]: dfj2.to_json('test.json')
,→"1356998400000":0.009310115,"1357084800000":-1.2591311739,"1357171200000":1.7549089729,
,→"1357257600000":0.9464922966,"1357344000000":-0.5276761509},"date":{"1356998400000
,→":1356998400000,"1357084800000":1356998400000,"1357171200000":1356998400000,
,→"1357257600000":1356998400000,"1357344000000":1356998400000},"ints":{"1356998400000":0,
,→"1357084800000":1,"1357171200000":2,"1357257600000":3,"1357344000000":4},"bools":{
,→"1356998400000":true,"1357084800000":true,"1357171200000":true,"1357257600000":true,
,→"1357344000000":true}}
Fallback behavior
If the JSON serializer cannot handle the container contents directly it will fall back in the following manner:
• if the dtype is unsupported (e.g. np.complex) then the default_handler, if provided, will be called
for each value, otherwise an exception is raised.
• if an object is unsupported it will attempt the following:
– check if the object has defined a toDict method and call it. A toDict method should return a
dict which will then be JSON serialized.
– invoke the default_handler if one was provided.
– convert the object to a dict by traversing its contents. However this will often fail with an
OverflowError or give unexpected results.
In general the best approach for unsupported objects or dtypes is to provide a default_handler. For
example:
Reading JSON
Reading a JSON string to pandas object can take a number of parameters. The parser will try to parse a
DataFrame if typ is not supplied or is None. To explicitly force Series parsing, pass typ=series
• filepath_or_buffer : a VALID JSON string or file handle / StringIO. The string could be a URL.
Valid URL schemes include http, ftp, S3, and file. For file URLs, a host is expected. For instance, a
local file could be file ://localhost/path/to/table.json
• typ : type of object to recover (series or frame), default ‘frame’
• orient :
Series :
– default is index
– allowed values are {split, records, index}
DataFrame
– default is columns
– allowed values are {split, records, index, columns, values, table}
The format of the JSON string
split dict like {index -> [index], columns -> [columns], data -> [values]}
records list like [{column -> value}, … , {column -> value}]
index dict like {index -> {column -> value}}
columns dict like {column -> {index -> value}}
values just the values array
table adhering to the JSON Table Schema
• dtype : if True, infer dtypes, if a dict of column to dtype, then use those, if False, then don’t infer
dtypes at all, default is True, apply only to the data.
• convert_axes : boolean, try to convert the axes to the proper dtypes, default is True
• convert_dates : a list of columns to parse for dates; If True, then try to parse date-like columns,
default is True.
• keep_default_dates : boolean, default True. If parsing dates, then parse the default date-like
columns.
• numpy : direct decoding to NumPy arrays. default is False; Supports numeric data only, although
labels may be non-numeric. Also note that the JSON ordering MUST be the same for each term if
numpy=True.
• precise_float : boolean, default False. Set to enable usage of higher precision (strtod) function when
decoding string to double values. Default (False) is to use fast but less precise builtin functionality.
• date_unit : string, the timestamp unit to detect if converting dates. Default None. By default the
timestamp precision will be detected, if this is not desired then pass one of ‘s’, ‘ms’, ‘us’ or ‘ns’ to force
timestamp precision to seconds, milliseconds, microseconds or nanoseconds respectively.
• lines : reads file as one json object per line.
• encoding : The encoding to use to decode py3 bytes.
• chunksize : when used in combination with lines=True, return a JsonReader which reads in
chunksize lines per iteration.
The parser will raise one of ValueError/TypeError/AssertionError if the JSON is not parseable.
If a non-default orient was used when encoding to JSON be sure to pass the same option here so that
decoding produces sensible results, see Orient Options for an overview.
Data conversion
The default of convert_axes=True, dtype=True, and convert_dates=True will try to parse the axes, and
all of the data into appropriate types, including dates. If you need to override specific dtypes, pass a dict to
dtype. convert_axes should only be set to False if you need to preserve string-like numbers (e.g. ‘1’, ‘2’)
in an axes.
Note: Large integer values may be converted to dates if convert_dates=True and the data and / or
column labels appear ‘date-like’. The exact threshold depends on the date_unit specified. ‘date-like’ means
that the column label meets one of the following criteria:
• it ends with '_at'
• it ends with '_time'
• it begins with 'timestamp'
• it is 'modified'
• it is 'date'
Warning: When reading JSON data, automatic coercing into dtypes has some quirks:
• an index can be reconstructed in a different order from serialization, that is, the returned order is
not guaranteed to be the same as before serialization
• a column that was float data will be converted to integer if it can be done safely, e.g. a column
of 1.
• bool columns will be converted to integer on reconstruction
Thus there are times where you may want to specify specific dtypes via the dtype keyword argument.
In [224]: pd.read_json(json)
Out[224]:
date B A
(continues on next page)
In [225]: pd.read_json('test.json')
Out[225]:
A B date ints bools
2013-01-01 2.002793 0.009310 2013-01-01 0 True
2013-01-02 -1.128420 -1.259131 2013-01-01 1 True
2013-01-03 -0.267123 1.754909 2013-01-01 2 True
2013-01-04 0.059081 0.946492 2013-01-01 3 True
2013-01-05 0.212602 -0.527676 2013-01-01 4 True
Don’t convert any data (but still convert axes and dates):
In [229]: si
Out[229]:
0 1 2 3
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0
In [230]: si.index
In [231]: si.columns
Out[231]: Int64Index([0, 1, 2, 3], dtype='int64')
In [234]: sij
Out[234]:
0 1 2 3
0 0 0 0 0
1 0 0 0 0
2 0 0 0 0
3 0 0 0 0
In [235]: sij.index
Out[235]: Index(['0', '1', '2', '3'], dtype='object')
In [236]: sij.columns
Out[236]: Index(['0', '1', '2', '3'], dtype='object')
Dates written in nanoseconds need to be read back in nanoseconds:
In [237]: json = dfj2.to_json(date_unit='ns')
In [239]: dfju
Out[239]:
A B date ints bools
1356998400000000000 2.002793 0.009310 1356998400000000000 0 True
1357084800000000000 -1.128420 -1.259131 1356998400000000000 1 True
1357171200000000000 -0.267123 1.754909 1356998400000000000 2 True
1357257600000000000 0.059081 0.946492 1356998400000000000 3 True
1357344000000000000 0.212602 -0.527676 1356998400000000000 4 True
In [241]: dfju
Out[241]:
A B date ints bools
2013-01-01 2.002793 0.009310 2013-01-01 0 True
2013-01-02 -1.128420 -1.259131 2013-01-01 1 True
2013-01-03 -0.267123 1.754909 2013-01-01 2 True
2013-01-04 0.059081 0.946492 2013-01-01 3 True
2013-01-05 0.212602 -0.527676 2013-01-01 4 True
Note: This supports numeric data only. Index and columns labels may be non-numeric, e.g. strings, dates
etc.
If numpy=True is passed to read_json an attempt will be made to sniff an appropriate dtype during deseri-
alization and to subsequently decode directly to NumPy arrays, bypassing the need for intermediate Python
objects.
This can provide speedups if you are deserialising a large amount of numeric data:
Warning: Direct NumPy decoding makes a number of assumptions and may fail or produce unexpected
output if these assumptions are not satisfied:
• data is numeric.
• data is uniform. The dtype is sniffed from the first value decoded. A ValueError may be raised,
or incorrect output may be produced if this condition is not satisfied.
• labels are ordered. Labels are only read from the first container, it is assumed that each subsequent
row / column has been encoded in the same order. This should be satisfied if the data was encoded
using to_json but may not be the case if the JSON is from another source.
Normalization
pandas provides a utility function to take a dict or list of dicts and normalize this semi-structured data into
a flat table.
In [255]: json_normalize(data)
Out[255]:
id name.first name.last name.given name.family name
0 1.0 Coleen Volk NaN NaN NaN
1 NaN NaN NaN Mose Regner NaN
2 2.0 NaN NaN NaN NaN Faye Raker
The max_level parameter provides more control over which level to end normalization. With max_level=1
the following snippet normalizes until 1st nesting level of the provided dict.
In [262]: df
Out[262]:
a b
0 1 2
1 3 4
In [265]: reader
Out[265]: <pandas.io.json._json.JsonReader at 0x13ed13b10>
a b
0 1 2
a b
1 3 4
Table schema
In [268]: df
Out[268]:
A B C
idx
0 1 a 2016-01-01
1 2 b 2016-01-02
2 3 c 2016-01-03
,→"C":"2016-01-01T00:00:00.000Z"},{"idx":1,"A":2,"B":"b","C":"2016-01-02T00:00:00.000Z"},
,→{"idx":2,"A":3,"B":"c","C":"2016-01-03T00:00:00.000Z"}]}'
The schema field contains the fields key, which itself contains a list of column name to type pairs, including
the Index or MultiIndex (see below for a list of types). The schema field also contains a primaryKey field
if the (Multi)index is unique.
The second field, data, contains the serialized data with the records orient. The index is included, and any
datetimes are ISO 8601 formatted, as required by the Table Schema spec.
The full list of types supported are described in the Table Schema spec. This table shows the mapping from
pandas types:
• The schema object contains a pandas_version field. This contains the version of pandas’ dialect of
the schema, and will be incremented with each revision.
• All dates are converted to UTC when serializing. Even timezone naive values, which are treated as
UTC with an offset of 0.
In [272]: build_table_schema(s)
Out[272]:
{'fields': [{'name': 'index', 'type': 'integer'},
{'name': 'values', 'type': 'datetime'}],
'primaryKey': ['index'],
'pandas_version': '0.20.0'}
• datetimes with a timezone (before serializing), include an additional field tz with the time zone name
(e.g. 'US/Central').
In [274]: build_table_schema(s_tz)
Out[274]:
{'fields': [{'name': 'index', 'type': 'integer'},
{'name': 'values', 'type': 'datetime', 'tz': 'US/Central'}],
'primaryKey': ['index'],
'pandas_version': '0.20.0'}
• Periods are converted to timestamps before serialization, and so have the same behavior of being con-
verted to UTC. In addition, periods will contain and additional field freq with the period’s frequency,
e.g. 'A-DEC'.
In [276]: build_table_schema(s_per)
Out[276]:
{'fields': [{'name': 'index', 'type': 'datetime', 'freq': 'A-DEC'},
{'name': 'values', 'type': 'integer'}],
'primaryKey': ['index'],
'pandas_version': '0.20.0'}
• Categoricals use the any type and an enum constraint listing the set of possible values. Additionally,
an ordered field is included:
In [278]: build_table_schema(s_cat)
Out[278]:
{'fields': [{'name': 'index', 'type': 'integer'},
(continues on next page)
In [280]: build_table_schema(s_dupe)
Out[280]:
{'fields': [{'name': 'index', 'type': 'integer'},
{'name': 'values', 'type': 'integer'}],
'pandas_version': '0.20.0'}
• The primaryKey behavior is the same with MultiIndexes, but in this case the primaryKey is an array:
In [282]: build_table_schema(s_multi)
Out[282]:
{'fields': [{'name': 'level_0', 'type': 'string'},
{'name': 'level_1', 'type': 'integer'},
{'name': 'values', 'type': 'integer'}],
'primaryKey': FrozenList(['level_0', 'level_1']),
'pandas_version': '0.20.0'}
In [284]: df
Out[284]:
foo bar baz qux
idx
0 1 a 2018-01-01 a
1 2 b 2018-01-02 b
2 3 c 2018-01-03 c
3 4 d 2018-01-04 c
In [285]: df.dtypes
Out[285]:
foo int64
bar object
baz datetime64[ns]
qux category
dtype: object
In [288]: new_df
Out[288]:
foo bar baz qux
idx
0 1 a 2018-01-01 a
1 2 b 2018-01-02 b
2 3 c 2018-01-03 c
3 4 d 2018-01-04 c
In [289]: new_df.dtypes
Out[289]:
foo int64
bar object
baz datetime64[ns]
qux category
dtype: object
Please note that the literal string ‘index’ as the name of an Index is not round-trippable, nor are any names
beginning with 'level_' within a MultiIndex. These are used by default in DataFrame.to_json() to
indicate missing values and the subsequent read cannot distinguish the intent.
In [293]: print(new_df.index.name)
None
4.1.3 HTML
Warning: We highly encourage you to read the HTML Table Parsing gotchas below regarding the
issues surrounding the BeautifulSoup4/html5lib/lxml parsers.
The top-level read_html() function can accept an HTML string/file/URL and will parse HTML tables into
list of pandas DataFrames. Let’s look at a few examples.
Note: read_html returns a list of DataFrame objects, even if there is only a single table contained in the
HTML content.
In [296]: dfs
Out[296]:
[ Bank Name City ST CERT ␣
,→ Acquiring Institution Closing Date Updated Date
0 The Enloe State Bank Cooper TX 10716 ␣
,→ Legend Bank, N. A. May 31, 2019 August 22, 2019
1 Washington Federal Bank for Savings Chicago IL 30570 ␣
,→ Royal Savings Bank December 15, 2017 July 24, 2019
2 The Farmers and Merchants State Bank of Argonia Argonia KS 17719 ␣
,→ Conway Bank October 13, 2017 August 12, 2019
3 Fayette County Bank Saint Elmo IL 1802 ␣
,→ United Fidelity Bank, fsb May 26, 2017 January 29, 2019
4 Guaranty Bank, (d/b/a BestBank in Georgia & Mi... Milwaukee WI 30003 First-
,→Citizens Bank & Trust Company May 5, 2017 March 22, 2018
.. ... ... .. ... ␣
,→ ... ... ...
551 Superior Bank, FSB Hinsdale IL 32646 ␣
,→ Superior Federal, FSB July 27, 2001 August 19, 2014
552 Malta National Bank Malta OH 6629 ␣
,→ North Valley Bank May 3, 2001 November 18, 2002
553 First Alliance Bank & Trust Co. Manchester NH 34264 Southern␣
,→New Hampshire Bank & Trust February 2, 2001 February 18, 2003
554 National State Bank of Metropolis Metropolis IL 3815 ␣
,→ Banterra Bank of Marion December 14, 2000 March 17, 2005
555 Bank of Honolulu Honolulu HI 21029 ␣
,→ Bank of the Orient October 13, 2000 March 17, 2005
Note: The data from the above URL changes every Monday so the resulting data above and the data
below may be slightly different.
Read in the content of the file from the above URL and pass it to read_html as a string:
In [298]: dfs
Out[298]:
[ Bank Name City ST CERT ␣
,→Acquiring Institution Closing Date Updated Date
0 Banks of Wisconsin d/b/a Bank of Kenosha Kenosha WI 35386 ␣
,→North Shore Bank, FSB May 31, 2013 May 31, 2013
1 Central Arizona Bank Scottsdale AZ 34527 ␣
,→ Western State Bank May 14, 2013 May 20, 2013
2 Sunrise Bank Valdosta GA 58185 ␣
,→ Synovus Bank May 10, 2013 May 21, 2013
3 Pisgah Community Bank Asheville NC 58701 ␣
,→ Capital Bank, N.A. May 10, 2013 May 14, 2013
4 Douglas County Bank Douglasville GA 21649 ␣
,→Hamilton State Bank April 26, 2013 May 16, 2013
.. ... ... .. ... ␣
,→ ... ... ...
500 Superior Bank, FSB Hinsdale IL 32646 ␣
,→Superior Federal, FSB July 27, 2001 June 5, 2012
501 Malta National Bank Malta OH 6629 ␣
,→ North Valley Bank May 3, 2001 November 18, 2002
502 First Alliance Bank & Trust Co. Manchester NH 34264 Southern New␣
,→Hampshire Bank & Trust February 2, 2001 February 18, 2003
503 National State Bank of Metropolis Metropolis IL 3815 ␣
,→Banterra Bank of Marion December 14, 2000 March 17, 2005
504 Bank of Honolulu Honolulu HI 21029 ␣
,→ Bank of the Orient October 13, 2000 March 17, 2005
In [301]: dfs
Out[301]:
[ Bank Name City ST CERT ␣
,→Acquiring Institution Closing Date Updated Date
0 Banks of Wisconsin d/b/a Bank of Kenosha Kenosha WI 35386 ␣
,→North Shore Bank, FSB May 31, 2013 May 31, 2013
1 Central Arizona Bank Scottsdale AZ 34527 ␣
,→ Western State Bank May 14, 2013 May 20, 2013
2 Sunrise Bank Valdosta GA 58185 ␣
,→ Synovus Bank May 10, 2013 May 21, 2013
3 Pisgah Community Bank Asheville NC 58701 ␣
,→ Capital Bank, N.A. May 10, 2013 May 14, 2013
(continues on next page)
Note: The following examples are not run by the IPython evaluator due to the fact that having so many
network-accessing functions slows down the documentation build. If you spot an error or an example that
doesn’t run, please do not hesitate to report it over on pandas GitHub issues page.
Specify a header row (by default <th> or <td> elements located within a <thead> are used to form the
column index, if multiple rows are contained within <thead> then a MultiIndex is created); if specified, the
header row is taken from the data minus the parsed header elements (<th> elements).
Specify a number of rows to skip using a list (xrange (Python 2 only) works as well):
url_mcc = 'https://en.wikipedia.org/wiki/Mobile_country_code'
dfs = pd.read_html(url_mcc, match='Telekom Albania', header=0,
converters={'MNC': str})
Read in pandas to_html output (with some loss of floating point precision):
df = pd.DataFrame(np.random.randn(2, 2))
s = df.to_html(float_format='{0:.40g}'.format)
dfin = pd.read_html(s, index_col=0)
The lxml backend will raise an error on a failed parse if that is the only parser you provide. If you only have
a single parser you can provide just a string, but it is considered good practice to pass a list with one string
if, for example, the function expects a sequence of strings. You may use:
However, if you have bs4 and html5lib installed and pass None or ['lxml', 'bs4'] then the parse will most
likely succeed. Note that as soon as a parse succeeds, the function will return.
DataFrame objects have an instance method to_html which renders the contents of the DataFrame as an
HTML table. The function arguments are as in the method to_string described above.
Note: Not all of the possible options for DataFrame.to_html are shown here for brevity’s sake. See
to_html() for the full set of options.
In [303]: df
Out[303]:
0 1
0 -1.050304 1.131622
1 -0.692581 -1.174172
In [305]: print(df.to_html(columns=[0]))
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>0</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>-1.050304</td>
</tr>
<tr>
<th>1</th>
<td>-0.692581</td>
</tr>
</tbody>
</table>
HTML:
float_format takes a Python callable to control the precision of floating point values:
In [306]: print(df.to_html(float_format='{0:.10f}'.format))
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>0</th>
<th>1</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>-1.0503044154</td>
<td>1.1316218324</td>
</tr>
<tr>
<th>1</th>
<td>-0.6925807265</td>
<td>-1.1741715747</td>
</tr>
</tbody>
</table>
HTML:
bold_rows will make the row labels bold by default, but you can turn that off:
In [307]: print(df.to_html(bold_rows=False))
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>0</th>
<th>1</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>-1.050304</td>
<td>1.131622</td>
</tr>
<tr>
<td>1</td>
<td>-0.692581</td>
<td>-1.174172</td>
</tr>
</tbody>
</table>
The classes argument provides the ability to give the resulting HTML table CSS classes. Note that these
The render_links argument provides the ability to add hyperlinks to cells that contain URLs.
New in version 0.24.
In [309]: url_df = pd.DataFrame({
.....: 'name': ['Python', 'Pandas'],
.....: 'url': ['https://www.python.org/', 'http://pandas.pydata.org']})
.....:
In [310]: print(url_df.to_html(render_links=True))
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>url</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>Python</td>
<td><a href="https://www.python.org/" target="_blank">https://www.python.org/</a></
,→td>
</tr>
<tr>
<th>1</th>
<td>Pandas</td>
<td><a href="http://pandas.pydata.org" target="_blank">http://pandas.pydata.org</a>
,→</td>
(continues on next page)
HTML:
Finally, the escape argument allows you to control whether the “<”, “>” and “&” characters escaped in the
resulting HTML (by default it is True). So to get the HTML without escaped characters pass escape=False
Escaped:
In [312]: print(df.to_html())
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>a</th>
<th>b</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>&</td>
<td>1.254800</td>
</tr>
<tr>
<th>1</th>
<td><</td>
<td>1.131996</td>
</tr>
<tr>
<th>2</th>
<td>></td>
<td>-1.311021</td>
</tr>
</tbody>
</table>
Not escaped:
In [313]: print(df.to_html(escape=False))
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>a</th>
<th>b</th>
</tr>
</thead>
<tbody>
(continues on next page)
Note: Some browsers may not show a difference in the rendering of the previous two HTML tables.
There are some versioning issues surrounding the libraries that are used to parse HTML tables in the top-level
pandas io function read_html.
Issues with lxml
• Benefits
– lxml is very fast.
– lxml requires Cython to install correctly.
• Drawbacks
– lxml does not make any guarantees about the results of its parse unless it is given strictly valid
markup.
– In light of the above, we have chosen to allow you, the user, to use the lxml backend, but this
backend will use html5lib if lxml fails to parse
– It is therefore highly recommended that you install both BeautifulSoup4 and html5lib, so that
you will still get a valid result (provided everything else is valid) even if lxml fails.
Issues with BeautifulSoup4 using lxml as a backend
• The above issues hold here as well since BeautifulSoup4 is essentially just a wrapper around a parser
backend.
Issues with BeautifulSoup4 using html5lib as a backend
• Benefits
– html5lib is far more lenient than lxml and consequently deals with real-life markup in a much
saner way rather than just, e.g., dropping an element without notifying you.
– html5lib generates valid HTML5 markup from invalid markup automatically. This is extremely
important for parsing HTML tables, since it guarantees a valid document. However, that does
NOT mean that it is “correct”, since the process of fixing markup does not have a single definition.
– html5lib is pure Python and requires no additional build steps beyond its own installation.
• Drawbacks
– The biggest drawback to using html5lib is that it is slow as molasses. However consider the fact
that many tables on the web are not big enough for the parsing algorithm runtime to matter. It
is more likely that the bottleneck will be in the process of reading the raw text from the URL
over the web, i.e., IO (input-output). For very large tables, this might not be true.
The read_excel() method can read Excel 2003 (.xls) files using the xlrd Python module. Excel 2007+
(.xlsx) files can be read using either xlrd or openpyxl. The to_excel() instance method is used for saving
a DataFrame to Excel. Generally the semantics are similar to working with csv data. See the cookbook for
some advanced strategies.
In the most basic use-case, read_excel takes a path to an Excel file, and the sheet_name indicating which
sheet to parse.
# Returns a DataFrame
pd.read_excel('path_to_file.xls', sheet_name='Sheet1')
ExcelFile class
To facilitate working with multiple sheets from the same file, the ExcelFile class can be used to wrap the
file and can be passed into read_excel There will be a performance benefit for reading multiple sheets as
the file is read into memory only once.
xlsx = pd.ExcelFile('path_to_file.xls')
df = pd.read_excel(xlsx, 'Sheet1')
The sheet_names property will generate a list of the sheet names in the file.
The primary use-case for an ExcelFile is parsing multiple sheets with different parameters:
data = {}
# For when Sheet1's format differs from Sheet2
with pd.ExcelFile('path_to_file.xls') as xls:
data['Sheet1'] = pd.read_excel(xls, 'Sheet1', index_col=None,
na_values=['NA'])
data['Sheet2'] = pd.read_excel(xls, 'Sheet2', index_col=1)
Note that if the same parsing parameters are used for all sheets, a list of sheet names can simply be passed
to read_excel with no loss in performance.
ExcelFile can also be called with a xlrd.book.Book object as a parameter. This allows the user to control
how the excel file is read. For example, sheets can be loaded on demand by calling xlrd.open_workbook()
with on_demand=True.
import xlrd
xlrd_book = xlrd.open_workbook('path_to_file.xls', on_demand=True)
with pd.ExcelFile(xlrd_book) as xls:
df1 = pd.read_excel(xls, 'Sheet1')
df2 = pd.read_excel(xls, 'Sheet2')
Specifying sheets
# Returns a DataFrame
pd.read_excel('path_to_file.xls', 'Sheet1', index_col=None, na_values=['NA'])
# Returns a DataFrame
pd.read_excel('path_to_file.xls', 0, index_col=None, na_values=['NA'])
# Returns a DataFrame
pd.read_excel('path_to_file.xls')
read_excel can read more than one sheet, by setting sheet_name to either a list of sheet names, a list of
sheet positions, or None to read all sheets. Sheets can be specified by sheet index or sheet name, using an
integer or string, respectively.
Reading a MultiIndex
read_excel can read a MultiIndex index, by passing a list of columns to index_col and a MultiIndex
column by passing a list of rows to header. If either the index or columns have serialized level names those
will be read in as well by specifying the rows/columns that make up the levels.
For example, to read in a MultiIndex index without names:
In [315]: df.to_excel('path_to_file.xlsx')
In [317]: df
Out[317]:
a b
a c 1 5
d 2 6
b c 3 7
d 4 8
If the index has level names, they will parsed as well, using the same parameters.
In [319]: df.to_excel('path_to_file.xlsx')
In [321]: df
Out[321]:
a b
lvl1 lvl2
(continues on next page)
If the source file has both MultiIndex index and columns, lists specifying each should be passed to index_col
and header:
In [323]: df.to_excel('path_to_file.xlsx')
In [325]: df
Out[325]:
c1 a
c2 b d
lvl1 lvl2
a c 1 5
d 2 6
b c 3 7
d 4 8
It is often the case that users will insert columns to do temporary computations in Excel and you may not
want to read in those columns. read_excel takes a usecols keyword to allow you to specify a subset of
columns to parse.
Deprecated since version 0.24.0.
Passing in an integer for usecols has been deprecated. Please pass in a list of ints from 0 to usecols
inclusive instead.
If usecols is an integer, then it is assumed to indicate the last column to be parsed.
You can also specify a comma-delimited set of Excel columns and ranges as a string:
If usecols is a list of integers, then it is assumed to be the file column indices to be parsed.
If usecols is a list of strings, it is assumed that each string corresponds to a column name provided either
by the user in names or inferred from the document header row(s). Those strings define which columns will
be parsed:
Parsing dates
Datetime-like values are normally automatically converted to the appropriate dtype when reading the excel
file. But if you have a column of strings that look like dates (but are not actually formatted as dates in
excel), you can use the parse_dates keyword to parse those strings to datetimes:
Cell converters
It is possible to transform the contents of Excel cells via the converters option. For instance, to convert a
column to boolean:
This options handles missing values and treats exceptions in the converters as missing data. Transformations
are applied cell by cell rather than to the column as a whole, so the array dtype is not guaranteed. For
instance, a column of integers with missing values cannot be transformed to an array with integer dtype,
because NaN is strictly a float. You can manually mask missing data to recover integer dtype:
def cfun(x):
return int(x) if x else -1
Dtype specifications
To write a DataFrame object to a sheet of an Excel file, you can use the to_excel instance method. The
arguments are largely the same as to_csv described above, the first argument being the name of the excel
file, and the optional second argument the name of the sheet to which the DataFrame should be written. For
example:
df.to_excel('path_to_file.xlsx', sheet_name='Sheet1')
Files with a .xls extension will be written using xlwt and those with a .xlsx extension will be written
using xlsxwriter (if available) or openpyxl.
The DataFrame will be written in a way that tries to mimic the REPL output. The index_label will be
placed in the second row instead of the first. You can place it in the first row by setting the merge_cells
option in to_excel() to False:
In order to write separate DataFrames to separate sheets in a single Excel file, one can pass an ExcelWriter.
Note: Wringing a little more performance out of read_excel Internally, Excel stores all numeric data as
floats. Because this can produce unexpected behavior when reading in data, pandas defaults to trying to
convert integers to floats if it doesn’t lose information (1.0 --> 1). You can pass convert_float=False to
disable this behavior, which may give a slight performance improvement.
Pandas supports writing Excel files to buffer-like objects such as StringIO or BytesIO using ExcelWriter.
bio = BytesIO()
# Seek to the beginning and read to copy the workbook to a variable in memory
(continues on next page)
Note: engine is optional but recommended. Setting the engine determines the version of workbook pro-
duced. Setting engine='xlrd' will produce an Excel 2003-format workbook (xls). Using either 'openpyxl'
or 'xlsxwriter' will produce an Excel 2007-format workbook (xlsx). If omitted, an Excel 2007-formatted
workbook is produced.
df.to_excel('path_to_file.xlsx', sheet_name='Sheet1')
The look and feel of Excel worksheets created from pandas can be modified using the following parameters
on the DataFrame’s to_excel method.
• float_format : Format string for floating point numbers (default None).
• freeze_panes : A tuple of two integers representing the bottommost row and rightmost column to
freeze. Each of these parameters is one-based, so (1, 1) will freeze the first row and first column (default
None).
Using the Xlsxwriter engine provides many options for controlling the format of an Excel worksheet created
with the to_excel method. Excellent examples can be found in the Xlsxwriter documentation here: https:
//xlsxwriter.readthedocs.io/working_with_pandas.html
# Returns a DataFrame
pd.read_excel('path_to_file.ods', engine='odf')
Note: Currently pandas only supports reading OpenDocument spreadsheets. Writing is not implemented.
4.1.6 Clipboard
A handy way to grab data is to use the read_clipboard() method, which takes the contents of the clipboard
buffer and passes them to the read_csv method. For instance, you can copy the following text to the
clipboard (CTRL-C on many operating systems):
A B C
x 1 4 p
y 2 5 q
z 3 6 r
The to_clipboard method can be used to write the contents of a DataFrame to the clipboard. Following
which you can paste the clipboard contents into other applications (CTRL-V on many operating systems).
Here we illustrate writing a DataFrame into clipboard and reading it back.
We can see that we got the same content back, which we had earlier written to the clipboard.
Note: You may need to install xclip or xsel (with PyQt5, PyQt4 or qtpy) on Linux to use these methods.
4.1.7 Pickling
All pandas objects are equipped with to_pickle methods which use Python’s cPickle module to save data
structures to disk using the pickle format.
In [326]: df
Out[326]:
c1 a
c2 b d
lvl1 lvl2
a c 1 5
d 2 6
b c 3 7
d 4 8
In [327]: df.to_pickle('foo.pkl')
The read_pickle function in the pandas namespace can be used to load any pickled pandas object (or any
other pickled object) from file:
In [328]: pd.read_pickle('foo.pkl')
Out[328]:
c1 a
c2 b d
lvl1 lvl2
a c 1 5
d 2 6
b c 3 7
d 4 8
Warning: Loading pickled data received from untrusted sources can be unsafe.
See: https://docs.python.org/3/library/pickle.html
Warning: read_pickle() is only guaranteed backwards compatible back to pandas version 0.20.3
In [329]: df = pd.DataFrame({
.....: 'A': np.random.randn(1000),
.....: 'B': 'foo',
.....: 'C': pd.date_range('20130101', periods=1000, freq='s')})
.....:
In [330]: df
Out[330]:
A B C
0 -0.053113 foo 2013-01-01 00:00:00
1 0.348832 foo 2013-01-01 00:00:01
2 -0.162729 foo 2013-01-01 00:00:02
3 -1.269943 foo 2013-01-01 00:00:03
4 -0.481824 foo 2013-01-01 00:00:04
.. ... ... ...
995 -1.001718 foo 2013-01-01 00:16:35
996 -0.471336 foo 2013-01-01 00:16:36
997 -0.071712 foo 2013-01-01 00:16:37
998 0.578273 foo 2013-01-01 00:16:38
999 0.595708 foo 2013-01-01 00:16:39
In [333]: rt
Out[333]:
A B C
0 -0.053113 foo 2013-01-01 00:00:00
1 0.348832 foo 2013-01-01 00:00:01
2 -0.162729 foo 2013-01-01 00:00:02
3 -1.269943 foo 2013-01-01 00:00:03
4 -0.481824 foo 2013-01-01 00:00:04
.. ... ... ...
995 -1.001718 foo 2013-01-01 00:16:35
996 -0.471336 foo 2013-01-01 00:16:36
997 -0.071712 foo 2013-01-01 00:16:37
998 0.578273 foo 2013-01-01 00:16:38
999 0.595708 foo 2013-01-01 00:16:39
(continues on next page)
In [336]: rt
Out[336]:
A B C
0 -0.053113 foo 2013-01-01 00:00:00
1 0.348832 foo 2013-01-01 00:00:01
2 -0.162729 foo 2013-01-01 00:00:02
3 -1.269943 foo 2013-01-01 00:00:03
4 -0.481824 foo 2013-01-01 00:00:04
.. ... ... ...
995 -1.001718 foo 2013-01-01 00:16:35
996 -0.471336 foo 2013-01-01 00:16:36
997 -0.071712 foo 2013-01-01 00:16:37
998 0.578273 foo 2013-01-01 00:16:38
999 0.595708 foo 2013-01-01 00:16:39
In [337]: df.to_pickle("data.pkl.gz")
In [338]: rt = pd.read_pickle("data.pkl.gz")
In [339]: rt
Out[339]:
A B C
0 -0.053113 foo 2013-01-01 00:00:00
1 0.348832 foo 2013-01-01 00:00:01
2 -0.162729 foo 2013-01-01 00:00:02
3 -1.269943 foo 2013-01-01 00:00:03
4 -0.481824 foo 2013-01-01 00:00:04
.. ... ... ...
995 -1.001718 foo 2013-01-01 00:16:35
996 -0.471336 foo 2013-01-01 00:16:36
997 -0.071712 foo 2013-01-01 00:16:37
998 0.578273 foo 2013-01-01 00:16:38
999 0.595708 foo 2013-01-01 00:16:39
In [340]: df["A"].to_pickle("s1.pkl.bz2")
In [341]: rt = pd.read_pickle("s1.pkl.bz2")
(continues on next page)
In [342]: rt
Out[342]:
0 -0.053113
1 0.348832
2 -0.162729
3 -1.269943
4 -0.481824
...
995 -1.001718
996 -0.471336
997 -0.071712
998 0.578273
999 0.595708
Name: A, Length: 1000, dtype: float64
4.1.8 msgpack
pandas supports the msgpack format for object serialization. This is a lightweight portable binary format,
similar to binary JSON, that is highly space efficient, and provides good performance both on the writing
(serialization), and reading (deserialization).
Warning: The msgpack format is deprecated as of 0.25 and will be removed in a future version. It is
recommended to use pyarrow for on-the-wire transmission of pandas objects.
Warning: read_msgpack() is only guaranteed backwards compatible back to pandas version 0.20.3
In [344]: df.to_msgpack('foo.msg')
In [345]: pd.read_msgpack('foo.msg')
Out[345]:
A B
0 0.541029 0.554672
1 0.150831 0.503287
2 0.834267 0.881894
3 0.706066 0.726912
4 0.639300 0.067928
You can pass a list of objects and you will receive them back on deserialization.
In [348]: pd.read_msgpack('foo.msg')
(continues on next page)
In [351]: pd.read_msgpack('foo.msg')
Out[351]:
[ A B
0 0.541029 0.554672
1 0.150831 0.503287
2 0.834267 0.881894
3 0.706066 0.726912
4 0.639300 0.067928, 'foo', array([1, 2, 3]), 2013-01-01 0.753932
2013-01-02 0.676180
2013-01-03 0.924728
2013-01-04 0.338661
2013-01-05 0.592241
Freq: D, dtype: float64, A B
0 0.541029 0.554672
1 0.150831 0.503287
(continues on next page)
Unlike other io methods, to_msgpack is available on both a per-object basis, df.to_msgpack() and using
the top-level pd.to_msgpack(...) where you can pack arbitrary collections of Python lists, dicts, scalars,
while intermixing pandas objects.
In [353]: pd.read_msgpack('foo2.msg')
Out[353]:
{'dict': ({'df': A B
0 0.541029 0.554672
1 0.150831 0.503287
2 0.834267 0.881894
3 0.706066 0.726912
4 0.639300 0.067928},
{'string': 'foo'},
{'scalar': 1.0},
{'s': 2013-01-01 0.753932
2013-01-02 0.676180
2013-01-03 0.924728
2013-01-04 0.338661
2013-01-05 0.592241
Freq: D, dtype: float64})}
Read/write API
In [354]: df.to_msgpack()
Out[354]: b"\x84\xa3typ\xadblock_
,→manager\xa5klass\xa9DataFrame\xa4axes\x92\x86\xa3typ\xa5index\xa5klass\xa5Index\xa4name\xc0\xa5dtype\x
,→index\xa5klass\xaaRangeIndex\xa4name\xc0\xa5start\x00\xa4stop\x05\xa4step\x01\xa6blocks\x91\x86\xa4loc
,→`)\x0c3kN\xc3?\xac\xa1:JQ\xb2\xea?\x8c|\xa87\x17\x98\xe6?\xf3H\x83*&u\xe4?\xd4S\xff
,→{\xe0\xbf\xe1?\xd3'2\xea\xed\x1a\xe0?6\x00'gy8\xec?S\x98/\xe7\xdcB\xe7?
,→`\xdbr\xed\xbac\xb1?
,→\xa5shape\x92\x02\x05\xa5dtype\xa7float64\xa5klass\xaaFloatBlock\xa8compress\xc0"
Furthermore you can concatenate the strings to produce a list of the original objects.
HDFStore is a dict-like object which reads and writes pandas using the high performance HDF5 format using
the excellent PyTables library. See the cookbook for some advanced strategies
Warning: pandas requires PyTables >= 3.0.0. There is a indexing bug in PyTables < 3.2 which may
appear when querying stores using an index. If you see a subset of results being returned, upgrade to
PyTables >= 3.2. Stores created previously will need to be rewritten using the updated version.
In [357]: print(store)
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
Objects can be written to the file just like adding key-value pairs to a dict:
In [362]: store['df'] = df
In [363]: store
Out[363]:
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
In [367]: store
Out[367]:
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
In [369]: store
Out[369]:
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
In [370]: store.is_open
Out[370]: False
# Working with, and automatically closing the store using a context manager
In [371]: with pd.HDFStore('store.h5') as store:
.....: store.keys()
.....:
Read/write API
HDFStore supports an top-level API using read_hdf for reading and to_hdf for writing, similar to how
read_csv and to_csv work.
HDFStore will by default not drop rows that are all missing. This behavior can be changed by setting
dropna=True.
In [375]: df_with_missing = pd.DataFrame({'col1': [0, np.nan, 2],
.....: 'col2': [1, np.nan, np.nan]})
.....:
In [376]: df_with_missing
Out[376]:
col1 col2
0 0.0 1.0
1 NaN NaN
2 2.0 NaN
Fixed format
The examples above show storing using put, which write the HDF5 to PyTables in a fixed array format,
called the fixed format. These types of stores are not appendable once written (though you can simply
remove them and rewrite). Nor are they queryable; they must be retrieved in their entirety. They also do
not support dataframes with non-unique column names. The fixed format stores offer very fast writing and
slightly faster reading than table stores. This format is specified by default when using put or to_hdf or
by format='fixed' or format='f'.
Warning: A fixed format will raise a TypeError if you try to retrieve using a where:
>>> pd.DataFrame(np.random.randn(10, 2)).to_hdf('test_fixed.h5', 'df')
>>> pd.read_hdf('test_fixed.h5', 'df', where='index>5')
TypeError: cannot pass a where specification when reading a fixed format.
this store must be selected in its entirety
Table format
HDFStore supports another PyTables format on disk, the table format. Conceptually a table is shaped very
much like a DataFrame, with rows and columns. A table may be appended to in the same or other sessions.
In addition, delete and query type operations are supported. This format is specified by format='table'
or format='t' to append or put or to_hdf.
This format can be set as an option as well pd.set_option('io.hdf.default_format','table') to enable
put/append/to_hdf to by default store in the table format.
In [381]: store = pd.HDFStore('store.h5')
In [386]: store
Out[386]:
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
Note: You can also create a table by passing format='table' or format='t' to a put operation.
Hierarchical keys
Keys to a store can be specified as a string. These can be in a hierarchical path-name like format (e.g.
foo/bar/bah), which will generate a hierarchy of sub-stores (or Groups in PyTables parlance). Keys can be
specified with out the leading ‘/’ and are always absolute (e.g. ‘foo’ refers to ‘/foo’). Removal operations
can remove everything in the sub-store and below, so be careful.
In [389]: store.put('foo/bar/bah', df)
In [392]: store
Out[392]:
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
In [395]: store
Out[395]:
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
You can walk through the group hierarchy using the walk method which will yield a tuple for each group
key along with the relative keys of its contents.
New in version 0.24.0.
In [396]: for (path, subgroups, subkeys) in store.walk():
.....: for subgroup in subgroups:
.....: print('GROUP: {}/{}'.format(path, subgroup))
.....: for subkey in subkeys:
.....: key = '/'.join([path, subkey])
.....: print('KEY: {}'.format(key))
.....: print(store.get(key))
.....:
GROUP: /foo
KEY: /df
A B C
2000-01-01 0.263806 1.913465 -0.274536
2000-01-02 0.283334 1.798001 -0.053258
2000-01-03 0.799684 -0.733715 1.205089
2000-01-04 0.131478 0.100995 -0.764260
2000-01-05 1.891112 -1.410251 0.752883
2000-01-06 -0.274852 -0.667027 -0.688782
2000-01-07 0.621607 -1.300199 0.050119
2000-01-08 -0.999591 -0.320658 -1.922640
GROUP: /foo/bar
(continues on next page)
Warning: Hierarchical keys cannot be retrieved as dotted (attribute) access as described above for
items stored under the root node.
In [8]: store.foo.bar.bah
AttributeError: 'HDFStore' object has no attribute 'foo'
# you can directly access the actual PyTables node but using the root node
In [9]: store.root.foo.bar.bah
Out[9]:
/foo/bar/bah (Group) ''
children := ['block0_items' (Array), 'block0_values' (Array), 'axis0' (Array),
,→'axis1' (Array)]
Storing types
Storing mixed-dtype data is supported. Strings are stored as a fixed-width using the maximum size of the
appended column. Subsequent attempts at appending longer strings will raise a ValueError.
Passing min_itemsize={`values`: size} as a parameter to append will set a larger minimum for the
string columns. Storing floats, strings, ints, bools, datetime64 are currently supported. For string
columns, passing nan_rep = 'nan' to append will change the default nan representation on disk (which
converts to/from np.nan), this defaults to nan.
In [398]: df_mixed = pd.DataFrame({'A': np.random.randn(8),
.....: 'B': np.random.randn(8),
In [399]: df_mixed.loc[df_mixed.index[3:5],
.....: ['A', 'B', 'string', 'datetime64']] = np.nan
.....:
In [402]: df_mixed1
Out[402]:
A B C string int bool datetime64
0 0.894171 -1.452159 -0.105646 string 1 True 2001-01-02
1 -1.539066 1.018959 0.028593 string 1 True 2001-01-02
2 -0.114019 -0.087476 0.693070 string 1 True 2001-01-02
3 NaN NaN -0.646571 NaN 1 True NaT
4 NaN NaN -0.174558 NaN 1 True NaT
5 -2.110838 -1.234633 -1.257271 string 1 True 2001-01-02
6 -0.704558 0.463419 -0.917264 string 1 True 2001-01-02
7 -0.929182 0.841053 0.414183 string 1 True 2001-01-02
In [403]: df_mixed1.dtypes.value_counts()
Out[403]:
float64 2
datetime64[ns] 1
int64 1
object 1
bool 1
float32 1
dtype: int64
Storing MultiIndex DataFrames as tables is very similar to storing/selecting from homogeneous index
DataFrames.
In [405]: index = pd.MultiIndex(levels=[['foo', 'bar', 'baz', 'qux'],
.....: ['one', 'two', 'three']],
.....: codes=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3],
.....: [0, 1, 2, 0, 1, 1, 2, 0, 1, 2]],
.....: names=['foo', 'bar'])
.....:
In [407]: df_mi
Out[407]:
A B C
foo bar
foo one 0.072648 -0.851494 0.140402
two -0.568937 0.439050 2.531582
three 0.539277 -1.398668 0.740635
bar one -1.892064 -0.830925 1.775692
two 2.183350 -1.565258 1.016985
baz two -0.476773 -0.566776 -0.665680
three 0.935387 0.551846 0.786999
qux one 0.481318 0.001118 -0.005084
two 0.238900 -1.888197 -0.943224
three -0.761786 -0.706338 -0.594234
In [409]: store.select('df_mi')
Out[409]:
A B C
foo bar
foo one 0.072648 -0.851494 0.140402
two -0.568937 0.439050 2.531582
three 0.539277 -1.398668 0.740635
bar one -1.892064 -0.830925 1.775692
two 2.183350 -1.565258 1.016985
baz two -0.476773 -0.566776 -0.665680
three 0.935387 0.551846 0.786999
qux one 0.481318 0.001118 -0.005084
two 0.238900 -1.888197 -0.943224
three -0.761786 -0.706338 -0.594234
A B C
foo bar
bar one -1.892064 -0.830925 1.775692
two 2.183350 -1.565258 1.016985
Querying
Querying a table
select and delete operations have an optional criterion that can be specified to select/delete only a subset
of the data. This allows one to have a very large on-disk table and retrieve only a portion of the data.
A query is specified using the Term class under the hood, as a boolean expression.
• index and columns are supported indexers of a DataFrames.
• if data_columns are specified, these can be used as additional indexers.
Valid comparison operators are:
=, ==, !=, >, >=, <, <=
Valid boolean expressions are combined with:
• | : or
• & : and
• ( and ) : for grouping
These rules are similar to how boolean expressions are used in pandas for indexing.
Note:
• = will be automatically expanded to the comparison operator ==
• ~ is the not operator, but can only be used in very limited circumstances
• If a list/tuple of expressions is passed they will be combined via &
Note: Passing a string to a query by interpolating it into the query expression is not recommended. Simply
assign the string of interest to a variable and use that variable in an expression. For example, do this
string = "HolyMoly'"
store.select('df', 'index == string')
instead of this
string = "HolyMoly'"
store.select('df', 'index == %s' % string)
The latter will not work and will raise a SyntaxError.Note that there’s a single quote followed by a double
quote in the string variable.
If you must interpolate, use the '%r' format specifier
The columns keyword can be supplied to select a list of columns to be returned, this is equivalent to passing
a 'columns=list_of_columns_to_filter':
start and stop parameters can be specified to limit the total search space. These are in terms of the total
number of rows in a table.
Note: select will raise a ValueError if the query expression has an unknown variable reference. Usually
this means that you are trying to select on a column that is not a data_column.
select will raise a SyntaxError if the query expression is not valid.
Using timedelta64[ns]
You can store and query using the timedelta64[ns] type. Terms can be specified in the format:
<float>(<unit>), where float may be signed (and fractional), and unit can be D,s,ms,us,ns for the
timedelta. Here’s an example:
In [419]: dftd
(continues on next page)
Indexing
You can create/modify an index for a table with create_table_index after data is already in the table (after
and append/put operation). Creating a table index is highly encouraged. This will speed your queries a
great deal when you use a select with the indexed dimension as the where.
Note: Indexes are automagically created on the indexables and any data columns you specify. This
behavior can be turned off by passing index=False to append.
In [425]: i = store.root.df.table.cols.index.index
Oftentimes when appending large amounts of data to a store, it is useful to turn off index creation for each
append, then recreate at the end.
In [432]: st.get_storer('df').table
Out[432]:
/df/table (Table(20,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
"B": Float64Col(shape=(), dflt=0.0, pos=2)}
byteorder := 'little'
chunkshape := (2730,)
In [434]: st.get_storer('df').table
Out[434]:
/df/table (Table(20,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
"B": Float64Col(shape=(), dflt=0.0, pos=2)}
byteorder := 'little'
chunkshape := (2730,)
autoindex := True
colindexes := {
"B": Index(9, full, shuffle, zlib(1)).is_csi=True}
In [435]: st.close()
You can designate (and index) certain columns that you want to be able to perform queries (other than
the indexable columns, which you can always query). For instance say you want to perform this common
operation, on-disk, and return just the frame that matches this query. You can specify data_columns =
True to force all columns to be data_columns.
In [436]: df_dc = df.copy()
In [442]: df_dc
Out[442]:
A B C string string2
2000-01-01 0.263806 1.913465 -0.274536 foo cool
2000-01-02 0.283334 1.000000 1.000000 foo cool
2000-01-03 0.799684 1.000000 1.000000 foo cool
2000-01-04 0.131478 0.100995 -0.764260 foo cool
2000-01-05 1.891112 -1.410251 0.752883 NaN cool
2000-01-06 -0.274852 -0.667027 -0.688782 NaN cool
2000-01-07 0.621607 -1.300199 0.050119 foo cool
2000-01-08 -0.999591 -0.320658 -1.922640 bar cool
# on-disk operations
In [443]: store.append('df_dc', df_dc, data_columns=['B', 'C', 'string', 'string2'])
# getting creative
In [445]: store.select('df_dc', 'B > 0 & C > 0 & string == foo')
Out[445]:
A B C string string2
2000-01-02 0.283334 1.0 1.0 foo cool
2000-01-03 0.799684 1.0 1.0 foo cool
Iterator
Note: You can also use the iterator with read_hdf which will open, then automatically close the store
when finished iterating.
Note, that the chunksize keyword applies to the source rows. So if you are doing a query, then the chunksize
will subdivide the total rows in the table and the query applied, returning an iterator on potentially unequal
sized chunks.
Here is a recipe for generating a query and using it to create equal sized return chunks.
In [450]: dfeq
Out[450]:
(continues on next page)
Advanced queries
To retrieve a single indexable or data column, use the method select_column. This will, for example, enable
you to get the index very quickly. These return a Series of the result, indexed by the row number. These
do not currently accept the where selector.
In [456]: store.select_column('df_dc', 'index')
Out[456]:
0 2000-01-01
1 2000-01-02
2 2000-01-03
3 2000-01-04
4 2000-01-05
5 2000-01-06
6 2000-01-07
7 2000-01-08
Name: index, dtype: datetime64[ns]
Selecting coordinates
Sometimes you want to get the coordinates (a.k.a the index locations) of your query. This returns an
Int64Index of the resulting locations. These coordinates can also be passed to subsequent where operations.
In [458]: df_coord = pd.DataFrame(np.random.randn(1000, 2),
.....: index=pd.date_range('20000101', periods=1000))
.....:
In [461]: c
Out[461]:
Int64Index([732, 733, 734, 735, 736, 737, 738, 739, 740, 741,
...
990, 991, 992, 993, 994, 995, 996, 997, 998, 999],
dtype='int64', length=268)
Sometime your query can involve creating a list of rows to select. Usually this mask would be a resulting
index from an indexing operation. This example selects the months of a datetimeindex which are 5.
Storer object
If you want to inspect the stored object, retrieve via get_storer. You could use this programmatically to
say get the number of rows in an object.
In [468]: store.get_storer('df_dc').nrows
Out[468]: 8
The methods append_to_multiple and select_as_multiple can perform appending/selecting from mul-
tiple tables at once. The idea is to have one table (call it the selector table) that you index most/all of the
columns, and perform your queries. The other table(s) are data tables with an index matching the selector
table’s index. You can then perform a very fast query on the selector table, yet get lots of data back. This
method is similar to having a very wide table, but enables more efficient queries.
The append_to_multiple method splits a given single DataFrame into multiple tables according to d, a
dictionary that maps the table names to a list of ‘columns’ you want in that table. If None is used in place
of a list, that table will have the remaining unspecified columns of the given DataFrame. The argument
selector defines which table is the selector table (which you can make queries from). The argument dropna
will drop rows from the input DataFrame to ensure tables are synchronized. This means that if a row for
one of the tables being written to is entirely np.NaN, that row will be dropped from all tables.
If dropna is False, THE USER IS RESPONSIBLE FOR SYNCHRONIZING THE TABLES.
Remember that entirely np.Nan rows are not written to the HDFStore, so if you choose to call dropna=False,
some tables may have more rows than others, and therefore select_as_multiple may not work or it may
return unexpected results.
In [469]: df_mt = pd.DataFrame(np.random.randn(8, 6),
.....: index=pd.date_range('1/1/2000', periods=8),
.....: columns=['A', 'B', 'C', 'D', 'E', 'F'])
.....:
In [473]: store
Out[473]:
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
In [475]: store.select('df2_mt')
Out[475]:
C D E F foo
2000-01-01 0.519195 -1.005739 -1.126444 -0.331994 bar
2000-01-02 -0.220187 -2.576683 -1.531751 1.545878 bar
2000-01-03 0.164228 -0.920410 0.184463 -0.179357 bar
2000-01-04 -1.493748 0.016206 -1.594605 -0.077495 bar
2000-01-05 -0.818602 -0.271994 1.188345 0.345087 bar
2000-01-06 -1.514329 -1.344535 -1.243543 0.231915 bar
2000-01-07 -0.702845 2.420812 0.309805 -1.101996 bar
2000-01-08 0.441337 -0.846243 -0.984939 -1.159608 bar
# as a multiple
In [476]: store.select_as_multiple(['df1_mt', 'df2_mt'], where=['A>0', 'B>0'],
.....: selector='df1_mt')
.....:
Out[476]:
Empty DataFrame
Columns: [A, B, C, D, E, F, foo]
Index: []
You can delete from a table selectively by specifying a where. In deleting rows, it is important to understand
the PyTables deletes rows by erasing the rows, then moving the following data. Thus deleting can potentially
be a very expensive operation depending on the orientation of your data. To get optimal performance, it’s
worthwhile to have the dimension you are deleting be the first of the indexables.
Data is ordered (on the disk) in terms of the indexables. Here’s a simple use case. You store panel-type
data, with dates in the major_axis and ids in the minor_axis. The data is then interleaved like this:
• date_1
– id_1
– id_2
– .
– id_n
• date_2
– id_1
– .
– id_n
It should be clear that a delete operation on the major_axis will be fairly quick, as one chunk is removed,
then the following data moved. On the other hand a delete operation on the minor_axis will be very
expensive. In this case it would almost certainly be faster to rewrite the table using a where that selects all
but the missing data.
Warning: Please note that HDF5 DOES NOT RECLAIM SPACE in the h5 files automatically.
Thus, repeatedly deleting (or removing nodes) and adding again, WILL TEND TO INCREASE
THE FILE SIZE.
To repack and clean the file, use ptrepack.
Compression
PyTables allows the stored data to be compressed. This applies to all kinds of stores, not just tables. Two
parameters are used to control compression: complevel and complib.
complevel specifies if and how hard data is to be compressed. complevel=0 and complevel=None
disables compression and 0<complevel<10 enables compression.
complib specifies which compression library to use. If nothing is specified the default library zlib
is used. A compression library usually optimizes for either good compression rates or speed and the
results will depend on the type of data. Which type of compression to choose depends on your specific
needs and data. The list of supported compression libraries:
• zlib: The default compression library. A classic in terms of compression, achieves good
compression rates but is somewhat slow.
• lzo: Fast compression and decompression.
• bzip2: Good compression rates.
• blosc: Fast compression and decompression.
New in version 0.20.2: Support for alternative blosc compressors:
• blosc:blosclz This is the default compressor for blosc
• blosc:lz4: A compact, very popular and fast compressor.
• blosc:lz4hc: A tweaked version of LZ4, produces better compression ratios at the expense
of speed.
• blosc:snappy: A popular compressor used in many places.
• blosc:zlib: A classic; somewhat slower than the previous ones, but achieving better
compression ratios.
• blosc:zstd: An extremely well balanced codec; it provides the best compression ratios
among the others above, and at reasonably fast speed.
If complib is defined as something other than the listed libraries a ValueError exception is
issued.
Note: If the library specified with the complib option is missing on your platform, compression defaults
to zlib without further ado.
Or on-the-fly compression (this only applies to tables) in stores where compression is not enabled:
ptrepack
PyTables offers better write performance when tables are compressed after they are written, as opposed to
turning on compression at the very beginning. You can use the supplied PyTables utility ptrepack. In
addition, ptrepack can change compression levels after the fact.
Furthermore ptrepack in.h5 out.h5 will repack the file to allow you to reuse previously deleted space.
Alternatively, one can simply remove the file and write again, or use the copy method.
Caveats
Warning: HDFStore is not-threadsafe for writing. The underlying PyTables only supports concur-
rent reads (via threading or processes). If you need reading and writing at the same time, you need to
serialize these operations in a single thread in a single process. You will corrupt your data otherwise. See
the (GH2397) for more information.
• If you use locks to manage write access between multiple processes, you may want to use fsync()
before releasing write locks. For convenience you can use store.flush(fsync=True) to do this for
you.
• Once a table is created columns (DataFrame) are fixed; only exactly the same columns can be ap-
pended
• Be aware that timezones (e.g., pytz.timezone('US/Eastern')) are not necessarily equal across time-
zone versions. So if data is localized to a specific timezone in the HDFStore using one version of a
timezone library and that data is updated with another version, the data will be converted to UTC
since these timezones are not considered equal. Either use the same version of timezone library or use
tz_convert with the updated timezone definition.
Warning: PyTables will show a NaturalNameWarning if a column name cannot be used as an attribute
selector. Natural identifiers contain only letters, numbers, and underscores, and may not begin with a
number. Other identifiers cannot be used in a where clause and are generally a bad idea.
DataTypes
HDFStore will map an object dtype to the PyTables underlying dtype. This means the following types are
known to work:
Categorical data
You can write data that contains category dtypes to a HDFStore. Queries work the same as if it was an
object array. However, the category dtyped data is stored in a more efficient manner.
In [477]: dfcat = pd.DataFrame({'A': pd.Series(list('aabbcdba')).astype('category'),
.....: 'B': np.random.randn(8)})
.....:
In [478]: dfcat
Out[478]:
A B
0 a -0.182915
1 a -1.740077
2 b -0.227312
3 b -0.351706
4 c -0.484542
5 d 1.150367
6 b -1.223195
7 a 1.022514
In [479]: dfcat.dtypes
Out[479]:
A category
B float64
dtype: object
In [483]: result
Out[483]:
A B
2 b -0.227312
3 b -0.351706
4 c -0.484542
6 b -1.223195
In [484]: result.dtypes
Out[484]:
A category
B float64
dtype: object
String columns
min_itemsize
The underlying implementation of HDFStore uses a fixed column width (itemsize) for string columns. A
string column itemsize is calculated as the maximum of the length of data (for that column) that is passed
to the HDFStore, in the first append. Subsequent appends, may introduce a string for a column larger
than the column can hold, an Exception will be raised (otherwise you could have a silent truncation of these
columns, leading to loss of information). In the future we may relax this and allow a user-specified truncation
to occur.
Pass min_itemsize on the first table creation to a-priori specify the minimum length of a particular string
column. min_itemsize can be an integer, or a dict mapping a column name to an integer. You can pass
values as a key to allow all indexables or data_columns to have this min_itemsize.
Passing a min_itemsize dict will cause all passed columns to be created as data_columns automatically.
Note: If you are not passing any data_columns, then the min_itemsize will be the maximum of the
length of any string passed
In [486]: dfs
Out[486]:
A B
0 foo bar
1 foo bar
2 foo bar
3 foo bar
4 foo bar
In [488]: store.get_storer('dfs').table
Out[488]:
/dfs/table (Table(5,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": StringCol(itemsize=30, shape=(2,), dflt=b'', pos=1)}
byteorder := 'little'
chunkshape := (963,)
autoindex := True
colindexes := {
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
In [490]: store.get_storer('dfs2').table
Out[490]:
/dfs2/table (Table(5,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": StringCol(itemsize=3, shape=(1,), dflt=b'', pos=1),
"A": StringCol(itemsize=30, shape=(), dflt=b'', pos=2)}
byteorder := 'little'
chunkshape := (1598,)
autoindex := True
colindexes := {
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"A": Index(6, medium, shuffle, zlib(1)).is_csi=False}
nan_rep
String columns will serialize a np.nan (a missing value) with the nan_rep string representation. This defaults
to the string value nan. You could inadvertently turn an actual nan value into a missing value.
In [492]: dfss
Out[492]:
A
0 foo
1 bar
2 nan
In [494]: store.select('dfss')
Out[494]:
A
0 foo
1 bar
2 NaN
In [496]: store.select('dfss2')
Out[496]:
A
0 foo
1 bar
2 nan
External compatibility
HDFStore writes table format objects in specific formats suitable for producing loss-less round trips to
pandas objects. For external compatibility, HDFStore can read native PyTables format tables.
It is possible to write an HDFStore object that can easily be imported into R using the rhdf5 library (Package
website). Create a table format store like this:
In [498]: df_for_r.head()
Out[498]:
first second class
0 0.253517 0.473526 0
1 0.232906 0.331008 0
2 0.069221 0.532945 1
3 0.290835 0.069538 1
4 0.912722 0.346792 0
In [501]: store_export
Out[501]:
<class 'pandas.io.pytables.HDFStore'>
File path: export.h5
In R this file can be read into a data.frame object using the rhdf5 library. The following example function
reads the corresponding column names and data values from the values and assembles them into a data.
frame:
# Load values and column names for all datasets from corresponding nodes and
# insert them into one data.frame object.
library(rhdf5)
return(data)
}
Note: The R function lists the entire HDF5 file’s contents and assembles the data.frame object from all
matching nodes, so use this only as a starting point if you have stored multiple DataFrame objects to a single
HDF5 file.
Performance
• tables format come with a writing performance penalty as compared to fixed stores. The benefit
is the ability to append/delete and query (potentially very large amounts of data). Write times are
generally longer as compared with regular stores. Query times can be quite fast, especially on an
indexed axis.
• You can pass chunksize=<int> to append, specifying the write chunksize (default is 50000). This will
significantly lower your memory usage on writing.
• You can pass expectedrows=<int> to the first append, to set the TOTAL number of expected rows
that PyTables will expected. This will optimize read/write performance.
• Duplicate rows can be written to tables, but are filtered out in selection (with the last items being
selected; thus a table is unique on major, minor pairs)
• A PerformanceWarning will be raised if you are attempting to store types that will be pickled by
PyTables (rather than stored as endemic types). See Here for more information and some solutions.
4.1.10 Feather
In [503]: df
Out[503]:
a b c d e f g h ␣
,→ i
0 a 1 3 4.0 True a 2013-01-01 2013-01-01 00:00:00-05:00 2013-01-01 00:00:00.
,→000000000
In [504]: df.dtypes
Out[504]:
a object
b int64
c uint8
d float64
e bool
f category
g datetime64[ns]
h datetime64[ns, US/Eastern]
i datetime64[ns]
dtype: object
Write to a feather file.
In [505]: df.to_feather('example.feather')
In [507]: result
Out[507]:
a b c d e f g h ␣
,→ i
0 a 1 3 4.0 True a 2013-01-01 2013-01-01 00:00:00-05:00 2013-01-01 00:00:00.
,→000000000
# we preserve dtypes
In [508]: result.dtypes
Out[508]:
a object
b int64
c uint8
d float64
e bool
f category
g datetime64[ns]
h datetime64[ns, US/Eastern]
i datetime64[ns]
dtype: object
4.1.11 Parquet
Note: These engines are very similar and should read/write nearly identical parquet format files. Currently
pyarrow does not support timedelta data, fastparquet>=0.1.4 supports timezone aware datetimes. These
libraries differ by having different underlying dependencies (fastparquet by using numba, while pyarrow
uses a c-library).
In [510]: df
Out[510]:
a b c d e f g
0 a 1 3 4.0 True 2013-01-01 2013-01-01 00:00:00-05:00
In [511]: df.dtypes
Out[511]:
a object
b int64
c uint8
d float64
e bool
f datetime64[ns]
g datetime64[ns, US/Eastern]
dtype: object
Write to a parquet file.
2235 index=index,
2236 partition_cols=partition_cols,
-> 2237 **kwargs
2238 )
2239
~/sandbox/pandas-doc/pandas/io/parquet.py in get_engine(engine)
42 return PyArrowImpl()
43 elif engine == "fastparquet":
---> 44 return FastParquetImpl()
45
46
~/sandbox/pandas-doc/pandas/io/parquet.py in __init__(self)
139 # we need to import on first use
140 fastparquet = import_optional_dependency(
--> 141 "fastparquet", extra="fastparquet is required for parquet support."
142 )
143 self.api = fastparquet
91 except ImportError:
92 if raise_on_missing:
---> 93 raise ImportError(message.format(name=name, extra=extra)) from None
94 else:
95 return None
293 """
294
--> 295 impl = get_engine(engine)
296 return impl.read(path, columns=columns, **kwargs)
~/sandbox/pandas-doc/pandas/io/parquet.py in get_engine(engine)
42 return PyArrowImpl()
43 elif engine == "fastparquet":
---> 44 return FastParquetImpl()
45
46
~/sandbox/pandas-doc/pandas/io/parquet.py in __init__(self)
139 # we need to import on first use
140 fastparquet = import_optional_dependency(
--> 141 "fastparquet", extra="fastparquet is required for parquet support."
142 )
143 self.api = fastparquet
~/sandbox/pandas-doc/pandas/compat/_optional.py in import_optional_dependency(name,␣
,→extra, raise_on_missing, on_version)
91 except ImportError:
92 if raise_on_missing:
---> 93 raise ImportError(message.format(name=name, extra=extra)) from None
94 else:
95 return None
In [516]: result.dtypes
Out[516]:
a object
b int64
c uint8
d float64
e bool
f datetime64[ns]
g datetime64[ns, US/Eastern]
dtype: object
293 """
294
--> 295 impl = get_engine(engine)
296 return impl.read(path, columns=columns, **kwargs)
~/sandbox/pandas-doc/pandas/io/parquet.py in get_engine(engine)
42 return PyArrowImpl()
43 elif engine == "fastparquet":
---> 44 return FastParquetImpl()
45
46
~/sandbox/pandas-doc/pandas/io/parquet.py in __init__(self)
139 # we need to import on first use
140 fastparquet = import_optional_dependency(
--> 141 "fastparquet", extra="fastparquet is required for parquet support."
142 )
143 self.api = fastparquet
~/sandbox/pandas-doc/pandas/compat/_optional.py in import_optional_dependency(name,␣
,→extra, raise_on_missing, on_version)
91 except ImportError:
92 if raise_on_missing:
---> 93 raise ImportError(message.format(name=name, extra=extra)) from None
94 else:
95 return None
In [519]: result.dtypes
Out[519]:
a object
b int64
dtype: object
Handling indexes
Serializing a DataFrame to parquet may include the implicit index as one or more columns in the output file.
Thus, this code:
creates a parquet file with three columns if you use pyarrow for serialization: a, b, and __index_level_0__.
If you’re using fastparquet, the index may or may not be written to the file.
This unexpected extra column causes some databases like Amazon Redshift to reject the file, because that
column doesn’t exist in the target table.
If you want to omit a dataframe’s indexes when writing, pass index=False to to_parquet():
This creates a parquet file with just the two expected columns, a and b. If your DataFrame has a custom
index, you won’t get it back when you load this file into a DataFrame.
Passing index=True will always write the index, even if that’s not the underlying engine’s default behavior.
The fname specifies the parent directory to which data will be saved. The partition_cols are the column
names by which the dataset will be partitioned. Columns are partitioned in the order they are given. The
partition splits are determined by the unique values in the partition columns. The above example creates a
partitioned dataset that may look like:
test
��� a=0
� ��� 0bac803e32dc42ae83fddfd029cbdebc.parquet
� ��� ...
��� a=1
��� e6ab24a4f45147b49b54a662f0c412a3.parquet
��� ...
The pandas.io.sql module provides a collection of query wrappers to both facilitate data retrieval and
to reduce dependency on DB-specific API. Database abstraction is provided by SQLAlchemy if installed.
In addition you will need a driver library for your database. Examples of such drivers are psycopg2 for
PostgreSQL or pymysql for MySQL. For SQLite this is included in Python’s standard library by default.
You can find an overview of supported drivers for each SQL dialect in the SQLAlchemy docs.
If SQLAlchemy is not installed, a fallback is only provided for sqlite (and for mysql for backwards compati-
bility, but this is deprecated and will be removed in a future version). This mode requires a Python database
adapter which respect the Python DB-API.
See also some cookbook examples for some advanced strategies.
The key functions are:
read_sql_table(table_name, con[, schema, …]) Read SQL database table into a DataFrame.
read_sql_query(sql, con[, index_col, …]) Read SQL query into a DataFrame.
read_sql(sql, con[, index_col, …]) Read SQL query or database table into a
DataFrame.
DataFrame.to_sql(self, name, con[, schema, …]) Write records stored in a DataFrame to a SQL
database.
pandas.read_sql_table
Notes
Any datetime values with time zone information will be converted to UTC.
Examples
pandas.read_sql_query
Notes
Any datetime values with time zone information parsed via the parse_dates parameter will be converted
to UTC.
pandas.read_sql
pandas.DataFrame.to_sql
if_exists [{‘fail’, ‘replace’, ‘append’}, default ‘fail’] How to behave if the table already
exists.
• fail: Raise a ValueError.
• replace: Drop the table before inserting new values.
• append: Insert new values to the existing table.
index [bool, default True] Write DataFrame index as a column. Uses index_label as
the column name in the table.
index_label [string or sequence, default None] Column label for index column(s). If
None is given (default) and index is True, then the index names are used. A sequence
should be given if the DataFrame uses MultiIndex.
chunksize [int, optional] Rows will be written in batches of this size at a time. By
default, all rows will be written at once.
dtype [dict, optional] Specifying the datatype for columns. The keys should be the
column names and the values should be the SQLAlchemy types or strings for the
sqlite3 legacy mode.
method [{None, ‘multi’, callable}, default None] Controls the SQL insertion clause
used:
• None : Uses standard SQL INSERT clause (one per row).
• ‘multi’: Pass multiple values in a single INSERT clause.
• callable with signature (pd_table, conn, keys, data_iter).
Details and a sample callable implementation can be found in the section insert
method.
New in version 0.24.0.
Raises
ValueError When the table already exists and if_exists is ‘fail’ (the default).
See also:
Notes
Timezone aware datetime columns will be written as Timestamp with timezone type with
SQLAlchemy if supported by the database. Otherwise, the datetimes will be stored as timezone
unaware timestamps local to the original timezone.
New in version 0.24.0.
References
[?], [?]
Examples
Specify the dtype (especially useful for integers with missing values). Notice that while pandas is forced
to store the data as floating point, the database supports nullable integers. When fetching the data
with Python, we get back integer scalars.
the provided input (database table name or sql query). Table names do not need to be quoted if they have
special characters.
In the following example, we use the SQlite SQL database engine. You can use a temporary SQLite database
where data are stored in “memory”.
To connect with SQLAlchemy you use the create_engine() function to create an engine object from
database URI. You only need to create the engine once per database you are connecting to. For more
information on create_engine() and the URI formatting, see the examples below and the SQLAlchemy
documentation
If you want to manage your own connections you can pass one of those instead:
Writing DataFrames
Assuming the following data is in a DataFrame data, we can insert it into the database using to_sql().
In [527]: data
Out[527]:
id Date Col_1 Col_2 Col_3
0 26 2010-10-18 X 27.50 True
1 42 2010-10-19 Y -12.50 False
2 63 2010-10-20 Z 5.73 True
With some databases, writing large DataFrames can result in errors due to packet size limitations being
exceeded. This can be avoided by setting the chunksize parameter when calling to_sql. For example, the
following writes data to the database in batches of 1000 rows at a time:
to_sql() will try to map your data to an appropriate SQL data type based on the dtype of the data. When
you have columns of dtype object, pandas will try to infer the data type.
You can always override the default type by specifying the desired SQL type of any of the columns by using
the dtype argument. This argument needs a dictionary mapping column names to SQLAlchemy types (or
strings for the sqlite3 fallback mode). For example, specifying to use the sqlalchemy String type instead of
the default Text type for string columns:
Note: Due to the limited support for timedelta’s in the different database flavors, columns with type
timedelta64 will be written as integer values as nanoseconds to the database and a warning will be raised.
Note: Columns of category dtype will be converted to the dense representation as you would get with
np.asarray(categorical) (e.g. for string categories this gives an array of strings). Because of this, reading
the database table back in does not generate a categorical.
Using SQLAlchemy, to_sql() is capable of writing datetime data that is timezone naive or timezone aware.
However, the resulting data stored in the database ultimately depends on the supported data type for
datetime data of the database system being used.
The following table lists supported data types for datetime data for some common databases. Other database
dialects may have different data types for datetime data.
When writing timezone aware data to databases that do not support timezones, the data will be written as
timezone naive timestamps that are in local time with respect to the timezone.
read_sql_table() is also capable of reading datetime data that is timezone aware or naive. When reading
TIMESTAMP WITH TIME ZONE types, pandas will convert the data to UTC.
Insertion method
Reading tables
read_sql_table() will read a database table given the table name and optionally a subset of columns to
read.
Note: In order to use read_sql_table(), you must have the SQLAlchemy optional dependency installed.
You can also specify the name of the column as the DataFrame index, and specify a subset of columns to be
read.
In [533]: pd.read_sql_table('data', engine, index_col='id')
Out[533]:
index Date Col_1 Col_2 Col_3
id
26 0 2010-10-18 X 27.50 True
42 1 2010-10-19 Y -12.50 False
63 2 2010-10-20 Z 5.73 True
0 X 27.50
1 Y -12.50
2 Z 5.73
And you can explicitly force columns to be parsed as dates:
If needed you can explicitly specify a format string, or a dict of arguments to pass to pandas.to_datetime():
Schema support
Reading from and writing to different schema’s is supported through the schema keyword in the
read_sql_table() and to_sql() functions. Note however that this depends on the database flavor (sqlite
does not have schema’s). For example:
Querying
You can query using raw SQL in the read_sql_query() function. In this case you must use the SQL
variant appropriate for your database. When using SQLAlchemy, you can also pass SQLAlchemy Expression
language constructs, which are database-agnostic.
In [537]: pd.read_sql_query("SELECT id, Col_1, Col_2 FROM data WHERE id = 42;", engine)
Out[537]:
id Col_1 Col_2
0 42 Y -12.5
The read_sql_query() function supports a chunksize argument. Specifying this will return an iterator
through chunks of the query result:
You can also run a plain query without creating a DataFrame with execute(). This is useful for queries that
don’t return values, such as INSERT. This is functionally equivalent to calling execute on the SQLAlchemy
engine or db connection object. Again, you must use the SQL syntax variant appropriate for your database.
To connect with SQLAlchemy you use the create_engine() function to create an engine object from
database URI. You only need to create the engine once per database you are connecting to.
engine = create_engine('postgresql://scott:tiger@localhost:5432/mydatabase')
engine = create_engine('oracle://scott:[email protected]:1521/sidname')
engine = create_engine('mssql+pyodbc://mydsn')
# sqlite://<nohostname>/<path>
# where <path> is relative:
engine = create_engine('sqlite:///foo.db')
If you have an SQLAlchemy description of your database you can express where conditions using SQLAlchemy
expressions
You can combine SQLAlchemy expressions with parameters passed to read_sql() using sqlalchemy.
bindparam()
Sqlite fallback
The use of sqlite is supported without using SQLAlchemy. This mode requires a Python database adapter
which respect the Python DB-API.
You can create connections like so:
import sqlite3
con = sqlite3.connect(':memory:')
data.to_sql('data', con)
pd.read_sql_query("SELECT * FROM data", con)
Warning: Starting in 0.20.0, pandas has split off Google BigQuery support into the separate package
pandas-gbq. You can pip install pandas-gbq to get it.
The method to_stata() will write a DataFrame into a .dta file. The format version of this file is always
115 (Stata 12).
In [550]: df.to_stata('stata.dta')
Stata data files have limited data type support; only strings with 244 or fewer characters, int8, int16,
int32, float32 and float64 can be stored in .dta files. Additionally, Stata reserves certain values to
represent missing data. Exporting a non-missing value that is outside of the permitted range in Stata for a
particular data type will retype the variable to the next larger size. For example, int8 values are restricted
to lie between -127 and 100 in Stata, and so variables with values above 100 will trigger a conversion to
int16. nan values in floating points data types are stored as the basic missing data type (. in Stata).
Note: It is not possible to export missing data values for integer data types.
The Stata writer gracefully handles other data types including int64, bool, uint8, uint16, uint32 by
casting to the smallest supported type that can represent the data. For example, data with a type of uint8
will be cast to int8 if all values are less than 100 (the upper bound for non-missing int8 data in Stata), or,
if values are outside of this range, the variable is cast to int16.
Warning: Conversion from int64 to float64 may result in a loss of precision if int64 values are larger
than 2**53.
Warning: StataWriter and to_stata() only support fixed width strings containing up to 244 char-
acters, a limitation imposed by the version 115 dta file format. Attempting to write Stata dta files with
strings longer than 244 characters raises a ValueError.
The top-level function read_stata will read a dta file and return either a DataFrame or a StataReader that
can be used to read the file incrementally.
In [551]: pd.read_stata('stata.dta')
Out[551]:
index A B
0 0 -2.802620 -1.031153
1 1 -0.471722 -1.004288
2 2 -0.809833 1.537958
3 3 -0.833349 2.502008
4 4 -1.016559 -0.583782
5 5 -0.369422 0.146956
6 6 -0.815559 -1.032447
7 7 0.676818 -0.410341
8 8 1.171674 2.227913
9 9 0.764637 1.540163
Specifying a chunksize yields a StataReader instance that can be used to read chunksize lines from the
file at a time. The StataReader object can be used as an iterator.
In [552]: reader = pd.read_stata('stata.dta', chunksize=3)
For more fine-grained control, use iterator=True and specify chunksize with each call to read().
Note: read_stata() and StataReader support .dta formats 113-115 (Stata 10-12), 117 (Stata 13), and
118 (Stata 14).
Note: Setting preserve_dtypes=False will upcast to the standard pandas data types: int64 for all
integer types and float64 for floating point data. By default, the Stata data types are preserved when
importing.
Categorical data
Categorical data can be exported to Stata data files as value labeled data. The exported data consists of
the underlying category codes as integer data values and the categories as value labels. Stata does not have
an explicit equivalent to a Categorical and information about whether the variable is ordered is lost when
exporting.
Warning: Stata only supports string value labels, and so str is called on the categories when exporting
data. Exporting Categorical variables with non-string categories produces a warning, and can result a
loss of information if the str representations of the categories are not unique.
Labeled data can similarly be imported from Stata data files as Categorical variables using the keyword
argument convert_categoricals (True by default). The keyword argument order_categoricals (True
by default) determines whether imported Categorical variables are ordered.
Note: When importing categorical data, the values of the variables in the Stata data file are not preserved
since Categorical variables always use integer data types between -1 and n-1 where n is the number
of categories. If the original values in the Stata data file are required, these can be imported by setting
convert_categoricals=False, which will import original data (but not the variable labels). The original
values can be matched to the imported categorical data since there is a simple mapping between the original
Stata data values and the category codes of imported Categorical variables: missing values are assigned code
-1, and the smallest original value is assigned 0, the second smallest is assigned 1 and so on until the largest
original value is assigned the code n-1.
Note: Stata supports partially labeled series. These series have value labels for some but not all data
values. Importing a partially labeled series will produce a Categorical with string categories for the values
that are labeled and numeric categories for values with no label.
The top-level function read_sas() can read (but not write) SAS xport (.XPT) and (since v0.18.0)
SAS7BDAT (.sas7bdat) format files.
SAS files only contain two value types: ASCII text and floating point values (usually 8 bytes but sometimes
truncated). For xport files, there is no automatic type conversion to integers, dates, or categoricals. For
SAS7BDAT files, the format codes may allow date variables to be automatically converted to dates. By
default the whole file is read and returned as a DataFrame.
Specify a chunksize or use iterator=True to obtain reader objects (XportReader or SAS7BDATReader) for
incrementally reading the file. The reader objects also have attributes that contain additional information
about the file and its variables.
Read a SAS7BDAT file:
df = pd.read_sas('sas_data.sas7bdat')
def do_something(chunk):
pass
The specification for the xport file format is available from the SAS web site.
No official documentation is available for the SAS7BDAT format.
pandas itself only supports IO with a limited set of file formats that map cleanly to its tabular data model.
For reading and writing other file formats into and from pandas, we recommend these packages from the
broader community.
netCDF
xarray provides data structures inspired by the pandas DataFrame for working with multi-dimensional
datasets, with a focus on the netCDF file format and easy conversion to and from pandas.
This is an informal comparison of various IO methods, using pandas 0.20.3. Timings are machine dependent
and small differences should be ignored.
In [1]: sz = 1000000
In [2]: df = pd.DataFrame({'A': np.random.randn(sz), 'B': [1] * sz})
In [3]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 2 columns):
A 1000000 non-null float64
B 1000000 non-null int64
dtypes: float64(1), int64(1)
memory usage: 15.3 MB
sz = 1000000
df = pd.DataFrame({'A': randn(sz), 'B': [1] * sz})
def test_sql_write(df):
if os.path.exists('test.sql'):
os.remove('test.sql')
sql_db = sqlite3.connect('test.sql')
df.to_sql(name='test_table', con=sql_db)
sql_db.close()
def test_sql_read():
sql_db = sqlite3.connect('test.sql')
pd.read_sql_query("select * from test_table", sql_db)
sql_db.close()
def test_hdf_fixed_write(df):
df.to_hdf('test_fixed.hdf', 'test', mode='w')
def test_hdf_fixed_read():
pd.read_hdf('test_fixed.hdf', 'test')
def test_hdf_fixed_write_compress(df):
df.to_hdf('test_fixed_compress.hdf', 'test', mode='w', complib='blosc')
def test_hdf_fixed_read_compress():
pd.read_hdf('test_fixed_compress.hdf', 'test')
(continues on next page)
def test_hdf_table_write(df):
df.to_hdf('test_table.hdf', 'test', mode='w', format='table')
def test_hdf_table_read():
pd.read_hdf('test_table.hdf', 'test')
def test_hdf_table_write_compress(df):
df.to_hdf('test_table_compress.hdf', 'test', mode='w',
complib='blosc', format='table')
def test_hdf_table_read_compress():
pd.read_hdf('test_table_compress.hdf', 'test')
def test_csv_write(df):
df.to_csv('test.csv', mode='w')
def test_csv_read():
pd.read_csv('test.csv', index_col=0)
def test_feather_write(df):
df.to_feather('test.feather')
def test_feather_read():
pd.read_feather('test.feather')
def test_pickle_write(df):
df.to_pickle('test.pkl')
def test_pickle_read():
pd.read_pickle('test.pkl')
def test_pickle_write_compress(df):
df.to_pickle('test.pkl.compress', compression='xz')
def test_pickle_read_compress():
pd.read_pickle('test.pkl.compress', compression='xz')
When writing, the top-three functions in terms of speed are are test_pickle_write, test_feather_write
and test_hdf_fixed_write_compress.
When reading, the top three are test_feather_read, test_pickle_read and test_hdf_fixed_read.
In [18]: %timeit test_sql_read()
1.35 s ± 14.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
{{ header }}
Note: The Python and NumPy indexing operators [] and attribute operator . provide quick and easy access
to pandas data structures across a wide range of use cases. This makes interactive work intuitive, as there’s
little new to learn if you already know how to deal with Python dictionaries and NumPy arrays. However,
since the type of the data to be accessed isn’t known in advance, directly using standard operators has some
optimization limits. For production code, we recommended that you take advantage of the optimized pandas
data access methods exposed in this chapter.
Warning: Whether a copy or a reference is returned for a setting operation, may depend on the context.
This is sometimes called chained assignment and should be avoided. See Returning a View versus Copy.
Warning: Indexing on an integer-based Index with floats has been clarified in 0.18.0, for a summary
of the changes, see here.
See the MultiIndex / Advanced Indexing for MultiIndex and more advanced indexing documentation.
See the cookbook for some advanced strategies.
Object selection has had a number of user-requested additions in order to support more explicit location
based indexing. Pandas now supports three types of multi-axis indexing.
• .loc is primarily label based, but may also be used with a boolean array. .loc will raise KeyError
when the items are not found. Allowed inputs are:
– A single label, e.g. 5 or 'a' (Note that 5 is interpreted as a label of the index. This use is not an
integer position along the index.).
– A list or array of labels ['a', 'b', 'c'].
– A slice object with labels 'a':'f' (Note that contrary to usual python slices, both the start
and the stop are included, when present in the index! See Slicing with labels and Endpoints are
inclusive.)
– A boolean array
– A callable function with one argument (the calling Series or DataFrame) and that returns valid
output for indexing (one of the above).
New in version 0.18.1.
See more at Selection by Label.
• .iloc is primarily integer position based (from 0 to length-1 of the axis), but may also be used with
a boolean array. .iloc will raise IndexError if a requested indexer is out-of-bounds, except slice
indexers which allow out-of-bounds indexing. (this conforms with Python/NumPy slice semantics).
Allowed inputs are:
– An integer e.g. 5.
– A list or array of integers [4, 3, 0].
– A slice object with ints 1:7.
– A boolean array.
– A callable function with one argument (the calling Series or DataFrame) and that returns valid
output for indexing (one of the above).
New in version 0.18.1.
See more at Selection by Position, Advanced Indexing and Advanced Hierarchical.
• .loc, .iloc, and also [] indexing can accept a callable as indexer. See more at Selection By Callable.
Getting values from an object with multi-axes selection uses the following notation (using .loc as an example,
but the following applies to .iloc as well). Any of the axes accessors may be the null slice :. Axes left out
of the specification are assumed to be :, e.g. p.loc['a'] is equivalent to p.loc['a', :, :].
4.2.2 Basics
As mentioned when introducing the data structures in the last section, the primary function of indexing
with [] (a.k.a. __getitem__ for those familiar with implementing class behavior in Python) is selecting out
lower-dimensional slices. The following table shows return type values when indexing pandas objects with
[]:
Here we construct a simple time series data set to use for illustrating the indexing functionality:
In [1]: dates = pd.date_range('1/1/2000', periods=8)
In [3]: df
Out[3]:
A B C D
2000-01-01 -1.157426 -0.096491 0.999344 -1.482012
2000-01-02 0.189291 0.926828 -0.029095 1.776600
2000-01-03 -1.334294 2.085399 -0.633036 0.208208
2000-01-04 -1.723333 -0.355486 -0.143959 0.177635
2000-01-05 1.071746 -0.516876 -0.382709 0.888600
2000-01-06 -0.156260 -0.720254 -0.837161 -0.426902
2000-01-07 -0.354174 0.510804 0.156535 0.294767
2000-01-08 -1.448608 -1.191084 -0.128338 -0.687717
Note: None of the indexing functionality is time series specific unless specifically stated.
Thus, as per above, we have the most basic indexing using []:
In [4]: s = df['A']
In [5]: s[dates[5]]
Out[5]: -0.15625969875302725
You can pass a list of columns to [] to select columns in that order. If a column is not contained in the
DataFrame, an exception will be raised. Multiple columns can also be set in this manner:
In [6]: df
Out[6]:
A B C D
2000-01-01 -1.157426 -0.096491 0.999344 -1.482012
2000-01-02 0.189291 0.926828 -0.029095 1.776600
2000-01-03 -1.334294 2.085399 -0.633036 0.208208
2000-01-04 -1.723333 -0.355486 -0.143959 0.177635
2000-01-05 1.071746 -0.516876 -0.382709 0.888600
2000-01-06 -0.156260 -0.720254 -0.837161 -0.426902
2000-01-07 -0.354174 0.510804 0.156535 0.294767
2000-01-08 -1.448608 -1.191084 -0.128338 -0.687717
In [8]: df
Out[8]:
A B C D
2000-01-01 -0.096491 -1.157426 0.999344 -1.482012
2000-01-02 0.926828 0.189291 -0.029095 1.776600
2000-01-03 2.085399 -1.334294 -0.633036 0.208208
2000-01-04 -0.355486 -1.723333 -0.143959 0.177635
2000-01-05 -0.516876 1.071746 -0.382709 0.888600
2000-01-06 -0.720254 -0.156260 -0.837161 -0.426902
2000-01-07 0.510804 -0.354174 0.156535 0.294767
2000-01-08 -1.191084 -1.448608 -0.128338 -0.687717
You may find this useful for applying a transform (in-place) to a subset of the columns.
Warning: pandas aligns all AXES when setting Series and DataFrame from .loc, and .iloc.
This will not modify df because the column alignment is before value assignment.
In [9]: df[['A', 'B']]
Out[9]:
A B
2000-01-01 -0.096491 -1.157426
2000-01-02 0.926828 0.189291
2000-01-03 2.085399 -1.334294
2000-01-04 -0.355486 -1.723333
2000-01-05 -0.516876 1.071746
2000-01-06 -0.720254 -0.156260
2000-01-07 0.510804 -0.354174
2000-01-08 -1.191084 -1.448608
In [16]: sa.b
Out[16]: 2
In [17]: dfa.A
Out[17]:
2000-01-01 -1.157426
2000-01-02 0.189291
2000-01-03 -1.334294
2000-01-04 -1.723333
2000-01-05 1.071746
2000-01-06 -0.156260
2000-01-07 -0.354174
2000-01-08 -1.448608
Freq: D, Name: A, dtype: float64
In [18]: sa.a = 5
In [19]: sa
Out[19]:
a 5
b 2
c 3
dtype: int64
In [21]: dfa
Out[21]:
(continues on next page)
In [23]: dfa
Out[23]:
A B C D
2000-01-01 0 -0.096491 0.999344 -1.482012
2000-01-02 1 0.926828 -0.029095 1.776600
2000-01-03 2 2.085399 -0.633036 0.208208
2000-01-04 3 -0.355486 -0.143959 0.177635
2000-01-05 4 -0.516876 -0.382709 0.888600
2000-01-06 5 -0.720254 -0.837161 -0.426902
2000-01-07 6 0.510804 0.156535 0.294767
2000-01-08 7 -1.191084 -0.128338 -0.687717
Warning:
• You can use this access only if the index element is a valid Python identifier, e.g. s.1 is not allowed.
See here for an explanation of valid identifiers.
• The attribute will not be available if it conflicts with an existing method name, e.g. s.min is not
allowed.
• Similarly, the attribute will not be available if it conflicts with any of the following list: index,
major_axis, minor_axis, items.
• In any of these cases, standard indexing will still work, e.g. s['1'], s['min'], and s['index']
will access the corresponding element or column.
If you are using the IPython environment, you may also use tab-completion to see these accessible attributes.
You can also assign a dict to a row of a DataFrame:
In [26]: x
Out[26]:
x y
0 1 3
1 9 99
2 3 5
You can use attribute access to modify an existing element of a Series or column of a DataFrame, but be
careful; if you try to use attribute access to create a new column, it creates a new attribute rather than a
new column. In 0.21.0 and later, this will raise a UserWarning:
In [3]: df
Out[3]:
one
0 1.0
1 2.0
2 3.0
The most robust and consistent way of slicing ranges along arbitrary axes is described in the Selection by
Position section detailing the .iloc method. For now, we explain the semantics of slicing using the []
operator.
With Series, the syntax works exactly as with an ndarray, returning a slice of the values and the corresponding
labels:
In [27]: s[:5]
Out[27]:
2000-01-01 -1.157426
2000-01-02 0.189291
2000-01-03 -1.334294
2000-01-04 -1.723333
2000-01-05 1.071746
Freq: D, Name: A, dtype: float64
In [28]: s[::2]
Out[28]:
2000-01-01 -1.157426
2000-01-03 -1.334294
2000-01-05 1.071746
2000-01-07 -0.354174
Freq: 2D, Name: A, dtype: float64
In [29]: s[::-1]
Out[29]:
2000-01-08 -1.448608
2000-01-07 -0.354174
2000-01-06 -0.156260
2000-01-05 1.071746
2000-01-04 -1.723333
2000-01-03 -1.334294
2000-01-02 0.189291
2000-01-01 -1.157426
Freq: -1D, Name: A, dtype: float64
Note that setting works as well:
In [30]: s2 = s.copy()
In [31]: s2[:5] = 0
In [32]: s2
Out[32]:
2000-01-01 0.000000
2000-01-02 0.000000
2000-01-03 0.000000
2000-01-04 0.000000
2000-01-05 0.000000
2000-01-06 -0.156260
2000-01-07 -0.354174
2000-01-08 -1.448608
Freq: D, Name: A, dtype: float64
With DataFrame, slicing inside of [] slices the rows. This is provided largely as a convenience since it is
such a common operation.
In [33]: df[:3]
Out[33]:
A B C D
2000-01-01 -1.157426 -0.096491 0.999344 -1.482012
2000-01-02 0.189291 0.926828 -0.029095 1.776600
2000-01-03 -1.334294 2.085399 -0.633036 0.208208
In [34]: df[::-1]
Out[34]:
A B C D
2000-01-08 -1.448608 -1.191084 -0.128338 -0.687717
2000-01-07 -0.354174 0.510804 0.156535 0.294767
2000-01-06 -0.156260 -0.720254 -0.837161 -0.426902
2000-01-05 1.071746 -0.516876 -0.382709 0.888600
2000-01-04 -1.723333 -0.355486 -0.143959 0.177635
2000-01-03 -1.334294 2.085399 -0.633036 0.208208
2000-01-02 0.189291 0.926828 -0.029095 1.776600
2000-01-01 -1.157426 -0.096491 0.999344 -1.482012
Warning: Whether a copy or a reference is returned for a setting operation, may depend on the context.
This is sometimes called chained assignment and should be avoided. See Returning a View versus Copy.
Warning:
.loc is strict when you present slicers that are not compatible (or convertible) with the index
type. For example using integers in a DatetimeIndex. These will raise a TypeError.
In [36]: dfl
Out[36]:
A B C D
2013-01-01 -2.100928 1.025928 -0.007973 -1.336035
2013-01-02 0.915382 -0.130655 0.022627 -1.425459
2013-01-03 0.271939 0.169543 0.692513 -1.231139
2013-01-04 1.692870 0.783855 -0.721626 1.698994
2013-01-05 -0.349882 1.586451 1.454199 0.149458
In [4]: dfl.loc[2:3]
TypeError: cannot do slice indexing on <class 'pandas.tseries.index.DatetimeIndex'>␣
,→with these indexers [2] of <type 'int'>
String likes in slicing can be convertible to the type of the index and lead to natural slicing.
In [37]: dfl.loc['20130102':'20130104']
Out[37]:
A B C D
2013-01-02 0.915382 -0.130655 0.022627 -1.425459
2013-01-03 0.271939 0.169543 0.692513 -1.231139
2013-01-04 1.692870 0.783855 -0.721626 1.698994
Warning: Starting in 0.21.0, pandas will show a FutureWarning if indexing with a list with missing
labels. In the future this will raise a KeyError. See list-like Using loc with missing keys in a list is
Deprecated.
pandas provides a suite of methods in order to have purely label based indexing. This is a strict inclusion
based protocol. Every label asked for must be in the index, or a KeyError will be raised. When slicing,
both the start bound AND the stop bound are included, if present in the index. Integers are valid labels,
but they refer to the label and not the position.
The .loc attribute is the primary access method. The following are valid inputs:
• A single label, e.g. 5 or 'a' (Note that 5 is interpreted as a label of the index. This use is not an
integer position along the index.).
• A list or array of labels ['a', 'b', 'c'].
• A slice object with labels 'a':'f' (Note that contrary to usual python slices, both the start and the
stop are included, when present in the index! See Slicing with labels.
• A boolean array.
• A callable, see Selection By Callable.
In [38]: s1 = pd.Series(np.random.randn(6), index=list('abcdef'))
In [39]: s1
Out[39]:
a 1.018393
b -0.565651
c -1.590951
d -0.651777
e 0.456407
f 1.134054
dtype: float64
In [40]: s1.loc['c':]
Out[40]:
c -1.590951
d -0.651777
e 0.456407
f 1.134054
dtype: float64
In [41]: s1.loc['b']
Out[41]: -0.5656513106035832
Note that setting works as well:
In [42]: s1.loc['c':] = 0
In [43]: s1
Out[43]:
a 1.018393
b -0.565651
c 0.000000
d 0.000000
e 0.000000
f 0.000000
dtype: float64
With a DataFrame:
In [44]: df1 = pd.DataFrame(np.random.randn(6, 4),
....: index=list('abcdef'),
....: columns=list('ABCD'))
....:
In [45]: df1
Out[45]:
A B C D
a 1.695493 1.632303 -1.726092 0.486227
b -0.625187 0.386616 -0.048112 -2.598355
c -0.871135 -0.209156 0.004590 0.449006
d 0.573428 0.697186 -2.442512 -1.423556
e -0.304997 0.672794 0.954090 1.323584
f 0.015720 -0.815293 0.164562 0.576599
In [48]: df1.loc['a']
Out[48]:
A 1.695493
B 1.632303
C -1.726092
D 0.486227
Name: a, dtype: float64
When using .loc with slices, if both the start and the stop labels are present in the index, then elements
located between the two (including them) are returned:
If at least one of the two is absent, but the index is sorted, and can be compared against start and stop
labels, then slicing will still work as expected, by selecting labels which rank between the two:
In [54]: s.sort_index()
Out[54]:
0 a
2 c
3 b
4 e
5 d
dtype: object
In [55]: s.sort_index().loc[1:6]
Out[55]:
2 c
3 b
4 e
5 d
dtype: object
However, if at least one of the two is absent and the index is not sorted, an error will be raised (since doing
otherwise would be computationally expensive, as well as potentially ambiguous for mixed type indexes).
For instance, in the above example, s.loc[1:6] would raise KeyError.
For the rationale behind this behavior, see Endpoints are inclusive.
Warning: Whether a copy or a reference is returned for a setting operation, may depend on the context.
This is sometimes called chained assignment and should be avoided. See Returning a View versus Copy.
Pandas provides a suite of methods in order to get purely integer based indexing. The semantics follow
closely Python and NumPy slicing. These are 0-based indexing. When slicing, the start bound is included,
while the upper bound is excluded. Trying to use a non-integer, even a valid label will raise an IndexError.
The .iloc attribute is the primary access method. The following are valid inputs:
• An integer e.g. 5.
• A list or array of integers [4, 3, 0].
• A slice object with ints 1:7.
• A boolean array.
• A callable, see Selection By Callable.
In [57]: s1
Out[57]:
0 -0.813837
2 1.020094
4 -0.538755
6 -0.273898
8 1.374350
dtype: float64
In [58]: s1.iloc[:3]
Out[58]:
0 -0.813837
2 1.020094
4 -0.538755
dtype: float64
In [59]: s1.iloc[3]
Out[59]: -0.2738980637291465
Note that setting works as well:
In [60]: s1.iloc[:3] = 0
In [61]: s1
Out[61]:
0 0.000000
2 0.000000
4 0.000000
6 -0.273898
8 1.374350
dtype: float64
With a DataFrame:
In [63]: df1
Out[63]:
0 2 4 6
0 -0.769208 -0.094955 -0.339642 1.131238
2 -1.165074 0.191823 -0.424832 0.641310
4 -1.389117 0.367491 2.164790 1.126079
6 1.550817 0.826973 -0.677486 2.087563
8 0.117134 -0.855723 0.082120 1.276149
10 0.270969 1.210188 -0.988631 -1.253327
0 2 4 6
0 -0.769208 -0.094955 -0.339642 1.131238
2 -1.165074 0.191823 -0.424832 0.641310
4 -1.389117 0.367491 2.164790 1.126079
In [67]: df1.iloc[1:3, :]
Out[67]:
0 2 4 6
2 -1.165074 0.191823 -0.424832 0.641310
4 -1.389117 0.367491 2.164790 1.126079
In [70]: df1.iloc[1]
Out[70]:
0 -1.165074
2 0.191823
4 -0.424832
6 0.641310
Name: 2, dtype: float64
In [71]: x = list('abcdef')
In [72]: x
Out[72]: ['a', 'b', 'c', 'd', 'e', 'f']
In [73]: x[4:10]
Out[73]: ['e', 'f']
In [74]: x[8:10]
Out[74]: []
In [75]: s = pd.Series(x)
In [76]: s
Out[76]:
0 a
1 b
2 c
3 d
4 e
5 f
dtype: object
In [77]: s.iloc[4:10]
Out[77]:
4 e
5 f
dtype: object
In [78]: s.iloc[8:10]
Out[78]: Series([], dtype: object)
Note that using slices that go out of bounds can result in an empty axis (e.g. an empty DataFrame being
returned).
In [79]: dfl = pd.DataFrame(np.random.randn(5, 2), columns=list('AB'))
In [80]: dfl
Out[80]:
A B
0 -1.141824 0.487982
1 1.110710 1.775785
2 -0.158929 1.256688
3 0.480722 0.545781
4 -1.214729 0.259405
0 0.487982
1 1.775785
2 1.256688
3 0.545781
4 0.259405
In [83]: dfl.iloc[4:6]
Out[83]:
A B
4 -1.214729 0.259405
A single indexer that is out of bounds will raise an IndexError. A list of indexers where any element is out
of bounds will raise an IndexError.
>>> dfl.iloc[:, 4]
IndexError: single positional indexer is out-of-bounds
In [85]: df1
Out[85]:
A B C D
a -1.898204 0.933280 0.410757 -1.209116
b 1.072207 -2.076376 -0.032087 -1.179905
c 0.819041 0.169362 0.395066 1.793339
d 0.620358 0.687095 0.924752 -0.953211
e 0.272744 -0.264613 -0.299304 0.828769
f 1.384847 1.408420 -0.599304 1.455457
a -1.898204 0.933280
b 1.072207 -2.076376
c 0.819041 0.169362
d 0.620358 0.687095
e 0.272744 -0.264613
f 1.384847 1.408420
Using these methods / indexers, you can chain data selection operations without using a temporary variable.
2007 CIN 6 379 745 101 203 35 2 36 125.0 10.0 1.0 105 127.0 14.0 ␣
,→ 1.0 1.0 15.0 18.0
DET 5 301 1062 162 283 54 4 37 144.0 24.0 7.0 97 176.0 3.0 ␣
,→10.0 4.0 8.0 28.0
(continues on next page)
Warning: Starting in 0.20.0, the .ix indexer is deprecated, in favor of the more strict .iloc and .loc
indexers.
.ix offers a lot of magic on the inference of what the user wants to do. To wit, .ix can decide to index
positionally OR via labels depending on the data type of the index. This has caused quite a bit of user
confusion over the years.
The recommended methods of indexing are:
• .loc if you want to label index.
• .iloc if you want to positionally index.
In [94]: dfd
Out[94]:
A B
a 1 4
b 2 5
c 3 6
Previous behavior, where you wish to get the 0th and the 2nd elements from the index in the ‘A’ column.
Using .loc. Here we will select the appropriate indexes from the index, then use label indexing.
This can also be expressed using .iloc, by explicitly getting locations on the indexers, and using positional
indexing to select things.
Warning: Starting in 0.21.0, using .loc or [] with a list with one or more missing labels, is deprecated,
in favor of .reindex.
In prior versions, using .loc[list-of-labels] would work as long as at least 1 of the keys was found
(otherwise it would raise a KeyError). This behavior is deprecated and will show a warning message pointing
to this section. The recommended alternative is to use .reindex().
For example.
In [99]: s
Out[99]:
0 1
1 2
2 3
dtype: int64
Previous behavior
Current behavior
Out[4]:
1 2.0
2 3.0
3 NaN
dtype: float64
Reindexing
The idiomatic way to achieve selecting potentially not-found elements is via .reindex(). See also the section
on reindexing.
Alternatively, if you want to select only valid keys, the following is idiomatic and efficient; it is guaranteed
to preserve the dtype of the selection.
In [103]: s.loc[s.index.intersection(labels)]
Out[103]:
1 2
2 3
dtype: int64
In [17]: s.reindex(labels)
ValueError: cannot reindex from a duplicate axis
Generally, you can intersect the desired labels with the current axis, and then reindex.
In [106]: s.loc[s.index.intersection(labels)].reindex(labels)
Out[106]:
c 3.0
d NaN
dtype: float64
In [42]: s.loc[s.index.intersection(labels)].reindex(labels)
ValueError: cannot reindex from a duplicate axis
A random selection of rows or columns from a Series or DataFrame with the sample() method. The method
will sample rows by default, and accepts a specific number of rows/columns to return, or a fraction of rows.
In [107]: s = pd.Series([0, 1, 2, 3, 4, 5])
Out[112]:
3 3
2 2
4 4
0 0
5 5
1 1
dtype: int64
# With replacement:
In [113]: s.sample(n=6, replace=True)
Out[113]:
0 0
2 2
0 0
3 3
1 1
2 2
dtype: int64
By default, each row has an equal probability of being selected, but if you want rows to have different
probabilities, you can pass the sample function sampling weights as weights. These weights can be a list,
a NumPy array, or a Series, but they must be of the same length as the object you are sampling. Missing
values will be treated as a weight of zero, and inf values are not allowed. If weights do not sum to 1, they
will be re-normalized by dividing all weights by the sum of the weights. For example:
When applied to a DataFrame, you can use a column of the DataFrame as sampling weights (provided you
are sampling rows and not columns) by simply passing the name of the column as a string.
sample also allows users to sample columns instead of rows using the axis argument.
Finally, one can also set a seed for sample’s random number generator using the random_state argument,
which will accept either an integer (as a seed) or a NumPy RandomState object.
In [123]: df4 = pd.DataFrame({'col1': [1, 2, 3], 'col2': [2, 3, 4]})
# With a given seed, the sample will always draw the same rows.
In [124]: df4.sample(n=2, random_state=2)
Out[124]:
col1 col2
2 3 4
1 2 3
The .loc/[] operations can perform enlargement when setting a non-existent key for that axis.
In the Series case this is effectively an appending operation.
In [127]: se
Out[127]:
0 1
1 2
2 3
dtype: int64
In [128]: se[5] = 5.
In [129]: se
(continues on next page)
In [131]: dfi
Out[131]:
A B
0 0 1
1 2 3
2 4 5
In [133]: dfi
Out[133]:
A B C
0 0 1 0
1 2 3 2
2 4 5 4
In [134]: dfi.loc[3] = 5
In [135]: dfi
Out[135]:
A B C
0 0 1 0
1 2 3 2
2 4 5 4
3 5 5 5
Since indexing with [] must handle a lot of cases (single-label access, slicing, boolean indexing, etc.), it has
a bit of overhead in order to figure out what you’re asking for. If you only want to access a scalar value, the
fastest way is to use the at and iat methods, which are implemented on all of the data structures.
Similarly to loc, at provides label based scalar lookups, while, iat provides integer based lookups analo-
gously to iloc
In [136]: s.iat[5]
Out[136]: 5
In [138]: df.iat[3, 0]
Out[138]: -1.7233332966009836
You can also set using these same indexers.
In [140]: df.iat[3, 0] = 7
In [142]: df
Out[142]:
A B C D E 0
2000-01-01 -1.157426 -0.096491 0.999344 -1.482012 NaN NaN
2000-01-02 0.189291 0.926828 -0.029095 1.776600 NaN NaN
2000-01-03 -1.334294 2.085399 -0.633036 0.208208 NaN NaN
2000-01-04 7.000000 -0.355486 -0.143959 0.177635 NaN NaN
2000-01-05 1.071746 -0.516876 -0.382709 0.888600 NaN NaN
2000-01-06 -0.156260 -0.720254 -0.837161 -0.426902 7.0 NaN
2000-01-07 -0.354174 0.510804 0.156535 0.294767 NaN NaN
2000-01-08 -1.448608 -1.191084 -0.128338 -0.687717 NaN NaN
2000-01-09 NaN NaN NaN NaN NaN 7.0
Another common operation is the use of boolean vectors to filter the data. The operators are: | for or, &
for and, and ~ for not. These must be grouped by using parentheses, since by default Python will evaluate
an expression such as df.A > 2 & df.B < 3 as df.A > (2 & df.B) < 3, while the desired evaluation order
is (df.A > 2) & (df.B < 3).
Using a boolean vector to index a Series works exactly as in a NumPy ndarray:
In [143]: s = pd.Series(range(-3, 4))
In [144]: s
Out[144]:
0 -3
1 -2
2 -1
3 0
4 1
5 2
6 3
dtype: int64
5 2
6 3
dtype: int64
List comprehensions and the map method of Series can also be used to produce more complex criteria:
In [149]: df2 = pd.DataFrame({'a': ['one', 'one', 'two', 'three', 'two', 'one', 'six'],
.....: 'b': ['x', 'y', 'y', 'x', 'y', 'x', 'x'],
.....: 'c': np.random.randn(7)})
.....:
In [151]: df2[criterion]
Out[151]:
a b c
2 two y 0.344065
3 three x 1.275247
4 two y 1.303763
# Multiple criteria
In [153]: df2[criterion & (df2['b'] == 'x')]
Out[153]:
a b c
3 three x 1.275247
With the choice methods Selection by Label, Selection by Position, and Advanced Indexing you may select
along more than one axis using boolean vectors combined with other indexing expressions.
Consider the isin() method of Series, which returns a boolean vector that is true wherever the Series
elements exist in the passed list. This allows you to select rows where one or more columns have values you
want:
In [155]: s = pd.Series(np.arange(5), index=np.arange(5)[::-1], dtype='int64')
In [156]: s
Out[156]:
4 0
3 1
2 2
1 3
0 4
dtype: int64
In [162]: s_mi
Out[162]:
0 a 0
b 1
c 2
1 a 3
b 4
c 5
dtype: int64
In [167]: df.isin(values)
Out[167]:
vals ids ids2
0 True True True
1 False True False
2 True False False
(continues on next page)
Oftentimes you’ll want to match certain values with certain columns. Just make values a dict where the
key is the column, and the value is a list of items you want to check for.
In [169]: df.isin(values)
Out[169]:
vals ids ids2
0 True True False
1 False True False
2 True False False
3 False False False
Combine DataFrame’s isin with the any() and all() methods to quickly select subsets of your data that
meet a given criteria. To select a row where each column meets its own criterion:
In [170]: values = {'ids': ['a', 'b'], 'ids2': ['a', 'c'], 'vals': [1, 3]}
In [172]: df[row_mask]
Out[172]:
vals ids ids2
0 1 a a
Selecting values from a Series with a boolean vector generally returns a subset of the data. To guarantee
that selection output has the same shape as the original data, you can use the where method in Series and
DataFrame.
To return only the selected rows:
Selecting values from a DataFrame with a boolean criterion now also preserves input data shape. where is
used under the hood as the implementation. The code below is equivalent to df.where(df < 0).
In addition, where takes an optional other argument for replacement of values where the condition is False,
in the returned copy.
You may wish to set values based on some boolean criteria. This can be done intuitively like so:
In [177]: s2 = s.copy()
In [179]: s2
Out[179]:
4 0
3 1
2 2
1 3
0 4
dtype: int64
By default, where returns a modified copy of the data. There is an optional parameter inplace so that the
original data can be modified without creating a copy:
In [185]: df_orig
Out[185]:
A B C D
2000-01-01 0.986205 1.719758 2.230079 0.439106
2000-01-02 2.397242 0.124508 1.493995 0.237058
2000-01-03 1.482014 0.429889 0.782186 0.389666
2000-01-04 0.480306 1.051903 0.987736 0.182060
2000-01-05 0.379467 0.273248 0.138556 0.881904
2000-01-06 0.514897 0.117796 0.108906 1.142649
2000-01-07 1.349120 0.316880 0.128845 1.352644
2000-01-08 0.161458 0.739064 0.165377 1.495080
Note: The signature for DataFrame.where() differs from numpy.where(). Roughly df1.where(m, df2)
is equivalent to np.where(m, df1, df2).
Alignment
Furthermore, where aligns the input boolean condition (ndarray or DataFrame), such that partial selection
with setting is possible. This is analogous to partial setting via .loc (but on the contents rather than the
axis labels).
In [189]: df2
Out[189]:
A B C D
2000-01-01 -0.986205 -1.719758 -2.230079 -0.439106
2000-01-02 -2.397242 3.000000 3.000000 -0.237058
2000-01-03 -1.482014 3.000000 -0.782186 3.000000
2000-01-04 -0.480306 -1.051903 -0.987736 -0.182060
2000-01-05 -0.379467 0.273248 -0.138556 0.881904
2000-01-06 -0.514897 -0.117796 -0.108906 -1.142649
2000-01-07 -1.349120 0.316880 -0.128845 -1.352644
2000-01-08 0.161458 0.739064 0.165377 -1.495080
Where can also accept axis and level parameters to align the input when performing the where.
Mask
DataFrame objects have a query() method that allows selection using an expression.
You can get the value of the frame where column b has values between the values of columns a and c. For
example:
In [198]: n = 10
In [200]: df
Out[200]:
a b c
0 0.564354 0.446221 0.031135
1 0.902823 0.612305 0.538181
# pure python
In [201]: df[(df.a < df.b) & (df.b < df.c)]
Out[201]:
a b c
3 0.194429 0.253263 0.923684
# query
In [202]: df.query('(a < b) & (b < c)')
Out[202]:
a b c
3 0.194429 0.253263 0.923684
Do the same thing but fall back on a named index if there is no column with the name a.
In [203]: df = pd.DataFrame(np.random.randint(n / 2, size=(n, 2)), columns=list('bc'))
In [205]: df
Out[205]:
b c
a
0 4 0
1 3 1
2 4 3
3 0 2
4 1 4
5 4 4
6 4 2
7 2 0
8 0 0
9 1 3
In [208]: df
Out[208]:
b c
0 4 5
1 4 1
2 0 8
3 4 7
4 0 0
5 8 5
6 1 2
7 4 3
8 1 0
9 6 3
Note: If the name of your index overlaps with a column name, the column name is given precedence. For
example,
In [212]: df.query('a > 2') # uses the column 'a', not the index
Out[212]:
a
a
0 4
1 4
2 3
You can still use the index in a query expression by using the special identifier ‘index’:
If for some reason you have a column named index, then you can refer to the index as ilevel_0 as well,
but at this point you should consider renaming your columns to something less ambiguous.
You can also use the levels of a DataFrame with a MultiIndex as if they were columns in the frame:
In [214]: n = 10
In [217]: colors
Out[217]:
array(['red', 'red', 'green', 'green', 'red', 'red', 'green', 'green',
'green', 'red'], dtype='<U5')
In [218]: foods
Out[218]:
array(['eggs', 'ham', 'ham', 'ham', 'ham', 'ham', 'eggs', 'eggs', 'ham',
'eggs'], dtype='<U4')
In [221]: df
Out[221]:
0 1
color food
red eggs 0.397240 1.722883
ham 0.634589 1.761948
green ham 1.191222 -0.748678
ham -0.013401 -0.982325
red ham 0.272726 1.042615
ham 0.267082 0.191461
green eggs -0.435659 -0.035917
eggs 0.194931 0.970348
ham 2.187055 0.383666
red eggs -0.812383 -0.497327
In [224]: df
Out[224]:
0 1
red eggs 0.397240 1.722883
ham 0.634589 1.761948
green ham 1.191222 -0.748678
ham -0.013401 -0.982325
red ham 0.272726 1.042615
ham 0.267082 0.191461
green eggs -0.435659 -0.035917
eggs 0.194931 0.970348
ham 2.187055 0.383666
A use case for query() is when you have a collection of DataFrame objects that have a subset of column
names (or index levels/names) in common. You can pass the same query to both frames without having to
specify which frame you’re interested in querying
In [227]: df
Out[227]:
a b c
0 0.483974 0.645639 0.413412
1 0.611039 0.585546 0.848970
2 0.523271 0.811649 0.517849
3 0.947506 0.143525 0.055154
4 0.934891 0.214973 0.271028
5 0.832143 0.777114 0.572133
6 0.304056 0.712288 0.960006
7 0.965451 0.803696 0.866318
8 0.965355 0.383391 0.647743
9 0.639263 0.218103 0.886788
In [229]: df2
Out[229]:
a b c
0 0.563472 0.298326 0.361543
1 0.021200 0.761846 0.279478
2 0.274321 0.127032 0.025433
3 0.789059 0.154680 0.999703
4 0.195936 0.042450 0.475367
5 0.970329 0.053024 0.293762
6 0.877607 0.352530 0.300746
7 0.259895 0.666779 0.920354
8 0.861035 0.176572 0.638339
9 0.083984 0.834057 0.673247
10 0.190267 0.647251 0.586836
11 0.395322 0.575815 0.184662
In [233]: df
Out[233]:
a b c
0 4 1 0
1 9 3 7
2 6 7 9
3 1 9 4
4 6 9 9
5 1 1 2
6 6 6 6
7 9 6 3
8 0 7 9
9 3 8 3
query() also supports special use of Python’s in and not in comparison operators, providing a succinct
syntax for calling the isin method of a Series or DataFrame.
# get all rows where columns "a" and "b" have overlapping values
In [239]: df = pd.DataFrame({'a': list('aabbccddeeff'), 'b': list('aaaabbbbcccc'),
.....: 'c': np.random.randint(5, size=12),
.....: 'd': np.random.randint(9, size=12)})
.....:
In [240]: df
Out[240]:
a b c d
0 a a 1 5
1 a a 1 1
2 b a 4 1
3 b a 4 3
4 c b 4 5
5 c b 0 0
6 d b 0 4
7 d b 3 3
8 e c 4 5
9 e c 2 0
10 f c 2 2
11 f c 4 7
0 a a 1 5
1 a a 1 1
2 b a 4 1
3 b a 4 3
4 c b 4 5
5 c b 0 0
# pure Python
In [244]: df[~df.a.isin(df.b)]
Out[244]:
a b c d
6 d b 0 4
7 d b 3 3
8 e c 4 5
9 e c 2 0
10 f c 2 2
11 f c 4 7
You can combine this with other expressions for very succinct queries:
# rows where cols a and b have overlapping values
# and col c's values are less than col d's
In [245]: df.query('a in b and c < d')
Out[245]:
a b c d
0 a a 1 5
4 c b 4 5
# pure Python
In [246]: df[df.b.isin(df.a) & (df.c < df.d)]
Out[246]:
a b c d
0 a a 1 5
4 c b 4 5
6 d b 0 4
8 e c 4 5
11 f c 4 7
Note: Note that in and not in are evaluated in Python, since numexpr has no equivalent of this operation.
However, only the in/not in expression itself is evaluated in vanilla Python. For example, in the
expression
df.query('a in b + c + d')
(b + c + d) is evaluated by numexpr and then the in operation is evaluated in plain Python. In general,
Comparing a list of values to a column using ==/!= works similarly to in/not in.
In [247]: df.query('b == ["a", "b", "c"]')
Out[247]:
a b c d
0 a a 1 5
1 a a 1 1
2 b a 4 1
3 b a 4 3
4 c b 4 5
5 c b 0 0
6 d b 0 4
7 d b 3 3
8 e c 4 5
9 e c 2 0
10 f c 2 2
11 f c 4 7
# pure Python
In [248]: df[df.b.isin(["a", "b", "c"])]
Out[248]:
a b c d
0 a a 1 5
1 a a 1 1
2 b a 4 1
3 b a 4 3
4 c b 4 5
5 c b 0 0
6 d b 0 4
7 d b 3 3
8 e c 4 5
9 e c 2 0
10 f c 2 2
11 f c 4 7
5 c b 0 0
6 d b 0 4
7 d b 3 3
8 e c 4 5
11 f c 4 7
# using in/not in
In [251]: df.query('[1, 2] in c')
Out[251]:
a b c d
0 a a 1 5
1 a a 1 1
9 e c 2 0
10 f c 2 2
# pure Python
In [253]: df[df.c.isin([1, 2])]
Out[253]:
a b c d
0 a a 1 5
1 a a 1 1
9 e c 2 0
10 f c 2 2
Boolean operators
You can negate boolean expressions with the word not or the ~ operator.
In [254]: df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))
In [256]: df.query('~bools')
Out[256]:
a b c bools
0 0.499604 0.981560 0.137759 False
4 0.104751 0.782568 0.977198 False
7 0.973197 0.245000 0.977406 False
9 0.805940 0.451425 0.070470 False
a b c bools
0 0.499604 0.981560 0.137759 False
4 0.104751 0.782568 0.977198 False
7 0.973197 0.245000 0.977406 False
9 0.805940 0.451425 0.070470 False
In [261]: shorter
Out[261]:
a b c bools
4 0.104751 0.782568 0.977198 False
In [262]: longer
Out[262]:
a b c bools
4 0.104751 0.782568 0.977198 False
Performance of query()
DataFrame.query() using numexpr is slightly faster than Python for large frames.
Note: You will only see the performance benefits of using the numexpr engine with DataFrame.query() if
your frame has more than approximately 200,000 rows.
This plot was created using a DataFrame with 3 columns each containing floating point values generated
using numpy.random.randn().
If you want to identify and remove duplicate rows in a DataFrame, there are two methods that will help:
duplicated and drop_duplicates. Each takes as an argument the columns to use to identify duplicated
rows.
• duplicated returns a boolean vector whose length is the number of rows, and which indicates whether
a row is duplicated.
• drop_duplicates removes duplicate rows.
By default, the first observed row of a duplicate set is considered unique, but each method has a keep
parameter to specify targets to be kept.
• keep='first' (default): mark / drop duplicates except for the first occurrence.
• keep='last': mark / drop duplicates except for the last occurrence.
• keep=False: mark / drop all duplicates.
In [264]: df2 = pd.DataFrame({'a': ['one', 'one', 'two', 'two', 'two', 'three', 'four'],
.....: 'b': ['x', 'y', 'x', 'y', 'x', 'x', 'x'],
.....: 'c': np.random.randn(7)})
.....:
In [265]: df2
Out[265]:
a b c
0 one x -0.086643
1 one y -0.862428
2 two x 1.155986
3 two y -0.583644
4 two x -1.416461
5 three x 0.799196
6 four x -2.063856
In [266]: df2.duplicated('a')
Out[266]:
0 False
1 True
2 False
3 True
4 True
5 False
6 False
dtype: bool
4 True
5 False
6 False
dtype: bool
In [269]: df2.drop_duplicates('a')
Out[269]:
a b c
0 one x -0.086643
2 two x 1.155986
5 three x 0.799196
6 four x -2.063856
.....:
In [275]: df3
Out[275]:
a b
a 0 -0.457673
a 1 0.315795
b 2 -0.013959
c 3 -0.376069
b 4 -0.715356
a 5 1.802760
In [276]: df3.index.duplicated()
Out[276]: array([False, True, False, False, True, True])
In [277]: df3[~df3.index.duplicated()]
Out[277]:
a b
a 0 -0.457673
b 2 -0.013959
c 3 -0.376069
In [278]: df3[~df3.index.duplicated(keep='last')]
Out[278]:
a b
c 3 -0.376069
b 4 -0.715356
a 5 1.802760
In [279]: df3[~df3.index.duplicated(keep=False)]
Out[279]:
a b
c 3 -0.376069
Each of Series or DataFrame have a get method which can return a default value.
In [280]: s = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
Sometimes you want to extract a set of values given a sequence of row labels and column labels, and the
lookup method allows for this and returns a NumPy array. For instance:
The pandas Index class and its subclasses can be viewed as implementing an ordered multiset. Duplicates
are allowed. However, if you try to convert an Index object with duplicate entries into a set, an exception
will be raised.
Index also provides the infrastructure necessary for lookups, data alignment, and reindexing. The easiest
way to create an Index directly is to pass a list or other sequence to Index:
In [285]: index = pd.Index(['e', 'd', 'a', 'b'])
In [286]: index
Out[286]: Index(['e', 'd', 'a', 'b'], dtype='object')
In [289]: index.name
Out[289]: 'something'
In [293]: df
Out[293]:
cols A B C
rows
0 -1.539526 0.083551 1.819217
1 -0.556258 -1.013751 0.804958
2 1.138849 -0.913212 2.047493
3 -0.894783 1.059103 -0.605857
4 -1.096832 0.217643 2.122047
In [294]: df['A']
Out[294]:
rows
0 -1.539526
1 -0.556258
2 1.138849
3 -0.894783
4 -1.096832
Name: A, dtype: float64
Setting metadata
Indexes are “mostly immutable”, but it is possible to set and change their metadata, like the index name (or,
for MultiIndex, levels and codes).
You can use the rename, set_names, set_levels, and set_codes to set these attributes directly. They
default to returning a copy; however, you can specify inplace=True to have the data change in place.
See Advanced Indexing for usage of MultiIndexes.
In [295]: ind = pd.Index([1, 2, 3])
In [296]: ind.rename("apple")
Out[296]: Int64Index([1, 2, 3], dtype='int64', name='apple')
In [297]: ind
Out[297]: Int64Index([1, 2, 3], dtype='int64')
In [300]: ind
Out[300]: Int64Index([1, 2, 3], dtype='int64', name='bob')
set_names, set_levels, and set_codes also take an optional level argument
In [301]: index = pd.MultiIndex.from_product([range(3), ['one', 'two']], names=['first',␣
,→'second'])
In [302]: index
Out[302]:
MultiIndex([(0, 'one'),
(0, 'two'),
(1, 'one'),
(1, 'two'),
(2, 'one'),
(2, 'two')],
names=['first', 'second'])
In [303]: index.levels[1]
Out[303]: Index(['one', 'two'], dtype='object', name='second')
The two main operations are union (|) and intersection (&). These can be directly called as instance
methods or used via overloaded operators. Difference is provided via the .difference() method.
In [305]: a = pd.Index(['c', 'b', 'a'])
In [307]: a | b
Out[307]: Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
In [308]: a & b
Out[308]: Index(['c'], dtype='object')
In [309]: a.difference(b)
Out[309]: Index(['a', 'b'], dtype='object')
Also available is the symmetric_difference (^) operation, which returns elements that appear in either
idx1 or idx2, but not in both. This is equivalent to the Index created by idx1.difference(idx2).
union(idx2.difference(idx1)), with duplicates dropped.
In [310]: idx1 = pd.Index([1, 2, 3, 4])
In [312]: idx1.symmetric_difference(idx2)
Out[312]: Int64Index([1, 5], dtype='int64')
Note: The resulting index from a set operation will be sorted in ascending order.
When performing Index.union() between indexes with different dtypes, the indexes must be cast to a
common dtype. Typically, though not always, this is object dtype. The exception is when performing a
union between integer and float data. In this case, the integer values are converted to float
Missing values
Important: Even though Index can hold missing values (NaN), it should be avoided if you do not want
any unexpected results. For example, some operations exclude missing values implicitly.
In [318]: idx1
Out[318]: Float64Index([1.0, nan, 3.0, 4.0], dtype='float64')
In [319]: idx1.fillna(2)
Out[319]: Float64Index([1.0, 2.0, 3.0, 4.0], dtype='float64')
In [321]: idx2
Out[321]: DatetimeIndex(['2011-01-01', 'NaT', '2011-01-03'], dtype='datetime64[ns]',␣
,→freq=None)
In [322]: idx2.fillna(pd.Timestamp('2011-01-02'))
Out[322]: DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03'],␣
,→dtype='datetime64[ns]', freq=None)
Occasionally you will load or create a data set into a DataFrame and want to add an index after you’ve
already done so. There are a couple of different ways.
Set an index
DataFrame has a set_index() method which takes a column name (for a regular Index) or a list of column
names (for a MultiIndex). To create a new, re-indexed DataFrame:
In [323]: data
Out[323]:
a b c d
0 bar one z 1.0
1 bar two y 2.0
2 foo one x 3.0
3 foo two w 4.0
In [325]: indexed1
Out[325]:
a b d
c
z bar one 1.0
y bar two 2.0
x foo one 3.0
w foo two 4.0
The append keyword option allow you to keep the existing index and append the given columns to a Multi-
Index:
In [330]: frame
Out[330]:
c d
c a b
z bar one z 1.0
y bar two y 2.0
x foo one x 3.0
w foo two w 4.0
Other options in set_index allow you not drop the index columns or to add the index in-place (without
creating a new object):
In [333]: data
Out[333]:
c d
a b
bar one z 1.0
two y 2.0
foo one x 3.0
two w 4.0
As a convenience, there is a new function on DataFrame called reset_index() which transfers the index
values into the DataFrame’s columns and sets a simple integer index. This is the inverse operation of
set_index().
In [334]: data
Out[334]:
c d
a b
bar one z 1.0
two y 2.0
foo one x 3.0
two w 4.0
In [335]: data.reset_index()
Out[335]:
a b c d
0 bar one z 1.0
1 bar two y 2.0
2 foo one x 3.0
3 foo two w 4.0
The output is more similar to a SQL table or a record array. The names for the columns derived from the
index are the ones stored in the names attribute.
You can use the level keyword to remove only a portion of the index:
In [336]: frame
Out[336]:
c d
c a b
z bar one z 1.0
y bar two y 2.0
x foo one x 3.0
w foo two w 4.0
In [337]: frame.reset_index(level=1)
Out[337]:
a c d
c b
z one bar z 1.0
y two bar y 2.0
x one foo x 3.0
w two foo w 4.0
reset_index takes an optional parameter drop which if true simply discards the index, instead of putting
index values in the DataFrame’s columns.
If you create an index yourself, you can just assign it to the index field:
data.index = index
When setting values in a pandas object, care must be taken to avoid what is called chained indexing. Here
is an example.
In [339]: dfmi
Out[339]:
one two
first second first second
0 a b c d
1 e f g h
2 i j k l
3 m n o p
In [340]: dfmi['one']['second']
Out[340]:
0 b
1 f
2 j
3 n
Name: second, dtype: object
These both yield the same results, so which should you use? It is instructive to understand the order of
operations on these and why method 2 (.loc) is much preferred over method 1 (chained []).
dfmi['one'] selects the first level of the columns and returns a DataFrame that is singly-indexed. Then
another Python operation dfmi_with_one['second'] selects the series indexed by 'second'. This is indi-
cated by the variable dfmi_with_one because pandas sees these operations as separate events. e.g. separate
calls to __getitem__, so it has to treat them as linear operations, they happen one after another.
Contrast this to df.loc[:,('one','second')] which passes a nested tuple of (slice(None),('one',
'second')) to a single call to __getitem__. This allows pandas to deal with this as a single entity.
Furthermore this order of operations can be significantly faster, and allows one to index both axes if so
desired.
The problem in the previous section is just a performance issue. What’s up with the SettingWithCopy
warning? We don’t usually throw warnings around when you do something that might cost a few extra
milliseconds!
But it turns out that assigning to the product of chained indexing has inherently unpredictable results. To
see this, think about how the Python interpreter executes this code:
dfmi['one']['second'] = value
# becomes
dfmi.__getitem__('one').__setitem__('second', value)
See that __getitem__ in there? Outside of simple cases, it’s very hard to predict whether it will return a
view or a copy (it depends on the memory layout of the array, about which pandas makes no guarantees), and
therefore whether the __setitem__ will modify dfmi or a temporary object that gets thrown out immediately
afterward. That’s what SettingWithCopy is warning you about!
Note: You may be wondering whether we should be concerned about the loc property in the first example.
But dfmi.loc is guaranteed to be dfmi itself with modified indexing behavior, so dfmi.loc.__getitem__ /
dfmi.loc.__setitem__ operate on dfmi directly. Of course, dfmi.loc.__getitem__(idx) may be a view
or a copy of dfmi.
Sometimes a SettingWithCopy warning will arise at times when there’s no obvious chained indexing going
on. These are the bugs that SettingWithCopy is designed to catch! Pandas is probably trying to warn you
that you’ve done this:
def do_something(df):
foo = df[['bar', 'baz']] # Is foo a view? A copy? Nobody knows!
# ... many lines here ...
# We don't know whether this will modify df or not!
foo['quux'] = value
return foo
Yikes!
When you use chained indexing, the order and type of the indexing operation partially determine whether
the result is a slice into the original object, or a copy of the slice.
Pandas has the SettingWithCopyWarning because assigning to a copy of a slice is frequently not intentional,
but a mistake caused by chained indexing returning a copy where a slice was expected.
If you would like pandas to be more or less trusting about assignment to a chained indexing expression, you
can set the option mode.chained_assignment to one of these values:
• 'warn', the default, means a SettingWithCopyWarning is printed.
• 'raise' means pandas will raise a SettingWithCopyException you have to deal with.
• None will suppress the warnings entirely.
>>> pd.set_option('mode.chained_assignment','warn')
>>> dfb[dfb.a.str.startswith('o')]['c'] = 42
Traceback (most recent call last)
...
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
In [346]: dfc
Out[346]:
A B
0 11 1
1 bbb 2
2 ccc 3
This can work at times, but it is not guaranteed to, and therefore should be avoided:
In [349]: dfc
Out[349]:
A B
0 111 1
1 bbb 2
2 ccc 3
>>> pd.set_option('mode.chained_assignment','raise')
>>> dfc.loc[0]['A'] = 1111
Traceback (most recent call last)
...
SettingWithCopyException:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
Warning: The chained assignment warnings / exceptions are aiming to inform the user of a possibly
invalid assignment. There may be false positives; situations where a chained assignment is inadvertently
reported.
{{ header }}
This section covers indexing with a MultiIndex and other advanced indexing features.
See the Indexing and Selecting Data for general indexing documentation.
Warning: Whether a copy or a reference is returned for a setting operation may depend on the context.
This is sometimes called chained assignment and should be avoided. See Returning a View versus Copy.
Hierarchical / Multi-level indexing is very exciting as it opens the door to some quite sophisticated data
analysis and manipulation, especially for working with higher dimensional data. In essence, it enables you
to store and manipulate data with an arbitrary number of dimensions in lower dimensional data structures
like Series (1d) and DataFrame (2d).
In this section, we will show what exactly we mean by “hierarchical” indexing and how it integrates with all
of the pandas indexing functionality described above and in prior sections. Later, when discussing group by
and pivoting and reshaping data, we’ll show non-trivial applications to illustrate how it aids in structuring
data for analysis.
See the cookbook for some advanced strategies.
Changed in version 0.24.0: MultiIndex.labels has been renamed to MultiIndex.codes and MultiIndex.
set_labels to MultiIndex.set_codes.
The MultiIndex object is the hierarchical analogue of the standard Index object which typically stores the
axis labels in pandas objects. You can think of MultiIndex as an array of tuples where each tuple is unique.
A MultiIndex can be created from a list of arrays (using MultiIndex.from_arrays()), an array of tuples
(using MultiIndex.from_tuples()), a crossed set of iterables (using MultiIndex.from_product()), or a
DataFrame (using MultiIndex.from_frame()). The Index constructor will attempt to return a MultiIndex
when it is passed a list of tuples. The following examples demonstrate different ways to initialize MultiIn-
dexes.
In [1]: arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
...: ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
...:
In [3]: tuples
Out[3]:
[('bar', 'one'),
('bar', 'two'),
('baz', 'one'),
('baz', 'two'),
('foo', 'one'),
('foo', 'two'),
('qux', 'one'),
('qux', 'two')]
In [5]: index
Out[5]:
MultiIndex([('bar', 'one'),
('bar', 'two'),
('baz', 'one'),
('baz', 'two'),
('foo', 'one'),
('foo', 'two'),
('qux', 'one'),
('qux', 'two')],
names=['first', 'second'])
In [7]: s
Out[7]:
first second
bar one 0.941492
two -0.234130
baz one -0.796227
two -0.058695
foo one 0.090387
two -0.543332
qux one -0.619806
two -0.570133
dtype: float64
When you want every pairing of the elements in two iterables, it can be easier to use the MultiIndex.
from_product() method:
In [8]: iterables = [['bar', 'baz', 'foo', 'qux'], ['one', 'two']]
You can also construct a MultiIndex from a DataFrame directly, using the method MultiIndex.
from_frame(). This is a complementary method to MultiIndex.to_frame().
New in version 0.24.0.
In [10]: df = pd.DataFrame([['bar', 'one'], ['bar', 'two'],
....: ['foo', 'one'], ['foo', 'two']],
....: columns=['first', 'second'])
....:
In [11]: pd.MultiIndex.from_frame(df)
Out[11]:
MultiIndex([('bar', 'one'),
('bar', 'two'),
('foo', 'one'),
('foo', 'two')],
names=['first', 'second'])
As a convenience, you can pass a list of arrays directly into Series or DataFrame to construct a MultiIndex
automatically:
In [12]: arrays = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
....: np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])]
....:
In [14]: s
Out[14]:
bar one 0.245235
two 0.016378
baz one -0.347328
two 0.600062
foo one -1.300454
two -0.813428
qux one 1.462932
two -1.206173
dtype: float64
All of the MultiIndex constructors accept a names argument which stores string names for the levels them-
selves. If no names are provided, None will be assigned:
In [17]: df.index.names
Out[17]: FrozenList([None, None])
This index can back any axis of a pandas object, and the number of levels of the index is up to you:
In [18]: df = pd.DataFrame(np.random.randn(3, 8), index=['A', 'B', 'C'], columns=index)
In [19]: df
Out[19]:
first bar baz foo qux
second one two one two one two one two
A 0.771686 -1.515175 0.803511 -1.127083 -0.597308 -0.605145 0.556247 -0.632974
B 1.802824 -1.471259 0.326461 1.536432 -0.081085 -0.147244 0.101228 0.450182
C 1.319947 0.675934 2.202010 -0.521456 0.021556 0.789515 -0.895903 1.202173
It’s worth keeping in mind that there’s nothing preventing you from using tuples as atomic labels on an axis:
The reason that the MultiIndex matters is that it can allow you to do grouping, selection, and reshaping
operations as we will describe below and in subsequent areas of the documentation. As you will see in
later sections, you can find yourself working with hierarchically-indexed data without creating a MultiIndex
explicitly yourself. However, when loading data from a file, you may wish to generate your own MultiIndex
when preparing the data set.
The method get_level_values() will return a vector of the labels for each location at a particular level:
In [23]: index.get_level_values(0)
Out[23]: Index(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], dtype='object',␣
,→name='first')
In [24]: index.get_level_values('second')
Out[24]: Index(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'], dtype='object',␣
,→name='second')
One of the important features of hierarchical indexing is that you can select data by a “partial” label
identifying a subgroup in the data. Partial selection “drops” levels of the hierarchical index in the result in
a completely analogous way to selecting a column in a regular DataFrame:
In [25]: df['bar']
Out[25]:
second one two
A 0.771686 -1.515175
B 1.802824 -1.471259
C 1.319947 0.675934
In [27]: df['bar']['one']
Out[27]:
A 0.771686
B 1.802824
C 1.319947
Name: one, dtype: float64
In [28]: s['qux']
Out[28]:
one 1.462932
two -1.206173
dtype: float64
See Cross-section with hierarchical index for how to select on a deeper level.
Defined levels
The MultiIndex keeps all the defined levels of an index, even if they are not actually used. When slicing an
index, you may notice this. For example:
In [29]: df.columns.levels # original MultiIndex
Out[29]: FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])
In [34]: new_mi.levels
Out[34]: FrozenList([['foo', 'qux'], ['one', 'two']])
Operations between differently-indexed objects having MultiIndex on the axes will work as you expect; data
alignment will work the same as an Index of tuples:
In [35]: s + s[:-2]
Out[35]:
bar one 0.490471
two 0.032756
baz one -0.694656
two 1.200123
foo one -2.600908
two -1.626856
qux one NaN
two NaN
dtype: float64
In [36]: s + s[::2]
Out[36]:
bar one 0.490471
two NaN
baz one -0.694656
two NaN
foo one -2.600908
two NaN
qux one 2.925863
two NaN
dtype: float64
The reindex() method of Series/DataFrames can be called with another MultiIndex, or even a list or
array of tuples:
In [37]: s.reindex(index[:3])
Out[37]:
first second
bar one 0.245235
two 0.016378
baz one -0.347328
dtype: float64
Syntactically integrating MultiIndex in advanced indexing with .loc is a bit challenging, but we’ve made
every effort to do so. In general, MultiIndex keys take the form of tuples. For example, the following works
as you would expect:
In [39]: df = df.T
In [40]: df
Out[40]:
A B C
first second
bar one 0.771686 1.802824 1.319947
two -1.515175 -1.471259 0.675934
baz one 0.803511 0.326461 2.202010
two -1.127083 1.536432 -0.521456
foo one -0.597308 -0.081085 0.021556
two -0.605145 -0.147244 0.789515
qux one 0.556247 0.101228 -0.895903
You don’t have to specify all levels of the MultiIndex by passing only the first elements of the tuple. For
example, you can use “partial” indexing to get all elements with bar in the first level as follows:
df.loc[‘bar’]
This is a shortcut for the slightly more verbose notation df.loc[('bar',),] (equivalent to df.loc['bar',]
in this example).
“Partial” slicing also works quite nicely.
In [43]: df.loc['baz':'foo']
Out[43]:
A B C
first second
baz one 0.803511 0.326461 2.202010
two -1.127083 1.536432 -0.521456
foo one -0.597308 -0.081085 0.021556
two -0.605145 -0.147244 0.789515
Note: It is important to note that tuples and lists are not treated identically in pandas when it comes to
indexing. Whereas a tuple is interpreted as one multi-level key, a list is used to specify several keys. Or in
other words, tuples go horizontally (traversing levels), lists go vertically (scanning levels).
Importantly, a list of tuples indexes several complete MultiIndex keys, whereas a tuple of lists refer to
several values within a level:
In [47]: s = pd.Series([1, 2, 3, 4, 5, 6],
....: index=pd.MultiIndex.from_product([["A", "B"], ["c", "d", "e"]]))
....:
Using slicers
Warning: You should specify all axes in the .loc specifier, meaning the indexer for the index and
for the columns. There are some ambiguous cases where the passed indexer could be mis-interpreted as
indexing both axes, rather than into say the MultiIndex for the rows.
You should do this:
df.loc[(slice('A1', 'A3'), ...), :] # noqa: E999
In [54]: dfmi
Out[54]:
lvl0 a b
lvl1 bar foo bah foo
A0 B0 C0 D0 1 0 3 2
D1 5 4 7 6
C1 D0 9 8 11 10
D1 13 12 15 14
C2 D0 17 16 19 18
... ... ... ... ...
A3 B1 C1 D1 237 236 239 238
C2 D0 241 240 243 242
D1 245 244 247 246
C3 D0 249 248 251 250
D1 253 252 255 254
You can use pandas.IndexSlice to facilitate a more natural syntax using :, rather than using slice(None).
It is possible to perform quite complicated selections using this method on multiple axes at the same time.
In [58]: dfmi.loc['A1', (slice(None), 'foo')]
Out[58]:
lvl0 a b
lvl1 foo foo
B0 C0 D0 64 66
D1 68 70
C1 D0 72 74
D1 76 78
C2 D0 80 82
D1 84 86
C3 D0 88 90
D1 92 94
B1 C0 D0 96 98
D1 100 102
C1 D0 104 106
D1 108 110
C2 D0 112 114
D1 116 118
C3 D0 120 122
D1 124 126
A2 B0 C1 D0 136 138
D1 140 142
C3 D0 152 154
D1 156 158
B1 C1 D0 168 170
D1 172 174
C3 D0 184 186
D1 188 190
A3 B0 C1 D0 200 202
D1 204 206
C3 D0 216 218
D1 220 222
B1 C1 D0 232 234
D1 236 238
C3 D0 248 250
D1 252 254
Using a boolean indexer you can provide selection related to the values.
You can also specify the axis argument to .loc to interpret the passed slicers on a single axis.
Furthermore, you can set the values using the following methods.
In [65]: df2
Out[65]:
lvl0 a b
lvl1 bar foo bah foo
A0 B0 C0 D0 1 0 3 2
D1 5 4 7 6
C1 D0 -10 -10 -10 -10
D1 -10 -10 -10 -10
C2 D0 17 16 19 18
... ... ... ... ...
A3 B1 C1 D1 -10 -10 -10 -10
C2 D0 241 240 243 242
D1 245 244 247 246
C3 D0 -10 -10 -10 -10
D1 -10 -10 -10 -10
In [68]: df2
Out[68]:
lvl0 a b
lvl1 bar foo bah foo
(continues on next page)
Cross-section
The xs() method of DataFrame additionally takes a level argument to make selecting data at a particular
level of a MultiIndex easier.
In [69]: df
Out[69]:
A B C
first second
bar one 0.771686 1.802824 1.319947
two -1.515175 -1.471259 0.675934
baz one 0.803511 0.326461 2.202010
two -1.127083 1.536432 -0.521456
foo one -0.597308 -0.081085 0.021556
two -0.605145 -0.147244 0.789515
qux one 0.556247 0.101228 -0.895903
two -0.632974 0.450182 1.202173
You can also select on the columns with xs, by providing the axis argument.
In [72]: df = df.T
You can pass drop_level=False to xs to retain the level that was selected.
Compare the above with the result using drop_level=True (the default value).
Using the parameter level in the reindex() and align() methods of pandas objects is useful to broadcast
values across a level. For instance:
In [79]: midx = pd.MultiIndex(levels=[['zero', 'one'], ['x', 'y']],
....: codes=[[1, 1, 0, 0], [1, 0, 1, 0]])
....:
In [81]: df
Out[81]:
0 1
one y -0.346466 0.214630
x -1.468255 1.365136
zero y -0.092330 -0.395188
x -0.998030 0.017783
In [83]: df2
Out[83]:
0 1
one -0.907361 0.789883
zero -0.545180 -0.188703
# aligning
In [85]: df_aligned, df2_aligned = df.align(df2, level=0)
In [86]: df_aligned
Out[86]:
0 1
one y -0.346466 0.214630
x -1.468255 1.365136
zero y -0.092330 -0.395188
x -0.998030 0.017783
In [87]: df2_aligned
Out[87]:
0 1
The reorder_levels() method generalizes the swaplevel method, allowing you to permute the hierarchical
index levels in one step:
The rename() method is used to rename the labels of a MultiIndex, and is typically used to rename the
columns of a DataFrame. The columns argument of rename allows a dictionary to be specified that includes
only the columns you wish to rename.
This method can also be used to rename specific labels of the main index of the DataFrame.
The rename_axis() method is used to rename the name of a Index or MultiIndex. In particular, the names
of the levels of a MultiIndex can be specified, which is useful if reset_index() is later used to move the
values from the MultiIndex to a column.
Note that the columns of a DataFrame are an index, so that using rename_axis with the columns argument
will change the name of that index.
In [94]: df.rename_axis(columns="Cols").columns
Out[94]: RangeIndex(start=0, stop=2, step=1, name='Cols')
Both rename and rename_axis support specifying a dictionary, Series or a mapping function to map
labels/names to new values.
For MultiIndex-ed objects to be indexed and sliced effectively, they need to be sorted. As with any index,
you can use sort_index().
In [95]: import random
In [96]: random.shuffle(tuples)
In [98]: s
Out[98]:
baz one -1.164671
qux one -0.923714
foo one -0.936684
baz two -0.850529
qux two 0.371209
foo two 0.662579
bar two 1.537426
one -0.899351
dtype: float64
In [99]: s.sort_index()
Out[99]:
bar one -0.899351
two 1.537426
baz one -1.164671
two -0.850529
foo one -0.936684
two 0.662579
qux one -0.923714
two 0.371209
dtype: float64
In [100]: s.sort_index(level=0)
Out[100]:
bar one -0.899351
two 1.537426
baz one -1.164671
two -0.850529
foo one -0.936684
two 0.662579
qux one -0.923714
two 0.371209
dtype: float64
In [101]: s.sort_index(level=1)
Out[101]:
bar one -0.899351
baz one -1.164671
foo one -0.936684
qux one -0.923714
bar two 1.537426
baz two -0.850529
foo two 0.662579
qux two 0.371209
dtype: float64
You may also pass a level name to sort_index if the MultiIndex levels are named.
In [102]: s.index.set_names(['L1', 'L2'], inplace=True)
In [103]: s.sort_index(level='L1')
Out[103]:
L1 L2
bar one -0.899351
two 1.537426
baz one -1.164671
two -0.850529
foo one -0.936684
two 0.662579
qux one -0.923714
two 0.371209
dtype: float64
In [104]: s.sort_index(level='L2')
Out[104]:
L1 L2
bar one -0.899351
baz one -1.164671
foo one -0.936684
qux one -0.923714
bar two 1.537426
baz two -0.850529
foo two 0.662579
qux two 0.371209
dtype: float64
On higher dimensional objects, you can sort any of the other axes by level if they have a MultiIndex:
Indexing will work even if the data are not sorted, but will be rather inefficient (and show a
PerformanceWarning). It will also return a copy of the data rather than a view:
In [108]: dfm
Out[108]:
jolie
jim joe
0 x 0.468958
x 0.330760
1 z 0.394711
y 0.247768
Out[4]:
jolie
jim joe
1 z 0.64094
Furthermore, if you try to index something that is not fully lexsorted, this can raise:
The is_lexsorted() method on a MultiIndex shows if the index is sorted, and the lexsort_depth property
returns the sort depth:
In [109]: dfm.index.is_lexsorted()
Out[109]: False
In [110]: dfm.index.lexsort_depth
Out[110]: 1
In [111]: dfm = dfm.sort_index()
In [112]: dfm
Out[112]:
jolie
jim joe
0 x 0.468958
x 0.330760
1 y 0.247768
z 0.394711
In [113]: dfm.index.is_lexsorted()
Out[113]: True
In [114]: dfm.index.lexsort_depth
Out[114]: 2
And now selection works as expected.
Similar to NumPy ndarrays, pandas Index, Series, and DataFrame also provides the take() method that
retrieves elements along a given axis at the given indices. The given indices must be either a list or an
ndarray of integer index positions. take will also accept negative integers as relative positions to the end of
the object.
In [116]: index = pd.Index(np.random.randint(0, 1000, 10))
In [117]: index
Out[117]: Int64Index([680, 761, 606, 142, 515, 764, 803, 789, 939, 518], dtype='int64')
In [119]: index[positions]
Out[119]: Int64Index([680, 518, 142], dtype='int64')
In [120]: index.take(positions)
Out[120]: Int64Index([680, 518, 142], dtype='int64')
In [122]: ser.iloc[positions]
Out[122]:
0 2.280385
9 0.956264
3 2.762272
dtype: float64
In [123]: ser.take(positions)
Out[123]:
0 2.280385
9 0.956264
3 2.762272
dtype: float64
For DataFrames, the given indices should be a 1d list or ndarray that specifies row or column positions.
In [124]: frm = pd.DataFrame(np.random.randn(5, 3))
Out[132]:
0 2.007833
1 -0.768178
dtype: float64
Finally, as a small note on performance, because the take method handles a narrower range of inputs, it
can offer performance that is a good deal faster than fancy indexing.
In [135]: random.shuffle(indexer)
We have discussed MultiIndex in the previous sections pretty extensively. Documentation about
DatetimeIndex and PeriodIndex are shown here, and documentation about TimedeltaIndex is found
here.
In the following sub-sections we will highlight some other index types.
CategoricalIndex
CategoricalIndex is a type of index that is useful for supporting indexing with duplicates. This is a
container around a Categorical and allows efficient indexing and storage of an index with a large number
of duplicated elements.
In [139]: from pandas.api.types import CategoricalDtype
In [142]: df
Out[142]:
A B
0 0 a
1 1 a
2 2 b
3 3 b
4 4 c
5 5 a
In [143]: df.dtypes
Out[143]:
A int64
B category
dtype: object
In [144]: df.B.cat.categories
Out[144]: Index(['c', 'a', 'b'], dtype='object')
Setting the index will create a CategoricalIndex.
In [146]: df2.index
Out[146]: CategoricalIndex(['a', 'a', 'b', 'b', 'c', 'a'], categories=['c', 'a', 'b'],␣
,→ordered=False, name='B', dtype='category')
Indexing with __getitem__/.iloc/.loc works similarly to an Index with duplicates. The indexers must
be in the category or the operation will raise a KeyError.
In [147]: df2.loc['a']
Out[147]:
A
B
a 0
a 1
a 5
In [148]: df2.loc['a'].index
Out[148]: CategoricalIndex(['a', 'a', 'a'], categories=['c', 'a', 'b'], ordered=False,␣
,→name='B', dtype='category')
Sorting the index will sort by the order of the categories (recall that we created the index with
CategoricalDtype(list('cab')), so the sorted order is cab).
In [149]: df2.sort_index()
Out[149]:
A
B
c 4
a 0
a 1
a 5
b 2
b 3
Groupby operations on the index will preserve the index nature as well.
In [150]: df2.groupby(level=0).sum()
Out[150]:
A
B
c 4
a 6
b 5
In [151]: df2.groupby(level=0).sum().index
Out[151]: CategoricalIndex(['c', 'a', 'b'], categories=['c', 'a', 'b'], ordered=False,␣
,→name='B', dtype='category')
Reindexing operations will return a resulting index based on the type of the passed indexer. Passing a
list will return a plain-old Index; indexing with a Categorical will return a CategoricalIndex, indexed
according to the categories of the passed Categorical dtype. This allows one to arbitrarily index these
even with values not in the categories, similarly to how you can reindex any pandas index.
In [152]: df2.reindex(['a', 'e'])
Out[152]:
A
B
a 0.0
a 1.0
a 5.0
e NaN
Warning: Reshaping and Comparison operations on a CategoricalIndex must have the same cate-
gories or a TypeError will be raised.
In [9]: df3 = pd.DataFrame({'A': np.arange(6), 'B': pd.Series(list('aabbca')).astype(
,→'category')})
In [11]: df3.index
Out[11]: CategoricalIndex(['a', 'a', 'b', 'b', 'c', 'a'], categories=['a', 'b', 'c'],␣
,→ordered=False, name='B', dtype='category')
Warning: Indexing on an integer-based Index with floats has been clarified in 0.18.0, for a summary
of the changes, see here.
Int64Index is a fundamental basic index in pandas. This is an immutable array implementing an ordered,
sliceable set. Prior to 0.18.0, the Int64Index would provide the default index for all NDFrame objects.
RangeIndex is a sub-class of Int64Index added in version 0.18.0, now providing the default index for all
NDFrame objects. RangeIndex is an optimized version of Int64Index that can represent a monotonic ordered
set. These are analogous to Python range types.
Float64Index
In [157]: indexf
Out[157]: Float64Index([1.5, 2.0, 3.0, 4.5, 5.0], dtype='float64')
In [159]: sf
Out[159]:
1.5 0
2.0 1
3.0 2
4.5 3
5.0 4
dtype: int64
Scalar selection for [],.loc will always be label based. An integer will match an equal float index (e.g. 3
is equivalent to 3.0).
In [160]: sf[3]
Out[160]: 2
In [161]: sf[3.0]
Out[161]: 2
In [162]: sf.loc[3]
Out[162]: 2
In [163]: sf.loc[3.0]
Out[163]: 2
The only positional indexing is via iloc.
In [164]: sf.iloc[3]
Out[164]: 3
A scalar index that is not found will raise a KeyError. Slicing is primarily on the values of the index when
using [],ix,loc, and always positional when using iloc. The exception is when the slice is boolean, in
which case it will always be positional.
In [165]: sf[2:4]
Out[165]:
2.0 1
3.0 2
dtype: int64
In [166]: sf.loc[2:4]
Out[166]:
2.0 1
3.0 2
dtype: int64
In [167]: sf.iloc[2:4]
Out[167]:
3.0 2
4.5 3
dtype: int64
In float indexes, slicing using floats is allowed.
In [168]: sf[2.1:4.6]
Out[168]:
3.0 2
4.5 3
dtype: int64
In [169]: sf.loc[2.1:4.6]
Out[169]:
3.0 2
4.5 3
dtype: int64
In non-float indexes, slicing using floats will raise a TypeError.
In [1]: pd.Series(range(5))[3.5]
TypeError: the label [3.5] is not a proper indexer for this index type (Int64Index)
In [1]: pd.Series(range(5))[3.5:4.5]
TypeError: the slice start [3.5] is not a proper indexer for this index type (Int64Index)
Warning: Using a scalar float indexer for .iloc has been removed in 0.18.0, so the following will raise
a TypeError:
In [3]: pd.Series(range(5)).iloc[3.0]
TypeError: cannot do positional indexing on <class 'pandas.indexes.range.RangeIndex'>␣
,→with these indexers [3.0] of <type 'float'>
Here is a typical use-case for using this type of indexing. Imagine that you have a somewhat irregular
timedelta-like indexing scheme, but the data is recorded as floats. This could, for example, be millisecond
offsets.
In [171]: dfir
Out[171]:
A B
0.0 -0.242414 2.329011
250.0 0.764004 0.234595
500.0 2.148210 -0.555984
750.0 -1.494413 -1.635996
1000.0 0.806457 -0.952539
1000.4 0.671243 1.005818
1250.5 -0.658081 0.439328
1500.6 -1.203373 0.245992
1750.7 1.498864 -0.599532
2000.8 -0.985225 -1.469645
2250.9 -0.364937 0.107360
Selection operations then will always work on a value basis, for all selection operators.
In [172]: dfir[0:1000.4]
Out[172]:
A B
0.0 -0.242414 2.329011
250.0 0.764004 0.234595
500.0 2.148210 -0.555984
750.0 -1.494413 -1.635996
1000.0 0.806457 -0.952539
1000.4 0.671243 1.005818
In [174]: dfir.loc[1000.4]
Out[174]:
A 0.671243
B 1.005818
Name: 1000.4, dtype: float64
You could retrieve the first 1 second (1000 ms) of data as such:
In [175]: dfir[0:1000]
Out[175]:
A B
0.0 -0.242414 2.329011
250.0 0.764004 0.234595
500.0 2.148210 -0.555984
750.0 -1.494413 -1.635996
1000.0 0.806457 -0.952539
In [176]: dfir.iloc[0:5]
Out[176]:
A B
0.0 -0.242414 2.329011
250.0 0.764004 0.234595
500.0 2.148210 -0.555984
750.0 -1.494413 -1.635996
1000.0 0.806457 -0.952539
IntervalIndex
In [178]: df
Out[178]:
A
(0, 1] 1
(1, 2] 2
(2, 3] 3
(3, 4] 4
Label based indexing via .loc along the edges of an interval works as you would expect, selecting that
particular interval.
In [179]: df.loc[2]
Out[179]:
A 2
Trying to select an Interval that is not exactly contained in the IntervalIndex will raise a KeyError.
Selecting all Intervals that overlap a given Interval can be performed using the overlaps() method to
create a boolean indexer.
In [184]: idxr = df.index.overlaps(pd.Interval(0.5, 2.5))
In [185]: idxr
Out[185]: array([ True, True, True, False])
In [186]: df[idxr]
Out[186]:
A
(0, 1] 1
(1, 2] 2
(2, 3] 3
cut() and qcut() both return a Categorical object, and the bins they create are stored as an
IntervalIndex in its .categories attribute.
In [187]: c = pd.cut(range(4), bins=2)
In [188]: c
Out[188]:
[(-0.003, 1.5], (-0.003, 1.5], (1.5, 3.0], (1.5, 3.0]]
Categories (2, interval[float64]): [(-0.003, 1.5] < (1.5, 3.0]]
In [189]: c.categories
Out[189]:
IntervalIndex([(-0.003, 1.5], (1.5, 3.0]],
closed='right',
dtype='interval[float64]')
cut() also accepts an IntervalIndex for its bins argument, which enables a useful pandas idiom. First,
We call cut() with some data and bins set to a fixed number, to generate the bins. Then, we pass the
values of .categories as the bins argument in subsequent calls to cut(), supplying new data which will
be binned into the same bins.
Any value which falls outside all bins will be assigned a NaN value.
If we need intervals on a regular frequency, we can use the interval_range() function to create an
IntervalIndex using various combinations of start, end, and periods. The default frequency for
interval_range is a 1 for numeric intervals, and calendar day for datetime-like intervals:
In [191]: pd.interval_range(start=0, end=5)
Out[191]:
IntervalIndex([(0, 1], (1, 2], (2, 3], (3, 4], (4, 5]],
closed='right',
dtype='interval[int64]')
closed='right',
dtype='interval[datetime64[ns]]')
closed='right',
dtype='interval[timedelta64[ns]]')
The freq parameter can used to specify non-default frequencies, and can utilize a variety of frequency aliases
with datetime-like intervals:
In [194]: pd.interval_range(start=0, periods=5, freq=1.5)
Out[194]:
IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0], (6.0, 7.5]],
closed='right',
dtype='interval[float64]')
closed='right',
dtype='interval[datetime64[ns]]')
closed='right',
dtype='interval[timedelta64[ns]]')
Additionally, the closed parameter can be used to specify which side(s) the intervals are closed on. Intervals
are closed on the right side by default.
In [197]: pd.interval_range(start=0, end=4, closed='both')
Out[197]:
IntervalIndex([[0, 1], [1, 2], [2, 3], [3, 4]],
closed='both',
dtype='interval[int64]')
In [200]: pd.interval_range(pd.Timestamp('2018-01-01'),
.....: pd.Timestamp('2018-02-28'), periods=3)
.....:
Out[200]:
IntervalIndex([(2018-01-01, 2018-01-20 08:00:00], (2018-01-20 08:00:00, 2018-02-08␣
,→16:00:00], (2018-02-08 16:00:00, 2018-02-28]],
closed='right',
dtype='interval[datetime64[ns]]')
Integer indexing
Label-based indexing with integer axis labels is a thorny topic. It has been discussed heavily on mailing lists
and among various members of the scientific Python community. In pandas, our general viewpoint is that
labels matter more than integer locations. Therefore, with an integer axis index only label-based indexing
is possible with the standard tools like .loc. The following code will generate exceptions:
In [201]: s = pd.Series(range(5))
In [202]: s[-1]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-202-76c3dce40054> in <module>
----> 1 s[-1]
~/sandbox/pandas-doc/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()
~/sandbox/pandas-doc/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()
~/sandbox/pandas-doc/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
~/sandbox/pandas-doc/pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.
,→Int64HashTable.get_item()
~/sandbox/pandas-doc/pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.
,→Int64HashTable.get_item()
KeyError: -1
In [204]: df
Out[204]:
0 1 2 3
0 -1.133558 -0.029021 -0.694313 0.001625
1 0.603428 2.165940 -0.484352 -0.005155
2 1.473773 0.636391 0.557850 -0.894238
3 -0.805963 -1.781931 1.888285 -0.953362
4 -0.112449 1.227497 -0.567709 2.458089
In [205]: df.loc[-2:]
Out[205]:
0 1 2 3
0 -1.133558 -0.029021 -0.694313 0.001625
1 0.603428 2.165940 -0.484352 -0.005155
2 1.473773 0.636391 0.557850 -0.894238
3 -0.805963 -1.781931 1.888285 -0.953362
4 -0.112449 1.227497 -0.567709 2.458089
This deliberate decision was made to prevent ambiguities and subtle bugs (many users reported finding bugs
when the API change was made to stop “falling back” on position-based indexing).
If the index of a Series or DataFrame is monotonically increasing or decreasing, then the bounds of a label-
based slice can be outside the range of the index, much like slice indexing a normal Python list. Mono-
tonicity of an index can be tested with the is_monotonic_increasing() and is_monotonic_decreasing()
attributes.
In [206]: df = pd.DataFrame(index=[2, 3, 3, 4, 5], columns=['data'], data=list(range(5)))
In [207]: df.index.is_monotonic_increasing
Out[207]: True
In [211]: df.index.is_monotonic_increasing
Out[211]: False
2 0
3 1
1 2
4 3
In [214]: weakly_monotonic
Out[214]: Index(['a', 'b', 'c', 'c'], dtype='object')
In [215]: weakly_monotonic.is_monotonic_increasing
Out[215]: True
Compared with standard Python sequence slicing in which the slice endpoint is not inclusive, label-based
slicing in pandas is inclusive. The primary reason for this is that it is often not possible to easily determine
the “successor” or next element after a particular label in an index. For example, consider the following
Series:
In [218]: s
Out[218]:
a 0.580640
b 0.138820
c -1.068494
d 1.649395
e -0.598832
f -0.322914
dtype: float64
Suppose we wished to slice from c to e, using integers this would be accomplished as such:
In [219]: s[2:5]
Out[219]:
c -1.068494
d 1.649395
(continues on next page)
However, if you only had c and e, determining the next element in the index can be somewhat complicated.
For example, the following does not work:
s.loc['c':'e' + 1]
A very common use case is to limit a time series to start and end at two specific dates. To enable this, we
made the design choice to make label-based slicing include both endpoints:
In [220]: s.loc['c':'e']
Out[220]:
c -1.068494
d 1.649395
e -0.598832
dtype: float64
This is most definitely a “practicality beats purity” sort of thing, but it is something to watch out for if you
expect label-based slicing to behave exactly in the way that standard Python integer slicing works.
The different indexing operation can potentially change the dtype of a Series.
In [221]: series1 = pd.Series([1, 2, 3])
In [222]: series1.dtype
Out[222]: dtype('int64')
In [224]: res.dtype
Out[224]: dtype('float64')
In [225]: res
Out[225]:
0 1.0
4 NaN
dtype: float64
In [226]: series2 = pd.Series([True])
In [227]: series2.dtype
Out[227]: dtype('bool')
In [229]: res.dtype
Out[229]: dtype('O')
In [230]: res
Out[230]:
0 True
1 NaN
2 NaN
dtype: object
This is because the (re)indexing operations above silently inserts NaNs and the dtype changes accordingly.
This can cause some issues when using numpy ufuncs such as numpy.logical_and.
See the this old issue for a more detailed discussion. {{ header }}
pandas provides various facilities for easily combining together Series or DataFrame with various kinds of
set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.
The concat() function (in the main pandas namespace) does all of the heavy lifting of performing concate-
nation operations along an axis while performing optional set logic (union or intersection) of the indexes (if
any) on the other axes. Note that I say “if any” because there is only a single possible axis of concatenation
for Series.
Before diving into all of the details of concat and what it can do, here is a simple example:
Like its sibling function on ndarrays, numpy.concatenate, pandas.concat takes a list or dict of
homogeneously-typed objects and concatenates them with some configurable handling of “what to do with
the other axes”:
• objs : a sequence or mapping of Series or DataFrame objects. If a dict is passed, the sorted keys will
be used as the keys argument, unless it is passed, in which case the values will be selected (see below).
Any None objects will be dropped silently unless they are all None in which case a ValueError will be
raised.
• axis : {0, 1, …}, default 0. The axis to concatenate along.
• join : {‘inner’, ‘outer’}, default ‘outer’. How to handle indexes on other axis(es). Outer for union and
inner for intersection.
• ignore_index : boolean, default False. If True, do not use the index values on the concatenation axis.
The resulting axis will be labeled 0, …, n - 1. This is useful if you are concatenating objects where the
concatenation axis does not have meaningful indexing information. Note the index values on the other
axes are still respected in the join.
• keys : sequence, default None. Construct hierarchical index using the passed keys as the outermost
level. If multiple levels passed, should contain tuples.
• levels : list of sequences, default None. Specific levels (unique values) to use for constructing a
MultiIndex. Otherwise they will be inferred from the keys.
• names : list, default None. Names for the levels in the resulting hierarchical index.
• verify_integrity : boolean, default False. Check whether the new concatenated axis contains du-
plicates. This can be very expensive relative to the actual data concatenation.
• copy : boolean, default True. If False, do not copy data unnecessarily.
Without a little bit of context many of these arguments don’t make much sense. Let’s revisit the above
example. Suppose we wanted to associate specific keys with each of the pieces of the chopped up DataFrame.
We can do this using the keys argument:
As you can see (if you’ve read the rest of the documentation), the resulting object’s index has a hierarchical
index. This means that we can now select out each chunk by key:
In [7]: result.loc['y']
Out[7]:
A B C D
4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
It’s not a stretch to see how this can be very useful. More detail on this functionality below.
Note: It is worth noting that concat() (and therefore append()) makes a full copy of the data, and that
constantly reusing this function can create a significant performance hit. If you need to use the operation
over several datasets, use a list comprehension.
When gluing together multiple DataFrames, you have a choice of how to handle the other axes (other than
the one being concatenated). This can be done in the following two ways:
• Take the union of them all, join='outer'. This is the default option as it results in zero information
loss.
• Take the intersection, join='inner'.
Here is an example of each of these methods. First, the default join='outer' behavior:
Lastly, suppose we just wanted to reuse the exact index from the original DataFrame:
A useful shortcut to concat() are the append() instance methods on Series and DataFrame. These methods
actually predated concat. They concatenate along axis=0, namely the index:
In the case of DataFrame, the indexes must be disjoint but the columns do not need to be:
Note: Unlike the append() method, which appends to the original list and returns None, append() here
does not modify df1 and returns its copy with df2 appended.
For DataFrame objects which don’t have a meaningful index, you may wish to append them and ignore the
fact that they may have overlapping indexes. To do this, use the ignore_index argument:
You can concatenate a mix of Series and DataFrame objects. The Series will be transformed to DataFrame
with the column name as the name of the Series.
Note: Since we’re concatenating a Series to a DataFrame, we could have achieved the same result with
DataFrame.assign(). To concatenate an arbitrary number of pandas objects (DataFrame or Series), use
concat.
A fairly common use of the keys argument is to override the column names when creating a new DataFrame
based on existing Series. Notice how the default behaviour consists on letting the resulting DataFrame
inherit the parent Series’ name, when these existed.
Through the keys argument we can override the existing column names.
You can also pass a dict to concat in which case the dict keys will be used for the keys argument (unless
other keys are specified):
The MultiIndex created has levels that are constructed from the passed keys and the index of the DataFrame
pieces:
In [32]: result.index.levels
Out[32]: FrozenList([['z', 'y'], [4, 5, 6, 7, 8, 9, 10, 11]])
If you wish to specify other levels (as will occasionally be the case), you can do so using the levels argument:
In [34]: result.index.levels
Out[34]: FrozenList([['z', 'y', 'x', 'w'], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]])
This is fairly esoteric, but it is actually necessary for implementing things like GroupBy where the order of
a categorical variable is meaningful.
While not especially efficient (since a new object must be created), you can append a single row to a
DataFrame by passing a Series or dict to append, which returns a new DataFrame as above.
You should use ignore_index with this method to instruct DataFrame to discard its index. If you wish to
preserve the index, you should construct an appropriately-indexed DataFrame and append or concatenate
those objects.
You can also pass a list of dicts or Series:
pandas has full-featured, high performance in-memory join operations idiomatically very similar to re-
lational databases like SQL. These methods perform significantly better (in some cases well over an order
of magnitude better) than other open source implementations (like base::merge.data.frame in R). The
reason for this is careful algorithmic design and the internal layout of the data in DataFrame.
See the cookbook for some advanced strategies.
Users who are familiar with SQL but new to pandas might be interested in a comparison with SQL.
pandas provides a single function, merge(), as the entry point for all standard database join operations
between DataFrame or named Series objects:
Note: Support for specifying index levels as the on, left_on, and right_on parameters was added in
version 0.23.0. Support for merging named Series objects was added in version 0.24.0.
The return type will be the same as left. If left is a DataFrame or named Series and right is a subclass
of DataFrame, the return type will still be DataFrame.
merge is a function in the pandas namespace, and it is also available as a DataFrame instance method
merge(), with the calling DataFrame being implicitly considered the left object in the join.
The related join() method, uses merge internally for the index-on-index (by default) and column(s)-on-
index join. If you are joining on index only, you may wish to use DataFrame.join to save yourself some
typing.
Experienced users of relational databases like SQL will be familiar with the terminology used to describe join
operations between two SQL-table like structures (DataFrame objects). There are several cases to consider
which are very important to understand:
• one-to-one joins: for example when joining two DataFrame objects on their indexes (which must
contain unique values).
• many-to-one joins: for example when joining an index (unique) to one or more columns in a different
DataFrame.
• many-to-many joins: joining columns on columns.
Note: When joining columns on columns (potentially a many-to-many join), any indexes on the passed
DataFrame objects will be discarded.
It is worth spending some time understanding the result of the many-to-many join case. In SQL / standard
relational algebra, if a key combination appears more than once in both tables, the resulting table will have the
Cartesian product of the associated data. Here is a very basic example with one unique key combination:
Here is a more complicated example with multiple join keys. Only the keys appearing in left and right
are present (the intersection), since how='inner' by default.
The how argument to merge specifies how to determine which keys are to be included in the resulting table.
If a key combination does not appear in either the left or right tables, the values in the joined table will
be NA. Here is a summary of the how options and their SQL equivalent names:
Warning: Joining / merging on duplicate keys can cause a returned frame that is the multiplication
of the row dimensions, which may result in memory overflow. It is the user’ s responsibility to manage
duplicate values in keys before joining large DataFrames.
If the user is aware of the duplicates in the right DataFrame but wants to ensure there are no duplicates in
the left DataFrame, one can use the validate='one_to_many' argument instead, which will not raise an
exception.
merge() accepts the argument indicator. If True, a Categorical-type column called _merge will be added
to the output object that takes on values:
The indicator argument will also accept string arguments, in which case the indicator function will use the
value of the passed string as the name for the indicator column.
Merge dtypes
In [60]: left
Out[60]:
key v1
0 1 10
In [62]: right
Out[62]:
key v1
0 1 20
1 2 30
key v1
0 1 10
1 1 20
2 2 30
In [71]: left
Out[71]:
X Y
0 foo three
1 bar three
2 foo two
3 foo three
4 foo one
5 foo one
6 bar one
7 foo two
8 foo one
9 bar two
In [72]: left.dtypes
Out[72]:
X category
Y object
dtype: object
The right frame.
In [73]: right = pd.DataFrame({'X': pd.Series(['foo', 'bar'],
....: dtype=CategoricalDtype(['foo', 'bar'])),
....: 'Z': [1, 2]})
....:
In [74]: right
Out[74]:
X Z
0 foo 1
1 bar 2
In [75]: right.dtypes
Out[75]:
X category
Z int64
dtype: object
The merged result:
In [76]: result = pd.merge(left, right, how='outer')
In [77]: result
Out[77]:
X Y Z
0 foo three 1
1 foo two 1
2 foo three 1
3 foo one 1
4 foo one 1
5 foo two 1
6 foo one 1
7 bar three 2
8 bar one 2
9 bar two 2
In [78]: result.dtypes
Out[78]:
X category
Y object
Z int64
dtype: object
Note: The category dtypes must be exactly the same, meaning the same categories and the ordered
attribute. Otherwise the result will coerce to object dtype.
Note: Merging on category dtypes that are the same can be quite performant compared to object dtype
merging.
Joining on index
DataFrame.join() is a convenient method for combining the columns of two potentially differently-indexed
DataFrames into a single result DataFrame. Here is a very basic example:
The data alignment here is on the indexes (row labels). This same behavior can be achieved using merge
plus additional arguments instructing it to use the indexes:
join() takes an optional on argument which may be a column or multiple column names, which specifies
that the passed DataFrame is to be aligned on that column in the DataFrame. These two function calls are
completely equivalent:
left.join(right, on=key_or_keys)
pd.merge(left, right, left_on=key_or_keys, right_index=True,
how='left', sort=False)
Obviously you can choose whichever form you find more convenient. For many-to-one joins (where one of
the DataFrame’s is already indexed by the join key), using join may be more convenient. Here is a simple
example:
To join on multiple
keys, the passed DataFrame must have a MultiIndex:
Now this can be joined by passing the two key column names:
The
default for DataFrame.join is to perform a left join (essentially a “VLOOKUP” operation, for Excel users),
which uses only the keys found in the calling DataFrame. Other join types, for example inner join, can be
just as easily performed:
As you can see, this drops any rows where there was no match.
You can join a singly-indexed DataFrame with a level of a MultiIndexed DataFrame. The level will match
on the name of the index of the singly-indexed frame against a level name of the MultiIndexed frame.
This is equivalent but less verbose and more memory efficient / faster than this.
This is supported in a limited way, provided that the index for the right argument is completely used in the
join, and is a subset of the indices in the left argument, as in this example:
In [100]: leftindex = pd.MultiIndex.from_product([list('abc'), list('xy'), [1, 2]],
.....: names=['abc', 'xy', 'num'])
.....:
In [102]: left
Out[102]:
v1
abc xy num
a x 1 0
2 1
y 1 2
2 3
b x 1 4
2 5
y 1 6
2 7
c x 1 8
2 9
y 1 10
2 11
In [105]: right
Out[105]:
v2
abc xy
a x 100
y 200
b x 300
y 400
c x 500
y 600
Note: When DataFrames are merged on a string that matches an index level in both frames, the index
level is preserved as an index level in the resulting DataFrame.
Note: When DataFrames are merged using only some of the levels of a MultiIndex, the extra levels will be
dropped from the resulting merge. In order to preserve those levels, use reset_index on those level names
to move those levels to columns prior to doing the merge.
Note: If a string matches both a column name and an index level name, then a warning is issued and the
column takes precedence. This will result in an ambiguity error in a future version.
The merge suffixes argument takes a tuple of list of strings to append to overlapping column names in the
input DataFrames to disambiguate the result columns:
A list or tuple of DataFrames can also be passed to join() to join them together on their indexes.
Another fairly common situation is to have two like-indexed (or similarly indexed) Series or DataFrame
objects and wanting to “patch” values in one object from values for matching indices in the other. Here is
an example:
Note that this method only takes values from the right DataFrame if they are missing in the left DataFrame.
A related method, update(), alters non-NA values in place:
In [129]: df1.update(df2)
A merge_ordered() function allows combining time series and other ordered data. In particular it has an
optional fill_method keyword to fill/interpolate missing data:
Merging asof
In [135]: trades
Out[135]:
time ticker price quantity
0 2016-05-25 13:30:00.023 MSFT 51.95 75
1 2016-05-25 13:30:00.038 MSFT 51.95 155
2 2016-05-25 13:30:00.048 GOOG 720.77 100
3 2016-05-25 13:30:00.048 GOOG 720.92 100
4 2016-05-25 13:30:00.048 AAPL 98.00 100
In [136]: quotes
Out[136]:
time ticker bid ask
0 2016-05-25 13:30:00.023 GOOG 720.50 720.93
1 2016-05-25 13:30:00.023 MSFT 51.95 51.96
2 2016-05-25 13:30:00.030 MSFT 51.97 51.98
3 2016-05-25 13:30:00.041 MSFT 51.99 52.00
4 2016-05-25 13:30:00.048 GOOG 720.50 720.93
5 2016-05-25 13:30:00.049 AAPL 97.99 98.01
6 2016-05-25 13:30:00.072 GOOG 720.50 720.88
7 2016-05-25 13:30:00.075 MSFT 52.01 52.03
By default we are taking the asof of the quotes.
In [137]: pd.merge_asof(trades, quotes,
.....: on='time',
.....: by='ticker')
.....:
Out[137]:
time ticker price quantity bid ask
0 2016-05-25 13:30:00.023 MSFT 51.95 75 51.95 51.96
1 2016-05-25 13:30:00.038 MSFT 51.95 155 51.97 51.98
2 2016-05-25 13:30:00.048 GOOG 720.77 100 720.50 720.93
3 2016-05-25 13:30:00.048 GOOG 720.92 100 720.50 720.93
4 2016-05-25 13:30:00.048 AAPL 98.00 100 NaN NaN
We only asof within 2ms between the quote time and the trade time.
In [138]: pd.merge_asof(trades, quotes,
.....: on='time',
.....: by='ticker',
.....: tolerance=pd.Timedelta('2ms'))
(continues on next page)
We only asof within 10ms between the quote time and the trade time and we exclude exact matches on time.
Note that though we exclude the exact matches (of the quotes), prior quotes do propagate to that point in
time.
{{ header }}
In [1]: df
Out[1]:
date variable value
0 2000-01-03 A 0.259126
1 2000-01-04 A 0.774683
2 2000-01-05 A 0.713390
3 2000-01-03 B -0.837638
4 2000-01-04 B 0.064089
5 2000-01-05 B -0.034934
6 2000-01-03 C -0.452395
7 2000-01-04 C 2.488398
8 2000-01-05 C 1.429389
9 2000-01-03 D 0.547623
10 2000-01-04 D 0.008959
11 2000-01-05 D 1.097237
For the curious here is how the above DataFrame was created:
import pandas.util.testing as tm
tm.N = 3
def unpivot(frame):
N, K = frame.shape
data = {'value': frame.to_numpy().ravel('F'),
'variable': np.asarray(frame.columns).repeat(N),
'date': np.tile(np.asarray(frame.index), K)}
return pd.DataFrame(data, columns=['date', 'variable', 'value'])
df = unpivot(tm.makeTimeDataFrame())
But suppose we wish to do time series operations with the variables. A better representation would be where
the columns are the unique variables and an index of dates identifies individual observations. To reshape
the data into this form, we use the DataFrame.pivot() method (also implemented as a top level function
pivot()):
If the values argument is omitted, and the input DataFrame has more than one column of values which are
not used as column or index inputs to pivot, then the resulting “pivoted” DataFrame will have hierarchical
columns whose topmost level indicates the respective value column:
In [6]: pivoted
Out[6]:
value value2 ␣
,→
variable A B C D A B C ␣
,→D
date ␣
,→
In [7]: pivoted['value2']
Out[7]:
variable A B C D
date
2000-01-03 0.518252 -1.675276 -0.904790 1.095246
2000-01-04 1.549366 0.128178 4.976796 0.017919
2000-01-05 1.426780 -0.069868 2.858779 2.194473
Note that this returns a view on the underlying data in the case where the data are homogeneously-typed.
Note: pivot() will error with a ValueError: Index contains duplicate entries, cannot reshape
if the index/column pair is not unique. In this case, consider using pivot_table() which is a generalization
of pivot that can handle duplicate values for one index/column pair.
Closely related to the pivot() method are the related stack() and unstack() methods available on Series
and DataFrame. These methods are designed to work together with MultiIndex objects (see the section on
hierarchical indexing). Here are essentially what these methods do:
• stack: “pivot” a level of the (possibly hierarchical) column labels, returning a DataFrame with an
index with a new inner-most level of row labels.
• unstack: (inverse operation of stack) “pivot” a level of the (possibly hierarchical) row index to the
column axis, producing a reshaped DataFrame with a new inner-most level of column labels.
The clearest way to explain is by example. Let’s take a prior example data set from the hierarchical indexing
section:
In [12]: df2
Out[12]:
A B
first second
(continues on next page)
The stack function “compresses” a level in the DataFrame’s columns to produce either:
• A Series, in the case of a simple column Index.
• A DataFrame, in the case of a MultiIndex in the columns.
If the columns have a MultiIndex, you can choose which level to stack. The stacked level becomes the new
lowest level in a MultiIndex on the columns:
In [14]: stacked
Out[14]:
first second
bar one A -1.427767
B 1.011174
two A -0.227837
B 0.260297
baz one A -0.664499
B -1.085553
two A -1.392521
B -0.426500
dtype: float64
With a “stacked” DataFrame or Series (having a MultiIndex as the index), the inverse operation of stack
is unstack, which by default unstacks the last level:
In [15]: stacked.unstack()
Out[15]:
A B
first second
bar one -1.427767 1.011174
two -0.227837 0.260297
baz one -0.664499 -1.085553
two -1.392521 -0.426500
In [16]: stacked.unstack(1)
Out[16]:
second one two
first
bar A -1.427767 -0.227837
B 1.011174 0.260297
baz A -0.664499 -1.392521
B -1.085553 -0.426500
In [17]: stacked.unstack(0)
Out[17]:
first bar baz
second
one A -1.427767 -0.664499
B 1.011174 -1.085553
two A -0.227837 -1.392521
B 0.260297 -0.426500
If the indexes have names, you can use the level names instead of specifying the level numbers:
In [18]: stacked.unstack('second')
Out[18]:
second one two
first
bar A -1.427767 -0.227837
B 1.011174 0.260297
baz A -0.664499 -1.392521
B -1.085553 -0.426500
Notice that the stack and unstack methods implicitly sort the index levels involved. Hence a call to stack
and then unstack, or vice versa, will result in a sorted copy of the original DataFrame or Series:
In [19]: index = pd.MultiIndex.from_product([[2, 1], ['a', 'b']])
In [21]: df
Out[21]:
A
2 a -1.727109
b -1.966122
1 a -0.004308
b 0.627249
Multiple levels
You may also stack or unstack more than one level at a time by passing a list of levels, in which case the
end result is as if each level in the list were processed individually.
In [23]: columns = pd.MultiIndex.from_tuples([
....: ('A', 'cat', 'long'), ('B', 'cat', 'long'),
....: ('A', 'dog', 'short'), ('B', 'dog', 'short')],
....: names=['exp', 'animal', 'hair_length']
....: )
....:
In [25]: df
Out[25]:
exp A B A B
animal cat cat dog dog
hair_length long long short short
0 -2.260817 0.023953 -1.328037 -0.091360
1 -0.063483 0.691641 1.062240 -1.912934
2 -0.967661 1.438160 1.796396 0.364482
3 0.384514 0.774313 -1.215737 1.533214
# df.stack(level=['animal', 'hair_length'])
# from above is equivalent to:
In [27]: df.stack(level=[1, 2])
Out[27]:
exp A B
animal hair_length
0 cat long -2.260817 0.023953
dog short -1.328037 -0.091360
1 cat long -0.063483 0.691641
dog short 1.062240 -1.912934
2 cat long -0.967661 1.438160
dog short 1.796396 0.364482
3 cat long 0.384514 0.774313
dog short -1.215737 1.533214
Missing data
These functions are intelligent about handling missing data and do not expect each subgroup within the
hierarchical index to have the same set of labels. They also can handle the index being unsorted (but you
can make it sorted by calling sort_index, of course). Here is a more complex example:
In [32]: df2
Out[32]:
exp A B A
animal cat dog cat dog
first second
bar one -1.436953 0.561814 -0.346880 -0.546060
two -1.993054 -0.689365 -0.877031 1.507935
baz one -1.866380 0.043384 0.252683 -0.479004
foo one -1.613149 -0.622599 0.291003 0.792238
two 0.807151 -0.758613 -2.393856 1.098272
qux two -0.411186 1.584705 -1.042868 -0.295906
As mentioned above, stack can be called with a level argument to select which level in the columns to
stack:
In [33]: df2.stack('exp')
Out[33]:
animal cat dog
first second exp
bar one A -1.436953 -0.546060
B -0.346880 0.561814
two A -1.993054 1.507935
B -0.877031 -0.689365
baz one A -1.866380 -0.479004
B 0.252683 0.043384
foo one A -1.613149 0.792238
B 0.291003 -0.622599
two A 0.807151 1.098272
B -2.393856 -0.758613
qux two A -0.411186 -0.295906
B -1.042868 1.584705
In [34]: df2.stack('animal')
Out[34]:
exp A B
In [36]: df3
Out[36]:
exp B
animal dog cat
first second
bar one 0.561814 -0.346880
two -0.689365 -0.877031
foo one -0.622599 0.291003
qux two 1.584705 -1.042868
In [37]: df3.unstack()
Out[37]:
exp B
animal dog cat
second one two one two
first
bar 0.561814 -0.689365 -0.346880 -0.877031
foo -0.622599 NaN 0.291003 NaN
qux NaN 1.584705 NaN -1.042868
New in version 0.18.0.
Alternatively, unstack takes an optional fill_value argument, for specifying the value of missing data.
In [38]: df3.unstack(fill_value=-1e9)
Out[38]:
exp B
animal dog cat
second one two one two
first
bar 5.618142e-01 -6.893650e-01 -3.468802e-01 -8.770315e-01
foo -6.225992e-01 -1.000000e+09 2.910029e-01 -1.000000e+09
qux -1.000000e+09 1.584705e+00 -1.000000e+09 -1.042868e+00
With a MultiIndex
Unstacking when the columns are a MultiIndex is also careful about doing the right thing:
In [39]: df[:3].unstack(0)
Out[39]:
exp A B A
animal cat dog cat dog
first bar baz bar baz bar baz bar baz
second
one -1.436953 -1.86638 0.561814 0.043384 -0.346880 0.252683 -0.546060 -0.479004
two -1.993054 NaN -0.689365 NaN -0.877031 NaN 1.507935 NaN
In [40]: df2.unstack(1)
Out[40]:
exp A B A
animal cat dog cat dog
second one two one two one two one two
first
bar -1.436953 -1.993054 0.561814 -0.689365 -0.346880 -0.877031 -0.546060 1.507935
baz -1.866380 NaN 0.043384 NaN 0.252683 NaN -0.479004 NaN
foo -1.613149 0.807151 -0.622599 -0.758613 0.291003 -2.393856 0.792238 1.098272
qux NaN -0.411186 NaN 1.584705 NaN -1.042868 NaN -0.295906
The top-level melt() function and the corresponding DataFrame.melt() are useful to massage a DataFrame
into a format where one or more columns are identifier variables, while all other columns, considered measured
variables, are “unpivoted” to the row axis, leaving just two non-identifier columns, “variable” and “value”.
The names of those columns can be customized by supplying the var_name and value_name parameters.
For instance,
In [41]: cheese = pd.DataFrame({'first': ['John', 'Mary'],
....: 'last': ['Doe', 'Bo'],
....: 'height': [5.5, 6.0],
....: 'weight': [130, 150]})
....:
In [42]: cheese
Out[42]:
first last height weight
0 John Doe 5.5 130
1 Mary Bo 6.0 150
In [47]: dft
Out[47]:
A1970 A1980 B1970 B1980 X id
0 a d 2.5 3.2 -2.013095 0
1 b e 1.2 1.3 -1.711797 1
2 c f 0.7 0.1 0.975018 2
It should be no shock that combining pivot / stack / unstack with GroupBy and the basic Series and
DataFrame statistical functions can produce some very expressive and fast data manipulations.
In [49]: df
Out[49]:
exp A B A
animal cat dog cat dog
first second
bar one -1.436953 0.561814 -0.346880 -0.546060
In [50]: df.stack().mean(1).unstack()
Out[50]:
animal cat dog
first second
bar one -0.891917 0.007877
two -1.435043 0.409285
baz one -0.806849 -0.217810
two 0.753528 -0.298146
foo one -0.661073 0.084819
two -0.793352 0.169830
qux one 0.074211 -0.438394
two -0.727027 0.644399
In [52]: df.stack().groupby(level=1).mean()
Out[52]:
exp A B
second
one -0.743827 0.031543
two 0.180838 -0.499969
In [53]: df.mean().unstack(0)
Out[53]:
exp A B
animal
cat -0.859011 -0.262870
dog 0.296022 -0.205557
While pivot() provides general purpose pivoting with various data types (strings, numerics, etc.), pandas
also provides pivot_table() for pivoting with aggregation of numeric data.
The function pivot_table() can be used to create spreadsheet-style pivot tables. See the cookbook for some
advanced strategies.
It takes a number of arguments:
• data: a DataFrame object.
• values: a column or a list of columns to aggregate.
• index: a column, Grouper, array which has the same length as data, or list of them. Keys to group by
on the pivot table index. If an array is passed, it is being used as the same manner as column values.
• columns: a column, Grouper, array which has the same length as data, or list of them. Keys to group
by on the pivot table column. If an array is passed, it is being used as the same manner as column
values.
• aggfunc: function to use for aggregation, defaulting to numpy.mean.
Consider a data set like this:
In [56]: df
Out[56]:
A B C D E F
0 one A foo 1.772891 -1.594231 2013-01-01
1 one B foo 0.498050 -1.165583 2013-02-01
2 two C foo 0.476510 0.451407 2013-03-01
3 three A bar 0.877022 -1.724523 2013-04-01
4 one B bar -0.301831 2.097509 2013-05-01
5 one C bar -0.733711 1.078343 2013-06-01
6 two A foo 0.826360 -0.394125 2013-07-01
7 three B foo 0.466766 -0.726083 2013-08-01
8 one C foo 1.634589 0.490827 2013-09-01
9 one A bar 0.958132 -0.559443 2013-10-01
10 two B bar -0.970543 -0.395876 2013-11-01
11 three C bar -0.932172 -0.927091 2013-12-01
12 one A foo -1.273567 -0.417776 2013-01-15
13 one B foo 0.270030 -0.708679 2013-02-15
14 two C foo -1.388749 -1.557855 2013-03-15
15 three A bar 2.090035 -1.742504 2013-04-15
16 one B bar -0.815984 -0.201498 2013-05-15
17 one C bar -0.020855 0.070382 2013-06-15
18 two A foo -0.108292 -0.749683 2013-07-15
19 three B foo 0.093700 1.434318 2013-08-15
20 one C foo -0.851281 1.226799 2013-09-15
21 one A bar -2.154727 -0.706492 2013-10-15
(continues on next page)
Also, you can use Grouper for index and columns keywords. For detail of Grouper, see Grouping with a
Grouper specification.
You can render a nice output of the table omitting the missing values by calling to_string if you wish:
In [63]: print(table.to_string(na_rep=''))
D E
C bar foo bar foo
A B
one A -0.598297 0.249662 -0.632968 -1.006004
B -0.558907 0.384040 0.948005 -0.937131
C -0.377283 0.391654 0.574362 0.858813
three A 1.483528 -1.733513
B 0.280233 0.354117
C -0.954274 -0.504514
two A 0.359034 -0.571904
B -0.545728 -0.048381
C -0.456119 -0.553224
Note that pivot_table is also available as an instance method on DataFrame, i.e. DataFrame.
pivot_table().
Adding margins
If you pass margins=True to pivot_table, special All columns and rows will be added with partial group
aggregates across the categories on the rows and columns:
Use crosstab() to compute a cross-tabulation of two (or more) factors. By default crosstab computes a
frequency table of the factors unless an array of values and an aggregation function are passed.
It takes a number of arguments
• index: array-like, values to group by in the rows.
• columns: array-like, values to group by in the columns.
• values: array-like, optional, array of values to aggregate according to the factors.
• aggfunc: function, optional, If no values array is passed, computes a frequency table.
• rownames: sequence, default None, must match number of row arrays passed.
• colnames: sequence, default None, if passed, must match number of column arrays passed.
• margins: boolean, default False, Add row/column margins (subtotals)
• normalize: boolean, {‘all’, ‘index’, ‘columns’}, or {0,1}, default False. Normalize by dividing all
values by the sum of values.
Any Series passed will have their name attributes used unless row or column names for the cross-tabulation
are specified
For example:
In [65]: foo, bar, dull, shiny, one, two = 'foo', 'bar', 'dull', 'shiny', 'one', 'two'
In [71]: df
Out[71]:
A B C
0 1 3 1.0
1 2 3 1.0
2 2 4 NaN
3 2 4 1.0
4 2 4 1.0
Normalization
normalize can also normalize values within each row or within each column:
crosstab can also be passed a third Series and an aggregation function (aggfunc) that will be applied to
the values of the third Series within each group defined by the first two Series:
Adding margins
4.5.7 Tiling
The cut() function computes groupings for the values of the input array and is often used to transform
continuous variables to discrete or categorical variables:
In [80]: ages = np.array([10, 15, 13, 12, 23, 25, 28, 59, 60])
Categories (3, interval[float64]): [(9.95, 26.667] < (26.667, 43.333] < (43.333, 60.0]]
If the bins keyword is an integer, then equal-width bins are formed. Alternatively we can specify custom
bin-edges:
In [83]: c
Out[83]:
[(0, 18], (0, 18], (0, 18], (0, 18], (18, 35], (18, 35], (18, 35], (35, 70], (35, 70]]
Categories (3, interval[int64]): [(0, 18] < (18, 35] < (35, 70]]
To convert a categorical variable into a “dummy” or “indicator” DataFrame, for example a column in a
DataFrame (a Series) which has k distinct values, can derive a DataFrame containing k columns of 1s and
0s using get_dummies():
In [85]: pd.get_dummies(df['key'])
Out[85]:
a b c
0 0 1 0
1 0 1 0
2 1 0 0
3 0 0 1
4 1 0 0
5 0 1 0
Sometimes it’s useful to prefix the column names, for example when merging the result with the original
DataFrame:
In [86]: dummies = pd.get_dummies(df['key'], prefix='key')
In [87]: dummies
Out[87]:
key_a key_b key_c
0 0 1 0
1 0 1 0
2 1 0 0
3 0 0 1
4 1 0 0
5 0 1 0
In [88]: df[['data1']].join(dummies)
Out[88]:
data1 key_a key_b key_c
0 0 0 1 0
1 1 0 1 0
2 2 1 0 0
3 3 0 0 1
4 4 1 0 0
5 5 0 1 0
This function is often used along with discretization functions like cut:
In [90]: values
Out[90]:
array([-0.7350435 , 1.42920375, -1.11519984, -0.97015174, -1.19270064,
0.02125661, -0.20556342, -0.66677255, 2.12277401, -0.10814128])
In [94]: pd.get_dummies(df)
Out[94]:
C A_a A_b B_b B_c
0 1 1 0 0 1
1 2 0 1 0 1
2 3 1 0 1 0
All non-object columns are included untouched in the output. You can control the columns that are encoded
with the columns keyword.
Notice that the B column is still included in the output, it just hasn’t been encoded. You can drop B before
calling get_dummies if you don’t want to include it in the output.
As with the Series version, you can pass values for the prefix and prefix_sep. By default the column
name is used as the prefix, and ‘_’ as the prefix separator. You can specify prefix and prefix_sep in 3
ways:
• string: Use the same value for prefix or prefix_sep for each column to be encoded.
• list: Must be the same length as the number of columns being encoded.
• dict: Mapping column name to prefix.
In [97]: simple
Out[97]:
C new_prefix_a new_prefix_b new_prefix_b new_prefix_c
0 1 1 0 0 1
1 2 0 1 0 1
2 3 1 0 1 0
In [99]: from_list
Out[99]:
C from_A_a from_A_b from_B_b from_B_c
0 1 1 0 0 1
1 2 0 1 0 1
2 3 1 0 1 0
In [101]: from_dict
Out[101]:
C from_A_a from_A_b from_B_b from_B_c
0 1 1 0 0 1
1 2 0 1 0 1
2 3 1 0 1 0
In [103]: pd.get_dummies(s)
Out[103]:
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
4 1 0 0
Out[104]:
b c
0 0 0
1 1 0
2 0 1
3 0 0
4 0 0
When a column contains only one level, it will be omitted in the result.
In [105]: df = pd.DataFrame({'A': list('aaaaa'), 'B': list('ababc')})
In [106]: pd.get_dummies(df)
Out[106]:
A_a B_a B_b B_c
0 1 1 0 0
1 1 0 1 0
2 1 1 0 0
3 1 0 1 0
4 1 0 0 1
In [111]: x
Out[111]:
0 A
1 A
2 NaN
3 B
4 3.14
5 inf
dtype: object
In [113]: labels
Out[113]: array([ 0, 0, -1, 1, 2, 3])
In [114]: uniques
Out[114]: Index(['A', 'B', 3.14, inf], dtype='object')
Note that factorize is similar to numpy.unique, but differs in its handling of NaN:
Note: The following numpy.unique will fail under Python 3 with a TypeError because of an ordering bug.
See also here.
Note: If you just want to handle one column as a categorical variable (like R’s factor), you can use
df["cat_col"] = pd.Categorical(df["col"]) or df["cat_col"] = df["col"].astype("category").
For full docs on Categorical, see the Categorical introduction and the API documentation.
4.5.10 Examples
In this section, we will review frequently asked questions and examples. The column names and relevant
column values are named to correspond with how this DataFrame will be pivoted in the answers below.
In [116]: n = 20
Suppose we wanted to pivot df such that the col values are columns, row values are the index, and the
mean of val0 are the values? In particular, the resulting DataFrame should look like:
Note: col col0 col1 col2 col3 col4 row row0 0.77 0.605 NaN 0.860 0.65 row2 0.13 NaN 0.395 0.500 0.25 row3
NaN 0.310 NaN 0.545 NaN row4 NaN 0.100 0.395 0.760 0.24
This solution uses pivot_table(). Also note that aggfunc='mean' is the default. It is included here to be
explicit.
In [122]: df.pivot_table(
.....: values='val0', index='row', columns='col', aggfunc='mean')
.....:
Out[122]:
col col0 col1 col2 col3 col4
row
row0 0.77 0.605 NaN 0.860 0.65
row2 0.13 NaN 0.395 0.500 0.25
row3 NaN 0.310 NaN 0.545 NaN
row4 NaN 0.100 0.395 0.760 0.24
Note that we can also replace the missing values by using the fill_value parameter.
In [123]: df.pivot_table(
.....: values='val0', index='row', columns='col', aggfunc='mean', fill_value=0)
(continues on next page)
Also note that we can pass in other aggregation functions as well. For example, we can also pass in sum.
In [124]: df.pivot_table(
.....: values='val0', index='row', columns='col', aggfunc='sum', fill_value=0)
.....:
Out[124]:
col col0 col1 col2 col3 col4
row
row0 0.77 1.21 0.00 0.86 0.65
row2 0.13 0.00 0.79 0.50 0.50
row3 0.00 0.31 0.00 1.09 0.00
row4 0.00 0.10 0.79 1.52 0.24
Another aggregation we can do is calculate the frequency in which the columns and rows occur together
a.k.a. “cross tabulation”. To do this, we can pass size to the aggfunc parameter.
We can also perform multiple aggregations. For example, to perform both a sum and mean, we can pass in
a list to the aggfunc argument.
In [126]: df.pivot_table(
.....: values='val0', index='row', columns='col', aggfunc=['mean', 'sum'])
.....:
Out[126]:
mean sum
col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
row
row0 0.77 0.605 NaN 0.860 0.65 0.77 1.21 NaN 0.86 0.65
row2 0.13 NaN 0.395 0.500 0.25 0.13 NaN 0.79 0.50 0.50
row3 NaN 0.310 NaN 0.545 NaN NaN 0.31 NaN 1.09 NaN
row4 NaN 0.100 0.395 0.760 0.24 NaN 0.10 0.79 1.52 0.24
Note to aggregate over multiple value columns, we can pass in a list to the values parameter.
In [127]: df.pivot_table(
.....: values=['val0', 'val1'], index='row', columns='col', aggfunc=['mean'])
.....:
Out[127]:
mean
val0 val1
col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
row
row0 0.77 0.605 NaN 0.860 0.65 0.01 0.745 NaN 0.010 0.02
row2 0.13 NaN 0.395 0.500 0.25 0.45 NaN 0.34 0.440 0.79
row3 NaN 0.310 NaN 0.545 NaN NaN 0.230 NaN 0.075 NaN
row4 NaN 0.100 0.395 0.760 0.24 NaN 0.070 0.42 0.300 0.46
Note to subdivide over multiple columns we can pass in a list to the columns parameter.
In [128]: df.pivot_table(
.....: values=['val0'], index='row', columns=['item', 'col'], aggfunc=['mean'])
.....:
Out[128]:
mean
val0
item item0 item1 item2
col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
row
row0 NaN NaN NaN 0.77 NaN NaN NaN NaN NaN 0.605 0.86 0.65
row2 0.35 NaN 0.37 NaN NaN 0.44 NaN NaN 0.13 NaN 0.50 0.13
row3 NaN NaN NaN NaN 0.31 NaN 0.81 NaN NaN NaN 0.28 NaN
row4 0.15 0.64 NaN NaN 0.10 0.64 0.88 0.24 NaN NaN NaN NaN
In [132]: df
Out[132]:
keys values
0 panda1 [eats, shoots]
1 panda2 [shoots, leaves]
2 panda3 [eats, leaves]
We can ‘explode’ the values column, transforming each list-like to a separate row, by using explode().
This will replicate the index values from the original row:
In [133]: df['values'].explode()
Out[133]:
0 eats
0 shoots
1 shoots
1 leaves
2 eats
2 leaves
Name: values, dtype: object
In [134]: df.explode('values')
Out[134]:
keys values
0 panda1 eats
0 panda1 shoots
1 panda2 shoots
1 panda2 leaves
2 panda3 eats
2 panda3 leaves
Series.explode() will replace empty lists with np.nan and preserve scalar entries. The dtype of the
resulting Series is always object.
In [135]: s = pd.Series([[1, 2, 3], 'foo', [], ['a', 'b']])
In [136]: s
Out[136]:
0 [1, 2, 3]
1 foo
2 []
3 [a, b]
dtype: object
In [137]: s.explode()
Out[137]:
0 1
0 2
0 3
1 foo
2 NaN
3 a
3 b
dtype: object
Here is a typical usecase. You have comma separated strings in a column and want to expand this.
In [139]: df
Out[139]:
(continues on next page)
Creating a long form DataFrame is now straightforward using explode and chained operations
In [140]: df.assign(var1=df.var1.str.split(',')).explode('var1')
Out[140]:
var1 var2
0 a 1
0 b 1
0 c 1
1 d 2
1 e 2
1 f 2
{{ header }}
Series and Index are equipped with a set of string processing methods that make it easy to operate on each
element of the array. Perhaps most importantly, these methods exclude missing/NA values automatically.
These are accessed via the str attribute and generally have names matching the equivalent (scalar) built-in
string methods:
In [1]: s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
In [2]: s.str.lower()
Out[2]:
0 a
1 b
2 c
3 aaba
4 baca
5 NaN
6 caba
7 dog
8 cat
dtype: object
In [3]: s.str.upper()
Out[3]:
0 A
1 B
2 C
3 AABA
4 BACA
5 NaN
6 CABA
7 DOG
8 CAT
dtype: object
In [4]: s.str.len()
Out[4]:
0 1.0
1 1.0
2 1.0
3 4.0
4 4.0
5 NaN
6 4.0
7 3.0
8 3.0
dtype: float64
In [5]: idx = pd.Index([' jack', 'jill ', ' jesse ', 'frank'])
In [6]: idx.str.strip()
Out[6]: Index(['jack', 'jill', 'jesse', 'frank'], dtype='object')
In [7]: idx.str.lstrip()
Out[7]: Index(['jack', 'jill ', 'jesse ', 'frank'], dtype='object')
In [8]: idx.str.rstrip()
Out[8]: Index([' jack', 'jill', ' jesse', 'frank'], dtype='object')
The string methods on Index are especially useful for cleaning up or transforming DataFrame columns. For
instance, you may have columns with leading or trailing whitespace:
In [10]: df
Out[10]:
Column A Column B
0 -0.061542 0.528296
1 -0.952849 2.392803
2 -1.524186 2.642204
In [12]: df.columns.str.lower()
Out[12]: Index([' column a ', ' column b '], dtype='object')
These string methods can then be used to clean up the columns as needed. Here we are removing leading
and trailing whitespaces, lower casing all names, and replacing any remaining whitespaces with underscores:
In [14]: df
Out[14]:
column_a column_b
(continues on next page)
Note: If you have a Series where lots of elements are repeated (i.e. the number of unique elements in
the Series is a lot smaller than the length of the Series), it can be faster to convert the original Series to
one of type category and then use .str.<method> or .dt.<property> on that. The performance difference
comes from the fact that, for Series of type category, the string operations are done on the .categories
and not on each element of the Series.
Please note that a Series of type category with string .categories has some limitations in comparison
to Series of type string (e.g. you can’t add strings to each other: s + " " + s won’t work if s is a Series
of type category). Also, .str methods which operate on elements of type list are not available on such a
Series.
Warning: Before v.0.25.0, the .str-accessor did only the most rudimentary type checks. Starting with
v.0.25.0, the type of the Series is inferred and the allowed types (i.e. strings) are enforced more rigorously.
Generally speaking, the .str accessor is intended to work only on strings. With very few exceptions,
other uses are not supported, and may be disabled at a later point.
In [16]: s2.str.split('_')
Out[16]:
0 [a, b, c]
1 [c, d, e]
2 NaN
3 [f, g, h]
dtype: object
In [18]: s2.str.split('_').str[1]
Out[18]:
0 b
1 d
2 NaN
3 g
dtype: object
It is easy to expand this to return a DataFrame using expand.
rsplit is similar to split except it works in the reverse direction, i.e., from the end of the string to the
beginning of the string:
In [23]: s3
Out[23]:
0 A
1 B
2 C
3 Aaba
4 Baca
5
6 NaN
7 CABA
8 dog
9 cat
dtype: object
0 A
1 B
2 C
3 XX-XX ba
4 XX-XX ca
5
6 NaN
7 XX-XX BA
8 XX-XX
9 XX-XX t
dtype: object
Some caution must be taken to keep regular expressions in mind! For example, the following code will cause
trouble because of the regular expression meaning of $:
# Consider the following badly formatted financial data
In [25]: dollars = pd.Series(['12', '-$10', '$10,000'])
Out[30]:
0 12
1 -10
2 $10,000
dtype: object
New in version 0.20.0.
The replace method can also take a callable as replacement. It is called on every pat using re.sub(). The
callable should expect one positional argument (a regex object) and return a string.
In [37]: import re
Including a flags argument when calling replace with a compiled regular expression object will raise a
ValueError.
4.6.2 Concatenation
There are several ways to concatenate a Series or Index, either with itself or others, all based on cat(),
resp. Index.str.cat.
In [42]: s.str.cat(sep=',')
Out[42]: 'a,b,c,d'
If not specified, the keyword sep for the separator defaults to the empty string, sep='':
In [43]: s.str.cat()
Out[43]: 'abcd'
By default, missing values are ignored. Using na_rep, they can be given a representation:
In [44]: t = pd.Series(['a', 'b', np.nan, 'd'])
In [45]: t.str.cat(sep=',')
Out[45]: 'a,b,d'
The first argument to cat() can be a list-like object, provided that it matches the length of the calling
Series (or Index).
Missing values on either side will result in missing values in the result as well, unless na_rep is specified:
In [48]: s.str.cat(t)
Out[48]:
0 aa
1 bb
2 NaN
3 dd
dtype: object
In [51]: s
Out[51]:
0 a
1 b
2 c
3 d
dtype: object
In [52]: d
Out[52]:
0 1
0 a a
1 b b
2 NaN c
3 d d
3 ddd
dtype: object
In [55]: s
Out[55]:
0 a
1 b
2 c
3 d
dtype: object
In [56]: u
Out[56]:
1 b
3 d
0 a
2 c
dtype: object
In [57]: s.str.cat(u)
Out[57]:
0 ab
1 bd
2 ca
3 dc
dtype: object
Warning: If the join keyword is not passed, the method cat() will currently fall back to the behavior
before version 0.23.0 (i.e. no alignment), but a FutureWarning will be raised if any of the involved
indexes differ, since this default will change to join='left' in a future version.
The usual options are available for join (one of 'left', 'outer', 'inner', 'right'). In particular,
alignment also means that the different lengths do not need to coincide anymore.
In [59]: v = pd.Series(['z', 'a', 'b', 'd', 'e'], index=[-1, 0, 1, 3, 4])
In [60]: s
Out[60]:
0 a
1 b
2 c
3 d
dtype: object
In [61]: v
Out[61]:
-1 z
0 a
1 b
3 d
4 e
dtype: object
In [65]: s
Out[65]:
0 a
1 b
2 c
3 d
dtype: object
In [66]: f
Out[66]:
0 1
3 d d
2 NaN c
1 b b
0 a a
Several array-like items (specifically: Series, Index, and 1-dimensional variants of np.ndarray) can be
combined in a list-like container (including iterators, dict-views, etc.).
In [68]: s
Out[68]:
0 a
1 b
2 c
3 d
dtype: object
In [69]: u
Out[69]:
1 b
3 d
0 a
2 c
dtype: object
2 c-ca
3 dddc
4 -e--
dtype: object
If using join='right' on a list-like of others that contains different indexes, the union of these indexes
will be used as the basis for the final concatenation:
In [73]: u.loc[[3]]
Out[73]:
3 d
dtype: object
You can use [] notation to directly index by position locations. If you index past the end of the string, the
result will be a NaN.
In [76]: s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan,
....: 'CABA', 'dog', 'cat'])
....:
In [77]: s.str[0]
Out[77]:
0 A
1 B
2 C
3 A
4 B
5 NaN
6 C
7 d
8 c
dtype: object
In [78]: s.str[1]
Out[78]:
0 NaN
1 NaN
2 NaN
3 a
4 a
5 NaN
6 A
7 o
8 a
dtype: object
Warning: In version 0.18.0, extract gained the expand argument. When expand=False it returns a
Series, Index, or DataFrame, depending on the subject and regular expression pattern (same behavior
as pre-0.18.0). When expand=True it always returns a DataFrame, which is more consistent and less
confusing from the perspective of a user. expand=True is the default since version 0.23.0.
The extract method accepts a regular expression with at least one capture group.
Extracting a regular expression with more than one group returns a DataFrame with one column per group.
Elements that do not match return a row filled with NaN. Thus, a Series of messy strings can be “converted”
into a like-indexed Series or DataFrame of cleaned-up or more useful strings, without necessitating get() to
access tuples or re.match objects. The dtype of the result is always object, even if no match is found and
the result only contains NaN.
Named groups like
can also be used. Note that any capture group names in the regular expression will be used for column
names; otherwise capture group numbers will be used.
Extracting a regular expression with one group returns a DataFrame with one column if expand=True.
Calling on an Index with a regex with exactly one capture group returns a DataFrame with one column if
expand=True.
In [84]: s = pd.Series(["a1", "b2", "c3"], ["A11", "B22", "C33"])
In [85]: s
Out[85]:
A11 a1
B22 b2
C33 c3
dtype: object
Calling on an Index with a regex with more than one capture group returns a DataFrame if expand=True.
The table below summarizes the behavior of extract(expand=False) (input subject in first column, number
of groups in regex in first row)
In [90]: s
Out[90]:
A a1a2
B b1
C c1
dtype: object
the extractall method returns every match. The result of extractall is always a DataFrame with a
MultiIndex on its rows. The last level of the MultiIndex is named match and indicates the order in the
subject.
In [93]: s.str.extractall(two_groups)
Out[93]:
letter digit
match
A 0 a 1
1 a 2
B 0 b 1
C 0 c 1
When each subject string in the Series has exactly one match,
In [95]: s
Out[95]:
0 a3
1 b3
(continues on next page)
In [97]: extract_result
Out[97]:
letter digit
0 a 3
1 b 3
2 c 2
In [99]: extractall_result
Out[99]:
letter digit
match
0 0 a 3
1 0 b 3
2 0 c 2
The distinction between match and contains is strictness: match relies on strict re.match, while contains
relies on re.search.
Methods like match, contains, startswith, and endswith take an extra na argument so missing values can
be considered True or False:
In [106]: s4 = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
You can extract dummy variables from string columns. For example if they are separated by a '|':
In [111]: idx.str.get_dummies(sep='|')
Out[111]:
MultiIndex([(1, 0, 0),
(1, 1, 0),
(0, 0, 0),
(1, 0, 1)],
names=['a', 'b', 'c'])
Method Description
cat() Concatenate strings
split() Split strings on delimiter
rsplit() Split strings on delimiter working from the end of the string
get() Index into each element (retrieve i-th element)
join() Join strings in each element of the Series with passed separator
get_dummies() Split strings on the delimiter returning DataFrame of dummy variables
contains() Return boolean array if each string contains pattern/regex
replace() Replace occurrences of pattern/regex/string with some other string or the return
value of a callable given the occurrence
repeat() Duplicate values (s.str.repeat(3) equivalent to x * 3)
pad() Add whitespace to left, right, or both sides of strings
center() Equivalent to str.center
ljust() Equivalent to str.ljust
rjust() Equivalent to str.rjust
zfill() Equivalent to str.zfill
wrap() Split long strings into lines with length less than a given width
slice() Slice each string in the Series
slice_replace() Replace slice in each string with passed value
count() Count occurrences of pattern
startswith() Equivalent to str.startswith(pat) for each element
endswith() Equivalent to str.endswith(pat) for each element
findall() Compute list of all occurrences of pattern/regex for each string
match() Call re.match on each element, returning matched groups as list
Continued on next page
{{ header }}
In this section, we will discuss missing (also referred to as NA) values in pandas.
Note: The choice of using NaN internally to denote missing data was largely for simplicity and performance
reasons. It differs from the MaskedArray approach of, for example, scikits.timeseries. We are hopeful
that NumPy will soon be able to provide a native NA type solution (similar to R) performant enough to be
used in pandas.
As data comes in many shapes and forms, pandas aims to be flexible with regard to handling missing data.
While NaN is the default missing value marker for reasons of computational speed and convenience, we need
to be able to easily detect this value with data of different types: floating point, integer, boolean, and general
object. In many cases, however, the Python None will arise and we wish to also consider that “missing” or
“not available” or “NA”.
Note: If you want to consider inf and -inf to be “NA” in computations, you can set pandas.options.
mode.use_inf_as_na = True.
In [4]: df
Out[4]:
one two three four five
a 0.070821 -0.093641 0.014099 bar True
c 0.702961 0.870484 -1.521966 bar True
e 0.219802 1.546457 -0.174262 bar True
f 0.831417 1.332054 0.438532 bar True
h 0.786117 -0.304618 0.956171 bar True
In [5]: df2 = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
In [6]: df2
Out[6]:
one two three four five
a 0.070821 -0.093641 0.014099 bar True
b NaN NaN NaN NaN NaN
c 0.702961 0.870484 -1.521966 bar True
d NaN NaN NaN NaN NaN
e 0.219802 1.546457 -0.174262 bar True
f 0.831417 1.332054 0.438532 bar True
g NaN NaN NaN NaN NaN
h 0.786117 -0.304618 0.956171 bar True
To make detecting missing values easier (and across different array dtypes), pandas provides the isna() and
notna() functions, which are also methods on Series and DataFrame objects:
In [7]: df2['one']
Out[7]:
a 0.070821
b NaN
c 0.702961
d NaN
e 0.219802
f 0.831417
g NaN
h 0.786117
Name: one, dtype: float64
In [8]: pd.isna(df2['one'])
Out[8]:
a False
b True
c False
d True
e False
f False
g True
h False
Name: one, dtype: bool
In [9]: df2['four'].notna()
Out[9]:
a True
b False
c True
d False
e True
f True
g False
h True
Name: four, dtype: bool
In [10]: df2.isna()
Out[10]:
one two three four five
a False False False False False
b True True True True True
c False False False False False
d True True True True True
e False False False False False
f False False False False False
g True True True True True
h False False False False False
Warning: One has to be mindful that in Python (and NumPy), the nan's don’t compare equal, but
None's do. Note that pandas/NumPy uses the fact that np.nan != np.nan, and treats None like np.nan.
In [11]: None == None # noqa: E711
Out[11]: True
Because NaN is a float, a column of integers with even one missing values is cast to floating-point dtype (see
Support for integer NA for more). Pandas provides a nullable integer array, which can be used by explicitly
requesting the dtype:
Alternatively, the string alias dtype='Int64' (note the capital "I") can be used.
See Nullable integer data type for more.
Datetimes
For datetime64[ns] types, NaT represents missing values. This is a pseudo-native sentinel value that can be
represented by NumPy in a singular dtype (datetime64[ns]). pandas objects provide compatibility between
NaT and NaN.
In [15]: df2 = df.copy()
In [17]: df2
Out[17]:
one two three four five timestamp
a 0.070821 -0.093641 0.014099 bar True 2012-01-01
c 0.702961 0.870484 -1.521966 bar True 2012-01-01
e 0.219802 1.546457 -0.174262 bar True 2012-01-01
f 0.831417 1.332054 0.438532 bar True 2012-01-01
h 0.786117 -0.304618 0.956171 bar True 2012-01-01
In [19]: df2
Out[19]:
one two three four five timestamp
a NaN -0.093641 0.014099 bar True NaT
c NaN 0.870484 -1.521966 bar True NaT
e 0.219802 1.546457 -0.174262 bar True 2012-01-01
f 0.831417 1.332054 0.438532 bar True 2012-01-01
h NaN -0.304618 0.956171 bar True NaT
In [20]: df2.dtypes.value_counts()
Out[20]:
float64 3
datetime64[ns] 1
object 1
bool 1
dtype: int64
You can insert missing values by simply assigning to containers. The actual missing value used will be chosen
based on the dtype.
For example, numeric containers will always use NaN regardless of the missing value type chosen:
In [23]: s
Out[23]:
0 NaN
1 2.0
2 3.0
dtype: float64
In [27]: s
Out[27]:
0 None
1 NaN
2 c
dtype: object
Missing values propagate naturally through arithmetic operations between pandas objects.
In [28]: a
Out[28]:
one two
a NaN -0.093641
c NaN 0.870484
e 0.219802 1.546457
f 0.831417 1.332054
h 0.831417 -0.304618
In [29]: b
Out[29]:
one two three
a NaN -0.093641 0.014099
c NaN 0.870484 -1.521966
e 0.219802 1.546457 -0.174262
f 0.831417 1.332054 0.438532
h NaN -0.304618 0.956171
In [30]: a + b
Out[30]:
one three two
a NaN NaN -0.187283
c NaN NaN 1.740968
e 0.439605 NaN 3.092914
f 1.662833 NaN 2.664108
h NaN NaN -0.609235
The descriptive statistics and computational methods discussed in the data structure overview (and listed
here and here) are all written to account for missing data. For example:
• When summing data, NA (missing) values will be treated as zero.
• If the data are all NA, the result will be 0.
• Cumulative methods like cumsum() and cumprod() ignore NA values by default, but preserve them in
the resulting arrays. To override this behaviour and include NA values, use skipna=False.
In [31]: df
Out[31]:
one two three
a NaN -0.093641 0.014099
c NaN 0.870484 -1.521966
e 0.219802 1.546457 -0.174262
f 0.831417 1.332054 0.438532
h NaN -0.304618 0.956171
In [32]: df['one'].sum()
Out[32]: 1.0512188858617544
In [33]: df.mean(1)
Out[33]:
a -0.039771
c -0.325741
e 0.530666
f 0.867334
h 0.325777
dtype: float64
In [34]: df.cumsum()
Out[34]:
one two three
a NaN -0.093641 0.014099
c NaN 0.776843 -1.507866
e 0.219802 2.323300 -1.682128
f 1.051219 3.655354 -1.243596
h NaN 3.350736 -0.287425
In [35]: df.cumsum(skipna=False)
Out[35]:
one two three
a NaN -0.093641 0.014099
c NaN 0.776843 -1.507866
e NaN 2.323300 -1.682128
f NaN 3.655354 -1.243596
h NaN 3.350736 -0.287425
Warning: This behavior is now standard as of v0.22.0 and is consistent with the default in numpy;
previously sum/prod of all-NA or empty Series/DataFrames would return NaN. See v0.22.0 whatsnew
for more.
In [37]: pd.Series([]).sum()
Out[37]: 0.0
The product of an empty or all-NA Series or column of a DataFrame is 1.
In [38]: pd.Series([np.nan]).prod()
Out[38]: 1.0
In [39]: pd.Series([]).prod()
Out[39]: 1.0
NA groups in GroupBy are automatically excluded. This behavior is consistent with R, for example:
In [40]: df
Out[40]:
one two three
In [41]: df.groupby('one').mean()
Out[41]:
two three
one
0.219802 1.546457 -0.174262
0.831417 1.332054 0.438532
See the groupby section here for more information.
pandas objects are equipped with various data manipulation methods for dealing with missing data.
fillna() can “fill in” NA values with non-NA data in a couple of ways, which we illustrate:
Replace NA with a scalar value
In [42]: df2
Out[42]:
one two three four five timestamp
a NaN -0.093641 0.014099 bar True NaT
c NaN 0.870484 -1.521966 bar True NaT
e 0.219802 1.546457 -0.174262 bar True 2012-01-01
f 0.831417 1.332054 0.438532 bar True 2012-01-01
h NaN -0.304618 0.956171 bar True NaT
In [43]: df2.fillna(0)
Out[43]:
one two three four five timestamp
a 0.000000 -0.093641 0.014099 bar True 0
c 0.000000 0.870484 -1.521966 bar True 0
e 0.219802 1.546457 -0.174262 bar True 2012-01-01 00:00:00
f 0.831417 1.332054 0.438532 bar True 2012-01-01 00:00:00
h 0.000000 -0.304618 0.956171 bar True 0
In [44]: df2['one'].fillna('missing')
Out[44]:
a missing
c missing
e 0.219802
f 0.831417
h missing
Name: one, dtype: object
Fill gaps forward or backward
Using the same filling arguments as reindexing, we can propagate non-NA values forward or backward:
In [45]: df
Out[45]:
one two three
a NaN -0.093641 0.014099
c NaN 0.870484 -1.521966
e 0.219802 1.546457 -0.174262
f 0.831417 1.332054 0.438532
h NaN -0.304618 0.956171
In [46]: df.fillna(method='pad')
Out[46]:
one two three
a NaN -0.093641 0.014099
c NaN 0.870484 -1.521966
e 0.219802 1.546457 -0.174262
f 0.831417 1.332054 0.438532
h 0.831417 -0.304618 0.956171
Limit the amount of filling
If we only want consecutive gaps filled up to a certain number of data points, we can use the limit keyword:
In [47]: df
Out[47]:
one two three
a NaN -0.093641 0.014099
c NaN 0.870484 -1.521966
e NaN NaN NaN
f NaN NaN NaN
h NaN -0.304618 0.956171
Method Action
pad / ffill Fill values forward
bfill / backfill Fill values backward
With time series data, using pad/ffill is extremely common so that the “last known value” is available at
every time point.
ffill() is equivalent to fillna(method='ffill') and bfill() is equivalent to fillna(method='bfill')
You can also fillna using a dict or Series that is alignable. The labels of the dict or index of the Series must
match the columns of the frame you wish to fill. The use case of this is to fill a DataFrame with the mean
of that column.
In [49]: dff = pd.DataFrame(np.random.randn(10, 3), columns=list('ABC'))
In [53]: dff
Out[53]:
A B C
0 0.632107 -0.123200 0.579811
1 -0.617833 0.730289 -1.930809
2 -1.795903 -2.012672 -0.884042
3 NaN 0.809658 0.727889
4 NaN NaN 1.683552
5 -1.134942 NaN NaN
6 -1.654372 -0.175245 NaN
7 0.332654 -1.208013 NaN
8 0.028692 0.139178 0.877023
9 0.780292 0.682156 0.993475
In [54]: dff.fillna(dff.mean())
Out[54]:
A B C
0 0.632107 -0.123200 0.579811
1 -0.617833 0.730289 -1.930809
2 -1.795903 -2.012672 -0.884042
3 -0.428663 0.809658 0.727889
4 -0.428663 -0.144731 1.683552
5 -1.134942 -0.144731 0.292414
6 -1.654372 -0.175245 0.292414
7 0.332654 -1.208013 0.292414
8 0.028692 0.139178 0.877023
9 0.780292 0.682156 0.993475
In [55]: dff.fillna(dff.mean()['B':'C'])
Out[55]:
A B C
0 0.632107 -0.123200 0.579811
1 -0.617833 0.730289 -1.930809
2 -1.795903 -2.012672 -0.884042
3 NaN 0.809658 0.727889
4 NaN -0.144731 1.683552
5 -1.134942 -0.144731 0.292414
6 -1.654372 -0.175245 0.292414
7 0.332654 -1.208013 0.292414
8 0.028692 0.139178 0.877023
9 0.780292 0.682156 0.993475
Same result as above, but is aligning the ‘fill’ value which is a Series in this case.
You may wish to simply exclude labels from a data set which refer to missing data. To do this, use dropna():
In [57]: df
Out[57]:
one two three
a NaN -0.093641 0.014099
c NaN 0.870484 -1.521966
e NaN 0.000000 0.000000
f NaN 0.000000 0.000000
h NaN -0.304618 0.956171
In [58]: df.dropna(axis=0)
Out[58]:
Empty DataFrame
Columns: [one, two, three]
Index: []
In [59]: df.dropna(axis=1)
Out[59]:
two three
a -0.093641 0.014099
c 0.870484 -1.521966
e 0.000000 0.000000
f 0.000000 0.000000
h -0.304618 0.956171
In [60]: df['one'].dropna()
Out[60]: Series([], Name: one, dtype: float64)
An equivalent dropna() is available for Series. DataFrame.dropna has considerably more options than
Series.dropna, which can be examined in the API .
4.7.7 Interpolation
Both Series and DataFrame objects have interpolate() that, by default, performs linear interpolation at
missing data points.
In [61]: ts
Out[61]:
2000-01-31 0.469112
2000-02-29 NaN
2000-03-31 NaN
2000-04-28 NaN
2000-05-31 NaN
...
2007-12-31 -6.950267
2008-01-31 -7.904475
2008-02-29 -6.441779
2008-03-31 -8.184940
2008-04-30 -9.011531
Freq: BM, Length: 100, dtype: float64
In [62]: ts.count()
Out[62]: 66
In [63]: ts.plot()
Out[63]: <matplotlib.axes._subplots.AxesSubplot at 0x14bbd5690>
In [64]: ts.interpolate()
Out[64]:
2000-01-31 0.469112
2000-02-29 0.434469
2000-03-31 0.399826
2000-04-28 0.365184
2000-05-31 0.330541
...
2007-12-31 -6.950267
2008-01-31 -7.904475
2008-02-29 -6.441779
2008-03-31 -8.184940
2008-04-30 -9.011531
Freq: BM, Length: 100, dtype: float64
In [65]: ts.interpolate().count()
Out[65]: 100
In [66]: ts.interpolate().plot()
Out[66]: <matplotlib.axes._subplots.AxesSubplot at 0x14a829590>
Out[67]:
2000-01-31 0.469112
2000-02-29 NaN
2002-07-31 -5.785037
2005-01-31 NaN
2008-04-30 -9.011531
dtype: float64
In [68]: ts2.interpolate()
Out[68]:
2000-01-31 0.469112
2000-02-29 -2.657962
2002-07-31 -5.785037
2005-01-31 -7.398284
2008-04-30 -9.011531
dtype: float64
In [69]: ts2.interpolate(method='time')
Out[69]:
2000-01-31 0.469112
2000-02-29 0.270241
2002-07-31 -5.785037
2005-01-31 -7.190866
2008-04-30 -9.011531
dtype: float64
For a floating-point index, use method='values':
In [70]: ser
Out[70]:
0.0 0.0
1.0 NaN
10.0 10.0
dtype: float64
In [71]: ser.interpolate()
Out[71]:
0.0 0.0
1.0 5.0
10.0 10.0
dtype: float64
In [72]: ser.interpolate(method='values')
Out[72]:
0.0 0.0
1.0 1.0
10.0 10.0
dtype: float64
You can also interpolate with a DataFrame:
In [73]: df = pd.DataFrame({'A': [1, 2.1, np.nan, 4.7, 5.6, 6.8],
....: 'B': [.25, np.nan, np.nan, 4, 12.2, 14.4]})
....:
In [74]: df
Out[74]:
A B
0 1.0 0.25
1 2.1 NaN
2 NaN NaN
3 4.7 4.00
4 5.6 12.20
5 6.8 14.40
In [75]: df.interpolate()
Out[75]:
A B
0 1.0 0.25
1 2.1 1.50
2 3.4 2.75
3 4.7 4.00
4 5.6 12.20
5 6.8 14.40
The method argument gives access to fancier interpolation methods. If you have scipy installed, you can
pass the name of a 1-d interpolation routine to method. You’ll want to consult the full scipy interpolation
documentation and reference guide for details. The appropriate interpolation method will depend on the
type of data you are working with.
• If you are dealing with a time series that is growing at an increasing rate, method='quadratic' may
be appropriate.
• If you have values approximating a cumulative distribution function, then method='pchip' should
work well.
• To fill missing values with goal of smooth plotting, consider method='akima'.
In [76]: df.interpolate(method='barycentric')
Out[76]:
A B
0 1.00 0.250
1 2.10 -7.660
2 3.53 -4.515
3 4.70 4.000
4 5.60 12.200
5 6.80 14.400
In [77]: df.interpolate(method='pchip')
Out[77]:
A B
0 1.00000 0.250000
1 2.10000 0.672808
2 3.43454 1.928950
3 4.70000 4.000000
4 5.60000 12.200000
5 6.80000 14.400000
In [78]: df.interpolate(method='akima')
Out[78]:
A B
0 1.000000 0.250000
1 2.100000 -0.873316
2 3.406667 0.320034
3 4.700000 4.000000
4 5.600000 12.200000
5 6.800000 14.400000
When interpolating via a polynomial or spline approximation, you must also specify the degree or order of
the approximation:
In [79]: df.interpolate(method='spline', order=2)
Out[79]:
A B
0 1.000000 0.250000
1 2.100000 -0.428598
2 3.404545 1.206900
3 4.700000 4.000000
4 5.600000 12.200000
5 6.800000 14.400000
In [81]: np.random.seed(2)
In [83]: missing = np.array([4, 13, 14, 15, 16, 17, 18, 20, 29])
In [87]: df.plot()
Out[87]: <matplotlib.axes._subplots.AxesSubplot at 0x14a517d90>
Another use case is interpolation at new values. Suppose you have 100 observations from some distribution.
And let’s suppose that you’re particularly interested in what’s happening around the middle. You can mix
pandas’ reindex and interpolate methods to interpolate at the new values.
# interpolate at new_index
In [89]: new_index = ser.index | pd.Index([49.25, 49.5, 49.75, 50.25, 50.5, 50.75])
In [91]: interp_s[49:51]
Out[91]:
49.00 0.471410
49.25 0.476841
49.50 0.481780
49.75 0.485998
50.00 0.489266
50.25 0.491814
50.50 0.493995
50.75 0.495763
51.00 0.497074
dtype: float64
Interpolation limits
Like other pandas fill methods, interpolate() accepts a limit keyword argument. Use this argument to
limit the number of consecutive NaN values filled since the last valid observation:
In [92]: ser = pd.Series([np.nan, np.nan, 5, np.nan, np.nan,
....: np.nan, 13, np.nan, np.nan])
....:
In [93]: ser
Out[93]:
0 NaN
1 NaN
2 5.0
3 NaN
4 NaN
5 NaN
6 13.0
7 NaN
8 NaN
dtype: float64
Out[96]:
0 NaN
1 5.0
2 5.0
3 NaN
4 NaN
5 11.0
6 13.0
7 NaN
8 NaN
dtype: float64
8 NaN
dtype: float64
In [103]: ser.replace(0, 5)
Out[103]:
0 5.0
1 1.0
2 2.0
3 3.0
4 4.0
dtype: float64
Instead of replacing with specified values, you can treat all given values as missing and interpolate over them:
Note: Python strings prefixed with the r character such as r'hello world' are so-called “raw” strings.
They have different semantics regarding backslashes than strings without this prefix. Backslashes in raw
strings will be interpreted as an escaped backslash, e.g., r'\' == '\\'. You should read about them if this
is unclear.
In [109]: d = {'a': list(range(4)), 'b': list('ab..'), 'c': ['a', 'b', np.nan, 'd']}
In [110]: df = pd.DataFrame(d)
Now do it with a regular expression that removes surrounding whitespace (regex -> regex):
Same as the previous example, but use a regular expression for searching instead (dict of regex -> dict):
You can pass nested dictionaries of regular expressions that use regex=True:
You can also use the group of a regular expression match when replacing (dict of regex -> dict of regex),
this works for lists as well.
You can pass a list of regular expressions, of which those that match will be replaced with a scalar (list of
regex -> regex).
All of the regular expression examples can also be passed with the to_replace argument as the regex
argument. In this case the value argument must be passed explicitly by name or regex must be a nested
dictionary. The previous example, in this case, would then be:
This can be convenient if you do not want to pass regex=True every time you want to use a regular expression.
Note: Anywhere in the above replace examples that you see a regular expression a compiled regular
expression is valid as well.
In [127]: df[1].dtype
Out[127]: dtype('float64')
You can also operate on the DataFrame in place:
Warning: When replacing multiple bool or datetime64 objects, the first argument to replace
(to_replace) must match the type of the value being replaced. For example,
>>> s = pd.Series([True, False, True])
>>> s.replace({'a string': 'new value', True: False}) # raises
TypeError: Cannot compare types 'ndarray(dtype=bool)' and 'str'
will raise a TypeError because one of the dict keys is not of the correct type for replacement.
However, when replacing a single object such as,
In [129]: s = pd.Series([True, False, True])
the original NDFrame object will be returned untouched. We’re working on unifying this API, but for
backwards compatibility reasons we cannot break the latter behavior. See GH6354 for more details.
While pandas supports storing arrays of integer and boolean type, these types are not capable of storing
missing data. Until we can switch to using a native NA type in NumPy, we’ve established some “casting
rules”. When a reindexing operation introduces missing data, the Series will be cast according to the rules
introduced in the table below.
For example:
In [131]: s = pd.Series(np.random.randn(5), index=[0, 2, 4, 6, 7])
In [132]: s > 0
Out[132]:
0 True
2 True
4 True
6 True
7 True
dtype: bool
In [135]: crit
Out[135]:
0 True
1 NaN
2 True
3 NaN
4 True
5 NaN
6 True
7 True
dtype: object
In [136]: crit.dtype
Out[136]: dtype('O')
Ordinarily NumPy will complain if you try to use an object array (even if it contains boolean values) instead
of a boolean array to get or set values from an ndarray (e.g. selecting values based on some criteria). If a
boolean vector contains NAs, an exception will be generated:
In [138]: reindexed[crit]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-138-0dac417a4890> in <module>
----> 1 reindexed[crit]
~/sandbox/pandas-doc/pandas/core/common.py in is_bool_indexer(key)
128 if not lib.is_bool_array(key):
129 if isna(key).any():
--> 130 raise ValueError(na_msg)
131 return False
132 return True
However, these can be filled in using fillna() and it will work fine:
In [139]: reindexed[crit.fillna(False)]
Out[139]:
0 0.126504
2 0.696198
4 0.697416
6 0.601516
7 0.003659
dtype: float64
In [140]: reindexed[crit.fillna(True)]
Out[140]:
0 0.126504
1 0.000000
2 0.696198
3 0.000000
4 0.697416
5 0.000000
6 0.601516
7 0.003659
dtype: float64
Pandas provides a nullable integer dtype, but you must explicitly request it when creating the series or
column. Notice that we use a capital “I” in the dtype="Int64".
In [142]: s
Out[142]:
0 0
1 1
2 NaN
3 3
4 4
dtype: Int64
This is an introduction to pandas categorical data type, including a short comparison with R’s factor.
Categoricals are a pandas data type corresponding to categorical variables in statistics. A categorical variable
takes on a limited, and usually fixed, number of possible values (categories; levels in R). Examples are gender,
social class, blood type, country affiliation, observation time or rating via Likert scales.
In contrast to statistical categorical variables, categorical data might have an order (e.g. ‘strongly agree’ vs
‘agree’ or ‘first observation’ vs. ‘second observation’), but numerical operations (additions, divisions, …) are
not possible.
All values of categorical data are either in categories or np.nan. Order is defined by the order of categories,
not lexical order of the values. Internally, the data structure consists of a categories array and an integer
array of codes which point to the real value in the categories array.
The categorical data type is useful in the following cases:
• A string variable consisting of only a few different values. Converting such a string variable to a
categorical variable will save some memory, see here.
• The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting
to a categorical and specifying an order on the categories, sorting and min/max will use the logical
order instead of the lexical order, see here.
• As a signal to other Python libraries that this column should be treated as a categorical variable (e.g.
to use suitable statistical methods or plot types).
See also the API docs on categoricals.
Series creation
In [2]: s
Out[2]:
0 a
1 b
2 c
3 a
dtype: category
Categories (3, object): [a, b, c]
In [5]: df
Out[5]:
A B
0 a a
1 b b
2 c c
3 a a
By using special functions, such as cut(), which groups data into discrete bins. See the example on tiling
in the docs.
In [9]: df.head(10)
Out[9]:
value group
(continues on next page)
In [11]: s = pd.Series(raw_cat)
In [12]: s
Out[12]:
0 NaN
1 b
2 c
3 NaN
dtype: category
Categories (3, object): [b, c, d]
In [15]: df
Out[15]:
A B
0 a NaN
1 b b
2 c c
3 a NaN
In [16]: df.dtypes
Out[16]:
A object
B category
dtype: object
DataFrame creation
Similar to the previous section where a single column was converted to categorical, all columns in a DataFrame
can be batch converted to categorical either during or after construction.
This can be done during construction by specifying dtype="category" in the DataFrame constructor:
In [18]: df.dtypes
Out[18]:
A category
B category
dtype: object
Note that the categories present in each column differ; the conversion is done column by column, so only
labels present in a given column are categories:
In [19]: df['A']
Out[19]:
0 a
1 b
2 c
3 a
Name: A, dtype: category
Categories (3, object): [a, b, c]
In [20]: df['B']
Out[20]:
0 b
1 c
2 c
3 d
Name: B, dtype: category
Categories (3, object): [b, c, d]
New in version 0.23.0.
Analogously, all columns in an existing DataFrame can be batch converted using DataFrame.astype():
In [23]: df_cat.dtypes
Out[23]:
A category
B category
dtype: object
In [25]: df_cat['B']
Out[25]:
0 b
1 c
2 c
3 d
Name: B, dtype: category
Categories (3, object): [b, c, d]
Controlling behavior
In the examples above where we passed dtype='category', we used the default behavior:
1. Categories are inferred from the data.
2. Categories are unordered.
To control those behaviors, instead of passing 'category', use an instance of CategoricalDtype.
In [30]: s_cat
Out[30]:
0 NaN
1 b
2 c
3 NaN
dtype: category
Categories (3, object): [b < c < d]
Similarly, a CategoricalDtype can be used with a DataFrame to ensure that categories are consistent among
all columns.
In [31]: from pandas.api.types import CategoricalDtype
In [35]: df_cat['A']
Out[35]:
0 a
1 b
2 c
3 a
Name: A, dtype: category
Categories (4, object): [a < b < c < d]
In [36]: df_cat['B']
Out[36]:
0 b
1 c
2 c
3 d
Name: B, dtype: category
Categories (4, object): [a < b < c < d]
Note: To perform table-wise conversion, where all labels in the entire DataFrame are used as categories
for each column, the categories parameter can be determined programmatically by categories = pd.
unique(df.to_numpy().ravel()).
If you already have codes and categories, you can use the from_codes() constructor to save the factorize
step during normal constructor mode:
In [38]: s = pd.Series(pd.Categorical.from_codes(splitter,
....: categories=["train", "test"]))
....:
To get back to the original Series or NumPy array, use Series.astype(original_dtype) or np.
asarray(categorical):
In [39]: s = pd.Series(["a", "b", "c", "a"])
In [40]: s
Out[40]:
0 a
1 b
2 c
3 a
dtype: object
In [41]: s2 = s.astype('category')
In [42]: s2
Out[42]:
0 a
1 b
2 c
3 a
dtype: category
Categories (3, object): [a, b, c]
In [43]: s2.astype(str)
Out[43]:
0 a
1 b
2 c
3 a
dtype: object
In [44]: np.asarray(s2)
Out[44]: array(['a', 'b', 'c', 'a'], dtype=object)
Note: In contrast to R’s factor function, categorical data is not converting input values to strings; categories
will end up the same data type as the original values.
Note: In contrast to R’s factor function, there is currently no way to assign/change labels at creation time.
Use categories to change the categories after creation time.
4.8.2 CategoricalDtype
In [48]: CategoricalDtype()
Out[48]: CategoricalDtype(categories=None, ordered=None)
A CategoricalDtype can be used in any place pandas expects a dtype. For example pandas.read_csv(),
pandas.DataFrame.astype(), or in the Series constructor.
Note: As a convenience, you can use the string 'category' in place of a CategoricalDtype when you
want the default behavior of the categories being unordered, and equal to the set values present in the array.
In other words, dtype='category' is equivalent to dtype=CategoricalDtype().
Equality semantics
Two instances of CategoricalDtype compare equal whenever they have the same categories and order.
When comparing two unordered categoricals, the order of the categories is not considered.
In [49]: c1 = CategoricalDtype(['a', 'b', 'c'], ordered=False)
In [52]: c1 == 'category'
Out[52]: True
4.8.3 Description
Using describe() on categorical data will produce similar output to a Series or DataFrame of type string.
In [53]: cat = pd.Categorical(["a", "c", "c", np.nan], categories=["b", "a", "c"])
In [55]: df.describe()
Out[55]:
cat s
count 3 3
unique 2 2
top c c
freq 2 2
In [56]: df["cat"].describe()
Out[56]:
count 3
unique 2
top c
freq 2
Name: cat, dtype: object
Categorical data has a categories and a ordered property, which list their possible values and whether the
ordering matters or not. These properties are exposed as s.cat.categories and s.cat.ordered. If you
don’t manually specify categories and ordering, they are inferred from the passed arguments.
In [57]: s = pd.Series(["a", "b", "c", "a"], dtype="category")
In [58]: s.cat.categories
Out[58]: Index(['a', 'b', 'c'], dtype='object')
In [59]: s.cat.ordered
Out[59]: False
It’s also possible to pass in the categories in a specific order:
In [60]: s = pd.Series(pd.Categorical(["a", "b", "c", "a"],
....: categories=["c", "b", "a"]))
....:
In [61]: s.cat.categories
Out[61]: Index(['c', 'b', 'a'], dtype='object')
In [62]: s.cat.ordered
Out[62]: False
Note: New categorical data are not automatically ordered. You must explicitly pass ordered=True to
indicate an ordered Categorical.
Note: The result of unique() is not always the same as Series.cat.categories, because Series.
unique() has a couple of guarantees, namely that it returns categories in the order of appearance, and it
only includes values that are actually present.
In [63]: s = pd.Series(list('babc')).astype(CategoricalDtype(list('abcd')))
In [64]: s
Out[64]:
0 b
1 a
2 b
3 c
dtype: category
Categories (4, object): [a, b, c, d]
# categories
In [65]: s.cat.categories
Out[65]: Index(['a', 'b', 'c', 'd'], dtype='object')
# uniques
In [66]: s.unique()
Out[66]:
[b, a, c]
Categories (3, object): [b, a, c]
Renaming categories
Renaming categories is done by assigning new values to the Series.cat.categories property or by using
the rename_categories() method:
In [68]: s
Out[68]:
0 a
1 b
2 c
3 a
dtype: category
Categories (3, object): [a, b, c]
In [70]: s
Out[70]:
0 Group a
1 Group b
2 Group c
3 Group a
dtype: category
Categories (3, object): [Group a, Group b, Group c]
In [72]: s
Out[72]:
0 1
1 2
2 3
3 1
dtype: category
Categories (3, int64): [1, 2, 3]
In [74]: s
Out[74]:
0 x
1 y
2 z
3 x
dtype: category
Categories (3, object): [x, y, z]
Note: In contrast to R’s factor, categorical data can have categories of other types than string.
Note: Be aware that assigning new categories is an inplace operation, while most other operations under
Series.cat per default return a new Series of dtype category.
In [75]: try:
....: s.cat.categories = [1, 1, 1]
....: except ValueError as e:
....: print("ValueError:", str(e))
....:
ValueError: Categorical categories must be unique
In [76]: try:
....: s.cat.categories = [1, 2, np.nan]
....: except ValueError as e:
....: print("ValueError:", str(e))
....:
ValueError: Categorial categories cannot be null
In [78]: s.cat.categories
Out[78]: Index(['x', 'y', 'z', 4], dtype='object')
In [79]: s
Out[79]:
0 x
1 y
2 z
3 x
dtype: category
Categories (4, object): [x, y, z, 4]
Removing categories
Removing categories can be done by using the remove_categories() method. Values which are removed
are replaced by np.nan.:
In [80]: s = s.cat.remove_categories([4])
In [81]: s
Out[81]:
0 x
1 y
2 z
(continues on next page)
In [83]: s
Out[83]:
0 a
1 b
2 a
dtype: category
Categories (4, object): [a, b, c, d]
In [84]: s.cat.remove_unused_categories()
Out[84]:
0 a
1 b
2 a
dtype: category
Categories (2, object): [a, b]
Setting categories
If you want to do remove and add new categories in one step (which has some speed advantage), or simply
set the categories to a predefined scale, use set_categories().
In [86]: s
Out[86]:
0 one
1 two
2 four
3 -
dtype: category
Categories (4, object): [-, four, one, two]
In [88]: s
Out[88]:
0 one
1 two
2 four
(continues on next page)
Note: Be aware that Categorical.set_categories() cannot know whether some category is omitted
intentionally or because it is misspelled or (under Python3) due to a type difference (e.g., NumPy S1 dtype
and Python strings). This can result in surprising behaviour!
If categorical data is ordered (s.cat.ordered == True), then the order of the categories has a meaning and
certain operations are possible. If the categorical is unordered, .min()/.max() will raise a TypeError.
In [89]: s = pd.Series(pd.Categorical(["a", "b", "c", "a"], ordered=False))
In [90]: s.sort_values(inplace=True)
In [92]: s.sort_values(inplace=True)
In [93]: s
Out[93]:
0 a
3 a
1 b
2 c
dtype: category
Categories (3, object): [a < b < c]
In [96]: s.cat.as_unordered()
Out[96]:
0 a
3 a
1 b
2 c
dtype: category
Categories (3, object): [a, b, c]
Sorting will use the order defined by categories, not any lexical order present on the data type. This is even
true for strings and numeric data:
In [97]: s = pd.Series([1, 2, 3, 1], dtype="category")
In [99]: s
Out[99]:
0 1
1 2
2 3
3 1
dtype: category
Categories (3, int64): [2 < 3 < 1]
In [100]: s.sort_values(inplace=True)
In [101]: s
Out[101]:
1 2
2 3
0 1
3 1
dtype: category
Categories (3, int64): [2 < 3 < 1]
Reordering
Reordering the categories is possible via the Categorical.reorder_categories() and the Categorical.
set_categories() methods. For Categorical.reorder_categories(), all old categories must be included
in the new categories and no new categories are allowed. This will necessarily make the sort order the same
as the categories order.
In [103]: s = pd.Series([1, 2, 3, 1], dtype="category")
In [105]: s
Out[105]:
0 1
1 2
2 3
3 1
dtype: category
Categories (3, int64): [2 < 3 < 1]
In [106]: s.sort_values(inplace=True)
In [107]: s
Out[107]:
1 2
2 3
0 1
3 1
dtype: category
Categories (3, int64): [2 < 3 < 1]
Note: Note the difference between assigning new categories and reordering the categories: the first renames
categories and therefore the individual values in the Series, but if the first position was sorted last, the
renamed value will still be sorted last. Reordering means that the way values are sorted is different afterwards,
but not that individual values in the Series are changed.
Note: If the Categorical is not ordered, Series.min() and Series.max() will raise TypeError. Numeric
operations like +, -, *, / and operations based on them (e.g. Series.median(), which would need to compute
the mean between two values if the length of an array is even) do not work and raise a TypeError.
A categorical dtyped column will participate in a multi-column sort in a similar manner to other columns.
The ordering of the categorical is determined by the categories of that column.
4.8.6 Comparisons
Note: Any “non-equality” comparisons of categorical data with a Series, np.array, list or categorical
data with different categories or ordering will raise a TypeError because custom categories ordering could
be interpreted in two ways: one with taking into account the ordering and one without.
In [116]: cat
Out[116]:
0 1
1 2
2 3
dtype: category
Categories (3, int64): [3 < 2 < 1]
In [117]: cat_base
Out[117]:
0 2
1 2
2 2
dtype: category
Categories (3, int64): [3 < 2 < 1]
In [118]: cat_base2
Out[118]:
0 2
1 2
2 2
dtype: category
Categories (1, int64): [2]
Comparing to a categorical with the same categories and ordering or to a scalar works:
In [119]: cat > cat_base
Out[119]:
0 True
1 False
2 False
dtype: bool
In [123]: cat == 2
Out[123]:
0 False
1 True
2 False
dtype: bool
This doesn’t work because the categories are not the same:
In [124]: try:
.....: cat > cat_base2
.....: except TypeError as e:
.....: print("TypeError:", str(e))
.....:
TypeError: Categoricals can only be compared if 'categories' are the same. Categories␣
,→are different lengths
If you want to do a “non-equality” comparison of a categorical series with a list-like object which is not
categorical data, you need to be explicit and convert the categorical data back to the original values:
In [125]: base = np.array([1, 2, 3])
In [126]: try:
.....: cat > base
.....: except TypeError as e:
.....: print("TypeError:", str(e))
.....:
TypeError: Cannot compare a Categorical for op __gt__ with type <class 'numpy.ndarray'>.
If you want to compare values, use 'np.asarray(cat) <op> other'.
In [130]: c1 == c2
Out[130]: array([ True, True])
4.8.7 Operations
Apart from Series.min(), Series.max() and Series.mode(), the following operations are possible with
categorical data:
Series methods like Series.value_counts() will use all categories, even if some categories are not present
in the data:
In [132]: s.value_counts()
Out[132]:
c 2
b 1
a 1
(continues on next page)
In [135]: df.groupby("cats").mean()
Out[135]:
values
cats
a 1.0
b 2.0
c 4.0
d NaN
Pivot tables:
The optimized pandas data access methods .loc, .iloc, .at, and .iat, work as normal. The only difference
is the return type (for getting) and that only values already in categories can be assigned.
Getting
If the slicing operation returns either a DataFrame or a column of type Series, the category dtype is
preserved.
In [142]: idx = pd.Index(["h", "i", "j", "k", "l", "m", "n"])
In [146]: df.iloc[2:4, :]
Out[146]:
cats values
j b 2
k b 2
An example where the category type is not preserved is if you take one single row: the resulting Series is
of dtype object:
Returning a single item from categorical data will also return the value, not a categorical of length “1”.
In [151]: df.iat[0, 0]
Out[151]: 'a'
Note: The is in contrast to R’s factor function, where factor(c(1,2,3))[1] returns a single value factor.
To get a single value Series of type category, you pass in a list with a single value:
The accessors .dt and .str will work if the s.cat.categories are of an appropriate type:
In [155]: str_s = pd.Series(list('aabb'))
In [157]: str_cat
Out[157]:
0 a
1 a
2 b
3 b
dtype: category
Categories (2, object): [a, b]
In [158]: str_cat.str.contains("a")
Out[158]:
0 True
1 True
2 False
3 False
dtype: bool
In [161]: date_cat
Out[161]:
0 2015-01-01
1 2015-01-02
2 2015-01-03
3 2015-01-04
4 2015-01-05
dtype: category
Categories (5, datetime64[ns]): [2015-01-01, 2015-01-02, 2015-01-03, 2015-01-04, 2015-01-
,→05]
In [162]: date_cat.dt.day
Out[162]:
0 1
1 2
2 3
3 4
4 5
dtype: int64
Note: The returned Series (or DataFrame) is of the same type as if you used the .str.<method> /
.dt.<method> on a Series of that type (and not of type category!).
That means, that the returned values from methods and properties on the accessors of a Series and the
returned values from methods and properties on the accessors of this Series transformed to one of type
category will be equal:
In [163]: ret_s = str_s.str.contains("a")
Note: The work is done on the categories and then a new Series is constructed. This has some
performance implication if you have a Series of type string, where lots of elements are repeated (i.e. the
number of unique elements in the Series is a lot smaller than the length of the Series). In this case it can
be faster to convert the original Series to one of type category and use .str.<method> or .dt.<property>
on that.
Setting
Setting values in a categorical column (or Series) works as long as the value is included in the categories:
In [167]: idx = pd.Index(["h", "i", "j", "k", "l", "m", "n"])
In [172]: df
Out[172]:
cats values
h a 1
i a 1
j b 2
k b 2
l a 1
m a 1
n a 1
In [173]: try:
.....: df.iloc[2:4, :] = [["c", 3], ["c", 3]]
.....: except ValueError as e:
.....: print("ValueError:", str(e))
.....:
ValueError: Cannot setitem on a Categorical with a new category, set the categories first
Setting values by assigning categorical data will also check that the categories match:
In [174]: df.loc["j":"k", "cats"] = pd.Categorical(["a", "a"], categories=["a", "b"])
In [175]: df
Out[175]:
cats values
h a 1
i a 1
j a 2
k a 2
l a 1
m a 1
n a 1
In [176]: try:
.....: df.loc["j":"k", "cats"] = pd.Categorical(["b", "b"],
.....: categories=["a", "b", "c"])
In [180]: df
Out[180]:
a b
0 1 a
1 b a
2 b b
3 1 b
4 1 a
In [181]: df.dtypes
Out[181]:
a object
b object
dtype: object
Merging
You can concat two DataFrames containing categorical data together, but the categories of these categoricals
need to be the same:
In [182]: cat = pd.Series(["a", "b"], dtype="category")
In [186]: res
Out[186]:
cats vals
0 a 1
1 b 2
0 a 1
1 b 2
In [187]: res.dtypes
Out[187]:
cats category
vals int64
dtype: object
In this case the categories are not the same, and therefore an error is raised:
In [190]: try:
.....: pd.concat([df, df_different])
.....: except ValueError as e:
.....: print("ValueError:", str(e))
.....:
Unioning
By default, the resulting categories will be ordered as they appear in the data. If you want the categories to
be lexsorted, use sort_categories=True argument.
union_categoricals also works with the “easy” case of combining two categoricals of the same categories
and order information (e.g. what you could also append for).
The below raises TypeError because the categories are ordered and not identical.
union_categoricals() also works with a CategoricalIndex, or Series containing categorical data, but
note that the resulting array will always be a plain Categorical:
Note: union_categoricals may recode the integer codes for categories when combining categoricals. This
is likely what you want, but if you are relying on the exact numbering of the categories, be aware.
In [205]: c1 = pd.Categorical(["b", "c"])
In [207]: c1
Out[207]:
[b, c]
Categories (2, object): [b, c]
# "b" is coded to 0
In [208]: c1.codes
Out[208]: array([0, 1], dtype=int8)
In [209]: c2
Out[209]:
[a, b]
Categories (2, object): [a, b]
# "b" is coded to 1
In [210]: c2.codes
Out[210]: array([0, 1], dtype=int8)
In [212]: c
Out[212]:
[b, c, a, b]
Categories (3, object): [b, c, a]
Concatenation
This section describes concatenations specific to category dtype. See Concatenating objects for general
description.
By default, Series or DataFrame concatenation which contains the same categories results in category
dtype, otherwise results in object dtype. Use .astype or union_categoricals to get category result.
# same categories
In [214]: s1 = pd.Series(['a', 'b'], dtype='category')
# different categories
In [217]: s3 = pd.Series(['b', 'c'], dtype='category')
0 b
1 c
dtype: category
Categories (3, object): [a, b, c]
You can write data that contains category dtypes to a HDFStore. See here for an example and caveats.
It is also possible to write data to and reading data from Stata format files. See here for an example and
caveats.
Writing to a CSV file will convert the data, effectively removing any information about the categorical
(categories and ordering). So if you read back the CSV file you have to convert the relevant columns back
to category and assign the right categories and categories ordering.
In [221]: import io
In [227]: df.to_csv(csv)
In [229]: df2.dtypes
Out[229]:
Unnamed: 0 int64
cats object
vals int64
dtype: object
In [230]: df2["cats"]
Out[230]:
0 very good
1 good
2 good
3 very good
4 very good
5 bad
Name: cats, dtype: object
In [233]: df2.dtypes
Out[233]:
Unnamed: 0 int64
cats category
vals int64
dtype: object
In [234]: df2["cats"]
Out[234]:
0 very good
1 good
2 good
3 very good
4 very good
5 bad
Name: cats, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]
The same holds for writing to a SQL database with to_sql.
pandas primarily uses the value np.nan to represent missing data. It is by default not included in computa-
tions. See the Missing Data section.
Missing values should not be included in the Categorical’s categories, only in the values. Instead, it is
understood that NaN is different, and is always a possibility. When working with the Categorical’s codes,
missing values will always have a code of -1.
In [235]: s = pd.Series(["a", "b", np.nan, "a"], dtype="category")
3 a
dtype: category
Categories (2, object): [a, b]
In [237]: s.cat.codes
Out[237]:
0 0
1 1
2 -1
3 0
dtype: int8
Methods for working with missing data, e.g. isna(), fillna(), dropna(), all work normally:
In [238]: s = pd.Series(["a", "b", np.nan], dtype="category")
In [239]: s
Out[239]:
0 a
1 b
2 NaN
dtype: category
Categories (2, object): [a, b]
In [240]: pd.isna(s)
Out[240]:
0 False
1 False
2 True
dtype: bool
In [241]: s.fillna("a")
Out[241]:
0 a
1 b
2 a
dtype: category
Categories (2, object): [a, b]
4.8.12 Gotchas
Memory usage
The memory usage of a Categorical is proportional to the number of categories plus the length of the data.
In contrast, an object dtype is a constant times the length of the data.
In [242]: s = pd.Series(['foo', 'bar'] * 1000)
# object dtype
In [243]: s.nbytes
Out[243]: 16000
# category dtype
In [244]: s.astype('category').nbytes
Out[244]: 2016
Note: If the number of categories approaches the length of the data, the Categorical will use nearly the
same or more memory than an equivalent object dtype representation.
In [245]: s = pd.Series(['foo%04d' % i for i in range(2000)])
# object dtype
In [246]: s.nbytes
Out[246]: 16000
# category dtype
In [247]: s.astype('category').nbytes
Out[247]: 20000
Currently, categorical data and the underlying Categorical is implemented as a Python object and not as
a low-level NumPy array dtype. This leads to some problems.
NumPy itself doesn’t know about the new dtype:
In [248]: try:
.....: np.dtype("category")
.....: except TypeError as e:
.....: print("TypeError:", str(e))
.....:
TypeError: data type "category" not understood
In [250]: try:
.....: np.dtype(dtype)
.....: except TypeError as e:
.....: print("TypeError:", str(e))
.....:
TypeError: data type not understood
In [256]: try:
.....: np.sum(s)
.....: except TypeError as e:
.....: print("TypeError:", str(e))
.....:
TypeError: Categorical cannot perform the operation sum
dtype in apply
Pandas currently does not preserve the dtype in apply functions: If you apply along rows you get a Series
of object dtype (same as getting a row -> getting one element will return a basic type) and applying along
columns will also convert to object. NaN values are unaffected. You can use fillna to handle missing values
before applying a function.
In [257]: df = pd.DataFrame({"a": [1, 2, 3, 4],
.....: "b": ["a", "b", "c", "d"],
.....: "cats": pd.Categorical([1, 2, 3, 2])})
.....:
cats category
dtype: object
Categorical index
CategoricalIndex is a type of index that is useful for supporting indexing with duplicates. This is a
container around a Categorical and allows efficient indexing and storage of an index with a large number
of duplicated elements. See the advanced indexing docs for a more detailed explanation.
Setting the index will create a CategoricalIndex:
In [260]: cats = pd.Categorical([1, 2, 3, 4], categories=[4, 2, 3, 1])
In [264]: df.index
Out[264]: CategoricalIndex([1, 2, 3, 4], categories=[4, 2, 3, 1], ordered=False,␣
,→dtype='category')
Side effects
Constructing a Series from a Categorical will not copy the input Categorical. This means that changes
to the Series will in most cases change the original Categorical:
In [268]: cat
Out[268]:
[1, 2, 3, 10]
Categories (5, int64): [1, 2, 3, 4, 10]
In [269]: s.iloc[0:2] = 10
In [270]: cat
Out[270]:
[10, 10, 3, 10]
Categories (5, int64): [1, 2, 3, 4, 10]
In [273]: cat
Out[273]:
[5, 5, 3, 5]
Categories (5, int64): [1, 2, 3, 4, 5]
In [276]: cat
Out[276]:
[1, 2, 3, 10]
Categories (5, int64): [1, 2, 3, 4, 10]
In [277]: s.iloc[0:2] = 10
In [278]: cat
Out[278]:
[1, 2, 3, 10]
Categories (5, int64): [1, 2, 3, 4, 10]
Note: This also happens in some cases when you supply a NumPy array instead of a Categorical: using
an int array (e.g. np.array([1,2,3,4])) will exhibit the same behavior, while using a string array (e.g.
np.array(["a","b","c","a"])) will not.
{{ header }}
Note: IntegerArray is currently experimental. Its API or implementation may change without warning.
In Missing Data, we saw that pandas primarily uses NaN to represent missing data. Because NaN is a float,
this forces an array of integers with any missing values to become floating point. In some cases, this may
not matter much. But if your integer column is, say, an identifier, casting to float can be problematic. Some
integers cannot even be represented as floating point numbers.
Pandas can represent integer data with possibly missing values using arrays.IntegerArray. This is an
extension types implemented within pandas. It is not the default dtype for integers, and will not be inferred;
you must explicitly pass the dtype into array() or Series:
In [2]: arr
Out[2]:
<IntegerArray>
[1, 2, NaN]
Length: 3, dtype: Int64
Or the string alias "Int64" (note the capital "I", to differentiate from NumPy’s 'int64' dtype:
This array can be stored in a DataFrame or Series like any NumPy array.
In [4]: pd.Series(arr)
Out[4]:
0 1
1 2
2 NaN
dtype: Int64
You can also pass the list-like object to the Series constructor with the dtype.
In [6]: s
Out[6]:
0 1
1 2
2 NaN
dtype: Int64
By default (if you don’t specify dtype), NumPy is used, and you’ll end up with a float64 dtype Series:
Operations involving an integer array will behave similar to NumPy arrays. Missing values will be propa-
gated, and and the data will be coerced to another dtype if needed.
# arithmetic
In [8]: s + 1
Out[8]:
0 2
1 3
2 NaN
dtype: Int64
# comparison
In [9]: s == 1
Out[9]:
0 True
1 False
2 False
dtype: bool
# indexing
In [10]: s.iloc[1:3]
Out[10]:
1 2
2 NaN
dtype: Int64
In [14]: df
Out[14]:
A B C
0 1 1 a
1 2 1 a
2 NaN 3 b
In [15]: df.dtypes
Out[15]:
A Int64
B int64
C object
dtype: object
These dtypes can be merged & reshaped & casted.
In [16]: pd.concat([df[['A']], df[['B', 'C']]], axis=1).dtypes
Out[16]:
A Int64
B int64
C object
dtype: object
In [17]: df['A'].astype(float)
Out[17]:
0 1.0
1 2.0
2 NaN
Name: A, dtype: float64
Reduction and groupby operations such as ‘sum’ work as well.
In [18]: df.sum()
Out[18]:
A 3
B 5
C aab
dtype: object
In [19]: df.groupby('B').A.sum()
Out[19]:
B
1 3
3 0
Name: A, dtype: Int64
{{ header }}
4.10 Visualization
In [2]: plt.close('all')
We provide the basics in pandas to easily create decent looking plots. See the ecosystem section for visual-
ization libraries that go beyond the basics documented here.
We will demonstrate the basics, see the cookbook for some advanced strategies.
The plot method on Series and DataFrame is just a simple wrapper around plt.plot():
In [3]: ts = pd.Series(np.random.randn(1000),
...: index=pd.date_range('1/1/2000', periods=1000))
...:
In [5]: ts.plot()
Out[5]: <matplotlib.axes._subplots.AxesSubplot at 0x13fc51490>
If the index consists of dates, it calls gcf().autofmt_xdate() to try to format the x-axis nicely as per
above.
On DataFrame, plot() is a convenience to plot all of the columns with labels:
In [7]: df = df.cumsum()
In [8]: plt.figure();
In [9]: df.plot();
You can plot one column versus another using the x and y keywords in plot():
Note: For more formatting and styling options, see formatting below.
Plotting methods allow for a handful of plot styles other than the default line plot. These methods can be
provided as the kind keyword argument to plot(), and include:
• ‘bar’ or ‘barh’ for bar plots
• ‘hist’ for histogram
• ‘box’ for boxplot
• ‘kde’ or ‘density’ for density plots
• ‘area’ for area plots
• ‘scatter’ for scatter plots
• ‘hexbin’ for hexagonal bin plots
• ‘pie’ for pie plots
For example, a bar plot can be created the following way:
In [13]: plt.figure();
In [14]: df.iloc[5].plot(kind='bar');
You can also create these other plots using the methods DataFrame.plot.<kind> instead of providing the
kind keyword argument. This makes it easier to discover plot methods and the specific arguments they use:
In [15]: df = pd.DataFrame()
In addition to these kind s, there are the DataFrame.hist(), and DataFrame.boxplot() methods, which use a
separate interface.
Finally, there are several plotting functions in pandas.plotting that take a Series or DataFrame as an
argument. These include:
• Scatter Matrix
• Andrews Curves
• Parallel Coordinates
• Lag Plot
• Autocorrelation Plot
• Bootstrap Plot
• RadViz
Plots may also be adorned with errorbars or tables.
Bar plots
For labeled, non-time series data, you may wish to produce a bar plot:
In [17]: plt.figure();
In [18]: df.iloc[5].plot.bar()
Out[18]: <matplotlib.axes._subplots.AxesSubplot at 0x14eadd910>
In [21]: df2.plot.bar();
In [22]: df2.plot.bar(stacked=True);
In [23]: df2.plot.barh(stacked=True);
Histograms
In [25]: plt.figure();
In [26]: df4.plot.hist(alpha=0.5)
Out[26]: <matplotlib.axes._subplots.AxesSubplot at 0x14f3e8ad0>
A histogram can be stacked using stacked=True. Bin size can be changed using the bins keyword.
In [27]: plt.figure();
You can pass other keywords supported by matplotlib hist. For example, horizontal and cumulative his-
tograms can be drawn by orientation='horizontal' and cumulative=True.
In [29]: plt.figure();
See the hist method and the matplotlib hist documentation for more.
The existing interface DataFrame.hist to plot histogram still can be used.
In [31]: plt.figure();
In [32]: df['A'].diff().hist()
Out[32]: <matplotlib.axes._subplots.AxesSubplot at 0x14f986890>
Box plots
In [38]: df.plot.box()
Out[38]: <matplotlib.axes._subplots.AxesSubplot at 0x15022c950>
Boxplot can be colorized by passing color keyword. You can pass a dict whose keys are boxes, whiskers,
medians and caps. If some keys are missing in the dict, default colors are used for the corresponding artists.
Also, boxplot has sym keyword to specify fliers style.
When you pass other type of arguments via color keyword, it will be directly passed to matplotlib for all
the boxes, whiskers, medians and caps colorization.
The colors are applied to every boxes to be drawn. If you want more complicated colorization, you can get
each drawn artists by passing return_type.
Also, you can pass other keywords supported by matplotlib boxplot. For example, horizontal and custom-
positioned boxplot can be drawn by vert=False and positions keywords.
See the boxplot method and the matplotlib boxplot documentation for more.
The existing interface DataFrame.boxplot to plot boxplot still can be used.
In [43]: plt.figure();
In [44]: bp = df.boxplot()
You can create a stratified boxplot using the by keyword argument to create groupings. For instance,
In [46]: df['X'] = pd.Series(['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B'])
In [47]: plt.figure();
In [48]: bp = df.boxplot(by='X')
You can also pass a subset of columns to plot, as well as group by multiple columns:
In [50]: df['X'] = pd.Series(['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B'])
In [51]: df['Y'] = pd.Series(['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'])
In [52]: plt.figure();
In boxplot, the return type can be controlled by the return_type, keyword. The valid choices are {"axes",
"dict", "both", None}. Faceting, created by DataFrame.boxplot with the by keyword, will affect the
output type as well:
In [54]: np.random.seed(1234)
In [58]: bp = df_box.boxplot(by='g')
The subplots above are split by the numeric columns first, then the value of the g column. Below the subplots
are first split by the value of g, then by the numeric columns.
In [59]: bp = df_box.groupby('g').boxplot()
Area plot
You can create area plots with Series.plot.area() and DataFrame.plot.area(). Area plots are stacked
by default. To produce stacked area plot, each column must be either all positive or all negative values.
When input data contains NaN, it will be automatically filled by 0. If you want to drop or fill by different
values, use dataframe.dropna() or dataframe.fillna() before calling plot.
In [61]: df.plot.area();
To produce an unstacked plot, pass stacked=False. Alpha value is set to 0.5 unless otherwise specified:
In [62]: df.plot.area(stacked=False);
Scatter plot
Scatter plot can be drawn by using the DataFrame.plot.scatter() method. Scatter plot requires numeric
columns for the x and y axes. These can be specified by the x and y keywords.
To plot multiple column groups in a single axes, repeat plot method specifying target ax. It is recommended
to specify color and label keywords to distinguish each groups.
The keyword c may be given as the name of a column to provide colors for each point:
You can pass other keywords supported by matplotlib scatter. The example below shows a bubble chart
using a column of the DataFrame as the bubble size.
See the scatter method and the matplotlib scatter documentation for more.
You can create hexagonal bin plots with DataFrame.plot.hexbin(). Hexbin plots can be a useful alternative
to scatter plots if your data are too dense to plot each point individually.
A useful keyword argument is gridsize; it controls the number of hexagons in the x-direction, and defaults
to 100. A larger gridsize means more, smaller bins.
By default, a histogram of the counts around each (x, y) point is computed. You can specify alternative
aggregations by passing values to the C and reduce_C_function arguments. C specifies the value at each
(x, y) point and reduce_C_function is a function of one argument that reduces all the values in a bin to
a single number (e.g. mean, max, sum, std). In this example the positions are given by columns a and b,
while the value is given by column z. The bins are aggregated with NumPy’s max function.
See the hexbin method and the matplotlib hexbin documentation for more.
Pie plot
You can create a pie plot with DataFrame.plot.pie() or Series.plot.pie(). If your data includes any
NaN, they will be automatically filled with 0. A ValueError will be raised if there are any negative values
in your data.
For pie plots it’s best to use square figures, i.e. a figure aspect ratio 1. You can create the figure with equal
width and height, or force the aspect ratio to be equal after plotting by calling ax.set_aspect('equal')
on the returned axes object.
Note that pie plot with DataFrame requires that you either specify a target column by the y argument
or subplots=True. When y is specified, pie plot of selected column will be drawn. If subplots=True is
specified, pie plots for each column are drawn as subplots. A legend will be drawn in each pie plots by
default; specify legend=False to hide it.
You can use the labels and colors keywords to specify the labels and colors of each wedge.
Warning: Most pandas plots use the label and color arguments (note the lack of “s” on those). To
be consistent with matplotlib.pyplot.pie() you must use labels and colors.
If you want to hide wedge labels, specify labels=None. If fontsize is specified, the value will be applied to
wedge labels. Also, other keywords supported by matplotlib.pyplot.pie() can be used.
If you pass values whose sum total is less than 1.0, matplotlib draws a semicircle.
Pandas tries to be pragmatic about plotting DataFrames or Series that contain missing data. Missing
values are dropped, left out, or filled depending on the plot type.
If any of these defaults are not what you want, or if you want to be explicit about how missing values are
handled, consider using fillna() or dropna() before plotting.
These functions can be imported from pandas.plotting and take a Series or DataFrame as an argument.
You can create a scatter plot matrix using the scatter_matrix method in pandas.plotting:
Density plot
You can create density plots using the Series.plot.kde() and DataFrame.plot.kde() methods.
In [87]: ser.plot.kde()
Out[87]: <matplotlib.axes._subplots.AxesSubplot at 0x1471045d0>
Andrews curves
Andrews curves allow one to plot multivariate data as a large number of curves that are created using the
attributes of samples as coefficients for Fourier series, see the Wikipedia entry for more information. By
coloring these curves differently for each class it is possible to visualize data clustering. Curves belonging to
samples of the same class will usually be closer together and form larger structures.
Note: The “Iris” dataset is available here.
In [88]: from pandas.plotting import andrews_curves
In [90]: plt.figure()
Out[90]: <Figure size 640x480 with 0 Axes>
Parallel coordinates
Parallel coordinates is a plotting technique for plotting multivariate data, see the Wikipedia entry for an
introduction. Parallel coordinates allows one to see clusters in data and to estimate other statistics visually.
Using parallel coordinates points are represented as connected line segments. Each vertical line represents
one attribute. One set of connected line segments represents one data point. Points that tend to cluster will
appear closer together.
In [92]: from pandas.plotting import parallel_coordinates
In [94]: plt.figure()
Out[94]: <Figure size 640x480 with 0 Axes>
Lag plot
Lag plots are used to check if a data set or time series is random. Random data should not exhibit any
structure in the lag plot. Non-random structure implies that the underlying data are not random. The lag
argument may be passed, and when lag=1 the plot is essentially data[:-1] vs. data[1:].
In [97]: plt.figure()
Out[97]: <Figure size 640x480 with 0 Axes>
In [100]: lag_plot(data)
Out[100]: <matplotlib.axes._subplots.AxesSubplot at 0x14aeb2710>
Autocorrelation plot
Autocorrelation plots are often used for checking randomness in time series. This is done by computing
autocorrelations for data values at varying time lags. If time series is random, such autocorrelations should
be near zero for any and all time-lag separations. If time series is non-random then one or more of the
autocorrelations will be significantly non-zero. The horizontal lines displayed in the plot correspond to 95%
and 99% confidence bands. The dashed line is 99% confidence band. See the Wikipedia entry for more about
autocorrelation plots.
In [102]: plt.figure()
Out[102]: <Figure size 640x480 with 0 Axes>
In [105]: autocorrelation_plot(data)
Out[105]: <matplotlib.axes._subplots.AxesSubplot at 0x145c97590>
Bootstrap plot
Bootstrap plots are used to visually assess the uncertainty of a statistic, such as mean, median, midrange,
etc. A random subset of a specified size is selected from a data set, the statistic in question is computed
for this subset and the process is repeated a specified number of times. Resulting plots and histograms are
what constitutes the bootstrap plot.
RadViz
RadViz is a way of visualizing multi-variate data. It is based on a simple spring tension minimization
algorithm. Basically you set up a bunch of points in a plane. In our case they are equally spaced on a
unit circle. Each point represents a single attribute. You then pretend that each sample in the data set is
attached to each of these points by a spring, the stiffness of which is proportional to the numerical value of
that attribute (they are normalized to unit interval). The point in the plane, where our sample settles to
(where the forces acting on our sample are at an equilibrium) is where a dot representing our sample will
be drawn. Depending on which class that sample belongs it will be colored differently. See the R package
Radviz for more information.
Note: The “Iris” dataset is available here.
In [109]: from pandas.plotting import radviz
In [111]: plt.figure()
Out[111]: <Figure size 640x480 with 0 Axes>
From version 1.5 and up, matplotlib offers a range of pre-configured plotting styles. Setting the style can be
used to easily give plots the general look that you want. Setting the style is as easy as calling matplotlib.
style.use(my_plot_style) before creating your plot. For example you could write matplotlib.style.
use('ggplot') for ggplot-style plots.
You can see the various available style names at matplotlib.style.available and it’s very easy to try
them out.
Most plotting methods have a set of keyword arguments that control the layout and formatting of the
returned plot:
In [113]: plt.figure();
For each kind of plot (e.g. line, bar, scatter) any additional arguments keywords are passed along to the
corresponding matplotlib function (ax.plot(), ax.bar(), ax.scatter()). These can be used to control
additional styling, beyond what pandas provides.
You may set the legend argument to False to hide the legend, which is shown by default.
In [116]: df = df.cumsum()
In [117]: df.plot(legend=False)
Out[117]: <matplotlib.axes._subplots.AxesSubplot at 0x14a278510>
Scales
In [118]: ts = pd.Series(np.random.randn(1000),
.....: index=pd.date_range('1/1/2000', periods=1000))
.....:
In [119]: ts = np.exp(ts.cumsum())
In [120]: ts.plot(logy=True)
Out[120]: <matplotlib.axes._subplots.AxesSubplot at 0x14a63f0d0>
To plot some columns in a DataFrame, give the column names to the secondary_y keyword:
In [123]: plt.figure()
Out[123]: <Figure size 640x480 with 0 Axes>
Note that the columns plotted on the secondary y-axis is automatically marked with “(right)” in the legend.
To turn off the automatic marking, use the mark_right=False keyword:
In [127]: plt.figure()
Out[127]: <Figure size 640x480 with 0 Axes>
pandas includes automatic tick resolution adjustment for regular frequency time-series data. For limited
cases where pandas cannot infer the frequency information (e.g., in an externally created twinx), you can
choose to suppress this behavior for alignment purposes.
Here is the default behavior, notice how the x-axis tick labeling is performed:
In [129]: plt.figure()
Out[129]: <Figure size 640x480 with 0 Axes>
In [130]: df.A.plot()
Out[130]: <matplotlib.axes._subplots.AxesSubplot at 0x149b267d0>
In [132]: df.A.plot(x_compat=True)
Out[132]: <matplotlib.axes._subplots.AxesSubplot at 0x13c3baed0>
If you have more than one plot that needs to be suppressed, the use method in pandas.plotting.
plot_params can be used in a with statement:
In [133]: plt.figure()
Out[133]: <Figure size 640x480 with 0 Axes>
Subplots
Each Series in a DataFrame can be plotted on a different axis with the subplots keyword:
The layout of subplots can be specified by the layout keyword. It can accept (rows, columns). The layout
keyword can be used in hist and boxplot also. If the input is invalid, a ValueError will be raised.
The number of axes which can be contained by rows x columns specified by layout must be larger than
the number of required subplots. If layout can contain more axes than required, blank axes are not drawn.
Similar to a NumPy array’s reshape method, you can use -1 for one dimension to automatically calculate
the number of rows or columns needed, given the other.
The required number of columns (3) is inferred from the number of series to plot and the given number of
rows (2).
You can pass multiple axes created beforehand as list-like via ax keyword. This allows more complicated
layouts. The passed axes must be the same number as the subplots being drawn.
When multiple axes are passed via the ax keyword, layout, sharex and sharey keywords don’t affect to
the output. You should explicitly pass sharex=False and sharey=False, otherwise you will see a warning.
Horizontal and vertical error bars can be supplied to the xerr and yerr keyword arguments to plot(). The
error values can be specified using a variety of formats:
• As a DataFrame or dict of errors with column names matching the columns attribute of the plotting
DataFrame or matching the name attribute of the Series.
• As a str indicating which of the columns of plotting DataFrame contain the error values.
• As raw values (list, tuple, or np.ndarray). Must be the same length as the plotting
DataFrame/Series.
Asymmetrical error bars are also supported, however raw error values must be provided in this case. For a
M length Series, a Mx2 array should be provided indicating lower and upper (or left and right) errors. For
a MxN DataFrame, asymmetrical errors should be in a Mx2xN array.
Here is an example of one way to easily plot group means with standard deviations from the raw data.
# Generate the data
In [153]: ix3 = pd.MultiIndex.from_arrays([
.....: ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'],
.....: ['foo', 'foo', 'bar', 'bar', 'foo', 'foo', 'bar', 'bar']],
.....: names=['letter', 'word'])
.....:
# Group by index labels and take the means and standard deviations
# for each group
In [155]: gp3 = df3.groupby(level=('letter', 'word'))
In [158]: means
Out[158]:
data1 data2
letter word
a bar 3.5 6.0
foo 2.5 5.5
b bar 2.5 5.5
foo 3.0 4.5
In [159]: errors
Out[159]:
data1 data2
letter word
a bar 0.707107 1.414214
foo 0.707107 0.707107
b bar 0.707107 0.707107
foo 1.414214 0.707107
# Plot
In [160]: fig, ax = plt.subplots()
Plotting tables
Plotting with matplotlib table is now supported in DataFrame.plot() and Series.plot() with a table
keyword. The table keyword can accept bool, DataFrame or Series. The simple way to draw a table is to
specify table=True. Data will be transposed to meet matplotlib’s default layout.
Also, you can pass a different DataFrame or Series to the table keyword. The data will be drawn as
displayed in print method (not transposed automatically). If required, it should be transposed manually as
seen in the example below.
There also exists a helper function pandas.plotting.table, which creates a table from DataFrame or
Series, and adds it to an matplotlib.Axes instance. This function can accept keywords which the mat-
plotlib table has.
In [169]: from pandas.plotting import table
Note: You can get table instances on the axes using axes.tables property for further decorations. See the
matplotlib table documentation for more.
Colormaps
A potential issue when plotting a large number of columns is that it can be difficult to distinguish some
series due to repetition in the default colors. To remedy this, DataFrame plotting supports the use of the
colormap argument, which accepts either a Matplotlib colormap or a string that is a name of a colormap
registered with Matplotlib. A visualization of the default matplotlib colormaps is available here.
As matplotlib does not directly support colormaps for line-based plots, the colors are selected based on an
even spacing determined by the number of columns in the DataFrame. There is no consideration made for
background color, so some colormaps will produce lines that are not easily visible.
To use the cubehelix colormap, we can pass colormap='cubehelix'.
In [173]: df = pd.DataFrame(np.random.randn(1000, 10), index=ts.index)
In [174]: df = df.cumsum()
In [175]: plt.figure()
Out[175]: <Figure size 640x480 with 0 Axes>
In [176]: df.plot(colormap='cubehelix')
In [178]: plt.figure()
Out[178]: <Figure size 640x480 with 0 Axes>
In [179]: df.plot(colormap=cm.cubehelix)
Out[179]: <matplotlib.axes._subplots.AxesSubplot at 0x152a997d0>
Colormaps can also be used other plot types, like bar charts:
In [180]: dd = pd.DataFrame(np.random.randn(10, 10)).applymap(abs)
In [181]: dd = dd.cumsum()
In [182]: plt.figure()
Out[182]: <Figure size 640x480 with 0 Axes>
In [183]: dd.plot.bar(colormap='Greens')
Out[183]: <matplotlib.axes._subplots.AxesSubplot at 0x152d6fdd0>
In some situations it may still be preferable or necessary to prepare plots directly with matplotlib, for instance
when a certain type of plot or customization is not (yet) supported by pandas. Series and DataFrame objects
behave like arrays and can therefore be passed directly to matplotlib functions without explicit casts.
pandas also automatically registers formatters and locators that recognize date indices, thereby extending
date and time support to practically all plot types available in matplotlib. Although this formatting does
not provide the same level of refinement you would get when plotting via pandas, it can be faster when
plotting a large number of points.
In [188]: price = pd.Series(np.random.randn(150).cumsum(),
.....: index=pd.date_range('2000-1-1', periods=150, freq='B'))
.....:
In [189]: ma = price.rolling(20).mean()
In [191]: plt.figure()
Out[191]: <Figure size 640x480 with 0 Axes>
Warning: The rplot trellis plotting interface has been removed. Please use external packages like
seaborn for similar but more refined functionality and refer to our 0.18.1 documentation here for how to
convert to using it.
{{ header }}
Percent change
Series and DataFrame have a method pct_change() to compute the percent change over a given number
of periods (using fill_method to fill NA/null values before computing the percent change).
In [2]: ser.pct_change()
Out[2]:
0 NaN
1 -3.499440
2 -1.209115
3 0.755930
4 -1.107920
5 -8.278259
6 4.389155
7 -1.588397
dtype: float64
In [4]: df.pct_change(periods=3)
Out[4]:
0 1 2 3
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 -1.032320 -0.144445 0.009776 -1.063561
4 0.262689 2.337771 -0.795219 -0.475762
5 -1.922600 -0.225164 -0.450228 -1.658606
6 -16.918561 -1.257664 2.236560 -17.931375
7 -1.660564 -1.197864 -8.654122 -0.311955
8 -2.167621 -1.274100 -2.128553 -0.511969
9 0.152402 -5.672460 -1.180728 -1.336598
Covariance
Series.cov() can be used to compute covariance between series (excluding missing values).
In [5]: s1 = pd.Series(np.random.randn(1000))
In [6]: s2 = pd.Series(np.random.randn(1000))
In [7]: s1.cov(s2)
Out[7]: -0.01982050306978479
Analogously, DataFrame.cov() to compute pairwise covariances among the series in the DataFrame, also
excluding NA/null values.
Note: Assuming the missing data are missing at random this results in an estimate for the covariance
matrix which is unbiased. However, for many applications this estimate may not be acceptable because the
estimated covariance matrix is not guaranteed to be positive semi-definite. This could lead to estimated
correlations having absolute values which are greater than one, and/or a non-invertible covariance matrix.
See Estimation of covariance matrices for more details.
In [9]: frame.cov()
Out[9]:
a b c d e
a 0.993079 -0.041165 -0.000453 -0.034713 0.015692
b -0.041165 1.003421 -0.012468 -0.004746 -0.039112
c -0.000453 -0.012468 1.016472 -0.032166 -0.021802
d -0.034713 -0.004746 -0.032166 1.074440 0.022702
e 0.015692 -0.039112 -0.021802 0.022702 1.001368
DataFrame.cov also supports an optional min_periods keyword that specifies the required minimum number
of observations for each column pair in order to have a valid result.
In [10]: frame = pd.DataFrame(np.random.randn(20, 3), columns=['a', 'b', 'c'])
In [13]: frame.cov()
Out[13]:
a b c
a 1.246580 0.142555 -0.347796
b 0.142555 0.395455 -0.093032
c -0.347796 -0.093032 1.303810
In [14]: frame.cov(min_periods=12)
Out[14]:
a b c
a 1.246580 NaN -0.347796
b NaN 0.395455 -0.093032
c -0.347796 -0.093032 1.303810
Correlation
Correlation may be computed using the corr() method. Using the method parameter, several methods for
computing correlations are provided:
All of these are currently computed using pairwise complete observations. Wikipedia has articles covering
the above correlation coefficients:
• Pearson correlation coefficient
• Kendall rank correlation coefficient
• Spearman’s rank correlation coefficient
Note: Please see the caveats associated with this method of calculating correlation matrices in the covari-
ance section.
In [23]: frame.corr()
Out[23]:
a b c
a 1.000000 0.199704 0.057058
b 0.199704 1.000000 0.087441
c 0.057058 0.087441 1.000000
In [24]: frame.corr(min_periods=12)
Out[24]:
a b c
a 1.000000 NaN 0.057058
# histogram intersection
In [25]: def histogram_intersection(a, b):
....: return np.minimum(np.true_divide(a, a.sum()),
....: np.true_divide(b, b.sum())).sum()
....:
In [26]: frame.corr(method=histogram_intersection)
Out[26]:
a b c
a 1.000000 -0.837485 -4.851772
b -0.837485 1.000000 -36.602246
c -4.851772 -36.602246 1.000000
A related method corrwith() is implemented on DataFrame to compute the correlation between like-labeled
Series contained in different DataFrame objects.
In [27]: index = ['a', 'b', 'c', 'd', 'e']
In [31]: df1.corrwith(df2)
Out[31]:
one -0.445528
two -0.351807
three -0.905237
four 0.351228
dtype: float64
Data ranking
The rank() method produces a data ranking with ties being assigned the mean of the ranks (by default) for
the group:
In [35]: s.rank()
Out[35]:
a 1.0
b 2.5
c 4.0
d 2.5
e 5.0
dtype: float64
rank() is also a DataFrame method and can rank either the rows (axis=0) or the columns (axis=1). NaN
values are excluded from the ranking.
In [36]: df = pd.DataFrame(np.random.randn(10, 6))
In [38]: df
Out[38]:
0 1 2 3 4 5
0 -0.706923 -0.165747 3.230001 0.251850 3.230001 -1.656365
1 1.049660 0.183541 0.966606 -0.193403 0.966606 2.218584
2 -0.505413 1.414760 0.898012 0.984841 0.898012 1.808877
3 1.836020 -1.449018 0.123195 1.262849 0.123195 0.120509
4 -3.715824 0.277952 -0.101871 0.019478 -0.101871 0.456955
5 -1.013384 -0.966352 -0.525177 -0.873757 NaN -1.858289
6 -0.313772 -1.143538 1.034543 0.569028 NaN -0.083523
7 -1.009067 0.825404 0.122760 0.911427 NaN 1.128538
8 -0.335681 0.722818 0.102682 -1.701922 NaN -1.660909
9 1.571059 1.169795 -0.988364 0.586451 NaN -0.999830
In [39]: df.rank(1)
Out[39]:
0 1 2 3 4 5
0 2.0 3.0 5.5 4.0 5.5 1.0
1 5.0 2.0 3.5 1.0 3.5 6.0
2 1.0 5.0 2.5 4.0 2.5 6.0
3 6.0 1.0 3.5 5.0 3.5 2.0
4 1.0 5.0 2.5 4.0 2.5 6.0
5 2.0 3.0 5.0 4.0 NaN 1.0
6 2.0 1.0 5.0 4.0 NaN 3.0
7 1.0 3.0 2.0 4.0 NaN 5.0
8 3.0 5.0 4.0 1.0 NaN 2.0
9 5.0 4.0 2.0 3.0 NaN 1.0
rank optionally takes a parameter ascending which by default is true; when false, data is reverse-ranked,
with larger values assigned a smaller rank.
rank supports different tie-breaking methods, specified with the method parameter:
• average : average rank of tied group
• min : lowest rank in the group
For working with data, a number of window functions are provided for computing common window or rolling
statistics. Among these are count, sum, mean, median, correlation, variance, covariance, standard deviation,
skewness, and kurtosis.
The rolling() and expanding() functions can be used directly from DataFrameGroupBy objects, see the
groupby docs.
Note: The API for window statistics is quite similar to the way one works with GroupBy objects, see the
documentation here.
We work with rolling, expanding and exponentially weighted data through the corresponding objects,
Rolling, Expanding and EWM.
In [40]: s = pd.Series(np.random.randn(1000),
....: index=pd.date_range('1/1/2000', periods=1000))
....:
In [41]: s = s.cumsum()
In [42]: s
Out[42]:
2000-01-01 -0.976840
2000-01-02 -1.301961
2000-01-03 -1.621681
2000-01-04 -1.526089
2000-01-05 -1.816793
...
2002-09-22 0.175963
2002-09-23 -0.049734
2002-09-24 1.187367
2002-09-25 1.189690
2002-09-26 1.045869
Freq: D, Length: 1000, dtype: float64
In [44]: r
Out[44]: Rolling [window=60,center=False,axis=0]
Generally these methods all have the same interface. They all accept the following arguments:
• window: size of moving window
• min_periods: threshold of non-null data points to require (otherwise result is NA)
• center: boolean, whether to set the labels at the center (default is False)
We can then call methods on these rolling objects. These return like-indexed objects:
In [45]: r.mean()
Out[45]:
2000-01-01 NaN
2000-01-02 NaN
2000-01-03 NaN
2000-01-04 NaN
2000-01-05 NaN
...
2002-09-22 -1.233792
2002-09-23 -1.193738
2002-09-24 -1.133761
2002-09-25 -1.079222
2002-09-26 -1.001532
Freq: D, Length: 1000, dtype: float64
In [46]: s.plot(style='k--')
Out[46]: <matplotlib.axes._subplots.AxesSubplot at 0x127785a50>
In [47]: r.mean().plot(style='k')
Out[47]: <matplotlib.axes._subplots.AxesSubplot at 0x127785a50>
They can also be applied to DataFrame objects. This is really just syntactic sugar for applying the moving
window operator to all of the DataFrame’s columns:
In [49]: df = df.cumsum()
In [50]: df.rolling(window=60).sum().plot(subplots=True)
Out[50]:
array([<matplotlib.axes._subplots.AxesSubplot object at 0x13b785110>,
<matplotlib.axes._subplots.AxesSubplot object at 0x1228a5e50>,
<matplotlib.axes._subplots.AxesSubplot object at 0x122939510>,
<matplotlib.axes._subplots.AxesSubplot object at 0x122992690>],
dtype=object)
Method summary
Method Description
count() Number of non-null observations
sum() Sum of values
mean() Mean of values
median() Arithmetic median of values
min() Minimum
max() Maximum
std() Bessel-corrected sample standard deviation
var() Unbiased variance
skew() Sample skewness (3rd moment)
kurt() Sample kurtosis (4th moment)
quantile() Sample quantile (value at %)
apply() Generic apply
cov() Unbiased covariance (binary)
corr() Correlation (binary)
The apply() function takes an extra func argument and performs generic rolling computations. The func
argument should be a single function that produces a single value from an ndarray input. Suppose we wanted
to compute the mean absolute deviation on a rolling basis:
Rolling windows
Passing win_type to .rolling generates a generic rolling window computation, that is weighted according
the win_type. The following methods are available:
Method Description
sum() Sum of values
mean() Mean of values
The weights used in the window are specified by the win_type keyword. The list of recognized types are the
scipy.signal window functions:
• boxcar
• triang
• blackman
• hamming
• bartlett
• parzen
• bohman
• blackmanharris
• nuttall
• barthann
• kaiser (needs beta)
• gaussian (needs std)
• general_gaussian (needs power, width)
• slepian (needs width)
• exponential (needs tau).
In [56]: ser.rolling(window=5).mean()
Out[56]:
2000-01-01 NaN
2000-01-02 NaN
2000-01-03 NaN
2000-01-04 NaN
2000-01-05 0.278863
2000-01-06 0.363741
2000-01-07 -0.014066
2000-01-08 0.093474
2000-01-09 0.047512
2000-01-10 -0.358661
Freq: D, dtype: float64
For some windowing functions, additional parameters must be specified:
Note: For .sum() with a win_type, there is no normalization done to the weights for the window. Passing
custom weights of [1, 1, 1] will yield a different result than passing weights of [2, 2, 2], for example.
When passing a win_type instead of explicitly specifying the weights, the weights are already normalized so
that the largest weight is 1.
In contrast, the nature of the .mean() calculation is such that the weights are normalized with respect to
each other. Weights of [1, 1, 1] and [2, 2, 2] yield the same result.
Time-aware rolling
In [59]: dft
Out[59]:
B
2013-01-01 09:00:00 0.0
2013-01-01 09:00:01 1.0
2013-01-01 09:00:02 2.0
2013-01-01 09:00:03 NaN
2013-01-01 09:00:04 4.0
This is a regular frequency index. Using an integer window parameter works to roll along the window
frequency.
In [60]: dft.rolling(2).sum()
Out[60]:
B
2013-01-01 09:00:00 NaN
2013-01-01 09:00:01 1.0
2013-01-01 09:00:02 3.0
2013-01-01 09:00:03 NaN
2013-01-01 09:00:04 NaN
In [62]: dft.rolling('2s').sum()
Out[62]:
B
2013-01-01 09:00:00 0.0
2013-01-01 09:00:01 1.0
2013-01-01 09:00:02 3.0
2013-01-01 09:00:03 2.0
2013-01-01 09:00:04 4.0
Using a non-regular, but still monotonic index, rolling with an integer window does not impart any special
calculation.
In [63]: dft = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]},
....: index=pd.Index([pd.Timestamp('20130101 09:00:00'),
....: pd.Timestamp('20130101 09:00:02'),
....: pd.Timestamp('20130101 09:00:03'),
....: pd.Timestamp('20130101 09:00:05'),
....: pd.Timestamp('20130101 09:00:06')],
....: name='foo'))
....:
In [64]: dft
Out[64]:
B
foo
2013-01-01 09:00:00 0.0
2013-01-01 09:00:02 1.0
2013-01-01 09:00:03 2.0
2013-01-01 09:00:05 NaN
2013-01-01 09:00:06 4.0
In [65]: dft.rolling(2).sum()
Out[65]:
B
foo
2013-01-01 09:00:00 NaN
2013-01-01 09:00:02 1.0
2013-01-01 09:00:03 3.0
2013-01-01 09:00:05 NaN
2013-01-01 09:00:06 NaN
Using the time-specification generates variable windows for this sparse data.
In [66]: dft.rolling('2s').sum()
Out[66]:
B
foo
2013-01-01 09:00:00 0.0
2013-01-01 09:00:02 1.0
2013-01-01 09:00:03 3.0
2013-01-01 09:00:05 NaN
2013-01-01 09:00:06 4.0
Furthermore, we now allow an optional on parameter to specify a column (rather than the default of the
index) in a DataFrame.
In [67]: dft = dft.reset_index()
In [68]: dft
Out[68]:
foo B
0 2013-01-01 09:00:00 0.0
1 2013-01-01 09:00:02 1.0
2 2013-01-01 09:00:03 2.0
3 2013-01-01 09:00:05 NaN
4 2013-01-01 09:00:06 4.0
For example, having the right endpoint open is useful in many problems that require that there is no
contamination from present information back to past information. This allows the rolling window to compute
statistics “up to that point in time”, but not including that point in time.
In [75]: df
Out[75]:
x right both left neither
2013-01-01 09:00:01 1 1.0 1.0 NaN NaN
2013-01-01 09:00:02 1 2.0 2.0 1.0 1.0
2013-01-01 09:00:03 1 2.0 3.0 2.0 1.0
2013-01-01 09:00:04 1 2.0 3.0 2.0 1.0
2013-01-01 09:00:06 1 1.0 2.0 1.0 NaN
Currently, this feature is only implemented for time-based windows. For fixed windows, the closed parameter
cannot be set and the rolling window will always have both endpoints closed.
Using .rolling() with a time-based index is quite similar to resampling. They both operate and perform
reductive operations on time-indexed pandas objects.
When using .rolling() with an offset. The offset is a time-delta. Take a backwards-in-time looking window,
and aggregate all of the values in that window (including the end-point, but not the start-point). This is
the new value at that point in the result. These are variable sized windows in time-space for each point of
the input. You will get a same sized result as the input.
When using .resample() with an offset. Construct a new index that is the frequency of the offset. For each
frequency bin, aggregate points from the input within a backwards-in-time looking window that fall in that
bin. The result of this aggregation is the output for that frequency point. The windows are fixed size in the
frequency space. Your result will have the shape of a regular frequency between the min and the max of the
original input object.
To summarize, .rolling() is a time-based window operation, while .resample() is a frequency-based
window operation.
Centering windows
By default the labels are set to the right edge of the window, but a center keyword is available so the labels
can be set at the center.
In [76]: ser.rolling(window=5).mean()
Out[76]:
2000-01-01 NaN
2000-01-02 NaN
2000-01-03 NaN
2000-01-04 NaN
2000-01-05 0.278863
2000-01-06 0.363741
2000-01-07 -0.014066
2000-01-08 0.093474
2000-01-09 0.047512
2000-01-10 -0.358661
Freq: D, dtype: float64
cov() and corr() can compute moving window statistics about two Series or any combination of
DataFrame/Series or DataFrame/DataFrame. Here is the behavior in each case:
• two Series: compute the statistic for the pairing.
• DataFrame/Series: compute the statistics for each column of the DataFrame with the passed Series,
thus returning a DataFrame.
• DataFrame/DataFrame: by default compute the statistic for matching column names, returning a
DataFrame. If the keyword argument pairwise=True is passed then computes the statistic for each
pair of columns, returning a MultiIndexed DataFrame whose index are the dates in question (see the
next section).
For example:
In [79]: df = df.cumsum()
In [81]: df2.rolling(window=5).corr(df2['B'])
Out[81]:
A B C D
2000-01-01 NaN NaN NaN NaN
2000-01-02 NaN NaN NaN NaN
2000-01-03 NaN NaN NaN NaN
2000-01-04 NaN NaN NaN NaN
2000-01-05 -0.285702 1.0 0.001386 -0.644391
2000-01-06 -0.382557 1.0 0.314561 -0.663137
2000-01-07 -0.758014 1.0 0.922098 -0.928193
2000-01-08 0.089482 1.0 0.921002 -0.826439
2000-01-09 0.556369 1.0 0.898814 -0.441671
2000-01-10 0.796184 1.0 0.802376 0.455321
2000-01-11 0.598143 1.0 0.132695 0.837587
2000-01-12 0.092580 1.0 0.483847 0.534910
2000-01-13 0.700883 1.0 -0.346985 0.714960
2000-01-14 0.457040 1.0 -0.614097 0.601374
2000-01-15 -0.199679 1.0 -0.759084 0.319158
2000-01-16 -0.470263 1.0 -0.765565 -0.007433
2000-01-17 -0.519380 1.0 -0.789353 -0.038121
2000-01-18 -0.182654 1.0 -0.568321 -0.656316
2000-01-19 0.289694 1.0 -0.574918 -0.642129
2000-01-20 0.047778 1.0 -0.626036 -0.841054
In financial data analysis and other fields it’s common to compute covariance and correlation matrices for a
collection of time series. Often one is also interested in moving-window covariance and correlation matrices.
This can be done by passing the pairwise keyword argument, which in the case of DataFrame inputs will
yield a MultiIndexed DataFrame whose index are the dates in question. In the case of a single DataFrame
argument the pairwise argument can even be omitted:
Note: Missing values are ignored and each entry is computed using the pairwise complete observations.
Please see the covariance section for caveats associated with this method of calculating covariance and
correlation matrices.
In [83]: covs.loc['2002-09-22':]
Out[83]:
B C D
2002-09-22 A -6.321005 -3.463368 3.549138
B 12.840797 1.109490 -5.797594
C 1.109490 1.893422 -0.882697
2002-09-23 A -8.255508 -3.651426 3.961313
B 13.553810 1.630004 -5.883064
C 1.630004 1.897993 -1.000379
2002-09-24 A -9.907739 -4.008386 4.363742
B 14.472223 2.014068 -5.992416
C 2.014068 1.986915 -1.090262
2002-09-25 A -11.462223 -4.333622 4.887435
B 15.350662 2.486750 -6.205751
C 2.486750 2.063034 -1.256018
2002-09-26 A -13.446182 -4.847647 5.782338
B 16.726803 3.168894 -6.881905
C 3.168894 2.234866 -1.562877
In [85]: correls.loc['2002-09-22':]
Out[85]:
A B C D
2002-09-22 A 1.000000 -0.525402 -0.749681 0.431575
B -0.525402 1.000000 0.225011 -0.660517
C -0.749681 0.225011 1.000000 -0.261891
D 0.431575 -0.660517 -0.261891 1.000000
2002-09-23 A 1.000000 -0.633521 -0.748795 0.456818
B -0.633521 1.000000 0.321374 -0.652272
C -0.748795 0.321374 1.000000 -0.296397
D 0.456818 -0.652272 -0.296397 1.000000
2002-09-24 A 1.000000 -0.697422 -0.761498 0.477549
B -0.697422 1.000000 0.375592 -0.643730
C -0.761498 0.375592 1.000000 -0.316090
D 0.477549 -0.643730 -0.316090 1.000000
2002-09-25 A 1.000000 -0.749655 -0.773133 0.510224
B -0.749655 1.000000 0.441891 -0.645289
C -0.773133 0.441891 1.000000 -0.356259
D 0.510224 -0.645289 -0.356259 1.000000
2002-09-26 A 1.000000 -0.802958 -0.791964 0.560321
B -0.802958 1.000000 0.518293 -0.667628
C -0.791964 0.518293 1.000000 -0.414793
D 0.560321 -0.667628 -0.414793 1.000000
You can efficiently retrieve the time series of correlations between two columns by reshaping and indexing:
4.11.3 Aggregation
Once the Rolling, Expanding or EWM objects have been created, several methods are available to perform
multiple computations on the data. These operations are similar to the aggregating API , groupby API , and
resample API .
In [89]: r
Out[89]: Rolling [window=60,min_periods=1,center=False,axis=0]
We can aggregate by passing a function to the entire DataFrame, or select a Series (or multiple Series) via
standard __getitem__.
In [90]: r.aggregate(np.sum)
Out[90]:
A B C
2000-01-01 1.697962 0.139101 -0.658131
2000-01-02 1.956254 0.333698 -3.074300
2000-01-03 2.226001 -0.657883 -6.222614
2000-01-04 1.882546 -1.518917 -6.271144
2000-01-05 2.049197 -2.031094 -6.393307
... ... ... ...
2002-09-22 5.024121 8.047617 12.955634
2002-09-23 3.556230 8.547739 13.261263
2002-09-24 2.352943 9.139912 12.724014
2002-09-25 1.206819 7.741372 11.936045
2002-09-26 -0.064820 5.847811 11.188268
In [91]: r['A'].aggregate(np.sum)
Out[91]:
2000-01-01 1.697962
2000-01-02 1.956254
2000-01-03 2.226001
2000-01-04 1.882546
2000-01-05 2.049197
...
2002-09-22 5.024121
2002-09-23 3.556230
2002-09-24 2.352943
2002-09-25 1.206819
2002-09-26 -0.064820
Freq: D, Name: A, Length: 1000, dtype: float64
With windowed Series you can also pass a list of functions to do aggregation with, outputting a DataFrame:
On a windowed DataFrame, you can pass a list of functions to apply to each column, which produces an
aggregated result with a hierarchical index:
Passing a dict of functions has different behavior by default, see the next section.
By passing a dict to aggregate you can apply a different aggregation to the columns of a DataFrame:
The function names can also be strings. In order for a string to be valid it must be implemented on the
windowed object
Furthermore you can pass a nested dict to indicate different aggregations on different columns.
A common alternative to rolling statistics is to use an expanding window, which yields the value of the
statistic with all the data available up to that point in time.
These follow a similar interface to .rolling, with the .expanding method returning an Expanding object.
As these calculations are a special case of rolling statistics, they are implemented in pandas such that the
following two calls are equivalent:
In [98]: df.rolling(window=len(df), min_periods=1).mean()[:5]
Out[98]:
A B C D
2000-01-01 -1.294374 0.554789 -0.493573 -1.044350
2000-01-02 -1.685776 0.306997 -0.491494 -1.785642
2000-01-03 -1.896213 0.797296 -0.378842 -2.447085
2000-01-04 -1.894460 1.091742 -0.467363 -2.590780
2000-01-05 -1.788451 1.095136 -0.690874 -2.647958
In [99]: df.expanding(min_periods=1).mean()[:5]
Out[99]:
A B C D
2000-01-01 -1.294374 0.554789 -0.493573 -1.044350
2000-01-02 -1.685776 0.306997 -0.491494 -1.785642
2000-01-03 -1.896213 0.797296 -0.378842 -2.447085
2000-01-04 -1.894460 1.091742 -0.467363 -2.590780
2000-01-05 -1.788451 1.095136 -0.690874 -2.647958
These have a similar set of methods to .rolling methods.
Method summary
Function Description
count() Number of non-null observations
sum() Sum of values
mean() Mean of values
median() Arithmetic median of values
min() Minimum
max() Maximum
std() Unbiased standard deviation
var() Unbiased variance
skew() Unbiased skewness (3rd moment)
kurt() Unbiased kurtosis (4th moment)
quantile() Sample quantile (value at %)
apply() Generic apply
cov() Unbiased covariance (binary)
corr() Correlation (binary)
Aside from not having a window parameter, these functions have the same interfaces as their .rolling
counterparts. Like above, the parameters they all accept are:
• min_periods: threshold of non-null data points to require. Defaults to minimum needed to compute
statistic. No NaNs will be output once min_periods non-null data points have been seen.
• center: boolean, whether to set the labels at the center (default is False).
Note: The output of the .rolling and .expanding methods do not return a NaN if there are at least
min_periods non-null values in the current window. For example:
In [100]: sn = pd.Series([1, 2, np.nan, 3, np.nan, 4])
In [101]: sn
Out[101]:
0 1.0
1 2.0
2 NaN
3 3.0
4 NaN
5 4.0
dtype: float64
In [102]: sn.rolling(2).max()
Out[102]:
0 NaN
1 2.0
2 NaN
3 NaN
4 NaN
5 NaN
dtype: float64
In [105]: sn.cumsum()
Out[105]:
0 1.0
1 3.0
2 NaN
3 6.0
4 NaN
5 10.0
dtype: float64
In [106]: sn.cumsum().fillna(method='ffill')
Out[106]:
0 1.0
1 3.0
2 3.0
3 6.0
4 6.0
5 10.0
dtype: float64
An expanding window statistic will be more stable (and less responsive) than its rolling window counterpart
as the increasing window size decreases the relative impact of an individual data point. As an example, here
is the mean() output for the previous time series dataset:
In [107]: s.plot(style='k--')
Out[107]: <matplotlib.axes._subplots.AxesSubplot at 0x13c6ca050>
In [108]: s.expanding().mean().plot(style='k')
Out[108]: <matplotlib.axes._subplots.AxesSubplot at 0x13c6ca050>
A related set of functions are exponentially weighted versions of several of the above statistics. A similar
interface to .rolling and .expanding is accessed through the .ewm method to receive an EWM object. A
number of expanding EW (exponentially weighted) methods are provided:
Function Description
mean() EW moving average
var() EW moving variance
std() EW moving standard deviation
corr() EW moving correlation
cov() EW moving covariance
where xt is the input, yt is the result and the wi are the weights.
The EW functions support two variants of exponential weights. The default, adjust=True, uses the weights
y0 = x0
yt = (1 − α)yt−1 + αxt ,
yt = α′ yt−1 + (1 − α′ )xt .
The difference between the above two variants arises because we are dealing with series which have finite
history. Consider a series of infinite history, with adjust=True:
Noting that the denominator is a geometric series with initial term equal to 1 and a ratio of 1 − α we have
which is the same expression as adjust=False above and therefore shows the equivalence of the two variants
for infinite series. When adjust=False, we have y0 = x0 and yt = αxt + (1 − α)yt−1 . Therefore, there is
an assumption that x0 is not an ordinary value but rather an exponentially weighted moment of the infinite
series up to that point.
One must have 0 < α ≤ 1, and while since version 0.18.0 it has been possible to pass α directly, it’s often
easier to think about either the span, center of mass (com) or half-life of an EW moment:
2
s+1 , for span s ≥ 1
1
α = 1+c , for center of mass c ≥ 0
log 0.5
1 − exp h , for half-life h > 0
One must specify precisely one of span, center of mass, half-life and alpha to the EW functions:
• Span corresponds to what is commonly called an “N-day EW moving average”.
• Center of mass has a more physical interpretation and can be thought of in terms of span: c =
(s − 1)/2.
• Half-life is the period of time for the exponential weight to reduce to one half.
In [110]: s.ewm(span=20).mean().plot(style='k')
Out[110]: <matplotlib.axes._subplots.AxesSubplot at 0x13c6e4c50>
EWM has a min_periods argument, which has the same meaning it does for all the .expanding and .
rolling methods: no output values will be set until at least min_periods non-null values are encountered
in the (expanding) window.
EWM also has an ignore_na argument, which determines how intermediate null values affect the calculation
of the weights. When ignore_na=False (the default), weights are calculated based on absolute positions,
so that intermediate null values affect the result. When ignore_na=True, weights are calculated by ignoring
intermediate null values. For example, assuming adjust=True, if ignore_na=False, the weighted average
of 3, NaN, 5 would be calculated as
(1 − α)2 · 3 + 1 · 5
.
(1 − α)2 + 1
Whereas if ignore_na=True, the weighted average would be calculated as
(1 − α) · 3 + 1 · 5
.
(1 − α) + 1
The var(), std(), and cov() functions have a bias argument, specifying whether the result should con-
tain biased or unbiased statistics. For example, if bias=True, ewmvar(x) is calculated as ewmvar(x) =
ewma(x**2) - ewma(x)**2; whereas if bias=False (the default), the biased variance statistics are scaled
by debiasing factors
(∑ )2
t
w
i=0 i
(∑ )2 ∑ .
t t
i=0 wi − w
i=0 i
2
(For wi = 1, this reduces to the usual N /(N − 1) factor, with N = t + 1.) See Weighted Sample Variance
on Wikipedia for further details. {{ header }}
By “group by” we are referring to a process involving one or more of the following steps:
• Splitting the data into groups based on some criteria.
• Applying a function to each group independently.
• Combining the results into a data structure.
Out of these, the split step is the most straightforward. In fact, in many situations we may wish to split the
data set into groups and do something with those groups. In the apply step, we might wish to do one of the
following:
• Aggregation: compute a summary statistic (or statistics) for each group. Some examples:
– Compute group sums or means.
– Compute group sizes / counts.
• Transformation: perform some group-specific computations and return a like-indexed object. Some
examples:
– Standardize data (zscore) within a group.
– Filling NAs within groups with a value derived from each group.
• Filtration: discard some groups, according to a group-wise computation that evaluates True or False.
Some examples:
– Discard data that belongs to groups with only a few members.
– Filter out data based on the group sum or mean.
• Some combination of the above: GroupBy will examine the results of the apply step and try to return
a sensibly combined result if it doesn’t fit into either of the above two categories.
Since the set of object instance methods on pandas data structures are generally rich and expressive, we
often simply want to invoke, say, a DataFrame function on each group. The name GroupBy should be quite
familiar to those who have used a SQL-based tool (or itertools), in which you can write code like:
We aim to make operations like this natural and easy to express using pandas. We’ll address each area of
GroupBy functionality then provide some non-trivial examples / use cases.
See the cookbook for some advanced strategies.
pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping
of labels to group names. To create a GroupBy object (more on what the GroupBy object is later), you may
do the following:
In [2]: df
Out[2]:
class order max_speed
falcon bird Falconiformes 389.0
parrot bird Psittaciformes 24.0
lion mammal Carnivora 80.2
monkey mammal Primates NaN
leopard mammal Carnivora 58.0
# default is axis=0
In [3]: grouped = df.groupby('class')
Note: A string passed to groupby may refer to either a column or an index level. If a string matches both
a column name and an index level name, a ValueError will be raised.
In [7]: df
Out[7]:
A B C D
0 foo one 0.359465 -1.079191
1 bar one -1.554635 0.249527
2 foo two 0.266660 0.810474
3 bar three 0.982955 0.093005
4 foo two 1.619710 0.809906
5 bar two -1.090548 -0.713776
6 foo one -1.058445 0.737929
7 foo three 0.565722 0.323839
On a DataFrame, we obtain a GroupBy object by calling groupby(). We could naturally group by either
the A or B columns, or both:
In [12]: grouped.sum()
Out[12]:
C D
A
bar -1.662228 -0.371244
foo 1.753112 1.602958
These will split the DataFrame on its index (rows). We could also split by the columns:
pandas Index objects support duplicate values. If a non-unique index is used as the group key in a groupby
operation, all values for the same index value will be considered to be in one group and thus the output of
aggregation functions will only contain unique index values:
In [15]: lst = [1, 2, 3, 1, 2, 3]
In [18]: grouped.first()
Out[18]:
1 1
2 2
3 3
dtype: int64
In [19]: grouped.last()
Out[19]:
1 10
2 20
3 30
dtype: int64
In [20]: grouped.sum()
Out[20]:
1 11
2 22
3 33
dtype: int64
Note that no splitting occurs until it’s needed. Creating the GroupBy object only verifies that you’ve
passed a valid mapping.
Note: Many kinds of complicated data manipulations can be expressed in terms of GroupBy operations
(though can’t be guaranteed to be the most efficient). You can get quite creative with the label mapping
functions.
GroupBy sorting
By default the group keys are sorted during the groupby operation. You may however pass sort=False for
potential speedups:
In [21]: df2 = pd.DataFrame({'X': ['B', 'B', 'A', 'A'], 'Y': [1, 2, 3, 4]})
In [22]: df2.groupby(['X']).sum()
Out[22]:
Y
X
A 7
B 3
Note that groupby will preserve the order in which observations are sorted within each group. For example,
the groups created by groupby() below are in the order they appeared in the original DataFrame:
In [24]: df3 = pd.DataFrame({'X': ['A', 'B', 'A', 'B'], 'Y': [1, 4, 3, 2]})
In [25]: df3.groupby(['X']).get_group('A')
Out[25]:
X Y
0 A 1
2 A 3
In [26]: df3.groupby(['X']).get_group('B')
Out[26]:
X Y
1 B 4
3 B 2
The groups attribute is a dict whose keys are the computed unique groups and corresponding values being
the axis labels belonging to each group. In the above example we have:
In [27]: df.groupby('A').groups
Out[27]:
{'bar': Int64Index([1, 3, 5], dtype='int64'),
'foo': Int64Index([0, 2, 4, 6, 7], dtype='int64')}
In [30]: grouped.groups
Out[30]:
{('bar', 'one'): Int64Index([1], dtype='int64'),
('bar', 'three'): Int64Index([3], dtype='int64'),
('bar', 'two'): Int64Index([5], dtype='int64'),
('foo', 'one'): Int64Index([0, 6], dtype='int64'),
('foo', 'three'): Int64Index([7], dtype='int64'),
('foo', 'two'): Int64Index([2, 4], dtype='int64')}
In [31]: len(grouped)
Out[31]: 6
GroupBy will tab complete column names (and other attributes):
In [32]: df
Out[32]:
height weight gender
2000-01-01 67.539208 153.593088 male
(continues on next page)
In [33]: gb = df.groupby('gender')
With hierarchically-indexed data, it’s quite natural to group by one of the levels of the hierarchy.
Let’s create a Series with a two-level MultiIndex.
In [35]: arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
....: ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
....:
In [38]: s
Out[38]:
first second
bar one 0.994442
two -1.007082
baz one 0.541666
two 1.649827
foo one -0.218738
two 1.589639
qux one -0.811123
two 0.299770
dtype: float64
In [40]: grouped.sum()
Out[40]:
first
bar -0.012640
baz 2.191493
foo 1.370901
qux -0.511354
dtype: float64
If the MultiIndex has names specified, these can be passed instead of the level number:
In [41]: s.groupby(level='second').sum()
Out[41]:
second
one 0.506247
two 2.532153
dtype: float64
The aggregation functions such as sum will take the level parameter directly. Additionally, the resulting
index will be named according to the chosen level:
In [42]: s.sum(level='second')
Out[42]:
second
one 0.506247
two 2.532153
dtype: float64
A DataFrame may be grouped by a combination of columns and index levels by specifying the column names
as strings and the index levels as pd.Grouper objects.
In [46]: arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
....: ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
....:
In [49]: df
Out[49]:
A B
first second
bar one 1 0
two 1 1
baz one 1 2
two 1 3
foo one 2 4
two 2 5
qux one 3 6
two 3 7
The following example groups df by the second index level and the A column.
Once you have created the GroupBy object from a DataFrame, you might want to do something different
for each of the columns. Thus, using [] similar to getting a column from a DataFrame, you can do:
This is mainly syntactic sugar for the alternative and much more verbose:
In [56]: df['C'].groupby(df['A'])
Out[56]: <pandas.core.groupby.generic.SeriesGroupBy object at 0x14280af90>
Additionally this method avoids recomputing the internal grouping information derived from the passed key.
With the GroupBy object in hand, iterating through the grouped data is very natural and functions similarly
to itertools.groupby():
In the case of grouping by multiple keys, the group name will be a tuple:
In [60]: grouped.get_group('bar')
Out[60]:
A B C D
1 bar one 0.288024 -1.467627
3 bar three 0.745006 0.788612
5 bar two 2.621557 0.125725
4.12.4 Aggregation
Once the GroupBy object has been created, several methods are available to perform a computation on the
grouped data. These operations are similar to the aggregating API , window functions API , and resample
API .
An obvious one is aggregation via the aggregate() or equivalently agg() method:
In [63]: grouped.aggregate(np.sum)
Out[63]:
C D
A
bar 3.654587 -0.553289
foo 2.738230 3.725545
In [65]: grouped.aggregate(np.sum)
Out[65]:
C D
A B
bar one 0.288024 -1.467627
three 0.745006 0.788612
two 2.621557 0.125725
foo one 1.070581 2.086976
three 0.362188 -0.068011
two 1.305461 1.706580
As you can see, the result of the aggregation will have the group names as the new index along the grouped
axis. In the case of multiple keys, the result is a MultiIndex by default, though this can be changed by using
the as_index option:
In [66]: grouped = df.groupby(['A', 'B'], as_index=False)
In [67]: grouped.aggregate(np.sum)
Out[67]:
A B C D
0 bar one 0.288024 -1.467627
1 bar three 0.745006 0.788612
2 bar two 2.621557 0.125725
3 foo one 1.070581 2.086976
4 foo three 0.362188 -0.068011
5 foo two 1.305461 1.706580
Another simple aggregation example is to compute the size of each group. This is included in GroupBy as
the size method. It returns a Series whose index are the group names and whose values are the sizes of
each group.
In [70]: grouped.size()
Out[70]:
A B
bar one 1
three 1
two 1
foo one 2
three 1
two 2
dtype: int64
In [71]: grouped.describe()
Out[71]:
C D ␣
,→
Note: Aggregation functions will not return the groups that you are aggregating over if they are named
columns, when as_index=True, the default. The grouped columns will be the indices of the returned object.
Passing as_index=False will return the groups that you are aggregating over, if they are named columns.
Aggregating functions are the ones that reduce the dimension of the returned objects. Some common
aggregating functions are tabulated below:
Function Description
mean() Compute mean of groups
sum() Compute sum of group values
size() Compute group sizes
count() Compute count of group
std() Standard deviation of groups
var() Compute variance of groups
sem() Standard error of the mean of groups
describe() Generates descriptive statistics
first() Compute first of group values
last() Compute last of group values
nth() Take nth value, or a subset if n is a list
min() Compute min of group values
max() Compute max of group values
The aggregating functions above will exclude NA values. Any function which reduces a Series to a scalar
value is an aggregation function and will work, a trivial example is df.groupby('A').agg(lambda ser:
1). Note that nth() can act as a reducer or a filter, see here.
With grouped Series you can also pass a list or dict of functions to do aggregation with, outputting a
DataFrame:
On a grouped DataFrame, you can pass a list of functions to apply to each column, which produces an
aggregated result with a hierarchical index:
In [74]: grouped.agg([np.sum, np.mean, np.std])
Out[74]:
C D
sum mean std sum mean std
A
bar 3.654587 1.218196 1.236638 -0.553289 -0.184430 1.159655
foo 2.738230 0.547646 0.737211 3.725545 0.745109 0.775629
The resulting aggregations are named for the functions themselves. If you need to rename, then you can add
in a chained operation for a Series like this:
In [75]: (grouped['C'].agg([np.sum, np.mean, np.std])
....: .rename(columns={'sum': 'foo',
....: 'mean': 'bar',
....: 'std': 'baz'}))
....:
Out[75]:
foo bar baz
A
bar 3.654587 1.218196 1.236638
foo 2.738230 0.547646 0.737211
Note: In general, the output column names should be unique. You can’t apply the same function (or two
functions with the same name) to the same column.
In [77]: grouped['C'].agg(['sum', 'sum'])
---------------------------------------------------------------------------
SpecificationError Traceback (most recent call last)
<ipython-input-77-7be02859f395> in <module>
----> 1 grouped['C'].agg(['sum', 'sum'])
~/sandbox/pandas-doc/pandas/core/groupby/generic.py in _aggregate_multiple_funcs(self,␣
,→arg, _level)
Pandas does allow you to provide multiple lambdas. In this case, pandas will mangle the name of the
(nameless) lambda functions, appending _<i> to each subsequent lambda.
Named aggregation
In [80]: animals
Out[80]:
kind height weight
0 cat 9.1 7.9
1 dog 6.0 7.5
2 cat 9.5 9.9
3 dog 34.0 198.0
In [81]: animals.groupby("kind").agg(
In [82]: animals.groupby("kind").agg(
....: min_height=('height', 'min'),
....: max_height=('height', 'max'),
....: average_weight=('weight', np.mean),
....: )
....:
Out[82]:
min_height max_height average_weight
kind
cat 9.1 9.5 8.90
dog 6.0 34.0 102.75
If your desired output column names are not valid python keywords, construct a dictionary and unpack the
keyword arguments
In [83]: animals.groupby("kind").agg(**{
....: 'total weight': pd.NamedAgg(column='weight', aggfunc=sum),
....: })
....:
Out[83]:
total weight
kind
cat 17.8
dog 205.5
Additional keyword arguments are not passed through to the aggregation functions. Only pairs of (column,
aggfunc) should be passed as **kwargs. If your aggregation functions requires additional arguments,
partially apply them with functools.partial().
Note: For Python 3.5 and earlier, the order of **kwargs in a functions was not preserved. This means
that the output column ordering would not be consistent. To ensure consistent ordering, the keys (and so
output columns) will always be sorted for Python 3.5.
Named aggregation is also valid for Series groupby aggregations. In this case there’s no column selection, so
the values are just the functions.
In [84]: animals.groupby("kind").height.agg(
....: min_height='min',
....: max_height='max',
....: )
(continues on next page)
By passing a dict to aggregate you can apply a different aggregation to the columns of a DataFrame:
The function names can also be strings. In order for a string to be valid it must be either implemented on
GroupBy or available via dispatching:
Some common aggregations, currently only sum, mean, std, and sem, have optimized Cython implementa-
tions:
In [87]: df.groupby('A').sum()
Out[87]:
C D
A
bar 3.654587 -0.553289
foo 2.738230 3.725545
4.12.5 Transformation
The transform method returns an object that is indexed the same (same size) as the one being grouped.
The transform function must:
• Return a result that is either the same size as the group chunk or broadcastable to the size of the group
chunk (e.g., a scalar, grouped.transform(lambda x: x.iloc[-1])).
• Operate column-by-column on the group chunk. The transform is applied to the first group chunk
using chunk.apply.
• Not perform in-place operations on the group chunk. Group chunks should be treated as immutable,
and changes to a group chunk may produce unexpected results. For example, when using fillna,
inplace must be False (grouped.transform(lambda x: x.fillna(inplace=False))).
• (Optionally) operates on the entire group chunk. If this is supported, a fast path is used starting from
the second chunk.
For example, suppose we wished to standardize the data within each group:
In [89]: index = pd.date_range('10/1/1999', periods=1100)
In [92]: ts.head()
Out[92]:
2000-01-08 0.865232
2000-01-09 0.835242
2000-01-10 0.801536
2000-01-11 0.786370
2000-01-12 0.777867
Freq: D, dtype: float64
In [93]: ts.tail()
Out[93]:
2002-09-30 0.360614
2002-10-01 0.382288
2002-10-02 0.383086
2002-10-03 0.365846
2002-10-04 0.374030
Freq: D, dtype: float64
In [96]: grouped.mean()
Out[96]:
2000 0.460436
2001 0.477878
2002 0.440377
dtype: float64
In [97]: grouped.std()
Out[97]:
2000 0.215566
2001 0.111128
2002 0.130943
dtype: float64
# Transformed Data
In [98]: grouped_trans = transformed.groupby(lambda x: x.year)
In [99]: grouped_trans.mean()
Out[99]:
2000 2.368656e-15
2001 3.565451e-16
2002 -2.629265e-16
dtype: float64
In [100]: grouped_trans.std()
Out[100]:
2000 1.0
2001 1.0
2002 1.0
dtype: float64
We can also visually compare the original and transformed data sets.
In [102]: compare.plot()
Out[102]: <matplotlib.axes._subplots.AxesSubplot at 0x13c6d1250>
Transformation functions that have lower dimension outputs are broadcast to match the shape of the input
array.
Alternatively, the built-in methods could be used to produce the same outputs.
Another common data transform is to replace missing data with the group mean.
In [107]: data_df
Out[107]:
A B C
0 NaN -0.730428 -0.856715
1 1.774779 -1.499567 0.513143
2 0.246662 2.480413 0.241526
3 0.581335 0.954109 -2.689853
4 0.226786 0.622744 0.331039
.. ... ... ...
995 0.682520 0.378388 1.313710
996 1.568709 -0.332316 0.832525
997 NaN -0.636547 -0.044304
998 -0.404024 -0.463095 -1.191003
999 0.875355 NaN -1.597643
We can verify that the group means have not changed in the transformed data and that the transformed
Note: Some functions will automatically transform the input when applied to a GroupBy object, but
returning an object of the same shape as the original. Passing as_index=False will not affect these trans-
formation methods.
For example: fillna, ffill, bfill, shift..
In [119]: grouped.ffill()
Out[119]:
A B C
0 NaN -0.730428 -0.856715
(continues on next page)
In [121]: df_re
Out[121]:
A B
0 1 0
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 1 6
7 1 7
8 1 8
9 1 9
10 5 10
11 5 11
12 5 12
13 5 13
14 5 14
15 5 15
16 5 16
17 5 17
18 5 18
19 5 19
In [122]: df_re.groupby('A').rolling(4).B.mean()
Out[122]:
A
1 0 NaN
1 NaN
2 NaN
3 1.5
4 2.5
5 3.5
6 4.5
7 5.5
8 6.5
9 7.5
5 10 NaN
11 NaN
12 NaN
13 11.5
14 12.5
15 13.5
16 14.5
17 15.5
18 16.5
19 17.5
Name: B, dtype: float64
The expanding() method will accumulate a given operation (sum() in the example) for all the members of
each particular group.
In [123]: df_re.groupby('A').expanding().sum()
Out[123]:
A B
A
1 0 1.0 0.0
1 2.0 1.0
2 3.0 3.0
3 4.0 6.0
4 5.0 10.0
5 6.0 15.0
6 7.0 21.0
7 8.0 28.0
8 9.0 36.0
9 10.0 45.0
5 10 5.0 10.0
11 10.0 21.0
12 15.0 33.0
13 20.0 46.0
14 25.0 60.0
15 30.0 75.0
16 35.0 91.0
17 40.0 108.0
18 45.0 126.0
19 50.0 145.0
Suppose you want to use the resample() method to get a daily frequency in each group of your dataframe
and wish to complete the missing values with the ffill() method.
In [125]: df_re
Out[125]:
group val
date
2016-01-03 1 5
2016-01-10 1 6
2016-01-17 2 7
2016-01-24 2 8
In [126]: df_re.groupby('group').resample('1D').ffill()
Out[126]:
group val
group date
1 2016-01-03 1 5
2016-01-04 1 5
2016-01-05 1 5
2016-01-06 1 5
2016-01-07 1 5
2016-01-08 1 5
2016-01-09 1 5
2016-01-10 1 6
2 2016-01-17 2 7
2016-01-18 2 7
2016-01-19 2 7
2016-01-20 2 7
2016-01-21 2 7
2016-01-22 2 7
2016-01-23 2 7
2016-01-24 2 8
4.12.6 Filtration
The filter method returns a subset of the original object. Suppose we want to take only elements that
belong to groups with a group sum greater than 2.
The argument of filter must be a function that, applied to the group as a whole, returns True or False.
Another useful operation is filtering out elements that belong to groups with only a couple members.
Alternatively, instead of dropping the offending groups, we can return a like-indexed objects where the groups
that do not pass the filter are filled with NaNs.
For DataFrames with multiple columns, filters should explicitly specify a column as the filter criterion.
Note: Some functions when applied to a groupby object will act as a filter on the input, returning a
reduced shape of the original (and potentially eliminating groups), but with the index unchanged. Passing
as_index=False will not affect these transformation methods.
For example: head, tail.
In [134]: dff.groupby('B').head(2)
Out[134]:
A B C
0 0 a 0
1 1 a 1
2 2 b 2
3 3 b 3
6 6 c 6
7 7 c 7
When doing an aggregation or transformation, you might just want to call an instance method on each data
group. This is pretty easy to do by passing lambda functions:
But, it’s rather verbose and can be untidy if you need to pass additional arguments. Using a bit of metapro-
gramming cleverness, GroupBy now has the ability to “dispatch” method calls to the groups:
In [137]: grouped.std()
Out[137]:
C D
A
bar 1.236638 1.159655
foo 0.737211 0.775629
What is actually happening here is that a function wrapper is being generated. When invoked, it takes any
passed arguments and invokes the function with any arguments on each group (in the above example, the
std function). The results are then combined together much in the style of agg and transform (it actually
uses apply to infer the gluing, documented next). This enables some operations to be carried out rather
succinctly:
In [138]: tsdf = pd.DataFrame(np.random.randn(1000, 3),
.....: index=pd.date_range('1/1/2000', periods=1000),
.....: columns=['A', 'B', 'C'])
.....:
In [141]: grouped.fillna(method='pad')
Out[141]:
A B C
2000-01-01 NaN NaN NaN
2000-01-02 1.125163 -0.383261 0.284227
2000-01-03 1.125163 -0.383261 0.284227
2000-01-04 0.606730 -0.880692 -0.944393
2000-01-05 0.606730 -0.880692 -0.944393
... ... ... ...
2002-09-22 -1.409950 -0.444110 -0.553993
2002-09-23 -1.409950 -0.444110 -0.553993
2002-09-24 -1.158956 0.108182 3.050277
2002-09-25 -1.158956 0.108182 3.050277
2002-09-26 0.061621 0.301882 -1.085504
In this example, we chopped the collection of time series into yearly chunks then independently called fillna
on the groups.
The nlargest and nsmallest methods work on Series style groupbys:
In [142]: s = pd.Series([9, 8, 7, 5, 19, 1, 4.2, 3.3])
In [143]: g = pd.Series(list('abababab'))
In [144]: gb = s.groupby(g)
In [145]: gb.nlargest(3)
Out[145]:
a 4 19.0
0 9.0
2 7.0
b 1 8.0
3 5.0
7 3.3
dtype: float64
In [146]: gb.nsmallest(3)
Out[146]:
a 6 4.2
2 7.0
0 9.0
b 5 1.0
7 3.3
3 5.0
dtype: float64
Some operations on the grouped data might not fit into either the aggregate or transform categories. Or,
you may simply want GroupBy to infer how to combine the results. For these, use the apply function, which
can be substituted for both aggregate and transform in many standard use cases. However, apply can
handle some exceptional use cases, for example:
In [147]: df
Out[147]:
A B C D
0 foo one -0.297872 0.630951
1 bar one 0.288024 -1.467627
2 foo two 0.049068 0.077597
3 bar three 0.745006 0.788612
4 foo two 1.256393 1.628982
5 bar two 2.621557 0.125725
6 foo one 1.368453 1.456025
7 foo three 0.362188 -0.068011
In [152]: grouped.apply(f)
Out[152]:
original demeaned
0 -0.297872 -0.845518
1 0.288024 -0.930172
2 0.049068 -0.498578
3 0.745006 -0.473189
4 1.256393 0.708747
5 2.621557 1.403361
6 1.368453 0.820807
7 0.362188 -0.185458
apply on a Series can operate on a returned value from the applied function, that is itself a series, and
possibly upcast the result to a DataFrame:
In [153]: def f(x):
.....: return pd.Series([x, x ** 2], index=['x', 'x^2'])
.....:
In [154]: s = pd.Series(np.random.rand(5))
In [155]: s
Out[155]:
0 0.798954
1 0.191223
2 0.584617
3 0.692853
4 0.064370
dtype: float64
In [156]: s.apply(f)
Out[156]:
x x^2
0 0.798954 0.638328
1 0.191223 0.036566
2 0.584617 0.341777
3 0.692853 0.480045
4 0.064370 0.004144
Note: apply can act as a reducer, transformer, or filter function, depending on exactly what is passed to
it. So depending on the path taken, and exactly what you are grouping. Thus the grouped columns(s) may
be included in the output as well as set the indices.
In [157]: df
Out[157]:
A B C D
0 foo one -0.297872 0.630951
1 bar one 0.288024 -1.467627
2 foo two 0.049068 0.077597
3 bar three 0.745006 0.788612
4 foo two 1.256393 1.628982
5 bar two 2.621557 0.125725
6 foo one 1.368453 1.456025
7 foo three 0.362188 -0.068011
Suppose we wish to compute the standard deviation grouped by the A column. There is a slight problem,
namely that we don’t care about the data in column B. We refer to this as a “nuisance” column. If the passed
aggregation function can’t be applied to some columns, the troublesome columns will be (silently) dropped.
Thus, this does not pose any problems:
In [158]: df.groupby('A').std()
Out[158]:
C D
A
(continues on next page)
Note: Any object column, also if it contains numerical values such as Decimal objects, is considered as a
“nuisance” columns. They are excluded from aggregate functions automatically in groupby.
If you do wish to include decimal or object columns in an aggregation with other non-nuisance data types,
you must do so explicitly.
# ...but cannot be combined with standard data types or they will be excluded
In [162]: df_dec.groupby(['id'])[['int_column', 'dec_column']].sum()
Out[162]:
int_column
id
1 4
2 6
# Use .agg function to aggregate over standard and "nuisance" data types
# at the same time
In [163]: df_dec.groupby(['id']).agg({'int_column': 'sum', 'dec_column': 'sum'})
Out[163]:
int_column dec_column
id
1 4 0.75
2 6 0.55
When using a Categorical grouper (as a single grouper, or as part of multiple groupers), the observed
keyword controls whether to return a cartesian product of all possible groupers values (observed=False) or
only those that are observed groupers (observed=True).
Show all values:
The returned dtype of the grouped will always include all of the categories that were grouped.
In [167]: s.index.dtype
Out[167]: CategoricalDtype(categories=['a', 'b'], ordered=False)
If there are any NaN or NaT values in the grouping key, these will be automatically excluded. In other words,
there will never be an “NA group” or “NaT group”. This was not the case in older versions of pandas, but
users were generally discarding the NA group anyway (and supporting it was an implementation headache).
Categorical variables represented as instance of pandas’s Categorical class can be used as group keys. If
so, the order of the levels will be preserved:
You may need to specify a bit more data to properly group. You can use the pd.Grouper to provide this
local control.
In [173]: df
Out[173]:
Branch Buyer Quantity Date
0 A Carl 1 2013-01-01 13:00:00
1 A Mark 3 2013-01-01 13:05:00
2 A Carl 5 2013-10-01 20:00:00
3 A Carl 1 2013-10-02 10:00:00
4 A Joe 8 2013-10-01 20:00:00
5 A Joe 1 2013-10-02 10:00:00
6 A Joe 9 2013-12-02 12:00:00
7 B Carl 3 2013-12-02 14:00:00
Groupby a specific column with the desired frequency. This is like resampling.
You have an ambiguous specification in that you have a named index and a column that could be potential
groupers.
In [175]: df = df.set_index('Date')
Just like for a DataFrame or Series you can call head and tail on a groupby:
In [179]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B'])
In [180]: df
Out[180]:
A B
0 1 2
1 1 4
2 5 6
In [181]: g = df.groupby('A')
In [182]: g.head(1)
Out[182]:
A B
0 1 2
2 5 6
In [183]: g.tail(1)
Out[183]:
A B
1 1 4
2 5 6
This shows the first or last n rows from each group.
To select from a DataFrame or Series the nth item, use nth(). This is a reduction method, and will return
a single row (or no row) per group if you pass an int for n:
In [184]: df = pd.DataFrame([[1, np.nan], [1, 4], [5, 6]], columns=['A', 'B'])
In [185]: g = df.groupby('A')
In [186]: g.nth(0)
Out[186]:
B
A
1 NaN
5 6.0
In [187]: g.nth(-1)
Out[187]:
B
A
1 4.0
5 6.0
In [188]: g.nth(1)
Out[188]:
B
A
1 4.0
If you want to select the nth not-null item, use the dropna kwarg. For a DataFrame this should be either
'any' or 'all' just like you would pass to dropna:
# nth(0) is the same as g.first()
In [189]: g.nth(0, dropna='any')
Out[189]:
B
A
1 4.0
5 6.0
In [190]: g.first()
Out[190]:
B
A
1 4.0
5 6.0
B
A
1 4.0
5 6.0
In [192]: g.last()
Out[192]:
B
A
1 4.0
5 6.0
In [196]: g.nth(0)
Out[196]:
A B
0 1 NaN
2 5 6.0
In [197]: g.nth(-1)
Out[197]:
A B
1 1 4.0
2 5 6.0
You can also select multiple rows from each group by specifying multiple nth values as a list of ints.
In [198]: business_dates = pd.date_range(start='4/1/2014', end='6/30/2014', freq='B')
# get the first, 4th, and last date index for each month
In [200]: df.groupby([df.index.year, df.index.month]).nth([0, 3, -1])
Out[200]:
a b
2014 4 1 1
4 1 1
4 1 1
5 1 1
5 1 1
5 1 1
6 1 1
6 1 1
(continues on next page)
To see the order in which each row appears within its group, use the cumcount method:
In [201]: dfg = pd.DataFrame(list('aaabba'), columns=['A'])
In [202]: dfg
Out[202]:
A
0 a
1 a
2 a
3 b
4 b
5 a
In [203]: dfg.groupby('A').cumcount()
Out[203]:
0 0
1 1
2 2
3 0
4 1
5 3
dtype: int64
In [204]: dfg.groupby('A').cumcount(ascending=False)
Out[204]:
0 3
1 2
2 1
3 1
4 0
5 0
dtype: int64
Enumerate groups
In [206]: dfg
Out[206]:
A
0 a
1 a
2 a
3 b
4 b
5 a
In [207]: dfg.groupby('A').ngroup()
Out[207]:
0 0
1 0
2 0
3 1
4 1
5 0
dtype: int64
In [208]: dfg.groupby('A').ngroup(ascending=False)
Out[208]:
0 1
1 1
2 1
3 0
4 0
5 1
dtype: int64
Plotting
Groupby also works with some plotting methods. For example, suppose we suspect that some features in a
DataFrame may differ by group, in this case, the values in column 1 where the group is “B” are 3 higher on
average.
In [209]: np.random.seed(1234)
In [213]: df.groupby('g').boxplot()
Out[213]:
A AxesSubplot(0.1,0.15;0.363636x0.75)
B AxesSubplot(0.536364,0.15;0.363636x0.75)
dtype: object
The result of calling boxplot is a dictionary whose keys are the values of our grouping column g (“A” and
“B”). The values of the resulting dictionary can be controlled by the return_type keyword of boxplot. See
the visualization documentation for more.
In [214]: n = 1000
In [216]: df.head(2)
Out[216]:
Store Product Revenue Quantity
0 Store_2 Product_1 26.12 1
1 Store_2 Product_1 28.86 1
Piping can also be expressive when you want to deliver a grouped object to some arbitrary function, for
example:
where mean takes a GroupBy object and finds the mean of the Revenue and Quantity columns respectively
for each Store-Product combination. The mean function can be any function that takes in a GroupBy object;
the .pipe will pass the GroupBy object as a parameter into the function you specify.
4.12.10 Examples
Regrouping by factor
Regroup columns of a DataFrame according to their sum, and sum the aggregated ones.
In [221]: df
Out[221]:
a b c d
0 1 0 1 2
1 0 1 0 3
2 0 0 0 4
Multi-column factorization
By using ngroup(), we can extract information about the groups in a way similar to factorize() (as
described further in the reshaping API ) but which applies naturally to multiple columns of mixed type
and different sources. This can be useful as an intermediate categorical-like step in processing, when the
relationships between the group rows are more important than their content, or as input to an algorithm
which only accepts the integer encoding. (For more information about support in pandas for full categorical
data, see the Categorical introduction and the API documentation.)
In [223]: dfg = pd.DataFrame({"A": [1, 1, 2, 3, 2], "B": list("aaaba")})
In [224]: dfg
Out[224]:
A B
0 1 a
1 1 a
2 2 a
3 3 b
4 2 a
dtype: int64
Resampling produces new hypothetical samples (resamples) from already existing observed data or from a
model that generates data. These new samples are similar to the pre-existing samples.
In order to resample to work on indices that are non-datetimelike, the following procedure can be utilized.
In the following examples, df.index // 5 returns a binary array which is used to determine what gets
selected for the groupby operation.
Note: The below example shows how we can downsample by consolidation of samples into fewer samples.
Here by using df.index // 5, we are aggregating the samples in bins. By applying std() function, we
aggregate the information contained in many samples into a small subset of values which is their standard
deviation thereby reducing the number of samples.
In [228]: df
Out[228]:
0 1
0 -0.793893 0.321153
1 0.342250 1.618906
2 -0.975807 1.918201
3 -0.810847 -1.405919
4 -1.977759 0.461659
5 0.730057 -1.316938
6 -0.751328 0.528290
7 -0.257759 -1.081009
8 0.505895 -1.701948
9 -1.006349 0.020208
In [229]: df.index // 5
Out[229]: Int64Index([0, 0, 0, 0, 0, 1, 1, 1, 1, 1], dtype='int64')
Group DataFrame columns, compute a set of metrics and return a named Series. The Series name is used
as the name for the column index. This is especially useful in conjunction with reshaping operations such
as stacking in which the column index name will be used as the name of the inserted column:
In [231]: df = pd.DataFrame({'a': [0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2],
.....: 'b': [0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1],
.....: 'c': [1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0],
.....: 'd': [0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1]})
.....:
In [234]: result
Out[234]:
metrics b_sum c_mean
a
0 2.0 0.5
1 2.0 0.5
2 2.0 0.5
In [235]: result.stack()
Out[235]:
a metrics
0 b_sum 2.0
c_mean 0.5
1 b_sum 2.0
c_mean 0.5
2 b_sum 2.0
c_mean 0.5
dtype: float64
{{ header }}
pandas contains extensive capabilities and features for working with time series data for all domains. Using
the NumPy datetime64 and timedelta64 dtypes, pandas has consolidated a large number of features from
other Python libraries like scikits.timeseries as well as created a tremendous amount of new functionality
for manipulating time series data.
For example, pandas supports:
Parsing time series information from various sources and formats
In [3]: dti
Out[3]: DatetimeIndex(['2018-01-01', '2018-01-01', '2018-01-01'], dtype='datetime64[ns]',
,→ freq=None)
In [5]: dti
Out[5]:
DatetimeIndex(['2018-01-01 00:00:00', '2018-01-01 01:00:00',
'2018-01-01 02:00:00'],
dtype='datetime64[ns]', freq='H')
In [7]: dti
Out[7]:
DatetimeIndex(['2018-01-01 00:00:00+00:00', '2018-01-01 01:00:00+00:00',
'2018-01-01 02:00:00+00:00'],
dtype='datetime64[ns, UTC]', freq='H')
In [8]: dti.tz_convert('US/Pacific')
Out[8]:
DatetimeIndex(['2017-12-31 16:00:00-08:00', '2017-12-31 17:00:00-08:00',
'2017-12-31 18:00:00-08:00'],
dtype='datetime64[ns, US/Pacific]', freq='H')
Resampling or converting a time series to a particular frequency
In [9]: idx = pd.date_range('2018-01-01', periods=5, freq='H')
In [11]: ts
Out[11]:
2018-01-01 00:00:00 0
2018-01-01 01:00:00 1
2018-01-01 02:00:00 2
2018-01-01 03:00:00 3
2018-01-01 04:00:00 4
Freq: H, dtype: int64
In [12]: ts.resample('2H').mean()
Out[12]:
2018-01-01 00:00:00 0.5
2018-01-01 02:00:00 2.5
2018-01-01 04:00:00 4.0
Freq: 2H, dtype: float64
Performing date and time arithmetic with absolute or relative time increments
In [13]: friday = pd.Timestamp('2018-01-05')
In [14]: friday.day_name()
Out[14]: 'Friday'
# Add 1 day
In [15]: saturday = friday + pd.Timedelta('1 day')
(continues on next page)
In [16]: saturday.day_name()
Out[16]: 'Saturday'
In [18]: monday.day_name()
Out[18]: 'Monday'
pandas provides a relatively compact and self-contained set of tools for performing the above tasks and more.
4.13.1 Overview
Concept Scalar Array Class pandas Data Type Primary Creation Method
Class
Date Timestamp DatetimeIndex datetime64[ns] or to_datetime or
times datetime64[ns, tz] date_range
Time Timedelta TimedeltaIndextimedelta64[ns] to_timedelta or
deltas timedelta_range
Time Period PeriodIndex period[freq] Period or period_range
spans
Date off- DateOffset None None DateOffset
sets
For time series data, it’s conventional to represent the time component in the index of a Series or DataFrame
so manipulations can be performed with respect to the time element.
However, Series and DataFrame can directly also support the time component as data itself.
Series and DataFrame have extended data type support and functionality for datetime, timedelta and
Period data when passed into those constructors. DateOffset data however will be stored as object data.
In [21]: pd.Series(pd.period_range('1/1/2011', freq='M', periods=3))
Out[21]:
0 2011-01
1 2011-02
2 2011-03
dtype: period[M]
In [25]: pd.Timedelta(pd.NaT)
Out[25]: NaT
In [26]: pd.Period(pd.NaT)
Out[26]: NaT
Timestamped data is the most basic type of time series data that associates values with points in time. For
pandas objects it means using the points in time.
In [28]: pd.Timestamp(datetime.datetime(2012, 5, 1))
Out[28]: Timestamp('2012-05-01 00:00:00')
In [29]: pd.Timestamp('2012-05-01')
Out[29]: Timestamp('2012-05-01 00:00:00')
In [30]: pd.Timestamp(2012, 5, 1)
In [35]: type(ts.index)
Out[35]: pandas.core.indexes.datetimes.DatetimeIndex
In [36]: ts.index
Out[36]: DatetimeIndex(['2012-05-01', '2012-05-02', '2012-05-03'],␣
,→dtype='datetime64[ns]', freq=None)
In [37]: ts
Out[37]:
2012-05-01 1.212707
2012-05-02 -2.219105
2012-05-03 0.930394
dtype: float64
In [40]: type(ts.index)
Out[40]: pandas.core.indexes.period.PeriodIndex
In [41]: ts.index
Out[41]: PeriodIndex(['2012-01', '2012-02', '2012-03'], dtype='period[M]', freq='M')
In [42]: ts
Out[42]:
2012-01 0.703754
2012-02 0.580511
2012-03 -0.135776
Freq: M, dtype: float64
pandas allows you to capture both representations and convert between them. Under the hood, pan-
das represents timestamps using instances of Timestamp and sequences of timestamps using instances of
DatetimeIndex. For regular time spans, pandas uses Period objects for scalar values and PeriodIndex for
sequences of spans. Better support for irregular intervals with arbitrary start and end points are forth-coming
in future releases.
To convert a Series or list-like object of date-like objects e.g. strings, epochs, or a mixture, you can use
the to_datetime function. When passed a Series, this returns a Series (with the same index), while a
list-like is converted to a DatetimeIndex:
In [43]: pd.to_datetime(pd.Series(['Jul 31, 2009', '2010-01-10', None]))
Out[43]:
0 2009-07-31
1 2010-01-10
2 NaT
dtype: datetime64[ns]
Warning: You see in the above example that dayfirst isn’t strict, so if a date can’t be parsed with
the day being first it will be parsed as if dayfirst were False.
If you pass a single string to to_datetime, it returns a single Timestamp. Timestamp can also accept string
input, but it doesn’t accept string parsing options like dayfirst or format, so use to_datetime if these are
required.
In [47]: pd.to_datetime('2010/11/12')
Out[47]: Timestamp('2010-11-12 00:00:00')
In [48]: pd.Timestamp('2010/11/12')
Out[48]: Timestamp('2010-11-12 00:00:00')
You can also use the DatetimeIndex constructor directly:
The string ‘infer’ can be passed in order to set the frequency of the index as the inferred frequency upon
creation:
In addition to the required datetime string, a format argument can be passed to ensure specific parsing.
This could also potentially speed up the conversion considerably.
In [51]: pd.to_datetime('2010/11/12', format='%Y/%m/%d')
Out[51]: Timestamp('2010-11-12 00:00:00')
In [54]: pd.to_datetime(df)
Out[54]:
0 2015-02-04 02:00:00
1 2016-03-05 03:00:00
dtype: datetime64[ns]
You can pass only the columns that you need to assemble.
pd.to_datetime looks for standard designations of the datetime component in the column names, including:
• required: year, month, day
• optional: hour, minute, second, millisecond, microsecond, nanosecond
Invalid data
Epoch timestamps
pandas supports converting integer or float epoch times to Timestamp and DatetimeIndex. The default unit
is nanoseconds, since that is how Timestamp objects are stored internally. However, epochs are often stored
in another unit which can be specified. These are computed from the starting point specified by the origin
parameter.
In [58]: pd.to_datetime([1349720105, 1349806505, 1349892905,
....: 1349979305, 1350065705], unit='s')
....:
Out[58]:
DatetimeIndex(['2012-10-08 18:15:05', '2012-10-09 18:15:05',
'2012-10-10 18:15:05', '2012-10-11 18:15:05',
'2012-10-12 18:15:05'],
dtype='datetime64[ns]', freq=None)
In [61]: pd.DatetimeIndex([1262347200000000000]).tz_localize('US/Pacific')
Out[61]: DatetimeIndex(['2010-01-01 12:00:00-08:00'], dtype='datetime64[ns, US/Pacific]',
,→ freq=None)
Warning: Conversion of float epoch times can lead to inaccurate and unexpected results. Python
floats have about 15 digits precision in decimal. Rounding during conversion from float to high precision
Timestamp is unavoidable. The only way to achieve exact precision is to use a fixed-width types (e.g. an
int64).
In [62]: pd.to_datetime([1490195805.433, 1490195805.433502912], unit='s')
Out[62]: DatetimeIndex(['2017-03-22 15:16:45.433000088', '2017-03-22 15:16:45.
,→433502913'], dtype='datetime64[ns]', freq=None)
See also:
Using the origin Parameter
To invert the operation from above, namely, to convert from a Timestamp to a ‘unix’ epoch:
In [65]: stamps
Out[65]:
DatetimeIndex(['2012-10-08 18:15:05', '2012-10-09 18:15:05',
'2012-10-10 18:15:05', '2012-10-11 18:15:05'],
dtype='datetime64[ns]', freq='D')
We subtract the epoch (midnight at January 1, 1970 UTC) and then floor divide by the “unit” (1 second).
The default is set at origin='unix', which defaults to 1970-01-01 00:00:00. Commonly called ‘unix
epoch’ or POSIX time.
To generate an index with timestamps, you can use either the DatetimeIndex or Index constructor and
pass in a list of datetime objects:
In [71]: index
Out[71]: DatetimeIndex(['2012-05-01', '2012-05-02', '2012-05-03'], dtype='datetime64[ns]
,→', freq=None)
In [73]: index
Out[73]: DatetimeIndex(['2012-05-01', '2012-05-02', '2012-05-03'], dtype='datetime64[ns]
,→', freq=None)
In practice this becomes very cumbersome because we often need a very long index with a large number of
timestamps. If we need timestamps on a regular frequency, we can use the date_range() and bdate_range()
functions to create a DatetimeIndex. The default frequency for date_range is a calendar day while the
default for bdate_range is a business day:
In [77]: index
Out[77]:
DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03', '2011-01-04',
'2011-01-05', '2011-01-06', '2011-01-07', '2011-01-08',
'2011-01-09', '2011-01-10',
...
'2011-12-23', '2011-12-24', '2011-12-25', '2011-12-26',
'2011-12-27', '2011-12-28', '2011-12-29', '2011-12-30',
'2011-12-31', '2012-01-01'],
dtype='datetime64[ns]', length=366, freq='D')
In [79]: index
Out[79]:
DatetimeIndex(['2011-01-03', '2011-01-04', '2011-01-05', '2011-01-06',
'2011-01-07', '2011-01-10', '2011-01-11', '2011-01-12',
'2011-01-13', '2011-01-14',
(continues on next page)
Convenience functions like date_range and bdate_range can utilize a variety of frequency aliases:
In [80]: pd.date_range(start, periods=1000, freq='M')
Out[80]:
DatetimeIndex(['2011-01-31', '2011-02-28', '2011-03-31', '2011-04-30',
'2011-05-31', '2011-06-30', '2011-07-31', '2011-08-31',
'2011-09-30', '2011-10-31',
...
'2093-07-31', '2093-08-31', '2093-09-30', '2093-10-31',
'2093-11-30', '2093-12-31', '2094-01-31', '2094-02-28',
'2094-03-31', '2094-04-30'],
dtype='datetime64[ns]', length=1000, freq='M')
bdate_range can also generate a range of custom frequency dates by using the weekmask and holidays
parameters. These parameters will only be used if a custom frequency string is passed.
In [88]: weekmask = 'Mon Wed Fri'
Since pandas represents timestamps in nanosecond resolution, the time span that can be represented using
a 64-bit integer is limited to approximately 584 years:
In [92]: pd.Timestamp.min
Out[92]: Timestamp('1677-09-21 00:12:43.145225')
In [93]: pd.Timestamp.max
Out[93]: Timestamp('2262-04-11 23:47:16.854775807')
See also:
Representing out-of-bounds spans
4.13.6 Indexing
One of the main uses for DatetimeIndex is as an index for pandas objects. The DatetimeIndex class contains
many time series related optimizations:
• A large range of dates for various offsets are pre-computed and cached under the hood in order to make
generating subsequent date ranges very fast (just have to grab a slice).
• Fast shifting using the shift and tshift method on pandas objects.
• Unioning of overlapping DatetimeIndex objects with the same frequency is very fast (important for
fast data alignment).
• Quick access to date fields via properties such as year, month, etc.
• Regularization functions like snap and very fast asof logic.
DatetimeIndex objects have all the basic functionality of regular Index objects, and a smorgasbord of
advanced time series specific methods for easy frequency processing.
See also:
Reindexing methods
Note: While pandas does not force you to have a sorted date index, some of these methods may have
unexpected or incorrect behavior if the dates are unsorted.
DatetimeIndex can be used like a regular index and offers all of its intelligent functionality like selection,
slicing, etc.
In [94]: rng = pd.date_range(start, end, freq='BM')
In [96]: ts.index
Out[96]:
DatetimeIndex(['2011-01-31', '2011-02-28', '2011-03-31', '2011-04-29',
'2011-05-31', '2011-06-30', '2011-07-29', '2011-08-31',
'2011-09-30', '2011-10-31', '2011-11-30', '2011-12-30'],
dtype='datetime64[ns]', freq='BM')
In [97]: ts[:5].index
Out[97]:
DatetimeIndex(['2011-01-31', '2011-02-28', '2011-03-31', '2011-04-29',
'2011-05-31'],
dtype='datetime64[ns]', freq='BM')
In [98]: ts[::2].index
Out[98]:
DatetimeIndex(['2011-01-31', '2011-03-31', '2011-05-31', '2011-07-29',
'2011-09-30', '2011-11-30'],
dtype='datetime64[ns]', freq='2BM')
Dates and strings that parse to timestamps can be passed as indexing parameters:
In [99]: ts['1/31/2011']
Out[99]: 0.3135034497740378
In [101]: ts['10/31/2011':'12/31/2011']
Out[101]:
2011-10-31 -0.042894
2011-11-30 1.321441
2011-12-30 -0.625640
Freq: BM, dtype: float64
To provide convenience for accessing longer time series, you can also pass in the year or year and month as
strings:
In [102]: ts['2011']
Out[102]:
2011-01-31 0.313503
2011-02-28 1.055889
2011-03-31 0.714651
2011-04-29 0.612442
2011-05-31 -0.170038
2011-06-30 -0.828809
2011-07-29 0.458806
2011-08-31 0.764740
2011-09-30 1.917429
2011-10-31 -0.042894
2011-11-30 1.321441
2011-12-30 -0.625640
Freq: BM, dtype: float64
In [103]: ts['2011-6']
Out[103]:
2011-06-30 -0.828809
Freq: BM, dtype: float64
This type of slicing will work on a DataFrame with a DatetimeIndex as well. Since the partial string selection
is a form of label slicing, the endpoints will be included. This would include matching times on an included
date:
In [104]: dft = pd.DataFrame(np.random.randn(100000, 1), columns=['A'],
.....: index=pd.date_range('20130101', periods=100000, freq='T'))
.....:
In [105]: dft
Out[105]:
A
2013-01-01 00:00:00 0.480217
2013-01-01 00:01:00 0.791875
2013-01-01 00:02:00 1.414514
2013-01-01 00:03:00 -0.801147
2013-01-01 00:04:00 0.462656
... ...
2013-03-11 10:35:00 0.486569
2013-03-11 10:36:00 -0.586561
2013-03-11 10:37:00 -0.403601
2013-03-11 10:38:00 -0.228277
2013-03-11 10:39:00 -0.267201
In [106]: dft['2013']
Out[106]:
A
2013-01-01 00:00:00 0.480217
2013-01-01 00:01:00 0.791875
2013-01-01 00:02:00 1.414514
2013-01-01 00:03:00 -0.801147
2013-01-01 00:04:00 0.462656
... ...
2013-03-11 10:35:00 0.486569
2013-03-11 10:36:00 -0.586561
2013-03-11 10:37:00 -0.403601
In [107]: dft['2013-1':'2013-2']
Out[107]:
A
2013-01-01 00:00:00 0.480217
2013-01-01 00:01:00 0.791875
2013-01-01 00:02:00 1.414514
2013-01-01 00:03:00 -0.801147
2013-01-01 00:04:00 0.462656
... ...
2013-02-28 23:55:00 1.819527
2013-02-28 23:56:00 0.891281
2013-02-28 23:57:00 -0.516058
2013-02-28 23:58:00 -1.350302
2013-02-28 23:59:00 1.475049
This specifies a stop time that includes all of the times on the last day:
In [108]: dft['2013-1':'2013-2-28']
Out[108]:
A
2013-01-01 00:00:00 0.480217
2013-01-01 00:01:00 0.791875
2013-01-01 00:02:00 1.414514
2013-01-01 00:03:00 -0.801147
2013-01-01 00:04:00 0.462656
... ...
2013-02-28 23:55:00 1.819527
2013-02-28 23:56:00 0.891281
2013-02-28 23:57:00 -0.516058
2013-02-28 23:58:00 -1.350302
2013-02-28 23:59:00 1.475049
This specifies an exact stop time (and is not the same as the above):
In [112]: dft2
Out[112]:
A
2013-01-01 00:00:00 a 0.030508
b 0.201088
2013-01-01 12:00:00 a -0.822650
b -0.159673
2013-01-02 00:00:00 a 0.715939
b 0.635435
2013-01-02 12:00:00 a 0.071542
b 0.539646
2013-01-03 00:00:00 a -0.743837
b 1.319587
2013-01-03 12:00:00 a -0.501123
b -0.492347
2013-01-04 00:00:00 a -0.357006
b -0.252463
2013-01-04 12:00:00 a -1.140700
b -1.367172
2013-01-05 00:00:00 a 0.048871
b -0.400048
2013-01-05 12:00:00 a 1.325801
b 0.651751
In [113]: dft2.loc['2013-01-05']
Out[113]:
A
2013-01-05 00:00:00 a 0.048871
b -0.400048
2013-01-05 12:00:00 a 1.325801
b 0.651751
In [118]: df
Out[118]:
0
2019-01-01 00:00:00-08:00 0
In [121]: series_minute.index.resolution
Out[121]: 'minute'
A timestamp string with minute resolution (or more accurate), gives a scalar instead, i.e. it is not casted to
a slice.
In [123]: series_minute['2011-12-31 23:59']
Out[123]: 1
In [126]: series_second.index.resolution
Out[126]: 'second'
Warning: However, if the string is treated as an exact match, the selection in DataFrame’s [] will
be column-wise and not row-wise, see Indexing Basics. For example dft_minute['2011-12-31 23:59']
will raise KeyError as '2012-12-31 23:59' has the same resolution as the index and there is no column
with such name:
To always have unambiguous selection, whether the row is treated as a slice or a single selection, use
.loc.
In [130]: dft_minute.loc['2011-12-31 23:59']
Out[130]:
a 1
b 4
Name: 2011-12-31 23:59:00, dtype: int64
Note also that DatetimeIndex resolution cannot be less precise than day.
In [131]: series_monthly = pd.Series([1, 2, 3],
.....: pd.DatetimeIndex(['2011-12', '2012-01', '2012-02']))
.....:
In [132]: series_monthly.index.resolution
Out[132]: 'day'
Exact indexing
As discussed in previous section, indexing a DatetimeIndex with a partial string depends on the “accuracy”
of the period, in other words how specific the interval is in relation to the resolution of the index. In contrast,
indexing with Timestamp or datetime objects is exact, because the objects have exact meaning. These also
follow the semantics of including both endpoints.
These Timestamp and datetime objects have exact hours, minutes, and seconds, even though they were
not explicitly specified (they are 0).
With no defaults.
A truncate() convenience function is provided that is similar to slicing. Note that truncate assumes a
0 value for any unspecified date component in a DatetimeIndex in contrast to slicing which returns any
partially matching dates:
In [136]: rng2 = pd.date_range('2011-01-01', '2012-01-01', freq='W')
In [139]: ts2['2011-11':'2011-12']
Out[139]:
2011-11-06 0.524645
2011-11-13 0.027878
2011-11-20 -0.914186
2011-11-27 0.672453
2011-12-04 -0.698538
2011-12-11 1.048716
2011-12-18 0.158099
2011-12-25 1.605115
Freq: W-SUN, dtype: float64
Even complicated fancy indexing that breaks the DatetimeIndex frequency regularity will result in a
DatetimeIndex, although frequency is lost:
There are several time/date properties that one can access from Timestamp or a collection of timestamps
like a DatetimeIndex.
Property Description
year The year of the datetime
month The month of the datetime
day The days of the datetime
hour The hour of the datetime
minute The minutes of the datetime
second The seconds of the datetime
microsecond The microseconds of the datetime
nanosecond The nanoseconds of the datetime
date Returns datetime.date (does not contain timezone information)
time Returns datetime.time (does not contain timezone information)
timetz Returns datetime.time as local time with timezone information
dayofyear The ordinal day of year
weekofyear The week ordinal of the year
week The week ordinal of the year
dayofweek The number of the day of the week with Monday=0, Sunday=6
weekday The number of the day of the week with Monday=0, Sunday=6
weekday_name The name of the day in a week (ex: Friday)
quarter Quarter of the date: Jan-Mar = 1, Apr-Jun = 2, etc.
days_in_month The number of days in the month of the datetime
is_month_start Logical indicating if first day of month (defined by frequency)
is_month_end Logical indicating if last day of month (defined by frequency)
is_quarter_start Logical indicating if first day of quarter (defined by frequency)
is_quarter_end Logical indicating if last day of quarter (defined by frequency)
is_year_start Logical indicating if first day of year (defined by frequency)
is_year_end Logical indicating if last day of year (defined by frequency)
is_leap_year Logical indicating if the date belongs to a leap year
Furthermore, if you have a Series with datetimelike values, then you can access these properties via the
.dt accessor, as detailed in the section on .dt accessors.
In the preceding examples, frequency strings (e.g. 'D') were used to specify a frequency that defined:
• how the date times in DatetimeIndex were spaced when using date_range()
• the frequency of a Period or PeriodIndex
These frequency strings map to a DateOffset object and its subclasses. A DateOffset is similar to a
Timedelta that represents a duration of time but follows specific calendar duration rules. For example,
a Timedelta day will always increment datetimes by 24 hours, while a DateOffset day will increment
datetimes to the same time the next day whether a day represents 23, 24 or 25 hours due to daylight
savings time. However, all DateOffset subclasses that are an hour or smaller (Hour, Minute, Second,
Milli, Micro, Nano) behave like Timedelta and respect absolute time.
The basic DateOffset acts similar to dateutil.relativedelta (relativedelta documentation) that shifts
a date time by the corresponding calendar duration specified. The arithmetic operator (+) or the apply
method can be used to perform the shift.
# This particular day contains a day light savings time transition
In [141]: ts = pd.Timestamp('2016-10-30 00:00:00', tz='Europe/Helsinki')
In [145]: friday.day_name()
Out[145]: 'Friday'
In [147]: two_business_days.apply(friday)
Out[147]: Timestamp('2018-01-09 00:00:00')
DateOffsets additionally have rollforward() and rollback() methods for moving a date forward or
backward respectively to a valid offset date relative to the offset. For example, business offsets will roll dates
that land on the weekends (Saturday and Sunday) forward to Monday since business offsets operate on the
weekdays.
In [150]: ts = pd.Timestamp('2018-01-06 00:00:00')
In [151]: ts.day_name()
Out[151]: 'Saturday'
# Date is brought to the closest offset date first and then the hour is added
In [154]: ts + offset
Out[154]: Timestamp('2018-01-08 10:00:00')
These operations preserve time (hour, minute, etc) information by default. To reset time to midnight, use
normalize() before or after applying the operation (depending on whether you want the time information
included in the operation).
In [155]: ts = pd.Timestamp('2014-01-01 09:00')
In [157]: day.apply(ts)
Out[157]: Timestamp('2014-01-02 09:00:00')
In [158]: day.apply(ts).normalize()
Out[158]: Timestamp('2014-01-02 00:00:00')
In [161]: hour.apply(ts)
Out[161]: Timestamp('2014-01-01 23:00:00')
In [162]: hour.apply(ts).normalize()
Out[162]: Timestamp('2014-01-01 00:00:00')
Parametric offsets
Some of the offsets can be “parameterized” when created to result in different behaviors. For example, the
Week offset for generating weekly data accepts a weekday parameter which results in the generated dates
always lying on a particular day of the week:
In [164]: d = datetime.datetime(2008, 8, 18, 9, 0)
In [165]: d
Out[165]: datetime.datetime(2008, 8, 18, 9, 0)
In [166]: d + pd.offsets.Week()
Out[166]: Timestamp('2008-08-25 09:00:00')
In [167]: d + pd.offsets.Week(weekday=4)
Out[167]: Timestamp('2008-08-22 09:00:00')
In [168]: (d + pd.offsets.Week(weekday=4)).weekday()
Out[168]: 4
In [169]: d - pd.offsets.Week()
Out[169]: Timestamp('2008-08-11 09:00:00')
The normalize option will be effective for addition and subtraction.
In [170]: d + pd.offsets.Week(normalize=True)
In [171]: d - pd.offsets.Week(normalize=True)
Out[171]: Timestamp('2008-08-11 00:00:00')
Another example is parameterizing YearEnd with the specific ending month:
In [172]: d + pd.offsets.YearEnd()
Out[172]: Timestamp('2008-12-31 09:00:00')
In [173]: d + pd.offsets.YearEnd(month=6)
Out[173]: Timestamp('2009-06-30 09:00:00')
Offsets can be used with either a Series or DatetimeIndex to apply the offset to each element.
In [174]: rng = pd.date_range('2012-01-01', '2012-01-03')
In [175]: s = pd.Series(rng)
In [176]: rng
Out[176]: DatetimeIndex(['2012-01-01', '2012-01-02', '2012-01-03'],␣
,→dtype='datetime64[ns]', freq='D')
In [178]: s + pd.DateOffset(months=2)
Out[178]:
0 2012-03-01
1 2012-03-02
2 2012-03-03
dtype: datetime64[ns]
In [179]: s - pd.DateOffset(months=2)
Out[179]:
0 2011-11-01
1 2011-11-02
2 2011-11-03
dtype: datetime64[ns]
If the offset class maps directly to a Timedelta (Day, Hour, Minute, Second, Micro, Milli, Nano) it can be
used exactly like a Timedelta - see the Timedelta section for more examples.
In [180]: s - pd.offsets.Day(2)
Out[180]:
0 2011-12-30
1 2011-12-31
2 2012-01-01
dtype: datetime64[ns]
In [182]: td
Out[182]:
0 3 days
1 3 days
2 3 days
dtype: timedelta64[ns]
In [183]: td + pd.offsets.Minute(15)
Out[183]:
0 3 days 00:15:00
1 3 days 00:15:00
2 3 days 00:15:00
dtype: timedelta64[ns]
Note that some offsets (such as BQuarterEnd) do not have a vectorized implementation. They can still be
used but may calculate significantly slower and will show a PerformanceWarning
The CDay or CustomBusinessDay class provides a parametric BusinessDay class which can be used to create
customized business day calendars which account for local holidays and local weekend conventions.
As an interesting example, let’s look at Egypt where a Friday-Saturday weekend is observed.
In [189]: dt + 2 * bday_egypt
Out[189]: Timestamp('2013-05-05 00:00:00')
Holiday calendars can be used to provide the list of holidays. See the holiday calendar section for more
information.
Monthly offsets that respect a certain holiday calendar can be defined in the usual way.
In [196]: bmth_us = pd.offsets.CustomBusinessMonthBegin(
.....: calendar=USFederalHolidayCalendar())
.....:
In [198]: dt + bmth_us
Out[198]: Timestamp('2014-01-02 00:00:00')
Note: The frequency string ‘C’ is used to indicate that a CustomBusinessDay DateOffset is used, it is
important to note that since CustomBusinessDay is a parameterised type, instances of CustomBusinessDay
may differ and this is not detectable from the ‘C’ frequency string. The user therefore needs to ensure that
the ‘C’ frequency string is used consistently within the user’s application.
Business hour
The BusinessHour class provides a business hour representation on BusinessDay, allowing to use specific
start and end times.
By default, BusinessHour uses 9:00 - 17:00 as business hours. Adding BusinessHour will increment
Timestamp by hourly frequency. If target Timestamp is out of business hours, move to the next busi-
ness hour then increment it. If the result exceeds the business hours end, the remaining hours are added to
the next business day.
In [200]: bh = pd.offsets.BusinessHour()
In [201]: bh
Out[201]: <BusinessHour: BH=09:00-17:00>
# 2014-08-01 is Friday
In [202]: pd.Timestamp('2014-08-01 10:00').weekday()
Out[202]: 4
# If the results is on the end time, move to the next business day
In [205]: pd.Timestamp('2014-08-01 16:00') + bh
Out[205]: Timestamp('2014-08-04 09:00:00')
In [210]: bh
Out[210]: <BusinessHour: BH=11:00-20:00>
In [215]: bh
Out[215]: <BusinessHour: BH=17:00-09:00>
In [228]: dt + bhour_us
Out[228]: Timestamp('2014-01-17 16:00:00')
# Monday is skipped because it's a holiday, business hour starts from 10:00
In [231]: dt + bhour_mon * 2
Out[231]: Timestamp('2014-01-21 10:00:00')
Offset aliases
A number of string aliases are given to useful common time series frequencies. We will refer to these aliases
as offset aliases.
Alias Description
B business day frequency
C custom business day frequency
D calendar day frequency
W weekly frequency
M month end frequency
SM semi-month end frequency (15th and end of month)
BM business month end frequency
CBM custom business month end frequency
MS month start frequency
SMS semi-month start frequency (1st and 15th)
BMS business month start frequency
CBMS custom business month start frequency
Q quarter end frequency
BQ business quarter end frequency
QS quarter start frequency
BQS business quarter start frequency
A, Y year end frequency
BA, BY business year end frequency
AS, YS year start frequency
BAS, BYS business year start frequency
BH business hour frequency
H hourly frequency
T, min minutely frequency
S secondly frequency
L, ms milliseconds
U, us microseconds
N nanoseconds
Combining aliases
As we have seen previously, the alias and the offset instance are fungible in most functions:
In [232]: pd.date_range(start, periods=5, freq='B')
Out[232]:
DatetimeIndex(['2011-01-03', '2011-01-04', '2011-01-05', '2011-01-06',
'2011-01-07'],
dtype='datetime64[ns]', freq='B')
Anchored offsets
Alias Description
W-SUN weekly frequency (Sundays). Same as ‘W’
W-MON weekly frequency (Mondays)
W-TUE weekly frequency (Tuesdays)
W-WED weekly frequency (Wednesdays)
W-THU weekly frequency (Thursdays)
W-FRI weekly frequency (Fridays)
W-SAT weekly frequency (Saturdays)
(B)Q(S)- quarterly frequency, year ends in December. Same as ‘Q’
DEC
(B)Q(S)- quarterly frequency, year ends in January
JAN
(B)Q(S)- quarterly frequency, year ends in February
FEB
(B)Q(S)- quarterly frequency, year ends in March
MAR
(B)Q(S)- quarterly frequency, year ends in April
APR
(B)Q(S)- quarterly frequency, year ends in May
MAY
(B)Q(S)- quarterly frequency, year ends in June
JUN
(B)Q(S)- quarterly frequency, year ends in July
JUL
(B)Q(S)- quarterly frequency, year ends in August
AUG
(B)Q(S)- quarterly frequency, year ends in September
SEP
(B)Q(S)- quarterly frequency, year ends in October
OCT
(B)Q(S)- quarterly frequency, year ends in November
NOV
(B)A(S)- annual frequency, anchored end of December. Same as ‘A’
DEC
Continued on next page
These can be used as arguments to date_range, bdate_range, constructors for DatetimeIndex, as well as
various other timeseries-related functions in pandas.
For those offsets that are anchored to the start or end of specific frequency (MonthEnd, MonthBegin, WeekEnd,
etc), the following rules apply to rolling forward and backwards.
When n is not 0, if the given date is not on an anchor point, it snapped to the next(previous) anchor point,
and moved |n|-1 additional steps forwards or backwards.
In [236]: pd.Timestamp('2014-01-02') + pd.offsets.MonthBegin(n=1)
Out[236]: Timestamp('2014-02-01 00:00:00')
If the given date is on an anchor point, it is moved |n| points forwards or backwards.
In [242]: pd.Timestamp('2014-01-01') + pd.offsets.MonthBegin(n=1)
Out[242]: Timestamp('2014-02-01 00:00:00')
Holidays and calendars provide a simple way to define holiday rules to be used with CustomBusinessDay or
in other analysis that requires a predefined set of holidays. The AbstractHolidayCalendar class provides
all the necessary methods to return a list of holidays and only rules need to be defined in a specific holiday
calendar class. Furthermore, the start_date and end_date class attributes determine over what date range
holidays are generated. These should be overwritten on the AbstractHolidayCalendar class to have the
range apply to all calendar subclasses. USFederalHolidayCalendar is the only calendar that exists and
primarily serves as an example for developing other calendars.
For holidays that occur on fixed dates (e.g., US Memorial Day or July 4th) an observance rule determines
when that holiday is observed if it falls on a weekend or some other non-observed day. Defined observance
rules are:
Rule Description
nearest_workday move Saturday to Friday and Sunday to Monday
sunday_to_mondaymove Sunday to following Monday
next_monday_or_tuesday
move Saturday to Monday and Sunday/Monday to Tuesday
previous_friday move Saturday and Sunday to previous Friday”
next_monday move Saturday and Sunday to following Monday
In [263]: AbstractHolidayCalendar.end_date
Out[263]: Timestamp('2030-12-31 00:00:00')
These dates can be overwritten by setting the attributes as datetime/Timestamp/string.
In [264]: AbstractHolidayCalendar.start_date = datetime.datetime(2012, 1, 1)
In [266]: cal.holidays()
Out[266]: DatetimeIndex(['2012-05-28', '2012-07-04', '2012-10-08'], dtype='datetime64[ns]
,→', freq=None)
Every calendar class is accessible by name using the get_calendar function which returns a holiday
class instance. Any imported calendar class will automatically be available by this function. Also,
HolidayCalendarFactory provides an easy interface to create calendars that are combinations of calen-
dars or calendars with additional rules.
In [267]: from pandas.tseries.holiday import get_calendar, HolidayCalendarFactory,\
.....: USLaborDay
.....:
In [269]: cal.rules
Out[269]:
[Holiday: Memorial Day (month=5, day=31, offset=<DateOffset: weekday=MO(-1)>),
Holiday: July 4th (month=7, day=4, observance=<function nearest_workday at 0x144b099e0>
,→),
In [271]: new_cal.rules
Out[271]:
[Holiday: Labor Day (month=9, day=1, offset=<DateOffset: weekday=MO(+1)>),
Holiday: Memorial Day (month=5, day=31, offset=<DateOffset: weekday=MO(-1)>),
Holiday: July 4th (month=7, day=4, observance=<function nearest_workday at 0x144b099e0>
,→),
Shifting / lagging
One may want to shift or lag the values in a time series back and forward in time. The method for this is
shift(), which is available on all of the pandas objects.
In [272]: ts = pd.Series(range(len(rng)), index=rng)
In [273]: ts = ts[:5]
The shift method accepts an freq argument which can accept a DateOffset class or other timedelta-like
object or also an offset alias:
In [275]: ts.shift(5, freq=pd.offsets.BDay())
Out[275]:
2012-01-06 0
2012-01-09 1
2012-01-10 2
Freq: B, dtype: int64
Note that with tshift, the leading entry is no longer NaN because the data is not being realigned.
Frequency conversion
The primary function for changing frequencies is the asfreq() method. For a DatetimeIndex, this is
basically just a thin, but convenient wrapper around reindex() which generates a date_range and calls
reindex.
In [278]: dr = pd.date_range('1/1/2010', periods=3, freq=3 * pd.offsets.BDay())
In [280]: ts
Out[280]:
2010-01-01 -0.014261
2010-01-06 -0.760069
2010-01-11 0.028879
Freq: 3B, dtype: float64
In [281]: ts.asfreq(pd.offsets.BDay())
Out[281]:
2010-01-01 -0.014261
2010-01-04 NaN
2010-01-05 NaN
2010-01-06 -0.760069
2010-01-07 NaN
2010-01-08 NaN
2010-01-11 0.028879
Freq: B, dtype: float64
asfreq provides a further convenience so you can specify an interpolation method for any gaps that may
appear after the frequency conversion.
Related to asfreq and reindex is fillna(), which is documented in the missing data section.
DatetimeIndex can be converted to an array of Python native datetime.datetime objects using the
to_pydatetime method.
4.13.10 Resampling
Warning: The interface to .resample has changed in 0.18.0 to be more groupby-like and hence more
flexible. See the whatsnew docs for a comparison with prior versions.
Pandas has a simple, powerful, and efficient functionality for performing resampling operations during fre-
quency conversion (e.g., converting secondly data into 5-minutely data). This is extremely common in, but
not limited to, financial applications.
resample() is a time-based groupby, followed by a reduction method on each of its groups. See some
cookbook examples for some advanced strategies.
Starting in version 0.18.1, the resample() function can be used directly from DataFrameGroupBy objects,
see the groupby docs.
Note: .resample() is similar to using a rolling() operation with a time-based offset, see a discussion
here.
Basics
In [285]: ts.resample('5Min').sum()
Out[285]:
2012-01-01 25003
Freq: 5T, dtype: int64
The resample function is very flexible and allows you to specify many different parameters to control the
frequency conversion and resampling operation.
Any function available via dispatching is available as a method of the returned object, including sum, mean,
std, sem, max, min, median, first, last, ohlc:
In [286]: ts.resample('5Min').mean()
Out[286]:
2012-01-01 250.03
Freq: 5T, dtype: float64
In [287]: ts.resample('5Min').ohlc()
Out[287]:
open high low close
2012-01-01 114 499 0 381
In [288]: ts.resample('5Min').max()
Out[288]:
2012-01-01 499
Freq: 5T, dtype: int64
For downsampling, closed can be set to ‘left’ or ‘right’ to specify which end of the interval is closed:
In [289]: ts.resample('5Min', closed='right').mean()
Out[289]:
2011-12-31 23:55:00 114.00000
2012-01-01 00:00:00 251.40404
Freq: 5T, dtype: float64
Warning: The default values for label and closed is ‘left’ for all frequency offsets except for ‘M’, ‘A’,
‘Q’, ‘BM’, ‘BA’, ‘BQ’, and ‘W’ which all have a default of ‘right’.
This might unintendedly lead to looking ahead, where the value for a later time is pulled back to a
previous time as in the following example with the BusinessDay frequency:
In [294]: s = pd.date_range('2000-01-01', '2000-01-05').to_series()
In [296]: s.dt.weekday_name
Out[296]:
2000-01-01 Saturday
2000-01-02 Sunday
2000-01-03 NaN
2000-01-04 Tuesday
2000-01-05 Wednesday
Freq: D, dtype: object
The axis parameter can be set to 0 or 1 and allows you to resample the specified axis for a DataFrame.
kind can be set to ‘timestamp’ or ‘period’ to convert the resulting index to/from timestamp and time span
representations. By default resample retains the input representation.
convention can be set to ‘start’ or ‘end’ when resampling period data (detail below). It specifies how low
frequency periods are converted to higher frequency periods.
Upsampling
For upsampling, you can specify a way to upsample and the limit parameter to interpolate over the gaps
that are created:
# from secondly to every 250 milliseconds
In [299]: ts[:2].resample('250L').asfreq()
Out[299]:
2012-01-01 00:00:00.000 114.0
2012-01-01 00:00:00.250 NaN
2012-01-01 00:00:00.500 NaN
2012-01-01 00:00:00.750 NaN
2012-01-01 00:00:01.000 119.0
Freq: 250L, dtype: float64
In [300]: ts[:2].resample('250L').ffill()
Out[300]:
2012-01-01 00:00:00.000 114
2012-01-01 00:00:00.250 114
2012-01-01 00:00:00.500 114
2012-01-01 00:00:00.750 114
2012-01-01 00:00:01.000 119
Freq: 250L, dtype: int64
In [301]: ts[:2].resample('250L').ffill(limit=2)
Out[301]:
2012-01-01 00:00:00.000 114.0
2012-01-01 00:00:00.250 114.0
2012-01-01 00:00:00.500 114.0
2012-01-01 00:00:00.750 NaN
2012-01-01 00:00:01.000 119.0
Freq: 250L, dtype: float64
Sparse resampling
Sparse timeseries are the ones where you have a lot fewer points relative to the amount of time you are
looking to resample. Naively upsampling a sparse series can potentially generate lots of intermediate values.
When you don’t want to use a method to fill these values, e.g. fill_method is None, then intermediate
values will be filled with NaN.
Since resample is a time-based groupby, the following is a method to efficiently resample only the groups
that are not all NaN.
In [304]: ts.resample('3T').sum()
Out[304]:
2014-01-01 00:00:00 0
2014-01-01 00:03:00 0
2014-01-01 00:06:00 0
(continues on next page)
We can instead only resample those groups where we have points as follows:
Aggregation
Similar to the aggregating API , groupby API , and the window functions API , a Resampler can be selectively
resampled.
Resampling a DataFrame, the default will be to act on all columns with the same function.
In [310]: r = df.resample('3T')
In [311]: r.mean()
(continues on next page)
On a resampled DataFrame, you can pass a list of functions to apply to each column, which produces an
aggregated result with a hierarchical index:
By passing a dict to aggregate you can apply a different aggregation to the columns of a DataFrame:
The function names can also be strings. In order for a string to be valid it must be implemented on the
resampled object:
Furthermore, you can also specify multiple aggregation functions for each column separately.
If a DataFrame does not have a datetimelike index, but instead you want to resample based on datetimelike
column in the frame, it can passed to the on keyword.
In [319]: df = pd.DataFrame({'date': pd.date_range('2015-01-01', freq='W', periods=5),
.....: 'a': np.arange(5)},
.....: index=pd.MultiIndex.from_arrays([
.....: [1, 2, 3, 4, 5],
.....: pd.date_range('2015-01-01', freq='W', periods=5)],
.....: names=['v', 'd']))
.....:
In [320]: df
Out[320]:
date a
v d
1 2015-01-04 2015-01-04 0
2 2015-01-11 2015-01-11 1
3 2015-01-18 2015-01-18 2
4 2015-01-25 2015-01-25 3
5 2015-02-01 2015-02-01 4
With the Resampler object in hand, iterating through the grouped data is very natural and functions
similarly to itertools.groupby():
Regular intervals of time are represented by Period objects in pandas while sequences of Period objects are
collected in a PeriodIndex, which can be created with the convenience function period_range.
Period
A Period represents a span of time (e.g., a day, a month, a quarter, etc). You can specify the span via freq
keyword using a frequency alias like below. Because freq represents a span of Period, it cannot be negative
like “-3D”.
In [326]: pd.Period('2012', freq='A-DEC')
Out[326]: Period('2012', 'A-DEC')
In [331]: p + 1
Out[331]: Period('2013', 'A-DEC')
In [332]: p - 3
Out[332]: Period('2009', 'A-DEC')
In [334]: p + 2
Out[334]: Period('2012-05', '2M')
In [335]: p - 1
Out[335]: Period('2011-11', '2M')
~/sandbox/pandas-doc/pandas/_libs/tslibs/period.pyx in pandas._libs.tslibs.period._
,→Period.__richcmp__()
In [338]: p + pd.offsets.Hour(2)
Out[338]: Period('2014-07-01 11:00', 'H')
In [339]: p + datetime.timedelta(minutes=120)
Out[339]: Period('2014-07-01 11:00', 'H')
In [1]: p + pd.offsets.Minute(5)
Traceback
...
ValueError: Input has different freq from Period(freq=H)
If Period has other frequencies, only the same offsets can be added. Otherwise, ValueError will be raised.
In [342]: p + pd.offsets.MonthEnd(3)
Out[342]: Period('2014-10', 'M')
In [1]: p + pd.offsets.MonthBegin(3)
Traceback
...
ValueError: Input has different freq from Period(freq=M)
Taking the difference of Period instances with the same frequency will return the number of frequency units
between them:
Regular sequences of Period objects can be collected in a PeriodIndex, which can be constructed using the
period_range convenience function:
In [345]: prng
Out[345]:
PeriodIndex(['2011-01', '2011-02', '2011-03', '2011-04', '2011-05', '2011-06',
'2011-07', '2011-08', '2011-09', '2011-10', '2011-11', '2011-12',
'2012-01'],
dtype='period[M]', freq='M')
Passing multiplied frequency outputs a sequence of Period which has multiplied span.
If start or end are Period objects, they will be used as anchor endpoints for a PeriodIndex with frequency
matching that of the PeriodIndex constructor.
Just like DatetimeIndex, a PeriodIndex can also be used to index pandas objects:
In [350]: ps
Out[350]:
2011-01 0.818145
2011-02 0.267439
2011-03 -0.133309
2011-04 0.662254
2011-05 -0.688393
2011-06 -0.731718
2011-07 -1.190946
2011-08 -1.889384
2011-09 0.171033
(continues on next page)
PeriodIndex supports addition and subtraction with the same rule as Period.
In [351]: idx = pd.period_range('2014-07-01 09:00', periods=5, freq='H')
In [352]: idx
Out[352]:
PeriodIndex(['2014-07-01 09:00', '2014-07-01 10:00', '2014-07-01 11:00',
'2014-07-01 12:00', '2014-07-01 13:00'],
dtype='period[H]', freq='H')
In [355]: idx
Out[355]: PeriodIndex(['2014-07', '2014-08', '2014-09', '2014-10', '2014-11'],␣
,→dtype='period[M]', freq='M')
PeriodIndex has its own dtype named period, refer to Period Dtypes.
Period dtypes
In [358]: pi
Out[358]: PeriodIndex(['2016-01', '2016-02', '2016-03'], dtype='period[M]', freq='M')
In [359]: pi.dtype
Out[359]: period[M]
The period dtype can be used in .astype(...). It allows one to change the freq of a PeriodIndex like
.asfreq() and convert a DatetimeIndex to PeriodIndex like to_period():
# convert to DatetimeIndex
In [361]: pi.astype('datetime64[ns]')
Out[361]: DatetimeIndex(['2016-01-01', '2016-02-01', '2016-03-01'],␣
,→dtype='datetime64[ns]', freq='MS')
# convert to PeriodIndex
In [362]: dti = pd.date_range('2011-01-01', freq='M', periods=3)
In [363]: dti
Out[363]: DatetimeIndex(['2011-01-31', '2011-02-28', '2011-03-31'],␣
,→dtype='datetime64[ns]', freq='M')
In [364]: dti.astype('period[M]')
Out[364]: PeriodIndex(['2011-01', '2011-02', '2011-03'], dtype='period[M]', freq='M')
You can pass in dates and strings to Series and DataFrame with PeriodIndex, in the same manner as
DatetimeIndex. For details, refer to DatetimeIndex Partial String Indexing.
In [365]: ps['2011-01']
Out[365]: 0.8181454127659131
In [367]: ps['10/31/2011':'12/31/2011']
Out[367]:
2011-10 0.368582
2011-11 0.356459
2011-12 0.556248
Freq: M, dtype: float64
Passing a string representing a lower frequency than PeriodIndex returns partial sliced data.
In [368]: ps['2011']
Out[368]:
2011-01 0.818145
2011-02 0.267439
2011-03 -0.133309
2011-04 0.662254
2011-05 -0.688393
2011-06 -0.731718
2011-07 -1.190946
2011-08 -1.889384
2011-09 0.171033
2011-10 0.368582
2011-11 0.356459
2011-12 0.556248
Freq: M, dtype: float64
In [370]: dfp
Out[370]:
A
2013-01-01 09:00 -0.169165
2013-01-01 09:01 2.314478
2013-01-01 09:02 2.452822
2013-01-01 09:03 -0.913482
2013-01-01 09:04 -0.635358
... ...
2013-01-01 18:55 0.478960
2013-01-01 18:56 -1.173443
2013-01-01 18:57 0.084316
2013-01-01 18:58 0.564841
2013-01-01 18:59 -1.468316
The frequency of Period and PeriodIndex can be converted via the asfreq method. Let’s start with the
fiscal year 2011, ending in December:
In [374]: p
Out[374]: Period('2011', 'A-DEC')
We can convert it to a monthly frequency. Using the how parameter, we can specify whether to return the
starting or ending month:
In [375]: p.asfreq('M', how='start')
Out[375]: Period('2011-01', 'M')
In [380]: p.asfreq('A-NOV')
Out[380]: Period('2012', 'A-NOV')
Note that since we converted to an annual frequency that ends the year in November, the monthly period
of December 2011 is actually in the 2012 A-NOV period.
Period conversions with anchored frequencies are particularly useful for working with various quarterly data
common to economics, business, and other fields. Many organizations define quarters relative to the month
in which their fiscal year starts and ends. Thus, first quarter of 2011 could start in 2010 or a few months
into 2011. Via anchored frequencies, pandas works for all quarterly frequencies Q-JAN through Q-DEC.
Q-DEC define regular calendar quarters:
In [381]: p = pd.Period('2012Q1', freq='Q-DEC')
Timestamped data can be converted to PeriodIndex-ed data using to_period and vice-versa using
to_timestamp:
In [387]: rng = pd.date_range('1/1/2012', periods=5, freq='M')
In [389]: ts
Out[389]:
2012-01-31 0.886762
2012-02-29 0.034219
2012-03-31 -0.120191
2012-04-30 -1.448144
2012-05-31 -0.316413
Freq: M, dtype: float64
In [390]: ps = ts.to_period()
In [391]: ps
Out[391]:
2012-01 0.886762
2012-02 0.034219
2012-03 -0.120191
2012-04 -1.448144
2012-05 -0.316413
Freq: M, dtype: float64
In [392]: ps.to_timestamp()
Out[392]:
2012-01-01 0.886762
2012-02-01 0.034219
2012-03-01 -0.120191
2012-04-01 -1.448144
2012-05-01 -0.316413
Freq: MS, dtype: float64
Remember that ‘s’ and ‘e’ can be used to return the timestamps at the start or end of the period:
Converting between period and timestamp enables some convenient arithmetic functions to be used. In the
following example, we convert a quarterly frequency with year ending in November to 9am of the end of the
month following the quarter end:
In [397]: ts.head()
Out[397]:
1990-03-01 09:00 1.136089
1990-06-01 09:00 0.413044
1990-09-01 09:00 -0.815750
1990-12-01 09:00 0.473114
1991-03-01 09:00 -0.242094
Freq: H, dtype: float64
If you have data that is outside of the Timestamp bounds, see Timestamp limitations, then you can use a
PeriodIndex and/or Series of Periods to do computations.
In [399]: span
Out[399]:
PeriodIndex(['1215-01-01', '1215-01-02', '1215-01-03', '1215-01-04',
'1215-01-05', '1215-01-06', '1215-01-07', '1215-01-08',
'1215-01-09', '1215-01-10',
...
'1380-12-23', '1380-12-24', '1380-12-25', '1380-12-26',
'1380-12-27', '1380-12-28', '1380-12-29', '1380-12-30',
'1380-12-31', '1381-01-01'],
dtype='period[D]', length=60632, freq='D')
In [401]: s
Out[401]:
0 20121231
1 20141130
2 99991231
dtype: int64
In [403]: s.apply(conv)
Out[403]:
0 2012-12-31
1 2014-11-30
2 9999-12-31
dtype: period[D]
In [404]: s.apply(conv)[2]
Out[404]: Period('9999-12-31', 'D')
These can easily be converted to a PeriodIndex:
In [406]: span
Out[406]: PeriodIndex(['2012-12-31', '2014-11-30', '9999-12-31'], dtype='period[D]',␣
,→freq='D')
pandas provides rich support for working with timestamps in different time zones using the pytz and
dateutil libraries or class:datetime.timezone objects from the standard library.
To localize these dates to a time zone (assign a particular time zone to a naive date), you can use the
tz_localize method or the tz keyword argument in date_range(), Timestamp, or DatetimeIndex. You
can either pass pytz or dateutil time zone objects or Olson time zone database strings. Olson time
zone strings will return pytz time zone objects by default. To return dateutil time zone objects, append
dateutil/ before the string.
• In pytz you can find a list of common (and less common) time zones using from pytz import
common_timezones, all_timezones.
• dateutil uses the OS time zones so there isn’t a fixed list available. For common zones, the names
are the same as pytz.
# pytz
In [410]: rng_pytz = pd.date_range('3/6/2012 00:00', periods=3, freq='D',
.....: tz='Europe/London')
.....:
In [411]: rng_pytz.tz
Out[411]: <DstTzInfo 'Europe/London' LMT-1 day, 23:59:00 STD>
# dateutil
In [412]: rng_dateutil = pd.date_range('3/6/2012 00:00', periods=3, freq='D')
In [414]: rng_dateutil.tz
Out[414]: tzfile('/usr/share/zoneinfo/Europe/London')
In [416]: rng_utc.tz
Out[416]: tzutc()
# datetime.timezone
In [417]: rng_utc = pd.date_range('3/6/2012 00:00', periods=3, freq='D',
.....: tz=datetime.timezone.utc)
.....:
In [418]: rng_utc.tz
Out[418]: datetime.timezone.utc
Note that the UTC time zone is a special case in dateutil and should be constructed explicitly as an instance
of dateutil.tz.tzutc. You can also construct other time zones objects explicitly first.
# pytz
In [420]: tz_pytz = pytz.timezone('Europe/London')
# dateutil
In [424]: tz_dateutil = dateutil.tz.gettz('Europe/London')
(continues on next page)
To convert a time zone aware pandas object from one time zone to another, you can use the tz_convert
method.
In [427]: rng_pytz.tz_convert('US/Eastern')
Out[427]:
DatetimeIndex(['2012-03-05 19:00:00-05:00', '2012-03-06 19:00:00-05:00',
'2012-03-07 19:00:00-05:00'],
dtype='datetime64[ns, US/Eastern]', freq='D')
Note: When using pytz time zones, DatetimeIndex will construct a different time zone object than a
Timestamp for the same time zone input. A DatetimeIndex can hold a collection of Timestamp objects that
may have different UTC offsets and cannot be succinctly represented by one pytz time zone instance while
one Timestamp represents one point in time with a specific UTC offset.
In [429]: dti.tz
Out[429]: <DstTzInfo 'US/Pacific' LMT-1 day, 16:07:00 STD>
In [431]: ts.tz
Out[431]: <DstTzInfo 'US/Pacific' PST-1 day, 16:00:00 STD>
Warning: Be wary of conversions between libraries. For some time zones, pytz and dateutil have
different definitions of the zone. This is more of a problem for unusual time zones than for ‘standard’
zones like US/Eastern.
Warning: Be aware that a time zone definition across versions of time zone libraries may not be
considered equal. This may cause problems when working with stored data that is localized using one
version and operated on with a different version. See here for how to handle such a situation.
Warning: For pytz time zones, it is incorrect to pass a time zone object directly into the datetime.
datetime constructor (e.g., datetime.datetime(2011, 1, 1, tz=pytz.timezone('US/Eastern')).
Instead, the datetime needs to be localized using the localize method on the pytz time zone object.
Under the hood, all timestamps are stored in UTC. Values from a time zone aware DatetimeIndex or
Timestamp will have their fields (day, hour, minute, etc.) localized to the time zone. However, timestamps
with the same UTC value are still considered to be equal even if they are in different time zones:
In [432]: rng_eastern = rng_utc.tz_convert('US/Eastern')
In [434]: rng_eastern[2]
Out[434]: Timestamp('2012-03-07 19:00:00-0500', tz='US/Eastern', freq='D')
In [435]: rng_berlin[2]
Out[435]: Timestamp('2012-03-08 01:00:00+0100', tz='Europe/Berlin', freq='D')
In [441]: result
Out[441]:
2013-01-01 00:00:00+00:00 0
2013-01-02 00:00:00+00:00 2
2013-01-03 00:00:00+00:00 4
Freq: D, dtype: int64
In [442]: result.index
Out[442]:
DatetimeIndex(['2013-01-01 00:00:00+00:00', '2013-01-02 00:00:00+00:00',
'2013-01-03 00:00:00+00:00'],
dtype='datetime64[ns, UTC]', freq='D')
To remove time zone information, use tz_localize(None) or tz_convert(None). tz_localize(None) will
remove the time zone yielding the local time representation. tz_convert(None) will remove the time zone
after converting to UTC time.
In [443]: didx = pd.date_range(start='2014-08-01 09:00', freq='H',
.....: periods=3, tz='US/Eastern')
.....:
In [444]: didx
Out[444]:
DatetimeIndex(['2014-08-01 09:00:00-04:00', '2014-08-01 10:00:00-04:00',
'2014-08-01 11:00:00-04:00'],
dtype='datetime64[ns, US/Eastern]', freq='H')
In [445]: didx.tz_localize(None)
Out[445]:
DatetimeIndex(['2014-08-01 09:00:00', '2014-08-01 10:00:00',
'2014-08-01 11:00:00'],
dtype='datetime64[ns]', freq='H')
In [446]: didx.tz_convert(None)
Out[446]:
DatetimeIndex(['2014-08-01 13:00:00', '2014-08-01 14:00:00',
'2014-08-01 15:00:00'],
dtype='datetime64[ns]', freq='H')
tz_localize may not be able to determine the UTC offset of a timestamp because daylight savings time
(DST) in a local time zone causes some times to occur twice within one day (“clocks fall back”). The
following options are available:
• 'raise': Raises a pytz.AmbiguousTimeError (the default behavior)
• 'infer': Attempt to determine the correct offset base on the monotonicity of the timestamps
• 'NaT': Replaces ambiguous times with NaT
• bool: True represents a DST time, False represents non-DST time. An array-like of bool values is
supported for a sequence of times.
In [2]: rng_hourly.tz_localize('US/Eastern')
AmbiguousTimeError: Cannot infer dst time from Timestamp('2011-11-06 01:00:00'), try␣
,→using the 'ambiguous' argument
Out[451]:
DatetimeIndex(['2011-11-06 00:00:00-04:00', '2011-11-06 01:00:00-04:00',
'2011-11-06 01:00:00-05:00', '2011-11-06 02:00:00-05:00'],
dtype='datetime64[ns, US/Eastern]', freq=None)
A DST transition may also shift the local time ahead by 1 hour creating nonexistent local times (“clocks
spring forward”). The behavior of localizing a timeseries with nonexistent times can be controlled by the
nonexistent argument. The following options are available:
• 'raise': Raises a pytz.NonExistentTimeError (the default behavior)
• 'NaT': Replaces nonexistent times with NaT
• 'shift_forward': Shifts nonexistent times forward to the closest real time
• 'shift_backward': Shifts nonexistent times backward to the closest real time
• timedelta object: Shifts nonexistent times by the timedelta duration
In [2]: dti.tz_localize('Europe/Warsaw')
NonExistentTimeError: 2015-03-29 02:30:00
A Series with time zone naive values is represented with a dtype of datetime64[ns].
In [459]: s_naive
Out[459]:
0 2013-01-01
1 2013-01-02
2 2013-01-03
dtype: datetime64[ns]
A Series with a time zone aware values is represented with a dtype of datetime64[ns, tz] where tz is
the time zone
In [461]: s_aware
Out[461]:
0 2013-01-01 00:00:00-05:00
1 2013-01-02 00:00:00-05:00
2 2013-01-03 00:00:00-05:00
dtype: datetime64[ns, US/Eastern]
Both of these Series time zone information can be manipulated via the .dt accessor, see the dt accessor
section.
For example, to localize and convert a naive stamp to time zone aware.
In [462]: s_naive.dt.tz_localize('UTC').dt.tz_convert('US/Eastern')
Out[462]:
0 2012-12-31 19:00:00-05:00
1 2013-01-01 19:00:00-05:00
2 2013-01-02 19:00:00-05:00
dtype: datetime64[ns, US/Eastern]
Time zone information can also be manipulated using the astype method. This method can localize and
convert time zone naive timestamps or convert time zone aware timestamps.
# localize and convert a naive time zone
In [463]: s_naive.astype('datetime64[ns, US/Eastern]')
Out[463]:
0 2012-12-31 19:00:00-05:00
1 2013-01-01 19:00:00-05:00
2 2013-01-02 19:00:00-05:00
dtype: datetime64[ns, US/Eastern]
In [464]: s_aware.astype('datetime64[ns]')
Out[464]:
0 2013-01-01 05:00:00
1 2013-01-02 05:00:00
2 2013-01-03 05:00:00
dtype: datetime64[ns]
Note: Using Series.to_numpy() on a Series, returns a NumPy array of the data. NumPy does not
currently support time zones (even though it is printing in the local time zone!), therefore an object array
of Timestamps is returned for time zone aware data:
In [466]: s_naive.to_numpy()
Out[466]:
array(['2013-01-01T00:00:00.000000000', '2013-01-02T00:00:00.000000000',
'2013-01-03T00:00:00.000000000'], dtype='datetime64[ns]')
In [467]: s_aware.to_numpy()
Out[467]:
array([Timestamp('2013-01-01 00:00:00-0500', tz='US/Eastern', freq='D'),
Timestamp('2013-01-02 00:00:00-0500', tz='US/Eastern', freq='D'),
Timestamp('2013-01-03 00:00:00-0500', tz='US/Eastern', freq='D')],
dtype=object)
By converting to an object array of Timestamps, it preserves the time zone information. For example, when
converting back to a Series:
In [468]: pd.Series(s_aware.to_numpy())
Out[468]:
0 2013-01-01 00:00:00-05:00
1 2013-01-02 00:00:00-05:00
2 2013-01-03 00:00:00-05:00
dtype: datetime64[ns, US/Eastern]
However, if you want an actual NumPy datetime64[ns] array (with the values converted to UTC) instead
of an array of objects, you can specify the dtype argument:
In [469]: s_aware.to_numpy(dtype='datetime64[ns]')
Out[469]:
array(['2013-01-01T05:00:00.000000000', '2013-01-02T05:00:00.000000000',
'2013-01-03T05:00:00.000000000'], dtype='datetime64[ns]')
{{ header }}
Timedeltas are differences in times, expressed in difference units, e.g. days, hours, minutes, seconds. They
can be both positive and negative.
Timedelta is a subclass of datetime.timedelta, and behaves in a similar manner, but allows compatibility
with np.timedelta64 types as well as a host of custom representation, parsing, and attributes.
4.14.1 Parsing
# strings
In [2]: pd.Timedelta('1 days')
Out[2]: Timedelta('1 days 00:00:00')
# like datetime.timedelta
# note: these MUST be specified as keyword arguments
In [6]: pd.Timedelta(days=1, seconds=1)
Out[6]: Timedelta('1 days 00:00:01')
# from a datetime.timedelta/np.timedelta64
In [8]: pd.Timedelta(datetime.timedelta(days=1, seconds=1))
Out[8]: Timedelta('1 days 00:00:01')
# a NaT
In [11]: pd.Timedelta('nan')
Out[11]: NaT
In [12]: pd.Timedelta('nat')
Out[12]: NaT
In [14]: pd.Timedelta('P0DT0H0M0.000000123S')
Out[14]: Timedelta('0 days 00:00:00.000000')
New in version 0.23.0: Added constructor for ISO 8601 Duration strings
DateOffsets (Day, Hour, Minute, Second, Milli, Micro, Nano) can also be used in construction.
In [15]: pd.Timedelta(pd.offsets.Second(2))
Out[15]: Timedelta('0 days 00:00:02')
to_timedelta
Using the top-level pd.to_timedelta, you can convert a scalar, array, list, or Series from a recognized
timedelta format / value into a Timedelta type. It will construct Series if the input is a Series, a scalar if
the input is scalar-like, otherwise it will output a TimedeltaIndex.
You can parse a single string to a Timedelta:
In [17]: pd.to_timedelta('1 days 06:05:01.00003')
Out[17]: Timedelta('1 days 06:05:01.000030')
In [18]: pd.to_timedelta('15.5us')
Out[18]: Timedelta('0 days 00:00:00.000015')
or a list/array of strings:
Timedelta limitations
Pandas represents Timedeltas in nanosecond resolution using 64 bit integers. As such, the 64 bit integer
limits determine the Timedelta limits.
In [22]: pd.Timedelta.min
Out[22]: Timedelta('-106752 days +00:12:43.145224')
In [23]: pd.Timedelta.max
Out[23]: Timedelta('106751 days 23:47:16.854775')
4.14.2 Operations
You can operate on Series/DataFrames and construct timedelta64[ns] Series through subtraction opera-
tions on datetime64[ns] Series, or Timestamps.
In [24]: s = pd.Series(pd.date_range('2012-1-1', periods=3, freq='D'))
In [27]: df
Out[27]:
A B
0 2012-01-01 0 days
1 2012-01-02 1 days
2 2012-01-03 2 days
In [29]: df
Out[29]:
A B C
0 2012-01-01 0 days 2012-01-01
1 2012-01-02 1 days 2012-01-03
2 2012-01-03 2 days 2012-01-05
In [30]: df.dtypes
Out[30]:
A datetime64[ns]
B timedelta64[ns]
C datetime64[ns]
dtype: object
In [31]: s - s.max()
Out[31]:
0 -2 days
1 -1 days
2 0 days
dtype: timedelta64[ns]
In [32]: s - datetime.datetime(2011, 1, 1, 3, 5)
Out[32]:
In [33]: s + datetime.timedelta(minutes=5)
Out[33]:
0 2012-01-01 00:05:00
1 2012-01-02 00:05:00
2 2012-01-03 00:05:00
dtype: datetime64[ns]
In [34]: s + pd.offsets.Minute(5)
Out[34]:
0 2012-01-01 00:05:00
1 2012-01-02 00:05:00
2 2012-01-03 00:05:00
dtype: datetime64[ns]
In [36]: y = s - s[0]
In [37]: y
Out[37]:
0 0 days
1 1 days
2 2 days
dtype: timedelta64[ns]
In [38]: y = s - s.shift()
In [39]: y
Out[39]:
0 NaT
1 1 days
2 1 days
dtype: timedelta64[ns]
In [41]: y
Out[41]:
(continues on next page)
Operands can also appear in a reversed order (a singular object operated with a Series):
In [42]: s.max() - s
Out[42]:
0 2 days
1 1 days
2 0 days
dtype: timedelta64[ns]
In [43]: datetime.datetime(2011, 1, 1, 3, 5) - s
Out[43]:
0 -365 days +03:05:00
1 -366 days +03:05:00
2 -367 days +03:05:00
dtype: timedelta64[ns]
In [44]: datetime.timedelta(minutes=5) + s
Out[44]:
0 2012-01-01 00:05:00
1 2012-01-02 00:05:00
2 2012-01-03 00:05:00
dtype: datetime64[ns]
min, max and the corresponding idxmin, idxmax operations are supported on frames:
In [45]: A = s - pd.Timestamp('20120101') - pd.Timedelta('00:05:05')
In [48]: df
Out[48]:
A B
0 -1 days +23:54:55 -1 days
1 0 days 23:54:55 -1 days
2 1 days 23:54:55 -1 days
In [49]: df.min()
Out[49]:
A -1 days +23:54:55
B -1 days +00:00:00
dtype: timedelta64[ns]
In [50]: df.min(axis=1)
Out[50]:
0 -1 days
1 -1 days
2 -1 days
dtype: timedelta64[ns]
In [51]: df.idxmin()
Out[51]:
A 0
B 0
dtype: int64
In [52]: df.idxmax()
Out[52]:
A 2
B 0
dtype: int64
min, max, idxmin, idxmax operations are supported on Series as well. A scalar result will be a Timedelta.
In [53]: df.min().max()
Out[53]: Timedelta('-1 days +23:54:55')
In [54]: df.min(axis=1).min()
Out[54]: Timedelta('-1 days +00:00:00')
In [55]: df.min().idxmax()
Out[55]: 'A'
In [56]: df.min(axis=1).idxmin()
Out[56]: 0
You can fillna on timedeltas, passing a timedelta to get a particular value.
In [57]: y.fillna(pd.Timedelta(0))
Out[57]:
0 0 days
1 0 days
2 1 days
dtype: timedelta64[ns]
In [61]: td1
Out[61]: Timedelta('-2 days +21:59:57')
In [62]: -1 * td1
Out[62]: Timedelta('1 days 02:00:03')
In [63]: - td1
Out[63]: Timedelta('1 days 02:00:03')
In [64]: abs(td1)
Out[64]: Timedelta('1 days 02:00:03')
4.14.3 Reductions
Numeric reduction operation for timedelta64[ns] will return Timedelta objects. As usual NaT are skipped
during evaluation.
In [65]: y2 = pd.Series(pd.to_timedelta(['-1 days +00:00:05', 'nat',
....: '-1 days +00:00:05', '1 days']))
....:
In [66]: y2
Out[66]:
0 -1 days +00:00:05
1 NaT
2 -1 days +00:00:05
3 1 days 00:00:00
dtype: timedelta64[ns]
In [67]: y2.mean()
Out[67]: Timedelta('-1 days +16:00:03.333333')
In [68]: y2.median()
Out[68]: Timedelta('-1 days +00:00:05')
In [69]: y2.quantile(.1)
Out[69]: Timedelta('-1 days +00:00:05')
In [70]: y2.sum()
Out[70]: Timedelta('-1 days +00:00:10')
Timedelta Series, TimedeltaIndex, and Timedelta scalars can be converted to other ‘frequencies’ by dividing
by another timedelta, or by astyping to a specific timedelta type. These operations yield Series and propagate
NaT -> nan. Note that division by the NumPy scalar is true division, while astyping is equivalent of floor
division.
In [71]: december = pd.Series(pd.date_range('20121201', periods=4))
In [76]: td
Out[76]:
0 31 days 00:00:00
1 31 days 00:00:00
2 31 days 00:05:03
3 NaT
dtype: timedelta64[ns]
# to days
In [77]: td / np.timedelta64(1, 'D')
Out[77]:
0 31.000000
1 31.000000
2 31.003507
3 NaN
dtype: float64
In [78]: td.astype('timedelta64[D]')
Out[78]:
0 31.0
1 31.0
2 31.0
3 NaN
dtype: float64
# to seconds
In [79]: td / np.timedelta64(1, 's')
Out[79]:
0 2678400.0
1 2678400.0
2 2678703.0
3 NaN
dtype: float64
In [80]: td.astype('timedelta64[s]')
Out[80]:
0 2678400.0
1 2678400.0
2 2678703.0
3 NaN
dtype: float64
4.14.5 Attributes
You can access various components of the Timedelta or TimedeltaIndex directly using the attributes days,
seconds,microseconds,nanoseconds. These are identical to the values returned by datetime.timedelta,
in that, for example, the .seconds attribute represents the number of seconds >= 0 and < 1 day. These
are signed according to whether the Timedelta is signed.
These operations can also be directly accessed via the .dt property of the Series as well.
Note: Note that the attributes are NOT the displayed values of the Timedelta. Use .components to
retrieve the displayed values.
For a Series:
In [89]: td.dt.days
Out[89]:
0 31.0
1 31.0
2 31.0
3 NaN
dtype: float64
In [90]: td.dt.seconds
Out[90]:
0 0.0
1 0.0
2 303.0
3 NaN
dtype: float64
You can access the value of the fields for a scalar Timedelta directly.
In [91]: tds = pd.Timedelta('31 days 5 min 3 sec')
In [92]: tds.days
Out[92]: 31
In [93]: tds.seconds
Out[93]: 303
In [94]: (-tds).seconds
Out[94]: 86097
You can use the .components property to access a reduced form of the timedelta. This returns a DataFrame
indexed similarly to the Series. These are the displayed values of the Timedelta.
In [95]: td.dt.components
Out[95]:
days hours minutes seconds milliseconds microseconds nanoseconds
0 31.0 0.0 0.0 0.0 0.0 0.0 0.0
1 31.0 0.0 0.0 0.0 0.0 0.0 0.0
2 31.0 0.0 5.0 3.0 0.0 0.0 0.0
3 NaN NaN NaN NaN NaN NaN NaN
In [96]: td.dt.components.seconds
Out[96]:
0 0.0
1 0.0
2 3.0
3 NaN
4.14.6 TimedeltaIndex
To generate an index with time delta, you can use either the TimedeltaIndex or the timedelta_range()
constructor.
Using TimedeltaIndex you can pass string-like, Timedelta, timedelta, or np.timedelta64 objects. Passing
np.nan/pd.NaT/nat will represent missing values.
The string ‘infer’ can be passed in order to set the frequency of the index as the inferred frequency upon
creation:
Similar to date_range(), you can construct regular ranges of a TimedeltaIndex using timedelta_range().
The default frequency for timedelta_range is calendar day:
Various combinations of start, end, and periods can be used with timedelta_range:
In [101]: pd.timedelta_range(start='1 days', end='5 days')
Out[101]: TimedeltaIndex(['1 days', '2 days', '3 days', '4 days', '5 days'],␣
,→dtype='timedelta64[ns]', freq='D')
Similarly to other of the datetime-like indices, DatetimeIndex and PeriodIndex, you can use
TimedeltaIndex as the index of pandas objects.
In [107]: s = pd.Series(np.arange(100),
.....: index=pd.timedelta_range('1 days', periods=100, freq='h'))
.....:
In [108]: s
(continues on next page)
2 days 11:00:00 35
2 days 12:00:00 36
2 days 13:00:00 37
2 days 14:00:00 38
2 days 15:00:00 39
2 days 16:00:00 40
2 days 17:00:00 41
2 days 18:00:00 42
2 days 19:00:00 43
2 days 20:00:00 44
2 days 21:00:00 45
2 days 22:00:00 46
2 days 23:00:00 47
Freq: H, dtype: int64
Operations
Finally, the combination of TimedeltaIndex with DatetimeIndex allow certain combination operations that
are NaT preserving:
In [113]: tdi = pd.TimedeltaIndex(['1 days', pd.NaT, '2 days'])
In [114]: tdi.to_list()
Out[114]: [Timedelta('1 days 00:00:00'), NaT, Timedelta('2 days 00:00:00')]
In [116]: dti.to_list()
Out[116]:
[Timestamp('2013-01-01 00:00:00', freq='D'),
Timestamp('2013-01-02 00:00:00', freq='D'),
Timestamp('2013-01-03 00:00:00', freq='D')]
Conversions
Similarly to frequency conversion on a Series above, you can convert these indices to yield another Index.
In [119]: tdi / np.timedelta64(1, 's')
Out[119]: Float64Index([86400.0, nan, 172800.0], dtype='float64')
In [120]: tdi.astype('timedelta64[s]')
Out[120]: Float64Index([86400.0, nan, 172800.0], dtype='float64')
Scalars type ops work as well. These can potentially return a different type of index.
# adding or timedelta and date -> datelike
In [121]: tdi + pd.Timestamp('20130101')
Out[121]: DatetimeIndex(['2013-01-02', 'NaT', '2013-01-03'], dtype='datetime64[ns]',␣
,→freq=None)
4.14.7 Resampling
In [126]: s.resample('D').mean()
Out[126]:
1 days 11.5
2 days 35.5
3 days 59.5
4 days 83.5
5 days 97.5
Freq: D, dtype: float64
4.15 Styling
np.random.seed(24)
df = pd.DataFrame({'A': np.linspace(1, 10, 10)})
df = pd.concat([df, pd.DataFrame(np.random.randn(10, 4), columns=list('BCDE'))],
axis=1)
df.iloc[0, 2] = np.nan
[3]: df.style
[3]: <pandas.io.formats.style.Styler at 0x119f9e710>
Note: The DataFrame.style attribute is a property that returns a Styler object. Styler has a _repr_html_
method defined on it so they are rendered automatically. If you want the actual HTML back for further
processing or for writing to file call the .render() method which returns a string.
The above output looks very similar to the standard DataFrame HTML representation. But we’ve done
some work behind the scenes to attach CSS classes to each cell. We can view these by calling the .render
method.
[4]: df.style.highlight_null().render().split('\n')[:10]
[4]: ['<style type="text/css" >',
' #T_d2a38a0c_f5cb_11e9_af35_186590cd1c87row0_col2 {',
' background-color: red;',
' }</style><table id="T_d2a38a0c_f5cb_11e9_af35_186590cd1c87" ><thead> <tr> ␣
,→ <th class="blank level0" ></th> <th class="col_heading level0 col0" >A</th>
,→ <th class="col_heading level0 col1" >B</th> <th class="col_heading␣
,→level0 col2" >C</th> <th class="col_heading level0 col3" >D</th> <th␣
,→class="col_heading level0 col4" >E</th> </tr></thead><tbody>',
' <tr>',
' <th id="T_d2a38a0c_f5cb_11e9_af35_186590cd1c87level0_row0"␣
,→class="row_heading level0 row0" >0</th>',
The row0_col2 is the identifier for that particular cell. We’ve also prepended each row/column identifier
with a UUID unique to each DataFrame so that the style from one doesn’t collide with the styling from
another within the same notebook or page (you can set the uuid if you’d like to tie together the styling of
two DataFrames).
When writing style functions, you take care of producing the CSS attribute / value pairs you want. Pandas
matches those up with the CSS classes that identify each cell.
Let’s write a simple style function that will color negative numbers red and positive numbers black.
In this case, the cell’s style depends only on it’s own value. That means we should use the Styler.applymap
method which works elementwise.
[6]: s = df.style.applymap(color_negative_red)
s
[6]: <pandas.io.formats.style.Styler at 0x11a5b9790>
Notice the similarity with the standard df.applymap, which operates on DataFrames elementwise. We want
you to be able to reuse your existing knowledge of how to interact with DataFrames.
Notice also that our function returned a string containing the CSS attribute and value, separated by a colon
[8]: df.style.apply(highlight_max)
[8]: <pandas.io.formats.style.Styler at 0x11a5e4a50>
In this case the input is a Series, one column at a time. Notice that the output shape of highlight_max
matches the input shape, an array with len(s) items.
We encourage you to use method chains to build up a style piecewise, before finally rending at the end of
the chain.
[9]: df.style.\
applymap(color_negative_red).\
apply(highlight_max)
[9]: <pandas.io.formats.style.Styler at 0x11a5edd50>
When using Styler.apply(func, axis=None), the function must return a DataFrame with the same index
and column labels.
Style functions should return strings with one or more CSS attribute: value delimited by semicolons.
Use
• Styler.applymap(func) for elementwise styles
• Styler.apply(func, axis=0) for columnwise styles
• Styler.apply(func, axis=1) for rowwise styles
• Styler.apply(func, axis=None) for tablewise styles
And crucially the input and output shapes of func must match. If x is the input then func(x).shape ==
x.shape.
Both Styler.apply, and Styler.applymap accept a subset keyword. This allows you to apply styles to
specific rows or columns, without having to code that logic into your style function.
The value passed to subset behaves similar to slicing a DataFrame.
• A scalar is treated as a column label
• A list (or series or numpy array)
• A tuple is treated as (row_indexer, column_indexer)
Consider using pd.IndexSlice to construct the tuple for the last one.
For row and column slicing, any valid indexer to .loc will work.
[13]: df.style.applymap(color_negative_red,
subset=pd.IndexSlice[2:5, ['B', 'D']])
[13]: <pandas.io.formats.style.Styler at 0x11a61ad10>
We distinguish the display value from the actual value in Styler. To control the display value, the text
is printed in each cell, use Styler.format. Cells can be formatted according to a format spec string or a
callable that takes a single value and returns a string.
[14]: df.style.format("{:.2%}")
[14]: <pandas.io.formats.style.Styler at 0x11a61a510>
Finally, we expect certain styling functions to be common enough that we’ve included a few “built-in” to
the Styler, so you don’t have to write them yourself.
[17]: df.style.highlight_null(null_color='red')
[17]: <pandas.io.formats.style.Styler at 0x11a630110>
You can create “heatmaps” with the background_gradient method. These require matplotlib, and we’ll
use Seaborn to get a nice colormap.
cm = sns.light_palette("green", as_cmap=True)
s = df.style.background_gradient(cmap=cm)
s
/Users/taugspurger/miniconda3/envs/pandas-doc/lib/python3.7/site-packages/matplotlib/
,→colors.py:527: RuntimeWarning: invalid value encountered in less
xa[xa < 0] = -1
[18]: <pandas.io.formats.style.Styler at 0x1a1c4d2fd0>
Styler.background_gradient takes the keyword arguments low and high. Roughly speaking these extend
the range of your data by low and high percent so that when we convert the colors, the colormap’s entire
range isn’t used. This is useful so that you can actually read the text still.
[21]: df.style.highlight_max(axis=0)
[21]: <pandas.io.formats.style.Styler at 0x1a1c64c290>
Use Styler.set_properties when the style doesn’t actually depend on the values.
Bar charts
New in version 0.20.0 is the ability to customize further the bar chart: You can now have the df.style.bar
be centered on zero or midpoint value (in addition to the already existing way of having the min value at
the left side of the cell), and you can pass a list of [color_negative, color_positive].
Here’s how you can change the above with the new align='mid' option:
The following example aims to give a highlight of the behavior of the new align options:
# Test series
test1 = pd.Series([-100,-60,-30,-20], name='All Negative')
test2 = pd.Series([10,20,50,100], name='All Positive')
test3 = pd.Series([-10,-5,0,90], name='Both Pos and Neg')
head = """
<table>
<thead>
<th>Align</th>
<th>All Negative</th>
(continues on next page)
"""
aligns = ['left','zero','mid']
for align in aligns:
row = "<tr><th>{}</th>".format(align)
for serie in [test1,test2,test3]:
s = serie.copy()
s.name=''
row += "<td>{}</td>".format(s.to_frame().style.bar(align=align,
color=['#d65f5f', '#5fba7d'],
width=100).render()) #testn[
,→'width']
row += '</tr>'
head += row
head+= """
</tbody>
</table>"""
HTML(head)
[25]: <IPython.core.display.HTML object>
Say you have a lovely style built up for a DataFrame, and now you want to apply the same style to a
second DataFrame. Export the style with df1.style.export, and import it on the second DataFrame with
df1.style.set
Notice that you’re able share the styles even though they’re data aware. The styles are re-evaluated on the
new DataFrame they’ve been used upon.
You’ve seen a few methods for data-driven styling. Styler also provides a few other options for styles that
don’t depend on the data.
• precision
• captions
• table-wide styles
• hiding the index or columns
Each of these can be specified in two ways:
• A keyword argument to Styler.__init__
• A call to one of the .set_ or .hide_ methods, e.g. .set_caption or .hide_columns
The best method to use depends on the context. Use the Styler constructor when building many styled
DataFrames that should all share the same properties. For interactive use, the.set_ and .hide_ methods
are more convenient.
Precision
You can control the precision of floats using pandas’ regular display.precision option.
[29]: df.style\
.applymap(color_negative_red)\
.apply(highlight_max)\
.set_precision(2)
[29]: <pandas.io.formats.style.Styler at 0x1a1c68b990>
Setting the precision only affects the printed number; the full-precision values are always passed to your style
functions. You can always use df.round(2).style if you’d prefer to round from the start.
Captions
Table styles
The next option you have are “table styles”. These are styles that apply to the table as a whole, but don’t
look at the data. Certain sytlings, including pseudo-selectors like :hover can only be used this way.
def hover(hover_color="#ffff99"):
return dict(selector="tr:hover",
props=[("background-color", "%s" % hover_color)])
styles = [
hover(),
dict(selector="th", props=[("font-size", "150%"),
("text-align", "center")]),
dict(selector="caption", props=[("caption-side", "bottom")])
]
html = (df.style.set_table_styles(styles)
.set_caption("Hover to highlight."))
html
[31]: <pandas.io.formats.style.Styler at 0x1a1c684190>
table_styles should be a list of dictionaries. Each dictionary should have the selector and props keys.
The value for selector should be a valid CSS selector. Recall that all the styles are already attached to an
id, unique to each Styler. This selector is in addition to that id. The value for props should be a list of
tuples of ('attribute', 'value').
table_styles are extremely flexible, but not as fun to type out by hand. We hope to collect some useful
ones either in pandas, or preferable in a new package that builds on top the tools here.
The index can be hidden from rendering by calling Styler.hide_index. Columns can be hidden from
rendering by calling Styler.hide_columns and passing in the name of a column, or a slice of columns.
[32]: df.style.hide_index()
[32]: <pandas.io.formats.style.Styler at 0x1a1c68b550>
[33]: df.style.hide_columns(['C','D'])
[33]: <pandas.io.formats.style.Styler at 0x1a1c674710>
CSS classes
Limitations
Terms
• Style function: a function that’s passed into Styler.apply or Styler.applymap and returns values
like 'css attribute: value'
• Builtin style functions: style functions that are methods on Styler
• table style: a dictionary with the two keys selector and props. selector is the CSS selector that
props will apply to. props is a list of (attribute, value) tuples. A list of table styles passed into
Styler.
[36]: np.random.seed(25)
cmap = cmap=sns.diverging_palette(5, 250, as_cmap=True)
bigdf = pd.DataFrame(np.random.randn(20, 25)).cumsum()
bigdf.style.background_gradient(cmap, axis=1)\
.set_properties(**{'max-width': '80px', 'font-size': '1pt'})\
.set_caption("Hover to magnify")\
.set_precision(2)\
.set_table_styles(magnify())
[36]: <pandas.io.formats.style.Styler at 0x1a1c641890>
[37]: df.style.\
applymap(color_negative_red).\
apply(highlight_max).\
to_excel('styled.xlsx', engine='openpyxl')
4.15.9 Extensibility
The core of pandas is, and will remain, its “high-performance, easy-to-use data structures”. With that in
mind, we hope that DataFrame.style accomplishes two goals
• Provide an API that is pleasing to use interactively and is “good enough” for many tasks
• Provide the foundations for dedicated libraries to build on
If you build a great library on top of this, let us know and we’ll link to it.
Subclassing
If the default template doesn’t quite suit your needs, you can subclass Styler and extend or override the
template. We’ll show an example of extending the default template to insert a custom header before each
table.
Now that we’ve created a template, we need to set up a subclass of Styler that knows about it.
Notice that we include the original loader in our environment’s loader. That’s because we extend the original
template, so the Jinja environment needs to be able to find it.
Now we can use that custom styler. It’s __init__ takes a DataFrame.
[41]: MyStyler(df)
[41]: <__main__.MyStyler at 0x1a1d3b0e10>
Our custom template accepts a table_title keyword. We can provide the value in the .render method.
For convenience, we provide the Styler.from_custom_template method that does the same as the custom
subclass.
HTML(structure)
[44]: <IPython.core.display.HTML object>
4.16.1 Overview
pandas has an options system that lets you customize some aspects of its behaviour, display-related options
being those the user is most likely to adjust.
Options have a full “dotted-style”, case-insensitive name (e.g. display.max_rows). You can get/set options
directly as attributes of the top-level options attribute:
In [2]: pd.options.display.max_rows
Out[2]: 60
In [4]: pd.options.display.max_rows
Out[4]: 999
The API is composed of 5 relevant functions, available directly from the pandas namespace:
• get_option() / set_option() - get/set the value of a single option.
• reset_option() - reset one or more options to their default value.
• describe_option() - print the descriptions of one or more options.
• option_context() - execute a codeblock with a set of options that revert to prior settings after
execution.
Note: Developers can check out pandas/core/config.py for more information.
All of the functions above accept a regexp pattern (re.search style) as an argument, and so passing in a
substring will work - as long as it is unambiguous:
In [5]: pd.get_option("display.max_rows")
Out[5]: 999
In [7]: pd.get_option("display.max_rows")
Out[7]: 101
In [9]: pd.get_option("display.max_rows")
Out[9]: 102
The following will not work because it matches multiple option names, e.g. display.max_colwidth,
display.max_rows, display.max_columns:
In [10]: try:
....: pd.get_option("column")
....: except KeyError as e:
....: print(e)
(continues on next page)
Note: Using this form of shorthand may cause your code to break if new options with similar names are
added in future versions.
You can get a list of available options and their descriptions with describe_option. When called with no
argument describe_option will print out the descriptions for all available options.
As described above, get_option() and set_option() are available from the pandas namespace. To change
an option, call set_option('option regex', new_value).
In [11]: pd.get_option('mode.sim_interactive')
Out[11]: False
In [13]: pd.get_option('mode.sim_interactive')
Out[13]: True
In [14]: pd.get_option("display.max_rows")
Out[14]: 60
In [16]: pd.get_option("display.max_rows")
Out[16]: 999
In [17]: pd.reset_option("display.max_rows")
In [18]: pd.get_option("display.max_rows")
Out[18]: 60
In [19]: pd.reset_option("^display")
option_context context manager has been exposed through the top-level API, allowing you to execute code
with given option values. Option values are restored automatically when you exit the with block:
In [20]: with pd.option_context("display.max_rows", 10, "display.max_columns", 5):
....: print(pd.get_option("display.max_rows"))
....: print(pd.get_option("display.max_columns"))
....:
10
5
In [21]: print(pd.get_option("display.max_rows"))
60
In [22]: print(pd.get_option("display.max_columns"))
0
Using startup scripts for the Python/IPython environment to import pandas and set options makes working
with pandas more efficient. To do this, create a .py or .ipy script in the startup directory of the desired
profile. An example where the startup folder is in a default ipython profile can be found at:
$IPYTHONDIR/profile_default/startup
More information can be found in the ipython documentation. An example startup script for pandas is
displayed below:
import pandas as pd
pd.set_option('display.max_rows', 999)
pd.set_option('precision', 5)
In [24]: pd.set_option('max_rows', 7)
In [25]: df
Out[25]:
0 1
0 -0.228248 -2.069612
1 0.610144 0.423497
2 1.117887 -0.274242
3 1.741812 -0.447501
4 -1.255427 0.938164
5 -0.468346 -1.254720
6 0.124824 0.756502
In [26]: pd.set_option('max_rows', 5)
In [27]: df
Out[27]:
0 1
0 -0.228248 -2.069612
1 0.610144 0.423497
.. ... ...
5 -0.468346 -1.254720
6 0.124824 0.756502
(continues on next page)
[7 rows x 2 columns]
In [28]: pd.reset_option('max_rows')
Once the display.max_rows is exceeded, the display.min_rows options determines how many rows are
shown in the truncated repr.
In [29]: pd.set_option('max_rows', 8)
In [30]: pd.set_option('max_rows', 4)
In [32]: df
Out[32]:
0 1
0 0.241440 0.497426
1 4.108693 0.821121
.. ... ...
5 0.396520 -0.314617
6 -0.593756 1.149501
[7 rows x 2 columns]
In [34]: df
Out[34]:
0 1
0 1.335566 0.302629
1 -0.454228 0.514371
.. ... ...
7 -1.221429 1.804477
8 0.180410 0.553164
[9 rows x 2 columns]
In [35]: pd.reset_option('max_rows')
In [36]: pd.reset_option('min_rows')
display.expand_frame_repr allows for the representation of dataframes to stretch across pages, wrapped
over the full column vs row-wise.
In [39]: df
(continues on next page)
In [41]: df
Out[41]:
0 1 2 3 4 5 6 7 ␣
,→ 8 9
0 1.033029 -0.329002 -1.151003 -0.426522 -0.148147 1.501437 0.869598 -1.087091 0.
,→664221 0.734885
1 -1.061366 -0.108517 -1.850404 0.330488 -0.315693 -1.350002 -0.698171 0.239951 -0.
,→552949 0.299527
2 0.552664 -0.840443 -0.312271 2.144678 0.121106 -0.846829 0.060462 -1.338589 1.
,→132746 0.370305
3 1.085806 0.902179 0.390296 0.975509 0.191574 -0.662209 -1.023515 -0.448175 -2.
,→505458 1.825994
4 -1.714067 -0.076640 -1.317567 -2.025594 -0.082245 -0.304667 -0.159724 0.548947 -0.
,→618375 0.378794
In [42]: pd.reset_option('expand_frame_repr')
display.large_repr lets you select whether to display dataframes that exceed max_columns or max_rows
as a truncated frame, or as a summary.
In [44]: pd.set_option('max_rows', 5)
In [46]: df
Out[46]:
0 1 2 3 4 5 6 7 ␣
,→ 8 9
0 0.513251 -0.334844 -0.283520 0.538424 0.057251 0.159088 -2.374403 0.058520 0.
,→376546 -0.135480
In [48]: df
Out[48]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 10 columns):
0 10 non-null float64
1 10 non-null float64
2 10 non-null float64
3 10 non-null float64
4 10 non-null float64
5 10 non-null float64
6 10 non-null float64
7 10 non-null float64
8 10 non-null float64
9 10 non-null float64
dtypes: float64(10)
memory usage: 928.0 bytes
In [49]: pd.reset_option('large_repr')
In [50]: pd.reset_option('max_rows')
display.max_colwidth sets the maximum width of columns. Cells of this length or longer will be truncated
with an ellipsis.
In [53]: df
Out[53]:
0 1 2 3
0 foo bar bim uncomfortably long string
1 horse cow banana apple
In [54]: pd.set_option('max_colwidth', 6)
In [55]: df
Out[55]:
0 1 2 3
0 foo bar bim un...
(continues on next page)
In [56]: pd.reset_option('max_colwidth')
In [59]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 10 columns):
0 10 non-null float64
1 10 non-null float64
2 10 non-null float64
3 10 non-null float64
4 10 non-null float64
5 10 non-null float64
6 10 non-null float64
7 10 non-null float64
8 10 non-null float64
9 10 non-null float64
dtypes: float64(10)
memory usage: 928.0 bytes
In [60]: pd.set_option('max_info_columns', 5)
In [61]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Columns: 10 entries, 0 to 9
dtypes: float64(10)
memory usage: 928.0 bytes
In [62]: pd.reset_option('max_info_columns')
display.max_info_rows: df.info() will usually show null-counts for each column. For large frames this
can be quite slow. max_info_rows and max_info_cols limit this null check only to frames with smaller
dimensions then specified. Note that you can specify the option df.info(null_counts=True) to override
on showing a particular frame.
In [64]: df
Out[64]:
0 1 2 3 4 5 6 7 8 9
0 1.0 1.0 1.0 1.0 0.0 NaN 0.0 0.0 1.0 1.0
1 NaN NaN 1.0 NaN 1.0 1.0 0.0 0.0 0.0 0.0
2 0.0 NaN 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
3 1.0 NaN 1.0 1.0 NaN 1.0 1.0 0.0 NaN 1.0
(continues on next page)
In [66]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 10 columns):
0 6 non-null float64
1 5 non-null float64
2 9 non-null float64
3 5 non-null float64
4 8 non-null float64
5 6 non-null float64
6 10 non-null float64
7 8 non-null float64
8 7 non-null float64
9 8 non-null float64
dtypes: float64(10)
memory usage: 928.0 bytes
In [67]: pd.set_option('max_info_rows', 5)
In [68]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 10 columns):
0 float64
1 float64
2 float64
3 float64
4 float64
5 float64
6 float64
7 float64
8 float64
9 float64
dtypes: float64(10)
memory usage: 928.0 bytes
In [69]: pd.reset_option('max_info_rows')
display.precision sets the output display precision in terms of decimal places. This is only a suggestion.
In [71]: pd.set_option('precision', 7)
(continues on next page)
In [72]: df
Out[72]:
0 1 2 3 4
0 0.4940359 1.5462776 0.2468575 -1.4687758 1.1470999
1 0.0955570 -1.1074387 -0.1762861 -0.9827557 2.0866827
2 -0.3446237 -2.0020792 0.3032344 -0.8298748 1.2887694
3 0.1349255 -1.7786006 -0.5007915 -1.0881616 -0.7578556
4 -0.6437449 -2.0087845 0.1962629 -0.8758964 -0.8936092
In [73]: pd.set_option('precision', 4)
In [74]: df
Out[74]:
0 1 2 3 4
0 0.4940 1.5463 0.2469 -1.4688 1.1471
1 0.0956 -1.1074 -0.1763 -0.9828 2.0867
2 -0.3446 -2.0021 0.3032 -0.8299 1.2888
3 0.1349 -1.7786 -0.5008 -1.0882 -0.7579
4 -0.6437 -2.0088 0.1963 -0.8759 -0.8936
display.chop_threshold sets at what level pandas rounds to zero when it displays a Series of DataFrame.
This setting does not change the precision at which the number is stored.
In [76]: pd.set_option('chop_threshold', 0)
In [77]: df
Out[77]:
0 1 2 3 4 5
0 0.7519 1.8969 -0.6291 1.8121 -2.0563 0.5627
1 -0.5821 -0.0740 -0.9865 -0.5947 -0.3148 -0.3469
2 0.4114 2.3264 -0.6341 -0.1544 -1.7493 -2.5196
3 1.3912 -1.3293 -0.7456 0.0213 0.9109 0.3153
4 1.8662 -0.1825 -1.8283 0.1390 0.1195 -0.8189
5 -0.3326 -0.5864 1.7345 -0.6128 -1.3934 0.2794
In [79]: df
Out[79]:
0 1 2 3 4 5
0 0.7519 1.8969 -0.6291 1.8121 -2.0563 0.5627
1 -0.5821 0.0000 -0.9865 -0.5947 0.0000 0.0000
2 0.0000 2.3264 -0.6341 0.0000 -1.7493 -2.5196
3 1.3912 -1.3293 -0.7456 0.0000 0.9109 0.0000
4 1.8662 0.0000 -1.8283 0.0000 0.0000 -0.8189
5 0.0000 -0.5864 1.7345 -0.6128 -1.3934 0.0000
In [80]: pd.reset_option('chop_threshold')
display.colheader_justify controls the justification of the headers. The options are ‘right’, and ‘left’.
In [81]: df = pd.DataFrame(np.array([np.random.randn(6),
....: np.random.randint(1, 9, 6) * .1,
....: np.zeros(6)]).T,
....: columns=['A', 'B', 'C'], dtype='float')
....:
In [83]: df
Out[83]:
A B C
0 -1.8222 0.7 0.0
1 0.4270 0.2 0.0
2 0.4070 0.5 0.0
3 -0.8443 0.1 0.0
4 -0.5598 0.6 0.0
5 -0.6005 0.6 0.0
In [85]: df
Out[85]:
A B C
0 -1.8222 0.7 0.0
1 0.4270 0.2 0.0
2 0.4070 0.5 0.0
3 -0.8443 0.1 0.0
4 -0.5598 0.6 0.0
5 -0.6005 0.6 0.0
In [86]: pd.reset_option('colheader_justify')
pandas also allows you to set how numbers are displayed in the console. This option is not set through the
set_options API.
Use the set_eng_float_format function to alter the floating-point formatting of pandas objects to produce
a particular format.
For instance:
In [87]: import numpy as np
In [90]: s / 1.e3
Out[90]:
a 1.615m
b -104.292u
c -1.455m
d 927.500u
e -17.229u
dtype: float64
In [91]: s / 1.e6
Out[91]:
a 1.615u
b -104.292n
c -1.455u
d 927.500n
e -17.229n
dtype: float64
To round floats on a case-by-case basis, you can also use round() and round().
Warning: Enabling this option will affect the performance for printing of DataFrame and Series (about
2 times slower). Use only when it is actually required.
Some East Asian countries use Unicode characters whose width corresponds to two Latin characters. If a
DataFrame or Series contains these characters, the default output mode may not align them properly.
Note: Screen captures are attached for each output to show the actual results.
In [93]: df
Out[93]:
�� ��
0 UK Alice
1 �� ���
Enabling display.unicode.east_asian_width allows pandas to check each character’s “East Asian Width”
property. These characters can be aligned properly by setting this option to True. However, this will result
in longer render times than the standard len function.
In [95]: df
Out[95]:
�� ��
0 UK Alice
1 �� ���
In addition, Unicode characters whose width is “Ambiguous” can either be 1 or 2 characters wide depending
on the terminal setting or encoding. The option display.unicode.ambiguous_as_wide can be used to
handle the ambiguity.
By default, an “Ambiguous” character’s width, such as “¡” (inverted exclamation) in the example below, is
taken to be 1.
In [97]: df
Out[97]:
a b
0 xxx yyy
1 ¡¡ ¡¡
In [99]: df
Out[99]:
a b
0 xxx yyy
1 ¡¡ ¡¡
In this part of the tutorial, we will investigate how to speed up certain functions operating on pandas
DataFrames using three different techniques: Cython, Numba and pandas.eval(). We will see a speed im-
provement of ~200 when we use Cython and Numba on a test function operating row-wise on the DataFrame.
Using pandas.eval() we will speed up a sum by an order of ~2.
For many use cases writing pandas in pure Python and NumPy is sufficient. In some computationally heavy
applications however, it can be possible to achieve sizable speed-ups by offloading work to cython.
This tutorial assumes you have refactored as much as possible in Python, for example by trying to remove
for-loops and making use of NumPy vectorization. It’s always worth optimising in Python first.
This tutorial walks through a “typical” process of cythonizing a slow computation. We use an example from
the Cython documentation but in the context of pandas. Our final cythonized solution is around 100 times
faster than the pure Python solution.
Pure Python
In [2]: df
Out[2]:
a b N x
0 0.286932 -1.108820 468 x
1 0.816278 -0.591357 796 x
2 -0.555291 0.240515 941 x
3 0.819424 -0.555205 760 x
4 1.179587 -0.379999 519 x
.. ... ... ... ..
995 0.498704 0.962552 352 x
996 0.376858 2.123431 543 x
997 -1.407083 1.844740 687 x
998 0.304963 0.234843 925 x
999 1.679794 -0.947443 291 x
But clearly this isn’t fast enough for us. Let’s take a look and see where the time is spent during this
operation (limited to the most time consuming four calls) using the prun ipython magic function:
By far the majority of time is spend inside either integrate_f or f, hence we’ll concentrate our efforts
cythonizing these two functions.
Note: In Python 2 replacing the range with its generator counterpart (xrange) would mean the range
line would vanish. In Python 3 range is already a generator.
Plain Cython
First we’re going to need to import the Cython magic function to ipython:
Now, let’s simply copy our functions over to Cython as is (the suffix is here to distinguish between function
versions):
In [7]: %%cython
...: def f_plain(x):
...: return x * (x - 1)
...: def integrate_f_plain(a, b, N):
...: s = 0
(continues on next page)
Note: If you’re having trouble pasting the above into your ipython, you may need to be using bleeding
edge ipython for paste to play well with cell magics.
Already this has shaved a third off, not too bad for a simple copy and paste.
Adding type
In [8]: %%cython
...: cdef double f_typed(double x) except? -2:
...: return x * (x - 1)
...: cpdef double integrate_f_typed(double a, double b, int N):
...: cdef int i
...: cdef double s, dx
...: s = 0
...: dx = (b - a) / N
...: for i in range(N):
...: s += f_typed(a + i * dx)
...: return s * dx
...:
Now, we’re talking! It’s now over ten times faster than the original python implementation, and we haven’t
really modified the code. Let’s have another look at what’s eating up time:
Using ndarray
It’s calling series… a lot! It’s creating a Series from each row, and get-ting from both the index and the series
(three times for each row). Function calls are expensive in Python, so maybe we could minimize these by
cythonizing the apply part.
Note: We are now passing ndarrays into the Cython function, fortunately Cython plays very nicely with
NumPy.
In [10]: %%cython
....: cimport numpy as np
....: import numpy as np
....: cdef double f_typed(double x) except? -2:
....: return x * (x - 1)
....: cpdef double integrate_f_typed(double a, double b, int N):
....: cdef int i
....: cdef double s, dx
....: s = 0
....: dx = (b - a) / N
....: for i in range(N):
....: s += f_typed(a + i * dx)
....: return s * dx
....: cpdef np.ndarray[double] apply_integrate_f(np.ndarray col_a, np.ndarray col_b,
....: np.ndarray col_N):
....: assert (col_a.dtype == np.float
....: and col_b.dtype == np.float and col_N.dtype == np.int)
....: cdef Py_ssize_t i, n = len(col_N)
....: assert (len(col_a) == len(col_b) == n)
....: cdef np.ndarray[double] res = np.empty(n)
....: for i in range(len(col_a)):
....: res[i] = integrate_f_typed(col_a[i], col_b[i], col_N[i])
....: return res
....:
The implementation is simple, it creates an array of zeros and loops over the rows, applying our
integrate_f_typed, and putting this in the zeros array.
Warning: You can not pass a Series directly as a ndarray typed parameter to a Cython function.
Instead pass the actual ndarray using the Series.to_numpy(). The reason is that the Cython definition
is specific to an ndarray and not the passed Series.
So, do not do this:
apply_integrate_f(df['a'], df['b'], df['N'])
Note: Loops like this would be extremely slow in Python, but in Cython looping over NumPy arrays is
fast.
We’ve gotten another big improvement. Let’s check again where the time is spent:
As one might expect, the majority of the time is now spent in apply_integrate_f, so if we wanted to make
anymore efficiencies we must continue to concentrate our efforts here.
There is still hope for improvement. Here’s an example of using some more advanced Cython techniques:
In [12]: %%cython
....: cimport cython
....: cimport numpy as np
....: import numpy as np
....: cdef double f_typed(double x) except? -2:
....: return x * (x - 1)
....: cpdef double integrate_f_typed(double a, double b, int N):
....: cdef int i
....: cdef double s, dx
....: s = 0
....: dx = (b - a) / N
....: for i in range(N):
....: s += f_typed(a + i * dx)
....: return s * dx
....: @cython.boundscheck(False)
....: @cython.wraparound(False)
....: cpdef np.ndarray[double] apply_integrate_f_wrap(np.ndarray[double] col_a,
....: np.ndarray[double] col_b,
....: np.ndarray[int] col_N):
....: cdef int i, n = len(col_N)
(continues on next page)
Even faster, with the caveat that a bug in our Cython code (an off-by-one error, for example) might cause
a segfault because memory access isn’t checked. For more about boundscheck and wraparound, see the
Cython docs on compiler directives.
A recent alternative to statically compiling Cython code, is to use a dynamic jit-compiler, Numba.
Numba gives you the power to speed up your applications with high performance functions written directly in
Python. With a few annotations, array-oriented and math-heavy Python code can be just-in-time compiled
to native machine instructions, similar in performance to C, C++ and Fortran, without having to switch
languages or Python interpreters.
Numba works by generating optimized machine code using the LLVM compiler infrastructure at import
time, runtime, or statically (using the included pycc tool). Numba supports compilation of Python to run
on either CPU or GPU hardware, and is designed to integrate with the Python scientific software stack.
Note: You will need to install Numba. This is easy with conda, by using: conda install numba, see
installing using miniconda.
Note: As of Numba version 0.20, pandas objects cannot be passed directly to Numba-compiled functions.
Instead, one must pass the NumPy array underlying the pandas object to the Numba-compiled function as
demonstrated below.
Jit
We demonstrate how to use Numba to just-in-time compile our code. We simply take the plain Python code
from above and annotate with the @jit decorator.
import numba
@numba.jit
def f_plain(x):
return x * (x - 1)
@numba.jit
def integrate_f_numba(a, b, N):
s = 0
dx = (b - a) / N
for i in range(N):
s += f_plain(a + i * dx)
return s * dx
@numba.jit
def apply_integrate_f_numba(col_a, col_b, col_N):
n = len(col_N)
result = np.empty(n, dtype='float64')
assert len(col_a) == len(col_b) == n
for i in range(n):
result[i] = integrate_f_numba(col_a[i], col_b[i], col_N[i])
return result
def compute_numba(df):
result = apply_integrate_f_numba(df['a'].to_numpy(),
df['b'].to_numpy(),
df['N'].to_numpy())
return pd.Series(result, index=df.index, name='result')
Note that we directly pass NumPy arrays to the Numba function. compute_numba is just a wrapper that
provides a nicer interface by passing/returning pandas objects.
Vectorize
Numba can also be used to write vectorized functions that do not require the user to explicitly loop over
the observations of a vector; a vectorized function will be applied to each row automatically. Consider the
following toy example of doubling each observation:
import numba
def double_every_value_nonumba(x):
return x * 2
@numba.vectorize
def double_every_value_withnumba(x): # noqa E501
return x * 2
Caveats
Note: Numba will execute on any function, but can only accelerate certain classes of functions.
Numba is best at accelerating functions that apply numerical functions to NumPy arrays. When passed a
function that only uses operations it knows how to accelerate, it will execute in nopython mode.
If Numba is passed a function that includes something it doesn’t know how to work with – a category that
currently includes sets, lists, dictionaries, or string functions – it will revert to object mode. In object
mode, Numba will execute but your code will not speed up significantly. If you would prefer that Numba
throw an error if it cannot compile a function in a way that speeds up your code, pass Numba the argument
nopython=True (e.g. @numba.jit(nopython=True)). For more on troubleshooting Numba modes, see the
Numba troubleshooting page.
Read more in the Numba docs.
The top-level function pandas.eval() implements expression evaluation of Series and DataFrame objects.
Note: To benefit from using eval() you need to install numexpr. See the recommended dependencies
section for more details.
The point of using eval() for expression evaluation rather than plain Python is two-fold: 1) large DataFrame
objects are evaluated more efficiently and 2) large arithmetic and boolean expressions are evaluated all at
once by the underlying engine (by default numexpr is used for evaluation).
Note: You should not use eval() for simple expressions or for expressions involving small DataFrames.
In fact, eval() is many orders of magnitude slower for smaller expressions/objects than plain ol’ Python.
A good rule of thumb is to only use eval() when you have a DataFrame with more than 10,000 rows.
eval() supports all arithmetic expressions supported by the engine in addition to some extensions available
only in pandas.
Note: The larger the frame and the larger the expression the more speedup you will see from using eval().
Supported syntax
eval() examples
Now let’s compare adding them together using plain ol’ Python versus eval():
In [17]: %timeit (df1 > 0) & (df2 > 0) & (df3 > 0) & (df4 > 0)
148 ms +- 5.16 ms per loop (mean +- std. dev. of 7 runs, 10 loops each)
In [18]: %timeit pd.eval('(df1 > 0) & (df2 > 0) & (df3 > 0) & (df4 > 0)')
10.4 ms +- 463 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
In [19]: s = pd.Series(np.random.randn(50))
should be performed in Python. An exception will be raised if you try to perform any boolean/bitwise
operations with scalar operands that are not of type bool or np.bool_. Again, you should perform these
kinds of operations in plain Python.
In addition to the top level pandas.eval() function you can also evaluate an expression in the “context” of
a DataFrame.
Any expression that is a valid pandas.eval() expression is also a valid DataFrame.eval() expression, with
the added benefit that you don’t have to prefix the name of the DataFrame to the column(s) you’re interested
in evaluating.
In addition, you can perform assignment of columns within an expression. This allows for formulaic evalua-
tion. The assignment target can be a new column name or an existing column name, and it must be a valid
Python identifier.
New in version 0.18.0.
The inplace keyword determines whether this assignment will performed on the original DataFrame or
return a copy with the new column.
Warning: For backwards compatibility, inplace defaults to True if not specified. This will change in a
future version of pandas - if your code depends on an inplace assignment you should update to explicitly
set inplace=True.
In [28]: df
Out[28]:
a b c d
0 1 5 5 10
1 1 6 7 14
2 1 7 9 18
3 1 8 11 22
4 1 9 13 26
When inplace is set to False, a copy of the DataFrame with the new or modified columns is returned and
the original frame is unchanged.
In [29]: df
Out[29]:
a b c d
0 1 5 5 10
1 1 6 7 14
2 1 7 9 18
3 1 8 11 22
4 1 9 13 26
In [31]: df
Out[31]:
a b c d
0 1 5 5 10
1 1 6 7 14
2 1 7 9 18
3 1 8 11 22
4 1 9 13 26
New in version 0.18.0.
As a convenience, multiple assignments can be performed by using a multi-line string.
In [32]: df.eval("""
....: c = a + b
....: d = a + b + c
....: a = 1""", inplace=False)
....:
Out[32]:
a b c d
0 1 5 6 12
1 1 6 7 14
2 1 7 8 16
3 1 8 9 18
4 1 9 10 20
In [36]: df['a'] = 1
In [37]: df
Out[37]:
a b c d
0 1 5 5 10
1 1 6 7 14
2 1 7 9 18
3 1 8 11 22
4 1 9 13 26
In [41]: df
Out[41]:
a b
3 3 8
4 4 9
Warning: Unlike with eval, the default value for inplace for query is False. This is consistent with
prior versions of pandas.
Local variables
You must explicitly reference any local variable that you want to use in an expression by placing the @
character in front of the name. For example,
In [42]: df = pd.DataFrame(np.random.randn(5, 2), columns=list('ab'))
With pandas.eval() you cannot use the @ prefix at all, because it isn’t defined in that context. pandas will
let you know this if you try to use @ in a top-level call to pandas.eval(). For example,
In [49]: a, b = 1, 2
File "/Users/taugspurger/miniconda3/envs/pandas-doc/lib/python3.7/site-packages/
,→IPython/core/interactiveshell.py", line 3326, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
In this case, you should simply refer to the variables like you would in standard Python.
pandas.eval() parsers
There are two different parsers and two different engines you can use as the backend.
The default 'pandas' parser allows a more intuitive syntax for expressing query-like operations (comparisons,
conjunctions and disjunctions). In particular, the precedence of the & and | operators is made equal to the
precedence of the corresponding boolean operations and and or.
For example, the above conjunction can be written without parentheses. Alternatively, you can use the
'python' parser to enforce strict Python semantics.
In [52]: expr = '(df1 > 0) & (df2 > 0) & (df3 > 0) & (df4 > 0)'
In [54]: expr_no_parens = 'df1 > 0 & df2 > 0 & df3 > 0 & df4 > 0'
In [56]: np.all(x == y)
Out[56]: True
The same expression can be “anded” together with the word and as well:
In [57]: expr = '(df1 > 0) & (df2 > 0) & (df3 > 0) & (df4 > 0)'
In [59]: expr_with_ands = 'df1 > 0 and df2 > 0 and df3 > 0 and df4 > 0'
In [61]: np.all(x == y)
Out[61]: True
The and and or operators here have the same precedence that they would in vanilla Python.
pandas.eval() backends
There’s also the option to make eval() operate identical to plain ol’ Python.
Note: Using the 'python' engine is generally not useful, except for testing other evaluation engines against
it. You will achieve no performance benefits using eval() with engine='python' and in fact may incur a
performance hit.
You can see this by using pandas.eval() with the 'python' engine. It is a bit slower (not by much) than
evaluating the same expression in Python
pandas.eval() performance
eval() is intended to speed up certain kinds of operations. In particular, those operations involving complex
expressions with large DataFrame/Series objects should see a significant performance benefit. Here is a plot
showing the running time of pandas.eval() as function of the size of the frame involved in the computation.
The two lines are two different engines.
Note: Operations with smallish objects (around 15k-20k rows) are faster using plain Python:
This plot was created using a DataFrame with 3 columns each containing floating point values generated
using numpy.random.randn().
Expressions that would result in an object dtype or involve datetime operations (because of NaT) must be
evaluated in Python space. The main reason for this behavior is to maintain backwards compatibility with
versions of NumPy < 1.7. In those versions of NumPy a call to ndarray.astype(str) will truncate any
strings that are more than 60 characters in length. Second, we can’t pass object arrays to numexpr thus
string comparisons must be evaluated in Python space.
The upshot is that this only applies to object-dtype expressions. So, if you have an expression–for example
In [64]: df = pd.DataFrame({'strings': np.repeat(list('cba'), 3),
....: 'nums': np.repeat(range(3), 3)})
....:
In [65]: df
Out[65]:
strings nums
0 c 0
1 c 0
2 c 0
3 b 1
4 b 1
5 b 1
6 a 2
7 a 2
8 a 2
Note: SparseSeries and SparseDataFrame have been deprecated. Their purpose is served equally well
by a Series or DataFrame with sparse values. See Migrating for tips on migrating.
Pandas provides data structures for efficiently storing sparse data. These are not necessarily sparse in the
typical “mostly 0”. Rather, you can view these objects as being “compressed” where any data matching a
specific value (NaN / missing value, though any value can be chosen, including 0) is omitted. The compressed
values are not actually stored in the array.
In [1]: arr = np.random.randn(10)
In [3]: ts = pd.Series(pd.SparseArray(arr))
In [4]: ts
Out[4]:
0 1.485629
1 0.328995
2 NaN
(continues on next page)
Notice the dtype, Sparse[float64, nan]. The nan means that elements in the array that are nan aren’t
actually stored, only the non-nan elements are. Those non-nan elements have a float64 dtype.
The sparse objects exist for memory efficiency reasons. Suppose you had a large, mostly NA DataFrame:
In [5]: df = pd.DataFrame(np.random.randn(10000, 4))
In [8]: sdf.head()
Out[8]:
0 1 2 3
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
In [9]: sdf.dtypes
Out[9]:
0 Sparse[float64, nan]
1 Sparse[float64, nan]
2 Sparse[float64, nan]
3 Sparse[float64, nan]
dtype: object
In [10]: sdf.sparse.density
Out[10]: 0.0002
As you can see, the density (% of values that have not been “compressed”) is extremely low. This sparse
object takes up much less memory on disk (pickled) and in the Python interpreter.
In [11]: 'dense : {:0.2f} bytes'.format(df.memory_usage().sum() / 1e3)
Out[11]: 'dense : 320.13 bytes'
4.18.1 SparseArray
SparseArray is a ExtensionArray for storing an array of sparse values (see dtypes for more on extension
arrays). It is a 1-dimensional ndarray-like object storing only values distinct from the fill_value:
In [17]: sparr
Out[17]:
[0.6814937596731665, -0.3936934801139613, nan, nan, nan, 1.1656790567319866, -0.
,→4627293996578389, nan, -0.12411975873999016, 1.979438803899594]
Fill: nan
IntIndex
Indices: array([0, 1, 5, 6, 8, 9], dtype=int32)
In [18]: np.asarray(sparr)
Out[18]:
array([ 0.68149376, -0.39369348, nan, nan, nan,
1.16567906, -0.4627294 , nan, -0.12411976, 1.9794388 ])
4.18.2 SparseDtype
In [19]: sparr.dtype
Out[19]: Sparse[float64, nan]
In [20]: pd.SparseDtype(np.dtype('datetime64[ns]'))
Out[20]: Sparse[datetime64[ns], NaT]
The default fill value for a given NumPy dtype is the “missing” value for that dtype, though it may be
overridden.
In [21]: pd.SparseDtype(np.dtype('datetime64[ns]'),
....: fill_value=pd.Timestamp('2017-01-01'))
....:
Out[21]: Sparse[datetime64[ns], 2017-01-01 00:00:00]
Finally, the string alias 'Sparse[dtype]' may be used to specify a sparse dtype in many places
In [24]: s.sparse.density
Out[24]: 0.5
In [25]: s.sparse.fill_value
Out[25]: 0
This accessor is available only on data with SparseDtype, and on the Series class itself for creating a Series
with sparse data from a scipy COO matrix with.
New in version 0.25.0.
A .sparse accessor has been added for DataFrame as well. See Sparse accessor for more.
You can apply NumPy ufuncs to SparseArray and get a SparseArray as a result.
In [27]: np.abs(arr)
Out[27]:
[1.0, nan, nan, 2.0, nan]
Fill: nan
IntIndex
Indices: array([0, 3], dtype=int32)
The ufunc is also applied to fill_value. This is needed to get the correct dense result.
In [28]: arr = pd.SparseArray([1., -1, -1, -2., -1], fill_value=-1)
In [29]: np.abs(arr)
Out[29]:
[1.0, 1, 1, 2.0, 1]
Fill: 1
IntIndex
Indices: array([0, 3], dtype=int32)
In [30]: np.abs(arr).to_dense()
4.18.5 Migrating
In older versions of pandas, the SparseSeries and SparseDataFrame classes (documented below) were the
preferred way to work with sparse data. With the advent of extension arrays, these subclasses are no longer
needed. Their purpose is better served by using a regular Series or DataFrame with sparse values instead.
Note: There’s no performance or memory penalty to using a Series or DataFrame with sparse values,
rather than a SparseSeries or SparseDataFrame.
This section provides some guidance on migrating your code to the new style. As a reminder, you can use
the python warnings module to control warnings. But we recommend modifying your code, rather than
ignoring the warning.
Construction
From an array-like, use the regular Series or DataFrame constructors with SparseArray values.
# Previous way
>>> pd.SparseDataFrame({"A": [0, 1]})
# New way
In [31]: pd.DataFrame({"A": pd.SparseArray([0, 1])})
Out[31]:
A
0 0
1 1
# Previous way
>>> from scipy import sparse
>>> mat = sparse.eye(3)
>>> df = pd.SparseDataFrame(mat, columns=['A', 'B', 'C'])
# New way
In [32]: from scipy import sparse
In [35]: df.dtypes
Out[35]:
A Sparse[float64, 0.0]
B Sparse[float64, 0.0]
C Sparse[float64, 0.0]
dtype: object
Conversion
From sparse to dense, use the .sparse accessors
In [36]: df.sparse.to_dense()
Out[36]:
A B C
0 1.0 0.0 0.0
1 0.0 1.0 0.0
2 0.0 0.0 1.0
In [37]: df.sparse.to_coo()
Out[37]:
<3x3 sparse matrix of type '<class 'numpy.float64'>'
with 3 stored elements in COOrdinate format>
From dense to sparse, use DataFrame.astype() with a SparseDtype.
In [40]: dense.astype(dtype)
Out[40]:
A
0 1
1 0
2 0
3 1
Sparse Properties
Sparse-specific properties, like density, are available on the .sparse accessor.
In [41]: df.sparse.density
Out[41]: 0.3333333333333333
General differences
In a SparseDataFrame, all columns were sparse. A DataFrame can have a mixture of sparse and dense
columns. As a consequence, assigning new columns to a DataFrame with sparse values will not automatically
convert the input to be sparse.
# Previous Way
>>> df = pd.SparseDataFrame({"A": [0, 1]})
>>> df['B'] = [0, 0] # implicitly becomes Sparse
>>> df['B'].dtype
Sparse[int64, nan]
Instead, you’ll need to ensure that the values being assigned are sparse
In [44]: df['B'].dtype
Out[44]: dtype('int64')
In [46]: df['B'].dtype
Out[46]: Sparse[int64, 0]
Use DataFrame.sparse.from_spmatrix() to create a DataFrame with sparse values from a sparse matrix.
New in version 0.25.0.
In [47]: from scipy.sparse import csr_matrix
In [51]: sp_arr
Out[51]:
<1000x5 sparse matrix of type '<class 'numpy.float64'>'
with 464 stored elements in Compressed Sparse Row format>
In [53]: sdf.head()
Out[53]:
0 1 2 3 4
0 0.0 0.921245 0.0 0.000000 0.000000
1 0.0 0.000000 0.0 0.000000 0.000000
2 0.0 0.000000 0.0 0.000000 0.000000
3 0.0 0.000000 0.0 0.000000 0.000000
4 0.0 0.000000 0.0 0.936826 0.918174
In [54]: sdf.dtypes
Out[54]:
0 Sparse[float64, 0.0]
1 Sparse[float64, 0.0]
2 Sparse[float64, 0.0]
3 Sparse[float64, 0.0]
4 Sparse[float64, 0.0]
dtype: object
All sparse formats are supported, but matrices that are not in COOrdinate format will be converted, copying
data as needed. To convert back to sparse SciPy matrix in COO format, you can use the DataFrame.sparse.
to_coo() method:
In [55]: sdf.sparse.to_coo()
Out[55]:
(continues on next page)
In [58]: s
Out[58]:
A B C D
1 2 a 0 3.0
1 NaN
1 b 0 1.0
1 3.0
2 1 b 0 NaN
1 NaN
dtype: float64
In [59]: ss = s.astype('Sparse')
In [60]: ss
Out[60]:
A B C D
1 2 a 0 3.0
1 NaN
1 b 0 1.0
1 3.0
2 1 b 0 NaN
1 NaN
dtype: Sparse[float64, nan]
In the example below, we transform the Series to a sparse representation of a 2-d array by specifying
that the first and second MultiIndex levels define labels for the rows and the third and fourth levels define
labels for the columns. We also specify that the column and row labels should be sorted in the final sparse
representation.
In [61]: A, rows, columns = ss.sparse.to_coo(row_levels=['A', 'B'],
....: column_levels=['C', 'D'],
....: sort_labels=True)
....:
In [62]: A
Out[62]:
<3x4 sparse matrix of type '<class 'numpy.float64'>'
with 3 stored elements in COOrdinate format>
In [63]: A.todense()
Out[63]:
matrix([[0., 0., 1., 3.],
[3., 0., 0., 0.],
[0., 0., 0., 0.]])
In [64]: rows
Out[64]: [(1, 1), (1, 2), (2, 1)]
In [65]: columns
Out[65]: [('a', 0), ('a', 1), ('b', 0), ('b', 1)]
Specifying different row and column labels (and not sorting them) yields a different sparse matrix:
In [66]: A, rows, columns = ss.sparse.to_coo(row_levels=['A', 'B', 'C'],
....: column_levels=['D'],
....: sort_labels=False)
....:
In [67]: A
Out[67]:
<3x2 sparse matrix of type '<class 'numpy.float64'>'
with 3 stored elements in COOrdinate format>
In [68]: A.todense()
Out[68]:
matrix([[3., 0.],
[1., 3.],
[0., 0.]])
In [69]: rows
Out[69]: [(1, 2, 'a'), (1, 1, 'b'), (2, 1, 'b')]
In [70]: columns
Out[70]: [0, 1]
A convenience method Series.sparse.from_coo() is implemented for creating a Series with sparse values
from a scipy.sparse.coo_matrix.
In [71]: from scipy import sparse
In [73]: A
Out[73]:
<3x4 sparse matrix of type '<class 'numpy.float64'>'
with 3 stored elements in COOrdinate format>
In [74]: A.todense()
Out[74]:
In [75]: ss = pd.Series.sparse.from_coo(A)
In [76]: ss
Out[76]:
0 2 1.0
3 2.0
1 0 3.0
dtype: Sparse[float64, nan]
Specifying dense_index=True will result in an index that is the Cartesian product of the row and columns
coordinates of the matrix. Note that this will consume a significant amount of memory (relative to
dense_index=False) if the sparse matrix is large (and sparse) enough.
In [78]: ss_dense
Out[78]:
0 0 NaN
1 NaN
2 1.0
3 2.0
1 0 3.0
1 NaN
2 NaN
3 NaN
2 0 NaN
1 NaN
2 NaN
3 NaN
dtype: Sparse[float64, nan]
The SparseSeries and SparseDataFrame classes are deprecated. Visit their API pages for usage. {{
header }}
The memory usage of a DataFrame (including the index) is shown when calling the info(). A configuration
option, display.memory_usage (see the list of options), specifies if the DataFrame’s memory usage will be
displayed when invoking the df.info() method.
For example, the memory usage of the DataFrame below is shown when calling info():
In [2]: n = 5000
In [4]: df = pd.DataFrame(data)
In [6]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 8 columns):
int64 5000 non-null int64
float64 5000 non-null float64
datetime64[ns] 5000 non-null datetime64[ns]
timedelta64[ns] 5000 non-null timedelta64[ns]
complex128 5000 non-null complex128
object 5000 non-null object
bool 5000 non-null bool
categorical 5000 non-null category
dtypes: bool(1), category(1), complex128(1), datetime64[ns](1), float64(1), int64(1),␣
,→object(1), timedelta64[ns](1)
The + symbol indicates that the true memory usage could be higher, because pandas does not count the
memory used by values in columns with dtype=object.
Passing memory_usage='deep' will enable a more accurate memory usage report, accounting for the full
usage of the contained objects. This is optional as it can be expensive to do this deeper introspection.
In [7]: df.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 8 columns):
int64 5000 non-null int64
float64 5000 non-null float64
datetime64[ns] 5000 non-null datetime64[ns]
timedelta64[ns] 5000 non-null timedelta64[ns]
complex128 5000 non-null complex128
object 5000 non-null object
bool 5000 non-null bool
categorical 5000 non-null category
dtypes: bool(1), category(1), complex128(1), datetime64[ns](1), float64(1), int64(1),␣
,→object(1), timedelta64[ns](1)
By default the display option is set to True but can be explicitly overridden by passing the memory_usage
argument when invoking df.info().
The memory usage of each column can be found by calling the memory_usage() method. This returns a
Series with an index represented by column names and memory usage of each column shown in bytes. For
the DataFrame above, the memory usage of each column and the total memory usage can be found with the
memory_usage method:
In [8]: df.memory_usage()
Out[8]:
Index 128
int64 40000
float64 40000
datetime64[ns] 40000
timedelta64[ns] 40000
complex128 80000
object 40000
bool 5000
categorical 10920
dtype: int64
In [10]: df.memory_usage(index=False)
Out[10]:
int64 40000
float64 40000
datetime64[ns] 40000
timedelta64[ns] 40000
complex128 80000
object 40000
bool 5000
categorical 10920
dtype: int64
The memory usage displayed by the info() method utilizes the memory_usage() method to determine the
memory usage of a DataFrame while also formatting the output in human-readable units (base-2 represen-
tation; i.e. 1KB = 1024 bytes).
See also Categorical Memory Usage.
pandas follows the NumPy convention of raising an error when you try to convert something to a bool. This
happens in an if-statement or when using the boolean operations: and, or, and not. It is not clear what
the result of the following code should be:
Should it be True because it’s not zero-length, or False because there are False values? It is unclear, so
instead, pandas raises a ValueError:
You need to explicitly choose what you want to do with the DataFrame, e.g. use any(), all() or empty().
Alternatively, you might want to compare if the pandas object is None:
To evaluate single-element pandas objects in a boolean context, use the method bool():
In [11]: pd.Series([True]).bool()
Out[11]: True
In [12]: pd.Series([False]).bool()
Out[12]: False
In [13]: pd.DataFrame([[True]]).bool()
Out[13]: True
In [14]: pd.DataFrame([[False]]).bool()
Out[14]: False
Bitwise boolean
Bitwise boolean operators like == and != return a boolean Series, which is almost always what you want
anyways.
>>> s = pd.Series(range(5))
>>> s == 4
0 False
1 False
2 False
3 False
4 True
dtype: bool
Using the Python in operator on a Series tests for membership in the index, not membership among the
values.
In [16]: 2 in s
Out[16]: False
In [17]: 'b' in s
Out[17]: True
If this behavior is surprising, keep in mind that using in on a Python dictionary tests keys, not values, and
Series are dict-like. To test for membership in the values, use the method isin():
In [18]: s.isin([2])
Out[18]:
a False
b False
c True
d False
e False
dtype: bool
In [19]: s.isin([2]).any()
Out[19]: True
For DataFrames, likewise, in applies to the column axis, testing for membership in the list of column names.
Choice of NA representation
For lack of NA (missing) support from the ground up in NumPy and Python in general, we were given the
difficult choice between either:
• A masked array solution: an array of data and an array of boolean values indicating whether a value
is there or is missing.
• Using a special sentinel value, bit pattern, or set of sentinel values to denote NA across the dtypes.
For many reasons we chose the latter. After years of production use it has proven, at least in my opinion,
to be the best decision given the state of affairs in NumPy and Python in general. The special value NaN
(Not-A-Number) is used everywhere as the NA value, and there are API functions isna and notna which can
be used across the dtypes to detect NA values.
However, it comes with it a couple of trade-offs which I most certainly have not ignored.
In the absence of high performance NA support being built into NumPy from the ground up, the primary
casualty is the ability to represent NAs in integer arrays. For example:
In [20]: s = pd.Series([1, 2, 3, 4, 5], index=list('abcde'))
In [21]: s
Out[21]:
a 1
b 2
c 3
d 4
e 5
dtype: int64
In [22]: s.dtype
Out[22]: dtype('int64')
In [24]: s2
Out[24]:
a 1.0
b 2.0
c 3.0
f NaN
u NaN
dtype: float64
In [25]: s2.dtype
Out[25]: dtype('float64')
This trade-off is made largely for memory and performance reasons, and also so that the resulting Series
continues to be “numeric”.
If you need to represent integers with possibly missing values, use one of the nullable-integer extension dtypes
provided by pandas
• Int8Dtype
• Int16Dtype
• Int32Dtype
• Int64Dtype
In [26]: s_int = pd.Series([1, 2, 3, 4, 5], index=list('abcde'),
....: dtype=pd.Int64Dtype())
....:
In [27]: s_int
Out[27]:
a 1
b 2
c 3
d 4
e 5
dtype: Int64
In [28]: s_int.dtype
Out[28]: Int64Dtype()
In [30]: s2_int
Out[30]:
a 1
b 2
c 3
f NaN
u NaN
dtype: Int64
In [31]: s2_int.dtype
Out[31]: Int64Dtype()
See Nullable integer data type for more.
NA type promotions
When introducing NAs into an existing Series or DataFrame via reindex() or some other means, boolean
and integer types will be promoted to a different dtype in order to store the NAs. The promotions are
summarized in this table:
While this may seem like a heavy trade-off, I have found very few cases where this is an issue in practice i.e.
storing values greater than 2**53. Some explanation for the motivation is in the next section.
Many people have suggested that NumPy should simply emulate the NA support present in the more domain-
specific statistical programming language R. Part of the reason is the NumPy type hierarchy:
Typeclass Dtypes
numpy.floating float16, float32, float64, float128
numpy.integer int8, int16, int32, int64
numpy.unsignedinteger uint8, uint16, uint32, uint64
numpy.object_ object_
numpy.bool_ bool_
numpy.character string_, unicode_
The R language, by contrast, only has a handful of built-in data types: integer, numeric (floating-point),
character, and boolean. NA types are implemented by reserving special bit patterns for each type to be
used as the missing value. While doing this with the full NumPy type hierarchy would be possible, it would
be a more substantial trade-off (especially for the 8- and 16-bit data types) and implementation undertaking.
An alternate approach is that of using masked arrays. A masked array is an array of data with an associated
boolean mask denoting whether each value should be considered NA or not. I am personally not in love with
this approach as I feel that overall it places a fairly heavy burden on the user and the library implementer.
Additionally, it exacts a fairly high performance cost when working with numerical data compared with the
simple approach of using NaN. Thus, I have chosen the Pythonic “practicality beats purity” approach and
traded integer NA capability for a much simpler approach of using a special value in float and object arrays
to denote NA, and promoting integer arrays to floating when NAs must be introduced.
For Series and DataFrame objects, var() normalizes by N-1 to produce unbiased estimates of the sample
variance, while NumPy’s var normalizes by N, which measures the variance of the sample. Note that cov()
normalizes by N-1 in both pandas and NumPy.
4.19.5 Thread-safety
As of pandas 0.11, pandas is not 100% thread safe. The known issues relate to the copy() method. If you
are doing a lot of copying of DataFrame objects shared among threads, we recommend holding locks inside
the threads where the data copying occurs.
See this link for more information.
Occasionally you may have to deal with data that were created on a machine with a different byte order
than the one on which you are running Python. A common symptom of this issue is an error like::
Traceback
...
ValueError: Big-endian buffer not supported on little-endian compiler
To deal with this issue you should convert the underlying NumPy array to the native system byte order
before passing it to Series or DataFrame constructors using something similar to the following:
In [34]: s = pd.Series(newx)
See the NumPy documentation on byte order for more details. {{ header }}
4.20 Cookbook
This is a repository for short and sweet examples and links for useful pandas recipes. We encourage users to
add to this documentation.
Adding interesting links and/or inline examples to this section is a great First Pull Request.
Simplified, condensed, new-user friendly, in-line examples have been inserted where possible to augment the
Stack-Overflow and GitHub links. Many of the links contain expanded information, above what the in-line
examples offer.
Pandas (pd) and Numpy (np) are the only two abbreviated imported modules. The rest are kept explicitly
imported for newer users.
These examples are written for Python 3. Minor tweaks might be necessary for earlier python versions.
4.20.1 Idioms
In [2]: df
Out[2]:
AAA BBB CCC
0 4 10 100
1 5 20 50
2 6 30 -30
3 7 40 -50
if-then…
In [4]: df
Out[4]:
AAA BBB CCC
0 4 10 100
1 5 -1 50
2 6 -1 -30
3 7 -1 -50
In [6]: df
Out[6]:
AAA BBB CCC
0 4 10 100
1 5 555 555
2 6 555 555
3 7 555 555
In [8]: df
Out[8]:
AAA BBB CCC
0 4 2000 2000
(continues on next page)
In [12]: df
Out[12]:
AAA BBB CCC
0 4 10 100
1 5 20 50
2 6 30 -30
3 7 40 -50
In [14]: df
Out[14]:
AAA BBB CCC logic
0 4 10 100 low
1 5 20 50 low
2 6 30 -30 high
3 7 40 -50 high
Splitting
In [16]: df
Out[16]:
AAA BBB CCC
0 4 10 100
1 5 20 50
2 6 30 -30
3 7 40 -50
Building criteria
In [20]: df
Out[20]:
AAA BBB CCC
0 4 10 100
1 5 20 50
2 6 30 -30
3 7 40 -50
In [24]: df
Out[24]:
AAA BBB CCC
0 0.1 10 100
1 5.0 20 50
2 0.1 30 -30
3 0.1 40 -50
In [26]: df
Out[26]:
AAA BBB CCC
0 4 10 100
1 5 20 50
2 6 30 -30
3 7 40 -50
In [30]: df
Out[30]:
AAA BBB CCC
0 4 10 100
1 5 20 50
(continues on next page)
In [38]: df[AllCrit]
Out[38]:
AAA BBB CCC
0 4 10 100
4.20.2 Selection
DataFrames
In [40]: df
Out[40]:
AAA BBB CCC
0 4 10 100
1 5 20 50
2 6 30 -30
3 7 40 -50
# Generic
In [44]: df.iloc[0:3]
Out[44]:
AAA BBB CCC
foo 4 10 100
bar 5 20 50
boo 6 30 -30
In [45]: df.loc['bar':'kar']
Out[45]:
AAA BBB CCC
bar 5 20 50
boo 6 30 -30
kar 7 40 -50
Ambiguity arises when an index consists of integers with a non-zero start or non-unit increment.
In [46]: data = {'AAA': [4, 5, 6, 7],
....: 'BBB': [10, 20, 30, 40],
....: 'CCC': [100, 50, -30, -50]}
....:
2 5 20 50
3 6 30 -30
Using inverse operator (~) to take the complement of a mask
In [50]: df = pd.DataFrame({'AAA': [4, 5, 6, 7],
....: 'BBB': [10, 20, 30, 40],
....: 'CCC': [100, 50, -30, -50]})
....:
In [51]: df
Out[51]:
AAA BBB CCC
0 4 10 100
1 5 20 50
2 6 30 -30
3 7 40 -50
New columns
In [54]: df
Out[54]:
AAA BBB CCC
0 1 1 2
1 2 1 1
2 1 2 3
3 3 2 1
In [59]: df
Out[59]:
AAA BBB CCC AAA_cat BBB_cat CCC_cat
0 1 1 2 Alpha Alpha Beta
1 2 1 1 Beta Alpha Alpha
(continues on next page)
In [61]: df
Out[61]:
AAA BBB
0 1 2
1 1 1
2 1 3
3 2 4
4 2 5
5 2 1
6 3 2
7 3 3
In [62]: df.loc[df.groupby("AAA")["BBB"].idxmin()]
Out[62]:
AAA BBB
1 1 1
5 2 1
6 3 2
4.20.3 MultiIndexing
In [65]: df
Out[65]:
row One_X One_Y Two_X Two_Y
0 0 1.1 1.2 1.11 1.22
1 1 1.1 1.2 1.11 1.22
2 2 1.1 1.2 1.11 1.22
# As Labelled Index
In [66]: df = df.set_index('row')
In [67]: df
Out[67]:
One_X One_Y Two_X Two_Y
row
0 1.1 1.2 1.11 1.22
1 1.1 1.2 1.11 1.22
2 1.1 1.2 1.11 1.22
In [69]: df
Out[69]:
One Two
X Y X Y
row
0 1.1 1.2 1.11 1.22
1 1.1 1.2 1.11 1.22
2 1.1 1.2 1.11 1.22
In [71]: df
Out[71]:
level_1 X Y
row
0 One 1.10 1.20
0 Two 1.11 1.22
1 One 1.10 1.20
1 Two 1.11 1.22
2 One 1.10 1.20
2 Two 1.11 1.22
# And fix the labels (Notice the label 'level_1' got added automatically)
In [72]: df.columns = ['Sample', 'All_X', 'All_Y']
In [73]: df
(continues on next page)
Arithmetic
In [76]: df
Out[76]:
A B C
O I O I O I
n 1.891290 0.239601 0.271233 -3.008824 -1.061154 -0.772010
m -0.046401 1.346871 0.489157 0.354291 -0.537453 -1.290894
In [78]: df
Out[78]:
A B C
O I O I O I
n -1.782296 -0.310360 -0.255602 3.897391 1.0 1.0
m 0.086335 -1.043363 -0.910138 -0.274454 1.0 1.0
Slicing
In [79]: coords = [('AA', 'one'), ('AA', 'six'), ('BB', 'one'), ('BB', 'two'),
....: ('BB', 'six')]
....:
In [82]: df
Out[82]:
(continues on next page)
To take the cross section of the 1st level and 1st axis the index:
In [92]: df
Out[92]:
Exams Labs
I II I II
Student Course
Ada Comp 70 71 72 73
Math 71 73 75 74
Sci 72 75 75 75
Quinn Comp 73 74 75 76
Math 74 76 78 77
Sci 75 78 78 78
Violet Comp 76 77 78 79
Math 77 79 81 80
Sci 78 81 81 81
In [94]: df.loc['Violet']
Out[94]:
Exams Labs
I II I II
Course
Comp 76 77 78 79
Math 77 79 81 80
Sci 78 81 81 81
Sorting
Levels
In [102]: df
Out[102]:
A
2013-08-01 0.260588
2013-08-02 -1.252128
2013-08-05 0.033727
2013-08-06 NaN
2013-08-07 1.939833
2013-08-08 -0.425615
In [103]: df.reindex(df.index[::-1]).ffill()
Out[103]:
A
2013-08-08 -0.425615
2013-08-07 1.939833
2013-08-06 1.939833
2013-08-05 0.033727
2013-08-02 -1.252128
2013-08-01 0.260588
cumsum reset at NaN values
Replace
4.20.5 Grouping
In [105]: df
Out[105]:
animal size weight adult
0 cat S 8 False
1 dog S 10 False
2 cat M 11 False
3 fish M 1 False
4 dog M 20 False
5 cat L 12 True
6 cat L 12 True
In [107]: gb = df.groupby(['animal'])
In [108]: gb.get_group('cat')
Out[108]:
animal size weight adult
0 cat S 8 False
2 cat M 11 False
5 cat L 12 True
6 cat L 12 True
In [111]: expected_df
Out[111]:
size weight adult
animal
cat L 12.4375 True
dog L 20.0000 True
fish L 1.2500 True
Expanding apply
In [117]: gb = df.groupby('A')
In [119]: gb.transform(replace)
Out[119]:
B
0 1.0
1 -1.0
2 1.5
3 1.5
In [124]: sorted_df
Out[124]:
code data flag
1 bar -0.21 True
4 bar -0.59 False
0 foo 0.16 False
3 foo 0.45 True
2 baz 0.33 False
5 baz 0.62 True
In [129]: ts.resample("5min").apply(mhc)
Out[129]:
Mean 2014-10-07 00:00:00 1
2014-10-07 00:05:00 3.5
2014-10-07 00:10:00 6
2014-10-07 00:15:00 8.5
Max 2014-10-07 00:00:00 2
2014-10-07 00:05:00 4
2014-10-07 00:10:00 7
2014-10-07 00:15:00 9
Custom 2014-10-07 00:00:00 1.234
2014-10-07 00:05:00 NaT
2014-10-07 00:10:00 7.404
2014-10-07 00:15:00 NaT
dtype: object
In [130]: ts
Out[130]:
2014-10-07 00:00:00 0
2014-10-07 00:02:00 1
2014-10-07 00:04:00 2
2014-10-07 00:06:00 3
2014-10-07 00:08:00 4
2014-10-07 00:10:00 5
2014-10-07 00:12:00 6
2014-10-07 00:14:00 7
2014-10-07 00:16:00 8
2014-10-07 00:18:00 9
Freq: 2T, dtype: int64
Create a value counts column and reassign back to the DataFrame
In [132]: df
Out[132]:
Color Value
0 Red 100
1 Red 150
2 Red 50
3 Blue 50
In [134]: df
Out[134]:
Color Value Counts
0 Red 100 3
1 Red 150 3
2 Red 50 3
3 Blue 50 1
In [136]: df
Out[136]:
line_race beyer
Last Gunfighter 10 99
Last Gunfighter 10 102
Last Gunfighter 8 103
Paynter 10 103
Paynter 10 88
Paynter 8 100
In [138]: df
Out[138]:
line_race beyer beyer_shifted
Last Gunfighter 10 99 NaN
Last Gunfighter 10 102 99.0
Last Gunfighter 8 103 102.0
Paynter 10 103 NaN
Paynter 10 88 103.0
Paynter 8 100 88.0
In [142]: df_count
Out[142]:
host service no
0 other web 2
1 that mail 1
2 this mail 2
3: Int64Index([2], dtype='int64'),
4: Int64Index([3, 4, 5], dtype='int64'),
5: Int64Index([6], dtype='int64'),
6: Int64Index([7, 8], dtype='int64')}
Expanding data
Splitting
Splitting a frame
Create a list of dataframes, split using a delineation based on logic included in rows.
In [146]: df = pd.DataFrame(data={'Case': ['A', 'A', 'A', 'B', 'A', 'A', 'B', 'A',
.....: 'A'],
.....: 'Data': np.random.randn(9)})
.....:
In [148]: dfs[0]
Out[148]:
Case Data
0 A -0.265881
1 A -0.862987
2 A -0.250195
3 B 0.901293
In [149]: dfs[1]
Out[149]:
Case Data
4 A -1.350468
5 A -1.720622
6 B -1.815265
In [150]: dfs[2]
Out[150]:
Case Data
7 A -0.690951
8 A -0.180020
Pivot
In [153]: table.stack('City')
Out[153]:
Sales
Province City
AL All 12.0
Calgary 8.0
Edmonton 4.0
BC All 16.0
Vancouver 16.0
MN All 3.0
Winnipeg 3.0
ON All 14.0
Toronto 13.0
Windsor 1.0
QC All 6.0
Montreal 6.0
All All 51.0
Calgary 8.0
Edmonton 4.0
Montreal 6.0
Toronto 13.0
Vancouver 16.0
Windsor 1.0
Winnipeg 3.0
In [154]: grades = [48, 99, 75, 80, 42, 80, 72, 68, 36, 78]
Apply
In [162]: df_orgz
Out[162]:
0 1 2 3
I A 2 4 8 16.0
B a b c NaN
II A 100 200 NaN NaN
B jj kk NaN NaN
III A 10 20 30 NaN
B ccc NaN NaN NaN
In [164]: df
Out[164]:
A B
2001-01-01 0.000031 -0.000011
2001-01-02 -0.000041 -0.000094
2001-01-03 0.000038 0.000089
2001-01-04 -0.000113 -0.000026
2001-01-05 -0.000178 0.000128
... ... ...
2006-06-19 -0.000009 0.000142
2006-06-20 0.000047 -0.000137
2006-06-21 0.000005 0.000090
2006-06-22 0.000081 -0.000111
2006-06-23 -0.000032 0.000057
In [167]: s
Out[167]:
2001-01-01 0.005415
2001-01-02 0.004827
2001-01-03 0.005830
2001-01-04 0.004434
2001-01-05 0.005509
...
2006-04-30 0.000306
2006-05-01 -0.000446
2006-05-02 0.000145
2006-05-03 -0.000512
2006-05-04 -0.000856
Length: 1950, dtype: float64
In [170]: df
Out[170]:
Open Close Volume
2014-01-01 -1.809586 -3.104481 1753
2014-01-02 -1.090265 -1.349723 940
2014-01-03 1.770653 -0.578485 1485
2014-01-04 2.343727 -0.073312 1238
2014-01-05 -0.157785 0.087135 1103
... ... ... ...
2014-04-06 -1.661246 -0.436527 1453
2014-04-07 -0.618636 -0.568683 181
2014-04-08 -0.980303 1.437674 824
2014-04-09 0.422785 -0.857118 695
2014-04-10 0.001692 -0.371867 184
In [172]: window = 5
In [174]: s.round(2)
Out[174]:
2014-01-06 -1.16
2014-01-07 -0.45
2014-01-08 -0.24
2014-01-09 -0.15
2014-01-10 -0.04
...
2014-04-06 0.29
2014-04-07 0.03
2014-04-08 -0.55
2014-04-09 0.14
2014-04-10 0.05
Length: 95, dtype: float64
4.20.6 Timeseries
Between times
Using indexer between time
Constructing a datetime range that excludes weekends and includes only certain times
Vectorized Lookup
Aggregation and plotting time series
Turn a matrix with hours in columns and days in rows into a continuous row sequence in the form of a time
series. How to rearrange a Python pandas DataFrame?
Dealing with duplicates when reindexing a timeseries to a specified frequency
Calculate the first day of the month for each entry in a DatetimeIndex
In [176]: dates.to_period(freq='M').to_timestamp()
Out[176]:
DatetimeIndex(['2000-01-01', '2000-01-01', '2000-01-01', '2000-01-01',
'2000-01-01'],
dtype='datetime64[ns]', freq=None)
Resampling
4.20.7 Merge
In [181]: df
Out[181]:
A B C
0 0.057755 -0.258675 0.465064
1 0.947265 -0.015106 0.986157
2 1.040532 -1.977669 1.020800
3 0.671853 -1.328706 0.375361
4 -0.573784 -3.245119 0.621043
5 -0.196414 -0.753543 -0.718978
6 0.057755 -0.258675 0.465064
7 0.947265 -0.015106 0.986157
8 1.040532 -1.977669 1.020800
9 0.671853 -1.328706 0.375361
10 -0.573784 -3.245119 0.621043
11 -0.196414 -0.753543 -0.718978
In [183]: df
Out[183]:
Area Bins Test_0 Data
0 A 110 0 0.804083
1 A 110 1 -1.073984
2 A 160 0 -1.481424
3 A 160 1 0.633964
4 A 160 2 0.439679
5 C 40 0 0.335582
6 C 40 1 -0.986495
4.20.8 Plotting
In [186]: df = pd.DataFrame(
.....: {'stratifying_var': np.random.uniform(0, 100, 20),
.....: 'price': np.random.normal(100, 5, 20)})
.....:
CSV
The best way to combine multiple files into a single DataFrame is to read the individual frames one by one,
put all of the individual frames into a list, and then combine the frames in the list using pd.concat():
You can use the same approach to read all files matching a pattern. Here is an example using glob:
In [193]: import os
Finally, this strategy will work with the other pd.read_*(...) functions described in the io docs.
In [198]: df.head()
Out[198]:
year month day
0 2000 1 1
1 2000 1 2
2 2000 1 3
3 2000 1 4
4 2000 1 5
SQL
Excel
HTML
Reading HTML tables from a server that cannot handle the default request header
HDFStore
In [210]: store.get_storer('df').attrs.my_attribute
Out[210]: {'A': 10}
Binary files
pandas readily accepts NumPy record arrays, if you need to read in a binary file consisting of an array of C
structs. For example, given this C program in a file called main.c compiled with gcc main.c -std=gnu99
on a 64-bit machine,
#include <stdio.h>
#include <stdint.h>
return 0;
}
the following Python code will read the binary file 'binary.dat' into a pandas DataFrame, where each
element of the struct corresponds to a column in the frame:
# note that the offsets are larger than the size of the type because of
# struct padding
offsets = 0, 8, 16
formats = 'i4', 'f8', 'f4'
dt = np.dtype({'names': names, 'offsets': offsets, 'formats': formats},
align=True)
df = pd.DataFrame(np.fromfile('binary.dat', dt))
Note: The offsets of the structure elements may be different depending on the architecture of the machine
on which the file was created. Using a raw binary file format like this for general data storage is not
recommended, as it is not cross platform. We recommended either HDF5 or msgpack, both of which are
supported by pandas’ IO facilities.
4.20.10 Computation
Correlation
Often it’s useful to obtain the lower (or upper) triangular form of a correlation matrix calculated from
DataFrame.corr(). This can be achieved by passing a boolean mask to where as follows:
In [214]: corr_mat.where(mask)
Out[214]:
0 1 2 3 4
0 NaN NaN NaN NaN NaN
1 0.058568 NaN NaN NaN NaN
2 -0.036451 -0.014375 NaN NaN NaN
3 -0.141872 -0.069392 0.015754 NaN NaN
4 0.173949 0.087820 -0.124553 0.100906 NaN
The method argument within DataFrame.corr can accept a callable in addition to the named correlation
types. Here we compute the distance correlation matrix for a DataFrame object.
In [217]: df.corr(method=distcorr)
Out[217]:
0 1 2
0 1.000000 0.169628 0.202669
1 0.169628 1.000000 0.208482
2 0.202669 0.208482 1.000000
4.20.11 Timedeltas
In [220]: s - s.max()
Out[220]:
0 -2 days
1 -1 days
2 0 days
dtype: timedelta64[ns]
In [221]: s.max() - s
Out[221]:
0 2 days
1 1 days
2 0 days
dtype: timedelta64[ns]
In [222]: s - datetime.datetime(2011, 1, 1, 3, 5)
Out[222]:
0 364 days 20:55:00
1 365 days 20:55:00
2 366 days 20:55:00
dtype: timedelta64[ns]
In [223]: s + datetime.timedelta(minutes=5)
Out[223]:
0 2012-01-01 00:05:00
1 2012-01-02 00:05:00
2 2012-01-03 00:05:00
dtype: datetime64[ns]
In [224]: datetime.datetime(2011, 1, 1, 3, 5) - s
Out[224]:
0 -365 days +03:05:00
1 -366 days +03:05:00
2 -367 days +03:05:00
dtype: timedelta64[ns]
In [225]: datetime.timedelta(minutes=5) + s
Out[225]:
0 2012-01-01 00:05:00
1 2012-01-02 00:05:00
2 2012-01-03 00:05:00
dtype: datetime64[ns]
Adding and subtracting deltas and dates
In [226]: deltas = pd.Series([datetime.timedelta(days=i) for i in range(3)])
In [228]: df
Out[228]:
A B
0 2012-01-01 0 days
1 2012-01-02 1 days
2 2012-01-03 2 days
In [231]: df
Out[231]:
A B New Dates Delta
0 2012-01-01 0 days 2012-01-01 0 days
1 2012-01-02 1 days 2012-01-03 -1 days
2 2012-01-03 2 days 2012-01-05 -2 days
In [232]: df.dtypes
Out[232]:
A datetime64[ns]
B timedelta64[ns]
New Dates datetime64[ns]
Delta timedelta64[ns]
dtype: object
Another example
Values can be set to NaT using np.nan, similar to datetime
In [233]: y = s - s.shift()
In [234]: y
Out[234]:
0 NaT
1 1 days
2 1 days
dtype: timedelta64[ns]
In [236]: y
Out[236]:
0 NaT
1 NaT
2 1 days
dtype: timedelta64[ns]
To globally provide aliases for axis names, one can define these 2 functions:
In [241]: df2.sum(axis='myaxis2')
Out[241]:
i1 -2.646263
i2 -1.737358
i3 -0.200793
dtype: float64
To create a dataframe from every combination of some given values, like R’s expand.grid() function, we
can create a dict where the keys are column names and the values are lists of the data values:
In [245]: df
Out[245]:
height weight sex
0 60 100 Male
1 60 100 Female
2 60 140 Male
3 60 140 Female
4 60 180 Male
5 60 180 Female
(continues on next page)
{{ header }}
FIVE
PANDAS ECOSYSTEM
Increasingly, packages are being built on top of pandas to address specific needs in data preparation, analysis
and visualization. This is encouraging because it means pandas is not only helping users to handle their
data tasks but also that it provides a better starting point for developers to build powerful and more focused
data tools. The creation of libraries that complement pandas’ functionality also allows pandas development
to remain focused around it’s original requirements.
This is an inexhaustive list of projects that build on pandas in order to provide tools in the PyData space.
For a list of projects that depend on pandas, see the libraries.io usage page for pandas or search pypi for
pandas.
We’d like to make it easier for users to find these projects, if you know of other substantial projects that you
feel should be on this list, please let us know.
5.1.1 Statsmodels
Statsmodels is the prominent Python “statistics and econometrics library” and it has a long-standing special
relationship with pandas. Statsmodels provides powerful statistics, econometrics, analysis and modeling
functionality that is out of pandas’ scope. Statsmodels leverages pandas objects as the underlying data
container for computation.
5.1.2 sklearn-pandas
5.1.3 Featuretools
Featuretools is a Python library for automated feature engineering built on top of pandas. It excels at
transforming temporal and relational datasets into feature matrices for machine learning using reusable
feature engineering “primitives”. Users can contribute their own primitives in Python and share them with
the rest of the community.
865
pandas: powerful Python data analysis toolkit, Release 0.25.2
5.2 Visualization
5.2.1 Altair
Altair is a declarative statistical visualization library for Python. With Altair, you can spend more time
understanding your data and its meaning. Altair’s API is simple, friendly and consistent and built on
top of the powerful Vega-Lite JSON specification. This elegant simplicity produces beautiful and effective
visualizations with a minimal amount of code. Altair works with Pandas DataFrames.
5.2.2 Bokeh
Bokeh is a Python interactive visualization library for large datasets that natively uses the latest web
technologies. Its goal is to provide elegant, concise construction of novel graphics in the style of Protovis/D3,
while delivering high-performance interactivity over large data to thin clients.
5.2.3 seaborn
5.2.4 yhat/ggpy
Hadley Wickham’s ggplot2 is a foundational exploratory visualization package for the R language. Based
on “The Grammar of Graphics” it provides a powerful, declarative and extremely general way to generate
bespoke plots of any kind of data. It’s really quite incredible. Various implementations to other languages
are available, but a faithful implementation for Python users has long been missing. Although still young
(as of Jan-2014), the yhat/ggpy project has been progressing quickly in that direction.
5.2.6 Plotly
Plotly’s Python API enables interactive figures and web shareability. Maps, 2D, 3D, and live-streaming
graphs are rendered with WebGL and D3.js. The library supports plotting directly from a pandas DataFrame
and cloud-based collaboration. Users of matplotlib, ggplot for Python, and Seaborn can convert figures into
interactive web-based plots. Plots can be drawn in IPython Notebooks , edited with R or MATLAB, modified
in a GUI, or embedded in apps and dashboards. Plotly is free for unlimited sharing, and has cloud, offline,
or on-premise accounts for private use.
5.2.7 QtPandas
Spun off from the main pandas library, the qtpandas library enables DataFrame visualization and manipu-
lation in PyQt4 and PySide applications.
5.3 IDE
5.3.1 IPython
IPython is an interactive command shell and distributed computing environment. IPython tab completion
works with Pandas methods and also attributes like DataFrame columns.
Jupyter Notebook is a web application for creating Jupyter notebooks. A Jupyter notebook is a JSON
document containing an ordered list of input/output cells which can contain code, text, mathematics, plots
and rich media. Jupyter notebooks can be converted to a number of open standard output formats (HTML,
HTML presentation slides, LaTeX, PDF, ReStructuredText, Markdown, Python) through ‘Download As’ in
the web interface and jupyter convert in a shell.
Pandas DataFrames implement _repr_html_``and ``_repr_latex methods which are utilized by Jupyter
Notebook for displaying (abbreviated) HTML or LaTeX tables. LaTeX output is properly escaped. (Note:
HTML tables may or may not be compatible with non-HTML Jupyter output formats.)
See Options and Settings and Available Options for pandas display. settings.
5.3.3 quantopian/qgrid
qgrid is “an interactive grid for sorting and filtering DataFrames in IPython Notebook” built with SlickGrid.
5.3.4 Spyder
Spyder is a cross-platform PyQt-based IDE combining the editing, analysis, debugging and profiling func-
tionality of a software development tool with the data exploration, interactive execution, deep inspection
and rich visualization capabilities of a scientific environment like MATLAB or Rstudio.
Its Variable Explorer allows users to view, manipulate and edit pandas Index, Series, and DataFrame objects
like a “spreadsheet”, including copying and modifying values, sorting, displaying a “heatmap”, converting
data types and more. Pandas objects can also be renamed, duplicated, new columns added, copyed/pasted
to/from the clipboard (as TSV), and saved/loaded to/from a file. Spyder can also import data from a
variety of plain text and binary files or the clipboard into a new pandas DataFrame via a sophisticated
import wizard.
Most pandas classes, methods and data attributes can be autocompleted in Spyder’s Editor and IPython
Console, and Spyder’s Help pane can retrieve and render Numpydoc documentation on pandas objects in
rich text with Sphinx both automatically and on-demand.
5.4 API
5.4.1 pandas-datareader
5.4.2 quandl/Python
Quandl API for Python wraps the Quandl REST API to return Pandas DataFrames with timeseries indexes.
5.4.3 pydatastream
PyDatastream is a Python interface to the Thomson Dataworks Enterprise (DWE/Datastream) SOAP API
to return indexed Pandas DataFrames with financial data. This package requires valid credentials for this
API (non free).
5.4.4 pandaSDMX
pandaSDMX is a library to retrieve and acquire statistical data and metadata disseminated in SDMX
2.1, an ISO-standard widely used by institutions such as statistics offices, central banks, and international
organisations. pandaSDMX can expose datasets and related structural metadata including data flows, code-
lists, and data structure definitions as pandas Series or MultiIndexed DataFrames.
5.4.5 fredapi
fredapi is a Python interface to the Federal Reserve Economic Data (FRED) provided by the Federal Reserve
Bank of St. Louis. It works with both the FRED database and ALFRED database that contains point-in-
time data (i.e. historic data revisions). fredapi provides a wrapper in Python to the FRED HTTP API,
and also provides several convenient methods for parsing and analyzing point-in-time data from ALFRED.
fredapi makes use of pandas and returns data in a Series or DataFrame. This module requires a FRED API
key that you can obtain for free on the FRED website.
5.5.1 Geopandas
Geopandas extends pandas data objects to include geographic information which support geometric opera-
tions. If your work entails maps and geographical coordinates, and you love pandas, you should take a close
look at Geopandas.
5.5.2 xarray
xarray brings the labeled data power of pandas to the physical sciences by providing N-dimensional variants
of the core pandas data structures. It aims to provide a pandas-like and pandas-compatible toolkit for
analytics on multi- dimensional arrays, rather than the tabular data for which pandas excels.
5.6 Out-of-core
5.6.1 Blaze
Blaze provides a standard API for doing computations with various in-memory and on-disk backends:
NumPy, Pandas, SQLAlchemy, MongoDB, PyTables, PySpark.
5.6.2 Dask
Dask is a flexible parallel computing library for analytics. Dask provides a familiar DataFrame interface for
out-of-core, parallel and distributed computing.
5.6.3 Dask-ML
Dask-ML enables parallel and distributed machine learning using Dask alongside existing machine learning
libraries like Scikit-Learn, XGBoost, and TensorFlow.
5.6.4 Koalas
Koalas provides a familiar pandas DataFrame interface on top of Apache Spark. It enables users to leverage
multi-cores on one machine or a cluster of machines to speed up or scale their DataFrame code.
5.6.5 Odo
Odo provides a uniform API for moving data between different formats. It uses pandas own read_csv for
CSV IO and leverages many existing packages such as PyTables, h5py, and pymongo to move data between
non pandas formats. Its graph based approach is also extensible by end users for custom formats that may
be too specific for the core of odo.
5.6.6 Ray
Pandas on Ray is an early stage DataFrame library that wraps Pandas and transparently distributes the data
and computation. The user does not need to know how many cores their system has, nor do they need to
specify how to distribute the data. In fact, users can continue using their previous Pandas notebooks while
experiencing a considerable speedup from Pandas on Ray, even on a single machine. Only a modification
of the import statement is needed, as we demonstrate below. Once you’ve changed your import statement,
you’re ready to use Pandas on Ray just like you would Pandas.
# import pandas as pd
import ray.dataframe as pd
5.6.7 Vaex
Increasingly, packages are being built on top of pandas to address specific needs in data preparation, analysis
and visualization. Vaex is a python library for Out-of-Core DataFrames (similar to Pandas), to visualize and
explore big tabular datasets. It can calculate statistics such as mean, sum, count, standard deviation etc, on
an N-dimensional grid up to a billion (109 ) objects/rows per second. Visualization is done using histograms,
density plots and 3d volume rendering, allowing interactive exploration of big data. Vaex uses memory
mapping, zero memory copy policy and lazy computations for best performance (no memory wasted).
• vaex.from_pandas
• vaex.to_pandas_df
5.7.1 Engarde
Engarde is a lightweight library used to explicitly state your assumptions about your datasets and check
that they’re actually true.
Pandas provides an interface for defining extension types to extend NumPy’s type system. The following
libraries implement that interface to provide types not found in NumPy or pandas, which work well with
pandas’ data containers.
5.8.1 cyberpandas
Cyberpandas provides an extension type for storing arrays of IP Addresses. These arrays can be stored
inside pandas’ Series and DataFrame.
5.9 Accessors
A directory of projects providing extension accessors. This is for users to discover new accessors and for
library authors to coordinate on the namespace.
{{ header }}
SIX
API REFERENCE
This page gives an overview of all public pandas objects, functions and methods. All classes and functions
exposed in pandas.* namespace are public.
Some subpackages are public which include pandas.errors, pandas.plotting, and pandas.testing. Pub-
lic functions in pandas.io and pandas.tseries submodules are mentioned in the documentation. pandas.
api.types subpackage holds some public functions related to data types in pandas.
Warning: The pandas.core, pandas.compat, and pandas.util top-level modules are PRIVATE.
Stable functionality in such modules is not guaranteed.
{{ header }}
6.1 Input/output
6.1.1 Pickling
read_pickle(path[, compression]) Load pickled pandas object (or any object) from
file.
pandas.read_pickle
pandas.read_pickle(path, compression=’infer’)
Load pickled pandas object (or any object) from file.
Warning: Loading pickled data received from untrusted sources can be unsafe. See here.
Parameters
path [str] File path where the pickled object will be loaded.
compression [{‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}, default ‘infer’] For on-the-fly
decompression of on-disk data. If ‘infer’, then use gzip, bz2, xz or zip if path ends in
‘.gz’, ‘.bz2’, ‘.xz’, or ‘.zip’ respectively, and no decompression otherwise. Set to None
for no decompression.
New in version 0.20.0.
Returns
873
pandas: powerful Python data analysis toolkit, Release 0.25.2
See also:
Notes
Examples
>>> import os
>>> os.remove("./dummy.pkl")
pandas.read_table
keep_date_col [bool, default False] If True and parse_dates specifies combining mul-
tiple columns then keep the original columns.
date_parser [function, optional] Function to use for converting a sequence of string
columns to an array of datetime instances. The default uses dateutil.parser.
parser to do the conversion. Pandas will try to call date_parser in three different
ways, advancing to the next if an exception occurs: 1) Pass one or more arrays (as
defined by parse_dates) as arguments; 2) concatenate (row-wise) the string values
from the columns defined by parse_dates into a single array and pass that; and 3)
call date_parser once for each row using one or more strings (corresponding to the
columns defined by parse_dates) as arguments.
dayfirst [bool, default False] DD/MM format dates, international and European for-
mat.
cache_dates [boolean, default True] If True, use a cache of unique, converted dates
to apply the datetime conversion. May produce significant speed-up when parsing
duplicate date strings, especially ones with timezone offsets.
New in version 0.25.0.
iterator [bool, default False] Return TextFileReader object for iteration or getting
chunks with get_chunk().
chunksize [int, optional] Return TextFileReader object for iteration. See the IO Tools
docs for more information on iterator and chunksize.
compression [{‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}, default ‘infer’] For on-the-fly
decompression of on-disk data. If ‘infer’ and filepath_or_buffer is path-like, then
detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, or ‘.xz’ (otherwise
no decompression). If using ‘zip’, the ZIP file must contain only one data file to be
read in. Set to None for no decompression.
New in version 0.18.1: support for ‘zip’ and ‘xz’ compression.
thousands [str, optional] Thousands separator.
decimal [str, default ‘.’] Character to recognize as decimal point (e.g. use ‘,’ for Euro-
pean data).
lineterminator [str (length 1), optional] Character to break file into lines. Only valid
with C parser.
quotechar [str (length 1), optional] The character used to denote the start and end of
a quoted item. Quoted items can include the delimiter and it will be ignored.
quoting [int or csv.QUOTE_* instance, default 0] Control field quoting behavior per
csv.QUOTE_* constants. Use one of QUOTE_MINIMAL (0), QUOTE_ALL (1),
QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).
doublequote [bool, default True] When quotechar is specified and quoting is not
QUOTE_NONE, indicate whether or not to interpret two consecutive quotechar ele-
ments INSIDE a field as a single quotechar element.
escapechar [str (length 1), optional] One-character string used to escape other charac-
ters.
comment [str, optional] Indicates remainder of line should not be parsed. If found at
the beginning of a line, the line will be ignored altogether. This parameter must
be a single character. Like empty lines (as long as skip_blank_lines=True), fully
commented lines are ignored by the parameter header but not by skiprows. For
Examples
pandas.read_csv
keep_date_col [bool, default False] If True and parse_dates specifies combining mul-
tiple columns then keep the original columns.
date_parser [function, optional] Function to use for converting a sequence of string
columns to an array of datetime instances. The default uses dateutil.parser.
parser to do the conversion. Pandas will try to call date_parser in three different
ways, advancing to the next if an exception occurs: 1) Pass one or more arrays (as
defined by parse_dates) as arguments; 2) concatenate (row-wise) the string values
from the columns defined by parse_dates into a single array and pass that; and 3)
call date_parser once for each row using one or more strings (corresponding to the
columns defined by parse_dates) as arguments.
dayfirst [bool, default False] DD/MM format dates, international and European for-
mat.
cache_dates [boolean, default True] If True, use a cache of unique, converted dates
to apply the datetime conversion. May produce significant speed-up when parsing
duplicate date strings, especially ones with timezone offsets.
New in version 0.25.0.
iterator [bool, default False] Return TextFileReader object for iteration or getting
chunks with get_chunk().
chunksize [int, optional] Return TextFileReader object for iteration. See the IO Tools
docs for more information on iterator and chunksize.
compression [{‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}, default ‘infer’] For on-the-fly
decompression of on-disk data. If ‘infer’ and filepath_or_buffer is path-like, then
detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, or ‘.xz’ (otherwise
no decompression). If using ‘zip’, the ZIP file must contain only one data file to be
read in. Set to None for no decompression.
New in version 0.18.1: support for ‘zip’ and ‘xz’ compression.
thousands [str, optional] Thousands separator.
decimal [str, default ‘.’] Character to recognize as decimal point (e.g. use ‘,’ for Euro-
pean data).
lineterminator [str (length 1), optional] Character to break file into lines. Only valid
with C parser.
quotechar [str (length 1), optional] The character used to denote the start and end of
a quoted item. Quoted items can include the delimiter and it will be ignored.
quoting [int or csv.QUOTE_* instance, default 0] Control field quoting behavior per
csv.QUOTE_* constants. Use one of QUOTE_MINIMAL (0), QUOTE_ALL (1),
QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).
doublequote [bool, default True] When quotechar is specified and quoting is not
QUOTE_NONE, indicate whether or not to interpret two consecutive quotechar ele-
ments INSIDE a field as a single quotechar element.
escapechar [str (length 1), optional] One-character string used to escape other charac-
ters.
comment [str, optional] Indicates remainder of line should not be parsed. If found at
the beginning of a line, the line will be ignored altogether. This parameter must
be a single character. Like empty lines (as long as skip_blank_lines=True), fully
commented lines are ignored by the parameter header but not by skiprows. For
Examples
pandas.read_fwf
Examples
pandas.read_msgpack
read_msgpack is deprecated and will be removed in a future version. It is recommended to use pyarrow
for on-the-wire transmission of pandas objects.
Parameters
path_or_buf [str, path object or file-like object] Any valid string path is acceptable.
The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For
file URLs, a host is expected.
If you want to pass in a path object, pandas accepts any os.PathLike.
By file-like object, we refer to objects with a read() method, such as a file handler
(e.g. via builtin open function) or StringIO.
encoding [Encoding for decoding msgpack str type]
iterator [boolean, if True, return an iterator to the unpacker] (default is False)
Returns
obj [same type as object stored in file]
Notes
6.1.3 Clipboard
pandas.read_clipboard
pandas.read_clipboard(sep=’\s+’, **kwargs)
Read text from clipboard and pass to read_csv. See read_csv for the full argument list
Parameters
sep [str, default ‘s+’] A string or regex delimiter. The default of ‘s+’ denotes one or
more whitespace characters.
Returns
parsed [DataFrame]
6.1.4 Excel
read_excel(io[, sheet_name, header, names, …]) Read an Excel file into a pandas DataFrame.
ExcelFile.parse(self[, sheet_name, header, …]) Parse specified sheet(s) into a DataFrame
pandas.read_excel
• If str, then indicates comma separated list of Excel column letters and column
ranges (e.g. “A:E” or “A,C,E:F”). Ranges are inclusive of both sides.
• If list of int, then indicates list of column numbers to be parsed.
• If list of string, then indicates list of column names to be parsed.
New in version 0.24.0.
• If callable, then evaluate each column name against it and parse the column if the
callable returns True.
New in version 0.24.0.
squeeze [bool, default False] If the parsed data only contains one column then return
a Series.
dtype [Type name or dict of column -> type, default None] Data type for data or
columns. E.g. {‘a’: np.float64, ‘b’: np.int32} Use object to preserve data as stored
in Excel and not interpret dtype. If converters are specified, they will be applied
INSTEAD of dtype conversion.
New in version 0.20.0.
engine [str, default None] If io is not a buffer or path, this must be set to identify io.
Acceptable values are None or xlrd.
converters [dict, default None] Dict of functions for converting values in certain
columns. Keys can either be integers or column labels, values are functions that
take one input argument, the Excel cell content, and return the transformed con-
tent.
true_values [list, default None] Values to consider as True.
New in version 0.19.0.
false_values [list, default None] Values to consider as False.
New in version 0.19.0.
skiprows [list-like] Rows to skip at the beginning (0-indexed).
nrows [int, default None] Number of rows to parse.
New in version 0.23.0.
na_values [scalar, str, list-like, or dict, default None] Additional strings to recognize
as NA/NaN. If dict passed, specific per-column NA values. By default the follow-
ing values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’,
‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’,
‘n/a’, ‘nan’, ‘null’.
keep_default_na [bool, default True] If na_values are specified and keep_default_na
is False the default NaN values are overridden, otherwise they’re appended to.
verbose [bool, default False] Indicate number of NA values placed in non-numeric
columns.
parse_dates [bool, list-like, or dict, default False] The behavior is as follows:
• bool. If True -> try parsing the index.
• list of int or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a
separate date column.
• list of lists. e.g. If [[1, 3]] -> combine columns 1 and 3 and parse as a single date
column.
• dict, e.g. {‘foo’ : [1, 3]} -> parse columns 1, 3 as date and call result ‘foo’
If a column or index contains an unparseable date, the entire column or index will
be returned unaltered as an object data type. For non-standard datetime parsing,
use pd.to_datetime after pd.read_excel.
Note: A fast-path exists for iso8601-formatted dates.
date_parser [function, optional] Function to use for converting a sequence of string
columns to an array of datetime instances. The default uses dateutil.parser.
parser to do the conversion. Pandas will try to call date_parser in three different
ways, advancing to the next if an exception occurs: 1) Pass one or more arrays (as
defined by parse_dates) as arguments; 2) concatenate (row-wise) the string values
from the columns defined by parse_dates into a single array and pass that; and 3)
call date_parser once for each row using one or more strings (corresponding to the
columns defined by parse_dates) as arguments.
thousands [str, default None] Thousands separator for parsing string columns to nu-
meric. Note that this parameter is only necessary for columns stored as TEXT
in Excel, any numeric columns will automatically be parsed, regardless of display
format.
comment [str, default None] Comments out remainder of line. Pass a character or
characters to this argument to indicate comments in the input file. Any data between
the comment string and the end of the current line is ignored.
skip_footer [int, default 0] Alias of skipfooter.
Deprecated since version 0.23.0: Use skipfooter instead.
skipfooter [int, default 0] Rows at the end to skip (0-indexed).
convert_float [bool, default True] Convert integral floats to int (i.e., 1.0 –> 1). If
False, all numeric data will be read in as floats: Excel stores all numbers as floats
internally.
mangle_dupe_cols [bool, default True] Duplicate columns will be specified as ‘X’,
‘X.1’, …’X.N’, rather than ‘X’…’X’. Passing in False will cause data to be overwritten
if there are duplicate names in the columns.
**kwds [optional] Optional keyword arguments can be passed to TextFileReader.
Returns
DataFrame or dict of DataFrames DataFrame from the passed in Excel file. See
notes in sheet_name argument for more information on when a dict of DataFrames
is returned.
See also:
Examples
The file can be read using the file name as string or an open file object:
Index and header can be specified via the index_col and header arguments
True, False, and NA values, and thousands separators have defaults, but can be explicitly specified,
too. Supply the values you would like as strings or lists of strings!
Comment lines in the excel input file can be skipped using the comment kwarg
pandas.ExcelFile.parse
ExcelWriter(path[, engine, date_format, …]) Class for writing DataFrame objects into excel
sheets, default is to use xlwt for xls, openpyxl for
xlsx.
pandas.ExcelWriter
Notes
Examples
Default usage:
Attributes
None
Methods
None
6.1.5 JSON
read_json([path_or_buf, orient, typ, dtype, …]) Convert a JSON string to pandas object.
pandas.read_json
can be produced by to_json() with a corresponding orient value. The set of possible
orients is:
• 'split' : dict like {index -> [index], columns -> [columns], data ->
[values]}
• 'records' : list like [{column -> value}, ... , {column -> value}]
• 'index' : dict like {index -> {column -> value}}
• 'columns' : dict like {column -> {index -> value}}
• 'values' : just the values array
The allowed and default values depend on the value of the typ parameter.
• when typ == 'series',
– allowed orients are {'split','records','index'}
– default is 'index'
– The Series index must be unique for orient 'index'.
• when typ == 'frame',
– allowed orients are {'split','records','index', 'columns','values',
'table'}
– default is 'columns'
– The DataFrame index must be unique for orients 'index' and 'columns'.
– The DataFrame columns must be unique for orients 'index', 'columns', and
'records'.
New in version 0.23.0: ‘table’ as an allowed value for the orient argument
typ [{‘frame’, ‘series’}, default ‘frame’] The type of object to recover.
dtype [bool or dict, default None] If True, infer dtypes; if a dict of column to dtype,
then use those; if False, then don’t infer dtypes at all, applies only to the data.
For all orient values except 'table', default is True.
Changed in version 0.25.0: Not applicable for orient='table'.
convert_axes [bool, default None] Try to convert the axes to the proper dtypes.
For all orient values except 'table', default is True.
Changed in version 0.25.0: Not applicable for orient='table'.
convert_dates [bool or list of str, default True] List of columns to parse for dates. If
True, then try to parse datelike columns. A column label is datelike if
• it ends with '_at',
• it ends with '_time',
• it begins with 'timestamp',
• it is 'modified', or
• it is 'date'.
keep_default_dates [bool, default True] If parsing dates, then parse the default date-
like columns.
numpy [bool, default False] Direct decoding to numpy arrays. Supports numeric data
only, but non-numeric column and index labels are supported. Note also that the
JSON ordering MUST be the same for each term if numpy=True.
precise_float [bool, default False] Set to enable usage of higher precision (strtod)
function when decoding string to double values. Default (False) is to use fast but
less precise builtin functionality.
date_unit [str, default None] The timestamp unit to detect if converting dates. The
default behaviour is to try and detect the correct precision, but if this is not desired
then pass one of ‘s’, ‘ms’, ‘us’ or ‘ns’ to force parsing only seconds, milliseconds,
microseconds or nanoseconds respectively.
encoding [str, default is ‘utf-8’] The encoding to use to decode py3 bytes.
New in version 0.19.0.
lines [bool, default False] Read the file as a json object per line.
New in version 0.19.0.
chunksize [int, optional] Return JsonReader object for iteration. See the line-delimited
json docs for more information on chunksize. This can only be passed if lines=True.
If this is None, the file will be read into memory all at once.
New in version 0.21.0.
compression [{‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}, default ‘infer’] For on-the-fly
decompression of on-disk data. If ‘infer’, then use gzip, bz2, zip or xz if path_or_buf
is a string ending in ‘.gz’, ‘.bz2’, ‘.zip’, or ‘xz’, respectively, and no decompression
otherwise. If using ‘zip’, the ZIP file must contain only one data file to be read in.
Set to None for no decompression.
New in version 0.21.0.
Returns
Series or DataFrame The type returned depends on the value of typ.
See also:
Notes
Specific to orient='table', if a DataFrame with a literal Index name of index gets written with
to_json(), the subsequent read operation will incorrectly set the Index name to None. This is because
index is also used by DataFrame.to_json() to denote a missing Index name, and the subsequent
read_json() operation cannot distinguish between the two. The same limitation is encountered with
a MultiIndex and any names beginning with 'level_'.
Examples
>>> df.to_json(orient='split')
'{"columns":["col 1","col 2"],
"index":["row 1","row 2"],
"data":[["a","b"],["c","d"]]}'
>>> pd.read_json(_, orient='split')
col 1 col 2
row 1 a b
row 2 c d
>>> df.to_json(orient='index')
'{"row 1":{"col 1":"a","col 2":"b"},"row 2":{"col 1":"c","col 2":"d"}}'
>>> pd.read_json(_, orient='index')
col 1 col 2
row 1 a b
row 2 c d
Encoding/decoding a Dataframe using 'records' formatted JSON. Note that index labels are not
preserved with this encoding.
>>> df.to_json(orient='records')
'[{"col 1":"a","col 2":"b"},{"col 1":"c","col 2":"d"}]'
>>> pd.read_json(_, orient='records')
col 1 col 2
0 a b
1 c d
>>> df.to_json(orient='table')
'{"schema": {"fields": [{"name": "index", "type": "string"},
{"name": "col 1", "type": "string"},
{"name": "col 2", "type": "string"}],
"primaryKey": "index",
"pandas_version": "0.20.0"},
"data": [{"index": "row 1", "col 1": "a", "col 2": "b"},
{"index": "row 2", "col 1": "c", "col 2": "d"}]}'
pandas.io.json.json_normalize
Parameters
data [dict or list of dicts] Unserialized JSON objects.
record_path [str or list of str, default None] Path in each object to list of records. If
not passed, data will be assumed to be an array of records.
meta [list of paths (str or list of str), default None] Fields to use as metadata for each
record in resulting table.
meta_prefix [str, default None] If True, prefix records with dotted (?) path, e.g.
foo.bar.field if meta is [‘foo’, ‘bar’].
record_prefix [str, default None] If True, prefix records with dotted (?) path, e.g.
foo.bar.field if path to records is [‘foo’, ‘bar’].
errors [{‘raise’, ‘ignore’}, default ‘raise’] Configures error handling.
• ‘ignore’ : will ignore KeyError if keys listed in meta are not always present.
• ‘raise’ : will raise KeyError if keys listed in meta are not always present.
New in version 0.20.0.
sep [str, default ‘.’] Nested records will generate names separated by sep. e.g., for sep=’.’,
{‘foo’: {‘bar’: 0}} -> foo.bar.
New in version 0.20.0.
max_level [int, default None] Max number of levels(depth of dict) to normalize. if
None, normalizes all levels.
New in version 0.25.0.
Returns
frame [DataFrame]
Normalize semi-structured JSON data into a flat table.
Examples
Returns normalized data with columns prefixed with the given string.
pandas.io.json.build_table_schema
Notes
See _as_json_table_type for conversion types. Timedeltas as converted to ISO8601 duration format
with 9 decimal places after the seconds field for nanosecond precision.
Categoricals are converted to the any dtype, and use the enum field constraint to list the allowed
values. The ordered attribute is included in an ordered field.
Examples
>>> df = pd.DataFrame(
... {'A': [1, 2, 3],
... 'B': ['a', 'b', 'c'],
... 'C': pd.date_range('2016-01-01', freq='d', periods=3),
... }, index=pd.Index(range(3), name='idx'))
>>> build_table_schema(df)
{'fields': [{'name': 'idx', 'type': 'integer'},
{'name': 'A', 'type': 'integer'},
{'name': 'B', 'type': 'string'},
{'name': 'C', 'type': 'datetime'}],
'pandas_version': '0.20.0',
'primaryKey': ['idx']}
6.1.6 HTML
read_html(io[, match, flavor, header, …]) Read HTML tables into a list of DataFrame ob-
jects.
pandas.read_html
is a valid attribute dictionary because the ‘id’ HTML tag attribute is a valid HTML
attribute for any HTML tag as per this document.
is not a valid attribute dictionary because ‘asdf’ is not a valid HTML attribute even
if it is a valid XML attribute. Valid HTML 4.01 table attributes can be found here.
A working draft of the HTML 5 spec can be found here. It contains the latest
information on table attributes for the modern web.
parse_dates [bool, optional] See read_csv() for more details.
thousands [str, optional] Separator to use to parse thousands. Defaults to ','.
encoding [str or None, optional] The encoding used to decode the web page. Defaults
to None.‘‘None‘‘ preserves the previous encoding behavior, which depends on the
underlying parser library (e.g., the parser library will try to use the encoding provided
by the document).
decimal [str, default ‘.’] Character to recognize as decimal point (e.g. use ‘,’ for Euro-
pean data).
New in version 0.19.0.
converters [dict, default None] Dict of functions for converting values in certain
columns. Keys can either be integers or column labels, values are functions that
take one input argument, the cell (not column) content, and return the transformed
content.
New in version 0.19.0.
na_values [iterable, default None] Custom NA values
New in version 0.19.0.
keep_default_na [bool, default True] If na_values are specified and keep_default_na
is False the default NaN values are overridden, otherwise they’re appended to
New in version 0.19.0.
displayed_only [bool, default True] Whether elements with “display: none” should
be parsed
New in version 0.23.0.
Returns
dfs [list of DataFrames]
See also:
read_csv
Notes
Before using this function you should read the gotchas about the HTML parsing libraries.
Expect to do some cleanup after you call this function. For example, you might need to manually assign
column names if the column names are converted to NaN when you pass the header=0 argument. We
try to assume as little as possible about the structure of the table and push the idiosyncrasies of the
HTML contained in the table to the user.
This function searches for <table> elements and only for <tr> and <th> rows and <td> elements
within each <tr> or <th> element in the table. <td> stands for “table data”. This function attempts
to properly handle colspan and rowspan attributes. If the function has a <thead> argument, it is
used to construct the header, otherwise the function attempts to find the header within the body (by
putting rows with only <th> elements into the header).
New in version 0.21.0.
Similar to read_csv() the header argument is applied after skiprows is applied.
This function will always return a list of DataFrame or it will fail, e.g., it will not return an empty list.
Examples
See the read_html documentation in the IO section of the docs for some examples of reading in HTML
tables.
read_hdf(path_or_buf[, key, mode]) Read from the store, close it if we opened it.
HDFStore.put(self, key, value[, format, append]) Store object in HDFStore
HDFStore.append(self, key, value[, format, …]) Append to Table in file.
HDFStore.get(self, key) Retrieve pandas object stored in file
HDFStore.select(self, key[, where, start, …]) Retrieve pandas object stored in file, optionally
based on where criteria
HDFStore.info(self) Print detailed information on the store.
HDFStore.keys(self) Return a (potentially unordered) list of the keys
corresponding to the objects stored in the HDFS-
tore.
HDFStore.groups(self) return a list of all the top-level nodes (that are not
themselves a pandas storage object)
HDFStore.walk(self[, where]) Walk the pytables group hierarchy for pandas ob-
jects
pandas.read_hdf
errors [str, default ‘strict’] Specifies how encoding and decoding errors are to be han-
dled. See the errors argument for open() for a full list of options.
**kwargs Additional keyword arguments passed to HDFStore.
Returns
item [object] The selected object. Return type depends on the object stored.
See also:
Examples
pandas.HDFStore.put
pandas.HDFStore.append
Notes
Does not check if data being appended overlaps with existing data in the table, so be careful
pandas.HDFStore.get
HDFStore.get(self, key)
Retrieve pandas object stored in file
Parameters
key [object]
Returns
obj [same type as object stored in file]
pandas.HDFStore.select
pandas.HDFStore.info
HDFStore.info(self )
Print detailed information on the store.
New in version 0.21.0.
Returns
str
pandas.HDFStore.keys
HDFStore.keys(self )
Return a (potentially unordered) list of the keys corresponding to the objects stored in the HDFStore.
These are ABSOLUTE path-names (e.g. have the leading ‘/’
Returns
list
pandas.HDFStore.groups
HDFStore.groups(self )
return a list of all the top-level nodes (that are not themselves a pandas storage object)
Returns
list
pandas.HDFStore.walk
HDFStore.walk(self, where=’/’)
Walk the pytables group hierarchy for pandas objects
This generator will yield the group path, subgroups and pandas object names for each group. Any
non-pandas PyTables objects that are not a group will be ignored.
The where group itself is listed first (preorder), then each of its child groups (following an alphanumer-
ical order) is also traversed, following the same procedure.
New in version 0.24.0.
Parameters
where [str, optional] Group where to start walking. If not supplied, the root group is
used.
Yields
6.1.8 Feather
read_feather(path[, columns, use_threads]) Load a feather-format object from the file path.
pandas.read_feather
6.1.9 Parquet
read_parquet(path[, engine, columns]) Load a parquet object from the file path, returning
a DataFrame.
pandas.read_parquet
6.1.10 SAS
pandas.read_sas
6.1.11 SQL
read_sql_table(table_name, con[, schema, …]) Read SQL database table into a DataFrame.
read_sql_query(sql, con[, index_col, …]) Read SQL query into a DataFrame.
read_sql(sql, con[, index_col, …]) Read SQL query or database table into a
DataFrame.
pandas.read_gbq
configuration [dict, optional] Query config parameters for job processing. For example:
configuration = {‘query’: {‘useQueryCache’: False}}
For more information see BigQuery REST API Reference.
credentials [google.auth.credentials.Credentials, optional] Credentials for accessing
Google APIs. Use this parameter to override default credentials, such as to use
Compute Engine google.auth.compute_engine.Credentials or Service Account
google.oauth2.service_account.Credentials directly.
New in version 0.8.0 of pandas-gbq.
New in version 0.24.0.
use_bqstorage_api [bool, default False] Use the BigQuery Storage API to download
query results quickly, but at an increased cost. To use this API, first enable it in the
Cloud Console. You must also have the bigquery.readsessions.create permission on
the project you are billing queries to.
This feature requires version 0.10.0 or later of the pandas-gbq package. It also
requires the google-cloud-bigquery-storage and fastavro packages.
New in version 0.25.0.
private_key [str, deprecated] Deprecated in pandas-gbq version 0.8.0. Use the
credentials parameter and google.oauth2.service_account.Credentials.
from_service_account_info() or google.oauth2.service_account.
Credentials.from_service_account_file() instead.
Service account private key in JSON format. Can be file path or string contents.
This is useful for remote server authentication (eg. Jupyter/IPython notebook on
remote host).
verbose [None, deprecated] Deprecated in pandas-gbq version 0.4.0. Use the logging
module to adjust verbosity instead.
Returns
df: DataFrame DataFrame representing results of query.
See also:
6.1.13 STATA
pandas.read_stata
ceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, and
file. For file URLs, a host is expected. A local file could be: file://localhost/
path/to/table.dta.
If you want to pass in a path object, pandas accepts any os.PathLike.
By file-like object, we refer to objects with a read() method, such as a file handler
(e.g. via builtin open function) or StringIO.
convert_dates [boolean, defaults to True] Convert date variables to DataFrame time
values.
convert_categoricals [boolean, defaults to True] Read value labels and convert
columns to Categorical/Factor variables.
encoding [string, None or encoding] Encoding used to parse the files. None defaults to
latin-1.
index_col [string, optional, default: None] Column to set as index.
convert_missing [boolean, defaults to False] Flag indicating whether to convert miss-
ing values to their Stata representations. If False, missing values are replaced with
nan. If True, columns containing missing values are returned with object data types
and missing values are represented by StataMissingValue objects.
preserve_dtypes [boolean, defaults to True] Preserve Stata datatypes. If False, nu-
meric data are upcast to pandas default types for foreign data (float64 or int64).
columns [list or None] Columns to retain. Columns will be returned in the given order.
None returns all columns.
order_categoricals [boolean, defaults to True] Flag indicating whether converted cat-
egorical data are ordered.
chunksize [int, default None] Return StataReader object for iterations, returns chunks
with given number of lines.
iterator [boolean, default False] Return StataReader object.
Returns
DataFrame or StataReader
See also:
Examples
>>> df = pd.read_stata('filename.dta')
pandas.io.stata.StataReader.data
StataReader.data(self, **kwargs)
Read observations from Stata file, converting them into a dataframe
Deprecated since version This: is a legacy method. Use read in new code.
Parameters
convert_dates [boolean, defaults to True] Convert date variables to DataFrame time
values.
convert_categoricals [boolean, defaults to True] Read value labels and convert
columns to Categorical/Factor variables.
index_col [string, optional, default: None] Column to set as index.
convert_missing [boolean, defaults to False] Flag indicating whether to convert miss-
ing values to their Stata representations. If False, missing values are replaced with
nan. If True, columns containing missing values are returned with object data types
and missing values are represented by StataMissingValue objects.
preserve_dtypes [boolean, defaults to True] Preserve Stata datatypes. If False, nu-
meric data are upcast to pandas default types for foreign data (float64 or int64).
columns [list or None] Columns to retain. Columns will be returned in the given order.
None returns all columns.
order_categoricals [boolean, defaults to True] Flag indicating whether converted cat-
egorical data are ordered.
Returns
DataFrame
pandas.io.stata.StataReader.data_label
StataReader.data_label
Return data label of Stata file.
pandas.io.stata.StataReader.value_labels
StataReader.value_labels(self )
Return a dict, associating each variable name a dict, associating each value its corresponding label.
Returns
dict
pandas.io.stata.StataReader.variable_labels
StataReader.variable_labels(self )
Return variable labels as a dict, associating each variable name with corresponding label.
Returns
dict
pandas.io.stata.StataWriter.write_file
StataWriter.write_file(self )
{{ header }}
melt(frame[, id_vars, value_vars, var_name, …]) Unpivot a DataFrame from wide format to long for-
mat, optionally leaving identifier variables set.
pivot(data[, index, columns, values]) Return reshaped DataFrame organized by given in-
dex / column values.
pivot_table(data[, values, index, columns, …]) Create a spreadsheet-style pivot table as a
DataFrame.
crosstab(index, columns[, values, rownames, …]) Compute a simple cross tabulation of two (or more)
factors.
cut(x, bins[, right, labels, retbins, …]) Bin values into discrete intervals.
qcut(x, q[, labels, retbins, precision, …]) Quantile-based discretization function.
merge(left, right[, how, on, left_on, …]) Merge DataFrame or named Series objects with a
database-style join.
merge_ordered(left, right[, on, left_on, …]) Perform merge with optional filling/interpolation
designed for ordered data like time series data.
merge_asof(left, right[, on, left_on, …]) Perform an asof merge.
concat(objs[, axis, join, join_axes, …]) Concatenate pandas objects along a particular axis
with optional set logic along the other axes.
get_dummies(data[, prefix, prefix_sep, …]) Convert categorical variable into dummy/indicator
variables.
factorize(values[, sort, order, …]) Encode the object as an enumerated type or cate-
gorical variable.
unique(values) Hash table-based unique.
wide_to_long(df, stubnames, i, j[, sep, suffix]) Wide panel to long format.
pandas.melt
Parameters
frame [DataFrame]
id_vars [tuple, list, or ndarray, optional] Column(s) to use as identifier variables.
value_vars [tuple, list, or ndarray, optional] Column(s) to unpivot. If not specified,
uses all columns that are not set as id_vars.
var_name [scalar] Name to use for the ‘variable’ column. If None it uses frame.
columns.name or ‘variable’.
value_name [scalar, default ‘value’] Name to use for the ‘value’ column.
col_level [int or string, optional] If columns are a MultiIndex then use this level to
melt.
Returns
DataFrame Unpivoted DataFrame.
See also:
DataFrame.melt
pivot_table
DataFrame.pivot
Series.explode
Examples
pandas.pivot
Raises
ValueError: When there are any index, columns combinations with multiple values.
DataFrame.pivot_table when you need to aggregate.
See also:
DataFrame.pivot_table Generalization of pivot that can handle duplicate values for one in-
dex/column pair.
DataFrame.unstack Pivot based on the index values instead of a column.
Notes
For finer-tuned control, see hierarchical indexing documentation along with the related stack/unstack
methods.
Examples
Notice that the first two rows are the same for our index and columns arguments.
pandas.pivot_table
observed [boolean, default False] This only applies if any of the groupers are Categor-
icals. If True: only show observed values for categorical groupers. If False: show all
values for categorical groupers.
Changed in version 0.25.0.
Returns
DataFrame
See also:
Examples
The next example aggregates by taking the mean across multiple columns.
We can also calculate multiple types of aggregations for any given value column.
pandas.crosstab
Notes
Any Series passed will have their name attributes used unless row or column names for the cross-
tabulation are specified.
Any input passed containing Categorical data will have all of its categories included in the cross-
tabulation, even if the actual data does not contain any instances of a particular category.
In the event that there aren’t overlapping indexes an empty DataFrame will be returned.
Examples
Here ‘c’ and ‘f’ are not represented in the data and will not be shown in the output because dropna is
True by default. Set dropna=False to preserve categories with no data.
pandas.cut
duplicates [{default ‘raise’, ‘drop’}, optional] If bin edges are not unique, raise Val-
ueError or drop non-uniques.
New in version 0.23.0.
Returns
out [Categorical, Series, or ndarray] An array-like object representing the respective bin
for each value of x. The type depends on the value of labels.
• True (default) : returns a Series for Series x or a Categorical for all other inputs.
The values stored within are Interval dtype.
• sequence of scalars : returns a Series for Series x or a Categorical for all other
inputs. The values stored within are whatever the type in the sequence is.
• False : returns an ndarray of integers.
bins [numpy.ndarray or IntervalIndex.] The computed or specified bins. Only returned
when retbins=True. For scalar or sequence bins, this is an ndarray with the computed
bins. If set duplicates=drop, bins will drop non-unique bin. For an IntervalIndex bins,
this is equal to bins.
See also:
qcut Discretize variable into equal-sized buckets based on rank or based on sample quantiles.
Categorical Array type for storing data that come from a fixed set of values.
Series One-dimensional array with axis labels (including time series).
IntervalIndex Immutable Index implementing an ordered, sliceable set.
Notes
Any NA values will be NA in the result. Out of bounds values will be NA in the resulting Series or
Categorical object.
Examples
Discovers the same bins, but assign them specific labels. Notice that the returned Categorical’s cate-
gories are labels and is ordered.
Passing a Series as an input returns a Series with mapping value. It is used to map numerically to
intervals based on bins.
Passing an IntervalIndex for bins results in those categories exactly. Notice that values not covered by
the IntervalIndex are set to NaN. 0 is to the left of the first bin (which is closed on the right), and 1.5
falls between two bins.
pandas.qcut
Notes
Examples
>>> pd.qcut(range(5), 4)
... # doctest: +ELLIPSIS
[(-0.001, 1.0], (-0.001, 1.0], (1.0, 2.0], (2.0, 3.0], (3.0, 4.0]]
Categories (4, interval[float64]): [(-0.001, 1.0] < (1.0, 2.0] ...
pandas.merge
suffixes [tuple of (str, str), default (‘_x’, ‘_y’)] Suffix to apply to overlapping column
names in the left and right side, respectively. To raise an exception on overlapping
columns use (False, False).
copy [bool, default True] If False, avoid copy if possible.
indicator [bool or str, default False] If True, adds a column to output DataFrame
called “_merge” with information on the source of each row. If string, column with
information on source of each row will be added to output DataFrame, and column
will be named value of string. Information column is Categorical-type and takes
on a value of “left_only” for observations whose merge key only appears in ‘left’
DataFrame, “right_only” for observations whose merge key only appears in ‘right’
DataFrame, and “both” if the observation’s merge key is found in both.
validate [str, optional] If specified, checks if merge is of specified type.
• “one_to_one” or “1:1”: check if merge keys are unique in both left and right
datasets.
• “one_to_many” or “1:m”: check if merge keys are unique in left dataset.
• “many_to_one” or “m:1”: check if merge keys are unique in right dataset.
• “many_to_many” or “m:m”: allowed, but does not result in checks.
New in version 0.21.0.
Returns
DataFrame A DataFrame of the two merged objects.
See also:
Notes
Support for specifying index levels as the on, left_on, and right_on parameters was added in version
0.23.0 Support for merging named Series objects was added in version 0.24.0
Examples
Merge df1 and df2 on the lkey and rkey columns. The value columns have the default suffixes, _x and
_y, appended.
Merge DataFrames df1 and df2 with specified left and right suffixes appended to any overlapping
columns.
Merge DataFrames df1 and df2, but raise an exception if the DataFrames have any overlapping columns.
pandas.merge_ordered
left_on [label or list, or array-like] Field names to join on in left DataFrame. Can be
a vector or list of vectors of the length of the DataFrame to use a particular vector
as the join key instead of columns
right_on [label or list, or array-like] Field names to join on in right DataFrame or
vector/list of vectors per left_on docs
left_by [column name or list of column names] Group left DataFrame by group columns
and merge piece by piece with right DataFrame
right_by [column name or list of column names] Group right DataFrame by group
columns and merge piece by piece with left DataFrame
fill_method [{‘ffill’, None}, default None] Interpolation method for data
suffixes [Sequence, default is (“_x”, “_y”)] A length-2 sequence where each element is
optionally a string indicating the suffix to add to overlapping column names in left
and right respectively. Pass a value of None instead of a string to indicate that the
column name from left or right should be left as-is, with no suffix. At least one of
the values must not be None.
Changed in version 0.25.0.
how [{‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘outer’]
• left: use only keys from left frame (SQL: left outer join)
• right: use only keys from right frame (SQL: right outer join)
• outer: use union of keys from both frames (SQL: full outer join)
• inner: use intersection of keys from both frames (SQL: inner join)
New in version 0.19.0.
Returns
merged [DataFrame] The output type will the be same as ‘left’, if it is a subclass of
DataFrame.
See also:
merge
merge_asof
Examples
>>> A >>> B
key lvalue group key rvalue
0 a 1 a 0 b 1
1 c 2 a 1 c 2
2 e 3 a 2 d 3
3 a 1 b
4 c 2 b
5 e 3 b
pandas.merge_asof
by [column name or list of column names] Match on these columns before performing
merge operation.
left_by [column name] Field names to match on in the left DataFrame.
New in version 0.19.2.
right_by [column name] Field names to match on in the right DataFrame.
New in version 0.19.2.
suffixes [2-length sequence (tuple, list, …)] Suffix to apply to overlapping column names
in the left and right side, respectively.
tolerance [integer or Timedelta, optional, default None] Select asof tolerance within
this range; must be compatible with the merge index.
allow_exact_matches [boolean, default True]
• If True, allow matching with the same ‘on’ value (i.e. less-than-or-equal-to /
greater-than-or-equal-to)
• If False, don’t match the same ‘on’ value (i.e., strictly less-than / strictly greater-
than)
direction [‘backward’ (default), ‘forward’, or ‘nearest’] Whether to search for prior,
subsequent, or closest matches.
New in version 0.20.0.
Returns
merged [DataFrame]
See also:
merge
merge_ordered
Examples
>>> quotes
time ticker bid ask
0 2016-05-25 13:30:00.023 GOOG 720.50 720.93
1 2016-05-25 13:30:00.023 MSFT 51.95 51.96
2 2016-05-25 13:30:00.030 MSFT 51.97 51.98
3 2016-05-25 13:30:00.041 MSFT 51.99 52.00
4 2016-05-25 13:30:00.048 GOOG 720.50 720.93
5 2016-05-25 13:30:00.049 AAPL 97.99 98.01
6 2016-05-25 13:30:00.072 GOOG 720.50 720.88
7 2016-05-25 13:30:00.075 MSFT 52.01 52.03
>>> trades
time ticker price quantity
0 2016-05-25 13:30:00.023 MSFT 51.95 75
1 2016-05-25 13:30:00.038 MSFT 51.95 155
2 2016-05-25 13:30:00.048 GOOG 720.77 100
3 2016-05-25 13:30:00.048 GOOG 720.92 100
4 2016-05-25 13:30:00.048 AAPL 98.00 100
We only asof within 2ms between the quote time and the trade time
We only asof within 10ms between the quote time and the trade time and we exclude exact matches
on time. However prior data will propagate forward
pandas.concat
Notes
Examples
Clear the existing index and reset it in the result by setting the ignore_index option to True.
Add a hierarchical index at the outermost level of the data with the keys option.
Label the index keys you create with the names option.
Combine DataFrame objects with overlapping columns and return everything. Columns outside the
intersection will be filled with NaN values.
Combine DataFrame objects with overlapping columns and return only those that are shared by passing
inner to the join keyword argument.
Prevent the result from including duplicate index values with the verify_integrity option.
pandas.get_dummies
Examples
>>> s = pd.Series(list('abca'))
>>> pd.get_dummies(s)
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
>>> pd.get_dummies(s1)
a b
0 1 0
1 0 1
2 0 0
>>> pd.get_dummies(pd.Series(list('abcaa')))
a b c
0 1 0 0
(continues on next page)
pandas.factorize
Note: Even if there’s a missing value in values, uniques will not contain an entry
for it.
See also:
Examples
These examples all show factorize as a top-level method like pd.factorize(values). The results are
identical for methods like Series.factorize().
With sort=True, the uniques will be sorted, and labels will be shuffled so that the relationship is the
maintained.
Missing values are indicated in labels with na_sentinel (-1 by default). Note that missing values are
never included in uniques.
Thus far, we’ve only factorized lists (which are internally coerced to NumPy arrays). When factorizing
pandas objects, the type of uniques will differ. For Categoricals, a Categorical is returned.
pandas.unique
pandas.unique(values)
Hash table-based unique. Uniques are returned in order of appearance. This does NOT sort.
Significantly faster than numpy.unique. Includes NA values.
Parameters
values [1d array-like]
Returns
numpy.ndarray or ExtensionArray The return can be:
• Index : when the input is an Index
• Categorical : when the input is a Categorical dtype
• ndarray : when the input is a Series/ndarray
Return numpy.ndarray or ExtensionArray.
See also:
Index.unique
Series.unique
Examples
>>> pd.unique(pd.Series([pd.Timestamp('20160101'),
... pd.Timestamp('20160101')]))
array(['2016-01-01T00:00:00.000000000'], dtype='datetime64[ns]')
>>> pd.unique(list('baabc'))
array(['b', 'a', 'c'], dtype=object)
>>> pd.unique(pd.Series(pd.Categorical(list('baabc'))))
[b, a, c]
Categories (3, object): [b, a, c]
>>> pd.unique(pd.Series(pd.Categorical(list('baabc'),
... categories=list('abc'))))
[b, a, c]
Categories (3, object): [b, a, c]
>>> pd.unique(pd.Series(pd.Categorical(list('baabc'),
... categories=list('abc'),
... ordered=True)))
[b, a, c]
Categories (3, object): [a < b < c]
An array of tuples
pandas.wide_to_long
negated character class ‘\D+’. You can also further disambiguate suffixes, for exam-
ple, if your wide variables are of the form A-one, B-two,.., and you have an unrelated
column A-rating, you can ignore the last one by specifying suffix=’(!?one|two)’
New in version 0.20.0.
Changed in version 0.23.0: When all suffixes are numeric, they are cast to
int64/float64.
Returns
DataFrame A DataFrame that contains each stub name as a variable, with new index
(i, j).
Notes
All extra variables are left untouched. This simply uses pandas.melt under the hood, but is hard-coded
to “do the right thing” in a typical case.
Examples
>>> np.random.seed(123)
>>> df = pd.DataFrame({"A1970" : {0 : "a", 1 : "b", 2 : "c"},
... "A1980" : {0 : "d", 1 : "e", 2 : "f"},
... "B1970" : {0 : 2.5, 1 : 1.2, 2 : .7},
... "B1980" : {0 : 3.2, 1 : 1.3, 2 : .1},
... "X" : dict(zip(range(3), np.random.randn(3)))
... })
>>> df["id"] = df.index
>>> df
A1970 A1980 B1970 B1980 X id
0 a d 2.5 3.2 -1.085631 0
1 b e 1.2 1.3 0.997345 1
2 c f 0.7 0.1 0.282978 2
>>> pd.wide_to_long(df, ["A", "B"], i="id", j="year")
... # doctest: +NORMALIZE_WHITESPACE
X A B
id year
0 1970 -1.085631 a 2.5
1 1970 0.997345 b 1.2
2 1970 0.282978 c 0.7
0 1980 -1.085631 d 3.2
1 1980 0.997345 e 1.3
2 1980 0.282978 f 0.1
>>> df = pd.DataFrame({
... 'famid': [1, 1, 1, 2, 2, 2, 3, 3, 3],
... 'birth': [1, 2, 3, 1, 2, 3, 1, 2, 3],
... 'ht1': [2.8, 2.9, 2.2, 2, 1.8, 1.9, 2.2, 2.3, 2.1],
... 'ht2': [3.4, 3.8, 2.9, 3.2, 2.8, 2.4, 3.3, 3.4, 2.9]
... })
>>> df
(continues on next page)
Going from long back to wide just takes some creative use of unstack
>>> w = l.unstack()
>>> w.columns = w.columns.map('{0[0]}{0[1]}'.format)
>>> w.reset_index()
famid birth ht1 ht2
0 1 1 2.8 3.4
1 1 2 2.9 3.8
2 1 3 2.2 2.9
3 2 1 2.0 3.2
4 2 2 1.8 2.8
5 2 3 1.9 2.4
6 3 1 2.2 3.3
7 3 2 2.3 3.4
8 3 3 2.1 2.9
>>> np.random.seed(0)
>>> df = pd.DataFrame({'A(weekly)-2010': np.random.rand(3),
... 'A(weekly)-2011': np.random.rand(3),
... 'B(weekly)-2010': np.random.rand(3),
... 'B(weekly)-2011': np.random.rand(3),
... 'X' : np.random.randint(3, size=3)})
>>> df['id'] = df.index
>>> df # doctest: +NORMALIZE_WHITESPACE, +ELLIPSIS
A(weekly)-2010 A(weekly)-2011 B(weekly)-2010 B(weekly)-2011 X id
0 0.548814 0.544883 0.437587 0.383442 0 0
1 0.715189 0.423655 0.891773 0.791725 1 1
2 0.602763 0.645894 0.963663 0.528895 1 2
If we have many columns, we could also use a regex to find our stubnames and pass that list on to
wide_to_long
All of the above examples have integers as suffixes. It is possible to have non-integers as suffixes.
>>> df = pd.DataFrame({
... 'famid': [1, 1, 1, 2, 2, 2, 3, 3, 3],
... 'birth': [1, 2, 3, 1, 2, 3, 1, 2, 3],
... 'ht_one': [2.8, 2.9, 2.2, 2, 1.8, 1.9, 2.2, 2.3, 2.1],
... 'ht_two': [3.4, 3.8, 2.9, 3.2, 2.8, 2.4, 3.3, 3.4, 2.9]
... })
>>> df
famid birth ht_one ht_two
0 1 1 2.8 3.4
1 1 2 2.9 3.8
2 1 3 2.2 2.9
3 2 1 2.0 3.2
4 2 2 1.8 2.8
5 2 3 1.9 2.4
6 3 1 2.2 3.3
7 3 2 2.3 3.4
(continues on next page)
pandas.isna
pandas.isna(obj)
Detect missing values for an array-like object.
This function takes a scalar or array-like object and indicates whether values are missing (NaN in
numeric arrays, None or NaN in object arrays, NaT in datetimelike).
Parameters
obj [scalar or array-like] Object to check for null or missing values.
Returns
bool or array-like of bool For scalar input, returns a scalar boolean. For array input,
returns an array of boolean indicating whether each corresponding element is missing.
See also:
Examples
>>> pd.isna('dog')
False
>>> pd.isna(np.nan)
True
For Series and DataFrame, the same type is returned, containing booleans.
>>> pd.isna(df[1])
0 False
1 True
Name: 1, dtype: bool
pandas.isnull
pandas.isnull(obj)
Detect missing values for an array-like object.
This function takes a scalar or array-like object and indicates whether values are missing (NaN in
numeric arrays, None or NaN in object arrays, NaT in datetimelike).
Parameters
obj [scalar or array-like] Object to check for null or missing values.
Returns
bool or array-like of bool For scalar input, returns a scalar boolean. For array input,
returns an array of boolean indicating whether each corresponding element is missing.
See also:
Examples
>>> pd.isna('dog')
False
>>> pd.isna(np.nan)
True
For Series and DataFrame, the same type is returned, containing booleans.
>>> pd.isna(df[1])
0 False
1 True
Name: 1, dtype: bool
pandas.notna
pandas.notna(obj)
Detect non-missing values for an array-like object.
This function takes a scalar or array-like object and indicates whether values are valid (not missing,
which is NaN in numeric arrays, None or NaN in object arrays, NaT in datetimelike).
Parameters
obj [array-like or object value] Object to check for not null or non-missing values.
Returns
bool or array-like of bool For scalar input, returns a scalar boolean. For array input,
returns an array of boolean indicating whether each corresponding element is valid.
See also:
Examples
>>> pd.notna('dog')
True
>>> pd.notna(np.nan)
False
For Series and DataFrame, the same type is returned, containing booleans.
>>> pd.notna(df[1])
0 True
1 False
Name: 1, dtype: bool
pandas.notnull
pandas.notnull(obj)
Detect non-missing values for an array-like object.
This function takes a scalar or array-like object and indicates whether values are valid (not missing,
which is NaN in numeric arrays, None or NaN in object arrays, NaT in datetimelike).
Parameters
obj [array-like or object value] Object to check for not null or non-missing values.
Returns
bool or array-like of bool For scalar input, returns a scalar boolean. For array input,
returns an array of boolean indicating whether each corresponding element is valid.
See also:
Examples
>>> pd.notna('dog')
True
>>> pd.notna(np.nan)
False
For Series and DataFrame, the same type is returned, containing booleans.
>>> pd.notna(df[1])
0 True
1 False
Name: 1, dtype: bool
pandas.to_numeric
Examples
pandas.to_datetime
Examples
Assembling a datetime from multiple columns of a DataFrame. The keys can be common abbreviations
like [‘year’, ‘month’, ‘day’, ‘minute’, ‘second’, ‘ms’, ‘us’, ‘ns’]) or plurals of the same
If a date does not meet the timestamp limitations, passing errors=’ignore’ will return the original input
instead of raising any exception.
Passing errors=’coerce’ will force an out-of-bounds date to NaT, in addition to forcing non-dates (or
non-parseable dates) to NaT.
Passing infer_datetime_format=True can often-times speedup a parsing if its not an ISO8601 format
exactly, but in a regular format.
Warning: For float arg, precision rounding might happen. To prevent unexpected behavior use
a fixed-width exact type.
pandas.to_timedelta
Parameters
arg [str, timedelta, list-like or Series] The data to be converted to timedelta.
unit [str, default ‘ns’] Denotes the unit of the arg. Possible values: (‘Y’, ‘M’, ‘W’, ‘D’,
‘days’, ‘day’, ‘hours’, hour’, ‘hr’, ‘h’, ‘m’, ‘minute’, ‘min’, ‘minutes’, ‘T’, ‘S’, ‘sec-
onds’, ‘sec’, ‘second’, ‘ms’, ‘milliseconds’, ‘millisecond’, ‘milli’, ‘millis’, ‘L’, ‘us’, ‘mi-
croseconds’, ‘microsecond’, ‘micro’, ‘micros’, ‘U’, ‘ns’, ‘nanoseconds’, ‘nano’, ‘nanos’,
‘nanosecond’, ‘N’).
box [bool, default True]
• If True returns a Timedelta/TimedeltaIndex of the results.
• If False returns a numpy.timedelta64 or numpy.darray of values of dtype
timedelta64[ns].
Deprecated since version 0.25.0: Use Series.to_numpy() or Timedelta.
to_timedelta64() instead to get an ndarray of values or numpy.timedelta64, re-
spectively.
errors [{‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’]
• If ‘raise’, then invalid parsing will raise an exception.
• If ‘coerce’, then invalid parsing will be set as NaT.
• If ‘ignore’, then invalid parsing will return the input.
Returns
timedelta64 or numpy.array of timedelta64 Output type returned if parsing suc-
ceeded.
See also:
Examples
pandas.date_range
Notes
Of the four parameters start, end, periods, and freq, exactly three must be specified. If freq is
omitted, the resulting DatetimeIndex will have periods linearly spaced elements between start and
end (closed on both sides).
To learn more about the frequency strings, please see this link.
Examples
Specify start, end, and periods; the frequency is generated automatically (linearly spaced).
Other Parameters
Changed the freq (frequency) to 'M' (month end frequency).
closed controls whether to include start and end that are on the boundary. The default includes
boundary points on either end.
pandas.bdate_range
tz [string or None] Time zone name for returning localized DatetimeIndex, for example
Asia/Beijing.
normalize [bool, default False] Normalize start/end dates to midnight before generating
date range.
name [string, default None] Name of the resulting DatetimeIndex.
weekmask [string or None, default None] Weekmask of valid business days, passed to
numpy.busdaycalendar, only used when custom frequency strings are passed. The
default value None is equivalent to ‘Mon Tue Wed Thu Fri’.
New in version 0.21.0.
holidays [list-like or None, default None] Dates to exclude from the set of valid business
days, passed to numpy.busdaycalendar, only used when custom frequency strings
are passed.
New in version 0.21.0.
closed [string, default None] Make the interval closed with respect to the given fre-
quency to the ‘left’, ‘right’, or both sides (None).
**kwargs For compatibility. Has no effect on the result.
Returns
DatetimeIndex
Notes
Of the four parameters: start, end, periods, and freq, exactly three must be specified. Specifying
freq is a requirement for bdate_range. Use date_range if specifying freq is not desired.
To learn more about the frequency strings, please see this link.
Examples
Note how the two weekend days are skipped in the result.
pandas.period_range
freq [string or DateOffset, optional] Frequency alias. By default the freq is taken from
start or end if those are Period objects. Otherwise, the default is "D" for daily
frequency.
name [string, default None] Name of the resulting PeriodIndex
Returns
prng [PeriodIndex]
Notes
Of the three parameters: start, end, and periods, exactly two must be specified.
To learn more about the frequency strings, please see this link.
Examples
If start or end are Period objects, they will be used as anchor endpoints for a PeriodIndex with
frequency matching that of the period_range constructor.
pandas.timedelta_range
Notes
Of the four parameters start, end, periods, and freq, exactly three must be specified. If freq is
omitted, the resulting TimedeltaIndex will have periods linearly spaced elements between start and
end (closed on both sides).
To learn more about the frequency strings, please see this link.
Examples
The closed parameter specifies which endpoint is included. The default behavior is to include both
endpoints.
The freq parameter specifies the frequency of the TimedeltaIndex. Only fixed frequencies can be
passed, non-fixed frequencies such as ‘M’ (month end) will raise.
Specify start, end, and periods; the frequency is generated automatically (linearly spaced).
pandas.infer_freq
pandas.infer_freq(index, warn=True)
Infer the most likely frequency given the input index. If the frequency is uncertain, a warning will be
printed.
Parameters
index [DatetimeIndex or TimedeltaIndex] if passed a Series will use the values of the
series (NOT THE INDEX)
warn [boolean, default True]
Returns
str or None None if no discernible frequency TypeError if the index is not datetime-
like ValueError if there are less than three values.
pandas.interval_range
IntervalIndex An Index of intervals that are all closed on the same side.
Notes
Of the four parameters start, end, periods, and freq, exactly three must be specified. If freq is
omitted, the resulting IntervalIndex will have periods linearly spaced elements between start and
end, inclusively.
To learn more about datetime-like frequency strings, please see this link.
Examples
>>> pd.interval_range(start=pd.Timestamp('2017-01-01'),
... end=pd.Timestamp('2017-01-04'))
IntervalIndex([(2017-01-01, 2017-01-02], (2017-01-02, 2017-01-03],
(continues on next page)
The freq parameter specifies the frequency between the left and right. endpoints of the individual
intervals within the IntervalIndex. For numeric start and end, the frequency must also be numeric.
Similarly, for datetime-like start and end, the frequency must be convertible to a DateOffset.
>>> pd.interval_range(start=pd.Timestamp('2017-01-01'),
... periods=3, freq='MS')
IntervalIndex([(2017-01-01, 2017-02-01], (2017-02-01, 2017-03-01],
(2017-03-01, 2017-04-01]],
closed='right', dtype='interval[datetime64[ns]]')
Specify start, end, and periods; the frequency is generated automatically (linearly spaced).
The closed parameter specifies which endpoints of the individual intervals within the IntervalIndex
are closed.
eval(expr[, parser, engine, truediv, …]) Evaluate a Python expression as a string using var-
ious backends.
pandas.eval
parser [string, default ‘pandas’, {‘pandas’, ‘python’}] The parser to use to construct
the syntax tree from the expression. The default of 'pandas' parses code slightly
different than standard Python. Alternatively, you can parse an expression using the
'python' parser to retain strict Python semantics. See the enhancing performance
documentation for more details.
engine [string or None, default ‘numexpr’, {‘python’, ‘numexpr’}] The engine used to
evaluate the expression. Supported engines are
• None : tries to use numexpr, falls back to python
• 'numexpr': This default engine evaluates pandas objects using numexpr
for large speed ups in complex expressions with large frames.
• 'python': Performs operations as if you had eval’d in top level python.
This engine is generally not that useful.
More backends may be available in the future.
truediv [bool, optional] Whether to use true division, like in Python >= 3
local_dict [dict or None, optional] A dictionary of local variables, taken from locals()
by default.
global_dict [dict or None, optional] A dictionary of global variables, taken from glob-
als() by default.
resolvers [list of dict-like or None, optional] A list of objects implementing the
__getitem__ special method that you can use to inject an additional collection
of namespaces to use for variable lookup. For example, this is used in the query()
method to inject the DataFrame.index and DataFrame.columns variables that refer
to their respective DataFrame instance attributes.
level [int, optional] The number of prior stack frames to traverse and add to the current
scope. Most users will not need to change this parameter.
target [object, optional, default None] This is the target object for assignment. It is
used when there is variable assignment in the expression. If so, then target must
support item assignment with string keys, and if a copy is being returned, it must
also support .copy().
inplace [bool, default False] If target is provided, and the expression mutates target,
whether to modify target inplace. Otherwise, return a copy of target with the mu-
tation.
Returns
ndarray, numeric scalar, DataFrame, Series
Raises
ValueError There are many instances where such an error can be raised:
• target=None, but the expression is multiline.
• The expression is multiline, but not all them have item assignment. An example
of such an arrangement is this:
a=b+1a+2
Here, there are expressions on different lines, making it multiline, but the last line
has no variable assigned to the output of a + 2.
• inplace=True, but the expression is missing item assignment.
• Item assignment is provided, but the target does not support string item assign-
ment.
• Item assignment is provided and inplace=False, but the target does not support
the .copy() method
See also:
DataFrame.query
DataFrame.eval
Notes
The dtype of any objects involved in an arithmetic % operation are recursively cast to float64.
See the enhancing performance documentation for more details.
6.2.7 Hashing
pandas.util.hash_array
pandas.util.hash_pandas_object
encoding [string, default ‘utf8’] encoding for data & key when strings
hash_key [string key to encode, default to _default_hash_key]
categorize [bool, default True] Whether to first categorize object arrays before hashing.
This is more efficient when the array contains duplicate values.
New in version 0.20.0.
Returns
Series of uint64, same length as the object
6.2.8 Testing
test([extra_args])
pandas.test
pandas.test(extra_args=None)
{{ header }}
6.3 Series
6.3.1 Constructor
Series([data, index, dtype, name, copy, …]) One-dimensional ndarray with axis labels (includ-
ing time series).
pandas.Series
dtype [str, numpy.dtype, or ExtensionDtype, optional] Data type for the output Series.
If not specified, this will be inferred from data. See the user guide for more usages.
copy [bool, default False] Copy input data.
Attributes
pandas.Series.T
Series.T
Return the transpose, which is by definition self.
pandas.Series.array
Series.array
The ExtensionArray of the data backing this Series or Index.
New in version 0.24.0.
Returns
ExtensionArray An ExtensionArray of the values stored within. For extension
types, this is the actual array. For NumPy native types, this is a thin (no copy)
wrapper around numpy.ndarray.
.array differs .values which may require converting the data to a different form.
See also:
Notes
This table lays out the different array types for each extension dtype within pandas.
For any 3rd-party extension types, the array type will be an ExtensionArray.
For all remaining dtypes .array will be a arrays.NumpyExtensionArray wrapping the actual
ndarray stored within. If you absolutely need a NumPy array (possibly with copying / coercing
data), then use Series.to_numpy() instead.
Examples
For regular NumPy types like int, and float, a PandasArray is returned.
pandas.Series.asobject
Series.asobject
Return object Series which contains boxed values.
Deprecated since version 0.23.0: Use astype(object) instead.
this is an internal non-public method
pandas.Series.at
Series.at
Access a single value for a row/column label pair.
Similar to loc, in that both provide label-based lookups. Use at if you only need to get or set a
single value in a DataFrame or Series.
Raises
KeyError When label does not exist in DataFrame
See also:
Examples
>>> df.loc[5].at['B']
4
pandas.Series.axes
Series.axes
Return a list of the row axis labels.
pandas.Series.base
Series.base
Return the base object if the memory of the underlying data is shared.
Deprecated since version 0.23.0.
pandas.Series.blocks
Series.blocks
Internal property, property synonym for as_blocks().
Deprecated since version 0.21.0.
pandas.Series.data
Series.data
Return the data pointer of the underlying data.
Deprecated since version 0.23.0.
pandas.Series.dtype
Series.dtype
Return the dtype object of the underlying data.
pandas.Series.dtypes
Series.dtypes
Return the dtype object of the underlying data.
pandas.Series.flags
Series.flags
pandas.Series.ftype
Series.ftype
Return if the data is sparse|dense.
Deprecated since version 0.25.0: Use dtype() instead.
pandas.Series.ftypes
Series.ftypes
Return if the data is sparse|dense.
Deprecated since version 0.25.0: Use dtypes() instead.
pandas.Series.hasnans
Series.hasnans
Return if I have any nans; enables various perf speedups.
pandas.Series.iat
Series.iat
Access a single value for a row/column pair by integer position.
Similar to iloc, in that both provide integer-based lookups. Use iat if you only need to get or
set a single value in a DataFrame or Series.
Raises
IndexError When integer position is out of bounds
See also:
Examples
>>> df.iat[1, 2]
1
>>> df.iat[1, 2] = 10
>>> df.iat[1, 2]
10
>>> df.loc[0].iat[1]
2
pandas.Series.iloc
Series.iloc
Purely integer-location based indexing for selection by position.
.iloc[] is primarily integer position based (from 0 to length-1 of the axis), but may also be
used with a boolean array.
Allowed inputs are:
• An integer, e.g. 5.
• A list or array of integers, e.g. [4, 3, 0].
• A slice object with ints, e.g. 1:7.
• A boolean array.
• A callable function with one argument (the calling Series or DataFrame) and that returns
valid output for indexing (one of the above). This is useful in method chains, when you don’t
have a reference to the calling object, but would like to base your selection on some value.
.iloc will raise IndexError if a requested indexer is out-of-bounds, except slice indexers which
allow out-of-bounds indexing (this conforms with python/numpy slice semantics).
See more at Selection by Position.
See also:
Examples
>>> type(df.iloc[0])
<class 'pandas.core.series.Series'>
>>> df.iloc[0]
a 1
b 2
c 3
d 4
Name: 0, dtype: int64
>>> df.iloc[[0]]
a b c d
0 1 2 3 4
>>> type(df.iloc[[0]])
<class 'pandas.core.frame.DataFrame'>
>>> df.iloc[:3]
a b c d
0 1 2 3 4
1 100 200 300 400
2 1000 2000 3000 4000
With a callable, useful in method chains. The x passed to the lambda is the DataFrame being
sliced. This selects the rows whose index label even.
>>> df.iloc[0, 1]
2
pandas.Series.imag
Series.imag
Return imag value of vector.
pandas.Series.index
Series.index
The index (axis labels) of the Series.
pandas.Series.is_copy
Series.is_copy
Return the copy.
pandas.Series.is_monotonic
Series.is_monotonic
Return boolean if values in the object are monotonic_increasing.
New in version 0.19.0.
Returns
bool
pandas.Series.is_monotonic_decreasing
Series.is_monotonic_decreasing
Return boolean if values in the object are monotonic_decreasing.
New in version 0.19.0.
Returns
bool
pandas.Series.is_monotonic_increasing
Series.is_monotonic_increasing
Return boolean if values in the object are monotonic_increasing.
New in version 0.19.0.
Returns
bool
pandas.Series.is_unique
Series.is_unique
Return boolean if values in the object are unique.
Returns
bool
pandas.Series.itemsize
Series.itemsize
Return the size of the dtype of the item of the underlying data.
Deprecated since version 0.23.0.
pandas.Series.ix
Series.ix
A primarily label-location based indexer, with integer position fallback.
Warning: Starting in 0.20.0, the .ix indexer is deprecated, in favor of the more strict .iloc and .loc
indexers.
.ix[] supports mixed integer and label based access. It is primarily label based, but will fall
back to integer positional access unless the corresponding axis is of integer type.
.ix is the most general indexer and will support any of the inputs in .loc and .iloc. .ix
also supports floating point label schemes. .ix is exceptionally useful when dealing with mixed
positional and label based hierarchical indexes.
However, when an axis is integer based, ONLY label based access and not positional access is
supported. Thus, in such cases, it’s usually better to be explicit and use .iloc or .loc.
See more at Advanced Indexing.
pandas.Series.loc
Series.loc
Access a group of rows and columns by label(s) or a boolean array.
.loc[] is primarily label based, but may also be used with a boolean array.
Allowed inputs are:
• A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label of the index, and never as
an integer position along the index).
• A list or array of labels, e.g. ['a', 'b', 'c'].
• A slice object with labels, e.g. 'a':'f'.
Warning: Note that contrary to usual python slices, both the start and the stop are
included
• A boolean array of the same length as the axis being sliced, e.g. [True, False, True].
• A callable function with one argument (the calling Series or DataFrame) and that returns
valid output for indexing (one of the above)
See more at Selection by Label
Raises
KeyError: when any items are not found
See also:
Examples
Getting values
>>> df.loc['viper']
max_speed 4
shield 5
Name: viper, dtype: int64
Slice with labels for row and single label for column. As mentioned above, note that both the
start and stop of the slice are included.
Setting values
Set value for all items matching the list of labels
>>> df.loc['cobra'] = 10
>>> df
max_speed shield
cobra 10 10
viper 4 50
sidewinder 7 50
Slice with integer labels for rows. As mentioned above, note that both the start and stop of the
slice are included.
>>> df.loc[7:9]
max_speed shield
7 1 2
8 4 5
9 7 8
>>> tuples = [
... ('cobra', 'mark i'), ('cobra', 'mark ii'),
... ('sidewinder', 'mark i'), ('sidewinder', 'mark ii'),
... ('viper', 'mark ii'), ('viper', 'mark iii')
... ]
>>> index = pd.MultiIndex.from_tuples(tuples)
>>> values = [[12, 2], [0, 4], [10, 20],
... [1, 4], [7, 1], [16, 36]]
>>> df = pd.DataFrame(values, columns=['max_speed', 'shield'], index=index)
>>> df
max_speed shield
cobra mark i 12 2
mark ii 0 4
sidewinder mark i 10 20
mark ii 1 4
viper mark ii 7 1
mark iii 16 36
>>> df.loc['cobra']
max_speed shield
mark i 12 2
mark ii 0 4
Single label for row and column. Similar to passing in a tuple, this returns a Series.
Single tuple for the index with a single label for the column
pandas.Series.name
Series.name
Return name of the Series.
pandas.Series.nbytes
Series.nbytes
Return the number of bytes in the underlying data.
pandas.Series.ndim
Series.ndim
Number of dimensions of the underlying data, by definition 1.
pandas.Series.real
Series.real
Return the real value of vector.
pandas.Series.shape
Series.shape
Return a tuple of the shape of the underlying data.
pandas.Series.size
Series.size
Return the number of elements in the underlying data.
pandas.Series.strides
Series.strides
Return the strides of the underlying data.
Deprecated since version 0.23.0.
pandas.Series.values
Series.values
Return Series as ndarray or ndarray-like depending on the dtype.
Returns
numpy.ndarray or ndarray-like
See also:
Examples
>>> pd.Series(list('aabc')).values
array(['a', 'a', 'b', 'c'], dtype=object)
>>> pd.Series(list('aabc')).astype('category').values
[a, a, b, c]
Categories (3, object): [a, b, c]
empty
Methods
pandas.Series.abs
Series.abs(self )
Return a Series/DataFrame with absolute numeric value of each element.
This function only applies to elements that are all numeric.
Returns
abs Series/DataFrame containing the absolute value of each element.
See also:
Notes
√
For complex inputs, 1.2 + 1j, the absolute value is a 2 + b2 .
Examples
Select rows with data closest to certain value using argsort (from StackOverflow).
>>> df = pd.DataFrame({
... 'a': [4, 5, 6, 7],
... 'b': [10, 20, 30, 40],
(continues on next page)
pandas.Series.add
Series.radd
Examples
pandas.Series.add_prefix
Series.add_prefix(self, prefix)
Prefix labels with string prefix.
For Series, the row labels are prefixed. For DataFrame, the column labels are prefixed.
Parameters
prefix [str] The string to add before each label.
Returns
Series or DataFrame New Series or DataFrame with updated labels.
See also:
Examples
>>> s.add_prefix('item_')
item_0 1
item_1 2
item_2 3
item_3 4
dtype: int64
>>> df.add_prefix('col_')
col_A col_B
0 1 3
1 2 4
2 3 5
3 4 6
pandas.Series.add_suffix
Series.add_suffix(self, suffix)
Suffix labels with string suffix.
For Series, the row labels are suffixed. For DataFrame, the column labels are suffixed.
Parameters
suffix [str] The string to add after each label.
Returns
Series or DataFrame New Series or DataFrame with updated labels.
See also:
Examples
>>> s.add_suffix('_item')
0_item 1
1_item 2
2_item 3
3_item 4
dtype: int64
>>> df.add_suffix('_col')
A_col B_col
0 1 3
1 2 4
2 3 5
3 4 6
pandas.Series.agg
Notes
Examples
>>> s.agg('min')
1
pandas.Series.aggregate
Notes
Examples
>>> s.agg('min')
1
pandas.Series.align
fill_value [scalar, default np.NaN] Value to use for missing values. Defaults to NaN,
but can be any “compatible” value
method [{‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None] Method to use for filling
holes in reindexed Series pad / ffill: propagate last valid observation forward to
next valid backfill / bfill: use NEXT valid observation to fill gap
limit [int, default None] If method is specified, this is the maximum number of
consecutive NaN values to forward/backward fill. In other words, if there is a gap
with more than this number of consecutive NaNs, it will only be partially filled.
If method is not specified, this is the maximum number of entries along the entire
axis where NaNs will be filled. Must be greater than 0 if not None.
fill_axis [{0 or ‘index’}, default 0] Filling axis, method and limit
broadcast_axis [{0 or ‘index’}, default None] Broadcast values along this axis, if
aligning two objects of different dimensions
Returns
(left, right) [(Series, type of other)] Aligned objects.
pandas.Series.all
Examples
Series
DataFrames
Create a dataframe from a dictionary.
>>> df.all()
col1 True
col2 False
dtype: bool
>>> df.all(axis='columns')
0 True
1 False
dtype: bool
>>> df.all(axis=None)
False
pandas.Series.any
Parameters
axis [{0 or ‘index’, 1 or ‘columns’, None}, default 0] Indicate which axis or axes
should be reduced.
• 0 / ‘index’ : reduce the index, return a Series whose index is the original column
labels.
• 1 / ‘columns’ : reduce the columns, return a Series whose index is the original
index.
• None : reduce all axes, return a scalar.
bool_only [bool, default None] Include only boolean columns. If None, will attempt
to use everything, then use only boolean data. Not implemented for Series.
skipna [bool, default True] Exclude NA/null values. If the entire row/column is NA
and skipna is True, then the result will be False, as for an empty row/column. If
skipna is False, then NA are treated as True, because these are not equal to zero.
level [int or level name, default None] If the axis is a MultiIndex (hierarchical), count
along a particular level, collapsing into a scalar.
**kwargs [any, default None] Additional keywords have no effect but might be ac-
cepted for compatibility with NumPy.
Returns
scalar or Series If level is specified, then, Series is returned; otherwise, scalar is
returned.
See also:
Examples
Series
For Series input, the output is a scalar indicating whether any element is True.
DataFrame
Whether each column contains at least one True element (the default).
>>> df = pd.DataFrame({"A": [1, 2], "B": [0, 2], "C": [0, 0]})
>>> df
A B C
0 1 0 0
1 2 2 0
>>> df.any()
A True
B True
C False
dtype: bool
>>> df.any(axis='columns')
0 True
1 True
dtype: bool
>>> df.any(axis='columns')
0 True
1 False
dtype: bool
>>> df.any(axis=None)
True
>>> pd.DataFrame([]).any()
Series([], dtype: bool)
pandas.Series.append
Notes
Iteratively appending to a Series can be more computationally intensive than a single concatenate.
A better solution is to append values to a list and then concatenate the list with the original Series
all at once.
Examples
>>> s1.append(s3)
0 1
1 2
2 3
3 4
4 5
5 6
dtype: int64
pandas.Series.apply
Examples
>>> s.apply(lambda x: x ** 2)
London 400
New York 441
Helsinki 144
dtype: int64
Define a custom function that needs additional positional arguments and pass these additional
arguments using the args keyword.
Define a custom function that takes keyword arguments and pass these arguments to apply.
>>> s.apply(np.log)
London 2.995732
New York 3.044522
Helsinki 2.484907
dtype: float64
pandas.Series.argmax
numpy.argmax Return indices of the maximum values along the given axis.
DataFrame.idxmax Return index of first occurrence of maximum over requested axis.
Series.idxmin Return index label of the first occurrence of minimum of values.
Notes
This method is the Series version of ndarray.argmax. This method returns the label of the
maximum, while ndarray.argmax returns the position. To get the position, use series.values.
argmax().
Examples
>>> s.idxmax()
'C'
If skipna is False and there is an NA value in the data, the function returns nan.
>>> s.idxmax(skipna=False)
nan
pandas.Series.argmin
numpy.argmin Return indices of the minimum values along the given axis.
DataFrame.idxmin Return index of first occurrence of minimum over requested axis.
Series.idxmax Return index label of the first occurrence of maximum of values.
Notes
This method is the Series version of ndarray.argmin. This method returns the label of the
minimum, while ndarray.argmin returns the position. To get the position, use series.values.
argmin().
Examples
>>> s.idxmin()
'A'
If skipna is False and there is an NA value in the data, the function returns nan.
>>> s.idxmin(skipna=False)
nan
pandas.Series.argsort
numpy.ndarray.argsort
pandas.Series.as_blocks
Series.as_blocks(self, copy=True)
Convert the frame to a dict of dtype -> Constructor Types that each has a homogeneous dtype.
Deprecated since version 0.21.0.
NOTE: the dtypes of the blocks WILL BE PRESERVED HERE (unlike in
as_matrix)
Parameters
copy [boolean, default True]
Returns
values [a dict of dtype -> Constructor Types]
pandas.Series.as_matrix
Series.as_matrix(self, columns=None)
Convert the frame to its Numpy-array representation.
Deprecated since version 0.23.0: Use DataFrame.values() instead.
Parameters
columns [list, optional, default:None] If None, return all columns, otherwise, returns
specified columns.
Returns
values [ndarray] If the caller is heterogeneous and contains booleans or objects, the
result will be of dtype=object. See Notes.
See also:
DataFrame.values
Notes
pandas.Series.asfreq
how [{‘start’, ‘end’}, default end] For PeriodIndex only, see PeriodIndex.asfreq
normalize [bool, default False] Whether to reset output index to midnight
fill_value [scalar, optional] Value to use for missing values, applied during upsam-
pling (note this does not fill NaNs that already were present).
New in version 0.20.0.
Returns
converted [same type as caller]
See also:
reindex
Notes
To learn more about the frequency strings, please see this link.
Examples
>>> df.asfreq(freq='30S')
s
2000-01-01 00:00:00 0.0
2000-01-01 00:00:30 NaN
2000-01-01 00:01:00 NaN
2000-01-01 00:01:30 NaN
2000-01-01 00:02:00 2.0
2000-01-01 00:02:30 NaN
2000-01-01 00:03:00 3.0
pandas.Series.asof
Notes
Examples
>>> s.asof(20)
2.0
For a sequence where, a Series is returned. The first value is NaN, because the first element of
where is before the first index value.
Missing values are not considered. The following is 2.0, not NaN, even though NaN is at the
index location for 30.
>>> s.asof(30)
2.0
pandas.Series.astype
Examples
Create a DataFrame:
>>> df.astype('int32').dtypes
col1 int32
col2 int32
dtype: object
Create a series:
>>> ser.astype('category')
0 1
1 2
dtype: category
Categories (2, int64): [1, 2]
Note that using copy=False and changing data on a new pandas object may propagate changes:
>>> s1 = pd.Series([1,2])
>>> s2 = s1.astype('int64', copy=False)
>>> s2[0] = 10
>>> s1 # note that s1[0] has changed too
0 10
1 2
dtype: int64
pandas.Series.at_time
Examples
>>> ts.at_time('12:00')
A
2018-04-09 12:00:00 2
2018-04-10 12:00:00 4
pandas.Series.autocorr
Series.autocorr(self, lag=1)
Compute the lag-N autocorrelation.
This method computes the Pearson correlation between the Series and its shifted self.
Parameters
lag [int, default 1] Number of lags to apply before performing autocorrelation.
Returns
float The Pearson correlation between self and self.shift(lag).
See also:
Notes
Examples
pandas.Series.between
Notes
This function is equivalent to (left <= ser) & (ser <= right)
Examples
>>> s.between(1, 4)
0 True
1 False
2 True
3 False
4 False
dtype: bool
pandas.Series.between_time
Series or DataFrame
Raises
TypeError If the index is not a DatetimeIndex
See also:
Examples
You get the times that are not between two times by setting start_time later than end_time:
pandas.Series.bfill
pandas.Series.bool
Series.bool(self )
Return the bool of a single element PandasObject.
This must be a boolean scalar value, either True or False. Raise a ValueError if the PandasObject
does not have exactly 1 element, or that element is not boolean
Returns
bool Same single boolean value converted to bool type.
pandas.Series.cat
Series.cat()
Accessor object for categorical properties of the Series values.
Be aware that assigning to categories is a inplace operation, while all methods return new cate-
gorical data per default (but can be called with inplace=True).
Parameters
data [Series or CategoricalIndex]
Examples
>>> s.cat.categories
>>> s.cat.categories = list('abc')
>>> s.cat.rename_categories(list('cab'))
>>> s.cat.reorder_categories(list('cab'))
>>> s.cat.add_categories(['d','e'])
>>> s.cat.remove_categories(['d'])
>>> s.cat.remove_unused_categories()
>>> s.cat.set_categories(list('abcde'))
>>> s.cat.as_ordered()
>>> s.cat.as_unordered()
pandas.Series.clip
Returns
Series or DataFrame Same type as calling object with the values outside the clip
boundaries replaced.
Examples
>>> data = {'col_0': [9, -3, 0, -1, 5], 'col_1': [-2, -7, 6, 8, -5]}
>>> df = pd.DataFrame(data)
>>> df
col_0 col_1
0 9 -2
1 -3 -7
2 0 6
3 -1 8
4 5 -5
>>> df.clip(-4, 6)
col_0 col_1
0 6 -2
1 -3 -4
2 0 6
3 -1 6
4 5 -4
Clips using specific lower and upper thresholds per column element:
pandas.Series.clip_lower
Elements below the threshold will be changed to match the threshold value(s). Threshold can be
a single value or an array, in the latter case it performs the truncation element-wise.
Parameters
threshold [numeric or array-like] Minimum value allowed. All values below thresh-
old will be set to this value.
• float : every value is compared to threshold.
• array-like : The shape of threshold should match the object it’s compared to.
When self is a Series, threshold should be the length. When self is a DataFrame,
threshold should 2-D and the same shape as self for axis=None, or 1-D and the
same length as the axis being compared.
axis [{0 or ‘index’, 1 or ‘columns’}, default 0] Align self with threshold along the
given axis.
inplace [bool, default False] Whether to perform the operation in place on the data.
New in version 0.21.0.
Returns
Series or DataFrame Original data with values trimmed.
See also:
Examples
Series clipping element-wise using an array of thresholds. threshold should be the same length as
the Series.
>>> df.clip(lower=3)
A B
0 3 3
1 3 4
2 5 6
Or to an array of values. By default, threshold should be the same shape as the DataFrame.
Control how threshold is broadcast with axis. In this case threshold should be the same length as
the axis specified by axis.
pandas.Series.clip_upper
threshold should 2-D and the same shape as self for axis=None, or 1-D and the
same length as the axis being compared.
axis [{0 or ‘index’, 1 or ‘columns’}, default 0] Align object with threshold along the
given axis.
inplace [bool, default False] Whether to perform the operation in place on the data.
New in version 0.21.0.
Returns
Series or DataFrame Original data with values trimmed.
See also:
Examples
>>> s.clip(upper=3)
0 1
1 2
2 3
3 3
4 3
dtype: int64
>>> s.clip(upper=elemwise_thresholds)
0 1
1 2
2 3
3 2
4 1
dtype: int64
pandas.Series.combine
Series.combine_first Combine Series values, choosing the calling Series’ values first.
Examples
Now, to combine the two datasets and view the highest speeds of the birds across the two datasets
In the previous example, the resulting value for duck is missing, because the maximum of a NaN
and a float is a NaN. So, in the example, we set fill_value=0, so the maximum value returned
will be the value from some dataset.
pandas.Series.combine_first
Series.combine_first(self, other)
Combine Series values, choosing the calling Series’s values first.
Parameters
other [Series] The value(s) to be combined with the Series.
Returns
Series The result of combining the Series with the other object.
See also:
Notes
Examples
pandas.Series.compound
numeric_only [bool, default None] Include only float, int, boolean columns. If
None, will attempt to use everything, then use only numeric data. Not implemented
for Series.
**kwargs Additional keyword arguments to be passed to the function.
Returns
scalar or Series (if level specified)
pandas.Series.compress
numpy.ndarray.compress
pandas.Series.copy
Series.copy(self, deep=True)
Make a copy of this object’s indices and data.
When deep=True (default), a new object will be created with a copy of the calling object’s data
and indices. Modifications to the data or indices of the copy will not be reflected in the original
object (see notes below).
When deep=False, a new object will be created without copying the calling object’s data or index
(only references to the data and index are copied). Any changes to the data of the original will
be reflected in the shallow copy (and vice versa).
Parameters
deep [bool, default True] Make a deep copy, including a copy of the data and the
indices. With deep=False neither the indices nor the data are copied.
Returns
copy [Series or DataFrame] Object type matches caller.
Notes
When deep=True, data is copied but actual Python objects will not be copied recursively, only
the reference to the object. This is in contrast to copy.deepcopy in the Standard Library, which
recursively copies object data (see examples below).
While Index objects are copied when deep=True, the underlying numpy array is not copied for
performance reasons. Since Index is immutable, the underlying data can be safely shared and a
copy is not needed.
Examples
>>> s is shallow
False
>>> s.values is shallow.values and s.index is shallow.index
True
>>> s is deep
False
>>> s.values is deep.values or s.index is deep.index
False
Updates to the data shared by shallow copy and original is reflected in both; deep copy remains
unchanged.
>>> s[0] = 3
>>> shallow[1] = 4
>>> s
a 3
b 4
dtype: int64
>>> shallow
a 3
b 4
dtype: int64
>>> deep
a 1
b 2
dtype: int64
Note that when copying an object containing Python objects, a deep copy will copy the data, but
will not do so recursively. Updating a nested data object will be reflected in the deep copy.
pandas.Series.corr
Examples
pandas.Series.count
Series.count(self, level=None)
Return number of non-NA/null observations in the Series.
Parameters
level [int or level name, default None] If the axis is a MultiIndex (hierarchical),
count along a particular level, collapsing into a smaller Series.
Returns
int or Series (if level specified) Number of non-null values in the Series.
Examples
pandas.Series.cov
Examples
pandas.Series.cummax
Returns
scalar or Series
See also:
Examples
Series
>>> s.cummax()
0 2.0
1 NaN
2 5.0
3 5.0
4 5.0
dtype: float64
>>> s.cummax(skipna=False)
0 2.0
1 NaN
2 NaN
3 NaN
4 NaN
dtype: float64
DataFrame
By default, iterates over rows and finds the maximum in each column. This is equivalent to
axis=None or axis='index'.
>>> df.cummax()
A B
0 2.0 1.0
1 3.0 NaN
2 3.0 1.0
To iterate over columns and find the maximum in each row, use axis=1
>>> df.cummax(axis=1)
A B
0 2.0 2.0
1 3.0 NaN
2 1.0 1.0
pandas.Series.cummin
Examples
Series
>>> s.cummin()
0 2.0
1 NaN
2 2.0
3 -1.0
4 -1.0
dtype: float64
>>> s.cummin(skipna=False)
0 2.0
1 NaN
2 NaN
3 NaN
4 NaN
dtype: float64
DataFrame
By default, iterates over rows and finds the minimum in each column. This is equivalent to
axis=None or axis='index'.
>>> df.cummin()
A B
0 2.0 1.0
1 2.0 NaN
2 1.0 0.0
To iterate over columns and find the minimum in each row, use axis=1
>>> df.cummin(axis=1)
A B
0 2.0 1.0
1 3.0 NaN
2 1.0 0.0
pandas.Series.cumprod
Examples
Series
>>> s.cumprod()
0 2.0
1 NaN
2 10.0
3 -10.0
4 -0.0
dtype: float64
>>> s.cumprod(skipna=False)
0 2.0
1 NaN
2 NaN
3 NaN
4 NaN
dtype: float64
DataFrame
By default, iterates over rows and finds the product in each column. This is equivalent to
axis=None or axis='index'.
>>> df.cumprod()
A B
0 2.0 1.0
1 6.0 NaN
2 6.0 0.0
To iterate over columns and find the product in each row, use axis=1
>>> df.cumprod(axis=1)
A B
0 2.0 2.0
1 3.0 NaN
2 1.0 0.0
pandas.Series.cumsum
Parameters
axis [{0 or ‘index’, 1 or ‘columns’}, default 0] The index or the name of the axis.
0 is equivalent to None or ‘index’.
skipna [boolean, default True] Exclude NA/null values. If an entire row/column
is NA, the result will be NA.
*args, **kwargs : Additional keywords have no effect but might be accepted for
compatibility with NumPy.
Returns
scalar or Series
See also:
Examples
Series
>>> s.cumsum()
0 2.0
1 NaN
2 7.0
3 6.0
4 6.0
dtype: float64
>>> s.cumsum(skipna=False)
0 2.0
1 NaN
2 NaN
(continues on next page)
DataFrame
By default, iterates over rows and finds the sum in each column. This is equivalent to axis=None
or axis='index'.
>>> df.cumsum()
A B
0 2.0 1.0
1 5.0 NaN
2 6.0 1.0
To iterate over columns and find the sum in each row, use axis=1
>>> df.cumsum(axis=1)
A B
0 2.0 3.0
1 3.0 NaN
2 1.0 1.0
pandas.Series.describe
to object columns submit the numpy.object data type. Strings can also be
used in the style of select_dtypes (e.g. df.describe(include=['O'])).
To select pandas categorical columns, use 'category'
• None (default) : The result will include all numeric columns.
exclude [list-like of dtypes or None (default), optional,] A black list of data types
to omit from the result. Ignored for Series. Here are the options:
• A list-like of dtypes : Excludes the provided data types from the result. To
exclude numeric types submit numpy.number. To exclude object columns
submit the data type numpy.object. Strings can also be used in the style of
select_dtypes (e.g. df.describe(include=['O'])). To exclude pandas
categorical columns, use 'category'
• None (default) : The result will exclude nothing.
Returns
Series or DataFrame Summary statistics of the Series or Dataframe provided.
See also:
Notes
For numeric data, the result’s index will include count, mean, std, min, max as well as lower, 50
and upper percentiles. By default the lower percentile is 25 and the upper percentile is 75. The
50 percentile is the same as the median.
For object data (e.g. strings or timestamps), the result’s index will include count, unique, top,
and freq. The top is the most common value. The freq is the most common value’s frequency.
Timestamps also include the first and last items.
If multiple object values have the highest count, then the count and top results will be arbitrarily
chosen from among those with the highest count.
For mixed data types provided via a DataFrame, the default is to return only an analysis of
numeric columns. If the dataframe consists only of object and categorical data without any
numeric columns, the default is to return an analysis of both the object and categorical columns.
If include='all' is provided as an option, the result will include a union of attributes of each
type.
The include and exclude parameters can be used to limit which columns in a DataFrame are
analyzed for the output. The parameters are ignored when analyzing a Series.
Examples
>>> s = pd.Series([
... np.datetime64("2000-01-01"),
... np.datetime64("2010-01-01"),
... np.datetime64("2010-01-01")
... ])
>>> s.describe()
count 3
unique 2
top 2010-01-01 00:00:00
freq 2
first 2000-01-01 00:00:00
last 2010-01-01 00:00:00
dtype: object
>>> df.describe(include='all')
categorical numeric object
count 3 3.0 3
unique 3 NaN 3
top f NaN c
freq 1 NaN 1
mean NaN 2.0 NaN
std NaN 1.0 NaN
min NaN 1.0 NaN
25% NaN 1.5 NaN
50% NaN 2.0 NaN
75% NaN 2.5 NaN
max NaN 3.0 NaN
>>> df.numeric.describe()
count 3.0
mean 2.0
std 1.0
min 1.0
25% 1.5
50% 2.0
75% 2.5
max 3.0
Name: numeric, dtype: float64
>>> df.describe(include=['category'])
categorical
count 3
unique 3
top f
freq 1
>>> df.describe(exclude=[np.number])
categorical object
count 3 3
unique 3 3
top f c
freq 1 1
>>> df.describe(exclude=[np.object])
categorical numeric
count 3 3.0
unique 3 NaN
top f NaN
freq 1 NaN
mean NaN 2.0
std NaN 1.0
min NaN 1.0
25% NaN 1.5
50% NaN 2.0
75% NaN 2.5
max NaN 3.0
pandas.Series.diff
Series.diff(self, periods=1)
First discrete difference of element.
Calculates the difference of a Series element compared with another element in the Series (default
is element in previous row).
Parameters
periods [int, default 1] Periods to shift for calculating difference, accepts negative
values.
Returns
Series First differences of the Series.
See also:
Examples
>>> s.diff(periods=3)
0 NaN
1 NaN
2 NaN
3 2.0
4 4.0
5 6.0
dtype: float64
>>> s.diff(periods=-1)
0 0.0
1 -1.0
2 -1.0
3 -2.0
4 -3.0
5 NaN
dtype: float64
pandas.Series.div
fill_value [None or float value, default None (NaN)] Fill existing missing (NaN)
values, and any new element needed for successful Series alignment, with this
value before computation. If data in both corresponding Series locations is
missing the result will be missing.
level [int or name] Broadcast across a level, matching Index values on the passed
MultiIndex level.
Returns
Series The result of the operation.
See also:
Series.rtruediv
Examples
pandas.Series.divide
level [int or name] Broadcast across a level, matching Index values on the passed
MultiIndex level.
Returns
Series The result of the operation.
See also:
Series.rtruediv
Examples
pandas.Series.divmod
Series.rdivmod
pandas.Series.dot
Series.dot(self, other)
Compute the dot product between the Series and the columns of other.
This method computes the dot product between the Series and another one, or the Series and
each columns of a DataFrame, or the Series and each columns of an array.
It can also be called using self @ other in Python >= 3.5.
Parameters
other [Series, DataFrame or array-like] The other object to compute the dot prod-
uct with its columns.
Returns
scalar, Series or numpy.ndarray Return the dot product of the Series and
other if other is a Series, the Series of the dot product of Series and each
rows of other if other is a DataFrame or a numpy.ndarray between the Series
and each columns of the numpy array.
See also:
Notes
The Series and other has to share the same index if other is a Series or a DataFrame.
Examples
pandas.Series.drop
Examples
Drop labels B en C
pandas.Series.drop_duplicates
Examples
With the ‘keep’ parameter, the selection behaviour of duplicated values can be changed. The
value ‘first’ keeps the first occurrence for each set of duplicated entries. The default value of keep
is ‘first’.
>>> s.drop_duplicates()
0 lama
1 cow
3 beetle
5 hippo
Name: animal, dtype: object
The value ‘last’ for parameter ‘keep’ keeps the last occurrence for each set of duplicated entries.
>>> s.drop_duplicates(keep='last')
1 cow
3 beetle
4 lama
5 hippo
Name: animal, dtype: object
The value False for parameter ‘keep’ discards all sets of duplicated entries. Setting the value of
‘inplace’ to True performs the operation inplace and returns None.
pandas.Series.droplevel
level [int, str, or list-like] If a string is given, must be the name of a level If list-like,
elements must be names or positional indexes of levels.
axis [{0 or ‘index’, 1 or ‘columns’}, default 0]
Returns
DataFrame.droplevel()
Examples
>>> df = pd.DataFrame([
... [1, 2, 3, 4],
... [5, 6, 7, 8],
... [9, 10, 11, 12]
... ]).set_index([0, 1]).rename_axis(['a', 'b'])
>>> df
level_1 c d
level_2 e f
a b
1 2 3 4
5 6 7 8
9 10 11 12
>>> df.droplevel('a')
level_1 c d
level_2 e f
b
2 3 4
6 7 8
10 11 12
pandas.Series.dropna
axis [{0 or ‘index’}, default 0] There is only one axis to drop values from.
inplace [bool, default False] If True, do operation inplace and return None.
**kwargs Not in use.
Returns
Series Series with NA entries dropped from it.
See also:
Examples
>>> ser.dropna()
0 1.0
1 2.0
dtype: float64
>>> ser.dropna(inplace=True)
>>> ser
0 1.0
1 2.0
dtype: float64
pandas.Series.dt
Series.dt()
Accessor object for datetimelike properties of the Series values.
Examples
>>> s.dt.hour
>>> s.dt.second
>>> s.dt.quarter
Returns a Series indexed like the original Series. Raises TypeError if the Series does not contain
datetimelike values.
pandas.Series.duplicated
Series.duplicated(self, keep=’first’)
Indicate duplicate Series values.
Duplicated values are indicated as True values in the resulting Series. Either all duplicates, all
except the first or all except the last occurrence of duplicates can be indicated.
Parameters
keep [{‘first’, ‘last’, False}, default ‘first’]
• ‘first’ : Mark duplicates as True except for the first occurrence.
• ‘last’ : Mark duplicates as True except for the last occurrence.
• False : Mark all duplicates as True.
Returns
Series Series indicating whether each value has occurred in the preceding values.
See also:
Examples
By default, for each set of duplicated values, the first occurrence is set on False and all others on
True:
which is equivalent to
>>> animals.duplicated(keep='first')
0 False
1 False
2 True
3 False
4 True
dtype: bool
By using ‘last’, the last occurrence of each set of duplicated values is set on False and all others
on True:
>>> animals.duplicated(keep='last')
0 True
1 False
2 True
3 False
4 False
dtype: bool
>>> animals.duplicated(keep=False)
0 True
1 False
2 True
3 False
4 True
dtype: bool
pandas.Series.eq
Series.None
pandas.Series.equals
Series.equals(self, other)
Test whether two objects contain the same elements.
This function allows two Series or DataFrames to be compared against each other to see if they
have the same shape and elements. NaNs in the same location are considered equal. The column
headers do not need to have the same type, but the elements within the columns must be the
same dtype.
Parameters
other [Series or DataFrame] The other Series or DataFrame to be compared with
the first.
Returns
bool True if all elements are the same in both objects, False otherwise.
See also:
Series.eq Compare two Series objects of the same length and return a Series where each element
is True if the element in each Series is equal, False otherwise.
DataFrame.eq Compare two DataFrame objects of the same shape and return a DataFrame
where each element is True if the respective element in each DataFrame is equal, False
otherwise.
assert_series_equal Return True if left and right Series are equal, False otherwise.
assert_frame_equal Return True if left and right DataFrames are equal, False otherwise.
numpy.array_equal Return True if two arrays have the same shape and elements, False other-
wise.
Notes
This function requires that the elements have the same dtype as their respective elements in the
other Series or DataFrame. However, the column labels do not need to have the same type, as
long as they are still considered equal.
Examples
DataFrames df and exactly_equal have the same types and values for their elements and column
labels, which will return True.
DataFrames df and different_column_type have the same element types and values, but have
different types for the column labels, which will still return True.
DataFrames df and different_data_type have different types for the same values for their ele-
ments, and will return False even though their column labels are the same values and types.
pandas.Series.ewm
Notes
Exactly one of center of mass, span, half-life, and alpha must be provided. Allowed values and
relationship between the parameters are specified in the parameter descriptions above; see the
link at the end of this section for a detailed explanation.
When adjust is True (default), weighted averages are calculated using weights (1-alpha)**(n-1),
(1-alpha)**(n-2), …, 1-alpha, 1.
When adjust is False, weighted averages are calculated recursively as:
weighted_average[0] = arg[0]; weighted_average[i] = (1-alpha)*weighted_average[i-1]
+ alpha*arg[i].
When ignore_na is False (default), weights are based on absolute positions. For example, the
weights of x and y used in calculating the final weighted average of [x, None, y] are (1-alpha)**2
and 1 (if adjust is True), and (1-alpha)**2 and alpha (if adjust is False).
When ignore_na is True (reproducing pre-0.15.0 behavior), weights are based on relative posi-
tions. For example, the weights of x and y used in calculating the final weighted average of [x,
None, y] are 1-alpha and 1 (if adjust is True), and 1-alpha and alpha (if adjust is False).
More details can be found at http://pandas.pydata.org/pandas-docs/stable/user_guide/
computation.html#exponentially-weighted-windows
Examples
>>> df.ewm(com=0.5).mean()
B
0 0.000000
1 0.750000
2 1.615385
3 1.615385
4 3.670213
pandas.Series.expanding
Notes
By default, the result is set to the right edge of the window. This can be changed to the center
of the window by setting center=True.
Examples
>>> df.expanding(2).sum()
B
0 NaN
1 1.0
2 3.0
3 3.0
4 7.0
pandas.Series.explode
Series.explode(self ) → ’Series’
Transform each element of a list-like to a row, replicating the index values.
New in version 0.25.0.
Returns
Series Exploded lists to rows; index will be duplicated for these rows.
See also:
Notes
This routine will explode list-likes including lists, tuples, Series, and np.ndarray. The result dtype
of the subset rows will be object. Scalars will be returned unchanged. Empty list-likes will result
in a np.nan for that row.
Examples
>>> s.explode()
0 1
0 2
0 3
1 foo
2 NaN
3 3
(continues on next page)
pandas.Series.factorize
Note: Even if there’s a missing value in values, uniques will not contain an
entry for it.
See also:
Examples
These examples all show factorize as a top-level method like pd.factorize(values). The results
are identical for methods like Series.factorize().
With sort=True, the uniques will be sorted, and labels will be shuffled so that the relationship
is the maintained.
Missing values are indicated in labels with na_sentinel (-1 by default). Note that missing values
are never included in uniques.
Thus far, we’ve only factorized lists (which are internally coerced to NumPy arrays). When
factorizing pandas objects, the type of uniques will differ. For Categoricals, a Categorical is
returned.
pandas.Series.ffill
pandas.Series.fillna
value [scalar, dict, Series, or DataFrame] Value to use to fill holes (e.g. 0), al-
ternately a dict/Series/DataFrame of values specifying which value to use for
each index (for a Series) or column (for a DataFrame). Values not in the
dict/Series/DataFrame will not be filled. This value cannot be a list.
method [{‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None] Method to use for
filling holes in reindexed Series pad / ffill: propagate last valid observation
forward to next valid backfill / bfill: use next valid observation to fill gap.
axis [{0 or ‘index’}] Axis along which to fill missing values.
inplace [bool, default False] If True, fill in-place. Note: this will modify any other
views on this object (e.g., a no-copy slice for a column in a DataFrame).
limit [int, default None] If method is specified, this is the maximum number of
consecutive NaN values to forward/backward fill. In other words, if there is a
gap with more than this number of consecutive NaNs, it will only be partially
filled. If method is not specified, this is the maximum number of entries along
the entire axis where NaNs will be filled. Must be greater than 0 if not None.
downcast [dict, default is None] A dict of item->dtype of what to downcast if
possible, or the string ‘infer’ which will try to downcast to an appropriate equal
type (e.g. float64 to int64 if possible).
Returns
Series Object with missing values filled.
See also:
Examples
>>> df.fillna(0)
A B C D
0 0.0 2.0 0.0 0
1 3.0 4.0 0.0 1
2 0.0 0.0 0.0 5
3 0.0 3.0 0.0 4
>>> df.fillna(method='ffill')
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 3.0 4.0 NaN 5
3 3.0 3.0 NaN 4
Replace all NaN elements in column ‘A’, ‘B’, ‘C’, and ‘D’, with 0, 1, 2, and 3 respectively.
pandas.Series.filter
DataFrame.loc
Notes
The items, like, and regex parameters are enforced to be mutually exclusive.
axis defaults to the info axis that is used when indexing with [].
Examples
pandas.Series.first
Series.first(self, offset)
Convenience method for subsetting initial periods of time series data based on a date offset.
Parameters
offset [string, DateOffset, dateutil.relativedelta]
Returns
subset [same type as caller]
Raises
TypeError If the index is not a DatetimeIndex
See also:
Examples
>>> ts.first('3D')
A
2018-04-09 1
2018-04-11 2
Notice the data for 3 first calender days were returned, not the first 3 days observed in the
dataset, and therefore data for 2018-04-13 was not returned.
pandas.Series.first_valid_index
Series.first_valid_index(self )
Return index for first non-NA/null value.
Returns
scalar [type of index]
Notes
If all elements are non-NA/null, returns None. Also returns None for empty Series/DataFrame.
pandas.Series.floordiv
Series.rfloordiv
Examples
pandas.Series.from_array
pandas.Series.ge
fill_value [None or float value, default None (NaN)] Fill existing missing (NaN)
values, and any new element needed for successful Series alignment, with this
value before computation. If data in both corresponding Series locations is
missing the result will be missing.
level [int or name] Broadcast across a level, matching Index values on the passed
MultiIndex level.
Returns
Series The result of the operation.
See also:
Series.None
pandas.Series.get
pandas.Series.get_dtype_counts
Series.get_dtype_counts(self )
Return counts of unique dtypes in this object.
Deprecated since version 0.25.0.
Use .dtypes.value_counts() instead.
Returns
dtype [Series] Series with the count of columns with each dtype.
See also:
Examples
>>> df.get_dtype_counts()
float64 1
int64 1
object 1
dtype: int64
pandas.Series.get_ftype_counts
Series.get_ftype_counts(self )
Return counts of unique ftypes in this object.
Deprecated since version 0.23.0.
This is useful for SparseDataFrame or for DataFrames containing sparse arrays.
Returns
dtype [Series] Series with the count of columns with each type and sparsity
(dense/sparse).
See also:
Examples
pandas.Series.get_value
scalar value
pandas.Series.get_values
Series.get_values(self )
Same as values (but handles sparseness conversions); is a view.
Deprecated since version 0.25.0: Use Series.to_numpy() or Series.array instead.
Returns
numpy.ndarray Data of the Series.
pandas.Series.groupby
Returns
DataFrameGroupBy or SeriesGroupBy Depends on the calling object and
returns groupby object that contains information about the groups.
See also:
resample Convenience method for frequency conversion and resampling of time series.
Notes
Examples
Hierarchical Indexes
We can groupby different levels of a hierarchical index using the level parameter:
pandas.Series.gt
Series.None
pandas.Series.head
Series.head(self, n=5)
Return the first n rows.
This function returns the first n rows for the object based on position. It is useful for quickly
testing if your object has the right type of data in it.
Parameters
n [int, default 5] Number of rows to select.
Returns
obj_head [same type as caller] The first n rows of the caller object.
See also:
Examples
>>> df.head()
animal
0 alligator
1 bee
2 falcon
3 lion
4 monkey
>>> df.head(3)
animal
0 alligator
1 bee
2 falcon
pandas.Series.hist
pandas.Series.idxmax
numpy.argmax Return indices of the maximum values along the given axis.
DataFrame.idxmax Return index of first occurrence of maximum over requested axis.
Series.idxmin Return index label of the first occurrence of minimum of values.
Notes
This method is the Series version of ndarray.argmax. This method returns the label of the
maximum, while ndarray.argmax returns the position. To get the position, use series.values.
argmax().
Examples
>>> s.idxmax()
'C'
If skipna is False and there is an NA value in the data, the function returns nan.
>>> s.idxmax(skipna=False)
nan
pandas.Series.idxmin
numpy.argmin Return indices of the minimum values along the given axis.
DataFrame.idxmin Return index of first occurrence of minimum over requested axis.
Series.idxmax Return index label of the first occurrence of maximum of values.
Notes
This method is the Series version of ndarray.argmin. This method returns the label of the
minimum, while ndarray.argmin returns the position. To get the position, use series.values.
argmin().
Examples
>>> s.idxmin()
'A'
If skipna is False and there is an NA value in the data, the function returns nan.
>>> s.idxmin(skipna=False)
nan
pandas.Series.infer_objects
Series.infer_objects(self )
Attempt to infer better dtypes for object columns.
Attempts soft conversion of object-dtyped columns, leaving non-object and unconvertible columns
unchanged. The inference rules are the same as during normal Series/DataFrame construction.
New in version 0.21.0.
Returns
converted [same type as input object]
See also:
Examples
>>> df.dtypes
A object
dtype: object
>>> df.infer_objects().dtypes
A int64
dtype: object
pandas.Series.interpolate
Notes
The ‘krogh’, ‘piecewise_polynomial’, ‘spline’, ‘pchip’ and ‘akima’ methods are wrappers around
the respective SciPy implementations of similar names. These use the actual numerical values
of the index. For more information on their behavior, see the SciPy documentation and SciPy
tutorial.
Examples
Filling in NaN in a Series by padding, but filling at most two consecutive NaN at a time.
Filling in NaN in a Series via polynomial interpolation or splines: Both ‘polynomial’ and ‘spline’
methods require that you also specify an order (int).
Fill the DataFrame forward (that is, going down) along each column using linear interpolation.
Note how the last entry in column ‘a’ is interpolated differently, because there is no entry after
it to use for interpolation. Note how the first entry in column ‘b’ remains NaN, because there is
no entry before it to use for interpolation.
>>> df = pd.DataFrame([(0.0, np.nan, -1.0, 1.0),
... (np.nan, 2.0, np.nan, np.nan),
... (2.0, 3.0, np.nan, 9.0),
... (np.nan, 4.0, -4.0, 16.0)],
... columns=list('abcd'))
>>> df
a b c d
0 0.0 NaN -1.0 1.0
1 NaN 2.0 NaN NaN
2 2.0 3.0 NaN 9.0
3 NaN 4.0 -4.0 16.0
>>> df.interpolate(method='linear', limit_direction='forward', axis=0)
a b c d
0 0.0 NaN -1.0 1.0
1 1.0 2.0 -2.0 5.0
(continues on next page)
pandas.Series.isin
Series.isin(self, values)
Check whether values are contained in Series.
Return a boolean Series showing whether each element in the Series matches an element in the
passed sequence of values exactly.
Parameters
values [set or list-like] The sequence of values to test. Passing in a single string
will raise a TypeError. Instead, turn a single string into a list of one element.
New in version 0.18.1: Support for values as a set.
Returns
Series Series of booleans indicating if each element is in values.
Raises
TypeError
• If values is a string
See also:
Examples
Passing a single string as s.isin('lama') will raise an error. Use a list of one element instead:
>>> s.isin(['lama'])
0 True
1 False
2 True
3 False
4 True
5 False
Name: animal, dtype: bool
pandas.Series.isna
Series.isna(self )
Detect missing values.
Return a boolean same-sized object indicating if the values are NA. NA values, such as None or
numpy.NaN, gets mapped to True values. Everything else gets mapped to False values. Characters
such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas.
options.mode.use_inf_as_na = True).
Returns
Series Mask of bool values for each element in Series that indicates whether an
element is not an NA value.
See also:
Examples
>>> df.isna()
age born name toy
0 False True False True
1 False False False False
2 True False False False
>>> ser.isna()
0 False
1 False
2 True
dtype: bool
pandas.Series.isnull
Series.isnull(self )
Detect missing values.
Return a boolean same-sized object indicating if the values are NA. NA values, such as None or
numpy.NaN, gets mapped to True values. Everything else gets mapped to False values. Characters
such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas.
options.mode.use_inf_as_na = True).
Returns
Series Mask of bool values for each element in Series that indicates whether an
element is not an NA value.
See also:
Examples
>>> df.isna()
age born name toy
0 False True False True
1 False False False False
2 True False False False
>>> ser.isna()
0 False
1 False
2 True
dtype: bool
pandas.Series.item
Series.item(self )
Return the first element of the underlying data as a python scalar.
Returns
scalar The first element of %(klass)s.
pandas.Series.items
Series.items(self )
Lazily iterate over (index, value) tuples.
This method returns an iterable tuple (index, value). This is convenient if you want to create a
lazy iterator.
Returns
iterable Iterable of tuples containing the (index, value) pairs from a Series.
See also:
Examples
pandas.Series.iteritems
Series.iteritems(self )
Lazily iterate over (index, value) tuples.
This method returns an iterable tuple (index, value). This is convenient if you want to create a
lazy iterator.
Returns
iterable Iterable of tuples containing the (index, value) pairs from a Series.
See also:
Examples
pandas.Series.keys
Series.keys(self )
Return alias for index.
Returns
Index Index of the Series.
pandas.Series.kurt
numeric_only [bool, default None] Include only float, int, boolean columns. If
None, will attempt to use everything, then use only numeric data. Not imple-
mented for Series.
**kwargs Additional keyword arguments to be passed to the function.
Returns
scalar or Series (if level specified)
pandas.Series.kurtosis
pandas.Series.last
Series.last(self, offset)
Convenience method for subsetting final periods of time series data based on a date offset.
Parameters
offset [string, DateOffset, dateutil.relativedelta]
Returns
subset [same type as caller]
Raises
TypeError If the index is not a DatetimeIndex
See also:
Examples
>>> ts.last('3D')
A
2018-04-13 3
2018-04-15 4
Notice the data for 3 last calender days were returned, not the last 3 observed days in the dataset,
and therefore data for 2018-04-11 was not returned.
pandas.Series.last_valid_index
Series.last_valid_index(self )
Return index for last non-NA/null value.
Returns
scalar [type of index]
Notes
If all elements are non-NA/null, returns None. Also returns None for empty Series/DataFrame.
pandas.Series.le
Series.None
pandas.Series.lt
Series.None
pandas.Series.mad
pandas.Series.map
Notes
When arg is a dictionary, values in Series that are not in the dictionary (as keys) are converted
to NaN. However, if the dictionary is a dict subclass that defines __missing__ (i.e. provides a
method for default values), then this default is used rather than NaN.
Examples
map accepts a dict or a Series. Values that are not found in the dict are converted to NaN,
unless the dict has a default value (e.g. defaultdict):
To avoid applying the function to missing values (and keep them as NaN) na_action='ignore'
can be used:
pandas.Series.mask
Notes
The mask method is an application of the if-then idiom. For each element in the calling
DataFrame, if cond is False the element is used; otherwise the corresponding element from
the DataFrame other is used.
The signature for DataFrame.where() differs from numpy.where(). Roughly df1.where(m,
df2) is equivalent to np.where(m, df1, df2).
For further details and examples see the mask documentation in indexing.
Examples
>>> s = pd.Series(range(5))
>>> s.where(s > 0)
0 NaN
1 1.0
2 2.0
3 3.0
4 4.0
dtype: float64
pandas.Series.max
Parameters
axis [{index (0)}] Axis for the function to be applied on.
skipna [bool, default True] Exclude NA/null values when computing the result.
level [int or level name, default None] If the axis is a MultiIndex (hierarchical),
count along a particular level, collapsing into a scalar.
numeric_only [bool, default None] Include only float, int, boolean columns. If
None, will attempt to use everything, then use only numeric data. Not imple-
mented for Series.
**kwargs Additional keyword arguments to be passed to the function.
Returns
scalar or Series (if level specified)
See also:
Examples
>>> s.max()
8
>>> s.max(level='blooded')
blooded
warm 4
cold 8
Name: legs, dtype: int64
>>> s.max(level=0)
blooded
warm 4
cold 8
Name: legs, dtype: int64
pandas.Series.mean
skipna [bool, default True] Exclude NA/null values when computing the result.
level [int or level name, default None] If the axis is a MultiIndex (hierarchical),
count along a particular level, collapsing into a scalar.
numeric_only [bool, default None] Include only float, int, boolean columns. If
None, will attempt to use everything, then use only numeric data. Not imple-
mented for Series.
**kwargs Additional keyword arguments to be passed to the function.
Returns
scalar or Series (if level specified)
pandas.Series.median
pandas.Series.memory_usage
Examples
>>> s = pd.Series(range(3))
>>> s.memory_usage()
152
Not including the index gives the size of the rest of the data, which is necessarily smaller:
>>> s.memory_usage(index=False)
24
pandas.Series.min
Parameters
axis [{index (0)}] Axis for the function to be applied on.
skipna [bool, default True] Exclude NA/null values when computing the result.
level [int or level name, default None] If the axis is a MultiIndex (hierarchical),
count along a particular level, collapsing into a scalar.
numeric_only [bool, default None] Include only float, int, boolean columns. If
None, will attempt to use everything, then use only numeric data. Not imple-
mented for Series.
**kwargs Additional keyword arguments to be passed to the function.
Returns
scalar or Series (if level specified)
See also:
Examples
>>> s.min()
0
>>> s.min(level='blooded')
blooded
warm 2
cold 0
Name: legs, dtype: int64
>>> s.min(level=0)
blooded
warm 2
cold 0
Name: legs, dtype: int64
pandas.Series.mod
Parameters
other [Series or scalar value]
fill_value [None or float value, default None (NaN)] Fill existing missing (NaN)
values, and any new element needed for successful Series alignment, with this
value before computation. If data in both corresponding Series locations is
missing the result will be missing.
level [int or name] Broadcast across a level, matching Index values on the passed
MultiIndex level.
Returns
Series The result of the operation.
See also:
Series.rmod
Examples
pandas.Series.mode
Series.mode(self, dropna=True)
Return the mode(s) of the dataset.
Always returns Series even if only one value is returned.
Parameters
dropna [bool, default True] Don’t consider counts of NaN/NaT.
New in version 0.24.0.
Returns
Series Modes of the Series in sorted order.
pandas.Series.mul
Series.rmul
Examples
pandas.Series.multiply
Series.rmul
Examples
pandas.Series.ne
Series.None
pandas.Series.nlargest
Notes
Examples
>>> s.nlargest()
France 65000000
Italy 59000000
Malta 434000
Maldives 434000
Brunei 434000
dtype: int64
The n largest elements where n=3. Default keep value is ‘first’ so Malta will be kept.
>>> s.nlargest(3)
France 65000000
Italy 59000000
Malta 434000
dtype: int64
The n largest elements where n=3 and keeping the last duplicates. Brunei will be kept since it is
the last with value 434000 based on the index order.
The n largest elements where n=3 with all duplicates kept. Note that the returned Series has five
elements due to the three duplicates.
pandas.Series.nonzero
Series.nonzero(self )
Return the integer indices of the elements that are non-zero.
Deprecated since version 0.24.0: Please use .to_numpy().nonzero() as a replacement.
This method is equivalent to calling numpy.nonzero on the series data. For compatibility with
NumPy, the return value is the same (a tuple with an array of indices for each dimension), but
it will always be a one-item tuple because series only have one dimension.
Returns
numpy.ndarray Indices of elements that are non-zero.
See also:
numpy.nonzero
Examples
pandas.Series.notna
Series.notna(self )
Detect existing (non-missing) values.
Return a boolean same-sized object indicating if the values are not NA. Non-missing values get
mapped to True. Characters such as empty strings '' or numpy.inf are not considered NA values
(unless you set pandas.options.mode.use_inf_as_na = True). NA values, such as None or
numpy.NaN, get mapped to False values.
Returns
Series Mask of bool values for each element in Series that indicates whether an
element is not an NA value.
See also:
Examples
>>> df.notna()
age born name toy
0 True False True False
1 True True True True
2 False True True True
>>> ser.notna()
0 True
1 True
2 False
dtype: bool
pandas.Series.notnull
Series.notnull(self )
Detect existing (non-missing) values.
Return a boolean same-sized object indicating if the values are not NA. Non-missing values get
mapped to True. Characters such as empty strings '' or numpy.inf are not considered NA values
(unless you set pandas.options.mode.use_inf_as_na = True). NA values, such as None or
numpy.NaN, get mapped to False values.
Returns
Series Mask of bool values for each element in Series that indicates whether an
element is not an NA value.
See also:
Examples
>>> df.notna()
age born name toy
0 True False True False
1 True True True True
2 False True True True
>>> ser.notna()
0 True
1 True
2 False
dtype: bool
pandas.Series.nsmallest
Notes
Faster than .sort_values().head(n) for small n relative to the size of the Series object.
Examples
>>> s.nsmallest()
Monserat 5200
Nauru 11300
Tuvalu 11300
Anguilla 11300
Iceland 337000
dtype: int64
The n smallest elements where n=3. Default keep value is ‘first’ so Nauru and Tuvalu will be
kept.
>>> s.nsmallest(3)
Monserat 5200
Nauru 11300
Tuvalu 11300
dtype: int64
The n smallest elements where n=3 and keeping the last duplicates. Anguilla and Tuvalu will be
kept since they are the last with value 11300 based on the index order.
The n smallest elements where n=3 with all duplicates kept. Note that the returned Series has
four elements due to the three duplicates.
pandas.Series.nunique
Series.nunique(self, dropna=True)
Return number of unique elements in the object.
Excludes NA values by default.
Parameters
dropna [bool, default True] Don’t include NaN in the count.
Returns
int
See also:
Examples
>>> s.nunique()
4
pandas.Series.pct_change
Examples
Series
>>> s.pct_change()
0 NaN
1 0.011111
2 -0.065934
dtype: float64
>>> s.pct_change(periods=2)
0 NaN
1 NaN
2 -0.055556
dtype: float64
See the percentage change in a Series where filling NAs with last valid observation forward to
next valid.
>>> s.pct_change(fill_method='ffill')
0 NaN
1 0.011111
2 0.000000
3 -0.065934
dtype: float64
DataFrame
Percentage change in French franc, Deutsche Mark, and Italian lira from 1980-01-01 to 1980-03-
01.
>>> df = pd.DataFrame({
... 'FR': [4.0405, 4.0963, 4.3149],
... 'GR': [1.7246, 1.7482, 1.8519],
... 'IT': [804.74, 810.01, 860.13]},
... index=['1980-01-01', '1980-02-01', '1980-03-01'])
(continues on next page)
>>> df.pct_change()
FR GR IT
1980-01-01 NaN NaN NaN
1980-02-01 0.013810 0.013684 0.006549
1980-03-01 0.053365 0.059318 0.061876
Percentage of change in GOOG and APPL stock volume. Shows computing the percentage
change between columns.
>>> df = pd.DataFrame({
... '2016': [1769950, 30586265],
... '2015': [1500923, 40912316],
... '2014': [1371819, 41403351]},
... index=['GOOG', 'APPL'])
>>> df
2016 2015 2014
GOOG 1769950 1500923 1371819
APPL 30586265 40912316 41403351
>>> df.pct_change(axis='columns')
2016 2015 2014
GOOG NaN -0.151997 -0.086016
APPL NaN 0.337604 0.012002
pandas.Series.pipe
DataFrame.apply
DataFrame.applymap
Series.map
Notes
Use .pipe when chaining together functions that expect Series, DataFrames or GroupBy objects.
Instead of writing
>>> (df.pipe(h)
... .pipe(g, arg1=a)
... .pipe(f, arg2=b, arg3=c)
... )
If you have a function that takes the data as (say) the second argument, pass a tuple indicating
which keyword expects the data. For example, suppose f takes its data as arg2:
>>> (df.pipe(h)
... .pipe(g, arg1=a)
... .pipe((f, 'arg2'), arg1=a, arg3=c)
... )
pandas.Series.plot
pandas.Series.pop
Series.pop(self, item)
Return item and drop from frame. Raise KeyError if not found.
Parameters
item [str] Label of column to be popped.
Returns
Series
Examples
>>> df.pop('class')
0 bird
1 bird
2 mammal
3 mammal
Name: class, dtype: object
>>> df
name max_speed
0 falcon 389.0
1 parrot 24.0
2 lion 80.5
3 monkey NaN
pandas.Series.pow
Series.rpow
Examples
pandas.Series.prod
Examples
>>> pd.Series([]).prod()
1.0
>>> pd.Series([]).prod(min_count=1)
nan
Thanks to the skipna parameter, min_count handles all-NA and empty series identically.
>>> pd.Series([np.nan]).prod()
1.0
>>> pd.Series([np.nan]).prod(min_count=1)
nan
pandas.Series.product
Examples
>>> pd.Series([]).prod()
1.0
>>> pd.Series([]).prod(min_count=1)
nan
Thanks to the skipna parameter, min_count handles all-NA and empty series identically.
>>> pd.Series([np.nan]).prod()
1.0
>>> pd.Series([np.nan]).prod(min_count=1)
nan
pandas.Series.ptp
pandas.Series.put
numpy.ndarray.put
pandas.Series.quantile
q [float or array-like, default 0.5 (50% quantile)] 0 <= q <= 1, the quantile(s) to
compute.
interpolation [{‘linear’, ‘lower’, ‘higher’, ‘midpoint’, ‘nearest’}] New in version
0.18.0.
This optional parameter specifies the interpolation method to use, when the
desired quantile lies between two data points i and j:
• linear: i + (j - i) * fraction, where fraction is the fractional part of the index
surrounded by i and j.
• lower: i.
• higher: j.
• nearest: i or j whichever is nearest.
• midpoint: (i + j) / 2.
Returns
float or Series If q is an array, a Series will be returned where the index is q and
the values are the quantiles, otherwise a float will be returned.
See also:
core.window.Rolling.quantile
numpy.percentile
Examples
pandas.Series.radd
level [int or name] Broadcast across a level, matching Index values on the passed
MultiIndex level.
Returns
Series The result of the operation.
See also:
Series.add
Examples
pandas.Series.rank
Examples
The following example shows how the method behaves with the above parameters:
• default_rank: this is the default behaviour obtained without using any parameter.
• max_rank: setting method = 'max' the records that have the same values are ranked using
the highest rank (e.g.: since ‘cat’ and ‘dog’ are both in the 2nd and 3rd position, rank 3 is
assigned.)
• NA_bottom: choosing na_option = 'bottom', if there are records with NaN values they
are placed at the bottom of the ranking.
• pct_rank: when setting pct = True, the ranking is expressed as percentile rank.
pandas.Series.ravel
Series.ravel(self, order=’C’)
Return the flattened underlying data as an ndarray.
Returns
numpy.ndarray or ndarray-like Flattened data of the Series.
See also:
numpy.ndarray.ravel
pandas.Series.rdiv
Series.truediv
Examples
pandas.Series.rdivmod
Series.divmod
pandas.Series.reindex
index [array-like, optional] New labels / index to conform to, should be specified
using keywords. Preferably an Index object to avoid duplicating data
method [{None, ‘backfill’/’bfill’, ‘pad’/’ffill’, ‘nearest’}] Method to use for fill-
ing holes in reindexed DataFrame. Please note: this is only applicable to
DataFrames/Series with a monotonically increasing/decreasing index.
• None (default): don’t fill gaps
• pad / ffill: propagate last valid observation forward to next valid
• backfill / bfill: use next valid observation to fill gap
• nearest: use nearest valid observations to fill gap
copy [bool, default True] Return a new object, even if the passed indexes are the
same.
level [int or name] Broadcast across a level, matching Index values on the passed
MultiIndex level.
fill_value [scalar, default np.NaN] Value to use for missing values. Defaults to
NaN, but can be any “compatible” value.
limit [int, default None] Maximum number of consecutive elements to forward or
backward fill.
tolerance [optional] Maximum distance between original and new labels for in-
exact matches. The values of the index at the matching locations most satisfy
the equation abs(index[indexer] - target) <= tolerance.
Tolerance may be a scalar value, which applies the same tolerance to all values,
or list-like, which applies variable tolerance per element. List-like includes list,
tuple, array, Series, and must be the same size as the index and its dtype must
exactly match the index’s type.
New in version 0.21.0: (list-like tolerance)
Returns
Series with changed index.
See also:
Examples
Create a new index and reindex the dataframe. By default values in the new index that do not
have corresponding records in the dataframe are assigned NaN.
We can fill in the missing values by passing a value to the keyword fill_value. Because the
index is not monotonically increasing or decreasing, we cannot use arguments to the keyword
method to fill the NaN values.
To further illustrate the filling functionality in reindex, we will create a dataframe with a mono-
tonically increasing index (for example, a sequence of dates).
The index entries that did not have a value in the original data frame (for example, ‘2009-12-29’)
are by default filled with NaN. If desired, we can fill in the missing values using one of several
options.
For example, to back-propagate the last valid value to fill the NaN values, pass bfill as an
argument to the method keyword.
Please note that the NaN value present in the original dataframe (at index value 2010-01-03) will
not be filled by any of the value propagation schemes. This is because filling while reindexing
does not look at dataframe values, but only compares the original and desired indexes. If you do
want to fill in the NaN values present in the original dataframe, use the fillna() method.
See the user guide for more.
pandas.Series.reindex_like
Notes
Examples
>>> df1
temp_celsius temp_fahrenheit windspeed
2014-02-12 24.3 75.7 high
2014-02-13 31.0 87.8 high
2014-02-14 22.0 71.6 medium
2014-02-15 35.0 95.0 medium
>>> df2
temp_celsius windspeed
2014-02-12 28.0 low
2014-02-13 30.0 low
2014-02-15 35.1 medium
>>> df2.reindex_like(df1)
temp_celsius temp_fahrenheit windspeed
2014-02-12 28.0 NaN low
2014-02-13 30.0 NaN low
(continues on next page)
pandas.Series.rename
Examples
pandas.Series.rename_axis
Notes
Examples
Series
DataFrame
MultiIndex
>>> df.rename_axis(columns=str.upper)
LIMBS num_legs num_arms
type name
mammal dog 4 0
cat 4 0
monkey 2 2
pandas.Series.reorder_levels
Series.reorder_levels(self, order)
Rearrange index levels using input order.
May not drop or duplicate levels.
Parameters
order [list of int representing new level order] (reference level by number or key)
Returns
type of caller (new object)
pandas.Series.repeat
Examples
pandas.Series.replace
• dict:
– Dicts can be used to specify different replacement values for different ex-
isting values. For example, {'a': 'b', 'y': 'z'} replaces the value
‘a’ with ‘b’ and ‘y’ with ‘z’. To use a dict in this way the value parameter
should be None.
– For a DataFrame a dict can specify that different values should be replaced
in different columns. For example, {'a': 1, 'b': 'z'} looks for the
value 1 in column ‘a’ and the value ‘z’ in column ‘b’ and replaces these
values with whatever is specified in value. The value parameter should
not be None in this case. You can treat this as a special case of passing
two lists except that you are specifying the column to search in.
– For a DataFrame nested dictionaries, e.g., {'a': {'b': np.nan}}, are
read as follows: look in column ‘a’ for the value ‘b’ and replace it with
NaN. The value parameter should be None to use a nested dict in this
way. You can nest regular expressions as well. Note that column names
(the top-level dictionary keys in a nested dictionary) cannot be regular
expressions.
• None:
– This means that the regex argument must be a string, compiled regular
expression, or list, dict, ndarray or Series of such elements. If value is also
None then this must be a nested dictionary or Series.
See the examples section for examples of each of these.
value [scalar, dict, list, str, regex, default None] Value to replace any values match-
ing to_replace with. For a DataFrame a dict of values can be used to specify
which value to use for each column (columns not in the dict will not be filled).
Regular expressions, strings and lists or dicts of such objects are also allowed.
inplace [bool, default False] If True, in place. Note: this will modify any other
views on this object (e.g. a column from a DataFrame). Returns the caller if
this is True.
limit [int, default None] Maximum size gap to forward or backward fill.
regex [bool or same types as to_replace, default False] Whether to interpret
to_replace and/or value as regular expressions. If this is True then to_replace
must be a string. Alternatively, this could be a regular expression or a list,
dict, or array of regular expressions in which case to_replace must be None.
method [{‘pad’, ‘ffill’, ‘bfill’, None}] The method to use when for replacement,
when to_replace is a scalar, list or tuple and value is None.
Changed in version 0.23.0: Added to DataFrame.
Returns
Series Object after replacement.
Raises
AssertionError
• If regex is not a bool and to_replace is not None.
TypeError
• If to_replace is a dict and value is not a list, dict, ndarray, or Series
Notes
• Regex substitution is performed under the hood with re.sub. The rules for substitution for
re.sub are the same.
• Regular expressions will only substitute on strings, meaning you cannot provide, for example,
a regular expression matching floating point numbers and expect the columns in your frame
that have a numeric dtype to be matched. However, if those floating point numbers are
strings, then you can do this.
• This method has a lot of options. You are encouraged to experiment and play with this
method to gain intuition about how it works.
• When dict is used as the to_replace value, it is like key(s) in the dict are the to_replace
part and value(s) in the dict are the value parameter.
Examples
List-like ‘to_replace‘
dict-like ‘to_replace‘
Note that when replacing multiple bool or datetime64 objects, the data types in the to_replace
parameter must match the data type of the value being replaced:
This raises a TypeError because one of the dict keys is not of the correct type for replacement.
Compare the behavior of s.replace({'a': None}) and s.replace('a', None) to understand
the peculiarities of the to_replace parameter:
When one uses a dict as the to_replace value, it is like the value(s) in the dict are equal to
the value parameter. s.replace({'a': None}) is equivalent to s.replace(to_replace={'a':
None}, value=None, method=None):
When value=None and to_replace is a scalar, list or tuple, replace uses the method parameter
(default ‘pad’) to do the replacement. So this is why the ‘a’ values are being replaced by 10 in
rows 1 and 2 and ‘b’ in row 4 in this case. The command s.replace('a', None) is actually
equivalent to s.replace(to_replace='a', value=None, method='pad'):
pandas.Series.resample
closed [{‘right’, ‘left’}, default None] Which side of bin interval is closed. The
default is ‘left’ for all frequency offsets except for ‘M’, ‘A’, ‘Q’, ‘BM’, ‘BA’,
‘BQ’, and ‘W’ which all have a default of ‘right’.
label [{‘right’, ‘left’}, default None] Which bin edge label to label bucket with.
The default is ‘left’ for all frequency offsets except for ‘M’, ‘A’, ‘Q’, ‘BM’, ‘BA’,
‘BQ’, and ‘W’ which all have a default of ‘right’.
convention [{‘start’, ‘end’, ‘s’, ‘e’}, default ‘start’] For PeriodIndex only, controls
whether to use the start or end of rule.
kind [{‘timestamp’, ‘period’}, optional, default None] Pass ‘timestamp’ to con-
vert the resulting index to a DateTimeIndex or ‘period’ to convert it to a
PeriodIndex. By default the input representation is retained.
loffset [timedelta, default None] Adjust the resampled time labels.
limit [int, default None] Maximum size gap when reindexing with fill_method.
Deprecated since version 0.18.0.
base [int, default 0] For frequencies that evenly subdivide 1 day, the “origin” of
the aggregated intervals. For example, for ‘5min’ frequency, base could range
from 0 through 4. Defaults to 0.
on [str, optional] For a DataFrame, column to use instead of index for resampling.
Column must be datetime-like.
New in version 0.19.0.
level [str or int, optional] For a MultiIndex, level (name or number) to use for
resampling. level must be datetime-like.
New in version 0.19.0.
Returns
Resampler object
See also:
Notes
Examples
Downsample the series into 3 minute bins and sum the values of the timestamps falling into a
bin.
>>> series.resample('3T').sum()
2000-01-01 00:00:00 3
2000-01-01 00:03:00 12
2000-01-01 00:06:00 21
Freq: 3T, dtype: int64
Downsample the series into 3 minute bins as above, but label each bin using the right edge
instead of the left. Please note that the value in the bucket used as the label is not included in
the bucket, which it labels. For example, in the original series the bucket 2000-01-01 00:03:00
contains the value 3, but the summed value in the resampled bucket with the label 2000-01-01
00:03:00 does not include 3 (if it did, the summed value would be 6, not 3). To include this
value close the right side of the bin interval as illustrated in the example below this one.
Downsample the series into 3 minute bins as above, but close the right side of the bin interval.
Upsample the series into 30 second bins and fill the NaN values using the pad method.
>>> series.resample('30S').pad()[0:5]
2000-01-01 00:00:00 0
2000-01-01 00:00:30 0
2000-01-01 00:01:00 1
2000-01-01 00:01:30 1
2000-01-01 00:02:00 2
Freq: 30S, dtype: int64
Upsample the series into 30 second bins and fill the NaN values using the bfill method.
>>> series.resample('30S').bfill()[0:5]
2000-01-01 00:00:00 0
2000-01-01 00:00:30 1
2000-01-01 00:01:00 1
2000-01-01 00:01:30 2
2000-01-01 00:02:00 2
Freq: 30S, dtype: int64
For a Series with a PeriodIndex, the keyword convention can be used to control whether to use
the start or end of rule.
Resample a year by quarter using ‘start’ convention. Values are assigned to the first quarter of
the period.
Resample quarters by month using ‘end’ convention. Values are assigned to the last month of
the period.
For DataFrame objects, the keyword on can be used to specify the column instead of the index
for resampling.
For a DataFrame with MultiIndex, the keyword level can be used to specify on which level the
resampling needs to take place.
pandas.Series.reset_index
Examples
>>> s.reset_index()
idx foo
0 a 1
1 b 2
2 c 3
3 d 4
>>> s.reset_index(name='values')
idx values
0 a 1
1 b 2
2 c 3
3 d 4
>>> s.reset_index(drop=True)
0 1
1 2
2 3
3 4
Name: foo, dtype: int64
To update the Series in place, without generating a new one set inplace to True. Note that it
also requires drop=True.
>>> s2.reset_index(level='a')
a foo
b
one bar 0
two bar 1
one baz 2
two baz 3
If level is not set, all levels are removed from the Index.
>>> s2.reset_index()
a b foo
0 bar one 0
1 bar two 1
2 baz one 2
3 baz two 3
pandas.Series.rfloordiv
Series.floordiv
Examples
pandas.Series.rmod
Series.mod
Examples
pandas.Series.rmul
Series.mul
Examples
pandas.Series.rolling
See also:
Notes
By default, the result is set to the right edge of the window. This can be changed to the center
of the window by setting center=True.
To learn more about the offsets & frequency strings, please see this link.
The recognized win_types are:
• boxcar
• triang
• blackman
• hamming
• bartlett
• parzen
• bohman
• blackmanharris
• nuttall
• barthann
• kaiser (needs beta)
• gaussian (needs std)
• general_gaussian (needs power, width)
• slepian (needs width)
• exponential (needs tau), center is set to None.
If win_type=None all points are evenly weighted. To learn more about different window types
see scipy.signal window functions.
Examples
Rolling sum with a window length of 2, using the ‘triang’ window type.
Rolling sum with a window length of 2, min_periods defaults to the window length.
>>> df.rolling(2).sum()
B
0 NaN
1 1.0
2 3.0
3 NaN
4 NaN
>>> df
B
2013-01-01 09:00:00 0.0
2013-01-01 09:00:02 1.0
2013-01-01 09:00:03 2.0
2013-01-01 09:00:05 NaN
2013-01-01 09:00:06 4.0
Contrasting to an integer rolling window, this will roll a variable length window corresponding
to the time period. The default for min_periods is 1.
>>> df.rolling('2s').sum()
B
2013-01-01 09:00:00 0.0
2013-01-01 09:00:02 1.0
2013-01-01 09:00:03 3.0
(continues on next page)
pandas.Series.round
Examples
pandas.Series.rpow
See also:
Series.pow
Examples
pandas.Series.rsub
Series.sub
Examples
pandas.Series.rtruediv
Series.truediv
Examples
pandas.Series.sample
Examples
Extract 3 random elements from the Series df['num_legs']: Note that we use random_state
to ensure the reproducibility of the examples.
Using a DataFrame column as weights. Rows with larger value in the num_specimen_seen
column are more likely to be sampled.
pandas.Series.searchsorted
numpy.searchsorted
Notes
Examples
>>> x.searchsorted(4)
3
>>> x.searchsorted('bread')
1
pandas.Series.sem
pandas.Series.set_axis
Returns
renamed [%(klass)s or None] An object of same type as caller if inplace=False,
None otherwise.
See also:
Examples
Series
>>> s
0 1
1 2
2 3
dtype: int64
DataFrame
pandas.Series.set_value
pandas.Series.shift
fill_value [object, optional] The scalar value to use for newly introduced missing
values. the default depends on the dtype of self. For numeric data, np.nan is
used. For datetime, timedelta, or period data, etc. NaT is used. For extension
dtypes, self.dtype.na_value is used.
Changed in version 0.24.0.
Returns
Series Copy of input object, shifted.
See also:
Examples
>>> df.shift(periods=3)
Col1 Col2 Col3
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 10.0 13.0 17.0
4 20.0 23.0 27.0
pandas.Series.skew
pandas.Series.slice_shift
Notes
While the slice_shift is faster than shift, you may pay for it later during alignment.
pandas.Series.sort_index
Examples
Sort Descending
>>> s.sort_index(ascending=False)
4 d
3 a
2 b
1 c
dtype: object
Sort Inplace
>>> s.sort_index(inplace=True)
>>> s
1 c
2 b
3 a
4 d
dtype: object
By default NaNs are put at the end, but use na_position to place them at the beginning
pandas.Series.sort_values
Examples
>>> s.sort_values(ascending=True)
1 1.0
2 3.0
4 5.0
3 10.0
0 NaN
dtype: float64
>>> s.sort_values(ascending=False)
3 10.0
4 5.0
2 3.0
1 1.0
0 NaN
dtype: float64
>>> s.sort_values(na_position='first')
0 NaN
1 1.0
2 3.0
4 5.0
3 10.0
dtype: float64
>>> s.sort_values()
3 a
1 b
4 c
2 d
0 z
dtype: object
pandas.Series.sparse
Series.sparse()
Accessor for SparseSparse from other sparse matrix data types.
pandas.Series.squeeze
Series.squeeze(self, axis=None)
Squeeze 1 dimensional axis objects into scalars.
Series or DataFrames with a single element are squeezed to a scalar. DataFrames with a single
column or a single row are squeezed to a Series. Otherwise the object is unchanged.
This method is most useful when you don’t know if your object is a Series or DataFrame, but
you do know it has just a single column. In that case you can safely call squeeze to ensure you
have a Series.
Parameters
axis [{0 or ‘index’, 1 or ‘columns’, None}, default None] A specific axis to squeeze.
By default, all length-1 axes are squeezed.
New in version 0.20.0.
Returns
DataFrame, Series, or scalar The projection after squeezing axis or all the
axes.
See also:
Examples
>>> even_primes.squeeze()
2
Squeezing objects with more than one value in every axis does nothing:
>>> odd_primes.squeeze()
1 3
2 5
3 7
dtype: int64
Slicing a single column will produce a DataFrame with the columns having only one value:
>>> df_a.squeeze('columns')
0 1
1 3
Name: a, dtype: int64
Slicing a single row from a single column will produce a single scalar DataFrame:
>>> df_0a.squeeze('rows')
a 1
Name: 0, dtype: int64
>>> df_0a.squeeze()
1
pandas.Series.std
pandas.Series.str
Series.str()
Vectorized string functions for Series and Index. NAs stay NA unless handled otherwise by a
particular method. Patterned after Python’s string methods, with some inspiration from R’s
stringr package.
Examples
>>> s.str.split('_')
>>> s.str.replace('_', '')
pandas.Series.sub
Series.rsub
Examples
pandas.Series.subtract
Series.rsub
Examples
pandas.Series.sum
Parameters
axis [{index (0)}] Axis for the function to be applied on.
skipna [bool, default True] Exclude NA/null values when computing the result.
level [int or level name, default None] If the axis is a MultiIndex (hierarchical),
count along a particular level, collapsing into a scalar.
numeric_only [bool, default None] Include only float, int, boolean columns. If
None, will attempt to use everything, then use only numeric data. Not imple-
mented for Series.
min_count [int, default 0] The required number of valid values to perform the
operation. If fewer than min_count non-NA values are present the result will
be NA.
New in version 0.22.0: Added with the default being 0. This means the sum of
an all-NA or empty Series is 0, and the product of an all-NA or empty Series
is 1.
**kwargs Additional keyword arguments to be passed to the function.
Returns
scalar or Series (if level specified)
See also:
Examples
>>> s.sum()
14
>>> s.sum(level='blooded')
blooded
warm 6
cold 8
Name: legs, dtype: int64
>>> s.sum(level=0)
blooded
warm 6
cold 8
Name: legs, dtype: int64
This can be controlled with the min_count parameter. For example, if you’d like the sum of an
empty series to be NaN, pass min_count=1.
>>> pd.Series([]).sum(min_count=1)
nan
Thanks to the skipna parameter, min_count handles all-NA and empty series identically.
>>> pd.Series([np.nan]).sum()
0.0
>>> pd.Series([np.nan]).sum(min_count=1)
nan
pandas.Series.swapaxes
pandas.Series.swaplevel
pandas.Series.tail
Series.tail(self, n=5)
Return the last n rows.
This function returns last n rows from the object based on position. It is useful for quickly
verifying data, for example, after sorting or appending rows.
Parameters
n [int, default 5] Number of rows to select.
Returns
type of caller The last n rows of the caller object.
See also:
Examples
>>> df.tail()
animal
4 monkey
5 parrot
6 shark
7 whale
8 zebra
>>> df.tail(3)
animal
6 shark
7 whale
8 zebra
pandas.Series.take
taken [same type as caller] An array-like containing the elements taken from the
object.
See also:
Examples
We may take elements using negative integers for positive indices, starting from the end of the
object, just like with Python lists.
pandas.Series.to_clipboard
Notes
Examples
We can omit the the index by passing the keyword index and setting it to false.
pandas.Series.to_csv
Examples
pandas.Series.to_dense
Series.to_dense(self )
Return dense representation of Series/DataFrame (as opposed to sparse).
Deprecated since version 0.25.0.
Returns
%(klass)s Dense %(klass)s.
pandas.Series.to_dict
Examples
pandas.Series.to_excel
encoding [str, optional] Encoding of the resulting excel file. Only necessary for
xlwt, other writers support unicode natively.
inf_rep [str, default ‘inf’] Representation for infinity (there is no native represen-
tation for infinity in Excel).
verbose [bool, default True] Display more information in the error logs.
freeze_panes [tuple of int (length 2), optional] Specifies the one-based bottom-
most row and rightmost column that is to be frozen.
New in version 0.20.0..
See also:
Notes
For compatibility with to_csv(), to_excel serializes lists and dicts to strings before writing.
Once a workbook has been saved it is not possible write further data without rewriting the whole
workbook.
Examples
>>> df1.to_excel("output.xlsx",
... sheet_name='Sheet_name_1') # doctest: +SKIP
If you wish to write to more than one sheet in the workbook, it is necessary to specify an
ExcelWriter object:
To set the library that is used to write the Excel file, you can pass the engine keyword (the
default engine is automatically chosen depending on the file extension):
pandas.Series.to_frame
Series.to_frame(self, name=None)
Convert Series to DataFrame.
Parameters
name [object, default None] The passed name should substitute for the series
name (if it has one).
Returns
DataFrame DataFrame representation of Series.
Examples
pandas.Series.to_hdf
• ‘table’: Table format. Write as a PyTables Table structure which may per-
form worse but allow more flexible operations like searching / selecting sub-
sets of the data.
append [bool, default False] For Table formats, append the input data to the
existing.
data_columns [list of columns or True, optional] List of columns to create as
indexed data columns for on-disk queries, or True to use all columns. By
default only the axes of the object are indexed. See Query via data columns.
Applicable only to format=’table’.
complevel [{0-9}, optional] Specifies a compression level for data. A value of 0
disables compression.
complib [{‘zlib’, ‘lzo’, ‘bzip2’, ‘blosc’}, default ‘zlib’] Specifies the compression
library to be used. As of v0.20.2 these additional compressors for Blosc are
supported (default if no compressor specified: ‘blosc:blosclz’): {‘blosc:blosclz’,
‘blosc:lz4’, ‘blosc:lz4hc’, ‘blosc:snappy’, ‘blosc:zlib’, ‘blosc:zstd’}. Specifying a
compression library which is not available issues a ValueError.
fletcher32 [bool, default False] If applying compression use the fletcher32 check-
sum.
dropna [bool, default False] If true, ALL nan rows will not be written to store.
errors [str, default ‘strict’] Specifies how encoding and decoding errors are to be
handled. See the errors argument for open() for a full list of options.
See also:
Examples
>>> import os
>>> os.remove('data.h5')
pandas.Series.to_json
read_json
Examples
Encoding/decoding a Dataframe using 'records' formatted JSON. Note that index labels are
not preserved with this encoding.
>>> df.to_json(orient='records')
'[{"col 1":"a","col 2":"b"},{"col 1":"c","col 2":"d"}]'
>>> df.to_json(orient='index')
'{"row 1":{"col 1":"a","col 2":"b"},"row 2":{"col 1":"c","col 2":"d"}}'
>>> df.to_json(orient='columns')
'{"col 1":{"row 1":"a","row 2":"c"},"col 2":{"row 1":"b","row 2":"d"}}'
>>> df.to_json(orient='values')
'[["a","b"],["c","d"]]'
>>> df.to_json(orient='table')
'{"schema": {"fields": [{"name": "index", "type": "string"},
{"name": "col 1", "type": "string"},
{"name": "col 2", "type": "string"}],
"primaryKey": "index",
"pandas_version": "0.20.0"},
"data": [{"index": "row 1", "col 1": "a", "col 2": "b"},
{"index": "row 2", "col 1": "c", "col 2": "d"}]}'
pandas.Series.to_latex
Examples
pandas.Series.to_list
Series.to_list(self )
Return a list of the values.
These are each a scalar type, which is a Python scalar (for str, int, float) or a pandas scalar (for
Timestamp/Timedelta/Interval/Period)
Returns
list
See also:
numpy.ndarray.tolist
pandas.Series.to_msgpack
pandas.Series.to_numpy
Notes
The returned array will be the same up to equality (values equal in self will be equal in the
returned array; likewise for values that are not equal). When self contains an ExtensionArray,
the dtype may be different. For example, for a category-dtype Series, to_numpy() will return a
NumPy array and the categorical dtype will be lost.
For NumPy dtypes, this will be a reference to the actual data stored in this Series or Index
(assuming copy=False). Modifying the result in place will modify the data stored in the Series
or Index (not that we recommend doing that).
For extension types, to_numpy() may require copying data and coercing the result to a NumPy
type (possibly object), which may be expensive. When you need a no-copy reference to the
underlying data, Series.array should be used instead.
This table lays out the different dtypes and default return types of to_numpy() for various dtypes
within pandas.
Examples
Specify the dtype to control how datetime-aware data is represented. Use dtype=object to return
an ndarray of pandas Timestamp objects, each with the correct tz.
>>> ser.to_numpy(dtype="datetime64[ns]")
... # doctest: +ELLIPSIS
array(['1999-12-31T23:00:00.000000000', '2000-01-01T23:00:00...'],
dtype='datetime64[ns]')
pandas.Series.to_period
pandas.Series.to_pickle
read_pickle Load pickled pandas object (or any object) from file.
DataFrame.to_hdf Write DataFrame to an HDF5 file.
Examples
>>> import os
>>> os.remove("./dummy.pkl")
pandas.Series.to_sparse
pandas.Series.to_sql
Notes
Timezone aware datetime columns will be written as Timestamp with timezone type with
SQLAlchemy if supported by the database. Otherwise, the datetimes will be stored as timezone
unaware timestamps local to the original timezone.
New in version 0.24.0.
References
[?], [?]
Examples
Specify the dtype (especially useful for integers with missing values). Notice that while pandas is
forced to store the data as floating point, the database supports nullable integers. When fetching
the data with Python, we get back integer scalars.
pandas.Series.to_string
pandas.Series.to_timestamp
pandas.Series.to_xarray
Series.to_xarray(self )
Return an xarray object from the pandas object.
Returns
xarray.DataArray or xarray.Dataset Data in the pandas structure converted
to Dataset if the object is a DataFrame, or a DataArray if the object is a Series.
See also:
Notes
Examples
>>> df.to_xarray()
<xarray.Dataset>
Dimensions: (index: 4)
Coordinates:
* index (index) int64 0 1 2 3
Data variables:
name (index) object 'falcon' 'parrot' 'lion' 'monkey'
class (index) object 'bird' 'bird' 'mammal' 'mammal'
max_speed (index) float64 389.0 24.0 80.5 nan
num_legs (index) int64 2 2 4 4
>>> df['max_speed'].to_xarray()
<xarray.DataArray 'max_speed' (index: 4)>
array([389. , 24. , 80.5, nan])
Coordinates:
* index (index) int64 0 1 2 3
>>> df_multiindex.to_xarray()
<xarray.Dataset>
Dimensions: (animal: 2, date: 2)
Coordinates:
* date (date) datetime64[ns] 2018-01-01 2018-01-02
* animal (animal) object 'falcon' 'parrot'
Data variables:
speed (date, animal) int64 350 18 361 15
pandas.Series.tolist
Series.tolist(self )
Return a list of the values.
These are each a scalar type, which is a Python scalar (for str, int, float) or a pandas scalar (for
Timestamp/Timedelta/Interval/Period)
Returns
list
See also:
numpy.ndarray.tolist
pandas.Series.transform
• function
• string function name
• list of functions and/or function names, e.g. [np.exp. 'sqrt']
• dict of axis labels -> functions, function names or list of such.
axis [{0 or ‘index’}] Parameter needed for compatibility with DataFrame.
*args Positional arguments to pass to func.
**kwargs Keyword arguments to pass to func.
Returns
Series A Series that must have the same length as self.
Raises
ValueError [If the returned Series has a different length than self.]
See also:
Examples
Even though the resulting Series must have the same length as the input Series, it is possible to
provide several input functions:
>>> s = pd.Series(range(3))
>>> s
0 0
1 1
2 2
dtype: int64
>>> s.transform([np.sqrt, np.exp])
sqrt exp
0 0.000000 1.000000
1 1.000000 2.718282
2 1.414214 7.389056
pandas.Series.transpose
pandas.Series.truediv
Series.rtruediv
Examples
pandas.Series.truncate
Notes
If the index being truncated contains only datetime values, before and after may be specified as
strings instead of Timestamps.
Examples
>>> df.truncate(before=pd.Timestamp('2016-01-05'),
... after=pd.Timestamp('2016-01-10')).tail()
A
2016-01-09 23:59:56 1
2016-01-09 23:59:57 1
2016-01-09 23:59:58 1
2016-01-09 23:59:59 1
2016-01-10 00:00:00 1
Because the index is a DatetimeIndex containing only dates, we can specify before and after as
strings. They will be coerced to Timestamps before truncation.
Note that truncate assumes a 0 value for any unspecified time component (midnight). This
differs from partial string slicing, which returns any partially matching dates.
pandas.Series.tshift
Notes
If freq is not specified then tries to use the freq or inferred_freq attributes of the index. If neither
of those attributes exist, a ValueError is thrown
pandas.Series.tz_convert
Raises
TypeError If the axis is tz-naive.
pandas.Series.tz_localize
Examples
>>> s = pd.Series([1],
... index=pd.DatetimeIndex(['2018-09-15 01:30:00']))
>>> s.tz_localize('CET')
2018-09-15 01:30:00+02:00 1
dtype: int64
Be careful with DST changes. When there is sequential data, pandas can infer the DST time:
In some cases, inferring the DST is impossible. In such cases, you can pass an ndarray to the
ambiguous parameter to set the DST explicitly
If the DST transition causes nonexistent times, you can shift these dates forward or
backwards with a timedelta object or ‘shift_forward’ or ‘shift_backwards’. >>> s =
pd.Series(range(2), index=pd.DatetimeIndex([ … ‘2015-03-29 02:30:00’, … ‘2015-03-29 03:30:00’]))
>>> s.tz_localize(‘Europe/Warsaw’, nonexistent=’shift_forward’) 2015-03-29 03:00:00+02:00
0 2015-03-29 03:30:00+02:00 1 dtype: int64 >>> s.tz_localize(‘Europe/Warsaw’, nonexis-
tent=’shift_backward’) 2015-03-29 01:59:59.999999999+01:00 0 2015-03-29 03:30:00+02:00 1
dtype: int64 >>> s.tz_localize(‘Europe/Warsaw’, nonexistent=pd.Timedelta(‘1H’)) 2015-03-29
03:30:00+02:00 0 2015-03-29 03:30:00+02:00 1 dtype: int64
pandas.Series.unique
Series.unique(self )
Return unique values of Series object.
Uniques are returned in order of appearance. Hash table-based unique, therefore does NOT sort.
Returns
ndarray or ExtensionArray The unique values returned as a NumPy array.
See Notes.
See also:
Notes
Returns the unique values as a NumPy array. In case of an extension-array backed Series, a new
ExtensionArray of that type with just the unique values is returned. This includes
• Categorical
• Period
• Datetime with Timezone
• Interval
• Sparse
• IntegerNA
See Examples section.
Examples
>>> pd.Series(pd.Categorical(list('baabc'))).unique()
[b, a, c]
Categories (3, object): [b, a, c]
pandas.Series.unstack
Examples
>>> s.unstack(level=-1)
a b
one 1 2
two 3 4
>>> s.unstack(level=0)
one two
a 1 3
b 2 4
pandas.Series.update
Series.update(self, other)
Modify Series in place using non-NA values from passed Series. Aligns on index.
Parameters
other [Series]
Examples
If other contains NaNs the corresponding values are not updated in the original Series.
pandas.Series.valid
pandas.Series.value_counts
Examples
With normalize set to True, returns the relative frequency by dividing all values by the sum of
values.
bins
Bins can be useful for going from a continuous variable to a categorical variable; instead of
counting unique apparitions of values, divide the index in the specified number of half-open bins.
>>> s.value_counts(bins=3)
(2.0, 3.0] 2
(0.996, 2.0] 2
(3.0, 4.0] 1
dtype: int64
dropna
With dropna set to False we can also see NaN index values.
>>> s.value_counts(dropna=False)
3.0 2
NaN 1
4.0 1
2.0 1
1.0 1
dtype: int64
pandas.Series.var
pandas.Series.view
Series.view(self, dtype=None)
Create a new view of the Series.
This function will return a new Series with a view of the same underlying values in memory,
optionally reinterpreted with a new data type. The new data type must preserve the same size
in bytes as to not cause index misalignment.
Parameters
dtype [data type] Data type object or one of their string representations.
Returns
Series A new Series object as a view of the same data in memory.
See also:
numpy.ndarray.view Equivalent numpy function to create a new view of the same data in
memory.
Notes
Series are instantiated with dtype=float64 by default. While numpy.ndarray.view() will return
a view with the same data type as the original array, Series.view() (without specified dtype)
will try using float64 and may fail if the original data type size in bytes is not the same.
Examples
The 8 bit signed integer representation of -1 is 0b11111111, but the same bytes represent 255 if
read as an 8 bit unsigned integer:
>>> us = s.view('uint8')
>>> us
0 254
1 255
2 0
3 1
4 2
dtype: uint8
pandas.Series.where
Notes
The where method is an application of the if-then idiom. For each element in the calling
DataFrame, if cond is True the element is used; otherwise the corresponding element from the
DataFrame other is used.
The signature for DataFrame.where() differs from numpy.where(). Roughly df1.where(m,
df2) is equivalent to np.where(m, df1, df2).
For further details and examples see the where documentation in indexing.
Examples
>>> s = pd.Series(range(5))
>>> s.where(s > 0)
0 NaN
1 1.0
2 2.0
3 3.0
4 4.0
dtype: float64
pandas.Series.xs
Notes
Examples
>>> df.xs('mammal')
num_legs num_wings
animal locomotion
cat walks 4 0
dog walks 4 0
bat flies 2 2
6.3.2 Attributes
Axes
pandas.Series.empty
Series.empty
6.3.3 Conversion
Series.astype(self, dtype[, copy, errors]) Cast a pandas object to a specified dtype dtype.
Series.infer_objects(self) Attempt to infer better dtypes for object columns.
Series.copy(self[, deep]) Make a copy of this object’s indices and data.
Series.bool(self) Return the bool of a single element PandasObject.
Continued on next page
pandas.Series.__array__
Series.__array__(self, dtype=None)
Return the values as a NumPy array.
Users should not call this directly. Rather, it is invoked by numpy.array() and numpy.asarray().
Parameters
dtype [str or numpy.dtype, optional] The dtype to use for the resulting NumPy array.
By default, the dtype is inferred from the data.
Returns
numpy.ndarray The values in the series converted to a numpy.ndarary with the
specified dtype.
See also:
Examples
Or the values may be localized to UTC and the tzinfo discared with dtype='datetime64[ns]'
Series.get(self, key[, default]) Get item from object for given key (ex: DataFrame
column).
Series.at Access a single value for a row/column label pair.
Series.iat Access a single value for a row/column pair by in-
teger position.
Series.loc Access a group of rows and columns by label(s) or
a boolean array.
Series.iloc Purely integer-location based indexing for selection
by position.
Series.__iter__(self) Return an iterator of the values.
Series.items(self) Lazily iterate over (index, value) tuples.
Series.iteritems(self) Lazily iterate over (index, value) tuples.
Series.keys(self) Return alias for index.
Series.pop(self, item) Return item and drop from frame.
Series.item(self) Return the first element of the underlying data as
a python scalar.
Series.xs(self, key[, axis, level, drop_level]) Return cross-section from the Series/DataFrame.
pandas.Series.__iter__
Series.__iter__(self )
Return an iterator of the values.
These are each a scalar type, which is a Python scalar (for str, int, float) or a pandas scalar (for
Timestamp/Timedelta/Interval/Period)
Returns
iterator
For more information on .at, .iat, .loc, and .iloc, see the indexing documentation.
Series.add(self, other[, level, fill_value, …]) Return Addition of series and other, element-wise
(binary operator add).
Series.sub(self, other[, level, fill_value, …]) Return Subtraction of series and other, element-
wise (binary operator sub).
Series.mul(self, other[, level, fill_value, …]) Return Multiplication of series and other, element-
wise (binary operator mul).
Series.div(self, other[, level, fill_value, …]) Return Floating division of series and other,
element-wise (binary operator truediv).
Series.truediv(self, other[, level, …]) Return Floating division of series and other,
element-wise (binary operator truediv).
Series.floordiv(self, other[, level, …]) Return Integer division of series and other, element-
wise (binary operator floordiv).
Series.mod(self, other[, level, fill_value, …]) Return Modulo of series and other, element-wise
(binary operator mod).
Series.pow(self, other[, level, fill_value, …]) Return Exponential power of series and other,
element-wise (binary operator pow).
Continued on next page
Series.align(self, other[, join, axis, …]) Align two objects on their axes with the specified
join method for each axis Index.
Series.drop(self[, labels, axis, index, …]) Return Series with specified index labels removed.
Series.droplevel(self, level[, axis]) Return DataFrame with requested index / column
level(s) removed.
Series.drop_duplicates(self[, keep, inplace]) Return Series with duplicate values removed.
Continued on next page
6.3.13 Accessors
Pandas provides dtype-specific methods under various accessors. These are separate namespaces within
Series that only apply to specific data types.
Datetimelike properties
Series.dt can be used to access the values of the series as datetimelike and return several properties. These
can be accessed like Series.dt.<property>.
Datetime properties
pandas.Series.dt.date
Series.dt.date
Returns numpy array of python datetime.date objects (namely, the date part of Timestamps without
timezone information).
pandas.Series.dt.time
Series.dt.time
Returns numpy array of datetime.time. The time part of the Timestamps.
pandas.Series.dt.timetz
Series.dt.timetz
Returns numpy array of datetime.time also containing timezone information. The time part of the
Timestamps.
pandas.Series.dt.year
Series.dt.year
The year of the datetime.
pandas.Series.dt.month
Series.dt.month
The month as January=1, December=12.
pandas.Series.dt.day
Series.dt.day
The days of the datetime.
pandas.Series.dt.hour
Series.dt.hour
The hours of the datetime.
pandas.Series.dt.minute
Series.dt.minute
The minutes of the datetime.
pandas.Series.dt.second
Series.dt.second
The seconds of the datetime.
pandas.Series.dt.microsecond
Series.dt.microsecond
The microseconds of the datetime.
pandas.Series.dt.nanosecond
Series.dt.nanosecond
The nanoseconds of the datetime.
pandas.Series.dt.week
Series.dt.week
The week ordinal of the year.
pandas.Series.dt.weekofyear
Series.dt.weekofyear
The week ordinal of the year.
pandas.Series.dt.dayofweek
Series.dt.dayofweek
The day of the week with Monday=0, Sunday=6.
Return the day of the week. It is assumed the week starts on Monday, which is denoted by 0 and ends
on Sunday which is denoted by 6. This method is available on both Series with datetime values (using
the dt accessor) or DatetimeIndex.
Returns
Series or Index Containing integers indicating the day number.
See also:
Series.dt.dayofweek Alias.
Series.dt.weekday Alias.
Series.dt.day_name Returns the name of the day of the week.
Examples
pandas.Series.dt.weekday
Series.dt.weekday
The day of the week with Monday=0, Sunday=6.
Return the day of the week. It is assumed the week starts on Monday, which is denoted by 0 and ends
on Sunday which is denoted by 6. This method is available on both Series with datetime values (using
the dt accessor) or DatetimeIndex.
Returns
Series or Index Containing integers indicating the day number.
See also:
Series.dt.dayofweek Alias.
Series.dt.weekday Alias.
Series.dt.day_name Returns the name of the day of the week.
Examples
pandas.Series.dt.dayofyear
Series.dt.dayofyear
The ordinal day of the year.
pandas.Series.dt.quarter
Series.dt.quarter
The quarter of the date.
pandas.Series.dt.is_month_start
Series.dt.is_month_start
Indicates whether the date is the first day of the month.
Returns
Series or array For Series, returns a Series with boolean values. For DatetimeIndex,
returns a boolean array.
See also:
is_month_start Return a boolean indicating whether the date is the first day of the month.
is_month_end Return a boolean indicating whether the date is the last day of the month.
Examples
This method is available on Series with datetime values under the .dt accessor, and directly on Date-
timeIndex.
pandas.Series.dt.is_month_end
Series.dt.is_month_end
Indicates whether the date is the last day of the month.
Returns
Series or array For Series, returns a Series with boolean values. For DatetimeIndex,
returns a boolean array.
See also:
is_month_start Return a boolean indicating whether the date is the first day of the month.
is_month_end Return a boolean indicating whether the date is the last day of the month.
Examples
This method is available on Series with datetime values under the .dt accessor, and directly on Date-
timeIndex.
pandas.Series.dt.is_quarter_start
Series.dt.is_quarter_start
Indicator for whether the date is the first day of a quarter.
Returns
is_quarter_start [Series or DatetimeIndex] The same type as the original data with
boolean values. Series will have the same name and index. DatetimeIndex will
have the same name.
See also:
Examples
This method is available on Series with datetime values under the .dt accessor, and directly on Date-
timeIndex.
>>> idx.is_quarter_start
array([False, False, True, False])
pandas.Series.dt.is_quarter_end
Series.dt.is_quarter_end
Indicator for whether the date is the last day of a quarter.
Returns
is_quarter_end [Series or DatetimeIndex] The same type as the original data with
boolean values. Series will have the same name and index. DatetimeIndex will
have the same name.
See also:
Examples
This method is available on Series with datetime values under the .dt accessor, and directly on Date-
timeIndex.
>>> idx.is_quarter_end
array([False, True, False, False])
pandas.Series.dt.is_year_start
Series.dt.is_year_start
Indicate whether the date is the first day of a year.
Returns
Series or DatetimeIndex The same type as the original data with boolean values.
Series will have the same name and index. DatetimeIndex will have the same
name.
See also:
Examples
This method is available on Series with datetime values under the .dt accessor, and directly on Date-
timeIndex.
>>> dates.dt.is_year_start
0 False
1 False
2 True
dtype: bool
>>> idx.is_year_start
array([False, False, True])
pandas.Series.dt.is_year_end
Series.dt.is_year_end
Indicate whether the date is the last day of the year.
Returns
Series or DatetimeIndex The same type as the original data with boolean values.
Series will have the same name and index. DatetimeIndex will have the same
name.
See also:
Examples
This method is available on Series with datetime values under the .dt accessor, and directly on Date-
timeIndex.
>>> dates.dt.is_year_end
0 False
1 True
2 False
dtype: bool
>>> idx.is_year_end
array([False, True, False])
pandas.Series.dt.is_leap_year
Series.dt.is_leap_year
Boolean indicator if the date belongs to a leap year.
A leap year is a year, which has 366 days (instead of 365) including 29th of February as an intercalary
day. Leap years are years which are multiples of four with the exception of years divisible by 100 but
not by 400.
Returns
Series or ndarray Booleans indicating if dates belong to a leap year.
Examples
This method is available on Series with datetime values under the .dt accessor, and directly on Date-
timeIndex.
pandas.Series.dt.daysinmonth
Series.dt.daysinmonth
The number of days in the month.
pandas.Series.dt.days_in_month
Series.dt.days_in_month
The number of days in the month.
pandas.Series.dt.tz
Series.dt.tz
Return timezone, if any.
Returns
datetime.tzinfo, pytz.tzinfo.BaseTZInfo, dateutil.tz.tz.tzfile, or None
Returns None when the array is tz-naive.
pandas.Series.dt.freq
Series.dt.freq
Datetime methods
pandas.Series.dt.to_period
Examples
pandas.Series.dt.to_pydatetime
Series.dt.to_pydatetime(self )
Return the data as an array of native Python datetime objects.
Timezone information is retained if present.
Warning: Python’s datetime uses microsecond resolution, which is lower than pandas (nanosec-
ond). The values are truncated.
Returns
numpy.ndarray Object dtype array containing native Python datetime objects.
See also:
Examples
>>> s.dt.to_pydatetime()
array([datetime.datetime(2018, 3, 10, 0, 0),
datetime.datetime(2018, 3, 11, 0, 0)], dtype=object)
>>> s.dt.to_pydatetime()
array([datetime.datetime(2018, 3, 10, 0, 0),
datetime.datetime(2018, 3, 10, 0, 0)], dtype=object)
pandas.Series.dt.tz_localize
• ‘coerce’ will return NaT if the timestamp can not be converted to the specified
time zone. Use nonexistent='NaT' instead.
Deprecated since version 0.24.0.
Returns
Same type as self Array/Index converted to the specified time zone.
Raises
TypeError If the Datetime Array/Index is tz-aware and tz is not None.
See also:
Examples
With the tz=None, we can remove the time zone information while keeping the local time (not converted
to UTC):
>>> tz_aware.tz_localize(None)
DatetimeIndex(['2018-03-01 09:00:00', '2018-03-02 09:00:00',
'2018-03-03 09:00:00'],
dtype='datetime64[ns]', freq='D')
Be careful with DST changes. When there is sequential data, pandas can infer the DST time: >>> s
= pd.to_datetime(pd.Series([‘2018-10-28 01:30:00’, … ‘2018-10-28 02:00:00’, … ‘2018-10-28 02:30:00’, …
‘2018-10-28 02:00:00’, … ‘2018-10-28 02:30:00’, … ‘2018-10-28 03:00:00’, … ‘2018-10-28 03:30:00’])) >>>
s.dt.tz_localize(‘CET’, ambiguous=’infer’) 0 2018-10-28 01:30:00+02:00 1 2018-10-28 02:00:00+02:00
2 2018-10-28 02:30:00+02:00 3 2018-10-28 02:00:00+01:00 4 2018-10-28 02:30:00+01:00 5 2018-10-28
03:00:00+01:00 6 2018-10-28 03:30:00+01:00 dtype: datetime64[ns, CET]
In some cases, inferring the DST is impossible. In such cases, you can pass an ndarray to the ambiguous
parameter to set the DST explicitly
If the DST transition causes nonexistent times, you can shift these dates forward or backwards with a
timedelta object or ‘shift_forward’ or ‘shift_backwards’. >>> s = pd.to_datetime(pd.Series([‘2015-
03-29 02:30:00’, … ‘2015-03-29 03:30:00’])) >>> s.dt.tz_localize(‘Europe/Warsaw’, nonexis-
tent=’shift_forward’) 0 2015-03-29 03:00:00+02:00 1 2015-03-29 03:30:00+02:00 dtype: date-
time64[ns, ‘Europe/Warsaw’] >>> s.dt.tz_localize(‘Europe/Warsaw’, nonexistent=’shift_backward’)
0 2015-03-29 01:59:59.999999999+01:00 1 2015-03-29 03:30:00+02:00 dtype: datetime64[ns, ‘Eu-
rope/Warsaw’] >>> s.dt.tz_localize(‘Europe/Warsaw’, nonexistent=pd.Timedelta(‘1H’)) 0 2015-03-
29 03:30:00+02:00 1 2015-03-29 03:30:00+02:00 dtype: datetime64[ns, ‘Europe/Warsaw’]
pandas.Series.dt.tz_convert
Examples
With the tz parameter, we can change the DatetimeIndex to other time zones:
>>> dti
DatetimeIndex(['2014-08-01 09:00:00+02:00',
'2014-08-01 10:00:00+02:00',
'2014-08-01 11:00:00+02:00'],
dtype='datetime64[ns, Europe/Berlin]', freq='H')
>>> dti.tz_convert('US/Central')
DatetimeIndex(['2014-08-01 02:00:00-05:00',
'2014-08-01 03:00:00-05:00',
'2014-08-01 04:00:00-05:00'],
dtype='datetime64[ns, US/Central]', freq='H')
With the tz=None, we can remove the timezone (after converting to UTC if necessary):
>>> dti
DatetimeIndex(['2014-08-01 09:00:00+02:00',
'2014-08-01 10:00:00+02:00',
'2014-08-01 11:00:00+02:00'],
dtype='datetime64[ns, Europe/Berlin]', freq='H')
>>> dti.tz_convert(None)
DatetimeIndex(['2014-08-01 07:00:00',
'2014-08-01 08:00:00',
'2014-08-01 09:00:00'],
dtype='datetime64[ns]', freq='H')
pandas.Series.dt.normalize
Examples
pandas.Series.dt.strftime
Examples
pandas.Series.dt.round
freq [str or Offset] The frequency level to round the index to. Must be a fixed fre-
quency like ‘S’ (second) not ‘ME’ (month end). See frequency aliases for a list of
possible freq values.
ambiguous [‘infer’, bool-ndarray, ‘NaT’, default ‘raise’] Only relevant for Date-
timeIndex:
• ‘infer’ will attempt to infer fall dst-transition hours based on order
• bool-ndarray where True signifies a DST time, False designates a non-DST
time (note that this flag is only applicable for ambiguous times)
• ‘NaT’ will return NaT where there are ambiguous times
• ‘raise’ will raise an AmbiguousTimeError if there are ambiguous times
New in version 0.24.0.
nonexistent [‘shift_forward’, ‘shift_backward’, ‘NaT’, timedelta, default ‘raise’] A
nonexistent time does not exist in a particular timezone where clocks moved for-
ward due to DST.
• ‘shift_forward’ will shift the nonexistent time forward to the closest existing
time
• ‘shift_backward’ will shift the nonexistent time backward to the closest ex-
isting time
• ‘NaT’ will return NaT where there are nonexistent times
• timedelta objects will shift nonexistent times by the timedelta
• ‘raise’ will raise an NonExistentTimeError if there are nonexistent times
New in version 0.24.0.
Returns
DatetimeIndex, TimedeltaIndex, or Series Index of the same type for a Date-
timeIndex or TimedeltaIndex, or a Series with the same index for a Series.
Raises
ValueError if the freq cannot be converted.
Examples
DatetimeIndex
Series
>>> pd.Series(rng).dt.round("H")
0 2018-01-01 12:00:00
1 2018-01-01 12:00:00
2 2018-01-01 12:00:00
dtype: datetime64[ns]
pandas.Series.dt.floor
Examples
DatetimeIndex
Series
>>> pd.Series(rng).dt.floor("H")
0 2018-01-01 11:00:00
1 2018-01-01 12:00:00
2 2018-01-01 12:00:00
dtype: datetime64[ns]
pandas.Series.dt.ceil
Examples
DatetimeIndex
Series
>>> pd.Series(rng).dt.ceil("H")
0 2018-01-01 12:00:00
1 2018-01-01 12:00:00
2 2018-01-01 13:00:00
dtype: datetime64[ns]
pandas.Series.dt.month_name
Examples
pandas.Series.dt.day_name
Examples
Period properties
Series.dt.qyear
Series.dt.start_time
Series.dt.end_time
pandas.Series.dt.qyear
Series.dt.qyear
pandas.Series.dt.start_time
Series.dt.start_time
pandas.Series.dt.end_time
Series.dt.end_time
Timedelta properties
pandas.Series.dt.days
Series.dt.days
Number of days for each element.
pandas.Series.dt.seconds
Series.dt.seconds
Number of seconds (>= 0 and less than 1 day) for each element.
pandas.Series.dt.microseconds
Series.dt.microseconds
Number of microseconds (>= 0 and less than 1 second) for each element.
pandas.Series.dt.nanoseconds
Series.dt.nanoseconds
Number of nanoseconds (>= 0 and less than 1 microsecond) for each element.
pandas.Series.dt.components
Series.dt.components
Return a Dataframe of the components of the Timedeltas.
Returns
DataFrame
Examples
Timedelta methods
pandas.Series.dt.to_pytimedelta
Series.dt.to_pytimedelta(self )
Return an array of native datetime.timedelta objects.
Python’s standard datetime library uses a different representation timedelta’s. This method converts
a Series of pandas Timedeltas to datetime.timedelta format with the same length as the original Series.
Returns
a [numpy.ndarray] Array of 1D containing data with datetime.timedelta type.
See also:
datetime.timedelta
Examples
>>> s.dt.to_pytimedelta()
array([datetime.timedelta(0), datetime.timedelta(1),
datetime.timedelta(2), datetime.timedelta(3),
datetime.timedelta(4)], dtype=object)
pandas.Series.dt.total_seconds
Examples
Series
>>> s.dt.total_seconds()
0 0.0
1 86400.0
2 172800.0
3 259200.0
4 345600.0
dtype: float64
TimedeltaIndex
>>> idx.total_seconds()
Float64Index([0.0, 86400.0, 172800.0, 259200.00000000003, 345600.0],
dtype='float64')
String handling
Series.str can be used to access the values of the series as strings and apply several methods to it. These
can be accessed like Series.str.<function/property>.
pandas.Series.str.capitalize
Series.str.capitalize(self )
Convert strings in the Series/Index to be capitalized.
Equivalent to str.capitalize().
Returns
Series/Index of objects
See also:
Examples
>>> s.str.lower()
0 lower
1 capitals
2 this is a sentence
3 swapcase
dtype: object
>>> s.str.upper()
0 LOWER
1 CAPITALS
2 THIS IS A SENTENCE
(continues on next page)
>>> s.str.title()
0 Lower
1 Capitals
2 This Is A Sentence
3 Swapcase
dtype: object
>>> s.str.capitalize()
0 Lower
1 Capitals
2 This is a sentence
3 Swapcase
dtype: object
>>> s.str.swapcase()
0 LOWER
1 capitals
2 THIS IS A SENTENCE
3 sWaPcAsE
dtype: object
pandas.Series.str.casefold
Series.str.casefold(self )
Convert strings in the Series/Index to be casefolded.
New in version 0.25.0.
Equivalent to str.casefold().
Returns
Series/Index of objects
See also:
Examples
>>> s.str.lower()
0 lower
1 capitals
2 this is a sentence
3 swapcase
dtype: object
>>> s.str.upper()
0 LOWER
1 CAPITALS
2 THIS IS A SENTENCE
3 SWAPCASE
dtype: object
>>> s.str.title()
0 Lower
1 Capitals
2 This Is A Sentence
3 Swapcase
dtype: object
>>> s.str.capitalize()
0 Lower
1 Capitals
2 This is a sentence
3 Swapcase
dtype: object
>>> s.str.swapcase()
0 LOWER
1 capitals
2 THIS IS A SENTENCE
3 sWaPcAsE
dtype: object
pandas.Series.str.cat
If others is not passed, then all values in the Series/Index are concatenated into a single string with a
given sep.
Parameters
others [Series, Index, DataFrame, np.ndarray or list-like] Series, Index, DataFrame,
np.ndarray (one- or two-dimensional) and other list-likes of strings must have the
same length as the calling Series/Index, with the exception of indexed objects (i.e.
Series/Index/DataFrame) if join is not None.
If others is a list-like that contains a combination of Series, Index or np.ndarray
(1-dim), then all elements will be unpacked and must satisfy the above criteria
individually.
If others is None, the method returns the concatenation of all strings in the calling
Series/Index.
sep [str, default ‘’] The separator between the different elements/columns. By default
the empty string ‘’ is used.
na_rep [str or None, default None] Representation that is inserted for all missing
values:
• If na_rep is None, and others is None, missing values in the Series/Index are
omitted from the result.
• If na_rep is None, and others is not None, a row containing a missing value
in any of the columns (before concatenation) will have a missing value in the
result.
join [{‘left’, ‘right’, ‘outer’, ‘inner’}, default None] Determines the join-style between
the calling Series/Index and any Series/Index/DataFrame in others (objects with-
out an index need to match the length of the calling Series/Index). If None,
alignment is disabled, but this option will be removed in a future version of pan-
das and replaced with a default of ‘left’. To disable alignment, use .values on any
Series/Index/DataFrame in others.
New in version 0.23.0.
Returns
str, Series or Index If others is None, str is returned, otherwise a Series/Index
(same type as caller) of objects is returned.
See also:
Examples
When not passing others, all values are concatenated into a single string:
By default, NA values in the Series are ignored. Using na_rep, they can be given a representation:
If others is specified, corresponding values are concatenated with the separator. Result will be a Series
of strings.
Missing values will remain missing in the result, but can again be represented using na_rep
Series with different indexes can be aligned before concatenation. The join-keyword works as in other
methods.
pandas.Series.str.center
pandas.Series.str.contains
Examples
Specifying na to be False instead of NaN replaces NaN values with False. If Series or Index does not
contain NaN values the resultant dtype will be bool, otherwise, an object dtype.
>>> import re
>>> s1.str.contains('PARROT', flags=re.IGNORECASE, regex=True)
0 False
1 False
2 True
3 False
4 NaN
dtype: object
Ensure pat is a not a literal pattern when regex is set to True. Note in the following example one might
expect only s2[1] and s2[3] to return True. However, ‘.0’ as a regex matches any character followed by
a 0.
pandas.Series.str.count
Notes
Some characters need to be escaped when passing in pat. eg. '$' has a special meaning in regex and
must be escaped when finding this literal character.
Examples
pandas.Series.str.decode
pandas.Series.str.encode
pandas.Series.str.endswith
Examples
>>> s.str.endswith('t')
0 True
1 False
2 False
3 NaN
dtype: object
pandas.Series.str.extract
Examples
A pattern with two groups will return a DataFrame with two columns. Non-matches will be NaN.
>>> s.str.extract(r'([ab])?(\d)')
0 1
0 a 1
1 b 2
2 NaN 3
>>> s.str.extract(r'(?P<letter>[ab])(?P<digit>\d)')
letter digit
0 a 1
1 b 2
2 NaN NaN
A pattern with one group will return a DataFrame with one column if expand=True.
pandas.Series.str.extractall
See also:
Examples
A pattern with one group will return a DataFrame with one column. Indices with no matches will not
appear in the result.
Capture group names are used for column names of the result.
>>> s.str.extractall(r"[ab](?P<digit>\d)")
digit
match
A 0 1
1 2
B 0 1
A pattern with two groups will return a DataFrame with two columns.
>>> s.str.extractall(r"(?P<letter>[ab])(?P<digit>\d)")
letter digit
match
A 0 a 1
1 a 2
B 0 b 1
>>> s.str.extractall(r"(?P<letter>[ab])?(?P<digit>\d)")
letter digit
match
A 0 a 1
1 a 2
B 0 b 1
C 0 NaN 1
pandas.Series.str.find
pandas.Series.str.findall
count Count occurrences of pattern or regular expression in each string of the Series/Index.
extractall For each string in the Series, extract groups from all matches of regular expression and
return a DataFrame with one row for each match and one column for each group.
re.findall The equivalent re function to all non-overlapping matches of pattern or regular expression
in string, as a list of strings.
Examples
>>> s.str.findall('Monkey')
0 []
1 [Monkey]
2 []
dtype: object
On the other hand, the search for the pattern ‘MONKEY’ doesn’t return any match:
>>> s.str.findall('MONKEY')
0 []
1 []
2 []
dtype: object
Flags can be added to the pattern or regular expression. For instance, to find the pattern ‘MONKEY’
ignoring the case:
>>> import re
>>> s.str.findall('MONKEY', flags=re.IGNORECASE)
0 []
1 [Monkey]
2 []
dtype: object
When the pattern matches more than one string in the Series, all matches are returned:
>>> s.str.findall('on')
0 [on]
1 [on]
2 []
dtype: object
Regular expressions are supported too. For instance, the search for all the strings ending with the
word ‘on’ is shown next:
>>> s.str.findall('on$')
0 [on]
1 []
2 []
dtype: object
If the pattern is found more than once in the same string, then a list of multiple strings is returned:
>>> s.str.findall('b')
0 []
1 []
2 [b, b]
dtype: object
pandas.Series.str.get
Series.str.get(self, i)
Extract element from each component at specified position.
Extract element from lists, tuples, or strings in each element in the Series/Index.
Parameters
i [int] Position of element to extract.
Returns
Series or Index
Examples
>>> s = pd.Series(["String",
... (1, 2, 3),
... ["a", "b", "c"],
... 123,
... -456,
... {1: "Hello", "2": "World"}])
>>> s
0 String
1 (1, 2, 3)
2 [a, b, c]
3 123
4 -456
5 {1: 'Hello', '2': 'World'}
dtype: object
>>> s.str.get(1)
0 t
1 2
2 b
3 NaN
4 NaN
5 Hello
dtype: object
>>> s.str.get(-1)
0 g
1 3
2 c
3 NaN
4 NaN
5 None
dtype: object
pandas.Series.str.index
pandas.Series.str.join
Series.str.join(self, sep)
Join lists contained as elements in the Series/Index with passed delimiter.
If the elements of a Series are lists themselves, join the content of these lists using the delimiter passed
to the function. This function is an equivalent to str.join().
Parameters
sep [str] Delimiter to use between list entries.
Returns
Series/Index: object The list entries concatenated by intervening occurrences of
the delimiter.
Raises
AttributeError If the supplied Series contains neither strings nor lists.
See also:
Notes
If any of the list items is not a string object, the result of the join will be NaN.
Examples
Join all lists using a ‘-‘. The lists containing object(s) of types other than str will produce a NaN.
>>> s.str.join('-')
0 lion-elephant-zebra
1 NaN
2 NaN
(continues on next page)
pandas.Series.str.len
Series.str.len(self )
Compute the length of each element in the Series/Index. The element may be a sequence (such as a
string, tuple or list) or a collection (such as a dictionary).
Returns
Series or Index of int A Series or Index of integer values indicating the length of
each element in the Series or Index.
See also:
Examples
Returns the length (number of characters) in a string. Returns the number of entries for dictionaries,
lists or tuples.
>>> s = pd.Series(['dog',
... '',
... 5,
... {'foo' : 'bar'},
... [2, 3, 5, 7],
... ('one', 'two', 'three')])
>>> s
0 dog
1
2 5
3 {'foo': 'bar'}
4 [2, 3, 5, 7]
5 (one, two, three)
dtype: object
>>> s.str.len()
0 3.0
1 0.0
2 NaN
3 1.0
4 4.0
5 3.0
dtype: float64
pandas.Series.str.ljust
pandas.Series.str.lower
Series.str.lower(self )
Convert strings in the Series/Index to lowercase.
Equivalent to str.lower().
Returns
Series/Index of objects
See also:
Examples
>>> s.str.lower()
0 lower
1 capitals
2 this is a sentence
3 swapcase
dtype: object
>>> s.str.upper()
0 LOWER
1 CAPITALS
2 THIS IS A SENTENCE
3 SWAPCASE
dtype: object
>>> s.str.title()
0 Lower
1 Capitals
2 This Is A Sentence
3 Swapcase
dtype: object
>>> s.str.capitalize()
0 Lower
1 Capitals
2 This is a sentence
3 Swapcase
dtype: object
>>> s.str.swapcase()
0 LOWER
1 capitals
2 THIS IS A SENTENCE
3 sWaPcAsE
dtype: object
pandas.Series.str.lstrip
Series.str.lstrip(self, to_strip=None)
Remove leading and trailing characters.
Strip whitespaces (including newlines) or a set of specified characters from each string in the Se-
ries/Index from left side. Equivalent to str.lstrip().
Parameters
to_strip [str or None, default None] Specifying the set of characters to be removed.
All combinations of this set of characters will be stripped. If None then whites-
paces are removed.
Returns
Series/Index of objects
See also:
Examples
>>> s.str.strip()
0 1. Ant.
1 2. Bee!
2 3. Cat?
3 NaN
dtype: object
>>> s.str.lstrip('123.')
0 Ant.
1 Bee!\n
2 Cat?\t
3 NaN
dtype: object
pandas.Series.str.match
pandas.Series.str.normalize
Series.str.normalize(self, form)
Return the Unicode normal form for the strings in the Series/Index. For more information on the
forms, see the unicodedata.normalize().
Parameters
form [{‘NFC’, ‘NFKC’, ‘NFD’, ‘NFKD’}] Unicode form
Returns
normalized [Series/Index of objects]
pandas.Series.str.pad
Series.str.rjust Fills the left side of strings with an arbitrary character. Equivalent to Series.
str.pad(side='left').
Series.str.ljust Fills the right side of strings with an arbitrary character. Equivalent to Series.
str.pad(side='right').
Series.str.center Fills boths sides of strings with an arbitrary character. Equivalent to Series.
str.pad(side='both').
Series.str.zfill Pad strings in the Series/Index by prepending ‘0’ character. Equivalent to Series.
str.pad(side='left', fillchar='0').
Examples
>>> s.str.pad(width=10)
0 caribou
1 tiger
dtype: object
pandas.Series.str.partition
Examples
>>> s.str.partition()
0 1 2
0 Linda van der Berg
1 George Pitt-Rivers
>>> s.str.rpartition()
0 1 2
0 Linda van der Berg
1 George Pitt-Rivers
>>> s.str.partition('-')
0 1 2
0 Linda van der Berg
1 George Pitt - Rivers
>>> idx.str.partition()
MultiIndex([('X', ' ', '123'),
('Y', ' ', '999')],
dtype='object')
>>> idx.str.partition(expand=False)
Index([('X', ' ', '123'), ('Y', ' ', '999')], dtype='object')
pandas.Series.str.repeat
Series.str.repeat(self, repeats)
Duplicate each string in the Series or Index.
Parameters
repeats [int or sequence of int] Same value for all (int) or different value per (se-
quence).
Returns
Series or Index of object Series or Index of repeated string objects specified by
input parameter repeats.
Examples
>>> s.str.repeat(repeats=2)
0 aa
1 bb
2 cc
dtype: object
pandas.Series.str.replace
Notes
When pat is a compiled regex, all flags should be included in the compiled regex. Use of case, flags, or
regex=False with a compiled regex will raise an error.
Examples
When pat is a string and regex is True (the default), the given pat is compiled as a regex. When repl
is a string, it replaces matching regex patterns as with re.sub(). NaN value(s) in the Series are left
as is:
When pat is a string and regex is False, every pat is replaced with repl as with str.replace():
When repl is a callable, it is called on every pat using re.sub(). The callable should expect one
positional argument (a regex object) and return a string.
To get the idea:
>>> import re
>>> regex_pat = re.compile(r'FUZ', flags=re.IGNORECASE)
>>> pd.Series(['foo', 'fuz', np.nan]).str.replace(regex_pat, 'bar')
0 foo
1 bar
2 NaN
dtype: object
pandas.Series.str.rfind
pandas.Series.str.rindex
pandas.Series.str.rjust
pandas.Series.str.rpartition
Parameters
sep [str, default whitespace] String to split on.
pat [str, default whitespace] Deprecated since version 0.24.0: Use sep instead
expand [bool, default True] If True, return DataFrame/MultiIndex expanding di-
mensionality. If False, return Series/Index.
Returns
DataFrame/MultiIndex or Series/Index of objects
See also:
Examples
>>> s.str.partition()
0 1 2
0 Linda van der Berg
1 George Pitt-Rivers
>>> s.str.rpartition()
0 1 2
0 Linda van der Berg
1 George Pitt-Rivers
>>> s.str.partition('-')
0 1 2
0 Linda van der Berg
1 George Pitt - Rivers
>>> idx.str.partition()
MultiIndex([('X', ' ', '123'),
('Y', ' ', '999')],
dtype='object')
>>> idx.str.partition(expand=False)
Index([('X', ' ', '123'), ('Y', ' ', '999')], dtype='object')
pandas.Series.str.rstrip
Series.str.rstrip(self, to_strip=None)
Remove leading and trailing characters.
Strip whitespaces (including newlines) or a set of specified characters from each string in the Se-
ries/Index from right side. Equivalent to str.rstrip().
Parameters
to_strip [str or None, default None] Specifying the set of characters to be removed.
All combinations of this set of characters will be stripped. If None then whites-
paces are removed.
Returns
Series/Index of objects
See also:
Examples
>>> s.str.strip()
0 1. Ant.
1 2. Bee!
2 3. Cat?
3 NaN
dtype: object
>>> s.str.lstrip('123.')
0 Ant.
1 Bee!\n
2 Cat?\t
3 NaN
dtype: object
pandas.Series.str.slice
Examples
>>> s.str.slice(start=1)
0 oala
1 ox
2 hameleon
dtype: object
>>> s.str.slice(stop=2)
0 ko
1 fo
2 ch
dtype: object
>>> s.str.slice(step=2)
0 kaa
1 fx
2 caeen
dtype: object
>>> s.str[0:5:3]
0 kl
1 f
2 cm
dtype: object
pandas.Series.str.slice_replace
repl [str, optional] String for replacement. If not specified (None), the sliced region
is replaced with an empty string.
Returns
Series or Index Same type as the original object.
See also:
Examples
Specify just start, meaning replace start until the end of the string with repl.
Specify just stop, meaning the start of the string to stop is replaced with repl, and the rest of the string
is included.
Specify start and stop, meaning the slice from start to stop is replaced with repl. Everything before or
after start and stop is included as is.
pandas.Series.str.split
Notes
Examples
>>> s.str.split()
0 [this, is, a, regular, sentence]
1 [https://docs.python.org/3/tutorial/index.html]
2 NaN
dtype: object
Without the n parameter, the outputs of rsplit and split are identical.
>>> s.str.rsplit()
0 [this, is, a, regular, sentence]
1 [https://docs.python.org/3/tutorial/index.html]
2 NaN
dtype: object
The n parameter can be used to limit the number of splits on the delimiter. The outputs of split and
rsplit are different.
>>> s.str.split(n=2)
0 [this, is, a regular sentence]
1 [https://docs.python.org/3/tutorial/index.html]
2 NaN
dtype: object
>>> s.str.rsplit(n=2)
0 [this is a, regular, sentence]
1 [https://docs.python.org/3/tutorial/index.html]
2 NaN
dtype: object
When using expand=True, the split elements will expand out into separate columns. If NaN is present,
it is propagated throughout the columns during the split.
>>> s.str.split(expand=True)
0 1 2 3
0 this is a regular
1 https://docs.python.org/3/tutorial/index.html None None None
2 NaN NaN NaN NaN \
4
0 sentence
(continues on next page)
For slightly more complex use cases like splitting the html document name from a url, a combination
of parameter settings can be used.
>>> s = pd.Series(["1+1=2"])
pandas.Series.str.rsplit
Notes
Examples
>>> s.str.split()
0 [this, is, a, regular, sentence]
1 [https://docs.python.org/3/tutorial/index.html]
2 NaN
dtype: object
Without the n parameter, the outputs of rsplit and split are identical.
>>> s.str.rsplit()
0 [this, is, a, regular, sentence]
1 [https://docs.python.org/3/tutorial/index.html]
2 NaN
dtype: object
The n parameter can be used to limit the number of splits on the delimiter. The outputs of split and
rsplit are different.
>>> s.str.split(n=2)
0 [this, is, a regular sentence]
1 [https://docs.python.org/3/tutorial/index.html]
2 NaN
dtype: object
>>> s.str.rsplit(n=2)
0 [this is a, regular, sentence]
1 [https://docs.python.org/3/tutorial/index.html]
2 NaN
dtype: object
When using expand=True, the split elements will expand out into separate columns. If NaN is present,
it is propagated throughout the columns during the split.
>>> s.str.split(expand=True)
0 1 2 3
0 this is a regular
1 https://docs.python.org/3/tutorial/index.html None None None
2 NaN NaN NaN NaN \
4
0 sentence
1 None
2 NaN
For slightly more complex use cases like splitting the html document name from a url, a combination
of parameter settings can be used.
>>> s = pd.Series(["1+1=2"])
pandas.Series.str.startswith
Examples
>>> s.str.startswith('b')
0 True
1 False
2 False
3 NaN
dtype: object
pandas.Series.str.strip
Series.str.strip(self, to_strip=None)
Remove leading and trailing characters.
Strip whitespaces (including newlines) or a set of specified characters from each string in the Se-
ries/Index from left and right sides. Equivalent to str.strip().
Parameters
to_strip [str or None, default None] Specifying the set of characters to be removed.
All combinations of this set of characters will be stripped. If None then whites-
paces are removed.
Returns
Series/Index of objects
See also:
Examples
>>> s.str.strip()
0 1. Ant.
1 2. Bee!
2 3. Cat?
3 NaN
dtype: object
>>> s.str.lstrip('123.')
0 Ant.
1 Bee!\n
2 Cat?\t
3 NaN
dtype: object
pandas.Series.str.swapcase
Series.str.swapcase(self )
Convert strings in the Series/Index to be swapcased.
Equivalent to str.swapcase().
Returns
Series/Index of objects
See also:
Series.str.title Converts first character of each word to uppercase and remaining to lowercase.
Series.str.capitalize Converts first character to uppercase and remaining to lowercase.
Series.str.swapcase Converts uppercase to lowercase and lowercase to uppercase.
Series.str.casefold Removes all case distinctions in the string.
Examples
>>> s.str.lower()
0 lower
1 capitals
2 this is a sentence
3 swapcase
dtype: object
>>> s.str.upper()
0 LOWER
1 CAPITALS
2 THIS IS A SENTENCE
3 SWAPCASE
dtype: object
>>> s.str.title()
0 Lower
1 Capitals
2 This Is A Sentence
3 Swapcase
dtype: object
>>> s.str.capitalize()
0 Lower
1 Capitals
2 This is a sentence
3 Swapcase
dtype: object
>>> s.str.swapcase()
0 LOWER
1 capitals
2 THIS IS A SENTENCE
3 sWaPcAsE
dtype: object
pandas.Series.str.title
Series.str.title(self )
Convert strings in the Series/Index to titlecase.
Equivalent to str.title().
Returns
Series/Index of objects
See also:
Examples
>>> s.str.lower()
0 lower
1 capitals
2 this is a sentence
3 swapcase
dtype: object
>>> s.str.upper()
0 LOWER
1 CAPITALS
2 THIS IS A SENTENCE
3 SWAPCASE
dtype: object
>>> s.str.title()
0 Lower
1 Capitals
2 This Is A Sentence
3 Swapcase
dtype: object
>>> s.str.capitalize()
0 Lower
1 Capitals
2 This is a sentence
3 Swapcase
dtype: object
>>> s.str.swapcase()
0 LOWER
1 capitals
2 THIS IS A SENTENCE
3 sWaPcAsE
dtype: object
pandas.Series.str.translate
Series.str.translate(self, table)
Map all characters in the string through the given mapping table. Equivalent to standard str.
translate().
Parameters
table [dict] table is a mapping of Unicode ordinals to Unicode ordinals, strings, or
None. Unmapped characters are left untouched. Characters mapped to None are
deleted. str.maketrans() is a helper function for making translation tables.
Returns
Series or Index
pandas.Series.str.upper
Series.str.upper(self )
Convert strings in the Series/Index to uppercase.
Equivalent to str.upper().
Returns
Series/Index of objects
See also:
Examples
>>> s.str.lower()
0 lower
1 capitals
2 this is a sentence
3 swapcase
dtype: object
>>> s.str.upper()
0 LOWER
1 CAPITALS
2 THIS IS A SENTENCE
3 SWAPCASE
dtype: object
>>> s.str.title()
0 Lower
1 Capitals
2 This Is A Sentence
3 Swapcase
dtype: object
>>> s.str.capitalize()
0 Lower
1 Capitals
2 This is a sentence
3 Swapcase
dtype: object
>>> s.str.swapcase()
0 LOWER
1 capitals
2 THIS IS A SENTENCE
3 sWaPcAsE
dtype: object
pandas.Series.str.wrap
This method has the same keyword parameters and defaults as textwrap.TextWrapper.
Parameters
width [int] Maximum line width.
expand_tabs [bool, optional] If True, tab characters will be expanded to spaces
(default: True).
replace_whitespace [bool, optional] If True, each whitespace character (as defined
by string.whitespace) remaining after tab expansion will be replaced by a single
space (default: True).
drop_whitespace [bool, optional] If True, whitespace that, after wrapping, happens
to end up at the beginning or end of a line is dropped (default: True).
break_long_words [bool, optional] If True, then words longer than width will be
broken in order to ensure that no lines are longer than width. If it is false, long
words will not be broken, and some lines may be longer than width (default:
True).
break_on_hyphens [bool, optional] If True, wrapping will occur preferably on
whitespace and right after hyphens in compound words, as it is customary in
English. If false, only whitespaces will be considered as potentially good places
for line breaks, but you need to set break_long_words to false if you want truly
insecable words (default: True).
Returns
Series or Index
Notes
Internally, this method uses a textwrap.TextWrapper instance with default settings. To achieve
behavior matching R’s stringr library str_wrap function, use the arguments:
• expand_tabs = False
• replace_whitespace = True
• drop_whitespace = True
• break_long_words = False
• break_on_hyphens = False
Examples
pandas.Series.str.zfill
Series.str.zfill(self, width)
Pad strings in the Series/Index by prepending ‘0’ characters.
Strings in the Series/Index are padded with ‘0’ characters on the left of the string to reach a total
string length width. Strings in the Series/Index with length greater or equal to width are unchanged.
Parameters
width [int] Minimum length of resulting string; strings with length less than width
be prepended with ‘0’ characters.
Returns
Series/Index of objects
See also:
Notes
Differs from str.zfill() which has special handling for ‘+’/’-‘ in the string.
Examples
Note that 10 and NaN are not strings, therefore they are converted to NaN. The minus sign in '-1' is
treated as a regular character and the zero is added to the left of it (str.zfill() would have moved
it to the left). 1000 remains unchanged as it is longer than width.
>>> s.str.zfill(3)
0 0-1
1 001
2 1000
3 NaN
4 NaN
dtype: object
pandas.Series.str.isalnum
Series.str.isalnum(self )
Check whether all characters in each string are alphanumeric.
This is equivalent to running the Python string method str.isalnum() for each element of the Se-
ries/Index. If a string has zero characters, False is returned for that check.
Returns
Series or Index of bool Series or Index of boolean values with the same length as
the original Series/Index.
See also:
Examples
>>> s1.str.isalpha()
0 True
1 False
2 False
3 False
dtype: bool
>>> s1.str.isnumeric()
0 False
1 False
2 True
3 False
dtype: bool
>>> s1.str.isalnum()
0 True
1 True
2 True
3 False
dtype: bool
Note that checks against characters mixed with any additional punctuation or whitespace will evaluate
to false for an alphanumeric check.
The s3.str.isdecimal method checks for characters used to form numbers in base 10.
>>> s3.str.isdecimal()
0 True
1 False
2 False
3 False
dtype: bool
The s.str.isdigit method is the same as s3.str.isdecimal but also includes special digits, like
superscripted and subscripted digits in unicode.
>>> s3.str.isdigit()
0 True
1 True
2 False
3 False
dtype: bool
The s.str.isnumeric method is the same as s3.str.isdigit but also includes other characters that
can represent quantities such as unicode fractions.
>>> s3.str.isnumeric()
0 True
1 True
2 True
3 False
dtype: bool
>>> s5.str.islower()
0 True
1 False
2 False
3 False
dtype: bool
>>> s5.str.isupper()
0 False
1 False
2 True
3 False
dtype: bool
The s5.str.istitle method checks for whether all words are in title case (whether only the first
letter of each word is capitalized). Words are assumed to be as any sequence of non-numeric characters
separated by whitespace characters.
>>> s5.str.istitle()
0 False
1 True
2 False
3 False
dtype: bool
pandas.Series.str.isalpha
Series.str.isalpha(self )
Check whether all characters in each string are alphabetic.
This is equivalent to running the Python string method str.isalpha() for each element of the Se-
ries/Index. If a string has zero characters, False is returned for that check.
Returns
Series or Index of bool Series or Index of boolean values with the same length as
the original Series/Index.
See also:
Examples
>>> s1.str.isalpha()
0 True
1 False
2 False
3 False
dtype: bool
>>> s1.str.isnumeric()
0 False
1 False
2 True
3 False
dtype: bool
>>> s1.str.isalnum()
0 True
1 True
2 True
3 False
dtype: bool
Note that checks against characters mixed with any additional punctuation or whitespace will evaluate
to false for an alphanumeric check.
The s3.str.isdecimal method checks for characters used to form numbers in base 10.
>>> s3.str.isdecimal()
0 True
1 False
2 False
3 False
dtype: bool
The s.str.isdigit method is the same as s3.str.isdecimal but also includes special digits, like
superscripted and subscripted digits in unicode.
>>> s3.str.isdigit()
0 True
1 True
2 False
3 False
dtype: bool
The s.str.isnumeric method is the same as s3.str.isdigit but also includes other characters that
can represent quantities such as unicode fractions.
>>> s3.str.isnumeric()
0 True
1 True
2 True
3 False
dtype: bool
>>> s5.str.islower()
0 True
1 False
2 False
3 False
dtype: bool
>>> s5.str.isupper()
0 False
1 False
2 True
3 False
dtype: bool
The s5.str.istitle method checks for whether all words are in title case (whether only the first
letter of each word is capitalized). Words are assumed to be as any sequence of non-numeric characters
separated by whitespace characters.
>>> s5.str.istitle()
0 False
1 True
2 False
(continues on next page)
pandas.Series.str.isdigit
Series.str.isdigit(self )
Check whether all characters in each string are digits.
This is equivalent to running the Python string method str.isdigit() for each element of the Se-
ries/Index. If a string has zero characters, False is returned for that check.
Returns
Series or Index of bool Series or Index of boolean values with the same length as
the original Series/Index.
See also:
Examples
>>> s1.str.isalpha()
0 True
1 False
2 False
3 False
dtype: bool
>>> s1.str.isnumeric()
0 False
1 False
2 True
3 False
dtype: bool
>>> s1.str.isalnum()
0 True
1 True
2 True
3 False
dtype: bool
Note that checks against characters mixed with any additional punctuation or whitespace will evaluate
to false for an alphanumeric check.
The s3.str.isdecimal method checks for characters used to form numbers in base 10.
>>> s3.str.isdecimal()
0 True
1 False
2 False
3 False
dtype: bool
The s.str.isdigit method is the same as s3.str.isdecimal but also includes special digits, like
superscripted and subscripted digits in unicode.
>>> s3.str.isdigit()
0 True
1 True
2 False
3 False
dtype: bool
The s.str.isnumeric method is the same as s3.str.isdigit but also includes other characters that
can represent quantities such as unicode fractions.
>>> s3.str.isnumeric()
0 True
1 True
2 True
3 False
dtype: bool
>>> s5.str.islower()
0 True
1 False
2 False
3 False
dtype: bool
>>> s5.str.isupper()
0 False
1 False
2 True
3 False
dtype: bool
The s5.str.istitle method checks for whether all words are in title case (whether only the first
letter of each word is capitalized). Words are assumed to be as any sequence of non-numeric characters
separated by whitespace characters.
>>> s5.str.istitle()
0 False
1 True
2 False
3 False
dtype: bool
pandas.Series.str.isspace
Series.str.isspace(self )
Check whether all characters in each string are whitespace.
This is equivalent to running the Python string method str.isspace() for each element of the Se-
ries/Index. If a string has zero characters, False is returned for that check.
Returns
Series or Index of bool Series or Index of boolean values with the same length as
the original Series/Index.
See also:
Examples
>>> s1.str.isalpha()
0 True
1 False
2 False
3 False
dtype: bool
>>> s1.str.isnumeric()
0 False
1 False
2 True
3 False
dtype: bool
>>> s1.str.isalnum()
0 True
1 True
2 True
3 False
dtype: bool
Note that checks against characters mixed with any additional punctuation or whitespace will evaluate
to false for an alphanumeric check.
The s3.str.isdecimal method checks for characters used to form numbers in base 10.
>>> s3.str.isdecimal()
0 True
1 False
2 False
3 False
dtype: bool
The s.str.isdigit method is the same as s3.str.isdecimal but also includes special digits, like
superscripted and subscripted digits in unicode.
>>> s3.str.isdigit()
0 True
1 True
2 False
3 False
dtype: bool
The s.str.isnumeric method is the same as s3.str.isdigit but also includes other characters that
can represent quantities such as unicode fractions.
>>> s3.str.isnumeric()
0 True
1 True
2 True
3 False
dtype: bool
>>> s5.str.islower()
0 True
1 False
2 False
3 False
dtype: bool
>>> s5.str.isupper()
0 False
(continues on next page)
The s5.str.istitle method checks for whether all words are in title case (whether only the first
letter of each word is capitalized). Words are assumed to be as any sequence of non-numeric characters
separated by whitespace characters.
>>> s5.str.istitle()
0 False
1 True
2 False
3 False
dtype: bool
pandas.Series.str.islower
Series.str.islower(self )
Check whether all characters in each string are lowercase.
This is equivalent to running the Python string method str.islower() for each element of the Se-
ries/Index. If a string has zero characters, False is returned for that check.
Returns
Series or Index of bool Series or Index of boolean values with the same length as
the original Series/Index.
See also:
Examples
>>> s1.str.isalpha()
0 True
1 False
2 False
3 False
dtype: bool
>>> s1.str.isnumeric()
0 False
1 False
2 True
3 False
dtype: bool
>>> s1.str.isalnum()
0 True
1 True
2 True
3 False
dtype: bool
Note that checks against characters mixed with any additional punctuation or whitespace will evaluate
to false for an alphanumeric check.
The s3.str.isdecimal method checks for characters used to form numbers in base 10.
>>> s3.str.isdecimal()
0 True
1 False
2 False
3 False
dtype: bool
The s.str.isdigit method is the same as s3.str.isdecimal but also includes special digits, like
superscripted and subscripted digits in unicode.
>>> s3.str.isdigit()
0 True
1 True
2 False
(continues on next page)
The s.str.isnumeric method is the same as s3.str.isdigit but also includes other characters that
can represent quantities such as unicode fractions.
>>> s3.str.isnumeric()
0 True
1 True
2 True
3 False
dtype: bool
>>> s5.str.islower()
0 True
1 False
2 False
3 False
dtype: bool
>>> s5.str.isupper()
0 False
1 False
2 True
3 False
dtype: bool
The s5.str.istitle method checks for whether all words are in title case (whether only the first
letter of each word is capitalized). Words are assumed to be as any sequence of non-numeric characters
separated by whitespace characters.
>>> s5.str.istitle()
0 False
1 True
2 False
3 False
dtype: bool
pandas.Series.str.isupper
Series.str.isupper(self )
Check whether all characters in each string are uppercase.
This is equivalent to running the Python string method str.isupper() for each element of the Se-
ries/Index. If a string has zero characters, False is returned for that check.
Returns
Series or Index of bool Series or Index of boolean values with the same length as
the original Series/Index.
See also:
Examples
>>> s1.str.isalpha()
0 True
1 False
2 False
3 False
dtype: bool
>>> s1.str.isnumeric()
0 False
1 False
2 True
3 False
dtype: bool
>>> s1.str.isalnum()
0 True
1 True
2 True
(continues on next page)
Note that checks against characters mixed with any additional punctuation or whitespace will evaluate
to false for an alphanumeric check.
The s3.str.isdecimal method checks for characters used to form numbers in base 10.
>>> s3.str.isdecimal()
0 True
1 False
2 False
3 False
dtype: bool
The s.str.isdigit method is the same as s3.str.isdecimal but also includes special digits, like
superscripted and subscripted digits in unicode.
>>> s3.str.isdigit()
0 True
1 True
2 False
3 False
dtype: bool
The s.str.isnumeric method is the same as s3.str.isdigit but also includes other characters that
can represent quantities such as unicode fractions.
>>> s3.str.isnumeric()
0 True
1 True
2 True
3 False
dtype: bool
>>> s5.str.islower()
0 True
1 False
2 False
3 False
dtype: bool
>>> s5.str.isupper()
0 False
1 False
2 True
3 False
dtype: bool
The s5.str.istitle method checks for whether all words are in title case (whether only the first
letter of each word is capitalized). Words are assumed to be as any sequence of non-numeric characters
separated by whitespace characters.
>>> s5.str.istitle()
0 False
1 True
2 False
3 False
dtype: bool
pandas.Series.str.istitle
Series.str.istitle(self )
Check whether all characters in each string are titlecase.
This is equivalent to running the Python string method str.istitle() for each element of the Se-
ries/Index. If a string has zero characters, False is returned for that check.
Returns
Series or Index of bool Series or Index of boolean values with the same length as
the original Series/Index.
See also:
Examples
>>> s1.str.isalpha()
0 True
1 False
2 False
3 False
dtype: bool
>>> s1.str.isnumeric()
0 False
1 False
2 True
3 False
dtype: bool
>>> s1.str.isalnum()
0 True
1 True
2 True
3 False
dtype: bool
Note that checks against characters mixed with any additional punctuation or whitespace will evaluate
to false for an alphanumeric check.
The s3.str.isdecimal method checks for characters used to form numbers in base 10.
>>> s3.str.isdecimal()
0 True
1 False
2 False
3 False
dtype: bool
The s.str.isdigit method is the same as s3.str.isdecimal but also includes special digits, like
superscripted and subscripted digits in unicode.
>>> s3.str.isdigit()
0 True
1 True
2 False
3 False
dtype: bool
The s.str.isnumeric method is the same as s3.str.isdigit but also includes other characters that
can represent quantities such as unicode fractions.
>>> s3.str.isnumeric()
0 True
1 True
2 True
3 False
dtype: bool
>>> s5.str.islower()
0 True
1 False
2 False
3 False
dtype: bool
>>> s5.str.isupper()
0 False
1 False
2 True
3 False
dtype: bool
The s5.str.istitle method checks for whether all words are in title case (whether only the first
letter of each word is capitalized). Words are assumed to be as any sequence of non-numeric characters
separated by whitespace characters.
>>> s5.str.istitle()
0 False
1 True
2 False
3 False
dtype: bool
pandas.Series.str.isnumeric
Series.str.isnumeric(self )
Check whether all characters in each string are numeric.
This is equivalent to running the Python string method str.isnumeric() for each element of the
Series/Index. If a string has zero characters, False is returned for that check.
Returns
Series or Index of bool Series or Index of boolean values with the same length as
the original Series/Index.
See also:
Examples
>>> s1.str.isalpha()
0 True
1 False
2 False
3 False
dtype: bool
>>> s1.str.isnumeric()
0 False
1 False
2 True
3 False
dtype: bool
>>> s1.str.isalnum()
0 True
1 True
2 True
3 False
dtype: bool
Note that checks against characters mixed with any additional punctuation or whitespace will evaluate
to false for an alphanumeric check.
The s3.str.isdecimal method checks for characters used to form numbers in base 10.
>>> s3.str.isdecimal()
0 True
1 False
2 False
3 False
dtype: bool
The s.str.isdigit method is the same as s3.str.isdecimal but also includes special digits, like
superscripted and subscripted digits in unicode.
>>> s3.str.isdigit()
0 True
1 True
2 False
3 False
dtype: bool
The s.str.isnumeric method is the same as s3.str.isdigit but also includes other characters that
can represent quantities such as unicode fractions.
>>> s3.str.isnumeric()
0 True
(continues on next page)
>>> s5.str.islower()
0 True
1 False
2 False
3 False
dtype: bool
>>> s5.str.isupper()
0 False
1 False
2 True
3 False
dtype: bool
The s5.str.istitle method checks for whether all words are in title case (whether only the first
letter of each word is capitalized). Words are assumed to be as any sequence of non-numeric characters
separated by whitespace characters.
>>> s5.str.istitle()
0 False
1 True
2 False
3 False
dtype: bool
pandas.Series.str.isdecimal
Series.str.isdecimal(self )
Check whether all characters in each string are decimal.
This is equivalent to running the Python string method str.isdecimal() for each element of the
Series/Index. If a string has zero characters, False is returned for that check.
Returns
Series or Index of bool Series or Index of boolean values with the same length as
the original Series/Index.
See also:
Examples
>>> s1.str.isalpha()
0 True
1 False
2 False
3 False
dtype: bool
>>> s1.str.isnumeric()
0 False
1 False
2 True
3 False
dtype: bool
>>> s1.str.isalnum()
0 True
1 True
2 True
3 False
dtype: bool
Note that checks against characters mixed with any additional punctuation or whitespace will evaluate
to false for an alphanumeric check.
The s3.str.isdecimal method checks for characters used to form numbers in base 10.
>>> s3.str.isdecimal()
0 True
1 False
2 False
3 False
dtype: bool
The s.str.isdigit method is the same as s3.str.isdecimal but also includes special digits, like
superscripted and subscripted digits in unicode.
>>> s3.str.isdigit()
0 True
1 True
2 False
3 False
dtype: bool
The s.str.isnumeric method is the same as s3.str.isdigit but also includes other characters that
can represent quantities such as unicode fractions.
>>> s3.str.isnumeric()
0 True
1 True
2 True
3 False
dtype: bool
>>> s5.str.islower()
0 True
1 False
2 False
3 False
dtype: bool
>>> s5.str.isupper()
0 False
1 False
2 True
3 False
dtype: bool
The s5.str.istitle method checks for whether all words are in title case (whether only the first
letter of each word is capitalized). Words are assumed to be as any sequence of non-numeric characters
separated by whitespace characters.
>>> s5.str.istitle()
0 False
1 True
2 False
3 False
dtype: bool
pandas.Series.str.get_dummies
Series.str.get_dummies(self, sep=’|’)
Split each string in the Series by sep and return a DataFrame of dummy/indicator variables.
Parameters
sep [str, default “|”] String to split on.
Returns
DataFrame Dummy variables corresponding to values of the Series.
See also:
Examples
Categorical accessor
Categorical-dtype specific methods and attributes are available under the Series.cat accessor.
pandas.Series.cat.categories
Series.cat.categories
The categories of this categorical.
Setting assigns new values to each category (effectively a rename of each individual category).
The assigned value has to be a list-like object. All items must be unique and the number of items in
the new categories must be the same as the number of items in the old categories.
Assigning to categories is a inplace operation!
Raises
ValueError If the new categories do not validate as categories or if the number of
new categories is unequal the number of old categories
See also:
rename_categories
reorder_categories
add_categories
remove_categories
remove_unused_categories
set_categories
pandas.Series.cat.ordered
Series.cat.ordered
Whether the categories have an ordered relationship.
pandas.Series.cat.codes
Series.cat.codes
Return Series of codes as well as the index.
pandas.Series.cat.rename_categories
reorder_categories
add_categories
remove_categories
remove_unused_categories
set_categories
Examples
For dict-like new_categories, extra keys are ignored and categories not in the dictionary are passed
through
pandas.Series.cat.reorder_categories
rename_categories
add_categories
remove_categories
remove_unused_categories
set_categories
pandas.Series.cat.add_categories
rename_categories
reorder_categories
remove_categories
remove_unused_categories
set_categories
pandas.Series.cat.remove_categories
rename_categories
reorder_categories
add_categories
remove_unused_categories
set_categories
pandas.Series.cat.remove_unused_categories
rename_categories
reorder_categories
add_categories
remove_categories
set_categories
pandas.Series.cat.set_categories
inplace [bool, default False] Whether or not to reorder the categories in-place or
return a copy of this categorical with reordered categories.
Returns
Categorical with reordered categories or None if inplace.
Raises
ValueError If new_categories does not validate as categories
See also:
rename_categories
reorder_categories
add_categories
remove_categories
remove_unused_categories
pandas.Series.cat.as_ordered
pandas.Series.cat.as_unordered
Sparse accessor
Sparse-dtype specific methods and attributes are provided under the Series.sparse accessor.
pandas.Series.sparse.npoints
Series.sparse.npoints
The number of non- fill_value points.
Examples
pandas.Series.sparse.density
Series.sparse.density
The percent of non- fill_value points, as decimal.
Examples
pandas.Series.sparse.fill_value
Series.sparse.fill_value
Elements in data that are fill_value are not stored.
For memory savings, this should be the most common value in the array.
pandas.Series.sparse.sp_values
Series.sparse.sp_values
An ndarray containing the non- fill_value values.
Examples
pandas.Series.sparse.from_coo
Examples
pandas.Series.sparse.to_coo
Use row_levels and column_levels to determine the row and column coordinates respectively.
row_levels and column_levels are the names (labels) or numbers of the levels. {row_levels, col-
umn_levels} must be a partition of the MultiIndex level names (or numbers).
Parameters
row_levels [tuple/list]
column_levels [tuple/list]
sort_labels [bool, default False] Sort the row and column labels before forming the
sparse matrix.
Returns
y [scipy.sparse.coo_matrix]
rows [list (row labels)]
columns [list (column labels)]
Examples
6.3.14 Plotting
Series.plot is both a callable method and a namespace attribute for specific plotting methods of the form
Series.plot.<kind>.
pandas.Series.plot.area
Examples
>>> df = pd.DataFrame({
... 'sales': [3, 2, 3, 9, 10, 6],
... 'signups': [5, 5, 6, 12, 14, 13],
... 'visits': [20, 42, 28, 62, 81, 50],
... }, index=pd.date_range(start='2018/01/01', end='2018/07/01',
... freq='M'))
>>> ax = df.plot.area()
Area plots are stacked by default. To produce an unstacked plot, pass stacked=False:
>>> ax = df.plot.area(stacked=False)
>>> ax = df.plot.area(y='sales')
>>> df = pd.DataFrame({
... 'sales': [3, 2, 3],
... 'visits': [20, 42, 28],
... 'day': [1, 2, 3],
... })
>>> ax = df.plot.area(x='day')
pandas.Series.plot.bar
Examples
Basic plot.
Plot a whole dataframe to a bar plot. Each column is assigned a distinct color, and each row is nested
in a group along the horizontal axis.
Instead of nesting, the figure can be split by column with subplots=True. In this case, a numpy.
ndarray of matplotlib.axes.Axes are returned.
pandas.Series.plot.barh
Examples
Basic example
pandas.Series.plot.box
Examples
Draw a box plot from a DataFrame with four columns of randomly generated data.
pandas.Series.plot.density
Examples
Given a Series of points randomly sampled from an unknown distribution, estimate its PDF using KDE
with automatic bandwidth determination and plot the results, evaluating them at 1000 equally spaced
points (default):
A scalar bandwidth can be specified. Using a small bandwidth value can lead to over-fitting, while
using a large bandwidth value may result in under-fitting:
>>> ax = s.plot.kde(bw_method=0.3)
>>> ax = s.plot.kde(bw_method=3)
Finally, the ind parameter determines the evaluation points for the plot of the estimated PDF:
>>> df = pd.DataFrame({
... 'x': [1, 2, 2.5, 3, 3.5, 4, 5],
... 'y': [4, 4, 4.5, 5, 5.5, 6, 6],
... })
>>> ax = df.plot.kde()
A scalar bandwidth can be specified. Using a small bandwidth value can lead to over-fitting, while
using a large bandwidth value may result in under-fitting:
>>> ax = df.plot.kde(bw_method=0.3)
>>> ax = df.plot.kde(bw_method=3)
Finally, the ind parameter determines the evaluation points for the plot of the estimated PDF:
pandas.Series.plot.hist
Examples
When we draw a dice 6000 times, we expect to get each value around 1000 times. But when we draw
two dices and sum the result, the distribution is going to be quite different. A histogram illustrates
those distributions.
>>> df = pd.DataFrame(
... np.random.randint(1, 7, 6000),
... columns = ['one'])
>>> df['two'] = df['one'] + np.random.randint(1, 7, 6000)
>>> ax = df.plot.hist(bins=12, alpha=0.5)
pandas.Series.plot.kde
Examples
Given a Series of points randomly sampled from an unknown distribution, estimate its PDF using KDE
with automatic bandwidth determination and plot the results, evaluating them at 1000 equally spaced
points (default):
A scalar bandwidth can be specified. Using a small bandwidth value can lead to over-fitting, while
using a large bandwidth value may result in under-fitting:
>>> ax = s.plot.kde(bw_method=0.3)
>>> ax = s.plot.kde(bw_method=3)
Finally, the ind parameter determines the evaluation points for the plot of the estimated PDF:
>>> df = pd.DataFrame({
... 'x': [1, 2, 2.5, 3, 3.5, 4, 5],
... 'y': [4, 4, 4.5, 5, 5.5, 6, 6],
... })
>>> ax = df.plot.kde()
A scalar bandwidth can be specified. Using a small bandwidth value can lead to over-fitting, while
using a large bandwidth value may result in under-fitting:
>>> ax = df.plot.kde(bw_method=0.3)
>>> ax = df.plot.kde(bw_method=3)
Finally, the ind parameter determines the evaluation points for the plot of the estimated PDF:
pandas.Series.plot.line
Examples
The following example shows the populations for some animals over the years.
>>> df = pd.DataFrame({
... 'pig': [20, 18, 489, 675, 1776],
... 'horse': [4, 25, 281, 600, 1900]
... }, index=[1990, 1997, 2003, 2009, 2014])
>>> lines = df.plot.line()
pandas.Series.plot.pie
Series.plot.pie(self, **kwargs)
Generate a pie plot.
A pie plot is a proportional representation of the numerical data in a column. This function
wraps matplotlib.pyplot.pie() for the specified column. If no column reference is passed and
subplots=True a pie plot is drawn for each numerical column independently.
Parameters
y [int or label, optional] Label or position of the column to plot. If not provided,
subplots=True argument must be passed.
**kwds Keyword arguments to pass on to DataFrame.plot().
Returns
matplotlib.axes.Axes or np.ndarray of them A NumPy array is returned when
subplots is True.
See also:
Examples
In the example below we have a DataFrame with the information about planet’s mass and radius. We
pass the the ‘mass’ column to the pie function to get a pie plot.
Series.hist(self[, by, ax, grid, …]) Draw histogram of the input series using mat-
plotlib.
6.3.16 Sparse
pandas.SparseSeries.to_coo
Parameters
row_levels [tuple/list]
column_levels [tuple/list]
sort_labels [bool, default False] Sort the row and column labels before forming the
sparse matrix.
Returns
y [scipy.sparse.coo_matrix]
rows [list (row labels)]
columns [list (column labels)]
Examples
pandas.SparseSeries.from_coo
Returns
s [SparseSeries]
Examples
{{ header }}
6.4 DataFrame
6.4.1 Constructor
pandas.DataFrame
Examples
>>> df.dtypes
col1 int64
col2 int64
dtype: object
Attributes
pandas.DataFrame.T
DataFrame.T
Transpose index and columns.
Reflect the DataFrame over its main diagonal by writing rows as columns and vice-versa. The
property T is an accessor to the method transpose().
Parameters
copy [bool, default False] If True, the underlying data is copied. Otherwise (de-
fault), no copy is made if possible.
*args, **kwargs Additional keywords have no effect but might be accepted for
compatibility with numpy.
Returns
DataFrame The transposed DataFrame.
See also:
Notes
Transposing a DataFrame with mixed dtypes will result in a homogeneous DataFrame with the
object dtype. In such a case, a copy of the data is always made.
Examples
When the dtype is homogeneous in the original DataFrame, we get a transposed DataFrame with
the same dtype:
>>> df1.dtypes
col1 int64
col2 int64
dtype: object
>>> df1_transposed.dtypes
0 int64
1 int64
dtype: object
When the DataFrame has mixed dtypes, we get a transposed DataFrame with the object dtype:
>>> df2.dtypes
name object
score float64
employed bool
kids int64
dtype: object
>>> df2_transposed.dtypes
0 object
1 object
dtype: object
pandas.DataFrame.at
DataFrame.at
Access a single value for a row/column label pair.
Similar to loc, in that both provide label-based lookups. Use at if you only need to get or set a
single value in a DataFrame or Series.
Raises
KeyError When label does not exist in DataFrame
See also:
Examples
>>> df.loc[5].at['B']
4
pandas.DataFrame.axes
DataFrame.axes
Return a list representing the axes of the DataFrame.
It has the row axis labels and column axis labels as the only members. They are returned in that
order.
Examples
pandas.DataFrame.blocks
DataFrame.blocks
Internal property, property synonym for as_blocks().
Deprecated since version 0.21.0.
pandas.DataFrame.columns
DataFrame.columns
The column labels of the DataFrame.
pandas.DataFrame.dtypes
DataFrame.dtypes
Return the dtypes in the DataFrame.
This returns a Series with the data type of each column. The result’s index is the original
DataFrame’s columns. Columns with mixed types are stored with the object dtype. See the
User Guide for more.
Returns
pandas.Series The data type of each column.
See also:
Examples
pandas.DataFrame.empty
DataFrame.empty
Indicator whether DataFrame is empty.
True if DataFrame is entirely empty (no items), meaning any of the axes are of length 0.
Returns
bool If DataFrame is empty, return True, if not return False.
See also:
Series.dropna
DataFrame.dropna
Notes
If DataFrame contains only NaNs, it is still not considered empty. See the example below.
Examples
If we only have NaNs in our DataFrame, it is not considered empty! We will need to drop the
NaNs to make the DataFrame empty:
pandas.DataFrame.ftypes
DataFrame.ftypes
Return the ftypes (indication of sparse/dense and dtype) in DataFrame.
Deprecated since version 0.25.0: Use dtypes() instead.
This returns a Series with the data type of each column. The result’s index is the original
DataFrame’s columns. Columns with mixed types are stored with the object dtype. See the
User Guide for more.
Returns
pandas.Series The data type and indication of sparse/dense of each column.
See also:
Notes
Sparse data should have the same dtypes as its dense representation.
Examples
pandas.DataFrame.iat
DataFrame.iat
Access a single value for a row/column pair by integer position.
Similar to iloc, in that both provide integer-based lookups. Use iat if you only need to get or
set a single value in a DataFrame or Series.
Raises
IndexError When integer position is out of bounds
See also:
Examples
>>> df.iat[1, 2]
1
>>> df.iat[1, 2] = 10
>>> df.iat[1, 2]
10
>>> df.loc[0].iat[1]
2
pandas.DataFrame.iloc
DataFrame.iloc
Purely integer-location based indexing for selection by position.
.iloc[] is primarily integer position based (from 0 to length-1 of the axis), but may also be
used with a boolean array.
Allowed inputs are:
• An integer, e.g. 5.
• A list or array of integers, e.g. [4, 3, 0].
• A slice object with ints, e.g. 1:7.
• A boolean array.
• A callable function with one argument (the calling Series or DataFrame) and that returns
valid output for indexing (one of the above). This is useful in method chains, when you
don’t have a reference to the calling object, but would like to base your selection on some
value.
.iloc will raise IndexError if a requested indexer is out-of-bounds, except slice indexers which
allow out-of-bounds indexing (this conforms with python/numpy slice semantics).
See more at Selection by Position.
See also:
Examples
>>> type(df.iloc[0])
<class 'pandas.core.series.Series'>
>>> df.iloc[0]
a 1
b 2
c 3
d 4
Name: 0, dtype: int64
>>> df.iloc[[0]]
a b c d
0 1 2 3 4
>>> type(df.iloc[[0]])
<class 'pandas.core.frame.DataFrame'>
>>> df.iloc[:3]
a b c d
0 1 2 3 4
1 100 200 300 400
2 1000 2000 3000 4000
With a callable, useful in method chains. The x passed to the lambda is the DataFrame being
sliced. This selects the rows whose index label even.
>>> df.iloc[0, 1]
2
pandas.DataFrame.index
DataFrame.index
The index (row labels) of the DataFrame.
pandas.DataFrame.is_copy
DataFrame.is_copy
Return the copy.
pandas.DataFrame.ix
DataFrame.ix
A primarily label-location based indexer, with integer position fallback.
Warning: Starting in 0.20.0, the .ix indexer is deprecated, in favor of the more strict .iloc and
.loc indexers.
.ix[] supports mixed integer and label based access. It is primarily label based, but will fall
back to integer positional access unless the corresponding axis is of integer type.
.ix is the most general indexer and will support any of the inputs in .loc and .iloc. .ix
also supports floating point label schemes. .ix is exceptionally useful when dealing with mixed
positional and label based hierarchical indexes.
However, when an axis is integer based, ONLY label based access and not positional access is
supported. Thus, in such cases, it’s usually better to be explicit and use .iloc or .loc.
See more at Advanced Indexing.
pandas.DataFrame.loc
DataFrame.loc
Access a group of rows and columns by label(s) or a boolean array.
.loc[] is primarily label based, but may also be used with a boolean array.
Allowed inputs are:
• A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label of the index, and never
as an integer position along the index).
• A list or array of labels, e.g. ['a', 'b', 'c'].
• A slice object with labels, e.g. 'a':'f'.
Warning: Note that contrary to usual python slices, both the start and the stop are
included
• A boolean array of the same length as the axis being sliced, e.g. [True, False, True].
• A callable function with one argument (the calling Series or DataFrame) and that returns
valid output for indexing (one of the above)
See more at Selection by Label
Raises
KeyError: when any items are not found
See also:
Examples
Getting values
>>> df.loc['viper']
max_speed 4
shield 5
Name: viper, dtype: int64
Slice with labels for row and single label for column. As mentioned above, note that both the
start and stop of the slice are included.
Setting values
Set value for all items matching the list of labels
>>> df.loc['cobra'] = 10
>>> df
max_speed shield
cobra 10 10
viper 4 50
sidewinder 7 50
Slice with integer labels for rows. As mentioned above, note that both the start and stop of the
slice are included.
>>> df.loc[7:9]
max_speed shield
7 1 2
8 4 5
9 7 8
>>> tuples = [
... ('cobra', 'mark i'), ('cobra', 'mark ii'),
... ('sidewinder', 'mark i'), ('sidewinder', 'mark ii'),
... ('viper', 'mark ii'), ('viper', 'mark iii')
... ]
>>> index = pd.MultiIndex.from_tuples(tuples)
>>> values = [[12, 2], [0, 4], [10, 20],
... [1, 4], [7, 1], [16, 36]]
>>> df = pd.DataFrame(values, columns=['max_speed', 'shield'], index=index)
>>> df
max_speed shield
cobra mark i 12 2
mark ii 0 4
sidewinder mark i 10 20
mark ii 1 4
viper mark ii 7 1
mark iii 16 36
>>> df.loc['cobra']
max_speed shield
mark i 12 2
mark ii 0 4
Single label for row and column. Similar to passing in a tuple, this returns a Series.
Single tuple for the index with a single label for the column
pandas.DataFrame.ndim
DataFrame.ndim
Return an int representing the number of axes / array dimensions.
Return 1 if Series. Otherwise return 2 if DataFrame.
See also:
Examples
pandas.DataFrame.shape
DataFrame.shape
Return a tuple representing the dimensionality of the DataFrame.
See also:
ndarray.shape
Examples
pandas.DataFrame.size
DataFrame.size
Return an int representing the number of elements in this object.
Return the number of rows if Series. Otherwise return the number of rows times number of
columns if DataFrame.
See also:
Examples
pandas.DataFrame.style
DataFrame.style
Property returning a Styler object containing methods for building a styled HTML representation
fo the DataFrame.
See also:
io.formats.style.Styler
pandas.DataFrame.values
DataFrame.values
Return a Numpy representation of the DataFrame.
Only the values in the DataFrame will be returned, the axes labels will be removed.
Returns
numpy.ndarray The values of the DataFrame.
See also:
Notes
The dtype will be a lower-common-denominator dtype (implicit upcasting); that is to say if the
dtypes (even of numeric types) are mixed, the one that accommodates all will be chosen. Use
this with care if you are not dealing with the blocks.
e.g. If the dtypes are float16 and float32, dtype will be upcast to float32. If dtypes are int32 and
uint8, dtype will be upcast to int32. By numpy.find_common_type() convention, mixing int64
and uint64 will result in a float64 dtype.
Examples
A DataFrame where all columns are the same type (e.g., int64) results in an array of the same
type.
A DataFrame with mixed type columns(e.g., str/object, int64, float32) results in an ndarray of
the broadest type that accommodates these mixed types (e.g., object).
Methods
pandas.DataFrame.abs
DataFrame.abs(self )
Return a Series/DataFrame with absolute numeric value of each element.
This function only applies to elements that are all numeric.
Returns
abs Series/DataFrame containing the absolute value of each element.
See also:
Notes
√
For complex inputs, 1.2 + 1j, the absolute value is a2 + b2 .
Examples
Select rows with data closest to certain value using argsort (from StackOverflow).
>>> df = pd.DataFrame({
... 'a': [4, 5, 6, 7],
... 'b': [10, 20, 30, 40],
... 'c': [100, 50, -30, -50]
... })
>>> df
a b c
0 4 10 100
1 5 20 50
2 6 30 -30
3 7 40 -50
>>> df.loc[(df.c - 43).abs().argsort()]
a b c
1 5 20 50
0 4 10 100
2 6 30 -30
3 7 40 -50
pandas.DataFrame.add
Notes
Examples
Add a scalar with operator version which return the same results.
>>> df + 1
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.add(1)
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.div(10)
angles degrees
circle 0.0 36.0
triangle 0.3 18.0
rectangle 0.4 36.0
>>> df.rdiv(10)
angles degrees
circle inf 0.027778
triangle 3.333333 0.055556
rectangle 2.500000 0.027778
>>> df - [1, 2]
angles degrees
circle -1 358
triangle 2 178
rectangle 3 358
>>> df * other
angles degrees
circle 0 NaN
triangle 9 NaN
rectangle 16 NaN
pandas.DataFrame.add_prefix
DataFrame.add_prefix(self, prefix)
Prefix labels with string prefix.
For Series, the row labels are prefixed. For DataFrame, the column labels are prefixed.
Parameters
prefix [str] The string to add before each label.
Returns
Series or DataFrame New Series or DataFrame with updated labels.
See also:
Examples
>>> s.add_prefix('item_')
item_0 1
item_1 2
item_2 3
item_3 4
dtype: int64
>>> df.add_prefix('col_')
col_A col_B
0 1 3
1 2 4
2 3 5
3 4 6
pandas.DataFrame.add_suffix
DataFrame.add_suffix(self, suffix)
Suffix labels with string suffix.
For Series, the row labels are suffixed. For DataFrame, the column labels are suffixed.
Parameters
suffix [str] The string to add after each label.
Returns
Series or DataFrame New Series or DataFrame with updated labels.
See also:
Examples
>>> s.add_suffix('_item')
0_item 1
1_item 2
2_item 3
3_item 4
dtype: int64
>>> df.add_suffix('_col')
A_col B_col
0 1 3
1 2 4
2 3 5
3 4 6
pandas.DataFrame.agg
Notes
Examples
pandas.DataFrame.aggregate
Notes
Examples
pandas.DataFrame.align
pandas.DataFrame.all
Parameters
axis [{0 or ‘index’, 1 or ‘columns’, None}, default 0] Indicate which axis or axes
should be reduced.
• 0 / ‘index’ : reduce the index, return a Series whose index is the original
column labels.
• 1 / ‘columns’ : reduce the columns, return a Series whose index is the original
index.
• None : reduce all axes, return a scalar.
bool_only [bool, default None] Include only boolean columns. If None, will
attempt to use everything, then use only boolean data. Not implemented for
Series.
skipna [bool, default True] Exclude NA/null values. If the entire row/column
is NA and skipna is True, then the result will be True, as for an empty
row/column. If skipna is False, then NA are treated as True, because these
are not equal to zero.
level [int or level name, default None] If the axis is a MultiIndex (hierarchical),
count along a particular level, collapsing into a Series.
**kwargs [any, default None] Additional keywords have no effect but might be
accepted for compatibility with NumPy.
Returns
Series or DataFrame If level is specified, then, DataFrame is returned; other-
wise, Series is returned.
See also:
Examples
Series
DataFrames
Create a dataframe from a dictionary.
>>> df.all()
col1 True
col2 False
dtype: bool
>>> df.all(axis='columns')
0 True
1 False
dtype: bool
>>> df.all(axis=None)
False
pandas.DataFrame.any
**kwargs [any, default None] Additional keywords have no effect but might be
accepted for compatibility with NumPy.
Returns
Series or DataFrame If level is specified, then, DataFrame is returned; other-
wise, Series is returned.
See also:
Examples
Series
For Series input, the output is a scalar indicating whether any element is True.
DataFrame
Whether each column contains at least one True element (the default).
>>> df = pd.DataFrame({"A": [1, 2], "B": [0, 2], "C": [0, 0]})
>>> df
A B C
0 1 0 0
1 2 2 0
>>> df.any()
A True
B True
C False
dtype: bool
>>> df.any(axis='columns')
0 True
1 True
dtype: bool
>>> df.any(axis='columns')
0 True
1 False
dtype: bool
>>> df.any(axis=None)
True
>>> pd.DataFrame([]).any()
Series([], dtype: bool)
pandas.DataFrame.append
DataFrame
See also:
Notes
If a list of dict/series is passed and the keys are all contained in the DataFrame’s index, the order
of the columns in the resulting DataFrame will be unchanged.
Iteratively appending rows to a DataFrame can be more computationally intensive than a single
concatenate. A better solution is to append those rows to a list and then concatenate the list
with the original DataFrame all at once.
Examples
The following, while not recommended methods for generating DataFrames, show two ways to
generate a DataFrame from multiple data sources.
Less efficient:
>>> df = pd.DataFrame(columns=['A'])
>>> for i in range(5):
... df = df.append({'A': i}, ignore_index=True)
>>> df
A
0 0
1 1
2 2
3 3
4 4
More efficient:
pandas.DataFrame.apply
Notes
In the current implementation apply calls func twice on the first column/row to decide whether
it can take a fast or slow code path. This can lead to unexpected behavior if func has side-effects,
as they will take effect twice for the first column/row.
Examples
Using a numpy universal function (in this case the same as np.sqrt(df)):
>>> df.apply(np.sqrt)
A B
0 2.0 3.0
1 2.0 3.0
2 2.0 3.0
Returning a Series inside the function is similar to passing result_type='expand'. The resulting
column names will be the Series index.
Passing result_type='broadcast' will ensure the same shape result, whether list-like or scalar
is returned by the function, and broadcast it along the axis. The resulting column names will be
the originals.
pandas.DataFrame.applymap
DataFrame.applymap(self, func)
Apply a function to a Dataframe elementwise.
This method applies a function that accepts and returns a scalar to every element of a DataFrame.
Parameters
func [callable] Python function, returns a single value from a single value.
Returns
DataFrame Transformed DataFrame.
See also:
Notes
In the current implementation applymap calls func twice on the first column/row to decide
whether it can take a fast or slow code path. This can lead to unexpected behavior if func has
side-effects, as they will take effect twice for the first column/row.
Examples
Note that a vectorized version of func often exists, which will be much faster. You could square
each number elementwise.
>>> df ** 2
0 1
0 1.000000 4.494400
1 11.262736 20.857489
pandas.DataFrame.as_blocks
DataFrame.as_blocks(self, copy=True)
Convert the frame to a dict of dtype -> Constructor Types that each has a homogeneous dtype.
Deprecated since version 0.21.0.
Parameters
copy [boolean, default True]
Returns
values [a dict of dtype -> Constructor Types]
pandas.DataFrame.as_matrix
DataFrame.as_matrix(self, columns=None)
Convert the frame to its Numpy-array representation.
Deprecated since version 0.23.0: Use DataFrame.values() instead.
Parameters
columns [list, optional, default:None] If None, return all columns, otherwise, re-
turns specified columns.
Returns
values [ndarray] If the caller is heterogeneous and contains booleans or objects,
the result will be of dtype=object. See Notes.
See also:
DataFrame.values
Notes
pandas.DataFrame.asfreq
reindex
Notes
To learn more about the frequency strings, please see this link.
Examples
>>> df.asfreq(freq='30S')
s
2000-01-01 00:00:00 0.0
2000-01-01 00:00:30 NaN
2000-01-01 00:01:00 NaN
2000-01-01 00:01:30 NaN
2000-01-01 00:02:00 2.0
2000-01-01 00:02:30 NaN
2000-01-01 00:03:00 3.0
pandas.DataFrame.asof
Notes
Examples
>>> s.asof(20)
2.0
For a sequence where, a Series is returned. The first value is NaN, because the first element of
where is before the first index value.
Missing values are not considered. The following is 2.0, not NaN, even though NaN is at the
index location for 30.
>>> s.asof(30)
2.0
pandas.DataFrame.assign
DataFrame.assign(self, **kwargs)
Assign new columns to a DataFrame.
Returns a new object with all original columns in addition to new ones. Existing columns that
are re-assigned will be overwritten.
Parameters
**kwargs [dict of {str: callable or Series}] The column names are keywords. If
the values are callable, they are computed on the DataFrame and assigned to
the new columns. The callable must not change input DataFrame (though
pandas doesn’t check it). If the values are not callable, (e.g. a Series, scalar,
or array), they are simply assigned.
Returns
DataFrame A new DataFrame with the new columns in addition to all the ex-
isting columns.
Notes
Assigning multiple columns within the same assign is possible. For Python 3.6 and above, later
items in ‘**kwargs’ may refer to newly created or modified columns in ‘df’; items are computed
and assigned into ‘df’ in order. For Python 3.5 and below, the order of keyword arguments is not
specified, you cannot refer to newly created or modified columns. All items are computed first,
and then assigned in alphabetical order.
Changed in version 0.23.0: Keyword argument order is maintained for Python 3.6 and later.
Examples
Alternatively, the same behavior can be achieved by directly referencing an existing Series or
sequence:
In Python 3.6+, you can create multiple columns within the same assign where one of the columns
depends on another one defined within the same assign:
pandas.DataFrame.astype
Examples
Create a DataFrame:
>>> df.astype('int32').dtypes
col1 int32
col2 int32
dtype: object
Create a series:
>>> ser.astype('category')
0 1
1 2
dtype: category
Categories (2, int64): [1, 2]
Note that using copy=False and changing data on a new pandas object may propagate changes:
>>> s1 = pd.Series([1,2])
>>> s2 = s1.astype('int64', copy=False)
>>> s2[0] = 10
>>> s1 # note that s1[0] has changed too
0 10
1 2
dtype: int64
pandas.DataFrame.at_time
Examples
>>> ts.at_time('12:00')
A
2018-04-09 12:00:00 2
2018-04-10 12:00:00 4
pandas.DataFrame.between_time
Examples
You get the times that are not between two times by setting start_time later than end_time:
pandas.DataFrame.bfill
pandas.DataFrame.bool
DataFrame.bool(self )
Return the bool of a single element PandasObject.
This must be a boolean scalar value, either True or False. Raise a ValueError if the PandasObject
does not have exactly 1 element, or that element is not boolean
Returns
bool Same single boolean value converted to bool type.
pandas.DataFrame.boxplot
grid [bool, default True] Setting this to True will show the grid.
figsize [A tuple (width, height) in inches] The size of the figure to create in mat-
plotlib.
layout [tuple (rows, columns), optional] For example, (3, 5) will display the sub-
plots using 3 columns and 5 rows, starting from the top-left.
return_type [{‘axes’, ‘dict’, ‘both’} or None, default ‘axes’] The kind of object
to return. The default is axes.
• ‘axes’ returns the matplotlib axes the boxplot is drawn on.
• ‘dict’ returns a dictionary whose values are the matplotlib Lines of the box-
plot.
• ‘both’ returns a namedtuple with the axes and dict.
• when grouping with by, a Series mapping columns to return_type is re-
turned.
If return_type is None, a NumPy array of axes with the same shape as
layout is returned.
**kwds All other plotting keyword arguments to be passed to matplotlib.
pyplot.boxplot().
Returns
result See Notes.
See also:
Notes
Examples
Boxplots can be created for every column in the dataframe by df.boxplot() or indicating the
columns to be used:
>>> np.random.seed(1234)
>>> df = pd.DataFrame(np.random.randn(10,4),
... columns=['Col1', 'Col2', 'Col3', 'Col4'])
>>> boxplot = df.boxplot(column=['Col1', 'Col2', 'Col3'])
Boxplots of variables distributions grouped by the values of a third variable can be created using
the option by. For instance:
A list of strings (i.e. ['X', 'Y']) can be passed to boxplot in order to group the data by
combination of the variables in the x-axis:
>>> df = pd.DataFrame(np.random.randn(10,3),
... columns=['Col1', 'Col2', 'Col3'])
>>> df['X'] = pd.Series(['A', 'A', 'A', 'A', 'A',
... 'B', 'B', 'B', 'B', 'B'])
>>> df['Y'] = pd.Series(['A', 'B', 'A', 'B', 'A',
... 'B', 'A', 'B', 'A', 'B'])
>>> boxplot = df.boxplot(column=['Col1', 'Col2'], by=['X', 'Y'])
Additional formatting can be done to the boxplot, like suppressing the grid (grid=False), ro-
tating the labels in the x-axis (i.e. rot=45) or changing the fontsize (i.e. fontsize=15):
The parameter return_type can be used to select the type of element returned by boxplot. When
return_type='axes' is selected, the matplotlib axes on which the boxplot is drawn are returned:
If return_type is None, a NumPy array of axes with the same shape as layout is returned:
pandas.DataFrame.clip
Examples
>>> data = {'col_0': [9, -3, 0, -1, 5], 'col_1': [-2, -7, 6, 8, -5]}
>>> df = pd.DataFrame(data)
>>> df
col_0 col_1
0 9 -2
1 -3 -7
2 0 6
3 -1 8
4 5 -5
>>> df.clip(-4, 6)
col_0 col_1
0 6 -2
1 -3 -4
2 0 6
3 -1 6
4 5 -4
Clips using specific lower and upper thresholds per column element:
pandas.DataFrame.clip_lower
Examples
Series clipping element-wise using an array of thresholds. threshold should be the same length as
the Series.
>>> df.clip(lower=3)
A B
0 3 3
1 3 4
2 5 6
Or to an array of values. By default, threshold should be the same shape as the DataFrame.
Control how threshold is broadcast with axis. In this case threshold should be the same length
as the axis specified by axis.
pandas.DataFrame.clip_upper
Examples
>>> s.clip(upper=3)
0 1
1 2
2 3
3 3
4 3
dtype: int64
>>> s.clip(upper=elemwise_thresholds)
0 1
1 2
2 3
3 2
4 1
dtype: int64
pandas.DataFrame.combine
Examples
Using fill_value fills Nones prior to passing the column to the merge function.
However, if the same element in both dataframes is None, that None is preserved
Example that demonstrates the use of overwrite and behavior when the axis differ between the
dataframes.
>>> df2 = pd.DataFrame({'B': [3, 3], 'C': [1, 1], }, index=[1, 2])
>>> df2.combine(df1, take_smaller)
A B C
0 0.0 NaN NaN
1 0.0 3.0 NaN
2 NaN 3.0 NaN
pandas.DataFrame.combine_first
DataFrame.combine_first(self, other)
Update null elements with value in the same location in other.
Combine two DataFrame objects by filling null values in one DataFrame with non-null values
from other DataFrame. The row and column indexes of the resulting DataFrame will be the
union of the two.
Parameters
other [DataFrame] Provided DataFrame to use to fill null values.
Returns
DataFrame
See also:
Examples
Null values still persist if the location of that null value does not exist in other
pandas.DataFrame.compound
pandas.DataFrame.copy
DataFrame.copy(self, deep=True)
Make a copy of this object’s indices and data.
When deep=True (default), a new object will be created with a copy of the calling object’s data
and indices. Modifications to the data or indices of the copy will not be reflected in the original
object (see notes below).
When deep=False, a new object will be created without copying the calling object’s data or
index (only references to the data and index are copied). Any changes to the data of the original
will be reflected in the shallow copy (and vice versa).
Parameters
deep [bool, default True] Make a deep copy, including a copy of the data and the
indices. With deep=False neither the indices nor the data are copied.
Returns
copy [Series or DataFrame] Object type matches caller.
Notes
When deep=True, data is copied but actual Python objects will not be copied recursively, only
the reference to the object. This is in contrast to copy.deepcopy in the Standard Library, which
recursively copies object data (see examples below).
While Index objects are copied when deep=True, the underlying numpy array is not copied for
performance reasons. Since Index is immutable, the underlying data can be safely shared and a
copy is not needed.
Examples
>>> s is shallow
False
>>> s.values is shallow.values and s.index is shallow.index
True
>>> s is deep
False
>>> s.values is deep.values or s.index is deep.index
False
Updates to the data shared by shallow copy and original is reflected in both; deep copy remains
unchanged.
>>> s[0] = 3
>>> shallow[1] = 4
>>> s
a 3
b 4
dtype: int64
(continues on next page)
Note that when copying an object containing Python objects, a deep copy will copy the data,
but will not do so recursively. Updating a nested data object will be reflected in the deep copy.
pandas.DataFrame.corr
DataFrame.corrwith
Series.corr
Examples
pandas.DataFrame.corrwith
DataFrame.corr
pandas.DataFrame.count
Examples
>>> df = pd.DataFrame({"Person":
... ["John", "Myla", "Lewis", "John", "Myla"],
... "Age": [24., np.nan, 21., 33, 26],
... "Single": [False, True, True, True, False]})
>>> df
Person Age Single
0 John 24.0 False
1 Myla NaN True
2 Lewis 21.0 True
3 John 33.0 True
4 Myla 26.0 False
>>> df.count()
Person 5
Age 4
Single 5
dtype: int64
>>> df.count(axis='columns')
0 3
1 2
2 3
3 3
4 3
dtype: int64
pandas.DataFrame.cov
DataFrame.cov(self, min_periods=None)
Compute pairwise covariance of columns, excluding NA/null values.
Compute the pairwise covariance among the series of a DataFrame. The returned data frame is
the covariance matrix of the columns of the DataFrame.
Both NA and null values are automatically excluded from the calculation. (See the note below
about bias from missing values.) A threshold can be set for the minimum number of observations
for each value created. Comparisons with observations below this threshold will be returned as
NaN.
This method is generally used for the analysis of time series data to understand the relationship
between different measures across time.
Parameters
min_periods [int, optional] Minimum number of observations required per pair
of columns to have a valid result.
Returns
DataFrame The covariance matrix of the series of the DataFrame.
See also:
Notes
Returns the covariance matrix of the DataFrame’s time series. The covariance is normalized by
N-1.
For DataFrames that have Series that are missing data (assuming that data is missing at random)
the returned covariance matrix will be an unbiased estimate of the variance and covariance
between the member Series.
However, for many applications this estimate may not be acceptable because the estimate covari-
ance matrix is not guaranteed to be positive semi-definite. This could lead to estimate correlations
having absolute values which are greater than one, and/or a non-invertible covariance matrix.
See Estimation of covariance matrices for more details.
Examples
>>> np.random.seed(42)
>>> df = pd.DataFrame(np.random.randn(1000, 5),
... columns=['a', 'b', 'c', 'd', 'e'])
>>> df.cov()
a b c d e
a 0.998438 -0.020161 0.059277 -0.008943 0.014144
b -0.020161 1.059352 -0.008543 -0.024738 0.009826
c 0.059277 -0.008543 1.010670 -0.001486 -0.000271
d -0.008943 -0.024738 -0.001486 0.921297 -0.013692
e 0.014144 0.009826 -0.000271 -0.013692 0.977795
>>> np.random.seed(42)
>>> df = pd.DataFrame(np.random.randn(20, 3),
... columns=['a', 'b', 'c'])
>>> df.loc[df.index[:5], 'a'] = np.nan
>>> df.loc[df.index[5:10], 'b'] = np.nan
>>> df.cov(min_periods=12)
a b c
a 0.316741 NaN -0.150812
b NaN 1.248003 0.191417
c -0.150812 0.191417 0.895202
pandas.DataFrame.cummax
Series or DataFrame
See also:
Examples
Series
>>> s.cummax()
0 2.0
1 NaN
2 5.0
3 5.0
4 5.0
dtype: float64
>>> s.cummax(skipna=False)
0 2.0
1 NaN
2 NaN
3 NaN
4 NaN
dtype: float64
DataFrame
By default, iterates over rows and finds the maximum in each column. This is equivalent to
axis=None or axis='index'.
>>> df.cummax()
A B
0 2.0 1.0
1 3.0 NaN
2 3.0 1.0
To iterate over columns and find the maximum in each row, use axis=1
>>> df.cummax(axis=1)
A B
0 2.0 2.0
1 3.0 NaN
2 1.0 1.0
pandas.DataFrame.cummin
Examples
Series
>>> s.cummin()
0 2.0
1 NaN
2 2.0
3 -1.0
4 -1.0
dtype: float64
>>> s.cummin(skipna=False)
0 2.0
1 NaN
2 NaN
3 NaN
4 NaN
dtype: float64
DataFrame
By default, iterates over rows and finds the minimum in each column. This is equivalent to
axis=None or axis='index'.
>>> df.cummin()
A B
0 2.0 1.0
1 2.0 NaN
2 1.0 0.0
To iterate over columns and find the minimum in each row, use axis=1
>>> df.cummin(axis=1)
A B
0 2.0 1.0
1 3.0 NaN
2 1.0 0.0
pandas.DataFrame.cumprod
Examples
Series
>>> s.cumprod()
0 2.0
1 NaN
2 10.0
3 -10.0
4 -0.0
dtype: float64
>>> s.cumprod(skipna=False)
0 2.0
1 NaN
2 NaN
3 NaN
4 NaN
dtype: float64
DataFrame
By default, iterates over rows and finds the product in each column. This is equivalent to
axis=None or axis='index'.
>>> df.cumprod()
A B
0 2.0 1.0
1 6.0 NaN
2 6.0 0.0
To iterate over columns and find the product in each row, use axis=1
>>> df.cumprod(axis=1)
A B
0 2.0 2.0
1 3.0 NaN
2 1.0 0.0
pandas.DataFrame.cumsum
Parameters
axis [{0 or ‘index’, 1 or ‘columns’}, default 0] The index or the name of the axis.
0 is equivalent to None or ‘index’.
skipna [boolean, default True] Exclude NA/null values. If an entire row/column
is NA, the result will be NA.
*args, **kwargs : Additional keywords have no effect but might be accepted for
compatibility with NumPy.
Returns
Series or DataFrame
See also:
Examples
Series
>>> s.cumsum()
0 2.0
1 NaN
2 7.0
3 6.0
4 6.0
dtype: float64
>>> s.cumsum(skipna=False)
0 2.0
1 NaN
2 NaN
(continues on next page)
DataFrame
By default, iterates over rows and finds the sum in each column. This is equivalent to axis=None
or axis='index'.
>>> df.cumsum()
A B
0 2.0 1.0
1 5.0 NaN
2 6.0 1.0
To iterate over columns and find the sum in each row, use axis=1
>>> df.cumsum(axis=1)
A B
0 2.0 3.0
1 3.0 NaN
2 1.0 1.0
pandas.DataFrame.describe
to object columns submit the numpy.object data type. Strings can also be
used in the style of select_dtypes (e.g. df.describe(include=['O'])).
To select pandas categorical columns, use 'category'
• None (default) : The result will include all numeric columns.
exclude [list-like of dtypes or None (default), optional,] A black list of data types
to omit from the result. Ignored for Series. Here are the options:
• A list-like of dtypes : Excludes the provided data types from the result. To
exclude numeric types submit numpy.number. To exclude object columns
submit the data type numpy.object. Strings can also be used in the style of
select_dtypes (e.g. df.describe(include=['O'])). To exclude pandas
categorical columns, use 'category'
• None (default) : The result will exclude nothing.
Returns
Series or DataFrame Summary statistics of the Series or Dataframe provided.
See also:
Notes
For numeric data, the result’s index will include count, mean, std, min, max as well as lower, 50
and upper percentiles. By default the lower percentile is 25 and the upper percentile is 75. The
50 percentile is the same as the median.
For object data (e.g. strings or timestamps), the result’s index will include count, unique, top,
and freq. The top is the most common value. The freq is the most common value’s frequency.
Timestamps also include the first and last items.
If multiple object values have the highest count, then the count and top results will be arbitrarily
chosen from among those with the highest count.
For mixed data types provided via a DataFrame, the default is to return only an analysis of
numeric columns. If the dataframe consists only of object and categorical data without any
numeric columns, the default is to return an analysis of both the object and categorical columns.
If include='all' is provided as an option, the result will include a union of attributes of each
type.
The include and exclude parameters can be used to limit which columns in a DataFrame are
analyzed for the output. The parameters are ignored when analyzing a Series.
Examples
>>> s = pd.Series([
... np.datetime64("2000-01-01"),
... np.datetime64("2010-01-01"),
... np.datetime64("2010-01-01")
... ])
>>> s.describe()
count 3
unique 2
top 2010-01-01 00:00:00
freq 2
first 2000-01-01 00:00:00
last 2010-01-01 00:00:00
dtype: object
>>> df.describe(include='all')
categorical numeric object
count 3 3.0 3
unique 3 NaN 3
top f NaN c
freq 1 NaN 1
mean NaN 2.0 NaN
std NaN 1.0 NaN
min NaN 1.0 NaN
25% NaN 1.5 NaN
50% NaN 2.0 NaN
75% NaN 2.5 NaN
max NaN 3.0 NaN
>>> df.numeric.describe()
count 3.0
mean 2.0
std 1.0
min 1.0
25% 1.5
50% 2.0
75% 2.5
max 3.0
Name: numeric, dtype: float64
>>> df.describe(include=['category'])
categorical
count 3
unique 3
top f
freq 1
>>> df.describe(exclude=[np.number])
categorical object
count 3 3
unique 3 3
top f c
freq 1 1
>>> df.describe(exclude=[np.object])
categorical numeric
count 3 3.0
unique 3 NaN
top f NaN
freq 1 NaN
mean NaN 2.0
std NaN 1.0
min NaN 1.0
25% NaN 1.5
50% NaN 2.0
75% NaN 2.5
max NaN 3.0
pandas.DataFrame.diff
Returns
DataFrame
See also:
Examples
>>> df.diff()
a b c
0 NaN NaN NaN
1 1.0 0.0 3.0
2 1.0 1.0 5.0
3 1.0 1.0 7.0
4 1.0 2.0 9.0
5 1.0 3.0 11.0
>>> df.diff(axis=1)
a b c
0 NaN 0.0 0.0
1 NaN -1.0 3.0
2 NaN -1.0 7.0
3 NaN -1.0 13.0
4 NaN 0.0 20.0
5 NaN 2.0 28.0
>>> df.diff(periods=3)
a b c
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
(continues on next page)
>>> df.diff(periods=-1)
a b c
0 -1.0 0.0 -3.0
1 -1.0 -1.0 -5.0
2 -1.0 -1.0 -7.0
3 -1.0 -2.0 -9.0
4 -1.0 -3.0 -11.0
5 NaN NaN NaN
pandas.DataFrame.div
Notes
Examples
Add a scalar with operator version which return the same results.
>>> df + 1
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.add(1)
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.div(10)
angles degrees
circle 0.0 36.0
triangle 0.3 18.0
rectangle 0.4 36.0
>>> df.rdiv(10)
angles degrees
circle inf 0.027778
triangle 3.333333 0.055556
rectangle 2.500000 0.027778
>>> df - [1, 2]
angles degrees
circle -1 358
(continues on next page)
>>> df * other
angles degrees
circle 0 NaN
triangle 9 NaN
rectangle 16 NaN
pandas.DataFrame.divide
Notes
Examples
Add a scalar with operator version which return the same results.
>>> df + 1
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.add(1)
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.div(10)
angles degrees
circle 0.0 36.0
triangle 0.3 18.0
rectangle 0.4 36.0
>>> df.rdiv(10)
angles degrees
circle inf 0.027778
triangle 3.333333 0.055556
rectangle 2.500000 0.027778
>>> df - [1, 2]
angles degrees
circle -1 358
triangle 2 178
rectangle 3 358
>>> df * other
angles degrees
circle 0 NaN
triangle 9 NaN
rectangle 16 NaN
pandas.DataFrame.dot
DataFrame.dot(self, other)
Compute the matrix multiplication between the DataFrame and other.
This method computes the matrix product between the DataFrame and the values of an other
Series, DataFrame or a numpy array.
It can also be called using self @ other in Python >= 3.5.
Parameters
other [Series, DataFrame or array-like] The other object to compute the matrix
product with.
Returns
Series or DataFrame If other is a Series, return the matrix product between
self and other as a Serie. If other is a DataFrame or a numpy.array, return the
matrix product of self and other in a DataFrame of a np.array.
See also:
Notes
The dimensions of DataFrame and other must be compatible in order to compute the matrix
multiplication. In addition, the column names of DataFrame and the index of other must contain
the same values, as they will be aligned prior to the multiplication.
The dot method for Series computes the inner product, instead of the matrix product here.
Examples
>>> other = pd.DataFrame([[0, 1], [1, 2], [-1, -1], [2, 0]])
>>> df.dot(other)
0 1
0 1 4
1 2 2
>>> df @ other
0 1
0 1 4
1 2 2
>>> arr = np.array([[0, 1], [1, 2], [-1, -1], [2, 0]])
>>> df.dot(arr)
0 1
0 1 4
1 2 2
Note how shuffling of the objects does not change the result.
pandas.DataFrame.drop
level [int or level name, optional] For MultiIndex, level from which the labels will
be removed.
inplace [bool, default False] If True, do operation inplace and return None.
errors [{‘ignore’, ‘raise’}, default ‘raise’] If ‘ignore’, suppress error and only ex-
isting labels are dropped.
Returns
DataFrame DataFrame without the removed index or column labels.
Raises
KeyError If any of the labels is not found in the selected axis.
See also:
Examples
Drop columns
pandas.DataFrame.drop_duplicates
pandas.DataFrame.droplevel
Examples
>>> df = pd.DataFrame([
... [1, 2, 3, 4],
... [5, 6, 7, 8],
... [9, 10, 11, 12]
... ]).set_index([0, 1]).rename_axis(['a', 'b'])
>>> df
level_1 c d
level_2 e f
a b
1 2 3 4
5 6 7 8
9 10 11 12
>>> df.droplevel('a')
level_1 c d
level_2 e f
b
2 3 4
6 7 8
10 11 12
pandas.DataFrame.dropna
Examples
>>> df.dropna()
name toy born
1 Batman Batmobile 1940-04-25
>>> df.dropna(axis='columns')
name
0 Alfred
1 Batman
2 Catwoman
>>> df.dropna(how='all')
name toy born
0 Alfred NaN NaT
1 Batman Batmobile 1940-04-25
2 Catwoman Bullwhip NaT
>>> df.dropna(thresh=2)
name toy born
1 Batman Batmobile 1940-04-25
2 Catwoman Bullwhip NaT
>>> df.dropna(inplace=True)
>>> df
name toy born
1 Batman Batmobile 1940-04-25
pandas.DataFrame.duplicated
pandas.DataFrame.eq
Notes
Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN !=
NaN ).
Examples
>>> df == 100
cost revenue
A False True
B False False
C True False
>>> df.eq(100)
cost revenue
A False True
B False False
C True False
When other is a Series, the columns of a DataFrame are aligned with the index of other and
broadcast:
When comparing to an arbitrary sequence, the number of columns must match the number
elements in other:
>>> df.gt(other)
cost revenue
A False False
B False False
C False True
D False False
pandas.DataFrame.equals
DataFrame.equals(self, other)
Test whether two objects contain the same elements.
This function allows two Series or DataFrames to be compared against each other to see if they
have the same shape and elements. NaNs in the same location are considered equal. The column
headers do not need to have the same type, but the elements within the columns must be the
same dtype.
Parameters
other [Series or DataFrame] The other Series or DataFrame to be compared with
the first.
Returns
bool True if all elements are the same in both objects, False otherwise.
See also:
Series.eq Compare two Series objects of the same length and return a Series where each element
is True if the element in each Series is equal, False otherwise.
DataFrame.eq Compare two DataFrame objects of the same shape and return a DataFrame
where each element is True if the respective element in each DataFrame is equal, False
otherwise.
assert_series_equal Return True if left and right Series are equal, False otherwise.
assert_frame_equal Return True if left and right DataFrames are equal, False otherwise.
numpy.array_equal Return True if two arrays have the same shape and elements, False other-
wise.
Notes
This function requires that the elements have the same dtype as their respective elements in the
other Series or DataFrame. However, the column labels do not need to have the same type, as
long as they are still considered equal.
Examples
DataFrames df and exactly_equal have the same types and values for their elements and column
labels, which will return True.
DataFrames df and different_column_type have the same element types and values, but have
different types for the column labels, which will still return True.
DataFrames df and different_data_type have different types for the same values for their ele-
ments, and will return False even though their column labels are the same values and types.
pandas.DataFrame.eval
Notes
For more details see the API documentation for eval(). For detailed examples see enhancing
performance with eval.
Examples
pandas.DataFrame.ewm
Notes
Exactly one of center of mass, span, half-life, and alpha must be provided. Allowed values and
relationship between the parameters are specified in the parameter descriptions above; see the
link at the end of this section for a detailed explanation.
When adjust is True (default), weighted averages are calculated using weights (1-alpha)**(n-1),
(1-alpha)**(n-2), …, 1-alpha, 1.
When adjust is False, weighted averages are calculated recursively as:
weighted_average[0] = arg[0]; weighted_average[i] = (1-alpha)*weighted_average[i-1]
+ alpha*arg[i].
When ignore_na is False (default), weights are based on absolute positions. For example, the
weights of x and y used in calculating the final weighted average of [x, None, y] are (1-alpha)**2
and 1 (if adjust is True), and (1-alpha)**2 and alpha (if adjust is False).
When ignore_na is True (reproducing pre-0.15.0 behavior), weights are based on relative posi-
tions. For example, the weights of x and y used in calculating the final weighted average of [x,
None, y] are 1-alpha and 1 (if adjust is True), and 1-alpha and alpha (if adjust is False).
More details can be found at http://pandas.pydata.org/pandas-docs/stable/user_guide/
computation.html#exponentially-weighted-windows
Examples
>>> df.ewm(com=0.5).mean()
B
0 0.000000
1 0.750000
2 1.615385
3 1.615385
4 3.670213
pandas.DataFrame.expanding
Notes
By default, the result is set to the right edge of the window. This can be changed to the center
of the window by setting center=True.
Examples
>>> df.expanding(2).sum()
B
0 NaN
1 1.0
2 3.0
3 3.0
4 7.0
pandas.DataFrame.explode
Notes
This routine will explode list-likes including lists, tuples, Series, and np.ndarray. The result dtype
of the subset rows will be object. Scalars will be returned unchanged. Empty list-likes will result
in a np.nan for that row.
Examples
>>> df = pd.DataFrame({'A': [[1, 2, 3], 'foo', [], [3, 4]], 'B': 1})
>>> df
A B
0 [1, 2, 3] 1
1 foo 1
2 [] 1
3 [3, 4] 1
>>> df.explode('A')
A B
0 1 1
0 2 1
0 3 1
1 foo 1
2 NaN 1
3 3 1
3 4 1
pandas.DataFrame.ffill
pandas.DataFrame.fillna
filled. If method is not specified, this is the maximum number of entries along
the entire axis where NaNs will be filled. Must be greater than 0 if not None.
downcast [dict, default is None] A dict of item->dtype of what to downcast if
possible, or the string ‘infer’ which will try to downcast to an appropriate equal
type (e.g. float64 to int64 if possible).
Returns
DataFrame Object with missing values filled.
See also:
Examples
>>> df.fillna(0)
A B C D
0 0.0 2.0 0.0 0
1 3.0 4.0 0.0 1
2 0.0 0.0 0.0 5
3 0.0 3.0 0.0 4
>>> df.fillna(method='ffill')
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 3.0 4.0 NaN 5
3 3.0 3.0 NaN 4
Replace all NaN elements in column ‘A’, ‘B’, ‘C’, and ‘D’, with 0, 1, 2, and 3 respectively.
pandas.DataFrame.filter
DataFrame.loc
Notes
The items, like, and regex parameters are enforced to be mutually exclusive.
axis defaults to the info axis that is used when indexing with [].
Examples
pandas.DataFrame.first
DataFrame.first(self, offset)
Convenience method for subsetting initial periods of time series data based on a date offset.
Parameters
offset [string, DateOffset, dateutil.relativedelta]
Returns
subset [same type as caller]
Raises
TypeError If the index is not a DatetimeIndex
See also:
Examples
>>> ts.first('3D')
A
2018-04-09 1
2018-04-11 2
Notice the data for 3 first calender days were returned, not the first 3 days observed in the
dataset, and therefore data for 2018-04-13 was not returned.
pandas.DataFrame.first_valid_index
DataFrame.first_valid_index(self )
Return index for first non-NA/null value.
Returns
scalar [type of index]
Notes
If all elements are non-NA/null, returns None. Also returns None for empty Series/DataFrame.
pandas.DataFrame.floordiv
Notes
Examples
Add a scalar with operator version which return the same results.
>>> df + 1
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.add(1)
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.div(10)
angles degrees
circle 0.0 36.0
triangle 0.3 18.0
rectangle 0.4 36.0
>>> df.rdiv(10)
angles degrees
circle inf 0.027778
triangle 3.333333 0.055556
rectangle 2.500000 0.027778
>>> df * other
angles degrees
circle 0 NaN
triangle 9 NaN
rectangle 16 NaN
pandas.DataFrame.from_dict
Examples
>>> data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}
>>> pd.DataFrame.from_dict(data)
col_1 col_2
(continues on next page)
>>> data = {'row_1': [3, 2, 1, 0], 'row_2': ['a', 'b', 'c', 'd']}
>>> pd.DataFrame.from_dict(data, orient='index')
0 1 2 3
row_1 3 2 1 0
row_2 a b c d
When using the ‘index’ orientation, the column names can be specified manually:
pandas.DataFrame.from_items
pandas.DataFrame.from_records
index [string, list of fields, array-like] Field of array to use as the index, alternately
a specific set of input labels to use
exclude [sequence, default None] Columns or fields to exclude
columns [sequence, default None] Column names to use. If the passed data do
not have names associated with them, this argument provides names for the
columns. Otherwise this argument indicates the order of the columns in the
result (any names not found in the data will become all-NA columns)
coerce_float [boolean, default False] Attempt to convert values of non-string,
non-numeric objects (like decimal.Decimal) to floating point, useful for SQL
result sets
nrows [int, default None] Number of rows to read if data is an iterator
Returns
DataFrame
pandas.DataFrame.ge
Notes
Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN !=
NaN ).
Examples
>>> df == 100
cost revenue
A False True
B False False
C True False
>>> df.eq(100)
cost revenue
A False True
B False False
C True False
When other is a Series, the columns of a DataFrame are aligned with the index of other and
broadcast:
When comparing to an arbitrary sequence, the number of columns must match the number
elements in other:
>>> df.gt(other)
cost revenue
A False False
B False False
C False True
D False False
pandas.DataFrame.get
pandas.DataFrame.get_dtype_counts
DataFrame.get_dtype_counts(self )
Return counts of unique dtypes in this object.
Deprecated since version 0.25.0.
Use .dtypes.value_counts() instead.
Returns
dtype [Series] Series with the count of columns with each dtype.
See also:
Examples
>>> df.get_dtype_counts()
float64 1
int64 1
object 1
dtype: int64
pandas.DataFrame.get_ftype_counts
DataFrame.get_ftype_counts(self )
Return counts of unique ftypes in this object.
Deprecated since version 0.23.0.
This is useful for SparseDataFrame or for DataFrames containing sparse arrays.
Returns
dtype [Series] Series with the count of columns with each type and sparsity
(dense/sparse).
See also:
Examples
pandas.DataFrame.get_value
pandas.DataFrame.get_values
DataFrame.get_values(self )
Return an ndarray after converting sparse values to dense.
Deprecated since version 0.25.0: Use np.asarray(..) or DataFrame.values() instead.
This is the same as .values for non-sparse data. For sparse data contained in a SparseArray,
the data are first converted to a dense representation.
Returns
numpy.ndarray Numpy representation of DataFrame.
See also:
Examples
>>> df.get_values()
array([[1, True, 1.0], [2, False, 2.0]], dtype=object)
>>> df.get_values()
array([[ 1., 1.],
[nan, 2.],
[nan, 3.]])
pandas.DataFrame.groupby
Parameters
by [mapping, function, label, or list of labels] Used to determine the groups for the
groupby. If by is a function, it’s called on each value of the object’s index. If a
dict or Series is passed, the Series or dict VALUES will be used to determine
the groups (the Series’ values are first aligned; see .align() method). If an
ndarray is passed, the values are used as-is determine the groups. A label or
list of labels may be passed to group by the columns in self. Notice that a
tuple is interpreted as a (single) key.
axis [{0 or ‘index’, 1 or ‘columns’}, default 0] Split along rows (0) or columns (1).
level [int, level name, or sequence of such, default None] If the axis is a MultiIndex
(hierarchical), group by a particular level or levels.
as_index [bool, default True] For aggregated output, return object with group
labels as the index. Only relevant for DataFrame input. as_index=False is
effectively “SQL-style” grouped output.
sort [bool, default True] Sort group keys. Get better performance by turning this
off. Note this does not influence the order of observations within each group.
Groupby preserves the order of rows within each group.
group_keys [bool, default True] When calling apply, add group keys to index to
identify pieces.
squeeze [bool, default False] Reduce the dimensionality of the return type if pos-
sible, otherwise return a consistent type.
observed [bool, default False] This only applies if any of the groupers are Cate-
goricals. If True: only show observed values for categorical groupers. If False:
show all values for categorical groupers.
New in version 0.23.0.
**kwargs Optional, only accepts keyword argument ‘mutated’ and is passed to
groupby.
Returns
DataFrameGroupBy or SeriesGroupBy Depends on the calling object and
returns groupby object that contains information about the groups.
See also:
resample Convenience method for frequency conversion and resampling of time series.
Notes
Examples
Hierarchical Indexes
We can groupby different levels of a hierarchical index using the level parameter:
pandas.DataFrame.gt
level [int or label] Broadcast across a level, matching Index values on the passed
MultiIndex level.
Returns
DataFrame of bool Result of the comparison.
See also:
Notes
Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN !=
NaN ).
Examples
>>> df == 100
cost revenue
A False True
B False False
C True False
>>> df.eq(100)
cost revenue
A False True
B False False
C True False
When other is a Series, the columns of a DataFrame are aligned with the index of other and
broadcast:
When comparing to an arbitrary sequence, the number of columns must match the number
elements in other:
>>> df.gt(other)
cost revenue
A False False
B False False
C False True
D False False
pandas.DataFrame.head
DataFrame.head(self, n=5)
Return the first n rows.
This function returns the first n rows for the object based on position. It is useful for quickly
testing if your object has the right type of data in it.
Parameters
n [int, default 5] Number of rows to select.
Returns
obj_head [same type as caller] The first n rows of the caller object.
See also:
Examples
>>> df.head()
animal
0 alligator
1 bee
2 falcon
3 lion
4 monkey
>>> df.head(3)
animal
0 alligator
1 bee
2 falcon
pandas.DataFrame.hist
Examples
This example draws a histogram based on the length and width of some animals, displayed in
three bins
>>> df = pd.DataFrame({
... 'length': [1.5, 0.5, 1.2, 0.9, 3],
... 'width': [0.7, 0.2, 0.15, 0.2, 1.1]
... }, index= ['pig', 'rabbit', 'duck', 'chicken', 'horse'])
>>> hist = df.hist(bins=3)
pandas.DataFrame.idxmax
See also:
Series.idxmax
Notes
pandas.DataFrame.idxmin
Series.idxmin
Notes
pandas.DataFrame.infer_objects
DataFrame.infer_objects(self )
Attempt to infer better dtypes for object columns.
Attempts soft conversion of object-dtyped columns, leaving non-object and unconvertible columns
unchanged. The inference rules are the same as during normal Series/DataFrame construction.
New in version 0.21.0.
Returns
converted [same type as input object]
See also:
Examples
>>> df.dtypes
A object
dtype: object
>>> df.infer_objects().dtypes
A int64
dtype: object
pandas.DataFrame.info
Examples
>>> df.info(verbose=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
int_col 5 non-null int64
text_col 5 non-null object
float_col 5 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 248.0+ bytes
Prints a summary of columns count and its dtypes but not per column information:
>>> df.info(verbose=False)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Columns: 3 entries, int_col to float_col
dtypes: float64(1), int64(1), object(1)
memory usage: 248.0+ bytes
Pipe output of DataFrame.info to buffer instead of sys.stdout, get buffer content and writes to a
text file:
>>> import io
>>> buffer = io.StringIO()
>>> df.info(buf=buffer)
>>> s = buffer.getvalue()
>>> with open("df_info.txt", "w",
... encoding="utf-8") as f: # doctest: +SKIP
... f.write(s)
260
The memory_usage parameter allows deep introspection mode, specially useful for big
DataFrames and fine-tune memory optimization:
>>> df.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 3 columns):
column_1 1000000 non-null object
column_2 1000000 non-null object
column_3 1000000 non-null object
dtypes: object(3)
memory usage: 188.8 MB
pandas.DataFrame.insert
pandas.DataFrame.interpolate
Series or DataFrame Returns the same object type as the caller, interpolated
at some or all NaN values.
See also:
Notes
The ‘krogh’, ‘piecewise_polynomial’, ‘spline’, ‘pchip’ and ‘akima’ methods are wrappers around
the respective SciPy implementations of similar names. These use the actual numerical values
of the index. For more information on their behavior, see the SciPy documentation and SciPy
tutorial.
Examples
Filling in NaN in a Series by padding, but filling at most two consecutive NaN at a time.
>>> s = pd.Series([np.nan, "single_one", np.nan,
... "fill_two_more", np.nan, np.nan, np.nan,
... 4.71, np.nan])
>>> s
0 NaN
1 single_one
2 NaN
3 fill_two_more
4 NaN
5 NaN
(continues on next page)
Filling in NaN in a Series via polynomial interpolation or splines: Both ‘polynomial’ and ‘spline’
methods require that you also specify an order (int).
Fill the DataFrame forward (that is, going down) along each column using linear interpolation.
Note how the last entry in column ‘a’ is interpolated differently, because there is no entry after
it to use for interpolation. Note how the first entry in column ‘b’ remains NaN, because there is
no entry before it to use for interpolation.
pandas.DataFrame.isin
DataFrame.isin(self, values)
Whether each element in the DataFrame is contained in values.
Parameters
values [iterable, Series, DataFrame or dict] The result will only be true at a
location if all the labels match. If values is a Series, that’s the index. If values
is a dict, the keys must be the column names, which must match. If values is
a DataFrame, then both the index and column labels must match.
Returns
DataFrame DataFrame of booleans showing whether each element in the
DataFrame is contained in values.
See also:
Examples
When values is a list check whether every value in the DataFrame is present in the list (which
animals have 0 or 2 legs or wings)
When values is a dict, we can pass values to check for each column separately:
When values is a Series or DataFrame the index and column must match. Note that ‘falcon’
does not match based on the number of legs in df2.
pandas.DataFrame.isna
DataFrame.isna(self )
Detect missing values.
Return a boolean same-sized object indicating if the values are NA. NA values, such as None or
numpy.NaN, gets mapped to True values. Everything else gets mapped to False values. Characters
such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas.
options.mode.use_inf_as_na = True).
Returns
DataFrame Mask of bool values for each element in DataFrame that indicates
whether an element is not an NA value.
See also:
Examples
>>> df.isna()
age born name toy
0 False True False True
1 False False False False
2 True False False False
>>> ser.isna()
0 False
1 False
2 True
dtype: bool
pandas.DataFrame.isnull
DataFrame.isnull(self )
Detect missing values.
Return a boolean same-sized object indicating if the values are NA. NA values, such as None or
numpy.NaN, gets mapped to True values. Everything else gets mapped to False values. Characters
such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas.
options.mode.use_inf_as_na = True).
Returns
DataFrame Mask of bool values for each element in DataFrame that indicates
whether an element is not an NA value.
See also:
Examples
>>> df.isna()
age born name toy
0 False True False True
1 False False False False
2 True False False False
>>> ser.isna()
0 False
1 False
2 True
dtype: bool
pandas.DataFrame.items
DataFrame.items(self )
Iterator over (column name, Series) pairs.
Iterates over the DataFrame columns, returning a tuple with the column name and the content
as a Series.
Yields
label [object] The column names for the DataFrame being iterated over.
content [Series] The column entries belonging to each label, as a Series.
See also:
Examples
pandas.DataFrame.iteritems
DataFrame.iteritems(self )
Iterator over (column name, Series) pairs.
Iterates over the DataFrame columns, returning a tuple with the column name and the content
as a Series.
Returns
label [object] The column names for the DataFrame being iterated over.
content [Series] The column entries belonging to each label, as a Series.
See also:
Examples
pandas.DataFrame.iterrows
DataFrame.iterrows(self )
Iterate over DataFrame rows as (index, Series) pairs.
Yields
index [label or tuple of label] The index of the row. A tuple for a MultiIndex.
data [Series] The data of the row as a Series.
it [generator] A generator that iterates over the rows of the frame.
See also:
Notes
1. Because iterrows returns a Series for each row, it does not preserve dtypes across the rows
(dtypes are preserved across columns for DataFrames). For example,
To preserve dtypes while iterating over the rows, it is better to use itertuples() which
returns namedtuples of the values and which is generally faster than iterrows.
2. You should never modify something you are iterating over. This is not guaranteed to work
in all cases. Depending on the data types, the iterator returns a copy and not a view, and
writing to it will have no effect.
pandas.DataFrame.itertuples
Notes
The column names will be renamed to positional names if they are invalid Python identifiers,
repeated, or start with an underscore. With a large number of columns (>255), regular tuples
are returned.
Examples
By setting the index parameter to False we can remove the index as the first element of the tuple:
With the name parameter set we set a custom name for the yielded namedtuples:
pandas.DataFrame.join
Notes
Parameters on, lsuffix, and rsuffix are not supported when passing a list of DataFrame objects.
Support for specifying index levels as the on parameter was added in version 0.23.0.
Examples
>>> df
key A
0 K0 A0
1 K1 A1
2 K2 A2
3 K3 A3
4 K4 A4
5 K5 A5
>>> other
key B
0 K0 B0
1 K1 B1
2 K2 B2
If we want to join using the key columns, we need to set key to be the index in both df and other.
The joined DataFrame will have key as its index.
>>> df.set_index('key').join(other.set_index('key'))
A B
key
K0 A0 B0
K1 A1 B1
K2 A2 B2
(continues on next page)
Another option to join using the key columns is to use the on parameter. DataFrame.join
always uses other’s index but we can use any column in df. This method preserves the original
DataFrame’s index in the result.
pandas.DataFrame.keys
DataFrame.keys(self )
Get the ‘info axis’ (see Indexing for more)
This is index for Series, columns for DataFrame.
Returns
Index Info axis.
pandas.DataFrame.kurt
pandas.DataFrame.kurtosis
pandas.DataFrame.last
DataFrame.last(self, offset)
Convenience method for subsetting final periods of time series data based on a date offset.
Parameters
offset [string, DateOffset, dateutil.relativedelta]
Returns
subset [same type as caller]
Raises
TypeError If the index is not a DatetimeIndex
See also:
Examples
>>> ts.last('3D')
A
2018-04-13 3
2018-04-15 4
Notice the data for 3 last calender days were returned, not the last 3 observed days in the dataset,
and therefore data for 2018-04-11 was not returned.
pandas.DataFrame.last_valid_index
DataFrame.last_valid_index(self )
Return index for last non-NA/null value.
Returns
scalar [type of index]
Notes
If all elements are non-NA/null, returns None. Also returns None for empty Series/DataFrame.
pandas.DataFrame.le
Notes
Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN !=
NaN ).
Examples
>>> df == 100
cost revenue
A False True
B False False
C True False
>>> df.eq(100)
cost revenue
A False True
B False False
C True False
When other is a Series, the columns of a DataFrame are aligned with the index of other and
broadcast:
When comparing to an arbitrary sequence, the number of columns must match the number
elements in other:
>>> df.gt(other)
cost revenue
A False False
B False False
C False True
D False False
pandas.DataFrame.lookup
Notes
Akin to:
Examples
pandas.DataFrame.lt
level [int or label] Broadcast across a level, matching Index values on the passed
MultiIndex level.
Returns
DataFrame of bool Result of the comparison.
See also:
Notes
Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN !=
NaN ).
Examples
>>> df == 100
cost revenue
A False True
B False False
C True False
>>> df.eq(100)
cost revenue
A False True
B False False
C True False
When other is a Series, the columns of a DataFrame are aligned with the index of other and
broadcast:
When comparing to an arbitrary sequence, the number of columns must match the number
elements in other:
>>> df.gt(other)
cost revenue
A False False
B False False
C False True
D False False
pandas.DataFrame.mad
pandas.DataFrame.mask
Notes
The mask method is an application of the if-then idiom. For each element in the calling
DataFrame, if cond is False the element is used; otherwise the corresponding element from
the DataFrame other is used.
The signature for DataFrame.where() differs from numpy.where(). Roughly df1.where(m,
df2) is equivalent to np.where(m, df1, df2).
For further details and examples see the mask documentation in indexing.
Examples
>>> s = pd.Series(range(5))
>>> s.where(s > 0)
0 NaN
1 1.0
2 2.0
3 3.0
(continues on next page)
pandas.DataFrame.max
Parameters
axis [{index (0), columns (1)}] Axis for the function to be applied on.
skipna [bool, default True] Exclude NA/null values when computing the result.
level [int or level name, default None] If the axis is a MultiIndex (hierarchical),
count along a particular level, collapsing into a Series.
numeric_only [bool, default None] Include only float, int, boolean columns. If
None, will attempt to use everything, then use only numeric data. Not imple-
mented for Series.
**kwargs Additional keyword arguments to be passed to the function.
Returns
Series or DataFrame (if level specified)
See also:
Examples
>>> s.max()
8
>>> s.max(level='blooded')
blooded
warm 4
cold 8
Name: legs, dtype: int64
>>> s.max(level=0)
blooded
warm 4
cold 8
Name: legs, dtype: int64
pandas.DataFrame.mean
pandas.DataFrame.median
skipna [bool, default True] Exclude NA/null values when computing the result.
level [int or level name, default None] If the axis is a MultiIndex (hierarchical),
count along a particular level, collapsing into a Series.
numeric_only [bool, default None] Include only float, int, boolean columns. If
None, will attempt to use everything, then use only numeric data. Not imple-
mented for Series.
**kwargs Additional keyword arguments to be passed to the function.
Returns
Series or DataFrame (if level specified)
pandas.DataFrame.melt
melt
pivot_table
DataFrame.pivot
Series.explode
Examples
pandas.DataFrame.memory_usage
Examples
>>> df.memory_usage()
Index 128
int64 40000
float64 40000
complex128 80000
object 40000
bool 5000
dtype: int64
>>> df.memory_usage(index=False)
int64 40000
float64 40000
complex128 80000
object 40000
bool 5000
dtype: int64
>>> df.memory_usage(deep=True)
Index 128
int64 40000
float64 40000
complex128 80000
object 160000
bool 5000
dtype: int64
Use a Categorical for efficient storage of an object-dtype column with many repeated values.
>>> df['object'].astype('category').memory_usage(deep=True)
5216
pandas.DataFrame.merge
• right: use only keys from right frame, similar to a SQL right outer join;
preserve key order.
• outer: use union of keys from both frames, similar to a SQL full outer join;
sort keys lexicographically.
• inner: use intersection of keys from both frames, similar to a SQL inner join;
preserve the order of the left keys.
on [label or list] Column or index level names to join on. These must be found in
both DataFrames. If on is None and not merging on indexes then this defaults
to the intersection of the columns in both DataFrames.
left_on [label or list, or array-like] Column or index level names to join on in the
left DataFrame. Can also be an array or list of arrays of the length of the left
DataFrame. These arrays are treated as if they are columns.
right_on [label or list, or array-like] Column or index level names to join on in
the right DataFrame. Can also be an array or list of arrays of the length of
the right DataFrame. These arrays are treated as if they are columns.
left_index [bool, default False] Use the index from the left DataFrame as the
join key(s). If it is a MultiIndex, the number of keys in the other DataFrame
(either the index or a number of columns) must match the number of levels.
right_index [bool, default False] Use the index from the right DataFrame as the
join key. Same caveats as left_index.
sort [bool, default False] Sort the join keys lexicographically in the result
DataFrame. If False, the order of the join keys depends on the join type
(how keyword).
suffixes [tuple of (str, str), default (‘_x’, ‘_y’)] Suffix to apply to overlapping
column names in the left and right side, respectively. To raise an exception on
overlapping columns use (False, False).
copy [bool, default True] If False, avoid copy if possible.
indicator [bool or str, default False] If True, adds a column to output DataFrame
called “_merge” with information on the source of each row. If string, column
with information on source of each row will be added to output DataFrame,
and column will be named value of string. Information column is Categorical-
type and takes on a value of “left_only” for observations whose merge key only
appears in ‘left’ DataFrame, “right_only” for observations whose merge key
only appears in ‘right’ DataFrame, and “both” if the observation’s merge key
is found in both.
validate [str, optional] If specified, checks if merge is of specified type.
• “one_to_one” or “1:1”: check if merge keys are unique in both left and right
datasets.
• “one_to_many” or “1:m”: check if merge keys are unique in left dataset.
• “many_to_one” or “m:1”: check if merge keys are unique in right dataset.
• “many_to_many” or “m:m”: allowed, but does not result in checks.
New in version 0.21.0.
Returns
DataFrame A DataFrame of the two merged objects.
See also:
Notes
Support for specifying index levels as the on, left_on, and right_on parameters was added in
version 0.23.0 Support for merging named Series objects was added in version 0.24.0
Examples
Merge df1 and df2 on the lkey and rkey columns. The value columns have the default suffixes,
_x and _y, appended.
Merge DataFrames df1 and df2 with specified left and right suffixes appended to any overlapping
columns.
Merge DataFrames df1 and df2, but raise an exception if the DataFrames have any overlapping
columns.
pandas.DataFrame.min
Parameters
axis [{index (0), columns (1)}] Axis for the function to be applied on.
skipna [bool, default True] Exclude NA/null values when computing the result.
level [int or level name, default None] If the axis is a MultiIndex (hierarchical),
count along a particular level, collapsing into a Series.
numeric_only [bool, default None] Include only float, int, boolean columns. If
None, will attempt to use everything, then use only numeric data. Not imple-
mented for Series.
**kwargs Additional keyword arguments to be passed to the function.
Returns
Series or DataFrame (if level specified)
See also:
DataFrame.idxmax Return the index of the maximum over the requested axis.
Examples
>>> s.min()
0
>>> s.min(level='blooded')
blooded
warm 2
cold 0
Name: legs, dtype: int64
>>> s.min(level=0)
blooded
warm 2
cold 0
Name: legs, dtype: int64
pandas.DataFrame.mod
level [int or label] Broadcast across a level, matching Index values on the passed
MultiIndex level.
fill_value [float or None, default None] Fill existing missing (NaN) values, and
any new element needed for successful DataFrame alignment, with this value
before computation. If data in both corresponding DataFrame locations is
missing the result will be missing.
Returns
DataFrame Result of the arithmetic operation.
See also:
Notes
Examples
Add a scalar with operator version which return the same results.
>>> df + 1
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.add(1)
angles degrees
circle 1 361
(continues on next page)
>>> df.div(10)
angles degrees
circle 0.0 36.0
triangle 0.3 18.0
rectangle 0.4 36.0
>>> df.rdiv(10)
angles degrees
circle inf 0.027778
triangle 3.333333 0.055556
rectangle 2.500000 0.027778
>>> df - [1, 2]
angles degrees
circle -1 358
triangle 2 178
rectangle 3 358
>>> df * other
angles degrees
circle 0 NaN
(continues on next page)
pandas.DataFrame.mode
Returns
DataFrame The modes of each column or row.
See also:
Examples
By default, missing values are not considered, and the mode of wings are both 0 and 2. The
second row of species and legs contains NaN, because they have only one mode, but the DataFrame
has two rows.
>>> df.mode()
species legs wings
0 bird 2.0 0.0
1 NaN NaN 2.0
Setting dropna=False NaN values are considered and they can be the mode (like for wings).
>>> df.mode(dropna=False)
species legs wings
0 bird 2 NaN
Setting numeric_only=True, only the mode of numeric columns is computed, and columns of
other types are ignored.
>>> df.mode(numeric_only=True)
legs wings
0 2.0 0.0
1 NaN 2.0
To compute the mode over columns and not rows, use the axis parameter:
pandas.DataFrame.mul
Notes
Examples
Add a scalar with operator version which return the same results.
>>> df + 1
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.add(1)
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.div(10)
angles degrees
circle 0.0 36.0
triangle 0.3 18.0
rectangle 0.4 36.0
>>> df.rdiv(10)
angles degrees
circle inf 0.027778
triangle 3.333333 0.055556
rectangle 2.500000 0.027778
>>> df - [1, 2]
angles degrees
circle -1 358
triangle 2 178
rectangle 3 358
>>> df * other
angles degrees
circle 0 NaN
triangle 9 NaN
rectangle 16 NaN
pandas.DataFrame.multiply
Notes
Examples
Add a scalar with operator version which return the same results.
>>> df + 1
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.add(1)
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.div(10)
angles degrees
circle 0.0 36.0
triangle 0.3 18.0
rectangle 0.4 36.0
>>> df.rdiv(10)
angles degrees
circle inf 0.027778
triangle 3.333333 0.055556
rectangle 2.500000 0.027778
>>> df - [1, 2]
angles degrees
circle -1 358
triangle 2 178
rectangle 3 358
>>> df * other
angles degrees
circle 0 NaN
triangle 9 NaN
rectangle 16 NaN
pandas.DataFrame.ne
Notes
Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN !=
NaN ).
Examples
>>> df == 100
cost revenue
A False True
B False False
C True False
>>> df.eq(100)
cost revenue
A False True
B False False
C True False
When other is a Series, the columns of a DataFrame are aligned with the index of other and
broadcast:
When comparing to an arbitrary sequence, the number of columns must match the number
elements in other:
>>> df.gt(other)
cost revenue
A False False
B False False
C False True
D False False
pandas.DataFrame.nlargest
Notes
This function cannot be used with all column types. For example, when specifying columns with
object or category dtypes, TypeError is raised.
Examples
In the following example, we will use nlargest to select the three rows having the largest values
in column “population”.
To order by the largest values in column “population” and then “GDP”, we can specify multiple
columns like in the next example.
pandas.DataFrame.notna
DataFrame.notna(self )
Detect existing (non-missing) values.
Return a boolean same-sized object indicating if the values are not NA. Non-missing values get
mapped to True. Characters such as empty strings '' or numpy.inf are not considered NA values
(unless you set pandas.options.mode.use_inf_as_na = True). NA values, such as None or
numpy.NaN, get mapped to False values.
Returns
DataFrame Mask of bool values for each element in DataFrame that indicates
whether an element is not an NA value.
See also:
Examples
>>> df.notna()
age born name toy
0 True False True False
1 True True True True
2 False True True True
>>> ser.notna()
0 True
1 True
2 False
dtype: bool
pandas.DataFrame.notnull
DataFrame.notnull(self )
Detect existing (non-missing) values.
Return a boolean same-sized object indicating if the values are not NA. Non-missing values get
mapped to True. Characters such as empty strings '' or numpy.inf are not considered NA values
(unless you set pandas.options.mode.use_inf_as_na = True). NA values, such as None or
numpy.NaN, get mapped to False values.
Returns
DataFrame Mask of bool values for each element in DataFrame that indicates
whether an element is not an NA value.
See also:
Examples
>>> df.notna()
age born name toy
0 True False True False
1 True True True True
2 False True True True
>>> ser.notna()
0 True
1 True
2 False
dtype: bool
pandas.DataFrame.nsmallest
Examples
In the following example, we will use nsmallest to select the three rows having the smallest
values in column “a”.
To order by the largest values in column “a” and then “c”, we can specify multiple columns like
in the next example.
pandas.DataFrame.nunique
Examples
>>> df.nunique(axis=1)
0 1
1 2
2 2
dtype: int64
pandas.DataFrame.pct_change
Examples
Series
>>> s.pct_change()
0 NaN
1 0.011111
2 -0.065934
dtype: float64
>>> s.pct_change(periods=2)
0 NaN
1 NaN
2 -0.055556
dtype: float64
See the percentage change in a Series where filling NAs with last valid observation forward to
next valid.
>>> s.pct_change(fill_method='ffill')
0 NaN
1 0.011111
2 0.000000
3 -0.065934
dtype: float64
DataFrame
Percentage change in French franc, Deutsche Mark, and Italian lira from 1980-01-01 to 1980-03-
01.
>>> df = pd.DataFrame({
... 'FR': [4.0405, 4.0963, 4.3149],
... 'GR': [1.7246, 1.7482, 1.8519],
... 'IT': [804.74, 810.01, 860.13]},
... index=['1980-01-01', '1980-02-01', '1980-03-01'])
>>> df
FR GR IT
1980-01-01 4.0405 1.7246 804.74
1980-02-01 4.0963 1.7482 810.01
1980-03-01 4.3149 1.8519 860.13
>>> df.pct_change()
FR GR IT
1980-01-01 NaN NaN NaN
1980-02-01 0.013810 0.013684 0.006549
1980-03-01 0.053365 0.059318 0.061876
Percentage of change in GOOG and APPL stock volume. Shows computing the percentage
change between columns.
>>> df = pd.DataFrame({
... '2016': [1769950, 30586265],
... '2015': [1500923, 40912316],
... '2014': [1371819, 41403351]},
... index=['GOOG', 'APPL'])
>>> df
2016 2015 2014
GOOG 1769950 1500923 1371819
APPL 30586265 40912316 41403351
>>> df.pct_change(axis='columns')
2016 2015 2014
GOOG NaN -0.151997 -0.086016
APPL NaN 0.337604 0.012002
pandas.DataFrame.pipe
DataFrame.apply
DataFrame.applymap
Series.map
Notes
Use .pipe when chaining together functions that expect Series, DataFrames or GroupBy objects.
Instead of writing
>>> (df.pipe(h)
... .pipe(g, arg1=a)
(continues on next page)
If you have a function that takes the data as (say) the second argument, pass a tuple indicating
which keyword expects the data. For example, suppose f takes its data as arg2:
>>> (df.pipe(h)
... .pipe(g, arg1=a)
... .pipe((f, 'arg2'), arg1=a, arg3=c)
... )
pandas.DataFrame.pivot
DataFrame.pivot_table Generalization of pivot that can handle duplicate values for one in-
dex/column pair.
DataFrame.unstack Pivot based on the index values instead of a column.
Notes
For finer-tuned control, see hierarchical indexing documentation along with the related
stack/unstack methods.
Examples
Notice that the first two rows are the same for our index and columns arguments.
pandas.DataFrame.pivot_table
Examples
The next example aggregates by taking the mean across multiple columns.
We can also calculate multiple types of aggregations for any given value column.
pandas.DataFrame.plot
pandas.DataFrame.pop
DataFrame.pop(self, item)
Return item and drop from frame. Raise KeyError if not found.
Parameters
item [str] Label of column to be popped.
Returns
Series
Examples
>>> df.pop('class')
0 bird
1 bird
2 mammal
3 mammal
Name: class, dtype: object
>>> df
name max_speed
0 falcon 389.0
1 parrot 24.0
2 lion 80.5
3 monkey NaN
pandas.DataFrame.pow
Notes
Examples
Add a scalar with operator version which return the same results.
>>> df + 1
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.add(1)
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.div(10)
angles degrees
circle 0.0 36.0
triangle 0.3 18.0
rectangle 0.4 36.0
>>> df.rdiv(10)
angles degrees
circle inf 0.027778
triangle 3.333333 0.055556
rectangle 2.500000 0.027778
>>> df - [1, 2]
angles degrees
circle -1 358
triangle 2 178
rectangle 3 358
>>> df * other
angles degrees
circle 0 NaN
triangle 9 NaN
rectangle 16 NaN
pandas.DataFrame.prod
Examples
>>> pd.Series([]).prod()
1.0
>>> pd.Series([]).prod(min_count=1)
nan
Thanks to the skipna parameter, min_count handles all-NA and empty series identically.
>>> pd.Series([np.nan]).prod()
1.0
>>> pd.Series([np.nan]).prod(min_count=1)
nan
pandas.DataFrame.product
Examples
>>> pd.Series([]).prod()
1.0
>>> pd.Series([]).prod(min_count=1)
nan
Thanks to the skipna parameter, min_count handles all-NA and empty series identically.
>>> pd.Series([np.nan]).prod()
1.0
>>> pd.Series([np.nan]).prod(min_count=1)
nan
pandas.DataFrame.quantile
Parameters
q [float or array-like, default 0.5 (50% quantile)] Value between 0 <= q <= 1, the
quantile(s) to compute.
axis [{0, 1, ‘index’, ‘columns’} (default 0)] Equals 0 or ‘index’ for row-wise, 1 or
‘columns’ for column-wise.
numeric_only [bool, default True] If False, the quantile of datetime and
timedelta data will be computed as well.
interpolation [{‘linear’, ‘lower’, ‘higher’, ‘midpoint’, ‘nearest’}] This optional pa-
rameter specifies the interpolation method to use, when the desired quantile
lies between two data points i and j:
• linear: i + (j - i) * fraction, where fraction is the fractional part of the index
surrounded by i and j.
• lower: i.
• higher: j.
• nearest: i or j whichever is nearest.
• midpoint: (i + j) / 2.
New in version 0.18.0.
Returns
Series or DataFrame
If q is an array, a DataFrame will be returned where the index is q,
the columns are the columns of self, and the values are the quantiles.
If q is a float, a Series will be returned where the index is the
columns of self and the values are the quantiles.
See also:
Examples
Specifying numeric_only=False will also compute the quantile of datetime and timedelta data.
pandas.DataFrame.query
Notes
The result of the evaluation of this expression is first passed to DataFrame.loc and if that
fails because of a multidimensional key (e.g., a DataFrame) then the result will be passed to
DataFrame.__getitem__().
This method uses the top-level eval() function to evaluate the passed query.
The query() method uses a slightly modified Python syntax by default. For example, the &
and | (bitwise) operators have the precedence of their boolean cousins, and and or. This is
syntactically valid Python, however the semantics are different.
You can change the semantics of the expression by passing the keyword argument
parser='python'. This enforces the same semantics as evaluation in Python space. Likewise,
you can pass engine='python' to evaluate an expression using Python itself as a backend. This
is not recommended as it is inefficient compared to using numexpr as the engine.
The DataFrame.index and DataFrame.columns attributes of the DataFrame instance are placed
in the query namespace by default, which allows you to treat both the index and columns of the
frame as a column in the frame. The identifier index is used for the frame index; you can also
use the name of the index to identify it in a query. Please note that Python keywords may not
be used as identifiers.
For further details and examples see the query documentation in indexing.
Examples
For columns with spaces in their name, you can use backtick quoting.
pandas.DataFrame.radd
Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //,
%, **.
Parameters
other [scalar, sequence, Series, or DataFrame] Any single or multiple element data
structure, or list-like object.
axis [{0 or ‘index’, 1 or ‘columns’}] Whether to compare by the index (0 or ‘index’)
or columns (1 or ‘columns’). For Series input, axis to match Series index on.
level [int or label] Broadcast across a level, matching Index values on the passed
MultiIndex level.
fill_value [float or None, default None] Fill existing missing (NaN) values, and
any new element needed for successful DataFrame alignment, with this value
before computation. If data in both corresponding DataFrame locations is
missing the result will be missing.
Returns
DataFrame Result of the arithmetic operation.
See also:
Notes
Examples
Add a scalar with operator version which return the same results.
>>> df + 1
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.add(1)
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.div(10)
angles degrees
circle 0.0 36.0
triangle 0.3 18.0
rectangle 0.4 36.0
>>> df.rdiv(10)
angles degrees
circle inf 0.027778
triangle 3.333333 0.055556
rectangle 2.500000 0.027778
>>> df - [1, 2]
angles degrees
circle -1 358
triangle 2 178
rectangle 3 358
>>> df * other
angles degrees
circle 0 NaN
triangle 9 NaN
rectangle 16 NaN
pandas.DataFrame.rank
Examples
The following example shows how the method behaves with the above parameters:
• default_rank: this is the default behaviour obtained without using any parameter.
• max_rank: setting method = 'max' the records that have the same values are ranked using
the highest rank (e.g.: since ‘cat’ and ‘dog’ are both in the 2nd and 3rd position, rank 3 is
assigned.)
• NA_bottom: choosing na_option = 'bottom', if there are records with NaN values they
are placed at the bottom of the ranking.
• pct_rank: when setting pct = True, the ranking is expressed as percentile rank.
pandas.DataFrame.rdiv
Notes
Examples
Add a scalar with operator version which return the same results.
>>> df + 1
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.add(1)
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.div(10)
angles degrees
circle 0.0 36.0
triangle 0.3 18.0
rectangle 0.4 36.0
>>> df.rdiv(10)
angles degrees
circle inf 0.027778
triangle 3.333333 0.055556
rectangle 2.500000 0.027778
>>> df - [1, 2]
angles degrees
circle -1 358
(continues on next page)
>>> df * other
angles degrees
circle 0 NaN
triangle 9 NaN
rectangle 16 NaN
pandas.DataFrame.reindex
tolerance [optional] Maximum distance between original and new labels for in-
exact matches. The values of the index at the matching locations most satisfy
the equation abs(index[indexer] - target) <= tolerance.
Tolerance may be a scalar value, which applies the same tolerance to all values,
or list-like, which applies variable tolerance per element. List-like includes list,
tuple, array, Series, and must be the same size as the index and its dtype must
exactly match the index’s type.
New in version 0.21.0: (list-like tolerance)
Returns
DataFrame with changed index.
See also:
Examples
Create a new index and reindex the dataframe. By default values in the new index that do not
have corresponding records in the dataframe are assigned NaN.
We can fill in the missing values by passing a value to the keyword fill_value. Because the
index is not monotonically increasing or decreasing, we cannot use arguments to the keyword
method to fill the NaN values.
To further illustrate the filling functionality in reindex, we will create a dataframe with a mono-
tonically increasing index (for example, a sequence of dates).
The index entries that did not have a value in the original data frame (for example, ‘2009-12-29’)
are by default filled with NaN. If desired, we can fill in the missing values using one of several
options.
For example, to back-propagate the last valid value to fill the NaN values, pass bfill as an
argument to the method keyword.
Please note that the NaN value present in the original dataframe (at index value 2010-01-03) will
not be filled by any of the value propagation schemes. This is because filling while reindexing
does not look at dataframe values, but only compares the original and desired indexes. If you do
want to fill in the NaN values present in the original dataframe, use the fillna() method.
See the user guide for more.
pandas.DataFrame.reindex_like
Conform the object to the same index on all axes. Optional filling logic, placing NaN in locations
having no value in the previous index. A new object is produced unless the new index is equivalent
to the current one and copy=False.
Parameters
other [Object of the same data type] Its row and column indices are used to define
the new indices of this object.
method [{None, ‘backfill’/’bfill’, ‘pad’/’ffill’, ‘nearest’}] Method to use for fill-
ing holes in reindexed DataFrame. Please note: this is only applicable to
DataFrames/Series with a monotonically increasing/decreasing index.
• None (default): don’t fill gaps
• pad / ffill: propagate last valid observation forward to next valid
• backfill / bfill: use next valid observation to fill gap
• nearest: use nearest valid observations to fill gap
copy [bool, default True] Return a new object, even if the passed indexes are the
same.
limit [int, default None] Maximum number of consecutive labels to fill for inexact
matches.
tolerance [optional] Maximum distance between original and new labels for in-
exact matches. The values of the index at the matching locations most satisfy
the equation abs(index[indexer] - target) <= tolerance.
Tolerance may be a scalar value, which applies the same tolerance to all values,
or list-like, which applies variable tolerance per element. List-like includes list,
tuple, array, Series, and must be the same size as the index and its dtype must
exactly match the index’s type.
New in version 0.21.0: (list-like tolerance)
Returns
Series or DataFrame Same type as caller, but with changed indices on each
axis.
See also:
Notes
Examples
>>> df1
temp_celsius temp_fahrenheit windspeed
2014-02-12 24.3 75.7 high
2014-02-13 31.0 87.8 high
2014-02-14 22.0 71.6 medium
2014-02-15 35.0 95.0 medium
>>> df2
temp_celsius windspeed
2014-02-12 28.0 low
2014-02-13 30.0 low
2014-02-15 35.1 medium
>>> df2.reindex_like(df1)
temp_celsius temp_fahrenheit windspeed
2014-02-12 28.0 NaN low
2014-02-13 30.0 NaN low
2014-02-14 NaN NaN NaN
2014-02-15 35.1 NaN medium
pandas.DataFrame.rename
Examples
>>> df.index
RangeIndex(start=0, stop=3, step=1)
>>> df.rename(index=str).index
Index(['0', '1', '2'], dtype='object')
pandas.DataFrame.rename_axis
Notes
Examples
Series
>>> s = pd.Series(["dog", "cat", "monkey"])
>>> s
0 dog
1 cat
2 monkey
dtype: object
>>> s.rename_axis("animal")
animal
0 dog
1 cat
2 monkey
dtype: object
DataFrame
>>> df = pd.DataFrame({"num_legs": [4, 4, 2],
... "num_arms": [0, 0, 2]},
... ["dog", "cat", "monkey"])
>>> df
num_legs num_arms
dog 4 0
cat 4 0
monkey 2 2
>>> df = df.rename_axis("animal")
>>> df
num_legs num_arms
animal
dog 4 0
cat 4 0
monkey 2 2
>>> df = df.rename_axis("limbs", axis="columns")
>>> df
(continues on next page)
MultiIndex
>>> df.rename_axis(columns=str.upper)
LIMBS num_legs num_arms
type name
mammal dog 4 0
cat 4 0
monkey 2 2
pandas.DataFrame.reorder_levels
pandas.DataFrame.replace
Values of the DataFrame are replaced with other values dynamically. This differs from updating
with .loc or .iloc, which require you to specify a location to update with some value.
Parameters
to_replace [str, regex, list, dict, Series, int, float, or None] How to find the values
that will be replaced.
• numeric, str or regex:
– numeric: numeric values equal to to_replace will be replaced with value
– str: string exactly matching to_replace will be replaced with value
– regex: regexs matching to_replace will be replaced with value
• list of str, regex, or numeric:
– First, if to_replace and value are both lists, they must be the same length.
– Second, if regex=True then all of the strings in both lists will be inter-
preted as regexs otherwise they will match directly. This doesn’t matter
much for value since there are only a few possible substitution regexes you
can use.
– str, regex and numeric rules apply as above.
• dict:
– Dicts can be used to specify different replacement values for different ex-
isting values. For example, {'a': 'b', 'y': 'z'} replaces the value
‘a’ with ‘b’ and ‘y’ with ‘z’. To use a dict in this way the value parameter
should be None.
– For a DataFrame a dict can specify that different values should be replaced
in different columns. For example, {'a': 1, 'b': 'z'} looks for the
value 1 in column ‘a’ and the value ‘z’ in column ‘b’ and replaces these
values with whatever is specified in value. The value parameter should
not be None in this case. You can treat this as a special case of passing
two lists except that you are specifying the column to search in.
– For a DataFrame nested dictionaries, e.g., {'a': {'b': np.nan}}, are
read as follows: look in column ‘a’ for the value ‘b’ and replace it with
NaN. The value parameter should be None to use a nested dict in this
way. You can nest regular expressions as well. Note that column names
(the top-level dictionary keys in a nested dictionary) cannot be regular
expressions.
• None:
– This means that the regex argument must be a string, compiled regular
expression, or list, dict, ndarray or Series of such elements. If value is also
None then this must be a nested dictionary or Series.
See the examples section for examples of each of these.
value [scalar, dict, list, str, regex, default None] Value to replace any values match-
ing to_replace with. For a DataFrame a dict of values can be used to specify
which value to use for each column (columns not in the dict will not be filled).
Regular expressions, strings and lists or dicts of such objects are also allowed.
inplace [bool, default False] If True, in place. Note: this will modify any other
views on this object (e.g. a column from a DataFrame). Returns the caller if
this is True.
limit [int, default None] Maximum size gap to forward or backward fill.
regex [bool or same types as to_replace, default False] Whether to interpret
to_replace and/or value as regular expressions. If this is True then to_replace
must be a string. Alternatively, this could be a regular expression or a list,
dict, or array of regular expressions in which case to_replace must be None.
method [{‘pad’, ‘ffill’, ‘bfill’, None}] The method to use when for replacement,
when to_replace is a scalar, list or tuple and value is None.
Changed in version 0.23.0: Added to DataFrame.
Returns
DataFrame Object after replacement.
Raises
AssertionError
• If regex is not a bool and to_replace is not None.
TypeError
• If to_replace is a dict and value is not a list, dict, ndarray, or Series
• If to_replace is None and regex is not compilable into a regular expression
or is a list, dict, ndarray, or Series.
• When replacing multiple bool or datetime64 objects and the arguments to
to_replace does not match the type of the value being replaced
ValueError
• If a list or an ndarray is passed to to_replace and value but they are not
the same length.
See also:
Notes
• Regex substitution is performed under the hood with re.sub. The rules for substitution for
re.sub are the same.
• Regular expressions will only substitute on strings, meaning you cannot provide, for example,
a regular expression matching floating point numbers and expect the columns in your frame
that have a numeric dtype to be matched. However, if those floating point numbers are
strings, then you can do this.
• This method has a lot of options. You are encouraged to experiment and play with this
method to gain intuition about how it works.
• When dict is used as the to_replace value, it is like key(s) in the dict are the to_replace
part and value(s) in the dict are the value parameter.
Examples
List-like ‘to_replace‘
dict-like ‘to_replace‘
Note that when replacing multiple bool or datetime64 objects, the data types in the to_replace
parameter must match the data type of the value being replaced:
This raises a TypeError because one of the dict keys is not of the correct type for replacement.
Compare the behavior of s.replace({'a': None}) and s.replace('a', None) to understand
the peculiarities of the to_replace parameter:
When one uses a dict as the to_replace value, it is like the value(s) in the dict are equal to
the value parameter. s.replace({'a': None}) is equivalent to s.replace(to_replace={'a':
None}, value=None, method=None):
When value=None and to_replace is a scalar, list or tuple, replace uses the method parameter
(default ‘pad’) to do the replacement. So this is why the ‘a’ values are being replaced by 10 in
rows 1 and 2 and ‘b’ in row 4 in this case. The command s.replace('a', None) is actually
equivalent to s.replace(to_replace='a', value=None, method='pad'):
pandas.DataFrame.resample
Convenience method for frequency conversion and resampling of time series. Object must have
a datetime-like index (DatetimeIndex, PeriodIndex, or TimedeltaIndex), or pass datetime-like
values to the on or level keyword.
Parameters
rule [DateOffset, Timedelta or str] The offset string or object representing target
conversion.
how [str] Method for down/re-sampling, default to ‘mean’ for downsampling.
Deprecated since version 0.18.0: The new syntax is .resample(...).mean(),
or .resample(...).apply(<func>)
axis [{0 or ‘index’, 1 or ‘columns’}, default 0] Which axis to use for up- or down-
sampling. For Series this will default to 0, i.e. along the rows. Must be
DatetimeIndex, TimedeltaIndex or PeriodIndex.
fill_method [str, default None] Filling method for upsampling.
Deprecated since version 0.18.0: The new syntax is .resample(...).
<func>(), e.g. .resample(...).pad()
closed [{‘right’, ‘left’}, default None] Which side of bin interval is closed. The
default is ‘left’ for all frequency offsets except for ‘M’, ‘A’, ‘Q’, ‘BM’, ‘BA’,
‘BQ’, and ‘W’ which all have a default of ‘right’.
label [{‘right’, ‘left’}, default None] Which bin edge label to label bucket with.
The default is ‘left’ for all frequency offsets except for ‘M’, ‘A’, ‘Q’, ‘BM’, ‘BA’,
‘BQ’, and ‘W’ which all have a default of ‘right’.
convention [{‘start’, ‘end’, ‘s’, ‘e’}, default ‘start’] For PeriodIndex only, controls
whether to use the start or end of rule.
kind [{‘timestamp’, ‘period’}, optional, default None] Pass ‘timestamp’ to con-
vert the resulting index to a DateTimeIndex or ‘period’ to convert it to a
PeriodIndex. By default the input representation is retained.
loffset [timedelta, default None] Adjust the resampled time labels.
limit [int, default None] Maximum size gap when reindexing with fill_method.
Deprecated since version 0.18.0.
base [int, default 0] For frequencies that evenly subdivide 1 day, the “origin” of
the aggregated intervals. For example, for ‘5min’ frequency, base could range
from 0 through 4. Defaults to 0.
on [str, optional] For a DataFrame, column to use instead of index for resampling.
Column must be datetime-like.
New in version 0.19.0.
level [str or int, optional] For a MultiIndex, level (name or number) to use for
resampling. level must be datetime-like.
New in version 0.19.0.
Returns
Resampler object
See also:
Notes
Examples
Downsample the series into 3 minute bins and sum the values of the timestamps falling into a
bin.
>>> series.resample('3T').sum()
2000-01-01 00:00:00 3
2000-01-01 00:03:00 12
2000-01-01 00:06:00 21
Freq: 3T, dtype: int64
Downsample the series into 3 minute bins as above, but label each bin using the right edge
instead of the left. Please note that the value in the bucket used as the label is not included in
the bucket, which it labels. For example, in the original series the bucket 2000-01-01 00:03:00
contains the value 3, but the summed value in the resampled bucket with the label 2000-01-01
00:03:00 does not include 3 (if it did, the summed value would be 6, not 3). To include this
value close the right side of the bin interval as illustrated in the example below this one.
Downsample the series into 3 minute bins as above, but close the right side of the bin interval.
Upsample the series into 30 second bins and fill the NaN values using the pad method.
>>> series.resample('30S').pad()[0:5]
2000-01-01 00:00:00 0
2000-01-01 00:00:30 0
2000-01-01 00:01:00 1
2000-01-01 00:01:30 1
2000-01-01 00:02:00 2
Freq: 30S, dtype: int64
Upsample the series into 30 second bins and fill the NaN values using the bfill method.
>>> series.resample('30S').bfill()[0:5]
2000-01-01 00:00:00 0
2000-01-01 00:00:30 1
2000-01-01 00:01:00 1
2000-01-01 00:01:30 2
2000-01-01 00:02:00 2
Freq: 30S, dtype: int64
For a Series with a PeriodIndex, the keyword convention can be used to control whether to use
the start or end of rule.
Resample a year by quarter using ‘start’ convention. Values are assigned to the first quarter of
the period.
Resample quarters by month using ‘end’ convention. Values are assigned to the last month of
the period.
For DataFrame objects, the keyword on can be used to specify the column instead of the index
for resampling.
For a DataFrame with MultiIndex, the keyword level can be used to specify on which level the
resampling needs to take place.
pandas.DataFrame.reset_index
level [int, str, tuple, or list, default None] Only remove the given levels from the
index. Removes all levels by default.
drop [bool, default False] Do not try to insert index into dataframe columns. This
resets the index to the default integer index.
inplace [bool, default False] Modify the DataFrame in place (do not create a new
object).
col_level [int or str, default 0] If the columns have multiple levels, determines
which level the labels are inserted into. By default it is inserted into the first
level.
col_fill [object, default ‘’] If the columns have multiple levels, determines how
the other levels are named. If None then the index name is repeated.
Returns
DataFrame DataFrame with the new index.
See also:
Examples
When we reset the index, the old index is added as a column, and a new sequential index is used:
>>> df.reset_index()
index class max_speed
0 falcon bird 389.0
1 parrot bird 24.0
2 lion mammal 80.5
3 monkey mammal NaN
We can use the drop parameter to avoid the old index being added as a column:
>>> df.reset_index(drop=True)
class max_speed
0 bird 389.0
(continues on next page)
>>> df.reset_index(level='class')
class speed species
max type
name
falcon bird 389.0 fly
parrot bird 24.0 fly
lion mammal 80.5 run
monkey mammal NaN jump
If we are not dropping the index, by default, it is placed in the top level. We can place it in
another level:
When the index is inserted under another level, we can specify under which one with the param-
eter col_fill:
pandas.DataFrame.rfloordiv
Notes
Examples
Add a scalar with operator version which return the same results.
>>> df + 1
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.add(1)
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.div(10)
angles degrees
circle 0.0 36.0
triangle 0.3 18.0
rectangle 0.4 36.0
>>> df.rdiv(10)
angles degrees
circle inf 0.027778
triangle 3.333333 0.055556
rectangle 2.500000 0.027778
>>> df - [1, 2]
angles degrees
circle -1 358
triangle 2 178
rectangle 3 358
>>> df * other
angles degrees
circle 0 NaN
triangle 9 NaN
rectangle 16 NaN
pandas.DataFrame.rmod
Notes
Examples
Add a scalar with operator version which return the same results.
>>> df + 1
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.add(1)
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.div(10)
angles degrees
circle 0.0 36.0
triangle 0.3 18.0
rectangle 0.4 36.0
>>> df.rdiv(10)
angles degrees
circle inf 0.027778
triangle 3.333333 0.055556
rectangle 2.500000 0.027778
>>> df - [1, 2]
angles degrees
circle -1 358
(continues on next page)
>>> df * other
angles degrees
circle 0 NaN
triangle 9 NaN
rectangle 16 NaN
pandas.DataFrame.rmul
Notes
Examples
Add a scalar with operator version which return the same results.
>>> df + 1
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.add(1)
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.div(10)
angles degrees
circle 0.0 36.0
triangle 0.3 18.0
rectangle 0.4 36.0
>>> df.rdiv(10)
angles degrees
circle inf 0.027778
triangle 3.333333 0.055556
rectangle 2.500000 0.027778
>>> df - [1, 2]
angles degrees
circle -1 358
triangle 2 178
rectangle 3 358
>>> df * other
angles degrees
circle 0 NaN
triangle 9 NaN
rectangle 16 NaN
pandas.DataFrame.rolling
Notes
By default, the result is set to the right edge of the window. This can be changed to the center
of the window by setting center=True.
To learn more about the offsets & frequency strings, please see this link.
The recognized win_types are:
• boxcar
• triang
• blackman
• hamming
• bartlett
• parzen
• bohman
• blackmanharris
• nuttall
• barthann
• kaiser (needs beta)
• gaussian (needs std)
• general_gaussian (needs power, width)
• slepian (needs width)
• exponential (needs tau), center is set to None.
If win_type=None all points are evenly weighted. To learn more about different window types
see scipy.signal window functions.
Examples
Rolling sum with a window length of 2, using the ‘triang’ window type.
Rolling sum with a window length of 2, min_periods defaults to the window length.
>>> df.rolling(2).sum()
B
0 NaN
1 1.0
2 3.0
3 NaN
4 NaN
>>> df
B
2013-01-01 09:00:00 0.0
2013-01-01 09:00:02 1.0
2013-01-01 09:00:03 2.0
2013-01-01 09:00:05 NaN
2013-01-01 09:00:06 4.0
Contrasting to an integer rolling window, this will roll a variable length window corresponding
to the time period. The default for min_periods is 1.
>>> df.rolling('2s').sum()
B
2013-01-01 09:00:00 0.0
2013-01-01 09:00:02 1.0
2013-01-01 09:00:03 3.0
2013-01-01 09:00:05 NaN
2013-01-01 09:00:06 4.0
pandas.DataFrame.round
decimals [int, dict, Series] Number of decimal places to round each column to. If
an int is given, round each column to the same number of places. Otherwise
dict and Series round to variable numbers of places. Column names should be
in the keys if decimals is a dict-like, or in the index if decimals is a Series. Any
columns not included in decimals will be left as is. Elements of decimals which
are not columns of the input will be ignored.
*args Additional keywords have no effect but might be accepted for compatibility
with numpy.
**kwargs Additional keywords have no effect but might be accepted for compat-
ibility with numpy.
Returns
DataFrame A DataFrame with the affected columns rounded to the specified
number of decimal places.
See also:
Examples
By providing an integer each column is rounded to the same number of decimal places
>>> df.round(1)
dogs cats
0 0.2 0.3
1 0.0 0.7
2 0.7 0.0
3 0.2 0.2
With a dict, the number of places for specific columns can be specified with the column names
as key and the number of decimal places as value
Using a Series, the number of places for specific columns can be specified with the column names
as index and the number of decimal places as value
pandas.DataFrame.rpow
Notes
Examples
Add a scalar with operator version which return the same results.
>>> df + 1
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.add(1)
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.div(10)
angles degrees
circle 0.0 36.0
triangle 0.3 18.0
rectangle 0.4 36.0
>>> df.rdiv(10)
angles degrees
circle inf 0.027778
triangle 3.333333 0.055556
rectangle 2.500000 0.027778
>>> df - [1, 2]
angles degrees
circle -1 358
triangle 2 178
rectangle 3 358
>>> df * other
angles degrees
circle 0 NaN
triangle 9 NaN
rectangle 16 NaN
pandas.DataFrame.rsub
Notes
Examples
Add a scalar with operator version which return the same results.
>>> df + 1
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.add(1)
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.div(10)
angles degrees
circle 0.0 36.0
triangle 0.3 18.0
rectangle 0.4 36.0
>>> df.rdiv(10)
angles degrees
circle inf 0.027778
triangle 3.333333 0.055556
rectangle 2.500000 0.027778
>>> df - [1, 2]
angles degrees
circle -1 358
triangle 2 178
rectangle 3 358
>>> df * other
angles degrees
circle 0 NaN
triangle 9 NaN
rectangle 16 NaN
pandas.DataFrame.rtruediv
Notes
Examples
Add a scalar with operator version which return the same results.
>>> df + 1
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.add(1)
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.div(10)
angles degrees
circle 0.0 36.0
triangle 0.3 18.0
rectangle 0.4 36.0
>>> df.rdiv(10)
angles degrees
circle inf 0.027778
triangle 3.333333 0.055556
rectangle 2.500000 0.027778
>>> df - [1, 2]
angles degrees
circle -1 358
triangle 2 178
rectangle 3 358
>>> df * other
angles degrees
circle 0 NaN
triangle 9 NaN
rectangle 16 NaN
pandas.DataFrame.sample
Examples
Extract 3 random elements from the Series df['num_legs']: Note that we use random_state
to ensure the reproducibility of the examples.
Using a DataFrame column as weights. Rows with larger value in the num_specimen_seen
column are more likely to be sampled.
pandas.DataFrame.select_dtypes
Notes
Examples
>>> df.select_dtypes(include='bool')
b
0 True
1 False
2 True
3 False
4 True
5 False
>>> df.select_dtypes(include=['float64'])
c
0 1.0
1 2.0
2 1.0
3 2.0
4 1.0
5 2.0
>>> df.select_dtypes(exclude=['int'])
b c
0 True 1.0
1 False 2.0
2 True 1.0
3 False 2.0
4 True 1.0
5 False 2.0
pandas.DataFrame.sem
Normalized by N-1 by default. This can be changed using the ddof argument
Parameters
axis [{index (0), columns (1)}]
skipna [bool, default True] Exclude NA/null values. If an entire row/column is
NA, the result will be NA
level [int or level name, default None] If the axis is a MultiIndex (hierarchical),
count along a particular level, collapsing into a Series
ddof [int, default 1] Delta Degrees of Freedom. The divisor used in calculations
is N - ddof, where N represents the number of elements.
numeric_only [bool, default None] Include only float, int, boolean columns. If
None, will attempt to use everything, then use only numeric data. Not imple-
mented for Series.
Returns
Series or DataFrame (if level specified)
pandas.DataFrame.set_axis
Returns
renamed [%(klass)s or None] An object of same type as caller if inplace=False,
None otherwise.
See also:
Examples
Series
>>> s
0 1
1 2
2 3
dtype: int64
DataFrame
pandas.DataFrame.set_index
Examples
>>> df.set_index('month')
year sale
month
1 2012 55
4 2014 40
(continues on next page)
pandas.DataFrame.set_value
pandas.DataFrame.shift
Examples
>>> df.shift(periods=3)
Col1 Col2 Col3
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 10.0 13.0 17.0
4 20.0 23.0 27.0
pandas.DataFrame.skew
pandas.DataFrame.slice_shift
Notes
While the slice_shift is faster than shift, you may pay for it later during alignment.
pandas.DataFrame.sort_index
pandas.DataFrame.sort_values
stable algorithm. For DataFrames, this option is only applied when sorting on
a single column or label.
na_position [{‘first’, ‘last’}, default ‘last’] Puts NaNs at the beginning if first;
last puts NaNs at the end.
Returns
sorted_obj [DataFrame or None] DataFrame with sorted values if inplace=False,
None otherwise.
Examples
>>> df = pd.DataFrame({
... 'col1': ['A', 'A', 'B', np.nan, 'D', 'C'],
... 'col2': [2, 1, 9, 8, 7, 4],
... 'col3': [0, 1, 9, 4, 2, 3],
... })
>>> df
col1 col2 col3
0 A 2 0
1 A 1 1
2 B 9 9
3 NaN 8 4
4 D 7 2
5 C 4 3
Sort by col1
>>> df.sort_values(by=['col1'])
col1 col2 col3
0 A 2 0
1 A 1 1
2 B 9 9
5 C 4 3
4 D 7 2
3 NaN 8 4
Sort Descending
pandas.DataFrame.sparse
DataFrame.sparse()
DataFrame accessor for sparse data.
New in version 0.25.0.
pandas.DataFrame.squeeze
DataFrame.squeeze(self, axis=None)
Squeeze 1 dimensional axis objects into scalars.
Series or DataFrames with a single element are squeezed to a scalar. DataFrames with a single
column or a single row are squeezed to a Series. Otherwise the object is unchanged.
This method is most useful when you don’t know if your object is a Series or DataFrame, but
you do know it has just a single column. In that case you can safely call squeeze to ensure you
have a Series.
Parameters
axis [{0 or ‘index’, 1 or ‘columns’, None}, default None] A specific axis to squeeze.
By default, all length-1 axes are squeezed.
New in version 0.20.0.
Returns
DataFrame, Series, or scalar The projection after squeezing axis or all the
axes.
See also:
Examples
>>> even_primes.squeeze()
2
Squeezing objects with more than one value in every axis does nothing:
>>> odd_primes.squeeze()
1 3
2 5
3 7
dtype: int64
Slicing a single column will produce a DataFrame with the columns having only one value:
>>> df_a.squeeze('columns')
0 1
1 3
Name: a, dtype: int64
Slicing a single row from a single column will produce a single scalar DataFrame:
>>> df_0a.squeeze('rows')
a 1
Name: 0, dtype: int64
>>> df_0a.squeeze()
1
pandas.DataFrame.stack
DataFrame.unstack Unstack prescribed level(s) from index axis onto column axis.
DataFrame.pivot Reshape dataframe from long format to wide format.
DataFrame.pivot_table Create a spreadsheet-style pivot table as a DataFrame.
Notes
The function is named by analogy with a collection of books being reorganized from being side
by side on a horizontal position (the columns of the dataframe) to being stacked vertically on
top of each other (in the index of the dataframe).
Examples
>>> df_single_level_cols
weight height
cat 0 1
dog 2 3
>>> df_single_level_cols.stack()
cat weight 0
height 1
dog weight 2
height 3
dtype: int64
>>> df_multi_level_cols1
weight
kg pounds
cat 1 2
dog 2 4
>>> df_multi_level_cols1.stack()
weight
cat kg 1
pounds 2
dog kg 2
pounds 4
Missing values
It is common to have missing values when stacking a dataframe with multi-level columns, as
the stacked dataframe typically has more values than the original dataframe. Missing values are
filled with NaNs:
>>> df_multi_level_cols2
weight height
kg m
cat 1.0 2.0
dog 3.0 4.0
>>> df_multi_level_cols2.stack()
height weight
cat kg NaN 1.0
m 2.0 NaN
dog kg NaN 3.0
m 4.0 NaN
>>> df_multi_level_cols2.stack(0)
kg m
cat height NaN 2.0
weight 1.0 NaN
dog height NaN 4.0
weight 3.0 NaN
>>> df_multi_level_cols2.stack([0, 1])
cat height m 2.0
weight kg 1.0
dog height m 4.0
weight kg 3.0
dtype: float64
Note that rows where all values are missing are dropped by default but this behaviour can be
controlled via the dropna keyword parameter:
>>> df_multi_level_cols3
weight height
kg m
cat NaN 1.0
dog 2.0 3.0
>>> df_multi_level_cols3.stack(dropna=False)
height weight
(continues on next page)
pandas.DataFrame.std
pandas.DataFrame.sub
level [int or label] Broadcast across a level, matching Index values on the passed
MultiIndex level.
fill_value [float or None, default None] Fill existing missing (NaN) values, and
any new element needed for successful DataFrame alignment, with this value
before computation. If data in both corresponding DataFrame locations is
missing the result will be missing.
Returns
DataFrame Result of the arithmetic operation.
See also:
Notes
Examples
Add a scalar with operator version which return the same results.
>>> df + 1
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.add(1)
angles degrees
circle 1 361
(continues on next page)
>>> df.div(10)
angles degrees
circle 0.0 36.0
triangle 0.3 18.0
rectangle 0.4 36.0
>>> df.rdiv(10)
angles degrees
circle inf 0.027778
triangle 3.333333 0.055556
rectangle 2.500000 0.027778
>>> df - [1, 2]
angles degrees
circle -1 358
triangle 2 178
rectangle 3 358
>>> df * other
angles degrees
circle 0 NaN
(continues on next page)
pandas.DataFrame.subtract
level [int or label] Broadcast across a level, matching Index values on the passed
MultiIndex level.
fill_value [float or None, default None] Fill existing missing (NaN) values, and
any new element needed for successful DataFrame alignment, with this value
before computation. If data in both corresponding DataFrame locations is
missing the result will be missing.
Returns
DataFrame Result of the arithmetic operation.
See also:
Notes
Examples
Add a scalar with operator version which return the same results.
>>> df + 1
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.add(1)
angles degrees
circle 1 361
(continues on next page)
>>> df.div(10)
angles degrees
circle 0.0 36.0
triangle 0.3 18.0
rectangle 0.4 36.0
>>> df.rdiv(10)
angles degrees
circle inf 0.027778
triangle 3.333333 0.055556
rectangle 2.500000 0.027778
>>> df - [1, 2]
angles degrees
circle -1 358
triangle 2 178
rectangle 3 358
>>> df * other
angles degrees
circle 0 NaN
(continues on next page)
pandas.DataFrame.sum
Parameters
axis [{index (0), columns (1)}] Axis for the function to be applied on.
skipna [bool, default True] Exclude NA/null values when computing the result.
level [int or level name, default None] If the axis is a MultiIndex (hierarchical),
count along a particular level, collapsing into a Series.
numeric_only [bool, default None] Include only float, int, boolean columns. If
None, will attempt to use everything, then use only numeric data. Not imple-
mented for Series.
min_count [int, default 0] The required number of valid values to perform the
operation. If fewer than min_count non-NA values are present the result will
be NA.
New in version 0.22.0: Added with the default being 0. This means the sum of
an all-NA or empty Series is 0, and the product of an all-NA or empty Series
is 1.
**kwargs Additional keyword arguments to be passed to the function.
Returns
Series or DataFrame (if level specified)
See also:
Examples
>>> s.sum()
14
>>> s.sum(level='blooded')
blooded
warm 6
cold 8
Name: legs, dtype: int64
>>> s.sum(level=0)
blooded
warm 6
cold 8
Name: legs, dtype: int64
This can be controlled with the min_count parameter. For example, if you’d like the sum of an
empty series to be NaN, pass min_count=1.
>>> pd.Series([]).sum(min_count=1)
nan
Thanks to the skipna parameter, min_count handles all-NA and empty series identically.
>>> pd.Series([np.nan]).sum()
0.0
>>> pd.Series([np.nan]).sum(min_count=1)
nan
pandas.DataFrame.swapaxes
pandas.DataFrame.swaplevel
Changed in version 0.18.1: The indexes i and j are now optional, and default to
the two innermost levels of the index.
pandas.DataFrame.tail
DataFrame.tail(self, n=5)
Return the last n rows.
This function returns last n rows from the object based on position. It is useful for quickly
verifying data, for example, after sorting or appending rows.
Parameters
n [int, default 5] Number of rows to select.
Returns
type of caller The last n rows of the caller object.
See also:
Examples
>>> df.tail()
animal
4 monkey
5 parrot
6 shark
7 whale
8 zebra
>>> df.tail(3)
animal
6 shark
(continues on next page)
pandas.DataFrame.take
Examples
We may take elements using negative integers for positive indices, starting from the end of the
object, just like with Python lists.
pandas.DataFrame.to_clipboard
Notes
Examples
We can omit the the index by passing the keyword index and setting it to false.
pandas.DataFrame.to_csv
Examples
pandas.DataFrame.to_dense
DataFrame.to_dense(self )
Return dense representation of Series/DataFrame (as opposed to sparse).
pandas.DataFrame.to_dict
Examples
>>> df.to_dict('series')
{'col1': row1 1
row2 2
Name: col1, dtype: int64,
'col2': row1 0.50
row2 0.75
Name: col2, dtype: float64}
>>> df.to_dict('split')
{'index': ['row1', 'row2'], 'columns': ['col1', 'col2'],
'data': [[1, 0.5], [2, 0.75]]}
>>> df.to_dict('records')
[{'col1': 1, 'col2': 0.5}, {'col1': 2, 'col2': 0.75}]
>>> df.to_dict('index')
{'row1': {'col1': 1, 'col2': 0.5}, 'row2': {'col1': 2, 'col2': 0.75}}
>>> dd = defaultdict(list)
>>> df.to_dict('records', into=dd)
[defaultdict(<class 'list'>, {'col1': 1, 'col2': 0.5}),
defaultdict(<class 'list'>, {'col1': 2, 'col2': 0.75})]
pandas.DataFrame.to_excel
name that already exists will result in the contents of the existing file being erased.
Parameters
excel_writer [str or ExcelWriter object] File path or existing ExcelWriter.
sheet_name [str, default ‘Sheet1’] Name of sheet which will contain DataFrame.
na_rep [str, default ‘’] Missing data representation.
float_format [str, optional] Format string for floating point numbers. For ex-
ample float_format="%.2f" will format 0.1234 to 0.12.
columns [sequence or list of str, optional] Columns to write.
header [bool or list of str, default True] Write out the column names. If a list of
string is given it is assumed to be aliases for the column names.
index [bool, default True] Write row names (index).
index_label [str or sequence, optional] Column label for index column(s) if de-
sired. If not specified, and header and index are True, then the index names
are used. A sequence should be given if the DataFrame uses MultiIndex.
startrow [int, default 0] Upper left cell row to dump data frame.
startcol [int, default 0] Upper left cell column to dump data frame.
engine [str, optional] Write engine to use, ‘openpyxl’ or ‘xlsxwriter’. You can also
set this via the options io.excel.xlsx.writer, io.excel.xls.writer, and
io.excel.xlsm.writer.
merge_cells [bool, default True] Write MultiIndex and Hierarchical Rows as
merged cells.
encoding [str, optional] Encoding of the resulting excel file. Only necessary for
xlwt, other writers support unicode natively.
inf_rep [str, default ‘inf’] Representation for infinity (there is no native represen-
tation for infinity in Excel).
verbose [bool, default True] Display more information in the error logs.
freeze_panes [tuple of int (length 2), optional] Specifies the one-based bottom-
most row and rightmost column that is to be frozen.
New in version 0.20.0..
See also:
Notes
For compatibility with to_csv(), to_excel serializes lists and dicts to strings before writing.
Once a workbook has been saved it is not possible write further data without rewriting the whole
workbook.
Examples
>>> df1.to_excel("output.xlsx",
... sheet_name='Sheet_name_1') # doctest: +SKIP
If you wish to write to more than one sheet in the workbook, it is necessary to specify an
ExcelWriter object:
To set the library that is used to write the Excel file, you can pass the engine keyword (the
default engine is automatically chosen depending on the file extension):
pandas.DataFrame.to_feather
DataFrame.to_feather(self, fname)
Write out the binary feather-format for DataFrames.
New in version 0.20.0.
Parameters
fname [str] string file path
pandas.DataFrame.to_gbq
chunksize [int, optional] Number of rows to be inserted in each chunk from the
dataframe. Set to None to load the whole dataframe at once.
reauth [bool, default False] Force Google BigQuery to re-authenticate the user.
This is useful if multiple accounts are used.
if_exists [str, default ‘fail’] Behavior when the destination table exists. Value
can be one of:
'fail' If table exists, do nothing.
'replace' If table exists, drop it, recreate it, and insert data.
'append' If table exists, insert data. Create if does not exist.
auth_local_webserver [bool, default False] Use the local webserver flow instead
of the console flow when getting user credentials.
New in version 0.2.0 of pandas-gbq.
table_schema [list of dicts, optional] List of BigQuery table fields to which ac-
cording DataFrame columns conform to, e.g. [{'name': 'col1', 'type':
'STRING'},...]. If schema is not provided, it will be generated according to
dtypes of DataFrame columns. See BigQuery API documentation on available
names of a field.
New in version 0.3.1 of pandas-gbq.
location [str, optional] Location where the load job should run. See the BigQuery
locations documentation for a list of available locations. The location must
match that of the target dataset.
New in version 0.5.0 of pandas-gbq.
progress_bar [bool, default True] Use the library tqdm to show the progress bar
for the upload, chunk by chunk.
New in version 0.5.0 of pandas-gbq.
credentials [google.auth.credentials.Credentials, optional] Credentials for access-
ing Google APIs. Use this parameter to override default credentials, such as to
use Compute Engine google.auth.compute_engine.Credentials or Service
Account google.oauth2.service_account.Credentials directly.
New in version 0.8.0 of pandas-gbq.
New in version 0.24.0.
verbose [bool, deprecated] Deprecated in pandas-gbq version 0.4.0. Use the log-
ging module to adjust verbosity instead.
private_key [str, deprecated] Deprecated in pandas-gbq version 0.8.0. Use
the credentials parameter and google.oauth2.service_account.
Credentials.from_service_account_info() or google.oauth2.
service_account.Credentials.from_service_account_file() instead.
Service account private key in JSON format. Can be file path or string con-
tents. This is useful for remote server authentication (eg. Jupyter/IPython
notebook on remote host).
See also:
pandas.DataFrame.to_hdf
errors [str, default ‘strict’] Specifies how encoding and decoding errors are to be
handled. See the errors argument for open() for a full list of options.
See also:
Examples
>>> import os
>>> os.remove('data.h5')
pandas.DataFrame.to_html
Parameters
buf [StringIO-like, optional] Buffer to write to.
columns [sequence, optional, default None] The subset of columns to write.
Writes all columns by default.
col_space [str or int, optional] The minimum width of each column in CSS length
units. An int is assumed to be px units.
New in version 0.25.0: Ability to use str.
header [bool, optional] Whether to print column labels, default True.
index [bool, optional, default True] Whether to print index (row) labels.
na_rep [str, optional, default ‘NaN’] String representation of NAN to use.
formatters [list or dict of one-param. functions, optional] Formatter functions to
apply to columns’ elements by position or name. The result of each function
must be a unicode string. List must be of length equal to the number of
columns.
float_format [one-parameter function, optional, default None] Formatter func-
tion to apply to columns’ elements if they are floats. The result of this function
must be a unicode string.
sparsify [bool, optional, default True] Set to False for a DataFrame with a hier-
archical index to print every multiindex key at each row.
index_names [bool, optional, default True] Prints the names of the indexes.
justify [str, default None] How to justify the column labels. If None uses the
option from the print configuration (controlled by set_option), ‘right’ out of
the box. Valid values are
• left
• right
• center
• justify
• justify-all
• start
• end
• inherit
• match-parent
• initial
• unset.
max_rows [int, optional] Maximum number of rows to display in the console.
min_rows [int, optional] The number of rows to display in the console in a trun-
cated repr (when number of rows is above max_rows).
max_cols [int, optional] Maximum number of columns to display in the console.
show_dimensions [bool, default False] Display DataFrame dimensions (number
of rows by number of columns).
decimal [str, default ‘.’] Character recognized as decimal separator, e.g. ‘,’ in
Europe.
New in version 0.18.0.
bold_rows [bool, default True] Make the row labels bold in the output.
classes [str or list or tuple, default None] CSS class(es) to apply to the resulting
html table.
escape [bool, default True] Convert the characters <, >, and & to HTML-safe
sequences.
notebook [{True, False}, default False] Whether the generated HTML is for
IPython Notebook.
border [int] A border=border attribute is included in the opening <table> tag.
Default pd.options.display.html.border.
New in version 0.19.0.
table_id [str, optional] A css id is included in the opening <table> tag if specified.
New in version 0.23.0.
render_links [bool, default False] Convert URLs to HTML links.
New in version 0.24.0.
Returns
str (or unicode, depending on data and options) String representation of
the dataframe.
See also:
pandas.DataFrame.to_json
read_json
Examples
Encoding/decoding a Dataframe using 'records' formatted JSON. Note that index labels are
not preserved with this encoding.
>>> df.to_json(orient='records')
'[{"col 1":"a","col 2":"b"},{"col 1":"c","col 2":"d"}]'
>>> df.to_json(orient='index')
'{"row 1":{"col 1":"a","col 2":"b"},"row 2":{"col 1":"c","col 2":"d"}}'
>>> df.to_json(orient='columns')
'{"col 1":{"row 1":"a","row 2":"c"},"col 2":{"row 1":"b","row 2":"d"}}'
>>> df.to_json(orient='values')
'[["a","b"],["c","d"]]'
>>> df.to_json(orient='table')
'{"schema": {"fields": [{"name": "index", "type": "string"},
{"name": "col 1", "type": "string"},
{"name": "col 2", "type": "string"}],
"primaryKey": "index",
"pandas_version": "0.20.0"},
"data": [{"index": "row 1", "col 1": "a", "col 2": "b"},
{"index": "row 2", "col 1": "c", "col 2": "d"}]}'
pandas.DataFrame.to_latex
Examples
pandas.DataFrame.to_msgpack
pandas.DataFrame.to_numpy
Examples
With heterogenous data, the lowest common type will have to be used.
For a mix of numeric and non-numeric types, the output array will have object dtype.
pandas.DataFrame.to_parquet
fname [str] File path or Root Directory path. Will be used as Root Directory
path while writing a partitioned dataset.
Changed in version 0.24.0.
engine [{‘auto’, ‘pyarrow’, ‘fastparquet’}, default ‘auto’] Parquet library to use. If
‘auto’, then the option io.parquet.engine is used. The default io.parquet.
engine behavior is to try ‘pyarrow’, falling back to ‘fastparquet’ if ‘pyarrow’
is unavailable.
compression [{‘snappy’, ‘gzip’, ‘brotli’, None}, default ‘snappy’] Name of the
compression to use. Use None for no compression.
index [bool, default None] If True, include the dataframe’s index(es) in the file
output. If False, they will not be written to the file. If None, the behavior
depends on the chosen engine.
New in version 0.24.0.
partition_cols [list, optional, default None] Column names by which to partition
the dataset Columns are partitioned in the order they are given
New in version 0.24.0.
**kwargs Additional arguments passed to the parquet library. See pandas io for
more details.
See also:
Notes
Examples
pandas.DataFrame.to_period
Parameters
freq [str, default] Frequency of the PeriodIndex.
axis [{0 or ‘index’, 1 or ‘columns’}, default 0] The axis to convert (the index by
default).
copy [bool, default True] If False then underlying input data is not copied.
Returns
TimeSeries with PeriodIndex
pandas.DataFrame.to_pickle
read_pickle Load pickled pandas object (or any object) from file.
DataFrame.to_hdf Write DataFrame to an HDF5 file.
DataFrame.to_sql Write DataFrame to a SQL database.
DataFrame.to_parquet Write a DataFrame to the binary parquet format.
Examples
>>> import os
>>> os.remove("./dummy.pkl")
pandas.DataFrame.to_records
Examples
If the DataFrame index has no label then the recarray field name is set to ‘index’. If the index
has a label then this is used as the field name:
>>> df.to_records(index=False)
rec.array([(1, 0.5 ), (2, 0.75)],
dtype=[('A', '<i8'), ('B', '<f8')])
>>> df.to_records(index_dtypes="<S2")
rec.array([(b'a', 1, 0.5 ), (b'b', 2, 0.75)],
dtype=[('I', 'S2'), ('A', '<i8'), ('B', '<f8')])
pandas.DataFrame.to_sparse
kind [{‘block’, ‘integer’}, default ‘block’] The kind of the SparseIndex tracking
where data is not equal to the fill value:
• ‘block’ tracks only the locations and sizes of blocks of data.
• ‘integer’ keeps an array with all the locations of the data.
In most cases ‘block’ is recommended, since it’s more memory efficient.
Returns
SparseDataFrame The sparse representation of the DataFrame.
See also:
Examples
pandas.DataFrame.to_stata
Examples
pandas.DataFrame.to_string
• end
• inherit
• match-parent
• initial
• unset.
max_rows [int, optional] Maximum number of rows to display in the console.
min_rows [int, optional] The number of rows to display in the console in a trun-
cated repr (when number of rows is above max_rows).
max_cols [int, optional] Maximum number of columns to display in the console.
show_dimensions [bool, default False] Display DataFrame dimensions (number
of rows by number of columns).
decimal [str, default ‘.’] Character recognized as decimal separator, e.g. ‘,’ in
Europe.
New in version 0.18.0.
line_width [int, optional] Width to wrap a line in characters.
Returns
str (or unicode, depending on data and options) String representation of
the dataframe.
See also:
Examples
pandas.DataFrame.to_timestamp
copy [bool, default True] If False then underlying input data is not copied.
Returns
DataFrame with DatetimeIndex
pandas.DataFrame.to_xarray
DataFrame.to_xarray(self )
Return an xarray object from the pandas object.
Returns
xarray.DataArray or xarray.Dataset Data in the pandas structure converted
to Dataset if the object is a DataFrame, or a DataArray if the object is a Series.
See also:
Notes
Examples
>>> df.to_xarray()
<xarray.Dataset>
Dimensions: (index: 4)
Coordinates:
* index (index) int64 0 1 2 3
Data variables:
name (index) object 'falcon' 'parrot' 'lion' 'monkey'
class (index) object 'bird' 'bird' 'mammal' 'mammal'
max_speed (index) float64 389.0 24.0 80.5 nan
num_legs (index) int64 2 2 4 4
>>> df['max_speed'].to_xarray()
<xarray.DataArray 'max_speed' (index: 4)>
array([389. , 24. , 80.5, nan])
Coordinates:
* index (index) int64 0 1 2 3
>>> df_multiindex.to_xarray()
<xarray.Dataset>
Dimensions: (animal: 2, date: 2)
Coordinates:
* date (date) datetime64[ns] 2018-01-01 2018-01-02
* animal (animal) object 'falcon' 'parrot'
Data variables:
speed (date, animal) int64 350 18 361 15
pandas.DataFrame.transform
Examples
Even though the resulting DataFrame must have the same length as the input DataFrame, it is
possible to provide several input functions:
>>> s = pd.Series(range(3))
>>> s
0 0
1 1
2 2
dtype: int64
>>> s.transform([np.sqrt, np.exp])
sqrt exp
0 0.000000 1.000000
1 1.000000 2.718282
2 1.414214 7.389056
pandas.DataFrame.transpose
copy [bool, default False] If True, the underlying data is copied. Otherwise (de-
fault), no copy is made if possible.
*args, **kwargs Additional keywords have no effect but might be accepted for
compatibility with numpy.
Returns
DataFrame The transposed DataFrame.
See also:
Notes
Transposing a DataFrame with mixed dtypes will result in a homogeneous DataFrame with the
object dtype. In such a case, a copy of the data is always made.
Examples
When the dtype is homogeneous in the original DataFrame, we get a transposed DataFrame with
the same dtype:
>>> df1.dtypes
col1 int64
col2 int64
dtype: object
>>> df1_transposed.dtypes
0 int64
1 int64
dtype: object
When the DataFrame has mixed dtypes, we get a transposed DataFrame with the object dtype:
>>> df2.dtypes
name object
score float64
employed bool
kids int64
dtype: object
>>> df2_transposed.dtypes
0 object
1 object
dtype: object
pandas.DataFrame.truediv
Notes
Examples
Add a scalar with operator version which return the same results.
>>> df + 1
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.add(1)
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.div(10)
angles degrees
circle 0.0 36.0
triangle 0.3 18.0
rectangle 0.4 36.0
>>> df.rdiv(10)
angles degrees
circle inf 0.027778
triangle 3.333333 0.055556
rectangle 2.500000 0.027778
>>> df - [1, 2]
angles degrees
circle -1 358
triangle 2 178
rectangle 3 358
>>> df * other
angles degrees
circle 0 NaN
triangle 9 NaN
rectangle 16 NaN
pandas.DataFrame.truncate
Notes
If the index being truncated contains only datetime values, before and after may be specified as
strings instead of Timestamps.
Examples
>>> df.truncate(before=pd.Timestamp('2016-01-05'),
... after=pd.Timestamp('2016-01-10')).tail()
A
(continues on next page)
Because the index is a DatetimeIndex containing only dates, we can specify before and after as
strings. They will be coerced to Timestamps before truncation.
Note that truncate assumes a 0 value for any unspecified time component (midnight). This
differs from partial string slicing, which returns any partially matching dates.
pandas.DataFrame.tshift
Notes
If freq is not specified then tries to use the freq or inferred_freq attributes of the index. If neither
of those attributes exist, a ValueError is thrown
pandas.DataFrame.tz_convert
pandas.DataFrame.tz_localize
• ‘shift_forward’ will shift the nonexistent time forward to the closest existing
time
• ‘shift_backward’ will shift the nonexistent time backward to the closest ex-
isting time
• ‘NaT’ will return NaT where there are nonexistent times
• timedelta objects will shift nonexistent times by the timedelta
• ‘raise’ will raise an NonExistentTimeError if there are nonexistent times
New in version 0.24.0.
Returns
Series or DataFrame Same type as the input.
Raises
TypeError If the TimeSeries is tz-aware and tz is not None.
Examples
>>> s = pd.Series([1],
... index=pd.DatetimeIndex(['2018-09-15 01:30:00']))
>>> s.tz_localize('CET')
2018-09-15 01:30:00+02:00 1
dtype: int64
Be careful with DST changes. When there is sequential data, pandas can infer the DST time:
In some cases, inferring the DST is impossible. In such cases, you can pass an ndarray to the
ambiguous parameter to set the DST explicitly
If the DST transition causes nonexistent times, you can shift these dates forward or
backwards with a timedelta object or ‘shift_forward’ or ‘shift_backwards’. >>> s =
pd.Series(range(2), index=pd.DatetimeIndex([ … ‘2015-03-29 02:30:00’, … ‘2015-03-29 03:30:00’]))
>>> s.tz_localize(‘Europe/Warsaw’, nonexistent=’shift_forward’) 2015-03-29 03:00:00+02:00
0 2015-03-29 03:30:00+02:00 1 dtype: int64 >>> s.tz_localize(‘Europe/Warsaw’, nonexis-
tent=’shift_backward’) 2015-03-29 01:59:59.999999999+01:00 0 2015-03-29 03:30:00+02:00 1
dtype: int64 >>> s.tz_localize(‘Europe/Warsaw’, nonexistent=pd.Timedelta(‘1H’)) 2015-03-29
03:30:00+02:00 0 2015-03-29 03:30:00+02:00 1 dtype: int64
pandas.DataFrame.unstack
Examples
>>> s.unstack(level=-1)
a b
one 1.0 2.0
two 3.0 4.0
>>> s.unstack(level=0)
one two
a 1.0 3.0
b 2.0 4.0
>>> df = s.unstack(level=0)
>>> df.unstack()
one a 1.0
b 2.0
two a 3.0
b 4.0
dtype: float64
pandas.DataFrame.update
Examples
The DataFrame’s length does not increase as a result of the update, only values at matching
index/column labels are updated.
If other contains NaNs the corresponding values are not updated in the original dataframe.
pandas.DataFrame.var
pandas.DataFrame.where
Notes
The where method is an application of the if-then idiom. For each element in the calling
DataFrame, if cond is True the element is used; otherwise the corresponding element from the
DataFrame other is used.
The signature for DataFrame.where() differs from numpy.where(). Roughly df1.where(m,
df2) is equivalent to np.where(m, df1, df2).
For further details and examples see the where documentation in indexing.
Examples
>>> s = pd.Series(range(5))
>>> s.where(s > 0)
0 NaN
1 1.0
2 2.0
3 3.0
4 4.0
dtype: float64
pandas.DataFrame.xs
Notes
Examples
>>> df.xs('mammal')
num_legs num_wings
animal locomotion
cat walks 4 0
dog walks 4 0
bat flies 2 2
Axes
6.4.3 Conversion
DataFrame.astype(self, dtype[, copy, errors]) Cast a pandas object to a specified dtype dtype.
DataFrame.infer_objects(self) Attempt to infer better dtypes for object columns.
DataFrame.copy(self[, deep]) Make a copy of this object’s indices and data.
DataFrame.isna(self) Detect missing values.
DataFrame.notna(self) Detect existing (non-missing) values.
DataFrame.bool(self) Return the bool of a single element PandasObject.
pandas.DataFrame.__iter__
DataFrame.__iter__(self )
Iterate over info axis.
Returns
iterator Info axis as iterator.
For more information on .at, .iat, .loc, and .iloc, see the indexing documentation.
DataFrame.add(self, other[, axis, level, …]) Get Addition of dataframe and other, element-wise
(binary operator add).
DataFrame.sub(self, other[, axis, level, …]) Get Subtraction of dataframe and other, element-
wise (binary operator sub).
DataFrame.mul(self, other[, axis, level, …]) Get Multiplication of dataframe and other,
element-wise (binary operator mul).
DataFrame.div(self, other[, axis, level, …]) Get Floating division of dataframe and other,
element-wise (binary operator truediv).
DataFrame.truediv(self, other[, axis, …]) Get Floating division of dataframe and other,
element-wise (binary operator truediv).
DataFrame.floordiv(self, other[, axis, …]) Get Integer division of dataframe and other,
element-wise (binary operator floordiv).
DataFrame.mod(self, other[, axis, level, …]) Get Modulo of dataframe and other, element-wise
(binary operator mod).
DataFrame.pow(self, other[, axis, level, …]) Get Exponential power of dataframe and other,
element-wise (binary operator pow).
Continued on next page
DataFrame.apply(self, func[, axis, …]) Apply a function along an axis of the DataFrame.
DataFrame.applymap(self, func) Apply a function to a Dataframe elementwise.
DataFrame.pipe(self, func, \*args, \*\*kwargs) Apply func(self, *args, **kwargs).
DataFrame.agg(self, func[, axis]) Aggregate using one or more operations over the
specified axis.
DataFrame.aggregate(self, func[, axis]) Aggregate using one or more operations over the
specified axis.
DataFrame.transform(self, func[, axis]) Call func on self producing a DataFrame with
transformed values and that has the same axis
length as self.
DataFrame.groupby(self[, by, axis, level, …]) Group DataFrame or Series using a mapper or by
a Series of columns.
DataFrame.rolling(self, window[, …]) Provide rolling window calculations.
Continued on next page
DataFrame.append(self, other[, …]) Append rows of other to the end of caller, returning
a new object.
DataFrame.assign(self, \*\*kwargs) Assign new columns to a DataFrame.
DataFrame.join(self, other[, on, how, …]) Join columns of another DataFrame.
DataFrame.merge(self, right[, how, on, …]) Merge DataFrame or named Series objects with a
database-style join.
DataFrame.update(self, other[, join, …]) Modify in place using non-NA values from another
DataFrame.
6.4.13 Plotting
DataFrame.plot is both a callable method and a namespace attribute for specific plotting methods of the
form DataFrame.plot.<kind>.
pandas.DataFrame.plot.area
Examples
>>> df = pd.DataFrame({
... 'sales': [3, 2, 3, 9, 10, 6],
... 'signups': [5, 5, 6, 12, 14, 13],
... 'visits': [20, 42, 28, 62, 81, 50],
... }, index=pd.date_range(start='2018/01/01', end='2018/07/01',
... freq='M'))
>>> ax = df.plot.area()
Area plots are stacked by default. To produce an unstacked plot, pass stacked=False:
>>> ax = df.plot.area(stacked=False)
>>> ax = df.plot.area(y='sales')
>>> df = pd.DataFrame({
... 'sales': [3, 2, 3],
... 'visits': [20, 42, 28],
... 'day': [1, 2, 3],
... })
>>> ax = df.plot.area(x='day')
pandas.DataFrame.plot.bar
Examples
Basic plot.
Plot a whole dataframe to a bar plot. Each column is assigned a distinct color, and each row is nested
in a group along the horizontal axis.
Instead of nesting, the figure can be split by column with subplots=True. In this case, a numpy.
ndarray of matplotlib.axes.Axes are returned.
pandas.DataFrame.plot.barh
Examples
Basic example
pandas.DataFrame.plot.box
A consideration when using this chart is that the box and the whiskers can overlap, which is very
common when plotting small sets of data.
Parameters
by [string or sequence] Column in the DataFrame to group by.
**kwds [optional] Additional keywords are documented in DataFrame.plot().
Returns
matplotlib.axes.Axes or numpy.ndarray of them
See also:
Examples
Draw a box plot from a DataFrame with four columns of randomly generated data.
pandas.DataFrame.plot.density
Examples
Given a Series of points randomly sampled from an unknown distribution, estimate its PDF using KDE
with automatic bandwidth determination and plot the results, evaluating them at 1000 equally spaced
points (default):
A scalar bandwidth can be specified. Using a small bandwidth value can lead to over-fitting, while
using a large bandwidth value may result in under-fitting:
>>> ax = s.plot.kde(bw_method=0.3)
>>> ax = s.plot.kde(bw_method=3)
Finally, the ind parameter determines the evaluation points for the plot of the estimated PDF:
>>> df = pd.DataFrame({
... 'x': [1, 2, 2.5, 3, 3.5, 4, 5],
... 'y': [4, 4, 4.5, 5, 5.5, 6, 6],
... })
>>> ax = df.plot.kde()
A scalar bandwidth can be specified. Using a small bandwidth value can lead to over-fitting, while
using a large bandwidth value may result in under-fitting:
>>> ax = df.plot.kde(bw_method=0.3)
>>> ax = df.plot.kde(bw_method=3)
Finally, the ind parameter determines the evaluation points for the plot of the estimated PDF:
pandas.DataFrame.plot.hexbin
Parameters
x [int or str] The column label or position for x points.
y [int or str] The column label or position for y points.
C [int or str, optional] The column label or position for the value of (x, y) point.
reduce_C_function [callable, default np.mean] Function of one argument that re-
duces all the values in a bin to a single number (e.g. np.mean, np.max, np.sum,
np.std).
gridsize [int or tuple of (int, int), default 100] The number of hexagons in the x-
direction. The corresponding number of hexagons in the y-direction is chosen in
a way that the hexagons are approximately regular. Alternatively, gridsize can be
a tuple with two elements specifying the number of hexagons in the x-direction
and the y-direction.
**kwds Additional keyword arguments are documented in DataFrame.plot().
Returns
matplotlib.AxesSubplot The matplotlib Axes on which the hexbin is plotted.
See also:
Examples
The following examples are generated with random data from a normal distribution.
>>> n = 10000
>>> df = pd.DataFrame({'x': np.random.randn(n),
... 'y': np.random.randn(n)})
>>> ax = df.plot.hexbin(x='x', y='y', gridsize=20)
The next example uses C and np.sum as reduce_C_function. Note that ‘observations’ values ranges
from 1 to 5 but the result plot shows values up to more than 25. This is because of the re-
duce_C_function.
>>> n = 500
>>> df = pd.DataFrame({
... 'coord_x': np.random.uniform(-3, 3, size=n),
... 'coord_y': np.random.uniform(30, 50, size=n),
... 'observations': np.random.randint(1,5, size=n)
... })
>>> ax = df.plot.hexbin(x='coord_x',
... y='coord_y',
... C='observations',
... reduce_C_function=np.sum,
... gridsize=10,
... cmap="viridis")
pandas.DataFrame.plot.hist
Examples
When we draw a dice 6000 times, we expect to get each value around 1000 times. But when we draw
two dices and sum the result, the distribution is going to be quite different. A histogram illustrates
those distributions.
>>> df = pd.DataFrame(
... np.random.randint(1, 7, 6000),
... columns = ['one'])
>>> df['two'] = df['one'] + np.random.randint(1, 7, 6000)
>>> ax = df.plot.hist(bins=12, alpha=0.5)
pandas.DataFrame.plot.kde
Examples
Given a Series of points randomly sampled from an unknown distribution, estimate its PDF using KDE
with automatic bandwidth determination and plot the results, evaluating them at 1000 equally spaced
points (default):
A scalar bandwidth can be specified. Using a small bandwidth value can lead to over-fitting, while
using a large bandwidth value may result in under-fitting:
>>> ax = s.plot.kde(bw_method=0.3)
>>> ax = s.plot.kde(bw_method=3)
Finally, the ind parameter determines the evaluation points for the plot of the estimated PDF:
>>> df = pd.DataFrame({
... 'x': [1, 2, 2.5, 3, 3.5, 4, 5],
... 'y': [4, 4, 4.5, 5, 5.5, 6, 6],
... })
>>> ax = df.plot.kde()
A scalar bandwidth can be specified. Using a small bandwidth value can lead to over-fitting, while
using a large bandwidth value may result in under-fitting:
>>> ax = df.plot.kde(bw_method=0.3)
>>> ax = df.plot.kde(bw_method=3)
Finally, the ind parameter determines the evaluation points for the plot of the estimated PDF:
pandas.DataFrame.plot.line
Examples
The following example shows the populations for some animals over the years.
>>> df = pd.DataFrame({
... 'pig': [20, 18, 489, 675, 1776],
... 'horse': [4, 25, 281, 600, 1900]
... }, index=[1990, 1997, 2003, 2009, 2014])
>>> lines = df.plot.line()
pandas.DataFrame.plot.pie
DataFrame.plot.pie(self, **kwargs)
Generate a pie plot.
A pie plot is a proportional representation of the numerical data in a column. This function
wraps matplotlib.pyplot.pie() for the specified column. If no column reference is passed and
subplots=True a pie plot is drawn for each numerical column independently.
Parameters
y [int or label, optional] Label or position of the column to plot. If not provided,
subplots=True argument must be passed.
**kwds Keyword arguments to pass on to DataFrame.plot().
Returns
matplotlib.axes.Axes or np.ndarray of them A NumPy array is returned when
subplots is True.
See also:
Examples
In the example below we have a DataFrame with the information about planet’s mass and radius. We
pass the the ‘mass’ column to the pie function to get a pie plot.
pandas.DataFrame.plot.scatter
Examples
Let’s see how to draw a scatter plot using coordinates from the values in a DataFrame’s columns.
>>> df = pd.DataFrame([[5.1, 3.5, 0], [4.9, 3.0, 0], [7.0, 3.2, 1],
... [6.4, 3.2, 1], [5.9, 3.0, 2]],
... columns=['length', 'width', 'species'])
>>> ax1 = df.plot.scatter(x='length',
... y='width',
... c='DarkBlue')
DataFrame.boxplot(self[, column, by, ax, …]) Make a box plot from DataFrame columns.
DataFrame.hist(data[, column, by, grid, …]) Make a histogram of the DataFrame’s.
Sparse-dtype specific methods and attributes are provided under the DataFrame.sparse accessor.
pandas.DataFrame.sparse.density
DataFrame.sparse.density
Ratio of non-sparse points to total (dense) data points represented in the DataFrame.
pandas.DataFrame.sparse.from_spmatrix
Examples
pandas.DataFrame.sparse.to_coo
sparse.to_coo(self )
Return the contents of the frame as a sparse SciPy COO matrix.
New in version 0.25.0.
Returns
coo_matrix [scipy.sparse.spmatrix] If the caller is heterogeneous and contains
booleans or objects, the result will be of dtype=object. See Notes.
Notes
The dtype will be the lowest-common-denominator type (implicit upcasting); that is to say if the
dtypes (even of numeric types) are mixed, the one that accommodates all will be chosen.
e.g. If the dtypes are float16 and float32, dtype will be upcast to float32. By numpy.find_common_type
convention, mixing int64 and and uint64 will result in a float64 dtype.
pandas.DataFrame.sparse.to_dense
sparse.to_dense(self )
Convert a DataFrame with sparse values to dense.
New in version 0.25.0.
Returns
DataFrame A DataFrame with the same values stored as dense arrays.
Examples
6.4.16 Sparse
pandas.SparseDataFrame.to_coo
SparseDataFrame.to_coo(self )
Return the contents of the frame as a sparse SciPy COO matrix.
New in version 0.25.0.
Returns
coo_matrix [scipy.sparse.spmatrix] If the caller is heterogeneous and contains
booleans or objects, the result will be of dtype=object. See Notes.
Notes
The dtype will be the lowest-common-denominator type (implicit upcasting); that is to say if the
dtypes (even of numeric types) are mixed, the one that accommodates all will be chosen.
e.g. If the dtypes are float16 and float32, dtype will be upcast to float32. By numpy.find_common_type
convention, mixing int64 and and uint64 will result in a float64 dtype.
{{ header }}
For most data types, pandas uses NumPy arrays as the concrete objects contained with a Index, Series, or
DataFrame.
For some data types, pandas extends NumPy’s type system.
Pandas and third-party libraries can extend NumPy’s type system (see Extension types). The top-level
array() method can be used to create a new array, which may be stored in a Series, Index, or as a column
in a DataFrame.
6.5.1 pandas.array
For all other cases, NumPy’s usual inference rules will be used.
copy [bool, default True] Whether to copy the data, even if not necessary. Depending
on the type of data, creating the new array may require copying data, even if
copy=False.
Returns
ExtensionArray The newly created array.
Raises
ValueError When data is not 1-dimensional.
See also:
Notes
Omitting the dtype argument means pandas will attempt to infer the best array type from the values
in the data. As new array types are added by pandas and 3rd party libraries, the “best” array type
may change. We recommend specifying dtype to ensure that
1. the correct array type for the data is returned
2. the returned array type doesn’t change as new extension types are added by pandas and third-
party libraries
Additionally, if the underlying memory representation of the returned array matters, we recommend
specifying the dtype as a concrete object rather than a string alias or allowing it to be inferred. For
example, a future version of pandas or a 3rd-party library may include a dedicated ExtensionArray for
string data. In this event, the following would no longer return a arrays.PandasArray backed by a
NumPy array.
This would instead return the new ExtensionArray dedicated for string data. If you really need the
new array to be backed by a NumPy array, specify that in the dtype.
Or use the dedicated constructor for the array you’re expecting, and wrap that in a PandasArray
Examples
Because omitting the dtype passes the data through to NumPy, a mixture of valid integers and NA will
return a floating-point NumPy array.
data must be 1-dimensional. A ValueError is raised when the input has the wrong dimensionality.
>>> pd.array(1)
Traceback (most recent call last):
...
ValueError: Cannot pass scalar '1' to 'pandas.array'.
NumPy cannot natively represent timezone-aware datetimes. Pandas supports this with the arrays.
DatetimeArray extension array, which can hold timezone-naive or timezone-aware values.
Timestamp, a subclass of datetime.datetime, is pandas’ scalar type for timezone-naive or timezone-aware
datetime data.
pandas.Timestamp
class pandas.Timestamp
Pandas replacement for python datetime.datetime object.
Timestamp is the pandas equivalent of python’s Datetime and is interchangeable with it in most cases.
It’s the type used for the entries that make up a DatetimeIndex, and other timeseries oriented data
structures in pandas.
Parameters
ts_input [datetime-like, str, int, float] Value to be converted to Timestamp.
freq [str, DateOffset] Offset which Timestamp will have.
tz [str, pytz.timezone, dateutil.tz.tzfile or None] Time zone for time which Timestamp
will have.
unit [str] Unit used for conversion if ts_input is of type int or float. The valid values
are ‘D’, ‘h’, ‘m’, ‘s’, ‘ms’, ‘us’, and ‘ns’. For example, ‘s’ means seconds and ‘ms’
means milliseconds.
year, month, day [int] New in version 0.19.0.
hour, minute, second, microsecond [int, optional, default 0] New in version
0.19.0.
Notes
There are essentially three calling conventions for the constructor. The primary form accepts four
parameters. They can be passed by position or keyword.
The other two forms mimic the parameters from datetime.datetime. They can be passed by either
position or keyword, but not both mixed together.
Examples
>>> pd.Timestamp('2017-01-01T12')
Timestamp('2017-01-01 12:00:00')
This converts an int representing a Unix-epoch in units of seconds and for a particular timezone
Using the other two forms that mimic the API for datetime.datetime:
Attributes
pandas.Timestamp.asm8
Timestamp.asm8
Return numpy datetime64 format in nanoseconds.
pandas.Timestamp.dayofweek
Timestamp.dayofweek
Return day of whe week.
pandas.Timestamp.dayofyear
Timestamp.dayofyear
Return the day of the year.
pandas.Timestamp.days_in_month
Timestamp.days_in_month
Return the number of days in the month.
pandas.Timestamp.daysinmonth
Timestamp.daysinmonth
Return the number of days in the month.
pandas.Timestamp.freqstr
Timestamp.freqstr
Return the total number of days in the month.
pandas.Timestamp.is_leap_year
Timestamp.is_leap_year
Return True if year is a leap year.
pandas.Timestamp.is_month_end
Timestamp.is_month_end
Return True if date is last day of month.
pandas.Timestamp.is_month_start
Timestamp.is_month_start
Return True if date is first day of month.
pandas.Timestamp.is_quarter_end
Timestamp.is_quarter_end
Return True if date is last day of the quarter.
pandas.Timestamp.is_quarter_start
Timestamp.is_quarter_start
Return True if date is first day of the quarter.
pandas.Timestamp.is_year_end
Timestamp.is_year_end
Return True if date is last day of the year.
pandas.Timestamp.is_year_start
Timestamp.is_year_start
Return True if date is first day of the year.
pandas.Timestamp.quarter
Timestamp.quarter
Return the quarter of the year.
pandas.Timestamp.resolution
Timestamp.resolution
Return resolution describing the smallest difference between two times that can be represented
by Timestamp object_state
pandas.Timestamp.tz
Timestamp.tz
Alias for tzinfo
pandas.Timestamp.week
Timestamp.week
Return the week number of the year.
pandas.Timestamp.weekday_name
Timestamp.weekday_name
Deprecated since version 0.23.0: Use Timestamp.day_name() instead
pandas.Timestamp.weekofyear
Timestamp.weekofyear
Return the week number of the year.
day
fold
freq
hour
microsecond
minute
month
nanosecond
second
tzinfo
value
year
Methods
pandas.Timestamp.astimezone
Timestamp.astimezone(self, tz)
Convert tz-aware Timestamp to another time zone.
Parameters
tz [str, pytz.timezone, dateutil.tz.tzfile or None] Time zone for time which Times-
tamp will be converted to. None will remove timezone holding UTC time.
Returns
converted [Timestamp]
Raises
TypeError If Timestamp is tz-naive.
pandas.Timestamp.ceil
pandas.Timestamp.combine
pandas.Timestamp.ctime
Timestamp.ctime()
Return ctime() style string.
pandas.Timestamp.date
Timestamp.date()
Return date object with same year, month and day.
pandas.Timestamp.day_name
Timestamp.day_name(self, locale=None)
Return the day name of the Timestamp with specified locale.
Parameters
locale [string, default None (English locale)] locale determining the language in
which to return the day name
Returns
day_name [string]
New in version 0.23.0: ..
pandas.Timestamp.dst
Timestamp.dst()
Return self.tzinfo.dst(self).
pandas.Timestamp.floor
pandas.Timestamp.fromisoformat
Timestamp.fromisoformat()
string -> datetime from datetime.isoformat() output
pandas.Timestamp.fromordinal
pandas.Timestamp.fromtimestamp
classmethod Timestamp.fromtimestamp(ts)
timestamp[, tz] -> tz’s local time from POSIX timestamp.
pandas.Timestamp.isocalendar
Timestamp.isocalendar()
Return a 3-tuple containing ISO year, week number, and weekday.
pandas.Timestamp.isoweekday
Timestamp.isoweekday()
Return the day of the week represented by the date. Monday == 1 … Sunday == 7
pandas.Timestamp.month_name
Timestamp.month_name(self, locale=None)
Return the month name of the Timestamp with specified locale.
Parameters
locale [string, default None (English locale)] locale determining the language in
which to return the month name
Returns
month_name [string]
New in version 0.23.0: ..
pandas.Timestamp.normalize
Timestamp.normalize(self )
Normalize Timestamp to midnight, preserving tz information.
pandas.Timestamp.now
classmethod Timestamp.now(tz=None)
Return new Timestamp object representing current time local to tz.
Parameters
tz [str or timezone object, default None] Timezone to localize to
pandas.Timestamp.replace
pandas.Timestamp.round
pandas.Timestamp.strftime
Timestamp.strftime()
format -> strftime() style string.
pandas.Timestamp.strptime
pandas.Timestamp.time
Timestamp.time()
Return time object with same time but with tzinfo=None.
pandas.Timestamp.timestamp
Timestamp.timestamp()
Return POSIX timestamp as float.
pandas.Timestamp.timetuple
Timestamp.timetuple()
Return time tuple, compatible with time.localtime().
pandas.Timestamp.timetz
Timestamp.timetz()
Return time object with same time and tzinfo.
pandas.Timestamp.to_datetime64
Timestamp.to_datetime64()
Return a numpy.datetime64 object with ‘ns’ precision.
pandas.Timestamp.to_julian_date
Timestamp.to_julian_date(self )
Convert TimeStamp to a Julian Date. 0 Julian date is noon January 1, 4713 BC.
pandas.Timestamp.to_numpy
Timestamp.to_numpy()
Convert the Timestamp to a NumPy datetime64.
New in version 0.25.0.
This is an alias method for Timestamp.to_datetime64(). The dtype and copy parameters are
available here only for compatibility. Their values will not affect the return value.
Returns
numpy.datetime64
See also:
pandas.Timestamp.to_period
Timestamp.to_period(self, freq=None)
Return an period of which this timestamp is an observation.
pandas.Timestamp.to_pydatetime
Timestamp.to_pydatetime()
Convert a Timestamp object to a native Python datetime object.
If warn=True, issue a warning if nanoseconds is nonzero.
pandas.Timestamp.today
pandas.Timestamp.toordinal
Timestamp.toordinal()
Return proleptic Gregorian ordinal. January 1 of year 1 is day 1.
pandas.Timestamp.tz_convert
Timestamp.tz_convert(self, tz)
Convert tz-aware Timestamp to another time zone.
Parameters
tz [str, pytz.timezone, dateutil.tz.tzfile or None] Time zone for time which Times-
tamp will be converted to. None will remove timezone holding UTC time.
Returns
converted [Timestamp]
Raises
TypeError If Timestamp is tz-naive.
pandas.Timestamp.tz_localize
pandas.Timestamp.tzname
Timestamp.tzname()
Return self.tzinfo.tzname(self).
pandas.Timestamp.utcfromtimestamp
classmethod Timestamp.utcfromtimestamp(ts)
Construct a naive UTC datetime from a POSIX timestamp.
pandas.Timestamp.utcnow
classmethod Timestamp.utcnow()
Return a new Timestamp representing UTC day and time.
pandas.Timestamp.utcoffset
Timestamp.utcoffset()
Return self.tzinfo.utcoffset(self).
pandas.Timestamp.utctimetuple
Timestamp.utctimetuple()
Return UTC time tuple, compatible with time.localtime().
pandas.Timestamp.weekday
Timestamp.weekday()
Return the day of the week represented by the date. Monday == 0 … Sunday == 6
isoformat
Properties
pandas.Timestamp.day
Timestamp.day
pandas.Timestamp.fold
Timestamp.fold
pandas.Timestamp.hour
Timestamp.hour
pandas.Timestamp.max
pandas.Timestamp.microsecond
Timestamp.microsecond
pandas.Timestamp.min
pandas.Timestamp.minute
Timestamp.minute
pandas.Timestamp.month
Timestamp.month
pandas.Timestamp.nanosecond
Timestamp.nanosecond
pandas.Timestamp.second
Timestamp.second
pandas.Timestamp.tzinfo
Timestamp.tzinfo
pandas.Timestamp.value
Timestamp.value
pandas.Timestamp.year
Timestamp.year
Methods
pandas.Timestamp.freq
Timestamp.freq
pandas.Timestamp.isoformat
Timestamp.isoformat(self, sep=’T’)
A collection of timestamps may be stored in a arrays.DatetimeArray. For timezone-aware data, the .
dtype of a DatetimeArray is a DatetimeTZDtype. For timezone-naive data, np.dtype("datetime64[ns]")
is used.
If the data are tz-aware, then every value in the array must have the same timezone.
pandas.arrays.DatetimeArray
Warning: DatetimeArray is currently experimental, and its API may change without warn-
ing. In particular, DatetimeArray.dtype is expected to change to always be an instance of an
ExtensionDtype subclass.
Parameters
values [Series, Index, DatetimeArray, ndarray] The datetime data.
For DatetimeArray values (or a Series or Index boxing one), dtype and freq will
be extracted from values, with precedence given to
dtype [numpy.dtype or DatetimeTZDtype] Note that the only NumPy dtype allowed
is ‘datetime64[ns]’.
freq [str or Offset, optional]
copy [bool, default False] Whether to copy the underlying array of values.
Attributes
None
Methods
None
pandas.DatetimeTZDtype
Examples
>>> pd.DatetimeTZDtype(tz='UTC')
datetime64[ns, UTC]
>>> pd.DatetimeTZDtype(tz='dateutil/US/Central')
datetime64[ns, tzfile('/usr/share/zoneinfo/US/Central')]
Attributes
pandas.DatetimeTZDtype.unit
DatetimeTZDtype.unit
The precision of the datetime data.
pandas.DatetimeTZDtype.tz
DatetimeTZDtype.tz
The timezone.
Methods
None
NumPy can natively represent timedeltas. Pandas provides Timedelta for symmetry with Timestamp.
pandas.Timedelta
class pandas.Timedelta
Represents a duration, the difference between two dates or times.
Timedelta is the pandas equivalent of python’s datetime.timedelta and is interchangeable with it in
most cases.
Parameters
value [Timedelta, timedelta, np.timedelta64, string, or integer]
unit [str, optional] Denote the unit of the input, if input is an integer. Default ‘ns’.
Possible values: {‘Y’, ‘M’, ‘W’, ‘D’, ‘days’, ‘day’, ‘hours’, hour’, ‘hr’, ‘h’, ‘m’,
‘minute’, ‘min’, ‘minutes’, ‘T’, ‘S’, ‘seconds’, ‘sec’, ‘second’, ‘ms’, ‘milliseconds’,
‘millisecond’, ‘milli’, ‘millis’, ‘L’, ‘us’, ‘microseconds’, ‘microsecond’, ‘micro’, ‘mi-
cros’, ‘U’, ‘ns’, ‘nanoseconds’, ‘nano’, ‘nanos’, ‘nanosecond’, ‘N’}
**kwargs Available kwargs: {days, seconds, microseconds, milliseconds, minutes,
hours, weeks}. Values for construction in compat with datetime.timedelta.
Numpy ints and floats will be coerced to python ints and floats.
Notes
Attributes
pandas.Timedelta.asm8
Timedelta.asm8
Return a numpy timedelta64 array scalar view.
Provides access to the array scalar view (i.e. a combination of the value and the units) associated
Examples
pandas.Timedelta.components
Timedelta.components
Return a components namedtuple-like.
pandas.Timedelta.days
Timedelta.days
Number of days.
pandas.Timedelta.delta
Timedelta.delta
Return the timedelta in nanoseconds (ns), for internal compatibility.
Returns
int Timedelta in nanoseconds.
Examples
pandas.Timedelta.microseconds
Timedelta.microseconds
Number of microseconds (>= 0 and less than 1 second).
pandas.Timedelta.nanoseconds
Timedelta.nanoseconds
Return the number of nanoseconds (n), where 0 <= n < 1 microsecond.
Returns
int Number of nanoseconds.
See also:
Timedelta.components Return all attributes with assigned values (i.e. days, hours, minutes,
seconds, milliseconds, microseconds, nanoseconds).
Examples
pandas.Timedelta.resolution
Timedelta.resolution
Return a string representing the lowest timedelta resolution.
Each timedelta has a defined resolution that represents the lowest OR most granular level of
precision. Each level of resolution is represented by a short string as defined below:
Resolution: Return value
• Days: ‘D’
• Hours: ‘H’
• Minutes: ‘T’
• Seconds: ‘S’
• Milliseconds: ‘L’
• Microseconds: ‘U’
• Nanoseconds: ‘N’
Returns
str Timedelta resolution.
Examples
pandas.Timedelta.resolution_string
Timedelta.resolution_string
Return a string representing the lowest timedelta resolution.
Each timedelta has a defined resolution that represents the lowest OR most granular level of
precision. Each level of resolution is represented by a short string as defined below:
Resolution: Return value
• Days: ‘D’
• Hours: ‘H’
• Minutes: ‘T’
• Seconds: ‘S’
• Milliseconds: ‘L’
• Microseconds: ‘U’
• Nanoseconds: ‘N’
Returns
str Timedelta resolution.
Examples
pandas.Timedelta.seconds
Timedelta.seconds
Number of seconds (>= 0 and less than 1 day).
freq
is_populated
value
Methods
pandas.Timedelta.ceil
Timedelta.ceil(self, freq)
return a new Timedelta ceiled to this resolution
Parameters
freq [a freq string indicating the ceiling resolution]
pandas.Timedelta.floor
Timedelta.floor(self, freq)
return a new Timedelta floored to this resolution
Parameters
freq [a freq string indicating the flooring resolution]
pandas.Timedelta.isoformat
Timedelta.isoformat()
Format Timedelta as ISO 8601 Duration like P[n]Y[n]M[n]DT[n]H[n]M[n]S, where the [n] s
are replaced by the values. See https://en.wikipedia.org/wiki/ISO_8601#Durations
New in version 0.20.0.
Returns
formatted [str]
See also:
Timestamp.isoformat
Notes
The longest component is days, whose value may be larger than 365. Every component is always
included, even if its value is 0. Pandas uses nanosecond precision, so up to 9 decimal places may
be included in the seconds component. Trailing 0’s are removed from the seconds component
after the decimal. We do not 0 pad components, so it’s …T5H…, not …T05H…
Examples
pandas.Timedelta.round
Timedelta.round(self, freq)
Round the Timedelta to the specified resolution
Parameters
freq [a freq string indicating the rounding resolution]
Returns
a new Timedelta rounded to the given resolution of freq
Raises
ValueError if the freq cannot be converted
pandas.Timedelta.to_numpy
Timedelta.to_numpy()
Convert the Timestamp to a NumPy timedelta64.
New in version 0.25.0.
This is an alias method for Timedelta.to_timedelta64(). The dtype and copy parameters are
available here only for compatibility. Their values will not affect the return value.
Returns
numpy.timedelta64
See also:
pandas.Timedelta.to_pytimedelta
Timedelta.to_pytimedelta()
Convert a pandas Timedelta object into a python timedelta object.
Timedelta objects are internally saved as numpy datetime64[ns] dtype. Use to_pytimedelta() to
convert to object dtype.
Returns
Notes
pandas.Timedelta.to_timedelta64
Timedelta.to_timedelta64()
Return a numpy.timedelta64 object with ‘ns’ precision.
pandas.Timedelta.total_seconds
Timedelta.total_seconds()
Total duration of timedelta in seconds (to ns precision).
pandas.Timedelta.view
Timedelta.view()
Array view compatibility.
Properties
pandas.Timedelta.freq
Timedelta.freq
pandas.Timedelta.is_populated
Timedelta.is_populated
pandas.Timedelta.max
pandas.Timedelta.min
pandas.Timedelta.value
Timedelta.value
Methods
pandas.arrays.TimedeltaArray
Warning: TimedeltaArray is currently experimental, and its API may change without warning. In
particular, TimedeltaArray.dtype is expected to change to be an instance of an ExtensionDtype
subclass.
Parameters
values [array-like] The timedelta data.
dtype [numpy.dtype] Currently, only numpy.dtype("timedelta64[ns]") is ac-
cepted.
freq [Offset, optional]
copy [bool, default False] Whether to copy the underlying array of data.
Attributes
None
Methods
None
6.5.5 Period
pandas.Period
class pandas.Period
Represents a period of time
Parameters
value [Period or str, default None] The time period represented (e.g., ‘4Q2005’)
freq [str, default None] One of pandas period strings or corresponding objects
year [int, default None]
month [int, default 1]
quarter [int, default None]
day [int, default 1]
hour [int, default 0]
minute [int, default 0]
Attributes
pandas.Period.day
Period.day
Get day of the month that a Period falls on.
Returns
int
See also:
Examples
pandas.Period.dayofweek
Period.dayofweek
Day of the week the period lies in, with Monday=0 and Sunday=6.
If the period frequency is lower than daily (e.g. hourly), and the period spans over multiple days,
the day at the start of the period is used.
If the frequency is higher than daily (e.g. monthly), the last day of the period is used.
Returns
int Day of the week.
See also:
Examples
For periods that span over multiple days, the day at the beginning of the period is returned.
For periods with a frequency higher than days, the last day of the period is returned.
pandas.Period.dayofyear
Period.dayofyear
Return the day of the year.
This attribute returns the day of the year on which the particular date occurs. The return value
ranges between 1 to 365 for regular years and 1 to 366 for leap years.
Returns
int The day of year.
See also:
Examples
pandas.Period.days_in_month
Period.days_in_month
Get the total number of days in the month that this period falls on.
Returns
int
See also:
Examples
>>> p = pd.Period('2018-2-17')
>>> p.days_in_month
28
>>> pd.Period('2018-03-01').days_in_month
31
>>> p = pd.Period('2016-2-17')
>>> p.days_in_month
29
pandas.Period.daysinmonth
Period.daysinmonth
Get the total number of days of the month that the Period falls in.
Returns
int
See also:
Examples
pandas.Period.hour
Period.hour
Get the hour of the day component of the Period.
Returns
int The hour as an integer, between 0 and 23.
See also:
Examples
pandas.Period.minute
Period.minute
Get minute of the hour component of the Period.
Returns
int The minute as an integer, between 0 and 59.
See also:
Examples
pandas.Period.qyear
Period.qyear
Fiscal year the Period lies in according to its starting-quarter.
The year and the qyear of the period will be the same if the fiscal and calendar years are the
same. When they are not, the fiscal year can be different from the calendar year of the period.
Returns
int The fiscal year of the period.
See also:
Examples
If the natural and fiscal year are the same, qyear and year will be the same.
If the fiscal year starts in April (Q-MAR), the first quarter of 2018 will start in April 2017. year
will then be 2018, but qyear will be the fiscal year, 2018.
pandas.Period.second
Period.second
Get the second component of the Period.
Returns
int The second of the Period (ranges from 0 to 59).
See also:
Examples
pandas.Period.start_time
Period.start_time
Get the Timestamp for the start of the period.
Returns
Timestamp
See also:
Examples
>>> period.start_time
Timestamp('2012-01-01 00:00:00')
>>> period.end_time
Timestamp('2012-01-01 23:59:59.999999999')
pandas.Period.week
Period.week
Get the week of the year on the given Period.
Returns
int
See also:
Examples
pandas.Period.weekday
Period.weekday
Day of the week the period lies in, with Monday=0 and Sunday=6.
If the period frequency is lower than daily (e.g. hourly), and the period spans over multiple days,
the day at the start of the period is used.
If the frequency is higher than daily (e.g. monthly), the last day of the period is used.
Returns
int Day of the week.
See also:
Examples
For periods that span over multiple days, the day at the beginning of the period is returned.
For periods with a frequency higher than days, the last day of the period is returned.
end_time
freq
freqstr
is_leap_year
month
ordinal
quarter
weekofyear
year
Methods
pandas.Period.asfreq
Period.asfreq()
Convert Period to desired frequency, either at the start or end of the interval
Parameters
freq [string]
how [{‘E’, ‘S’, ‘end’, ‘start’}, default ‘end’] Start or end of the timespan
Returns
resampled [Period]
pandas.Period.strftime
Period.strftime()
Returns the string representation of the Period, depending on the selected fmt. fmt must be
a string containing one or several directives. The method recognizes the same directives as the
time.strftime() function of the standard Python distribution, as well as the specific additional
directives %f, %F, %q. (formatting & docs originally from scikits.timeries)
Notes
(1) The %f directive is the same as %y if the frequency is not quarterly. Otherwise, it corresponds
to the ‘fiscal’ year, as defined by the qyear attribute.
(2) The %F directive is the same as %Y if the frequency is not quarterly. Otherwise, it corresponds
to the ‘fiscal’ year, as defined by the qyear attribute.
(3) The %p directive only affects the output hour field if the %I directive is used to parse the
hour.
(4) The range really is 0 to 61; this accounts for leap seconds and the (very rare) double leap
seconds.
(5) The %U and %W directives are only used in calculations when the day of the week and the
year are specified.
Examples
pandas.Period.to_timestamp
Period.to_timestamp()
Return the Timestamp representation of the Period at the target frequency at the specified end
(how) of the Period
Parameters
freq [string or DateOffset] Target frequency. Default is ‘D’ if self.freq is week or
longer and ‘S’ otherwise
how [str, default ‘S’ (start)] ‘S’, ‘E’. Can be aliased as case insensitive ‘Start’,
‘Finish’, ‘Begin’, ‘End’
Returns
Timestamp
now
Properties
pandas.Period.end_time
Period.end_time
pandas.Period.freq
Period.freq
pandas.Period.freqstr
Period.freqstr
pandas.Period.is_leap_year
Period.is_leap_year
pandas.Period.month
Period.month
pandas.Period.ordinal
Period.ordinal
pandas.Period.quarter
Period.quarter
pandas.Period.weekofyear
Period.weekofyear
pandas.Period.year
Period.year
Methods
pandas.Period.now
Period.now()
A collection of timedeltas may be stored in a arrays.PeriodArray. Every period in a PeriodArray must
have the same freq.
arrays.PeriodArray(values[, freq, dtype, copy]) Pandas ExtensionArray for storing Period data.
pandas.arrays.PeriodArray
Notes
Attributes
None
Methods
None
pandas.PeriodDtype
class pandas.PeriodDtype
An ExtensionDtype for Period data.
This is not an actual numpy dtype, but a duck type.
Parameters
freq [str or DateOffset] The frequency of this PeriodDtype
Examples
>>> pd.PeriodDtype(freq='D')
period[D]
>>> pd.PeriodDtype(freq=pd.offsets.MonthEnd())
period[M]
Attributes
pandas.PeriodDtype.freq
PeriodDtype.freq
The frequency object of this PeriodDtype.
Methods
None
pandas.Interval
class pandas.Interval
Immutable object implementing an Interval, a bounded slice-like interval.
New in version 0.20.0.
Parameters
left [orderable scalar] Left bound for the interval.
right [orderable scalar] Right bound for the interval.
closed [{‘right’, ‘left’, ‘both’, ‘neither’}, default ‘right’] Whether the interval is closed
on the left-side, right-side, both or neither. See the Notes for more detailed
explanation.
See also:
IntervalIndex An Index of Interval objects that are all closed on the same side.
cut Convert continuous data into discrete bins (Categorical of Interval objects).
qcut Convert continuous data into bins (Categorical of Interval objects) based on quantiles.
Period Represents a period of time.
Notes
The parameters left and right must be from the same type, you must be able to compare them and
they must satisfy left <= right.
A closed interval (in mathematics denoted by square brackets) contains its endpoints, i.e. the closed
interval [0, 5] is characterized by the conditions 0 <= x <= 5. This is what closed='both' stands
for. An open interval (in mathematics denoted by parentheses) does not contain its endpoints, i.e. the
open interval (0, 5) is characterized by the conditions 0 < x < 5. This is what closed='neither'
stands for. Intervals can also be half-open or half-closed, i.e. [0, 5) is described by 0 <= x < 5
(closed='left') and (0, 5] is described by 0 < x <= 5 (closed='right').
Examples
>>> 2.5 in iv
True
>>> 0 in iv
False
>>> 5 in iv
True
>>> 0.0001 in iv
True
>>> iv.length
5
You can operate with + and * over an Interval and the operation is applied to each of its bounds, so
the result depends on the type of the bound elements
>>> shifted_iv = iv + 3
>>> shifted_iv
Interval(3, 8, closed='right')
>>> extended_iv = iv * 10.0
>>> extended_iv
Interval(0.0, 50.0, closed='right')
Attributes
pandas.Interval.closed
Interval.closed
Whether the interval is closed on the left-side, right-side, both or neither
pandas.Interval.closed_left
Interval.closed_left
Check if the interval is closed on the left side.
For the meaning of closed and open see Interval.
Returns
bool True if the Interval is closed on the left-side, else False.
pandas.Interval.closed_right
Interval.closed_right
Check if the interval is closed on the right side.
For the meaning of closed and open see Interval.
Returns
bool True if the Interval is closed on the left-side, else False.
pandas.Interval.is_empty
Interval.is_empty
Indicates if an interval is empty, meaning it contains no points.
New in version 0.25.0.
Returns
bool or ndarray A boolean indicating if a scalar Interval is empty, or a
boolean ndarray positionally indicating if an Interval in an IntervalArray
or IntervalIndex is empty.
Examples
pandas.Interval.left
Interval.left
Left bound for the interval
pandas.Interval.length
Interval.length
Return the length of the Interval
pandas.Interval.mid
Interval.mid
Return the midpoint of the Interval
pandas.Interval.open_left
Interval.open_left
Check if the interval is open on the left side.
For the meaning of closed and open see Interval.
Returns
bool True if the Interval is closed on the left-side, else False.
pandas.Interval.open_right
Interval.open_right
Check if the interval is open on the right side.
For the meaning of closed and open see Interval.
Returns
bool True if the Interval is closed on the left-side, else False.
pandas.Interval.right
Interval.right
Right bound for the interval
Methods
pandas.Interval.overlaps
Interval.overlaps()
Check whether two Interval objects overlap.
Two intervals overlap if they share a common point, including closed endpoints. Intervals that
only have an open endpoint in common do not overlap.
New in version 0.24.0.
Parameters
other [Interval] The interval to check against for an overlap.
Returns
bool True if the two intervals overlap, else False.
See also:
Examples
>>> i1 = pd.Interval(0, 2)
>>> i2 = pd.Interval(1, 3)
>>> i1.overlaps(i2)
True
>>> i3 = pd.Interval(4, 5)
>>> i1.overlaps(i3)
False
Properties
pandas.arrays.IntervalArray
class pandas.arrays.IntervalArray
Pandas array for interval data that are closed on the same side.
New in version 0.24.0.
Parameters
data [array-like (1-dimensional)] Array-like containing Interval objects from which to
build the IntervalArray.
closed [{‘left’, ‘right’, ‘both’, ‘neither’}, default ‘right’] Whether the intervals are
closed on the left-side, right-side, both or neither.
dtype [dtype or None, default None] If None, dtype will be inferred.
New in version 0.23.0.
copy [bool, default False] Copy the input data.
verify_integrity [bool, default True] Verify that the IntervalArray is valid.
See also:
Notes
Examples
Attributes
pandas.arrays.IntervalArray.left
IntervalArray.left
Return the left endpoints of each Interval in the IntervalArray as an Index
pandas.arrays.IntervalArray.right
IntervalArray.right
Return the right endpoints of each Interval in the IntervalArray as an Index
pandas.arrays.IntervalArray.closed
IntervalArray.closed
Whether the intervals are closed on the left-side, right-side, both or neither
pandas.arrays.IntervalArray.mid
IntervalArray.mid
Return the midpoint of each Interval in the IntervalArray as an Index
pandas.arrays.IntervalArray.length
IntervalArray.length
Return an Index with entries denoting the length of each Interval in the IntervalArray
pandas.arrays.IntervalArray.is_empty
IntervalArray.is_empty
Indicates if an interval is empty, meaning it contains no points.
New in version 0.25.0.
Returns
bool or ndarray A boolean indicating if a scalar Interval is empty, or a
boolean ndarray positionally indicating if an Interval in an IntervalArray
or IntervalIndex is empty.
Examples
pandas.arrays.IntervalArray.is_non_overlapping_monotonic
IntervalArray.is_non_overlapping_monotonic
Return True if the IntervalArray is non-overlapping (no Intervals share points) and is either
monotonic increasing or monotonic decreasing, else False
Methods
from_arrays(left, right[, closed, copy, dtype]) Construct from two arrays defining the left and
right bounds.
from_tuples(data[, closed, copy, dtype]) Construct an IntervalArray from an array-like of
tuples
from_breaks(breaks[, closed, copy, dtype]) Construct an IntervalArray from an array of
splits.
contains(self, other) Check elementwise if the Intervals contain the
value.
Continued on next page
pandas.arrays.IntervalArray.from_arrays
Notes
Each element of left must be less than or equal to the right element at the same position. If an
element is missing, it must be missing in both left and right. A TypeError is raised when using
an unsupported type for left or right. At the moment, ‘category’, ‘object’, and ‘string’ subtypes
are not supported.
Examples
pandas.arrays.IntervalArray.from_tuples
Examples
pandas.arrays.IntervalArray.from_breaks
Examples
pandas.arrays.IntervalArray.contains
IntervalArray.contains(self, other)
Check elementwise if the Intervals contain the value.
Return a boolean mask whether the value is contained in the Intervals of the IntervalArray.
New in version 0.25.0.
Parameters
other [scalar] The value to check whether it is contained in the Intervals.
Returns
boolean array
See also:
Examples
pandas.arrays.IntervalArray.overlaps
IntervalArray.overlaps(self, other)
Check elementwise if an Interval overlaps the values in the IntervalArray.
Two intervals overlap if they share a common point, including closed endpoints. Intervals that
only have an open endpoint in common do not overlap.
New in version 0.24.0.
Parameters
other [Interval] Interval to check against for an overlap.
Returns
ndarray Boolean array positionally indicating where an overlap occurs.
See also:
Examples
pandas.arrays.IntervalArray.set_closed
IntervalArray.set_closed(self, closed)
Return an IntervalArray identical to the current one, but closed on the specified side
New in version 0.24.0.
Parameters
closed [{‘left’, ‘right’, ‘both’, ‘neither’}] Whether the intervals are closed on the
left-side, right-side, both or neither.
Returns
new_index [IntervalArray]
Examples
pandas.arrays.IntervalArray.to_tuples
IntervalArray.to_tuples(self, na_tuple=True)
Return an ndarray of tuples of the form (left, right)
Parameters
na_tuple [boolean, default True] Returns NA as a tuple if True, (nan, nan), or
just as the NA value itself if False, nan.
New in version 0.23.0.
Returns
tuples: ndarray
pandas.IntervalDtype
class pandas.IntervalDtype
An ExtensionDtype for Interval data.
This is not an actual numpy dtype, but a duck type.
Parameters
subtype [str, np.dtype] The dtype of the Interval bounds.
Examples
>>> pd.IntervalDtype(subtype='int64')
interval[int64]
Attributes
pandas.IntervalDtype.subtype
IntervalDtype.subtype
The dtype of the Interval bounds.
Methods
None
numpy.ndarray cannot natively represent integer-data with missing values. Pandas provides this through
arrays.IntegerArray.
pandas.arrays.IntegerArray
Warning: IntegerArray is currently experimental, and its API or internal implementation may
change without warning.
Examples
String aliases for the dtypes are also available. They are capitalized.
Attributes
None
Methods
None
pandas.Int8Dtype
class pandas.Int8Dtype
An ExtensionDtype for int8 integer data.
Attributes
None
Methods
None
pandas.Int16Dtype
class pandas.Int16Dtype
An ExtensionDtype for int16 integer data.
Attributes
None
Methods
None
pandas.Int32Dtype
class pandas.Int32Dtype
An ExtensionDtype for int32 integer data.
Attributes
None
Methods
None
pandas.Int64Dtype
class pandas.Int64Dtype
An ExtensionDtype for int64 integer data.
Attributes
None
Methods
None
pandas.UInt8Dtype
class pandas.UInt8Dtype
An ExtensionDtype for uint8 integer data.
Attributes
None
Methods
None
pandas.UInt16Dtype
class pandas.UInt16Dtype
An ExtensionDtype for uint16 integer data.
Attributes
None
Methods
None
pandas.UInt32Dtype
class pandas.UInt32Dtype
An ExtensionDtype for uint32 integer data.
Attributes
None
Methods
None
pandas.UInt64Dtype
class pandas.UInt64Dtype
An ExtensionDtype for uint64 integer data.
Attributes
None
Methods
None
Pandas defines a custom data type for representing data that can take only a limited, fixed set of values.
The dtype of a Categorical can be described by a pandas.api.types.CategoricalDtype.
pandas.CategoricalDtype
Categorical
Notes
This class is useful for specifying the type of a Categorical independent of the values. See Categori-
calDtype for more.
Examples
Attributes
pandas.CategoricalDtype.categories
CategoricalDtype.categories
An Index containing the unique categories allowed.
pandas.CategoricalDtype.ordered
CategoricalDtype.ordered
Whether the categories have an ordered relationship.
Methods
None
pandas.Categorical
Categoricals can only take on only a limited, and usually fixed, number of possible values (categories). In
contrast to statistical categorical variables, a Categorical might have an order, but numerical operations
(additions, divisions, …) are not possible.
All values of the Categorical are either in categories or np.nan. Assigning values outside of categories
will raise a ValueError. Order is defined by the order of the categories, not lexical order of the values.
Parameters
values [list-like] The values of the categorical. If categories are given, values not in
categories will be replaced with NaN.
categories [Index-like (unique), optional] The unique categories for this categorical.
If not given, the categories are assumed to be the unique values of values (sorted,
if possible, otherwise in the order in which they appear).
ordered [bool, default False] Whether or not this categorical is treated as a ordered
categorical. If True, the resulting categorical will be ordered. An ordered cate-
gorical respects, when sorted, the order of its categories attribute (which in turn
is the categories argument, if provided).
dtype [CategoricalDtype] An instance of CategoricalDtype to use for this categor-
ical
New in version 0.21.0.
Raises
ValueError If the categories do not validate.
TypeError If an explicit ordered=True is given but no categories and the values are
not sortable.
See also:
Notes
Examples
Ordered Categoricals can be sorted according to the custom order of the categories and can have a min
and max value.
Attributes
pandas.Categorical.categories
Categorical.categories
The categories of this categorical.
Setting assigns new values to each category (effectively a rename of each individual category).
The assigned value has to be a list-like object. All items must be unique and the number of items
in the new categories must be the same as the number of items in the old categories.
Assigning to categories is a inplace operation!
Raises
ValueError If the new categories do not validate as categories or if the number
of new categories is unequal the number of old categories
See also:
rename_categories
reorder_categories
add_categories
remove_categories
remove_unused_categories
set_categories
pandas.Categorical.codes
Categorical.codes
The category codes of this categorical.
Level codes are an array if integer which are the positions of the real values in the categories
array.
There is not setter, use the other categorical methods and the normal item setter to change values
in the categorical.
pandas.Categorical.ordered
Categorical.ordered
Whether the categories have an ordered relationship.
pandas.Categorical.dtype
Categorical.dtype
The CategoricalDtype for this instance
Methods
from_codes(codes[, categories, ordered, dtype]) Make a Categorical type from codes and cate-
gories or dtype.
__array__(self[, dtype]) The numpy array interface.
pandas.Categorical.from_codes
Examples
pandas.Categorical.__array__
Categorical.__array__(self, dtype=None)
The numpy array interface.
Returns
numpy.array A numpy array of either the specified dtype or, if dtype==None
(default), the same dtype as categorical.categories.dtype.
The alternative Categorical.from_codes() constructor can be used when you have the categories and
integer codes already:
Categorical.from_codes(codes[, categories, …]) Make a Categorical type from codes and categories
or dtype.
np.asarray(categorical) works by implementing the array interface. Be aware, that this converts the
Categorical back to a NumPy array, so categories and order information is not preserved!
A Categorical can be stored in a Series or DataFrame. To create a Series of dtype category, use cat =
s.astype(dtype) or Series(..., dtype=dtype) where dtype is either
• the string 'category'
• an instance of CategoricalDtype.
If the Series is of dtype CategoricalDtype, Series.cat can be used to change the categorical data. See
Categorical accessor for more.
Data where a single value is repeated many times (e.g. 0 or NaN) may be stored efficiently as a SparseArray.
pandas.SparseArray
data.dtype na_value
float np.nan
int 0
bool False
datetime64 pd.NaT
timedelta64 pd.NaT
The fill value is potentially specified in three ways. In order of precedence, these
are
1. The fill_value argument
2. dtype.fill_value if fill_value is None and dtype is a SparseDtype
3. data.dtype.fill_value if fill_value is None and dtype is not a SparseDtype
and data is a SparseArray.
kind [{‘integer’, ‘block’}, default ‘integer’] The type of storage for sparse locations.
• ‘block’: Stores a block and block_length for each contiguous span of sparse
values. This is best when sparse data tends to be clumped together, with
large regions of fill-value values between sparse values.
• ‘integer’: uses an integer to store the location of each sparse value.
dtype [np.dtype or SparseDtype, optional] The dtype to use for the SparseArray. For
numpy dtypes, this determines the dtype of self.sp_values. For SparseDtype,
this determines self.sp_values and self.fill_value.
copy [bool, default False] Whether to explicitly copy the incoming data array.
Attributes
None
Methods
None
pandas.SparseDtype
dtype na_value
float np.nan
int 0
bool False
datetime64 pd.NaT
timedelta64 pd.NaT
Attributes
None
Methods
None
The Series.sparse accessor may be used to access sparse-specific attributes and methods if the Series
contains sparse values. See Sparse accessor for more.
{{ header }}
6.6 Panel
Panel was removed in 0.25.0. For prior documentation, see the 0.24 documentation
{{ header }}
6.7.1 Index
Many of these methods or variants thereof are available on the objects that contain an index
(Series/DataFrame) and those should most likely be used before calling these methods directly.
pandas.Index
class pandas.Index
Immutable ndarray implementing an ordered, sliceable set. The basic object storing axis labels for all
pandas objects.
Parameters
data [array-like (1-dimensional)]
dtype [NumPy dtype (default: object)] If dtype is None, we find the dtype that best
fits the data. If an actual dtype is provided, we coerce to that dtype if it’s safe.
Otherwise, an error will be raised.
copy [bool] Make a copy of input ndarray
name [object] Name to be stored in the index
tupleize_cols [bool (default: True)] When True, attempt to create a MultiIndex if
possible
See also:
Notes
Examples
>>> pd.Index(list('abc'))
Index(['a', 'b', 'c'], dtype='object')
Attributes
pandas.Index.T
Index.T
Return the transpose, which is by definition self.
pandas.Index.array
Index.array
The ExtensionArray of the data backing this Series or Index.
New in version 0.24.0.
Returns
ExtensionArray An ExtensionArray of the values stored within. For extension
types, this is the actual array. For NumPy native types, this is a thin (no copy)
wrapper around numpy.ndarray.
.array differs .values which may require converting the data to a different
form.
See also:
Notes
This table lays out the different array types for each extension dtype within pandas.
For any 3rd-party extension types, the array type will be an ExtensionArray.
For all remaining dtypes .array will be a arrays.NumpyExtensionArray wrapping the actual
ndarray stored within. If you absolutely need a NumPy array (possibly with copying / coercing
data), then use Series.to_numpy() instead.
Examples
For regular NumPy types like int, and float, a PandasArray is returned.
pandas.Index.asi8
Index.asi8
Integer representation of the values.
Returns
ndarray An ndarray with int64 dtype.
pandas.Index.base
Index.base
Return the base object if the memory of the underlying data is shared.
Deprecated since version 0.23.0.
pandas.Index.data
Index.data
Return the data pointer of the underlying data.
Deprecated since version 0.23.0.
pandas.Index.dtype
Index.dtype
Return the dtype object of the underlying data.
pandas.Index.dtype_str
Index.dtype_str
Return the dtype str of the underlying data.
Deprecated since version 0.25.0.
pandas.Index.flags
Index.flags
pandas.Index.hasnans
Index.hasnans
Return if I have any nans; enables various perf speedups.
pandas.Index.inferred_type
Index.inferred_type
Return a string of the type inferred from the values.
pandas.Index.is_monotonic
Index.is_monotonic
Alias for is_monotonic_increasing.
pandas.Index.is_monotonic_decreasing
Index.is_monotonic_decreasing
Return if the index is monotonic decreasing (only equal or decreasing) values.
Examples
pandas.Index.is_monotonic_increasing
Index.is_monotonic_increasing
Return if the index is monotonic increasing (only equal or increasing) values.
Examples
pandas.Index.is_unique
Index.is_unique
Return if the index has unique values.
pandas.Index.itemsize
Index.itemsize
Return the size of the dtype of the item of the underlying data.
Deprecated since version 0.23.0.
pandas.Index.nbytes
Index.nbytes
Return the number of bytes in the underlying data.
pandas.Index.ndim
Index.ndim
Number of dimensions of the underlying data, by definition 1.
pandas.Index.nlevels
Index.nlevels
Number of levels.
pandas.Index.shape
Index.shape
Return a tuple of the shape of the underlying data.
pandas.Index.size
Index.size
Return the number of elements in the underlying data.
pandas.Index.strides
Index.strides
Return the strides of the underlying data.
Deprecated since version 0.23.0.
pandas.Index.values
Index.values
Return an array representing the data in the Index.
Returns
array: numpy.ndarray or ExtensionArray
See also:
empty
has_duplicates
is_all_dates
name
names
Methods
pandas.Index.all
converted to bool.
See also:
Notes
Not a Number (NaN), positive infinity and negative infinity evaluate to True because these are
not equal to zero.
Examples
all
True, because nonzero integers are considered True.
any
True, because 1 is considered True.
pandas.Index.any
See also:
Notes
Not a Number (NaN), positive infinity and negative infinity evaluate to True because these are
not equal to zero.
Examples
pandas.Index.append
Index.append(self, other)
Append a collection of Index options together.
Parameters
other [Index or list/tuple of indices]
Returns
appended [Index]
pandas.Index.argmax
numpy.ndarray.argmax
pandas.Index.argmin
numpy.ndarray.argmin
pandas.Index.argsort
Examples
>>> idx[order]
Index(['a', 'b', 'c', 'd'], dtype='object')
pandas.Index.asof
Index.asof(self, label)
Return the label from the index, or, if not present, the previous one.
Assuming that the index is sorted, return the passed index label if it is in the index, or return
the previous index label if the passed one is not in the index.
Parameters
label [object] The label up to which the method returns the latest index label.
Returns
object The passed label if it is in the index. The previous label if the passed label
is not in the sorted index or NaN if there is no such label.
See also:
Examples
If the label is in the index, the method returns the passed label.
>>> idx.asof('2014-01-02')
'2014-01-02'
If all of the labels in the index are later than the passed label, NaN is returned.
>>> idx.asof('1999-01-02')
nan
pandas.Index.asof_locs
As in the asof function, if the label (a particular entry in where) is not in the index, the latest
index label upto the passed label is chosen and its index returned.
If all of the labels in the index are later than a label in where, -1 is returned.
mask is used to ignore NA values in the index during calculation.
Parameters
where [Index] An Index consisting of an array of timestamps.
mask [array-like] Array of booleans denoting where values in the original data are
not NA.
Returns
numpy.ndarray An array of locations (indices) of the labels from the Index which
correspond to the return values of the asof function for every element in where.
pandas.Index.astype
pandas.Index.contains
Index.contains(self, key)
Return a boolean indicating whether the provided key is in the index.
Deprecated since version 0.25.0: Use key in index instead of index.contains(key).
Returns
bool
pandas.Index.copy
Notes
In most cases, there should be no functional difference from using deep, but if deep is passed it
will attempt to deepcopy.
pandas.Index.delete
Index.delete(self, loc)
Make new Index with passed location(-s) deleted.
Returns
new_index [Index]
pandas.Index.difference
Examples
pandas.Index.drop
pandas.Index.drop_duplicates
Index.drop_duplicates(self, keep=’first’)
Return Index with duplicate values removed.
Parameters
keep [{‘first’, ‘last’, False}, default ‘first’]
• ‘first’ : Drop duplicates except for the first occurrence.
• ‘last’ : Drop duplicates except for the last occurrence.
• False : Drop all duplicates.
Returns
deduplicated [Index]
See also:
Examples
The keep parameter controls which duplicate values are removed. The value ‘first’ keeps the first
occurrence for each set of duplicated entries. The default value of keep is ‘first’.
>>> idx.drop_duplicates(keep='first')
Index(['lama', 'cow', 'beetle', 'hippo'], dtype='object')
The value ‘last’ keeps the last occurrence for each set of duplicated entries.
>>> idx.drop_duplicates(keep='last')
Index(['cow', 'beetle', 'lama', 'hippo'], dtype='object')
>>> idx.drop_duplicates(keep=False)
Index(['cow', 'beetle', 'hippo'], dtype='object')
pandas.Index.droplevel
Index.droplevel(self, level=0)
Return index with requested level(s) removed.
If resulting index has only 1 level left, the result will be of Index type, not MultiIndex.
New in version 0.23.1: (support for non-MultiIndex)
Parameters
level [int, str, or list-like, default 0] If a string is given, must be the name of a
level If list-like, elements must be names or indexes of levels.
Returns
Index or MultiIndex
pandas.Index.dropna
Index.dropna(self, how=’any’)
Return Index without NA/NaN values
Parameters
how [{‘any’, ‘all’}, default ‘any’] If the Index is a MultiIndex, drop the value when
any or all levels are NaN.
Returns
valid [Index]
pandas.Index.duplicated
Index.duplicated(self, keep=’first’)
Indicate duplicate index values.
Duplicated values are indicated as True values in the resulting array. Either all duplicates, all
except the first, or all except the last occurrence of duplicates can be indicated.
Parameters
keep [{‘first’, ‘last’, False}, default ‘first’] The value or values in a set of duplicates
to mark as missing.
Examples
By default, for each set of duplicated values, the first occurrence is set to False and all others to
True:
which is equivalent to
>>> idx.duplicated(keep='first')
array([False, False, True, False, True])
By using ‘last’, the last occurrence of each set of duplicated values is set on False and all others
on True:
>>> idx.duplicated(keep='last')
array([ True, False, True, False, False])
>>> idx.duplicated(keep=False)
array([ True, False, True, False, True])
pandas.Index.equals
Index.equals(self, other)
Determine if two Index objects contain the same elements.
Returns
bool True if “other” is an Index and it has the same elements as calling index;
False otherwise.
pandas.Index.factorize
Note: Even if there’s a missing value in values, uniques will not contain an
entry for it.
See also:
Examples
These examples all show factorize as a top-level method like pd.factorize(values). The results
are identical for methods like Series.factorize().
With sort=True, the uniques will be sorted, and labels will be shuffled so that the relationship
is the maintained.
Missing values are indicated in labels with na_sentinel (-1 by default). Note that missing values
are never included in uniques.
Thus far, we’ve only factorized lists (which are internally coerced to NumPy arrays). When
factorizing pandas objects, the type of uniques will differ. For Categoricals, a Categorical is
returned.
pandas.Index.fillna
pandas.Index.format
pandas.Index.get_duplicates
Index.get_duplicates(self )
Extract duplicated index elements.
Deprecated since version 0.23.0: Use idx[idx.duplicated()].unique() instead
Returns a sorted list of index elements which appear more than once in the index.
Returns
array-like List of duplicated indexes.
See also:
Examples
Note that for a DatetimeIndex, it does not return a list but a new DatetimeIndex:
pandas.Index.get_indexer
target [Index]
method [{None, ‘pad’/’ffill’, ‘backfill’/’bfill’, ‘nearest’}, optional]
• default: exact matches only.
• pad / ffill: find the PREVIOUS index value if no exact match.
• backfill / bfill: use NEXT index value if no exact match
• nearest: use the NEAREST index value if no exact match. Tied distances
are broken by preferring the larger index value.
limit [int, optional] Maximum number of consecutive labels in target to match
for inexact matches.
tolerance [optional] Maximum distance between original and new labels for in-
exact matches. The values of the index at the matching locations most satisfy
the equation abs(index[indexer] - target) <= tolerance.
Tolerance may be a scalar value, which applies the same tolerance to all values,
or list-like, which applies variable tolerance per element. List-like includes list,
tuple, array, Series, and must be the same size as the index and its dtype must
exactly match the index’s type.
New in version 0.21.0: (list-like tolerance)
Returns
indexer [ndarray of int] Integers from 0 to n - 1 indicating that the index at these
positions matches the corresponding target values. Missing values in the target
are marked by -1.
Examples
Notice that the return value is an array of locations in index and x is marked by -1, as it is not
in index.
pandas.Index.get_indexer_for
pandas.Index.get_indexer_non_unique
Index.get_indexer_non_unique(self, target)
Compute indexer and mask for new index given the current index. The indexer should be then
used as an input to ndarray.take to align the current data to the new index.
Parameters
target [Index]
Returns
indexer [ndarray of int] Integers from 0 to n - 1 indicating that the index at these
positions matches the corresponding target values. Missing values in the target
are marked by -1.
missing [ndarray of int] An indexer into the target of the values not found. These
correspond to the -1 in the indexer array.
pandas.Index.get_level_values
Index.get_level_values(self, level)
Return an Index of values for requested level.
This is primarily useful to get an individual level of values from a MultiIndex, but is provided on
Index as well for compatibility.
Parameters
level [int or str] It is either the integer position or the name of the level.
Returns
Index Calling object, as there is only one level in the Index.
See also:
Notes
Examples
>>> idx.get_level_values(0)
Index(['a', 'b', 'c'], dtype='object')
pandas.Index.get_loc
Examples
pandas.Index.get_slice_bound
pandas.Index.get_value
Returns
scalar A value in the Series with the index of the key value in self.
pandas.Index.get_values
Index.get_values(self )
Return Index data as an numpy.ndarray.
Deprecated since version 0.25.0: Use Index.to_numpy() or Index.array instead.
Returns
numpy.ndarray A one-dimensional numpy array of the Index values.
See also:
Examples
pandas.Index.groupby
Index.groupby(self, values)
Group the index labels by a given array of values.
Parameters
pandas.Index.holds_integer
Index.holds_integer(self )
Whether the type is an integer type.
pandas.Index.identical
Index.identical(self, other)
Similar to equals, but check that other comparable attributes are also equal.
Returns
bool If two Index objects have equal elements and same type True, otherwise
False.
pandas.Index.insert
pandas.Index.intersection
Examples
pandas.Index.is_
Index.is_(self, other)
More flexible, faster check like is but that works through views.
Note: this is not the same as Index.identical(), which checks that metadata is also the same.
Parameters
other [object] other object to compare against.
Returns
True if both have same underlying data, False otherwise [bool]
pandas.Index.is_categorical
Index.is_categorical(self )
Check if the Index holds categorical data.
Returns
boolean True if the Index is categorical.
See also:
Examples
pandas.Index.is_type_compatible
Index.is_type_compatible(self, kind)
Whether the index type is compatible with the provided type.
pandas.Index.isin
Notes
In the case of MultiIndex you must either specify values as a list-like object containing tuples that
are the same length as the number of levels, or specify level. Otherwise it will raise a ValueError.
If level is specified:
• if it is the name of one and only one index level, use that level;
• otherwise it should be a number indicating level position.
Examples
Check whether each index value in a list of values. >>> idx.isin([1, 4]) array([ True, False,
False])
Check whether the strings in the ‘color’ level of the MultiIndex are in a list of colors.
>>> dti.isin(['2000-03-11'])
array([ True, False, False])
pandas.Index.isna
Index.isna(self )
Detect missing values.
Return a boolean same-sized object indicating if the values are NA. NA values, such as None,
numpy.NaN or pd.NaT, get mapped to True values. Everything else get mapped to False values.
Characters such as empty strings ‘’ or numpy.inf are not considered NA values (unless you set
pandas.options.mode.use_inf_as_na = True).
New in version 0.20.0.
Returns
numpy.ndarray A boolean array of whether my values are NA.
See also:
Examples
pandas.Index.isnull
Index.isnull(self )
Detect missing values.
Return a boolean same-sized object indicating if the values are NA. NA values, such as None,
numpy.NaN or pd.NaT, get mapped to True values. Everything else get mapped to False values.
Characters such as empty strings ‘’ or numpy.inf are not considered NA values (unless you set
pandas.options.mode.use_inf_as_na = True).
New in version 0.20.0.
Returns
numpy.ndarray A boolean array of whether my values are NA.
See also:
Examples
pandas.Index.item
Index.item(self )
Return the first element of the underlying data as a python scalar.
Returns
scalar The first element of %(klass)s.
pandas.Index.join
sort [boolean, default False] Sort the join keys lexicographically in the result In-
dex. If False, the order of the join keys depends on the join type (how keyword)
New in version 0.20.0.
Returns
join_index, (left_indexer, right_indexer)
pandas.Index.map
pandas.Index.max
Examples
pandas.Index.memory_usage
Index.memory_usage(self, deep=False)
Memory usage of the values
Parameters
deep [bool] Introspect the data deeply, interrogate object dtypes for system-level
memory consumption
Returns
bytes used
See also:
numpy.ndarray.nbytes
Notes
Memory usage does not include memory consumed by elements that are not components of the
array if deep=False or if used on PyPy
pandas.Index.min
Examples
pandas.Index.notna
Index.notna(self )
Detect existing (non-missing) values.
Return a boolean same-sized object indicating if the values are not NA. Non-missing values get
mapped to True. Characters such as empty strings '' or numpy.inf are not considered NA values
(unless you set pandas.options.mode.use_inf_as_na = True). NA values, such as None or
numpy.NaN, get mapped to False values.
New in version 0.20.0.
Returns
numpy.ndarray Boolean array to indicate which entries are not NA.
See also:
Examples
Show which entries in an Index are not NA. The result is an array.
pandas.Index.notnull
Index.notnull(self )
Detect existing (non-missing) values.
Return a boolean same-sized object indicating if the values are not NA. Non-missing values get
mapped to True. Characters such as empty strings '' or numpy.inf are not considered NA values
(unless you set pandas.options.mode.use_inf_as_na = True). NA values, such as None or
numpy.NaN, get mapped to False values.
New in version 0.20.0.
Returns
numpy.ndarray Boolean array to indicate which entries are not NA.
See also:
Examples
Show which entries in an Index are not NA. The result is an array.
pandas.Index.nunique
Index.nunique(self, dropna=True)
Return number of unique elements in the object.
Excludes NA values by default.
Parameters
dropna [bool, default True] Don’t include NaN in the count.
Returns
int
See also:
Examples
>>> s.nunique()
4
pandas.Index.putmask
numpy.ndarray.putmask
pandas.Index.ravel
Index.ravel(self, order=’C’)
Return an ndarray of the flattened values of the underlying data.
Returns
numpy.ndarray Flattened array.
See also:
numpy.ndarray.ravel
pandas.Index.reindex
pandas.Index.rename
Examples
pandas.Index.repeat
Examples
pandas.Index.searchsorted
numpy.searchsorted
Notes
Examples
>>> x.searchsorted(4)
3
>>> x.searchsorted('bread')
1
pandas.Index.set_names
Examples
pandas.Index.set_value
Notes
pandas.Index.shift
Notes
This method is only implemented for datetime-like index classes, i.e., DatetimeIndex, PeriodIndex
and TimedeltaIndex.
Examples
The default value of freq is the freq attribute of the index, which is ‘MS’ (month start) in this
example.
>>> month_starts.shift(10)
DatetimeIndex(['2011-11-01', '2011-12-01', '2012-01-01', '2012-02-01',
'2012-03-01'],
dtype='datetime64[ns]', freq='MS')
pandas.Index.slice_indexer
Notes
This function assumes that the data is sorted, so use at your own peril
Examples
This is a method on all index types. For example you can do:
pandas.Index.slice_locs
Notes
Examples
pandas.Index.sort
pandas.Index.sort_values
ascending [bool, default True] Should the index values be sorted in an ascending
order.
Returns
sorted_index [pandas.Index] Sorted copy of the index.
indexer [numpy.ndarray, optional] The indices that the index itself was sorted by.
See also:
Examples
>>> idx.sort_values()
Int64Index([1, 10, 100, 1000], dtype='int64')
Sort values in descending order, and also get the indices idx was sorted by.
pandas.Index.sortlevel
pandas.Index.str
Index.str()
Vectorized string functions for Series and Index. NAs stay NA unless handled otherwise by a
particular method. Patterned after Python’s string methods, with some inspiration from R’s
stringr package.
Examples
>>> s.str.split('_')
>>> s.str.replace('_', '')
pandas.Index.summary
Index.summary(self, name=None)
Return a summarized representation.
Deprecated since version 0.23.0.
pandas.Index.symmetric_difference
Notes
symmetric_difference contains elements that appear in either idx1 or idx2 but not both.
Equivalent to the Index created by idx1.difference(idx2) | idx2.difference(idx1) with
duplicates dropped.
Examples
pandas.Index.take
numpy.ndarray.take
pandas.Index.to_flat_index
Index.to_flat_index(self )
Identity method.
New in version 0.24.0.
This is implemented for compatibility with subclass implementations when chaining.
Returns
pd.Index Caller.
See also:
pandas.Index.to_frame
name [object, default None] The passed name should substitute for the index
name (if it has one).
Returns
DataFrame DataFrame containing the original Index data.
See also:
Examples
>>> idx.to_frame(index=False)
animal
0 Ant
1 Bear
2 Cow
pandas.Index.to_list
Index.to_list(self )
Return a list of the values.
These are each a scalar type, which is a Python scalar (for str, int, float) or a pandas scalar (for
Timestamp/Timedelta/Interval/Period)
Returns
list
See also:
numpy.ndarray.tolist
pandas.Index.to_native_types
pandas.Index.to_numpy
Notes
The returned array will be the same up to equality (values equal in self will be equal in the
returned array; likewise for values that are not equal). When self contains an ExtensionArray,
the dtype may be different. For example, for a category-dtype Series, to_numpy() will return a
NumPy array and the categorical dtype will be lost.
For NumPy dtypes, this will be a reference to the actual data stored in this Series or Index
(assuming copy=False). Modifying the result in place will modify the data stored in the Series
or Index (not that we recommend doing that).
For extension types, to_numpy() may require copying data and coercing the result to a NumPy
type (possibly object), which may be expensive. When you need a no-copy reference to the
underlying data, Series.array should be used instead.
This table lays out the different dtypes and default return types of to_numpy() for various dtypes
within pandas.
Examples
Specify the dtype to control how datetime-aware data is represented. Use dtype=object to return
an ndarray of pandas Timestamp objects, each with the correct tz.
>>> ser.to_numpy(dtype="datetime64[ns]")
... # doctest: +ELLIPSIS
array(['1999-12-31T23:00:00.000000000', '2000-01-01T23:00:00...'],
dtype='datetime64[ns]')
pandas.Index.to_series
pandas.Index.tolist
Index.tolist(self )
Return a list of the values.
These are each a scalar type, which is a Python scalar (for str, int, float) or a pandas scalar (for
Timestamp/Timedelta/Interval/Period)
Returns
list
See also:
numpy.ndarray.tolist
pandas.Index.transpose
pandas.Index.union
Parameters
other [Index or array-like]
sort [bool or None, default None] Whether to sort the resulting Index.
• None : Sort the result, except when
1. self and other are equal.
2. self or other has length 0.
3. Some values in self or other cannot be compared. A RuntimeWarning is
issued in this case.
• False : do not sort the result.
New in version 0.24.0.
Changed in version 0.24.1: Changed the default value from True to None (with-
out change in behaviour).
Returns
union [Index]
Examples
pandas.Index.unique
Index.unique(self, level=None)
Return unique values in the index. Uniques are returned in order of appearance, this does NOT
sort.
Parameters
level [int or str, optional, default None] Only return values from specified level
(for MultiIndex)
New in version 0.23.0.
Returns
Index without duplicates
See also:
unique
Series.unique
pandas.Index.value_counts
Examples
With normalize set to True, returns the relative frequency by dividing all values by the sum of
values.
bins
Bins can be useful for going from a continuous variable to a categorical variable; instead of
counting unique apparitions of values, divide the index in the specified number of half-open bins.
>>> s.value_counts(bins=3)
(2.0, 3.0] 2
(0.996, 2.0] 2
(3.0, 4.0] 1
dtype: int64
dropna
With dropna set to False we can also see NaN index values.
>>> s.value_counts(dropna=False)
3.0 2
NaN 1
4.0 1
(continues on next page)
pandas.Index.where
is_boolean
is_floating
is_integer
is_interval
is_lexsorted_for_tuple
is_mixed
is_numeric
is_object
view
Properties
pandas.Index.has_duplicates
Index.has_duplicates
pandas.Index.is_all_dates
Index.is_all_dates
pandas.Index.name
Index.name = None
pandas.Index.names
Index.names
pandas.Index.empty
Index.empty
pandas.Index.is_boolean
Index.is_boolean(self )
pandas.Index.is_floating
Index.is_floating(self )
pandas.Index.is_integer
Index.is_integer(self )
pandas.Index.is_interval
Index.is_interval(self )
pandas.Index.is_mixed
Index.is_mixed(self )
pandas.Index.is_numeric
Index.is_numeric(self )
pandas.Index.is_object
Index.is_object(self )
pandas.Index.is_lexsorted_for_tuple
Index.is_lexsorted_for_tuple(self, tup)
Missing values
Index.fillna(self[, value, downcast]) Fill NA/NaN values with the specified value
Index.dropna(self[, how]) Return Index without NA/NaN values
Index.isna(self) Detect missing values.
Index.notna(self) Detect existing (non-missing) values.
Conversion
pandas.Index.view
Index.view(self, cls=None)
Sorting
Index.argsort(self, \*args, \*\*kwargs) Return the integer indices that would sort the in-
dex.
Index.searchsorted(self, value[, side, sorter]) Find indices where elements should be inserted to
maintain order.
Index.sort_values(self[, return_indexer, …]) Return a sorted copy of the index.
Time-specific operations
Selecting
Index.asof(self, label) Return the label from the index, or, if not present,
the previous one.
Index.asof_locs(self, where, mask) Find the locations (indices) of the labels from the
index for every entry in the where argument.
Index.contains(self, key) (DEPRECATED) Return a boolean indicating
whether the provided key is in the index.
Index.get_duplicates(self) (DEPRECATED) Extract duplicated index ele-
ments.
Continued on next page
pandas.RangeIndex
class pandas.RangeIndex
Immutable Index implementing a monotonic integer range.
RangeIndex is a memory-saving special case of Int64Index limited to representing monotonic ranges.
Using RangeIndex may in some instances improve computing speed.
This is the default index type used by DataFrame and Series when no explicit index is provided by the
user.
Parameters
start [int (default: 0), or other RangeIndex instance] If int and “stop” is not given,
interpreted as “stop” instead.
stop [int (default: 0)]
step [int (default: 1)]
name [object, optional] Name to be stored in the index
copy [bool, default False] Unused, accepted for homogeneity with other index types.
See also:
Attributes
pandas.RangeIndex.start
RangeIndex.start
The value of the start parameter (0 if this was not supplied)
pandas.RangeIndex.stop
RangeIndex.stop
The value of the stop parameter
pandas.RangeIndex.step
RangeIndex.step
The value of the step parameter (1 if this was not supplied)
Methods
pandas.RangeIndex.from_range
pandas.Int64Index
class pandas.Int64Index
Immutable ndarray implementing an ordered, sliceable set. The basic object storing axis labels for all
pandas objects. Int64Index is a special case of Index with purely integer labels.
Parameters
Notes
Attributes
None
Methods
None
pandas.UInt64Index
class pandas.UInt64Index
Immutable ndarray implementing an ordered, sliceable set. The basic object storing axis labels for all
pandas objects. UInt64Index is a special case of Index with purely unsigned integer labels.
Parameters
data [array-like (1-dimensional)]
dtype [NumPy dtype (default: uint64)]
copy [bool] Make a copy of input ndarray
name [object] Name to be stored in the index
See also:
Notes
Attributes
None
Methods
None
pandas.Float64Index
class pandas.Float64Index
Immutable ndarray implementing an ordered, sliceable set. The basic object storing axis labels for all
pandas objects. Float64Index is a special case of Index with purely float labels.
Parameters
data [array-like (1-dimensional)]
dtype [NumPy dtype (default: float64)]
copy [bool] Make a copy of input ndarray
name [object] Name to be stored in the index
See also:
Notes
Attributes
None
Methods
None
6.7.3 CategoricalIndex
pandas.CategoricalIndex
class pandas.CategoricalIndex
Index based on an underlying Categorical.
CategoricalIndex, like Categorical, can only take on a limited, and usually fixed, number of possible
values (categories). Also, like Categorical, it might have an order, but numerical operations (additions,
divisions, …) are not possible.
Parameters
data [array-like (1-dimensional)] The values of the categorical. If categories are given,
values not in categories will be replaced with NaN.
categories [index-like, optional] The categories for the categorical. Items need to be
unique. If the categories are not given here (and also not in dtype), they will be
inferred from the data.
ordered [bool, optional] Whether or not this categorical is treated as an ordered cat-
egorical. If not given here or in dtype, the resulting categorical will be unordered.
dtype [CategoricalDtype or the string “category”, optional] If CategoricalDtype,
cannot be used together with categories or ordered.
New in version 0.21.0.
copy [bool, default False] Make a copy of input ndarray.
name [object, optional] Name to be stored in the index.
Raises
ValueError If the categories do not validate.
TypeError If an explicit ordered=True is given but no categories and the values are
not sortable.
See also:
Notes
Examples
Attributes
codes
categories
ordered
Methods
pandas.CategoricalIndex.rename_categories
reorder_categories
add_categories
remove_categories
remove_unused_categories
set_categories
Examples
For dict-like new_categories, extra keys are ignored and categories not in the dictionary are
passed through
pandas.CategoricalIndex.reorder_categories
Parameters
new_categories [Index-like] The categories in new order.
ordered [bool, optional] Whether or not the categorical is treated as a ordered
categorical. If not given, do not change the ordered information.
inplace [bool, default False] Whether or not to reorder the categories inplace or
return a copy of this categorical with reordered categories.
Returns
cat [Categorical with reordered categories or None if inplace.]
Raises
ValueError If the new categories do not contain all old category items or any
new ones
See also:
rename_categories
add_categories
remove_categories
remove_unused_categories
set_categories
pandas.CategoricalIndex.add_categories
rename_categories
reorder_categories
remove_categories
remove_unused_categories
set_categories
pandas.CategoricalIndex.remove_categories
rename_categories
reorder_categories
add_categories
remove_unused_categories
set_categories
pandas.CategoricalIndex.remove_unused_categories
rename_categories
reorder_categories
add_categories
remove_categories
set_categories
pandas.CategoricalIndex.set_categories
rename_categories
reorder_categories
add_categories
remove_categories
remove_unused_categories
pandas.CategoricalIndex.as_ordered
pandas.CategoricalIndex.as_unordered
pandas.CategoricalIndex.map
CategoricalIndex.map(self, mapper)
Map values using input correspondence (a dict, Series, or function).
Maps the values (their categories, not the codes) of the index to new categories. If the mapping
correspondence is one-to-one the result is a CategoricalIndex which has the same order property
as the original, otherwise an Index is returned.
If a dict or Series is used any unmapped category is mapped to NaN. Note that if this happens
an Index will be returned.
Parameters
mapper [function, dict, or Series] Mapping correspondence.
Returns
pandas.CategoricalIndex or pandas.Index Mapped index.
See also:
Examples
If a dict is used, all unmapped categories are mapped to NaN and the result is an Index:
Categorical components
CategoricalIndex.codes
CategoricalIndex.categories
CategoricalIndex.ordered
CategoricalIndex.rename_categories(self, …) Rename categories.
CategoricalIndex.reorder_categories(self, …) Reorder categories as specified in new_categories.
CategoricalIndex.add_categories(self, …) Add new categories.
CategoricalIndex.remove_categories(self, …) Remove the specified categories.
CategoricalIndex.remove_unused_categories(…) Remove categories which are not used.
CategoricalIndex.set_categories(self, …) Set the categories to the specified new_categories.
CategoricalIndex.as_ordered(self, \*args, …) Set the Categorical to be ordered.
CategoricalIndex.as_unordered(self, \*args, …) Set the Categorical to be unordered.
pandas.CategoricalIndex.codes
CategoricalIndex.codes
pandas.CategoricalIndex.categories
CategoricalIndex.categories
pandas.CategoricalIndex.ordered
CategoricalIndex.ordered
pandas.CategoricalIndex.equals
CategoricalIndex.equals(self, other)
Determine if two CategoricalIndex objects contain the same elements.
Returns
bool If two CategoricalIndex objects have equal elements True, otherwise False.
6.7.4 IntervalIndex
pandas.IntervalIndex
class pandas.IntervalIndex
Immutable index of intervals that are closed on the same side.
New in version 0.20.0.
Parameters
data [array-like (1-dimensional)] Array-like containing Interval objects from which to
build the IntervalIndex.
closed [{‘left’, ‘right’, ‘both’, ‘neither’}, default ‘right’] Whether the intervals are
closed on the left-side, right-side, both or neither.
dtype [dtype or None, default None] If None, dtype will be inferred.
New in version 0.23.0.
copy [bool, default False] Copy the input data.
name [object, optional] Name to be stored in the index.
verify_integrity [bool, default True] Verify that the IntervalIndex is valid.
See also:
Notes
Examples
Attributes
pandas.IntervalIndex.left
IntervalIndex.left
Return the left endpoints of each Interval in the IntervalIndex as an Index
pandas.IntervalIndex.right
IntervalIndex.right
Return the right endpoints of each Interval in the IntervalIndex as an Index
pandas.IntervalIndex.closed
IntervalIndex.closed
Whether the intervals are closed on the left-side, right-side, both or neither
pandas.IntervalIndex.mid
IntervalIndex.mid
Return the midpoint of each Interval in the IntervalIndex as an Index
pandas.IntervalIndex.length
IntervalIndex.length
Return an Index with entries denoting the length of each Interval in the IntervalIndex
pandas.IntervalIndex.is_empty
IntervalIndex.is_empty
Indicates if an interval is empty, meaning it contains no points.
New in version 0.25.0.
Returns
bool or ndarray A boolean indicating if a scalar Interval is empty, or a
boolean ndarray positionally indicating if an Interval in an IntervalArray
or IntervalIndex is empty.
Examples
pandas.IntervalIndex.is_non_overlapping_monotonic
IntervalIndex.is_non_overlapping_monotonic
Return True if the IntervalIndex is non-overlapping (no Intervals share points) and is either
monotonic increasing or monotonic decreasing, else False
pandas.IntervalIndex.is_overlapping
IntervalIndex.is_overlapping
Return True if the IntervalIndex has overlapping intervals, else False.
Two intervals overlap if they share a common point, including closed endpoints. Intervals that
only have an open endpoint in common do not overlap.
New in version 0.24.0.
Returns
bool Boolean indicating if the IntervalIndex has overlapping intervals.
See also:
Examples
pandas.IntervalIndex.values
IntervalIndex.values
Return the IntervalIndex’s data as an IntervalArray.
Methods
from_arrays(left, right[, closed, name, …]) Construct from two arrays defining the left and
right bounds.
from_tuples(data[, closed, name, copy, dtype]) Construct an IntervalIndex from an array-like of
tuples
from_breaks(breaks[, closed, name, copy, Construct an IntervalIndex from an array of
dtype]) splits.
contains(self, other) Check elementwise if the Intervals contain the
value.
overlaps(self, other) Check elementwise if an Interval overlaps the
values in the IntervalIndex.
set_closed(self, closed) Return an IntervalIndex identical to the current
one, but closed on the specified side
to_tuples(self[, na_tuple]) Return an Index of tuples of the form (left, right)
pandas.IntervalIndex.from_arrays
Notes
Each element of left must be less than or equal to the right element at the same position. If an
element is missing, it must be missing in both left and right. A TypeError is raised when using
an unsupported type for left or right. At the moment, ‘category’, ‘object’, and ‘string’ subtypes
are not supported.
Examples
pandas.IntervalIndex.from_tuples
See also:
Examples
pandas.IntervalIndex.from_breaks
Examples
pandas.IntervalIndex.contains
IntervalIndex.contains(self, other)
Check elementwise if the Intervals contain the value.
Return a boolean mask whether the value is contained in the Intervals of the IntervalIndex.
New in version 0.25.0.
Parameters
other [scalar] The value to check whether it is contained in the Intervals.
Returns
boolean array
See also:
Examples
pandas.IntervalIndex.overlaps
IntervalIndex.overlaps(self, other)
Check elementwise if an Interval overlaps the values in the IntervalIndex.
Two intervals overlap if they share a common point, including closed endpoints. Intervals that
only have an open endpoint in common do not overlap.
New in version 0.24.0.
Parameters
other [Interval] Interval to check against for an overlap.
Returns
ndarray Boolean array positionally indicating where an overlap occurs.
See also:
Examples
pandas.IntervalIndex.set_closed
IntervalIndex.set_closed(self, closed)
Return an IntervalIndex identical to the current one, but closed on the specified side
New in version 0.24.0.
Parameters
closed [{‘left’, ‘right’, ‘both’, ‘neither’}] Whether the intervals are closed on the
left-side, right-side, both or neither.
Returns
new_index [IntervalIndex]
Examples
pandas.IntervalIndex.to_tuples
IntervalIndex.to_tuples(self, na_tuple=True)
Return an Index of tuples of the form (left, right)
Parameters
na_tuple [boolean, default True] Returns NA as a tuple if True, (nan, nan), or
just as the NA value itself if False, nan.
New in version 0.23.0.
Returns
tuples: Index
Examples
IntervalIndex components
IntervalIndex.from_arrays(left, right[, …]) Construct from two arrays defining the left and
right bounds.
IntervalIndex.from_tuples(data[, closed, …]) Construct an IntervalIndex from an array-like of
tuples
IntervalIndex.from_breaks(breaks[, closed, …]) Construct an IntervalIndex from an array of splits.
IntervalIndex.left Return the left endpoints of each Interval in the
IntervalIndex as an Index
IntervalIndex.right Return the right endpoints of each Interval in the
IntervalIndex as an Index
IntervalIndex.mid Return the midpoint of each Interval in the Inter-
valIndex as an Index
IntervalIndex.closed Whether the intervals are closed on the left-side,
right-side, both or neither
IntervalIndex.length Return an Index with entries denoting the length
of each Interval in the IntervalIndex
IntervalIndex.values Return the IntervalIndex’s data as an IntervalAr-
ray.
IntervalIndex.is_empty Indicates if an interval is empty, meaning it contains
no points.
IntervalIndex.is_non_overlapping_monotonic Return True if the IntervalIndex is non-overlapping
(no Intervals share points) and is either monotonic
increasing or monotonic decreasing, else False
IntervalIndex.is_overlapping Return True if the IntervalIndex has overlapping
intervals, else False.
IntervalIndex.get_loc(self, key, method, …) Get integer location, slice or boolean mask for re-
quested label.
IntervalIndex.get_indexer(self, target, …) Compute indexer and mask for new index given the
current index.
IntervalIndex.set_closed(self, closed) Return an IntervalIndex identical to the current
one, but closed on the specified side
Continued on next page
pandas.IntervalIndex.get_loc
Examples
>>> index.get_loc(1.5)
1
If a label is in several intervals, you get the locations of all the relevant intervals.
>>> i3 = pd.Interval(0, 2)
>>> overlapping_index = pd.IntervalIndex([i1, i2, i3])
>>> overlapping_index.get_loc(0.5)
array([ True, False, True])
pandas.IntervalIndex.get_indexer
Examples
Notice that the return value is an array of locations in index and x is marked by -1, as it is not in
index.
6.7.5 MultiIndex
pandas.MultiIndex
class pandas.MultiIndex
A multi-level, or hierarchical, index object for pandas objects.
Parameters
levels [sequence of arrays] The unique labels for each level.
codes [sequence of arrays] Integers for each level designating which label at each
location.
New in version 0.24.0.
labels [sequence of arrays] Integers for each level designating which label at each
location.
Deprecated since version 0.24.0: Use codes instead
sortorder [optional int] Level of sortedness (must be lexicographically sorted by that
level).
names [optional sequence of objects] Names for each of the index levels. (name is
accepted for compat).
copy [bool, default False] Copy the meta-data.
verify_integrity [bool, default True] Check that the levels/codes are consistent and
valid.
See also:
Notes
Examples
A new MultiIndex is typically constructed using one of the helper methods MultiIndex.
from_arrays(), MultiIndex.from_product() and MultiIndex.from_tuples(). For example (using
.from_arrays):
See further examples for how to construct a MultiIndex in the doc strings of the mentioned helper
methods.
Attributes
pandas.MultiIndex.names
MultiIndex.names
Names of levels in MultiIndex
pandas.MultiIndex.nlevels
MultiIndex.nlevels
Integer number of levels in this MultiIndex.
pandas.MultiIndex.levshape
MultiIndex.levshape
A tuple with the length of each level.
levels
codes
Methods
pandas.MultiIndex.from_arrays
Examples
pandas.MultiIndex.from_tuples
index [MultiIndex]
See also:
Examples
pandas.MultiIndex.from_product
Examples
pandas.MultiIndex.from_frame
Examples
>>> pd.MultiIndex.from_frame(df)
MultiIndex([('HI', 'Temp'),
('HI', 'Precip'),
('NJ', 'Temp'),
('NJ', 'Precip')],
names=['a', 'b'])
pandas.MultiIndex.set_levels
Examples
pandas.MultiIndex.set_codes
Examples
pandas.MultiIndex.to_frame
DataFrame
pandas.MultiIndex.to_flat_index
MultiIndex.to_flat_index(self )
Convert a MultiIndex to an Index of Tuples containing the level values.
New in version 0.24.0.
Returns
pd.Index Index with the MultiIndex data represented in Tuples.
Notes
This method will simply return the caller if called by anything other than a MultiIndex.
Examples
pandas.MultiIndex.is_lexsorted
MultiIndex.is_lexsorted(self )
Return True if the codes are lexicographically sorted
Returns
bool
pandas.MultiIndex.sortlevel
pandas.MultiIndex.droplevel
MultiIndex.droplevel(self, level=0)
Return index with requested level(s) removed.
If resulting index has only 1 level left, the result will be of Index type, not MultiIndex.
New in version 0.23.1: (support for non-MultiIndex)
Parameters
level [int, str, or list-like, default 0] If a string is given, must be the name of a
level If list-like, elements must be names or indexes of levels.
Returns
Index or MultiIndex
pandas.MultiIndex.swaplevel
Examples
pandas.MultiIndex.reorder_levels
MultiIndex.reorder_levels(self, order)
Rearrange levels using input order. May not drop or duplicate levels
Returns
MultiIndex
pandas.MultiIndex.remove_unused_levels
MultiIndex.remove_unused_levels(self )
Create a new MultiIndex from the current that removes unused levels, meaning that they are not
expressed in the labels.
The resulting MultiIndex will have the same outward appearance, meaning the same .values and
ordering. It will also be .equals() to the original.
New in version 0.20.0.
Returns
MultiIndex
Examples
>>> mi[2:]
MultiIndex([(1, 'a'),
(1, 'b')],
)
The 0 from the first level is not represented and can be removed
pandas.IndexSlice
Notes
Examples
MultiIndex constructors
MultiIndex properties
pandas.MultiIndex.levels
MultiIndex.levels
pandas.MultiIndex.codes
MultiIndex.codes
MultiIndex components
pandas.MultiIndex.to_hierarchical
Examples
MultiIndex selecting
pandas.MultiIndex.get_loc
Notes
The key cannot be a slice, list of same-level labels, a boolean mask, or a sequence of such. If you want
to use those, use MultiIndex.get_locs() instead.
Examples
>>> mi.get_loc('b')
slice(1, 3, None)
pandas.MultiIndex.get_loc_level
Examples
>>> mi.get_loc_level('b')
(slice(1, 3, None), Index(['e', 'f'], dtype='object', name='B'))
pandas.MultiIndex.get_indexer
Examples
Notice that the return value is an array of locations in index and x is marked by -1, as it is not in
index.
pandas.MultiIndex.get_level_values
MultiIndex.get_level_values(self, level)
Return vector of label values for requested level, equal to the length of the index.
Parameters
level [int or str] level is either the integer position of the level in the MultiIndex, or
the name of the level.
Returns
values [Index] Values is a level of this MultiIndex converted to a single Index (or
subclass thereof).
Examples
Create a MultiIndex:
>>> mi.get_level_values(0)
Index(['a', 'b', 'c'], dtype='object', name='level_1')
>>> mi.get_level_values('level_2')
Index(['d', 'e', 'f'], dtype='object', name='level_2')
6.7.6 DatetimeIndex
pandas.DatetimeIndex
class pandas.DatetimeIndex
Immutable ndarray of datetime64 data, represented internally as int64, and which can be boxed to
Timestamp objects that are subclasses of datetime and carry metadata such as frequency information.
Parameters
data [array-like (1-dimensional), optional] Optional datetime-like data to construct
index with
copy [bool] Make a copy of input ndarray
freq [string or pandas offset object, optional] One of pandas date offset strings or cor-
responding objects. The string ‘infer’ can be passed in order to set the frequency
of the index as the inferred frequency upon creation
start [starting value, datetime-like, optional] If data is None, start is used as the start
point in generating regular timestamp data.
Deprecated since version 0.24.0.
periods [int, optional, > 0] Number of periods to generate, if generating index. Takes
precedence over end argument
Deprecated since version 0.24.0.
end [end time, datetime-like, optional] If periods is none, generated index will extend
to first conforming time on or just past end argument
Deprecated since version 0.24.0.
closed [string or None, default None] Make the interval closed with respect to the
given frequency to the ‘left’, ‘right’, or both sides (None)
Deprecated since version 0.24.: 0
tz [pytz.timezone or dateutil.tz.tzfile]
ambiguous [‘infer’, bool-ndarray, ‘NaT’, default ‘raise’] When clocks moved back-
ward due to DST, ambiguous times may arise. For example in Central European
Time (UTC+01), when going from 03:00 DST to 02:00 non-DST, 02:30:00 local
time occurs both at 00:30:00 UTC and at 01:30:00 UTC. In such a situation, the
ambiguous parameter dictates how ambiguous times should be handled.
• ‘infer’ will attempt to infer fall dst-transition hours based on order
• bool-ndarray where True signifies a DST time, False signifies a non-DST time
(note that this flag is only applicable for ambiguous times)
• ‘NaT’ will return NaT where there are ambiguous times
• ‘raise’ will raise an AmbiguousTimeError if there are ambiguous times
name [object] Name to be stored in the index
dayfirst [bool, default False] If True, parse dates in data with the day first order
yearfirst [bool, default False] If True parse dates in data with the year first order
See also:
Notes
To learn more about the frequency strings, please see this link.
Creating a DatetimeIndex based on start, periods, and end has been deprecated in favor of
date_range().
Attributes
pandas.DatetimeIndex.year
DatetimeIndex.year
The year of the datetime.
pandas.DatetimeIndex.month
DatetimeIndex.month
The month as January=1, December=12.
pandas.DatetimeIndex.day
DatetimeIndex.day
The days of the datetime.
pandas.DatetimeIndex.hour
DatetimeIndex.hour
The hours of the datetime.
pandas.DatetimeIndex.minute
DatetimeIndex.minute
The minutes of the datetime.
pandas.DatetimeIndex.second
DatetimeIndex.second
The seconds of the datetime.
pandas.DatetimeIndex.microsecond
DatetimeIndex.microsecond
The microseconds of the datetime.
pandas.DatetimeIndex.nanosecond
DatetimeIndex.nanosecond
The nanoseconds of the datetime.
pandas.DatetimeIndex.date
DatetimeIndex.date
Returns numpy array of python datetime.date objects (namely, the date part of Timestamps
without timezone information).
pandas.DatetimeIndex.time
DatetimeIndex.time
Returns numpy array of datetime.time. The time part of the Timestamps.
pandas.DatetimeIndex.timetz
DatetimeIndex.timetz
Returns numpy array of datetime.time also containing timezone information. The time part of
the Timestamps.
pandas.DatetimeIndex.dayofyear
DatetimeIndex.dayofyear
The ordinal day of the year.
pandas.DatetimeIndex.weekofyear
DatetimeIndex.weekofyear
The week ordinal of the year.
pandas.DatetimeIndex.week
DatetimeIndex.week
The week ordinal of the year.
pandas.DatetimeIndex.dayofweek
DatetimeIndex.dayofweek
The day of the week with Monday=0, Sunday=6.
Return the day of the week. It is assumed the week starts on Monday, which is denoted by 0 and
ends on Sunday which is denoted by 6. This method is available on both Series with datetime
values (using the dt accessor) or DatetimeIndex.
Returns
Series or Index Containing integers indicating the day number.
See also:
Series.dt.dayofweek Alias.
Series.dt.weekday Alias.
Series.dt.day_name Returns the name of the day of the week.
Examples
pandas.DatetimeIndex.weekday
DatetimeIndex.weekday
The day of the week with Monday=0, Sunday=6.
Return the day of the week. It is assumed the week starts on Monday, which is denoted by 0 and
ends on Sunday which is denoted by 6. This method is available on both Series with datetime
values (using the dt accessor) or DatetimeIndex.
Returns
Series or Index Containing integers indicating the day number.
See also:
Series.dt.dayofweek Alias.
Series.dt.weekday Alias.
Series.dt.day_name Returns the name of the day of the week.
Examples
pandas.DatetimeIndex.quarter
DatetimeIndex.quarter
The quarter of the date.
pandas.DatetimeIndex.freq
DatetimeIndex.freq
Return the frequency object if it is set, otherwise None.
pandas.DatetimeIndex.freqstr
DatetimeIndex.freqstr
Return the frequency object as a string if it is set, otherwise None.
pandas.DatetimeIndex.is_month_start
DatetimeIndex.is_month_start
Indicates whether the date is the first day of the month.
Returns
Series or array For Series, returns a Series with boolean values. For DatetimeIn-
dex, returns a boolean array.
See also:
is_month_start Return a boolean indicating whether the date is the first day of the month.
is_month_end Return a boolean indicating whether the date is the last day of the month.
Examples
This method is available on Series with datetime values under the .dt accessor, and directly on
DatetimeIndex.
pandas.DatetimeIndex.is_month_end
DatetimeIndex.is_month_end
Indicates whether the date is the last day of the month.
Returns
Series or array For Series, returns a Series with boolean values. For DatetimeIn-
dex, returns a boolean array.
See also:
is_month_start Return a boolean indicating whether the date is the first day of the month.
is_month_end Return a boolean indicating whether the date is the last day of the month.
Examples
This method is available on Series with datetime values under the .dt accessor, and directly on
DatetimeIndex.
pandas.DatetimeIndex.is_quarter_start
DatetimeIndex.is_quarter_start
Indicator for whether the date is the first day of a quarter.
Returns
is_quarter_start [Series or DatetimeIndex] The same type as the original data
with boolean values. Series will have the same name and index. DatetimeIndex
will have the same name.
See also:
Examples
This method is available on Series with datetime values under the .dt accessor, and directly on
DatetimeIndex.
>>> idx.is_quarter_start
array([False, False, True, False])
pandas.DatetimeIndex.is_quarter_end
DatetimeIndex.is_quarter_end
Indicator for whether the date is the last day of a quarter.
Returns
is_quarter_end [Series or DatetimeIndex] The same type as the original data
with boolean values. Series will have the same name and index. DatetimeIndex
will have the same name.
See also:
Examples
This method is available on Series with datetime values under the .dt accessor, and directly on
DatetimeIndex.
>>> idx.is_quarter_end
array([False, True, False, False])
pandas.DatetimeIndex.is_year_start
DatetimeIndex.is_year_start
Indicate whether the date is the first day of a year.
Returns
Series or DatetimeIndex The same type as the original data with boolean val-
ues. Series will have the same name and index. DatetimeIndex will have the
same name.
See also:
Examples
This method is available on Series with datetime values under the .dt accessor, and directly on
DatetimeIndex.
>>> dates.dt.is_year_start
0 False
1 False
2 True
dtype: bool
>>> idx.is_year_start
array([False, False, True])
pandas.DatetimeIndex.is_year_end
DatetimeIndex.is_year_end
Indicate whether the date is the last day of the year.
Returns
Series or DatetimeIndex The same type as the original data with boolean val-
ues. Series will have the same name and index. DatetimeIndex will have the
same name.
See also:
Examples
This method is available on Series with datetime values under the .dt accessor, and directly on
DatetimeIndex.
>>> dates.dt.is_year_end
0 False
1 True
2 False
dtype: bool
>>> idx.is_year_end
array([False, True, False])
pandas.DatetimeIndex.is_leap_year
DatetimeIndex.is_leap_year
Boolean indicator if the date belongs to a leap year.
A leap year is a year, which has 366 days (instead of 365) including 29th of February as an
intercalary day. Leap years are years which are multiples of four with the exception of years
divisible by 100 but not by 400.
Returns
Series or ndarray Booleans indicating if dates belong to a leap year.
Examples
This method is available on Series with datetime values under the .dt accessor, and directly on
DatetimeIndex.
pandas.DatetimeIndex.inferred_freq
DatetimeIndex.inferred_freq
Tryies to return a string representing a frequency guess, generated by infer_freq. Returns None
if it can’t autodetect the frequency.
tz
Methods
pandas.DatetimeIndex.normalize
Examples
pandas.DatetimeIndex.strftime
Parameters
date_format [str] Date format string (e.g. “%Y-%m-%d”).
Returns
Index Index of formatted strings.
See also:
Examples
pandas.DatetimeIndex.snap
DatetimeIndex.snap(self, freq=’S’)
Snap time stamps to nearest occurring frequency
Returns
DatetimeIndex
pandas.DatetimeIndex.tz_convert
Examples
With the tz parameter, we can change the DatetimeIndex to other time zones:
>>> dti
DatetimeIndex(['2014-08-01 09:00:00+02:00',
'2014-08-01 10:00:00+02:00',
'2014-08-01 11:00:00+02:00'],
dtype='datetime64[ns, Europe/Berlin]', freq='H')
>>> dti.tz_convert('US/Central')
DatetimeIndex(['2014-08-01 02:00:00-05:00',
'2014-08-01 03:00:00-05:00',
'2014-08-01 04:00:00-05:00'],
dtype='datetime64[ns, US/Central]', freq='H')
With the tz=None, we can remove the timezone (after converting to UTC if necessary):
>>> dti
DatetimeIndex(['2014-08-01 09:00:00+02:00',
'2014-08-01 10:00:00+02:00',
'2014-08-01 11:00:00+02:00'],
dtype='datetime64[ns, Europe/Berlin]', freq='H')
>>> dti.tz_convert(None)
DatetimeIndex(['2014-08-01 07:00:00',
'2014-08-01 08:00:00',
'2014-08-01 09:00:00'],
dtype='datetime64[ns]', freq='H')
pandas.DatetimeIndex.tz_localize
Examples
With the tz=None, we can remove the time zone information while keeping the local time (not
converted to UTC):
>>> tz_aware.tz_localize(None)
DatetimeIndex(['2018-03-01 09:00:00', '2018-03-02 09:00:00',
'2018-03-03 09:00:00'],
dtype='datetime64[ns]', freq='D')
Be careful with DST changes. When there is sequential data, pandas can infer the DST time:
>>> s = pd.to_datetime(pd.Series([‘2018-10-28 01:30:00’, … ‘2018-10-28 02:00:00’, … ‘2018-10-28
02:30:00’, … ‘2018-10-28 02:00:00’, … ‘2018-10-28 02:30:00’, … ‘2018-10-28 03:00:00’, … ‘2018-10-28
03:30:00’])) >>> s.dt.tz_localize(‘CET’, ambiguous=’infer’) 0 2018-10-28 01:30:00+02:00 1 2018-
10-28 02:00:00+02:00 2 2018-10-28 02:30:00+02:00 3 2018-10-28 02:00:00+01:00 4 2018-10-28
02:30:00+01:00 5 2018-10-28 03:00:00+01:00 6 2018-10-28 03:30:00+01:00 dtype: datetime64[ns,
CET]
In some cases, inferring the DST is impossible. In such cases, you can pass an ndarray to the
ambiguous parameter to set the DST explicitly
If the DST transition causes nonexistent times, you can shift these dates for-
ward or backwards with a timedelta object or ‘shift_forward’ or ‘shift_backwards’.
>>> s = pd.to_datetime(pd.Series([‘2015-03-29 02:30:00’, … ‘2015-03-29 03:30:00’]))
>>> s.dt.tz_localize(‘Europe/Warsaw’, nonexistent=’shift_forward’) 0 2015-03-29
03:00:00+02:00 1 2015-03-29 03:30:00+02:00 dtype: datetime64[ns, ‘Europe/Warsaw’]
>>> s.dt.tz_localize(‘Europe/Warsaw’, nonexistent=’shift_backward’) 0 2015-03-29
01:59:59.999999999+01:00 1 2015-03-29 03:30:00+02:00 dtype: datetime64[ns, ‘Europe/Warsaw’]
>>> s.dt.tz_localize(‘Europe/Warsaw’, nonexistent=pd.Timedelta(‘1H’)) 0 2015-03-29
03:30:00+02:00 1 2015-03-29 03:30:00+02:00 dtype: datetime64[ns, ‘Europe/Warsaw’]
pandas.DatetimeIndex.round
Examples
DatetimeIndex
Series
>>> pd.Series(rng).dt.round("H")
0 2018-01-01 12:00:00
1 2018-01-01 12:00:00
2 2018-01-01 12:00:00
dtype: datetime64[ns]
pandas.DatetimeIndex.floor
Raises
ValueError if the freq cannot be converted.
Examples
DatetimeIndex
Series
>>> pd.Series(rng).dt.floor("H")
0 2018-01-01 11:00:00
1 2018-01-01 12:00:00
2 2018-01-01 12:00:00
dtype: datetime64[ns]
pandas.DatetimeIndex.ceil
• ‘shift_backward’ will shift the nonexistent time backward to the closest ex-
isting time
• ‘NaT’ will return NaT where there are nonexistent times
• timedelta objects will shift nonexistent times by the timedelta
• ‘raise’ will raise an NonExistentTimeError if there are nonexistent times
New in version 0.24.0.
Returns
DatetimeIndex, TimedeltaIndex, or Series Index of the same type for a
DatetimeIndex or TimedeltaIndex, or a Series with the same index for a Series.
Raises
ValueError if the freq cannot be converted.
Examples
DatetimeIndex
Series
>>> pd.Series(rng).dt.ceil("H")
0 2018-01-01 12:00:00
1 2018-01-01 12:00:00
2 2018-01-01 13:00:00
dtype: datetime64[ns]
pandas.DatetimeIndex.to_period
Examples
pandas.DatetimeIndex.to_perioddelta
pandas.DatetimeIndex.to_pydatetime
pandas.DatetimeIndex.to_series
Parameters
keep_tz [optional, defaults False] Return the data keeping the timezone.
If keep_tz is True:
If the timezone is not set, the resulting Series will have a datetime64[ns]
dtype.
Otherwise the Series will have an datetime64[ns, tz] dtype; the tz will be
preserved.
If keep_tz is False:
Series will have a datetime64[ns] dtype. TZ aware objects will have the tz
removed.
Changed in version 0.24: The default value will change to True in a future
release. You can set keep_tz=True to already obtain the future behaviour and
silence the warning.
index [Index, optional] index of resulting Series. If None, defaults to original
index
name [string, optional] name of resulting Series. If None, defaults to name of
original index
Returns
Series
pandas.DatetimeIndex.to_frame
Examples
>>> idx.to_frame(index=False)
animal
0 Ant
1 Bear
2 Cow
pandas.DatetimeIndex.month_name
Examples
pandas.DatetimeIndex.day_name
Examples
pandas.DatetimeIndex.mean
numpy.ndarray.mean
Series.mean Return the mean value in a Series.
Notes
mean is only defined for Datetime and Timedelta dtypes, not for Period.
Time/Date components
pandas.DatetimeIndex.tz
DatetimeIndex.tz
Selecting
pandas.DatetimeIndex.indexer_at_time
indexer_between_time, DataFrame.at_time
pandas.DatetimeIndex.indexer_between_time
indexer_at_time, DataFrame.between_time
Time-specific operations
Conversion
Methods
6.7.7 TimedeltaIndex
pandas.TimedeltaIndex
class pandas.TimedeltaIndex
Immutable ndarray of timedelta64 data, represented internally as int64, and which can be boxed to
timedelta objects
Parameters
data [array-like (1-dimensional), optional] Optional timedelta-like data to construct
index with
unit [unit of the arg (D,h,m,s,ms,us,ns) denote the unit, optional] which is an inte-
ger/float number
freq [string or pandas offset object, optional] One of pandas date offset strings or cor-
responding objects. The string ‘infer’ can be passed in order to set the frequency
of the index as the inferred frequency upon creation
copy [bool] Make a copy of input ndarray
start [starting value, timedelta-like, optional] If data is None, start is used as the
start point in generating regular timedelta data.
Deprecated since version 0.24.0.
periods [int, optional, > 0] Number of periods to generate, if generating index. Takes
precedence over end argument
Deprecated since version 0.24.0.
end [end time, timedelta-like, optional] If periods is none, generated index will extend
to first conforming time on or just past end argument
Deprecated since version 0.24.: 0
closed [string or None, default None] Make the interval closed with respect to the
given frequency to the ‘left’, ‘right’, or both sides (None)
Deprecated since version 0.24.: 0
name [object] Name to be stored in the index
See also:
Notes
To learn more about the frequency strings, please see this link.
Creating a TimedeltaIndex based on start, periods, and end has been deprecated in favor of
timedelta_range().
Attributes
pandas.TimedeltaIndex.days
TimedeltaIndex.days
Number of days for each element.
pandas.TimedeltaIndex.seconds
TimedeltaIndex.seconds
Number of seconds (>= 0 and less than 1 day) for each element.
pandas.TimedeltaIndex.microseconds
TimedeltaIndex.microseconds
Number of microseconds (>= 0 and less than 1 second) for each element.
pandas.TimedeltaIndex.nanoseconds
TimedeltaIndex.nanoseconds
Number of nanoseconds (>= 0 and less than 1 microsecond) for each element.
pandas.TimedeltaIndex.components
TimedeltaIndex.components
Return a dataframe of the components (days, hours, minutes, seconds, milliseconds, microsec-
onds, nanoseconds) of the Timedeltas.
Returns
a DataFrame
pandas.TimedeltaIndex.inferred_freq
TimedeltaIndex.inferred_freq
Tryies to return a string representing a frequency guess, generated by infer_freq. Returns None
if it can’t autodetect the frequency.
Methods
pandas.TimedeltaIndex.to_pytimedelta
pandas.TimedeltaIndex.to_series
pandas.TimedeltaIndex.round
Examples
DatetimeIndex
Series
>>> pd.Series(rng).dt.round("H")
0 2018-01-01 12:00:00
1 2018-01-01 12:00:00
2 2018-01-01 12:00:00
dtype: datetime64[ns]
pandas.TimedeltaIndex.floor
Examples
DatetimeIndex
Series
>>> pd.Series(rng).dt.floor("H")
0 2018-01-01 11:00:00
1 2018-01-01 12:00:00
2 2018-01-01 12:00:00
dtype: datetime64[ns]
pandas.TimedeltaIndex.ceil
freq [str or Offset] The frequency level to ceil the index to. Must be a fixed
frequency like ‘S’ (second) not ‘ME’ (month end). See frequency aliases for a
list of possible freq values.
ambiguous [‘infer’, bool-ndarray, ‘NaT’, default ‘raise’] Only relevant for Date-
timeIndex:
• ‘infer’ will attempt to infer fall dst-transition hours based on order
• bool-ndarray where True signifies a DST time, False designates a non-DST
time (note that this flag is only applicable for ambiguous times)
• ‘NaT’ will return NaT where there are ambiguous times
• ‘raise’ will raise an AmbiguousTimeError if there are ambiguous times
New in version 0.24.0.
nonexistent [‘shift_forward’, ‘shift_backward’, ‘NaT’, timedelta, default ‘raise’]
A nonexistent time does not exist in a particular timezone where clocks moved
forward due to DST.
• ‘shift_forward’ will shift the nonexistent time forward to the closest existing
time
• ‘shift_backward’ will shift the nonexistent time backward to the closest ex-
isting time
• ‘NaT’ will return NaT where there are nonexistent times
• timedelta objects will shift nonexistent times by the timedelta
• ‘raise’ will raise an NonExistentTimeError if there are nonexistent times
New in version 0.24.0.
Returns
DatetimeIndex, TimedeltaIndex, or Series Index of the same type for a
DatetimeIndex or TimedeltaIndex, or a Series with the same index for a Series.
Raises
ValueError if the freq cannot be converted.
Examples
DatetimeIndex
Series
>>> pd.Series(rng).dt.ceil("H")
0 2018-01-01 12:00:00
1 2018-01-01 12:00:00
2 2018-01-01 13:00:00
dtype: datetime64[ns]
pandas.TimedeltaIndex.to_frame
Examples
>>> idx.to_frame(index=False)
animal
0 Ant
1 Bear
2 Cow
pandas.TimedeltaIndex.mean
numpy.ndarray.mean
Series.mean Return the mean value in a Series.
Notes
mean is only defined for Datetime and Timedelta dtypes, not for Period.
Components
Conversion
Methods
6.7.8 PeriodIndex
pandas.PeriodIndex
class pandas.PeriodIndex
Immutable ndarray holding ordinal values indicating regular periods in time such as particular years,
quarters, months, etc.
Index keys are boxed to Period objects which carries the metadata (eg, frequency information).
Parameters
data [array-like (1d integer np.ndarray or PeriodArray), optional] Optional period-
like data to construct index with
copy [bool] Make a copy of input ndarray
freq [string or period object, optional] One of pandas period strings or corresponding
objects
start [starting value, period-like, optional] If data is None, used as the start point in
generating regular period data.
Deprecated since version 0.24.0.
periods [int, optional, > 0] Number of periods to generate, if generating index. Takes
precedence over end argument
Deprecated since version 0.24.0.
end [end value, period-like, optional] If periods is none, generated index will extend
to first conforming period on or just past end argument
Deprecated since version 0.24.0.
year [int, array, or Series, default None]
month [int, array, or Series, default None]
quarter [int, array, or Series, default None]
Notes
Creating a PeriodIndex based on start, periods, and end has been deprecated in favor of
period_range().
Examples
Attributes
pandas.PeriodIndex.day
PeriodIndex.day
The days of the period
pandas.PeriodIndex.dayofweek
PeriodIndex.dayofweek
The day of the week with Monday=0, Sunday=6
pandas.PeriodIndex.dayofyear
PeriodIndex.dayofyear
The ordinal day of the year
pandas.PeriodIndex.days_in_month
PeriodIndex.days_in_month
The number of days in the month
pandas.PeriodIndex.daysinmonth
PeriodIndex.daysinmonth
The number of days in the month
pandas.PeriodIndex.freq
PeriodIndex.freq
Return the frequency object if it is set, otherwise None.
pandas.PeriodIndex.freqstr
PeriodIndex.freqstr
Return the frequency object as a string if it is set, otherwise None.
pandas.PeriodIndex.hour
PeriodIndex.hour
The hour of the period
pandas.PeriodIndex.is_leap_year
PeriodIndex.is_leap_year
Logical indicating if the date belongs to a leap year
pandas.PeriodIndex.minute
PeriodIndex.minute
The minute of the period
pandas.PeriodIndex.month
PeriodIndex.month
The month as January=1, December=12
pandas.PeriodIndex.quarter
PeriodIndex.quarter
The quarter of the date
pandas.PeriodIndex.second
PeriodIndex.second
The second of the period
pandas.PeriodIndex.week
PeriodIndex.week
The week ordinal of the year
pandas.PeriodIndex.weekday
PeriodIndex.weekday
The day of the week with Monday=0, Sunday=6
pandas.PeriodIndex.weekofyear
PeriodIndex.weekofyear
The week ordinal of the year
pandas.PeriodIndex.year
PeriodIndex.year
The year of the period
end_time
qyear
start_time
Methods
pandas.PeriodIndex.asfreq
Examples
>>> pidx.asfreq('M')
PeriodIndex(['2010-12', '2011-12', '2012-12', '2013-12', '2014-12',
'2015-12'], dtype='period[M]', freq='M')
pandas.PeriodIndex.strftime
Parameters
date_format [str] Date format string (e.g. “%Y-%m-%d”).
Returns
Index Index of formatted strings.
See also:
Examples
pandas.PeriodIndex.to_timestamp
Properties
pandas.PeriodIndex.end_time
PeriodIndex.end_time
pandas.PeriodIndex.qyear
PeriodIndex.qyear
pandas.PeriodIndex.start_time
PeriodIndex.start_time
Methods
PeriodIndex.asfreq(self, \*args, \*\*kwargs) Convert the Period Array/Index to the specified fre-
quency freq.
PeriodIndex.strftime(self, \*args, \*\*kwargs) Convert to Index using specified date_format.
PeriodIndex.to_timestamp(self, \*args, …) Cast to DatetimeArray/Index.
{{ header }}
6.8.1 DateOffset
pandas.tseries.offsets.DateOffset
Works exactly like relativedelta in terms of the keyword args you pass in, use of the keyword n is
discouraged– you would be better off specifying n in the keywords you use, but regardless it is there
for you. n is needed for DateOffset subclasses.
DateOffset work as follows. Each offset specify a set of dates that conform to the DateOffset. For
example, Bday defines this set to be the set of dates that are weekdays (M-F). To test if a date is in
the set of a DateOffset dateOffset we can use the onOffset method: dateOffset.onOffset(date).
If a date is not on a valid date, the rollback and rollforward methods can be used to roll the date to
the nearest valid date before/after the date.
DateOffsets can be created to move dates forward a given number of valid dates. For example, Bday(2)
can be added to a date to move it two business days forward. If the date does not start on a valid
date, first it is moved to a valid date. Thus pseudo code is:
def __add__(date): date = rollback(date) # does nothing if date is valid return date + <n number
of periods>
When a date offset is created for a negative number of periods, the date is first rolled forward. The
pseudo code is:
def __add__(date): date = rollforward(date) # does nothing is date is valid return date + <n
number of periods>
Zero presents a problem. Should it roll forward or back? We arbitrarily have it rollforward:
date + BDay(0) == BDay.rollforward(date)
Since 0 is a bit weird, we suggest avoiding its use.
Parameters
n [int, default 1] The number of time periods the offset represents.
normalize [bool, default False] Whether to round the result of a DateOffset addition
down to the previous midnight.
**kwds Temporal parameter that add to or replace the offset value.
Parameters that add to the offset (like Timedelta):
• years
• months
• weeks
• days
• hours
• minutes
• seconds
• microseconds
• nanoseconds
Parameters that replace the offset value:
• year
• month
• day
• weekday
• hour
• minute
• second
• microsecond
• nanosecond
See also:
dateutil.relativedelta.relativedelta
Examples
Attributes
pandas.tseries.offsets.DateOffset.base
DateOffset.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
freqstr
kwds
name
nanos
rule_code
Methods
pandas.tseries.offsets.DateOffset.apply_index
DateOffset.apply_index(self, other)
Vectorized apply of DateOffset to DatetimeIndex, raises NotImplentedError for offsets without
a vectorized implementation.
Parameters
i [DatetimeIndex]
Returns
y [DatetimeIndex]
pandas.tseries.offsets.DateOffset.rollback
DateOffset.rollback(self, dt)
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.DateOffset.rollforward
DateOffset.rollforward(self, dt)
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
__call__
apply
copy
isAnchored
onOffset
Properties
DateOffset.freqstr
DateOffset.kwds
DateOffset.name
DateOffset.nanos
DateOffset.normalize
DateOffset.rule_code
pandas.tseries.offsets.DateOffset.freqstr
DateOffset.freqstr
pandas.tseries.offsets.DateOffset.kwds
DateOffset.kwds
pandas.tseries.offsets.DateOffset.name
DateOffset.name
pandas.tseries.offsets.DateOffset.nanos
DateOffset.nanos
pandas.tseries.offsets.DateOffset.normalize
DateOffset.normalize = False
pandas.tseries.offsets.DateOffset.rule_code
DateOffset.rule_code
Methods
DateOffset.apply(self, other)
DateOffset.copy(self)
DateOffset.isAnchored(self)
DateOffset.onOffset(self, dt)
pandas.tseries.offsets.DateOffset.apply
DateOffset.apply(self, other)
pandas.tseries.offsets.DateOffset.copy
DateOffset.copy(self )
pandas.tseries.offsets.DateOffset.isAnchored
DateOffset.isAnchored(self )
pandas.tseries.offsets.DateOffset.onOffset
DateOffset.onOffset(self, dt)
6.8.2 BusinessDay
pandas.tseries.offsets.BusinessDay
Attributes
pandas.tseries.offsets.BusinessDay.base
BusinessDay.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
pandas.tseries.offsets.BusinessDay.offset
BusinessDay.offset
Alias for self._offset.
freqstr
kwds
name
nanos
rule_code
Methods
pandas.tseries.offsets.BusinessDay.rollback
BusinessDay.rollback(self, dt)
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.BusinessDay.rollforward
BusinessDay.rollforward(self, dt)
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
__call__
apply
apply_index
copy
isAnchored
onOffset
Properties
BusinessDay.freqstr
BusinessDay.kwds
BusinessDay.name
BusinessDay.nanos
BusinessDay.normalize
BusinessDay.rule_code
pandas.tseries.offsets.BusinessDay.freqstr
BusinessDay.freqstr
pandas.tseries.offsets.BusinessDay.kwds
BusinessDay.kwds
pandas.tseries.offsets.BusinessDay.name
BusinessDay.name
pandas.tseries.offsets.BusinessDay.nanos
BusinessDay.nanos
pandas.tseries.offsets.BusinessDay.normalize
BusinessDay.normalize = False
pandas.tseries.offsets.BusinessDay.rule_code
BusinessDay.rule_code
Methods
BusinessDay.apply(self, other)
BusinessDay.apply_index(self, other)
BusinessDay.copy(self)
BusinessDay.isAnchored(self)
BusinessDay.onOffset(self, dt)
pandas.tseries.offsets.BusinessDay.apply
BusinessDay.apply(self, other)
pandas.tseries.offsets.BusinessDay.apply_index
BusinessDay.apply_index(self, other)
pandas.tseries.offsets.BusinessDay.copy
BusinessDay.copy(self )
pandas.tseries.offsets.BusinessDay.isAnchored
BusinessDay.isAnchored(self )
pandas.tseries.offsets.BusinessDay.onOffset
BusinessDay.onOffset(self, dt)
6.8.3 BusinessHour
BusinessHour([n, normalize, start, end, offset]) DateOffset subclass representing possibly n busi-
ness hours.
pandas.tseries.offsets.BusinessHour
Attributes
pandas.tseries.offsets.BusinessHour.base
BusinessHour.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
pandas.tseries.offsets.BusinessHour.next_bday
BusinessHour.next_bday
Used for moving to next business day.
pandas.tseries.offsets.BusinessHour.offset
BusinessHour.offset
Alias for self._offset.
freqstr
kwds
name
nanos
rule_code
Methods
pandas.tseries.offsets.BusinessHour.apply_index
BusinessHour.apply_index(self, other)
Vectorized apply of DateOffset to DatetimeIndex, raises NotImplentedError for offsets without
a vectorized implementation.
Parameters
i [DatetimeIndex]
Returns
y [DatetimeIndex]
pandas.tseries.offsets.BusinessHour.rollback
BusinessHour.rollback(self, dt)
Roll provided date backward to next offset only if not on offset.
pandas.tseries.offsets.BusinessHour.rollforward
BusinessHour.rollforward(self, dt)
Roll provided date forward to next offset only if not on offset.
__call__
apply
copy
isAnchored
onOffset
Properties
BusinessHour.freqstr
BusinessHour.kwds
BusinessHour.name
BusinessHour.nanos
BusinessHour.normalize
BusinessHour.rule_code
pandas.tseries.offsets.BusinessHour.freqstr
BusinessHour.freqstr
pandas.tseries.offsets.BusinessHour.kwds
BusinessHour.kwds
pandas.tseries.offsets.BusinessHour.name
BusinessHour.name
pandas.tseries.offsets.BusinessHour.nanos
BusinessHour.nanos
pandas.tseries.offsets.BusinessHour.normalize
BusinessHour.normalize = False
pandas.tseries.offsets.BusinessHour.rule_code
BusinessHour.rule_code
Methods
BusinessHour.apply(self, other)
BusinessHour.copy(self)
BusinessHour.isAnchored(self)
BusinessHour.onOffset(self, dt)
pandas.tseries.offsets.BusinessHour.apply
BusinessHour.apply(self, other)
pandas.tseries.offsets.BusinessHour.copy
BusinessHour.copy(self )
pandas.tseries.offsets.BusinessHour.isAnchored
BusinessHour.isAnchored(self )
pandas.tseries.offsets.BusinessHour.onOffset
BusinessHour.onOffset(self, dt)
6.8.4 CustomBusinessDay
pandas.tseries.offsets.CustomBusinessDay
normalize [bool, default False] Normalize start/end dates to midnight before gener-
ating date range
weekmask [str, Default ‘Mon Tue Wed Thu Fri’] weekmask of valid business days,
passed to numpy.busdaycalendar
holidays [list] list/array of dates to exclude from the set of valid business days, passed
to numpy.busdaycalendar
calendar [pd.HolidayCalendar or np.busdaycalendar]
offset [timedelta, default timedelta(0)]
Attributes
pandas.tseries.offsets.CustomBusinessDay.base
CustomBusinessDay.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
pandas.tseries.offsets.CustomBusinessDay.offset
CustomBusinessDay.offset
Alias for self._offset.
freqstr
kwds
name
nanos
rule_code
Methods
pandas.tseries.offsets.CustomBusinessDay.apply_index
CustomBusinessDay.apply_index(self, i)
Vectorized apply of DateOffset to DatetimeIndex, raises NotImplentedError for offsets without
a vectorized implementation.
Parameters
i [DatetimeIndex]
Returns
y [DatetimeIndex]
pandas.tseries.offsets.CustomBusinessDay.rollback
CustomBusinessDay.rollback(self, dt)
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.CustomBusinessDay.rollforward
CustomBusinessDay.rollforward(self, dt)
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
__call__
apply
copy
isAnchored
onOffset
Properties
CustomBusinessDay.freqstr
CustomBusinessDay.kwds
CustomBusinessDay.name
CustomBusinessDay.nanos
CustomBusinessDay.normalize
CustomBusinessDay.rule_code
pandas.tseries.offsets.CustomBusinessDay.freqstr
CustomBusinessDay.freqstr
pandas.tseries.offsets.CustomBusinessDay.kwds
CustomBusinessDay.kwds
pandas.tseries.offsets.CustomBusinessDay.name
CustomBusinessDay.name
pandas.tseries.offsets.CustomBusinessDay.nanos
CustomBusinessDay.nanos
pandas.tseries.offsets.CustomBusinessDay.normalize
CustomBusinessDay.normalize = False
pandas.tseries.offsets.CustomBusinessDay.rule_code
CustomBusinessDay.rule_code
Methods
CustomBusinessDay.apply(self, other)
CustomBusinessDay.copy(self)
CustomBusinessDay.isAnchored(self)
CustomBusinessDay.onOffset(self, dt)
pandas.tseries.offsets.CustomBusinessDay.apply
CustomBusinessDay.apply(self, other)
pandas.tseries.offsets.CustomBusinessDay.copy
CustomBusinessDay.copy(self )
pandas.tseries.offsets.CustomBusinessDay.isAnchored
CustomBusinessDay.isAnchored(self )
pandas.tseries.offsets.CustomBusinessDay.onOffset
CustomBusinessDay.onOffset(self, dt)
6.8.5 CustomBusinessHour
pandas.tseries.offsets.CustomBusinessHour
Attributes
pandas.tseries.offsets.CustomBusinessHour.base
CustomBusinessHour.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
pandas.tseries.offsets.CustomBusinessHour.next_bday
CustomBusinessHour.next_bday
Used for moving to next business day.
pandas.tseries.offsets.CustomBusinessHour.offset
CustomBusinessHour.offset
Alias for self._offset.
freqstr
kwds
name
nanos
rule_code
Methods
pandas.tseries.offsets.CustomBusinessHour.apply_index
CustomBusinessHour.apply_index(self, other)
Vectorized apply of DateOffset to DatetimeIndex, raises NotImplentedError for offsets without
a vectorized implementation.
Parameters
i [DatetimeIndex]
Returns
y [DatetimeIndex]
pandas.tseries.offsets.CustomBusinessHour.rollback
CustomBusinessHour.rollback(self, dt)
Roll provided date backward to next offset only if not on offset.
pandas.tseries.offsets.CustomBusinessHour.rollforward
CustomBusinessHour.rollforward(self, dt)
Roll provided date forward to next offset only if not on offset.
__call__
apply
copy
isAnchored
onOffset
Properties
CustomBusinessHour.freqstr
CustomBusinessHour.kwds
CustomBusinessHour.name
CustomBusinessHour.nanos
CustomBusinessHour.normalize
CustomBusinessHour.rule_code
pandas.tseries.offsets.CustomBusinessHour.freqstr
CustomBusinessHour.freqstr
pandas.tseries.offsets.CustomBusinessHour.kwds
CustomBusinessHour.kwds
pandas.tseries.offsets.CustomBusinessHour.name
CustomBusinessHour.name
pandas.tseries.offsets.CustomBusinessHour.nanos
CustomBusinessHour.nanos
pandas.tseries.offsets.CustomBusinessHour.normalize
CustomBusinessHour.normalize = False
pandas.tseries.offsets.CustomBusinessHour.rule_code
CustomBusinessHour.rule_code
Methods
CustomBusinessHour.apply(self, other)
CustomBusinessHour.copy(self)
CustomBusinessHour.isAnchored(self)
CustomBusinessHour.onOffset(self, dt)
pandas.tseries.offsets.CustomBusinessHour.apply
CustomBusinessHour.apply(self, other)
pandas.tseries.offsets.CustomBusinessHour.copy
CustomBusinessHour.copy(self )
pandas.tseries.offsets.CustomBusinessHour.isAnchored
CustomBusinessHour.isAnchored(self )
pandas.tseries.offsets.CustomBusinessHour.onOffset
CustomBusinessHour.onOffset(self, dt)
6.8.6 MonthOffset
MonthOffset
Attributes
pandas.tseries.offsets.MonthOffset
class pandas.tseries.offsets.MonthOffset
Attributes
pandas.tseries.offsets.MonthOffset.base
MonthOffset.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
freqstr
kwds
name
nanos
rule_code
Methods
pandas.tseries.offsets.MonthOffset.rollback
MonthOffset.rollback(self, dt)
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.MonthOffset.rollforward
MonthOffset.rollforward(self, dt)
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
__call__
apply
apply_index
copy
isAnchored
onOffset
Properties
MonthOffset.freqstr
MonthOffset.kwds
MonthOffset.name
MonthOffset.nanos
MonthOffset.normalize
MonthOffset.rule_code
pandas.tseries.offsets.MonthOffset.freqstr
MonthOffset.freqstr
pandas.tseries.offsets.MonthOffset.kwds
MonthOffset.kwds
pandas.tseries.offsets.MonthOffset.name
MonthOffset.name
pandas.tseries.offsets.MonthOffset.nanos
MonthOffset.nanos
pandas.tseries.offsets.MonthOffset.normalize
MonthOffset.normalize = False
pandas.tseries.offsets.MonthOffset.rule_code
MonthOffset.rule_code
Methods
MonthOffset.apply(self, other)
MonthOffset.apply_index(self, other)
MonthOffset.copy(self)
MonthOffset.isAnchored(self)
MonthOffset.onOffset(self, dt)
pandas.tseries.offsets.MonthOffset.apply
MonthOffset.apply(self, other)
pandas.tseries.offsets.MonthOffset.apply_index
MonthOffset.apply_index(self, other)
pandas.tseries.offsets.MonthOffset.copy
MonthOffset.copy(self )
pandas.tseries.offsets.MonthOffset.isAnchored
MonthOffset.isAnchored(self )
pandas.tseries.offsets.MonthOffset.onOffset
MonthOffset.onOffset(self, dt)
6.8.7 MonthEnd
pandas.tseries.offsets.MonthEnd
class pandas.tseries.offsets.MonthEnd
DateOffset of one month end.
Attributes
pandas.tseries.offsets.MonthEnd.base
MonthEnd.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
freqstr
kwds
name
nanos
rule_code
Methods
pandas.tseries.offsets.MonthEnd.rollback
MonthEnd.rollback(self, dt)
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.MonthEnd.rollforward
MonthEnd.rollforward(self, dt)
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
__call__
apply
apply_index
copy
isAnchored
onOffset
Properties
MonthEnd.freqstr
MonthEnd.kwds
MonthEnd.name
MonthEnd.nanos
Continued on next page
pandas.tseries.offsets.MonthEnd.freqstr
MonthEnd.freqstr
pandas.tseries.offsets.MonthEnd.kwds
MonthEnd.kwds
pandas.tseries.offsets.MonthEnd.name
MonthEnd.name
pandas.tseries.offsets.MonthEnd.nanos
MonthEnd.nanos
pandas.tseries.offsets.MonthEnd.normalize
MonthEnd.normalize = False
pandas.tseries.offsets.MonthEnd.rule_code
MonthEnd.rule_code
Methods
MonthEnd.apply(self, other)
MonthEnd.apply_index(self, other)
MonthEnd.copy(self)
MonthEnd.isAnchored(self)
MonthEnd.onOffset(self, dt)
pandas.tseries.offsets.MonthEnd.apply
MonthEnd.apply(self, other)
pandas.tseries.offsets.MonthEnd.apply_index
MonthEnd.apply_index(self, other)
pandas.tseries.offsets.MonthEnd.copy
MonthEnd.copy(self )
pandas.tseries.offsets.MonthEnd.isAnchored
MonthEnd.isAnchored(self )
pandas.tseries.offsets.MonthEnd.onOffset
MonthEnd.onOffset(self, dt)
6.8.8 MonthBegin
pandas.tseries.offsets.MonthBegin
class pandas.tseries.offsets.MonthBegin
DateOffset of one month at beginning.
Attributes
pandas.tseries.offsets.MonthBegin.base
MonthBegin.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
freqstr
kwds
name
nanos
rule_code
Methods
pandas.tseries.offsets.MonthBegin.rollback
MonthBegin.rollback(self, dt)
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.MonthBegin.rollforward
MonthBegin.rollforward(self, dt)
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
__call__
apply
apply_index
copy
isAnchored
onOffset
Properties
MonthBegin.freqstr
MonthBegin.kwds
MonthBegin.name
MonthBegin.nanos
MonthBegin.normalize
MonthBegin.rule_code
pandas.tseries.offsets.MonthBegin.freqstr
MonthBegin.freqstr
pandas.tseries.offsets.MonthBegin.kwds
MonthBegin.kwds
pandas.tseries.offsets.MonthBegin.name
MonthBegin.name
pandas.tseries.offsets.MonthBegin.nanos
MonthBegin.nanos
pandas.tseries.offsets.MonthBegin.normalize
MonthBegin.normalize = False
pandas.tseries.offsets.MonthBegin.rule_code
MonthBegin.rule_code
Methods
MonthBegin.apply(self, other)
MonthBegin.apply_index(self, other)
MonthBegin.copy(self)
MonthBegin.isAnchored(self)
MonthBegin.onOffset(self, dt)
pandas.tseries.offsets.MonthBegin.apply
MonthBegin.apply(self, other)
pandas.tseries.offsets.MonthBegin.apply_index
MonthBegin.apply_index(self, other)
pandas.tseries.offsets.MonthBegin.copy
MonthBegin.copy(self )
pandas.tseries.offsets.MonthBegin.isAnchored
MonthBegin.isAnchored(self )
pandas.tseries.offsets.MonthBegin.onOffset
MonthBegin.onOffset(self, dt)
6.8.9 BusinessMonthEnd
pandas.tseries.offsets.BusinessMonthEnd
class pandas.tseries.offsets.BusinessMonthEnd
DateOffset increments between business EOM dates.
Attributes
pandas.tseries.offsets.BusinessMonthEnd.base
BusinessMonthEnd.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
freqstr
kwds
name
nanos
rule_code
Methods
pandas.tseries.offsets.BusinessMonthEnd.rollback
BusinessMonthEnd.rollback(self, dt)
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.BusinessMonthEnd.rollforward
BusinessMonthEnd.rollforward(self, dt)
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
__call__
apply
apply_index
copy
isAnchored
onOffset
Properties
BusinessMonthEnd.freqstr
BusinessMonthEnd.kwds
BusinessMonthEnd.name
BusinessMonthEnd.nanos
BusinessMonthEnd.normalize
BusinessMonthEnd.rule_code
pandas.tseries.offsets.BusinessMonthEnd.freqstr
BusinessMonthEnd.freqstr
pandas.tseries.offsets.BusinessMonthEnd.kwds
BusinessMonthEnd.kwds
pandas.tseries.offsets.BusinessMonthEnd.name
BusinessMonthEnd.name
pandas.tseries.offsets.BusinessMonthEnd.nanos
BusinessMonthEnd.nanos
pandas.tseries.offsets.BusinessMonthEnd.normalize
BusinessMonthEnd.normalize = False
pandas.tseries.offsets.BusinessMonthEnd.rule_code
BusinessMonthEnd.rule_code
Methods
BusinessMonthEnd.apply(self, other)
Continued on next page
pandas.tseries.offsets.BusinessMonthEnd.apply
BusinessMonthEnd.apply(self, other)
pandas.tseries.offsets.BusinessMonthEnd.apply_index
BusinessMonthEnd.apply_index(self, other)
pandas.tseries.offsets.BusinessMonthEnd.copy
BusinessMonthEnd.copy(self )
pandas.tseries.offsets.BusinessMonthEnd.isAnchored
BusinessMonthEnd.isAnchored(self )
pandas.tseries.offsets.BusinessMonthEnd.onOffset
BusinessMonthEnd.onOffset(self, dt)
6.8.10 BusinessMonthBegin
pandas.tseries.offsets.BusinessMonthBegin
class pandas.tseries.offsets.BusinessMonthBegin
DateOffset of one business month at beginning.
Attributes
pandas.tseries.offsets.BusinessMonthBegin.base
BusinessMonthBegin.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
freqstr
kwds
name
nanos
rule_code
Methods
pandas.tseries.offsets.BusinessMonthBegin.rollback
BusinessMonthBegin.rollback(self, dt)
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.BusinessMonthBegin.rollforward
BusinessMonthBegin.rollforward(self, dt)
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
__call__
apply
apply_index
copy
isAnchored
onOffset
Properties
BusinessMonthBegin.freqstr
BusinessMonthBegin.kwds
BusinessMonthBegin.name
BusinessMonthBegin.nanos
BusinessMonthBegin.normalize
BusinessMonthBegin.rule_code
pandas.tseries.offsets.BusinessMonthBegin.freqstr
BusinessMonthBegin.freqstr
pandas.tseries.offsets.BusinessMonthBegin.kwds
BusinessMonthBegin.kwds
pandas.tseries.offsets.BusinessMonthBegin.name
BusinessMonthBegin.name
pandas.tseries.offsets.BusinessMonthBegin.nanos
BusinessMonthBegin.nanos
pandas.tseries.offsets.BusinessMonthBegin.normalize
BusinessMonthBegin.normalize = False
pandas.tseries.offsets.BusinessMonthBegin.rule_code
BusinessMonthBegin.rule_code
Methods
BusinessMonthBegin.apply(self, other)
BusinessMonthBegin.apply_index(self, other)
BusinessMonthBegin.copy(self)
BusinessMonthBegin.isAnchored(self)
BusinessMonthBegin.onOffset(self, dt)
pandas.tseries.offsets.BusinessMonthBegin.apply
BusinessMonthBegin.apply(self, other)
pandas.tseries.offsets.BusinessMonthBegin.apply_index
BusinessMonthBegin.apply_index(self, other)
pandas.tseries.offsets.BusinessMonthBegin.copy
BusinessMonthBegin.copy(self )
pandas.tseries.offsets.BusinessMonthBegin.isAnchored
BusinessMonthBegin.isAnchored(self )
pandas.tseries.offsets.BusinessMonthBegin.onOffset
BusinessMonthBegin.onOffset(self, dt)
6.8.11 CustomBusinessMonthEnd
pandas.tseries.offsets.CustomBusinessMonthEnd
Attributes
pandas.tseries.offsets.CustomBusinessMonthEnd.base
CustomBusinessMonthEnd.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
pandas.tseries.offsets.CustomBusinessMonthEnd.cbday_roll
CustomBusinessMonthEnd.cbday_roll
Define default roll function to be called in apply method.
pandas.tseries.offsets.CustomBusinessMonthEnd.month_roll
CustomBusinessMonthEnd.month_roll
Define default roll function to be called in apply method.
pandas.tseries.offsets.CustomBusinessMonthEnd.offset
CustomBusinessMonthEnd.offset
Alias for self._offset.
freqstr
kwds
m_offset
name
nanos
rule_code
Methods
pandas.tseries.offsets.CustomBusinessMonthEnd.apply_index
CustomBusinessMonthEnd.apply_index(self, other)
Vectorized apply of DateOffset to DatetimeIndex, raises NotImplentedError for offsets without
a vectorized implementation.
Parameters
i [DatetimeIndex]
Returns
y [DatetimeIndex]
pandas.tseries.offsets.CustomBusinessMonthEnd.rollback
CustomBusinessMonthEnd.rollback(self, dt)
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.CustomBusinessMonthEnd.rollforward
CustomBusinessMonthEnd.rollforward(self, dt)
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
__call__
apply
copy
isAnchored
onOffset
Properties
CustomBusinessMonthEnd.freqstr
CustomBusinessMonthEnd.kwds
CustomBusinessMonthEnd.m_offset
CustomBusinessMonthEnd.name
CustomBusinessMonthEnd.nanos
CustomBusinessMonthEnd.normalize
CustomBusinessMonthEnd.rule_code
pandas.tseries.offsets.CustomBusinessMonthEnd.freqstr
CustomBusinessMonthEnd.freqstr
pandas.tseries.offsets.CustomBusinessMonthEnd.kwds
CustomBusinessMonthEnd.kwds
pandas.tseries.offsets.CustomBusinessMonthEnd.m_offset
CustomBusinessMonthEnd.m_offset
pandas.tseries.offsets.CustomBusinessMonthEnd.name
CustomBusinessMonthEnd.name
pandas.tseries.offsets.CustomBusinessMonthEnd.nanos
CustomBusinessMonthEnd.nanos
pandas.tseries.offsets.CustomBusinessMonthEnd.normalize
CustomBusinessMonthEnd.normalize = False
pandas.tseries.offsets.CustomBusinessMonthEnd.rule_code
CustomBusinessMonthEnd.rule_code
Methods
CustomBusinessMonthEnd.apply(self, other)
CustomBusinessMonthEnd.copy(self)
CustomBusinessMonthEnd.isAnchored(self)
CustomBusinessMonthEnd.onOffset(self, dt)
pandas.tseries.offsets.CustomBusinessMonthEnd.apply
CustomBusinessMonthEnd.apply(self, other)
pandas.tseries.offsets.CustomBusinessMonthEnd.copy
CustomBusinessMonthEnd.copy(self )
pandas.tseries.offsets.CustomBusinessMonthEnd.isAnchored
CustomBusinessMonthEnd.isAnchored(self )
pandas.tseries.offsets.CustomBusinessMonthEnd.onOffset
CustomBusinessMonthEnd.onOffset(self, dt)
6.8.12 CustomBusinessMonthBegin
pandas.tseries.offsets.CustomBusinessMonthBegin
Attributes
pandas.tseries.offsets.CustomBusinessMonthBegin.base
CustomBusinessMonthBegin.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
pandas.tseries.offsets.CustomBusinessMonthBegin.cbday_roll
CustomBusinessMonthBegin.cbday_roll
Define default roll function to be called in apply method.
pandas.tseries.offsets.CustomBusinessMonthBegin.month_roll
CustomBusinessMonthBegin.month_roll
Define default roll function to be called in apply method.
pandas.tseries.offsets.CustomBusinessMonthBegin.offset
CustomBusinessMonthBegin.offset
Alias for self._offset.
freqstr
kwds
m_offset
name
nanos
rule_code
Methods
pandas.tseries.offsets.CustomBusinessMonthBegin.apply_index
CustomBusinessMonthBegin.apply_index(self, other)
Vectorized apply of DateOffset to DatetimeIndex, raises NotImplentedError for offsets without
a vectorized implementation.
Parameters
i [DatetimeIndex]
Returns
y [DatetimeIndex]
pandas.tseries.offsets.CustomBusinessMonthBegin.rollback
CustomBusinessMonthBegin.rollback(self, dt)
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.CustomBusinessMonthBegin.rollforward
CustomBusinessMonthBegin.rollforward(self, dt)
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
__call__
apply
copy
isAnchored
onOffset
Properties
CustomBusinessMonthBegin.freqstr
CustomBusinessMonthBegin.kwds
CustomBusinessMonthBegin.m_offset
CustomBusinessMonthBegin.name
CustomBusinessMonthBegin.nanos
CustomBusinessMonthBegin.normalize
CustomBusinessMonthBegin.rule_code
pandas.tseries.offsets.CustomBusinessMonthBegin.freqstr
CustomBusinessMonthBegin.freqstr
pandas.tseries.offsets.CustomBusinessMonthBegin.kwds
CustomBusinessMonthBegin.kwds
pandas.tseries.offsets.CustomBusinessMonthBegin.m_offset
CustomBusinessMonthBegin.m_offset
pandas.tseries.offsets.CustomBusinessMonthBegin.name
CustomBusinessMonthBegin.name
pandas.tseries.offsets.CustomBusinessMonthBegin.nanos
CustomBusinessMonthBegin.nanos
pandas.tseries.offsets.CustomBusinessMonthBegin.normalize
CustomBusinessMonthBegin.normalize = False
pandas.tseries.offsets.CustomBusinessMonthBegin.rule_code
CustomBusinessMonthBegin.rule_code
Methods
CustomBusinessMonthBegin.apply(self, other)
CustomBusinessMonthBegin.copy(self)
CustomBusinessMonthBegin.isAnchored(self)
CustomBusinessMonthBegin.onOffset(self, dt)
pandas.tseries.offsets.CustomBusinessMonthBegin.apply
CustomBusinessMonthBegin.apply(self, other)
pandas.tseries.offsets.CustomBusinessMonthBegin.copy
CustomBusinessMonthBegin.copy(self )
pandas.tseries.offsets.CustomBusinessMonthBegin.isAnchored
CustomBusinessMonthBegin.isAnchored(self )
pandas.tseries.offsets.CustomBusinessMonthBegin.onOffset
CustomBusinessMonthBegin.onOffset(self, dt)
6.8.13 SemiMonthOffset
Attributes
pandas.tseries.offsets.SemiMonthOffset
Attributes
pandas.tseries.offsets.SemiMonthOffset.base
SemiMonthOffset.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
freqstr
kwds
name
nanos
rule_code
Methods
pandas.tseries.offsets.SemiMonthOffset.rollback
SemiMonthOffset.rollback(self, dt)
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.SemiMonthOffset.rollforward
SemiMonthOffset.rollforward(self, dt)
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
__call__
apply
apply_index
copy
isAnchored
onOffset
Properties
SemiMonthOffset.freqstr
SemiMonthOffset.kwds
SemiMonthOffset.name
SemiMonthOffset.nanos
SemiMonthOffset.normalize
SemiMonthOffset.rule_code
pandas.tseries.offsets.SemiMonthOffset.freqstr
SemiMonthOffset.freqstr
pandas.tseries.offsets.SemiMonthOffset.kwds
SemiMonthOffset.kwds
pandas.tseries.offsets.SemiMonthOffset.name
SemiMonthOffset.name
pandas.tseries.offsets.SemiMonthOffset.nanos
SemiMonthOffset.nanos
pandas.tseries.offsets.SemiMonthOffset.normalize
SemiMonthOffset.normalize = False
pandas.tseries.offsets.SemiMonthOffset.rule_code
SemiMonthOffset.rule_code
Methods
SemiMonthOffset.apply(self, other)
SemiMonthOffset.apply_index(self, other)
SemiMonthOffset.copy(self)
SemiMonthOffset.isAnchored(self)
SemiMonthOffset.onOffset(self, dt)
pandas.tseries.offsets.SemiMonthOffset.apply
SemiMonthOffset.apply(self, other)
pandas.tseries.offsets.SemiMonthOffset.apply_index
SemiMonthOffset.apply_index(self, other)
pandas.tseries.offsets.SemiMonthOffset.copy
SemiMonthOffset.copy(self )
pandas.tseries.offsets.SemiMonthOffset.isAnchored
SemiMonthOffset.isAnchored(self )
pandas.tseries.offsets.SemiMonthOffset.onOffset
SemiMonthOffset.onOffset(self, dt)
6.8.14 SemiMonthEnd
SemiMonthEnd([n, normalize, day_of_month]) Two DateOffset’s per month repeating on the last
day of the month and day_of_month.
pandas.tseries.offsets.SemiMonthEnd
Attributes
pandas.tseries.offsets.SemiMonthEnd.base
SemiMonthEnd.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
freqstr
kwds
name
nanos
rule_code
Methods
pandas.tseries.offsets.SemiMonthEnd.rollback
SemiMonthEnd.rollback(self, dt)
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.SemiMonthEnd.rollforward
SemiMonthEnd.rollforward(self, dt)
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
__call__
apply
apply_index
copy
isAnchored
onOffset
Properties
SemiMonthEnd.freqstr
SemiMonthEnd.kwds
SemiMonthEnd.name
SemiMonthEnd.nanos
SemiMonthEnd.normalize
SemiMonthEnd.rule_code
pandas.tseries.offsets.SemiMonthEnd.freqstr
SemiMonthEnd.freqstr
pandas.tseries.offsets.SemiMonthEnd.kwds
SemiMonthEnd.kwds
pandas.tseries.offsets.SemiMonthEnd.name
SemiMonthEnd.name
pandas.tseries.offsets.SemiMonthEnd.nanos
SemiMonthEnd.nanos
pandas.tseries.offsets.SemiMonthEnd.normalize
SemiMonthEnd.normalize = False
pandas.tseries.offsets.SemiMonthEnd.rule_code
SemiMonthEnd.rule_code
Methods
SemiMonthEnd.apply(self, other)
SemiMonthEnd.apply_index(self, other)
SemiMonthEnd.copy(self)
SemiMonthEnd.isAnchored(self)
SemiMonthEnd.onOffset(self, dt)
pandas.tseries.offsets.SemiMonthEnd.apply
SemiMonthEnd.apply(self, other)
pandas.tseries.offsets.SemiMonthEnd.apply_index
SemiMonthEnd.apply_index(self, other)
pandas.tseries.offsets.SemiMonthEnd.copy
SemiMonthEnd.copy(self )
pandas.tseries.offsets.SemiMonthEnd.isAnchored
SemiMonthEnd.isAnchored(self )
pandas.tseries.offsets.SemiMonthEnd.onOffset
SemiMonthEnd.onOffset(self, dt)
6.8.15 SemiMonthBegin
SemiMonthBegin([n, normalize, day_of_month]) Two DateOffset’s per month repeating on the first
day of the month and day_of_month.
pandas.tseries.offsets.SemiMonthBegin
Attributes
pandas.tseries.offsets.SemiMonthBegin.base
SemiMonthBegin.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
freqstr
kwds
name
nanos
rule_code
Methods
pandas.tseries.offsets.SemiMonthBegin.rollback
SemiMonthBegin.rollback(self, dt)
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.SemiMonthBegin.rollforward
SemiMonthBegin.rollforward(self, dt)
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
__call__
apply
apply_index
copy
isAnchored
onOffset
Properties
SemiMonthBegin.freqstr
SemiMonthBegin.kwds
SemiMonthBegin.name
SemiMonthBegin.nanos
SemiMonthBegin.normalize
SemiMonthBegin.rule_code
pandas.tseries.offsets.SemiMonthBegin.freqstr
SemiMonthBegin.freqstr
pandas.tseries.offsets.SemiMonthBegin.kwds
SemiMonthBegin.kwds
pandas.tseries.offsets.SemiMonthBegin.name
SemiMonthBegin.name
pandas.tseries.offsets.SemiMonthBegin.nanos
SemiMonthBegin.nanos
pandas.tseries.offsets.SemiMonthBegin.normalize
SemiMonthBegin.normalize = False
pandas.tseries.offsets.SemiMonthBegin.rule_code
SemiMonthBegin.rule_code
Methods
SemiMonthBegin.apply(self, other)
SemiMonthBegin.apply_index(self, other)
SemiMonthBegin.copy(self)
SemiMonthBegin.isAnchored(self)
SemiMonthBegin.onOffset(self, dt)
pandas.tseries.offsets.SemiMonthBegin.apply
SemiMonthBegin.apply(self, other)
pandas.tseries.offsets.SemiMonthBegin.apply_index
SemiMonthBegin.apply_index(self, other)
pandas.tseries.offsets.SemiMonthBegin.copy
SemiMonthBegin.copy(self )
pandas.tseries.offsets.SemiMonthBegin.isAnchored
SemiMonthBegin.isAnchored(self )
pandas.tseries.offsets.SemiMonthBegin.onOffset
SemiMonthBegin.onOffset(self, dt)
6.8.16 Week
pandas.tseries.offsets.Week
Attributes
pandas.tseries.offsets.Week.base
Week.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
freqstr
kwds
name
nanos
rule_code
Methods
pandas.tseries.offsets.Week.rollback
Week.rollback(self, dt)
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.Week.rollforward
Week.rollforward(self, dt)
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
__call__
apply
apply_index
copy
isAnchored
onOffset
Properties
Week.freqstr
Week.kwds
Week.name
Week.nanos
Week.normalize
Week.rule_code
pandas.tseries.offsets.Week.freqstr
Week.freqstr
pandas.tseries.offsets.Week.kwds
Week.kwds
pandas.tseries.offsets.Week.name
Week.name
pandas.tseries.offsets.Week.nanos
Week.nanos
pandas.tseries.offsets.Week.normalize
Week.normalize = False
pandas.tseries.offsets.Week.rule_code
Week.rule_code
Methods
Week.apply(self, other)
Week.apply_index(self, other)
Week.copy(self)
Week.isAnchored(self)
Week.onOffset(self, dt)
pandas.tseries.offsets.Week.apply
Week.apply(self, other)
pandas.tseries.offsets.Week.apply_index
Week.apply_index(self, other)
pandas.tseries.offsets.Week.copy
Week.copy(self )
pandas.tseries.offsets.Week.isAnchored
Week.isAnchored(self )
pandas.tseries.offsets.Week.onOffset
Week.onOffset(self, dt)
6.8.17 WeekOfMonth
WeekOfMonth([n, normalize, week, weekday]) Describes monthly dates like “the Tuesday of the
2nd week of each month”.
pandas.tseries.offsets.WeekOfMonth
Attributes
pandas.tseries.offsets.WeekOfMonth.base
WeekOfMonth.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
freqstr
kwds
name
nanos
rule_code
Methods
pandas.tseries.offsets.WeekOfMonth.apply_index
WeekOfMonth.apply_index(self, other)
Vectorized apply of DateOffset to DatetimeIndex, raises NotImplentedError for offsets without
a vectorized implementation.
Parameters
i [DatetimeIndex]
Returns
y [DatetimeIndex]
pandas.tseries.offsets.WeekOfMonth.rollback
WeekOfMonth.rollback(self, dt)
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.WeekOfMonth.rollforward
WeekOfMonth.rollforward(self, dt)
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
__call__
apply
copy
isAnchored
onOffset
Properties
WeekOfMonth.freqstr
WeekOfMonth.kwds
WeekOfMonth.name
WeekOfMonth.nanos
WeekOfMonth.normalize
WeekOfMonth.rule_code
pandas.tseries.offsets.WeekOfMonth.freqstr
WeekOfMonth.freqstr
pandas.tseries.offsets.WeekOfMonth.kwds
WeekOfMonth.kwds
pandas.tseries.offsets.WeekOfMonth.name
WeekOfMonth.name
pandas.tseries.offsets.WeekOfMonth.nanos
WeekOfMonth.nanos
pandas.tseries.offsets.WeekOfMonth.normalize
WeekOfMonth.normalize = False
pandas.tseries.offsets.WeekOfMonth.rule_code
WeekOfMonth.rule_code
Methods
WeekOfMonth.apply(self, other)
WeekOfMonth.copy(self)
Continued on next page
pandas.tseries.offsets.WeekOfMonth.apply
WeekOfMonth.apply(self, other)
pandas.tseries.offsets.WeekOfMonth.copy
WeekOfMonth.copy(self )
pandas.tseries.offsets.WeekOfMonth.isAnchored
WeekOfMonth.isAnchored(self )
pandas.tseries.offsets.WeekOfMonth.onOffset
WeekOfMonth.onOffset(self, dt)
6.8.18 LastWeekOfMonth
LastWeekOfMonth([n, normalize, weekday]) Describes monthly dates in last week of month like
“the last Tuesday of each month”.
pandas.tseries.offsets.LastWeekOfMonth
Attributes
pandas.tseries.offsets.LastWeekOfMonth.base
LastWeekOfMonth.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
freqstr
kwds
name
nanos
rule_code
Methods
pandas.tseries.offsets.LastWeekOfMonth.apply_index
LastWeekOfMonth.apply_index(self, other)
Vectorized apply of DateOffset to DatetimeIndex, raises NotImplentedError for offsets without
a vectorized implementation.
Parameters
i [DatetimeIndex]
Returns
y [DatetimeIndex]
pandas.tseries.offsets.LastWeekOfMonth.rollback
LastWeekOfMonth.rollback(self, dt)
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.LastWeekOfMonth.rollforward
LastWeekOfMonth.rollforward(self, dt)
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
__call__
apply
copy
isAnchored
onOffset
Properties
LastWeekOfMonth.freqstr
LastWeekOfMonth.kwds
LastWeekOfMonth.name
LastWeekOfMonth.nanos
LastWeekOfMonth.normalize
LastWeekOfMonth.rule_code
pandas.tseries.offsets.LastWeekOfMonth.freqstr
LastWeekOfMonth.freqstr
pandas.tseries.offsets.LastWeekOfMonth.kwds
LastWeekOfMonth.kwds
pandas.tseries.offsets.LastWeekOfMonth.name
LastWeekOfMonth.name
pandas.tseries.offsets.LastWeekOfMonth.nanos
LastWeekOfMonth.nanos
pandas.tseries.offsets.LastWeekOfMonth.normalize
LastWeekOfMonth.normalize = False
pandas.tseries.offsets.LastWeekOfMonth.rule_code
LastWeekOfMonth.rule_code
Methods
LastWeekOfMonth.apply(self, other)
LastWeekOfMonth.copy(self)
Continued on next page
pandas.tseries.offsets.LastWeekOfMonth.apply
LastWeekOfMonth.apply(self, other)
pandas.tseries.offsets.LastWeekOfMonth.copy
LastWeekOfMonth.copy(self )
pandas.tseries.offsets.LastWeekOfMonth.isAnchored
LastWeekOfMonth.isAnchored(self )
pandas.tseries.offsets.LastWeekOfMonth.onOffset
LastWeekOfMonth.onOffset(self, dt)
6.8.19 QuarterOffset
pandas.tseries.offsets.QuarterOffset
Attributes
pandas.tseries.offsets.QuarterOffset.base
QuarterOffset.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
freqstr
kwds
name
nanos
rule_code
Methods
pandas.tseries.offsets.QuarterOffset.rollback
QuarterOffset.rollback(self, dt)
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.QuarterOffset.rollforward
QuarterOffset.rollforward(self, dt)
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
__call__
apply
apply_index
copy
isAnchored
onOffset
Properties
QuarterOffset.freqstr
QuarterOffset.kwds
QuarterOffset.name
QuarterOffset.nanos
QuarterOffset.normalize
QuarterOffset.rule_code
pandas.tseries.offsets.QuarterOffset.freqstr
QuarterOffset.freqstr
pandas.tseries.offsets.QuarterOffset.kwds
QuarterOffset.kwds
pandas.tseries.offsets.QuarterOffset.name
QuarterOffset.name
pandas.tseries.offsets.QuarterOffset.nanos
QuarterOffset.nanos
pandas.tseries.offsets.QuarterOffset.normalize
QuarterOffset.normalize = False
pandas.tseries.offsets.QuarterOffset.rule_code
QuarterOffset.rule_code
Methods
QuarterOffset.apply(self, other)
QuarterOffset.apply_index(self, other)
QuarterOffset.copy(self)
QuarterOffset.isAnchored(self)
QuarterOffset.onOffset(self, dt)
pandas.tseries.offsets.QuarterOffset.apply
QuarterOffset.apply(self, other)
pandas.tseries.offsets.QuarterOffset.apply_index
QuarterOffset.apply_index(self, other)
pandas.tseries.offsets.QuarterOffset.copy
QuarterOffset.copy(self )
pandas.tseries.offsets.QuarterOffset.isAnchored
QuarterOffset.isAnchored(self )
pandas.tseries.offsets.QuarterOffset.onOffset
QuarterOffset.onOffset(self, dt)
6.8.20 BQuarterEnd
pandas.tseries.offsets.BQuarterEnd
Attributes
pandas.tseries.offsets.BQuarterEnd.base
BQuarterEnd.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
freqstr
kwds
name
nanos
rule_code
Methods
pandas.tseries.offsets.BQuarterEnd.rollback
BQuarterEnd.rollback(self, dt)
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.BQuarterEnd.rollforward
BQuarterEnd.rollforward(self, dt)
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
__call__
apply
apply_index
copy
isAnchored
onOffset
Properties
BQuarterEnd.freqstr
BQuarterEnd.kwds
BQuarterEnd.name
BQuarterEnd.nanos
BQuarterEnd.normalize
BQuarterEnd.rule_code
pandas.tseries.offsets.BQuarterEnd.freqstr
BQuarterEnd.freqstr
pandas.tseries.offsets.BQuarterEnd.kwds
BQuarterEnd.kwds
pandas.tseries.offsets.BQuarterEnd.name
BQuarterEnd.name
pandas.tseries.offsets.BQuarterEnd.nanos
BQuarterEnd.nanos
pandas.tseries.offsets.BQuarterEnd.normalize
BQuarterEnd.normalize = False
pandas.tseries.offsets.BQuarterEnd.rule_code
BQuarterEnd.rule_code
Methods
BQuarterEnd.apply(self, other)
BQuarterEnd.apply_index(self, other)
BQuarterEnd.copy(self)
BQuarterEnd.isAnchored(self)
BQuarterEnd.onOffset(self, dt)
pandas.tseries.offsets.BQuarterEnd.apply
BQuarterEnd.apply(self, other)
pandas.tseries.offsets.BQuarterEnd.apply_index
BQuarterEnd.apply_index(self, other)
pandas.tseries.offsets.BQuarterEnd.copy
BQuarterEnd.copy(self )
pandas.tseries.offsets.BQuarterEnd.isAnchored
BQuarterEnd.isAnchored(self )
pandas.tseries.offsets.BQuarterEnd.onOffset
BQuarterEnd.onOffset(self, dt)
6.8.21 BQuarterBegin
Attributes
pandas.tseries.offsets.BQuarterBegin
Attributes
pandas.tseries.offsets.BQuarterBegin.base
BQuarterBegin.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
freqstr
kwds
name
nanos
rule_code
Methods
pandas.tseries.offsets.BQuarterBegin.rollback
BQuarterBegin.rollback(self, dt)
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.BQuarterBegin.rollforward
BQuarterBegin.rollforward(self, dt)
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
__call__
apply
apply_index
copy
isAnchored
onOffset
Properties
BQuarterBegin.freqstr
BQuarterBegin.kwds
BQuarterBegin.name
BQuarterBegin.nanos
BQuarterBegin.normalize
BQuarterBegin.rule_code
pandas.tseries.offsets.BQuarterBegin.freqstr
BQuarterBegin.freqstr
pandas.tseries.offsets.BQuarterBegin.kwds
BQuarterBegin.kwds
pandas.tseries.offsets.BQuarterBegin.name
BQuarterBegin.name
pandas.tseries.offsets.BQuarterBegin.nanos
BQuarterBegin.nanos
pandas.tseries.offsets.BQuarterBegin.normalize
BQuarterBegin.normalize = False
pandas.tseries.offsets.BQuarterBegin.rule_code
BQuarterBegin.rule_code
Methods
BQuarterBegin.apply(self, other)
BQuarterBegin.apply_index(self, other)
BQuarterBegin.copy(self)
BQuarterBegin.isAnchored(self)
BQuarterBegin.onOffset(self, dt)
pandas.tseries.offsets.BQuarterBegin.apply
BQuarterBegin.apply(self, other)
pandas.tseries.offsets.BQuarterBegin.apply_index
BQuarterBegin.apply_index(self, other)
pandas.tseries.offsets.BQuarterBegin.copy
BQuarterBegin.copy(self )
pandas.tseries.offsets.BQuarterBegin.isAnchored
BQuarterBegin.isAnchored(self )
pandas.tseries.offsets.BQuarterBegin.onOffset
BQuarterBegin.onOffset(self, dt)
6.8.22 QuarterEnd
pandas.tseries.offsets.QuarterEnd
Attributes
pandas.tseries.offsets.QuarterEnd.base
QuarterEnd.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
freqstr
kwds
name
nanos
rule_code
Methods
pandas.tseries.offsets.QuarterEnd.rollback
QuarterEnd.rollback(self, dt)
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.QuarterEnd.rollforward
QuarterEnd.rollforward(self, dt)
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
__call__
apply
apply_index
copy
isAnchored
onOffset
Properties
QuarterEnd.freqstr
QuarterEnd.kwds
QuarterEnd.name
QuarterEnd.nanos
QuarterEnd.normalize
QuarterEnd.rule_code
pandas.tseries.offsets.QuarterEnd.freqstr
QuarterEnd.freqstr
pandas.tseries.offsets.QuarterEnd.kwds
QuarterEnd.kwds
pandas.tseries.offsets.QuarterEnd.name
QuarterEnd.name
pandas.tseries.offsets.QuarterEnd.nanos
QuarterEnd.nanos
pandas.tseries.offsets.QuarterEnd.normalize
QuarterEnd.normalize = False
pandas.tseries.offsets.QuarterEnd.rule_code
QuarterEnd.rule_code
Methods
QuarterEnd.apply(self, other)
QuarterEnd.apply_index(self, other)
QuarterEnd.copy(self)
QuarterEnd.isAnchored(self)
QuarterEnd.onOffset(self, dt)
pandas.tseries.offsets.QuarterEnd.apply
QuarterEnd.apply(self, other)
pandas.tseries.offsets.QuarterEnd.apply_index
QuarterEnd.apply_index(self, other)
pandas.tseries.offsets.QuarterEnd.copy
QuarterEnd.copy(self )
pandas.tseries.offsets.QuarterEnd.isAnchored
QuarterEnd.isAnchored(self )
pandas.tseries.offsets.QuarterEnd.onOffset
QuarterEnd.onOffset(self, dt)
6.8.23 QuarterBegin
Attributes
pandas.tseries.offsets.QuarterBegin
Attributes
pandas.tseries.offsets.QuarterBegin.base
QuarterBegin.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
freqstr
kwds
name
nanos
rule_code
Methods
pandas.tseries.offsets.QuarterBegin.rollback
QuarterBegin.rollback(self, dt)
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.QuarterBegin.rollforward
QuarterBegin.rollforward(self, dt)
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
__call__
apply
apply_index
copy
isAnchored
onOffset
Properties
QuarterBegin.freqstr
QuarterBegin.kwds
QuarterBegin.name
QuarterBegin.nanos
QuarterBegin.normalize
QuarterBegin.rule_code
pandas.tseries.offsets.QuarterBegin.freqstr
QuarterBegin.freqstr
pandas.tseries.offsets.QuarterBegin.kwds
QuarterBegin.kwds
pandas.tseries.offsets.QuarterBegin.name
QuarterBegin.name
pandas.tseries.offsets.QuarterBegin.nanos
QuarterBegin.nanos
pandas.tseries.offsets.QuarterBegin.normalize
QuarterBegin.normalize = False
pandas.tseries.offsets.QuarterBegin.rule_code
QuarterBegin.rule_code
Methods
QuarterBegin.apply(self, other)
QuarterBegin.apply_index(self, other)
QuarterBegin.copy(self)
QuarterBegin.isAnchored(self)
QuarterBegin.onOffset(self, dt)
pandas.tseries.offsets.QuarterBegin.apply
QuarterBegin.apply(self, other)
pandas.tseries.offsets.QuarterBegin.apply_index
QuarterBegin.apply_index(self, other)
pandas.tseries.offsets.QuarterBegin.copy
QuarterBegin.copy(self )
pandas.tseries.offsets.QuarterBegin.isAnchored
QuarterBegin.isAnchored(self )
pandas.tseries.offsets.QuarterBegin.onOffset
QuarterBegin.onOffset(self, dt)
6.8.24 YearOffset
pandas.tseries.offsets.YearOffset
Attributes
pandas.tseries.offsets.YearOffset.base
YearOffset.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
freqstr
kwds
name
nanos
rule_code
Methods
pandas.tseries.offsets.YearOffset.rollback
YearOffset.rollback(self, dt)
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.YearOffset.rollforward
YearOffset.rollforward(self, dt)
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
__call__
apply
apply_index
copy
isAnchored
onOffset
Properties
YearOffset.freqstr
YearOffset.kwds
YearOffset.name
YearOffset.nanos
Continued on next page
pandas.tseries.offsets.YearOffset.freqstr
YearOffset.freqstr
pandas.tseries.offsets.YearOffset.kwds
YearOffset.kwds
pandas.tseries.offsets.YearOffset.name
YearOffset.name
pandas.tseries.offsets.YearOffset.nanos
YearOffset.nanos
pandas.tseries.offsets.YearOffset.normalize
YearOffset.normalize = False
pandas.tseries.offsets.YearOffset.rule_code
YearOffset.rule_code
Methods
YearOffset.apply(self, other)
YearOffset.apply_index(self, other)
YearOffset.copy(self)
YearOffset.isAnchored(self)
YearOffset.onOffset(self, dt)
pandas.tseries.offsets.YearOffset.apply
YearOffset.apply(self, other)
pandas.tseries.offsets.YearOffset.apply_index
YearOffset.apply_index(self, other)
pandas.tseries.offsets.YearOffset.copy
YearOffset.copy(self )
pandas.tseries.offsets.YearOffset.isAnchored
YearOffset.isAnchored(self )
pandas.tseries.offsets.YearOffset.onOffset
YearOffset.onOffset(self, dt)
6.8.25 BYearEnd
pandas.tseries.offsets.BYearEnd
Attributes
pandas.tseries.offsets.BYearEnd.base
BYearEnd.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
freqstr
kwds
name
nanos
rule_code
Methods
pandas.tseries.offsets.BYearEnd.rollback
BYearEnd.rollback(self, dt)
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.BYearEnd.rollforward
BYearEnd.rollforward(self, dt)
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
__call__
apply
apply_index
copy
isAnchored
onOffset
Properties
BYearEnd.freqstr
BYearEnd.kwds
BYearEnd.name
BYearEnd.nanos
BYearEnd.normalize
BYearEnd.rule_code
pandas.tseries.offsets.BYearEnd.freqstr
BYearEnd.freqstr
pandas.tseries.offsets.BYearEnd.kwds
BYearEnd.kwds
pandas.tseries.offsets.BYearEnd.name
BYearEnd.name
pandas.tseries.offsets.BYearEnd.nanos
BYearEnd.nanos
pandas.tseries.offsets.BYearEnd.normalize
BYearEnd.normalize = False
pandas.tseries.offsets.BYearEnd.rule_code
BYearEnd.rule_code
Methods
BYearEnd.apply(self, other)
BYearEnd.apply_index(self, other)
BYearEnd.copy(self)
BYearEnd.isAnchored(self)
BYearEnd.onOffset(self, dt)
pandas.tseries.offsets.BYearEnd.apply
BYearEnd.apply(self, other)
pandas.tseries.offsets.BYearEnd.apply_index
BYearEnd.apply_index(self, other)
pandas.tseries.offsets.BYearEnd.copy
BYearEnd.copy(self )
pandas.tseries.offsets.BYearEnd.isAnchored
BYearEnd.isAnchored(self )
pandas.tseries.offsets.BYearEnd.onOffset
BYearEnd.onOffset(self, dt)
6.8.26 BYearBegin
pandas.tseries.offsets.BYearBegin
Attributes
pandas.tseries.offsets.BYearBegin.base
BYearBegin.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
freqstr
kwds
name
nanos
rule_code
Methods
pandas.tseries.offsets.BYearBegin.rollback
BYearBegin.rollback(self, dt)
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.BYearBegin.rollforward
BYearBegin.rollforward(self, dt)
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
__call__
apply
apply_index
copy
isAnchored
onOffset
Properties
BYearBegin.freqstr
BYearBegin.kwds
BYearBegin.name
BYearBegin.nanos
BYearBegin.normalize
BYearBegin.rule_code
pandas.tseries.offsets.BYearBegin.freqstr
BYearBegin.freqstr
pandas.tseries.offsets.BYearBegin.kwds
BYearBegin.kwds
pandas.tseries.offsets.BYearBegin.name
BYearBegin.name
pandas.tseries.offsets.BYearBegin.nanos
BYearBegin.nanos
pandas.tseries.offsets.BYearBegin.normalize
BYearBegin.normalize = False
pandas.tseries.offsets.BYearBegin.rule_code
BYearBegin.rule_code
Methods
BYearBegin.apply(self, other)
Continued on next page
pandas.tseries.offsets.BYearBegin.apply
BYearBegin.apply(self, other)
pandas.tseries.offsets.BYearBegin.apply_index
BYearBegin.apply_index(self, other)
pandas.tseries.offsets.BYearBegin.copy
BYearBegin.copy(self )
pandas.tseries.offsets.BYearBegin.isAnchored
BYearBegin.isAnchored(self )
pandas.tseries.offsets.BYearBegin.onOffset
BYearBegin.onOffset(self, dt)
6.8.27 YearEnd
pandas.tseries.offsets.YearEnd
Attributes
pandas.tseries.offsets.YearEnd.base
YearEnd.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
freqstr
kwds
name
nanos
rule_code
Methods
pandas.tseries.offsets.YearEnd.rollback
YearEnd.rollback(self, dt)
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.YearEnd.rollforward
YearEnd.rollforward(self, dt)
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
__call__
apply
apply_index
copy
isAnchored
onOffset
Properties
YearEnd.freqstr
YearEnd.kwds
YearEnd.name
YearEnd.nanos
YearEnd.normalize
YearEnd.rule_code
pandas.tseries.offsets.YearEnd.freqstr
YearEnd.freqstr
pandas.tseries.offsets.YearEnd.kwds
YearEnd.kwds
pandas.tseries.offsets.YearEnd.name
YearEnd.name
pandas.tseries.offsets.YearEnd.nanos
YearEnd.nanos
pandas.tseries.offsets.YearEnd.normalize
YearEnd.normalize = False
pandas.tseries.offsets.YearEnd.rule_code
YearEnd.rule_code
Methods
YearEnd.apply(self, other)
YearEnd.apply_index(self, other)
YearEnd.copy(self)
YearEnd.isAnchored(self)
YearEnd.onOffset(self, dt)
pandas.tseries.offsets.YearEnd.apply
YearEnd.apply(self, other)
pandas.tseries.offsets.YearEnd.apply_index
YearEnd.apply_index(self, other)
pandas.tseries.offsets.YearEnd.copy
YearEnd.copy(self )
pandas.tseries.offsets.YearEnd.isAnchored
YearEnd.isAnchored(self )
pandas.tseries.offsets.YearEnd.onOffset
YearEnd.onOffset(self, dt)
6.8.28 YearBegin
pandas.tseries.offsets.YearBegin
Attributes
pandas.tseries.offsets.YearBegin.base
YearBegin.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
freqstr
kwds
name
nanos
rule_code
Methods
pandas.tseries.offsets.YearBegin.rollback
YearBegin.rollback(self, dt)
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.YearBegin.rollforward
YearBegin.rollforward(self, dt)
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
__call__
apply
apply_index
copy
isAnchored
onOffset
Properties
YearBegin.freqstr
YearBegin.kwds
YearBegin.name
YearBegin.nanos
YearBegin.normalize
YearBegin.rule_code
pandas.tseries.offsets.YearBegin.freqstr
YearBegin.freqstr
pandas.tseries.offsets.YearBegin.kwds
YearBegin.kwds
pandas.tseries.offsets.YearBegin.name
YearBegin.name
pandas.tseries.offsets.YearBegin.nanos
YearBegin.nanos
pandas.tseries.offsets.YearBegin.normalize
YearBegin.normalize = False
pandas.tseries.offsets.YearBegin.rule_code
YearBegin.rule_code
Methods
YearBegin.apply(self, other)
YearBegin.apply_index(self, other)
YearBegin.copy(self)
YearBegin.isAnchored(self)
YearBegin.onOffset(self, dt)
pandas.tseries.offsets.YearBegin.apply
YearBegin.apply(self, other)
pandas.tseries.offsets.YearBegin.apply_index
YearBegin.apply_index(self, other)
pandas.tseries.offsets.YearBegin.copy
YearBegin.copy(self )
pandas.tseries.offsets.YearBegin.isAnchored
YearBegin.isAnchored(self )
pandas.tseries.offsets.YearBegin.onOffset
YearBegin.onOffset(self, dt)
6.8.29 FY5253
pandas.tseries.offsets.FY5253
Attributes
pandas.tseries.offsets.FY5253.base
FY5253.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
freqstr
kwds
name
nanos
rule_code
Methods
pandas.tseries.offsets.FY5253.apply_index
FY5253.apply_index(self, other)
Vectorized apply of DateOffset to DatetimeIndex, raises NotImplentedError for offsets without
a vectorized implementation.
Parameters
i [DatetimeIndex]
Returns
y [DatetimeIndex]
pandas.tseries.offsets.FY5253.rollback
FY5253.rollback(self, dt)
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.FY5253.rollforward
FY5253.rollforward(self, dt)
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
__call__
apply
copy
get_rule_code_suffix
get_year_end
isAnchored
onOffset
Properties
FY5253.freqstr
FY5253.kwds
FY5253.name
FY5253.nanos
FY5253.normalize
FY5253.rule_code
pandas.tseries.offsets.FY5253.freqstr
FY5253.freqstr
pandas.tseries.offsets.FY5253.kwds
FY5253.kwds
pandas.tseries.offsets.FY5253.name
FY5253.name
pandas.tseries.offsets.FY5253.nanos
FY5253.nanos
pandas.tseries.offsets.FY5253.normalize
FY5253.normalize = False
pandas.tseries.offsets.FY5253.rule_code
FY5253.rule_code
Methods
FY5253.apply(self, other)
FY5253.copy(self)
FY5253.get_rule_code_suffix(self)
FY5253.get_year_end(self, dt)
FY5253.isAnchored(self)
FY5253.onOffset(self, dt)
pandas.tseries.offsets.FY5253.apply
FY5253.apply(self, other)
pandas.tseries.offsets.FY5253.copy
FY5253.copy(self )
pandas.tseries.offsets.FY5253.get_rule_code_suffix
FY5253.get_rule_code_suffix(self )
pandas.tseries.offsets.FY5253.get_year_end
FY5253.get_year_end(self, dt)
pandas.tseries.offsets.FY5253.isAnchored
FY5253.isAnchored(self )
pandas.tseries.offsets.FY5253.onOffset
FY5253.onOffset(self, dt)
6.8.30 FY5253Quarter
pandas.tseries.offsets.FY5253Quarter
Attributes
pandas.tseries.offsets.FY5253Quarter.base
FY5253Quarter.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
freqstr
kwds
name
nanos
rule_code
Methods
pandas.tseries.offsets.FY5253Quarter.apply_index
FY5253Quarter.apply_index(self, other)
Vectorized apply of DateOffset to DatetimeIndex, raises NotImplentedError for offsets without
a vectorized implementation.
Parameters
i [DatetimeIndex]
Returns
y [DatetimeIndex]
pandas.tseries.offsets.FY5253Quarter.rollback
FY5253Quarter.rollback(self, dt)
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.FY5253Quarter.rollforward
FY5253Quarter.rollforward(self, dt)
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
__call__
apply
copy
get_weeks
isAnchored
onOffset
year_has_extra_week
Properties
FY5253Quarter.freqstr
FY5253Quarter.kwds
FY5253Quarter.name
FY5253Quarter.nanos
FY5253Quarter.normalize
FY5253Quarter.rule_code
pandas.tseries.offsets.FY5253Quarter.freqstr
FY5253Quarter.freqstr
pandas.tseries.offsets.FY5253Quarter.kwds
FY5253Quarter.kwds
pandas.tseries.offsets.FY5253Quarter.name
FY5253Quarter.name
pandas.tseries.offsets.FY5253Quarter.nanos
FY5253Quarter.nanos
pandas.tseries.offsets.FY5253Quarter.normalize
FY5253Quarter.normalize = False
pandas.tseries.offsets.FY5253Quarter.rule_code
FY5253Quarter.rule_code
Methods
FY5253Quarter.apply(self, other)
FY5253Quarter.copy(self)
FY5253Quarter.get_weeks(self, dt)
FY5253Quarter.isAnchored(self)
FY5253Quarter.onOffset(self, dt)
FY5253Quarter.year_has_extra_week(self, dt)
pandas.tseries.offsets.FY5253Quarter.apply
FY5253Quarter.apply(self, other)
pandas.tseries.offsets.FY5253Quarter.copy
FY5253Quarter.copy(self )
pandas.tseries.offsets.FY5253Quarter.get_weeks
FY5253Quarter.get_weeks(self, dt)
pandas.tseries.offsets.FY5253Quarter.isAnchored
FY5253Quarter.isAnchored(self )
pandas.tseries.offsets.FY5253Quarter.onOffset
FY5253Quarter.onOffset(self, dt)
pandas.tseries.offsets.FY5253Quarter.year_has_extra_week
FY5253Quarter.year_has_extra_week(self, dt)
6.8.31 Easter
pandas.tseries.offsets.Easter
class pandas.tseries.offsets.Easter
DateOffset for the Easter holiday using logic defined in dateutil.
Right now uses the revised method which is valid in years 1583-4099.
Attributes
pandas.tseries.offsets.Easter.base
Easter.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
freqstr
kwds
name
nanos
rule_code
Methods
pandas.tseries.offsets.Easter.apply_index
Easter.apply_index(self, other)
Vectorized apply of DateOffset to DatetimeIndex, raises NotImplentedError for offsets without
a vectorized implementation.
Parameters
i [DatetimeIndex]
Returns
y [DatetimeIndex]
pandas.tseries.offsets.Easter.rollback
Easter.rollback(self, dt)
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.Easter.rollforward
Easter.rollforward(self, dt)
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
__call__
apply
copy
isAnchored
onOffset
Properties
Easter.freqstr
Easter.kwds
Easter.name
Easter.nanos
Easter.normalize
Easter.rule_code
pandas.tseries.offsets.Easter.freqstr
Easter.freqstr
pandas.tseries.offsets.Easter.kwds
Easter.kwds
pandas.tseries.offsets.Easter.name
Easter.name
pandas.tseries.offsets.Easter.nanos
Easter.nanos
pandas.tseries.offsets.Easter.normalize
Easter.normalize = False
pandas.tseries.offsets.Easter.rule_code
Easter.rule_code
Methods
Easter.apply(self, other)
Easter.copy(self)
Easter.isAnchored(self)
Easter.onOffset(self, dt)
pandas.tseries.offsets.Easter.apply
Easter.apply(self, other)
pandas.tseries.offsets.Easter.copy
Easter.copy(self )
pandas.tseries.offsets.Easter.isAnchored
Easter.isAnchored(self )
pandas.tseries.offsets.Easter.onOffset
Easter.onOffset(self, dt)
6.8.32 Tick
Tick([n, normalize])
Attributes
pandas.tseries.offsets.Tick
Attributes
pandas.tseries.offsets.Tick.base
Tick.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
delta
freqstr
kwds
name
nanos
rule_code
Methods
apply(self, other)
apply_index(self, other) Vectorized apply of DateOffset to DatetimeIn-
dex, raises NotImplentedError for offsets with-
out a vectorized implementation.
rollback(self, dt) Roll provided date backward to next offset only
if not on offset.
rollforward(self, dt) Roll provided date forward to next offset only if
not on offset.
pandas.tseries.offsets.Tick.apply
Tick.apply(self, other)
pandas.tseries.offsets.Tick.apply_index
Tick.apply_index(self, other)
Vectorized apply of DateOffset to DatetimeIndex, raises NotImplentedError for offsets without
a vectorized implementation.
Parameters
i [DatetimeIndex]
Returns
y [DatetimeIndex]
pandas.tseries.offsets.Tick.rollback
Tick.rollback(self, dt)
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.Tick.rollforward
Tick.rollforward(self, dt)
Roll provided date forward to next offset only if not on offset.
Returns
__call__
copy
isAnchored
onOffset
Properties
Tick.delta
Tick.freqstr
Tick.kwds
Tick.name
Tick.nanos
Tick.normalize
Tick.rule_code
pandas.tseries.offsets.Tick.delta
Tick.delta
pandas.tseries.offsets.Tick.freqstr
Tick.freqstr
pandas.tseries.offsets.Tick.kwds
Tick.kwds
pandas.tseries.offsets.Tick.name
Tick.name
pandas.tseries.offsets.Tick.nanos
Tick.nanos
pandas.tseries.offsets.Tick.normalize
Tick.normalize = False
pandas.tseries.offsets.Tick.rule_code
Tick.rule_code
Methods
Tick.copy(self)
Tick.isAnchored(self)
Tick.onOffset(self, dt)
pandas.tseries.offsets.Tick.copy
Tick.copy(self )
pandas.tseries.offsets.Tick.isAnchored
Tick.isAnchored(self )
pandas.tseries.offsets.Tick.onOffset
Tick.onOffset(self, dt)
6.8.33 Day
Day([n, normalize])
Attributes
pandas.tseries.offsets.Day
Attributes
pandas.tseries.offsets.Day.base
Day.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
delta
freqstr
kwds
name
nanos
rule_code
Methods
apply(self, other)
apply_index(self, other) Vectorized apply of DateOffset to DatetimeIn-
dex, raises NotImplentedError for offsets with-
out a vectorized implementation.
rollback(self, dt) Roll provided date backward to next offset only
if not on offset.
rollforward(self, dt) Roll provided date forward to next offset only if
not on offset.
pandas.tseries.offsets.Day.apply
Day.apply(self, other)
pandas.tseries.offsets.Day.apply_index
Day.apply_index(self, other)
Vectorized apply of DateOffset to DatetimeIndex, raises NotImplentedError for offsets without
a vectorized implementation.
Parameters
i [DatetimeIndex]
Returns
y [DatetimeIndex]
pandas.tseries.offsets.Day.rollback
Day.rollback(self, dt)
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.Day.rollforward
Day.rollforward(self, dt)
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
__call__
copy
isAnchored
onOffset
Properties
Day.delta
Day.freqstr
Day.kwds
Day.name
Day.nanos
Day.normalize
Day.rule_code
pandas.tseries.offsets.Day.delta
Day.delta
pandas.tseries.offsets.Day.freqstr
Day.freqstr
pandas.tseries.offsets.Day.kwds
Day.kwds
pandas.tseries.offsets.Day.name
Day.name
pandas.tseries.offsets.Day.nanos
Day.nanos
pandas.tseries.offsets.Day.normalize
Day.normalize = False
pandas.tseries.offsets.Day.rule_code
Day.rule_code
Methods
Day.copy(self)
Day.isAnchored(self)
Day.onOffset(self, dt)
pandas.tseries.offsets.Day.copy
Day.copy(self )
pandas.tseries.offsets.Day.isAnchored
Day.isAnchored(self )
pandas.tseries.offsets.Day.onOffset
Day.onOffset(self, dt)
6.8.34 Hour
Hour([n, normalize])
Attributes
pandas.tseries.offsets.Hour
Attributes
pandas.tseries.offsets.Hour.base
Hour.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
delta
freqstr
kwds
name
nanos
rule_code
Methods
apply(self, other)
Continued on next page
pandas.tseries.offsets.Hour.apply
Hour.apply(self, other)
pandas.tseries.offsets.Hour.apply_index
Hour.apply_index(self, other)
Vectorized apply of DateOffset to DatetimeIndex, raises NotImplentedError for offsets without
a vectorized implementation.
Parameters
i [DatetimeIndex]
Returns
y [DatetimeIndex]
pandas.tseries.offsets.Hour.rollback
Hour.rollback(self, dt)
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.Hour.rollforward
Hour.rollforward(self, dt)
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
__call__
copy
isAnchored
onOffset
Properties
Hour.delta
Hour.freqstr
Hour.kwds
Hour.name
Hour.nanos
Hour.normalize
Hour.rule_code
pandas.tseries.offsets.Hour.delta
Hour.delta
pandas.tseries.offsets.Hour.freqstr
Hour.freqstr
pandas.tseries.offsets.Hour.kwds
Hour.kwds
pandas.tseries.offsets.Hour.name
Hour.name
pandas.tseries.offsets.Hour.nanos
Hour.nanos
pandas.tseries.offsets.Hour.normalize
Hour.normalize = False
pandas.tseries.offsets.Hour.rule_code
Hour.rule_code
Methods
Hour.copy(self)
Hour.isAnchored(self)
Hour.onOffset(self, dt)
pandas.tseries.offsets.Hour.copy
Hour.copy(self )
pandas.tseries.offsets.Hour.isAnchored
Hour.isAnchored(self )
pandas.tseries.offsets.Hour.onOffset
Hour.onOffset(self, dt)
6.8.35 Minute
Minute([n, normalize])
Attributes
pandas.tseries.offsets.Minute
Attributes
pandas.tseries.offsets.Minute.base
Minute.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
delta
freqstr
kwds
name
nanos
rule_code
Methods
apply(self, other)
Continued on next page
pandas.tseries.offsets.Minute.apply
Minute.apply(self, other)
pandas.tseries.offsets.Minute.apply_index
Minute.apply_index(self, other)
Vectorized apply of DateOffset to DatetimeIndex, raises NotImplentedError for offsets without
a vectorized implementation.
Parameters
i [DatetimeIndex]
Returns
y [DatetimeIndex]
pandas.tseries.offsets.Minute.rollback
Minute.rollback(self, dt)
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.Minute.rollforward
Minute.rollforward(self, dt)
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
__call__
copy
isAnchored
onOffset
Properties
Minute.delta
Minute.freqstr
Minute.kwds
Minute.name
Minute.nanos
Minute.normalize
Minute.rule_code
pandas.tseries.offsets.Minute.delta
Minute.delta
pandas.tseries.offsets.Minute.freqstr
Minute.freqstr
pandas.tseries.offsets.Minute.kwds
Minute.kwds
pandas.tseries.offsets.Minute.name
Minute.name
pandas.tseries.offsets.Minute.nanos
Minute.nanos
pandas.tseries.offsets.Minute.normalize
Minute.normalize = False
pandas.tseries.offsets.Minute.rule_code
Minute.rule_code
Methods
Minute.copy(self)
Minute.isAnchored(self)
Minute.onOffset(self, dt)
pandas.tseries.offsets.Minute.copy
Minute.copy(self )
pandas.tseries.offsets.Minute.isAnchored
Minute.isAnchored(self )
pandas.tseries.offsets.Minute.onOffset
Minute.onOffset(self, dt)
6.8.36 Second
Second([n, normalize])
Attributes
pandas.tseries.offsets.Second
Attributes
pandas.tseries.offsets.Second.base
Second.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
delta
freqstr
kwds
name
nanos
rule_code
Methods
apply(self, other)
Continued on next page
pandas.tseries.offsets.Second.apply
Second.apply(self, other)
pandas.tseries.offsets.Second.apply_index
Second.apply_index(self, other)
Vectorized apply of DateOffset to DatetimeIndex, raises NotImplentedError for offsets without
a vectorized implementation.
Parameters
i [DatetimeIndex]
Returns
y [DatetimeIndex]
pandas.tseries.offsets.Second.rollback
Second.rollback(self, dt)
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.Second.rollforward
Second.rollforward(self, dt)
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
__call__
copy
isAnchored
onOffset
Properties
Second.delta
Second.freqstr
Second.kwds
Second.name
Second.nanos
Second.normalize
Second.rule_code
pandas.tseries.offsets.Second.delta
Second.delta
pandas.tseries.offsets.Second.freqstr
Second.freqstr
pandas.tseries.offsets.Second.kwds
Second.kwds
pandas.tseries.offsets.Second.name
Second.name
pandas.tseries.offsets.Second.nanos
Second.nanos
pandas.tseries.offsets.Second.normalize
Second.normalize = False
pandas.tseries.offsets.Second.rule_code
Second.rule_code
Methods
Second.copy(self)
Second.isAnchored(self)
Second.onOffset(self, dt)
pandas.tseries.offsets.Second.copy
Second.copy(self )
pandas.tseries.offsets.Second.isAnchored
Second.isAnchored(self )
pandas.tseries.offsets.Second.onOffset
Second.onOffset(self, dt)
6.8.37 Milli
Milli([n, normalize])
Attributes
pandas.tseries.offsets.Milli
Attributes
pandas.tseries.offsets.Milli.base
Milli.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
delta
freqstr
kwds
name
nanos
rule_code
Methods
apply(self, other)
Continued on next page
pandas.tseries.offsets.Milli.apply
Milli.apply(self, other)
pandas.tseries.offsets.Milli.apply_index
Milli.apply_index(self, other)
Vectorized apply of DateOffset to DatetimeIndex, raises NotImplentedError for offsets without
a vectorized implementation.
Parameters
i [DatetimeIndex]
Returns
y [DatetimeIndex]
pandas.tseries.offsets.Milli.rollback
Milli.rollback(self, dt)
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.Milli.rollforward
Milli.rollforward(self, dt)
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
__call__
copy
isAnchored
onOffset
Properties
Milli.delta
Milli.freqstr
Milli.kwds
Milli.name
Milli.nanos
Milli.normalize
Milli.rule_code
pandas.tseries.offsets.Milli.delta
Milli.delta
pandas.tseries.offsets.Milli.freqstr
Milli.freqstr
pandas.tseries.offsets.Milli.kwds
Milli.kwds
pandas.tseries.offsets.Milli.name
Milli.name
pandas.tseries.offsets.Milli.nanos
Milli.nanos
pandas.tseries.offsets.Milli.normalize
Milli.normalize = False
pandas.tseries.offsets.Milli.rule_code
Milli.rule_code
Methods
Milli.copy(self)
Milli.isAnchored(self)
Milli.onOffset(self, dt)
pandas.tseries.offsets.Milli.copy
Milli.copy(self )
pandas.tseries.offsets.Milli.isAnchored
Milli.isAnchored(self )
pandas.tseries.offsets.Milli.onOffset
Milli.onOffset(self, dt)
6.8.38 Micro
Micro([n, normalize])
Attributes
pandas.tseries.offsets.Micro
Attributes
pandas.tseries.offsets.Micro.base
Micro.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
delta
freqstr
kwds
name
nanos
rule_code
Methods
apply(self, other)
Continued on next page
pandas.tseries.offsets.Micro.apply
Micro.apply(self, other)
pandas.tseries.offsets.Micro.apply_index
Micro.apply_index(self, other)
Vectorized apply of DateOffset to DatetimeIndex, raises NotImplentedError for offsets without
a vectorized implementation.
Parameters
i [DatetimeIndex]
Returns
y [DatetimeIndex]
pandas.tseries.offsets.Micro.rollback
Micro.rollback(self, dt)
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.Micro.rollforward
Micro.rollforward(self, dt)
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
__call__
copy
isAnchored
onOffset
Properties
Micro.delta
Micro.freqstr
Micro.kwds
Micro.name
Micro.nanos
Micro.normalize
Micro.rule_code
pandas.tseries.offsets.Micro.delta
Micro.delta
pandas.tseries.offsets.Micro.freqstr
Micro.freqstr
pandas.tseries.offsets.Micro.kwds
Micro.kwds
pandas.tseries.offsets.Micro.name
Micro.name
pandas.tseries.offsets.Micro.nanos
Micro.nanos
pandas.tseries.offsets.Micro.normalize
Micro.normalize = False
pandas.tseries.offsets.Micro.rule_code
Micro.rule_code
Methods
Micro.copy(self)
Micro.isAnchored(self)
Micro.onOffset(self, dt)
pandas.tseries.offsets.Micro.copy
Micro.copy(self )
pandas.tseries.offsets.Micro.isAnchored
Micro.isAnchored(self )
pandas.tseries.offsets.Micro.onOffset
Micro.onOffset(self, dt)
6.8.39 Nano
Nano([n, normalize])
Attributes
pandas.tseries.offsets.Nano
Attributes
pandas.tseries.offsets.Nano.base
Nano.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
delta
freqstr
kwds
name
nanos
rule_code
Methods
apply(self, other)
Continued on next page
pandas.tseries.offsets.Nano.apply
Nano.apply(self, other)
pandas.tseries.offsets.Nano.apply_index
Nano.apply_index(self, other)
Vectorized apply of DateOffset to DatetimeIndex, raises NotImplentedError for offsets without
a vectorized implementation.
Parameters
i [DatetimeIndex]
Returns
y [DatetimeIndex]
pandas.tseries.offsets.Nano.rollback
Nano.rollback(self, dt)
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.Nano.rollforward
Nano.rollforward(self, dt)
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
__call__
copy
isAnchored
onOffset
Properties
Nano.delta
Nano.freqstr
Nano.kwds
Nano.name
Nano.nanos
Nano.normalize
Nano.rule_code
pandas.tseries.offsets.Nano.delta
Nano.delta
pandas.tseries.offsets.Nano.freqstr
Nano.freqstr
pandas.tseries.offsets.Nano.kwds
Nano.kwds
pandas.tseries.offsets.Nano.name
Nano.name
pandas.tseries.offsets.Nano.nanos
Nano.nanos
pandas.tseries.offsets.Nano.normalize
Nano.normalize = False
pandas.tseries.offsets.Nano.rule_code
Nano.rule_code
Methods
Nano.copy(self)
Nano.isAnchored(self)
Nano.onOffset(self, dt)
pandas.tseries.offsets.Nano.copy
Nano.copy(self )
pandas.tseries.offsets.Nano.isAnchored
Nano.isAnchored(self )
pandas.tseries.offsets.Nano.onOffset
Nano.onOffset(self, dt)
6.8.40 BDay
pandas.tseries.offsets.BDay
pandas.tseries.offsets.BDay
alias of pandas.tseries.offsets.BusinessDay
Properties
pandas.tseries.offsets.BDay.base
BDay.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
pandas.tseries.offsets.BDay.freqstr
BDay.freqstr
pandas.tseries.offsets.BDay.kwds
BDay.kwds
pandas.tseries.offsets.BDay.name
BDay.name
pandas.tseries.offsets.BDay.nanos
BDay.nanos
pandas.tseries.offsets.BDay.normalize
BDay.normalize = False
pandas.tseries.offsets.BDay.offset
BDay.offset
Alias for self._offset.
pandas.tseries.offsets.BDay.rule_code
BDay.rule_code
Methods
BDay.apply(self, other)
BDay.apply_index(self, other)
BDay.copy(self)
BDay.isAnchored(self)
BDay.onOffset(self, dt)
BDay.rollback(self, dt) Roll provided date backward to next offset only if
not on offset.
BDay.rollforward(self, dt) Roll provided date forward to next offset only if not
on offset.
pandas.tseries.offsets.BDay.apply
BDay.apply(self, other)
pandas.tseries.offsets.BDay.apply_index
BDay.apply_index(self, other)
pandas.tseries.offsets.BDay.copy
BDay.copy(self )
pandas.tseries.offsets.BDay.isAnchored
BDay.isAnchored(self )
pandas.tseries.offsets.BDay.onOffset
BDay.onOffset(self, dt)
pandas.tseries.offsets.BDay.rollback
BDay.rollback(self, dt)
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.BDay.rollforward
BDay.rollforward(self, dt)
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
6.8.41 BMonthEnd
pandas.tseries.offsets.BMonthEnd
pandas.tseries.offsets.BMonthEnd
alias of pandas.tseries.offsets.BusinessMonthEnd
Properties
pandas.tseries.offsets.BMonthEnd.base
BMonthEnd.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
pandas.tseries.offsets.BMonthEnd.freqstr
BMonthEnd.freqstr
pandas.tseries.offsets.BMonthEnd.kwds
BMonthEnd.kwds
pandas.tseries.offsets.BMonthEnd.name
BMonthEnd.name
pandas.tseries.offsets.BMonthEnd.nanos
BMonthEnd.nanos
pandas.tseries.offsets.BMonthEnd.normalize
BMonthEnd.normalize = False
pandas.tseries.offsets.BMonthEnd.rule_code
BMonthEnd.rule_code
Methods
BMonthEnd.apply(self, other)
BMonthEnd.apply_index(self, other)
BMonthEnd.copy(self)
BMonthEnd.isAnchored(self)
BMonthEnd.onOffset(self, dt)
BMonthEnd.rollback(self, dt) Roll provided date backward to next offset only if
not on offset.
BMonthEnd.rollforward(self, dt) Roll provided date forward to next offset only if not
on offset.
pandas.tseries.offsets.BMonthEnd.apply
BMonthEnd.apply(self, other)
pandas.tseries.offsets.BMonthEnd.apply_index
BMonthEnd.apply_index(self, other)
pandas.tseries.offsets.BMonthEnd.copy
BMonthEnd.copy(self )
pandas.tseries.offsets.BMonthEnd.isAnchored
BMonthEnd.isAnchored(self )
pandas.tseries.offsets.BMonthEnd.onOffset
BMonthEnd.onOffset(self, dt)
pandas.tseries.offsets.BMonthEnd.rollback
BMonthEnd.rollback(self, dt)
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.BMonthEnd.rollforward
BMonthEnd.rollforward(self, dt)
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
6.8.42 BMonthBegin
pandas.tseries.offsets.BMonthBegin
pandas.tseries.offsets.BMonthBegin
alias of pandas.tseries.offsets.BusinessMonthBegin
Properties
pandas.tseries.offsets.BMonthBegin.base
BMonthBegin.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
pandas.tseries.offsets.BMonthBegin.freqstr
BMonthBegin.freqstr
pandas.tseries.offsets.BMonthBegin.kwds
BMonthBegin.kwds
pandas.tseries.offsets.BMonthBegin.name
BMonthBegin.name
pandas.tseries.offsets.BMonthBegin.nanos
BMonthBegin.nanos
pandas.tseries.offsets.BMonthBegin.normalize
BMonthBegin.normalize = False
pandas.tseries.offsets.BMonthBegin.rule_code
BMonthBegin.rule_code
Methods
BMonthBegin.apply(self, other)
BMonthBegin.apply_index(self, other)
BMonthBegin.copy(self)
BMonthBegin.isAnchored(self)
Continued on next page
pandas.tseries.offsets.BMonthBegin.apply
BMonthBegin.apply(self, other)
pandas.tseries.offsets.BMonthBegin.apply_index
BMonthBegin.apply_index(self, other)
pandas.tseries.offsets.BMonthBegin.copy
BMonthBegin.copy(self )
pandas.tseries.offsets.BMonthBegin.isAnchored
BMonthBegin.isAnchored(self )
pandas.tseries.offsets.BMonthBegin.onOffset
BMonthBegin.onOffset(self, dt)
pandas.tseries.offsets.BMonthBegin.rollback
BMonthBegin.rollback(self, dt)
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.BMonthBegin.rollforward
BMonthBegin.rollforward(self, dt)
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
6.8.43 CBMonthEnd
pandas.tseries.offsets.CBMonthEnd
pandas.tseries.offsets.CBMonthEnd
alias of pandas.tseries.offsets.CustomBusinessMonthEnd
Properties
pandas.tseries.offsets.CBMonthEnd.base
CBMonthEnd.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
pandas.tseries.offsets.CBMonthEnd.cbday_roll
CBMonthEnd.cbday_roll
Define default roll function to be called in apply method.
pandas.tseries.offsets.CBMonthEnd.freqstr
CBMonthEnd.freqstr
pandas.tseries.offsets.CBMonthEnd.kwds
CBMonthEnd.kwds
pandas.tseries.offsets.CBMonthEnd.m_offset
CBMonthEnd.m_offset
pandas.tseries.offsets.CBMonthEnd.month_roll
CBMonthEnd.month_roll
Define default roll function to be called in apply method.
pandas.tseries.offsets.CBMonthEnd.name
CBMonthEnd.name
pandas.tseries.offsets.CBMonthEnd.nanos
CBMonthEnd.nanos
pandas.tseries.offsets.CBMonthEnd.normalize
CBMonthEnd.normalize = False
pandas.tseries.offsets.CBMonthEnd.offset
CBMonthEnd.offset
Alias for self._offset.
pandas.tseries.offsets.CBMonthEnd.rule_code
CBMonthEnd.rule_code
Methods
CBMonthEnd.apply(self, other)
CBMonthEnd.apply_index(self, other) Vectorized apply of DateOffset to DatetimeIndex,
raises NotImplentedError for offsets without a vec-
torized implementation.
CBMonthEnd.copy(self)
CBMonthEnd.isAnchored(self)
CBMonthEnd.onOffset(self, dt)
CBMonthEnd.rollback(self, dt) Roll provided date backward to next offset only if
not on offset.
CBMonthEnd.rollforward(self, dt) Roll provided date forward to next offset only if not
on offset.
pandas.tseries.offsets.CBMonthEnd.apply
CBMonthEnd.apply(self, other)
pandas.tseries.offsets.CBMonthEnd.apply_index
CBMonthEnd.apply_index(self, other)
Vectorized apply of DateOffset to DatetimeIndex, raises NotImplentedError for offsets without a vec-
torized implementation.
Parameters
i [DatetimeIndex]
Returns
y [DatetimeIndex]
pandas.tseries.offsets.CBMonthEnd.copy
CBMonthEnd.copy(self )
pandas.tseries.offsets.CBMonthEnd.isAnchored
CBMonthEnd.isAnchored(self )
pandas.tseries.offsets.CBMonthEnd.onOffset
CBMonthEnd.onOffset(self, dt)
pandas.tseries.offsets.CBMonthEnd.rollback
CBMonthEnd.rollback(self, dt)
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.CBMonthEnd.rollforward
CBMonthEnd.rollforward(self, dt)
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
6.8.44 CBMonthBegin
pandas.tseries.offsets.CBMonthBegin
pandas.tseries.offsets.CBMonthBegin
alias of pandas.tseries.offsets.CustomBusinessMonthBegin
Properties
pandas.tseries.offsets.CBMonthBegin.base
CBMonthBegin.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
pandas.tseries.offsets.CBMonthBegin.cbday_roll
CBMonthBegin.cbday_roll
Define default roll function to be called in apply method.
pandas.tseries.offsets.CBMonthBegin.freqstr
CBMonthBegin.freqstr
pandas.tseries.offsets.CBMonthBegin.kwds
CBMonthBegin.kwds
pandas.tseries.offsets.CBMonthBegin.m_offset
CBMonthBegin.m_offset
pandas.tseries.offsets.CBMonthBegin.month_roll
CBMonthBegin.month_roll
Define default roll function to be called in apply method.
pandas.tseries.offsets.CBMonthBegin.name
CBMonthBegin.name
pandas.tseries.offsets.CBMonthBegin.nanos
CBMonthBegin.nanos
pandas.tseries.offsets.CBMonthBegin.normalize
CBMonthBegin.normalize = False
pandas.tseries.offsets.CBMonthBegin.offset
CBMonthBegin.offset
Alias for self._offset.
pandas.tseries.offsets.CBMonthBegin.rule_code
CBMonthBegin.rule_code
Methods
CBMonthBegin.apply(self, other)
CBMonthBegin.apply_index(self, other) Vectorized apply of DateOffset to DatetimeIndex,
raises NotImplentedError for offsets without a vec-
torized implementation.
CBMonthBegin.copy(self)
CBMonthBegin.isAnchored(self)
CBMonthBegin.onOffset(self, dt)
CBMonthBegin.rollback(self, dt) Roll provided date backward to next offset only if
not on offset.
CBMonthBegin.rollforward(self, dt) Roll provided date forward to next offset only if not
on offset.
pandas.tseries.offsets.CBMonthBegin.apply
CBMonthBegin.apply(self, other)
pandas.tseries.offsets.CBMonthBegin.apply_index
CBMonthBegin.apply_index(self, other)
Vectorized apply of DateOffset to DatetimeIndex, raises NotImplentedError for offsets without a vec-
torized implementation.
Parameters
i [DatetimeIndex]
Returns
y [DatetimeIndex]
pandas.tseries.offsets.CBMonthBegin.copy
CBMonthBegin.copy(self )
pandas.tseries.offsets.CBMonthBegin.isAnchored
CBMonthBegin.isAnchored(self )
pandas.tseries.offsets.CBMonthBegin.onOffset
CBMonthBegin.onOffset(self, dt)
pandas.tseries.offsets.CBMonthBegin.rollback
CBMonthBegin.rollback(self, dt)
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.CBMonthBegin.rollforward
CBMonthBegin.rollforward(self, dt)
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
6.8.45 CDay
pandas.tseries.offsets.CDay
pandas.tseries.offsets.CDay
alias of pandas.tseries.offsets.CustomBusinessDay
Properties
pandas.tseries.offsets.CDay.base
CDay.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
pandas.tseries.offsets.CDay.freqstr
CDay.freqstr
pandas.tseries.offsets.CDay.kwds
CDay.kwds
pandas.tseries.offsets.CDay.name
CDay.name
pandas.tseries.offsets.CDay.nanos
CDay.nanos
pandas.tseries.offsets.CDay.normalize
CDay.normalize = False
pandas.tseries.offsets.CDay.offset
CDay.offset
Alias for self._offset.
pandas.tseries.offsets.CDay.rule_code
CDay.rule_code
Methods
CDay.apply(self, other)
CDay.apply_index(self, i) Vectorized apply of DateOffset to DatetimeIndex,
raises NotImplentedError for offsets without a vec-
torized implementation.
CDay.copy(self)
CDay.isAnchored(self)
CDay.onOffset(self, dt)
CDay.rollback(self, dt) Roll provided date backward to next offset only if
not on offset.
CDay.rollforward(self, dt) Roll provided date forward to next offset only if not
on offset.
pandas.tseries.offsets.CDay.apply
CDay.apply(self, other)
pandas.tseries.offsets.CDay.apply_index
CDay.apply_index(self, i)
Vectorized apply of DateOffset to DatetimeIndex, raises NotImplentedError for offsets without a vec-
torized implementation.
Parameters
i [DatetimeIndex]
Returns
y [DatetimeIndex]
pandas.tseries.offsets.CDay.copy
CDay.copy(self )
pandas.tseries.offsets.CDay.isAnchored
CDay.isAnchored(self )
pandas.tseries.offsets.CDay.onOffset
CDay.onOffset(self, dt)
pandas.tseries.offsets.CDay.rollback
CDay.rollback(self, dt)
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.CDay.rollforward
CDay.rollforward(self, dt)
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
6.9 Frequencies
6.9.1 pandas.tseries.frequencies.to_offset
pandas.tseries.frequencies.to_offset(freq)
Return DateOffset object from string or tuple representation or datetime.timedelta object
Parameters
freq [str, tuple, datetime.timedelta, DateOffset or None]
Returns
DateOffset None if freq is None.
Raises
ValueError If freq is an invalid frequency
See also:
DateOffset
Examples
>>> to_offset('5min')
<5 * Minutes>
>>> to_offset('1D1H')
<25 * Hours>
>>> to_offset(datetime.timedelta(days=1))
<Day>
>>> to_offset(Hour())
<Hour>
{{ header }}
6.10 Window
pandas.core.window.Rolling.count
Rolling.count(self )
The rolling count of any non-NaN observations inside the window.
Returns
Series or DataFrame Returned object type is determined by the caller of the rolling
calculation.
See also:
Examples
pandas.core.window.Rolling.sum
Examples
>>> s.rolling(3).sum()
0 NaN
1 NaN
2 6.0
3 9.0
4 12.0
dtype: float64
>>> s.expanding(3).sum()
0 NaN
1 NaN
2 6.0
3 10.0
4 15.0
dtype: float64
>>> df.rolling(3).sum()
A B
0 NaN NaN
1 NaN NaN
2 6.0 14.0
3 9.0 29.0
(continues on next page)
pandas.core.window.Rolling.mean
Examples
The below examples will show rolling mean calculations with window sizes of two and three, respec-
tively.
>>> s.rolling(3).mean()
0 NaN
1 NaN
2 2.0
3 3.0
dtype: float64
pandas.core.window.Rolling.median
Rolling.median(self, **kwargs)
Calculate the rolling median.
Parameters
**kwargs For compatibility with other rolling methods. Has no effect on the com-
puted median.
Returns
Series or DataFrame Returned type is the same as the original object.
See also:
Examples
pandas.core.window.Rolling.var
Notes
The default ddof of 1 used in Series.var() is different than the default ddof of 0 in numpy.var().
A minimum of 1 period is required for the rolling calculation.
Examples
>>> s.expanding(3).var()
0 NaN
1 NaN
2 0.333333
3 0.916667
4 0.800000
5 0.700000
6 0.619048
dtype: float64
pandas.core.window.Rolling.std
Notes
The default ddof of 1 used in Series.std is different than the default ddof of 0 in numpy.std.
A minimum of one period is required for the rolling calculation.
Examples
>>> s.expanding(3).std()
0 NaN
1 NaN
2 0.577350
3 0.957427
4 0.894427
5 0.836660
6 0.786796
dtype: float64
pandas.core.window.Rolling.min
Examples
pandas.core.window.Rolling.max
pandas.core.window.Rolling.corr
Notes
Examples
The below example shows a rolling calculation with a window size of four matching the equivalent
function call using numpy.corrcoef().
>>> v1 = [3, 3, 3, 5, 8]
>>> v2 = [3, 4, 4, 4, 8]
>>> fmt = "{0:.6f}" # limit the printed precision to 6 digits
>>> # numpy returns a 2X2 array, the correlation coefficient
>>> # is the number at entry [0][1]
>>> print(fmt.format(np.corrcoef(v1[:-1], v2[:-1])[0][1]))
0.333333
>>> print(fmt.format(np.corrcoef(v1[1:], v2[1:])[0][1]))
0.916949
>>> s1 = pd.Series(v1)
>>> s2 = pd.Series(v2)
>>> s1.rolling(4).corr(s2)
0 NaN
1 NaN
2 NaN
3 0.333333
4 0.916949
dtype: float64
The below example shows a similar rolling calculation on a DataFrame using the pairwise option.
>>> matrix = np.array([[51., 35.], [49., 30.], [47., 32.], [46., 31.], [50., 36.
,→]])
pandas.core.window.Rolling.cov
pandas.core.window.Rolling.skew
Rolling.skew(self, **kwargs)
Unbiased rolling skewness.
Parameters
**kwargs Keyword arguments to be passed into func.
Returns
Series or DataFrame Return type is determined by the caller.
See also:
pandas.core.window.Rolling.kurt
Rolling.kurt(self, **kwargs)
Calculate unbiased rolling kurtosis.
This function uses Fisher’s definition of kurtosis without bias.
Parameters
**kwargs Under Review.
Returns
Series or DataFrame Returned object type is determined by the caller of the rolling
calculation.
See also:
Notes
Examples
The example below will show a rolling calculation with a window size of four matching the equivalent
function call using scipy.stats.
pandas.core.window.Rolling.apply
pandas.core.window.Rolling.aggregate
func [function, str, list or dict] Function to use for aggregating the data. If a func-
tion, must either work when passed a Series/Dataframe or when passed to Se-
ries/Dataframe.apply.
Accepted combinations are:
• function
• string function name
• list of functions and/or function names, e.g. [np.sum, 'mean']
• dict of axis labels -> functions, function names or list of such.
*args Positional arguments to pass to func.
**kwargs Keyword arguments to pass to func.
Returns
scalar, Series or DataFrame The return can be:
• scalar : when Series.agg is called with single function
• Series : when DataFrame.agg is called with a single function
• DataFrame : when DataFrame.agg is called with several functions
Return scalar, Series or DataFrame.
See also:
Series.rolling
DataFrame.rolling
Notes
Examples
>>> df.rolling(3).sum()
A B C
0 NaN NaN NaN
1 NaN NaN NaN
2 -2.655105 0.637799 -2.135068
3 -0.971785 -0.600366 -3.280224
4 -0.214334 -1.294599 -3.227500
5 1.514216 2.028250 -2.989060
6 1.074618 5.709767 -2.322600
7 2.718061 3.850718 0.256446
8 -0.289082 2.454418 1.416871
9 0.212668 0.403198 -0.093924
pandas.core.window.Rolling.quantile
See also:
Series.quantile Computes value at the given quantile over all data in Series.
DataFrame.quantile Computes values at the given quantile over requested axis in DataFrame.
Examples
pandas.core.window.Window.mean
Examples
The below examples will show rolling mean calculations with window sizes of two and three, respec-
tively.
>>> s.rolling(3).mean()
0 NaN
1 NaN
2 2.0
3 3.0
dtype: float64
pandas.core.window.Window.sum
Examples
>>> s.rolling(3).sum()
0 NaN
1 NaN
2 6.0
3 9.0
(continues on next page)
>>> s.expanding(3).sum()
0 NaN
1 NaN
2 6.0
3 10.0
4 15.0
dtype: float64
>>> df.rolling(3).sum()
A B
0 NaN NaN
1 NaN NaN
2 6.0 14.0
3 9.0 29.0
4 12.0 50.0
pandas.core.window.Expanding.count
Expanding.count(self, **kwargs)
The expanding count of any non-NaN observations inside the window.
Returns
Series or DataFrame Returned object type is determined by the caller of the ex-
panding calculation.
See also:
Examples
pandas.core.window.Expanding.sum
Examples
>>> s.rolling(3).sum()
0 NaN
1 NaN
2 6.0
3 9.0
4 12.0
dtype: float64
>>> s.expanding(3).sum()
0 NaN
1 NaN
2 6.0
3 10.0
4 15.0
dtype: float64
>>> df.rolling(3).sum()
A B
0 NaN NaN
1 NaN NaN
2 6.0 14.0
3 9.0 29.0
4 12.0 50.0
pandas.core.window.Expanding.mean
Examples
The below examples will show rolling mean calculations with window sizes of two and three, respec-
tively.
>>> s.rolling(3).mean()
0 NaN
1 NaN
2 2.0
3 3.0
dtype: float64
pandas.core.window.Expanding.median
Expanding.median(self, **kwargs)
Calculate the expanding median.
Parameters
**kwargs For compatibility with other expanding methods. Has no effect on the
computed median.
Returns
Series or DataFrame Returned type is the same as the original object.
See also:
Examples
pandas.core.window.Expanding.var
Parameters
ddof [int, default 1] Delta Degrees of Freedom. The divisor used in calculations is N
- ddof, where N represents the number of elements.
*args, **kwargs For NumPy compatibility. No additional arguments are used.
Returns
Series or DataFrame Returns the same object type as the caller of the expanding
calculation.
See also:
Notes
The default ddof of 1 used in Series.var() is different than the default ddof of 0 in numpy.var().
A minimum of 1 period is required for the rolling calculation.
Examples
>>> s.expanding(3).var()
0 NaN
1 NaN
2 0.333333
3 0.916667
4 0.800000
5 0.700000
6 0.619048
dtype: float64
pandas.core.window.Expanding.std
Notes
The default ddof of 1 used in Series.std is different than the default ddof of 0 in numpy.std.
A minimum of one period is required for the rolling calculation.
Examples
>>> s.expanding(3).std()
0 NaN
1 NaN
2 0.577350
3 0.957427
4 0.894427
5 0.836660
(continues on next page)
pandas.core.window.Expanding.min
Examples
pandas.core.window.Expanding.max
pandas.core.window.Expanding.corr
Notes
Examples
The below example shows a rolling calculation with a window size of four matching the equivalent
function call using numpy.corrcoef().
>>> v1 = [3, 3, 3, 5, 8]
>>> v2 = [3, 4, 4, 4, 8]
>>> fmt = "{0:.6f}" # limit the printed precision to 6 digits
>>> # numpy returns a 2X2 array, the correlation coefficient
>>> # is the number at entry [0][1]
>>> print(fmt.format(np.corrcoef(v1[:-1], v2[:-1])[0][1]))
0.333333
>>> print(fmt.format(np.corrcoef(v1[1:], v2[1:])[0][1]))
0.916949
>>> s1 = pd.Series(v1)
>>> s2 = pd.Series(v2)
>>> s1.rolling(4).corr(s2)
0 NaN
1 NaN
2 NaN
3 0.333333
4 0.916949
dtype: float64
The below example shows a similar rolling calculation on a DataFrame using the pairwise option.
>>> matrix = np.array([[51., 35.], [49., 30.], [47., 32.], [46., 31.], [50., 36.
,→]])
pandas.core.window.Expanding.cov
pandas.core.window.Expanding.skew
Expanding.skew(self, **kwargs)
Unbiased expanding skewness.
Parameters
**kwargs Keyword arguments to be passed into func.
Returns
Series or DataFrame Return type is determined by the caller.
See also:
pandas.core.window.Expanding.kurt
Expanding.kurt(self, **kwargs)
Calculate unbiased expanding kurtosis.
This function uses Fisher’s definition of kurtosis without bias.
Parameters
**kwargs Under Review.
Returns
Series or DataFrame Returned object type is determined by the caller of the ex-
panding calculation.
See also:
Notes
Examples
The example below will show an expanding calculation with a window size of four matching the
equivalent function call using scipy.stats.
pandas.core.window.Expanding.apply
• True or None : the passed function will receive ndarray objects instead. If you
are just applying a NumPy reduction function this will achieve much better
performance.
The raw parameter is required and will show a FutureWarning if not passed. In
the future raw will default to False.
New in version 0.23.0.
*args, **kwargs Arguments and keyword arguments to be passed into func.
Returns
Series or DataFrame Return type is determined by the caller.
See also:
pandas.core.window.Expanding.aggregate
DataFrame.expanding.aggregate
DataFrame.rolling.aggregate
DataFrame.aggregate
Notes
Examples
>>> df.ewm(alpha=0.5).mean()
A B C
0 -2.385977 -0.102758 0.438822
1 -1.464856 0.569633 -0.490089
2 -0.207700 0.149687 -1.135379
3 -0.471677 -0.645305 -0.906555
4 -0.355635 -0.203033 -0.904111
5 1.076417 1.503943 -1.146293
6 -0.041654 1.925562 -0.588728
7 0.680292 0.132049 0.548693
8 0.067236 0.948257 0.163353
9 -0.286980 0.618493 -0.694496
pandas.core.window.Expanding.quantile
• higher: j.
• nearest: i or j whichever is nearest.
• midpoint: (i + j) / 2.
**kwargs: For compatibility with other expanding methods. Has no effect on the
result.
Returns
Series or DataFrame Returned object type is determined by the caller of the ex-
panding calculation.
See also:
Series.quantile Computes value at the given quantile over all data in Series.
DataFrame.quantile Computes values at the given quantile over requested axis in DataFrame.
Examples
pandas.core.window.EWM.mean
pandas.core.window.EWM.std
pandas.core.window.EWM.var
pandas.core.window.EWM.corr
pairwise [bool, default None] If False then only matching columns between self and
other will be used and the output will be a DataFrame. If True then all pairwise
combinations will be calculated and the output will be a MultiIndex DataFrame
in the case of DataFrame inputs. In the case of missing elements, only complete
pairwise observations will be used.
bias [bool, default False] Use a standard estimation bias correction.
**kwargs Keyword arguments to be passed into func.
Returns
Series or DataFrame Return type is determined by the caller.
See also:
pandas.core.window.EWM.cov
{{ header }}
6.11 GroupBy
pandas.core.groupby.GroupBy.__iter__
GroupBy.__iter__(self )
Groupby iterator.
Returns
Generator yielding sequence of (name, subsetted object)
for each group
pandas.core.groupby.GroupBy.groups
GroupBy.groups
Dict {group name -> group labels}.
pandas.core.groupby.GroupBy.indices
GroupBy.indices
Dict {group name -> group indices}.
pandas.core.groupby.GroupBy.get_group
Grouper([key, level, freq, axis, sort]) A Grouper allows the user to specify a groupby in-
struction for a target object
pandas.Grouper
Examples
>>> df.groupby(Grouper(key='A'))
Specify a resample operation on the level ‘date’ on the columns axis with a frequency of 60s
Attributes
ax
groups
GroupBy.apply(self, func, \*args, \*\*kwargs) Apply function func group-wise and combine the
results together.
GroupBy.agg(self, func, \*args, \*\*kwargs)
GroupBy.aggregate(self, func, \*args,
\*\*kwargs)
GroupBy.transform(self, func, \*args,
\*\*kwargs)
GroupBy.pipe(self, func, \*args, \*\*kwargs) Apply a function func with arguments to this
GroupBy object and return the function’s result.
pandas.core.groupby.GroupBy.apply
pipe Apply function to the full GroupBy object instead of to each group.
aggregate Apply aggregate function to the GroupBy object.
transform Apply function column-by-column to the GroupBy object.
Series.apply Apply a function to a Series.
DataFrame.apply Apply a function to each row or column of a DataFrame.
pandas.core.groupby.GroupBy.agg
pandas.core.groupby.GroupBy.aggregate
pandas.core.groupby.GroupBy.transform
pandas.core.groupby.GroupBy.pipe
>>> (df.groupby('group')
... .pipe(f)
... .pipe(g, arg1=a)
... .pipe(h, arg2=b, arg3=c))
Notes
Examples
To get the difference between each groups maximum and minimum value in one pass, you can do
GroupBy.all(self[, skipna]) Return True if all values in the group are truthful,
else False.
GroupBy.any(self[, skipna]) Return True if any value in the group is truthful,
else False.
GroupBy.bfill(self[, limit]) Backward fill the values.
GroupBy.count(self) Compute count of group, excluding missing values.
GroupBy.cumcount(self[, ascending]) Number each item in each group from 0 to the
length of that group - 1.
GroupBy.cummax(self[, axis]) Cumulative max for each group.
GroupBy.cummin(self[, axis]) Cumulative min for each group.
GroupBy.cumprod(self[, axis]) Cumulative product for each group.
GroupBy.cumsum(self[, axis]) Cumulative sum for each group.
GroupBy.ffill(self[, limit]) Forward fill the values.
GroupBy.first(self, \*\*kwargs) Compute first of group values.
GroupBy.head(self[, n]) Return first n rows of each group.
GroupBy.last(self, \*\*kwargs) Compute last of group values.
GroupBy.max(self, \*\*kwargs) Compute max of group values.
GroupBy.mean(self, \*args, \*\*kwargs) Compute mean of groups, excluding missing values.
GroupBy.median(self, \*\*kwargs) Compute median of groups, excluding missing val-
ues.
GroupBy.min(self, \*\*kwargs) Compute min of group values.
GroupBy.ngroup(self[, ascending]) Number each group from 0 to the number of groups
- 1.
GroupBy.nth(self, n, List[int]], dropna, …) Take the nth row from each group if n is an int, or
a subset of rows if n is a list of ints.
GroupBy.ohlc(self) Compute sum of values, excluding missing values.
GroupBy.prod(self, \*\*kwargs) Compute prod of group values.
GroupBy.rank(self[, method, ascending, …]) Provide the rank of values within each group.
GroupBy.pct_change(self[, periods, …]) Calculate pct_change of each value to previous en-
try in group.
GroupBy.size(self) Compute group sizes.
GroupBy.sem(self[, ddof]) Compute standard error of the mean of groups, ex-
cluding missing values.
Continued on next page
pandas.core.groupby.GroupBy.all
GroupBy.all(self, skipna=True)
Return True if all values in the group are truthful, else False.
Parameters
skipna [bool, default True] Flag to ignore nan values during truth testing
Returns
bool
See also:
Series.groupby
DataFrame.groupby
pandas.core.groupby.GroupBy.any
GroupBy.any(self, skipna=True)
Return True if any value in the group is truthful, else False.
Parameters
skipna [bool, default True] Flag to ignore nan values during truth testing
Returns
bool
See also:
Series.groupby
DataFrame.groupby
pandas.core.groupby.GroupBy.bfill
GroupBy.bfill(self, limit=None)
Backward fill the values.
Parameters
limit [integer, optional] limit of how many values to fill
Returns
Series or DataFrame Object with missing values filled.
See also:
Series.backfill
DataFrame.backfill
Series.fillna
DataFrame.fillna
pandas.core.groupby.GroupBy.count
GroupBy.count(self )
Compute count of group, excluding missing values.
Returns
Series or DataFrame Count of values within each group.
See also:
Series.groupby
DataFrame.groupby
pandas.core.groupby.GroupBy.cumcount
GroupBy.cumcount(self, ascending=True)
Number each item in each group from 0 to the length of that group - 1.
Essentially this is equivalent to
Parameters
ascending [bool, default True] If False, number in reverse, from length of group - 1
to 0.
Returns
Series Sequence number of each element within each group.
See also:
Examples
pandas.core.groupby.GroupBy.cummax
Series.groupby
DataFrame.groupby
pandas.core.groupby.GroupBy.cummin
Series.groupby
DataFrame.groupby
pandas.core.groupby.GroupBy.cumprod
Series or DataFrame
See also:
Series.groupby
DataFrame.groupby
pandas.core.groupby.GroupBy.cumsum
Series.groupby
DataFrame.groupby
pandas.core.groupby.GroupBy.ffill
GroupBy.ffill(self, limit=None)
Forward fill the values.
Parameters
limit [integer, optional] limit of how many values to fill
Returns
Series or DataFrame Object with missing values filled.
See also:
Series.pad
DataFrame.pad
Series.fillna
DataFrame.fillna
pandas.core.groupby.GroupBy.first
GroupBy.first(self, **kwargs)
Compute first of group values.
Returns
Series or DataFrame Computed first of values within each group.
pandas.core.groupby.GroupBy.head
GroupBy.head(self, n=5)
Return first n rows of each group.
Essentially equivalent to .apply(lambda x: x.head(n)), except ignores as_index flag.
Returns
Series or DataFrame
See also:
Series.groupby
DataFrame.groupby
Examples
pandas.core.groupby.GroupBy.last
GroupBy.last(self, **kwargs)
Compute last of group values.
Returns
Series or DataFrame Computed last of values within each group.
pandas.core.groupby.GroupBy.max
GroupBy.max(self, **kwargs)
Compute max of group values.
Returns
Series or DataFrame Computed max of values within each group.
pandas.core.groupby.GroupBy.mean
See also:
Series.groupby
DataFrame.groupby
Examples
Groupby one column and return the mean of the remaining columns in each group.
>>> df.groupby('A').mean()
B C
A
1 3.0 1.333333
2 4.0 1.500000
Groupby two columns and return the mean of the remaining column.
Groupby one column and return the mean of only particular column in the group.
>>> df.groupby('A')['B'].mean()
A
1 3.0
2 4.0
Name: B, dtype: float64
pandas.core.groupby.GroupBy.median
GroupBy.median(self, **kwargs)
Compute median of groups, excluding missing values.
For multiple groupings, the result index will be a MultiIndex
Returns
Series or DataFrame Median of values within each group.
See also:
Series.groupby
DataFrame.groupby
pandas.core.groupby.GroupBy.min
GroupBy.min(self, **kwargs)
Compute min of group values.
Returns
Series or DataFrame Computed min of values within each group.
pandas.core.groupby.GroupBy.ngroup
GroupBy.ngroup(self, ascending=True)
Number each group from 0 to the number of groups - 1.
This is the enumerative complement of cumcount. Note that the numbers given to the groups match
the order in which the groups would be seen when iterating over the groupby object, not the order
they are first observed.
New in version 0.20.2.
Parameters
ascending [bool, default True] If False, number in reverse, from number of group - 1
to 0.
Returns
Series Unique numbers for each group.
See also:
Examples
pandas.core.groupby.GroupBy.nth
Series.groupby
DataFrame.groupby
Examples
pandas.core.groupby.GroupBy.ohlc
GroupBy.ohlc(self )
Compute sum of values, excluding missing values.
For multiple groupings, the result index will be a MultiIndex
Returns
DataFrame Open, high, low and close values within each group.
See also:
Series.groupby
DataFrame.groupby
pandas.core.groupby.GroupBy.prod
GroupBy.prod(self, **kwargs)
Compute prod of group values.
Returns
Series or DataFrame Computed prod of values within each group.
pandas.core.groupby.GroupBy.rank
Series.groupby
DataFrame.groupby
pandas.core.groupby.GroupBy.pct_change
Series.groupby
DataFrame.groupby
pandas.core.groupby.GroupBy.size
GroupBy.size(self )
Compute group sizes.
Returns
Series Number of rows in each group.
See also:
Series.groupby
DataFrame.groupby
pandas.core.groupby.GroupBy.sem
GroupBy.sem(self, ddof=1)
Compute standard error of the mean of groups, excluding missing values.
For multiple groupings, the result index will be a MultiIndex.
Parameters
ddof [integer, default 1] degrees of freedom
Returns
Series or DataFrame Standard error of the mean of values within each group.
See also:
Series.groupby
DataFrame.groupby
pandas.core.groupby.GroupBy.std
Series.groupby
DataFrame.groupby
pandas.core.groupby.GroupBy.sum
GroupBy.sum(self, **kwargs)
Compute sum of group values.
Returns
Series or DataFrame Computed sum of values within each group.
pandas.core.groupby.GroupBy.var
Series.groupby
DataFrame.groupby
pandas.core.groupby.GroupBy.tail
GroupBy.tail(self, n=5)
Return last n rows of each group.
Essentially equivalent to .apply(lambda x: x.tail(n)), except ignores as_index flag.
Returns
Series or DataFrame
See also:
Series.groupby
DataFrame.groupby
Examples
The following methods are available in both SeriesGroupBy and DataFrameGroupBy objects, but may differ
slightly, usually in that the DataFrameGroupBy version usually permits the specification of an axis argument,
and often an argument indicating whether to restrict application to columns of a specific data type.
DataFrameGroupBy.all(self[, skipna]) Return True if all values in the group are truthful,
else False.
DataFrameGroupBy.any(self[, skipna]) Return True if any value in the group is truthful,
else False.
DataFrameGroupBy.bfill(self[, limit]) Backward fill the values.
DataFrameGroupBy.corr Compute pairwise correlation of columns, excluding
NA/null values.
DataFrameGroupBy.count(self) Compute count of group, excluding missing values.
DataFrameGroupBy.cov Compute pairwise covariance of columns, excluding
NA/null values.
DataFrameGroupBy.cummax(self[, axis]) Cumulative max for each group.
DataFrameGroupBy.cummin(self[, axis]) Cumulative min for each group.
DataFrameGroupBy.cumprod(self[, axis]) Cumulative product for each group.
DataFrameGroupBy.cumsum(self[, axis]) Cumulative sum for each group.
DataFrameGroupBy.describe(self, \*\*kwargs) Generate descriptive statistics that summarize
the central tendency, dispersion and shape of a
dataset’s distribution, excluding NaN values.
DataFrameGroupBy.diff First discrete difference of element.
DataFrameGroupBy.ffill(self[, limit]) Forward fill the values.
DataFrameGroupBy.fillna Fill NA/NaN values using the specified method.
DataFrameGroupBy.filter(self, func[, dropna]) Return a copy of a DataFrame excluding elements
from groups that do not satisfy the boolean crite-
rion specified by func.
DataFrameGroupBy.hist Make a histogram of the DataFrame’s.
DataFrameGroupBy.idxmax Return index of first occurrence of maximum over
requested axis.
DataFrameGroupBy.idxmin Return index of first occurrence of minimum over
requested axis.
DataFrameGroupBy.mad Return the mean absolute deviation of the values
for the requested axis.
DataFrameGroupBy.nunique(self[, dropna]) Return DataFrame with number of distinct obser-
vations per group for each column.
DataFrameGroupBy.pct_change(self[, periods, …]) Calculate pct_change of each value to previous en-
try in group.
DataFrameGroupBy.plot Class implementing the .plot attribute for groupby
objects.
DataFrameGroupBy.quantile(self[, q, …]) Return group values at the given quantile, a la
numpy.percentile.
DataFrameGroupBy.rank(self[, method, …]) Provide the rank of values within each group.
DataFrameGroupBy.resample(self, rule, …) Provide resampling when using a TimeGrouper.
DataFrameGroupBy.shift(self[, periods, …]) Shift each group by periods observations.
DataFrameGroupBy.size(self) Compute group sizes.
Continued on next page
pandas.core.groupby.DataFrameGroupBy.all
DataFrameGroupBy.all(self, skipna=True)
Return True if all values in the group are truthful, else False.
Parameters
skipna [bool, default True] Flag to ignore nan values during truth testing
Returns
bool
See also:
Series.groupby
DataFrame.groupby
pandas.core.groupby.DataFrameGroupBy.any
DataFrameGroupBy.any(self, skipna=True)
Return True if any value in the group is truthful, else False.
Parameters
skipna [bool, default True] Flag to ignore nan values during truth testing
Returns
bool
See also:
Series.groupby
DataFrame.groupby
pandas.core.groupby.DataFrameGroupBy.bfill
DataFrameGroupBy.bfill(self, limit=None)
Backward fill the values.
Parameters
limit [integer, optional] limit of how many values to fill
Returns
Series or DataFrame Object with missing values filled.
See also:
Series.backfill
DataFrame.backfill
Series.fillna
DataFrame.fillna
pandas.core.groupby.DataFrameGroupBy.corr
DataFrameGroupBy.corr
Compute pairwise correlation of columns, excluding NA/null values.
Parameters
method [{‘pearson’, ‘kendall’, ‘spearman’} or callable]
• pearson : standard correlation coefficient
• kendall : Kendall Tau correlation coefficient
• spearman : Spearman rank correlation
• callable: callable with input two 1d ndarrays and returning a float.
Note that the returned matrix from corr will have 1 along the diagonals
and will be symmetric regardless of the callable’s behavior .. versionadded::
0.24.0
min_periods [int, optional] Minimum number of observations required per pair of
columns to have a valid result. Currently only available for Pearson and Spearman
correlation.
Returns
DataFrame Correlation matrix.
See also:
DataFrame.corrwith
Series.corr
Examples
pandas.core.groupby.DataFrameGroupBy.count
DataFrameGroupBy.count(self )
Compute count of group, excluding missing values.
Returns
DataFrame Count of values within each group.
pandas.core.groupby.DataFrameGroupBy.cov
DataFrameGroupBy.cov
Compute pairwise covariance of columns, excluding NA/null values.
Compute the pairwise covariance among the series of a DataFrame. The returned data frame is the
covariance matrix of the columns of the DataFrame.
Both NA and null values are automatically excluded from the calculation. (See the note below about
bias from missing values.) A threshold can be set for the minimum number of observations for each
value created. Comparisons with observations below this threshold will be returned as NaN.
This method is generally used for the analysis of time series data to understand the relationship between
different measures across time.
Parameters
min_periods [int, optional] Minimum number of observations required per pair of
columns to have a valid result.
Returns
DataFrame The covariance matrix of the series of the DataFrame.
See also:
Notes
Returns the covariance matrix of the DataFrame’s time series. The covariance is normalized by N-1.
For DataFrames that have Series that are missing data (assuming that data is missing at random) the
returned covariance matrix will be an unbiased estimate of the variance and covariance between the
member Series.
However, for many applications this estimate may not be acceptable because the estimate covariance
matrix is not guaranteed to be positive semi-definite. This could lead to estimate correlations having
absolute values which are greater than one, and/or a non-invertible covariance matrix. See Estimation
of covariance matrices for more details.
Examples
>>> np.random.seed(42)
>>> df = pd.DataFrame(np.random.randn(1000, 5),
... columns=['a', 'b', 'c', 'd', 'e'])
>>> df.cov()
a b c d e
a 0.998438 -0.020161 0.059277 -0.008943 0.014144
b -0.020161 1.059352 -0.008543 -0.024738 0.009826
c 0.059277 -0.008543 1.010670 -0.001486 -0.000271
d -0.008943 -0.024738 -0.001486 0.921297 -0.013692
e 0.014144 0.009826 -0.000271 -0.013692 0.977795
>>> np.random.seed(42)
>>> df = pd.DataFrame(np.random.randn(20, 3),
... columns=['a', 'b', 'c'])
>>> df.loc[df.index[:5], 'a'] = np.nan
>>> df.loc[df.index[5:10], 'b'] = np.nan
>>> df.cov(min_periods=12)
a b c
a 0.316741 NaN -0.150812
b NaN 1.248003 0.191417
c -0.150812 0.191417 0.895202
pandas.core.groupby.DataFrameGroupBy.cummax
Series.groupby
DataFrame.groupby
pandas.core.groupby.DataFrameGroupBy.cummin
Series.groupby
DataFrame.groupby
pandas.core.groupby.DataFrameGroupBy.cumprod
Series.groupby
DataFrame.groupby
pandas.core.groupby.DataFrameGroupBy.cumsum
Series.groupby
DataFrame.groupby
pandas.core.groupby.DataFrameGroupBy.describe
DataFrameGroupBy.describe(self, **kwargs)
Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s
distribution, excluding NaN values.
Analyzes both numeric and object series, as well as DataFrame column sets of mixed data types. The
output will vary depending on what is provided. Refer to the notes below for more detail.
Parameters
percentiles [list-like of numbers, optional] The percentiles to include in the output.
All should fall between 0 and 1. The default is [.25, .5, .75], which returns
the 25th, 50th, and 75th percentiles.
include [‘all’, list-like of dtypes or None (default), optional] A white list of data types
to include in the result. Ignored for Series. Here are the options:
• ‘all’ : All columns of the input will be included in the output.
• A list-like of dtypes : Limits the results to the provided data types. To
limit the result to numeric types submit numpy.number. To limit it instead
to object columns submit the numpy.object data type. Strings can also be
used in the style of select_dtypes (e.g. df.describe(include=['O'])). To
select pandas categorical columns, use 'category'
Notes
For numeric data, the result’s index will include count, mean, std, min, max as well as lower, 50
and upper percentiles. By default the lower percentile is 25 and the upper percentile is 75. The 50
percentile is the same as the median.
For object data (e.g. strings or timestamps), the result’s index will include count, unique, top,
and freq. The top is the most common value. The freq is the most common value’s frequency.
Timestamps also include the first and last items.
If multiple object values have the highest count, then the count and top results will be arbitrarily
chosen from among those with the highest count.
For mixed data types provided via a DataFrame, the default is to return only an analysis of numeric
columns. If the dataframe consists only of object and categorical data without any numeric columns,
the default is to return an analysis of both the object and categorical columns. If include='all' is
provided as an option, the result will include a union of attributes of each type.
The include and exclude parameters can be used to limit which columns in a DataFrame are analyzed
for the output. The parameters are ignored when analyzing a Series.
Examples
>>> s = pd.Series([
... np.datetime64("2000-01-01"),
... np.datetime64("2010-01-01"),
... np.datetime64("2010-01-01")
... ])
>>> s.describe()
count 3
unique 2
top 2010-01-01 00:00:00
freq 2
first 2000-01-01 00:00:00
last 2010-01-01 00:00:00
dtype: object
>>> df.describe(include='all')
categorical numeric object
count 3 3.0 3
unique 3 NaN 3
top f NaN c
freq 1 NaN 1
mean NaN 2.0 NaN
std NaN 1.0 NaN
min NaN 1.0 NaN
25% NaN 1.5 NaN
50% NaN 2.0 NaN
75% NaN 2.5 NaN
max NaN 3.0 NaN
>>> df.numeric.describe()
count 3.0
mean 2.0
std 1.0
min 1.0
25% 1.5
50% 2.0
75% 2.5
max 3.0
Name: numeric, dtype: float64
>>> df.describe(include=[np.number])
numeric
count 3.0
mean 2.0
std 1.0
min 1.0
25% 1.5
50% 2.0
75% 2.5
max 3.0
>>> df.describe(include=[np.object])
object
count 3
unique 3
top c
freq 1
>>> df.describe(include=['category'])
categorical
(continues on next page)
>>> df.describe(exclude=[np.number])
categorical object
count 3 3
unique 3 3
top f c
freq 1 1
>>> df.describe(exclude=[np.object])
categorical numeric
count 3 3.0
unique 3 NaN
top f NaN
freq 1 NaN
mean NaN 2.0
std NaN 1.0
min NaN 1.0
25% NaN 1.5
50% NaN 2.0
75% NaN 2.5
max NaN 3.0
pandas.core.groupby.DataFrameGroupBy.diff
DataFrameGroupBy.diff
First discrete difference of element.
Calculates the difference of a DataFrame element compared with another element in the DataFrame
(default is the element in the same column of the previous row).
Parameters
periods [int, default 1] Periods to shift for calculating difference, accepts negative
values.
axis [{0 or ‘index’, 1 or ‘columns’}, default 0] Take difference over rows (0) or columns
(1).
New in version 0.16.1..
Returns
DataFrame
See also:
Examples
>>> df.diff()
a b c
0 NaN NaN NaN
1 1.0 0.0 3.0
2 1.0 1.0 5.0
3 1.0 1.0 7.0
4 1.0 2.0 9.0
5 1.0 3.0 11.0
>>> df.diff(axis=1)
a b c
0 NaN 0.0 0.0
1 NaN -1.0 3.0
2 NaN -1.0 7.0
3 NaN -1.0 13.0
4 NaN 0.0 20.0
5 NaN 2.0 28.0
>>> df.diff(periods=3)
a b c
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 3.0 2.0 15.0
4 3.0 4.0 21.0
5 3.0 6.0 27.0
>>> df.diff(periods=-1)
a b c
0 -1.0 0.0 -3.0
1 -1.0 -1.0 -5.0
2 -1.0 -1.0 -7.0
3 -1.0 -2.0 -9.0
4 -1.0 -3.0 -11.0
5 NaN NaN NaN
pandas.core.groupby.DataFrameGroupBy.ffill
DataFrameGroupBy.ffill(self, limit=None)
Forward fill the values.
Parameters
limit [integer, optional] limit of how many values to fill
Returns
Series or DataFrame Object with missing values filled.
See also:
Series.pad
DataFrame.pad
Series.fillna
DataFrame.fillna
pandas.core.groupby.DataFrameGroupBy.fillna
DataFrameGroupBy.fillna
Fill NA/NaN values using the specified method.
Parameters
value [scalar, dict, Series, or DataFrame] Value to use to fill holes (e.g. 0), alternately
a dict/Series/DataFrame of values specifying which value to use for each index (for
a Series) or column (for a DataFrame). Values not in the dict/Series/DataFrame
will not be filled. This value cannot be a list.
method [{‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None] Method to use for filling
holes in reindexed Series pad / ffill: propagate last valid observation forward to
next valid backfill / bfill: use next valid observation to fill gap.
axis [{0 or ‘index’, 1 or ‘columns’}] Axis along which to fill missing values.
inplace [bool, default False] If True, fill in-place. Note: this will modify any other
views on this object (e.g., a no-copy slice for a column in a DataFrame).
limit [int, default None] If method is specified, this is the maximum number of con-
secutive NaN values to forward/backward fill. In other words, if there is a gap
with more than this number of consecutive NaNs, it will only be partially filled.
If method is not specified, this is the maximum number of entries along the entire
axis where NaNs will be filled. Must be greater than 0 if not None.
Examples
>>> df.fillna(0)
A B C D
0 0.0 2.0 0.0 0
1 3.0 4.0 0.0 1
2 0.0 0.0 0.0 5
3 0.0 3.0 0.0 4
>>> df.fillna(method='ffill')
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 3.0 4.0 NaN 5
3 3.0 3.0 NaN 4
Replace all NaN elements in column ‘A’, ‘B’, ‘C’, and ‘D’, with 0, 1, 2, and 3 respectively.
pandas.core.groupby.DataFrameGroupBy.filter
Notes
Each subframe is endowed the attribute ‘name’ in case you need to know which group you are working
on.
Examples
pandas.core.groupby.DataFrameGroupBy.hist
DataFrameGroupBy.hist
Make a histogram of the DataFrame’s.
Examples
This example draws a histogram based on the length and width of some animals, displayed in three
bins
>>> df = pd.DataFrame({
... 'length': [1.5, 0.5, 1.2, 0.9, 3],
... 'width': [0.7, 0.2, 0.15, 0.2, 1.1]
(continues on next page)
pandas.core.groupby.DataFrameGroupBy.idxmax
DataFrameGroupBy.idxmax
Return index of first occurrence of maximum over requested axis. NA/null values are excluded.
Parameters
axis [{0 or ‘index’, 1 or ‘columns’}, default 0] 0 or ‘index’ for row-wise, 1 or ‘columns’
for column-wise
skipna [boolean, default True] Exclude NA/null values. If an entire row/column is
NA, the result will be NA.
Returns
Series Indexes of maxima along the specified axis.
Raises
ValueError
• If the row/column is empty
See also:
Series.idxmax
Notes
pandas.core.groupby.DataFrameGroupBy.idxmin
DataFrameGroupBy.idxmin
Return index of first occurrence of minimum over requested axis. NA/null values are excluded.
Parameters
axis [{0 or ‘index’, 1 or ‘columns’}, default 0] 0 or ‘index’ for row-wise, 1 or ‘columns’
for column-wise
skipna [boolean, default True] Exclude NA/null values. If an entire row/column is
NA, the result will be NA.
Returns
Series Indexes of minima along the specified axis.
Raises
ValueError
• If the row/column is empty
See also:
Series.idxmin
Notes
pandas.core.groupby.DataFrameGroupBy.mad
DataFrameGroupBy.mad
Return the mean absolute deviation of the values for the requested axis.
Parameters
axis [{index (0), columns (1)}] Axis for the function to be applied on.
skipna [bool, default True] Exclude NA/null values when computing the result.
level [int or level name, default None] If the axis is a MultiIndex (hierarchical), count
along a particular level, collapsing into a Series.
numeric_only [bool, default None] Include only float, int, boolean columns. If None,
will attempt to use everything, then use only numeric data. Not implemented for
Series.
**kwargs Additional keyword arguments to be passed to the function.
Returns
Series or DataFrame (if level specified)
pandas.core.groupby.DataFrameGroupBy.nunique
DataFrameGroupBy.nunique(self, dropna=True)
Return DataFrame with number of distinct observations per group for each column.
New in version 0.20.0.
Parameters
dropna [boolean, default True] Don’t include NaN in the counts.
Returns
nunique: DataFrame
Examples
>>> df.groupby('id').nunique()
id value1 value2
id
egg 1 1 1
ham 1 1 2
spam 1 2 1
pandas.core.groupby.DataFrameGroupBy.pct_change
Series.groupby
DataFrame.groupby
pandas.core.groupby.DataFrameGroupBy.plot
DataFrameGroupBy.plot
Class implementing the .plot attribute for groupby objects.
pandas.core.groupby.DataFrameGroupBy.quantile
Examples
>>> df = pd.DataFrame([
... ['a', 1], ['a', 2], ['a', 3],
... ['b', 1], ['b', 3], ['b', 5]
... ], columns=['key', 'val'])
>>> df.groupby('key').quantile()
val
key
a 2.0
b 3.0
pandas.core.groupby.DataFrameGroupBy.rank
Series.groupby
DataFrame.groupby
pandas.core.groupby.DataFrameGroupBy.resample
Examples
Downsample the DataFrame into 3 minute bins and sum the values of the timestamps falling into a
bin.
>>> df.groupby('a').resample('3T').sum()
a b
a
0 2000-01-01 00:00:00 0 2
2000-01-01 00:03:00 0 1
5 2000-01-01 00:00:00 5 1
>>> df.groupby('a').resample('30S').sum()
a b
a
0 2000-01-01 00:00:00 0 1
2000-01-01 00:00:30 0 0
2000-01-01 00:01:00 0 1
2000-01-01 00:01:30 0 0
2000-01-01 00:02:00 0 0
2000-01-01 00:02:30 0 0
2000-01-01 00:03:00 0 1
5 2000-01-01 00:02:00 5 1
>>> df.groupby('a').resample('M').sum()
a b
a
0 2000-01-31 0 3
5 2000-01-31 5 1
Downsample the series into 3 minute bins as above, but close the right side of the bin interval.
Downsample the series into 3 minute bins and close the right side of the bin interval, but label each
bin using the right edge instead of the left.
pandas.core.groupby.DataFrameGroupBy.shift
Series.groupby
DataFrame.groupby
pandas.core.groupby.DataFrameGroupBy.size
DataFrameGroupBy.size(self )
Compute group sizes.
Returns
Series Number of rows in each group.
See also:
Series.groupby
DataFrame.groupby
pandas.core.groupby.DataFrameGroupBy.skew
DataFrameGroupBy.skew
Return unbiased skew over requested axis Normalized by N-1.
Parameters
axis [{index (0), columns (1)}] Axis for the function to be applied on.
skipna [bool, default True] Exclude NA/null values when computing the result.
level [int or level name, default None] If the axis is a MultiIndex (hierarchical), count
along a particular level, collapsing into a Series.
numeric_only [bool, default None] Include only float, int, boolean columns. If None,
will attempt to use everything, then use only numeric data. Not implemented for
Series.
**kwargs Additional keyword arguments to be passed to the function.
Returns
Series or DataFrame (if level specified)
pandas.core.groupby.DataFrameGroupBy.take
DataFrameGroupBy.take
Return the elements in the given positional indices along an axis.
This means that we are not indexing according to actual values in the index attribute of the object.
We are indexing according to the actual position of the element in the object.
Parameters
indices [array-like] An array of ints indicating which positions to take.
axis [{0 or ‘index’, 1 or ‘columns’, None}, default 0] The axis on which to select
elements. 0 means that we are selecting rows, 1 means that we are selecting
columns.
is_copy [bool, default True] Whether to return a copy of the original object or not.
**kwargs For compatibility with numpy.take(). Has no effect on the output.
Returns
taken [same type as caller] An array-like containing the elements taken from the
object.
See also:
Examples
We may take elements using negative integers for positive indices, starting from the end of the object,
just like with Python lists.
pandas.core.groupby.DataFrameGroupBy.tshift
DataFrameGroupBy.tshift
Shift the time index, using the index’s frequency if available.
Parameters
periods [int] Number of periods to move, can be positive or negative
freq [DateOffset, timedelta, or time rule string, default None] Increment to use from
the tseries module or time rule (e.g. ‘EOM’)
axis [int or basestring] Corresponds to the axis that contains the Index
Returns
shifted [Series/DataFrame]
Notes
If freq is not specified then tries to use the freq or inferred_freq attributes of the index. If neither of
those attributes exist, a ValueError is thrown
The following methods are available only for SeriesGroupBy objects.
pandas.core.groupby.SeriesGroupBy.nlargest
SeriesGroupBy.nlargest
Return the largest n elements.
Parameters
n [int, default 5] Return this many descending sorted values.
keep [{‘first’, ‘last’, ‘all’}, default ‘first’] When there are duplicate values that cannot
all fit in a Series of n elements:
• first [return the first n occurrences in order] of appearance.
• last [return the last n occurrences in reverse] order of appearance.
• all [keep all occurrences. This can result in a Series of] size larger than n.
Returns
Series The n largest values in the Series, sorted in decreasing order.
See also:
Notes
Faster than .sort_values(ascending=False).head(n) for small n relative to the size of the Series
object.
Examples
The n largest elements where n=3. Default keep value is ‘first’ so Malta will be kept.
>>> s.nlargest(3)
France 65000000
Italy 59000000
Malta 434000
dtype: int64
The n largest elements where n=3 and keeping the last duplicates. Brunei will be kept since it is the
last with value 434000 based on the index order.
The n largest elements where n=3 with all duplicates kept. Note that the returned Series has five
elements due to the three duplicates.
pandas.core.groupby.SeriesGroupBy.nsmallest
SeriesGroupBy.nsmallest
Return the smallest n elements.
Parameters
n [int, default 5] Return this many ascending sorted values.
keep [{‘first’, ‘last’, ‘all’}, default ‘first’] When there are duplicate values that cannot
all fit in a Series of n elements:
• first [return the first n occurrences in order] of appearance.
• last [return the last n occurrences in reverse] order of appearance.
• all [keep all occurrences. This can result in a Series of] size larger than n.
Returns
Series The n smallest values in the Series, sorted in increasing order.
See also:
Notes
Faster than .sort_values().head(n) for small n relative to the size of the Series object.
Examples
>>> s.nsmallest()
Monserat 5200
Nauru 11300
Tuvalu 11300
Anguilla 11300
Iceland 337000
dtype: int64
The n smallest elements where n=3. Default keep value is ‘first’ so Nauru and Tuvalu will be kept.
>>> s.nsmallest(3)
Monserat 5200
Nauru 11300
Tuvalu 11300
dtype: int64
The n smallest elements where n=3 and keeping the last duplicates. Anguilla and Tuvalu will be kept
since they are the last with value 11300 based on the index order.
The n smallest elements where n=3 with all duplicates kept. Note that the returned Series has four
elements due to the three duplicates.
pandas.core.groupby.SeriesGroupBy.nunique
SeriesGroupBy.nunique(self, dropna=True)
Return number of unique elements in the group.
Returns
Series Number of unique values within each group.
pandas.core.groupby.SeriesGroupBy.unique
SeriesGroupBy.unique
Return unique values of Series object.
Uniques are returned in order of appearance. Hash table-based unique, therefore does NOT sort.
Returns
ndarray or ExtensionArray The unique values returned as a NumPy array. See
Notes.
See also:
Notes
Returns the unique values as a NumPy array. In case of an extension-array backed Series, a new
ExtensionArray of that type with just the unique values is returned. This includes
• Categorical
• Period
• Datetime with Timezone
• Interval
• Sparse
• IntegerNA
Examples
>>> pd.Series(pd.Categorical(list('baabc'))).unique()
[b, a, c]
Categories (3, object): [b, a, c]
pandas.core.groupby.SeriesGroupBy.value_counts
pandas.core.groupby.SeriesGroupBy.is_monotonic_increasing
SeriesGroupBy.is_monotonic_increasing
Return boolean if values in the object are monotonic_increasing.
New in version 0.19.0.
Returns
bool
pandas.core.groupby.SeriesGroupBy.is_monotonic_decreasing
SeriesGroupBy.is_monotonic_decreasing
Return boolean if values in the object are monotonic_decreasing.
New in version 0.19.0.
Returns
bool
The following methods are available only for DataFrameGroupBy objects.
pandas.core.groupby.DataFrameGroupBy.corrwith
DataFrameGroupBy.corrwith
Compute pairwise correlation between rows or columns of DataFrame with rows or columns of Series
or DataFrame. DataFrames are first aligned along both axes before computing the correlations.
Parameters
other [DataFrame, Series] Object with which to compute correlations.
axis [{0 or ‘index’, 1 or ‘columns’}, default 0] 0 or ‘index’ to compute column-wise, 1
or ‘columns’ for row-wise.
drop [bool, default False] Drop missing indices from result.
method [{‘pearson’, ‘kendall’, ‘spearman’} or callable]
• pearson : standard correlation coefficient
• kendall : Kendall Tau correlation coefficient
• spearman : Spearman rank correlation
• callable: callable with input two 1d ndarrays and returning a float
New in version 0.24.0.
Returns
Series Pairwise correlations.
See also:
DataFrame.corr
pandas.core.groupby.DataFrameGroupBy.boxplot
Examples
{{ header }}
6.12 Resampling
pandas.core.resample.Resampler.__iter__
Resampler.__iter__(self )
Resampler iterator.
Returns
Generator yielding sequence of (name, subsetted object)
for each group.
See also:
GroupBy.__iter__
pandas.core.resample.Resampler.groups
Resampler.groups
Dict {group name -> group labels}.
pandas.core.resample.Resampler.indices
Resampler.indices
Dict {group name -> group indices}.
pandas.core.resample.Resampler.get_group
Resampler.apply(self, func, \*args, \*\*kwargs) Aggregate using one or more operations over the
specified axis.
Resampler.aggregate(self, func, \*args, …) Aggregate using one or more operations over the
specified axis.
Continued on next page
pandas.core.resample.Resampler.apply
DataFrame.groupby.aggregate
DataFrame.resample.transform
DataFrame.aggregate
Notes
Examples
>>> s = pd.Series([1,2,3,4,5],
index=pd.date_range('20130101', periods=5,freq='s'))
2013-01-01 00:00:00 1
2013-01-01 00:00:01 2
2013-01-01 00:00:02 3
2013-01-01 00:00:03 4
2013-01-01 00:00:04 5
Freq: S, dtype: int64
>>> r = s.resample('2s')
DatetimeIndexResampler [freq=<2 * Seconds>, axis=0, closed=left,
label=left, convention=start, base=0]
>>> r.agg(np.sum)
2013-01-01 00:00:00 3
2013-01-01 00:00:02 7
2013-01-01 00:00:04 5
Freq: 2S, dtype: int64
>>> r.agg(['sum','mean','max'])
sum mean max
2013-01-01 00:00:00 3 1.5 2
2013-01-01 00:00:02 7 3.5 4
2013-01-01 00:00:04 5 5.0 5
pandas.core.resample.Resampler.aggregate
Returns
scalar, Series or DataFrame The return can be:
• scalar : when Series.agg is called with single function
• Series : when DataFrame.agg is called with a single function
• DataFrame : when DataFrame.agg is called with several functions
Return scalar, Series or DataFrame.
See also:
DataFrame.groupby.aggregate
DataFrame.resample.transform
DataFrame.aggregate
Notes
Examples
>>> s = pd.Series([1,2,3,4,5],
index=pd.date_range('20130101', periods=5,freq='s'))
2013-01-01 00:00:00 1
2013-01-01 00:00:01 2
2013-01-01 00:00:02 3
2013-01-01 00:00:03 4
2013-01-01 00:00:04 5
Freq: S, dtype: int64
>>> r = s.resample('2s')
DatetimeIndexResampler [freq=<2 * Seconds>, axis=0, closed=left,
label=left, convention=start, base=0]
>>> r.agg(np.sum)
2013-01-01 00:00:00 3
2013-01-01 00:00:02 7
2013-01-01 00:00:04 5
Freq: 2S, dtype: int64
>>> r.agg(['sum','mean','max'])
sum mean max
2013-01-01 00:00:00 3 1.5 2
2013-01-01 00:00:02 7 3.5 4
2013-01-01 00:00:04 5 5.0 5
pandas.core.resample.Resampler.transform
Examples
pandas.core.resample.Resampler.pipe
>>> (df.groupby('group')
... .pipe(f)
... .pipe(g, arg1=a)
... .pipe(h, arg2=b, arg3=c))
Returns
object [the return type of func.]
See also:
Notes
Examples
To get the difference between each 2-day period’s maximum and minimum value in one pass, you can
do
6.12.3 Upsampling
pandas.core.resample.Resampler.ffill
Resampler.ffill(self, limit=None)
Forward fill the values.
Parameters
limit [integer, optional] limit of how many values to fill
Returns
An upsampled Series.
See also:
Series.fillna
DataFrame.fillna
pandas.core.resample.Resampler.backfill
Resampler.backfill(self, limit=None)
Backward fill the new missing values in the resampled data.
In statistics, imputation is the process of replacing missing data with substituted values [?]. When
resampling data, missing values may appear (e.g., when the resampling frequency is higher than the
original frequency). The backward fill will replace NaN values that appeared in the resampled data
with the next value in the original sequence. Missing values that existed in the original data will not
be modified.
Parameters
limit [integer, optional] Limit of how many values to fill.
Returns
Series, DataFrame An upsampled Series or DataFrame with backward filled NaN
values.
See also:
References
[?]
Examples
Resampling a Series:
>>> s.resample('30min').backfill()
2018-01-01 00:00:00 1
2018-01-01 00:30:00 2
2018-01-01 01:00:00 2
2018-01-01 01:30:00 3
2018-01-01 02:00:00 3
Freq: 30T, dtype: int64
>>> s.resample('15min').backfill(limit=2)
2018-01-01 00:00:00 1.0
2018-01-01 00:15:00 NaN
2018-01-01 00:30:00 2.0
2018-01-01 00:45:00 2.0
2018-01-01 01:00:00 2.0
2018-01-01 01:15:00 NaN
2018-01-01 01:30:00 3.0
2018-01-01 01:45:00 3.0
2018-01-01 02:00:00 3.0
Freq: 15T, dtype: float64
>>> df.resample('30min').backfill()
a b
2018-01-01 00:00:00 2.0 1
2018-01-01 00:30:00 NaN 3
2018-01-01 01:00:00 NaN 3
2018-01-01 01:30:00 6.0 5
2018-01-01 02:00:00 6.0 5
>>> df.resample('15min').backfill(limit=2)
(continues on next page)
pandas.core.resample.Resampler.bfill
Resampler.bfill(self, limit=None)
Backward fill the new missing values in the resampled data.
In statistics, imputation is the process of replacing missing data with substituted values [?]. When
resampling data, missing values may appear (e.g., when the resampling frequency is higher than the
original frequency). The backward fill will replace NaN values that appeared in the resampled data
with the next value in the original sequence. Missing values that existed in the original data will not
be modified.
Parameters
limit [integer, optional] Limit of how many values to fill.
Returns
Series, DataFrame An upsampled Series or DataFrame with backward filled NaN
values.
See also:
References
[?]
Examples
Resampling a Series:
>>> s.resample('30min').backfill()
2018-01-01 00:00:00 1
2018-01-01 00:30:00 2
2018-01-01 01:00:00 2
2018-01-01 01:30:00 3
2018-01-01 02:00:00 3
Freq: 30T, dtype: int64
>>> s.resample('15min').backfill(limit=2)
2018-01-01 00:00:00 1.0
2018-01-01 00:15:00 NaN
2018-01-01 00:30:00 2.0
2018-01-01 00:45:00 2.0
2018-01-01 01:00:00 2.0
2018-01-01 01:15:00 NaN
2018-01-01 01:30:00 3.0
2018-01-01 01:45:00 3.0
2018-01-01 02:00:00 3.0
Freq: 15T, dtype: float64
>>> df.resample('30min').backfill()
a b
2018-01-01 00:00:00 2.0 1
2018-01-01 00:30:00 NaN 3
2018-01-01 01:00:00 NaN 3
2018-01-01 01:30:00 6.0 5
2018-01-01 02:00:00 6.0 5
>>> df.resample('15min').backfill(limit=2)
a b
2018-01-01 00:00:00 2.0 1.0
2018-01-01 00:15:00 NaN NaN
2018-01-01 00:30:00 NaN 3.0
(continues on next page)
pandas.core.resample.Resampler.pad
Resampler.pad(self, limit=None)
Forward fill the values.
Parameters
limit [integer, optional] limit of how many values to fill
Returns
An upsampled Series.
See also:
Series.fillna
DataFrame.fillna
pandas.core.resample.Resampler.nearest
Resampler.nearest(self, limit=None)
Resample by using the nearest value.
When resampling data, missing values may appear (e.g., when the resampling frequency is higher than
the original frequency). The nearest method will replace NaN values that appeared in the resampled
data with the value from the nearest member of the sequence, based on the index value. Missing values
that existed in the original data will not be modified. If limit is given, fill only this many values in
each direction for each of the original values.
Parameters
limit [int, optional] Limit of how many values to fill.
New in version 0.21.0.
Returns
Series or DataFrame An upsampled Series or DataFrame with NaN values filled
with their nearest value.
See also:
backfill Backward fill the new missing values in the resampled data.
pad Forward fill NaN values.
Examples
>>> s.resample('15min').nearest()
2018-01-01 00:00:00 1
2018-01-01 00:15:00 1
2018-01-01 00:30:00 2
2018-01-01 00:45:00 2
2018-01-01 01:00:00 2
Freq: 15T, dtype: int64
>>> s.resample('15min').nearest(limit=1)
2018-01-01 00:00:00 1.0
2018-01-01 00:15:00 1.0
2018-01-01 00:30:00 NaN
2018-01-01 00:45:00 2.0
2018-01-01 01:00:00 2.0
Freq: 15T, dtype: float64
pandas.core.resample.Resampler.fillna
References
[?]
Examples
Resampling a Series:
>>> s.resample("30min").asfreq()
2018-01-01 00:00:00 1.0
2018-01-01 00:30:00 NaN
2018-01-01 01:00:00 2.0
2018-01-01 01:30:00 NaN
2018-01-01 02:00:00 3.0
Freq: 30T, dtype: float64
>>> s.resample('30min').fillna("backfill")
2018-01-01 00:00:00 1
2018-01-01 00:30:00 2
2018-01-01 01:00:00 2
2018-01-01 01:30:00 3
2018-01-01 02:00:00 3
Freq: 30T, dtype: int64
>>> s.resample('30min').fillna("pad")
2018-01-01 00:00:00 1
2018-01-01 00:30:00 1
2018-01-01 01:00:00 2
2018-01-01 01:30:00 2
2018-01-01 02:00:00 3
Freq: 30T, dtype: int64
>>> s.resample('30min').fillna("nearest")
2018-01-01 00:00:00 1
2018-01-01 00:30:00 2
2018-01-01 01:00:00 2
2018-01-01 01:30:00 3
2018-01-01 02:00:00 3
Freq: 30T, dtype: int64
>>> sm.resample('30min').fillna('backfill')
2018-01-01 00:00:00 1.0
2018-01-01 00:30:00 NaN
2018-01-01 01:00:00 NaN
2018-01-01 01:30:00 3.0
2018-01-01 02:00:00 3.0
Freq: 30T, dtype: float64
>>> sm.resample('30min').fillna('pad')
2018-01-01 00:00:00 1.0
2018-01-01 00:30:00 1.0
2018-01-01 01:00:00 NaN
2018-01-01 01:30:00 NaN
2018-01-01 02:00:00 3.0
Freq: 30T, dtype: float64
>>> sm.resample('30min').fillna('nearest')
2018-01-01 00:00:00 1.0
2018-01-01 00:30:00 NaN
2018-01-01 01:00:00 NaN
(continues on next page)
DataFrame resampling is done column-wise. All the same options are available.
>>> df.resample('30min').fillna("bfill")
a b
2018-01-01 00:00:00 2.0 1
2018-01-01 00:30:00 NaN 3
2018-01-01 01:00:00 NaN 3
2018-01-01 01:30:00 6.0 5
2018-01-01 02:00:00 6.0 5
pandas.core.resample.Resampler.asfreq
Resampler.asfreq(self, fill_value=None)
Return the values at the new freq, essentially a reindex.
Parameters
fill_value [scalar, optional] Value to use for missing values, applied during upsam-
pling (note this does not fill NaNs that already were present).
New in version 0.20.0.
Returns
DataFrame or Series Values at the specified freq.
See also:
Series.asfreq
DataFrame.asfreq
pandas.core.resample.Resampler.interpolate
Notes
The ‘krogh’, ‘piecewise_polynomial’, ‘spline’, ‘pchip’ and ‘akima’ methods are wrappers around the
respective SciPy implementations of similar names. These use the actual numerical values of the index.
For more information on their behavior, see the SciPy documentation and SciPy tutorial.
Examples
Filling in NaN in a Series by padding, but filling at most two consecutive NaN at a time.
Filling in NaN in a Series via polynomial interpolation or splines: Both ‘polynomial’ and ‘spline’ methods
require that you also specify an order (int).
Fill the DataFrame forward (that is, going down) along each column using linear interpolation.
Note how the last entry in column ‘a’ is interpolated differently, because there is no entry after it to
use for interpolation. Note how the first entry in column ‘b’ remains NaN, because there is no entry
before it to use for interpolation.
pandas.core.resample.Resampler.count
Resampler.count(self, _method=’count’)
Compute count of group, excluding missing values.
Returns
Series or DataFrame Count of values within each group.
See also:
Series.groupby
DataFrame.groupby
pandas.core.resample.Resampler.nunique
Resampler.nunique(self, _method=’nunique’)
Return number of unique elements in the group.
Returns
Series Number of unique values within each group.
pandas.core.resample.Resampler.first
pandas.core.resample.Resampler.last
pandas.core.resample.Resampler.max
pandas.core.resample.Resampler.mean
Series.groupby
DataFrame.groupby
Examples
Groupby one column and return the mean of the remaining columns in each group.
>>> df.groupby('A').mean()
B C
A
1 3.0 1.333333
2 4.0 1.500000
Groupby two columns and return the mean of the remaining column.
Groupby one column and return the mean of only particular column in the group.
>>> df.groupby('A')['B'].mean()
A
1 3.0
2 4.0
Name: B, dtype: float64
pandas.core.resample.Resampler.median
Series.groupby
DataFrame.groupby
pandas.core.resample.Resampler.min
pandas.core.resample.Resampler.ohlc
Series.groupby
DataFrame.groupby
pandas.core.resample.Resampler.prod
pandas.core.resample.Resampler.size
Resampler.size(self )
Compute group sizes.
Returns
Series Number of rows in each group.
See also:
Series.groupby
DataFrame.groupby
pandas.core.resample.Resampler.sem
Series.groupby
DataFrame.groupby
pandas.core.resample.Resampler.std
pandas.core.resample.Resampler.sum
pandas.core.resample.Resampler.var
pandas.core.resample.Resampler.quantile
Series.quantile
DataFrame.quantile
DataFrameGroupBy.quantile
{{ header }}
6.13 Style
Styler(data[, precision, table_styles, …]) Helps style a DataFrame or Series according to the
data with HTML and CSS.
Styler.from_custom_template(searchpath, Factory function for creating a subclass of Styler
name) with a custom template and Jinja environment.
pandas.io.formats.style.Styler
DataFrame.style
Notes
Most styling will be done by passing style functions into Styler.apply or Styler.applymap. Style
functions should return values with strings containing CSS 'attr: value' that will be applied to the
indicated cells.
If using in the Jupyter notebook, Styler has defined a _repr_html_ to automatically render itself.
Otherwise call Styler.render to get the generated HTML.
CSS classes are attached to the generated HTML
• Index and Column names include index_name and level<k> where k is its level in a MultiIndex
• Index label cells include
– row_heading
– row<n> where n is the numeric position of the row
– level<k> where k is the level in a MultiIndex
• Column label cells include * col_heading * col<n> where n is the numeric position of the column
* evel<k> where k is the level in a MultiIndex
• Blank cells include blank
• Data cells include data
Attributes
Methods
pandas.io.formats.style.Styler.apply
axis [{0 or ‘index’, 1 or ‘columns’, None}, default 0] apply to each column (axis=0
or 'index'), to each row (axis=1 or 'columns'), or to the entire DataFrame
at once with axis=None.
subset [IndexSlice] a valid indexer to limit data to before applying the function.
Consider using a pandas.IndexSlice
kwargs [dict] pass along to func
Returns
self [Styler]
Notes
The output shape of func should match the input, i.e. if x is the input row, column, or table
(depending on axis), then func(x).shape == x.shape should be true.
This is similar to DataFrame.apply, except that axis=None applies the function to the entire
DataFrame at once, rather than column-wise or row-wise.
Examples
pandas.io.formats.style.Styler.applymap
Styler.where
pandas.io.formats.style.Styler.background_gradient
Notes
Set text_color_threshold or tune low and high to keep the text legible by not using the entire
range of the color map. The range of the data is extended by low * (x.max() - x.min()) and
high * (x.max() - x.min()) before normalizing.
pandas.io.formats.style.Styler.bar
width [float, default 100] A number between 0 or 100. The largest value will cover
width percent of the cell’s width.
align [{‘left’, ‘zero’,’ mid’}, default ‘left’] How to align the bars with the cells.
• ‘left’ : the min value starts at the left of the cell.
• ‘zero’ : a value of zero is located at the center of the cell.
• ‘mid’ : the center of the cell is at (max-min)/2, or if values are all negative
(positive) the zero is aligned at the right (left) of the cell.
New in version 0.20.0.
vmin [float, optional] Minimum bar value, defining the left hand limit of the bar
drawing range, lower values are clipped to vmin. When None (default): the
minimum value of the data will be used.
New in version 0.24.0.
vmax [float, optional] Maximum bar value, defining the right hand limit of the
bar drawing range, higher values are clipped to vmax. When None (default):
the maximum value of the data will be used.
New in version 0.24.0.
Returns
self [Styler]
pandas.io.formats.style.Styler.clear
Styler.clear(self )
Reset the styler, removing any previously applied styles. Returns None.
pandas.io.formats.style.Styler.export
Styler.export(self )
Export the styles to applied to the current Styler.
Can be applied to a second style with Styler.use.
Returns
styles [list]
See also:
Styler.use
pandas.io.formats.style.Styler.format
Notes
Examples
pandas.io.formats.style.Styler.from_custom_template
pandas.io.formats.style.Styler.hide_columns
Styler.hide_columns(self, subset)
Hide columns from rendering.
New in version 0.23.0.
Parameters
subset [IndexSlice] An argument to DataFrame.loc that identifies which columns
are hidden.
Returns
self [Styler]
pandas.io.formats.style.Styler.hide_index
Styler.hide_index(self )
Hide any indices from rendering.
New in version 0.23.0.
Returns
self [Styler]
pandas.io.formats.style.Styler.highlight_max
pandas.io.formats.style.Styler.highlight_min
pandas.io.formats.style.Styler.highlight_null
Styler.highlight_null(self, null_color=’red’)
Shade the background null_color for missing values.
Parameters
null_color [str]
Returns
self [Styler]
pandas.io.formats.style.Styler.pipe
Notes
Like DataFrame.pipe(), this method can simplify the application of several user-defined func-
tions to a styler. Instead of writing:
(df.style.set_precision(3)
.pipe(g, arg1=a)
.pipe(f, arg2=b, arg3=c))
In particular, this allows users to define functions that take a styler object, along with other
parameters, and return the styler after making styling changes (such as calling Styler.apply()
or Styler.set_properties()). Using .pipe, these user-defined style “transformations” can be
interleaved with calls to the built-in Styler interface.
Examples
The user-defined format_conversion function above can be called within a sequence of other
style modifications:
pandas.io.formats.style.Styler.render
Styler.render(self, **kwargs)
Render the built up styles to HTML.
Parameters
**kwargs Any additional keyword arguments are passed through to self.
template.render. This is useful when you need to provide additional variables
for a custom template.
New in version 0.20.
Returns
rendered [str] The rendered HTML.
Notes
Styler objects have defined the _repr_html_ method which automatically calls self.render()
when it’s the last item in a Notebook cell. When calling Styler.render() directly, wrap the
result in IPython.display.HTML to view the rendered HTML in the notebook.
Pandas uses the following keys in render. Arguments passed in **kwargs take precedence, so
think carefully if you want to override them:
• head
• cellstyle
• body
• uuid
• precision
• table_styles
• caption
• table_attributes
pandas.io.formats.style.Styler.set_caption
Styler.set_caption(self, caption)
Set the caption on a Styler
Parameters
caption [str]
Returns
self [Styler]
pandas.io.formats.style.Styler.set_precision
Styler.set_precision(self, precision)
Set the precision used to render.
Parameters
precision [int]
Returns
self [Styler]
pandas.io.formats.style.Styler.set_properties
Examples
pandas.io.formats.style.Styler.set_table_attributes
Styler.set_table_attributes(self, attributes)
Set the table attributes.
These are the items that show up in the opening <table> tag in addition to to automatic (by
default) id.
Parameters
attributes [string]
Returns
self [Styler]
Examples
pandas.io.formats.style.Styler.set_table_styles
Styler.set_table_styles(self, table_styles)
Set the table styles on a Styler.
These are placed in a <style> tag before the generated HTML table.
Parameters
table_styles [list] Each individual table_style should be a dictionary with
selector and props keys. selector should be a CSS selector that the style
will be applied to (automatically prefixed by the table’s UUID) and props
should be a list of tuples with (attribute, value).
Returns
self [Styler]
Examples
pandas.io.formats.style.Styler.set_uuid
Styler.set_uuid(self, uuid)
Set the uuid for a Styler.
Parameters
uuid [str]
Returns
self [Styler]
pandas.io.formats.style.Styler.to_excel
Multiple sheets may be written to by specifying unique sheet_name. With all data written to
the file it is necessary to save the changes. Note that creating an ExcelWriter object with a file
name that already exists will result in the contents of the existing file being erased.
Parameters
excel_writer [str or ExcelWriter object] File path or existing ExcelWriter.
sheet_name [str, default ‘Sheet1’] Name of sheet which will contain DataFrame.
na_rep [str, default ‘’] Missing data representation.
float_format [str, optional] Format string for floating point numbers. For ex-
ample float_format="%.2f" will format 0.1234 to 0.12.
columns [sequence or list of str, optional] Columns to write.
header [bool or list of str, default True] Write out the column names. If a list of
string is given it is assumed to be aliases for the column names.
index [bool, default True] Write row names (index).
index_label [str or sequence, optional] Column label for index column(s) if de-
sired. If not specified, and header and index are True, then the index names
are used. A sequence should be given if the DataFrame uses MultiIndex.
startrow [int, default 0] Upper left cell row to dump data frame.
startcol [int, default 0] Upper left cell column to dump data frame.
engine [str, optional] Write engine to use, ‘openpyxl’ or ‘xlsxwriter’. You can also
set this via the options io.excel.xlsx.writer, io.excel.xls.writer, and
io.excel.xlsm.writer.
merge_cells [bool, default True] Write MultiIndex and Hierarchical Rows as
merged cells.
encoding [str, optional] Encoding of the resulting excel file. Only necessary for
xlwt, other writers support unicode natively.
inf_rep [str, default ‘inf’] Representation for infinity (there is no native represen-
tation for infinity in Excel).
verbose [bool, default True] Display more information in the error logs.
freeze_panes [tuple of int (length 2), optional] Specifies the one-based bottom-
most row and rightmost column that is to be frozen.
New in version 0.20.0..
See also:
Notes
For compatibility with to_csv(), to_excel serializes lists and dicts to strings before writing.
Once a workbook has been saved it is not possible write further data without rewriting the whole
workbook.
Examples
>>> df1.to_excel("output.xlsx",
... sheet_name='Sheet_name_1') # doctest: +SKIP
If you wish to write to more than one sheet in the workbook, it is necessary to specify an
ExcelWriter object:
To set the library that is used to write the Excel file, you can pass the engine keyword (the
default engine is automatically chosen depending on the file extension):
pandas.io.formats.style.Styler.use
Styler.use(self, styles)
Set the styles on the current Styler, possibly using styles from Styler.export.
Parameters
styles [list] list of style functions
Returns
self [Styler]
See also:
Styler.export
pandas.io.formats.style.Styler.where
Parameters
cond [callable] cond should take a scalar and return a boolean
value [str] applied when cond returns true
other [str] applied when cond returns false
subset [IndexSlice] a valid indexer to limit data to before applying the function.
Consider using a pandas.IndexSlice
kwargs [dict] pass along to cond
Returns
self [Styler]
See also:
Styler.applymap
Styler.env
Styler.template
Styler.loader
pandas.io.formats.style.Styler.env
pandas.io.formats.style.Styler.template
pandas.io.formats.style.Styler.loader
Styler.highlight_max(self[, subset, color, axis]) Highlight the maximum by shading the back-
ground.
Styler.highlight_min(self[, subset, color, axis]) Highlight the minimum by shading the background.
Styler.highlight_null(self[, null_color]) Shade the background null_color for missing val-
ues.
Styler.background_gradient(self[, cmap, …]) Color the background in a gradient according to the
data in each column (optionally row).
Styler.bar(self[, subset, axis, color, …]) Draw bar chart in the cell backgrounds.
{{ header }}
6.14 Plotting
andrews_curves(frame, class_column[, ax, …]) Generate a matplotlib plot of Andrews curves, for
visualising clusters of multivariate data.
bootstrap_plot(series[, fig, size, samples]) Bootstrap plot on mean, median and mid-range
statistics.
deregister_matplotlib_converters() Remove pandas’ formatters and converters
lag_plot(series[, lag, ax]) Lag plot for time series.
parallel_coordinates(frame, class_column[, …]) Parallel coordinates plotting.
radviz(frame, class_column[, ax, color, …]) Plot a multidimensional dataset in 2D.
register_matplotlib_converters([explicit]) Register Pandas Formatters and Converters with
matplotlib
scatter_matrix(frame[, alpha, figsize, ax, …]) Draw a matrix of scatter plots.
6.14.1 pandas.plotting.andrews_curves
6.14.2 pandas.plotting.bootstrap_plot
Examples
>>> s = pd.Series(np.random.uniform(size=100))
>>> fig = pd.plotting.bootstrap_plot(s) # doctest: +SKIP
6.14.3 pandas.plotting.deregister_matplotlib_converters
pandas.plotting.deregister_matplotlib_converters()
Remove pandas’ formatters and converters
Removes the custom converters added by register(). This attempts to set the state of the reg-
istry back to the state before pandas registered its own units. Converters for pandas’ own types
like Timestamp and Period are removed completely. Converters for types pandas overwrites, like
datetime.datetime, are restored to their original value.
See also:
deregister_matplotlib_converters
6.14.4 pandas.plotting.lag_plot
6.14.5 pandas.plotting.parallel_coordinates
color [list or tuple, optional] Colors to use for the different classes
use_columns [bool, optional] If true, columns will be used as xticks
xticks [list or tuple, optional] A list of values to use for xticks
colormap [str or matplotlib colormap, default None] Colormap to use for line colors.
axvlines [bool, optional] If true, vertical lines will be added at each xtick
axvlines_kwds [keywords, optional] Options to be passed to axvline method for
vertical lines
sort_labels [bool, False] Sort class_column labels, useful when assigning colors
New in version 0.20.0.
kwds [keywords] Options to pass to matplotlib plotting method
Returns
class:matplotlib.axis.Axes
Examples
6.14.6 pandas.plotting.radviz
Examples
>>> df = pd.DataFrame({
... 'SepalLength': [6.5, 7.7, 5.1, 5.8, 7.6, 5.0, 5.4, 4.6,
... 6.7, 4.6],
... 'SepalWidth': [3.0, 3.8, 3.8, 2.7, 3.0, 2.3, 3.0, 3.2,
... 3.3, 3.6],
... 'PetalLength': [5.5, 6.7, 1.9, 5.1, 6.6, 3.3, 4.5, 1.4,
... 5.7, 1.0],
... 'PetalWidth': [1.8, 2.2, 0.4, 1.9, 2.1, 1.0, 1.5, 0.2,
... 2.1, 0.2],
... 'Category': ['virginica', 'virginica', 'setosa',
... 'virginica', 'virginica', 'versicolor',
... 'versicolor', 'setosa', 'virginica',
... 'setosa']
... })
>>> rad_viz = pd.plotting.radviz(df, 'Category') # doctest: +SKIP
6.14.7 pandas.plotting.register_matplotlib_converters
pandas.plotting.register_matplotlib_converters(explicit=True)
Register Pandas Formatters and Converters with matplotlib
This function modifies the global matplotlib.units.registry dictionary. Pandas adds custom con-
verters for
• pd.Timestamp
• pd.Period
• np.datetime64
• datetime.datetime
• datetime.date
• datetime.time
See also:
deregister_matplotlib_converter
6.14.8 pandas.plotting.scatter_matrix
Examples
{{ header }}
pandas.describe_option
Parameters
pat [str] Regexp pattern. All matching keys will have their description displayed.
_print_desc [bool, default True] If True (default) the description(s) will be printed
to stdout. Otherwise, the description(s) will be returned as a unicode string (for
testing).
Returns
None by default, the description(s) as a unicode string if _print_desc
is False
Notes
compute.use_bottleneck [bool] Use the bottleneck library to accelerate if it is installed, the de-
fault is True Valid values: False,True [default: True] [currently: True]compute.use_numexpr :
bool Use the numexpr library to accelerate computation if it is installed, the default is True
Valid values: False,True [default: True] [currently: True]display.chop_threshold : float or None
if set to a float value, all float values smaller then the given threshold will be displayed as
exactly 0 by repr and friends. [default: None] [currently: None]display.colheader_justify :
‘left’/’right’ Controls the justification of column headers. used by DataFrameFormatter. [de-
fault: right] [currently: right]display.column_space No description available. [default: 12] [cur-
rently: 12]display.date_dayfirst : boolean When True, prints and parses dates with the day
first, eg 20/01/2005 [default: False] [currently: False]display.date_yearfirst : boolean When
True, prints and parses dates with the year first, eg 2005/01/20 [default: False] [currently:
False]display.encoding : str/unicode Defaults to the detected encoding of the console. Specifies
the encoding to be used for strings returned by to_string, these are generally strings meant to
be displayed on the console. [default: UTF-8] [currently: UTF-8]display.expand_frame_repr
: boolean Whether to print out the full DataFrame repr for wide DataFrames across mul-
tiple lines, max_columns is still respected, but the output will wrap-around across multiple
“pages” if its width exceeds display.width. [default: True] [currently: True]display.float_format
: callable The callable should accept a floating point number and return a string with the
desired format of the number. This is used in some places like SeriesFormatter. See for-
mats.format.EngFormatter for an example. [default: None] [currently: None]display.html.border
: int A border=value attribute is inserted in the <table> tag for the DataFrame HTML repr.
[default: 1] [currently: 1]display.html.table_schema : boolean Whether to publish a Table
Schema representation for frontends that support it. (default: False) [default: False] [cur-
rently: False]display.html.use_mathjax : boolean When True, Jupyter notebook will process
table contents using MathJax, rendering mathematical expressions enclosed by the dollar sym-
bol. (default: True) [default: True] [currently: True]display.large_repr : ‘truncate’/’info’ For
DataFrames exceeding max_rows/max_cols, the repr (and HTML repr) can show a truncated
table (the default from 0.13), or switch to the view from df.info() (the behaviour in earlier versions
of pandas). [default: truncate] [currently: truncate]display.latex.escape : bool This specifies if
the to_latex method of a Dataframe uses escapes special characters. Valid values: False,True
[default: True] [currently: True]display.latex.longtable :bool This specifies if the to_latex method
of a Dataframe uses the longtable format. Valid values: False,True [default: False] [currently:
False]display.latex.multicolumn : bool This specifies if the to_latex method of a Dataframe uses
multicolumns to pretty-print MultiIndex columns. Valid values: False,True [default: True] [cur-
rently: True]display.latex.multicolumn_format : bool This specifies if the to_latex method of
a Dataframe uses multicolumns to pretty-print MultiIndex columns. Valid values: False,True
[default: l] [currently: l]display.latex.multirow : bool This specifies if the to_latex method of
a Dataframe uses multirows to pretty-print MultiIndex rows. Valid values: False,True [default:
False] [currently: False]display.latex.repr : boolean Whether to produce a latex DataFrame rep-
resentation for jupyter environments that support it. (default: False) [default: False] [currently:
False]display.max_categories : int This sets the maximum number of categories pandas should
output when printing out a Categorical or a Series of dtype “category”. [default: 8] [currently:
8]display.max_columns : int If max_cols is exceeded, switch to truncate view. Depending on
large_repr, objects are either centrally truncated or printed as a summary view. ‘None’ value
means unlimited.
In case python/IPython is running in a terminal and large_repr equals ‘truncate’ this can be
set to 0 and pandas will auto-detect the width of the terminal and print a truncated object
which fits the screen width. The IPython notebook, IPython qtconsole, or IDLE do not run
in a terminal and hence it is not possible to do correct auto-detection. [default: 0] [currently:
0]display.max_colwidth : int The maximum width in characters of a column in the repr of
a pandas data structure. When the column overflows, a “…” placeholder is embedded in the
output. [default: 50] [currently: 50]display.max_info_columns : int max_info_columns is used
in DataFrame.info method to decide if per column information will be printed. [default: 100]
[currently: 100]display.max_info_rows : int or None df.info() will usually show null-counts for
each column. For large frames this can be quite slow. max_info_rows and max_info_cols
limit this null check only to frames with smaller dimensions than specified. [default: 1690785]
[currently: 1690785]display.max_rows : int If max_rows is exceeded, switch to truncate view.
Depending on large_repr, objects are either centrally truncated or printed as a summary view.
‘None’ value means unlimited.
In case python/IPython is running in a terminal and large_repr equals ‘truncate’ this can be
set to 0 and pandas will auto-detect the height of the terminal and print a truncated object
which fits the screen height. The IPython notebook, IPython qtconsole, or IDLE do not run
in a terminal and hence it is not possible to do correct auto-detection. [default: 60] [currently:
60]display.max_seq_items : int or None when pretty-printing a long sequence, no more then
max_seq_items will be printed. If items are omitted, they will be denoted by the addition of
“…” to the resulting string.
If set to None, the number of items to be printed is unlimited. [default: 100] [currently: 100]dis-
play.memory_usage : bool, string or None This specifies if the memory usage of a DataFrame
should be displayed when df.info() is called. Valid values True,False,’deep’ [default: True] [cur-
rently: True]display.min_rows : int The numbers of rows to show in a truncated view (when
max_rows is exceeded). Ignored when max_rows is set to None or 0. When set to None,
follows the value of max_rows. [default: 10] [currently: 10]display.multi_sparse : boolean
“sparsify” MultiIndex display (don’t display repeated elements in outer levels within groups)
[default: True] [currently: True]display.notebook_repr_html : boolean When True, IPython
notebook will use html representation for pandas objects (if it is available). [default: True]
[currently: True]display.pprint_nest_depth : int Controls the number of nested levels to pro-
cess when pretty-printing [default: 3] [currently: 3]display.precision : int Floating point out-
put precision (number of significant digits). This is only a suggestion [default: 6] [currently:
6]display.show_dimensions : boolean or ‘truncate’ Whether to print out dimensions at the
end of DataFrame repr. If ‘truncate’ is specified, only print out the dimensions if the frame
is truncated (e.g. not display all rows and/or columns) [default: truncate] [currently: trun-
cate]display.unicode.ambiguous_as_wide : boolean Whether to use the Unicode East Asian
Width to calculate the display text width. Enabling this may affect to the performance (default:
False) [default: False] [currently: False]display.unicode.east_asian_width : boolean Whether
to use the Unicode East Asian Width to calculate the display text width. Enabling this may
affect to the performance (default: False) [default: False] [currently: False]display.width : int
Width of the display in characters. In case python/IPython is running in a terminal this can
be set to None and pandas will correctly auto-detect the width. Note that the IPython note-
book, IPython qtconsole, or IDLE do not run in a terminal and hence it is not possible to
correctly detect the width. [default: 80] [currently: 80]io.excel.ods.reader : string The de-
fault Excel reader engine for ‘ods’ files. Available options: auto, odf. [default: auto] [cur-
rently: auto]io.excel.xls.reader : string The default Excel reader engine for ‘xls’ files. Avail-
able options: auto, xlrd. [default: auto] [currently: auto]io.excel.xls.writer : string The default
Excel writer engine for ‘xls’ files. Available options: auto, xlwt. [default: auto] [currently:
auto]io.excel.xlsm.reader : string The default Excel reader engine for ‘xlsm’ files. Available op-
tions: auto, xlrd, openpyxl. [default: auto] [currently: auto]io.excel.xlsm.writer : string The
default Excel writer engine for ‘xlsm’ files. Available options: auto, openpyxl. [default: auto]
[currently: auto]io.excel.xlsx.reader : string The default Excel reader engine for ‘xlsx’ files. Avail-
able options: auto, xlrd, openpyxl. [default: auto] [currently: auto]io.excel.xlsx.writer : string
The default Excel writer engine for ‘xlsx’ files. Available options: auto, openpyxl, xlsxwriter.
[default: auto] [currently: auto]io.hdf.default_format : format default format writing format,
if None, then put will default to ‘fixed’ and append will default to ‘table’ [default: None] [cur-
rently: None]io.hdf.dropna_table : boolean drop ALL nan rows when appending to a table [de-
fault: False] [currently: False]io.parquet.engine : string The default parquet reader/writer engine.
Available options: ‘auto’, ‘pyarrow’, ‘fastparquet’, the default is ‘auto’ [default: auto] [currently:
auto]mode.chained_assignment : string Raise an exception, warn, or no action if trying to use
pandas.reset_option
Parameters
pat [str/regex] If specified only options matching prefix* will be reset. Note: partial
matches are supported for convenience, but unless you use the full option name
(e.g. x.y.z.option_name), your code may break in future versions if new options
with similar names are introduced.
Returns
None
Notes
auto]io.excel.xlsm.reader : string The default Excel reader engine for ‘xlsm’ files. Available op-
tions: auto, xlrd, openpyxl. [default: auto] [currently: auto]io.excel.xlsm.writer : string The
default Excel writer engine for ‘xlsm’ files. Available options: auto, openpyxl. [default: auto]
[currently: auto]io.excel.xlsx.reader : string The default Excel reader engine for ‘xlsx’ files. Avail-
able options: auto, xlrd, openpyxl. [default: auto] [currently: auto]io.excel.xlsx.writer : string
The default Excel writer engine for ‘xlsx’ files. Available options: auto, openpyxl, xlsxwriter.
[default: auto] [currently: auto]io.hdf.default_format : format default format writing format,
if None, then put will default to ‘fixed’ and append will default to ‘table’ [default: None] [cur-
rently: None]io.hdf.dropna_table : boolean drop ALL nan rows when appending to a table [de-
fault: False] [currently: False]io.parquet.engine : string The default parquet reader/writer engine.
Available options: ‘auto’, ‘pyarrow’, ‘fastparquet’, the default is ‘auto’ [default: auto] [currently:
auto]mode.chained_assignment : string Raise an exception, warn, or no action if trying to use
chained assignment, The default is warn [default: warn] [currently: warn]mode.sim_interactive :
boolean Whether to simulate interactive mode for purposes of testing [default: False] [currently:
False]mode.use_inf_as_na : boolean True means treat None, NaN, INF, -INF as NA (old way),
False means None and NaN are null, but INF, -INF are not NA (new way). [default: False] [cur-
rently: False]mode.use_inf_as_null : boolean use_inf_as_null had been deprecated and will
be removed in a future version. Use use_inf_as_na instead. [default: False] [currently: False]
(Deprecated, use mode.use_inf_as_na instead.)plotting.backend : str The plotting backend to
use. The default value is “matplotlib”, the backend provided with pandas. Other backends can
be specified by prodiving the name of the module that implements the backend. [default: mat-
plotlib] [currently: matplotlib]plotting.matplotlib.register_converters : bool Whether to register
converters with matplotlib’s units registry for dates, times, datetimes, and Periods. Toggling to
False will remove the converters, restoring any converters that pandas overwrote. [default: True]
[currently: True]
pandas.get_option
• io.hdf.[default_format, dropna_table]
• io.parquet.[engine]
• mode.[chained_assignment, sim_interactive, use_inf_as_na, use_inf_as_null]
• plotting.[backend]
• plotting.matplotlib.[register_converters]
Parameters
pat [str] Regexp which should match a single option. Note: partial matches
are supported for convenience, but unless you use the full option name (e.g.
x.y.z.option_name), your code may break in future versions if new options with
similar names are introduced.
Returns
result [the value of the option]
Raises
OptionError [if no such option exists]
Notes
the to_latex method of a Dataframe uses escapes special characters. Valid values: False,True
[default: True] [currently: True]display.latex.longtable :bool This specifies if the to_latex method
of a Dataframe uses the longtable format. Valid values: False,True [default: False] [currently:
False]display.latex.multicolumn : bool This specifies if the to_latex method of a Dataframe uses
multicolumns to pretty-print MultiIndex columns. Valid values: False,True [default: True] [cur-
rently: True]display.latex.multicolumn_format : bool This specifies if the to_latex method of
a Dataframe uses multicolumns to pretty-print MultiIndex columns. Valid values: False,True
[default: l] [currently: l]display.latex.multirow : bool This specifies if the to_latex method of
a Dataframe uses multirows to pretty-print MultiIndex rows. Valid values: False,True [default:
False] [currently: False]display.latex.repr : boolean Whether to produce a latex DataFrame rep-
resentation for jupyter environments that support it. (default: False) [default: False] [currently:
False]display.max_categories : int This sets the maximum number of categories pandas should
output when printing out a Categorical or a Series of dtype “category”. [default: 8] [currently:
8]display.max_columns : int If max_cols is exceeded, switch to truncate view. Depending on
large_repr, objects are either centrally truncated or printed as a summary view. ‘None’ value
means unlimited.
In case python/IPython is running in a terminal and large_repr equals ‘truncate’ this can be
set to 0 and pandas will auto-detect the width of the terminal and print a truncated object
which fits the screen width. The IPython notebook, IPython qtconsole, or IDLE do not run
in a terminal and hence it is not possible to do correct auto-detection. [default: 0] [currently:
0]display.max_colwidth : int The maximum width in characters of a column in the repr of
a pandas data structure. When the column overflows, a “…” placeholder is embedded in the
output. [default: 50] [currently: 50]display.max_info_columns : int max_info_columns is used
in DataFrame.info method to decide if per column information will be printed. [default: 100]
[currently: 100]display.max_info_rows : int or None df.info() will usually show null-counts for
each column. For large frames this can be quite slow. max_info_rows and max_info_cols
limit this null check only to frames with smaller dimensions than specified. [default: 1690785]
[currently: 1690785]display.max_rows : int If max_rows is exceeded, switch to truncate view.
Depending on large_repr, objects are either centrally truncated or printed as a summary view.
‘None’ value means unlimited.
In case python/IPython is running in a terminal and large_repr equals ‘truncate’ this can be
set to 0 and pandas will auto-detect the height of the terminal and print a truncated object
which fits the screen height. The IPython notebook, IPython qtconsole, or IDLE do not run
in a terminal and hence it is not possible to do correct auto-detection. [default: 60] [currently:
60]display.max_seq_items : int or None when pretty-printing a long sequence, no more then
max_seq_items will be printed. If items are omitted, they will be denoted by the addition of
“…” to the resulting string.
If set to None, the number of items to be printed is unlimited. [default: 100] [currently: 100]dis-
play.memory_usage : bool, string or None This specifies if the memory usage of a DataFrame
should be displayed when df.info() is called. Valid values True,False,’deep’ [default: True] [cur-
rently: True]display.min_rows : int The numbers of rows to show in a truncated view (when
max_rows is exceeded). Ignored when max_rows is set to None or 0. When set to None,
follows the value of max_rows. [default: 10] [currently: 10]display.multi_sparse : boolean
“sparsify” MultiIndex display (don’t display repeated elements in outer levels within groups)
[default: True] [currently: True]display.notebook_repr_html : boolean When True, IPython
notebook will use html representation for pandas objects (if it is available). [default: True]
[currently: True]display.pprint_nest_depth : int Controls the number of nested levels to pro-
cess when pretty-printing [default: 3] [currently: 3]display.precision : int Floating point out-
put precision (number of significant digits). This is only a suggestion [default: 6] [currently:
6]display.show_dimensions : boolean or ‘truncate’ Whether to print out dimensions at the
end of DataFrame repr. If ‘truncate’ is specified, only print out the dimensions if the frame
is truncated (e.g. not display all rows and/or columns) [default: truncate] [currently: trun-
pandas.set_option
Parameters
pat [str] Regexp which should match a single option. Note: partial matches
are supported for convenience, but unless you use the full option name (e.g.
x.y.z.option_name), your code may break in future versions if new options with
similar names are introduced.
value [object] New value of option.
Returns
None
Raises
OptionError if no such option exists
Notes
: boolean Whether to print out the full DataFrame repr for wide DataFrames across mul-
tiple lines, max_columns is still respected, but the output will wrap-around across multiple
“pages” if its width exceeds display.width. [default: True] [currently: True]display.float_format
: callable The callable should accept a floating point number and return a string with the
desired format of the number. This is used in some places like SeriesFormatter. See for-
mats.format.EngFormatter for an example. [default: None] [currently: None]display.html.border
: int A border=value attribute is inserted in the <table> tag for the DataFrame HTML repr.
[default: 1] [currently: 1]display.html.table_schema : boolean Whether to publish a Table
Schema representation for frontends that support it. (default: False) [default: False] [cur-
rently: False]display.html.use_mathjax : boolean When True, Jupyter notebook will process
table contents using MathJax, rendering mathematical expressions enclosed by the dollar sym-
bol. (default: True) [default: True] [currently: True]display.large_repr : ‘truncate’/’info’ For
DataFrames exceeding max_rows/max_cols, the repr (and HTML repr) can show a truncated
table (the default from 0.13), or switch to the view from df.info() (the behaviour in earlier versions
of pandas). [default: truncate] [currently: truncate]display.latex.escape : bool This specifies if
the to_latex method of a Dataframe uses escapes special characters. Valid values: False,True
[default: True] [currently: True]display.latex.longtable :bool This specifies if the to_latex method
of a Dataframe uses the longtable format. Valid values: False,True [default: False] [currently:
False]display.latex.multicolumn : bool This specifies if the to_latex method of a Dataframe uses
multicolumns to pretty-print MultiIndex columns. Valid values: False,True [default: True] [cur-
rently: True]display.latex.multicolumn_format : bool This specifies if the to_latex method of
a Dataframe uses multicolumns to pretty-print MultiIndex columns. Valid values: False,True
[default: l] [currently: l]display.latex.multirow : bool This specifies if the to_latex method of
a Dataframe uses multirows to pretty-print MultiIndex rows. Valid values: False,True [default:
False] [currently: False]display.latex.repr : boolean Whether to produce a latex DataFrame rep-
resentation for jupyter environments that support it. (default: False) [default: False] [currently:
False]display.max_categories : int This sets the maximum number of categories pandas should
output when printing out a Categorical or a Series of dtype “category”. [default: 8] [currently:
8]display.max_columns : int If max_cols is exceeded, switch to truncate view. Depending on
large_repr, objects are either centrally truncated or printed as a summary view. ‘None’ value
means unlimited.
In case python/IPython is running in a terminal and large_repr equals ‘truncate’ this can be
set to 0 and pandas will auto-detect the width of the terminal and print a truncated object
which fits the screen width. The IPython notebook, IPython qtconsole, or IDLE do not run
in a terminal and hence it is not possible to do correct auto-detection. [default: 0] [currently:
0]display.max_colwidth : int The maximum width in characters of a column in the repr of
a pandas data structure. When the column overflows, a “…” placeholder is embedded in the
output. [default: 50] [currently: 50]display.max_info_columns : int max_info_columns is used
in DataFrame.info method to decide if per column information will be printed. [default: 100]
[currently: 100]display.max_info_rows : int or None df.info() will usually show null-counts for
each column. For large frames this can be quite slow. max_info_rows and max_info_cols
limit this null check only to frames with smaller dimensions than specified. [default: 1690785]
[currently: 1690785]display.max_rows : int If max_rows is exceeded, switch to truncate view.
Depending on large_repr, objects are either centrally truncated or printed as a summary view.
‘None’ value means unlimited.
In case python/IPython is running in a terminal and large_repr equals ‘truncate’ this can be
set to 0 and pandas will auto-detect the height of the terminal and print a truncated object
which fits the screen height. The IPython notebook, IPython qtconsole, or IDLE do not run
in a terminal and hence it is not possible to do correct auto-detection. [default: 60] [currently:
60]display.max_seq_items : int or None when pretty-printing a long sequence, no more then
max_seq_items will be printed. If items are omitted, they will be denoted by the addition of
“…” to the resulting string.
If set to None, the number of items to be printed is unlimited. [default: 100] [currently: 100]dis-
play.memory_usage : bool, string or None This specifies if the memory usage of a DataFrame
should be displayed when df.info() is called. Valid values True,False,’deep’ [default: True] [cur-
rently: True]display.min_rows : int The numbers of rows to show in a truncated view (when
max_rows is exceeded). Ignored when max_rows is set to None or 0. When set to None,
follows the value of max_rows. [default: 10] [currently: 10]display.multi_sparse : boolean
“sparsify” MultiIndex display (don’t display repeated elements in outer levels within groups)
[default: True] [currently: True]display.notebook_repr_html : boolean When True, IPython
notebook will use html representation for pandas objects (if it is available). [default: True]
[currently: True]display.pprint_nest_depth : int Controls the number of nested levels to pro-
cess when pretty-printing [default: 3] [currently: 3]display.precision : int Floating point out-
put precision (number of significant digits). This is only a suggestion [default: 6] [currently:
6]display.show_dimensions : boolean or ‘truncate’ Whether to print out dimensions at the
end of DataFrame repr. If ‘truncate’ is specified, only print out the dimensions if the frame
is truncated (e.g. not display all rows and/or columns) [default: truncate] [currently: trun-
cate]display.unicode.ambiguous_as_wide : boolean Whether to use the Unicode East Asian
Width to calculate the display text width. Enabling this may affect to the performance (default:
False) [default: False] [currently: False]display.unicode.east_asian_width : boolean Whether
to use the Unicode East Asian Width to calculate the display text width. Enabling this may
affect to the performance (default: False) [default: False] [currently: False]display.width : int
Width of the display in characters. In case python/IPython is running in a terminal this can
be set to None and pandas will correctly auto-detect the width. Note that the IPython note-
book, IPython qtconsole, or IDLE do not run in a terminal and hence it is not possible to
correctly detect the width. [default: 80] [currently: 80]io.excel.ods.reader : string The de-
fault Excel reader engine for ‘ods’ files. Available options: auto, odf. [default: auto] [cur-
rently: auto]io.excel.xls.reader : string The default Excel reader engine for ‘xls’ files. Avail-
able options: auto, xlrd. [default: auto] [currently: auto]io.excel.xls.writer : string The default
Excel writer engine for ‘xls’ files. Available options: auto, xlwt. [default: auto] [currently:
auto]io.excel.xlsm.reader : string The default Excel reader engine for ‘xlsm’ files. Available op-
tions: auto, xlrd, openpyxl. [default: auto] [currently: auto]io.excel.xlsm.writer : string The
default Excel writer engine for ‘xlsm’ files. Available options: auto, openpyxl. [default: auto]
[currently: auto]io.excel.xlsx.reader : string The default Excel reader engine for ‘xlsx’ files. Avail-
able options: auto, xlrd, openpyxl. [default: auto] [currently: auto]io.excel.xlsx.writer : string
The default Excel writer engine for ‘xlsx’ files. Available options: auto, openpyxl, xlsxwriter.
[default: auto] [currently: auto]io.hdf.default_format : format default format writing format,
if None, then put will default to ‘fixed’ and append will default to ‘table’ [default: None] [cur-
rently: None]io.hdf.dropna_table : boolean drop ALL nan rows when appending to a table [de-
fault: False] [currently: False]io.parquet.engine : string The default parquet reader/writer engine.
Available options: ‘auto’, ‘pyarrow’, ‘fastparquet’, the default is ‘auto’ [default: auto] [currently:
auto]mode.chained_assignment : string Raise an exception, warn, or no action if trying to use
chained assignment, The default is warn [default: warn] [currently: warn]mode.sim_interactive :
boolean Whether to simulate interactive mode for purposes of testing [default: False] [currently:
False]mode.use_inf_as_na : boolean True means treat None, NaN, INF, -INF as NA (old way),
False means None and NaN are null, but INF, -INF are not NA (new way). [default: False] [cur-
rently: False]mode.use_inf_as_null : boolean use_inf_as_null had been deprecated and will
be removed in a future version. Use use_inf_as_na instead. [default: False] [currently: False]
(Deprecated, use mode.use_inf_as_na instead.)plotting.backend : str The plotting backend to
use. The default value is “matplotlib”, the backend provided with pandas. Other backends can
be specified by prodiving the name of the module that implements the backend. [default: mat-
plotlib] [currently: matplotlib]plotting.matplotlib.register_converters : bool Whether to register
converters with matplotlib’s units registry for dates, times, datetimes, and Periods. Toggling to
False will remove the converters, restoring any converters that pandas overwrote. [default: True]
[currently: True]
pandas.option_context
class pandas.option_context(*args)
Context manager to temporarily set options in the with statement context.
You need to invoke as option_context(pat, val, [(pat, val), ...]).
Examples
testing.assert_frame_equal(left, right[, …]) Check that left and right DataFrame are equal.
testing.assert_series_equal(left, right[, …]) Check that left and right Series are equal.
testing.assert_index_equal(left, right, …) Check that left and right Index are equal.
pandas.testing.assert_frame_equal
When comparing two numbers, if the first number has magnitude less than 1e-
5, we compare the two numbers directly and check whether they are equivalent
within the specified precision. Otherwise, we compare the ratio of the second
number to the first number and check whether it is equivalent to 1 within the
specified precision.
check_names [bool, default True] Whether to check that the names attribute for
both the index and column attributes of the DataFrame is identical, i.e.
• left.index.names == right.index.names
• left.columns.names == right.columns.names
by_blocks [bool, default False] Specify how to compare internal data. If False,
compare by columns. If True, compare by blocks.
check_exact [bool, default False] Whether to compare number exactly.
check_datetimelike_compat [bool, default False] Compare datetime-like which is
comparable ignoring dtype.
check_categorical [bool, default True] Whether to compare internal Categorical
exactly.
check_like [bool, default False] If True, ignore the order of index & columns. Note:
index labels must match their respective rows (same as in columns) - same labels
must be with the same data.
obj [str, default ‘DataFrame’] Specify object name being compared, internally used
to show appropriate assertion message.
See also:
Examples
This example shows comparing two DataFrames that are equal but with columns of differing dtypes.
pandas.testing.assert_series_equal
pandas.testing.assert_index_equal
Parameters
left [Index]
right [Index]
exact [bool / string {‘equiv’}, default ‘equiv’] Whether to check the Index class, dtype
and inferred_type are identical. If ‘equiv’, then RangeIndex can be substituted
for Int64Index as well.
check_names [bool, default True] Whether to check the names attribute.
check_less_precise [bool or int, default False] Specify comparison precision. Only
used when check_exact is False. 5 digits (False) or 3 digits (True) after decimal
points are compared. If int, then specify the digits to compare
check_exact [bool, default True] Whether to compare number exactly.
check_categorical [bool, default True] Whether to compare internal Categorical
exactly.
obj [str, default ‘Index’] Specify object name being compared, internally used to show
appropriate assertion message
pandas.errors.DtypeWarning
exception pandas.errors.DtypeWarning
Warning raised when reading different dtypes in a column from a file.
Raised for a dtype incompatibility. This can happen whenever read_csv or read_table encounter non-
uniform dtypes in a column(s) of a given CSV file.
See also:
Notes
This warning is issued when dealing with larger files because the dtype checking happens per chunk
read.
Despite the warning, the CSV file is read with mixed types in a single column which will be an object
type. See the examples below to better understand this issue.
Examples
This example creates and reads a large CSV file with a column that contains int and str.
Important to notice that df2 will contain both str and int for the same input, ‘1’.
>>> df2.iloc[262140, 0]
'1'
>>> type(df2.iloc[262140, 0])
<class 'str'>
>>> df2.iloc[262150, 0]
1
>>> type(df2.iloc[262150, 0])
<class 'int'>
One way to solve this issue is using the dtype parameter in the read_csv and read_table functions to
explicit the conversion:
>>> import os
>>> os.remove('test.csv')
pandas.errors.EmptyDataError
exception pandas.errors.EmptyDataError
Exception that is thrown in pd.read_csv (by both the C and Python engines) when empty data or
header is encountered.
pandas.errors.OutOfBoundsDatetime
exception pandas.errors.OutOfBoundsDatetime
pandas.errors.ParserError
exception pandas.errors.ParserError
Exception that is raised by an error encountered in parsing file contents.
This is a generic error raised for errors encountered when functions like read_csv or read_html are
parsing contents of a file.
See also:
pandas.errors.ParserWarning
exception pandas.errors.ParserWarning
Warning raised when reading a file that doesn’t use the default ‘c’ parser.
Raised by pd.read_csv and pd.read_table when it is necessary to change parsers, generally from the
default ‘c’ parser to ‘python’.
It happens due to a lack of support or functionality for parsing a particular attribute of a CSV file
with the requested engine.
Currently, ‘c’ unsupported options include the following parameters:
1. sep other than a single character (e.g. regex separators)
2. skipfooter higher than 0
3. sep=None with delim_whitespace=False
The warning can be avoided by adding engine=’python’ as a parameter in pd.read_csv and
pd.read_table methods.
See also:
Examples
>>> import io
>>> csv = '''a;b;c
... 1;1,8
... 1;2,1'''
>>> df = pd.read_csv(io.StringIO(csv), sep='[;,]') # doctest: +SKIP
... # ParserWarning: Falling back to the 'python' engine...
pandas.errors.PerformanceWarning
exception pandas.errors.PerformanceWarning
Warning raised when there is a possible performance impact.
pandas.errors.UnsortedIndexError
exception pandas.errors.UnsortedIndexError
Error raised when attempting to get a slice of a MultiIndex, and the index has not been lexsorted.
Subclass of KeyError.
New in version 0.20.0.
pandas.errors.UnsupportedFunctionCall
exception pandas.errors.UnsupportedFunctionCall
Exception raised when attempting to call a numpy function on a pandas object, but that function is
not supported by the object e.g. np.cumsum(groupby_object).
pandas.api.types.union_categoricals
Notes
Examples
If you want to combine categoricals that do not necessarily have the same categories, union_categoricals
will combine a list-like of categoricals. The new categories will be the union of the categories being
combined.
By default, the resulting categories will be ordered as they appear in the categories of the data. If you
want the categories to be lexsorted, use sort_categories=True argument.
union_categoricals also works with the case of combining two categoricals of the same categories and
order information (e.g. what you could also append for).
Raises TypeError because the categories are ordered and not identical.
union_categoricals also works with a CategoricalIndex, or Series containing categorical data, but note
that the resulting array will always be a plain Categorical
pandas.api.types.infer_dtype
pandas.api.types.infer_dtype()
Efficiently infer the type of a passed val, or list-like array of values. Return a string describing the
type.
Parameters
value [scalar, list, ndarray, or pandas type]
skipna [bool, default False] Ignore NaN values when inferring the type.
New in version 0.21.0.
Returns
string describing the common type of the input data.
Results can include:
• string
• unicode
• bytes
• floating
• integer
• mixed-integer
• mixed-integer-float
• decimal
• complex
• categorical
• boolean
• datetime64
• datetime
• date
• timedelta64
• timedelta
• time
• period
• mixed
Raises
TypeError if ndarray-like but cannot infer the dtype
Notes
Examples
>>> infer_dtype([pd.Timestamp('20130101')])
'datetime'
>>> infer_dtype([np.datetime64('2013-01-01')])
'datetime64'
>>> infer_dtype(pd.Series(list('aabc')).astype('category'))
'categorical'
pandas.api.types.pandas_dtype
pandas.api.types.pandas_dtype(dtype)
Convert input into a pandas only dtype object or a numpy dtype object.
Parameters
dtype [object to be converted]
Returns
np.dtype or a pandas dtype
Raises
TypeError if not a dtype
Dtype introspection
pandas.api.types.is_bool_dtype
pandas.api.types.is_bool_dtype(arr_or_dtype)
Check whether the provided array or dtype is of a boolean dtype.
Parameters
arr_or_dtype [array-like] The array or dtype to check.
Returns
boolean Whether or not the array or dtype is of a boolean dtype.
Notes
Examples
>>> is_bool_dtype(str)
False
(continues on next page)
pandas.api.types.is_categorical_dtype
pandas.api.types.is_categorical_dtype(arr_or_dtype)
Check whether an array-like or dtype is of the Categorical dtype.
Parameters
arr_or_dtype [array-like] The array-like or dtype to check.
Returns
boolean Whether or not the array-like or dtype is of the Categorical dtype.
Examples
>>> is_categorical_dtype(object)
False
>>> is_categorical_dtype(CategoricalDtype())
True
>>> is_categorical_dtype([1, 2, 3])
False
>>> is_categorical_dtype(pd.Categorical([1, 2, 3]))
True
>>> is_categorical_dtype(pd.CategoricalIndex([1, 2, 3]))
True
pandas.api.types.is_complex_dtype
pandas.api.types.is_complex_dtype(arr_or_dtype)
Check whether the provided array or dtype is of a complex dtype.
Parameters
arr_or_dtype [array-like] The array or dtype to check.
Returns
boolean Whether or not the array or dtype is of a complex dtype.
Examples
>>> is_complex_dtype(str)
False
>>> is_complex_dtype(int)
False
>>> is_complex_dtype(np.complex)
True
>>> is_complex_dtype(np.array(['a', 'b']))
False
>>> is_complex_dtype(pd.Series([1, 2]))
False
>>> is_complex_dtype(np.array([1 + 1j, 5]))
True
pandas.api.types.is_datetime64_any_dtype
pandas.api.types.is_datetime64_any_dtype(arr_or_dtype)
Check whether the provided array or dtype is of the datetime64 dtype.
Parameters
arr_or_dtype [array-like] The array or dtype to check.
Returns
boolean Whether or not the array or dtype is of the datetime64 dtype.
Examples
>>> is_datetime64_any_dtype(str)
False
>>> is_datetime64_any_dtype(int)
False
>>> is_datetime64_any_dtype(np.datetime64) # can be tz-naive
True
>>> is_datetime64_any_dtype(DatetimeTZDtype("ns", "US/Eastern"))
True
>>> is_datetime64_any_dtype(np.array(['a', 'b']))
False
>>> is_datetime64_any_dtype(np.array([1, 2]))
False
>>> is_datetime64_any_dtype(np.array([], dtype=np.datetime64))
True
>>> is_datetime64_any_dtype(pd.DatetimeIndex([1, 2, 3],
dtype=np.datetime64))
True
pandas.api.types.is_datetime64_dtype
pandas.api.types.is_datetime64_dtype(arr_or_dtype)
Check whether an array-like or dtype is of the datetime64 dtype.
Parameters
arr_or_dtype [array-like] The array-like or dtype to check.
Returns
boolean Whether or not the array-like or dtype is of the datetime64 dtype.
Examples
>>> is_datetime64_dtype(object)
False
>>> is_datetime64_dtype(np.datetime64)
True
>>> is_datetime64_dtype(np.array([], dtype=int))
False
>>> is_datetime64_dtype(np.array([], dtype=np.datetime64))
True
>>> is_datetime64_dtype([1, 2, 3])
False
pandas.api.types.is_datetime64_ns_dtype
pandas.api.types.is_datetime64_ns_dtype(arr_or_dtype)
Check whether the provided array or dtype is of the datetime64[ns] dtype.
Parameters
arr_or_dtype [array-like] The array or dtype to check.
Returns
boolean Whether or not the array or dtype is of the datetime64[ns] dtype.
Examples
>>> is_datetime64_ns_dtype(str)
False
>>> is_datetime64_ns_dtype(int)
False
>>> is_datetime64_ns_dtype(np.datetime64) # no unit
False
>>> is_datetime64_ns_dtype(DatetimeTZDtype("ns", "US/Eastern"))
True
>>> is_datetime64_ns_dtype(np.array(['a', 'b']))
False
>>> is_datetime64_ns_dtype(np.array([1, 2]))
False
(continues on next page)
pandas.api.types.is_datetime64tz_dtype
pandas.api.types.is_datetime64tz_dtype(arr_or_dtype)
Check whether an array-like or dtype is of a DatetimeTZDtype dtype.
Parameters
arr_or_dtype [array-like] The array-like or dtype to check.
Returns
boolean Whether or not the array-like or dtype is of a DatetimeTZDtype dtype.
Examples
>>> is_datetime64tz_dtype(object)
False
>>> is_datetime64tz_dtype([1, 2, 3])
False
>>> is_datetime64tz_dtype(pd.DatetimeIndex([1, 2, 3])) # tz-naive
False
>>> is_datetime64tz_dtype(pd.DatetimeIndex([1, 2, 3], tz="US/Eastern"))
True
pandas.api.types.is_extension_type
pandas.api.types.is_extension_type(arr)
Check whether an array-like is of a pandas extension class instance.
Extension classes include categoricals, pandas sparse objects (i.e. classes represented within the pandas
library and not ones external to it like scipy sparse matrices), and datetime-like arrays.
Parameters
arr [array-like] The array-like to check.
Returns
boolean Whether or not the array-like is of a pandas extension class instance.
Examples
pandas.api.types.is_extension_array_dtype
pandas.api.types.is_extension_array_dtype(arr_or_dtype)
Check if an object is a pandas extension array type.
See the Use Guide for more.
Parameters
arr_or_dtype [object] For array-like input, the .dtype attribute will be extracted.
Returns
bool Whether the arr_or_dtype is an extension array type.
Notes
This checks whether an object implements the pandas extension array interface. In pandas, this
includes:
• Categorical
• Sparse
• Interval
• Period
• DatetimeArray
• TimedeltaArray
Third-party libraries may implement arrays or types satisfying this interface as well.
Examples
pandas.api.types.is_float_dtype
pandas.api.types.is_float_dtype(arr_or_dtype)
Check whether the provided array or dtype is of a float dtype.
This function is internal and should not be exposed in the public API.
Parameters
arr_or_dtype [array-like] The array or dtype to check.
Returns
boolean Whether or not the array or dtype is of a float dtype.
Examples
>>> is_float_dtype(str)
False
>>> is_float_dtype(int)
False
>>> is_float_dtype(float)
True
>>> is_float_dtype(np.array(['a', 'b']))
False
>>> is_float_dtype(pd.Series([1, 2]))
False
(continues on next page)
pandas.api.types.is_int64_dtype
pandas.api.types.is_int64_dtype(arr_or_dtype)
Check whether the provided array or dtype is of the int64 dtype.
Parameters
arr_or_dtype [array-like] The array or dtype to check.
Returns
boolean Whether or not the array or dtype is of the int64 dtype.
Notes
Depending on system architecture, the return value of is_int64_dtype( int) will be True if the OS uses
64-bit integers and False if the OS uses 32-bit integers.
Examples
>>> is_int64_dtype(str)
False
>>> is_int64_dtype(np.int32)
False
>>> is_int64_dtype(np.int64)
True
>>> is_int64_dtype('int8')
False
>>> is_int64_dtype('Int8')
False
>>> is_int64_dtype(pd.Int64Dtype)
True
>>> is_int64_dtype(float)
False
>>> is_int64_dtype(np.uint64) # unsigned
False
>>> is_int64_dtype(np.array(['a', 'b']))
False
>>> is_int64_dtype(np.array([1, 2], dtype=np.int64))
True
>>> is_int64_dtype(pd.Index([1, 2.])) # float
False
>>> is_int64_dtype(np.array([1, 2], dtype=np.uint32)) # unsigned
False
pandas.api.types.is_integer_dtype
pandas.api.types.is_integer_dtype(arr_or_dtype)
Check whether the provided array or dtype is of an integer dtype.
Unlike in in_any_int_dtype, timedelta64 instances will return False.
Changed in version 0.24.0: The nullable Integer dtypes (e.g. pandas.Int64Dtype) are also considered
as integer by this function.
Parameters
arr_or_dtype [array-like] The array or dtype to check.
Returns
boolean Whether or not the array or dtype is of an integer dtype and not an instance
of timedelta64.
Examples
>>> is_integer_dtype(str)
False
>>> is_integer_dtype(int)
True
>>> is_integer_dtype(float)
False
>>> is_integer_dtype(np.uint64)
True
>>> is_integer_dtype('int8')
True
>>> is_integer_dtype('Int8')
True
>>> is_integer_dtype(pd.Int8Dtype)
True
>>> is_integer_dtype(np.datetime64)
False
>>> is_integer_dtype(np.timedelta64)
False
>>> is_integer_dtype(np.array(['a', 'b']))
False
>>> is_integer_dtype(pd.Series([1, 2]))
True
>>> is_integer_dtype(np.array([], dtype=np.timedelta64))
False
>>> is_integer_dtype(pd.Index([1, 2.])) # float
False
pandas.api.types.is_interval_dtype
pandas.api.types.is_interval_dtype(arr_or_dtype)
Check whether an array-like or dtype is of the Interval dtype.
Parameters
Examples
>>> is_interval_dtype(object)
False
>>> is_interval_dtype(IntervalDtype())
True
>>> is_interval_dtype([1, 2, 3])
False
>>>
>>> interval = pd.Interval(1, 2, closed="right")
>>> is_interval_dtype(interval)
False
>>> is_interval_dtype(pd.IntervalIndex([interval]))
True
pandas.api.types.is_numeric_dtype
pandas.api.types.is_numeric_dtype(arr_or_dtype)
Check whether the provided array or dtype is of a numeric dtype.
Parameters
arr_or_dtype [array-like] The array or dtype to check.
Returns
boolean Whether or not the array or dtype is of a numeric dtype.
Examples
>>> is_numeric_dtype(str)
False
>>> is_numeric_dtype(int)
True
>>> is_numeric_dtype(float)
True
>>> is_numeric_dtype(np.uint64)
True
>>> is_numeric_dtype(np.datetime64)
False
>>> is_numeric_dtype(np.timedelta64)
False
>>> is_numeric_dtype(np.array(['a', 'b']))
False
>>> is_numeric_dtype(pd.Series([1, 2]))
True
>>> is_numeric_dtype(pd.Index([1, 2.]))
(continues on next page)
pandas.api.types.is_object_dtype
pandas.api.types.is_object_dtype(arr_or_dtype)
Check whether an array-like or dtype is of the object dtype.
Parameters
arr_or_dtype [array-like] The array-like or dtype to check.
Returns
boolean Whether or not the array-like or dtype is of the object dtype.
Examples
>>> is_object_dtype(object)
True
>>> is_object_dtype(int)
False
>>> is_object_dtype(np.array([], dtype=object))
True
>>> is_object_dtype(np.array([], dtype=int))
False
>>> is_object_dtype([1, 2, 3])
False
pandas.api.types.is_period_dtype
pandas.api.types.is_period_dtype(arr_or_dtype)
Check whether an array-like or dtype is of the Period dtype.
Parameters
arr_or_dtype [array-like] The array-like or dtype to check.
Returns
boolean Whether or not the array-like or dtype is of the Period dtype.
Examples
>>> is_period_dtype(object)
False
>>> is_period_dtype(PeriodDtype(freq="D"))
True
>>> is_period_dtype([1, 2, 3])
False
(continues on next page)
pandas.api.types.is_signed_integer_dtype
pandas.api.types.is_signed_integer_dtype(arr_or_dtype)
Check whether the provided array or dtype is of a signed integer dtype.
Unlike in in_any_int_dtype, timedelta64 instances will return False.
Changed in version 0.24.0: The nullable Integer dtypes (e.g. pandas.Int64Dtype) are also considered
as integer by this function.
Parameters
arr_or_dtype [array-like] The array or dtype to check.
Returns
boolean Whether or not the array or dtype is of a signed integer dtype and not an
instance of timedelta64.
Examples
>>> is_signed_integer_dtype(str)
False
>>> is_signed_integer_dtype(int)
True
>>> is_signed_integer_dtype(float)
False
>>> is_signed_integer_dtype(np.uint64) # unsigned
False
>>> is_signed_integer_dtype('int8')
True
>>> is_signed_integer_dtype('Int8')
True
>>> is_signed_dtype(pd.Int8Dtype)
True
>>> is_signed_integer_dtype(np.datetime64)
False
>>> is_signed_integer_dtype(np.timedelta64)
False
>>> is_signed_integer_dtype(np.array(['a', 'b']))
False
>>> is_signed_integer_dtype(pd.Series([1, 2]))
True
>>> is_signed_integer_dtype(np.array([], dtype=np.timedelta64))
False
>>> is_signed_integer_dtype(pd.Index([1, 2.])) # float
False
(continues on next page)
pandas.api.types.is_string_dtype
pandas.api.types.is_string_dtype(arr_or_dtype)
Check whether the provided array or dtype is of the string dtype.
Parameters
arr_or_dtype [array-like] The array or dtype to check.
Returns
boolean Whether or not the array or dtype is of the string dtype.
Examples
>>> is_string_dtype(str)
True
>>> is_string_dtype(object)
True
>>> is_string_dtype(int)
False
>>>
>>> is_string_dtype(np.array(['a', 'b']))
True
>>> is_string_dtype(pd.Series([1, 2]))
False
pandas.api.types.is_timedelta64_dtype
pandas.api.types.is_timedelta64_dtype(arr_or_dtype)
Check whether an array-like or dtype is of the timedelta64 dtype.
Parameters
arr_or_dtype [array-like] The array-like or dtype to check.
Returns
boolean Whether or not the array-like or dtype is of the timedelta64 dtype.
Examples
>>> is_timedelta64_dtype(object)
False
>>> is_timedelta64_dtype(np.timedelta64)
True
>>> is_timedelta64_dtype([1, 2, 3])
False
(continues on next page)
pandas.api.types.is_timedelta64_ns_dtype
pandas.api.types.is_timedelta64_ns_dtype(arr_or_dtype)
Check whether the provided array or dtype is of the timedelta64[ns] dtype.
This is a very specific dtype, so generic ones like np.timedelta64 will return False if passed into this
function.
Parameters
arr_or_dtype [array-like] The array or dtype to check.
Returns
boolean Whether or not the array or dtype is of the timedelta64[ns] dtype.
Examples
>>> is_timedelta64_ns_dtype(np.dtype('m8[ns]'))
True
>>> is_timedelta64_ns_dtype(np.dtype('m8[ps]')) # Wrong frequency
False
>>> is_timedelta64_ns_dtype(np.array([1, 2], dtype='m8[ns]'))
True
>>> is_timedelta64_ns_dtype(np.array([1, 2], dtype=np.timedelta64))
False
pandas.api.types.is_unsigned_integer_dtype
pandas.api.types.is_unsigned_integer_dtype(arr_or_dtype)
Check whether the provided array or dtype is of an unsigned integer dtype.
Changed in version 0.24.0: The nullable Integer dtypes (e.g. pandas.UInt64Dtype) are also considered
as integer by this function.
Parameters
arr_or_dtype [array-like] The array or dtype to check.
Returns
boolean Whether or not the array or dtype is of an unsigned integer dtype.
Examples
>>> is_unsigned_integer_dtype(str)
False
>>> is_unsigned_integer_dtype(int) # signed
False
>>> is_unsigned_integer_dtype(float)
False
>>> is_unsigned_integer_dtype(np.uint64)
True
>>> is_unsigned_integer_dtype('uint8')
True
>>> is_unsigned_integer_dtype('UInt8')
True
>>> is_unsigned_integer_dtype(pd.UInt8Dtype)
True
>>> is_unsigned_integer_dtype(np.array(['a', 'b']))
False
>>> is_unsigned_integer_dtype(pd.Series([1, 2])) # signed
False
>>> is_unsigned_integer_dtype(pd.Index([1, 2.])) # float
False
>>> is_unsigned_integer_dtype(np.array([1, 2], dtype=np.uint32))
True
pandas.api.types.is_sparse
pandas.api.types.is_sparse(arr)
Check whether an array-like is a 1-D pandas sparse array.
Check that the one-dimensional array-like is a pandas sparse array. Returns True if it is a pandas
sparse array, not another type of sparse array.
Parameters
arr [array-like] Array-like to check.
Returns
bool Whether or not the array-like is a pandas sparse array.
See also:
Examples
Iterable introspection
pandas.api.types.is_dict_like
pandas.api.types.is_dict_like(obj)
Check if the object is dict-like.
Parameters
obj [The object to check]
Returns
is_dict_like [bool] Whether obj has dict-like properties.
Examples
pandas.api.types.is_file_like
pandas.api.types.is_file_like(obj)
Check if the object is a file-like object.
For objects to be considered file-like, they must be an iterator AND have either a read and/or write
method as an attribute.
Note: file-like objects must be iterable, but iterable objects need not be file-like.
New in version 0.20.0.
Parameters
obj [The object to check]
Returns
is_file_like [bool] Whether obj has file-like properties.
Examples
>>> buffer(StringIO("data"))
>>> is_file_like(buffer)
True
>>> is_file_like([1, 2, 3])
False
pandas.api.types.is_list_like
pandas.api.types.is_list_like()
Check if the object is list-like.
Objects that are considered list-like are for example Python lists, tuples, sets, NumPy arrays, and
Pandas Series.
Strings and datetime objects, however, are not considered list-like.
Parameters
obj [The object to check]
allow_sets [boolean, default True] If this parameter is False, sets will not be consid-
ered list-like
New in version 0.24.0.
Returns
is_list_like [bool] Whether obj has list-like properties.
Examples
pandas.api.types.is_named_tuple
pandas.api.types.is_named_tuple(obj)
Check if the object is a named tuple.
Parameters
obj [The object to check]
Returns
is_named_tuple [bool] Whether obj is a named tuple.
Examples
pandas.api.types.is_iterator
pandas.api.types.is_iterator(obj)
Check if the object is an iterator.
For example, lists are considered iterators but not strings or datetime objects.
Parameters
obj [The object to check]
Returns
is_iter [bool] Whether obj is an iterator.
Examples
Scalar introspection
api.types.is_bool()
api.types.is_categorical(arr) Check whether an array-like is a Categorical in-
stance.
api.types.is_complex()
api.types.is_datetimetz(arr) (DEPRECATED) Check whether an array-like is a
datetime array-like with a timezone component in
its dtype.
api.types.is_float()
api.types.is_hashable(obj) Return True if hash(obj) will succeed, False other-
wise.
api.types.is_integer()
api.types.is_interval()
api.types.is_number(obj) Check if the object is a number.
api.types.is_period(arr) (DEPRECATED) Check whether an array-like is a
periodical index.
api.types.is_re(obj) Check if the object is a regex pattern instance.
api.types.is_re_compilable(obj) Check if the object can be compiled into a regex
pattern instance.
api.types.is_scalar() Return True if given value is scalar.
pandas.api.types.is_bool
pandas.api.types.is_bool()
pandas.api.types.is_categorical
pandas.api.types.is_categorical(arr)
Check whether an array-like is a Categorical instance.
Parameters
arr [array-like] The array-like to check.
Returns
boolean Whether or not the array-like is of a Categorical instance.
Examples
pandas.api.types.is_complex
pandas.api.types.is_complex()
pandas.api.types.is_datetimetz
pandas.api.types.is_datetimetz(arr)
Check whether an array-like is a datetime array-like with a timezone component in its dtype.
Deprecated since version 0.24.0.
Parameters
arr [array-like] The array-like to check.
Returns
boolean Whether or not the array-like is a datetime array-like with a timezone com-
ponent in its dtype.
Examples
Although the following examples are both DatetimeIndex objects, the first one returns False because
it has no timezone component unlike the second one, which returns True.
The object need not be a DatetimeIndex object. It just needs to have a dtype which has a timezone
component.
pandas.api.types.is_float
pandas.api.types.is_float()
pandas.api.types.is_hashable
pandas.api.types.is_hashable(obj)
Return True if hash(obj) will succeed, False otherwise.
Some types will pass a test against collections.abc.Hashable but fail when they are actually hashed
with hash().
Distinguish between these and other types by trying the call to hash() and seeing if they raise Type-
Error.
Returns
bool
Examples
>>> a = ([],)
>>> isinstance(a, collections.abc.Hashable)
True
>>> is_hashable(a)
False
pandas.api.types.is_integer
pandas.api.types.is_integer()
pandas.api.types.is_interval
pandas.api.types.is_interval()
pandas.api.types.is_number
pandas.api.types.is_number(obj)
Check if the object is a number.
Returns True when the object is a number, and False if is not.
Parameters
obj [any type] The object to check if is a number.
Returns
is_number [bool] Whether obj is a number or not.
See also:
Examples
>>> pd.api.types.is_number(1)
True
>>> pd.api.types.is_number(7.15)
True
>>> pd.api.types.is_number(False)
True
>>> pd.api.types.is_number("foo")
False
>>> pd.api.types.is_number("5")
False
pandas.api.types.is_period
pandas.api.types.is_period(arr)
Check whether an array-like is a periodical index.
Deprecated since version 0.24.0.
Parameters
arr [array-like] The array-like to check.
Returns
boolean Whether or not the array-like is a periodical index.
Examples
pandas.api.types.is_re
pandas.api.types.is_re(obj)
Check if the object is a regex pattern instance.
Parameters
obj [The object to check]
Returns
is_regex [bool] Whether obj is a regex pattern.
Examples
>>> is_re(re.compile(".*"))
True
>>> is_re("foo")
False
pandas.api.types.is_re_compilable
pandas.api.types.is_re_compilable(obj)
Check if the object can be compiled into a regex pattern instance.
Parameters
obj [The object to check]
Returns
is_regex_compilable [bool] Whether obj can be compiled as a regex pattern.
Examples
>>> is_re_compilable(".*")
True
>>> is_re_compilable(1)
False
pandas.api.types.is_scalar
pandas.api.types.is_scalar()
Return True if given value is scalar.
Parameters
val [object] This includes:
• numpy array scalar (e.g. np.int64)
• Python builtin numerics
• Python builtin byte arrays and strings
• None
• datetime.datetime
• datetime.timedelta
• Period
• decimal.Decimal
• Interval
• DateOffset
• Fraction
• Number
Returns
bool Return True if given object is scalar, False otherwise
Examples
{{ header }}
6.16 Extensions
These are primarily intended for library authors looking to extend pandas objects.
6.16.1 pandas.api.extensions.register_extension_dtype
pandas.api.extensions.register_extension_dtype(cls: Type[pandas.core.dtypes.base.ExtensionDtype])
→ Type[pandas.core.dtypes.base.ExtensionDtype]
Register an ExtensionType with pandas as class decorator.
New in version 0.24.0.
This enables operations like .astype(name) for the name of the ExtensionDtype.
Returns
callable A class decorator.
Examples
6.16.2 pandas.api.extensions.register_dataframe_accessor
pandas.api.extensions.register_dataframe_accessor(name)
Register a custom accessor on DataFrame objects.
Parameters
name [str] Name under which the accessor should be registered. A warning is issued
if this name conflicts with a preexisting attribute.
Returns
callable A class decorator.
See also:
register_series_accessor, register_index_accessor
Notes
When accessed, your accessor will be initialized with the pandas object the user is interacting with.
So the signature must be
For consistency with pandas methods, you should raise an AttributeError if the data passed to your
accessor has an incorrect dtype.
Examples
import pandas as pd
@pd.api.extensions.register_dataframe_accessor("geo")
class GeoAccessor:
def __init__(self, pandas_obj):
self._obj = pandas_obj
@property
def center(self):
# return the geographic center point of this DataFrame
lat = self._obj.latitude
lon = self._obj.longitude
return (float(lon.mean()), float(lat.mean()))
def plot(self):
# plot this array's data on a map, e.g., using Cartopy
pass
6.16.3 pandas.api.extensions.register_series_accessor
pandas.api.extensions.register_series_accessor(name)
Register a custom accessor on Series objects.
Parameters
name [str] Name under which the accessor should be registered. A warning is issued
if this name conflicts with a preexisting attribute.
Returns
callable A class decorator.
See also:
register_dataframe_accessor, register_index_accessor
Notes
When accessed, your accessor will be initialized with the pandas object the user is interacting with.
So the signature must be
For consistency with pandas methods, you should raise an AttributeError if the data passed to your
accessor has an incorrect dtype.
Examples
import pandas as pd
@pd.api.extensions.register_dataframe_accessor("geo")
class GeoAccessor:
def __init__(self, pandas_obj):
self._obj = pandas_obj
@property
def center(self):
# return the geographic center point of this DataFrame
lat = self._obj.latitude
lon = self._obj.longitude
return (float(lon.mean()), float(lat.mean()))
def plot(self):
# plot this array's data on a map, e.g., using Cartopy
pass
6.16.4 pandas.api.extensions.register_index_accessor
pandas.api.extensions.register_index_accessor(name)
Register a custom accessor on Index objects.
Parameters
name [str] Name under which the accessor should be registered. A warning is issued
if this name conflicts with a preexisting attribute.
Returns
register_dataframe_accessor, register_series_accessor
Notes
When accessed, your accessor will be initialized with the pandas object the user is interacting with.
So the signature must be
For consistency with pandas methods, you should raise an AttributeError if the data passed to your
accessor has an incorrect dtype.
Examples
import pandas as pd
@pd.api.extensions.register_dataframe_accessor("geo")
class GeoAccessor:
def __init__(self, pandas_obj):
self._obj = pandas_obj
@property
def center(self):
# return the geographic center point of this DataFrame
lat = self._obj.latitude
lon = self._obj.longitude
return (float(lon.mean()), float(lat.mean()))
def plot(self):
# plot this array's data on a map, e.g., using Cartopy
pass
6.16.5 pandas.api.extensions.ExtensionDtype
class pandas.api.extensions.ExtensionDtype
A custom data type, to be paired with an ExtensionArray.
New in version 0.23.0.
See also:
extensions.register_extension_dtype
extensions.ExtensionArray
Notes
The interface includes the following abstract methods that must be implemented by subclasses:
• type
• name
• construct_from_string
The following attributes influence the behavior of the dtype in pandas operations
• _is_numeric
• _is_boolean
Optionally one can override construct_array_type for construction with the name of this dtype via
the Registry. See extensions.register_extension_dtype().
• construct_array_type
The na_value class attribute can be used to set the default NA value for this type. numpy.nan is used
by default.
ExtensionDtypes are required to be hashable. The base class provides a default implementation, which
relies on the _metadata class attribute. _metadata should be a tuple containing the strings that define
your data type. For example, with PeriodDtype that’s the freq attribute.
If you have a parametrized dtype you should set the ‘‘_metadata‘‘ class property.
Ideally, the attributes in _metadata will match the parameters to your ExtensionDtype.__init__ (if
any). If any of the attributes in _metadata don’t implement the standard __eq__ or __hash__, the
default implementations here will not work.
Changed in version 0.24.0: Added _metadata, __hash__, and changed the default definition of __eq__.
This class does not inherit from ‘abc.ABCMeta’ for performance reasons. Methods and properties
required by the interface raise pandas.errors.AbstractMethodError and no register method is
provided for registering virtual subclasses.
Attributes
pandas.api.extensions.ExtensionDtype.kind
ExtensionDtype.kind
A character code (one of ‘biufcmMOSUV’), default ‘O’
This should match the NumPy dtype used when the array is converted to an ndarray, which is
probably ‘O’ for object if the extension type cannot be represented as a built-in NumPy type.
See also:
numpy.dtype.kind
pandas.api.extensions.ExtensionDtype.na_value
ExtensionDtype.na_value
Default NA value to use for this type.
This is used in e.g. ExtensionArray.take. This should be the user-facing “boxed” version of the
NA value, not the physical NA value for storage. e.g. for JSONArray, this is an empty dictionary.
pandas.api.extensions.ExtensionDtype.name
ExtensionDtype.name
A string identifying the data type.
Will be used for display in, e.g. Series.dtype
pandas.api.extensions.ExtensionDtype.names
ExtensionDtype.names
Ordered list of field names, or None if there are no fields.
This is for compatibility with NumPy arrays, and may be removed in the future.
pandas.api.extensions.ExtensionDtype.type
ExtensionDtype.type
The scalar type for the array, e.g. int
It’s expected ExtensionArray[item] returns an instance of ExtensionDtype.type for scalar
item, assuming that value is valid (not NA). NA values do not need to be instances of type.
Methods
pandas.api.extensions.ExtensionDtype.construct_array_type
classmethod ExtensionDtype.construct_array_type()
Return the array type associated with this dtype
Returns
type
pandas.api.extensions.ExtensionDtype.construct_from_string
Examples
For extension dtypes with arguments the following may be an adequate implementation.
>>> @classmethod
... def construct_from_string(cls, string):
... pattern = re.compile(r"^my_type\[(?P<arg_name>.+)\]$")
... match = pattern.match(string)
... if match:
... return cls(**match.groupdict())
... else:
... raise TypeError("Cannot construct a '{}' from "
... "'{}'".format(cls.__name__, string))
pandas.api.extensions.ExtensionDtype.is_dtype
Notes
6.16.6 pandas.api.extensions.ExtensionArray
class pandas.api.extensions.ExtensionArray
Abstract base class for custom 1-D array types.
pandas will recognize instances of this class as proper arrays with a custom type and will not attempt
to coerce them to objects. They may be stored directly inside a DataFrame or Series.
New in version 0.23.0.
Notes
The interface includes the following abstract methods that must be implemented by subclasses:
• _from_sequence
• _from_factorized
• __getitem__
• __len__
• dtype
• nbytes
• isna
• take
• copy
• _concat_same_type
A default repr displaying the type, (truncated) data, length, and dtype is provided. It can be cus-
tomized or replaced by by overriding:
• __repr__ : A default repr for the ExtensionArray.
• _formatter : Print scalars inside a Series or DataFrame.
Some methods require casting the ExtensionArray to an ndarray of Python objects with self.
astype(object), which may be expensive. When performance is a concern, we highly recommend
overriding the following methods:
• fillna
• dropna
• unique
• factorize / _values_for_factorize
• argsort / _values_for_argsort
• searchsorted
The remaining methods implemented on this class should be performant, as they only compose abstract
methods. Still, a more efficient implementation may be available, and these methods can be overridden.
One can implement methods to handle array reductions.
• _reduce
One can implement methods to handle parsing from strings that will be used in methods such as
pandas.io.parsers.read_csv.
• _from_sequence_of_strings
This class does not inherit from ‘abc.ABCMeta’ for performance reasons. Methods and properties
required by the interface raise pandas.errors.AbstractMethodError and no register method is
provided for registering virtual subclasses.
ExtensionArrays are limited to 1 dimension.
They may be backed by none, one, or many NumPy arrays. For example, pandas.Categorical is an
extension array backed by two arrays, one for codes and one for categories. An array of IPv6 address
may be backed by a NumPy structured array with two fields, one for the lower 64 bits and one for the
upper 64 bits. Or they may be backed by some other storage type, like Python lists. Pandas makes
no assumptions on how the data are stored, just that it can be converted to a NumPy array. The
ExtensionArray interface does not impose any rules on how this data is stored. However, currently, the
backing data cannot be stored in attributes called .values or ._values to ensure full compatibility
with pandas internals. But other names as .data, ._data, ._items, … can be freely used.
If implementing NumPy’s __array_ufunc__ interface, pandas expects that
1. You defer by raising NotImplemented when any Series are present in inputs. Pandas will extract
the arrays and call the ufunc again.
2. You define a _HANDLED_TYPES tuple as an attribute on the class. Pandas inspect this to determine
whether the ufunc is valid for the types present.
See NumPy Universal Functions for more.
Attributes
pandas.api.extensions.ExtensionArray.dtype
ExtensionArray.dtype
An instance of ‘ExtensionDtype’.
pandas.api.extensions.ExtensionArray.nbytes
ExtensionArray.nbytes
The number of bytes needed to store this object in memory.
pandas.api.extensions.ExtensionArray.ndim
ExtensionArray.ndim
Extension Arrays are only allowed to be 1-dimensional.
pandas.api.extensions.ExtensionArray.shape
ExtensionArray.shape
Return a tuple of the array dimensions.
Methods
argsort(self[, ascending, kind]) Return the indices that would sort this array.
astype(self, dtype[, copy]) Cast to a NumPy array with ‘dtype’.
copy(self) Return a copy of the array.
dropna(self) Return ExtensionArray without NA values
factorize(self, na_sentinel) Encode the extension array as an enumerated
type.
fillna(self[, value, method, limit]) Fill NA/NaN values using the specified method.
isna(self) A 1-D array indicating if each value is missing.
ravel(self[, order]) Return a flattened view on this array.
repeat(self, repeats[, axis]) Repeat elements of a ExtensionArray.
searchsorted(self, value[, side, sorter]) Find indices where elements should be inserted
to maintain order.
shift(self, periods, fill_value) Shift values by desired number.
take(self, indices, allow_fill, fill_value) Take elements from an array.
unique(self) Compute the ExtensionArray of unique values.
_concat_same_type(to_concat) Concatenate multiple array
_formatter(self, boxed) Formatting function for scalar values.
_formatting_values(self) (DEPRECATED) An array of values to be
printed in, e.g.
_from_factorized(values, original) Reconstruct an ExtensionArray after factoriza-
tion.
_from_sequence(scalars[, dtype, copy]) Construct a new ExtensionArray from a se-
quence of scalars.
_from_sequence_of_strings(strings[, dtype, Construct a new ExtensionArray from a se-
copy]) quence of strings.
_ndarray_values Internal pandas method for lossy conversion to
a NumPy ndarray.
_reduce(self, name[, skipna]) Return a scalar result of performing the reduc-
tion operation.
_values_for_argsort(self) Return values for sorting.
_values_for_factorize(self) Return an array and missing value suitable for
factorization.
pandas.api.extensions.ExtensionArray.argsort
pandas.api.extensions.ExtensionArray.astype
pandas.api.extensions.ExtensionArray.copy
ExtensionArray.copy(self ) → pandas.core.dtypes.generic.ABCExtensionArray
Return a copy of the array.
Returns
ExtensionArray
pandas.api.extensions.ExtensionArray.dropna
ExtensionArray.dropna(self )
Return ExtensionArray without NA values
Returns
valid [ExtensionArray]
pandas.api.extensions.ExtensionArray.factorize
Note: uniques will not contain an entry for the NA value of the Extension-
Array if there are any missing values present in self.
See also:
Notes
pandas.api.extensions.ExtensionArray.fillna
pandas.api.extensions.ExtensionArray.isna
ExtensionArray.isna(self ) → ~ArrayLike
A 1-D array indicating if each value is missing.
Returns
na_values [Union[np.ndarray, ExtensionArray]] In most cases, this should return
a NumPy ndarray. For exceptional cases like SparseArray, where returning
an ndarray would be expensive, an ExtensionArray may be returned.
Notes
pandas.api.extensions.ExtensionArray.ravel
Notes
pandas.api.extensions.ExtensionArray.repeat
Examples
pandas.api.extensions.ExtensionArray.searchsorted
Parameters
value [array_like] Values to insert into self.
side [{‘left’, ‘right’}, optional] If ‘left’, the index of the first suitable location found
is given. If ‘right’, return the last such index. If there is no suitable index,
return either 0 or N (where N is the length of self ).
sorter [1-D array_like, optional] Optional array of integer indices that sort array
a into ascending order. They are typically the result of argsort.
Returns
array of ints Array of insertion points with the same shape as value.
See also:
pandas.api.extensions.ExtensionArray.shift
Notes
pandas.api.extensions.ExtensionArray.take
numpy.take
api.extensions.take
Notes
Examples
Here’s an example implementation, which relies on casting the extension array to object dtype.
This uses the helper method pandas.api.extensions.take().
pandas.api.extensions.ExtensionArray.unique
ExtensionArray.unique(self )
Compute the ExtensionArray of unique values.
Returns
uniques [ExtensionArray]
pandas.api.extensions.ExtensionArray._concat_same_type
pandas.api.extensions.ExtensionArray._formatter
pandas.api.extensions.ExtensionArray._formatting_values
ExtensionArray._formatting_values(self ) → numpy.ndarray
An array of values to be printed in, e.g. the Series repr
Deprecated since version 0.24.0: Use ExtensionArray._formatter() instead.
Returns
array [ndarray]
pandas.api.extensions.ExtensionArray._from_factorized
factorize
ExtensionArray.factorize
pandas.api.extensions.ExtensionArray._from_sequence
pandas.api.extensions.ExtensionArray._from_sequence_of_strings
pandas.api.extensions.ExtensionArray._ndarray_values
ExtensionArray._ndarray_values
Internal pandas method for lossy conversion to a NumPy ndarray.
This method is not part of the pandas interface.
The expectation is that this is cheap to compute, and is primarily used for interacting with our
indexers.
Returns
array [ndarray]
pandas.api.extensions.ExtensionArray._reduce
pandas.api.extensions.ExtensionArray._values_for_argsort
ExtensionArray._values_for_argsort(self ) → numpy.ndarray
Return values for sorting.
Returns
ndarray The transformed values should maintain the ordering between values
within the array.
See also:
ExtensionArray.argsort
pandas.api.extensions.ExtensionArray._values_for_factorize
Notes
6.16.7 pandas.arrays.PandasArray
Attributes
None
Methods
None
{{ header }}
SEVEN
DEVELOPMENT
{{ header }}
Table of contents:
• Where to start?
• Bug reports and enhancement requests
• Working with the code
– Version control, Git, and GitHub
– Getting started with Git
– Forking
– Creating a development environment
* Installing a C compiler
* Creating a Python environment
* Creating a Python environment (pip)
– Creating a branch
• Contributing to the documentation
– About the pandas documentation
– Updating a pandas docstring
– How to build the pandas documentation
* Requirements
* Building the documentation
* Building master branch documentation
• Contributing to the code base
– Code standards
– Optional dependencies
2203
pandas: powerful Python data analysis toolkit, Release 0.25.2
* C (cpplint)
* Python (PEP8 / black)
* Import formatting
* Backwards compatibility
– Testing with continuous integration
– Test-driven development/code writing
* Writing tests
* Transitioning to pytest
* Using pytest
* Using hypothesis
* Testing warnings
– Running the test suite
– Running the performance test suite
– Documenting your code
• Contributing your changes to pandas
– Committing your code
– Pushing your changes
– Review your code
– Finally, make the pull request
– Updating your pull request
– Delete your merged branch (optional)
All contributions, bug reports, bug fixes, documentation improvements, enhancements, and ideas are wel-
come.
If you are brand new to pandas or open-source development, we recommend going through the GitHub
“issues” tab to find issues that interest you. There are a number of issues listed under Docs and good first
issue where you could start out. Once you’ve found an interesting issue, you can return here to get your
development environment setup.
Feel free to ask questions on the mailing list or on Gitter.
Bug reports are an important part of making pandas more stable. Having a complete bug report will allow
others to reproduce the bug and provide insight into fixing. See this stackoverflow article and this blogpost
for tips on writing a good bug report.
Trying the bug-producing code out on the master branch is often a worthwhile exercise to confirm the bug
still exists. It is also worth searching existing bug reports and pull requests to see if the issue has already
been reported and/or fixed.
```python
>>> from pandas import DataFrame
>>> df = DataFrame(...)
...
```
2. Include the full version string of pandas and its dependencies. You can use the built-in function:
3. Explain why the current behavior is wrong/not desired and what you expect instead.
The issue will then show up to the pandas community and be open to comments/ideas from others.
Now that you have an issue you want to fix, enhancement to add, or documentation to improve, you need
to learn how to work with GitHub and the pandas code base.
To the new user, working with Git is one of the more daunting aspects of contributing to pandas. It can very
quickly become overwhelming, but sticking to the guidelines below will help keep the process straightforward
and mostly trouble free. As always, if you are having difficulties please feel free to ask for help.
The code is hosted on GitHub. To contribute you will need to sign up for a free GitHub account. We use
Git for version control to allow many people to work together on the project.
Some great resources for learning Git:
• the GitHub help pages.
• the NumPy’s documentation.
• Matthew Brett’s Pydagogue.
GitHub has instructions for installing git, setting up your SSH key, and configuring git. All these steps need
to be completed before you can work seamlessly between your local repository and GitHub.
Forking
You will need your own fork to work on the code. Go to the pandas project page and hit the Fork button.
You will want to clone your fork to your machine:
This creates the directory pandas-yourname and connects your repository to the upstream (main project)
pandas repository.
To test out code changes, you’ll need to build pandas from source, which requires a C compiler and Python
environment. If you’re making documentation changes, you can skip to Contributing to the documentation
but you won’t be able to build the documentation locally before pushing your changes.
Installing a C compiler
Pandas uses C extensions (mostly written using Cython) to speed up certain operations. To install pandas
from source, you need to compile these C extensions, which means you need a C compiler. This process
depends on which platform you’re using. Follow the CPython contributing guide for getting a compiler
installed. You don’t need to do any of the ./configure or make steps; you only need to install the compiler.
For Windows developers, when using Python 3.5 and later, it is sufficient to install Visual Studio 2017 with
the Python development workload and the Python native development tools option. Otherwise,
the following links may be helpful.
• https://blogs.msdn.microsoft.com/pythonengineering/2017/03/07/python-support-in-vs2017/
• https://blogs.msdn.microsoft.com/pythonengineering/2016/04/11/unable-to-find-vcvarsall-bat/
• https://github.com/conda/conda-recipes/wiki/Building-from-Source-on-Windows-32-bit-and-64-bit
• https://cowboyprogrammer.org/building-python-wheels-for-windows/
• https://blog.ionelmc.ro/2014/12/21/compiling-python-extensions-on-windows/
• https://support.enthought.com/hc/en-us/articles/204469260-Building-Python-extensions-with-Canopy
Let us know if you have any difficulties by opening an issue or reaching out on Gitter.
Now that you have a C compiler, create an isolated pandas development environment:
• Install either Anaconda or miniconda
• Make sure your conda is up to date (conda update conda)
• Make sure that you have cloned the repository
• cd to the pandas source directory
We’ll now kick off a three-step process:
1. Install the build dependencies
2. Build and install pandas
3. Install the optional dependencies
# Create and activate the build environment
conda env create -f environment.yml
conda activate pandas-dev
At this point you should be able to import pandas from your locally built version:
This will create the new environment, and not touch any of your existing environments, nor any existing
Python installation.
To view your environments:
conda info -e
conda deactivate
If you aren’t using conda for your development environment, follow these instructions. You’ll need to have
at least python3.5 installed on your system.
Creating a branch
You want your master branch to reflect only production-ready code, so create a feature branch for making
your changes. For example:
This changes your working directory to the shiny-new-feature branch. Keep any changes in this branch
specific to one bug or feature so it is clear what the branch brings to pandas. You can have many shiny-new-
features and switch in between them using the git checkout command.
When creating this branch, make sure your master branch is up to date with the latest upstream master
version. To update your local master branch, you can do:
When you want to update the feature branch with changes in master after you created the branch, check
the section on updating a PR.
Contributing to the documentation benefits everyone who uses pandas. We encourage you to help us improve
the documentation, and you don’t have to be an expert on pandas to do so! In fact, there are sections of the
docs that are worse off after being written by experts. If something in the docs doesn’t make sense to you,
updating the relevant section after you figure it out is a great way to ensure it will help the next person.
Documentation:
The documentation is written in reStructuredText, which is almost like writing in plain English, and built
using Sphinx. The Sphinx Documentation has an excellent introduction to reST. Review the Sphinx docs
to perform more complex changes to the documentation as well.
Some other important things to know about the docs:
• The pandas documentation consists of two parts: the docstrings in the code itself and the docs in this
folder doc/.
The docstrings provide a clear explanation of the usage of the individual functions, while the documen-
tation in this folder consists of tutorial-like overviews per topic together with some other information
(what’s new, installation, etc).
• The docstrings follow a pandas convention, based on the Numpy Docstring Standard. Follow the
pandas docstring guide for detailed instructions on how to write a correct docstring. {{ header }}
A Python docstring is a string used to document a Python module, class, function or method, so
programmers can understand what it does without having to read the details of the implementation.
Also, it is a common practice to generate online (html) documentation automatically from docstrings.
Sphinx serves this purpose.
Next example gives an idea on how a docstring looks like:
This function simply wraps the `+` operator, and does not
do anything interesting, except for illustrating what is
the docstring of a very simple function.
Parameters
----------
num1 : int
First number to add
num2 : int
Second number to add
Returns
-------
int
The sum of `num1` and `num2`
See Also
--------
subtract : Subtract one integer from another
Examples
--------
>>> add(2, 2)
4
>>> add(25, 0)
25
>>> add(10, -10)
0
"""
return num1 + num2
Some standards exist about docstrings, so they are easier to read, and they can be exported to other
formats such as html or pdf.
The first conventions every Python docstring should follow are defined in PEP-257.
As PEP-257 is quite open, and some other standards exist on top of it. In the case of pandas, the
numpy docstring convention is followed. The conventions is explained in this document:
– numpydoc docstring guide (which is based in the original Guide to NumPy/SciPy documentation)
Writing a docstring
General rules
Docstrings must be defined with three double-quotes. No blank lines should be left before or after the
docstring. The text starts in the next line after the opening quotes. The closing quotes have their own
line (meaning that they are not at the end of the last sentence).
In rare occasions reST styles like bold text or italics will be used in docstrings, but is it common to
have inline code, which is presented between backticks. It is considered inline code:
– The name of a parameter
– Python code, a module, function, built-in, type, literal… (e.g. os, list, numpy.abs, datetime.
date, True)
– A pandas class (in the form :class:`pandas.Series`)
– A pandas method (in the form :meth:`pandas.Series.sum`)
– A pandas function (in the form :func:`pandas.to_datetime`)
Note: To display only the last component of the linked class, method or function, prefix it with ~.
For example, :class:`~pandas.Series` will link to pandas.Series but only display the last part,
Series as the link text. See Sphinx cross-referencing syntax for details.
Good:
def add_values(arr):
"""
Add the values in `arr`.
Bad:
def func():
"""Some function.
There is a blank line between the docstring and the first line
of code `foo = 1`.
The closing quotes should be in the next line, not in this one."""
foo = 1
bar = 2
return foo + bar
The short summary is a single sentence that expresses what the function does in a concise way.
The short summary must start with a capital letter, end with a dot, and fit in a single line. It needs to
express what the object does without providing details. For functions and methods, the short summary
must start with an infinitive verb.
Good:
def astype(dtype):
"""
Cast Series type.
Bad:
def astype(dtype):
"""
Casts Series type.
def astype(dtype):
"""
Method to cast Series type.
def astype(dtype):
"""
Cast Series type
def astype(dtype):
"""
Cast Series type from its current type to the new type defined in
the parameter dtype.
The extended summary provides details on what the function does. It should not go into the details
of the parameters, or discuss implementation notes, which go in other sections.
A blank line is left between the short summary and the extended summary. And every paragraph in
the extended summary is finished by a dot.
The extended summary should provide details on why the function is useful and their use cases, if it
is not too generic.
def unstack():
"""
Pivot a row index to columns.
The index level will be automatically removed from the index when added
as columns.
"""
pass
Section 3: Parameters
The details of the parameters will be added in this section. This section has the title “Parameters”,
followed by a line with a hyphen under each letter of the word “Parameters”. A blank line is left before
the section title, but not after, and not between the line with the word “Parameters” and the one with
the hyphens.
After the title, each parameter in the signature must be documented, including *args and **kwargs,
but not self.
The parameters are defined by their name, followed by a space, a colon, another space, and the type
(or types). Note that the space between the name and the colon is important. Types are not defined
for *args and **kwargs, but must be defined for all other parameters. After the parameter definition,
it is required to have a line with the parameter description, which is indented, and can have multiple
lines. The description must start with a capital letter, and finish with a dot.
For keyword arguments with a default value, the default will be listed after a comma at the end of the
type. The exact form of the type in this case will be “int, default 0”. In some cases it may be useful to
explain what the default argument means, which can be added after a comma “int, default -1, meaning
all cpus”.
In cases where the default value is None, meaning that the value will not be used. Instead of “str,
default None”, it is preferred to write “str, optional”. When None is a value being used, we will keep
the form “str, default None”. For example, in df.to_csv(compression=None), None is not a value being
used, but means that compression is optional, and no compression is being used if not provided. In
this case we will use str, optional. Only in cases like func(value=None) and None is being used in the
same way as 0 or foo would be used, then we will specify “str, int or None, default None”.
Good:
class Series:
def plot(self, kind, color='blue', **kwargs):
"""
Generate a plot.
Parameters
----------
kind : str
Kind of matplotlib plot.
color : str, default 'blue'
Color name or rgb code.
**kwargs
These parameters will be passed to the matplotlib plotting
function.
"""
pass
Bad:
class Series:
def plot(self, kind, **kwargs):
"""
Generate a plot.
Parameters
----------
kind: str
kind of matplotlib plot
"""
pass
Parameter types
When specifying the parameter types, Python built-in data types can be used directly (the Python
type is preferred to the more verbose string, integer, boolean, etc):
– int
– float
– str
– bool
For complex types, define the subtypes. For dict and tuple, as more than one type is present, we use
the brackets to help read the type (curly brackets for dict and normal brackets for tuple):
– list of int
– dict of {str : int}
– tuple of (str, int, int)
– tuple of (str,)
– set of str
In case where there are just a set of values allowed, list them in curly brackets and separated by
commas (followed by a space). If the values are ordinal and they have an order, list them in this order.
Otherwise, list the default value first, if there is one:
– {0, 10, 25}
– {‘simple’, ‘advanced’}
– {‘low’, ‘medium’, ‘high’}
– {‘cat’, ‘dog’, ‘bird’}
If the type is defined in a Python module, the module must be specified:
– datetime.date
– datetime.datetime
– decimal.Decimal
If the type is in a package, the module must be also specified:
– numpy.ndarray
– scipy.sparse.coo_matrix
If the type is a pandas type, also specify pandas except for Series and DataFrame:
– Series
– DataFrame
– pandas.Index
– pandas.Categorical
– pandas.SparseArray
If the exact type is not relevant, but must be compatible with a numpy array, array-like can be specified.
If Any type that can be iterated is accepted, iterable can be used:
– array-like
– iterable
If more than one type is accepted, separate them by commas, except the last two types, that need to
be separated by the word ‘or’:
– int or float
– float, decimal.Decimal or None
– str or list of str
If None is one of the accepted values, it always needs to be the last in the list.
For axis, the convention is to use something like:
– axis : {0 or ‘index’, 1 or ‘columns’, None}, default None
If the method returns a value, it will be documented in this section. Also if the method yields its
output.
The title of the section will be defined in the same way as the “Parameters”. With the names “Returns”
or “Yields” followed by a line with as many hyphens as the letters in the preceding word.
The documentation of the return is also similar to the parameters. But in this case, no name will be
provided, unless the method returns or yields more than one value (a tuple of values).
The types for “Returns” and “Yields” are the same as the ones for the “Parameters”. Also, the
description must finish with a dot.
For example, with a single value:
def sample():
"""
Generate and return a random number.
Returns
-------
float
Random number generated.
"""
return np.random.random()
import string
def random_letters():
"""
Generate and return a sequence of random letters.
Returns
-------
length : int
Length of the returned string.
letters : str
String of random letters.
"""
length = np.random.randint(1, 10)
letters = ''.join(np.random.choice(string.ascii_lowercase)
for i in range(length))
return length, letters
def sample_values():
"""
Generate an infinite sequence of random numbers.
Yields
------
float
Random number generated.
"""
while True:
yield np.random.random()
This section is used to let users know about pandas functionality related to the one being documented.
In rare cases, if no related methods or functions can be found at all, this section can be skipped.
An obvious example would be the head() and tail() methods. As tail() does the equivalent as head()
but at the end of the Series or DataFrame instead of at the beginning, it is good to let the users know
about it.
To give an intuition on what can be considered related, here there are some examples:
– loc and iloc, as they do the same, but in one case providing indices and in the other positions
– max and min, as they do the opposite
– iterrows, itertuples and items, as it is easy that a user looking for the method to iterate over
columns ends up in the method to iterate over rows, and vice-versa
– fillna and dropna, as both methods are used to handle missing values
– read_csv and to_csv, as they are complementary
– merge and join, as one is a generalization of the other
– astype and pandas.to_datetime, as users may be reading the documentation of astype to know
how to cast as a date, and the way to do it is with pandas.to_datetime
– where is related to numpy.where, as its functionality is based on it
When deciding what is related, you should mainly use your common sense and think about what can
be useful for the users reading the documentation, especially the less experienced ones.
When relating to other libraries (mainly numpy), use the name of the module first (not an alias like
np). If the function is in a module which is not the main one, like scipy.sparse, list the full module
(e.g. scipy.sparse.coo_matrix).
This section, as the previous, also has a header, “See Also” (note the capital S and A). Also followed
by the line with hyphens, and preceded by a blank line.
After the header, we will add a line for each related method or function, followed by a space, a colon,
another space, and a short description that illustrated what this method or function does, why is it
relevant in this context, and what are the key differences between the documented function and the
one referencing. The description must also finish with a dot.
Note that in “Returns” and “Yields”, the description is located in the following line than the type.
But in this section it is located in the same line, with a colon in between. If the description does not
fit in the same line, it can continue in the next ones, but it has to be indented in them.
For example:
class Series:
def head(self):
"""
Return the first 5 elements of the Series.
Returns
-------
Series
Subset of the original series with the 5 first values.
See Also
--------
Series.tail : Return the last 5 elements of the Series.
(continues on next page)
Section 6: Notes
This is an optional section used for notes about the implementation of the algorithm. Or to document
technical aspects of the function behavior.
Feel free to skip it, unless you are familiar with the implementation of the algorithm, or you discover
some counter-intuitive behavior while writing the examples for the function.
This section follows the same format as the extended summary section.
Section 7: Examples
This is one of the most important sections of a docstring, even if it is placed in the last position. As
often, people understand concepts better with examples, than with accurate explanations.
Examples in docstrings, besides illustrating the usage of the function or method, must be valid Python
code, that in a deterministic way returns the presented output, and that can be copied and run by
users.
They are presented as a session in the Python terminal. >>> is used to present code. … is used for
code continuing from the previous line. Output is presented immediately after the last line of code
generating the output (no blank lines in between). Comments describing the examples can be added
with blank lines before and after them.
The way to present examples is as follows:
1. Import required libraries (except numpy and pandas)
2. Create the data required for the example
3. Show a very basic example that gives an idea of the most common use case
4. Add examples with explanations that illustrate how the parameters can be used for extended
functionality
A simple example could be:
class Series:
Parameters
----------
n : int
(continues on next page)
Return
------
pandas.Series
Subset of the original series with the n first values.
See Also
--------
tail : Return the last n elements of the Series.
Examples
--------
>>> s = pd.Series(['Ant', 'Bear', 'Cow', 'Dog', 'Falcon',
... 'Lion', 'Monkey', 'Rabbit', 'Zebra'])
>>> s.head()
0 Ant
1 Bear
2 Cow
3 Dog
4 Falcon
dtype: object
With the `n` parameter, we can change the number of returned rows:
>>> s.head(n=3)
0 Ant
1 Bear
2 Cow
dtype: object
"""
return self.iloc[:n]
The examples should be as concise as possible. In cases where the complexity of the function requires
long examples, is recommended to use blocks with headers in bold. Use double star ** to make a text
bold, like in **this example**.
Code in examples is assumed to always start with these two lines which are not shown:
import numpy as np
import pandas as pd
Any other module used in the examples must be explicitly imported, one per line (as recommended
in PEP 8#imports) and avoiding aliases. Avoid excessive imports, but if needed, imports from the
standard library go first, followed by third-party libraries (like matplotlib).
When illustrating examples with a single Series use the name s, and if illustrating with a single
DataFrame use the name df. For indices, idx is the preferred name. If a set of homogeneous Series
or DataFrame is used, name them s1, s2, s3… or df1, df2, df3… If the data is not homogeneous, and
more than one structure is needed, name them with something meaningful, for example df_main and
df_to_join.
Data used in the example should be as compact as possible. The number of rows is recommended to
be around 4, but make it a number that makes sense for the specific example. For example in the head
method, it requires to be higher than 5, to show the example with the default values. If doing the
mean, we could use something like [1, 2, 3], so it is easy to see that the value returned is the mean.
For more complex examples (grouping for example), avoid using data without interpretation, like a
matrix of random numbers with columns A, B, C, D… And instead use a meaningful example, which
makes it easier to understand the concept. Unless required by the example, use names of animals, to
keep examples consistent. And numerical properties of them.
When calling the method, keywords arguments head(n=3) are preferred to positional arguments
head(3).
Good:
class Series:
def mean(self):
"""
Compute the mean of the input.
Examples
--------
>>> s = pd.Series([1, 2, 3])
>>> s.mean()
2
"""
pass
Examples
--------
>>> s = pd.Series([1, np.nan, 3])
>>> s.fillna(0)
[1, 0, 3]
"""
pass
def groupby_mean(self):
"""
Group by index and return mean.
Examples
--------
>>> s = pd.Series([380., 370., 24., 26],
... name='max_speed',
... index=['falcon', 'falcon', 'parrot', 'parrot'])
>>> s.groupby_mean()
index
(continues on next page)
Examples
--------
>>> s = pd.Series('Antelope', 'Lion', 'Zebra', np.nan)
>>> s.contains(pattern='a')
0 False
1 False
2 True
3 NaN
dtype: bool
**Case sensitivity**
**Missing values**
We can fill missing values in the output using the `na` parameter:
Bad:
Examples
--------
>>> import numpy as np
>>> import pandas as pd
>>> df = pd.DataFrame(np.random.randn(3, 3),
... columns=('a', 'b', 'c'))
>>> df.method(1)
21
>>> df.method(bar=14)
123
"""
pass
Getting the examples pass the doctests in the validation script can sometimes be tricky. Here are some
attention points:
– Import all needed libraries (except for pandas and numpy, those are already imported as import
pandas as pd and import numpy as np) and define all variables you use in the example.
– Try to avoid using random data. However random data might be OK in some cases, like if
the function you are documenting deals with probability distributions, or if the amount of data
needed to make the function result meaningful is too much, such that creating it manually is very
cumbersome. In those cases, always use a fixed random seed to make the generated examples
predictable. Example:
>>> np.random.seed(42)
>>> df = pd.DataFrame({'normal': np.random.normal(100, 5, 20)})
– If you have a code snippet that wraps multiple lines, you need to use ‘…’ on the continued lines:
– If you want to show a case where an exception is raised, you can do:
>>> pd.to_datetime(["712-01-01"])
Traceback (most recent call last):
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 712-01-01 00:00:00
It is essential to include the “Traceback (most recent call last):”, but for the actual error only
the error name is sufficient.
– If there is a small part of the result that can vary (e.g. a hash in an object representation), you
can use ... to represent this part.
If you want to show that s.plot() returns a matplotlib AxesSubplot object, this will fail the
doctest
>>> s.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x7efd0c0b0690>
>>> s.plot()
<matplotlib.axes._subplots.AxesSubplot at ...>
Plots in examples
There are some methods in pandas returning plots. To render the plots generated by the examples in
the documentation, the .. plot:: directive exists.
To use it, place the next code after the “Examples” header as shown below. The plot will be generated
automatically when building the documentation.
class Series:
def plot(self):
"""
Generate a plot with the `Series` data.
Examples
--------
.. plot::
:context: close-figs
Sharing docstrings
Pandas has a system for sharing docstrings, with slight variations, between classes. This helps us keep
docstrings consistent, while keeping things clear for the user reading. It comes at the cost of some
complexity when writing.
Each shared docstring will have a base template with variables, like %(klass)s. The variables filled in
later on using the Substitution decorator. Finally, docstrings can be appended to with the Appender
decorator.
In this example, we’ll create a parent docstring normally (this is like pandas.core.generic.
NDFrame. Then we’ll have two children (like pandas.core.series.Series and pandas.core.frame.
DataFrame). We’ll substitute the children’s class names in this docstring.
class Parent:
def my_function(self):
"""Apply my function to %(klass)s."""
...
class ChildA(Parent):
@Substitution(klass="ChildA")
@Appender(Parent.my_function.__doc__)
def my_function(self):
...
class ChildB(Parent):
@Substitution(klass="ChildB")
@Appender(Parent.my_function.__doc__)
def my_function(self):
...
>>> print(Parent.my_function.__doc__)
Apply my function to %(klass)s.
>>> print(ChildA.my_function.__doc__)
Apply my function to ChildA.
>>> print(ChildB.my_function.__doc__)
Apply my function to ChildB.
@Appender(template % _shared_doc_kwargs)
def my_function(self):
...
where template may come from a module-level _shared_docs dictionary mapping function names
to docstrings. Wherever possible, we prefer using Appender and Substitution, since the docstring-
writing processes is slightly closer to normal.
See pandas.core.generic.NDFrame.fillna for an example template, and pandas.core.series.
Series.fillna and pandas.core.generic.frame.fillna for the filled versions.
• The tutorials make heavy use of the ipython directive sphinx extension. This directive lets you put
code in the documentation which will be run during the doc build. For example:
.. ipython:: python
x = 2
x**3
In [1]: x = 2
In [2]: x**3
Out[2]: 8
Almost all code examples in the docs are run (and the output saved) during the doc build. This
approach means that code examples will always be up to date, but it does make the doc building a bit
more complex.
• Our API documentation in doc/source/api.rst houses the auto-generated documentation from the
docstrings. For classes, there are a few subtleties around controlling which methods and attributes
have pages auto-generated.
We have two autosummary templates for classes.
1. _templates/autosummary/class.rst. Use this when you want to automatically generate a page
for every public method and attribute on the class. The Attributes and Methods sections will
be automatically added to the class’ rendered documentation by numpydoc. See DataFrame for
an example.
2. _templates/autosummary/class_without_autosummary. Use this when you want to pick a
subset of methods / attributes to auto-generate pages for. When using this template, you should
include an Attributes and Methods section in the class docstring. See CategoricalIndex for
an example.
Every method should be included in a toctree in api.rst, else Sphinx will emit a warning.
Note: The .rst files are used to automatically generate Markdown and HTML versions of the docs. For
this reason, please do not edit CONTRIBUTING.md directly, but instead make any changes to doc/source/
development/contributing.rst. Then, to generate CONTRIBUTING.md, use pandoc with the following
command:
The utility script scripts/validate_docstrings.py can be used to get a csv summary of the API docu-
mentation. And also validate common errors in the docstring of a specific class, function or method. The
summary also compares the list of methods documented in doc/source/api.rst (which is used to gener-
ate the API Reference page) and the actual public methods. This will identify methods documented in
doc/source/api.rst that are not actually class methods, and existing methods that are not documented
in doc/source/api.rst.
When improving a single function or method’s docstring, it is not necessarily needed to build the full
documentation (see next section). However, there is a script that checks a docstring (for example for the
DataFrame.mean method):
This script will indicate some formatting errors if present, and will also run and test the examples included
in the docstring. Check the pandas docstring guide for a detailed guide on how to format the docstring.
The examples in the docstring (‘doctests’) must be valid Python code, that in a deterministic way returns
the presented output, and that can be copied and run by users. This can be checked with the script above,
and is also tested on Travis. A failing doctest will be a blocker for merging a PR. Check the examples section
in the docstring guide for some tips and tricks to get the doctests passing.
When doing a PR with a docstring update, it is good to post the output of the validation script in a comment
on github.
Requirements
First, you need to have a development environment to be able to build pandas (see the docs on creating a
development environment above).
So how do you build the docs? Navigate to your local doc/ directory in the console and run:
Then you can find the HTML output in the folder doc/build/html/.
The first time you build the docs, it will take quite a while because it has to run all the code examples and
build all the generated docstring pages. In subsequent evocations, sphinx will try to only build the pages
that have been modified.
If you want to do a full clean build, do:
You can tell make.py to compile only a single section of the docs, greatly reducing the turn-around time for
checking your changes.
# compile the docs with only a single section, relative to the "source" folder.
# For example, compiling only this guide (docs/source/development/contributing.rst)
python make.py clean
python make.py --single development/contributing.rst
For comparison, a full documentation build may take 15 minutes, but a single section may take 15 seconds.
Subsequent builds, which only process portions you have changed, will be faster.
You can also specify to use multiple cores to speed up the documentation build:
Open the following file in a web browser to see the full documentation you just built:
doc/build/html/index.html
And you’ll have the satisfaction of seeing your new and improved documentation!
When pull requests are merged into the pandas master branch, the main parts of the documentation are
also built by Travis-CI. These docs are then hosted here, see also the Continuous Integration section.
Code Base:
• Code standards
• Optional dependencies
– C (cpplint)
– Python (PEP8 / black)
– Import formatting
– Backwards compatibility
• Testing with continuous integration
• Test-driven development/code writing
– Writing tests
– Transitioning to pytest
– Using pytest
– Using hypothesis
– Testing warnings
• Running the test suite
• Running the performance test suite
• Documenting your code
Code standards
Writing good code is not just about what you write. It is also about how you write it. During Continuous
Integration testing, several tools will be run to check your code for stylistic errors. Generating any warnings
will cause the test to fail. Thus, good style is a requirement for submitting code to pandas.
There is a tool in pandas to help contributors verify their changes before contributing them to the project:
./ci/code_checks.sh
The script verifies the linting of code files, it looks for common mistake patterns (like missing spaces around
sphinx directives that make the documentation not being rendered properly) and it also validates the doctests.
It is possible to run the checks independently by using the parameters lint, patterns and doctests (e.g.
./ci/code_checks.sh lint).
In addition, because a lot of people use our library, it is important that we do not make sudden changes to
the code that could have the potential to break a lot of user code as a result, that is, we need it to be as
backwards compatible as possible to avoid mass breakages.
Additional standards are outlined on the code style wiki page.
Optional dependencies
Optional dependencies (e.g. matplotlib) should be imported with the private helper pandas.compat.
_optional.import_optional_dependency. This ensures a consistent error message when the dependency
is not met.
All methods using an optional dependency should include a test asserting that an ImportError is raised
when the optional dependency is not found. This test should be skipped if the library is present.
All optional dependencies should be documented in Optional dependencies and the minimum required version
should be set in the pandas.compat._optional.VERSIONS dict.
C (cpplint)
pandas uses the Google standard. Google provides an open source style checker called cpplint, but we use
a fork of it that can be found here. Here are some of the more common cpplint issues:
• we restrict line-length to 80 characters to promote readability
• every header file must include a header guard to avoid name collisions if re-included
Continuous Integration will run the cpplint tool and report any stylistic errors in your code. Therefore, it is
helpful before submitting code to run the check yourself:
To make your commits compliant with this standard, you can install the ClangFormat tool, which can be
downloaded here. To configure, in your home directory, run the following command:
Then modify the file to ensure that any indentation width parameters are at least four. Once configured,
you can run the tool as follows:
clang-format modified-c-file
This will output what your file will look like if the changes are made, and to apply them, run the following
command:
clang-format -i modified-c-file
To run the tool on an entire directory, you can run the following analogous commands:
Do note that this tool is best-effort, meaning that it will try to correct as many errors as possible, but it
may not correct all of them. Thus, it is recommended that you run cpplint to double check and make any
other style fixes manually.
pandas follows the PEP8 standard and uses Black and Flake8 to ensure a consistent code format throughout
the project.
Continuous Integration will run those tools and report any stylistic errors in your code. Therefore, it is
helpful before submitting code to run the check yourself:
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
to auto-format your code. Additionally, many editors have plugins that will apply black as you edit files.
Optionally, you may wish to setup pre-commit hooks to automatically run black and flake8 when you
make a git commit. This can be done by installing pre-commit:
pre-commit install
from the root of the pandas repository. Now black and flake8 will be run each time you commit changes.
You can skip these checks with git commit --no-verify.
This command will catch any stylistic errors in your changes specifically, but be beware it may not catch
all of them. For example, if you delete the only usage of an imported function, it is stylistically incorrect to
import an unused function. However, style-checking the diff will not catch this because the actual import is
not part of the diff. Thus, for completeness, you should run this command, though it will take longer:
Note that on OSX, the -r flag is not available, so you have to omit it and run this slightly modified command:
Windows does not support the xargs command (unless installed for example via the MinGW toolchain),
but one can imitate the behaviour as follows:
This will get all the files being changed by the PR (and ending with .py), and run flake8 on them, one
after the other.
Import formatting
isort pandas/io/pytables.py
to automatically format imports correctly. This will modify your local copy of the files.
The –recursive flag can be passed to sort all files in a directory.
You can then verify the changes look ok, then git commit and push.
Backwards compatibility
Please try to maintain backward compatibility. pandas has lots of users with lots of existing code, so don’t
break it if at all possible. If you think breakage is required, clearly state why as part of the pull request.
Also, be careful when changing method signatures and add deprecation warnings where needed. Also, add
the deprecated sphinx directive to the deprecated functions or methods.
If a function with the same arguments as the one being deprecated exist, you can use the pandas.util.
_decorators.deprecate:
import warnings
def old_func():
"""Summary of the function.
.. deprecated:: 0.21.0
Use new_func instead.
"""
warnings.warn('Use new_func instead.', FutureWarning, stacklevel=2)
new_func()
def new_func():
pass
The pandas test suite will run automatically on Travis-CI and Azure Pipelines continuous integration services,
once your pull request is submitted. However, if you wish to run the test suite on a branch prior to submitting
the pull request, then the continuous integration services need to be hooked to your GitHub repository.
Instructions are here for Travis-CI and Azure Pipelines.
A pull-request will be considered for merging when you have an all ‘green’ build. If any tests are failing,
then you will get a red ‘X’, where you can click through to see the individual failed tests. This is an example
of a green build.
Note: Each time you push to your fork, a new run of the tests will be triggered on the CI. You can
enable the auto-cancel feature, which removes any non-currently-running tests for that same pull-request,
for Travis-CI here.
pandas is serious about testing and strongly encourages contributors to embrace test-driven development
(TDD). This development process “relies on the repetition of a very short development cycle: first the
developer writes an (initially failing) automated test case that defines a desired improvement or new function,
then produces the minimum amount of code to pass that test.” So, before actually writing any code, you
should write your tests. Often the test can be taken from the original GitHub issue. However, it is always
worth considering additional use cases and writing corresponding tests.
Adding tests is one of the most common requests after code is pushed to pandas. Therefore, it is worth
getting in the habit of writing tests ahead of time so this is never an issue.
Like many packages, pandas uses pytest and the convenient extensions in numpy.testing.
Writing tests
All tests should go into the tests subdirectory of the specific package. This folder contains many current
examples of tests, and we suggest looking to these for inspiration. If your test requires working with files or
network connectivity, there is more information on the testing page of the wiki.
The pandas.util.testing module has many special assert functions that make it easier to make state-
ments about whether Series or DataFrame objects are equivalent. The easiest way to verify that your code
is correct is to explicitly construct the result you expect, then compare the actual result to the expected
correct result:
def test_pivot(self):
data = {
'index' : ['A', 'B', 'C', 'C', 'B', 'A'],
'columns' : ['One', 'One', 'One', 'Two', 'Two', 'Two'],
'values' : [1., 2., 3., 3., 2., 1.]
}
frame = DataFrame(data)
pivoted = frame.pivot(index='index', columns='columns', values='values')
expected = DataFrame({
'One' : {'A' : 1., 'B' : 2., 'C' : 3.},
'Two' : {'A' : 1., 'B' : 2., 'C' : 3.}
})
assert_frame_equal(pivoted, expected)
Transitioning to pytest
pandas existing test structure is mostly classed based, meaning that you will typically find tests wrapped in
a class.
class TestReallyCoolFeature:
pass
Going forward, we are moving to a more functional style using the pytest framework, which offers a richer
testing framework that will facilitate testing and developing. Thus, instead of writing test classes, we will
write test functions like this:
def test_really_cool_feature():
pass
Using pytest
Here is an example of a self-contained set of tests that illustrate multiple features that we like to use.
• functional style: tests are like test_* and only take arguments that are either fixtures or parameters
• pytest.mark can be used to set metadata on test functions, e.g. skip or xfail.
• using parametrize: allow testing of multiple cases
• to set a mark on a parameter, pytest.param(..., marks=...) syntax should be used
• fixture, code for object construction, on a per-test basis
• using bare assert for scalars and truth-testing
• tm.assert_series_equal (and its counter part tm.assert_frame_equal), for pandas object compar-
isons.
• the typical pattern of constructing an expected and comparing versus the result
We would name this file test_cool_feature.py and put in an appropriate place in the pandas/tests/
structure.
import pytest
import numpy as np
import pandas as pd
@pytest.mark.parametrize(
'dtype', ['float32', pytest.param('int16', marks=pytest.mark.skip),
pytest.param('int32', marks=pytest.mark.xfail(
reason='to show how it works'))])
def test_mark(dtype):
assert str(np.dtype(dtype)) == 'float32'
@pytest.fixture
def series():
return pd.Series([1, 2, 3])
tester.py::test_dtypes[int8] PASSED
tester.py::test_dtypes[int16] PASSED
tester.py::test_dtypes[int32] PASSED
tester.py::test_dtypes[int64] PASSED
tester.py::test_mark[float32] PASSED
tester.py::test_mark[int16] SKIPPED
tester.py::test_mark[int32] xfail
tester.py::test_series[int8] PASSED
tester.py::test_series[int16] PASSED
tester.py::test_series[int32] PASSED
tester.py::test_series[int64] PASSED
Tests that we have parametrized are now accessible via the test name, for example we could run these with
-k int8 to sub-select only those tests which match int8.
test_cool_feature.py::test_dtypes[int8] PASSED
test_cool_feature.py::test_series[int8] PASSED
Using hypothesis
Hypothesis is a library for property-based testing. Instead of explicitly parametrizing a test, you can describe
all valid inputs and let Hypothesis try to find a failing input. Even better, no matter how many random
examples it tries, Hypothesis always reports a single minimal counterexample to your assertions - often an
example that you would never have thought to test.
See Getting Started with Hypothesis for more of an introduction, then refer to the Hypothesis documentation
for details.
import json
from hypothesis import given, strategies as st
@given(value=any_json_value)
def test_json_roundtrip(value):
result = json.loads(json.dumps(value))
assert value == result
This test shows off several useful features of Hypothesis, as well as demonstrating a good use-case: checking
properties that should hold over a large or complicated domain of inputs.
To keep the Pandas test suite running quickly, parametrized tests are preferred if the inputs or logic are
simple, with Hypothesis tests reserved for cases with complex logic or where there are too many combinations
of options or subtle interactions to test (or think of!) all of them.
Testing warnings
By default, one of pandas CI workers will fail if any unhandled warnings are emitted.
If your change involves checking that a warning is actually emitted, use tm.
assert_produces_warning(ExpectedWarning).
import pandas.util.testing as tm
df = pd.DataFrame()
(continues on next page)
We prefer this to the pytest.warns context manager because ours checks that the warning’s stacklevel is
set correctly. The stacklevel is what ensure the user’s file name and line number is printed in the warning,
rather than something internal to pandas. It represents the number of function calls from user code (e.g.
df.some_operation()) to the function that actually emits the warning. Our linter will fail the build if you
use pytest.warns in a test.
If you have a test that would emit a warning, but you aren’t actually testing the warning itself (say because
it’s going to be removed in the future, or because we’re matching a 3rd-party library’s behavior), then use
pytest.mark.filterwarnings to ignore the error.
@pytest.mark.filterwarnings("ignore:msg:category")
def test_thing(self):
...
If the test generates a warning of class category whose message starts with msg, the warning will be ignored
and the test will pass.
If you need finer-grained control, you can use Python’s usual warnings module to control whether a warning
is ignored / raised at different places within a single test.
with warnings.catch_warnings():
warnings.simplefilter("ignore", FutureWarning)
# Or use warnings.filterwarnings(...)
The tests can then be run directly inside your Git clone (without having to install pandas) by typing:
pytest pandas
The tests suite is exhaustive and takes around 20 minutes to run. Often it is worth running only a subset of
tests first around your changes before running the entire suite.
The easiest way to do this is with:
pytest pandas/tests/[test-module].py
pytest pandas/tests/[test-module].py::[TestClass]
pytest pandas/tests/[test-module].py::[TestClass]::[test_method]
Using pytest-xdist, one can speed up local testing on multicore machines. To use this feature, you will need
to install pytest-xdist via:
Two scripts are provided to assist with this. These scripts distribute testing across 4 threads.
test_fast.sh
test_fast.bat
This can significantly reduce the time it takes to locally run tests before submitting a pull request.
For more, see the pytest documentation.
New in version 0.20.0.
Furthermore one can run
pd.test()
Performance matters and it is worth considering whether your code has introduced performance regressions.
pandas is in the process of migrating to asv benchmarks to enable easy monitoring of the performance of
critical pandas operations. These benchmarks are all found in the pandas/asv_bench directory. asv supports
both python2 and python3.
To use all features of asv, you will need either conda or virtualenv. For more details please check the asv
installation webpage.
To install asv:
If you need to run a benchmark, change your directory to asv_bench/ and run:
You can replace HEAD with the name of the branch you are working on, and report benchmarks that changed
by more than 10%. The command uses conda by default for creating the benchmark environments. If you
want to use virtualenv instead, write:
The -E virtualenv option should be added to all asv commands that run benchmarks. The default value
is defined in asv.conf.json.
Running the full test suite can take up to one hour and use up to 3GB of RAM. Usually it is sufficient to
paste only a subset of the results into the pull request to show that the committed changes do not cause
unexpected performance regressions. You can run specific benchmarks using the -b flag, which takes a regular
expression. For example, this will only run tests from a pandas/asv_bench/benchmarks/groupby.py file:
If you want to only run a specific group of tests from a file, you can do it using . as a separator. For example:
This will display stderr from the benchmarks, and use your local python that comes from your $PATH.
Information on how to write a benchmark and how to use asv can be found in the asv documentation.
Changes should be reflected in the release notes located in doc/source/whatsnew/vx.y.z.rst. This file
contains an ongoing change log for each release. Add an entry to this file to document your fix, enhancement
or (unavoidable) breaking change. Make sure to include the GitHub issue number when adding your entry
(using :issue:`1234` where 1234 is the issue/pull request number).
If your code is an enhancement, it is most likely necessary to add usage examples to the existing documen-
tation. This can be done following the section regarding documentation above. Further, to let users know
when this feature was added, the versionadded directive is used. The sphinx syntax for that is:
.. versionadded:: 0.21.0
This will put the text New in version 0.21.0 wherever you put the sphinx directive. This should also be put
in the docstring when adding a new function or method (example) or a new keyword argument (example).
Keep style fixes to a separate commit to make your pull request more readable.
Once you’ve made changes, you can see them by typing:
git status
If you have created a new file, it is not being tracked by git. Add it by typing:
# On branch shiny-new-feature
#
# modified: /relative/path/to/file-you-added.py
#
Finally, commit your changes to your local repository with an explanatory message. Pandas uses a convention
for commit message prefixes and layout. Here are some common prefixes along with general guidelines for
when to use them:
• ENH: Enhancement, new functionality
• BUG: Bug fix
• DOC: Additions/updates to documentation
• TST: Additions/updates to tests
• BLD: Updates to the build process/scripts
• PERF: Performance improvement
• CLN: Code cleanup
The following defines how a commit message should be structured. Please reference the relevant GitHub
issues in your commit message using GH1234 or #1234. Either style is fine, but the former is generally
preferred:
• a subject line with < 80 chars.
• One blank line.
• Optionally, a commit message body.
Now you can commit your changes in your local repository:
git commit -m
When you want your changes to appear publicly on your GitHub page, push your forked feature branch’s
commits:
Here origin is the default name given to your remote repository on GitHub. You can see the remote
repositories:
git remote -v
If you added the upstream repository as described above you will see something like:
Now your code is on GitHub, but it is not yet a part of the pandas project. For that to happen, a pull
request needs to be submitted on GitHub.
When you’re ready to ask for a code review, file a pull request. Before you do, once again make sure that
you have followed all the guidelines outlined in this document regarding code style, tests, performance tests,
and documentation. You should also double check your branch changes against the branch it was based on:
If everything looks good, you are ready to make a pull request. A pull request is how code from a local
repository becomes available to the GitHub community and can be looked at and eventually merged into
the master version. This pull request and its associated changes will eventually be committed to the master
branch and available in the next release. To submit a pull request:
1. Navigate to your repository on GitHub
2. Click on the Pull Request button
3. You can then click on Commits and Files Changed to make sure everything looks okay one last time
4. Write a description of your changes in the Preview Discussion tab
5. Click Send Pull Request.
This request then goes to the repository maintainers, and they will review the code.
Based on the review you get on your pull request, you will probably need to make some changes to the code.
In that case, you can make them in your branch, add a new commit to that branch, push it to GitHub, and
the pull request will be automatically updated. Pushing them to GitHub again is done by:
This will automatically update your pull request with the latest code and restart the Continuous Integration
tests.
Another reason you might need to update your pull request is to solve conflicts with changes that have been
merged into the master branch since you opened your pull request.
To do this, you need to “merge upstream master” in your branch:
If there are no conflicts (or they could be fixed automatically), a file with a default commit message will
open, and you can simply save and quit this file.
If there are merge conflicts, you need to solve those conflicts. See for example at https://help.github.com/
articles/resolving-a-merge-conflict-using-the-command-line/ for an explanation on how to do this. Once the
conflicts are merged and the files where the conflicts were solved are added, you can run git commit to save
those fixes.
If you have uncommitted changes at the moment you want to update the branch with master, you will need
to stash them prior to updating (see the stash docs). This will effectively store your changes and they can
be reapplied after updating.
After the feature branch has been update locally, you can now update your pull request by pushing to the
branch on GitHub:
Once your feature branch is accepted into upstream, you’ll probably want to get rid of the branch. First,
merge upstream master into your branch so git knows it is safe to delete your branch:
Make sure you use a lower-case -d, or else git won’t warn you if your feature branch has not actually been
merged.
The branch will still exist on GitHub, so to delete it there do:
{{ header }}
7.2 Internals
This section will provide a look into some of pandas internals. It’s primarily intended for developers of
pandas itself.
7.2.1 Indexing
In pandas there are a few objects implemented which can serve as valid containers for the axis labels:
• Index: the generic “ordered set” object, an ndarray of object dtype assuming nothing about its con-
tents. The labels must be hashable (and likely immutable) and unique. Populates a dict of label to
location in Cython to do O(1) lookups.
• Int64Index: a version of Index highly optimized for 64-bit integer data, such as time stamps
• Float64Index: a version of Index highly optimized for 64-bit float data
• MultiIndex: the standard hierarchical index object
• DatetimeIndex: An Index object with Timestamp boxed elements (impl are the int64 values)
• TimedeltaIndex: An Index object with Timedelta boxed elements (impl are the in64 values)
• PeriodIndex: An Index object with Period elements
There are functions that make the creation of a regular index easy:
• date_range: fixed frequency date range generated from a time rule or DateOffset. An ndarray of
Python datetime objects
• period_range: fixed frequency date range generated from a time rule or DateOffset. An ndarray of
Period objects, representing timespans
The motivation for having an Index class in the first place was to enable different implementations of
indexing. This means that it’s possible for you, the user, to implement a custom Index subclass that may
be better suited to a particular application than the ones provided in pandas.
From an internal implementation point of view, the relevant methods that an Index must define are one or
more of the following (depending on how incompatible the new object internals are with the Index functions):
• get_loc: returns an “indexer” (an integer, or in some cases a slice object) for a label
• slice_locs: returns the “range” to slice between two labels
• get_indexer: Computes the indexing vector for reindexing / data alignment purposes. See the source
/ docstrings for more on this
• get_indexer_non_unique: Computes the indexing vector for reindexing / data alignment purposes
when the index is non-unique. See the source / docstrings for more on this
• reindex: Does any pre-conversion of the input index then calls get_indexer
• union, intersection: computes the union or intersection of two Index objects
• insert: Inserts a new label into an Index, yielding a new object
• delete: Delete a label, yielding a new object
• drop: Deletes a set of labels
• take: Analogous to ndarray.take
MultiIndex
Internally, the MultiIndex consists of a few things: the levels, the integer codes (until version 0.24 named
labels), and the level names:
In [1]: index = pd.MultiIndex.from_product([range(3), ['one', 'two']],
...: names=['first', 'second'])
...:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-1-325313b74147> in <module>
----> 1 index = pd.MultiIndex.from_product([range(3), ['one', 'two']],
2 names=['first', 'second'])
In [2]: index
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-2-c17665332d8f> in <module>
----> 1 index
In [3]: index.levels
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-3-caa84382af6f> in <module>
----> 1 index.levels
In [4]: index.codes
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-4-9caf1b7e70ce> in <module>
----> 1 index.codes
In [5]: index.names
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-5-f95ab7546e25> in <module>
----> 1 index.names
Values
Pandas extends NumPy’s type system with custom types, like Categorical or datetimes with a timezone,
so we have multiple notions of “values”. For 1-D containers (Index classes and Series) we have the following
convention:
• cls._ndarray_values is always a NumPy ndarray. Ideally, _ndarray_values is cheap to compute.
For example, for a Categorical, this returns the codes, not the array of objects.
• cls._values refers is the “best possible” array. This could be an ndarray, ExtensionArray, or in
Index subclass (note: we’re in the process of removing the index subclasses here so that it’s always an
ndarray or ExtensionArray).
So, for example, Series[category]._values is a Categorical, while Series[category].
_ndarray_values is the underlying codes.
This section has been moved to Subclassing pandas data structures. {{ header }}
While pandas provides a rich set of methods, containers, and data types, your needs may not be fully
satisfied. Pandas offers a few options for extending pandas.
@pd.api.extensions.register_dataframe_accessor("geo")
class GeoAccessor:
def __init__(self, pandas_obj):
self._validate(pandas_obj)
self._obj = pandas_obj
@staticmethod
def _validate(obj):
# verify there is a column latitude and a column longitude
if 'latitude' not in obj.columns or 'longitude' not in obj.columns:
raise AttributeError("Must have 'latitude' and 'longitude'.")
@property
def center(self):
# return the geographic center point of this DataFrame
lat = self._obj.latitude
lon = self._obj.longitude
return (float(lon.mean()), float(lat.mean()))
def plot(self):
# plot this array's data on a map, e.g., using Cartopy
pass
Now users can access your methods using the geo namespace:
This can be a convenient way to extend pandas objects without subclassing them. If you write a custom
accessor, make a pull request adding it to our ecosystem page.
We highly recommend validating the data in your accessor’s __init__. In our GeoAccessor, we validate
that the data contains the expected columns, raising an AttributeError when the validation fails. For a
Series accessor, you should validate the dtype if the accessor applies only to certain dtypes.
Pandas defines an interface for implementing data types and arrays that extend NumPy’s type system.
Pandas itself uses the extension system for some types that aren’t built into NumPy (categorical, period,
interval, datetime with timezone).
Libraries can define a custom array and data type. When pandas encounters these objects, they will be
handled properly (i.e. not converted to an ndarray of objects). Many methods like pandas.isna() will
dispatch to the extension type’s implementation.
If you’re building a library that implements the interface, please publicize it on Extension data types.
The interface consists of two classes.
ExtensionDtype
ExtensionArray
This class provides all the array-like functionality. ExtensionArrays are limited to 1 dimension. An Exten-
sionArray is linked to an ExtensionDtype via the dtype attribute.
Pandas makes no restrictions on how an extension array is created via its __new__ or __init__, and puts
no restrictions on how you store your data. We do require that your array be convertible to a NumPy array,
even if this is relatively expensive (as it is for Categorical).
They may be backed by none, one, or many NumPy arrays. For example, pandas.Categorical is an
extension array backed by two arrays, one for codes and one for categories. An array of IPv6 addresses may
be backed by a NumPy structured array with two fields, one for the lower 64 bits and one for the upper 64
bits. Or they may be backed by some other storage type, like Python lists.
See the extension array source for the interface definition. The docstrings and comments contain guidance
for properly implementing the interface.
By default, there are no operators defined for the class ExtensionArray. There are two approaches for
providing operator support for your ExtensionArray:
1. Define each of the operators on your ExtensionArray subclass.
2. Use an operator implementation from pandas that depends on operators that are already defined on
the underlying elements (scalars) of the ExtensionArray.
Note: Regardless of the approach, you may want to set __array_priority__ if you want your implemen-
tation to be called when involved in binary operations with NumPy arrays.
For the first approach, you define selected operators, e.g., __add__, __le__, etc. that you want your
ExtensionArray subclass to support.
The second approach assumes that the underlying elements (i.e., scalar type) of the ExtensionArray have the
individual operators already defined. In other words, if your ExtensionArray named MyExtensionArray
is implemented so that each element is an instance of the class MyExtensionElement, then if the opera-
tors are defined for MyExtensionElement, the second approach will automatically define the operators for
MyExtensionArray.
A mixin class, ExtensionScalarOpsMixin supports this second approach. If developing an ExtensionArray
subclass, for example MyExtensionArray, can simply include ExtensionScalarOpsMixin as a parent class of
MyExtensionArray, and then call the methods _add_arithmetic_ops() and/or _add_comparison_ops()
to hook the operators into your MyExtensionArray class, as follows:
MyExtensionArray._add_arithmetic_ops()
MyExtensionArray._add_comparison_ops()
Note: Since pandas automatically calls the underlying operator on each element one-by-one, this
might not be as performant as implementing your own version of the associated operators directly on the
ExtensionArray.
For arithmetic operations, this implementation will try to reconstruct a new ExtensionArray with the result
of the element-wise operation. Whether or not that succeeds depends on whether the operation returns a
result that’s valid for the ExtensionArray. If an ExtensionArray cannot be reconstructed, an ndarray
containing the scalars returned instead.
For ease of implementation and consistency with operations between pandas and NumPy ndarrays, we
recommend not handling Series and Indexes in your binary ops. Instead, you should detect these cases and
return NotImplemented. When pandas encounters an operation like op(Series, ExtensionArray), pandas
will
1. unbox the array from the Series (Series.array)
2. call result = op(values, ExtensionArray)
3. re-box the result in a Series
Series implements __array_ufunc__. As part of the implementation, pandas unboxes the ExtensionArray
from the Series, applies the ufunc, and re-boxes it if necessary.
If applicable, we highly recommend that you implement __array_ufunc__ in your extension array to avoid
coercion to an ndarray. See the numpy documentation for an example.
As part of your implementation, we require that you defer to pandas when a pandas container (Series,
DataFrame, Index) is detected in inputs. If any of those is present, you should return NotImplemented.
Pandas will take care of unboxing the array from the container and re-calling the ufunc with the unwrapped
input.
We provide a test suite for ensuring that your extension arrays satisfy the expected behavior. To use the test
suite, you must provide several pytest fixtures and inherit from the base test class. The required fixtures are
found in https://github.com/pandas-dev/pandas/blob/master/pandas/tests/extension/conftest.py.
To use a test, subclass it:
class TestConstructors(base.BaseConstructorsTests):
pass
Warning: There are some easier alternatives before considering subclassing pandas data structures.
1. Extensible method chains with pipe
2. Use composition. See here.
3. Extending by registering an accessor
4. Extending by extension type
This section describes how to subclass pandas data structures to meet more specific needs. There are two
points that need attention:
1. Override constructor properties.
2. Define original properties
Each data structure has several constructor properties for returning a new data structure as the result of an
operation. By overriding these properties, you can retain subclasses through pandas data manipulations.
There are 3 constructor properties to be defined:
• _constructor: Used when a manipulation result has the same dimensions as the original.
• _constructor_sliced: Used when a manipulation result has one lower dimension(s) as the original,
such as DataFrame single columns slicing.
• _constructor_expanddim: Used when a manipulation result has one higher dimension as the original,
such as Series.to_frame().
Following table shows how pandas data structures define constructor properties by default.
Below example shows how to define SubclassedSeries and SubclassedDataFrame overriding constructor
properties.
class SubclassedSeries(pd.Series):
@property
def _constructor(self):
return SubclassedSeries
@property
def _constructor_expanddim(self):
return SubclassedDataFrame
class SubclassedDataFrame(pd.DataFrame):
@property
def _constructor(self):
return SubclassedDataFrame
@property
def _constructor_sliced(self):
return SubclassedSeries
>>> df = SubclassedDataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
(continues on next page)
>>> type(df)
<class '__main__.SubclassedDataFrame'>
>>> type(sliced1)
<class '__main__.SubclassedDataFrame'>
>>> type(sliced2)
<class '__main__.SubclassedSeries'>
To let original data structures have additional properties, you should let pandas know what properties are
added. pandas maps unknown properties to data names overriding __getattribute__. Defining original
properties can be done in one of 2 ways:
1. Define _internal_names and _internal_names_set for temporary properties which WILL NOT be
passed to manipulation results.
2. Define _metadata for normal properties which will be passed to manipulation results.
Below is an example to define two original properties, “internal_cache” as a temporary property and
“added_property” as a normal property
class SubclassedDataFrame2(pd.DataFrame):
# temporary properties
_internal_names = pd.DataFrame._internal_names + ['internal_cache']
_internal_names_set = set(_internal_names)
# normal properties
_metadata = ['added_property']
>>> df = SubclassedDataFrame2({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
>>> df
A B C
0 1 4 7
1 2 5 8
2 3 6 9
>>> df.internal_cache
cached
>>> df.added_property
property
Starting in 0.25 pandas can be extended with third-party plotting backends. The main idea is letting users
select a plotting backend different than the provided one based on Matplotlib. For example:
The backend module can then use other visualization tools (Bokeh, Altair,…) to generate the plots.
Libraries implementing the plotting backend should use entry points to make their backend discoverable to
pandas. The key is "pandas_plotting_backends". For example, pandas registers the default “matplotlib”
backend as follows.
# in setup.py
setup( # noqa: F821
...,
entry_points={
"pandas_plotting_backends": [
(continues on next page)
More information on how to implement a third-party plotting backend can be found at https://github.com/
pandas-dev/pandas/blob/master/pandas/plotting/__init__.py#L1. {{ header }}
7.4 Developer
The Apache Parquet format provides key-value metadata at the file and column level, stored in the footer
of the Parquet file:
where KeyValue is
struct KeyValue {
1: required string key
2: optional string value
}
So that a pandas.DataFrame can be faithfully reconstructed, we store a pandas metadata key in the
FileMetaData with the value stored as :
Here, <c0>/<ci0> and so forth are dictionaries containing the metadata for each column, including the index
columns. This has JSON form:
{'name': column_name,
'field_name': parquet_column_name,
'pandas_type': pandas_type,
'numpy_type': numpy_type,
'metadata': metadata}
Note: Every index column is stored with a name matching the pattern __index_level_\d+__ and its
corresponding column information is can be found with the following code snippet.
Following this naming convention isn’t strictly necessary, but strongly suggested for compatibility with
Arrow.
Here’s an example of how the index metadata is structured in pyarrow:
{'index_columns': ['__index_level_0__'],
'column_indexes': [
{'name': None,
(continues on next page)
7.5 Roadmap
This page provides an overview of the major themes in pandas’ development. Each of these items requires a
relatively large amount of effort to implement. These may be achieved more quickly with dedicated funding
or interest from contributors.
An item being on the roadmap does not mean that it will necessarily happen, even with unlimited funding.
During the implementation period we may discover issues preventing the adoption of the feature.
Additionally, an item not being on the roadmap does not exclude it from inclusion in pandas. The roadmap
is intended for larger, fundamental changes to the project that are likely to take months or years of developer
time. Smaller-scoped items will continue to be tracked on our issue tracker.
7.5.1 Extensibility
Pandas Extension types allow for extending NumPy types with custom data types and array storage. Pandas
uses extension types internally, and provides an interface for 3rd-party libraries to define their own custom
data types.
Many parts of pandas still unintentionally convert data to a NumPy array. These problems are especially
pronounced for nested data.
We’d like to improve the handling of extension arrays throughout the library, making their behavior more
consistent with the handling of NumPy arrays. We’ll do this by cleaning up pandas’ internals and adding
new methods to the extension array interface.
Currently, pandas stores text data in an object -dtype NumPy array. The current implementation has
two primary drawbacks: First, object -dtype is not specific to strings: any Python object can be stored
in an object -dtype array, not just strings. Second: this is not efficient. The NumPy memory model isn’t
especially well-suited to variable width text data.
To solve the first issue, we propose a new extension type for string data. This will initially be opt-in, with
users explicitly requesting dtype="string". The array backing this string dtype may initially be the current
implementation: an object -dtype NumPy array of Python strings.
To solve the second issue (performance), we’ll explore alternative in-memory array libraries (for example,
Apache Arrow). As part of the work, we may need to implement certain operations expected by pandas
users (for example the algorithm used in, Series.str.upper). That work may be done outside of pandas.
Apache Arrow is a cross-language development platform for in-memory data. The Arrow logical types are
closely aligned with typical pandas use cases.
We’d like to provide better-integrated support for Arrow memory and data types within pandas. This will
let us take advantage of its I/O capabilities and provide for better interoperability with other languages and
libraries using Arrow.
We’d like to replace pandas current internal data structures (a collection of 1 or 2-D arrays) with a simpler
collection of 1-D arrays.
Pandas internal data model is quite complex. A DataFrame is made up of one or more 2-dimensional
“blocks”, with one or more blocks per dtype. This collection of 2-D arrays is managed by the BlockManager.
The primary benefit of the BlockManager is improved performance on certain operations (construction from
a 2D array, binary operations, reductions across the columns), especially for wide DataFrames. However,
the BlockManager substantially increases the complexity and maintenance burden of pandas.
By replacing the BlockManager we hope to achieve
• Substantially simpler code
The code for getting and setting values in pandas’ data structures needs refactoring. In particular, we must
clearly separate code that converts keys (e.g., the argument to DataFrame.loc) to positions from code that
uses these positions to get or set values. This is related to the proposed BlockManager rewrite. Currently,
the BlockManager sometimes uses label-based, rather than position-based, indexing. We propose that it
should only work with positional indexing, and the translation of keys to positions should be entirely done
at a higher level.
Indexing is a complicated API with many subtleties. This refactor will require care and
attention. More details are discussed at https://github.com/pandas-dev/pandas/wiki/(Tentative)
-rules-for-restructuring-indexing-code
Numba is a JIT compiler for Python code. We’d like to provide ways for users to apply their own Numba-
jitted functions where pandas accepts user-defined functions (for example, Series.apply(), DataFrame.
apply(), DataFrame.applymap(), and in groupby and window contexts). This will improve the performance
of user-defined-functions in these operations by staying within compiled code.
We’d like to improve the content, structure, and presentation of the pandas documentation. Some specific
goals include
• Overhaul the HTML theme with a modern, responsive design (GH15556)
• Improve the “Getting Started” documentation, designing and writing learning paths for users different
backgrounds (e.g. brand new to programming, familiar with other languages like R, already familiar
with Python).
• Improve the overall organization of the documentation and specific subsections of the documentation
to make navigation and finding content easier.
To improve the quality and consistency of pandas docstrings, we’ve developed tooling to check docstrings
in a variety of ways. https://github.com/pandas-dev/pandas/blob/master/scripts/validate_docstrings.py
contains the checks.
Like many other projects, pandas uses the numpydoc style for writing docstrings. With the collaboration
of the numpydoc maintainers, we’d like to move the checks to a package other than pandas so that other
projects can easily use them as well.
Pandas uses airspeed velocity to monitor for performance regressions. ASV itself is a fabulous tool, but
requires some additional work to be integrated into an open source project’s workflow.
The asv-runner organization, currently made up of pandas maintainers, provides tools built on top of ASV.
We have a physical machine for running a number of project’s benchmarks, and tools managing the bench-
mark runs and reporting on results.
We’d like to fund improvements and maintenance of these tools to
• Be more stable. Currently, they’re maintained on the nights and weekends when a maintainer has free
time.
• Tune the system for benchmarks to improve stability, following https://pyperf.readthedocs.io/en/
latest/system.html
• Build a GitHub bot to request ASV runs before a PR is merged. Currently, the benchmarks are only
run nightly.
Pandas continues to evolve. The direction is primarily determined by community interest. Everyone is
welcome to review existing items on the roadmap and to propose a new item.
Each item on the roadmap should be a short summary of a larger design proposal. The proposal should
include
1. Short summary of the changes, which would be appropriate for inclusion in the roadmap if accepted.
2. Motivation for the changes.
3. An explanation of why the change is in scope for pandas.
4. Detailed design: Preferably with example-usage (even if not implemented yet) and API documentation
5. API Change: Any API changes that may result from the proposal.
That proposal may then be submitted as a GitHub issue, where the pandas maintainers can review and
comment on the design. The pandas mailing list should be notified of the proposal.
When there’s agreement that an implementation would be welcome, the roadmap should be updated to
include the summary and a link to the discussion issue. {{ header }}
EIGHT
RELEASE NOTES
This is the list of changes to pandas between each release. For full details, see the commit logs at http:
//github.com/pandas-dev/pandas. For install and upgrade instructions, see Installing statsmodels.
These are the changes in pandas 0.25.1. See release for a full changelog including other versions of pandas.
Some users may unknowingly have an incomplete Python installation lacking the lzma module from the
standard library. In this case, import pandas failed due to an ImportError (:issue: 27575). Pandas will now
warn, rather than raising an ImportError if the lzma module is not present. Any subsequent attempt to use
lzma methods will raise a RuntimeError. A possible fix for the lack of the lzma module is to ensure you have
the necessary libraries and then re-install Python. For example, on MacOS installing Python with pyenv
may lead to an incomplete Python installation due to unmet system dependencies at compilation time (like
xz). Compilation will succeed, but Python might fail at run time. The issue can be solved by installing the
necessary dependencies and then re-installing Python.
Bug fixes
Categorical
• Bug in Categorical.fillna() that would replace all values, not just those that are NaN (GH26215)
Datetimelike
2257
pandas: powerful Python data analysis toolkit, Release 0.25.2
Timezones
• Bug in Index where a numpy object array with a timezone aware Timestamp and np.nan would not
return a DatetimeIndex (GH27011)
Numeric
Conversion
• Improved the warnings for the deprecated methods Series.real() and Series.imag() (GH27610)
Interval
Indexing
• Bug in partial-string indexing returning a NumPy array rather than a Series when indexing with a
scalar like .loc['2015'] (GH27516)
• Break reference cycle involving Index and other index classes to allow garbage collection of index
objects without running the GC. (GH27585, GH27840)
• Fix regression in assigning values to a single column of a DataFrame with a MultiIndex columns
(GH27841).
• Fix regression in .ix fallback with an IntervalIndex (GH27865).
Missing
I/O
• Avoid calling S3File.s3 when reading parquet, as this was removed in s3fs version 0.3.0 (GH27756)
• Better error message when a negative header is passed in pandas.read_csv() (GH27779)
• Follow the min_rows display option (introduced in v0.25.0) correctly in the HTML repr in the notebook
(GH27991).
Plotting
• Added a pandas_plotting_backends entrypoint group for registering plot backends. See Plotting
backends for more (GH26747).
• Fixed the re-instatement of Matplotlib datetime converters after calling pandas.plotting.
deregister_matplotlib_converters() (GH27481).
• Fix compatibility issue with matplotlib when passing a pandas Index to a plot call (GH27775).
Groupby/resample/rolling
Reshaping
• A KeyError is now raised if .unstack() is called on a Series or DataFrame with a flat Index passing
a name which is not the correct one (GH18303)
• Bug merge_asof() could not merge Timedelta objects when passing tolerance kwarg (GH27642)
• Bug in DataFrame.crosstab() when margins set to True and normalize is not False, an error is
raised. (GH27500)
• DataFrame.join() now suppresses the FutureWarning when the sort parameter is specified (GH21952)
• Bug in DataFrame.join() raising with readonly arrays (GH27943)
Sparse
Other
Contributors
A total of 7 people contributed patches to this release. People with a “+” by their names contributed a
patch for the first time.
• Felix Divo +
• Jeff Reback
• Jeremy Schendel
• Joris Van den Bossche
• MeeseeksMachine +
• Tom Augspurger
• jbrockmendel
Warning: Starting with the 0.25.x series of releases, pandas only supports Python 3.5.3 and higher.
See Dropping Python 2.7 for more details.
Warning: The minimum supported Python version will be bumped to 3.6 in a future release.
Warning: Panel has been fully removed. For N-D labeled data structures, please use xarray
Warning: read_pickle() and read_msgpack() are only guaranteed backwards compatible back to
pandas version 0.20.3 (GH27082)
{{ header }}
These are the changes in pandas 0.25.0. See release for a full changelog including other versions of pandas.
Enhancements
Pandas has added special groupby behavior, known as “named aggregation”, for naming the output columns
when applying multiple aggregation functions to specific columns (GH18366, GH26512).
In [1]: animals = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
...: 'height': [9.1, 6.0, 9.5, 34.0],
...: 'weight': [7.9, 7.5, 9.9, 198.0]})
...:
In [2]: animals
Out[2]:
[4 rows x 3 columns]
In [3]: animals.groupby("kind").agg(
...: min_height=pd.NamedAgg(column='height', aggfunc='min'),
...: max_height=pd.NamedAgg(column='height', aggfunc='max'),
...: average_weight=pd.NamedAgg(column='weight', aggfunc=np.mean),
...: )
...:
Out[3]:
min_height max_height average_weight
kind
cat 9.1 9.5 8.90
dog 6.0 34.0 102.75
[2 rows x 3 columns]
Pass the desired columns names as the **kwargs to .agg. The values of **kwargs should be tuples where
the first element is the column selection, and the second element is the aggregation function to apply. Pandas
provides the pandas.NamedAgg namedtuple to make it clearer what the arguments to the function are, but
plain tuples are accepted as well.
In [4]: animals.groupby("kind").agg(
...: min_height=('height', 'min'),
...: max_height=('height', 'max'),
...: average_weight=('weight', np.mean),
...: )
...:
Out[4]:
min_height max_height average_weight
kind
cat 9.1 9.5 8.90
dog 6.0 34.0 102.75
[2 rows x 3 columns]
Named aggregation is the recommended replacement for the deprecated “dict-of-dicts” approach to naming
the output of column-specific aggregations (Deprecate groupby.agg() with a dictionary when renaming).
A similar approach is now available for Series groupby objects as well. Because there’s no need for column
selection, the values can just be the functions to apply
In [5]: animals.groupby("kind").height.agg(
...: min_height="min",
...: max_height="max",
...: )
...:
Out[5]:
min_height max_height
kind
(continues on next page)
[2 rows x 2 columns]
This type of aggregation is the recommended alternative to the deprecated behavior when passing a dict to
a Series groupby aggregation (Deprecate groupby.agg() with a dictionary when renaming).
See Named aggregation for more.
You can now provide multiple lambda functions to a list-like aggregation in pandas.core.groupby.GroupBy.
agg (GH26430).
In [6]: animals.groupby('kind').height.agg([
...: lambda x: x.iloc[0], lambda x: x.iloc[-1]
...: ])
...:
Out[6]:
<lambda_0> <lambda_1>
kind
cat 9.1 9.5
dog 6.0 34.0
[2 rows x 2 columns]
In [7]: animals.groupby('kind').agg([
...: lambda x: x.iloc[0] - x.iloc[1],
...: lambda x: x.iloc[0] + x.iloc[1]
...: ])
...:
Out[7]:
height weight
<lambda_0> <lambda_1> <lambda_0> <lambda_1>
kind
cat -0.4 18.6 -2.0 17.8
dog -28.0 40.0 -190.5 205.5
[2 rows x 4 columns]
Previously, these raised a SpecificationError.
Printing of MultiIndex instances now shows tuples of each row and ensures that the tuple items are vertically
aligned, so it’s now easier to understand the structure of the MultiIndex. (GH13480):
The repr now looks like this:
Previously, outputting a MultiIndex printed all the levels and codes of the MultiIndex, which was visually
unappealing and made the output more difficult to navigate. For example (limiting the range to 5):
In the new repr, all values will be shown, if the number of rows is smaller than options.display.
max_seq_items (default: 100 items). Horizontally, the output will truncate, if it’s wider than options.
display.width (default: 80 characters).
Currently, the default display options of pandas ensure that when a Series or DataFrame has more than 60
rows, its repr gets truncated to this maximum of 60 rows (the display.max_rows option). However, this
still gives a repr that takes up a large part of the vertical screen estate. Therefore, a new option display.
min_rows is introduced with a default of 10 which determines the number of rows showed in the truncated
repr:
• For small Series or DataFrames, up to max_rows number of rows is shown (default: 60).
• For larger Series of DataFrame with a length above max_rows, only min_rows number of rows is shown
(default: 10, i.e. the first and last 5 rows).
This dual option allows to still see the full content of relatively small objects (e.g. df.head(20) shows all
20 rows), while giving a brief repr for large objects.
To restore the previous behaviour of a single threshold, set pd.options.display.min_rows = None.
json_normalize() normalizes the provided input dict to all nested levels. The new max_level parameter
provides more control over which level to end normalization (GH23843):
The repr now looks like this:
In [10]: data = [{
....: 'CreatedBy': {'Name': 'User001'},
....: 'Lookup': {'TextField': 'Some text',
....: 'UserField': {'Id': 'ID001', 'Name': 'Name001'}},
....: 'Image': {'a': 'b'}
....: }]
....:
[1 rows x 4 columns]
Series and DataFrame have gained the DataFrame.explode() methods to transform list-likes to individual
rows. See section on Exploding list-like column in docs for more information (GH16538, GH10511)
Here is a typical usecase. You have comma separated string in a column.
In [13]: df
Out[13]:
var1 var2
0 a,b,c 1
1 d,e,f 2
[2 rows x 2 columns]
In [14]: df.assign(var1=df.var1.str.split(',')).explode('var1')
Out[14]:
var1 var2
0 a 1
0 b 1
0 c 1
1 d 2
1 e 2
(continues on next page)
[6 rows x 2 columns]
Other enhancements
• DataFrame.plot() keywords logy, logx and loglog can now accept the value 'sym' for symlog
scaling. (GH24867)
• Added support for ISO week year format (‘%G-%V-%u’) when parsing datetimes using to_datetime()
(GH16607)
• Indexing of DataFrame and Series now accepts zerodim np.ndarray (GH24919)
• Timestamp.replace() now supports the fold argument to disambiguate DST transition times
(GH25017)
• DataFrame.at_time() and Series.at_time() now support datetime.time objects with timezones
(GH24043)
• DataFrame.pivot_table() now accepts an observed parameter which is passed to underlying calls
to DataFrame.groupby() to speed up grouping categorical data. (GH24923)
• Series.str has gained Series.str.casefold() method to removes all case distinctions present in a
string (GH25405)
• DataFrame.set_index() now works for instances of abc.Iterator, provided their output is of the
same length as the calling frame (GH22484, GH24984)
• DatetimeIndex.union() now supports the sort argument. The behavior of the sort parameter
matches that of Index.union() (GH24994)
• RangeIndex.union() now supports the sort argument. If sort=False an unsorted Int64Index is
always returned. sort=None is the default and returns a monotonically increasing RangeIndex if
possible or a sorted Int64Index if not (GH24471)
• TimedeltaIndex.intersection() now also supports the sort keyword (GH24471)
• DataFrame.rename() now supports the errors argument to raise errors when attempting to rename
nonexistent keys (GH13473)
• Added Sparse accessor for working with a DataFrame whose values are sparse (GH25681)
• RangeIndex has gained start, stop, and step attributes (GH25710)
• datetime.timezone objects are now supported as arguments to timezone methods and constructors
(GH25065)
• DataFrame.query() and DataFrame.eval() now supports quoting column names with backticks to
refer to names with spaces (GH6508)
• merge_asof() now gives a more clear error message when merge keys are categoricals that are not
equal (GH26136)
• pandas.core.window.Rolling() supports exponential (or Poisson) window type (GH21303)
• Error message for missing required imports now includes the original import error’s text (GH23868)
• DatetimeIndex and TimedeltaIndex now have a mean method (GH24757)
• DataFrame.describe() now formats integer percentiles without decimal point (GH26660)
• Added support for reading SPSS .sav files using read_spss() (GH26537)
• Added new option plotting.backend to be able to select a plotting backend different than the exist-
ing matplotlib one. Use pandas.set_option('plotting.backend', '<backend-module>') where
<backend-module is a library implementing the pandas plotting API (GH14130)
• pandas.offsets.BusinessHour supports multiple opening hours intervals (GH15481)
• read_excel() can now use openpyxl to read Excel files via the engine='openpyxl' argument. This
will become the default in a future release (GH11499)
• pandas.io.excel.read_excel() supports reading OpenDocument tables. Specify engine='odf' to
enable. Consult the IO User Guide for more details (GH9070)
• Interval, IntervalIndex, and IntervalArray have gained an is_empty attribute denoting if the
given interval(s) are empty (GH27219)
Indexing a DataFrame or Series with a DatetimeIndex with a date string with a UTC offset would previously
ignore the UTC offset. Now, the UTC offset is respected in indexing. (GH24076, GH16785)
In [16]: df
Out[16]:
0
2019-01-01 00:00:00-08:00 0
[1 rows x 1 columns]
Previous behavior:
New behavior:
[1 rows x 1 columns]
Constructing a MultiIndex with NaN levels or codes value < -1 was allowed previously. Now, construction
with codes value < -1 is not allowed and NaN levels’ corresponding codes would be reassigned as -1. (GH19387)
Previous behavior:
New behavior:
In [18]: pd.MultiIndex(levels=[[np.nan, None, pd.NaT, 128, 2]],
....: codes=[[0, -1, 1, 2, 3, 4]])
....:
Out[18]:
MultiIndex([(nan,),
(nan,),
(nan,),
(nan,),
(128,),
( 2,)],
)
270
271 if verify_integrity:
--> 272 new_codes = result._verify_integrity()
273 result._codes = new_codes
274
In [21]: df
Out[21]:
a b
0 x 1
1 y 2
[2 rows x 2 columns]
Previous behavior:
In [3]: df.groupby('a').apply(func)
x
x
y
Out[3]:
a b
0 x 1
1 y 2
New behavior:
In [23]: df.groupby("a").apply(func)
x
y
Out[23]:
a b
0 x 1
1 y 2
[2 rows x 2 columns]
When passed DataFrames whose values are sparse, concat() will now return a Series or DataFrame with
sparse values, rather than a SparseDataFrame (GH25702).
Previous behavior:
New behavior:
This now matches the existing behavior of concat on Series with sparse values. concat() will continue to
return a SparseDataFrame when all the values are instances of SparseDataFrame.
This change also affects routines using concat() internally, like get_dummies(), which now returns a
DataFrame in all cases (previously a SparseDataFrame was returned if all the columns were dummy en-
coded, and a DataFrame otherwise).
Providing any SparseSeries or SparseDataFrame to concat() will cause a SparseSeries or
SparseDataFrame to be returned, as before.
Due to the lack of more fine-grained dtypes, Series.str so far only checked whether the data was of
object dtype. Series.str will now infer the dtype data within the Series; in particular, 'bytes'-only
data will raise an exception (except for Series.str.decode(), Series.str.get(), Series.str.len(),
Series.str.slice()), see GH23163, GH23011, GH23551.
Previous behavior:
In [2]: s
Out[2]:
0 b'a'
1 b'ba'
2 b'cba'
dtype: object
In [3]: s.str.startswith(b'a')
Out[3]:
0 True
1 False
2 False
dtype: bool
New behavior:
In [26]: s = pd.Series(np.array(['a', 'ba', 'cba'], 'S'), dtype=object)
In [27]: s
Out[27]:
0 b'a'
1 b'ba'
2 b'cba'
Length: 3, dtype: object
In [28]: s.str.startswith(b'a')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-28-ac784692b361> in <module>
----> 1 s.str.startswith(b'a')
Previously, columns that were categorical, but not the groupby key(s) would be converted to object dtype
during groupby operations. Pandas now will preserve these dtypes. (GH18502)
In [29]: cat = pd.Categorical(["foo", "bar", "bar", "qux"], ordered=True)
In [31]: df
Out[31]:
payload col
0 -1 foo
1 -2 bar
2 -1 bar
3 -2 qux
[4 rows x 2 columns]
In [32]: df.dtypes
Out[32]:
payload int64
col category
Length: 2, dtype: object
Previous Behavior:
In [5]: df.groupby('payload').first().col.dtype
Out[5]: dtype('O')
New Behavior:
In [33]: df.groupby('payload').first().col.dtype
Out[33]: CategoricalDtype(categories=['bar', 'foo', 'qux'], ordered=True)
When performing Index.union() operations between objects of incompatible dtypes, the result will be a
base Index of dtype object. This behavior holds true for unions between Index objects that previously
would have been prohibited. The dtype of empty Index objects will now be evaluated before performing
union operations rather than simply returning the other Index object. Index.union() can now be considered
commutative, such that A.union(B) == B.union(A) (GH23525).
Previous behavior:
New behavior:
In [34]: pd.period_range('19910905', periods=2).union(pd.Int64Index([1, 2, 3]))
Out[34]: Index([1991-09-05, 1991-09-06, 1, 2, 3], dtype='object')
The methods ffill, bfill, pad and backfill of DataFrameGroupBy previously included the group labels
in the return value, which was inconsistent with other groupby transforms. Now only the filled values are
returned. (GH21521)
In [37]: df
Out[37]:
a b
0 x 1
1 y 2
[2 rows x 2 columns]
Previous behavior:
In [3]: df.groupby("a").ffill()
Out[3]:
a b
0 x 1
1 y 2
New behavior:
In [38]: df.groupby("a").ffill()
Out[38]:
b
0 1
1 2
[2 rows x 1 columns]
DataFrame describe on an empty categorical / object column will return top and freq
When calling DataFrame.describe() with an empty categorical / object column, the ‘top’ and ‘freq’ columns
were previously omitted, which was inconsistent with the output for non-empty columns. Now the ‘top’ and
‘freq’ columns will always be included, with numpy.nan in the case of an empty DataFrame (GH26397)
In [40]: df
Out[40]:
Empty DataFrame
Columns: [empty_col]
Index: []
[0 rows x 1 columns]
Previous behavior:
In [3]: df.describe()
Out[3]:
empty_col
count 0
unique 0
New behavior:
In [41]: df.describe()
Out[41]:
empty_col
count 0
unique 0
top NaN
freq NaN
[4 rows x 1 columns]
Pandas has until now mostly defined string representations in a Pandas objects’s
__str__/__unicode__/__bytes__ methods, and called __str__ from the __repr__ method, if a
specific __repr__ method is not found. This is not needed for Python3. In Pandas 0.25, the string
representations of Pandas objects are now generally defined in __repr__, and calls to __str__ in general
now pass the call on to the __repr__, if a specific __str__ method doesn’t exist, as is standard for Python.
This change is backward compatible for direct usage of Pandas, but if you subclass Pandas objects and
give your subclasses specific __str__/__repr__ methods, you may have to adjust your __str__/__repr__
methods (GH26495).
Indexing methods for IntervalIndex have been modified to require exact matches only for Interval queries.
IntervalIndex methods previously matched on any overlapping Interval. Behavior with scalar points, e.g.
querying with an integer, is unchanged (GH16316).
In [43]: ii
Out[43]:
IntervalIndex([(0, 4], (1, 5], (5, 8]],
closed='right',
dtype='interval[int64]')
The in operator (__contains__) now only returns True for exact matches to Intervals in the
IntervalIndex, whereas this would previously return True for any Interval overlapping an Interval
in the IntervalIndex.
Previous behavior:
New behavior:
In [44]: pd.Interval(1, 2, closed='neither') in ii
Out[44]: False
New behavior:
Likewise, get_indexer() and get_indexer_non_unique() will also only return locations for exact matches
to Interval queries, with -1 denoting that an exact match was not found.
These indexing changes extend to querying a Series or DataFrame with an IntervalIndex index.
In [47]: s
Out[47]:
(0, 4] a
(1, 5] b
(5, 8] c
Length: 3, dtype: object
Selecting from a Series or DataFrame using [] (__getitem__) or loc now only returns exact matches for
Interval queries.
Previous behavior:
New behavior:
In [48]: s[pd.Interval(1, 5)]
Out[48]: 'b'
New behavior:
The overlaps() method can be used to create a boolean indexer that replicates the previous behavior of
returning overlapping matches.
New behavior:
In [50]: idxr = s.index.overlaps(pd.Interval(2, 3))
In [51]: idxr
Out[51]: array([ True, True, False])
In [52]: s[idxr]
Out[52]:
(0, 4] a
(1, 5] b
Length: 2, dtype: object
In [53]: s.loc[idxr]
Out[53]:
(0, 4] a
(1, 5] b
Length: 2, dtype: object
Applying a binary ufunc like numpy.power() now aligns the inputs when both are Series (GH23293).
In [54]: s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
In [56]: s1
Out[56]:
a 1
b 2
c 3
Length: 3, dtype: int64
In [57]: s2
Out[57]:
d 3
c 4
b 5
Length: 3, dtype: int64
Previous behavior
New behavior
This matches the behavior of other binary operations in pandas, like Series.add(). To retain the previous
behavior, convert the other Series to an array before applying the ufunc.
Categorical.argsort() now places missing values at the end of the array, making it consistent with NumPy
and the rest of pandas (GH21801).
Previous behavior
In [3]: cat.argsort()
Out[3]: array([1, 2, 0])
In [4]: cat[cat.argsort()]
Out[4]:
[NaN, a, b]
categories (2, object): [a < b]
New behavior
In [61]: cat.argsort()
Out[61]: array([2, 0, 1])
In [62]: cat[cat.argsort()]
Out[62]:
[a, b, NaN]
Categories (2, object): [a < b]
Starting with Python 3.7 the key-order of dict is guaranteed. In practice, this has been true since Python
3.6. The DataFrame constructor now treats a list of dicts in the same way as it does a list of OrderedDict,
i.e. preserving the order of the dicts. This change applies only when pandas is running on Python>=3.6
(GH27309).
In [63]: data = [
....: {'name': 'Joe', 'state': 'NY', 'age': 18},
....: {'name': 'Jane', 'state': 'KY', 'age': 19, 'hobby': 'Minecraft'},
....: {'name': 'Jean', 'state': 'OK', 'age': 20, 'finances': 'good'}
....: ]
....:
Previous Behavior:
The columns were lexicographically sorted previously,
In [1]: pd.DataFrame(data)
Out[1]:
age finances hobby name state
0 18 NaN NaN Joe NY
1 19 NaN Minecraft Jane KY
2 20 good NaN Jean OK
New Behavior:
The column order now matches the insertion-order of the keys in the dict, considering all the records from
top to bottom. As a consequence, the column order of the resulting DataFrame has changed compared to
previous pandas verisons.
In [64]: pd.DataFrame(data)
Out[64]:
name state age hobby finances
0 Joe NY 18 NaN NaN
1 Jane KY 19 Minecraft NaN
2 Jean OK 20 NaN good
[3 rows x 5 columns]
Due to dropping support for Python 2.7, a number of optional dependencies have updated minimum versions
(GH25725, GH24942, GH25752). Independently, some minimum supported versions of dependencies were
updated (GH23519, GH25554). If installed, we now require:
For optional libraries the general recommendation is to use the latest version. The following table lists the
lowest version per library that is currently being tested throughout the development of pandas. Optional
libraries below the lowest tested version may still work, but are not considered supported.
• DatetimeTZDtype will now standardize pytz timezones to a common timezone instance (GH24713)
• Timestamp and Timedelta scalars now implement the to_numpy() method as aliases to Timestamp.
to_datetime64() and Timedelta.to_timedelta64(), respectively. (GH24653)
• Timestamp.strptime() will now rise a NotImplementedError (GH25016)
• Comparing Timestamp with unsupported objects now returns NotImplemented instead of raising
TypeError. This implies that unsupported rich comparisons are delegated to the other object, and are
now consistent with Python 3 behavior for datetime objects (GH24011)
• Bug in DatetimeIndex.snap() which didn’t preserving the name of the input Index (GH25575)
• The arg argument in pandas.core.groupby.DataFrameGroupBy.agg() has been renamed to func
(GH26089)
• The arg argument in pandas.core.window._Window.aggregate() has been renamed to func
(GH26372)
• Most Pandas classes had a __bytes__ method, which was used for getting a python2-style bytestring
representation of the object. This method has been removed as a part of dropping Python2 (GH26447)
• The .str-accessor has been disabled for 1-level MultiIndex, use MultiIndex.to_flat_index() if
necessary (GH23679)
• Removed support of gtk package for clipboards (GH26563)
• Using an unsupported version of Beautiful Soup 4 will now raise an ImportError instead of a
ValueError (GH27063)
• Series.to_excel() and DataFrame.to_excel() will now raise a ValueError when saving timezone
aware data. (GH27008, GH7056)
• ExtensionArray.argsort() places NA values at the end of the sorted array. (GH21801)
• DataFrame.to_hdf() and Series.to_hdf() will now raise a NotImplementedError when saving a
MultiIndex with extention data types for a fixed format. (GH7775)
• Passing duplicate names in read_csv() will now raise a ValueError (GH17346)
Deprecations
Sparse subclasses
The SparseSeries and SparseDataFrame subclasses are deprecated. Their functionality is better-provided
by a Series or DataFrame with sparse values.
Previous way
In [66]: df.dtypes
Out[66]:
A Sparse[int64, nan]
Length: 1, dtype: object
New way
In [68]: df.dtypes
Out[68]:
A Sparse[int64, 0]
Length: 1, dtype: object
The memory usage of the two approaches is identical. See Migrating for more (GH19239).
msgpack format
The msgpack format is deprecated as of 0.25 and will be removed in a future version. It is recommended to
use pyarrow for on-the-wire transmission of pandas objects. (GH27084)
Other deprecations
• The deprecated .ix[] indexer now raises a more visible FutureWarning instead of
DeprecationWarning (GH26438).
• Deprecated the units=M (months) and units=Y (year) parameters for units of pandas.
to_timedelta(), pandas.Timedelta() and pandas.TimedeltaIndex() (GH16344)
• pandas.concat() has deprecated the join_axes-keyword. Instead, use DataFrame.reindex() or
DataFrame.reindex_like() on the result or on the inputs (GH21951)
• The SparseArray.values attribute is deprecated. You can use np.asarray(...) or the SparseArray.
to_dense() method instead (GH26421).
• The functions pandas.to_datetime() and pandas.to_timedelta() have deprecated the box key-
word. Instead, use to_numpy() or Timestamp.to_datetime64() or Timedelta.to_timedelta64().
(GH24416)
• The DataFrame.compound() and Series.compound() methods are deprecated and will be removed in
a future version (GH26405).
• The internal attributes _start, _stop and _step attributes of RangeIndex have been deprecated. Use
the public attributes start, stop and step instead (GH26581).
• The Series.ftype(), Series.ftypes() and DataFrame.ftypes() methods are deprecated and will
be removed in a future version. Instead, use Series.dtype() and DataFrame.dtypes() (GH26705).
• The Series.get_values(), DataFrame.get_values(), Index.get_values(), SparseArray.
get_values() and Categorical.get_values() methods are deprecated. One of np.asarray(..)
or to_numpy() can be used instead (GH19617).
• The ‘outer’ method on NumPy ufuncs, e.g. np.subtract.outer has been deprecated on Series
objects. Convert the input to an array with Series.array first (GH27186)
• Timedelta.resolution() is deprecated and replaced with Timedelta.resolution_string(). In a fu-
ture version, Timedelta.resolution() will be changed to behave like the standard library datetime.
timedelta.resolution (GH21344)
• read_table() has been undeprecated. (GH25220)
• Index.dtype_str is deprecated. (GH18262)
• Series.imag and Series.real are deprecated. (GH18262)
• Series.put() is deprecated. (GH18262)
• Index.item() and Series.item() is deprecated. (GH18262)
• The default value ordered=None in CategoricalDtype has been deprecated in favor of ordered=False.
When converting between categorical types ordered=True must be explicitly passed in order to be
preserved. (GH26336)
• Index.contains() is deprecated. Use key in index (__contains__) instead (GH17753).
• DataFrame.get_dtype_counts() is deprecated. (GH18262)
• Categorical.ravel() will return a Categorical instead of a np.ndarray (GH27199)
Performance improvements
• Significant speedup in SparseArray initialization that benefits most operations, fixing performance
regression introduced in v0.20.0 (GH24985)
• DataFrame.to_stata() is now faster when outputting data with any string or non-native endian
columns (GH25045)
• Improved performance of Series.searchsorted(). The speedup is especially large when the dtype is
int8/int16/int32 and the searched key is within the integer bounds for the dtype (GH22034)
• Improved performance of pandas.core.groupby.GroupBy.quantile() (GH20405)
• Improved performance of slicing and other selected operation on a RangeIndex (GH26565, GH26617,
GH26722)
• RangeIndex now performs standard lookup without instantiating an actual hashtable, hence saving
memory (GH16685)
• Improved performance of read_csv() by faster tokenizing and faster parsing of small float numbers
(GH25784)
• Improved performance of read_csv() by faster parsing of N/A and boolean values (GH25804)
• Improved performance of IntervalIndex.is_monotonic, IntervalIndex.
is_monotonic_increasing and IntervalIndex.is_monotonic_decreasing by removing conversion
to MultiIndex (GH24813)
• Improved performance of DataFrame.to_csv() when writing datetime dtypes (GH25708)
• Improved performance of read_csv() by much faster parsing of MM/YYYY and DD/MM/YYYY datetime
formats (GH25922)
• Improved performance of nanops for dtypes that cannot store NaNs. Speedup is particularly prominent
for Series.all() and Series.any() (GH25070)
• Improved performance of Series.map() for dictionary mappers on categorical series by mapping the
categories instead of mapping all values (GH23785)
• Improved performance of IntervalIndex.intersection() (GH24813)
• Improved performance of read_csv() by faster concatenating date columns without extra conversion
to string for integer/float zero and float NaN; by faster checking the string for the possibility of being
a date (GH25754)
• Improved performance of IntervalIndex.is_unique by removing conversion to MultiIndex
(GH24813)
• Restored performance of DatetimeIndex.__iter__() by re-enabling specialized code path (GH26702)
• Improved performance when building MultiIndex with at least one CategoricalIndex level
(GH22044)
• Improved performance by removing the need for a garbage collect when checking for
SettingWithCopyWarning (GH27031)
• For to_datetime() changed default value of cache parameter to True (GH26043)
• Improved performance of DatetimeIndex and PeriodIndex slicing given non-unique, monotonic data
(GH27136).
• Improved performance of pd.read_json() for index-oriented data. (GH26773)
• Improved performance of MultiIndex.shape() (GH27384).
Bug fixes
Categorical
• Bug in DataFrame.at() and Series.at() that would raise exception if the index was a
CategoricalIndex (GH20629)
• Fixed bug in comparison of ordered Categorical that contained missing values with a scalar which
sometimes incorrectly resulted in True (GH26504)
• Bug in DataFrame.dropna() when the DataFrame has a CategoricalIndex containing Interval
objects incorrectly raised a TypeError (GH25087)
Datetimelike
• Bug in to_datetime() which would raise an (incorrect) ValueError when called with a date far into
the future and the format argument specified instead of raising OutOfBoundsDatetime (GH23830)
• Bug in to_datetime() which would raise InvalidIndexError: Reindexing only valid with
uniquely valued Index objects when called with cache=True, with arg including at least two
different elements from the set {None, numpy.nan, pandas.NaT} (GH22305)
• Bug in DataFrame and Series where timezone aware data with dtype='datetime64[ns] was not cast
to naive (GH25843)
• Improved Timestamp type checking in various datetime functions to prevent exceptions when using a
subclassed datetime (GH25851)
• Bug in Series and DataFrame repr where np.datetime64('NaT') and np.timedelta64('NaT') with
dtype=object would be represented as NaN (GH25445)
• Bug in to_datetime() which does not replace the invalid argument with NaT when error is set to
coerce (GH26122)
• Bug in adding DateOffset with nonzero month to DatetimeIndex would raise ValueError (GH26258)
• Bug in to_datetime() which raises unhandled OverflowError when called with mix of invalid dates
and NaN values with format='%Y%m%d' and error='coerce' (GH25512)
• Bug in isin() for datetimelike indexes; DatetimeIndex, TimedeltaIndex and PeriodIndex where
the levels parameter was ignored. (GH26675)
• Bug in to_datetime() which raises TypeError for format='%Y%m%d' when called for invalid integer
dates with length >= 6 digits with errors='ignore'
• Bug when comparing a PeriodIndex against a zero-dimensional numpy array (GH26689)
• Bug in constructing a Series or DataFrame from a numpy datetime64 array with a non-ns
unit and out-of-bound timestamps generating rubbish data, which will now correctly raise an
OutOfBoundsDatetime error (GH26206).
• Bug in date_range() with unnecessary OverflowError being raised for very large or very small dates
(GH26651)
• Bug where adding Timestamp to a np.timedelta64 object would raise instead of returning a Timestamp
(GH24775)
• Bug where comparing a zero-dimensional numpy array containing a np.datetime64 object to a
Timestamp would incorrect raise TypeError (GH26916)
• Bug in to_datetime() which would raise ValueError: Tz-aware datetime.datetime cannot be
converted to datetime64 unless utc=True when called with cache=True, with arg including
datetime strings with different offset (GH26097)
•
Timedelta
Timezones
• Bug in DatetimeIndex.to_frame() where timezone aware data would be converted to timezone naive
data (GH25809)
• Bug in to_datetime() with utc=True and datetime strings that would apply previously parsed UTC
offsets to subsequent arguments (GH24992)
• Bug in Timestamp.tz_localize() and Timestamp.tz_convert() does not propagate freq (GH25241)
• Bug in Series.at() where setting Timestamp with timezone raises TypeError (GH25506)
• Bug in DataFrame.update() when updating with timezone aware data would return timezone naive
data (GH25807)
• Bug in to_datetime() where an uninformative RuntimeError was raised when passing a naive
Timestamp with datetime strings with mixed UTC offsets (GH25978)
• Bug in to_datetime() with unit='ns' would drop timezone information from the parsed argument
(GH26168)
• Bug in DataFrame.join() where joining a timezone aware index with a timezone aware column would
result in a column of NaN (GH26335)
• Bug in date_range() where ambiguous or nonexistent start or end times were not handled by the
ambiguous or nonexistent keywords respectively (GH27088)
• Bug in DatetimeIndex.union() when combining a timezone aware and timezone unaware
DatetimeIndex (GH21671)
• Bug when applying a numpy reduction function (e.g. numpy.minimum()) to a timezone aware Series
(GH15552)
Numeric
• Bug in to_numeric() in which large negative numbers were being improperly handled (GH24910)
• Bug in to_numeric() in which numbers were being coerced to float, even though errors was not
coerce (GH24910)
• Bug in to_numeric() in which invalid values for errors were being allowed (GH26466)
• Bug in format in which floating point complex numbers were not being formatted to proper display
precision and trimming (GH25514)
• Bug in error messages in DataFrame.corr() and Series.corr(). Added the possibility of using a
callable. (GH25729)
• Bug in Series.divmod() and Series.rdivmod() which would raise an (incorrect) ValueError rather
than return a pair of Series objects as result (GH25557)
• Raises a helpful exception when a non-numeric index is sent to interpolate() with methods which
require numeric index. (GH21662)
• Bug in eval() when comparing floats with scalar operators, for example: x < -0.1 (GH25928)
• Fixed bug where casting all-boolean array to integer extension array failed (GH25211)
• Bug in divmod with a Series object containing zeros incorrectly raising AttributeError (GH26987)
• Inconsistency in Series floor-division (// ) and divmod filling positive//zero with NaN instead of Inf
(GH27321)
•
Conversion
• Bug in DataFrame.astype() when passing a dict of columns and types the errors parameter was
ignored. (GH25905)
•
•
Strings
• Bug in the __name__ attribute of several methods of Series.str, which were set incorrectly (GH23551)
• Improved error message when passing Series of wrong dtype to Series.str.cat() (GH22722)
•
Interval
Indexing
• Improved exception message when calling DataFrame.iloc() with a list of non-numeric objects
(GH25753).
• Improved exception message when calling .iloc or .loc with a boolean indexer with different length
(GH26658).
• Bug in KeyError exception message when indexing a MultiIndex with a non-existant key not displaying
the original key (GH27250).
• Bug in .iloc and .loc with a boolean indexer not raising an IndexError when too few items are
passed (GH26658).
• Bug in DataFrame.loc() and Series.loc() where KeyError was not raised for a MultiIndex when
the key was less than or equal to the number of levels in the MultiIndex (GH14885).
• Bug in which DataFrame.append() produced an erroneous warning indicating that a KeyError will
be thrown in the future when the data to be appended contains new columns (GH22252).
• Bug in which DataFrame.to_csv() caused a segfault for a reindexed data frame, when the indices
were single-level MultiIndex (GH26303).
• Fixed bug where assigning a arrays.PandasArray to a pandas.core.frame.DataFrame would raise
error (GH26390)
• Allow keyword arguments for callable local reference used in the DataFrame.query() string (GH26426)
• Fixed a KeyError when indexing a MultiIndex` level with a list containing exactly one label, which
is missing (GH27148)
• Bug which produced AttributeError on partial matching Timestamp in a MultiIndex (GH26944)
• Bug in Categorical and CategoricalIndex with Interval values when using the in operator
(__contains) with objects that are not comparable to the values in the Interval (GH23705)
• Bug in DataFrame.loc() and DataFrame.iloc() on a DataFrame with a single timezone-aware date-
time64[ns] column incorrectly returning a scalar instead of a Series (GH27110)
• Bug in CategoricalIndex and Categorical incorrectly raising ValueError instead of TypeError
when a list is passed using the in operator (__contains__) (GH21729)
• Bug in setting a new value in a Series with a Timedelta object incorrectly casting the value to an
integer (GH22717)
• Bug in Series setting a new key (__setitem__) with a timezone-aware datetime incorrectly raising
ValueError (GH12862)
• Bug in DataFrame.iloc() when indexing with a read-only indexer (GH17192)
• Bug in Series setting an existing tuple key (__setitem__) with timezone-aware datetime values
incorrectly raising TypeError (GH20441)
Missing
MultiIndex
• Bug in which incorrect exception raised by Timedelta when testing the membership of MultiIndex
(GH24570)
•
I/O
• Bug in DataFrame.to_html() where values were truncated using display options instead of outputting
the full content (GH17004)
• Fixed bug in missing text when using to_clipboard() if copying utf-16 characters in Python 3 on
Windows (GH25040)
• Bug in read_json() for orient='table' when it tries to infer dtypes by default, which is not applicable
as dtypes are already defined in the JSON schema (GH21345)
• Bug in read_json() for orient='table' and float index, as it infers index dtype by default, which is
not applicable because index dtype is already defined in the JSON schema (GH25433)
• Bug in read_json() for orient='table' and string of float column names, as it makes a column name
type conversion to Timestamp, which is not applicable because column names are already defined in
the JSON schema (GH25435)
• Bug in json_normalize() for errors='ignore' where missing values in the input data, were filled in
resulting DataFrame with the string "nan" instead of numpy.nan (GH25468)
• DataFrame.to_html() now raises TypeError when using an invalid type for the classes parameter
instead of AssertionError (GH25608)
• Bug in DataFrame.to_string() and DataFrame.to_latex() that would lead to incorrect output when
the header keyword is used (GH16718)
• Bug in read_csv() not properly interpreting the UTF8 encoded filenames on Windows on Python
3.6+ (GH15086)
• Improved performance in pandas.read_stata() and pandas.io.stata.StataReader when converting
columns that have missing values (GH25772)
• Bug in DataFrame.to_html() where header numbers would ignore display options when rounding
(GH17280)
• Bug in read_hdf() where reading a table from an HDF5 file written directly with PyTables fails with
a ValueError when using a sub-selection via the start or stop arguments (GH11188)
• Bug in read_hdf() not properly closing store after a KeyError is raised (GH25766)
• Improved the explanation for the failure when value labels are repeated in Stata dta files and suggested
work-arounds (GH25772)
• Improved pandas.read_stata() and pandas.io.stata.StataReader to read incorrectly formatted
118 format files saved by Stata (GH25960)
• Improved the col_space parameter in DataFrame.to_html() to accept a string so CSS length values
can be set correctly (GH25941)
• Fixed bug in loading objects from S3 that contain # characters in the URL (GH25945)
• Adds use_bqstorage_api parameter to read_gbq() to speed up downloads of large data
frames. This feature requires version 0.10.0 of the pandas-gbq library as well as the
google-cloud-bigquery-storage and fastavro libraries. (GH26104)
• Fixed memory leak in DataFrame.to_json() when dealing with numeric data (GH24889)
• Bug in read_json() where date strings with Z were not converted to a UTC timezone (GH26168)
• Added cache_dates=True parameter to read_csv(), which allows to cache unique dates when they
are parsed (GH25990)
• DataFrame.to_excel() now raises a ValueError when the caller’s dimensions exceed the limitations
of Excel (GH26051)
• Fixed bug in pandas.read_csv() where a BOM would result in incorrect parsing using en-
gine=’python’ (GH26545)
• read_excel() now raises a ValueError when input is of type pandas.io.excel.ExcelFile and
engine param is passed since pandas.io.excel.ExcelFile has an engine defined (GH26566)
• Bug while selecting from HDFStore with where='' specified (GH26610).
• Fixed bug in DataFrame.to_excel() where custom objects (i.e. PeriodIndex) inside merged cells were
not being converted into types safe for the Excel writer (GH27006)
• Bug in read_hdf() where reading a timezone aware DatetimeIndex would raise a TypeError
(GH11926)
• Bug in to_msgpack() and read_msgpack() which would raise a ValueError rather than a
FileNotFoundError for an invalid path (GH27160)
• Fixed bug in DataFrame.to_parquet() which would raise a ValueError when the dataframe had no
columns (GH27339)
• Allow parsing of PeriodDtype columns when using read_csv() (GH26934)
Plotting
Groupby/resample/rolling
Reshaping
• Bug in pandas.merge() adds a string of None, if None is assigned in suffixes instead of remain the
column name as-is (GH24782).
• Bug in merge() when merging by index name would sometimes result in an incorrectly numbered index
(missing index values are now assigned NA) (GH24212, GH25009)
• to_records() now accepts dtypes to its column_dtypes parameter (GH24895)
• Bug in concat() where order of OrderedDict (and dict in Python 3.6+) is not respected, when
passed in as objs argument (GH21510)
• Bug in pivot_table() where columns with NaN values are dropped even if dropna argument is False,
when the aggfunc argument contains a list (GH22159)
• Bug in concat() where the resulting freq of two DatetimeIndex with the same freq would be dropped
(GH3232).
• Bug in merge() where merging with equivalent Categorical dtypes was raising an error (GH22501)
• bug in DataFrame instantiating with a dict of iterators or generators (e.g. pd.DataFrame({'A':
reversed(range(3))})) raised an error (GH26349).
• Bug in DataFrame instantiating with a range (e.g. pd.DataFrame(range(3))) raised an error
(GH26342).
• Bug in DataFrame constructor when passing non-empty tuples would cause a segmentation fault
(GH25691)
• Bug in Series.apply() failed when the series is a timezone aware DatetimeIndex (GH25959)
• Bug in pandas.cut() where large bins could incorrectly raise an error due to an integer overflow
(GH26045)
• Bug in DataFrame.sort_index() where an error is thrown when a multi-indexed DataFrame is sorted
on all levels with the initial level sorted last (GH26053)
Sparse
• Significant speedup in SparseArray initialization that benefits most operations, fixing performance
regression introduced in v0.20.0 (GH24985)
• Bug in SparseFrame constructor where passing None as the data would cause default_fill_value to
be ignored (GH16807)
• Bug in SparseDataFrame when adding a column in which the length of values does not match length
of index, AssertionError is raised instead of raising ValueError (GH25484)
• Introduce a better error message in Series.sparse.from_coo() so it returns a TypeError for inputs
that are not coo matrices (GH26554)
• Bug in numpy.modf() on a SparseArray. Now a tuple of SparseArray is returned (GH26946).
Build Changes
ExtensionArray
Other
Contributors
Warning: The 0.24.x series of releases will be the last to support Python 2. Future feature releases
will support Python 3 only. See Dropping Python 2.7 for more.
{{ header }}
These are the changes in pandas 0.24.2. See release for a full changelog including other versions of pandas.
Fixed regressions
• Fixed regression in factorize() when passing a custom na_sentinel value with sort=True
(GH25409).
• Fixed regression in DataFrame.to_csv() writing duplicate line endings with gzip compress (GH25311)
Bug fixes
I/O
• Better handling of terminal printing when the terminal dimensions are not known (GH25080)
• Bug in reading a HDF5 table-format DataFrame created in Python 2, in Python 3 (GH24925)
• Bug in reading a JSON with orient='table' generated by DataFrame.to_json() with index=False
(GH25170)
• Bug where float indexes could have misaligned values when printing (GH25061)
Categorical
• Bug where calling Series.replace() on categorical data could return a Series with incorrect dimen-
sions (GH24971)
•
•
Reshaping
• Bug in transform() where applying a function to a timezone aware column would return a timezone
naive result (GH24198)
• Bug in DataFrame.join() when joining on a timezone aware DatetimeIndex (GH23931)
Visualization
• Bug in Series.plot() where a secondary y axis could not be set to log scale (GH25545)
Other
• Bug in Series.is_unique() where single occurrences of NaN were not considered unique (GH25180)
• Bug in merge() when merging an empty DataFrame with an Int64 column or a non-empty DataFrame
with an Int64 column that is all NaN (GH25183)
• Bug in IntervalTree where a RecursionError occurs upon construction due to an overflow when
adding endpoints, which also causes IntervalIndex to crash during indexing operations (GH25485)
• Bug in Series.size raising for some extension-array-backed Series, rather than returning the size
(GH25580)
• Bug in resampling raising for nullable integer-dtype columns (GH25580)
Contributors
A total of 25 people contributed patches to this release. People with a “+” by their names contributed a
patch for the first time.
• Albert Villanova del Moral
• Arno Veenstra +
• chris-b1
• Devin Petersohn +
• EternalLearner42 +
• Flavien Lambert +
• gfyoung
• Gioia Ballin
• jbrockmendel
• Jeff Reback
• Jeremy Schendel
• Johan von Forstner +
• Joris Van den Bossche
• Josh
• Justin Zheng
• Kendall Masse
• Matthew Roeschke
• Max Bolingbroke +
• rbenes +
• Sterling Paramore +
• Tao He +
• Thomas A Caswell
• Tom Augspurger
• Vibhu Agarwal +
• William Ayd
• Zach Angell
Warning: The 0.24.x series of releases will be the last to support Python 2. Future feature releases
will support Python 3 only. See Dropping Python 2.7 for more.
{{ header }}
These are the changes in pandas 0.24.1. See release for a full changelog including other versions of pandas.
See What’s new in 0.24.0 (January 25, 2019) for the 0.24.0 changelog.
API changes
The default sort value for Index.union() has changed from True to None (GH24959). The default behavior,
however, remains the same: the result is sorted, unless
Fixed regressions
• Fixed regression in DataFrame.to_dict() with records orient raising an AttributeError when the
DataFrame contained more than 255 columns, or wrongly converting column names that were not valid
python identifiers (GH24939, GH24940).
• Fixed regression in read_sql() when passing certain queries with MySQL/pymysql (GH24988).
• Fixed regression in Index.intersection incorrectly sorting the values by default (GH24959).
• Fixed regression in merge() when merging an empty DataFrame with multiple timezone-aware columns
on one of the timezone-aware columns (GH25014).
• Fixed regression in Series.rename_axis() and DataFrame.rename_axis() where passing None failed
to remove the axis name (GH25034)
• Fixed regression in to_timedelta() with box=False incorrectly returning a datetime64 object instead
of a timedelta64 object (GH24961)
• Fixed regression where custom hashable types could not be used as column keys in DataFrame.
set_index() (GH24969)
Bug fixes
Reshaping
• Bug in DataFrame.groupby() with Grouper when there is a time change (DST) and grouping frequency
is '1d' (GH24972)
Visualization
• Fixed the warning for implicitly registered matplotlib converters not showing. See Restore Matplotlib
datetime converter registration for more (GH24963).
Other
• Fixed AttributeError when printing a DataFrame’s HTML repr after accessing the IPython config
object (GH25036)
Contributors
A total of 7 people contributed patches to this release. People with a “+” by their names contributed a
patch for the first time.
• Alex Buchkovsky
• Roman Yurchak
• h-vetinari
• jbrockmendel
• Jeremy Schendel
• Joris Van den Bossche
• Tom Augspurger
Warning: The 0.24.x series of releases will be the last to support Python 2. Future feature releases
will support Python 3 only. See Dropping Python 2.7 for more details.
{{ header }}
This is a major release from 0.23.4 and includes a number of API changes, new features, enhancements, and
performance improvements along with a large number of bug fixes.
Highlights include:
• Optional Integer NA Support
• New APIs for accessing the array backing a Series or Index
• A new top-level method for creating arrays
• Store Interval and Period data in a Series or DataFrame
• Support for joining on two MultiIndexes
Check the API Changes and deprecations before updating.
These are the changes in pandas 0.24.0. See release for a full changelog including other versions of pandas.
Enhancements
Pandas has gained the ability to hold integer dtypes with missing values. This long requested feature is
enabled through the use of extension types.
Note: IntegerArray is currently experimental. Its API or implementation may change without warning.
We can construct a Series with the specified dtype. The dtype string Int64 is a pandas ExtensionDtype.
Specifying a list or array using the traditional missing value marker of np.nan will infer to integer dtype.
The display of the Series will also use the NaN to indicate missing values in string outputs. (GH20700,
GH20747, GH22441, GH21789, GH22346)
In [2]: s
Out[2]:
0 1
1 2
2 NaN
Length: 3, dtype: Int64
# comparison
In [4]: s == 1
Out[4]:
0 True
1 False
2 False
Length: 3, dtype: bool
# indexing
In [5]: s.iloc[1:3]
Out[5]:
1 2
2 NaN
Length: 2, dtype: Int64
In [9]: df
Out[9]:
A B C
0 1 1 a
1 2 1 a
2 NaN 3 b
[3 rows x 3 columns]
In [10]: df.dtypes
Out[10]:
A Int64
B int64
C object
Length: 3, dtype: object
These dtypes can be merged, reshaped, and casted.
In [11]: pd.concat([df[['A']], df[['B', 'C']]], axis=1).dtypes
Out[11]:
A Int64
B int64
C object
Length: 3, dtype: object
In [12]: df['A'].astype(float)
Out[12]:
0 1.0
1 2.0
2 NaN
Name: A, Length: 3, dtype: float64
Reduction and groupby operations such as sum work.
In [13]: df.sum()
Out[13]:
A 3
B 5
C aab
Length: 3, dtype: object
In [14]: df.groupby('B').A.sum()
Out[14]:
B
1 3
3 0
Name: A, Length: 2, dtype: Int64
Warning: The Integer NA support currently uses the capitalized dtype version, e.g. Int8 as compared
to the traditional int8. This may be changed at a future date.
Series.array and Index.array have been added for extracting the array backing a Series or Index.
(GH19954, GH23623)
In [15]: idx = pd.period_range('2000', periods=4)
In [16]: idx.array
Out[16]:
<PeriodArray>
['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04']
Length: 4, dtype: period[D]
In [17]: pd.Series(idx).array
Out[17]:
<PeriodArray>
['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04']
Length: 4, dtype: period[D]
Historically, this would have been done with series.values, but with .values it was unclear whether the
returned value would be the actual array, some transformation of it, or one of pandas custom arrays (like
Categorical). For example, with PeriodIndex, .values generates a new ndarray of period objects each
time.
In [18]: idx.values
Out[18]:
array([Period('2000-01-01', 'D'), Period('2000-01-02', 'D'),
Period('2000-01-03', 'D'), Period('2000-01-04', 'D')], dtype=object)
In [19]: id(idx.values)
Out[19]: 5645036320
In [20]: id(idx.values)
Out[20]: 5616686224
If you need an actual NumPy array, use Series.to_numpy() or Index.to_numpy().
In [21]: idx.to_numpy()
Out[21]:
array([Period('2000-01-01', 'D'), Period('2000-01-02', 'D'),
Period('2000-01-03', 'D'), Period('2000-01-04', 'D')], dtype=object)
In [22]: pd.Series(idx).to_numpy()
Out[22]:
array([Period('2000-01-01', 'D'), Period('2000-01-02', 'D'),
Period('2000-01-03', 'D'), Period('2000-01-04', 'D')], dtype=object)
For Series and Indexes backed by normal NumPy arrays, Series.array will return a new arrays.
PandasArray, which is a thin (no-copy) wrapper around a numpy.ndarray. PandasArray isn’t especially
useful on its own, but it does provide the same interface as any extension array defined in pandas or by a
third-party library.
In [23]: ser = pd.Series([1, 2, 3])
In [24]: ser.array
Out[24]:
<PandasArray>
[1, 2, 3]
Length: 3, dtype: int64
In [25]: ser.to_numpy()
Out[25]: array([1, 2, 3])
We haven’t removed or deprecated Series.values or DataFrame.values, but we highly recommend and
using .array or .to_numpy() instead.
See Dtypes and Attributes and Underlying Data for more.
A new top-level method array() has been added for creating 1-dimensional arrays (GH22860). This can
be used to create any extension array, including extension arrays registered by 3rd party libraries. See the
dtypes docs for more on extension arrays.
In [26]: pd.array([1, 2, np.nan], dtype='Int64')
Out[26]:
<IntegerArray>
[1, 2, NaN]
Length: 3, dtype: Int64
On their own, a PandasArray isn’t a very useful object. But if you need write low-level code that works
generically for any ExtensionArray, PandasArray satisfies that need.
Notice that by default, if no dtype is specified, the dtype of the returned array is inferred from the data. In
particular, note that the first example of [1, 2, np.nan] would have returned a floating-point array, since
NaN is a float.
In [29]: pd.array([1, 2, np.nan])
Out[29]:
<PandasArray>
[1.0, 2.0, nan]
Length: 3, dtype: float64
Interval and Period data may now be stored in a Series or DataFrame, in addition to an IntervalIndex
and PeriodIndex like previously (GH19453, GH22862).
In [31]: ser
Out[31]:
0 (0, 1]
1 (1, 2]
2 (2, 3]
3 (3, 4]
4 (4, 5]
Length: 5, dtype: interval
In [32]: ser.dtype
Out[32]: interval[int64]
For periods:
In [33]: pser = pd.Series(pd.period_range("2000", freq="D", periods=5))
In [34]: pser
Out[34]:
0 2000-01-01
1 2000-01-02
2 2000-01-03
3 2000-01-04
4 2000-01-05
Length: 5, dtype: period[D]
In [35]: pser.dtype
Out[35]: period[D]
Previously, these would be cast to a NumPy array with object dtype. In general, this should result in better
performance when storing an array of intervals or periods in a Series or column of a DataFrame.
Use Series.array to extract the underlying array of intervals or periods from the Series:
In [36]: ser.array
Out[36]:
IntervalArray([(0, 1], (1, 2], (2, 3], (3, 4], (4, 5]],
closed='right',
dtype='interval[int64]')
In [37]: pser.array
Out[37]:
<PeriodArray>
['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04', '2000-01-05']
Length: 5, dtype: period[D]
These return an instance of arrays.IntervalArray or arrays.PeriodArray, the new extension arrays that
back interval and period data.
Warning: For backwards compatibility, Series.values continues to return a NumPy array of objects
for Interval and Period data. We recommend using Series.array when you need the array of data stored
in the Series, and Series.to_numpy() when you know you need a NumPy array.
See Dtypes and Attributes and Underlying Data for more.
DataFrame.merge() and DataFrame.join() can now be used to join multi-indexed Dataframe instances on
the overlapping index levels (GH6360)
See the Merge, join, and concatenate documentation section.
In [42]: left.join(right)
Out[42]:
A B C D
key X Y
K0 X0 Y0 A0 B0 C0 D0
X1 Y0 A1 B1 C0 D0
K1 X2 Y1 A2 B2 C1 D1
[3 rows x 4 columns]
[3 rows x 4 columns]
read_html Enhancements
read_html() previously ignored colspan and rowspan attributes. Now it understands them, treating them
as sequences of cells with the same value. (GH17054)
Previous behavior:
In [13]: result
Out [13]:
[ A B C
0 1 2 NaN]
New behavior:
In [45]: result
Out[45]:
[ A B C
0 1 1 2
[1 rows x 3 columns]]
The Styler class has gained a pipe() method. This provides a convenient way to apply users’ predefined
styling functions, and can help reduce “boilerplate” when using DataFrame styling functionality repeatedly
within a notebook. (GH23229)
Similar methods already exist for other classes in pandas, including DataFrame.pipe(), GroupBy.pipe(),
and Resampler.pipe().
DataFrame.rename_axis() now supports index and columns arguments and Series.rename_axis() sup-
ports index argument (GH19978).
This change allows a dictionary to be passed so that some of the names of a MultiIndex can be changed.
Example:
In [49]: mi = pd.MultiIndex.from_product([list('AB'), list('CD'), list('EF')],
....: names=['AB', 'CD', 'EF'])
....:
In [51]: df
Out[51]:
N
AB CD EF
A C E 0
F 1
D E 2
F 3
B C E 4
F 5
D E 6
F 7
[8 rows x 1 columns]
[8 rows x 1 columns]
See the Advanced documentation on renaming for more details.
Other enhancements
• merge() now directly allows merge between objects of type DataFrame and named Series, without
the need to convert the Series object into a DataFrame beforehand (GH21220)
• ExcelWriter now accepts mode as a keyword argument, enabling append to existing workbooks when
using the openpyxl engine (GH3441)
• FrozenList has gained the .union() and .difference() methods. This functionality greatly simpli-
fies groupby’s that rely on explicitly excluding certain columns. See Splitting an object into groups for
more information (GH15475, GH15506).
• DataFrame.to_parquet() now accepts index as an argument, allowing the user to override the engine’s
default behavior to include or omit the dataframe’s indexes from the resulting Parquet file. (GH20768)
• read_feather() now accepts columns as an argument, allowing the user to specify which columns
should be read. (GH24025)
• DataFrame.corr() and Series.corr() now accept a callable for generic calculation methods of cor-
relation, e.g. histogram intersection (GH22684)
• DataFrame.to_string() now accepts decimal as an argument, allowing the user to specify which
decimal separator should be used in the output. (GH23614)
• DataFrame.to_html() now accepts render_links as an argument, allowing the user to generate
HTML with links to any URLs that appear in the DataFrame. See the section on writing HTML
in the IO docs for example usage. (GH2679)
• pandas.read_csv() now supports pandas extension types as an argument to dtype, allowing the user
to use pandas extension types when reading CSVs. (GH23228)
• The shift() method now accepts fill_value as an argument, allowing the user to specify a value which
will be used instead of NA/NaT in the empty periods. (GH15486)
• to_datetime() now supports the %Z and %z directive when passed into format (GH13486)
• Series.mode() and DataFrame.mode() now support the dropna parameter which can be used to
specify whether NaN/NaT values should be considered (GH17534)
• DataFrame.to_csv() and Series.to_csv() now support the compression keyword when a file handle
is passed. (GH21227)
• Index.droplevel() is now implemented also for flat indexes, for compatibility with MultiIndex
(GH21115)
• Series.droplevel() and DataFrame.droplevel() are now implemented (GH20342)
• Added support for reading from/writing to Google Cloud Storage via the gcsfs library (GH19454,
GH23094)
• DataFrame.to_gbq() and read_gbq() signature and documentation updated to reflect changes from
the Pandas-GBQ library version 0.8.0. Adds a credentials argument, which enables the use of any
kind of google-auth credentials. (GH21627, GH22557, GH23662)
• New method HDFStore.walk() will recursively walk the group hierarchy of an HDF5 file (GH10932)
• read_html() copies cell data across colspan and rowspan, and it treats all-th table rows as headers
if header kwarg is not given and there is no thead (GH17054)
• Series.nlargest(), Series.nsmallest(), DataFrame.nlargest(), and DataFrame.nsmallest()
now accept the value "all" for the keep argument. This keeps all ties for the nth largest/smallest
value (GH16818)
• IntervalIndex has gained the set_closed() method to change the existing closed value (GH21670)
• to_csv(), to_csv(), to_json(), and to_json() now support compression='infer' to infer com-
pression based on filename extension (GH15008). The default compression for to_csv, to_json, and
to_pickle methods has been updated to 'infer' (GH22004).
• DataFrame.to_sql() now supports writing TIMESTAMP WITH TIME ZONE types for supported databases.
For databases that don’t support timezones, datetime data will be stored as timezone unaware local
timestamps. See the Datetime data types for implications (GH9086).
• pandas.DataFrame.to_sql() has gained the method argument to control SQL insertion clause. See
the insertion method section in the documentation. (GH8953)
• DataFrame.corrwith() now supports Spearman’s rank correlation, Kendall’s tau as well as callable
correlation methods. (GH21925)
• DataFrame.to_json(), DataFrame.to_csv(), DataFrame.to_pickle(), and other export methods
now support tilde(~) in path argument. (GH23473)
We have updated our minimum supported versions of dependencies (GH21242, GH18742, GH23774,
GH24767). If installed, we now require:
Additionally we no longer depend on feather-format for feather based storage and replaced it with refer-
ences to pyarrow (GH21639 and GH23053).
DataFrame.to_csv() now uses os.linesep() rather than '\n' for the default line terminator (GH20353).
This change only affects when running on Windows, where '\r\n' was used for line terminator even when
'\n' was passed in line_terminator.
Previous behavior on Windows:
On Windows, the value of os.linesep is '\r\n', so if line_terminator is not set, '\r\n' is used for line
terminator.
For file objects, specifying newline is not sufficient to set the line terminator. You must pass in the
line_terminator explicitly, even in this case.
Proper handling of np.NaN in a string data-typed column with the Python engine
There was bug in read_excel() and read_csv() with the Python engine, where missing values turned to
'nan' with dtype=str and na_filter=True. Now, these missing values are converted to the string missing
indicator, np.nan. (GH20377)
Previous behavior:
New behavior:
Notice how we now instead output np.nan itself instead of a stringified form of it.
Previously, parsing datetime strings with UTC offsets with to_datetime() or DatetimeIndex would au-
tomatically convert the datetime to UTC without timezone localization. This is inconsistent from parsing
the same datetime string with Timestamp which would preserve the UTC offset in the tz attribute. Now,
to_datetime() preserves the UTC offset in the tz attribute when all the datetime strings have the same
UTC offset (GH17697, GH11736, GH22457)
Previous behavior:
# Different UTC offsets would automatically convert the datetimes to UTC (without a UTC␣
,→timezone)
New behavior:
In [56]: pd.to_datetime("2015-11-18 15:30:00+05:30")
Out[56]: Timestamp('2015-11-18 15:30:00+0530', tz='pytz.FixedOffset(330)')
Parsing datetime strings with the same UTC offset will preserve the UTC offset in the tz
Parsing datetime strings with different UTC offsets will now create an Index of datetime.datetime objects
with different UTC offsets
In [59]: idx = pd.to_datetime(["2015-11-18 15:30:00+05:30",
....: "2015-11-18 16:30:00+06:30"])
....:
In [60]: idx
Out[60]: Index([2015-11-18 15:30:00+05:30, 2015-11-18 16:30:00+06:30], dtype='object')
In [61]: idx[0]
Out[61]: datetime.datetime(2015, 11, 18, 15, 30, tzinfo=tzoffset(None, 19800))
In [62]: idx[1]
Out[62]: datetime.datetime(2015, 11, 18, 16, 30, tzinfo=tzoffset(None, 23400))
Passing utc=True will mimic the previous behavior but will correctly indicate that the dates have been
converted to UTC
>>> import io
>>> content = """\
... a
... 2000-01-01T00:00:00+05:00
... 2000-01-01T00:00:00+06:00"""
>>> df = pd.read_csv(io.StringIO(content), parse_dates=['a'])
>>> df.a
0 1999-12-31 19:00:00
1 1999-12-31 18:00:00
Name: a, dtype: datetime64[ns]
New behavior
In [64]: import io
In [67]: df.a
Out[67]:
0 2000-01-01 00:00:00+05:00
1 2000-01-01 00:00:00+06:00
Name: a, Length: 2, dtype: object
As can be seen, the dtype is object; each value in the column is a string. To convert the strings to an array
of datetimes, the date_parser argument
In [69]: df.a
Out[69]:
0 1999-12-31 19:00:00+00:00
1 1999-12-31 18:00:00+00:00
Name: a, Length: 2, dtype: datetime64[ns, UTC]
The time values in Period and PeriodIndex objects are now set to ‘23:59:59.999999999’ when calling Series.
dt.end_time, Period.end_time, PeriodIndex.end_time, Period.to_timestamp() with how='end', or
PeriodIndex.to_timestamp() with how='end' (GH17157)
Previous behavior:
In [4]: pd.Series(pi).dt.end_time[0]
Out[4]: Timestamp(2017-01-01 00:00:00)
In [5]: p.end_time
Out[5]: Timestamp(2017-01-01 23:59:59.999999999)
New behavior:
Calling Series.dt.end_time will now result in a time of ‘23:59:59.999999999’ as is the case with Period.
end_time, for example
In [70]: p = pd.Period('2017-01-01', 'D')
In [71]: pi = pd.PeriodIndex([p])
In [72]: pd.Series(pi).dt.end_time[0]
In [73]: p.end_time
Out[73]: Timestamp('2017-01-01 23:59:59.999999999')
The return type of Series.unique() for datetime with timezone values has changed from an numpy.ndarray
of Timestamp objects to a arrays.DatetimeArray (GH24024).
Previous behavior:
In [3]: ser.unique()
Out[3]: array([Timestamp('2000-01-01 00:00:00+0000', tz='UTC')], dtype=object)
New behavior:
In [75]: ser.unique()
Out[75]:
<DatetimeArray>
['2000-01-01 00:00:00+00:00']
Length: 1, dtype: datetime64[ns, UTC]
SparseArray, the array backing SparseSeries and the columns in a SparseDataFrame, is now an extension
array (GH21978, GH19056, GH22835). To conform to this interface and for consistency with the rest of
pandas, some API breaking changes were made:
• SparseArray is no longer a subclass of numpy.ndarray. To convert a SparseArray to a NumPy array,
use numpy.asarray().
• SparseArray.dtype and SparseSeries.dtype are now instances of SparseDtype, rather than np.
dtype. Access the underlying dtype with SparseDtype.subtype.
• numpy.asarray(sparse_array) now returns a dense array with all the values, not just the non-fill-
value values (GH14167)
• SparseArray.take now matches the API of pandas.api.extensions.ExtensionArray.take()
(GH19506):
– The default value of allow_fill has changed from False to True.
– The out and mode parameters are now longer accepted (previously, this raised if they were
specified).
– Passing a scalar for indices is no longer allowed.
• The result of concat() with a mix of sparse and dense Series is a Series with sparse values, rather
than a SparseSeries.
In [77]: s.sparse.density
Out[77]: 0.6
Previously, when sparse=True was passed to get_dummies(), the return value could be either a DataFrame
or a SparseDataFrame, depending on whether all or a just a subset of the columns were dummy-encoded.
Now, a DataFrame is always returned (GH24284).
Previous behavior
The first get_dummies() returns a DataFrame because the column A is not dummy encoded. When just
["B", "C"] are passed to get_dummies, then all the columns are dummy-encoded, and a SparseDataFrame
was returned.
In [2]: df = pd.DataFrame({"A": [1, 2], "B": ['a', 'b'], "C": ['a', 'a']})
New behavior
Now, the return type is consistently a DataFrame.
In [78]: type(pd.get_dummies(df, sparse=True))
Out[78]: pandas.core.frame.DataFrame
Note: There’s no difference in memory usage between a SparseDataFrame and a DataFrame with sparse
values. The memory usage will be the same as in the previous version of pandas.
Bug in DataFrame.to_dict() raises ValueError when used with orient='index' and a non-unique index
instead of losing data (GH22801)
In [80]: df = pd.DataFrame({'a': [1, 2], 'b': [0.5, 0.75]}, index=['A', 'A'])
In [81]: df
Out[81]:
a b
A 1 0.50
A 2 0.75
[2 rows x 2 columns]
In [82]: df.to_dict(orient='index')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-82-f5309a7c6adb> in <module>
----> 1 df.to_dict(orient='index')
Creating a Tick object (Day, Hour, Minute, Second, Milli, Micro, Nano) with normalize=True is no longer
supported. This prevents unexpected behavior where addition could fail to be monotone or associative.
(GH21427)
Previous behavior:
In [3]: ts
Out[3]: Timestamp('2018-06-11 18:01:14')
In [6]: ts + tic
Out[6]: Timestamp('2018-06-11 00:00:00')
New behavior:
Period subtraction
Subtraction of a Period from another Period will give a DateOffset. instead of an integer (GH21314)
Previous behavior:
New behavior:
Similarly, subtraction of a Period from a PeriodIndex will now return an Index of DateOffset objects
instead of an Int64Index
Previous behavior:
In [3]: pi - pi[0]
Out[3]: Int64Index([0, 1, 2], dtype='int64')
New behavior:
In [90]: pi - pi[0]
Out[90]: Index([<0 * MonthEnds>, <MonthEnd>, <2 * MonthEnds>], dtype='object')
Adding or subtracting NaN from a DataFrame column with timedelta64[ns] dtype will now raise a
TypeError instead of returning all-NaT. This is for compatibility with TimedeltaIndex and Series behavior
(GH22163)
In [91]: df = pd.DataFrame([pd.Timedelta(days=1)])
In [92]: df
Out[92]:
0
0 1 days
[1 rows x 1 columns]
Previous behavior:
In [4]: df = pd.DataFrame([pd.Timedelta(days=1)])
In [5]: df - np.nan
Out[5]:
0
0 NaT
New behavior:
In [2]: df - np.nan
...
TypeError: unsupported operand type(s) for -: 'TimedeltaIndex' and 'float'
Previously, the broadcasting behavior of DataFrame comparison operations (==, !=, …) was inconsistent with
the behavior of arithmetic operations (+, -, …). The behavior of the comparison operations has been changed
to match the arithmetic operations in these cases. (GH22880)
The affected cases are:
• operating against a 2-dimensional np.ndarray with either 1 row or 1 column will now broadcast the
same way a np.ndarray would (GH23000).
• a list or tuple with length matching the number of rows in the DataFrame will now raise ValueError
instead of operating column-by-column (GH22880.
• a list or tuple with length matching the number of columns in the DataFrame will now operate row-
by-row instead of raising ValueError (GH22880).
In [94]: df = pd.DataFrame(arr)
In [95]: df
Out[95]:
0 1
0 0 1
1 2 3
2 4 5
[3 rows x 2 columns]
Previous behavior:
In [5]: df == arr[[0], :]
...: # comparison previously broadcast where arithmetic would raise
Out[5]:
0 1
0 True True
1 False False
2 False False
In [6]: df + arr[[0], :]
...
ValueError: Unable to coerce to DataFrame, shape must be (3, 2): given (1, 2)
In [7]: df == (1, 2)
...: # length matches number of columns;
...: # comparison previously raised where arithmetic would broadcast
...
ValueError: Invalid broadcasting comparison [(1, 2)] with block values
In [8]: df + (1, 2)
Out[8]:
0 1
0 1 3
1 3 5
2 5 7
In [9]: df == (1, 2, 3)
...: # length matches number of rows
...: # comparison previously broadcast where arithmetic would raise
Out[9]:
0 1
0 False True
1 True False
2 False False
In [10]: df + (1, 2, 3)
...
ValueError: Unable to coerce to Series, length must be 2: given 3
New behavior:
# Comparison operations and arithmetic operations both broadcast.
In [96]: df == arr[[0], :]
Out[96]:
0 1
0 True True
1 False False
2 False False
[3 rows x 2 columns]
In [97]: df + arr[[0], :]
Out[97]:
0 1
0 0 2
1 2 4
2 4 6
[3 rows x 2 columns]
# Comparison operations and arithmetic operations both broadcast.
In [98]: df == (1, 2)
Out[98]:
0 1
0 False False
1 False False
2 False False
[3 rows x 2 columns]
In [99]: df + (1, 2)
Out[99]:
0 1
0 1 3
1 3 5
2 5 7
[3 rows x 2 columns]
# Comparison operations and arithmetic operations both raise ValueError.
In [6]: df == (1, 2, 3)
...
ValueError: Unable to coerce to Series, length must be 2: given 3
In [7]: df + (1, 2, 3)
...
ValueError: Unable to coerce to Series, length must be 2: given 3
DataFrame arithmetic operations when operating with 2-dimensional np.ndarray objects now broadcast in
the same way as np.ndarray broadcast. (GH23000)
In [100]: arr = np.arange(6).reshape(3, 2)
In [101]: df = pd.DataFrame(arr)
(continues on next page)
In [102]: df
Out[102]:
0 1
0 0 1
1 2 3
2 4 5
[3 rows x 2 columns]
Previous behavior:
New behavior:
In [103]: df + arr[[0], :] # 1 row, 2 columns
Out[103]:
0 1
0 0 2
1 2 4
2 4 6
[3 rows x 2 columns]
[3 rows x 2 columns]
Series and Index constructors now raise when the data is incompatible with a passed dtype= (GH15832)
Previous behavior:
New behavior:
Concatenation Changes
Calling pandas.concat() on a Categorical of ints with NA values now causes them to be processed as
objects when concatenating with anything other than another Categorical of ints (GH19214)
Previous behavior
New behavior
• For DatetimeIndex and TimedeltaIndex with non-None freq attribute, addition or subtraction of
integer-dtyped array or Index will return an object of the same class (GH19959)
• DateOffset objects are now immutable. Attempting to alter one of these will now raise
AttributeError (GH21341)
• PeriodIndex subtraction of another PeriodIndex will now return an object-dtype Index of
DateOffset objects instead of raising a TypeError (GH20049)
• cut() and qcut() now returns a DatetimeIndex or TimedeltaIndex bins when the input is datetime
or timedelta dtype respectively and retbins=True (GH19891)
• A newly constructed empty DataFrame with integer as the dtype will now only be cast to float64 if
index is specified (GH22858)
• Series.str.cat() will now raise if others is a set (GH23009)
• Passing scalar values to DatetimeIndex or TimedeltaIndex will now raise TypeError instead of
ValueError (GH23539)
• max_rows and max_cols parameters removed from HTMLFormatter since truncation is handled by
DataFrameFormatter (GH23818)
• read_csv() will now raise a ValueError if a column with missing values is declared as having dtype
bool (GH20591)
• The column order of the resultant DataFrame from MultiIndex.to_frame() is now guaranteed to
match the MultiIndex.names order. (GH22420)
• Incorrectly passing a DatetimeIndex to MultiIndex.from_tuples(), rather than a sequence of tuples,
now raises a TypeError rather than a ValueError (GH24024)
• pd.offsets.generate_range() argument time_rule has been removed; use offset instead
(GH24157)
• In 0.23.x, pandas would raise a ValueError on a merge of a numeric column (e.g. int dtyped column)
and an object dtyped column (GH9780). We have re-enabled the ability to merge object and other
dtypes; pandas will still raise on a merge between a numeric and an object dtyped column that is
composed only of strings (GH21681)
• Accessing a level of a MultiIndex with a duplicate name (e.g. in get_level_values()) now raises a
ValueError instead of a KeyError (GH21678).
• Invalid construction of IntervalDtype will now always raise a TypeError rather than a ValueError
if the subdtype is invalid (GH21185)
• Trying to reindex a DataFrame with a non unique MultiIndex now raises a ValueError instead of an
Exception (GH21770)
• Index subtraction will attempt to operate element-wise instead of raising TypeError (GH19369)
• pandas.io.formats.style.Styler supports a number-format property when using to_excel()
(GH22015)
• DataFrame.corr() and Series.corr() now raise a ValueError along with a helpful error message
instead of a KeyError when supplied with an invalid method (GH22298)
• shift() will now always return a copy, instead of the previous behaviour of returning self when shifting
by 0 (GH22397)
• DataFrame.set_index() now gives a better (and less frequent) KeyError, raises a ValueError for
incorrect types, and will not fail on duplicate column names with drop=True. (GH22484)
• Slicing a single row of a DataFrame with multiple ExtensionArrays of the same type now preserves the
dtype, rather than coercing to object (GH22784)
• DateOffset attribute _cacheable and method _should_cache have been removed (GH23118)
• Series.searchsorted(), when supplied a scalar value to search for, now returns a scalar instead of
an array (GH23801).
• Categorical.searchsorted(), when supplied a scalar value to search for, now returns a scalar instead
of an array (GH23466).
• Categorical.searchsorted() now raises a KeyError rather that a ValueError, if a searched for key
is not found in its categories (GH23466).
• Index.hasnans() and Series.hasnans() now always return a python boolean. Previously, a python
or a numpy boolean could be returned, depending on circumstances (GH23294).
• The order of the arguments of DataFrame.to_html() and DataFrame.to_string() is rearranged to
be consistent with each other. (GH23614)
• CategoricalIndex.reindex() now raises a ValueError if the target index is non-unique and not
equal to the current index. It previously only raised if the target index was not of a categorical dtype
(GH23963).
• Series.to_list() and Index.to_list() are now aliases of Series.tolist respectively Index.
tolist (GH8826)
• The result of SparseSeries.unstack is now a DataFrame with sparse values, rather than a
SparseDataFrame (GH24372).
• DatetimeIndex and TimedeltaIndex no longer ignore the dtype precision. Passing a non-nanosecond
resolution dtype will raise a ValueError (GH24753)
Deprecations
more similar to the API for CategoricalIndex (GH13443). As a consequence, other uses of the name
labels in MultiIndex have also been deprecated and replaced with codes:
– You should initialize a MultiIndex instance using a parameter named codes rather than labels.
– MultiIndex.set_labels has been deprecated in favor of MultiIndex.set_codes().
– For method MultiIndex.copy(), the labels parameter has been deprecated and replaced by a
codes parameter.
• DataFrame.to_stata(), read_stata(), StataReader and StataWriter have deprecated the
encoding argument. The encoding of a Stata dta file is determined by the file type and cannot
be changed (GH21244)
• MultiIndex.to_hierarchical() is deprecated and will be removed in a future version (GH21613)
• Series.ptp() is deprecated. Use numpy.ptp instead (GH21614)
• Series.compress() is deprecated. Use Series[condition] instead (GH18262)
• The signature of Series.to_csv() has been uniformed to that of DataFrame.to_csv(): the name of
the first argument is now path_or_buf, the order of subsequent arguments has changed, the header
argument now defaults to True. (GH19715)
• Categorical.from_codes() has deprecated providing float values for the codes argument. (GH21767)
• pandas.read_table() is deprecated. Instead, use read_csv() passing sep='\t' if necessary. This
deprecation has been removed in 0.25.0. (GH21948)
• Series.str.cat() has deprecated using arbitrary list-likes within list-likes. A list-like container may
still contain many Series, Index or 1-dimensional np.ndarray, or alternatively, only scalar values.
(GH21950)
• FrozenNDArray.searchsorted() has deprecated the v parameter in favor of value (GH14645)
• DatetimeIndex.shift() and PeriodIndex.shift() now accept periods argument instead of n
for consistency with Index.shift() and Series.shift(). Using n throws a deprecation warning
(GH22458, GH22912)
• The fastpath keyword of the different Index constructors is deprecated (GH23110).
• Timestamp.tz_localize(), DatetimeIndex.tz_localize(), and Series.tz_localize() have dep-
recated the errors argument in favor of the nonexistent argument (GH8917)
• The class FrozenNDArray has been deprecated. When unpickling, FrozenNDArray will be unpickled
to np.ndarray once this class is removed (GH9031)
• The methods DataFrame.update() and Panel.update() have deprecated the
raise_conflict=False|True keyword in favor of errors='ignore'|'raise' (GH23585)
• The methods Series.str.partition() and Series.str.rpartition() have deprecated the pat key-
word in favor of sep (GH22676)
• Deprecated the nthreads keyword of pandas.read_feather() in favor of use_threads to reflect the
changes in pyarrow>=0.11.0. (GH23053)
• pandas.read_excel() has deprecated accepting usecols as an integer. Please pass in a list of ints
from 0 to usecols inclusive instead (GH23527)
• Constructing a TimedeltaIndex from data with datetime64-dtyped data is deprecated, will raise
TypeError in a future version (GH23539)
• Constructing a DatetimeIndex from data with timedelta64-dtyped data is deprecated, will raise
TypeError in a future version (GH23675)
In the past, users could—in some cases—add or subtract integers or integer-dtype arrays from Timestamp,
DatetimeIndex and TimedeltaIndex.
This usage is now deprecated. Instead add or subtract integer multiples of the object’s freq attribute
(GH21939, GH23878).
Previous behavior:
New behavior:
In [109]: ts + 2 * ts.freq
Out[109]: Timestamp('1994-05-06 14:15:16', freq='H')
The behavior of DatetimeIndex when passed integer data and a timezone is changing in a future version of
pandas. Previously, these were interpreted as wall times in the desired timezone. In the future, these will
be interpreted as wall times in UTC, which are then converted to the desired timezone (GH24559).
The default behavior remains the same, but issues a warning:
pd.to_datetime(integer_data, utc=True).tz_convert(tz)
pd.to_datetime(integer_data).tz_localize(tz)
#!/bin/python3
Out[3]: DatetimeIndex(['2000-01-01 00:00:00-06:00'], dtype='datetime64[ns, US/Central]',
,→ freq=None)
As the warning message explains, opt in to the future behavior by specifying that the integer values are
The old behavior can be retained with by localizing directly to the final timezone:
In [115]: pd.to_datetime([946684800000000000]).tz_localize('US/Central')
Out[115]: DatetimeIndex(['2000-01-01 00:00:00-06:00'], dtype='datetime64[ns, US/Central]
,→', freq=None)
The conversion from a Series or Index with timezone-aware datetime data will change to preserve timezones
by default (GH23569).
NumPy doesn’t have a dedicated dtype for timezone-aware datetimes. In the past, converting a Series or
DatetimeIndex with timezone-aware datatimes would convert to a NumPy array by
1. converting the tz-aware data to UTC
2. dropping the timezone-info
3. returning a numpy.ndarray with datetime64[ns] dtype
Future versions of pandas will preserve the timezone information by returning an object-dtype NumPy array
where each value is a Timestamp with the correct timezone attached
In [117]: ser
Out[117]:
0 2000-01-01 00:00:00+01:00
1 2000-01-02 00:00:00+01:00
Length: 2, dtype: datetime64[ns, CET]
In [8]: np.asarray(ser)
/bin/ipython:1: FutureWarning: Converting timezone-aware DatetimeArray to timezone-naive
ndarray with 'datetime64[ns]' dtype. In the future, this will return an ndarray
with 'object' dtype where each element is a 'pandas.Timestamp' with the correct 'tz
,→'.
The previous or future behavior can be obtained, without any warnings, by specifying the dtype
Previous behavior
Future behavior
# New behavior
In [119]: np.asarray(ser, dtype=object)
Out[119]:
array([Timestamp('2000-01-01 00:00:00+0100', tz='CET', freq='D'),
Timestamp('2000-01-02 00:00:00+0100', tz='CET', freq='D')],
dtype=object)
Or by using Series.to_numpy()
In [120]: ser.to_numpy()
Out[120]:
array([Timestamp('2000-01-01 00:00:00+0100', tz='CET', freq='D'),
Timestamp('2000-01-02 00:00:00+0100', tz='CET', freq='D')],
dtype=object)
In [121]: ser.to_numpy(dtype="datetime64[ns]")
Out[121]:
array(['1999-12-31T23:00:00.000000000', '2000-01-01T23:00:00.000000000'],
dtype='datetime64[ns]')
All the above applies to a DatetimeIndex with tz-aware values as well.
Performance improvements
• Slicing Series and DataFrames with an monotonically increasing CategoricalIndex is now very fast
and has speed comparable to slicing with an Int64Index. The speed increase is both when indexing by
label (using .loc) and position(.iloc) (GH20395) Slicing a monotonically increasing CategoricalIndex
itself (i.e. ci[1000:2000]) shows similar speed improvements as above (GH21659)
• Improved performance of CategoricalIndex.equals() when comparing to another
CategoricalIndex (GH24023)
• Improved performance of Series.describe() in case of numeric dtpyes (GH21274)
• Improved performance of pandas.core.groupby.GroupBy.rank() when dealing with tied rankings
(GH21237)
• Improved performance of DataFrame.set_index() with columns consisting of Period objects
(GH21582, GH21606)
• Improved performance of Series.at() and Index.get_value() for Extension Arrays values (e.g.
Categorical) (GH24204)
• Improved performance of membership checks in Categorical and CategoricalIndex (i.e. x in cat-
style checks are much faster). CategoricalIndex.contains() is likewise much faster (GH21369,
GH21508)
• Improved performance of HDFStore.groups() (and dependent functions like HDFStore.keys(). (i.e.
x in store checks are much faster) (GH21372)
• Improved the performance of pandas.get_dummies() with sparse=True (GH21997)
• Improved performance of IndexEngine.get_indexer_non_unique() for sorted, non-unique indexes
(GH9466)
• Improved performance of PeriodIndex.unique() (GH23083)
• Improved performance of concat() for Series objects (GH23404)
Bug fixes
Categorical
• Bug in Categorical.from_codes() where NaN values in codes were silently converted to 0 (GH21767).
In the future this will raise a ValueError. Also changes the behavior of .from_codes([1.1, 2.0]).
• Bug in Categorical.sort_values() where NaN values were always positioned in front regardless of
na_position value. (GH22556).
• Bug when indexing with a boolean-valued Categorical. Now a boolean-valued Categorical is treated
as a boolean mask (GH22665)
• Constructing a CategoricalIndex with empty values and boolean categories was raising a ValueError
after a change to dtype coercion (GH22702).
• Bug in Categorical.take() with a user-provided fill_value not encoding the fill_value, which
could result in a ValueError, incorrect results, or a segmentation fault (GH23296).
• In Series.unstack(), specifying a fill_value not present in the categories now raises a TypeError
rather than ignoring the fill_value (GH23284)
• Bug when resampling DataFrame.resample() and aggregating on categorical data, the categorical
dtype was getting lost. (GH23227)
• Bug in many methods of the .str-accessor, which always failed on calling the CategoricalIndex.str
constructor (GH23555, GH23556)
• Bug in Series.where() losing the categorical dtype for categorical data (GH24077)
• Bug in Categorical.apply() where NaN values could be handled unpredictably. They now remain
unchanged (GH24241)
• Bug in Categorical comparison methods incorrectly raising ValueError when operating against a
DataFrame (GH24630)
• Bug in Categorical.set_categories() where setting fewer new categories with rename=True caused
a segmentation fault (GH24675)
Datetimelike
• Fixed bug where two DateOffset objects with different normalize attributes could evaluate as equal
(GH21404)
• Fixed bug where Timestamp.resolution() incorrectly returned 1-microsecond timedelta instead of
1-nanosecond Timedelta (GH21336, GH21365)
• Bug in to_datetime() that did not consistently return an Index when box=True was specified
(GH21864)
• Bug in DatetimeIndex comparisons where string comparisons incorrectly raises TypeError (GH22074)
• Bug in DatetimeIndex comparisons when comparing against timedelta64[ns] dtyped arrays; in some
cases TypeError was incorrectly raised, in others it incorrectly failed to raise (GH22074)
• Bug in DatetimeIndex comparisons when comparing against object-dtyped arrays (GH22074)
• Bug in DataFrame with datetime64[ns] dtype addition and subtraction with Timedelta-like objects
(GH22005, GH22163)
• Bug in DataFrame with datetime64[ns] dtype addition and subtraction with DateOffset objects
returning an object dtype instead of datetime64[ns] dtype (GH21610, GH22163)
• Bug in DataFrame with datetime64[ns] dtype comparing against NaT incorrectly (GH22242,
GH22163)
• Bug in DataFrame with datetime64[ns] dtype subtracting Timestamp-like object incorrectly returned
datetime64[ns] dtype instead of timedelta64[ns] dtype (GH8554, GH22163)
• Bug in DataFrame with datetime64[ns] dtype subtracting np.datetime64 object with non-
nanosecond unit failing to convert to nanoseconds (GH18874, GH22163)
• Bug in DataFrame comparisons against Timestamp-like objects failing to raise TypeError for inequality
checks with mismatched types (GH8932, GH22163)
• Bug in DataFrame with mixed dtypes including datetime64[ns] incorrectly raising TypeError on
equality comparisons (GH13128, GH22163)
• Bug in DataFrame.values returning a DatetimeIndex for a single-column DataFrame with tz-aware
datetime values. Now a 2-D numpy.ndarray of Timestamp objects is returned (GH24024)
• Bug in DataFrame.eq() comparison against NaT incorrectly returning True or NaN (GH15697,
GH22163)
• Bug in DatetimeIndex subtraction that incorrectly failed to raise OverflowError (GH22492,
GH22508)
• Bug in DatetimeIndex incorrectly allowing indexing with Timedelta object (GH20464)
• Bug in DatetimeIndex where frequency was being set if original frequency was None (GH22150)
• Bug in rounding methods of DatetimeIndex (round(), ceil(), floor()) and Timestamp (round(),
ceil(), floor()) could give rise to loss of precision (GH22591)
• Bug in to_datetime() with an Index argument that would drop the name from the result (GH21697)
• Bug in PeriodIndex where adding or subtracting a timedelta or Tick object produced incorrect
results (GH22988)
• Bug in the Series repr with period-dtype data missing a space before the data (GH23601)
• Bug in date_range() when decrementing a start date to a past end date by a negative frequency
(GH23270)
• Bug in Series.min() which would return NaN instead of NaT when called on a series of NaT (GH23282)
• Bug in Series.combine_first() not properly aligning categoricals, so that missing values in self
where not filled by valid values from other (GH24147)
• Bug in DataFrame.combine() with datetimelike values raising a TypeError (GH23079)
• Bug in date_range() with frequency of Day or higher where dates sufficiently far in the future could
wrap around to the past instead of raising OutOfBoundsDatetime (GH14187)
• Bug in period_range() ignoring the frequency of start and end when those are provided as Period
objects (GH20535).
• Bug in PeriodIndex with attribute freq.n greater than 1 where adding a DateOffset object would
return incorrect results (GH23215)
• Bug in Series that interpreted string indices as lists of characters when setting datetimelike values
(GH23451)
• Bug in DataFrame when creating a new column from an ndarray of Timestamp objects with timezones
creating an object-dtype column, rather than datetime with timezone (GH23932)
• Bug in Timestamp constructor which would drop the frequency of an input Timestamp (GH22311)
• Bug in DatetimeIndex where calling np.array(dtindex, dtype=object) would incorrectly return
an array of long objects (GH23524)
• Bug in Index where passing a timezone-aware DatetimeIndex and dtype=object would incorrectly raise
a ValueError (GH23524)
• Bug in Index where calling np.array(dtindex, dtype=object) on a timezone-naive DatetimeIndex
would return an array of datetime objects instead of Timestamp objects, potentially losing nanosecond
portions of the timestamps (GH23524)
• Bug in Categorical.__setitem__ not allowing setting with another Categorical when both are
unordered and have the same categories, but in a different order (GH24142)
• Bug in date_range() where using dates with millisecond resolution or higher could return incorrect
values or the wrong number of values in the index (GH24110)
• Bug in DatetimeIndex where constructing a DatetimeIndex from a Categorical or
CategoricalIndex would incorrectly drop timezone information (GH18664)
• Bug in DatetimeIndex and TimedeltaIndex where indexing with Ellipsis would incorrectly lose the
index’s freq attribute (GH21282)
• Clarified error message produced when passing an incorrect freq argument to DatetimeIndex with
NaT as the first entry in the passed data (GH11587)
• Bug in to_datetime() where box and utc arguments were ignored when passing a DataFrame or dict
of unit mappings (GH23760)
• Bug in Series.dt where the cache would not update properly after an in-place operation (GH24408)
• Bug in PeriodIndex where comparisons against an array-like object with length 1 failed to raise
ValueError (GH23078)
• Bug in DatetimeIndex.astype(), PeriodIndex.astype() and TimedeltaIndex.astype() ignoring
the sign of the dtype for unsigned integer dtypes (GH24405).
• Fixed bug in Series.max() with datetime64[ns]-dtype failing to return NaT when nulls are present
and skipna=False is passed (GH24265)
• Bug in to_datetime() where arrays of datetime objects containing both timezone-aware and
timezone-naive datetimes would fail to raise ValueError (GH24569)
• Bug in to_datetime() with invalid datetime format doesn’t coerce input to NaT even if
errors='coerce' (GH24763)
Timedelta
• Bug in DataFrame with timedelta64[ns] dtype division by Timedelta-like scalar incorrectly returning
timedelta64[ns] dtype instead of float64 dtype (GH20088, GH22163)
• Bug in adding a Index with object dtype to a Series with timedelta64[ns] dtype incorrectly raising
(GH22390)
• Bug in multiplying a Series with numeric dtype against a timedelta object (GH22390)
• Bug in Series with numeric dtype when adding or subtracting an an array or Series with timedelta64
dtype (GH22390)
• Bug in Index with numeric dtype when multiplying or dividing an array with dtype timedelta64
(GH22390)
• Bug in TimedeltaIndex incorrectly allowing indexing with Timestamp object (GH20464)
• Fixed bug where subtracting Timedelta from an object-dtyped array would raise TypeError
(GH21980)
• Fixed bug in adding a DataFrame with all-timedelta64[ns] dtypes to a DataFrame with all-integer
dtypes returning incorrect results instead of raising TypeError (GH22696)
• Bug in TimedeltaIndex where adding a timezone-aware datetime scalar incorrectly returned a
timezone-naive DatetimeIndex (GH23215)
• Bug in TimedeltaIndex where adding np.timedelta64('NaT') incorrectly returned an all-NaT
DatetimeIndex instead of an all-NaT TimedeltaIndex (GH23215)
• Bug in Timedelta and to_timedelta() have inconsistencies in supported unit string (GH21762)
• Bug in TimedeltaIndex division where dividing by another TimedeltaIndex raised TypeError instead
of returning a Float64Index (GH23829, GH22631)
• Bug in TimedeltaIndex comparison operations where comparing against non-Timedelta-like objects
would raise TypeError instead of returning all-False for __eq__ and all-True for __ne__ (GH24056)
• Bug in Timedelta comparisons when comparing with a Tick object incorrectly raising TypeError
(GH24710)
Timezones
• Bug in Index.shift() where an AssertionError would raise when shifting across DST (GH8616)
• Bug in Timestamp constructor where passing an invalid timezone offset designator (Z) would not raise
a ValueError (GH8910)
• Bug in Timestamp.replace() where replacing at a DST boundary would retain an incorrect offset
(GH7825)
• Bug in Series.replace() with datetime64[ns, tz] data when replacing NaT (GH11792)
• Bug in Timestamp when passing different string date formats with a timezone offset would produce
different timezone offsets (GH12064)
• Bug when comparing a tz-naive Timestamp to a tz-aware DatetimeIndex which would coerce the
DatetimeIndex to tz-naive (GH12601)
• Bug in Series.truncate() with a tz-aware DatetimeIndex which would cause a core dump (GH9243)
• Bug in Series constructor which would coerce tz-aware and tz-naive Timestamp to tz-aware (GH13051)
• Bug in Index with datetime64[ns, tz] dtype that did not localize integer data correctly (GH20964)
• Bug in DatetimeIndex where constructing with an integer and tz would not localize correctly
(GH12619)
• Fixed bug where DataFrame.describe() and Series.describe() on tz-aware datetimes did not show
first and last result (GH21328)
• Bug in DatetimeIndex comparisons failing to raise TypeError when comparing timezone-aware
DatetimeIndex against np.datetime64 (GH22074)
• Bug in DataFrame assignment with a timezone-aware scalar (GH19843)
• Bug in DataFrame.asof() that raised a TypeError when attempting to compare tz-naive and tz-aware
timestamps (GH21194)
• Bug when constructing a DatetimeIndex with Timestamp constructed with the replace method across
DST (GH18785)
• Bug when setting a new value with DataFrame.loc() with a DatetimeIndex with a DST transition
(GH18308, GH20724)
• Bug in Index.unique() that did not re-localize tz-aware dates correctly (GH21737)
• Bug when indexing a Series with a DST transition (GH21846)
• Bug in DataFrame.resample() and Series.resample() where an AmbiguousTimeError or
NonExistentTimeError would raise if a timezone aware timeseries ended on a DST transition
(GH19375, GH10117)
• Bug in DataFrame.drop() and Series.drop() when specifying a tz-aware Timestamp key to drop
from a DatetimeIndex with a DST transition (GH21761)
• Bug in DatetimeIndex constructor where NaT and dateutil.tz.tzlocal would raise an
OutOfBoundsDatetime error (GH23807)
• Bug in DatetimeIndex.tz_localize() and Timestamp.tz_localize() with dateutil.tz.tzlocal
near a DST transition that would return an incorrectly localized datetime (GH23807)
• Bug in Timestamp constructor where a dateutil.tz.tzutc timezone passed with a datetime.
datetime argument would be converted to a pytz.UTC timezone (GH23807)
• Bug in to_datetime() where utc=True was not respected when specifying a unit and
errors='ignore' (GH23758)
• Bug in to_datetime() where utc=True was not respected when passing a Timestamp (GH24415)
• Bug in DataFrame.any() returns wrong value when axis=1 and the data is of datetimelike type
(GH23070)
• Bug in DatetimeIndex.to_period() where a timezone aware index was converted to UTC first before
creating PeriodIndex (GH22905)
• Bug in DataFrame.tz_localize(), DataFrame.tz_convert(), Series.tz_localize(), and Series.
tz_convert() where copy=False would mutate the original argument inplace (GH6326)
• Bug in DataFrame.max() and DataFrame.min() with axis=1 where a Series with NaN would be
returned when all columns contained the same timezone (GH10390)
Offsets
• Bug in FY5253 where date offsets could incorrectly raise an AssertionError in arithmetic operations
(GH14774)
• Bug in DateOffset where keyword arguments week and milliseconds were accepted and ignored.
Passing these will now raise ValueError (GH19398)
• Bug in adding DateOffset with DataFrame or PeriodIndex incorrectly raising TypeError (GH23215)
• Bug in comparing DateOffset objects with non-DateOffset objects, particularly strings, raising
ValueError instead of returning False for equality checks and True for not-equal checks (GH23524)
Numeric
Conversion
Strings
Interval
• Bug in the IntervalIndex constructor where the closed parameter did not always override the inferred
closed (GH19370)
• Bug in the IntervalIndex repr where a trailing comma was missing after the list of intervals (GH20611)
• Bug in Interval where scalar arithmetic operations did not retain the closed value (GH22313)
• Bug in IntervalIndex where indexing with datetime-like values raised a KeyError (GH20636)
• Bug in IntervalTree where data containing NaN triggered a warning and resulted in incorrect indexing
queries with IntervalIndex (GH23352)
Indexing
• DataFrame.__getitem__ now accepts dictionaries and dictionary keys as list-likes of labels, consistently
with Series.__getitem__ (GH21294)
• Fixed DataFrame[np.nan] when columns are non-unique (GH21428)
• Bug when indexing DatetimeIndex with nanosecond resolution dates and timezones (GH11679)
• Bug where indexing with a Numpy array containing negative values would mutate the indexer
(GH21867)
• Bug where mixed indexes wouldn’t allow integers for .at (GH19860)
• Float64Index.get_loc now raises KeyError when boolean key passed. (GH19087)
• Bug in DataFrame.loc() when indexing with an IntervalIndex (GH19977)
• Index no longer mangles None, NaN and NaT, i.e. they are treated as three different keys. However, for
numeric Index all three are still coerced to a NaN (GH22332)
• Bug in scalar in Index if scalar is a float while the Index is of integer dtype (GH22085)
• Bug in MultiIndex.set_levels() when levels value is not subscriptable (GH23273)
• Bug where setting a timedelta column by Index causes it to be casted to double, and therefore lose
precision (GH23511)
• Bug in Index.union() and Index.intersection() where name of the Index of the result was not
computed correctly for certain cases (GH9943, GH9862)
• Bug in Index slicing with boolean Index may raise TypeError (GH22533)
• Bug in PeriodArray.__setitem__ when accepting slice and list-like value (GH23978)
• Bug in DatetimeIndex, TimedeltaIndex where indexing with Ellipsis would lose their freq attribute
(GH21282)
• Bug in iat where using it to assign an incompatible value would create a new column (GH23236)
Missing
• Bug in DataFrame.fillna() where a ValueError would raise when one column contained a
datetime64[ns, tz] dtype (GH15522)
• Bug in Series.hasnans() that could be incorrectly cached and return incorrect answers if null elements
are introduced after an initial call (GH19700)
• Series.isin() now treats all NaN-floats as equal also for np.object-dtype. This behavior is consistent
with the behavior for float64 (GH22119)
• unique() no longer mangles NaN-floats and the NaT-object for np.object-dtype, i.e. NaT is no longer
coerced to a NaN-value and is treated as a different entity. (GH22295)
• DataFrame and Series now properly handle numpy masked arrays with hardened masks. Previously,
constructing a DataFrame or Series from a masked array with a hard mask would create a pandas
object containing the underlying value, rather than the expected NaN. (GH24574)
• Bug in DataFrame constructor where dtype argument was not honored when handling numpy masked
record arrays. (GH24874)
MultiIndex
I/O
• Bug in read_csv() in which a column specified with CategoricalDtype of boolean categories was not
being correctly coerced from string values to booleans (GH20498)
• Bug in read_csv() in which unicode column names were not being properly recognized with Python
2.x (GH13253)
• Bug in DataFrame.to_sql() when writing timezone aware data (datetime64[ns, tz] dtype) would
raise a TypeError (GH9086)
• Bug in DataFrame.to_sql() where a naive DatetimeIndex would be written as TIMESTAMP WITH
TIMEZONE type in supported databases, e.g. PostgreSQL (GH23510)
• Bug in read_excel() when parse_cols is specified with an empty dataset (GH9208)
• read_html() no longer ignores all-whitespace <tr> within <thead> when considering the skiprows
and header arguments. Previously, users had to decrease their header and skiprows values on such
tables to work around the issue. (GH21641)
• read_excel() will correctly show the deprecation warning for previously deprecated sheetname
(GH17994)
• read_csv() and read_table() will throw UnicodeError and not coredump on badly encoded strings
(GH22748)
• read_csv() will correctly parse timezone-aware datetimes (GH22256)
• Bug in read_csv() in which memory management was prematurely optimized for the C engine when
the data was being read in chunks (GH23509)
• Bug in read_csv() in unnamed columns were being improperly identified when extracting a multi-
index (GH23687)
• read_sas() will parse numbers in sas7bdat-files that have width less than 8 bytes correctly. (GH21616)
• read_sas() will correctly parse sas7bdat files with many columns (GH22628)
• read_sas() will correctly parse sas7bdat files with data page types having also bit 7 set (so page type
is 128 + 256 = 384) (GH16615)
• Bug in read_sas() in which an incorrect error was raised on an invalid file format. (GH24548)
• Bug in read_csv() that caused the C engine on Python 3.6+ on Windows to improperly read CSV
filenames with accented or special characters (GH15086)
• Bug in read_fwf() in which the compression type of a file was not being properly inferred (GH22199)
• Bug in pandas.io.json.json_normalize() that caused it to raise TypeError when two consecutive
elements of record_path are dicts (GH22706)
• Bug in DataFrame.to_stata(), pandas.io.stata.StataWriter and pandas.io.stata.
StataWriter117 where a exception would leave a partially written and invalid dta file (GH23573)
• Bug in DataFrame.to_stata() and pandas.io.stata.StataWriter117 that produced invalid files
when using strLs with non-ASCII characters (GH23573)
• Bug in HDFStore that caused it to raise ValueError when reading a Dataframe in Python 3 from fixed
format written in Python 2 (GH24510)
• Bug in DataFrame.to_string() and more generally in the floating repr formatter. Zeros were not
trimmed if inf was present in a columns while it was the case with NA values. Zeros are now trimmed
as in the presence of NA (GH24861).
• Bug in the repr when truncating the number of columns and having a wide last column (GH24849).
Plotting
Groupby/resample/rolling
• Bug in pandas.core.groupby.SeriesGroupBy.mean() when values were integral but could not fit
inside of int64, overflowing instead. (GH22487)
• pandas.core.groupby.RollingGroupby.agg() and pandas.core.groupby.ExpandingGroupby.
agg() now support multiple aggregation functions as parameters (GH15072)
• Bug in DataFrame.resample() and Series.resample() when resampling by a weekly offset ('W')
across a DST transition (GH9119, GH21459)
• Bug in DataFrame.expanding() in which the axis argument was not being respected during aggre-
gations (GH23372)
• Bug in pandas.core.groupby.GroupBy.transform() which caused missing values when the input
function can accept a DataFrame but renames it (GH23455).
• Bug in pandas.core.groupby.GroupBy.nth() where column order was not always preserved
(GH20760)
• Bug in pandas.core.groupby.GroupBy.rank() with method='dense' and pct=True when a group
has only one member would raise a ZeroDivisionError (GH23666).
• Calling pandas.core.groupby.GroupBy.rank() with empty groups and pct=True was raising a
ZeroDivisionError (GH22519)
• Bug in DataFrame.resample() when resampling NaT in TimeDeltaIndex (GH13223).
• Bug in DataFrame.groupby() did not respect the observed argument when selecting a column and
instead always used observed=False (GH23970)
• Bug in pandas.core.groupby.SeriesGroupBy.pct_change() or pandas.core.groupby.
DataFrameGroupBy.pct_change() would previously work across groups when calculating the
percent change, where it now correctly works per group (GH21200, GH21235).
• Bug preventing hash table creation with very large number (2^32) of rows (GH22805)
• Bug in groupby when grouping on categorical causes ValueError and incorrect grouping if
observed=True and nan is present in categorical column (GH24740, GH21151).
Reshaping
• Bug in pandas.concat() when joining resampled DataFrames with timezone aware index (GH13783)
• Bug in pandas.concat() when joining only Series the names argument of concat is no longer ignored
(GH23490)
• Bug in Series.combine_first() with datetime64[ns, tz] dtype which would return tz-naive result
(GH21469)
• Bug in Series.where() and DataFrame.where() with datetime64[ns, tz] dtype (GH21546)
• Bug in DataFrame.where() with an empty DataFrame and empty cond having non-bool dtype
(GH21947)
• Bug in Series.mask() and DataFrame.mask() with list conditionals (GH21891)
• Bug in DataFrame.replace() raises RecursionError when converting OutOfBounds datetime64[ns,
tz] (GH20380)
• pandas.core.groupby.GroupBy.rank() now raises a ValueError when an invalid value is passed for
argument na_option (GH22124)
• Bug in get_dummies() with Unicode attributes in Python 2 (GH22084)
Sparse
Style
Build changes
• Building pandas for development now requires cython >= 0.28.2 (GH21688)
• Testing pandas now requires hypothesis>=3.58. You can find the Hypothesis docs here, and a pandas-
specific introduction in the contributing guide. (GH22280)
• Building pandas on macOS now targets minimum macOS 10.9 if run on macOS 10.9 or above
(GH23424)
Other
• Bug where C variables were declared with external linkage causing import errors if certain other C
libraries were imported before Pandas. (GH24113)
Contributors
A total of 518 people contributed patches to this release. People with a “+” by their names contributed a
patch for the first time.
• 1_x7 +
• AJ Dyka +
• AJ Pryor, Ph.D +
• Aaron Critchley
• Abdullah İhsan Seçer +
• Adam Bull +
• Adam Hooper
• Adam J. Stewart
• Adam Kim
• Adam Klimont +
• Addison Lynch +
• Alan Hogue +
• Albert Villanova del Moral
• Alex Radu +
• Alex Rychyk
• Alex Strick van Linschoten +
• Alex Volkov +
• Alex Watt +
• AlexTereshenkov +
• Alexander Buchkovsky
• Alexander Hendorf +
• Alexander Hess +
• Alexander Nordin +
• Alexander Ponomaroff +
• Alexandre Batisse +
• Alexandre Decan +
• Allen Downey +
• Allison Browne +
• Aly Sivji
• Alyssa Fu Ward +
• Andrew
• Andrew Gaspari +
• Andrew Gross +
• Andrew Spott +
• Andrew Wood +
• Andy +
• Aniket uttam +
• Anjali2019 +
• Anjana S +
• Antoine Viscardi +
• Antonio Gutierrez +
• Antti Kaihola +
• Anudeep Tubati +
• Arjun Sharma +
• Armin Varshokar
• Arno Veenstra +
• Artem Bogachev
• ArtinSarraf +
• Barry Fitzgerald +
• Bart Aelterman +
• Batalex +
• Baurzhan Muftakhidinov
• Ben James +
• Ben Nelson +
• Benjamin Grove +
• Benjamin Rowell +
• Benoit Paquet +
• Bharat Raghunathan +
• Bhavani Ravi +
• Big Head +
• Boris Lau +
• Brett Naul
• Brett Randall +
• Brian Choi +
• Bryan Cutler +
• C John Klehm +
• C.A.M. Gerlach +
• Caleb Braun +
• Carl Johan +
• Cecilia +
• Chalmer Lowe
• Chang She
• Charles David +
• Cheuk Ting Ho
• Chris
• Chris Bertinato +
• Chris Roberts +
• Chris Stadler +
• Christian Haege +
• Christian Hudon
• Christopher Whelan
• Chu Qing Hao +
• Chuanzhu Xu +
• Clemens Brunner
• Da Cheezy Mobsta +
• Damian Kula +
• Damini Satya
• Daniel Himmelstein
• Daniel Hrisca +
• Daniel Luis Costa +
• Daniel Saxton +
• DanielFEvans +
• Darcy Meyer +
• DataOmbudsman
• David Arcos
• David Krych
• David Liu +
• Dean Langsam +
• Deepyaman Datta +
• Denis Belavin +
• Devin Petersohn +
• Diane Trout +
• Diego Argueta +
• Diego Torres +
• Dobatymo +
• Doug Latornell +
• Dr. Irv
• Dylan Dmitri Gray +
• EdAbati +
• Enrico Rotundo +
• Eric Boxer +
• Eric Chea
• Erik +
• Erik Nilsson +
• EternalLearner42 +
• Evan +
• Evan Livelo +
• Fabian Haase +
• Fabian Retkowski
• Fabian Rost +
• Fabien Aulaire +
• Fakabbir Amin +
• Fei Phoon +
• Felix Divo +
• Fernando Margueirat +
• Flavien Lambert +
• Florian Müller +
• Florian Rathgeber +
• Frank Hoang +
• Fábio Rosado +
• Gabe Fernando
• Gabriel Reid +
• Gaibo Zhang +
• Giftlin Rajaiah
• Gioia Ballin +
• Giuseppe Romagnuolo +
• Gjelt
• Gordon Blackadder +
• Gosuke Shibahara +
• Graham Inggs
• Gregory Rome +
• Guillaume Gay
• Guillaume Lemaitre +
• HHest +
• Hannah Ferchland
• Haochen Wu
• Hielke Walinga +
• How Si Wei +
• Hubert +
• HubertKl +
• Huize Wang +
• Hyukjin Kwon +
• HyunTruth +
• Iain Barr
• Ian Dunn +
• Ignacio Vergara Kausel +
• Inevitable-Marzipan +
• Irv Lustig +
• IsvenC +
• JElfner +
• Jacob Bundgaard +
• Jacopo Rota
• Jakob Jarmar +
• James Bourbeau +
• James Cobon-Kerr +
• James Myatt +
• James Winegar +
• Jan Rudolph
• Jan-Philip Gehrcke +
• Jared Groves +
• Jarrod Millman +
• Jason Kiley +
• Javad Noorbakhsh +
• Jay Offerdahl +
• Jayanth Katuri +
• Jeff Reback
• Jeongmin Yu +
• Jeremy Schendel
• Jerod Estapa +
• Jesper Dramsch +
• Jiang Yue +
• Jim Jeon +
• Joe Jevnik
• Joel Nothman
• Joel Ostblom +
• Johan von Forstner +
• Johnny Chiu +
• Jonas +
• Jonathon Vandezande +
• Jop Vermeer +
• Jordi Contestí
• Jorge López Fueyo +
• Joris Van den Bossche
• Jose Quinones +
• Jose Rivera-Rubio +
• Josh
• Josh Friedlander +
• Jun +
• Justin Zheng +
• Kaiqi Dong +
• Kalyan Gokhale
• Kane +
• Kang Yoosam +
• Kapil Patel +
• Kara de la Marck +
• Karl Dunkle Werner +
• Karmanya Aggarwal +
• Katherine Surta +
• Katrin Leinweber +
• Kendall Masse +
• Kevin Markham +
• Kevin Sheppard
• Kimi Li +
• Koustav Samaddar +
• Krishna +
• Kristian Holsheimer +
• Ksenia Gueletina +
• Kyle Kosic +
• Kyle Prestel +
• LJ +
• LeakedMemory +
• Li Jin +
• Licht Takeuchi
• Lorenzo Stella +
• Luca Donini +
• Luciano Viola +
• Maarten Rietbergen +
• Mak Sze Chun +
• Marc Garcia
• Marius Potgieter +
• Mark Sikora +
• Markus Meier +
• Marlene Silva Marchena +
• Martin Babka +
• MatanCohe +
• Mateusz Woś +
• Mathew Topper +
• Matias Heikkilä
• Mats Maiwald +
• Matt Boggess +
• Matt Cooper +
• Matt Williams +
• Matthew Gilbert
• Matthew Roeschke
• Max Bolingbroke +
• Max Kanter
• Max Kovalovs +
• Max van Deursen +
• MeeseeksMachine
• Michael
• Michael Davis +
• Michael Odintsov
• Michael P. Moran +
• Michael Silverstein +
• Michael-J-Ward +
• Mickaël Schoentgen +
• Miguel Sánchez de León Peque +
• Mike Cramblett +
• Min ho Kim +
• Ming Li
• Misha Veldhoen +
• Mitar
• Mitch Negus
• Monson Shao +
• Moonsoo Kim +
• Mortada Mehyar
• Mukul Ashwath Ram +
• MusTheDataGuy +
• Myles Braithwaite
• Nanda H Krishna +
• Nehil Jain +
• Nicholas Musolino +
• Nicolas Dickreuter +
• Nikhil Kumar Mengani +
• Nikoleta Glynatsi +
• Noam Hershtig +
• Noora Husseini +
• Ondrej Kokes
• Pablo Ambrosio +
• Pamela Wu +
• Parfait G +
• Patrick Park +
• Paul
• Paul Ganssle
• Paul Reidy
• Paul van Mulbregt +
• Pauli Virtanen
• Pav A +
• Philippe Ombredanne +
• Phillip Cloud
• Pietro Battiston
• Piyush Aggarwal +
• Prabakaran Kumaresshan +
• Pulkit Maloo
• Pyry Kovanen
• Rajib Mitra +
• Redonnet Louis +
• Rhys Parry +
• Richard Eames +
• Rick +
• Robin
• Roei.r +
• RomainSa +
• Roman Imankulov +
• Roman Yurchak +
• Ruijing Li +
• Ryan +
• Ryan Joyce +
• Ryan Nazareth +
• Ryan Rehman +
• Rüdiger Busche +
• SEUNG HOON, SHIN +
• Sakar Panta +
• Samuel Sinayoko
• Sandeep Pathak +
• Sandrine Pataut +
• Sangwoong Yoon
• Santosh Kumar +
• Saurav Chakravorty +
• Scott McAllister +
• Scott Talbert +
• Sean Chan +
• Sergey Kopylov +
• Shadi Akiki +
• Shantanu Gontia +
• Shengpu Tang +
• Shirish Kadam +
• Shivam Rana +
• Shorokhov Sergey +
• Simon Hawkins +
• Simon Riddell +
• Simone Basso
• Sinhrks
• Soyoun(Rose) Kim +
• Srinivas Reddy Thatiparthy (��������������������) +
• Stefaan Lippens +
• Stefano Cianciulli
• Stefano Miccoli +
• Stephan Hoyer
• Stephen Childs
• Stephen Cowley +
• Stephen Pascoe
• Stephen Rauch
• Sterling Paramore +
• Steve Baker +
• Steve Cook +
• Steve Dower +
• Steven +
• Stijn Van Hoey
• Stéphan Taljaard +
• Sumanau Sareen +
• Sumin Byeon +
• Sören +
• Takuya N +
• Tamas Nagy +
• Tan Tran +
• Tanya Jain +
• Tao He +
• Tarbo Fukazawa
• Terji Petersen +
• Thein Oo +
• Thiago Cordeiro da Fonseca +
• ThibTrip +
• Thierry Moisan
• Thijs Damsma +
• Thiviyan Thanapalasingam +
• Thomas A Caswell
• Thomas Kluiters +
• Thomas Lentali +
• Tilen Kusterle +
• Tim D. Smith +
• Tim Gates +
• Tim Hoffmann
• Tim Swast
• Tom Augspurger
• Tom Neep +
• Tomasz Kluczkowski +
• Tomáš Chvátal +
• Tony Tao +
• Triple0 +
• Troels Nielsen +
• Tuhin Mahmud +
• Tyler Reddy +
• Uddeshya Singh
• Uwe L. Korn +
• Vadym Barda +
• Vaibhav Vishal +
• Varad Gunjal +
• Vasily Litvinov +
• Vibhu Agarwal +
• Victor Maryama +
• Victor Villas
• Vikramjeet Das +
• Vincent La
• Vitória Helena +
• Vladislav +
• Vu Le
• Vyom Jain +
• Víctor Moron Tejero +
• Weiwen Gu +
• Wenhuan
• Wes Turner
• Wil Tan +
• Will Ayd +
• William Ayd
• Wouter De Coster +
• Yeojin Kim +
• Yitzhak Andrade +
• Yoann Goular +
• Yuecheng Wu +
• Yuliya Dovzhenko +
• Yury Bayda +
• Zac Hatfield-Dodds +
• Zach Angell +
• aberres +
• aeltanawy +
• ailchau +
• alimcmaster1
• alphaCTzo7G +
• amphy +
• anmyachev +
• araraonline +
• azure-pipelines[bot] +
• benarthur91 +
• bk521234 +
• cgangwar11 +
• chris-b1
• cxl923cc +
• dahlbaek +
• danielplawrence +
• dannyhyunkim +
• darke-spirits +
• david-liu-brattle-1
• davidmvalente +
• deflatSOCO
• doosik_bae +
• dylanchase +
• eduardo naufel schettino +
• endenis +
• enisnazif +
• euri10 +
• evangelineliu +
• ezcitron +
• fengyqf +
• fjdiod
• fjetter
• fl4p +
• fleimgruber +
• froessler
• gfyoung
• gwrome +
• h-vetinari
• haison +
• hannah-c +
• harisbal +
• heckeop +
• henriqueribeiro +
• himanshu awasthi
• hongshaoyang +
• iamshwin +
• igorfassen +
• jalazbe +
• jamesoliverh +
• jbrockmendel
• jh-wu +
• jkovacevic +
• justinchan23 +
• killerontherun1 +
• knuu +
• kpapdac +
• kpflugshaupt +
• krsnik93 +
• leerssej +
• louispotok
• lrjball +
• marcosrullan +
• mazayo +
• miker985
• nathalier +
• nicolab100 +
• nprad
• nrebena +
• nsuresh +
• nullptr +
• ottiP
• pajachiet +
• pilkibun +
• pmaxey83 +
• raguiar2 +
• ratijas +
• rbenes +
• realead +
• robbuckley +
• saurav2608 +
• shawnbrown +
• sideeye +
• ssikdar1
• sudhir mohanraj +
• svenharris +
• syutbai +
• tadeja +
• tamuhey +
• testvinder +
• thatneat
• tmnhat2001
• tomascassidy +
• tomneep
• topper-123
• vkk800 +
• willweil +
• winlu +
• yehia67 +
• yhaque1213 +
• ym-pett +
• yrhooke +
• ywpark1 +
• zertrin
• zhezherun +
{{ header }}
This is a minor bug-fix release in the 0.23.x series and includes some small regression fixes and bug fixes.
We recommend that all users upgrade to this version.
Warning: Starting January 1, 2019, pandas feature releases will support Python 3 only. See Dropping
Python 2.7 for more.
• Fixed regressions
• Bug fixes
• Contributors
Fixed regressions
• Python 3.7 with Windows gave all missing values for rolling variance calculations (GH21813)
Bug fixes
Groupby/resample/rolling
• Bug where calling DataFrameGroupBy.agg() with a list of functions including ohlc as the non-initial
element would raise a ValueError (GH21716)
• Bug in roll_quantile caused a memory leak when calling .rolling(...).quantile(q) with q in
(0,1) (GH21965)
Missing
• Bug in Series.clip() and DataFrame.clip() cannot accept list-like threshold containing NaN
(GH19992)
Contributors
A total of 6 people contributed patches to this release. People with a “+” by their names contributed a
patch for the first time.
• Jeff Reback
• MeeseeksMachine +
• Tom Augspurger
• chris-b1
• h-vetinari
• meeseeksdev[bot]
{{ header }}
This release fixes a build issue with the sdist for Python 3.7 (GH21785) There are no other changes.
Contributors
A total of 2 people contributed patches to this release. People with a “+” by their names contributed a
patch for the first time.
• Tom Augspurger
• meeseeksdev[bot] +
{{ header }}
This is a minor bug-fix release in the 0.23.x series and includes some small regression fixes and bug fixes.
We recommend that all users upgrade to this version.
Note: Pandas 0.23.2 is first pandas release that’s compatible with Python 3.7 (GH20552)
Warning: Starting January 1, 2019, pandas feature releases will support Python 3 only. See Dropping
Python 2.7 for more.
DataFrame.all() and DataFrame.any() now accept axis=None to reduce over all axes to a scalar (GH19976)
In [2]: df.all(axis=None)
Out[2]: False
This also provides compatibility with NumPy 1.15, which now dispatches to DataFrame.all. With NumPy
1.15 and pandas 0.23.1 or earlier, numpy.all() will no longer reduce over every axis:
With pandas 0.23.2, that will correctly return False, as it did with NumPy < 1.15.
Fixed regressions
• Re-allowed duplicate level names of a MultiIndex. Accessing a level that has a duplicate name by
name still raises an error (GH19029).
• Bug in both DataFrame.first_valid_index() and Series.first_valid_index() raised for a row
index having duplicate values (GH21441)
• Fixed printing of DataFrames with hierarchical columns with long names (GH21180)
• Fixed regression in reindex() and groupby() with a MultiIndex or multiple keys that contains cate-
gorical datetime-like values (GH21390).
• Fixed regression in unary negative operations with object dtype (GH21380)
• Bug in Timestamp.ceil() and Timestamp.floor() when timestamp is a multiple of the rounding
frequency (GH21262)
• Fixed regression in to_clipboard() that defaulted to copying dataframes with space delimited instead
of tab delimited (GH21104)
Build changes
• The source and binary distributions no longer include test data files, resulting in smaller download
sizes. Tests relying on these data files will be skipped when using pandas.test(). (GH19320)
Bug fixes
Conversion
• Bug in constructing Index with an iterator or generator (GH21470)
• Bug in Series.nlargest() for signed and unsigned integer dtypes when the minimum value is present
(GH21426)
Indexing
• Bug in Index.get_indexer_non_unique() with categorical key (GH21448)
• Bug in comparison operations for MultiIndex where error was raised on equality / inequality compar-
ison involving a MultiIndex with nlevels == 1 (GH21149)
• Bug in DataFrame.drop() behaviour is not consistent for unique and non-unique indexes (GH21494)
• Bug in DataFrame.duplicated() with a large number of columns causing a ‘maximum recursion depth
exceeded’ (GH21524).
I/O
• Bug in read_csv() that caused it to incorrectly raise an error when nrows=0, low_memory=True, and
index_col was not None (GH21141)
• Bug in json_normalize() when formatting the record_prefix with integer columns (GH21536)
Categorical
• Bug in rendering Series with Categorical dtype in rare conditions under Python 2.7 (GH21002)
Timezones
• Bug in Timestamp and DatetimeIndex where passing a Timestamp localized after a DST transition
would return a datetime before the DST transition (GH20854)
• Bug in comparing DataFrame with tz-aware DatetimeIndex columns with a DST transition that raised
a KeyError (GH19970)
• Bug in DatetimeIndex.shift() where an AssertionError would raise when shifting across DST
(GH8616)
• Bug in Timestamp constructor where passing an invalid timezone offset designator (Z) would not raise
a ValueError (GH8910)
• Bug in Timestamp.replace() where replacing at a DST boundary would retain an incorrect offset
(GH7825)
• Bug in DatetimeIndex.reindex() when reindexing a tz-naive and tz-aware DatetimeIndex (GH8306)
• Bug in DatetimeIndex.resample() when downsampling across a DST boundary (GH8531)
Timedelta
• Bug in Timedelta where non-zero timedeltas shorter than 1 microsecond were considered False
(GH21484)
Contributors
A total of 17 people contributed patches to this release. People with a “+” by their names contributed a
patch for the first time.
• David Krych
• Jacopo Rota +
• Jeff Reback
• Jeremy Schendel
• Joris Van den Bossche
• Kalyan Gokhale
• Matthew Roeschke
• Michael Odintsov +
• Ming Li
• Pietro Battiston
• Tom Augspurger
• Uddeshya Singh
• Vu Le +
• alimcmaster1 +
• david-liu-brattle-1 +
• gfyoung
• jbrockmendel
{{ header }}
This is a minor bug-fix release in the 0.23.x series and includes some small regression fixes and bug fixes.
We recommend that all users upgrade to this version.
Warning: Starting January 1, 2019, pandas feature releases will support Python 3 only. See Dropping
Python 2.7 for more.
• Fixed regressions
• Performance improvements
• Bug fixes
• Contributors
Fixed regressions
Performance improvements
Bug fixes
Groupby/resample/rolling
• Bug in DataFrame.agg() where applying multiple aggregation functions to a DataFrame with dupli-
cated column names would cause a stack overflow (GH21063)
• Bug in pandas.core.groupby.GroupBy.ffill() and pandas.core.groupby.GroupBy.bfill()
where the fill within a grouping would not always be applied as intended due to the implementa-
tions’ use of a non-stable sort (GH21207)
• Bug in pandas.core.groupby.GroupBy.rank() where results did not scale to 100% when specifying
method='dense' and pct=True
• Bug in pandas.DataFrame.rolling() and pandas.Series.rolling() which incorrectly accepted a
0 window size rather than raising (GH21286)
Data-type specific
• Bug in Series.str.replace() where the method throws TypeError on Python 3.5.2 (GH21078)
• Bug in Timedelta where passing a float with a unit would prematurely round the float precision
(GH14156)
• Bug in pandas.testing.assert_index_equal() which raised AssertionError incorrectly, when
comparing two CategoricalIndex objects with param check_categorical=False (GH19776)
Sparse
• Bug in SparseArray.shape which previously only returned the shape SparseArray.sp_values
(GH21126)
Indexing
• Bug in Series.reset_index() where appropriate error was not raised with an invalid level name
(GH20925)
• Bug in interval_range() when start/periods or end/periods are specified with float start or end
(GH21161)
• Bug in MultiIndex.set_names() where error raised for a MultiIndex with nlevels == 1 (GH21149)
• Bug in IntervalIndex constructors where creating an IntervalIndex from categorical data was not
fully supported (GH21243, GH21253)
• Bug in MultiIndex.sort_index() which was not guaranteed to sort correctly with level=1; this was
also causing data misalignment in particular DataFrame.stack() operations (GH20994, GH20945,
GH21052)
Plotting
• New keywords (sharex, sharey) to turn on/off sharing of x/y-axis by subplots generated with pan-
das.DataFrame().groupby().boxplot() (GH20968)
I/O
• Bug in IO methods specifying compression='zip' which produced uncompressed zip archives
(GH17778, GH21144)
• Bug in DataFrame.to_stata() which prevented exporting DataFrames to buffers and most file-like
objects (GH21041)
• Bug in read_stata() and StataReader which did not correctly decode utf-8 strings on Python 3 from
Stata 14 files (dta version 118) (GH21244)
• Bug in IO JSON read_json() reading empty JSON schema with orient='table' back to DataFrame
caused an error (GH21287)
Reshaping
• Bug in concat() where error was raised in concatenating Series with numpy scalar and tuple names
(GH21015)
• Bug in concat() warning message providing the wrong guidance for future behavior (GH21101)
Other
• Tab completion on Index in IPython no longer outputs deprecation warnings (GH21125)
• Bug preventing pandas being used on Windows without C++ redistributable installed (GH21106)
Contributors
A total of 30 people contributed patches to this release. People with a “+” by their names contributed a
patch for the first time.
• Adam J. Stewart
• Adam Kim +
• Aly Sivji
• Chalmer Lowe +
• Damini Satya +
• Dr. Irv
• Gabe Fernando +
• Giftlin Rajaiah
• Jeff Reback
• Jeremy Schendel +
• Joris Van den Bossche
• Kalyan Gokhale +
• Kevin Sheppard
• Matthew Roeschke
• Max Kanter +
• Ming Li
• Pyry Kovanen +
• Stefano Cianciulli
• Tom Augspurger
• Uddeshya Singh +
• Wenhuan
• William Ayd
• chris-b1
• gfyoung
• h-vetinari
• nprad +
• ssikdar1 +
• tmnhat2001
• topper-123
• zertrin +
{{ header }}
This is a major release from 0.22.0 and includes a number of API changes, deprecations, new features,
enhancements, and performance improvements along with a large number of bug fixes. We recommend that
all users upgrade to this version.
Highlights include:
• Round-trippable JSON format with ‘table’ orient.
• Instantiation from dicts respects order for Python 3.6+.
• Dependent column arguments for assign.
• Merging / sorting on a combination of columns and index levels.
• Extending pandas with custom types.
• Excluding unobserved categories from groupby.
• Changes to make output shape of DataFrame.apply consistent.
Check the API Changes and deprecations before updating.
Warning: Starting January 1, 2019, pandas feature releases will support Python 3 only. See Dropping
Python 2.7 for more.
• New features
– JSON read/write round-trippable with orient='table'
– .assign() accepts dependent arguments
– Merging on a combination of columns and index levels
– Sorting by a combination of columns and index levels
– Extending pandas with custom types (experimental)
– New observed keyword for excluding unobserved categories in groupby
– Rolling/Expanding.apply() accepts raw=False to pass a Series to the function
– DataFrame.interpolate has gained the limit_area kwarg
– get_dummies now supports dtype argument
– Timedelta mod method
– .rank() handles inf values when NaN are present
– Series.str.cat has gained the join kwarg
– DataFrame.astype performs column-wise conversion to Categorical
– Other enhancements
• Backwards incompatible API changes
– Dependencies have increased minimum versions
– Instantiation from dicts preserves dict insertion order for python 3.6+
– Deprecate Panel
– pandas.core.common removals
– Changes to make output of DataFrame.apply consistent
– Concatenation will no longer sort
– Build changes
– Index division by zero fills correctly
– Extraction of matching patterns from strings
– Default value for the ordered parameter of CategoricalDtype
– Better pretty-printing of DataFrames in a terminal
– Datetimelike API changes
– Other API changes
• Deprecations
• Removal of prior version deprecations/changes
• Performance improvements
• Documentation changes
• Bug fixes
– Categorical
– Datetimelike
– Timedelta
– Timezones
– Offsets
– Numeric
– Strings
– Indexing
– MultiIndex
– I/O
– Plotting
– Groupby/resample/rolling
– Sparse
– Reshaping
– Other
• Contributors
New features
A DataFrame can now be written to and subsequently read back via JSON while preserving metadata through
usage of the orient='table' argument (see GH18912 and GH9146). Previously, none of the available orient
values guaranteed the preservation of dtypes and index names, amongst other metadata.
In [1]: df = pd.DataFrame({'foo': [1, 2, 3, 4],
...: 'bar': ['a', 'b', 'c', 'd'],
...: 'baz': pd.date_range('2018-01-01', freq='d', periods=4),
...: 'qux': pd.Categorical(['a', 'b', 'c', 'c'])},
...: index=pd.Index(range(4), name='idx'))
...:
In [2]: df
Out[2]:
foo bar baz qux
idx
0 1 a 2018-01-01 a
1 2 b 2018-01-02 b
2 3 c 2018-01-03 c
3 4 d 2018-01-04 c
[4 rows x 4 columns]
In [3]: df.dtypes
Out[3]:
foo int64
bar object
baz datetime64[ns]
qux category
Length: 4, dtype: object
In [6]: new_df
Out[6]:
foo bar baz qux
idx
0 1 a 2018-01-01 a
1 2 b 2018-01-02 b
2 3 c 2018-01-03 c
3 4 d 2018-01-04 c
[4 rows x 4 columns]
In [7]: new_df.dtypes
Out[7]:
foo int64
bar object
baz datetime64[ns]
qux category
Length: 4, dtype: object
Please note that the string index is not supported with the round trip format, as it is used by default in
write_json to indicate a missing index name.
In [8]: df.index.name = 'index'
In [11]: new_df
Out[11]:
foo bar baz qux
0 1 a 2018-01-01 a
1 2 b 2018-01-02 b
2 3 c 2018-01-03 c
3 4 d 2018-01-04 c
[4 rows x 4 columns]
In [12]: new_df.dtypes
Out[12]:
foo int64
bar object
baz datetime64[ns]
qux category
Length: 4, dtype: object
The DataFrame.assign() now accepts dependent keyword arguments for python version later than 3.6 (see
also PEP 468). Later keyword arguments may now refer to earlier ones if the argument is a callable. See
the documentation here (GH14207)
In [13]: df = pd.DataFrame({'A': [1, 2, 3]})
In [14]: df
Out[14]:
A
0 1
1 2
2 3
[3 rows x 1 columns]
[3 rows x 3 columns]
Warning: This may subtly change the behavior of your code when you’re using .assign() to update
an existing column. Previously, callables referring to other variables being updated would get the “old”
values
Previous behavior:
In [2]: df = pd.DataFrame({"A": [1, 2, 3]})
New behavior:
In [16]: df.assign(A=df.A + 1, C=lambda df: df.A * -1)
Out[16]:
A C
0 2 -2
1 3 -3
2 4 -4
[3 rows x 2 columns]
Strings passed to DataFrame.merge() as the on, left_on, and right_on parameters may now refer to either
column names or index level names. This enables merging DataFrame instances on a combination of index
levels and columns without resetting indexes. See the Merge on columns and levels documentation section.
(GH14355)
In [17]: left_index = pd.Index(['K0', 'K0', 'K1', 'K2'], name='key1')
[3 rows x 5 columns]
Strings passed to DataFrame.sort_values() as the by parameter may now refer to either column names or
index level names. This enables sorting DataFrame instances by a combination of index levels and columns
without resetting indexes. See the Sorting by Indexes and Values documentation section. (GH14353)
# Build MultiIndex
In [22]: idx = pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('a', 2),
....: ('b', 2), ('b', 1), ('b', 1)])
....:
# Build DataFrame
In [24]: df_multi = pd.DataFrame({'A': np.arange(6, 0, -1)},
....: index=idx)
....:
In [25]: df_multi
Out[25]:
A
first second
a 1 6
2 5
2 4
b 2 3
1 2
1 1
[6 rows x 1 columns]
[6 rows x 1 columns]
Pandas now supports storing array-like objects that aren’t necessarily 1-D NumPy arrays as columns in
a DataFrame or values in a Series. This allows third-party libraries to implement extensions to NumPy’s
types, similar to how pandas implemented categoricals, datetimes with timezones, periods, and intervals.
As a demonstration, we’ll use cyberpandas, which provides an IPArray type for storing ip addresses.
IPArray isn’t a normal 1-D NumPy array, but because it’s a pandas ExtensionArray, it can be stored
properly inside pandas’ containers.
In [4]: ser
Out[4]:
0 0.0.0.0
1 192.168.1.1
2 2001:db8:85a3::8a2e:370:7334
dtype: ip
Notice that the dtype is ip. The missing value semantics of the underlying array are respected:
In [5]: ser.isna()
Out[5]:
0 True
1 False
2 False
dtype: bool
For more, see the extension types documentation. If you build an extension array, publicize it on our
ecosystem page.
Grouping by a categorical includes the unobserved categories in the output. When grouping by multiple
categorical columns, this means you get the cartesian product of all the categories, including combinations
where there are no observations, which can result in a large number of groups. We have added a keyword
observed to control this behavior, it defaults to observed=False for backward-compatibility. (GH14942,
GH8138, GH15217, GH17594, GH8669, GH20583, GH20902)
In [31]: df
Out[31]:
A B values C
0 a c 1 foo
1 a d 2 bar
2 b c 3 foo
3 b d 4 bar
[4 rows x 4 columns]
[4 rows x 1 columns]
For pivoting operations, this behavior is already controlled by the dropna keyword:
In [37]: df
Out[37]:
A B values
0 a c 1
1 a d 2
2 b c 3
3 b d 4
[4 rows x 3 columns]
[4 rows x 1 columns]
y NaN
z c NaN
d NaN
y NaN
[9 rows x 1 columns]
In [41]: s
Out[41]:
1 0
2 1
3 2
4 3
5 4
Length: 5, dtype: int64
Pass a Series:
DataFrame.interpolate() has gained a limit_area parameter to allow further control of which NaN
s are replaced. Use limit_area='inside' to fill only NaNs surrounded by valid values or use
limit_area='outside' to fill only NaN s outside the existing valid values while preserving those inside.
(GH16284) See the full documentation here.
In [45]: ser
Out[45]:
0 NaN
1 NaN
2 5.0
3 NaN
4 NaN
5 NaN
6 13.0
7 NaN
8 NaN
Length: 9, dtype: float64
The get_dummies() now accepts a dtype argument, which specifies a dtype for the new columns. The
default remains uint8. (GH18330)
In [49]: df = pd.DataFrame({'a': [1, 2], 'b': [3, 4], 'c': [5, 6]})
mod (%) and divmod operations are now defined on Timedelta objects when operating with either timedelta-
like or with numeric arguments. See the documentation here. (GH19365)
In [52]: td = pd.Timedelta(hours=37)
In [53]: td % pd.Timedelta(minutes=45)
Out[53]: Timedelta('0 days 00:15:00')
In previous versions, .rank() would assign inf elements NaN as their ranks. Now ranks are calculated
properly. (GH6945)
In [55]: s
Out[55]:
0 -inf
1 0.0
2 1.0
3 NaN
4 inf
Length: 5, dtype: float64
Previous behavior:
In [11]: s.rank()
Out[11]:
0 1.0
1 2.0
2 3.0
3 NaN
4 NaN
dtype: float64
Current behavior:
In [56]: s.rank()
Out[56]:
0 1.0
1 2.0
2 3.0
3 NaN
4 4.0
Length: 5, dtype: float64
Furthermore, previously if you rank inf or -inf values together with NaN values, the calculation won’t
distinguish NaN from infinity when using ‘top’ or ‘bottom’ argument.
In [58]: s
Out[58]:
0 NaN
1 NaN
2 -inf
3 -inf
Length: 4, dtype: float64
Previous behavior:
In [15]: s.rank(na_option='top')
Out[15]:
0 2.5
1 2.5
2 2.5
3 2.5
dtype: float64
Current behavior:
In [59]: s.rank(na_option='top')
Out[59]:
0 1.5
1 1.5
2 3.5
3 3.5
Length: 4, dtype: float64
Previously, Series.str.cat() did not – in contrast to most of pandas – align Series on their index
before concatenation (see GH18657). The method has now gained a keyword join to control the manner of
alignment, see examples below and here.
In v.0.23 join will default to None (meaning no alignment), but this default will change to 'left' in a future
version of pandas.
In [60]: s = pd.Series(['a', 'b', 'c', 'd'])
In [62]: s.str.cat(t)
Out[62]:
0 ab
1 bd
2 ce
3 dc
Length: 4, dtype: object
3 dd
Length: 4, dtype: object
Furthermore, Series.str.cat() now works for CategoricalIndex as well (previously raised a ValueError;
see GH20842).
DataFrame.astype() can now perform column-wise conversion to Categorical by supplying the string
'category' or a CategoricalDtype. Previously, attempting this would raise a NotImplementedError. See
the Object creation section of the documentation for more details and examples. (GH12860, GH18099)
Supplying the string 'category' performs column-wise conversion, with only labels appearing in a given
column set as categories:
In [64]: df = pd.DataFrame({'A': list('abca'), 'B': list('bccd')})
In [65]: df = df.astype('category')
In [66]: df['A'].dtype
Out[66]: CategoricalDtype(categories=['a', 'b', 'c'], ordered=False)
In [67]: df['B'].dtype
Out[67]: CategoricalDtype(categories=['b', 'c', 'd'], ordered=False)
Supplying a CategoricalDtype will make the categories in each column consistent with the supplied dtype:
In [68]: from pandas.api.types import CategoricalDtype
In [71]: df = df.astype(cdt)
In [72]: df['A'].dtype
Out[72]: CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=True)
In [73]: df['B'].dtype
Out[73]: CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=True)
Other enhancements
• Unary + now permitted for Series and DataFrame as numeric operator (GH16073)
• Better support for to_excel() output with the xlsxwriter engine. (GH16149)
• pandas.tseries.frequencies.to_offset() now accepts leading ‘+’ signs e.g. ‘+1h’. (GH18171)
• MultiIndex.unique() now supports the level= argument, to get unique values from a specific index
level (GH17896)
• pandas.io.formats.style.Styler now has method hide_index() to determine whether the index
will be rendered in output (GH14194)
• pandas.io.formats.style.Styler now has method hide_columns() to determine whether columns
will be hidden in output (GH14194)
• Improved wording of ValueError raised in to_datetime() when unit= is passed with a non-convertible
value (GH14350)
• Series.fillna() now accepts a Series or a dict as a value for a categorical dtype (GH17033)
• pandas.read_clipboard() updated to use qtpy, falling back to PyQt5 and then PyQt4, adding com-
patibility with Python3 and multiple python-qt bindings (GH17722)
• Improved wording of ValueError raised in read_csv() when the usecols argument cannot match all
columns. (GH17301)
• DataFrame.corrwith() now silently drops non-numeric columns when passed a Series. Before, an
exception was raised (GH18570).
• IntervalIndex now supports time zone aware Interval objects (GH18537, GH18538)
• Series() / DataFrame() tab completion also returns identifiers in the first level of a MultiIndex().
(GH16326)
• read_excel() has gained the nrows parameter (GH16645)
• DataFrame.append() can now in more cases preserve the type of the calling dataframe’s columns (e.g.
if both are CategoricalIndex) (GH18359)
• DataFrame.to_json() and Series.to_json() now accept an index argument which allows the user
to exclude the index from the JSON output (GH17394)
• IntervalIndex.to_tuples() has gained the na_tuple parameter to control whether NA is returned
as a tuple of NA, or NA itself (GH18756)
• Categorical.rename_categories, CategoricalIndex.rename_categories and Series.cat.
rename_categories can now take a callable as their argument (GH18862)
• Interval and IntervalIndex have gained a length attribute (GH18789)
• Resampler objects now have a functioning pipe method. Previously, calls to pipe were diverted to
the mean method (GH17905).
• is_scalar() now returns True for DateOffset objects (GH18943).
• DataFrame.pivot() now accepts a list for the values= kwarg (GH17160).
• Added pandas.api.extensions.register_dataframe_accessor(), pandas.api.extensions.
register_series_accessor(), and pandas.api.extensions.register_index_accessor(),
accessor for libraries downstream of pandas to register custom accessors like .cat on pandas objects.
See Registering Custom Accessors for more (GH14781).
• IntervalIndex.astype now supports conversions between subtypes when passed an IntervalDtype
(GH19197)
• IntervalIndex and its associated constructor methods (from_arrays, from_breaks, from_tuples)
have gained a dtype parameter (GH19262)
• Added pandas.core.groupby.SeriesGroupBy.is_monotonic_increasing() and pandas.core.
groupby.SeriesGroupBy.is_monotonic_decreasing() (GH17015)
• For subclassed DataFrames, DataFrame.apply() will now preserve the Series subclass (if defined)
when passing the data to the applied function (GH19822)
• DataFrame.from_dict() now accepts a columns argument that can be used to specify the column
names when orient='index' is used (GH18529)
• Added option display.html.use_mathjax so MathJax can be disabled when rendering tables in
Jupyter notebooks (GH19856, GH19824)
• DataFrame.replace() now supports the method parameter, which can be used to specify the replace-
ment method when to_replace is a scalar, list or tuple and value is None (GH19632)
• Timestamp.month_name(), DatetimeIndex.month_name(), and Series.dt.month_name() are now
available (GH12805)
• Timestamp.day_name() and DatetimeIndex.day_name() are now available to return day names with
a specified locale (GH12806)
• DataFrame.to_sql() now performs a multi-value insert if the underlying connection supports itk
rather than inserting row by row. SQLAlchemy dialects supporting multi-value inserts include: mysql,
postgresql, sqlite and any dialect with supports_multivalues_insert. (GH14315, GH8953)
• read_html() now accepts a displayed_only keyword argument to controls whether or not hidden
elements are parsed (True by default) (GH20027)
• read_html() now reads all <tbody> elements in a <table>, not just the first. (GH20690)
• quantile() and quantile() now accept the interpolation keyword, linear by default (GH20497)
• zip compression is supported via compression=zip in DataFrame.to_pickle(), Series.to_pickle(),
DataFrame.to_csv(), Series.to_csv(), DataFrame.to_json(), Series.to_json(). (GH17778)
• WeekOfMonth constructor now supports n=0 (GH20517).
• DataFrame and Series now support matrix multiplication (@) operator (GH10259) for Python>=3.5
• Updated DataFrame.to_gbq() and pandas.read_gbq() signature and documentation to reflect
changes from the Pandas-GBQ library version 0.4.0. Adds intersphinx mapping to Pandas-GBQ li-
brary. (GH20564)
• Added new writer for exporting Stata dta files in version 117, StataWriter117. This format supports
exporting strings with lengths up to 2,000,000 characters (GH16450)
• to_hdf() and read_hdf() now accept an errors keyword argument to control encoding error handling
(GH20835)
• cut() has gained the duplicates='raise'|'drop' option to control whether to raise on duplicated
edges (GH20947)
• date_range(), timedelta_range(), and interval_range() now return a linearly spaced index if
start, stop, and periods are specified, but freq is not. (GH20808, GH20983, GH20976)
We have updated our minimum supported versions of dependencies (GH15184). If installed, we now require:
Instantiation from dicts preserves dict insertion order for python 3.6+
Until Python 3.6, dicts in Python had no formally defined ordering. For Python version 3.6 and later, dicts
are ordered by insertion order, see PEP 468. Pandas will use the dict’s insertion order, when creating a
Series or DataFrame from a dict and you’re using Python version 3.6 or higher. (GH19884)
Previous behavior (and current behavior if on Python < 3.6):
Notice that the Series is now ordered by insertion order. This new behavior is used for all relevant pandas
types (Series, DataFrame, SparseSeries and SparseDataFrame).
If you wish to retain the old behavior while using Python >= 3.6, you can use .sort_index():
Deprecate Panel
Panel was deprecated in the 0.20.x release, showing as a DeprecationWarning. Using Panel will now show
a FutureWarning. The recommended way to represent 3-D data are with a MultiIndex on a DataFrame
via the to_frame() or with the xarray package. Pandas provides a to_xarray() method to automate this
conversion (GH13563, GH18324).
In [76]: p = tm.makePanel()
In [77]: p
Out[77]:
<class 'pandas.core.panel.Panel'>
Dimensions: 3 (items) x 3 (major_axis) x 4 (minor_axis)
Items axis: ItemA to ItemC
Major_axis axis: 2000-01-03 00:00:00 to 2000-01-05 00:00:00
Minor_axis axis: A to D
In [78]: p.to_frame()
Out[78]:
ItemA ItemB ItemC
major minor
2000-01-03 A 0.469112 0.721555 0.404705
B -1.135632 0.271860 -1.039268
C 0.119209 0.276232 -1.344312
D -2.104569 0.113648 -0.109050
2000-01-04 A -0.282863 -0.706771 0.577046
B 1.212112 -0.424972 -0.370647
C -1.044236 -1.087401 0.844885
D -0.494929 -1.478427 1.643563
2000-01-05 A -1.509059 -1.039575 -1.715002
B -0.173215 0.567020 -1.157892
C -0.861849 -0.673690 1.075770
D 1.071804 0.524988 -1.469388
In [79]: p.to_xarray()
Out[79]:
<xarray.DataArray (items: 3, major_axis: 3, minor_axis: 4)>
array([[[ 0.469112, -1.135632, 0.119209, -2.104569],
[-0.282863, 1.212112, -1.044236, -0.494929],
[-1.509059, -0.173215, -0.861849, 1.071804]],
pandas.core.common removals
The following error & warning messages are removed from pandas.core.common (GH13634, GH19769):
• PerformanceWarning
• UnsupportedFunctionCall
• UnsortedIndexError
• AbstractMethodError
These are available from import from pandas.errors (since 0.19.0).
DataFrame.apply() was inconsistent when applying an arbitrary user-defined-function that returned a list-
like with axis=1. Several bugs and inconsistencies are resolved. If the applied function returns a Series, then
pandas will return a DataFrame; otherwise a Series will be returned, this includes the case where a list-like
(e.g. tuple or list is returned) (GH16353, GH17437, GH17970, GH17348, GH17892, GH18573, GH17602,
GH18775, GH18901, GH18919).
In [77]: df
Out[77]:
A B C
0 1 2 3
1 1 2 3
2 1 2 3
3 1 2 3
4 1 2 3
5 1 2 3
[6 rows x 3 columns]
Previous behavior: if the returned shape happened to match the length of original columns, this would
return a DataFrame. If the return shape did not match, a Series with lists was returned.
New behavior: When the applied function returns a list-like, this will now always return a Series.
In [78]: df.apply(lambda x: [1, 2, 3], axis=1)
Out[78]:
0 [1, 2, 3]
1 [1, 2, 3]
2 [1, 2, 3]
3 [1, 2, 3]
4 [1, 2, 3]
5 [1, 2, 3]
Length: 6, dtype: object
[6 rows x 3 columns]
To broadcast the result across the original columns (the old behaviour for list-likes of the correct length),
you can use result_type='broadcast'. The shape must match the original columns.
[6 rows x 3 columns]
Returning a Series allows one to control the exact return structure and column names:
[6 rows x 3 columns]
In a future version of pandas pandas.concat() will no longer sort the non-concatenation axis when it is
not already aligned. The current behavior is the same as the previous (sorting), but now a warning is issued
when sort is not specified and the non-concatenation axis is not aligned (GH4588).
In [83]: df1 = pd.DataFrame({"a": [1, 2], "b": [1, 2]}, columns=['b', 'a'])
[4 rows x 2 columns]
To keep the previous behavior (sorting) and silence the warning, pass sort=True
[4 rows x 2 columns]
Build changes
• Building pandas for development now requires cython >= 0.24 (GH18613)
• Building from source now explicitly requires setuptools in setup.py (GH18113)
• Updated conda recipe to be in compliance with conda-build 3.0+ (GH18002)
Division operations on Index and subclasses will now fill division of positive numbers by zero with np.inf,
division of negative numbers by zero with -np.inf and 0 / 0 with np.nan. This matches existing Series
behavior. (GH19322, GH19347)
Previous behavior:
In [7]: index / 0
Out[7]: Int64Index([0, 0, 0], dtype='int64')
# Previous behavior yielded different results depending on the type of zero in the␣
,→divisor
In [11]: pd.RangeIndex(1, 5) / 0
ZeroDivisionError: integer division or modulo by zero
Current behavior:
In [87]: index = pd.Int64Index([-1, 0, 1])
In [88]: index / 0
Out[88]: Float64Index([-inf, nan, inf], dtype='float64')
In [92]: pd.RangeIndex(1, 5) / 0
Out[92]: Float64Index([inf, inf, inf, inf], dtype='float64')
By default, extracting matching patterns from strings with str.extract() used to return a Series if a
single group was being extracted (a DataFrame if more than one group was extracted). As of Pandas 0.23.0
str.extract() always returns a DataFrame, unless expand is set to False. Finally, None was an accepted
value for the expand parameter (which was equivalent to False), but now raises a ValueError. (GH11386)
Previous behavior:
In [3]: extracted
Out [3]:
0 10
1 12
dtype: object
In [4]: type(extracted)
Out [4]:
pandas.core.series.Series
New behavior:
In [93]: s = pd.Series(['number 10', '12 eggs'])
In [95]: extracted
Out[95]:
0
0 10
1 12
[2 rows x 1 columns]
In [96]: type(extracted)
Out[96]: pandas.core.frame.DataFrame
To restore previous behavior, simply set expand to False:
In [97]: s = pd.Series(['number 10', '12 eggs'])
In [99]: extracted
Out[99]:
0 10
1 12
Length: 2, dtype: object
In [100]: type(extracted)
Out[100]: pandas.core.series.Series
The default value of the ordered parameter for CategoricalDtype has changed from False to None to allow
updating of categories without impacting ordered. Behavior should remain consistent for downstream
objects, such as Categorical (GH18790)
In previous versions, the default value for the ordered parameter was False. This could potentially lead to
the ordered parameter unintentionally being changed from True to False when users attempt to update
categories if ordered is not explicitly specified, as it would silently default to False. The new behavior
for ordered=None is to retain the existing value of ordered.
New behavior:
In [4]: cat
Out[4]:
[a, b, c, a, b, a]
Categories (3, object): [c < b < a]
In [6]: cat.astype(cdt)
Out[6]:
[a, b, c, a, b, a]
Categories (4, object): [c < b < a < d]
Notice in the example above that the converted Categorical has retained ordered=True. Had the default
value for ordered remained as False, the converted Categorical would have become unordered, despite
ordered=False never being explicitly specified. To change the value of ordered, explicitly pass it to the
new dtype, e.g. CategoricalDtype(categories=list('cbad'), ordered=False).
Note that the unintentional conversion of ordered discussed above did not arise in previous versions due
to separate bugs that prevented astype from doing any type of category to category conversion (GH10696,
GH18593). These bugs have been fixed in this release, and motivated changing the default value of ordered.
Previously, the default value for the maximum number of columns was pd.options.display.
max_columns=20. This meant that relatively wide data frames would not fit within the terminal width,
and pandas would introduce line breaks to display these 20 columns. This resulted in an output that was
relatively difficult to read:
If Python runs in a terminal, the maximum number of columns is now determined automatically so
that the printed data frame fits within the current terminal width (pd.options.display.max_columns=0)
(GH17023). If Python runs as a Jupyter kernel (such as the Jupyter QtConsole or a Jupyter notebook,
as well as in many IDEs), this value cannot be inferred automatically and is thus set to 20 as in previous
versions. In a terminal, this results in a much nicer output:
Note that if you don’t like the new default, you can always set this option yourself. To revert to the old
setting, you can run this line:
pd.options.display.max_columns = 20
• The default Timedelta constructor now accepts an ISO 8601 Duration string as an argument
(GH19040)
• Subtracting NaT from a Series with dtype='datetime64[ns]' returns a Series with
dtype='timedelta64[ns]' instead of dtype='datetime64[ns]' (GH18808)
• Series.astype() and Index.astype() with an incompatible dtype will now raise a TypeError rather
than a ValueError (GH18231)
• Series construction with an object dtyped tz-aware datetime and dtype=object specified, will now
return an object dtyped Series, previously this would infer the datetime dtype (GH18231)
• A Series of dtype=category constructed from an empty dict will now have categories of
dtype=object rather than dtype=float64, consistently with the case in which an empty list is passed
(GH18515)
• All-NaN levels in a MultiIndex are now assigned float rather than object dtype, promoting consis-
tency with Index (GH17929).
• Levels names of a MultiIndex (when not None) are now required to be unique: trying to create a
MultiIndex with repeated names will raise a ValueError (GH18872)
• Both construction and renaming of Index/MultiIndex with non-hashable name/names will now raise
TypeError (GH20527)
• Index.map() can now accept Series and dictionary input objects (GH12756, GH18482, GH18509).
• DataFrame.unstack() will now default to filling with np.nan for object columns. (GH12815)
• IntervalIndex constructor will raise if the closed parameter conflicts with how the input data is
inferred to be closed (GH18421)
• Inserting missing values into indexes will work for all types of indexes and automatically insert the
correct type of missing value (NaN, NaT, etc.) regardless of the type passed in (GH18295)
• When created with duplicate labels, MultiIndex now raises a ValueError. (GH17464)
• Series.fillna() now raises a TypeError instead of a ValueError when passed a list, tuple or
DataFrame as a value (GH18293)
• pandas.DataFrame.merge() no longer casts a float column to object when merging on int and
float columns (GH16572)
• pandas.merge() now raises a ValueError when trying to merge on incompatible data types (GH9780)
• The default NA value for UInt64Index has changed from 0 to NaN, which impacts methods that mask
with NA, such as UInt64Index.where() (GH18398)
• Refactored setup.py to use find_packages instead of explicitly listing out all subpackages (GH18535)
• Rearranged the order of keyword arguments in read_excel() to align with read_csv() (GH16672)
• wide_to_long() previously kept numeric-like suffixes as object dtype. Now they are cast to numeric
if possible (GH17627)
• In read_excel(), the comment argument is now exposed as a named parameter (GH18735)
• Rearranged the order of keyword arguments in read_excel() to align with read_csv() (GH16672)
• The options html.border and mode.use_inf_as_null were deprecated in prior versions, these will
now show FutureWarning rather than a DeprecationWarning (GH19003)
• IntervalIndex and IntervalDtype no longer support categorical, object, and string subtypes
(GH19016)
• IntervalDtype now returns True when compared against 'interval' regardless of subtype, and
IntervalDtype.name now returns 'interval' regardless of subtype (GH18980)
• KeyError now raises instead of ValueError in drop(), drop(), drop(), drop() when dropping a
non-existent element in an axis with duplicates (GH19186)
• Series.to_csv() now accepts a compression argument that works in the same way as the
compression argument in DataFrame.to_csv() (GH18958)
• Set operations (union, difference…) on IntervalIndex with incompatible index types will now raise a
TypeError rather than a ValueError (GH19329)
• DateOffset objects render more simply, e.g. <DateOffset: days=1> instead of <DateOffset:
kwds={'days': 1}> (GH19403)
• Categorical.fillna now validates its value and method keyword arguments. It now raises when
both or none are specified, matching the behavior of Series.fillna() (GH19682)
• pd.to_datetime('today') now returns a datetime, consistent with pd.Timestamp('today'); previ-
ously pd.to_datetime('today') returned a .normalized() datetime (GH19935)
• Series.str.replace() now takes an optional regex keyword which, when set to False, uses literal
string replacement rather than regex replacement (GH16808)
• DatetimeIndex.strftime() and PeriodIndex.strftime() now return an Index instead of a numpy
array to be consistent with similar accessors (GH20127)
• Constructing a Series from a list of length 1 no longer broadcasts this list when a longer index is
specified (GH19714, GH20391).
• DataFrame.to_dict() with orient='index' no longer casts int columns to float for a DataFrame
with only int and float columns (GH18580)
• A user-defined-function that is passed to Series.rolling().aggregate(), DataFrame.rolling().
aggregate(), or its expanding cousins, will now always be passed a Series, rather than a np.array;
.apply() only has the raw keyword, see here. This is consistent with the signatures of .aggregate()
across pandas (GH20584)
• Rolling and Expanding types raise NotImplementedError upon iteration (GH11704).
Deprecations
• The order parameter of factorize() is deprecated and will be removed in a future release (GH19727)
• Timestamp.weekday_name, DatetimeIndex.weekday_name, and Series.dt.weekday_name are depre-
cated in favor of Timestamp.day_name(), DatetimeIndex.day_name(), and Series.dt.day_name()
(GH12806)
• pandas.tseries.plotting.tsplot is deprecated. Use Series.plot() instead (GH18627)
• Index.summary() is deprecated and will be removed in a future version (GH18217)
• NDFrame.get_ftype_counts() is deprecated and will be removed in a future version (GH18243)
• The convert_datetime64 parameter in DataFrame.to_records() has been deprecated and will be
removed in a future version. The NumPy bug motivating this parameter has been resolved. The
default value for this parameter has also changed from True to None (GH18160).
• Series.rolling().apply(), DataFrame.rolling().apply(), Series.expanding().apply(), and
DataFrame.expanding().apply() have deprecated passing an np.array by default. One will need to
pass the new raw parameter to be explicit about what is passed (GH20584)
• The data, base, strides, flags and itemsize properties of the Series and Index classes have been
deprecated and will be removed in a future version (GH20419).
• DatetimeIndex.offset is deprecated. Use DatetimeIndex.freq instead (GH20716)
• Floor division between an integer ndarray and a Timedelta is deprecated. Divide by Timedelta.value
instead (GH19761)
• Setting PeriodIndex.freq (which was not guaranteed to work correctly) is deprecated. Use
PeriodIndex.asfreq() instead (GH20678)
• Index.get_duplicates() is deprecated and will be removed in a future version (GH20239)
• The previous default behavior of negative indices in Categorical.take is deprecated. In a future
version it will change from meaning missing values to meaning positional indices from the right. The
future behavior is consistent with Series.take() (GH20664).
• Passing multiple axes to the axis parameter in DataFrame.dropna() has been deprecated and will be
removed in a future version (GH20987)
• Warnings against the obsolete usage Categorical(codes, categories), which were emitted for in-
stance when the first two arguments to Categorical() had different dtypes, and recommended the
use of Categorical.from_codes, have now been removed (GH8074)
• The levels and labels attributes of a MultiIndex can no longer be set directly (GH4039).
• pd.tseries.util.pivot_annual has been removed (deprecated since v0.19). Use pivot_table in-
stead (GH18370)
• pd.tseries.util.isleapyear has been removed (deprecated since v0.19). Use .is_leap_year prop-
erty in Datetime-likes instead (GH18370)
• pd.ordered_merge has been removed (deprecated since v0.19). Use pd.merge_ordered instead
(GH18459)
• The SparseList class has been removed (GH14007)
• The pandas.io.wb and pandas.io.data stub modules have been removed (GH13735)
• Categorical.from_array has been removed (GH13854)
• The freq and how parameters have been removed from the rolling/expanding/ewm methods of
DataFrame and Series (deprecated since v0.18). Instead, resample before calling the methods.
(GH18601 & GH18668)
• DatetimeIndex.to_datetime, Timestamp.to_datetime, PeriodIndex.to_datetime, and Index.
to_datetime have been removed (GH8254, GH14096, GH14113)
• read_csv() has dropped the skip_footer parameter (GH13386)
• read_csv() has dropped the as_recarray parameter (GH13373)
• read_csv() has dropped the buffer_lines parameter (GH13360)
• read_csv() has dropped the compact_ints and use_unsigned parameters (GH13323)
• The Timestamp class has dropped the offset attribute in favor of freq (GH13593)
• The Series, Categorical, and Index classes have dropped the reshape method (GH13012)
• pandas.tseries.frequencies.get_standard_freq has been removed in favor of pandas.tseries.
frequencies.to_offset(freq).rule_code (GH13874)
• The freqstr keyword has been removed from pandas.tseries.frequencies.to_offset in favor of
freq (GH13874)
• The Panel4D and PanelND classes have been removed (GH13776)
• The Panel class has dropped the to_long and toLong methods (GH19077)
• The options display.line_with and display.height are removed in favor of display.width and
display.max_rows respectively (GH4391, GH19107)
• The labels attribute of the Categorical class has been removed in favor of Categorical.codes
(GH7768)
• The flavor parameter have been removed from func:to_sql method (GH13611)
• The modules pandas.tools.hashing and pandas.util.hashing have been removed (GH16223)
• The top-level functions pd.rolling_*, pd.expanding_* and pd.ewm* have been removed (Deprecated
since v0.18). Instead, use the DataFrame/Series methods rolling, expanding and ewm (GH18723)
• Imports from pandas.core.common for functions such as is_datetime64_dtype are now removed.
These are located in pandas.api.types. (GH13634, GH19769)
• The infer_dst keyword in Series.tz_localize(), DatetimeIndex.tz_localize() and
DatetimeIndex have been removed. infer_dst=True is equivalent to ambiguous='infer',
and infer_dst=False to ambiguous='raise' (GH7963).
• When .resample() was changed from an eager to a lazy operation, like .groupby() in v0.18.0, we
put in place compatibility (with a FutureWarning), so operations would continue to work. This is now
fully removed, so a Resampler will no longer forward compat operations (GH20554)
• Remove long deprecated axis=None parameter from .replace() (GH20271)
Performance improvements
• Converting a Series of Timedelta objects to days, seconds, etc… sped up through vectorization of
underlying methods (GH18092)
• Improved performance of .map() with a Series/dict input (GH15081)
• The overridden Timedelta properties of days, seconds and microseconds have been removed, leveraging
their built-in Python versions instead (GH18242)
• Series construction will reduce the number of copies made of the input data in certain cases (GH17449)
• Improved performance of Series.dt.date() and DatetimeIndex.date() (GH18058)
• Improved performance of Series.dt.time() and DatetimeIndex.time() (GH18461)
• Improved performance of IntervalIndex.symmetric_difference() (GH18475)
• Improved performance of DatetimeIndex and Series arithmetic operations with Business-Month and
Business-Quarter frequencies (GH18489)
• Series() / DataFrame() tab completion limits to 100 values, for better performance. (GH18587)
• Improved performance of DataFrame.median() with axis=1 when bottleneck is not installed
(GH16468)
• Improved performance of MultiIndex.get_loc() for large indexes, at the cost of a reduction in per-
formance for small ones (GH18519)
• Improved performance of MultiIndex.remove_unused_levels() when there are no unused levels, at
the cost of a reduction in performance when there are (GH19289)
• Improved performance of Index.get_loc() for non-unique indexes (GH19478)
• Improved performance of pairwise .rolling() and .expanding() with .cov() and .corr() operations
(GH17917)
• Improved performance of pandas.core.groupby.GroupBy.rank() (GH15779)
• Improved performance of variable .rolling() on .min() and .max() (GH19521)
• Improved performance of pandas.core.groupby.GroupBy.ffill() and pandas.core.groupby.
GroupBy.bfill() (GH11296)
• Improved performance of pandas.core.groupby.GroupBy.any() and pandas.core.groupby.
GroupBy.all() (GH15435)
• Improved performance of pandas.core.groupby.GroupBy.pct_change() (GH19165)
• Improved performance of Series.isin() in the case of categorical dtypes (GH20003)
• Improved performance of getattr(Series, attr) when the Series has certain index types. This
manifested in slow printing of large Series with a DatetimeIndex (GH19764)
• Fixed a performance regression for GroupBy.nth() and GroupBy.last() with some object columns
(GH19283)
• Improved performance of pandas.core.arrays.Categorical.from_codes() (GH18501)
Documentation changes
Thanks to all of the contributors who participated in the Pandas Documentation Sprint, which took place
on March 10th. We had about 500 participants from over 30 locations across the world. You should notice
that many of the API docstrings have greatly improved.
There were too many simultaneous contributions to include a release note for each improvement, but this
GitHub search should give you an idea of how many docstrings were improved.
Special thanks to Marc Garcia for organizing the sprint. For more information, read the NumFOCUS
blogpost recapping the sprint.
• Changed spelling of “numpy” to “NumPy”, and “python” to “Python”. (GH19017)
• Consistency when introducing code samples, using either colon or period. Rewrote some sentences for
greater clarity, added more dynamic references to functions, methods and classes. (GH18941, GH18948,
GH18973, GH19017)
• Added a reference to DataFrame.assign() in the concatenate section of the merging documentation
(GH18665)
Bug fixes
Categorical
Warning: A class of bugs were introduced in pandas 0.21 with CategoricalDtype that affects the cor-
rectness of operations like merge, concat, and indexing when comparing multiple unordered Categorical
arrays that have the same categories, but in a different order. We highly recommend upgrading or man-
ually aligning your categories before doing these operations.
• Bug in Categorical.equals returning the wrong result when comparing two unordered Categorical
arrays with the same categories, but in a different order (GH16603)
• Bug in pandas.api.types.union_categoricals() returning the wrong result when for unordered
categoricals with the categories in a different order. This affected pandas.concat() with Categorical
data (GH19096).
• Bug in pandas.merge() returning the wrong result when joining on an unordered Categorical that
had the same categories but in a different order (GH19551)
• Bug in CategoricalIndex.get_indexer() returning the wrong result when target was an unordered
Categorical that had the same categories as self but in a different order (GH19551)
• Bug in Index.astype() with a categorical dtype where the resultant index is not converted to a
CategoricalIndex for all types of index (GH18630)
• Bug in Series.astype() and Categorical.astype() where an existing categorical data does not get
updated (GH10696, GH18593)
• Bug in Series.str.split() with expand=True incorrectly raising an IndexError on empty strings
(GH20002).
• Bug in Index constructor with dtype=CategoricalDtype(...) where categories and ordered are
not maintained (GH19032)
• Bug in Series constructor with scalar and dtype=CategoricalDtype(...) where categories and
ordered are not maintained (GH19565)
• Bug in Categorical.__iter__ not converting to Python types (GH19909)
• Bug in pandas.factorize() returning the unique codes for the uniques. This now returns a
Categorical with the same dtype as the input (GH19721)
• Bug in pandas.factorize() including an item for missing values in the uniques return value
(GH19721)
• Bug in Series.take() with categorical data interpreting -1 in indices as missing value markers, rather
than the last element of the Series (GH20664)
Datetimelike
Timedelta
• Bug in Timedelta.__mul__() where multiplying by NaT returned NaT instead of raising a TypeError
(GH19819)
• Bug in Series with dtype='timedelta64[ns]' where addition or subtraction of TimedeltaIndex had
results cast to dtype='int64' (GH17250)
Timezones
• Bug in creating a Series from an array that contains both tz-naive and tz-aware values will result in
a Series whose dtype is tz-aware instead of object (GH16406)
• Bug in comparison of timezone-aware DatetimeIndex against NaT incorrectly raising TypeError
(GH19276)
• Bug in DatetimeIndex.astype() when converting between timezone aware dtypes, and converting
from timezone aware to naive (GH18951)
• Bug in comparing DatetimeIndex, which failed to raise TypeError when attempting to compare
timezone-aware and timezone-naive datetimelike objects (GH18162)
• Bug in localization of a naive, datetime string in a Series constructor with a datetime64[ns, tz]
dtype (GH174151)
• Timestamp.replace() will now handle Daylight Savings transitions gracefully (GH18319)
• Bug in tz-aware DatetimeIndex where addition/subtraction with a TimedeltaIndex or array with
dtype='timedelta64[ns]' was incorrect (GH17558)
• Bug in DatetimeIndex.insert() where inserting NaT into a timezone-aware index incorrectly raised
(GH16357)
• Bug in DataFrame constructor, where tz-aware Datetimeindex and a given column name will result in
an empty DataFrame (GH19157)
• Bug in Timestamp.tz_localize() where localizing a timestamp near the minimum or maximum valid
values could overflow and return a timestamp with an incorrect nanosecond value (GH12677)
• Bug when iterating over DatetimeIndex that was localized with fixed timezone offset that rounded
nanosecond precision to microseconds (GH19603)
• Bug in DataFrame.diff() that raised an IndexError with tz-aware values (GH18578)
• Bug in melt() that converted tz-aware dtypes to tz-naive (GH15785)
• Bug in Dataframe.count() that raised an ValueError, if Dataframe.dropna() was called for a single
column with timezone-aware values. (GH13407)
Offsets
• Bug in WeekOfMonth and Week where addition and subtraction did not roll correctly (GH18510,
GH18672, GH18864)
• Bug in WeekOfMonth and LastWeekOfMonth where default keyword arguments for constructor raised
ValueError (GH19142)
• Bug in FY5253Quarter, LastWeekOfMonth where rollback and rollforward behavior was inconsistent
with addition and subtraction behavior (GH18854)
• Bug in FY5253 where datetime addition and subtraction incremented incorrectly for dates on the
year-end but not normalized to midnight (GH18854)
• Bug in FY5253 where date offsets could incorrectly raise an AssertionError in arithmetic operations
(GH14774)
Numeric
• Bug in Series constructor with an int or float list where specifying dtype=str, dtype='str' or
dtype='U' failed to convert the data elements to strings (GH16605)
• Bug in Index multiplication and division methods where operating with a Series would return an
Index object instead of a Series object (GH19042)
• Bug in the DataFrame constructor in which data containing very large positive or very large negative
numbers was causing OverflowError (GH18584)
• Bug in Index constructor with dtype='uint64' where int-like floats were not coerced to UInt64Index
(GH18400)
• Bug in DataFrame flex arithmetic (e.g. df.add(other, fill_value=foo)) with a fill_value other
than None failed to raise NotImplementedError in corner cases where either the frame or other has
length zero (GH19522)
• Multiplication and division of numeric-dtyped Index objects with timedelta-like scalars returns
TimedeltaIndex instead of raising TypeError (GH19333)
• Bug where NaN was returned instead of 0 by Series.pct_change() and DataFrame.pct_change()
when fill_method is not None (GH19873)
Strings
• Bug in Series.str.get() with a dictionary in the values and the index not in the keys, raising
KeyError (GH20671)
Indexing
MultiIndex
• Bug in MultiIndex.__contains__() where non-tuple keys would return True even if they had been
dropped (GH19027)
• Bug in MultiIndex.set_labels() which would cause casting (and potentially clipping) of the new
labels if the level argument is not 0 or a list like [0, 1, … ] (GH19057)
• Bug in MultiIndex.get_level_values() which would return an invalid index on level of ints with
missing values (GH17924)
• Bug in MultiIndex.unique() when called on empty MultiIndex (GH20568)
• Bug in MultiIndex.unique() which would not preserve level names (GH20570)
• Bug in MultiIndex.remove_unused_levels() which would fill nan values (GH18417)
• Bug in MultiIndex.from_tuples() which would fail to take zipped tuples in python3 (GH18434)
• Bug in MultiIndex.get_loc() which would fail to automatically cast values between float and int
(GH18818, GH15994)
• Bug in MultiIndex.get_loc() which would cast boolean to integer labels (GH19086)
• Bug in MultiIndex.get_loc() which would fail to locate keys containing NaN (GH18485)
• Bug in MultiIndex.get_loc() in large MultiIndex, would fail when levels had different dtypes
(GH18520)
• Bug in indexing where nested indexers having only numpy arrays are handled incorrectly (GH19686)
I/O
• read_html() now rewinds seekable IO objects after parse failure, before attempting to parse with a
new parser. If a parser errors and the object is non-seekable, an informative error is raised suggesting
the use of a different parser (GH17975)
• DataFrame.to_html() now has an option to add an id to the leading <table> tag (GH8496)
• Bug in read_msgpack() with a non existent file is passed in Python 2 (GH15296)
• Bug in read_csv() where a MultiIndex with duplicate columns was not being mangled appropriately
(GH18062)
• Bug in read_csv() where missing values were not being handled properly when
keep_default_na=False with dictionary na_values (GH19227)
• Bug in read_csv() causing heap corruption on 32-bit, big-endian architectures (GH20785)
• Bug in read_sas() where a file with 0 variables gave an AttributeError incorrectly. Now it gives an
EmptyDataError (GH18184)
• Bug in DataFrame.to_latex() where pairs of braces meant to serve as invisible placeholders were
escaped (GH18667)
• Bug in DataFrame.to_latex() where a NaN in a MultiIndex would cause an IndexError or incorrect
output (GH14249)
• Bug in DataFrame.to_latex() where a non-string index-level name would result in an
AttributeError (GH19981)
• Bug in DataFrame.to_latex() where the combination of an index name and the index_names=False
option would result in incorrect output (GH18326)
• Bug in DataFrame.to_latex() where a MultiIndex with an empty string as its name would result in
incorrect output (GH18669)
• Bug in DataFrame.to_latex() where missing space characters caused wrong escaping and produced
non-valid latex in some cases (GH20859)
• Bug in read_json() where large numeric values were causing an OverflowError (GH18842)
• Bug in DataFrame.to_parquet() where an exception was raised if the write destination is S3
(GH19134)
• Interval now supported in DataFrame.to_excel() for all Excel file types (GH19242)
• Timedelta now supported in DataFrame.to_excel() for all Excel file types (GH19242, GH9155,
GH19900)
• Bug in pandas.io.stata.StataReader.value_labels() raising an AttributeError when called on
very old files. Now returns an empty dict (GH19417)
• Bug in read_pickle() when unpickling objects with TimedeltaIndex or Float64Index created with
pandas prior to version 0.20 (GH19939)
• Bug in pandas.io.json.json_normalize() where sub-records are not properly normalized if any
sub-records values are NoneType (GH20030)
• Bug in usecols parameter in read_csv() where error is not raised correctly when passing a string.
(GH20529)
• Bug in HDFStore.keys() when reading a file with a soft link causes exception (GH20523)
• Bug in HDFStore.select_column() where a key which is not a valid store raised an AttributeError
instead of a KeyError (GH17912)
Plotting
• Better error message when attempting to plot but matplotlib is not installed (GH19810).
• DataFrame.plot() now raises a ValueError when the x or y argument is improperly formed (GH18671)
• Bug in DataFrame.plot() when x and y arguments given as positions caused incorrect referenced
columns for line, bar and area plots (GH20056)
• Bug in formatting tick labels with datetime.time() and fractional seconds (GH18478).
• Series.plot.kde() has exposed the args ind and bw_method in the docstring (GH18461). The argu-
ment ind may now also be an integer (number of sample points).
• DataFrame.plot() now supports multiple columns to the y argument (GH19699)
Groupby/resample/rolling
• Bug when grouping by a single column and aggregating with a class like list or tuple (GH18079)
• Fixed regression in DataFrame.groupby() which would not emit an error when called with a tuple key
not in the index (GH18798)
• Bug in DataFrame.resample() which silently ignored unsupported (or mistyped) options for label,
closed and convention (GH19303)
• Bug in DataFrame.groupby() where tuples were interpreted as lists of keys rather than as keys
(GH17979, GH18249)
Sparse
• Bug in which creating a SparseDataFrame from a dense Series or an unsupported type raised an
uncontrolled exception (GH19374)
• Bug in SparseDataFrame.to_csv causing exception (GH19384)
• Bug in SparseSeries.memory_usage which caused segfault by accessing non sparse elements
(GH19368)
• Bug in constructing a SparseArray: if data is a scalar and index is defined it will coerce to float64
regardless of scalar’s dtype. (GH19163)
Reshaping
• Suppressed error in the construction of a DataFrame from a dict containing scalar values when the
corresponding keys are not included in the passed index (GH18600)
• Fixed (changed from object to float64) dtype of DataFrame initialized with axes, no data, and
dtype=int (GH19646)
• Bug in Series.rank() where Series containing NaT modifies the Series inplace (GH18521)
• Bug in cut() which fails when using readonly arrays (GH18773)
• Bug in DataFrame.pivot_table() which fails when the aggfunc arg is of type string. The behavior
is now consistent with other methods like agg and apply (GH18713)
• Bug in DataFrame.merge() in which merging using Index objects as vectors raised an Exception
(GH19038)
• Bug in DataFrame.stack(), DataFrame.unstack(), Series.unstack() which were not returning
subclasses (GH15563)
• Bug in timezone comparisons, manifesting as a conversion of the index to UTC in .concat() (GH18523)
• Bug in concat() when concatenating sparse and dense series it returns only a SparseDataFrame.
Should be a DataFrame. (GH18914, GH18686, and GH16874)
• Improved error message for DataFrame.merge() when there is no common merge key (GH19427)
• Bug in DataFrame.join() which does an outer instead of a left join when being called with multiple
DataFrames and some have non-unique indices (GH19624)
• Series.rename() now accepts axis as a kwarg (GH18589)
• Bug in rename() where an Index of same-length tuples was converted to a MultiIndex (GH19497)
• Comparisons between Series and Index would return a Series with an incorrect name, ignoring the
Index’s name attribute (GH19582)
• Bug in qcut() where datetime and timedelta data with NaT present raised a ValueError (GH19768)
• Bug in DataFrame.iterrows(), which would infers strings not compliant to ISO8601 to datetimes
(GH19671)
• Bug in Series constructor with Categorical where a ValueError is not raised when an index of
different length is given (GH19342)
• Bug in DataFrame.astype() where column metadata is lost when converting to categorical or a dic-
tionary of dtypes (GH19920)
• Bug in cut() and qcut() where timezone information was dropped (GH19872)
• Bug in Series constructor with a dtype=str, previously raised in some cases (GH19853)
• Bug in get_dummies(), and select_dtypes(), where duplicate column names caused incorrect be-
havior (GH20848)
• Bug in isna(), which cannot handle ambiguous typed lists (GH20675)
• Bug in concat() which raises an error when concatenating TZ-aware dataframes and all-NaT
dataframes (GH12396)
• Bug in concat() which raises an error when concatenating empty TZ-aware series (GH18447)
Other
• Improved error message when attempting to use a Python keyword as an identifier in a numexpr backed
query (GH18221)
• Bug in accessing a pandas.get_option(), which raised KeyError rather than OptionError when
looking up a non-existent option key in some cases (GH19789)
• Bug in testing.assert_series_equal() and testing.assert_frame_equal() for Series or
DataFrames with differing unicode data (GH20503)
Contributors
A total of 328 people contributed patches to this release. People with a “+” by their names contributed a
patch for the first time.
• Aaron Critchley
• AbdealiJK +
• Adam Hooper +
• Albert Villanova del Moral
• Alejandro Giacometti +
• Alejandro Hohmann +
• Alex Rychyk
• Alexander Buchkovsky
• Alexander Lenail +
• Alexander Michael Schade
• Aly Sivji +
• Andreas Költringer +
• Andrew
• Andrew Bui +
• András Novoszáth +
• Andy Craze +
• Andy R. Terrel
• Anh Le +
• Anil Kumar Pallekonda +
• Antoine Pitrou +
• Antonio Linde +
• Antonio Molina +
• Antonio Quinonez +
• Armin Varshokar +
• Artem Bogachev +
• Avi Sen +
• Azeez Oluwafemi +
• Ben Auffarth +
• Bernhard Thiel +
• Bhavesh Poddar +
• BielStela +
• Blair +
• Bob Haffner
• Brett Naul +
• Brock Mendel
• Bryce Guinta +
• Carlos Eduardo Moreira dos Santos +
• Carlos García Márquez +
• Carol Willing
• Cheuk Ting Ho +
• Chitrank Dixit +
• Chris
• Chris Burr +
• Chris Catalfo +
• Chris Mazzullo
• Christian Chwala +
• Cihan Ceyhan +
• Clemens Brunner
• Colin +
• Cornelius Riemenschneider
• Crystal Gong +
• DaanVanHauwermeiren
• Dan Dixey +
• Daniel Frank +
• Daniel Garrido +
• Daniel Sakuma +
• DataOmbudsman +
• Dave Hirschfeld
• Dave Lewis +
• David Adrián Cañones Castellano +
• David Arcos +
• David C Hall +
• David Fischer
• David Hoese +
• David Lutz +
• David Polo +
• David Stansby
• Dennis Kamau +
• Dillon Niederhut
• Dimitri +
• Dr. Irv
• Dror Atariah
• Eric Chea +
• Eric Kisslinger
• Eric O. LEBIGOT (EOL) +
• FAN-GOD +
• Fabian Retkowski +
• Fer Sar +
• Gabriel de Maeztu +
• Gianpaolo Macario +
• Giftlin Rajaiah
• Gilberto Olimpio +
• Gina +
• Gjelt +
• Graham Inggs +
• Grant Roch
• Grant Smith +
• Grzegorz Konefał +
• Guilherme Beltramini
• HagaiHargil +
• Hamish Pitkeathly +
• Hammad Mashkoor +
• Hannah Ferchland +
• Hans
• Haochen Wu +
• Hissashi Rocha +
• Iain Barr +
• Ibrahim Sharaf ElDen +
• Ignasi Fosch +
• Igor Conrado Alves de Lima +
• Igor Shelvinskyi +
• Imanflow +
• Ingolf Becker
• Israel Saeta Pérez
• Iva Koevska +
• Jakub Nowacki +
• Jan F-F +
• Jan Koch +
• Jan Werkmann
• Janelle Zoutkamp +
• Jason Bandlow +
• Jaume Bonet +
• Jay Alammar +
• Jeff Reback
• JennaVergeynst
• Jimmy Woo +
• Jing Qiang Goh +
• Joachim Wagner +
• Joan Martin Miralles +
• Joel Nothman
• Joeun Park +
• John Cant +
• Johnny Metz +
• Jon Mease
• Jonas Schulze +
• Jongwony +
• Jordi Contestí +
• Joris Van den Bossche
• José F. R. Fonseca +
• Jovixe +
• Julio Martinez +
• Jörg Döpfert
• KOBAYASHI Ittoku +
• Kate Surta +
• Kenneth +
• Kevin Kuhl
• Kevin Sheppard
• Krzysztof Chomski
• Ksenia +
• Ksenia Bobrova +
• Kunal Gosar +
• Kurtis Kerstein +
• Kyle Barron +
• Laksh Arora +
• Laurens Geffert +
• Leif Walsh
• Liam Marshall +
• Liam3851 +
• Licht Takeuchi
• Liudmila +
• Ludovico Russo +
• Mabel Villalba +
• Manan Pal Singh +
• Manraj Singh
• Marc +
• Marc Garcia
• Marco Hemken +
• Maria del Mar Bibiloni +
• Mario Corchero +
• Mark Woodbridge +
• Martin Journois +
• Mason Gallo +
• Matias Heikkilä +
• Matt Braymer-Hayes
• Matt Kirk +
• Matt Maybeno +
• Matthew Kirk +
• Matthew Rocklin +
• Matthew Roeschke
• Matthias Bussonnier +
• Max Mikhaylov +
• Maxim Veksler +
• Maximilian Roos
• Maximiliano Greco +
• Michael Penkov
• Michael Röttger +
• Michael Selik +
• Michael Waskom
• Mie~~~
• Mike Kutzma +
• Ming Li +
• Mitar +
• Mitch Negus +
• Montana Low +
• Moritz Münst +
• Mortada Mehyar
• Myles Braithwaite +
• Nate Yoder
• Nicholas Ursa +
• Nick Chmura
• Nikos Karagiannakis +
• Nipun Sadvilkar +
• Nis Martensen +
• Noah +
• Noémi Éltető +
• Olivier Bilodeau +
• Ondrej Kokes +
• Onno Eberhard +
• Paul Ganssle +
• Paul Mannino +
• Paul Reidy
• Paulo Roberto de Oliveira Castro +
• Pepe Flores +
• Peter Hoffmann
• Phil Ngo +
• Pietro Battiston
• Pranav Suri +
• Priyanka Ojha +
• Pulkit Maloo +
• README Bot +
• Ray Bell +
• Riccardo Magliocchetti +
• Ridhwan Luthra +
• Robert Meyer
• Robin
• Robin Kiplang’at +
• Rohan Pandit +
• Rok Mihevc +
• Rouz Azari
• Ryszard T. Kaleta +
• Sam Cohan
• Sam Foo
• Samir Musali +
• Samuel Sinayoko +
• Sangwoong Yoon
• SarahJessica +
• Sharad Vijalapuram +
• Shubham Chaudhary +
• SiYoungOh +
• Sietse Brouwer
• Simone Basso +
• Stefania Delprete +
• Stefano Cianciulli +
• Stephen Childs +
• StephenVoland +
• Stijn Van Hoey +
• Sven
• Talitha Pumar +
• Tarbo Fukazawa +
• Ted Petrou +
• Thomas A Caswell
• Tim Hoffmann +
• Tim Swast
• Tom Augspurger
• Tommy +
• Tulio Casagrande +
• Tushar Gupta +
• Tushar Mittal +
• Upkar Lidder +
• Victor Villas +
• Vince W +
• Vinícius Figueiredo +
• Vipin Kumar +
• WBare
• Wenhuan +
• Wes Turner
• William Ayd
• Wilson Lin +
• Xbar
• Yaroslav Halchenko
• Yee Mey
• Yeongseon Choe +
• Yian +
• Yimeng Zhang
• ZhuBaohe +
• Zihao Zhao +
• adatasetaday +
• akielbowicz +
• akosel +
• alinde1 +
• amuta +
• bolkedebruin
• cbertinato
• cgohlke
• charlie0389 +
• chris-b1
• csfarkas +
• dajcs +
• deflatSOCO +
• derestle-htwg
• discort
• dmanikowski-reef +
• donK23 +
• elrubio +
• fivemok +
• fjdiod
• fjetter +
• froessler +
• gabrielclow
• gfyoung
• ghasemnaddaf
• h-vetinari +
• himanshu awasthi +
• ignamv +
• jayfoad +
• jazzmuesli +
• jbrockmendel
• jen w +
• jjames34 +
• joaoavf +
• joders +
• jschendel
• juan huguet +
• l736x +
• luzpaz +
• mdeboc +
• miguelmorin +
• miker985
• miquelcamprodon +
• orereta +
• ottiP +
• peterpanmj +
• rafarui +
• raph-m +
• readyready15728 +
• rmihael +
• samghelms +
• scriptomation +
• sfoo +
• stefansimik +
• stonebig
• tmnhat2001 +
• tomneep +
• topper-123
• tv3141 +
• verakai +
• xpvpc +
• zhanghui +
{{ header }}
This is a major release from 0.21.1 and includes a single, API-breaking change. We recommend that all users
upgrade to this version after carefully reading the release note (singular!).
Pandas 0.22.0 changes the handling of empty and all-NA sums and products. The summary is that
• The sum of an empty or all-NA Series is now 0
• The product of an empty or all-NA Series is now 1
• We’ve added a min_count parameter to .sum() and .prod() controlling the minimum number of valid
values for the result to be valid. If fewer than min_count non-NA values are present, the result is NA.
The default is 0. To return NaN, the 0.21 behavior, use min_count=1.
Some background: In pandas 0.21, we fixed a long-standing inconsistency in the return value of all-NA series
depending on whether or not bottleneck was installed. See Sum/Prod of all-NaN or empty Series/DataFrames
is now consistently NaN . At the same time, we changed the sum and prod of an empty Series to also be
NaN.
Based on feedback, we’ve partially reverted those changes.
Arithmetic operations
In [1]: pd.Series([]).sum()
Out[1]: nan
In [2]: pd.Series([np.nan]).sum()
Out[2]: nan
pandas 0.22.0
In [1]: pd.Series([]).sum()
Out[1]: 0.0
In [2]: pd.Series([np.nan]).sum()
Out[2]: 0.0
The default behavior is the same as pandas 0.20.3 with bottleneck installed. It also matches the behavior of
NumPy’s np.nansum on empty and all-NA arrays.
To have the sum of an empty series return NaN (the default behavior of pandas 0.20.3 without bottleneck,
or pandas 0.21.x), use the min_count keyword.
In [3]: pd.Series([]).sum(min_count=1)
Out[3]: nan
Thanks to the skipna parameter, the .sum on an all-NA series is conceptually the same as the .sum of an
empty one with skipna=True (the default).
The min_count parameter refers to the minimum number of non-null values required for a non-NA sum or
product.
Series.prod() has been updated to behave the same as Series.sum(), returning 1 instead.
In [5]: pd.Series([]).prod()
Out[5]: 1.0
In [6]: pd.Series([np.nan]).prod()
Out[6]: 1.0
In [7]: pd.Series([]).prod(min_count=1)
Out[7]: nan
These changes affect DataFrame.sum() and DataFrame.prod() as well. Finally, a few less obvious places in
pandas are affected by this change.
Grouping by a categorical
Grouping by a Categorical and summing now returns 0 instead of NaN for categories with no observations.
The product now returns 1 instead of NaN.
pandas 0.21.x
pandas 0.22
To restore the 0.21 behavior of returning NaN for unobserved groups, use min_count>=1.
Resample
The sum and product of all-NA bins has changed from NaN to 0 for sum and 1 for product.
pandas 0.21.x
In [12]: s.resample('2d').sum()
Out[12]:
2017-01-01 2.0
2017-01-03 NaN
Freq: 2D, dtype: float64
pandas 0.22.0
In [12]: s.resample('2d').sum()
Out[12]:
2017-01-01 2.0
2017-01-03 0.0
Freq: 2D, Length: 2, dtype: float64
In [13]: s.resample('2d').sum(min_count=1)
Out[13]:
2017-01-01 2.0
2017-01-03 NaN
Freq: 2D, Length: 2, dtype: float64
In particular, upsampling and taking the sum or product is affected, as upsampling introduces missing values
even if the original series was entirely valid.
pandas 0.21.x
pandas 0.22.0
Once again, the min_count keyword is available to restore the 0.21 behavior.
Rolling and expanding already have a min_periods keyword that behaves similar to min_count. The only
case that changes is when doing a rolling or expanding sum with min_periods=0. Previously this returned
NaN, when fewer than min_periods non-NA values were in the window. Now it returns 0.
pandas 0.21.1
pandas 0.22.0
The default behavior of min_periods=None, implying that min_periods equals the window size, is un-
changed.
Compatibility
If you maintain a library that should work across pandas versions, it may be easiest to exclude pandas 0.21
from your requirements. Otherwise, all your sum() calls would need to check if the Series is empty before
summing.
With setuptools, in your setup.py use:
install_requires=['pandas!=0.21.*', ...]
requirements:
run:
- pandas !=0.21.0,!=0.21.1
Note that the inconsistency in the return value for all-NA series is still there for pandas 0.20.3 and earlier.
Avoiding pandas 0.21 will only help with the empty case.
Contributors
A total of 1 people contributed patches to this release. People with a “+” by their names contributed a
patch for the first time.
• Tom Augspurger
{{ header }}
This is a minor bug-fix release in the 0.21.x series and includes some small regression fixes, bug fixes and
performance improvements. We recommend that all users upgrade to this version.
Highlights include:
• Temporarily restore matplotlib datetime plotting functionality. This should resolve issues for users
who implicitly relied on pandas to plot datetimes with matplotlib. See here.
• Improvements to the Parquet IO functions introduced in 0.21.0. See here.
Pandas implements some matplotlib converters for nicely formatting the axis labels on plots with datetime
or Period values. Prior to pandas 0.21.0, these were implicitly registered with matplotlib, as a side effect of
import pandas.
In pandas 0.21.0, we required users to explicitly register the converter. This caused problems for some users
who relied on those converters being present for regular matplotlib.pyplot plotting methods, so we’re
temporarily reverting that change; pandas 0.21.1 again registers the converters on import, just like before
0.21.0.
We’ve added a new option to control the converters: pd.options.plotting.matplotlib.
register_converters. By default, they are registered. Toggling this to False removes pandas’
formatters and restore any converters we overwrote when registering them (GH18301).
We’re working with the matplotlib developers to make this easier. We’re trying to balance user convenience
(automatically registering the converters) with import performance and best practices (importing pandas
shouldn’t have the side effect of overwriting any custom converters you’ve already set). In the future we hope
to have most of the datetime formatting functionality in matplotlib, with just the pandas-specific converters
in pandas. We’ll then gracefully deprecate the automatic registration of converters in favor of users explicitly
registering them when they want them.
New features
• DataFrame.to_parquet() will now write non-default indexes when the underlying engine supports it.
The indexes will be preserved when reading back in with read_parquet() (GH18581).
• read_parquet() now allows to specify the columns to read from a parquet file (GH18154)
• read_parquet() now allows to specify kwargs which are passed to the respective engine (GH18216)
Other enhancements
Deprecations
Performance improvements
Bug fixes
Conversion
• Bug in TimedeltaIndex subtraction could incorrectly overflow when NaT is present (GH17791)
• Bug in DatetimeIndex subtracting datetimelike from DatetimeIndex could fail to overflow (GH18020)
• Bug in IntervalIndex.copy() when copying and IntervalIndex with non-default closed (GH18339)
• Bug in DataFrame.to_dict() where columns of datetime that are tz-aware were not converted to
required arrays when used with orient='records', raising TypeError (GH18372)
• Bug in DateTimeIndex and date_range() where mismatching tz-aware start and end timezones
would not raise an err if end.tzinfo is None (GH18431)
• Bug in Series.fillna() which raised when passed a long integer on Python 2 (GH18159).
Indexing
I/O
Plotting
• Bug in DataFrame.plot() and Series.plot() with DatetimeIndex where a figure generated by them
is not pickleable in Python 3 (GH18439)
Groupby/resample/rolling
Reshaping
• Error message in pd.merge_asof() for key datatype mismatch now includes datatype of left and right
key (GH18068)
• Bug in pd.concat when empty and non-empty DataFrames or Series are concatenated (GH18178
GH18187)
• Bug in DataFrame.filter(...) when unicode is passed as a condition in Python 2 (GH13101)
• Bug when merging empty DataFrames when np.seterr(divide='raise') is set (GH17776)
Numeric
• Bug in pd.Series.rolling.skew() and rolling.kurt() with all equal values has floating issue
(GH18044)
Categorical
String
• Series.str.split() will now propagate NaN values across all expanded columns instead of None
(GH18450)
Contributors
A total of 46 people contributed patches to this release. People with a “+” by their names contributed a
patch for the first time.
• Aaron Critchley +
• Alex Rychyk
• Alexander Buchkovsky +
• Alexander Michael Schade +
• Chris Mazzullo
• Cornelius Riemenschneider +
• Dave Hirschfeld +
• David Fischer +
• David Stansby +
• Dror Atariah +
• Eric Kisslinger +
• Hans +
• Ingolf Becker +
• Jan Werkmann +
• Jeff Reback
• Joris Van den Bossche
• Jörg Döpfert +
• Kevin Kuhl +
• Krzysztof Chomski +
• Leif Walsh
• Licht Takeuchi
• Manraj Singh +
• Matt Braymer-Hayes +
• Michael Waskom +
• Mie~~~ +
• Peter Hoffmann +
• Robert Meyer +
• Sam Cohan +
• Sietse Brouwer +
• Sven +
• Tim Swast
• Tom Augspurger
• Wes Turner
• William Ayd +
• Yee Mey +
• bolkedebruin +
• cgohlke
• derestle-htwg +
• fjdiod +
• gabrielclow +
• gfyoung
• ghasemnaddaf +
• jbrockmendel
• jschendel
• miker985 +
• topper-123
{{ header }}
This is a major release from 0.20.3 and includes a number of API changes, deprecations, new features,
enhancements, and performance improvements along with a large number of bug fixes. We recommend that
all users upgrade to this version.
Highlights include:
• Integration with Apache Parquet, including a new top-level read_parquet() function and DataFrame.
to_parquet() method, see here.
• New user-facing pandas.api.types.CategoricalDtype for specifying categoricals independent of the
data, see here.
• The behavior of sum and prod on all-NaN Series/DataFrames is now consistent and no longer depends
on whether bottleneck is installed, and sum and prod on empty Series now return NaN instead of 0,
see here.
• Compatibility fixes for pypy, see here.
• Additions to the drop, reindex and rename API to make them more consistent, see here.
• Addition of the new methods DataFrame.infer_objects (see here) and GroupBy.pipe (see here).
• Indexing with a list of labels, where one or more of the labels is missing, is deprecated and will raise a
KeyError in a future version, see here.
Check the API Changes and deprecations before updating.
• New features
– Integration with Apache Parquet file format
– infer_objects type conversion
– Reshaping
– Numeric
– Categorical
– PyPy
– Other
• Contributors
New features
Integration with Apache Parquet, including a new top-level read_parquet() and DataFrame.to_parquet()
method, see here (GH15838, GH17438).
Apache Parquet provides a cross-language, binary file format for reading and writing data frames efficiently.
Parquet is designed to faithfully serialize and de-serialize DataFrame s, supporting all of the pandas dtypes,
including extension dtypes such as datetime with timezones.
This functionality depends on either the pyarrow or fastparquet library. For more details, see see the IO
docs on Parquet.
In [2]: df.dtypes
Out[2]:
A int64
B object
C object
Length: 3, dtype: object
In [3]: df.infer_objects().dtypes
Out[3]:
A int64
B int64
C object
Length: 3, dtype: object
Note that column 'C' was not converted - only scalar numeric types will be converted to a new type.
Other types of conversion should be accomplished using the to_numeric() function (or to_datetime(),
to_timedelta()).
In [4]: df = df.infer_objects()
In [6]: df.dtypes
Out[6]:
A int64
B int64
C int64
Length: 3, dtype: object
New users are often puzzled by the relationship between column operations and attribute access on DataFrame
instances (GH7175). One specific instance of this confusion is attempting to create a new column by setting
an attribute on the DataFrame:
This does not raise any obvious exceptions, but also does not create a new column:
In [3]: df
Out[3]:
one
0 1.0
1 2.0
2 3.0
Setting a list-like data structure into a new attribute now raises a UserWarning about the potential for
unexpected behavior. See Attribute Access.
The drop() method has gained index/columns keywords as an alternative to specifying the axis. This is
similar to the behavior of reindex (GH12392).
For example:
In [7]: df = pd.DataFrame(np.arange(8).reshape(2, 4),
...: columns=['A', 'B', 'C', 'D'])
...:
In [8]: df
Out[8]:
A B C D
0 0 1 2 3
1 4 5 6 7
[2 rows x 4 columns]
[2 rows x 2 columns]
[2 rows x 2 columns]
The DataFrame.rename() and DataFrame.reindex() methods have gained the axis keyword to specify the
axis to target with the operation (GH12392).
Here’s rename:
In [11]: df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
[3 rows x 2 columns]
[3 rows x 2 columns]
And reindex:
In [14]: df.reindex(['A', 'B', 'C'], axis='columns')
Out[14]:
A B C
0 1 4 NaN
1 2 5 NaN
2 3 6 NaN
[3 rows x 3 columns]
[3 rows x 2 columns]
The “index, columns” style continues to work as before.
In [16]: df.rename(index=id, columns=str.lower)
Out[16]:
a b
4510573680 1 4
4510573712 2 5
4510573744 3 6
[3 rows x 2 columns]
[3 rows x 3 columns]
We highly encourage using named arguments to avoid confusion when using either style.
pandas.api.types.CategoricalDtype has been added to the public API and expanded to include the
categories and ordered attributes. A CategoricalDtype can be used to specify the set of categories and
orderedness of an array, independent of the data. This can be useful for example, when converting string
data to a Categorical (GH14711, GH15078, GH16015, GH17643):
In [21]: s.astype(dtype)
Out[21]:
0 a
1 b
2 c
3 a
Length: 4, dtype: category
Categories (4, object): [a < b < c < d]
One place that deserves special mention is in read_csv(). Previously, with dtype={'col': 'category'},
the returned values and categories would always be strings.
GroupBy objects now have a pipe method, similar to the one on DataFrame and Series, that allow for
functions that take a GroupBy to be composed in a clean, readable syntax. (GH17871)
For a concrete example on combining .groupby and .pipe , imagine having a DataFrame with columns
for stores, products, revenue and sold quantity. We’d like to do a groupwise calculation of prices (i.e.
revenue/quantity) per store and per product. We could do this in a multi-step operation, but expressing it
in terms of piping can make the code more readable.
First we set the data:
In [27]: n = 1000
In [29]: df.head(2)
Out[29]:
Store Product Revenue Quantity
(continues on next page)
[2 rows x 4 columns]
[2 rows x 3 columns]
rename_categories() now accepts a dict-like argument for new_categories. The previous categories are
looked up in the dictionary’s keys and replaced if found. The behavior of missing and extra keys is the same
as in DataFrame.rename().
Warning: To assist with upgrading pandas, rename_categories treats Series as list-like. Typ-
ically, Series are considered to be dict-like (e.g. in .rename, .map). In a future version of pandas
rename_categories will change to treat them as dict-like. Follow the warning message’s recommenda-
tions for writing future-proof code.
In [33]: c.rename_categories(pd.Series([0, 1], index=['a', 'c']))
FutureWarning: Treating Series 'new_categories' as a list-like and using the values.
In a future version, 'rename_categories' will treat Series like a dictionary.
For dict-like, use 'new_categories.to_dict()'
For list-like, use 'new_categories.values'.
Out[33]:
[0, 0, 1]
Categories (2, int64): [0, 1]
Other enhancements
New keywords
• Added a skipna parameter to infer_dtype() to support type inference in the presence of missing
values (GH17059).
• Series.to_dict() and DataFrame.to_dict() now support an into keyword which allows you to
specify the collections.Mapping subclass that you would like returned. The default is dict, which
is backwards compatible. (GH16122)
• Series.set_axis() and DataFrame.set_axis() now support the inplace parameter. (GH14636)
• Series.to_pickle() and DataFrame.to_pickle() have gained a protocol parameter (GH16252).
By default, this parameter is set to HIGHEST_PROTOCOL
• read_feather() has gained the nthreads parameter for multi-threaded operations (GH16359)
• DataFrame.clip() and Series.clip() have gained an inplace argument. (GH15388)
• crosstab() has gained a margins_name parameter to define the name of the row / column that will
contain the totals when margins=True. (GH15972)
• read_json() now accepts a chunksize parameter that can be used when lines=True. If chunksize
is passed, read_json now returns an iterator which reads in chunksize lines with each iteration.
(GH17048)
• read_json() and to_json() now accept a compression argument which allows them to transparently
handle compressed files. (GH17798)
Various enhancements
• date_range() now accepts ‘Y’ in addition to ‘A’ as an alias for end of year. (GH9313)
• DataFrame.add_prefix() and DataFrame.add_suffix() now accept strings containing the ‘%’ char-
acter. (GH17151)
• Read/write methods that infer compression (read_csv(), read_table(), read_pickle(), and
to_pickle()) can now infer from path-like objects, such as pathlib.Path. (GH17206)
• read_sas() now recognizes much more of the most frequently used date (datetime) formats in
SAS7BDAT files. (GH15871)
• DataFrame.items() and Series.items() are now present in both Python 2 and 3 and is lazy in all
cases. (GH13918, GH17213)
• pandas.io.formats.style.Styler.where() has been implemented as a convenience for pandas.io.
formats.style.Styler.applymap(). (GH17474)
• MultiIndex.is_monotonic_decreasing() has been implemented. Previously returned False in all
cases. (GH16554)
• read_excel() raises ImportError with a better message if xlrd is not installed. (GH17613)
• DataFrame.assign() will preserve the original order of **kwargs for Python 3.6+ users instead of
sorting the column names. (GH14207)
• Series.reindex(), DataFrame.reindex(), Index.get_indexer() now support list-like argument for
tolerance. (GH17367)
We have updated our minimum supported versions of dependencies (GH15206, GH15543, GH15214). If
installed, we now require:
Note: The changes described here have been partially reverted. See the v0.22.0 Whatsnew for more.
The behavior of sum and prod on all-NaN Series/DataFrames no longer depends on whether bottleneck is
installed, and return value of sum and prod on an empty Series has changed (GH9422, GH15507).
Calling sum or prod on an empty or all-NaN Series, or columns of a DataFrame, will result in NaN. See the
docs.
In [33]: s = pd.Series([np.nan])
In [2]: s.sum()
Out[2]: np.nan
In [2]: s.sum()
Out[2]: 0.0
In [34]: s.sum()
Out[34]: 0.0
Note that this also changes the sum of an empty Series. Previously this always returned 0 regardless of a
bottleneck installation:
In [1]: pd.Series([]).sum()
Out[1]: 0
but for consistency with the all-NaN case, this was changed to return NaN as well:
In [35]: pd.Series([]).sum()
Out[35]: 0.0
Previously, selecting with a list of labels, where one or more labels were missing would always succeed,
returning NaN for missing labels. This will now show a FutureWarning. In the future this will raise a
KeyError (GH15747). This warning will trigger on a DataFrame or a Series for using .loc[] or [[]] when
passing a list-of-labels with at least 1 missing label. See the deprecation docs.
In [37]: s
Out[37]:
0 1
1 2
2 3
Length: 3, dtype: int64
Previous behavior
Current behavior
Out[4]:
1 2.0
2 3.0
3 NaN
dtype: float64
The idiomatic way to achieve selecting potentially not-found elements is via .reindex()
NA naming changes
In order to promote more consistency among the pandas API, we have added additional top-level functions
isna() and notna() that are aliases for isnull() and notnull(). The naming scheme is now more
consistent with methods like .dropna() and .fillna(). Furthermore in all cases where .isnull() and
.notnull() methods are defined, these have additional methods named .isna() and .notna(), these are
included for classes Categorical, Index, Series, and DataFrame. (GH15001).
The configuration option pd.options.mode.use_inf_as_null is deprecated, and pd.options.mode.
use_inf_as_na is added as a replacement.
Previously, when using certain iteration methods for a Series with dtype int or float, you would re-
ceive a numpy scalar, e.g. a np.int64, rather than a Python int. Issue (GH10904) corrected this for
Series.tolist() and list(Series). This change makes all iteration methods consistent, in particular, for
__iter__() and .map(); note that this only affects int/float dtypes. (GH13236, GH13258, GH14216).
In [41]: s
(continues on next page)
Previously:
In [2]: type(list(s)[0])
Out[2]: numpy.int64
New behavior:
In [42]: type(list(s)[0])
Out[42]: int
Furthermore this will now correctly box the results of iteration for DataFrame.to_dict() as well.
In [44]: df = pd.DataFrame(d)
Previously:
In [8]: type(df.to_dict()['a'][0])
Out[8]: numpy.int64
New behavior:
In [45]: type(df.to_dict()['a'][0])
Out[45]: int
Previously when passing a boolean Index to .loc, if the index of the Series/DataFrame had boolean labels,
you would get a label based selection, potentially duplicating result labels, rather than a boolean indexing
selection (where True selects elements), this was inconsistent how a boolean numpy array indexed. The new
behavior is to act like a boolean numpy array indexer. (GH17738)
Previous behavior:
In [47]: s
Out[47]:
False 1
True 2
False 3
Length: 3, dtype: int64
Current behavior
Furthermore, previously if you had an index that was non-numeric (e.g. strings), then a boolean Index would
raise a KeyError. This will now be treated as a boolean indexer.
Previously behavior:
In [50]: s
Out[50]:
a 1
b 2
c 3
Length: 3, dtype: int64
Current behavior
PeriodIndex resampling
In [4]: resampled
Out[4]:
2017-03-31 1.0
2017-09-30 5.5
2018-03-31 10.0
Freq: 2Q-DEC, dtype: float64
In [5]: resampled.index
Out[5]: DatetimeIndex(['2017-03-31', '2017-09-30', '2018-03-31'], dtype='datetime64[ns]',
,→ freq='2Q-DEC')
New behavior:
In [52]: pi = pd.period_range('2017-01', periods=12, freq='M')
In [55]: resampled
Out[55]:
2017Q1 2.5
2017Q3 8.5
Freq: 2Q-DEC, Length: 2, dtype: float64
In [56]: resampled.index
Out[56]: PeriodIndex(['2017Q1', '2017Q3'], dtype='period[2Q-DEC]', freq='2Q-DEC')
Upsampling and calling .ohlc() previously returned a Series, basically identical to calling .asfreq().
OHLC upsampling now returns a DataFrame with columns open, high, low and close (GH13083). This is
consistent with downsampling and DatetimeIndex behavior.
Previous behavior:
In [3]: s.resample('H').ohlc()
Out[3]:
2000-01-01 00:00 0.0
...
2000-01-10 23:00 NaN
Freq: H, Length: 240, dtype: float64
In [4]: s.resample('M').ohlc()
Out[4]:
open high low close
2000-01 0 9 0 9
New behavior:
In [57]: pi = pd.period_range(start='2000-01-01', freq='D', periods=10)
In [59]: s.resample('H').ohlc()
Out[59]:
open high low close
2000-01-01 00:00 0.0 0.0 0.0 0.0
2000-01-01 01:00 NaN NaN NaN NaN
2000-01-01 02:00 NaN NaN NaN NaN
2000-01-01 03:00 NaN NaN NaN NaN
2000-01-01 04:00 NaN NaN NaN NaN
... ... ... ... ...
2000-01-10 19:00 NaN NaN NaN NaN
2000-01-10 20:00 NaN NaN NaN NaN
2000-01-10 21:00 NaN NaN NaN NaN
2000-01-10 22:00 NaN NaN NaN NaN
2000-01-10 23:00 NaN NaN NaN NaN
In [60]: s.resample('M').ohlc()
Out[60]:
open high low close
2000-01 0 9 0 9
[1 rows x 4 columns]
eval() will now raise a ValueError when item assignment malfunctions, or inplace operations are specified,
but there is no item assignment in the expression (GH16732)
Previously, if you attempted the following expression, you would get a not very helpful error message:
This is a very long way of saying numpy arrays don’t support string-item indexing. With this change, the
error message is now this:
It also used to be possible to evaluate expressions inplace, even if there was no item assignment:
However, this input does not make much sense because the output is not being assigned to the target. Now,
a ValueError will be raised when such an input is passed in:
Dtype conversions
Previously assignments, .where() and .fillna() with a bool assignment, would coerce to same the type
(e.g. int / float), or raise for datetimelikes. These will now preserve the bools with object dtypes.
(GH16821).
In [6]: s
Out[6]:
0 1
1 1
2 3
dtype: int64
New behavior
In [64]: s
Out[64]:
0 1
1 True
2 3
Length: 3, dtype: object
Previously, as assignment to a datetimelike with a non-datetimelike would coerce the non-datetime-like item
being assigned (GH14145).
In [1]: s[1] = 1
In [2]: s
Out[2]:
0 2011-01-01 00:00:00.000000000
1 1970-01-01 00:00:00.000000001
dtype: datetime64[ns]
In [66]: s[1] = 1
In [67]: s
Out[67]:
(continues on next page)
• Inconsistent behavior in .where() with datetimelikes which would raise rather than coerce to object
(GH16402)
• Bug in assignment against int64 data with np.ndarray with float64 dtype may keep int64 dtype
(GH14001)
The MultiIndex constructors no longer squeezes a MultiIndex with all length-one levels down to a regular
Index. This affects all the MultiIndex constructors. (GH17178)
Previous behavior:
Length 1 levels are no longer special-cased. They behave exactly as if you had length 2+ levels, so a
MultiIndex is always returned from all of the MultiIndex constructors:
Previously, to_datetime() did not localize datetime Series data when utc=True was passed. Now,
to_datetime() will correctly localize Series with a datetime64[ns, UTC] dtype to be consistent with
how list-like and Index data are handled. (GH6415).
Previous behavior
New behavior
Additionally, DataFrames with datetime columns that were parsed by read_sql_table() and
read_sql_query() will also be localized to UTC only if the original SQL columns were timezone aware
datetime columns.
In previous versions, there were some inconsistencies between the various range functions: date_range(),
bdate_range(), period_range(), timedelta_range(), and interval_range(). (GH17471).
One of the inconsistent behaviors occurred when the start, end and period parameters were all specified,
potentially leading to ambiguous ranges. When all three parameters were passed, interval_range ignored
the period parameter, period_range ignored the end parameter, and the other range functions raised. To
promote consistency among the range functions, and avoid potentially ambiguous ranges, interval_range
and period_range will now raise when all three parameters are passed.
Previous behavior:
New behavior:
Additionally, the endpoint parameter end was not included in the intervals produced by interval_range.
However, all other range functions include end in their output. To promote consistency among the range
functions, interval_range will now include end as the right endpoint of the final interval, except if freq is
specified in a way which skips end.
Previous behavior:
New behavior:
Pandas no longer registers our date, time, datetime, datetime64, and Period converters with matplotlib
when pandas is imported. Matplotlib plot methods (plt.plot, ax.plot, …), will not nicely format the x-axis
for DatetimeIndex or PeriodIndex values. You must explicitly register these methods:
Pandas built-in Series.plot and DataFrame.plot will register these converters on first-use (GH17710).
Note: This change has been temporarily reverted in pandas 0.21.1, for more details see here.
• The Categorical constructor no longer accepts a scalar for the categories keyword. (GH16022)
• Accessing a non-existent attribute on a closed HDFStore will now raise an AttributeError rather than
a ClosedFileError (GH16301)
• read_csv() now issues a UserWarning if the names parameter contains duplicates (GH17095)
• read_csv() now treats 'null' and 'n/a' strings as missing values by default (GH16471, GH16078)
• pandas.HDFStore’s string representation is now faster and less detailed. For the previous behavior,
use pandas.HDFStore.info(). (GH16503).
• Compression defaults in HDF stores now follow pytables standards. Default is no compression and if
complib is missing and complevel > 0 zlib is used (GH15943)
• Index.get_indexer_non_unique() now returns a ndarray indexer rather than an Index; this is con-
sistent with Index.get_indexer() (GH16819)
• Removed the @slow decorator from pandas.util.testing, which caused issues for some downstream
packages’ test suites. Use @pytest.mark.slow instead, which achieves the same thing (GH16850)
• Moved definition of MergeError to the pandas.errors module.
• The signature of Series.set_axis() and DataFrame.set_axis() has been changed from
set_axis(axis, labels) to set_axis(labels, axis=0), for consistency with the rest of the API.
The old signature is deprecated and will show a FutureWarning (GH14636)
• Series.argmin() and Series.argmax() will now raise a TypeError when used with object dtypes,
instead of a ValueError (GH13595)
• Period is now immutable, and will now raise an AttributeError when a user tries to assign a new
value to the ordinal or freq attributes (GH17116).
• to_datetime() when passed a tz-aware origin= kwarg will now raise a more informative ValueError
rather than a TypeError (GH16842)
• to_datetime() now raises a ValueError when format includes %W or %U without also including day of
the week and calendar year (GH16774)
• Renamed non-functional index to index_col in read_stata() to improve API consistency (GH16342)
• Bug in DataFrame.drop() caused boolean labels False and True to be treated as labels 0 and 1
respectively when dropping indices from a numeric index. This will now raise a ValueError (GH16877)
• Restricted DateOffset keyword arguments. Previously, DateOffset subclasses allowed arbitrary key-
word arguments which could lead to unexpected behavior. Now, only valid arguments will be accepted.
(GH17176).
Deprecations
• reindex_axis() has been deprecated in favor of reindex(). See here for more (GH17833).
The Series.select() and DataFrame.select() methods are deprecated in favor of using df.loc[labels.
map(crit)] (GH12401)
Out[3]:
A
bar 2
baz 3
[2 rows x 1 columns]
The behavior of Series.argmax() and Series.argmin() have been deprecated in favor of Series.idxmax()
and Series.idxmin(), respectively (GH16830).
For compatibility with NumPy arrays, pd.Series implements argmax and argmin. Since pandas 0.13.0,
argmax has been an alias for pandas.Series.idxmax(), and argmin has been an alias for pandas.Series.
idxmin(). They return the label of the maximum or minimum, rather than the position.
We’ve deprecated the current behavior of Series.argmax and Series.argmin. Using either of these will
emit a FutureWarning. Use Series.idxmax() if you want the label of the maximum. Use Series.values.
argmax() if you want the position of the maximum. Likewise for the minimum. In a future release Series.
argmax and Series.argmin will return the position of the maximum or minimum.
• The function get_offset_name has been dropped in favor of the .freqstr attribute for an offset
(GH11834)
• pandas no longer tests for compatibility with hdf5-files created with pandas < 0.11 (GH17404).
Performance improvements
Documentation changes
Bug fixes
Conversion
• Bug in assignment against datetime-like data with int may incorrectly convert to datetime-like
(GH14145)
• Bug in assignment against int64 data with np.ndarray with float64 dtype may keep int64 dtype
(GH14001)
• Fixed the return type of IntervalIndex.is_non_overlapping_monotonic to be a Python bool for
consistency with similar attributes/methods. Previously returned a numpy.bool_. (GH17237)
• Bug in IntervalIndex.is_non_overlapping_monotonic when intervals are closed on both sides and
overlap at a point (GH16560)
• Bug in Series.fillna() returns frame when inplace=True and value is dict (GH16156)
• Bug in Timestamp.weekday_name returning a UTC-based weekday name when localized to a timezone
(GH17354)
• Bug in Timestamp.replace when replacing tzinfo around DST changes (GH15683)
• Bug in Timedelta construction and arithmetic that would not propagate the Overflow exception
(GH17367)
• Bug in astype() converting to object dtype when passed extension type classes (DatetimeTZDtype,
CategoricalDtype) rather than instances. Now a TypeError is raised when a class is passed
(GH17780).
• Bug in to_numeric() in which elements were not always being coerced to numeric when
errors='coerce' (GH17007, GH17125)
• Bug in DataFrame and Series constructors where range objects are converted to int32 dtype on
Windows instead of int64 (GH16804)
Indexing
• When called with a null slice (e.g. df.iloc[:]), the .iloc and .loc indexers return a shallow copy
of the original object. Previously they returned the original object. (GH13873).
• When called on an unsorted MultiIndex, the loc indexer now will raise UnsortedIndexError only if
proper slicing is used on non-sorted levels (GH16734).
• Fixes regression in 0.20.3 when indexing with a string on a TimedeltaIndex (GH16896).
• Fixed TimedeltaIndex.get_loc() handling of np.timedelta64 inputs (GH16909).
• Fix MultiIndex.sort_index() ordering when ascending argument is a list, but not all levels are
specified, or are in a different order (GH16934).
• Fixes bug where indexing with np.inf caused an OverflowError to be raised (GH16957)
• Bug in reindexing on an empty CategoricalIndex (GH16770)
• Fixes DataFrame.loc for setting with alignment and tz-aware DatetimeIndex (GH16889)
• Avoids IndexError when passing an Index or Series to .iloc with older numpy (GH17193)
• Allow unicode empty strings as placeholders in multilevel columns in Python 2 (GH17099)
• Bug in .iloc when used with inplace addition or assignment and an int indexer on a MultiIndex
causing the wrong indexes to be read from and written to (GH17148)
• Bug in .isin() in which checking membership in empty Series objects raised an error (GH16991)
• Bug in CategoricalIndex reindexing in which specified indices containing duplicates were not being
respected (GH17323)
• Bug in intersection of RangeIndex with negative step (GH17296)
• Bug in IntervalIndex where performing a scalar lookup fails for included right endpoints of non-
overlapping monotonic decreasing indexes (GH16417, GH17271)
• Bug in DataFrame.first_valid_index() and DataFrame.last_valid_index() when no valid entry
(GH17400)
• Bug in Series.rename() when called with a callable, incorrectly alters the name of the Series, rather
than the name of the Index. (GH17407)
• Bug in String.str_get() raises IndexError instead of inserting NaNs when using a negative index.
(GH17704)
I/O
• Bug in read_hdf() when reading a timezone aware index from fixed format HDFStore (GH17618)
• Bug in read_csv() in which columns were not being thoroughly de-duplicated (GH17060)
• Bug in read_csv() in which specified column names were not being thoroughly de-duplicated
(GH17095)
• Bug in read_csv() in which non integer values for the header argument generated an unhelpful /
unrelated error message (GH16338)
• Bug in read_csv() in which memory management issues in exception handling, under certain condi-
tions, would cause the interpreter to segfault (GH14696, GH16798).
• Bug in read_csv() when called with low_memory=False in which a CSV with at least one column >
2GB in size would incorrectly raise a MemoryError (GH16798).
• Bug in read_csv() when called with a single-element list header would return a DataFrame of all NaN
values (GH7757)
• Bug in DataFrame.to_csv() defaulting to ‘ascii’ encoding in Python 3, instead of ‘utf-8’ (GH17097)
• Bug in read_stata() where value labels could not be read when using an iterator (GH16923)
• Bug in read_stata() where the index was not set (GH16342)
• Bug in read_html() where import check fails when run in multiple threads (GH16928)
• Bug in read_csv() where automatic delimiter detection caused a TypeError to be thrown when a bad
line was encountered rather than the correct error message (GH13374)
• Bug in DataFrame.to_html() with notebook=True where DataFrames with named indices or non-
MultiIndex indices had undesired horizontal or vertical alignment for column or row labels, respectively
(GH16792)
• Bug in DataFrame.to_html() in which there was no validation of the justify parameter (GH17527)
• Bug in HDFStore.select() when reading a contiguous mixed-data table featuring VLArray (GH17021)
• Bug in to_json() where several conditions (including objects with unprintable symbols, objects with
deep recursion, overlong labels) caused segfaults instead of raising the appropriate exception (GH14256)
Plotting
• Bug in plotting methods using secondary_y and fontsize not setting secondary axis font size
(GH12565)
• Bug when plotting timedelta and datetime dtypes on y-axis (GH16953)
• Line plots no longer assume monotonic x data when calculating xlims, they show the entire lines now
even for unsorted x data. (GH11310, GH11471)
• With matplotlib 2.0.0 and above, calculation of x limits for line plots is left to matplotlib, so that its
new default settings are applied. (GH15495)
• Bug in Series.plot.bar or DataFrame.plot.bar with y not respecting user-passed color (GH16822)
• Bug causing plotting.parallel_coordinates to reset the random seed when using random colors
(GH17525)
Groupby/resample/rolling
• Bug in Series.resample(...).apply() where an empty Series modified the source index and did
not return the name of a Series (GH14313)
• Bug in .rolling(...).apply(...) with a DataFrame with a DatetimeIndex, a window of a timedelta-
convertible and min_periods >= 1 (GH15305)
• Bug in DataFrame.groupby where index and column keys were not recognized correctly when the
number of keys equaled the number of elements on the groupby axis (GH16859)
• Bug in groupby.nunique() with TimeGrouper which cannot handle NaT correctly (GH17575)
• Bug in DataFrame.groupby where a single level selection from a MultiIndex unexpectedly sorts
(GH17537)
• Bug in DataFrame.groupby where spurious warning is raised when Grouper object is used to override
ambiguous column name (GH17383)
• Bug in TimeGrouper differs when passes as a list and as a scalar (GH17530)
Sparse
Reshaping
• Bug in pivot_table() where the result’s columns did not preserve the categorical dtype of columns
when dropna was False (GH17842)
• Bug in DataFrame.drop_duplicates where dropping with non-unique column names raised a
ValueError (GH17836)
• Bug in unstack() which, when called on a list of levels, would discard the fillna argument (GH13971)
• Bug in the alignment of range objects and other list-likes with DataFrame leading to operations being
performed row-wise instead of column-wise (GH17901)
Numeric
• Bug in .clip() with axis=1 and a list-like for threshold is passed; previously this raised ValueError
(GH15390)
• Series.clip() and DataFrame.clip() now treat NA values for upper and lower arguments as None
instead of raising ValueError (GH17276).
Categorical
PyPy
Other
• Bug where some inplace operators were not being wrapped and produced a copy when invoked
(GH12962)
• Bug in eval() where the inplace parameter was being incorrectly handled (GH16732)
Contributors
A total of 206 people contributed patches to this release. People with a “+” by their names contributed a
patch for the first time.
• 3553x +
• Aaron Barber
• Adam Gleave +
• Adam Smith +
• AdamShamlian +
• Adrian Liaw +
• Alan Velasco +
• Alan Yee +
• Alex B +
• Alex Lubbock +
• Alex Marchenko +
• Alex Rychyk +
• Amol K +
• Andreas Winkler
• Andrew +
• Andrew �
• André Jonasson +
• Becky Sweger
• Berkay +
• Bob Haffner +
• Bran Yang
• Brian Tu +
• Brock Mendel +
• Carol Willing +
• Carter Green +
• Chankey Pathak +
• Chris
• Chris Billington
• Chris Filo Gorgolewski +
• Chris Kerr
• Chris M +
• Chris Mazzullo +
• Christian Prinoth
• Christian Stade-Schuldt
• Christoph Moehl +
• DSM
• Daniel Chen +
• Daniel Grady
• Daniel Himmelstein
• Dave Willmer
• David Cook
• David Gwynne
• David Read +
• Dillon Niederhut +
• Douglas Rudd
• Eric Stein +
• Eric Wieser +
• Erik Fredriksen
• Florian Wilhelm +
• Floris Kint +
• Forbidden Donut
• Gabe F +
• Giftlin +
• Giftlin Rajaiah +
• Giulio Pepe +
• Guilherme Beltramini
• Guillem Borrell +
• Hanmin Qin +
• Hendrik Makait +
• Hugues Valois
• Hussain Tamboli +
• Iva Miholic +
• Jan Novotný +
• Jan Rudolph
• Jean Helie +
• Jean-Baptiste Schiratti +
• Jean-Mathieu Deschenes
• Jeff Knupp +
• Jeff Reback
• Jeff Tratner
• JennaVergeynst
• JimStearns206
• Joel Nothman
• John W. O’Brien
• Jon Crall +
• Jon Mease
• Jonathan J. Helmus +
• Joris Van den Bossche
• JosephWagner
• Juarez Bochi
• Julian Kuhlmann +
• Karel De Brabandere
• Kassandra Keeton +
• Keiron Pizzey +
• Keith Webber
• Kernc
• Kevin Sheppard
• Kirk Hansen +
• Licht Takeuchi +
• Lucas Kushner +
• Mahdi Ben Jelloul +
• Makarov Andrey +
• Malgorzata Turzanska +
• Marc Garcia +
• Margaret Sy +
• MarsGuy +
• Matt Bark +
• Matthew Roeschke
• Matti Picus
• Mehmet Ali “Mali” Akmanalp
• Michael Gasvoda +
• Michael Penkov +
• Milo +
• Morgan Stuart +
• Morgan243 +
• Nathan Ford +
• Nick Eubank
• Nick Garvey +
• Oleg Shteynbuk +
• P-Tillmann +
• Pankaj Pandey
• Patrick Luo
• Patrick O’Melveny
• Paul Reidy +
• Paula +
• Peter Quackenbush
• Peter Yanovich +
• Phillip Cloud
• Pierre Haessig
• Pietro Battiston
• Pradyumna Reddy Chinthala
• Prasanjit Prakash
• RobinFiveWords
• Ryan Hendrickson
• Sam Foo
• Sangwoong Yoon +
• Simon Gibbons +
• SimonBaron
• Steven Cutting +
• Sudeep +
• Sylvia +
• TN+
• Telt
• Thomas A Caswell
• Tim Swast +
• Tom Augspurger
• Tong SHEN
• Tuan +
• Utkarsh Upadhyay +
• Vincent La +
• Vivek +
• WANG Aiyong
• WBare
• Wes McKinney
• XF +
• Yi Liu +
• Yosuke Nakabayashi +
• aaron315 +
• abarber4gh +
• aernlund +
• agustín méndez +
• andymaheshw +
• ante328 +
• aviolov +
• bpraggastis
• cbertinato +
• cclauss +
• chernrick
• chris-b1
• dkamm +
• dwkenefick
• economy
• faic +
• fding253 +
• gfyoung
• guygoldberg +
• hhuuggoo +
• huashuai +
• ian
• iulia +
• jaredsnyder
• jbrockmendel +
• jdeschenes
• jebob +
• jschendel +
• keitakurita
• kernc +
• kiwirob +
• kjford
• linebp
• lloydkirk
• louispotok +
• majiang +
• manikbhandari +
• matthiashuschle +
• mattip
• maxwasserman +
• mjlove12 +
• nmartensen +
• pandas-docs-bot +
• parchd-1 +
• philipphanemann +
• rdk1024 +
• reidy-p +
• ri938
• ruiann +
• rvernica +
• s-weigand +
• scotthavard92 +
• skwbc +
• step4me +
• tobycheese +
• topper-123 +
• tsdlovell
• ysau +
• zzgao +
{{ header }}
This is a minor bug-fix release in the 0.20.x series and includes some small regression fixes and bug fixes.
We recommend that all users upgrade to this version.
• Bug fixes
– Conversion
– Indexing
– I/O
– Plotting
– Reshaping
– Categorical
• Contributors
Bug fixes
Conversion
• Bug in pickle compat prior to the v0.20.x series, when UTC is a timezone in a Series/DataFrame/Index
(GH16608)
• Bug in Series construction when passing a Series with dtype='category' (GH16524).
• Bug in DataFrame.astype() when passing a Series as the dtype kwarg. (GH16717).
Indexing
• Bug in Float64Index causing an empty array instead of None to be returned from .get(np.nan) on
a Series whose index did not contain any NaN s (GH8569)
• Bug in MultiIndex.isin causing an error when passing an empty iterable (GH16777)
• Fixed a bug in a slicing DataFrame/Series that have a TimedeltaIndex (GH16637)
I/O
• Bug in read_csv() in which files weren’t opened as binary files by the C engine on Windows, causing
EOF characters mid-field, which would fail (GH16039, GH16559, GH16675)
• Bug in read_hdf() in which reading a Series saved to an HDF file in ‘fixed’ format fails when an
explicit mode='r' argument is supplied (GH16583)
• Bug in DataFrame.to_latex() where bold_rows was wrongly specified to be True by default, whereas
in reality row labels remained non-bold whatever parameter provided. (GH16707)
• Fixed an issue with DataFrame.style() where generated element ids were not unique (GH16780)
• Fixed loading a DataFrame with a PeriodIndex, from a format='fixed' HDFStore, in Python 3, that
was written in Python 2 (GH16781)
Plotting
• Fixed regression that prevented RGB and RGBA tuples from being used as color arguments (GH16233)
• Fixed an issue with DataFrame.plot.scatter() that incorrectly raised a KeyError when categorical
data is used for plotting (GH16199)
Reshaping
Categorical
• Bug in DataFrame.sort_values not respecting the kind parameter with categorical data (GH16793)
Contributors
A total of 20 people contributed patches to this release. People with a “+” by their names contributed a
patch for the first time.
• Bran Yang
• Chris
• Chris Kerr +
• DSM
• David Gwynne
• Douglas Rudd
• Forbidden Donut +
• Jeff Reback
• Joris Van den Bossche
• Karel De Brabandere +
• Peter Quackenbush +
• Pradyumna Reddy Chinthala +
• Telt +
• Tom Augspurger
• chris-b1
• gfyoung
• ian +
• jdeschenes +
• kjford +
• ri938 +
{{ header }}
This is a minor bug-fix release in the 0.20.x series and includes some small regression fixes, bug fixes and
performance improvements. We recommend that all users upgrade to this version.
• Enhancements
• Performance improvements
• Bug fixes
– Conversion
– Indexing
– I/O
– Plotting
– Groupby/resample/rolling
– Sparse
– Reshaping
– Numeric
– Categorical
– Other
• Contributors
Enhancements
Performance improvements
Bug fixes
• Silenced a warning on some Windows environments about “tput: terminal attributes: No such device
or address” when detecting the terminal size. This fix only applies to python 3 (GH16496)
• Bug in using pathlib.Path or py.path.local objects with io functions (GH16291)
• Bug in Index.symmetric_difference() on two equal MultiIndex’s, results in a TypeError (GH13490)
• Bug in DataFrame.update() with overwrite=False and NaN values (GH15593)
• Passing an invalid engine to read_csv() now raises an informative ValueError rather than
UnboundLocalError. (GH16511)
• Bug in unique() on an array of tuples (GH16519)
• Bug in cut() when labels are set, resulting in incorrect label ordering (GH16459)
• Fixed a compatibility issue with IPython 6.0’s tab completion showing deprecation warnings on
Categoricals (GH16409)
Conversion
• Bug in to_numeric() in which empty data inputs were causing a segfault of the interpreter (GH16302)
• Silence numpy warnings when broadcasting DataFrame to Series with comparison ops (GH16378,
GH16306)
Indexing
I/O
• Bug in read_csv() when comment is passed in a space delimited text file (GH16472)
• Bug in read_csv() not raising an exception with nonexistent columns in usecols when it had the
correct length (GH14671)
• Bug that would force importing of the clipboard routines unnecessarily, potentially causing an import
error on startup (GH16288)
• Bug that raised IndexError when HTML-rendering an empty DataFrame (GH15953)
• Bug in read_csv() in which tarfile object inputs were raising an error in Python 2.x for the C engine
(GH16530)
• Bug where DataFrame.to_html() ignored the index_names parameter (GH16493)
• Bug where pd.read_hdf() returns numpy strings for index names (GH13492)
• Bug in HDFStore.select_as_multiple() where start/stop arguments were not respected (GH16209)
Plotting
Groupby/resample/rolling
Sparse
Reshaping
Numeric
• Bug in .interpolate(), where limit_direction was not respected when limit=None (default) was
passed (GH16282)
Categorical
• Fixed comparison operations considering the order of the categories when both categoricals are un-
ordered (GH16014)
Other
Contributors
A total of 34 people contributed patches to this release. People with a “+” by their names contributed a
patch for the first time.
• Aaron Barber +
• Andrew � +
• Becky Sweger +
• Christian Prinoth +
• Christian Stade-Schuldt +
• DSM
• Erik Fredriksen +
• Hugues Valois +
• Jeff Reback
• Jeff Tratner
• JimStearns206 +
• John W. O’Brien
• Joris Van den Bossche
• JosephWagner +
• Keith Webber +
• Mehmet Ali “Mali” Akmanalp +
• Pankaj Pandey
• Patrick Luo +
• Patrick O’Melveny +
• Pietro Battiston
• RobinFiveWords +
• Ryan Hendrickson +
• SimonBaron +
• Tom Augspurger
• WBare +
• bpraggastis +
• chernrick +
• chris-b1
• economy +
• gfyoung
• jaredsnyder +
• keitakurita +
• linebp
• lloydkirk +
{{ header }}
This is a major release from 0.19.2 and includes a number of API changes, deprecations, new features,
enhancements, and performance improvements along with a large number of bug fixes. We recommend that
all users upgrade to this version.
Highlights include:
• New .agg() API for Series/DataFrame similar to the groupby-rolling-resample API’s, see here
• Integration with the feather-format, including a new top-level pd.read_feather() and DataFrame.
to_feather() method, see here.
• The .ix indexer has been deprecated, see here
• Panel has been deprecated, see here
• Addition of an IntervalIndex and Interval scalar type, see here
• Improved user API when grouping by index levels in .groupby(), see here
• Improved support for UInt64 dtypes, see here
• A new orient for JSON serialization, orient='table', that uses the Table Schema spec and that gives
the possibility for a more interactive repr in the Jupyter Notebook, see here
• Experimental support for exporting styled DataFrames (DataFrame.style) to Excel, see here
• Window binary corr/cov operations now return a MultiIndexed DataFrame rather than a Panel, as
Panel is now deprecated, see here
• Support for S3 handling now uses s3fs, see here
• Google BigQuery support now uses the pandas-gbq library, see here
Warning: Pandas has changed the internal structure and layout of the code base. This can affect
imports that are not from the top-level pandas.* namespace, please see the changes here.
Note: This is a combined release for 0.20.0 and and 0.20.1. Version 0.20.1 contains one additional change
for backwards-compatibility with downstream projects using pandas’ utils routines. (GH16250)
• New features
– agg API for DataFrame/Series
– dtype keyword for data IO
– .to_datetime() has gained an origin parameter
– Groupby enhancements
New features
Series & DataFrame have been enhanced to support the aggregation API. This is a familiar API from
groupby, window operations, and resampling. This allows aggregation operations in a concise way by using
agg() and transform(). The full documentation is here (GH1623).
Here is a sample
In [3]: df
Out[3]:
A B C
2000-01-01 -1.392054 1.153922 1.181944
2000-01-02 0.391371 -0.881047 0.295080
2000-01-03 1.863801 -1.712274 -1.407085
2000-01-04 NaN NaN NaN
2000-01-05 NaN NaN NaN
2000-01-06 NaN NaN NaN
2000-01-07 NaN NaN NaN
2000-01-08 1.076541 0.363177 1.893680
(continues on next page)
One can operate using string function names, callables, lists, or dictionaries of these.
Using a single function is equivalent to .apply.
In [4]: df.agg('sum')
Out[4]:
A 0.793677
B -1.007232
C 1.264515
Length: 3, dtype: float64
[2 rows x 3 columns]
Using a dict provides the ability to apply specific aggregations per column. You will get a matrix-like output
of all of the aggregators. The output has one column per unique function. Those functions applied to a
particular column will be NaN:
[3 rows x 2 columns]
When presented with mixed dtypes that cannot be aggregated, .agg() will only take the valid aggregations.
This is similar to how groupby .agg() works. (GH15015)
In [9]: df.dtypes
Out[9]:
A int64
B float64
C object
D datetime64[ns]
Length: 4, dtype: object
[2 rows x 4 columns]
The 'python' engine for read_csv(), as well as the read_fwf() function for parsing fixed-width text files
and read_excel() for parsing Excel files, now accept the dtype keyword argument for specifying the types
of specific columns (GH14295). See the io docs for more information.
In [11]: data = "a b\n1 2\n3 4"
In [12]: pd.read_fwf(StringIO(data)).dtypes
Out[12]:
a int64
b int64
Length: 2, dtype: object
to_datetime() has gained a new parameter, origin, to define a reference date from where to compute the
resulting timestamps when parsing numerical values with a specific unit specified. (GH11276, GH11745)
For example, with 1960-01-01 as the starting date:
The default is set at origin='unix', which defaults to 1970-01-01 00:00:00, which is commonly called
‘unix epoch’ or POSIX time. This was the previous default, so this is a backward compatible change.
Groupby enhancements
Strings passed to DataFrame.groupby() as the by parameter may now reference either column names or
index level names. Previously, only column names could be referenced. This allows to easily group by a
column and index level at the same time. (GH5677)
In [16]: arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
....: ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
....:
In [19]: df
Out[19]:
A B
first second
bar one 1 0
two 1 1
baz one 1 2
two 1 3
foo one 2 4
two 2 5
qux one 3 6
two 3 7
[8 rows x 2 columns]
one 1 2
2 4
3 6
two 1 4
2 5
3 7
[6 rows x 1 columns]
The compression code was refactored (GH12688). As a result, reading dataframes from URLs in read_csv()
or read_table() now supports additional compression methods: xz, bz2, and zip (GH14570). Previously,
only gzip compression was supported. By default, compression of URLs and paths are now inferred us-
ing their file extensions. Additionally, support for bz2 compression in the python 2 C-engine improved
(GH14874).
In [24]: df.head(2)
Out[24]:
S X E M
0 13876 1 1 1
1 11608 1 3 0
[2 rows x 4 columns]
read_pickle(), DataFrame.to_pickle() and Series.to_pickle() can now read from and write to com-
pressed pickle files. Compression methods can be an explicit parameter or be inferred from the file extension.
See the docs here.
In [28]: rt.head()
Out[28]:
A B C
0 0.177387 foo 2013-01-01 00:00:00
1 0.983513 foo 2013-01-01 00:00:01
2 0.023505 foo 2013-01-01 00:00:02
3 0.553777 foo 2013-01-01 00:00:03
4 0.353769 foo 2013-01-01 00:00:04
[5 rows x 3 columns]
The default is to infer the compression type from the extension (compression='infer'):
In [29]: df.to_pickle("data.pkl.gz")
In [30]: rt = pd.read_pickle("data.pkl.gz")
In [31]: rt.head()
Out[31]:
A B C
0 0.177387 foo 2013-01-01 00:00:00
1 0.983513 foo 2013-01-01 00:00:01
2 0.023505 foo 2013-01-01 00:00:02
3 0.553777 foo 2013-01-01 00:00:03
4 0.353769 foo 2013-01-01 00:00:04
[5 rows x 3 columns]
In [32]: df["A"].to_pickle("s1.pkl.bz2")
In [33]: rt = pd.read_pickle("s1.pkl.bz2")
In [34]: rt.head()
Out[34]:
0 0.177387
1 0.983513
2 0.023505
3 0.553777
4 0.353769
Name: A, Length: 5, dtype: float64
Pandas has significantly improved support for operations involving unsigned, or purely non-negative, inte-
gers. Previously, handling these integers would result in improper rounding or data-type casting, leading to
incorrect results. Notably, a new numerical index, UInt64Index, has been created (GH14937)
In [37]: df.index
Out[37]: UInt64Index([1, 2, 3], dtype='uint64')
• Bug in converting object elements of array-like objects to unsigned 64-bit integers (GH4471, GH14982)
• Bug in Series.unique() in which unsigned 64-bit integers were causing overflow (GH14721)
• Bug in DataFrame construction in which unsigned 64-bit integer elements were being converted to
objects (GH14881)
• Bug in pd.read_csv() in which unsigned 64-bit integer elements were being improperly converted to
the wrong data types (GH14983)
• Bug in pd.unique() in which unsigned 64-bit integers were causing overflow (GH14915)
• Bug in pd.value_counts() in which unsigned 64-bit integers were being erroneously truncated in the
output (GH14934)
GroupBy on categoricals
In previous versions, .groupby(..., sort=False) would fail with a ValueError when grouping on a
categorical series with some categories not appearing in the data. (GH13179)
In [39]: df = pd.DataFrame({
....: 'A': np.random.randint(100),
....: 'B': np.random.randint(100),
....: 'C': np.random.randint(100),
....: 'chromosomes': pd.Categorical(np.random.choice(chromosomes, 100),
....: categories=chromosomes,
....: ordered=True)})
....:
In [40]: df
Out[40]:
A B C chromosomes
0 64 56 54 1
1 64 56 54 Y
2 64 56 54 Y
3 64 56 54 Y
4 64 56 54 11
.. .. .. .. ...
95 64 56 54 14
96 64 56 54 10
97 64 56 54 13
98 64 56 54 18
99 64 56 54 8
Previous behavior:
New behavior:
The new orient 'table' for DataFrame.to_json() will generate a Table Schema compatible string repre-
sentation of the data.
In [42]: df = pd.DataFrame(
....: {'A': [1, 2, 3],
....: 'B': ['a', 'b', 'c'],
....: 'C': pd.date_range('2016-01-01', freq='d', periods=3)},
....: index=pd.Index(range(3), name='idx'))
....:
In [43]: df
Out[43]:
A B C
idx
0 1 a 2016-01-01
1 2 b 2016-01-02
2 3 c 2016-01-03
[3 rows x 3 columns]
In [44]: df.to_json(orient='table')
Out[44]: '{"schema": {"fields":[{"name":"idx","type":"integer"},{"name":"A",
,→"type":"integer"},{"name":"B","type":"string"},{"name":"C","type":"datetime"}],
,→"C":"2016-01-01T00:00:00.000Z"},{"idx":1,"A":2,"B":"b","C":"2016-01-02T00:00:00.000Z"},
,→{"idx":2,"A":3,"B":"c","C":"2016-01-03T00:00:00.000Z"}]}'
Pandas now supports creating sparse dataframes directly from scipy.sparse.spmatrix instances. See the
documentation for more information. (GH4343)
All sparse formats are supported, but matrices that are not in COOrdinate format will be converted, copying
data as needed.
In [49]: sp_arr
Out[49]:
<1000x5 sparse matrix of type '<class 'numpy.float64'>'
with 491 stored elements in Compressed Sparse Row format>
In [51]: sdf
Out[51]:
0 1 2 3 4
0 NaN 0.97914 NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
(continues on next page)
To convert a SparseDataFrame back to sparse SciPy matrix in COO format, you can use:
In [52]: sdf.to_coo()
Out[52]:
<1000x5 sparse matrix of type '<class 'numpy.float64'>'
with 491 stored elements in COOrdinate format>
Experimental support has been added to export DataFrame.style formats to Excel using the openpyxl
engine. (GH15530)
For example, after running the following, styled.xlsx renders as below:
In [53]: np.random.seed(24)
In [57]: df
Out[57]:
A B C D E
0 1.0 1.329212 NaN -0.316280 -0.990810
1 2.0 -1.070816 -1.438713 0.564417 0.295722
2 3.0 -1.626404 0.219565 0.678805 1.889273
3 4.0 0.961538 0.104011 -0.481165 0.850229
4 5.0 1.453425 1.057737 0.165562 0.515018
5 6.0 -1.336936 0.562861 1.392855 -0.063328
6 7.0 0.121668 1.207603 -0.002040 1.627796
7 8.0 0.354493 1.037528 -0.385684 0.519818
8 9.0 1.686583 -1.325963 1.428984 -2.089354
9 10.0 -0.129820 0.631523 -0.586538 0.290720
IntervalIndex
pandas has gained an IntervalIndex with its own dtype, interval as well as the Interval scalar type.
These allow first-class support for interval notation, specifically as a return type for the categories in cut()
and qcut(). The IntervalIndex allows some unique indexing, see the docs. (GH7640, GH8625)
Warning: These indexing behaviors of the IntervalIndex are provisional and may change in a future
version of pandas. Feedback on usage is welcome.
Previous behavior:
The returned categories were strings, representing Intervals
In [2]: c
Out[2]:
[(-0.003, 1.5], (-0.003, 1.5], (1.5, 3], (1.5, 3]]
Categories (2, object): [(-0.003, 1.5] < (1.5, 3]]
In [3]: c.categories
Out[3]: Index(['(-0.003, 1.5]', '(1.5, 3]'], dtype='object')
New behavior:
In [60]: c = pd.cut(range(4), bins=2)
In [61]: c
Out[61]:
[(-0.003, 1.5], (-0.003, 1.5], (1.5, 3.0], (1.5, 3.0]]
Categories (2, interval[float64]): [(-0.003, 1.5] < (1.5, 3.0]]
In [62]: c.categories
Out[62]:
IntervalIndex([(-0.003, 1.5], (1.5, 3.0]],
closed='right',
dtype='interval[float64]')
Furthermore, this allows one to bin other data with these same bins, with NaN representing a missing value
similar to other dtypes.
In [65]: df
Out[65]:
A
B
(-0.003, 1.5] 0
(1.5, 3.0] 1
(-0.003, 1.5] 2
(-0.003, 1.5] 3
[4 rows x 1 columns]
In [67]: df.loc[0]
Out[67]:
A
B
(-0.003, 1.5] 0
(-0.003, 1.5] 2
(-0.003, 1.5] 3
[3 rows x 1 columns]
Other enhancements
• The .to_latex() method will now accept multicolumn and multirow arguments to use the accom-
panying LaTeX enhancements
• pd.merge_asof() gained the option direction='backward'|'forward'|'nearest' (GH14887)
• Series/DataFrame.asfreq() have gained a fill_value parameter, to fill missing values (GH3715).
• Series/DataFrame.resample.asfreq have gained a fill_value parameter, to fill missing values
during resampling (GH3715).
• pandas.util.hash_pandas_object() has gained the ability to hash a MultiIndex (GH15224)
• Series/DataFrame.squeeze() have gained the axis parameter. (GH15339)
• DataFrame.to_excel() has a new freeze_panes parameter to turn on Freeze Panes when exporting
to Excel (GH15160)
• pd.read_html() will parse multiple header rows, creating a MutliIndex header. (GH13434).
• HTML table output skips colspan or rowspan attribute if equal to 1. (GH15403)
• pandas.io.formats.style.Styler template now has blocks for easier extension, see the example
notebook (GH15649)
• Styler.render() now accepts **kwargs to allow user-defined variables in the template (GH15649)
• Compatibility with Jupyter notebook 5.0; MultiIndex column labels are left-aligned and MultiIndex
row-labels are top-aligned (GH15379)
• TimedeltaIndex now has a custom date-tick formatter specifically designed for nanosecond level pre-
cision (GH8711)
• pd.api.types.union_categoricals gained the ignore_ordered argument to allow ignoring the or-
dered attribute of unioned categoricals (GH13410). See the categorical union docs for more information.
• DataFrame.to_latex() and DataFrame.to_string() now allow optional header aliases. (GH15536)
• Re-enable the parse_dates keyword of pd.read_excel() to parse string columns as dates (GH14326)
• Added .empty property to subclasses of Index. (GH15270)
• Enabled floor division for Timedelta and TimedeltaIndex (GH15828)
• pandas.io.json.json_normalize() gained the option errors='ignore'|'raise'; the default is
errors='raise' which is backward compatible. (GH14583)
• pandas.io.json.json_normalize() with an empty list will return an empty DataFrame (GH15534)
• pandas.io.json.json_normalize() has gained a sep option that accepts str to separate joined fields;
the default is “.”, which is backward compatible. (GH14883)
• MultiIndex.remove_unused_levels() has been added to facilitate removing unused levels.
(GH15694)
• pd.read_csv() will now raise a ParserError error whenever any parsing error occurs (GH15913,
GH15925)
• pd.read_csv() now supports the error_bad_lines and warn_bad_lines arguments for the Python
parser (GH15925)
• The display.show_dimensions option can now also be used to specify whether the length of a Series
should be shown in its repr (GH7117).
• parallel_coordinates() has gained a sort_labels keyword argument that sorts class labels and
the colors assigned to them (GH15908)
• Options added to allow one to turn on/off using bottleneck and numexpr, see here (GH16157)
• DataFrame.style.bar() now accepts two more options to further customize the bar chart. Bar align-
ment is set with align='left'|'mid'|'zero', the default is “left”, which is backward compatible;
You can now pass a list of color=[color_negative, color_positive]. (GH14757)
Possible incompatibility for HDF5 formats created with pandas < 0.13.0
pd.TimeSeries was deprecated officially in 0.17.0, though has already been an alias since 0.13.0. It has been
dropped in favor of pd.Series. (GH15098).
This may cause HDF5 files that were created in prior versions to become unreadable if pd.TimeSeries was
used. This is most likely to be for pandas < 0.13.0. If you find yourself in this situation. You can use
a recent prior version of pandas to read in your HDF5 files, then write them out again after applying the
procedure below.
In [3]: s
Out[3]:
2013-01-01 1
2013-01-02 2
2013-01-03 3
Freq: D, dtype: int64
In [4]: type(s)
Out[4]: pandas.core.series.TimeSeries
In [5]: s = pd.Series(s)
In [6]: s
Out[6]:
2013-01-01 1
2013-01-02 2
2013-01-03 3
Freq: D, dtype: int64
In [7]: type(s)
Out[7]: pandas.core.series.Series
In [69]: idx
Out[69]: Int64Index([1, 2], dtype='int64')
Previous behavior:
In [5]: idx.map(lambda x: x * 2)
Out[5]: array([2, 4])
In [7]: mi.map(lambda x: x)
Out[7]: array([(1, 2), (2, 4)], dtype=object)
New behavior:
In [72]: idx.map(lambda x: x * 2)
Out[72]: Int64Index([2, 4], dtype='int64')
In [74]: mi.map(lambda x: x)
Out[74]:
MultiIndex([(1, 2),
(2, 4)],
)
In [77]: s
Out[77]:
0 2011-01-02 00:00:00+09:00
1 2011-01-02 01:00:00+09:00
2 2011-01-02 02:00:00+09:00
Length: 3, dtype: datetime64[ns, Asia/Tokyo]
Previous behavior:
New behavior:
The datetime-related attributes (see here for an overview) of DatetimeIndex, PeriodIndex and
TimedeltaIndex previously returned numpy arrays. They will now return a new Index object, except
in the case of a boolean field, where the result will still be a boolean ndarray. (GH15022)
Previous behaviour:
In [2]: idx.hour
Out[2]: array([ 0, 10, 20, 6, 16], dtype=int32)
New behavior:
In [80]: idx.hour
Out[80]: Int64Index([0, 10, 20, 6, 16], dtype='int64')
This has the advantage that specific Index methods are still available on the result. On the other hand, this
might have backward incompatibilities: e.g. compared to numpy arrays, Index objects are not mutable. To
get the original ndarray, you can always convert explicitly using np.asarray(idx.hour).
In prior versions, using Series.unique() and pandas.unique() on Categorical and tz-aware data-types
would yield different return types. These are now made consistent. (GH15903)
• Datetime tz-aware
Previous behaviour:
# Series
In [5]: pd.Series([pd.Timestamp('20160101', tz='US/Eastern'),
...: pd.Timestamp('20160101', tz='US/Eastern')]).unique()
(continues on next page)
# Index
In [7]: pd.Index([pd.Timestamp('20160101', tz='US/Eastern'),
...: pd.Timestamp('20160101', tz='US/Eastern')]).unique()
Out[7]: DatetimeIndex(['2016-01-01 00:00:00-05:00'], dtype='datetime64[ns, US/
,→Eastern]', freq=None)
New behavior:
# Series, returns an array of Timestamp tz-aware
In [81]: pd.Series([pd.Timestamp(r'20160101', tz=r'US/Eastern'),
....: pd.Timestamp(r'20160101', tz=r'US/Eastern')]).unique()
....:
Out[81]:
<DatetimeArray>
['2016-01-01 00:00:00-05:00']
Length: 1, dtype: datetime64[ns, US/Eastern]
• Categoricals
Previous behaviour:
New behavior:
# returns a Categorical
In [85]: pd.Series(list('baabc'), dtype='category').unique()
Out[85]:
[b, a, c]
Categories (3, object): [b, a, c]
S3 file handling
pandas now uses s3fs for handling S3 connections. This shouldn’t break any code. However, since s3fs
is not a required dependency, you will need to install it separately, like boto in prior versions of pandas.
(GH11915).
DatetimeIndex Partial String Indexing now works as an exact match, provided that string resolution coincides
with index resolution, including a case when both are seconds (GH14826). See Slice vs. Exact Match for
details.
Previous behavior:
New behavior:
Previously, concat of multiple objects with different float dtypes would automatically upcast results to a
dtype of float64. Now the smallest acceptable dtype will be used (GH13247)
In [89]: df1.dtypes
Out[89]:
0 float32
Length: 1, dtype: object
In [91]: df2.dtypes
Out[91]:
0 float32
Length: 1, dtype: object
Previous behavior:
New behavior:
pandas has split off Google BigQuery support into a separate package pandas-gbq. You can conda install
pandas-gbq -c conda-forge or pip install pandas-gbq to get it. The functionality of read_gbq() and
DataFrame.to_gbq() remain the same with the currently released version of pandas-gbq=0.1.4. Documen-
tation is now hosted here (GH15347)
In previous versions, showing .memory_usage() on a pandas structure that has an index, would only include
actual index values and not include structures that facilitated fast indexing. This will generally be different
for Index and MultiIndex and less-so for other index types. (GH15237)
Previous behavior:
In [9]: index.memory_usage(deep=True)
Out[9]: 180
In [10]: index.get_loc('foo')
Out[10]: 0
In [11]: index.memory_usage(deep=True)
Out[11]: 180
New behavior:
In [9]: index.memory_usage(deep=True)
Out[9]: 180
In [10]: index.get_loc('foo')
Out[10]: 0
In [11]: index.memory_usage(deep=True)
Out[11]: 260
DataFrame.sort_index changes
In certain cases, calling .sort_index() on a MultiIndexed DataFrame would return the same DataFrame
without seeming to sort. This would happen with a lexsorted, but non-monotonic levels. (GH15622,
GH15687, GH14015, GH13431, GH15797)
This is unchanged from prior versions, but shown for illustration purposes:
In [94]: df
Out[94]:
value
B 0 0
1 1
2 2
A 0 3
1 4
2 5
[6 rows x 1 columns]
In [95]: df.index.is_lexsorted()
Out[95]: False
In [96]: df.index.is_monotonic
Out[96]: False
Sorting works as expected
In [97]: df.sort_index()
Out[97]:
value
A 0 3
1 4
2 5
B 0 0
1 1
2 2
[6 rows x 1 columns]
In [98]: df.sort_index().index.is_lexsorted()
Out[98]: True
In [99]: df.sort_index().index.is_monotonic
Out[99]: True
However, this example, which has a non-monotonic 2nd level, doesn’t behave as desired.
In [100]: df = pd.DataFrame({'value': [1, 2, 3, 4]},
.....: index=pd.MultiIndex([['a', 'b'], ['bb', 'aa']],
.....: [[0, 0, 1, 1], [0, 1, 0, 1]]))
.....:
In [101]: df
Out[101]:
value
a bb 1
aa 2
b bb 3
aa 4
[4 rows x 1 columns]
Previous behavior:
In [11]: df.sort_index()
Out[11]:
value
a bb 1
aa 2
b bb 3
aa 4
In [14]: df.sort_index().index.is_lexsorted()
Out[14]: True
In [15]: df.sort_index().index.is_monotonic
(continues on next page)
New behavior:
In [102]: df.sort_index()
Out[102]:
value
a aa 2
bb 1
b aa 4
bb 3
[4 rows x 1 columns]
In [103]: df.sort_index().index.is_lexsorted()
Out[103]: True
In [104]: df.sort_index().index.is_monotonic
Out[104]: True
The output formatting of groupby.describe() now labels the describe() metrics in the columns instead
of the index. This format is consistent with groupby.agg() when applying multiple functions at once.
(GH4792)
Previous behavior:
In [2]: df.groupby('A').describe()
Out[2]:
B
A
1 count 2.000000
mean 1.500000
std 0.707107
min 1.000000
25% 1.250000
50% 1.500000
75% 1.750000
max 2.000000
2 count 2.000000
mean 3.500000
std 0.707107
min 3.000000
25% 3.250000
50% 3.500000
75% 3.750000
max 4.000000
New behavior:
In [105]: df = pd.DataFrame({'A': [1, 1, 2, 2], 'B': [1, 2, 3, 4]})
In [106]: df.groupby('A').describe()
Out[106]:
B
count mean std min 25% 50% 75% max
A
1 2.0 1.5 0.707107 1.0 1.25 1.5 1.75 2.0
2 2.0 3.5 0.707107 3.0 3.25 3.5 3.75 4.0
[2 rows x 8 columns]
[2 rows x 4 columns]
A binary window operation, like .corr() or .cov(), when operating on a .rolling(..), .expanding(..),
or .ewm(..) object, will now return a 2-level MultiIndexed DataFrame rather than a Panel, as Panel is
now deprecated, see here. These are equivalent in function, but a MultiIndexed DataFrame enjoys more
support in pandas. See the section on Windowed Binary Operations for more information. (GH15677)
In [108]: np.random.seed(1234)
In [110]: df.tail()
Out[110]:
bar A B
foo
2016-04-05 0.640880 0.126205
(continues on next page)
[5 rows x 2 columns]
Previous behavior:
In [2]: df.rolling(12).corr()
Out[2]:
<class 'pandas.core.panel.Panel'>
Dimensions: 100 (items) x 2 (major_axis) x 2 (minor_axis)
Items axis: 2016-01-01 00:00:00 to 2016-04-09 00:00:00
Major_axis axis: A to B
Minor_axis axis: A to B
New behavior:
In [112]: res.tail()
Out[112]:
bar A B
foo bar
2016-04-07 B -0.132090 1.000000
2016-04-08 A 1.000000 -0.145775
B -0.145775 1.000000
2016-04-09 A 1.000000 0.119645
B 0.119645 1.000000
[5 rows x 2 columns]
In [113]: df.rolling(12).corr().loc['2016-04-07']
Out[113]:
bar A B
foo bar
2016-04-07 A 1.00000 -0.13209
B -0.13209 1.00000
[2 rows x 2 columns]
In previous versions most types could be compared to string column in a HDFStore usually resulting in
an invalid comparison, returning an empty result frame. These comparisons will now raise a TypeError
(GH15492)
In [116]: df.dtypes
Out[116]:
unparsed_date object
Length: 1, dtype: object
Previous behavior:
New behavior:
In [18]: ts = pd.Timestamp('2014-01-01')
Index.intersection and inner join now preserve the order of the left Index
Index.intersection() now preserves the order of the calling Index (left) instead of the other Index (right)
(GH15582). This affects inner joins, DataFrame.join() and merge(), and the .align method.
• Index.intersection
In [118]: left
Out[118]: Int64Index([2, 1, 0], dtype='int64')
In [120]: right
Out[120]: Int64Index([1, 2, 3], dtype='int64')
Previous behavior:
In [4]: left.intersection(right)
Out[4]: Int64Index([1, 2], dtype='int64')
New behavior:
In [121]: left.intersection(right)
Out[121]: Int64Index([2, 1], dtype='int64')
In [123]: left
Out[123]:
a
2 20
1 10
0 0
[3 rows x 1 columns]
In [125]: right
Out[125]:
b
1 100
2 200
3 300
[3 rows x 1 columns]
Previous behavior:
In [4]: left.join(right, how='inner')
Out[4]:
a b
1 10 100
2 20 200
New behavior:
[2 rows x 2 columns]
The documentation for pivot_table() states that a DataFrame is always returned. Here a bug is fixed that
allowed this to return a Series under certain circumstance. (GH4386)
In [127]: df = pd.DataFrame({'col1': [3, 4, 5],
.....: 'col2': ['C', 'D', 'E'],
.....: 'col3': [1, 3, 9]})
.....:
In [128]: df
Out[128]:
(continues on next page)
[3 rows x 3 columns]
Previous behavior:
New behavior:
[3 rows x 1 columns]
• numexpr version is now required to be >= 2.4.6 and it will not be used at all if this requisite is not
fulfilled (GH15213).
• CParserError has been renamed to ParserError in pd.read_csv() and will be removed in the future
(GH12665)
• SparseArray.cumsum() and SparseSeries.cumsum() will now always return SparseArray and
SparseSeries respectively (GH12855)
• DataFrame.applymap() with an empty DataFrame will return a copy of the empty DataFrame instead
of a Series (GH8222)
• Series.map() now respects default values of dictionary subclasses with a __missing__ method, such
as collections.Counter (GH15999)
• .loc has compat with .ix for accepting iterators, and NamedTuples (GH15120)
• interpolate() and fillna() will raise a ValueError if the limit keyword argument is not greater
than 0. (GH9217)
• pd.read_csv() will now issue a ParserWarning whenever there are conflicting values provided by the
dialect parameter and the user (GH14898)
• pd.read_csv() will now raise a ValueError for the C engine if the quote character is larger than than
one byte (GH11592)
• inplace arguments now require a boolean value, else a ValueError is thrown (GH14189)
• pandas.api.types.is_datetime64_ns_dtype will now report True on a tz-aware dtype, similar to
pandas.api.types.is_datetime64_any_dtype
• DataFrame.asof() will return a null filled Series instead the scalar NaN if a match is not found
(GH15118)
• Specific support for copy.copy() and copy.deepcopy() functions on NDFrame objects (GH15444)
• Series.sort_values() accepts a one element list of bool for consistency with the behavior of
DataFrame.sort_values() (GH15604)
• .merge() and .join() on category dtype columns will now preserve the category dtype when possible
(GH10409)
• SparseDataFrame.default_fill_value will be 0, previously was nan in the return from pd.
get_dummies(..., sparse=True) (GH15594)
• The default behaviour of Series.str.match has changed from extracting groups to matching the
pattern. The extracting behaviour was deprecated since pandas version 0.13.0 and can be done with
the Series.str.extract method (GH5224). As a consequence, the as_indexer keyword is ignored
(no longer needed to specify the new behaviour) and is deprecated.
• NaT will now correctly report False for datetimelike boolean operations such as is_month_start
(GH15781)
• NaT will now correctly return np.nan for Timedelta and Period accessors such as days and quarter
(GH15782)
• NaT will now returns NaT for tz_localize and tz_convert methods (GH15830)
• DataFrame and Panel constructors with invalid input will now raise ValueError rather than
PandasError, if called with scalar inputs and not axes (GH15541)
• DataFrame and Panel constructors with invalid input will now raise ValueError rather than pandas.
core.common.PandasError, if called with scalar inputs and not axes; The exception PandasError is
removed as well. (GH15541)
• The exception pandas.core.common.AmbiguousIndexError is removed as it is not referenced
(GH15541)
Some formerly public python/c/c++/cython extension modules have been moved and/or renamed. These
are all removed from the public API. Furthermore, the pandas.core, pandas.compat, and pandas.util
top-level modules are now considered to be PRIVATE. If indicated, a deprecation warning will be issued if
you reference theses modules. (GH12588)
Some new subpackages are created with public functionality that is not directly exposed in the top-level
namespace: pandas.errors, pandas.plotting and pandas.testing (more details below). Together with
pandas.api.types and certain functions in the pandas.io and pandas.tseries submodules, these are now
the public subpackages.
Further changes:
• The function union_categoricals() is now importable from pandas.api.types, formerly from
pandas.types.concat (GH15998)
• The type import pandas.tslib.NaTType is deprecated and can be replaced by using type(pandas.
NaT) (GH16146)
• The public functions in pandas.tools.hashing deprecated from that locations, but are now importable
from pandas.util (GH16223)
• The modules in pandas.util: decorators, print_versions, doctools, validators, depr_module
are now private. Only the functions exposed in pandas.util itself are public (GH16223)
pandas.errors
We are adding a standard public module for all pandas exceptions & warnings pandas.errors. (GH14800).
Previously these exceptions & warnings could be imported from pandas.core.common or pandas.io.common.
These exceptions and warnings will be removed from the *.common locations in a future release. (GH15541)
The following are now part of this API:
['DtypeWarning',
'EmptyDataError',
'OutOfBoundsDatetime',
'ParserError',
(continues on next page)
pandas.testing
We are adding a standard module that exposes the public testing functions in pandas.testing (GH9895).
Those functions can be used when writing tests for functionality using pandas objects.
The following testing functions are now part of this API:
• testing.assert_frame_equal()
• testing.assert_series_equal()
• testing.assert_index_equal()
pandas.plotting
A new public pandas.plotting module has been added that holds plotting functionality that was previously
in either pandas.tools.plotting or in the top-level namespace. See the deprecations sections for more
details.
• Building pandas for development now requires cython >= 0.23 (GH14831)
• Require at least 0.23 version of cython to avoid problems with character encodings (GH14699)
• Switched the test framework to use pytest (GH13097)
• Reorganization of tests directory layout (GH14854, GH15707).
Deprecations
Deprecate .ix
The .ix indexer is deprecated, in favor of the more strict .iloc and .loc indexers. .ix offers a lot of magic
on the inference of what the user wants to do. To wit, .ix can decide to index positionally OR via labels,
depending on the data type of the index. This has caused quite a bit of user confusion over the years. The
full indexing documentation is here. (GH14218)
The recommended methods of indexing are:
• .loc if you want to label index
• .iloc if you want to positionally index.
Using .ix will now show a DeprecationWarning with a link to some examples of how to convert code here.
In [131]: df
Out[131]:
A B
a 1 4
b 2 5
c 3 6
[3 rows x 2 columns]
Previous behavior, where you wish to get the 0th and the 2nd elements from the index in the ‘A’ column.
Using .loc. Here we will select the appropriate indexes from the index, then use label indexing.
Using .iloc. Here we will get the location of the ‘A’ column, then use positional indexing to select things.
Deprecate Panel
Panel is deprecated and will be removed in a future version. The recommended way to represent 3-D data
are with a MultiIndex on a DataFrame via the to_frame() or with the xarray package. Pandas provides a
to_xarray() method to automate this conversion (GH13563).
In [134]: p = tm.makePanel()
In [135]: p
Out[135]:
<class 'pandas.core.panel.Panel'>
Dimensions: 3 (items) x 3 (major_axis) x 4 (minor_axis)
(continues on next page)
In [136]: p.to_frame()
Out[136]:
ItemA ItemB ItemC
major minor
2000-01-03 A 0.628776 -1.409432 0.209395
B 0.988138 -1.347533 -0.896581
C -0.938153 1.272395 -0.161137
D -0.223019 -0.591863 -1.051539
2000-01-04 A 0.186494 1.422986 -0.592886
B -0.072608 0.363565 1.104352
C -1.239072 -1.449567 0.889157
D 2.123692 -0.414505 -0.319561
2000-01-05 A 0.952478 -2.147855 -1.473116
B -0.550603 -0.014752 -0.431550
C 0.139683 -1.195524 0.288377
D 0.122273 -1.425795 -0.619993
In [137]: p.to_xarray()
Out[137]:
<xarray.DataArray (items: 3, major_axis: 3, minor_axis: 4)>
array([[[ 0.628776, 0.988138, -0.938153, -0.223019],
[ 0.186494, -0.072608, -1.239072, 2.123692],
[ 0.952478, -0.550603, 0.139683, 0.122273]],
However, .agg(..) can also accept a dict that allows ‘renaming’ of the result columns. This is a complicated
and confusing syntax, as well as not consistent between Series and DataFrame. We are deprecating this
‘renaming’ functionality.
• We are deprecating passing a dict to a grouped/rolled/resampled Series. This allowed one to rename
the resulting aggregation, but this had a completely different meaning than passing a dictionary to a
grouped DataFrame, which accepts column-to-aggregations.
• We are deprecating passing a dict-of-dicts to a grouped/rolled/resampled DataFrame in a similar man-
ner.
This is an illustrative example:
In [135]: df
Out[135]:
A B C
0 1 0 0
1 1 1 1
2 1 2 2
3 2 3 3
4 2 4 4
[5 rows x 3 columns]
Here is a typical useful syntax for computing different aggregations for different columns. This is a natural,
and useful syntax. We aggregate from the dict-to-list by taking the specified columns and applying the list
of functions. This returns a MultiIndex for the columns (this is not deprecated).
[2 rows x 2 columns]
Here’s an example of the first deprecation, passing a dict to a grouped Series. This is a combination
aggregation & renaming:
Out[6]:
foo
A
1 3
2 2
[2 rows x 1 columns]
In [23]: (df.groupby('A')
...: .agg({'B': {'foo': 'sum'}, 'C': {'bar': 'min'}})
...: )
FutureWarning: using a dict with renaming is deprecated and
will be removed in a future version
Out[23]:
B C
foo bar
A
1 3 0
2 7 3
In [138]: (df.groupby('A')
.....: .agg({'B': 'sum', 'C': 'min'})
.....: .rename(columns={'B': 'foo', 'C': 'bar'})
.....: )
.....:
Out[138]:
foo bar
A
1 3 0
2 7 3
[2 rows x 2 columns]
Deprecate .plotting
The pandas.tools.plotting module has been deprecated, in favor of the top level pandas.plotting mod-
ule. All the public plotting functions are now available from pandas.plotting (GH12548).
Furthermore, the top-level pandas.scatter_matrix and pandas.plot_params are deprecated. Users can
import these from pandas.plotting as well.
Previous script:
pd.tools.plotting.scatter_matrix(df)
pd.scatter_matrix(df)
pd.plotting.scatter_matrix(df)
Other deprecations
• SparseArray.to_dense() has deprecated the fill parameter, as that parameter was not being re-
spected (GH14647)
• SparseSeries.to_dense() has deprecated the sparse_only parameter (GH14647)
• Series.repeat() has deprecated the reps parameter in favor of repeats (GH12662)
• The Series constructor and .astype method have deprecated accepting timestamp dtypes without a
frequency (e.g. np.datetime64) for the dtype parameter (GH15524)
• Index.repeat() and MultiIndex.repeat() have deprecated the n parameter in favor of repeats
(GH12662)
• Categorical.searchsorted() and Series.searchsorted() have deprecated the v parameter in favor
of value (GH12662)
• TimedeltaIndex.searchsorted(), DatetimeIndex.searchsorted(), and PeriodIndex.
searchsorted() have deprecated the key parameter in favor of value (GH12662)
• DataFrame.astype() has deprecated the raise_on_error parameter in favor of errors (GH14878)
• Series.sortlevel and DataFrame.sortlevel have been deprecated in favor of Series.sort_index
and DataFrame.sort_index (GH15099)
• importing concat from pandas.tools.merge has been deprecated in favor of imports from the pandas
namespace. This should only affect explicit imports (GH15358)
• Series/DataFrame/Panel.consolidate() been deprecated as a public method. (GH15483)
• The as_indexer keyword of Series.str.match() has been deprecated (ignored keyword) (GH15257).
• The following top-level pandas functions have been deprecated and will be removed in a future version
(GH13790, GH15940)
– pd.pnow(), replaced by Period.now()
– pd.Term, is removed, as it is not applicable to user code. Instead use in-line string expressions
in the where clause when searching in HDFStore
– pd.Expr, is removed, as it is not applicable to user code.
– pd.match(), is removed.
– pd.groupby(), replaced by using the .groupby() method directly on a Series/DataFrame
– pd.get_store(), replaced by a direct call to pd.HDFStore(...)
• is_any_int_dtype, is_floating_dtype, and is_sequence are deprecated from pandas.api.types
(GH16042)
• The pandas.rpy module is removed. Similar functionality can be accessed through the rpy2 project.
See the R interfacing docs for more details.
• The pandas.io.ga module with a google-analytics interface is removed (GH11308). Similar func-
tionality can be found in the Google2Pandas package.
• pd.to_datetime and pd.to_timedelta have dropped the coerce parameter in favor of errors
(GH13602)
• pandas.stats.fama_macbeth, pandas.stats.ols, pandas.stats.plm and pandas.stats.var, as
well as the top-level pandas.fama_macbeth and pandas.ols routines are removed. Similar func-
tionality can be found in the statsmodels package. (GH11898)
• The TimeSeries and SparseTimeSeries classes, aliases of Series and SparseSeries, are removed
(GH10890, GH15098).
• Series.is_time_series is dropped in favor of Series.index.is_all_dates (GH15098)
• The deprecated irow, icol, iget and iget_value methods are removed in favor of iloc and iat as
explained here (GH10711).
• The deprecated DataFrame.iterkv() has been removed in favor of DataFrame.iteritems()
(GH10711)
• The Categorical constructor has dropped the name parameter (GH10632)
• Categorical has dropped support for NaN categories (GH10748)
• The take_last parameter has been dropped from duplicated(), drop_duplicates(), nlargest(),
and nsmallest() methods (GH10236, GH10792, GH10920)
• Series, Index, and DataFrame have dropped the sort and order methods (GH10726)
• Where clauses in pytables are only accepted as strings and expressions types and not other data-types
(GH12027)
• DataFrame has dropped the combineAdd and combineMult methods in favor of add and mul respectively
(GH10735)
Performance improvements
Bug fixes
Conversion
• Bug in Timestamp.replace now raises TypeError when incorrect argument names are given; previously
this raised ValueError (GH15240)
• Bug in Timestamp.replace with compat for passing long integers (GH15030)
• Bug in Timestamp returning UTC based time/date attributes when a timezone was provided (GH13303,
GH6538)
• Bug in Timestamp incorrectly localizing timezones during construction (GH11481, GH15777)
• Bug in TimedeltaIndex addition where overflow was being allowed without error (GH14816)
• Bug in TimedeltaIndex raising a ValueError when boolean indexing with loc (GH14946)
• Bug in catching an overflow in Timestamp + Timedelta/Offset operations (GH15126)
• Bug in DatetimeIndex.round() and Timestamp.round() floating point accuracy when rounding by
milliseconds or less (GH14440, GH15578)
• Bug in astype() where inf values were incorrectly converted to integers. Now raises error now with
astype() for Series and DataFrames (GH14265)
• Bug in DataFrame(..).apply(to_numeric) when values are of type decimal.Decimal. (GH14827)
• Bug in describe() when passing a numpy array which does not contain the median to the percentiles
keyword argument (GH14908)
• Cleaned up PeriodIndex constructor, including raising on floats more consistently (GH13277)
• Bug in using __deepcopy__ on empty NDFrame objects (GH15370)
• Bug in .replace() may result in incorrect dtypes. (GH12747, GH15765)
• Bug in Series.replace and DataFrame.replace which failed on empty replacement dicts (GH15289)
• Bug in Series.replace which replaced a numeric by string (GH15743)
• Bug in Index construction with NaN elements and integer dtype specified (GH15187)
• Bug in Series construction with a datetimetz (GH14928)
• Bug in Series.dt.round() inconsistent behaviour on NaT ‘s with different arguments (GH14940)
• Bug in Series constructor when both copy=True and dtype arguments are provided (GH15125)
• Incorrect dtyped Series was returned by comparison methods (e.g., lt, gt, …) against a constant for
an empty DataFrame (GH15077)
• Bug in Series.ffill() with mixed dtypes containing tz-aware datetimes. (GH14956)
• Bug in DataFrame.fillna() where the argument downcast was ignored when fillna value was of type
dict (GH15277)
• Bug in .asfreq(), where frequency was not set for empty Series (GH14320)
• Bug in DataFrame construction with nulls and datetimes in a list-like (GH15869)
• Bug in DataFrame.fillna() with tz-aware datetimes (GH15855)
Indexing
I/O
• Bug in pd.to_numeric() in which float and unsigned integer elements were being improperly casted
(GH14941, GH15005)
• Bug in pd.read_fwf() where the skiprows parameter was not being respected during column width
inference (GH11256)
• Bug in pd.read_csv() in which the dialect parameter was not being verified before processing
(GH14898)
• Bug in pd.read_csv() in which missing data was being improperly handled with usecols (GH6710)
• Bug in pd.read_csv() in which a file containing a row with many columns followed by rows with fewer
columns would cause a crash (GH14125)
• Bug in pd.read_csv() for the C engine where usecols were being indexed incorrectly with
parse_dates (GH14792)
• Bug in pd.read_csv() with parse_dates when multi-line headers are specified (GH15376)
• Bug in pd.read_csv() with float_precision='round_trip' which caused a segfault when a text
entry is parsed (GH15140)
• Bug in pd.read_csv() when an index was specified and no values were specified as null values
(GH15835)
• Bug in pd.read_csv() in which certain invalid file objects caused the Python interpreter to crash
(GH15337)
• Bug in pd.read_csv() in which invalid values for nrows and chunksize were allowed (GH15767)
• Bug in pd.read_csv() for the Python engine in which unhelpful error messages were being raised
when parsing errors occurred (GH15910)
• Bug in pd.read_csv() in which the skipfooter parameter was not being properly validated
(GH15925)
• Bug in pd.to_csv() in which there was numeric overflow when a timestamp index was being written
(GH15982)
• Bug in pd.util.hashing.hash_pandas_object() in which hashing of categoricals depended on the
ordering of categories, instead of just their values. (GH15143)
• Bug in .to_json() where lines=True and contents (keys or values) contain escaped characters
(GH15096)
• Bug in .to_json() causing single byte ascii characters to be expanded to four byte unicode (GH15344)
• Bug in .to_json() for the C engine where rollover was not correctly handled for case where frac is
odd and diff is exactly 0.5 (GH15716, GH15864)
• Bug in pd.read_json() for Python 2 where lines=True and contents contain non-ascii unicode char-
acters (GH15132)
• Bug in pd.read_msgpack() in which Series categoricals were being improperly processed (GH14901)
• Bug in pd.read_msgpack() which did not allow loading of a dataframe with an index of type
CategoricalIndex (GH15487)
Plotting
Groupby/resample/rolling
• Bug in .rolling() where pd.Timedelta or datetime.timedelta was not accepted as a window ar-
gument (GH15440)
• Bug in Rolling.quantile function that caused a segmentation fault when called with a quantile value
outside of the range [0, 1] (GH15463)
• Bug in DataFrame.resample().median() if duplicate column names are present (GH14233)
Sparse
Reshaping
• Bug in pd.merge_asof() where left_index or right_index caused a failure when multiple by was
specified (GH15676)
• Bug in pd.merge_asof() where left_index/right_index together caused a failure when tolerance
was specified (GH15135)
• Bug in DataFrame.pivot_table() where dropna=True would not drop all-NaN columns when the
columns was a category dtype (GH15193)
• Bug in pd.melt() where passing a tuple value for value_vars caused a TypeError (GH15348)
• Bug in pd.pivot_table() where no error was raised when values argument was not in the columns
(GH14938)
• Bug in pd.concat() in which concatenating with an empty dataframe with join='inner' was being
improperly handled (GH15328)
• Bug with sort=True in DataFrame.join and pd.merge when joining on indexes (GH15582)
• Bug in DataFrame.nsmallest and DataFrame.nlargest where identical values resulted in duplicated
rows (GH15297)
• Bug in pandas.pivot_table() incorrectly raising UnicodeError when passing unicode input for
margins keyword (GH13292)
Numeric
• Bug in .eval() which caused multi-line evals to fail with local variables not on the first line (GH15342)
Other
Contributors
A total of 204 people contributed patches to this release. People with a “+” by their names contributed a
patch for the first time.
• Adam J. Stewart +
• Adrian +
• Ajay Saxena
• Akash Tandon +
• Albert Villanova del Moral +
• Aleksey Bilogur +
• Alexis Mignon +
• Amol Kahat +
• Andreas Winkler +
• Andrew Kittredge +
• Anthonios Partheniou
• Arco Bast +
• Ashish Singal +
• Baurzhan Muftakhidinov +
• Ben Kandel
• Ben Thayer +
• Ben Welsh +
• Bill Chambers +
• Brandon M. Burroughs
• Brian +
• Brian McFee +
• Carlos Souza +
• Chris
• Chris Ham
• Chris Warth
• Christoph Gohlke
• Christoph Paulik +
• Christopher C. Aycock
• Clemens Brunner +
• D.S. McNeil +
• DaanVanHauwermeiren +
• Daniel Himmelstein
• Dave Willmer
• David Cook +
• David Gwynne +
• David Hoffman +
• David Krych
• Diego Fernandez +
• Dimitris Spathis +
• Dmitry L +
• Dody Suria Wijaya +
• Dominik Stanczak +
• Dr-Irv
• Dr. Irv +
• Elliott Sales de Andrade +
• Ennemoser Christoph +
• Francesc Alted +
• Fumito Hamamura +
• Giacomo Ferroni
• Graham R. Jeffries +
• Greg Williams +
• Guilherme Beltramini +
• Guilherme Samora +
• Hao Wu +
• Harshit Patni +
• Ilya V. Schurov +
• Iván Vallés Pérez
• Jackie Leng +
• Jaehoon Hwang +
• James Draper +
• James Goppert +
• James McBride +
• James Santucci +
• Jan Schulz
• Jeff Carey
• Jeff Reback
• JennaVergeynst +
• Jim +
• Jim Crist
• Joe Jevnik
• Joel Nothman +
• John +
• John Tucker +
• John W. O’Brien
• John Zwinck
• Jon M. Mease
• Jon Mease
• Jonathan Whitmore +
• Jonathan de Bruin +
• Joost Kranendonk +
• Joris Van den Bossche
• Joshua Bradt +
• Julian Santander
• Julien Marrec +
• Jun Kim +
• Justin Solinsky +
• Kacawi +
• Kamal Kamalaldin +
• Kerby Shedden
• Kernc
• Keshav Ramaswamy
• Kevin Sheppard
• Kyle Kelley
• Larry Ren
• Leon Yin +
• Line Pedersen +
• Lorenzo Cestaro +
• Luca Scarabello
• Lukasz +
• Mahmoud Lababidi
• Mark Mandel +
• Matt Roeschke
• Matthew Brett
• Matthew Roeschke +
• Matti Picus
• Maximilian Roos
• Michael Charlton +
• Michael Felt
• Michael Lamparski +
• Michiel Stock +
• Mikolaj Chwalisz +
• Min RK
• Miroslav Šedivý +
• Mykola Golubyev
• Nate Yoder
• Nathalie Rud +
• Nicholas Ver Halen
• Nick Chmura +
• Nolan Nichols +
• Pankaj Pandey +
• Pawel Kordek
• Pete Huang +
• Peter +
• Peter Csizsek +
• Petio Petrov +
• Phil Ruffwind +
• Pietro Battiston
• Piotr Chromiec
• Prasanjit Prakash +
• Rob Forgione +
• Robert Bradshaw
• Robin +
• Rodolfo Fernandez
• Roger Thomas
• Rouz Azari +
• Sahil Dua
• Sam Foo +
• Sami Salonen +
• Sarah Bird +
• Sarma Tangirala +
• Scott Sanderson
• Sebastian Bank
• Sebastian Gsänger +
• Shawn Heide
• Shyam Saladi +
• Sinhrks
• Stephen Rauch +
• Sébastien de Menten +
• Tara Adiseshan
• Thiago Serafim
• Thoralf Gutierrez +
• Thrasibule +
• Tobias Gustafsson +
• Tom Augspurger
• Tong SHEN +
• Tong Shen +
• TrigonaMinima +
• Uwe +
• Wes Turner
• Wiktor Tomczak +
• WillAyd
• Yaroslav Halchenko
• Yimeng Zhang +
• abaldenko +
• adrian-stepien +
• alexandercbooth +
• atbd +
• bastewart +
• bmagnusson +
• carlosdanielcsantos +
• chaimdemulder +
• chris-b1
• dickreuter +
• discort +
• dr-leo +
• dubourg
• dwkenefick +
• funnycrab +
• gfyoung
• goldenbull +
• [email protected]
• jojomdt +
• linebp +
• manu +
• manuels +
• mattip +
• maxalbert +
• mcocdawc +
• nuffe +
• paul-mannino
• pbreach +
• sakkemo +
• scls19fr
• sinhrks
• stijnvanhoey +
• the-nose-knows +
• themrmax +
• tomrod +
• tzinckgraf
• wandersoncferreira
• watercrossing +
• wcwagner
• xgdgsc +
• yui-knk
{{ header }}
This is a minor bug-fix release in the 0.19.x series and includes some small regression fixes, bug fixes and
performance improvements. We recommend that all users upgrade to this version.
Highlights include:
• Compatibility with Python 3.6
• Added a Pandas Cheat Sheet. (GH13202).
• Enhancements
• Performance improvements
• Bug fixes
• Contributors
Enhancements
Performance improvements
Bug fixes
• Bug in pd.read_csv in which aliasing was being done for na_values when passed in as a dictionary
(GH14203)
• Bug in pd.read_csv in which column indices for a dict-like na_values were not being respected
(GH14203)
• Bug in pd.read_csv where reading files fails, if the number of headers is equal to the number of lines
in the file (GH14515)
• Bug in pd.read_csv for the Python engine in which an unhelpful error message was being raised when
multi-char delimiters were not being respected with quotes (GH14582)
• Fix bugs (GH14734, GH13654) in pd.read_sas and pandas.io.sas.sas7bdat.SAS7BDATReader that
caused problems when reading a SAS file incrementally.
• Bug in pd.read_csv for the Python engine in which an unhelpful error message was being raised when
skipfooter was not being respected by Python’s CSV library (GH13879)
• Bug in .fillna() in which timezone aware datetime64 values were incorrectly rounded (GH14872)
• Bug in .groupby(..., sort=True) of a non-lexsorted MultiIndex when grouping with multiple levels
(GH14776)
• Bug in pd.cut with negative values and a single bin (GH14652)
• Bug in pd.to_numeric where a 0 was not unsigned on a downcast='unsigned' argument (GH14401)
• Bug in plotting regular and irregular timeseries using shared axes (sharex=True or ax.twinx())
(GH13341, GH14322).
• Bug in not propagating exceptions in parsing invalid datetimes, noted in python 3.6 (GH14561)
• Bug in resampling a DatetimeIndex in local TZ, covering a DST change, which would raise
AmbiguousTimeError (GH14682)
• Bug in indexing that transformed RecursionError into KeyError or IndexingError (GH14554)
• Bug in HDFStore when writing a MultiIndex when using data_columns=True (GH14435)
• Bug in HDFStore.append() when writing a Series and passing a min_itemsize argument containing
a value for the index (GH11412)
• Bug when writing to a HDFStore in table format with a min_itemsize value for the index and without
asking to append (GH10381)
• Bug in Series.groupby.nunique() raising an IndexError for an empty Series (GH12553)
• Bug in DataFrame.nlargest and DataFrame.nsmallest when the index had duplicate values
(GH13412)
• Bug in clipboard functions on linux with python2 with unicode and separators (GH13747)
• Bug in clipboard functions on Windows 10 and python 3 (GH14362, GH12807)
• Bug in .to_clipboard() and Excel compat (GH12529)
• Bug in DataFrame.combine_first() for integer columns (GH14687).
• Bug in pd.read_csv() in which the dtype parameter was not being respected for empty data
(GH14712)
• Bug in pd.read_csv() in which the nrows parameter was not being respected for large input when
using the C engine for parsing (GH7626)
• Bug in pd.merge_asof() could not handle timezone-aware DatetimeIndex when a tolerance was spec-
ified (GH14844)
• Explicit check in to_stata and StataWriter for out-of-range values when writing doubles (GH14618)
• Bug in .plot(kind='kde') which did not drop missing values to generate the KDE Plot, instead
generating an empty plot. (GH14821)
• Bug in unstack() if called with a list of column(s) as an argument, regardless of the dtypes of all
columns, they get coerced to object (GH11847)
Contributors
A total of 33 people contributed patches to this release. People with a “+” by their names contributed a
patch for the first time.
• Ajay Saxena +
• Ben Kandel
• Chris
• Chris Ham +
• Christopher C. Aycock
• Daniel Himmelstein +
• Dave Willmer +
• Dr-Irv
• Jeff Carey +
• Jeff Reback
• Joe Jevnik
• Joris Van den Bossche
• Julian Santander +
• Kerby Shedden
• Keshav Ramaswamy
• Kevin Sheppard
• Luca Scarabello +
• Matt Roeschke +
• Matti Picus +
• Maximilian Roos
• Mykola Golubyev +
• Nate Yoder +
• Nicholas Ver Halen +
• Pawel Kordek
• Pietro Battiston
• Rodolfo Fernandez +
• Tara Adiseshan +
• Tom Augspurger
• Yaroslav Halchenko
• gfyoung
• [email protected] +
• sinhrks
• wandersoncferreira +
{{ header }}
This is a minor bug-fix release from 0.19.0 and includes some small regression fixes, bug fixes and performance
improvements. We recommend that all users upgrade to this version.
• Performance improvements
• Bug fixes
• Contributors
Performance improvements
Bug fixes
• Source installs from PyPI will now again work without cython installed, as in previous versions
(GH14204)
• Compat with Cython 0.25 for building (GH14496)
• Fixed regression where user-provided file handles were closed in read_csv (c engine) (GH14418).
• Fixed regression in DataFrame.quantile when missing values where present in some columns
(GH14357).
• Fixed regression in Index.difference where the freq of a DatetimeIndex was incorrectly set
(GH14323)
• Added back pandas.core.common.array_equivalent with a deprecation warning (GH14555).
• Bug in pd.read_csv for the C engine in which quotation marks were improperly parsed in skipped
rows (GH14459)
• Bug in pd.read_csv for Python 2.x in which Unicode quote characters were no longer being respected
(GH14477)
Contributors
A total of 30 people contributed patches to this release. People with a “+” by their names contributed a
patch for the first time.
• Adam Chainz +
• Anthonios Partheniou
• Arash Rouhani +
• Ben Kandel
• Brandon M. Burroughs +
• Chris
• Chris Warth
• David Krych +
• Iván Vallés Pérez +
• Jeff Reback
• Joe Jevnik
• Jon M. Mease +
• Jon Mease +
• Joris Van den Bossche
• Josh Owen +
• Keshav Ramaswamy +
• Larry Ren +
• Michael Felt +
• Piotr Chromiec +
• Robert Bradshaw +
• Sinhrks
• Thiago Serafim +
• Tom Bird
• bkandel +
• chris-b1
• dubourg +
• gfyoung
• mattrijk +
• paul-mannino +
• sinhrks
{{ header }}
This is a major release from 0.18.1 and includes number of API changes, several new features, enhancements,
and performance improvements along with a large number of bug fixes. We recommend that all users upgrade
to this version.
Highlights include:
• merge_asof() for asof-style time-series joining, see here
• .rolling() is now time-series aware, see here
• read_csv() now supports parsing Categorical data, see here
• A function union_categorical() has been added for combining categoricals, see here
• PeriodIndex now has its own period dtype, and changed to be more consistent with other Index
classes. See here
• Sparse data structures gained enhanced support of int and bool dtypes, see here
• Comparison operations with Series no longer ignores the index, see here for an overview of the API
changes.
• Introduction of a pandas development API for utility functions, see here.
• Deprecation of Panel4D and PanelND. We recommend to represent these types of n-dimensional data
with the xarray package.
• Removal of the previously deprecated modules pandas.io.data, pandas.io.wb, pandas.tools.rplot.
Warning: pandas >= 0.19.0 will no longer silence numpy ufunc warnings upon import, see here.
• New features
– merge_asof for asof-style time-series joining
– .rolling() is now time-series aware
– read_csv has improved support for duplicate column names
– read_csv supports parsing Categorical directly
– Categorical concatenation
– Semi-month offsets
– New Index methods
– Google BigQuery Enhancements
– Fine-grained numpy errstate
– get_dummies now returns integer dtypes
– Downcast values to smallest possible dtype in to_numeric
– pandas development API
– Other enhancements
• API changes
– Series.tolist() will now return Python types
– Series operators for different indexes
* Arithmetic operators
* Comparison operators
* Logical operators
* Flexible comparison methods
– Series type promotion on assignment
– .to_datetime() changes
– Merging changes
– .describe() changes
– Period changes
New features
A long-time requested feature has been added through the merge_asof() function, to support asof style
joining of time-series (GH1870, GH13695, GH13709, GH13902). Full documentation is here.
The merge_asof() performs an asof merge, which is similar to a left-join except that we match on nearest
key rather than equal keys.
In [1]: left = pd.DataFrame({'a': [1, 5, 10],
...: 'left_val': ['a', 'b', 'c']})
...:
In [3]: left
Out[3]:
a left_val
0 1 a
1 5 b
2 10 c
[3 rows x 2 columns]
In [4]: right
Out[4]:
a right_val
0 1 1
1 2 2
2 3 3
3 6 6
4 7 7
[5 rows x 2 columns]
We typically want to match exactly when possible, and use the most recent value otherwise.
[3 rows x 3 columns]
We can also match rows ONLY with prior data, and not an exact match.
[3 rows x 3 columns]
In a typical time-series example, we have trades and quotes and we want to asof-join them. This also
illustrates using the by parameter to group data before merging.
In [9]: trades
Out[9]:
time ticker price quantity
0 2016-05-25 13:30:00.023 MSFT 51.95 75
1 2016-05-25 13:30:00.038 MSFT 51.95 155
2 2016-05-25 13:30:00.048 GOOG 720.77 100
3 2016-05-25 13:30:00.048 GOOG 720.92 100
4 2016-05-25 13:30:00.048 AAPL 98.00 100
[5 rows x 4 columns]
In [10]: quotes
Out[10]:
time ticker bid ask
0 2016-05-25 13:30:00.023 GOOG 720.50 720.93
1 2016-05-25 13:30:00.023 MSFT 51.95 51.96
2 2016-05-25 13:30:00.030 MSFT 51.97 51.98
3 2016-05-25 13:30:00.041 MSFT 51.99 52.00
4 2016-05-25 13:30:00.048 GOOG 720.50 720.93
5 2016-05-25 13:30:00.049 AAPL 97.99 98.01
6 2016-05-25 13:30:00.072 GOOG 720.50 720.88
7 2016-05-25 13:30:00.075 MSFT 52.01 52.03
[8 rows x 4 columns]
An asof merge joins on the on, typically a datetimelike field, which is ordered, and in this case we are using
a grouper in the by field. This is like a left-outer join, except that forward filling happens automatically
taking the most recent non-NaN value.
[5 rows x 6 columns]
This returns a merged DataFrame with the entries in the same order as the original left passed DataFrame
(trades in this case), with the fields of the quotes merged.
.rolling() objects are now time-series aware and can accept a time-series offset (or convertible) for the
window argument (GH13327, GH12995). See the full documentation here.
In [13]: dft
Out[13]:
B
2013-01-01 09:00:00 0.0
2013-01-01 09:00:01 1.0
2013-01-01 09:00:02 2.0
2013-01-01 09:00:03 NaN
2013-01-01 09:00:04 4.0
[5 rows x 1 columns]
This is a regular frequency index. Using an integer window parameter works to roll along the window
frequency.
In [14]: dft.rolling(2).sum()
Out[14]:
B
2013-01-01 09:00:00 NaN
2013-01-01 09:00:01 1.0
2013-01-01 09:00:02 3.0
2013-01-01 09:00:03 NaN
2013-01-01 09:00:04 NaN
[5 rows x 1 columns]
[5 rows x 1 columns]
Specifying an offset allows a more intuitive specification of the rolling frequency.
In [16]: dft.rolling('2s').sum()
Out[16]:
B
2013-01-01 09:00:00 0.0
2013-01-01 09:00:01 1.0
2013-01-01 09:00:02 3.0
2013-01-01 09:00:03 2.0
2013-01-01 09:00:04 4.0
[5 rows x 1 columns]
Using a non-regular, but still monotonic index, rolling with an integer window does not impart any special
calculation.
In [17]: dft = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]},
....: index=pd.Index([pd.Timestamp('20130101 09:00:00'),
....: pd.Timestamp('20130101 09:00:02'),
....: pd.Timestamp('20130101 09:00:03'),
....: pd.Timestamp('20130101 09:00:05'),
....: pd.Timestamp('20130101 09:00:06')],
....: name='foo'))
....:
In [18]: dft
Out[18]:
B
foo
2013-01-01 09:00:00 0.0
2013-01-01 09:00:02 1.0
2013-01-01 09:00:03 2.0
2013-01-01 09:00:05 NaN
2013-01-01 09:00:06 4.0
[5 rows x 1 columns]
In [19]: dft.rolling(2).sum()
Out[19]:
B
foo
2013-01-01 09:00:00 NaN
2013-01-01 09:00:02 1.0
2013-01-01 09:00:03 3.0
2013-01-01 09:00:05 NaN
2013-01-01 09:00:06 NaN
[5 rows x 1 columns]
Using the time-specification generates variable windows for this sparse data.
In [20]: dft.rolling('2s').sum()
Out[20]:
B
foo
2013-01-01 09:00:00 0.0
2013-01-01 09:00:02 1.0
2013-01-01 09:00:03 3.0
2013-01-01 09:00:05 NaN
2013-01-01 09:00:06 4.0
[5 rows x 1 columns]
Furthermore, we now allow an optional on parameter to specify a column (rather than the default of the
index) in a DataFrame.
In [21]: dft = dft.reset_index()
In [22]: dft
Out[22]:
foo B
0 2013-01-01 09:00:00 0.0
1 2013-01-01 09:00:02 1.0
2 2013-01-01 09:00:03 2.0
3 2013-01-01 09:00:05 NaN
4 2013-01-01 09:00:06 4.0
[5 rows x 2 columns]
[5 rows x 2 columns]
Duplicate column names are now supported in read_csv() whether they are in the file or passed in as the
names parameter (GH7160, GH9424)
Previous behavior:
The first a column contained the same data as the second a column, when it should have contained the
values [0, 3].
New behavior:
683 )
684
--> 685 return _read(filepath_or_buffer, kwds)
686
687 parser_f.__name__ = name
~/sandbox/pandas-doc/pandas/io/parsers.py in _validate_names(names)
419 if names is not None:
420 if len(names) != len(set(names)):
--> 421 raise ValueError("Duplicate names are not allowed.")
422 return names
423
The read_csv() function now supports parsing a Categorical column when specified as a dtype (GH10153).
Depending on the structure of the data, this can result in a faster parse time and lower memory usage
compared to converting to Categorical after parsing. See the io docs here.
In [27]: data = 'col1,col2,col3\na,b,1\na,b,2\nc,d,3'
In [28]: pd.read_csv(StringIO(data))
Out[28]:
col1 col2 col3
0 a b 1
1 a b 2
2 c d 3
[3 rows x 3 columns]
In [29]: pd.read_csv(StringIO(data)).dtypes
Out[29]:
col1 object
col2 object
col3 int64
Length: 3, dtype: object
Note: The resulting categories will always be parsed as strings (object dtype). If the categories are
numeric they can be converted using the to_numeric() function, or as appropriate, another converter such
as to_datetime().
In [32]: df = pd.read_csv(StringIO(data), dtype='category')
In [33]: df.dtypes
Out[33]:
col1 category
col2 category
col3 category
Length: 3, dtype: object
In [34]: df['col3']
Out[34]:
0 1
1 2
2 3
Name: col3, Length: 3, dtype: category
Categories (3, object): [1, 2, 3]
In [36]: df['col3']
Out[36]:
0 1
1 2
2 3
Name: col3, Length: 3, dtype: category
Categories (3, int64): [1, 2, 3]
Categorical concatenation
• A function union_categoricals() has been added for combining categoricals, see Unioning Categor-
icals (GH13361, GH13763, GH13846, GH14173)
In [37]: from pandas.api.types import union_categoricals
• concat and append now can concat category dtypes with different categories as object dtype
(GH13524)
In [41]: s1 = pd.Series(['a', 'b'], dtype='category')
Previous behavior:
In [1]: pd.concat([s1, s2])
ValueError: incompatible categories in categorical concat
New behavior:
In [43]: pd.concat([s1, s2])
Out[43]:
0 a
1 b
0 b
1 c
Length: 4, dtype: object
Semi-month offsets
Pandas has gained new frequency offsets, SemiMonthEnd (‘SM’) and SemiMonthBegin (‘SMS’). These provide
date offsets anchored (by default) to the 15th and end of month, and 15th and 1st of month respectively.
(GH1543)
SemiMonthEnd:
In [45]: pd.Timestamp('2016-01-01') + SemiMonthEnd()
Out[45]: Timestamp('2016-01-15 00:00:00')
SemiMonthBegin:
In [47]: pd.Timestamp('2016-01-01') + SemiMonthBegin()
Out[47]: Timestamp('2016-01-15 00:00:00')
Using the anchoring suffix, you can also specify the day of month to use instead of the 15th.
In [49]: pd.date_range('2015-01-01', freq='SMS-16', periods=4)
Out[49]: DatetimeIndex(['2015-01-01', '2015-01-16', '2015-02-01', '2015-02-16'],␣
,→dtype='datetime64[ns]', freq='SMS-16')
The following methods and options are added to Index, to be more consistent with the Series and DataFrame
API.
Index now supports the .where() function for same shape indexing (GH13170)
In [54]: idx.dropna()
Out[54]: Float64Index([1.0, 2.0, 4.0], dtype='float64')
For MultiIndex, values are dropped if any level is missing by default. Specifying how='all' only drops
values where all levels are missing.
In [55]: midx = pd.MultiIndex.from_arrays([[1, 2, np.nan, 4],
....: [1, 2, np.nan, np.nan]])
....:
In [56]: midx
Out[56]:
MultiIndex([(1.0, 1.0),
(2.0, 2.0),
(nan, nan),
(4.0, nan)],
)
In [57]: midx.dropna()
Out[57]:
MultiIndex([(1, 1),
(2, 2)],
)
In [58]: midx.dropna(how='all')
Out[58]:
MultiIndex([(1, 1.0),
(2, 2.0),
(4, nan)],
)
Index now supports .str.extractall() which returns a DataFrame, see the docs here (GH10008, GH13156)
In [60]: idx.str.extractall(r"[ab](?P<digit>\d)")
Out[60]:
digit
match
0 0 1
1 2
1 0 1
[3 rows x 1 columns]
Index.astype() now accepts an optional boolean argument copy, which allows optional copying if the
requirements on dtype are satisfied (GH13209)
• The read_gbq() method has gained the dialect argument to allow users to specify whether to use
BigQuery’s legacy SQL or BigQuery’s standard SQL. See the docs for more details (GH13615).
• The to_gbq() method now allows the DataFrame column order to differ from the destination table
schema (GH11359).
Previous versions of pandas would permanently silence numpy’s ufunc error handling when pandas was
imported. Pandas did this in order to silence the warnings that would arise from using numpy ufuncs on
missing data, which are usually represented as NaN s. Unfortunately, this silenced legitimate warnings arising
in non-pandas code in the application. Starting with 0.19.0, pandas will use the numpy.errstate context
manager to silence these warnings in a more fine-grained manner, only around where these operations are
actually used in the pandas code base. (GH13109, GH13145)
After upgrading pandas, you may see new RuntimeWarnings being issued from your code. These are likely
legitimate, and the underlying cause likely existed in the code when using previous versions of pandas that
simply silenced the warning. Use numpy.errstate around the source of the RuntimeWarning to control how
these conditions are handled.
The pd.get_dummies function now returns dummy-encoded columns as small integers, rather than floats
(GH8725). This should provide an improved memory footprint.
Previous behavior:
Out[1]:
a float64
b float64
c float64
dtype: object
New behavior:
pd.to_numeric() now accepts a downcast parameter, which will downcast the data if possible to smallest
specified numerical dtype (GH13352)
In [62]: s = ['1', 2, 3]
As part of making pandas API more uniform and accessible in the future, we have created a standard
sub-package of pandas, pandas.api to hold public API’s. We are starting by exposing type introspection
functions in pandas.api.types. More sub-packages and officially sanctioned API’s will be published in
future versions of pandas (GH13147, GH13634)
The following are now part of this API:
In [68]: pprint.pprint(funcs)
['CategoricalDtype',
'DatetimeTZDtype',
'IntervalDtype',
'PeriodDtype',
'infer_dtype',
'is_array_like',
'is_bool',
'is_bool_dtype',
'is_categorical',
'is_categorical_dtype',
'is_complex',
'is_complex_dtype',
'is_datetime64_any_dtype',
'is_datetime64_dtype',
'is_datetime64_ns_dtype',
'is_datetime64tz_dtype',
'is_datetimetz',
'is_dict_like',
'is_dtype_equal',
'is_extension_array_dtype',
'is_extension_type',
'is_file_like',
'is_float',
'is_float_dtype',
'is_hashable',
'is_int64_dtype',
'is_integer',
'is_integer_dtype',
'is_interval',
'is_interval_dtype',
'is_iterator',
'is_list_like',
'is_named_tuple',
'is_number',
'is_numeric_dtype',
'is_object_dtype',
'is_period',
'is_period_dtype',
'is_re',
'is_re_compilable',
'is_scalar',
'is_signed_integer_dtype',
'is_sparse',
'is_string_dtype',
'is_timedelta64_dtype',
Note: Calling these functions from the internal module pandas.core.common will now show a
DeprecationWarning (GH13990)
Other enhancements
• Timestamp can now accept positional and keyword parameters similar to datetime.datetime()
(GH10758, GH11630)
In [69]: pd.Timestamp(2012, 1, 1)
Out[69]: Timestamp('2012-01-01 00:00:00')
....: freq='W',
....: periods=5)
....: ], names=['v', 'd']))
....:
In [72]: df
Out[72]:
date a
v d
1 2015-01-04 2015-01-04 0
2 2015-01-11 2015-01-11 1
3 2015-01-18 2015-01-18 2
4 2015-01-25 2015-01-25 3
5 2015-02-01 2015-02-01 4
[5 rows x 2 columns]
[2 rows x 1 columns]
[2 rows x 1 columns]
• The .get_credentials() method of GbqConnector can now first try to fetch the application default
credentials. See the docs for more details (GH13577).
• The .tz_localize() method of DatetimeIndex and Timestamp has gained the errors keyword, so
you can potentially coerce nonexistent timestamps to NaT. The default behavior remains to raising a
NonExistentTimeError (GH13057)
• .to_hdf/read_hdf() now accept path objects (e.g. pathlib.Path, py.path.local) for the file path
(GH11773)
• The pd.read_csv() with engine='python' has gained support for the decimal (GH12933),
na_filter (GH13321) and the memory_map option (GH13381).
• Consistent with the Python API, pd.read_csv() will now interpret +inf as positive infinity (GH13274)
• The pd.read_html() has gained support for the na_values, converters, keep_default_na options
(GH13461)
• Categorical.astype() now accepts an optional boolean argument copy, effective when dtype is cat-
egorical (GH13209)
• DataFrame has gained the .asof() method to return the last non-NaN values according to the selected
subset (GH13358)
• The DataFrame constructor will now respect key ordering if a list of OrderedDict objects are passed
in (GH13304)
• pd.read_html() has gained support for the decimal option (GH12907)
• Series has gained the properties .is_monotonic, .is_monotonic_increasing, .
is_monotonic_decreasing, similar to Index (GH13336)
• DataFrame.to_sql() now allows a single value as the SQL type for all columns (GH11886).
• Series.append now supports the ignore_index option (GH13677)
• .to_stata() and StataWriter can now write variable labels to Stata dta files using a dictionary to
make column names to labels (GH13535, GH13536)
• .to_stata() and StataWriter will automatically convert datetime64[ns] columns to Stata format
%tc, rather than raising a ValueError (GH12259)
• read_stata() and StataReader raise with a more explicit error message when reading Stata files with
repeated value labels when convert_categoricals=True (GH13923)
• DataFrame.style will now render sparsified MultiIndexes (GH11655)
• DataFrame.style will now show column level names (e.g. DataFrame.columns.names) (GH13775)
• DataFrame has gained support to re-order the columns based on the values in a row using df.
sort_values(by='...', axis=1) (GH10806)
In [75]: df = pd.DataFrame({'A': [2, 7], 'B': [3, 5], 'C': [4, 8]},
....: index=['row1', 'row2'])
....:
In [76]: df
Out[76]:
A B C
row1 2 3 4
row2 7 5 8
[2 rows x 3 columns]
[2 rows x 3 columns]
• Added documentation to I/O regarding the perils of reading in columns with mixed dtypes and how
to handle it (GH13746)
• to_html() now has a border argument to control the value in the opening <table> tag. The default is
the value of the html.border option, which defaults to 1. This also affects the notebook HTML repr,
but since Jupyter’s CSS includes a border-width attribute, the visual effect is the same. (GH11563).
• Raise ImportError in the sql functions when sqlalchemy is not installed and a connection string is
used (GH11920).
• Compatibility with matplotlib 2.0. Older versions of pandas should also work with matplotlib 2.0
(GH13333)
• Timestamp, Period, DatetimeIndex, PeriodIndex and .dt accessor have gained a .is_leap_year
property to check whether the date belongs to a leap year. (GH13727)
• astype() will now accept a dict of column name to data types mapping as the dtype argument.
(GH12086)
• The pd.read_json and DataFrame.to_json has gained support for reading and writing json lines with
lines option see Line delimited json (GH9180)
• read_excel() now supports the true_values and false_values keyword arguments (GH13347)
• groupby() will now accept a scalar and a single-element list for specifying level on a non-MultiIndex
grouper. (GH13907)
• Non-convertible dates in an excel date column will be returned without conversion and the column will
be object dtype, rather than raising an exception (GH10001).
• pd.Timedelta(None) is now accepted and will return NaT, mirroring pd.Timestamp (GH13687)
• pd.read_stata() can now handle some format 111 files, which are produced by SAS when generating
Stata dta files (GH11526)
• Series and Index now support divmod which will return a tuple of series or indices. This behaves like
a standard binary operator with regards to broadcasting rules (GH14208).
API changes
Series.tolist() will now return Python types in the output, mimicking NumPy .tolist() behavior
(GH10904)
Previous behavior:
In [7]: type(s.tolist()[0])
Out[7]:
<class 'numpy.int64'>
New behavior:
In [79]: type(s.tolist()[0])
Out[79]: int
Following Series operators have been changed to make all operators consistent, including DataFrame
(GH1134, GH4581, GH13538)
• Series comparison operators now raise ValueError when index are different.
• Series logical operators align both index of left and right hand side.
Warning: Until 0.18.1, comparing Series with the same length, would succeed even if the .index
are different (the result ignores .index). As of 0.19.0, this will raises ValueError to be more strict.
This section also describes how to keep previous behavior or align different indexes, using the flexible
comparison methods like .eq.
Arithmetic operators
In [82]: s1 + s2
Out[82]:
A 3.0
B 4.0
C NaN
D NaN
Length: 4, dtype: float64
[4 rows x 1 columns]
Comparison operators
In [1]: s1 == s2
Out[1]:
A False
B True
C False
dtype: bool
In [2]: s1 == s2
Out[2]:
ValueError: Can only compare identically-labeled Series objects
Note: To achieve the same result as previous versions (compare values based on locations ignoring .index),
compare both .values.
If you want to compare Series aligning its .index, see flexible comparison methods section below:
In [87]: s1.eq(s2)
Out[87]:
A False
B True
C False
D False
Length: 4, dtype: bool
Logical operators
Logical operators align both .index of left and right hand side.
Previous behavior (Series), only left hand side index was kept:
In [90]: s1 & s2
Out[90]:
A True
B False
C False
D False
Length: 4, dtype: bool
Note: To achieve the same result as previous versions (compare values based on only left hand side index),
you can use reindex_like:
[4 rows x 1 columns]
Series flexible comparison methods like eq, ne, le, lt, ge and gt now align both index. Use these operators
if you want to compare two Series which has the different index.
In [95]: s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
In [97]: s1.eq(s2)
Out[97]:
a False
b True
c False
d False
Length: 4, dtype: bool
In [98]: s1.ge(s2)
Out[98]:
a False
b True
c True
d False
Length: 4, dtype: bool
Previously, this worked the same as comparison operators (see above).
A Series will now correctly promote its dtype for assignment with incompat values to the current dtype
(GH13234)
In [99]: s = pd.Series()
Previous behavior:
New behavior:
In [100]: s["a"] = pd.Timestamp("2016-01-01")
In [102]: s
Out[102]:
a 2016-01-01 00:00:00
b 3
Length: 2, dtype: object
In [103]: s.dtype
Out[103]: dtype('O')
.to_datetime() changes
Previously if .to_datetime() encountered mixed integers/floats and strings, but no datetimes with
errors='coerce' it would convert all to NaT.
Previous behavior:
Current behavior:
This will now convert integers/floats with the default unit of ns.
Merging changes
Merging will now preserve the dtype of the join keys (GH8596)
In [106]: df1
Out[106]:
key v1
0 1 10
[1 rows x 2 columns]
In [108]: df2
Out[108]:
key v1
0 1 20
1 2 30
[2 rows x 2 columns]
Previous behavior:
New behavior:
We are able to preserve the join keys
In [109]: pd.merge(df1, df2, how='outer')
Out[109]:
key v1
0 1 10
1 1 20
2 2 30
[3 rows x 2 columns]
[2 rows x 3 columns]
.describe() changes
Percentile identifiers in the index of a .describe() output will now be rounded to the least precision that
keeps them distinct (GH13104)
Previous behavior:
The percentiles were rounded to at most one decimal place, which could raise ValueError for a data frame
if the percentiles were duplicated.
New behavior:
In [115]: s.describe(percentiles=[0.0001, 0.0005, 0.001, 0.999, 0.9995, 0.9999])
Out[115]:
count 5.000000
mean 2.000000
std 1.581139
min 0.000000
0.01% 0.000400
0.05% 0.002000
0.1% 0.004000
50% 2.000000
99.9% 3.996000
99.95% 3.998000
99.99% 3.999600
max 4.000000
Length: 12, dtype: float64
Period changes
PeriodIndex now has its own period dtype. The period dtype is a pandas extension dtype like category or
the timezone aware dtype (datetime64[ns, tz]) (GH13941). As a consequence of this change, PeriodIndex
no longer has an integer dtype:
Previous behavior:
In [2]: pi
Out[2]: PeriodIndex(['2016-08-01'], dtype='int64', freq='D')
In [4]: pi.dtype
Out[4]: dtype('int64')
New behavior:
In [117]: pi = pd.PeriodIndex(['2016-08-01'], freq='D')
In [118]: pi
Out[118]: PeriodIndex(['2016-08-01'], dtype='period[D]', freq='D')
In [119]: pd.api.types.is_integer_dtype(pi)
Out[119]: False
In [120]: pd.api.types.is_period_dtype(pi)
Out[120]: True
In [121]: pi.dtype
Out[121]: period[D]
In [122]: type(pi.dtype)
Out[122]: pandas.core.dtypes.dtypes.PeriodDtype
Previously, Period has its own Period('NaT') representation different from pd.NaT. Now Period('NaT')
has been changed to return pd.NaT. (GH12759, GH13582)
Previous behavior:
New behavior:
These result in pd.NaT without providing freq option.
In [123]: pd.Period('NaT')
Out[123]: NaT
In [124]: pd.Period(None)
Out[124]: NaT
To be compatible with Period addition and subtraction, pd.NaT now supports addition and subtraction
with int. Previously it raised ValueError.
Previous behavior:
In [5]: pd.NaT + 1
...
ValueError: Cannot add integral value to Timestamp without freq.
New behavior:
In [125]: pd.NaT + 1
Out[125]: NaT
In [126]: pd.NaT - 1
Out[126]: NaT
.values is changed to return an array of Period objects, rather than an array of integers (GH13988).
Previous behavior:
New behavior:
In [128]: pi.values
Out[128]: array([Period('2011-01', 'M'), Period('2011-02', 'M')], dtype=object)
Addition and subtraction of the base Index type and of DatetimeIndex (not the numeric index types) pre-
viously performed set operations (set union and difference). This behavior was already deprecated since
0.15.0 (in favor using the specific .union() and .difference() methods), and is now disabled. When pos-
sible, + and - are now used for element-wise operations, for example for concatenating strings or subtracting
datetimes (GH8227, GH14127).
Previous behavior:
New behavior: the same operation will now perform element-wise addition:
Note that numeric Index objects already performed element-wise operations. For example, the behavior of
adding two integer Indexes is unchanged. The base Index is now made consistent with this behavior.
Further, because of this change, it is now possible to subtract two DatetimeIndex objects resulting in a
TimedeltaIndex:
Previous behavior:
New behavior:
Index.difference and Index.symmetric_difference will now, more consistently, treat NaN values as any
other values. (GH13514)
Previous behavior:
In [3]: idx1.difference(idx2)
Out[3]: Float64Index([nan, 2.0, 3.0], dtype='float64')
In [4]: idx1.symmetric_difference(idx2)
Out[4]: Float64Index([0.0, nan, 2.0, 3.0], dtype='float64')
New behavior:
In [134]: idx1.difference(idx2)
Out[134]: Float64Index([2.0, 3.0], dtype='float64')
In [135]: idx1.symmetric_difference(idx2)
Out[135]: Float64Index([0.0, 2.0, 3.0], dtype='float64')
Index.unique() now returns unique values as an Index of the appropriate dtype. (GH13395). Previously,
most Index classes returned np.ndarray, and DatetimeIndex, TimedeltaIndex and PeriodIndex returned
Index to keep metadata like timezone.
Previous behavior:
New behavior:
In [136]: pd.Index([1, 2, 3]).unique()
Out[136]: Int64Index([1, 2, 3], dtype='int64')
In [141]: midx
Out[141]:
MultiIndex([('a', 'foo'),
('b', 'bar')],
)
Previous behavior:
In [4]: midx.levels[0]
Out[4]: Index(['b', 'a', 'c'], dtype='object')
In [5]: midx.get_level_values[0]
Out[5]: Index(['a', 'b'], dtype='object')
In [143]: midx.get_level_values(0)
Out[143]: CategoricalIndex(['a', 'b'], categories=['b', 'a', 'c'], ordered=False,␣
,→dtype='category')
Previous behavior:
In [11]: df_grouped.index.levels[1]
Out[11]: Index(['b', 'a', 'c'], dtype='object', name='C')
In [12]: df_grouped.reset_index().dtypes
Out[12]:
A int64
C object
B float64
dtype: object
In [13]: df_set_idx.index.levels[1]
Out[13]: Index(['b', 'a', 'c'], dtype='object', name='C')
In [14]: df_set_idx.reset_index().dtypes
Out[14]:
A int64
C object
B int64
dtype: object
New behavior:
In [147]: df_grouped.index.levels[1]
Out[147]: CategoricalIndex(['b', 'a', 'c'], categories=['b', 'a', 'c'], ordered=False,␣
,→name='C', dtype='category')
In [148]: df_grouped.reset_index().dtypes
Out[148]:
A int64
C category
B float64
Length: 3, dtype: object
In [149]: df_set_idx.index.levels[1]
Out[149]: CategoricalIndex(['b', 'a', 'c'], categories=['b', 'a', 'c'], ordered=False,␣
,→name='C', dtype='category')
In [150]: df_set_idx.reset_index().dtypes
Out[150]:
A int64
C category
B int64
Length: 3, dtype: object
When read_csv() is called with chunksize=n and without specifying an index, each chunk used to have
an independently generated index from 0 to n-1. They are now given instead a progressive index, starting
from 0 for the first chunk, from n for the second, and so on, so that, when concatenated, they are identical
to the result of calling read_csv() without the chunksize= argument (GH12185).
In [151]: data = 'A,B\n0,1\n2,3\n4,5\n6,7'
Previous behavior:
New behavior:
[4 rows x 2 columns]
Sparse Changes
These changes allow pandas to handle sparse data with more dtypes, and for work to make a smoother
experience with data handling.
Sparse data structures now gained enhanced support of int64 and bool dtype (GH667, GH13849).
Previously, sparse data were float64 dtype by default, even if all inputs were of int or bool dtype. You
had to specify dtype explicitly to create sparse data with int64 dtype. Also, fill_value had to be specified
explicitly because the default was np.nan which doesn’t appear in int64 or bool data.
In [1]: pd.SparseArray([1, 2, 0, 0])
Out[1]:
[1.0, 2.0, 0.0, 0.0]
Fill: nan
IntIndex
Indices: array([0, 1, 2, 3], dtype=int32)
# specifying int64 dtype, but all values are stored in sp_values because
(continues on next page)
As of v0.19.0, sparse data keeps the input dtype, and uses more appropriate fill_value defaults (0 for
int64 dtype, False for bool dtype).
In [153]: pd.SparseArray([1, 2, 0, 0], dtype=np.int64)
Out[153]:
[1, 2, 0, 0]
Fill: 0
IntIndex
Indices: array([0, 1], dtype=int32)
• Sparse data structure now can preserve dtype after arithmetic ops (GH13848)
In [155]: s = pd.SparseSeries([0, 2, 0, 1], fill_value=0, dtype=np.int64)
In [156]: s.dtype
Out[156]: Sparse[int64, 0]
In [157]: s + 1
Out[157]:
0 1
1 3
2 1
3 2
Length: 4, dtype: Sparse[int64, 1]
BlockIndex
Block locations: array([1, 3], dtype=int32)
Block lengths: array([1, 1], dtype=int32)
• Sparse data structure now support astype to convert internal dtype (GH13900)
In [158]: s = pd.SparseSeries([1., 0., 2., 0.], fill_value=0)
In [159]: s
Out[159]:
0 1.0
1 0.0
2 2.0
3 0.0
Length: 4, dtype: Sparse[float64, 0]
BlockIndex
Block locations: array([0, 2], dtype=int32)
Block lengths: array([1, 1], dtype=int32)
In [160]: s.astype(np.int64)
Out[160]:
0 1
1 0
2 2
3 0
Length: 4, dtype: Sparse[int64, 0]
BlockIndex
Block locations: array([0, 2], dtype=int32)
Block lengths: array([1, 1], dtype=int32)
astype fails if data contains values which cannot be converted to specified dtype. Note that the limitation
is applied to fill_value which default is np.nan.
• Subclassed SparseDataFrame and SparseSeries now preserve class types when slicing or transposing.
(GH13787)
• SparseArray with bool dtype now supports logical (bool) operators (GH14000)
• Bug in SparseSeries with MultiIndex [] indexing may raise IndexError (GH13144)
• Bug in SparseSeries with MultiIndex [] indexing result may have normal Index (GH13144)
• Bug in SparseDataFrame in which axis=None did not default to axis=0 (GH13048)
• Bug in SparseSeries and SparseDataFrame creation with object dtype may raise TypeError
(GH11633)
• Bug in SparseDataFrame doesn’t respect passed SparseArray or SparseSeries ‘s dtype and
fill_value (GH13866)
• Bug in SparseArray and SparseSeries don’t apply ufunc to fill_value (GH13853)
• Bug in SparseSeries.abs incorrectly keeps negative fill_value (GH13853)
• Bug in single row slicing on multi-type SparseDataFrame s, types were previously forced to float
(GH13917)
Note: This change only affects 64 bit python running on Windows, and only affects relatively advanced
indexing operations
Methods such as Index.get_indexer that return an indexer array, coerce that array to a “platform int”, so
that it can be directly used in 3rd party library operations like numpy.take. Previously, a platform int was
defined as np.int_ which corresponds to a C integer, but the correct type, and what is being used now, is
np.intp, which corresponds to the C integer size that can hold a pointer (GH3033, GH13972).
These types are the same on many platform, but for 64 bit python on Windows, np.int_ is 32 bits, and
np.intp is 64 bits. Changing this behavior improves performance for many operations on that platform.
Previous behavior:
New behavior:
• Timestamp.to_pydatetime will issue a UserWarning when warn=True, and the instance has a non-zero
number of nanoseconds, previously this would print a message to stdout (GH14101).
• Series.unique() with datetime and timezone now returns return array of Timestamp with timezone
(GH13565).
• Panel.to_sparse() will raise a NotImplementedError exception when called (GH13778).
• Index.reshape() will raise a NotImplementedError exception when called (GH12882).
Deprecations
• Series.reshape and Categorical.reshape have been deprecated and will be removed in a subsequent
release (GH12882, GH12882)
• PeriodIndex.to_datetime has been deprecated in favor of PeriodIndex.to_timestamp (GH8254)
• Timestamp.to_datetime has been deprecated in favor of Timestamp.to_pydatetime (GH8254)
• Index.to_datetime and DatetimeIndex.to_datetime have been deprecated in favor of pd.
to_datetime (GH8254)
• pandas.core.datetools module has been deprecated and will be removed in a subsequent release
(GH14094)
• SparseList has been deprecated and will be removed in a future version (GH13784)
• DataFrame.to_html() and DataFrame.to_latex() have dropped the colSpace parameter in favor of
col_space (GH13857)
• DataFrame.to_sql() has deprecated the flavor parameter, as it is superfluous when SQLAlchemy is
not installed (GH13611)
• Deprecated read_csv keywords:
– compact_ints and use_unsigned have been deprecated and will be removed in a future version
(GH13320)
– buffer_lines has been deprecated and will be removed in a future version (GH13360)
– as_recarray has been deprecated and will be removed in a future version (GH13373)
– skip_footer has been deprecated in favor of skipfooter and will be removed in a future version
(GH13349)
• top-level pd.ordered_merge() has been renamed to pd.merge_ordered() and the original name will
be removed in a future version (GH13358)
• Timestamp.offset property (and named arg in the constructor), has been deprecated in favor of freq
(GH12160)
• pd.tseries.util.pivot_annual is deprecated. Use pivot_table as alternative, an example is here
(GH736)
• pd.tseries.util.isleapyear has been deprecated and will be removed in a subsequent release.
Datetime-likes now have a .is_leap_year property (GH13727)
• Panel4D and PanelND constructors are deprecated and will be removed in a future version. The
recommended way to represent these types of n-dimensional data are with the xarray package. Pandas
provides a to_xarray() method to automate this conversion (GH13564).
• pandas.tseries.frequencies.get_standard_freq is deprecated. Use pandas.tseries.
frequencies.to_offset(freq).rule_code instead (GH13874)
• pandas.tseries.frequencies.to_offset’s freqstr keyword is deprecated in favor of freq
(GH13874)
• Categorical.from_array has been deprecated and will be removed in a future version (GH13854)
• pd.Categorical has dropped setting of the ordered attribute directly in favor of the set_ordered
method (GH13671)
• pd.Categorical has dropped the levels attribute in favor of categories (GH8376)
• DataFrame.to_sql() has dropped the mysql option for the flavor parameter (GH13611)
• Panel.shift() has dropped the lags parameter in favor of periods (GH14041)
• pd.Index has dropped the diff method in favor of difference (GH13669)
• pd.DataFrame has dropped the to_wide method in favor of to_panel (GH14039)
• Series.to_csv has dropped the nanRep parameter in favor of na_rep (GH13804)
• Series.xs, DataFrame.xs, Panel.xs, Panel.major_xs, and Panel.minor_xs have dropped the copy
parameter (GH13781)
• str.split has dropped the return_type parameter in favor of expand (GH13701)
• Removal of the legacy time rules (offset aliases), deprecated since 0.17.0 (this has been alias since 0.8.0)
(GH13590, GH13868). Now legacy time rules raises ValueError. For the list of currently supported
offsets, see here.
• The default value for the return_type parameter for DataFrame.plot.box and DataFrame.boxplot
changed from None to "axes". These methods will now return a matplotlib axes by default instead of
a dictionary of artists. See here (GH6581).
• The tquery and uquery functions in the pandas.io.sql module are removed (GH5950).
Performance improvements
Bug fixes
• Bug in groupby().shift(), which could cause a segfault or corruption in rare circumstances when
grouping by columns with missing values (GH13813)
• Bug in groupby().cumsum() calculating cumprod when axis=1. (GH13994)
• Bug in pd.to_timedelta() in which the errors parameter was not being respected (GH13613)
• Bug in io.json.json_normalize(), where non-ascii keys raised an exception (GH13213)
• Bug when passing a not-default-indexed Series as xerr or yerr in .plot() (GH11858)
• Bug in area plot draws legend incorrectly if subplot is enabled or legend is moved after plot (matplotlib
1.5.0 is required to draw area plot legend properly) (GH9161, GH13544)
• Bug in DataFrame assignment with an object-dtyped Index where the resultant column is mutable to
the original object. (GH13522)
• Bug in matplotlib AutoDataFormatter; this restores the second scaled formatting and re-adds micro-
second scaled formatting (GH13131)
• Bug in selection from a HDFStore with a fixed format and start and/or stop specified will now return
the selected range (GH8287)
• Bug in Categorical.from_codes() where an unhelpful error was raised when an invalid ordered
parameter was passed in (GH14058)
• Bug in Series construction from a tuple of integers on windows not returning default dtype (int64)
(GH13646)
• Bug in TimedeltaIndex addition with a Datetime-like object where addition overflow was not being
caught (GH14068)
• Bug in .groupby(..).resample(..) when the same object is called multiple times (GH13174)
• Bug in .to_records() when index name is a unicode string (GH13172)
• Bug in calling .memory_usage() on object which doesn’t implement (GH12924)
• Regression in Series.quantile with nans (also shows up in .median() and .describe() ); further-
more now names the Series with the quantile (GH13098, GH13146)
• Bug in SeriesGroupBy.transform with datetime values and missing groups (GH13191)
• Bug where empty Series were incorrectly coerced in datetime-like numeric operations (GH13844)
• Bug in Categorical constructor when passed a Categorical containing datetimes with timezones
(GH14190)
• Bug in Series.str.extractall() with str index raises ValueError (GH13156)
• Bug in Series.str.extractall() with single group and quantifier (GH13382)
• Bug in DatetimeIndex and Period subtraction raises ValueError or AttributeError rather than
TypeError (GH13078)
• Bug in Index and Series created with NaN and NaT mixed data may not have datetime64 dtype
(GH13324)
• Bug in Index and Series may ignore np.datetime64('nat') and np.timdelta64('nat') to infer
dtype (GH13324)
• Bug in PeriodIndex and Period subtraction raises AttributeError (GH13071)
• Bug in PeriodIndex construction returning a float64 index in some circumstances (GH13067)
• Bug in .resample(..) with a PeriodIndex not changing its freq appropriately when empty
(GH13067)
• Bug in .resample(..) with a PeriodIndex not retaining its type or name with an empty DataFrame
appropriately when empty (GH13212)
• Bug in groupby(..).apply(..) when the passed function returns scalar values per group (GH13468).
• Bug in groupby(..).resample(..) where passing some keywords would raise an exception (GH13235)
• Bug in .tz_convert on a tz-aware DateTimeIndex that relied on index being sorted for correct results
(GH13306)
• Bug in .tz_localize with dateutil.tz.tzlocal may return incorrect result (GH13583)
• Bug in DatetimeTZDtype dtype with dateutil.tz.tzlocal cannot be regarded as valid dtype
(GH13583)
• Bug in pd.read_hdf() where attempting to load an HDF file with a single dataset, that had one or
more categorical columns, failed unless the key argument was set to the name of the dataset. (GH13231)
• Bug in .rolling() that allowed a negative integer window in construction of the Rolling() object,
but would later fail on aggregation (GH13383)
• Bug in Series indexing with tuple-valued data and a numeric index (GH13509)
• Bug in printing pd.DataFrame where unusual elements with the object dtype were causing segfaults
(GH13717)
• Bug in ranking Series which could result in segfaults (GH13445)
• Bug in various index types, which did not propagate the name of passed index (GH12309)
• Bug in DatetimeIndex, which did not honour the copy=True (GH13205)
• Bug in DatetimeIndex.is_normalized returns incorrectly for normalized date_range in case of local
timezones (GH13459)
• Bug in pd.concat and .append may coerces datetime64 and timedelta to object dtype containing
python built-in datetime or timedelta rather than Timestamp or Timedelta (GH13626)
• Bug in PeriodIndex.append may raises AttributeError when the result is object dtype (GH13221)
• Bug in CategoricalIndex.append may accept normal list (GH13626)
• Bug in pd.concat and .append with the same timezone get reset to UTC (GH7795)
• Bug in Series and DataFrame .append raises AmbiguousTimeError if data contains datetime near
DST boundary (GH13626)
• Bug in DataFrame.to_csv() in which float values were being quoted even though quotations were
specified for non-numeric values only (GH12922, GH13259)
• Bug in DataFrame.describe() raising ValueError with only boolean columns (GH13898)
• Bug in MultiIndex slicing where extra elements were returned when level is non-unique (GH12896)
• Bug in .str.replace does not raise TypeError for invalid replacement (GH13438)
• Bug in MultiIndex.from_arrays which didn’t check for input array lengths matching (GH13599)
• Bug in cartesian_product and MultiIndex.from_product which may raise with empty input arrays
(GH12258)
• Bug in pd.read_csv() which may cause a segfault or corruption when iterating in large chunks over
a stream/file under rare circumstances (GH13703)
• Bug in pd.read_csv() which caused errors to be raised when a dictionary containing scalars is passed
in for na_values (GH12224)
• Bug in pd.read_csv() which caused BOM files to be incorrectly parsed by not ignoring the BOM
(GH4793)
• Bug in pd.read_csv() with engine='python' which raised errors when a numpy array was passed in
for usecols (GH12546)
• Bug in pd.read_csv() where the index columns were being incorrectly parsed when parsed as dates
with a thousands parameter (GH14066)
• Bug in pd.read_csv() with engine='python' in which NaN values weren’t being detected after data
was converted to numeric values (GH13314)
• Bug in pd.read_csv() in which the nrows argument was not properly validated for both engines
(GH10476)
• Bug in pd.read_csv() with engine='python' in which infinities of mixed-case forms were not being
interpreted properly (GH13274)
• Bug in pd.read_csv() with engine='python' in which trailing NaN values were not being parsed
(GH13320)
• Bug in pd.read_csv() with engine='python' when reading from a tempfile.TemporaryFile on
Windows with Python 3 (GH13398)
• Bug in pd.read_csv() that prevents usecols kwarg from accepting single-byte unicode strings
(GH13219)
• Bug in pd.read_csv() that prevents usecols from being an empty set (GH13402)
• Bug in pd.read_csv() in the C engine where the NULL character was not being parsed as NULL
(GH14012)
• Bug in pd.read_csv() with engine='c' in which NULL quotechar was not accepted even though
quoting was specified as None (GH13411)
• Bug in pd.read_csv() with engine='c' in which fields were not properly cast to float when quoting
was specified as non-numeric (GH13411)
• Bug in pd.read_csv() in Python 2.x with non-UTF8 encoded, multi-character separated data
(GH3404)
• Bug in pd.read_csv(), where aliases for utf-xx (e.g. UTF-xx, UTF_xx, utf_xx) raised UnicodeDe-
codeError (GH13549)
• Bug in pd.read_csv, pd.read_table, pd.read_fwf, pd.read_stata and pd.read_sas where files
were opened by parsers but not closed if both chunksize and iterator were None. (GH13940)
• Bug in StataReader, StataWriter, XportReader and SAS7BDATReader where a file was not properly
closed when an error was raised. (GH13940)
• Bug in pd.pivot_table() where margins_name is ignored when aggfunc is a list (GH13354)
• Bug in pd.Series.str.zfill, center, ljust, rjust, and pad when passing non-integers, did not
raise TypeError (GH13598)
• Bug in checking for any null objects in a TimedeltaIndex, which always returned True (GH13603)
• Bug in Series arithmetic raises TypeError if it contains datetime-like as object dtype (GH13043)
• Bug Series.isnull() and Series.notnull() ignore Period('NaT') (GH13737)
• Bug Series.fillna() and Series.dropna() don’t affect to Period('NaT') (GH13737
• Bug in Index raises OutOfBoundsDatetime if datetime exceeds datetime64[ns] bounds, rather than
coercing to object dtype (GH13663)
• Bug in Index may ignore specified datetime64 or timedelta64 passed as dtype (GH13981)
• Bug in RangeIndex can be created without no arguments rather than raises TypeError (GH13793)
• Bug in .value_counts() raises OutOfBoundsDatetime if data exceeds datetime64[ns] bounds
(GH13663)
• Bug in DatetimeIndex may raise OutOfBoundsDatetime if input np.datetime64 has other unit than
ns (GH9114)
• Bug in Series creation with np.datetime64 which has other unit than ns as object dtype results in
incorrect values (GH13876)
• Bug in resample with timedelta data where data was casted to float (GH13119).
• Bug in pd.isnull() pd.notnull() raise TypeError if input datetime-like has other unit than ns
(GH13389)
• Bug in pd.merge() may raise TypeError if input datetime-like has other unit than ns (GH13389)
• Bug in HDFStore/read_hdf() discarded DatetimeIndex.name if tz was set (GH13884)
• Bug in Categorical.remove_unused_categories() changes .codes dtype to platform int (GH13261)
• Bug in groupby with as_index=False returns all NaN’s when grouping on multiple columns including
a categorical one (GH13204)
• Bug in df.groupby(...)[...] where getitem with Int64Index raised an error (GH13731)
• Bug in the CSS classes assigned to DataFrame.style for index names. Previously they were assigned
"col_heading level<n> col<c>" where n was the number of levels + 1. Now they are assigned
"index_name level<n>", where n is the correct level for that MultiIndex.
• Bug where pd.read_gbq() could throw ImportError: No module named discovery as a result of a
naming conflict with another python package called apiclient (GH13454)
• Bug in Index.union returns an incorrect result with a named empty index (GH13432)
• Bugs in Index.difference and DataFrame.join raise in Python3 when using mixed-integer indexes
(GH13432, GH12814)
• Bug in subtract tz-aware datetime.datetime from tz-aware datetime64 series (GH14088)
• Bug in .to_excel() when DataFrame contains a MultiIndex which contains a label with a NaN value
(GH13511)
• Bug in invalid frequency offset string like “D1”, “-2-3H” may not raise ValueError (GH13930)
• Bug in concat and groupby for hierarchical frames with RangeIndex levels (GH13542).
• Bug in Series.str.contains() for Series containing only NaN values of object dtype (GH14171)
• Bug in agg() function on groupby dataframe changes dtype of datetime64[ns] column to float64
(GH12821)
• Bug in using NumPy ufunc with PeriodIndex to add or subtract integer raise
IncompatibleFrequency. Note that using standard operator like + or - is recommended, be-
cause standard operators use more efficient path (GH13980)
• Bug in operations on NaT returning float instead of datetime64[ns] (GH12941)
• Bug in Series flexible arithmetic methods (like .add()) raises ValueError when axis=None
(GH13894)
• Bug in DataFrame.to_csv() with MultiIndex columns in which a stray empty line was added
(GH6618)
• Bug in DatetimeIndex, TimedeltaIndex and PeriodIndex.equals() may return True when input
isn’t Index but contains the same values (GH13107)
• Bug in assignment against datetime with timezone may not work if it contains datetime near DST
boundary (GH14146)
• Bug in pd.eval() and HDFStore query truncating long float literals with python 2 (GH14241)
• Bug in Index raises KeyError displaying incorrect column when column is not in the df and columns
contains duplicate values (GH13822)
• Bug in Period and PeriodIndex creating wrong dates when frequency has combined offset aliases
(GH13874)
• Bug in .to_string() when called with an integer line_width and index=False raises an Unbound-
LocalError exception because idx referenced before assignment.
• Bug in eval() where the resolvers argument would not accept a list (GH14095)
• Bugs in stack, get_dummies, make_axis_dummies which don’t preserve categorical dtypes in
(multi)indexes (GH13854)
• PeriodIndex can now accept list and array which contains pd.NaT (GH13430)
• Bug in df.groupby where .median() returns arbitrary values if grouped dataframe contains empty
bins (GH13629)
• Bug in Index.copy() where name parameter was ignored (GH14302)
Contributors
A total of 117 people contributed patches to this release. People with a “+” by their names contributed a
patch for the first time.
• Adrien Emery +
• Alex Alekseyev
• Alex Vig +
• Allen Riddell +
• Amol +
• Amol Agrawal +
• Andy R. Terrel +
• Anthonios Partheniou
• Ben Kandel +
• Bob Baxley +
• Brett Rosen +
• Camilo Cota +
• Chris
• Chris Grinolds
• Chris Warth
• Christian Hudon
• Christopher C. Aycock
• Daniel Siladji +
• Douglas McNeil
• Drewrey Lupton +
• Eduardo Blancas Reyes +
• Elliot Marsden +
• Evan Wright
• Felix Marczinowski +
• Francis T. O’Donovan
• Geraint Duck +
• Giacomo Ferroni +
• Grant Roch +
• Gábor Lipták
• Haleemur Ali +
• Hassan Shamim +
• Iulius Curt +
• Ivan Nazarov +
• Jeff Reback
• Jeffrey Gerard +
• Jenn Olsen +
• Jim Crist
• Joe Jevnik
• John Evans +
• John Freeman
• John Liekezer +
• John W. O’Brien
• John Zwinck +
• Johnny Gill +
• Jordan Erenrich +
• Joris Van den Bossche
• Josh Howes +
• Jozef Brandys +
• Ka Wo Chen
• Kamil Sindi +
• Kerby Shedden
• Kernc +
• Kevin Sheppard
• Matthieu Brucher +
• Maximilian Roos
• Michael Scherer +
• Mike Graham +
• Mortada Mehyar
• Muhammad Haseeb Tariq +
• Nate George +
• Neil Parley +
• Nicolas Bonnotte
• OXPHOS
• Pan Deng / Zora +
• Paul +
• Paul Mestemaker +
• Pauli Virtanen
• Pawel Kordek +
• Pietro Battiston
• Piotr Jucha +
• Ravi Kumar Nimmi +
• Robert Gieseke
• Robert Kern +
• Roger Thomas
• Roy Keyes +
• Russell Smith +
• Sahil Dua +
• Sanjiv Lobo +
• Sašo Stanovnik +
• Shawn Heide +
• Sinhrks
• Stephen Kappel +
• Steve Choi +
• Stewart Henderson +
• Sudarshan Konge +
• Thomas A Caswell
• Tom Augspurger
• Tom Bird +
• Uwe Hoffmann +
• WillAyd +
• Xiang Zhang +
• YG-Riku +
• Yadunandan +
• Yaroslav Halchenko
• Yuichiro Kaneko +
• adneu
• agraboso +
• babakkeyvani +
• c123w +
• chris-b1
• cmazzullo +
• conquistador1492 +
• cr3 +
• dsm054
• gfyoung
• harshul1610 +
• iamsimha +
• jackieleng +
• mpuels +
• pijucha +
• priyankjain +
• sinhrks
• wcwagner +
• yui-knk +
• zhangjinjie +
• znmean +
• ����Yan Facai� +
{{ header }}
This is a minor bug-fix release from 0.18.0 and includes a large number of bug fixes along with several
new features, enhancements, and performance improvements. We recommend that all users upgrade to this
version.
Highlights include:
• .groupby(...) has been enhanced to provide convenient syntax when working with .rolling(..),
.expanding(..) and .resample(..) per group, see here
• pd.to_datetime() has gained the ability to assemble dates from a DataFrame, see here
• Method chaining improvements, see here.
• Custom business hour offset, see here.
• Many bug fixes in the handling of sparse, see here
• Expanded the Tutorials section with a feature on modern pandas, courtesy of @TomAugsburger.
(GH13045).
• New features
– Custom business hour
– .groupby(..) syntax with window and resample operations
– Method chaining improvements
* .where() and .mask()
* .loc[], .iloc[], .ix[]
* [] indexing
– Partial string indexing on DateTimeIndex when part of a MultiIndex
– Assembling datetimes
– Other enhancements
• Sparse changes
• API changes
– .groupby(..).nth() changes
– numpy function compatibility
– Using .apply on groupby resampling
– Changes in read_csv exceptions
– to_datetime error changes
– Other API changes
– Deprecations
• Performance improvements
• Bug fixes
• Contributors
New features
In [6]: dt + bhour_us
Out[6]: Timestamp('2014-01-17 16:00:00')
In [7]: dt + bhour_us * 2
Out[7]: Timestamp('2014-01-20 09:00:00')
.groupby(...) has been enhanced to provide convenient syntax when working with .rolling(..), .
expanding(..) and .resample(..) per group, see (GH12486, GH12738).
You can now use .rolling(..) and .expanding(..) as methods on groupbys. These return another
deferred object (similar to what .rolling() and .expanding() do on ungrouped pandas objects). You can
then operate on these RollingGroupby objects in a similar manner.
Previously you would have to do this to get a rolling window mean per-group:
In [9]: df
Out[9]:
A B
0 1 0
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 1 6
7 1 7
(continues on next page)
In [11]: df.groupby('A').rolling(4).B.mean()
Out[11]:
A
1 0 NaN
1 NaN
2 NaN
3 1.5
4 2.5
5 3.5
6 4.5
7 5.5
8 6.5
9 7.5
10 8.5
11 9.5
12 10.5
13 11.5
14 12.5
15 13.5
16 14.5
17 15.5
(continues on next page)
In [13]: df
Out[13]:
group val
date
2016-01-03 1 5
2016-01-10 1 6
2016-01-17 2 7
2016-01-24 2 8
[4 rows x 2 columns]
In [15]: df.groupby('group').resample('1D').ffill()
Out[15]:
group val
group date
1 2016-01-03 1 5
2016-01-04 1 5
2016-01-05 1 5
2016-01-06 1 5
2016-01-07 1 5
2016-01-08 1 5
2016-01-09 1 5
2016-01-10 1 6
2 2016-01-17 2 7
2016-01-18 2 7
2016-01-19 2 7
2016-01-20 2 7
2016-01-21 2 7