Pandas: Powerful Python Data Analysis Toolkit: Release 0.7.1
Pandas: Powerful Python Data Analysis Toolkit: Release 0.7.1
Release 0.7.1
Wes McKinney
CONTENTS
Whats New 1.1 v.0.7.1 (February 29, 2012) . . . . . . . . . . . . . . . 1.2 v.0.7.0 (February 9, 2012) . . . . . . . . . . . . . . . . 1.3 v.0.6.1 (December 13, 2011) . . . . . . . . . . . . . . . 1.4 v.0.6.0 (November 25, 2011) . . . . . . . . . . . . . . . 1.5 v.0.5.0 (October 24, 2011) . . . . . . . . . . . . . . . . 1.6 v.0.4.3 through v0.4.1 (September 25 - October 9, 2011) Installation 2.1 Python version support 2.2 Binary installers . . . 2.3 Dependencies . . . . . 2.4 Optional dependencies 2.5 Installing from source 2.6 Running the test suite
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
3 3 3 9 9 11 12 15 15 15 15 15 16 16 17 19 19 20 20 20 20 20 23 23 27 40 43 43 43 45 48 52 55 61 63 i
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
3 4
Frequently Asked Questions (FAQ) Package overview 4.1 Data structures at a glance . . . 4.2 Mutability and copying of data . 4.3 Getting Support . . . . . . . . 4.4 Credits . . . . . . . . . . . . . 4.5 Development Team . . . . . . . 4.6 License . . . . . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
Intro to Data Structures 5.1 Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 DataFrame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Panel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Essential basic functionality 6.1 Head and Tail . . . . . . . . . . . 6.2 Attributes and the raw ndarray(s) 6.3 Flexible binary operations . . . . 6.4 Descriptive statistics . . . . . . . 6.5 Function application . . . . . . . 6.6 Reindexing and altering labels . . 6.7 Iteration . . . . . . . . . . . . . . 6.8 Sorting by index and value . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
6.9 Copying, type casting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.10 Pickling and serialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.11 Console Output Formatting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Indexing and selecting data 7.1 Basics . . . . . . . . . . . . . . . . . . . . 7.2 Advanced indexing with labels . . . . . . . 7.3 Index objects . . . . . . . . . . . . . . . . 7.4 Hierarchical indexing (MultiIndex) . . . . 7.5 Adding an index to an existing DataFrame 7.6 Indexing internal details . . . . . . . . . . Computational tools 8.1 Statistical functions . . . . . . . . . . . . 8.2 Moving (rolling) statistics / moments . . . 8.3 Exponentially weighted moment functions 8.4 Linear and panel regression . . . . . . . . Working with missing data 9.1 Missing data basics . . . . . . . . . . . 9.2 Calculations with missing data . . . . . 9.3 Cleaning / lling missing data . . . . . 9.4 Missing data casting rules and indexing
64 65 66 69 69 76 80 81 90 92
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
95 . 95 . 97 . 101 . 102 109 109 111 112 115 117 117 121 122 124 125 126 127
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
10 Group By: split-apply-combine 10.1 Splitting an object into groups . 10.2 Iterating through groups . . . . 10.3 Aggregation . . . . . . . . . . 10.4 Transformation . . . . . . . . . 10.5 Dispatching to instance methods 10.6 Flexible apply . . . . . . . . 10.7 Other useful features . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
11 Merge, join, and concatenate 129 11.1 Concatenating objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 11.2 Database-style DataFrame joining/merging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 12 Reshaping and Pivot Tables 12.1 Reshaping by pivoting DataFrame objects 12.2 Reshaping by stacking and unstacking . . 12.3 Reshaping by Melt . . . . . . . . . . . . 12.4 Combining with stats and GroupBy . . . 12.5 Pivot tables and cross-tabulations . . . . 13 Time Series / Date functionality 13.1 DateOffset objects . . . . . . . . . . 13.2 Generating date ranges (DateRange) . 13.3 Time series-related instance methods 13.4 Up- and downsampling . . . . . . . . 147 147 148 151 152 153 157 157 159 161 162
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
14 Plotting with matplotlib 165 14.1 Basic plotting: plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 14.2 Other plotting features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 15 IO Tools (Text, CSV, HDF5, ...) 175
ii
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
16 Sparse data structures 183 16.1 SparseArray . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 16.2 SparseList . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 16.3 SparseIndex objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 17 Caveats and Gotchas 17.1 NaN, Integer NA values and NA type promotions 17.2 Integer indexing . . . . . . . . . . . . . . . . . 17.3 Label-based slicing conventions . . . . . . . . . 17.4 Miscellaneous indexing gotchas . . . . . . . . . 187 187 189 189 189 193 193 194 194
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
18 rpy2 / R interface 18.1 Transferring R data sets into Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18.2 Calling R functions with pandas objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18.3 High-level interface to R estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19 Related Python libraries 195 19.1 la (larry) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 19.2 scikits.statsmodels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 19.3 scikits.timeseries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 20 Comparison with R / R libraries 20.1 data.frame . . . . . . . . . 20.2 zoo . . . . . . . . . . . . . 20.3 xts . . . . . . . . . . . . . 20.4 plyr . . . . . . . . . . . . . 20.5 reshape / reshape2 . . . . . 21 API Reference 21.1 General functions . 21.2 Series . . . . . . . 21.3 DataFrame . . . . 21.4 Panel . . . . . . . Python Module Index Python Module Index Index 197 197 197 197 197 197 199 199 214 236 270 271 273 275
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
iii
iv
PDF Version Date: February 29, 2012 Version: 0.7.1 Binary Installers: http://pypi.python.org/pypi/pandas Source Repository: http://github.com/pydata/pandas Issues & Ideas: https://github.com/pydata/pandas/issues Q&A Support: http://stackoverow.com/questions/tagged/pandas Developer Mailing List: http://groups.google.com/group/pystatsmodels pandas is a Python package providing fast, exible, and expressive data structures designed to make working with relational or labeled data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and exible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal. pandas is well suited for many different kinds of data: Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet Ordered and unordered (not necessarily xed-frequency) time series data. Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure The two primary data structures of pandas, Series (1-dimensional) and DataFrame (2-dimensional), handle the vast majority of typical use cases in nance, statistics, social science, and many areas of engineering. For R users, DataFrame provides everything that Rs data.frame provides and much more. pandas is built on top of NumPy and is intended to integrate well within a scientic computing environment with many other 3rd party libraries. Here are just a few of the things that pandas does well: Easy handling of missing data (represented as NaN) in oating point as well as non-oating point data Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations Powerful, exible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects Intelligent label-based slicing, fancy indexing, and subsetting of large data sets Intuitive merging and joining data sets Flexible reshaping and pivoting of data sets Hierarchical labeling of axes (possible to have multiple labels per tick) Robust IO tools for loading data from at les (CSV and delimited), Excel les, databases, and saving / loading data from the ultrafast HDF5 format Time series-specic functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc. Many of these principles are here to address the shortcomings frequently experienced using other languages / scientic research environments. For data scientists, working with data is typically divided into multiple stages: munging and
CONTENTS
cleaning data, analyzing / modeling it, then organizing the results of the analysis into a form suitable for plotting or tabular display. pandas is the ideal tool for all of these tasks. Some other notes pandas is fast. Many of the low-level algorithmic bits have been extensively tweaked in Cython code. However, as with anything else generalization usually sacrices performance. So if you focus on one feature for your application you may be able to create a faster specialized tool. pandas will soon become a dependency of statsmodels, making it an important part of the statistical computing ecosystem in Python. pandas has been used extensively in production in nancial applications. Note: This documentation assumes general familiarity with NumPy. If you havent used NumPy much or at all, do invest some time in learning about NumPy rst. See the package overview for more detail about whats in the library.
CONTENTS
CHAPTER
ONE
WHATS NEW
These are new features and improvements of note in each release.
New unied concatenation function for concatenating Series, DataFrame or Panel objects along an axis. Can form union or intersection of the other axes. Improves performance of Series.append and DataFrame.append (GH468, GH479, GH273) Can pass multiple DataFrames to DataFrame.append to concatenate (stack) and multiple Series to Series.append too Can pass list of dicts (e.g., a list of JSON objects) to DataFrame constructor (GH526) You can now set multiple columns in a DataFrame via __getitem__, useful for transformation (GH342) Handle differently-indexed output values in DataFrame.apply (GH498)
In [902]: df = DataFrame(randn(10, 4)) In [903]: df.apply(lambda x: Out[903]: 0 1 count 10.000000 10.000000 mean -0.556258 -0.268695 std 0.775352 1.072879 min -1.696646 -2.251671 25% -1.234969 -0.734632 50% -0.340199 -0.311720 75% -0.037106 0.451343 max 0.401255 1.199708 x.describe()) 2 10.000000 -0.215066 1.431537 -1.739074 -0.901618 -0.555855 -0.072265 2.997692 3 10.000000 0.073787 0.962624 -1.545595 -0.384952 0.001073 0.867466 1.357252
Add reorder_levels method to Series and DataFrame (PR534) Add dict-like get function to DataFrame and Panel (PR521) Add DataFrame.iterrows method for efciently iterating through the rows of a DataFrame Add DataFrame.to_panel with code adapted from LongPanel.to_long Add reindex_axis method added to DataFrame Add level option to binary arithmetic functions on DataFrame and Series Add level option to the reindex and align methods on Series and DataFrame for broadcasting values across a level (GH542, PR552, others) Add attribute-based item access to Panel and add IPython completion (PR563) Add logy option to Series.plot for log-scaling on the Y axis Add index and header options to DataFrame.to_string Can pass multiple DataFrames to DataFrame.join to join on index (GH115) Can pass multiple Panels to Panel.join (GH115) Added justify argument to DataFrame.to_string to allow different alignment of column headers Add sort option to GroupBy to allow disabling sorting of the group keys for potential speedups (GH595) Can pass MaskedArray to Series constructor (PR563) Add Panel item access via attributes and IPython completion (GH554) Implement DataFrame.lookup, fancy-indexing analogue for retrieving values given a sequence of row and column labels (GH338) Can pass a list of functions to aggregate with groupby on a DataFrame, yielding an aggregated result with hierarchical columns (GH166)
Can call cummin and cummax on Series and DataFrame to get cumulative minimum and maximum, respectively (GH647) value_range added as utility function to get min and max of a dataframe (GH288) Added encoding argument to read_csv, read_table, to_csv and from_csv for non-ascii text (GH717) Added abs method to pandas objects Added crosstab function for easily computing frequency tables Added isin method to index objects Added level argument to xs method of DataFrame.
This is all exactly identical to the behavior before. However, if you ask for a key not contained in the Series, in versions 0.6.1 and prior, Series would fall back on a location-based lookup. This now raises a KeyError:
In [2]: s[1] KeyError: 1
6 8 10 12 14
-0.26598 -2.4184 -0.2658 0.11503 -0.58776 0.3144 -0.8566 0.61941 0.10940 -0.7175 -1.0108 0.47990 -1.16919 -0.3087 -0.6049 -0.43544 -0.07337 0.3410 0.0424 -0.16037
In order to support purely integer-based indexing, the following methods have been added: Method Series.iget_value(i) Series.iget(i) DataFrame.irow(i) DataFrame.icol(j) DataFrame.iget_value(i, j) Description Retrieve value stored at location i Alias for iget_value Retrieve the i-th row Retrieve the j-th column Retrieve the value at row i and column j
If the index had been sorted, the range selection would have been possible:
In [912]: s2 = s.sort_index() In [913]: s2 Out[913]: a -1.018111 c 0.787672 e 0.790509 g 1.109413 k -0.359977
0.346131
In the case of integer indexes, the behavior will be exactly as before (shadowing ndarray):
In [920]: s = Series(randn(6), index=range(0, 12, 2)) In [921]: s[[4, 0, 2]] Out[921]: 4 -0.553988 0 0.394646 2 -2.230633 In [922]: s[1:5]
If you wish to do indexing with sequences and slicing on an integer index with label semantics, use ix.
Added skip_footer (GH291) and converters (GH343) options to read_csv and read_table Added drop_duplicates and duplicated functions for removing duplicate DataFrame rows and checking for duplicate rows, respectively (GH319) Implemented operators &, |, ^, - on DataFrame (GH347) Added Series.mad, mean absolute deviation Added QuarterEnd DateOffset (PR321) Added dot to DataFrame (GH65) Added orient option to Panel.from_dict (GH359, GH301) Added orient option to DataFrame.from_dict Added passing list of tuples or list of lists to DataFrame.from_records (GH357) Added multiple levels to groupby (GH103) Allow multiple columns in by argument of DataFrame.sort_index (GH92, PR362) Added fast get_value and put_value methods to DataFrame (GH360) Added cov instance methods to Series and DataFrame (GH194, PR362) Added kind=bar option to DataFrame.plot (PR348) Added idxmin and idxmax to Series and DataFrame (PR286) Added read_clipboard function to parse DataFrame from clipboard (GH300) Added nunique function to Series for counting unique elements (GH297) Made DataFrame constructor use Series name if no columns passed (GH373) Support regular expressions in read_table/read_csv (GH364) Added DataFrame.to_html for writing DataFrame to HTML (PR387) Added support for MaskedArray data in DataFrame, masked values converted to NaN (PR396) Added DataFrame.boxplot function (GH368) Can pass extra args, kwds to DataFrame.apply (GH376) Implement DataFrame.join with vector on argument (GH312) Added legend boolean ag to DataFrame.plot (GH324) Can pass multiple levels to stack and unstack (GH370) Can pass multiple values columns to pivot_table (GH381) Use Series name in GroupBy for result index (GH363) Added raw option to DataFrame.apply for performance if only need ndarray (GH309) Added proper, tested weighted least squares to standard and panel OLS (GH303)
10
VBENCH Improved performance of MultiIndex.from_tuples VBENCH Special Cython matrix iterator for applying arbitrary reduction operations VBENCH + DOCUMENT Add raw option to DataFrame.apply for getting better performance when VBENCH Faster cythonized count by level in Series and DataFrame (GH341) VBENCH? Signicant GroupBy performance enhancement with multiple keys with many empty combinations VBENCH New Cython vectorized function map_infer speeds up Series.apply and Series.map signicantly when passed elementwise Python function, motivated by (PR355) VBENCH Signicantly improved performance of Series.order, which also makes np.unique called on a Series faster (GH327) VBENCH Vastly improved performance of GroupBy on axes with a MultiIndex (GH299)
11
Substantially improved performance of generic Index.intersection and Index.union Implemented BlockManager.take resulting in signicantly faster take performance on mixed-type DataFrame objects (GH104) Improved performance of Series.sort_index Signicant groupby performance enhancement: removed unnecessary integrity checks in DataFrame internals that were slowing down slicing operations to retrieve groups Optimized _ensure_index function resulting in performance savings in type-checking Index objects Wrote fast time series merging / joining methods in Cython. Will be integrated later into DataFrame.join and related functions
13
14
CHAPTER
TWO
INSTALLATION
You have the option to install an ofcial release or to build the development version. If you choose to install from source and are running Windows, you will have to ensure that you have a compatible C compiler (MinGW or Visual Studio) installed. How-to install MinGW on Windows
2.3 Dependencies
NumPy: 1.4.0 or higher. Recommend 1.5.1 or higher python-dateutil 1.5
15
Note: Without the optional dependencies, many useful features will not work. Hence, it is highly recommended that you install these. A packaged distribution like the Enthought Python Distribution may be worth considering.
On Windows, I suggest installing the MinGW compiler suite following the directions linked to above. Once congured property, run the following on the command line:
python setup.py build --compiler=mingw32 python setup.py install
Note that you will not be able to import pandas if you open an interpreter in the source directory unless you build the C extensions in place:
python setup.py build_ext --inplace
16
Chapter 2. Installation
CHAPTER
THREE
17
18
CHAPTER
FOUR
PACKAGE OVERVIEW
pandas consists of the following things A set of labeled array data structures, the primary of which are Series/TimeSeries and DataFrame Index objects enabling both simple axis indexing and multi-level / hierarchical axis indexing An integrated group by engine for aggregating and transforming data sets Date range generation (DateRange) and custom date offsets enabling the implementation of customized frequencies Input/Output tools: loading tabular data from at les (CSV, delimited, Excel 2003), and saving and loading pandas objects from the fast and efcient PyTables/HDF5 format. Memory-efcent sparse versions of the standard data structures for storing data that is mostly missing or mostly constant (some xed value) Moving window statistics (rolling mean, rolling standard deviation, etc.) Static and moving window linear and panel regression
Series 1D labeled homogeneously-typed array TimeSeries with index containing datetimes Series DataFrame General 2D labeled, size-mutable tabular structure with potentially heterogeneously-typed columns Panel General 3D labeled, also size-mutable array
intended to lend more semantic meaning to the data; i.e., for a particular data set there is likely to be a right way to orient the data. The goal, then, is to reduce the amount of mental effort required to code up data transformations in downstream functions. For example, with tabular data (DataFrame) it is more semantically helpful to think of the index (the rows) and the columns rather than axis 0 and axis 1. And iterating through the columns of the DataFrame thus results in more readable code:
for col in df.columns: series = df[col] # do something with series
4.4 Credits
pandas development began at AQR Capital Management in April 2008. It was open-sourced at the end of 2009. AQR continued to provide resources for development through the end of 2011, and continues to contribute bug reports today. Since January 2012, Lambda Foundry, has been providing development resources, as well as commercial support, training, and consulting for pandas. pandas is only made possible by a group of people around the world like you who have contributed new code, bug reports, xes, comments and ideas. A complete list can be found on Github.
4.6 License
20
====================== PANDAS LICENSING TERMS ====================== pandas is licensed under the BSD 3-Clause (also known as "BSD New" or "BSD Simplified"), as follows: Copyright (c) 2011-2012, Lambda Foundry, Inc. and PyData Development Team All rights reserved. Copyright (c) 2008-2011 AQR Capital Management, LLC All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of the copyright holder nor the names of any contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDER AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. About the Copyright Holders =========================== AQR Capital Management began pandas development in 2008. Development was led by Wes McKinney. AQR released the source under this license in 2009. Wes is now an employee of Lambda Foundry, and remains the pandas project lead. The PyData Development Team is the collection of developers of the PyData project. This includes all of the PyData sub-projects, including pandas. The core team that coordinates development on GitHub can be found here: http://github.com/pydata. Full credits for pandas contributors can be found in the documentation. Our Copyright Policy ====================
4.6. License
21
PyData uses a shared copyright model. Each contributor maintains copyright over their contributions to PyData. However, it is important to note that these contributions are typically only changes to the repositories. Thus, the PyData source code, in its entirety, is not the copyright of any single person or institution. Instead, it is the collective copyright of the entire PyData Development Team. If individual contributors want to maintain a record of what changes/contributions they have specific copyright on, they should indicate their copyright in the commit message of the change when they commit the change to one of the PyData repositories. With this in mind, the following banner should be used in any source code file to indicate the copyright and license terms: #----------------------------------------------------------------------------# Copyright (c) 2012, PyData Development Team # All rights reserved. # # Distributed under the terms of the BSD Simplified License. # # The full license is in the LICENSE file, distributed with this software. #-----------------------------------------------------------------------------
22
CHAPTER
FIVE
Here is a basic tenet to keep in mind: data alignment is intrinsic. Link between labels and data will not be broken unless done so explicitly by you. Well give a brief intro to the data structures, then consider all of the broad categories of functionality and methods in separate sections.
5.1 Series
Series is a one-dimensional labeled array (technically a subclass of ndarray) capable of holding any data type (integers, strings, oating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call:
>>> s = Series(data, index=index)
Here, data can be many different things: a Python dict an ndarray a scalar value (like 5) The passed index is a list of axis labels. Thus, this separates into a few cases depending on what data is: From ndarray If data is an ndarray, index must be the same length as data. If no index is passed, one will be created having values [0, ..., len(data) - 1].
In [228]: s = Series(randn(5), index=[a, b, c, d, e]) In [229]: s Out[229]:
23
a b c d e
In [230]: s.index Out[230]: Index([a, b, c, d, e], dtype=object) In [231]: Series(randn(5)) Out[231]: 0 0.654 1 -1.146 2 1.144 3 0.167 4 0.148
Note: The values in the index must be unique. If they are not, an exception will not be raised immediately, but attempting any operation involving the index will later result in an exception. In other words, the Index object containing the labels lazily checks whether the values are unique. The reason for being lazy is nearly all performance-based (there are many instances in computations, like parts of GroupBy, where the index is not used). From dict If data is a dict, if index is passed the values in data corresponding to the labels in the index will be pulled out. Otherwise, an index will be constructed from the sorted keys of the dict, if possible.
In [232]: d = {a : 0., b : 1., c : 2.} In [233]: Series(d) Out[233]: a 0 b 1 c 2 In [234]: Series(d, index=[b, c, d, a]) Out[234]: b 1 c 2 d NaN a 0
Note: NaN (not a number) is the standard missing data marker used in pandas From scalar value If data is a scalar value, an index must be provided. The value will be repeated to match the length of index
In [235]: Series(5., index=[a, b, c, d, e]) Out[235]: a 5 b 5 c 5 d 5 e 5
24
5.1. Series
25
A key difference between Series and ndarray is that operations between Series automatically align the data based on label. Thus, you can write computations without giving consideration to whether the Series involved have the same labels.
In [249]: s[1:] + s[:-1] Out[249]: a NaN b -3.074 c 0.326 d -1.296 e NaN
The result of an operation between unaligned Series will have the union of the indexes involved. If a label is not found in one Series or the other, the result will be marked as missing (NaN). Being able to write code without doing any explicit data alignment grants immense freedom and exibility in interactive data analysis and research. The integrated data alignment features of the pandas data structures set pandas apart from the majority of related tools for working with labeled data. Note: In general, we chose to make the default result of operations between differently indexed objects yield the union of the indexes in order to avoid loss of information. Having an index label, though the data is missing, is
26
typically important information as part of a computation. You of course have the option of dropping labels with missing data via the dropna function.
The Series name will be assigned automatically in many cases, in particular when taking 1D slices of DataFrame as you will see below.
5.2 DataFrame
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. Like Series, DataFrame accepts many different kinds of input: Dict of 1D ndarrays, lists, dicts, or Series 2-D numpy.ndarray Structured or record ndarray A Series Another DataFrame Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments. If you pass an index and / or columns, you are guaranteeing the index and / or columns of the resulting DataFrame. Thus, a dict of Series plus a specic index will discard all data not matching up to the passed index. If axis labels are not passed, they will be constructed from the input data based on common sense rules.
5.2. DataFrame
27
In [255]: df Out[255]: one two a 1 1 b 2 2 c 3 3 d NaN 4 In [256]: DataFrame(d, index=[d, b, a]) Out[256]: one two d NaN 4 b 2 2 a 1 1 In [257]: DataFrame(d, index=[d, b, a], columns=[two, three]) Out[257]: two three d 4 NaN b 2 NaN a 1 NaN
The row and column labels can be accessed respectively by accessing the index and columns attributes: Note: When a particular set of columns is passed along with a dict of data, the passed columns override the keys in the dict.
In [258]: df.index Out[258]: Index([a, b, c, d], dtype=object) In [259]: df.columns Out[259]: Index([one, two], dtype=object)
28
c d
3 4
2 1
Note: DataFrame is not intended to work exactly like a 2-dimensional NumPy ndarray.
5.2. DataFrame
29
DataFrame.from_items DataFrame.from_items works analogously to the form of the dict constructor that takes a sequence of (key, value) pairs, where the keys are column (or row, in the case of orient=index) names, and the value are the column values (or row values). This can be useful for constructing a DataFrame with the columns in a particular order without having to pass an explicit list of columns:
In [274]: DataFrame.from_items([(A, [1, 2, 3]), (B, [4, 5, 6])]) Out[274]: A B 0 1 4 1 2 5 2 3 6
If you pass orient=index, the keys will be the row labels. But in this case you must also pass the desired column names:
In [275]: DataFrame.from_items([(A, [1, 2, 3]), (B, [4, 5, 6])], .....: orient=index, columns=[one, two, three]) Out[275]: one two three A 1 2 3 B 4 5 6
30
When inserting a Series that does not have the same index as the DataFrame, it will be conformed to the DataFrames index:
In [285]: df[one_trunc] = df[one][:2] In [286]: df Out[286]: one flag a 1 False
foo bar
one_trunc 1
5.2. DataFrame
31
b c d
2 3 NaN
2 NaN NaN
You can insert raw ndarrays but their length must match the length of the DataFrames index. By default, columns get inserted at the end. The insert function is available to insert at a particular location in the columns:
In [287]: df.insert(1, bar, df[one]) In [288]: df Out[288]: one bar flag a 1 1 False b 2 2 False c 3 3 True d NaN NaN False
Row selection, for example, returns a Series whose index is the columns of the DataFrame:
In [289]: df.xs(b) Out[289]: one 2 bar 2 flag False foo bar one_trunc 2 Name: b In [290]: df.ix[2] Out[290]: one 3 bar 3 flag True foo bar one_trunc NaN Name: c
Note if a DataFrame contains columns of multiple dtypes, the dtype of the row will be chosen to accommodate all of the data types (dtype=object is the most general). For a more exhaustive treatment of more sophisticated label-based indexing and slicing, see the section on indexing. We will address the fundamentals of reindexing / conforming to new sets of lables in the section on reindexing.
32
D NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
When doing an operation between DataFrame and Series, the default behavior is to align the Series index on the DataFrame columns, thus broadcasting row-wise. For example:
In [294]: df - df.ix[0] Out[294]: A B C D 0 0.000 0.000 0.000 0.000 1 0.808 1.358 2.420 -1.339 2 -0.070 -0.814 4.027 -0.293 3 0.073 -0.519 2.195 -0.002 4 1.342 0.372 2.510 -1.543 5 -0.523 0.665 0.942 0.018 6 0.101 -0.066 1.943 -0.817 7 0.744 0.834 3.473 -0.665 8 0.045 0.772 1.406 -0.965 9 0.860 1.677 1.462 -1.165
In the special case of working with time series data, if the Series is a TimeSeries (which it will be automatically if the index contains datetime objects), and the DataFrame index also contains dates, the broadcasting will be column-wise:
In [295]: index = DateRange(1/1/2000, periods=8) In [296]: df = DataFrame(randn(8, 3), index=index, .....: columns=[A, B, C]) In [297]: df Out[297]: 2000-01-03 2000-01-04 2000-01-05 2000-01-06 2000-01-07 2000-01-10 2000-01-11 2000-01-12 A 0.361 -0.646 0.613 0.624 1.559 -1.381 -0.673 -0.583 B -0.192 -1.051 0.501 0.790 0.335 0.365 -1.968 -0.999 C -0.058 -0.716 -1.380 0.818 0.919 -1.811 -0.401 -0.629
5.2. DataFrame
33
In [298]: type(df[A]) Out[298]: pandas.core.series.TimeSeries In [299]: df - df[A] Out[299]: A B C 2000-01-03 0 -0.553 -0.419 2000-01-04 0 -0.405 -0.070 2000-01-05 0 -0.112 -1.993 2000-01-06 0 0.166 0.195 2000-01-07 0 -1.224 -0.640 2000-01-10 0 1.746 -0.429 2000-01-11 0 -1.294 0.272 2000-01-12 0 -0.416 -0.046
Technical purity aside, this case is so common in practice that supporting the special case is preferable to the alternative of forcing the user to transpose and do column-based alignment like so:
In [300]: (df.T - df[A]).T Out[300]: A B C 2000-01-03 0 -0.553 -0.419 2000-01-04 0 -0.405 -0.070 2000-01-05 0 -0.112 -1.993 2000-01-06 0 0.166 0.195 2000-01-07 0 -1.224 -0.640 2000-01-10 0 1.746 -0.429 2000-01-11 0 -1.294 0.272 2000-01-12 0 -0.416 -0.046
For explicit control over the matching and broadcasting behavior, see the section on exible binary operations. Operations with scalars are just as you would expect:
In [301]: df * 5 + 2 Out[301]: A B 2000-01-03 3.805 1.040 2000-01-04 -1.230 -3.257 2000-01-05 5.064 4.504 2000-01-06 5.118 5.948 2000-01-07 9.795 3.676 2000-01-10 -4.906 3.826 2000-01-11 -1.367 -7.838 2000-01-12 -0.915 -2.993 In [302]: 1 / df Out[302]: A 2000-01-03 2.770 2000-01-04 -1.548 2000-01-05 1.632 2000-01-06 1.604 2000-01-07 0.641 2000-01-10 -0.724 2000-01-11 -1.485 2000-01-12 -1.715 In [303]: df ** 4
B C -5.211 -17.148 -0.951 -1.397 1.997 -0.724 1.266 1.222 2.983 1.088 2.738 -0.552 -0.508 -2.493 -1.001 -1.589
34
Out[303]: 2000-01-03 2000-01-04 2000-01-05 2000-01-06 2000-01-07 2000-01-10 2000-01-11 2000-01-12 A 0.017 0.174 0.141 0.151 5.908 3.639 0.206 0.116 B 0.001 1.222 0.063 0.389 0.013 0.018 14.988 0.995 C 0.000 0.263 3.631 0.448 0.715 10.748 0.026 0.157
5.2.10 Transposing
To transpose, access the T attribute (also the transpose function), similar to an ndarray:
# only show the first 5 rows In [310]: df[:5].T Out[310]: 2000-01-03 2000-01-04 2000-01-05 A 0.361 -0.646 0.613 B -0.192 -1.051 0.501 C -0.058 -0.716 -1.380
5.2. DataFrame
35
DataFrame is not intended to be a drop-in replacement for ndarray as its indexing semantics are quite different in places from a matrix.
36
id 100 non-null values year 100 non-null values stint 100 non-null values team 100 non-null values lg 100 non-null values g 100 non-null values ab 100 non-null values r 100 non-null values h 100 non-null values X2b 100 non-null values X3b 100 non-null values hr 100 non-null values rbi 100 non-null values sb 100 non-null values cs 100 non-null values bb 100 non-null values so 100 non-null values ibb 100 non-null values hbp 100 non-null values sh 100 non-null values sf 100 non-null values gidp 100 non-null values dtypes: float64(9), int64(10), object(3)
However, using to_string will return a string representation of the DataFrame in tabular form, though it wont always t the console width:
In [318]: print baseball.ix[-20:, :12].to_string() id year stint team lg g ab 88641 womacto01 2006 2 CHN NL 19 50 88643 schilcu01 2006 1 BOS AL 31 2 88645 myersmi01 2006 1 NYA AL 62 0 88649 helliri01 2006 1 MIL NL 20 3 88650 johnsra05 2006 1 NYA AL 33 6 88652 finlest01 2006 1 SFN NL 139 426 88653 gonzalu01 2006 1 ARI NL 153 586 88662 seleaa01 2006 1 LAN NL 28 26 89177 francju01 2007 2 ATL NL 15 40 89178 francju01 2007 1 NYN NL 40 50 89330 zaungr01 2007 1 TOR AL 110 331 89333 witasja01 2007 1 TBA AL 3 0 89334 williwo02 2007 1 HOU NL 33 59 89335 wickmbo01 2007 2 ARI NL 8 0 89336 wickmbo01 2007 1 ATL NL 47 0 89337 whitero02 2007 1 MIN AL 38 109 89338 whiteri01 2007 1 HOU NL 20 1 89339 wellsda01 2007 2 LAN NL 7 15 89340 wellsda01 2007 1 SDN NL 22 38 89341 weathda01 2007 1 CIN NL 67 0 89343 walketo04 2007 1 OAK AL 18 48 89345 wakefti01 2007 1 BOS AL 1 2 89347 vizquom01 2007 1 SFN NL 145 513 89348 villoro01 2007 1 NYA AL 6 0 89352 valenjo03 2007 1 NYN NL 51 166 89354 trachst01 2007 2 CHN NL 4 7 89355 trachst01 2007 1 BAL AL 3 5 89359 timlimi01 2007 1 BOS AL 4 0 89360 thomeji01 2007 1 CHA AL 130 432 r 6 0 0 0 0 66 93 2 1 7 43 0 3 0 0 8 0 2 1 0 5 0 54 0 18 0 0 0 79 h 14 1 0 0 1 105 159 5 10 10 80 0 6 0 0 19 0 4 4 0 13 0 126 0 40 1 0 0 119 X2b 1 0 0 0 0 21 52 1 3 0 24 0 0 0 0 4 0 1 0 0 1 0 18 0 11 0 0 0 19 X3b 0 0 0 0 0 12 2 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 3 0 1 0 0 0 0 hr 1 0 0 0 0 6 15 0 0 1 10 0 1 0 0 4 0 0 0 0 0 0 4 0 3 0 0 0 35
5.2. DataFrame
37
89361 89363 89365 89366 89367 89368 89370 89371 89372 89374 89375 89378 89381 89382 89383 89384 89385 89388 89389 89396 89398 89400 89402 89406 89410 89411 89412 89420 89421 89425 89426 89429 89430 89431 89438 89439 89442 89445 89450 89451 89452 89460 89462 89463 89464 89465 89466 89467 89468 89469 89473 89474 89480 89481 89482 89489 89493 89494
thomafr04 tavarju01 sweenma01 sweenma01 suppaje01 stinnke01 stantmi02 stairma01 sprinru01 sosasa01 smoltjo01 sheffga01 seleaa01 seaneru01 schmija01 schilcu01 sandere02 rogerke01 rodriiv01 ramirma02 piazzmi01 perezne01 parkch01 oliveda02 myersmi01 mussimi01 moyerja01 mesajo01 martipe02 maddugr01 mabryjo01 loftoke01 loftoke01 loaizes01 kleskry01 kentje01 jonesto02 johnsra05 hoffmtr01 hernaro01 hernaro01 guarded01 griffke02 greensh01 graffto01 gordoto01 gonzalu01 gomezch02 gomezch02 glavito02 floydcl01 finlest01 embreal01 edmonji01 easleda01 delgaca01 cormirh01 coninje01
2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007
1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2
TOR BOS LAN SFN MIL SLN CIN TOR SLN TEX ATL DET NYN LAN LAN BOS KCA DET DET BOS OAK DET NYN LAA NYA NYA PHI PHI NYN SDN COL CLE TEX LAN SFN LAN DET ARI SDN LAN CLE CIN CIN NYN MIL PHI LAN CLE BAL NYN CHN COL OAK SLN NYN NYN CIN NYN
AL AL NL NL NL NL NL AL NL AL NL AL NL NL NL AL AL AL AL AL AL AL NL AL AL AL NL NL NL NL NL AL AL NL NL NL AL NL NL NL AL NL NL NL NL NL NL AL AL NL NL NL AL NL NL NL NL NL
155 2 30 76 33 26 67 125 72 114 30 133 31 68 6 1 24 1 129 133 83 33 1 5 6 2 33 38 5 33 28 52 84 5 116 136 5 10 60 22 2 15 144 130 86 44 139 19 73 33 108 43 4 117 76 139 6 21
531 4 33 90 61 82 2 357 1 412 54 494 4 1 7 2 73 2 502 483 309 64 1 0 1 2 73 0 9 62 34 173 317 7 362 494 0 15 0 0 0 0 528 446 231 0 464 53 169 56 282 94 0 365 193 538 0 41
63 0 2 18 4 7 0 58 0 53 1 107 0 0 1 0 12 0 50 84 33 5 0 0 0 0 4 0 1 2 4 24 62 0 51 78 0 0 0 0 0 0 78 62 34 0 70 4 17 3 40 9 0 39 24 71 0 2
147 1 9 23 8 13 0 103 0 104 5 131 0 0 1 1 23 0 141 143 85 11 0 0 0 0 9 0 1 9 4 49 96 1 94 149 0 1 0 0 0 0 146 130 55 0 129 15 51 12 80 17 0 92 54 139 0 8
30 0 1 8 0 3 0 28 0 24 1 20 0 0 0 0 7 0 31 33 17 3 0 0 0 0 2 0 1 2 1 9 16 0 27 36 0 0 0 0 0 0 24 30 8 0 23 2 10 1 10 3 0 15 6 30 0 2
0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 3 1 1 0 0 0 0 0 0 0 0 0 0 3 3 0 3 1 0 0 0 0 0 0 1 1 0 0 2 0 1 0 1 0 0 2 0 0 0 0
26 0 0 2 0 1 0 21 0 21 0 25 0 0 1 0 2 0 11 20 8 1 0 0 0 0 0 0 0 0 1 0 7 0 6 20 0 0 0 0 0 0 30 10 9 0 15 0 1 0 9 1 0 12 10 24 0 0
38
89495 89497 89498 89499 89501 89502 89521 89523 89525 89526 89530 89533 89534
coninje01 clemero02 claytro01 claytro01 cirilje01 cirilje01 bondsba01 biggicr01 benitar01 benitar01 ausmubr01 aloumo01 alomasa02
2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007
1 1 2 1 2 1 1 1 2 1 1 1 1
CIN NYA BOS TOR ARI MIN SFN HOU FLO SFN HOU NYN NYN
NL AL AL AL NL AL NL NL NL NL NL NL NL
23 0 1 23 6 18 75 68 0 0 38 51 1
57 1 0 48 8 40 94 130 0 0 82 112 3
11 0 0 14 4 9 14 31 0 0 16 19 1
1 0 0 0 0 2 0 3 0 0 3 1 0
6 0 0 1 0 2 28 10 0 0 3 13 0
The related method get_dtype_counts will return the number of columns of each type:
In [320]: baseball.get_dtype_counts() Out[320]: float64 9 int64 10 object 3
5.2. DataFrame
39
In [321]: df = DataFrame({foo1 : np.random.randn(5), .....: foo2 : np.random.randn(5)}) In [322]: df Out[322]: foo1 foo2 0 -0.548001 -0.966162 1 -0.852612 -0.332601 2 -0.126250 -1.327330 3 1.765997 1.225847 4 -1.593297 -0.348395 In [323]: df.foo1 Out[323]: 0 -0.548001 1 -0.852612 2 -0.126250 3 1.765997 4 -1.593297 Name: foo1
The columns are also connected to the IPython completion mechanism so they can be tab-completed:
In [5]: df.fo<TAB> df.foo1 df.foo2
5.3 Panel
Panel is a somewhat less-used, but still important container for 3-dimensional data. The term panel data is derived from econometrics and is partially responsible for the name pandas: pan(el)-da(ta)-s. The names for the 3 axes are intended to give some semantic meaning to describing operations involving panel data and, in particular, econometric analysis of panel data. However, for the strict purposes of slicing and dicing a collection of DataFrame objects, you may nd the axis names slightly arbitrary: items: axis 0, each item corresponds to a DataFrame contained inside major_axis: axis 1, it is the index (rows) of each of the DataFrames minor_axis: axis 2, it is the columns of each of the DataFrames Construction of Panels works about like you would expect:
40
Note that the values in the dict need only be convertible to DataFrame. Thus, they can be any of the other valid inputs to DataFrame as per above. One helpful factory method is Panel.from_dict, which takes a dictionary of DataFrames as above, and the following named parameters: Parameter intersect orient Default False items Description drops elements whose indices do not align use minor to use DataFrames columns as panel items
Orient is especially useful for mixed-type DataFrames. Note: Unfortunately Panel, being less commonly used than Series and DataFrame, has been slightly neglected featurewise. A number of methods and options available in DataFrame are not available in Panel. This will get worked on, of course, in future releases. And faster if you join me in working on the codebase.
5.3. Panel
41
The API for insertion and deletion is the same as for DataFrame. And as with DataFrame, if the item is a valid python identier, you can access it as an attribute and tab-complete it in IPython.
42
CHAPTER
SIX
43
shape: gives the axis dimensions of the object, consistent with ndarray Axis labels Series: index (only axis) DataFrame: index (rows) and columns Panel: items, major_axis, and minor_axis Note, these attributes can be safely assigned to!
In [8]: df[:2] Out[8]: A B C 2000-01-03 -0.397375 -1.964789 -0.170366 2000-01-04 0.316897 -0.090626 -1.399808 In [9]: df.columns = [x.lower() for x in df.columns] In [10]: df Out[10]: 2000-01-03 2000-01-04 2000-01-05 2000-01-06 2000-01-07 2000-01-10 2000-01-11 2000-01-12 a -0.397375 0.316897 -0.543386 -0.019389 -0.246211 -0.616430 0.745438 -0.182547 b c -1.964789 -0.170366 -0.090626 -1.399808 -1.065025 0.424420 0.330196 0.637809 0.773164 0.019443 -0.305674 0.844000 -0.280596 -0.796469 -0.499865 1.541597
To get the actual data inside a data structure, one need only access the values property:
In [11]: s.values Out[11]: array([ 2.3971, 0.0065, 1.2489, -0.8971, 0.7268])
In [12]: df.values Out[12]: array([[-0.3974, -1.9648, -0.1704], [ 0.3169, -0.0906, -1.3998], [-0.5434, -1.065 , 0.4244], [-0.0194, 0.3302, 0.6378], [-0.2462, 0.7732, 0.0194], [-0.6164, -0.3057, 0.844 ], [ 0.7454, -0.2806, -0.7965], [-0.1825, -0.4999, 1.5416]]) In [13]: wp.values Out[13]: array([[[-0.7461, 0.6748, [-1.3674, -0.1887, [ 0.2275, -0.1511, [-0.1162, 1.9625, [-0.2509, 0.5349, [[ 1.6887, 0.272 , [ 1.1353, -0.6189, [ 0.9363, 0.4164, [ 0.7311, 2.0043, [ 0.1582, 0.7534,
-0.5713, -0.9046, 1.4874, -0.8264, -2.3785, 1.0161, -0.7009, -0.2274, -1.2222, -1.5557,
0.1542], -0.8528], 0.2663], 0.7547], 1.3824]], -1.7139], 0.4606], 1.4496], -0.8807], -0.522 ]]])
44
If a DataFrame or Panel contains homogeneously-typed data, the ndarray can actually be modied in-place, and the changes will be reected in the data structure. For heterogeneous data (e.g. some of the DataFrames columns are not all the same dtype), this will not be the case. The values attribute itself, unlike the axis labels, cannot be assigned to. Note: When working with heterogeneous data, the dtype of the resulting ndarray will be chosen to accommodate all of the data involved. For example, if strings are involved, the result will be of object dtype. If there are only oats and integers, the resulting array will be of oat dtype.
45
one three a -1.069035 NaN b 0.053475 -0.500531 c -1.447289 -1.514975 d NaN -1.512446
two 0 0 0 0
In [20]: df.sub(column, axis=0) Out[20]: one three two a -1.069035 NaN 0 b 0.053475 -0.500531 0 c -1.447289 -1.514975 0 d NaN -1.512446 0
With Panel, describing the matching behavior is a bit more difcult, so the arithmetic methods instead (and perhaps confusingly?) give you the option to specify the broadcast axis. For example, suppose we wished to demean the data over a particular axis. This can be accomplished by taking the mean over an axis and broadcasting over the same axis:
In [21]: major_mean = wp.mean(axis=major) In [22]: major_mean Out[22]: Item1 Item2 A -0.450606 0.929929 B 0.566477 0.565418 C -0.638662 -0.538023 D 0.340967 -0.241284 In [23]: wp.sub(major_mean, axis=major) Out[23]: <class pandas.core.panel.Panel> Dimensions: 2 (items) x 5 (major) x 4 (minor) Items: Item1 to Item2 Major axis: 2000-01-03 00:00:00 to 2000-01-07 00:00:00 Minor axis: A to D
And similarly for axis=items and axis=minor. Note: I could be convinced to make the axis argument in the DataFrame methods match the broadcasting behavior of Panel. Though it would require a transition period so users can change their code...
46
In [25]: df2 Out[25]: one three a 0.779958 1.000000 b 0.446548 -0.107458 c -0.439897 -0.507583 d NaN -0.433661 In [26]: df Out[26]: one a 1.559916 b 0.893097 c -0.879794 d NaN + df2 three NaN -0.214915 -1.015165 -0.867323
In [27]: df.add(df2, fill_value=0) Out[27]: one three two a 1.559916 1.000000 3.697985 b 0.893097 -0.214915 0.786146 c -0.879794 -1.015165 2.014785 d NaN -0.867323 2.157570
47
48
three two
-0.349567 1.082061
All such methods have a skipna option signaling whether to exclude missing data (True by default):
In [38]: df.sum(0, skipna=False) Out[38]: one NaN three NaN two 4.328243 In [39]: df.sum(axis=1, skipna=True) Out[39]: a 2.628951 b 0.732164 c 0.059913 d 0.645124
Combined with the broadcasting / arithmetic behavior, one can describe various statistical procedures, like standardization (rendering data zero mean and standard deviation 1), very concisely:
In [40]: ts_stand = (df - df.mean()) / df.std() In [41]: Out[41]: one three two ts_stand.std() 1 1 1
Note that methods like cumsum and cumprod preserve the location of NA values:
In [44]: df.cumsum() Out[44]: one three a 0.779958 NaN b 1.226506 -0.107458 c 0.786610 -0.615040 d NaN -1.048702
Here is a quick reference summary table of common functions. Each also takes an optional level parameter which applies only if the object has a hierarchical index.
49
Function count sum mean mad median min max abs prod std var skew kurt quantile cumsum cumprod cummax cummin
Description Number of non-null observations Sum of values Mean of values Mean absolute deviation Arithmetic median of values Minimum Maximum Absolute Value Product of values Unbiased standard deviation Unbiased variance Unbiased skewness (3rd moment) Unbiased kurtosis (4th moment) Sample quantile (value at %) Cumulative sum Cumulative product Cumulative maximum Cumulative minimum
Note that by chance some NumPy methods, like mean, std, and sum, will exclude NAs on Series input by default:
In [45]: np.mean(df[one]) Out[45]: 0.2622032058481874 In [46]: np.mean(df[one].values) Out[46]: nan
Series also has a method nunique which will return the number of unique non-null values:
In [47]: series = Series(randn(500)) In [48]: series[20:500] = np.nan In [49]: series[10:20] = 5
50
max
4.199996
In [54]: frame = DataFrame(randn(1000, 5), columns=[a, b, c, d, e]) In [55]: frame.ix[::2] = np.nan In [56]: frame.describe() Out[56]: a b count 500.000000 500.000000 mean 0.008892 -0.019903 std 0.976105 0.995283 min -2.991973 -3.186717 25% -0.640447 -0.713719 50% 0.014034 0.008756 75% 0.704305 0.641368 max 2.658383 2.933960
For a non-numerical Series object, describe will give a simple summary of the number of unique values and most frequently occurring values:
In [57]: s = Series([a, a, b, b, a, a, np.nan, c, d, a]) In [58]: s.describe() Out[58]: count 9 unique 4 top a freq 5
There also is a utility function, value_range which takes a DataFrame and returns a series with the minimum/maximum values in the DataFrame.
C 0.094150
51
In [69]: df.apply(np.cumsum) Out[69]: one three two a 0.779958 NaN 1.848993 b 1.226506 -0.107458 2.242066 c 0.786610 -0.615040 3.249458 d NaN -1.048702 4.328243 In [70]: df.apply(np.exp) Out[70]: one three two a 2.181381 NaN 6.353417 b 1.562908 0.898115 1.481526
52
c d
0.644103 NaN
0.601949 0.648132
2.738451 2.941104
Depending on the return type of the function passed to apply, the result will either be of lower dimension or the same dimension. apply combined with some cleverness can be used to answer many questions about a data set. For example, suppose we wanted to extract the date where the maximum value for each column occurred:
In [71]: tsdf = DataFrame(randn(1000, 3), columns=[A, B, C], ....: index=DateRange(1/1/2000, periods=1000)) In [72]: tsdf.apply(lambda x: x.index[x.dropna().argmax()]) Out[72]: A 2003-06-09 00:00:00 B 2001-02-05 00:00:00 C 2001-06-21 00:00:00
You may also pass additional arguments and keyword arguments to the apply method. For instance, consider the following function you would like to apply:
def subtract_and_divide(x, sub, divide=1): return (x - sub) / divide
Another useful feature is the ability to pass Series methods to carry out some Series operation on each column or row:
In [73]: tsdf Out[73]: 2000-01-03 2000-01-04 2000-01-05 2000-01-06 2000-01-07 2000-01-10 2000-01-11 2000-01-12 2000-01-13 2000-01-14 A -2.569072 -2.177425 1.863567 NaN NaN NaN NaN 2.532888 -1.857103 -0.616887 B 0.915718 -0.028034 1.072148 NaN NaN NaN NaN -1.265961 -0.500810 -1.097486 C -0.470533 -0.758511 0.737064 NaN NaN NaN NaN -0.586280 1.055360 -1.673839
In [74]: tsdf.apply(Series.interpolate) Out[74]: A B C 2000-01-03 -2.569072 0.915718 -0.470533 2000-01-04 -2.177425 -0.028034 -0.758511 2000-01-05 1.863567 1.072148 0.737064 2000-01-06 1.997431 0.604526 0.472395 2000-01-07 2.131295 0.136905 0.207726 2000-01-10 2.265160 -0.330717 -0.056943 2000-01-11 2.399024 -0.798339 -0.321612 2000-01-12 2.532888 -1.265961 -0.586280 2000-01-13 -1.857103 -0.500810 1.055360 2000-01-14 -0.616887 -1.097486 -1.673839
Finally, apply takes an argument raw which is False by default, which converts each row or column into a Series before applying the function. When set to True, the passed function will instead receive an ndarray object, which has 6.5. Function application 53
positive performance implications if you do not need the indexing functionality. See Also: The section on GroupBy demonstrates related, exible functionality for grouping by some criterion, applying, and combining the results into a Series, DataFrame, etc.
Series.map has an additional feature which is that it can be used to easily link or map values dened by a secondary series. This is closely related to merging/joining functionality:
In [78]: s = Series([six, seven, six, seven, six], ....: index=[a, b, c, d, e]) In [79]: t = Series({six : 6., seven : 7.}) In [80]: s Out[80]: a six b seven c six d seven e six In [81]: s.map(t) Out[81]: a 6 b 7 c 6 d 7 e 6
54
Here, the f label was not contained in the Series and hence appears as NaN in the result. With a DataFrame, you can simultaneously reindex the index and columns:
In [85]: df Out[85]: one three a 0.779958 NaN b 0.446548 -0.107458 c -0.439897 -0.507583 d NaN -0.433661
In [86]: df.reindex(index=[c, f, b], columns=[three, two, one]) Out[86]: three two one c -0.507583 1.007392 -0.439897 f NaN NaN NaN b -0.107458 0.393073 0.446548
For convenience, you may utilize the reindex_axis method, which takes the labels and a keyword axis paramater. Note that the Index objects containing the actual axis labels can be shared between objects. So if we have a Series and a DataFrame, the following can be done:
In [87]: rs = s.reindex(df.index) In [88]: rs Out[88]: a -0.558130
55
b c d
This means that the reindexed Seriess index is the same Python object as the DataFrames index. See Also: Advanced indexing is an even more concise way of doing reindexing. Note: When writing performance-sensitive code, there is a good reason to spend some time becoming a reindexing ninja: many operations are faster on pre-aligned data. Adding two unaligned DataFrames internally triggers a reindexing step. For exploratory analysis you will hardly notice the difference (because reindex has been heavily optimized), but when CPU cycles matter sprinking a few explicit reindex calls here and there can have an impact.
6.6.2 Reindexing with reindex_axis 6.6.3 Aligning objects with each other with align
The align method is the fastest way to simultaneously align two objects. It supports a join argument (related to joining and merging): join=outer: take the union of the indexes
56
join=left: use the calling objects index join=right: use the passed objects index join=inner: intersect the indexes It returns a tuple with both of the reindexed Series:
In [93]: s = Series(randn(5), index=[a, b, c, d, e]) In [94]: s1 = s[:4] In [95]: s2 = s[1:] In [96]: s1.align(s2) Out[96]: (a 1.007125 b 0.467322 c -0.514897 d 1.051824 e NaN, a NaN b 0.467322 c -0.514897 d 1.051824 e -0.654226) In [97]: s1.align(s2, join=inner) Out[97]: (b 0.467322 c -0.514897 d 1.051824, b 0.467322 c -0.514897 d 1.051824) In [98]: s1.align(s2, join=left) Out[98]: (a 1.007125 b 0.467322 c -0.514897 d 1.051824, a NaN b 0.467322 c -0.514897 d 1.051824)
For DataFrames, the join method will be applied to both the index and the columns by default:
In [99]: df.align(df2, join=inner) Out[99]: ( one two a 0.779958 1.848993 b 0.446548 0.393073 c -0.439897 1.007392, one two a 0.517755 0.76584 b 0.184345 -0.69008 c -0.702100 -0.07576)
You can also pass an axis option to only align on the specied axis: 6.6. Reindexing and altering labels 57
In [100]: df.align(df2, join=inner, axis=0) Out[100]: ( one three two a 0.779958 NaN 1.848993 b 0.446548 -0.107458 0.393073 c -0.439897 -0.507583 1.007392, one two a 0.517755 0.76584 b 0.184345 -0.69008 c -0.702100 -0.07576)
If you pass a Series to DataFrame.align, you can choose to align both objects either on the DataFrames index or columns using the axis argument:
In [101]: df.align(df2.ix[0], axis=1) Out[101]: ( one three two a 0.779958 NaN 1.848993 b 0.446548 -0.107458 0.393073 c -0.439897 -0.507583 1.007392 d NaN -0.433661 1.078785, one 0.517755 three NaN two 0.765840 Name: a)
Other ll methods could be added, of course, but these are the two most commonly used for time series data. In a way they only make sense for time series or otherwise ordered data, but you may have an application on non-time series data where this sort of interpolation logic is the correct thing to do. More sophisticated interpolation of missing values would be an obvious extension. We illustrate these ll methods on a simple TimeSeries:
In [102]: rng = DateRange(1/3/2000, periods=8) In [103]: ts = Series(randn(8), index=rng) In [104]: ts2 = ts[[0, 3, 6]] In [105]: ts Out[105]: 2000-01-03 2000-01-04 2000-01-05 2000-01-06 2000-01-07 2000-01-10 2000-01-11 2000-01-12
58
In [106]: ts2 Out[106]: 2000-01-03 0.377650 2000-01-06 -0.337795 2000-01-11 -0.719562 In [107]: ts2.reindex(ts.index) Out[107]: 2000-01-03 0.377650 2000-01-04 NaN 2000-01-05 NaN 2000-01-06 -0.337795 2000-01-07 NaN 2000-01-10 NaN 2000-01-11 -0.719562 2000-01-12 NaN In [108]: ts2.reindex(ts.index, method=ffill) Out[108]: 2000-01-03 0.377650 2000-01-04 0.377650 2000-01-05 0.377650 2000-01-06 -0.337795 2000-01-07 -0.337795 2000-01-10 -0.337795 2000-01-11 -0.719562 2000-01-12 -0.719562 In [109]: ts2.reindex(ts.index, method=bfill) Out[109]: 2000-01-03 0.377650 2000-01-04 -0.337795 2000-01-05 -0.337795 2000-01-06 -0.337795 2000-01-07 -0.719562 2000-01-10 -0.719562 2000-01-11 -0.719562 2000-01-12 NaN
Note the same result could have been achieved using llna:
In [110]: ts2.reindex(ts.index).fillna(method=ffill) Out[110]: 2000-01-03 0.377650 2000-01-04 0.377650 2000-01-05 0.377650 2000-01-06 -0.337795 2000-01-07 -0.337795 2000-01-10 -0.337795 2000-01-11 -0.719562 2000-01-12 -0.719562
Note these methods generally assume that the indexes are sorted. They may be modied in the future to be a bit more exible but as time series data is ordered most of the time anyway, this has not been a major priority.
In [111]: df Out[111]: one three a 0.779958 NaN b 0.446548 -0.107458 c -0.439897 -0.507583 d NaN -0.433661
In [112]: df.drop([a, d], axis=0) Out[112]: one three two b 0.446548 -0.107458 0.393073 c -0.439897 -0.507583 1.007392 In [113]: df.drop([one], axis=1) Out[113]: three two a NaN 1.848993 b -0.107458 0.393073 c -0.507583 1.007392 d -0.433661 1.078785
Note that the following also works, but a bit less obvious / clean:
In [114]: df.reindex(df.index - [a, d]) Out[114]: one three two b 0.446548 -0.107458 0.393073 c -0.439897 -0.507583 1.007392
If you pass a function, it must return a value when called with any of the labels (and must produce a set of unique values). But if you pass a dict or Series, it need only contain a subset of the labels as keys:
In [117]: df.rename(columns={one : foo, two : bar}, .....: index={a : apple, b : banana, d : durian}) Out[117]: foo three bar apple 0.779958 NaN 1.848993
60
The rename method also provides a copy named parameter that is by default True and copies the underlying data. Pass copy=False to rename the data in place. The Panel class has an a related rename_axis class which can rename any of its three axes.
6.7 Iteration
Considering the pandas as somewhat dict-like structure, basic iteration produces the keys of the objects, namely: Series: the index label DataFrame: the column labels Panel: the item labels Thus, for example:
In [118]: for col in df: .....: print col .....: one three two
6.7.1 iteritems
Consistent with the dict-like interface, iteritems iterates through key-value pairs: Series: (index, scalar value) pairs DataFrame: (column, Series) pairs Panel: (item, DataFrame) pairs For example:
In [119]: for item, frame in wp.iteritems(): .....: print item .....: print frame .....: Item1 A B C D 2000-01-03 -0.746076 0.674788 -0.571265 0.154209 2000-01-04 -1.367352 -0.188750 -0.904606 -0.852790 2000-01-05 0.227502 -0.151056 1.487440 0.266303 2000-01-06 -0.116207 1.962528 -0.826398 0.754717 2000-01-07 -0.250897 0.534875 -2.378483 1.382397 Item2 A B C D 2000-01-03 1.688713 0.271984 1.016070 -1.713926 2000-01-04 1.135326 -0.618932 -0.700908 0.460558 2000-01-05 0.936316 0.416414 -0.227416 1.449636 2000-01-06 0.731070 2.004273 -1.222206 -0.880667 2000-01-07 0.158218 0.753351 -1.555652 -0.522021
6.7. Iteration
61
6.7.2 iterrows
New in v0.7 is the ability to iterate efciently through rows of a DataFrame. It returns an iterator yielding each index value along with a Series containing the data in each row:
In [120]: for row_index, row in df2.iterrows(): .....: print %s\n%s % (row_index, row) .....: a one 0.517755 two 0.765840 Name: a b one 0.184345 two -0.690080 Name: b c one -0.70210 two -0.07576 Name: c
6.7.3 itertuples
This method will return an iterator yielding a tuple for each row in the DataFrame. The rst element of the tuple will be the rows corresponding index value, while the remaining values are the row values proper. For instance,
In [126]: for r in df2.itertuples(): print r (0, 1, 4) (1, 2, 5) (2, 3, 6)
62
DataFrame.sort_index can accept an optional by argument for axis=0 which will use an arbitrary vector or a column name of the DataFrame to determine the sort order:
In [131]: df.sort_index(by=two) Out[131]: one three two b 0.446548 -0.107458 0.393073 c -0.439897 -0.507583 1.007392 d NaN -0.433661 1.078785 a 0.779958 NaN 1.848993
Series has the method order (analogous to Rs order function) which sorts by value, with special treatment of NA values via the na_last argument:
63
In [134]: s[2] = np.nan In [135]: s.order() Out[135]: e -0.654226 b 0.467322 a 1.007125 d 1.051824 c NaN In [136]: s.order(na_last=False) Out[136]: c NaN e -0.654226 b 0.467322 a 1.007125 d 1.051824
Some other sorting notes / nuances: Series.sort sorts a Series by value in-place. This is to provide compatibility with NumPy methods which expect the ndarray.sort behavior. DataFrame.sort takes a column argument instead of by. This method will likely be deprecated in a future release in favor of just using sort_index.
64
In [150]: df.save(foo.pickle)
The load function in the pandas namespace can be used to load any pickled pandas object (or any other pickled
65
There is also a save function which takes any object as its rst argument:
In [152]: save(df, foo.pickle) In [153]: load(foo.pickle) Out[153]: a b c 0 1.249928 1.275456 0.9290728 1 1.371313 2.53522 -0.6311205 2 -0.3284879 -0.9698651 -0.1706295 3 0.01490762 -0.4734291 -0.7004532 4 -0.3286771 1.794254 1.70885 5 1.157524 0.1118896 1.165142
The set_printoptions function has a number of options for controlling how oating point numbers are formatted (using hte precision argument) in the console and . The max_rows and max_columns control how many rows and columns of DataFrame objects are shown by default. If max_columns is set to 0 (the default, in fact), the library 66 Chapter 6. Essential basic functionality
will attempt to t the DataFrames string representation into the current terminal width, and defaulting to the summary view otherwise.
67
68
CHAPTER
SEVEN
7.1 Basics
As mentioned when introducing the data structures in the last section, the primary function of indexing with [] (a.k.a. __getitem__ for those familiar with implementing class behavior in Python) is selecting out lower-dimensional slices. Thus, Series: series[label] returns a scalar value DataFrame: frame[colname] returns a Series corresponding to the passed column name Panel: panel[itemname] returns a DataFrame corresponding to the passed item name Here we construct a simple time series data set to use for illustrating the indexing functionality:
In [423]: dates = np.asarray(DateRange(1/1/2000, periods=8)) In [424]: df = DataFrame(randn(8, 4), index=dates, columns=[A, B, C, D]) In [425]: df Out[425]: 2000-01-03 2000-01-04 2000-01-05 2000-01-06 2000-01-07 2000-01-10 2000-01-11 2000-01-12 A 0.469112 1.212112 -0.861849 0.721555 -0.424972 -0.673690 0.404705 -0.370647 B -0.282863 -0.173215 -2.104569 -0.706771 0.567020 0.113648 0.577046 -1.157892 C -1.509059 0.119209 -0.494929 -1.039575 0.276232 -1.478427 -1.715002 -1.344312 D -1.135632 -1.044236 1.071804 0.271860 -1.087401 0.524988 -1.039268 0.844885
69
In [426]: panel = Panel({one : df, two : df - df.mean()}) In [427]: panel Out[427]: <class pandas.core.panel.Panel> Dimensions: 2 (items) x 8 (major) x 4 (minor) Items: one to two Major axis: 2000-01-03 00:00:00 to 2000-01-12 00:00:00 Minor axis: A to D
Note: None of the indexing functionality is time series specic unless specically stated. Thus, as per above, we have the most basic indexing using []:
In [428]: s = df[A] In [429]: s[dates[5]] Out[429]: -0.67368970808837059 In [430]: panel[two] Out[430]: A B 2000-01-03 0.409571 0.113086 2000-01-04 1.152571 0.222735 2000-01-05 -0.921390 -1.708620 2000-01-06 0.662014 -0.310822 2000-01-07 -0.484513 0.962970 2000-01-10 -0.733231 0.509598 2000-01-11 0.345164 0.972995 2000-01-12 -0.430188 -0.761943
There is an analogous set_value method which has the additional capability of enlarging an object. This method always returns a reference to the object it modied, which in the fast of enlargement, will be a new object:
In [433]: df.set_value(dates[5], E, 7) Out[433]: A B C D E 2000-01-03 0.469112 -0.282863 -1.509059 -1.135632 NaN 2000-01-04 1.212112 -0.173215 0.119209 -1.044236 NaN 2000-01-05 -0.861849 -2.104569 -0.494929 1.071804 NaN 2000-01-06 0.721555 -0.706771 -1.039575 0.271860 NaN 2000-01-07 -0.424972 0.567020 0.276232 -1.087401 NaN 2000-01-10 -0.673690 0.113648 -1.478427 0.524988 7
70
2000-01-11 0.404705 0.577046 -1.715002 -1.039268 NaN 2000-01-12 -0.370647 -1.157892 -1.344312 0.844885 NaN
If you are using the IPython environment, you may also use tab-completion to see the accessible columns of a DataFrame. You can pass a list of columns to [] to select columns in that order: If a column is not contained in the DataFrame, an exception will be raised. Multiple columns can also be set in this manner:
In [435]: df Out[435]: 2000-01-03 2000-01-04 2000-01-05 2000-01-06 2000-01-07 2000-01-10 2000-01-11 2000-01-12 A 0.469112 1.212112 -0.861849 0.721555 -0.424972 -0.673690 0.404705 -0.370647 B -0.282863 -0.173215 -2.104569 -0.706771 0.567020 0.113648 0.577046 -1.157892 C -1.509059 0.119209 -0.494929 -1.039575 0.276232 -1.478427 -1.715002 -1.344312 D -1.135632 -1.044236 1.071804 0.271860 -1.087401 0.524988 -1.039268 0.844885
In [436]: df[[B, A]] = df[[A, B]] In [437]: df Out[437]: 2000-01-03 2000-01-04 2000-01-05 2000-01-06 2000-01-07 2000-01-10 2000-01-11 2000-01-12 A -0.282863 -0.173215 -2.104569 -0.706771 0.567020 0.113648 0.577046 -1.157892 B 0.469112 1.212112 -0.861849 0.721555 -0.424972 -0.673690 0.404705 -0.370647 C -1.509059 0.119209 -0.494929 -1.039575 0.276232 -1.478427 -1.715002 -1.344312 D -1.135632 -1.044236 1.071804 0.271860 -1.087401 0.524988 -1.039268 0.844885
You may nd this useful for applying a transform (in-place) to a subset of the columns.
major_xs and minor_xs functions for retrieving slices as DataFrames for a given major_axis or minor_axis label, respectively.
In [438]: date = dates[5] In [439]: df.xs(date) Out[439]: A 0.113648 B -0.673690 C -1.478427 D 0.524988 Name: 2000-01-10 00:00:00 In [440]: panel.major_xs(date) Out[440]: one two A -0.673690 -0.733231 B 0.113648 0.509598 C -1.478427 -0.580194 D 0.524988 0.724113 In [441]: panel.minor_xs(A) Out[441]: one two 2000-01-03 0.469112 0.409571 2000-01-04 1.212112 1.152571 2000-01-05 -0.861849 -0.921390 2000-01-06 0.721555 0.662014 2000-01-07 -0.424972 -0.484513 2000-01-10 -0.673690 -0.733231 2000-01-11 0.404705 0.345164 2000-01-12 -0.370647 -0.430188
72
Out[444]: 2000-01-12 2000-01-11 2000-01-10 2000-01-07 2000-01-06 2000-01-05 2000-01-04 2000-01-03 Name: A
With DataFrame, slicing inside of [] slices the rows. This is provided largely as a convenience since it is such a common operation.
In [448]: df[:3] Out[448]: A B C D 2000-01-03 -0.282863 0.469112 -1.509059 -1.135632 2000-01-04 -0.173215 1.212112 0.119209 -1.044236 2000-01-05 -2.104569 -0.861849 -0.494929 1.071804 In [449]: df[::-1] Out[449]: 2000-01-12 2000-01-11 2000-01-10 2000-01-07 2000-01-06 2000-01-05 2000-01-04 2000-01-03 A -1.157892 0.577046 0.113648 0.567020 -0.706771 -2.104569 -0.173215 -0.282863 B -0.370647 0.404705 -0.673690 -0.424972 0.721555 -0.861849 1.212112 0.469112 C -1.344312 -1.715002 -1.478427 0.276232 -1.039575 -0.494929 0.119209 -1.509059 D 0.844885 -1.039268 0.524988 -1.087401 0.271860 1.071804 -1.044236 -1.135632
7.1. Basics
73
0.113648 0.577046
In [451]: s[(s < 0) & (s > -0.5)] Out[451]: 2000-01-03 -0.282863 2000-01-04 -0.173215 Name: A
You may select rows from a DataFrame using a boolean vector the same length as the DataFrames index (for example, something derived from one of the columns of the DataFrame):
In [452]: df[df[A] > 0] Out[452]: A B C D 2000-01-07 0.567020 -0.424972 0.276232 -1.087401 2000-01-10 0.113648 -0.673690 -1.478427 0.524988 2000-01-11 0.577046 0.404705 -1.715002 -1.039268
Consider the isin method of Series, which returns a boolean vector that is true wherever the Series elements exist in the passed list. This allows you to select out rows where one or more columns have values you want:
In [453]: df2 = DataFrame({a : [one, one, two, three, two, one, six], .....: b : [x, y, y, x, y, x, x], .....: c : np.random.randn(7)}) In [454]: df2[df2[a].isin([one, two])] Out[454]: a b c 0 one x 1.075770 1 one y -0.109050 2 two y 1.643563 4 two y 0.357021 5 one x -0.674600
Note, with the advanced indexing ix method, you may select along more than one axis using boolean vectors combined with other indexing expressions.
74
In [457]: df2[df2 < 0] = 0 In [458]: df2 Out[458]: 2000-01-03 2000-01-04 2000-01-05 2000-01-06 2000-01-07 2000-01-10 2000-01-11 2000-01-12 A 0.000000 0.000000 0.000000 0.000000 0.567020 0.113648 0.577046 0.000000 B 0.469112 1.212112 0.000000 0.721555 0.000000 0.000000 0.404705 0.000000 C 0.000000 0.119209 0.000000 0.000000 0.276232 0.000000 0.000000 0.000000 D 0.000000 0.000000 1.071804 0.271860 0.000000 0.524988 0.000000 0.844885
Note that such an operation requires that the boolean DataFrame is indexed exactly the same.
7.1. Basics
75
b y x y x x
76
Slicing with labels is semantically slightly different because the slice start and stop are inclusive in the label-based case:
In [475]: date1, date2 = dates[[2, 4]] In [476]: print date1, date2 2000-01-05 00:00:00 2000-01-07 00:00:00 In [477]: df.ix[date1:date2] Out[477]: A B C D 2000-01-05 -2.104569 -0.861849 -0.494929 1.071804 2000-01-06 -0.706771 0.721555 -1.039575 0.271860 2000-01-07 0.567020 -0.424972 0.276232 -1.087401
77
In [478]: df[A].ix[date1:date2] Out[478]: 2000-01-05 -2.104569 2000-01-06 -0.706771 2000-01-07 0.567020 Name: A
Getting and setting rows in a DataFrame, especially by their location, is much easier:
In [479]: df2 = df[:5].copy() In [480]: df2.ix[3] Out[480]: A -0.706771 B 0.721555 C -1.039575 D 0.271860 Name: 2000-01-06 00:00:00 In [481]: df2.ix[3] = np.arange(len(df2.columns)) In [482]: df2 Out[482]: A B C D 2000-01-03 -0.282863 0.469112 -1.509059 -1.135632 2000-01-04 -0.173215 1.212112 0.119209 -1.044236 2000-01-05 -2.104569 -0.861849 -0.494929 1.071804 2000-01-06 0.000000 1.000000 2.000000 3.000000 2000-01-07 0.567020 -0.424972 0.276232 -1.087401
Column or row selection can be combined as you would expect with arrays of labels or even boolean vectors:
In [483]: df.ix[df[A] > 0, B] Out[483]: 2000-01-07 -0.424972 2000-01-10 -0.673690 2000-01-11 0.404705 Name: B In [484]: df.ix[date1:date2, B] Out[484]: 2000-01-05 -0.861849 2000-01-06 0.721555 2000-01-07 -0.424972 Name: B In [485]: df.ix[date1, B] Out[485]: -0.86184896334779992
Slicing with labels is closely related to the truncate method which does precisely .ix[start:stop] but returns a copy (for legacy reasons).
78
79
B C D 0.469112 -1.509059 -1.135632 1.212112 0.119209 -1.044236 NaN NaN NaN 0.721555 -1.039575 0.271860
Starting with pandas 0.5, the name, if set, will be shown in the console display:
In [500]: index = Index(range(5), name=rows) In [501]: columns = Index([A, B, C], name=cols) In [502]: df = DataFrame(np.random.randn(5, 3), index=index, columns=columns) In [503]: df Out[503]: cols A B C rows 0 3.357427 -0.317441 -1.236269 1 0.896171 -0.487602 -0.082240 2 -2.182937 0.380396 0.084844 3 0.432390 1.519970 -0.493662 4 0.600178 0.274230 0.132885 In [504]: df[A] Out[504]: rows 0 3.357427
80
81
All of the MultiIndex constructors accept a names argument which stores string names for the levels themselves. If no names are provided, some arbitrary ones will be assigned:
In [517]: index.names Out[517]: [first, second]
This index can back any axis of a pandas object, and the number of levels of the index is up to you:
In [518]: df = DataFrame(randn(3, 8), index=[A, B, C], columns=index) In [519]: df Out[519]: first bar baz foo qux second one two one two one two one two A 0.299368 -0.863838 0.408204 -1.048089 -0.025747 -0.988387 0.094055 1.262731 B 1.289997 0.082423 -0.055758 0.536580 -0.489682 0.369374 -0.034571 -2.484478 C -0.281461 0.030711 0.109121 1.126203 -0.977349 1.474071 -0.064034 -1.282782 In [520]: DataFrame(randn(6, 6), index=index[:6], columns=index[:6]) Out[520]:
82
first bar baz foo second one two one two one two first second bar one 0.781836 -1.071357 0.441153 2.353925 0.583787 0.221471 two -0.744471 0.758527 1.729689 -0.964980 -0.845696 -1.340896 baz one 1.846883 -1.328865 1.682706 -1.717693 0.888782 0.228440 two 0.901805 1.171216 0.520260 -1.197071 -1.066969 -0.303421 foo one -0.858447 0.306996 -0.028665 0.384316 1.574159 1.588931 two 0.476720 0.473424 -0.242861 -0.014805 -0.284319 0.650776
Weve sparsied the higher levels of the indexes to make the console output a bit easier on the eyes. Its worth keeping in mind that theres nothing preventing you from using tuples as atomic labels on an axis:
In [521]: Series(randn(8), index=tuples) Out[521]: (bar, one) -1.461665 (bar, two) -1.137707 (baz, one) -0.891060 (baz, two) -0.693921 (foo, one) 1.613616 (foo, two) 0.464000 (qux, one) 0.227371 (qux, two) -0.496922
The reason that the MultiIndex matters is that it can allow you to do grouping, selection, and reshaping operations as we will describe below and in subsequent areas of the documentation. As you will see in later sections, you can nd yourself working with hierarchically-indexed data without creating a MultiIndex explicitly yourself. However, when loading data from a le, you may wish to generate your own MultiIndex when preparing the data set.
83
B 1.289997 C -0.281461 Name: (bar, one) In [526]: df[bar][one] Out[526]: A 0.299368 B 1.289997 C -0.281461 Name: one In [527]: s[qux] Out[527]: second one 1.063327 two 1.266143
reindex can be called with another MultiIndex or even a list or array of tuples:
In [530]: s.reindex(index[:3]) Out[530]: first second bar one -0.023688 two 2.410179 baz one 1.450520 In [531]: s.reindex([(foo, two), (bar, one), (qux, one), (baz, one)]) Out[531]:
84
In [534]: df.ix[bar] Out[534]: A B C second one 0.299368 1.289997 -0.281461 two -0.863838 0.082423 0.030711 In [535]: df.ix[bar, two] Out[535]: A -0.863838 B 0.082423 C 0.030711 Name: (bar, two)
85
qux
two one
In [538]: df.ix[(baz, two):foo] Out[538]: A B C first second baz two -1.048089 0.536580 1.126203 foo one -0.025747 -0.489682 -0.977349 two -0.988387 0.369374 1.474071
The following does not work, and its not clear if it should or not:
>>> df.ix[[bar, qux]]
The code for implementing .ix makes every attempt to do the right thing but as you use it you may uncover corner cases or unintuitive behavior. If you do nd something like this, do not hesitate to report the issue or ask on the mailing list.
86
zero y -0.260838 0.281957 x 1.523962 -0.902937 In [544]: df2 = df.mean(level=0) In [545]: print df2 0 1 key_0 zero 0.631562 -0.310490 one -0.414117 -1.926216 In [546]: print df2.reindex(df.index, level=0) 0 1 one y -0.414117 -1.926216 x -0.414117 -1.926216 zero y 0.631562 -0.310490 x 0.631562 -0.310490 In [547]: df_aligned, df2_aligned = df.align(df2, level=0) In [548]: print df_aligned 0 1 one y 0.306389 -2.290613 x -1.134623 -1.561819 zero y -0.260838 0.281957 x 1.523962 -0.902937 In [549]: print df2_aligned 0 1 one y -0.414117 -1.926216 x -0.414117 -1.926216 zero y 0.631562 -0.310490 x 0.631562 -0.310490
87
qux bar
two one
0.782098 -1.069094
In [553]: s.sortlevel(0) Out[553]: bar one -1.069094 two -0.057873 baz one 0.861209 two 0.800193 foo one -0.368204 two -1.144073 qux one 0.068159 two 0.782098 In [554]: s.sortlevel(1) Out[554]: bar one -1.069094 baz one 0.861209 foo one -0.368204 qux one 0.068159 bar two -0.057873 baz two 0.800193 foo two -1.144073 qux two 0.782098
Note, you may also pass a level name to sortlevel if the MultiIndex levels are named.
In [555]: s.index.names = [L1, L2] In [556]: s.sortlevel(level=L1) Out[556]: L1 L2 bar one -1.069094 two -0.057873 baz one 0.861209 two 0.800193 foo one -0.368204 two -1.144073 qux one 0.068159 two 0.782098 In [557]: s.sortlevel(level=L2) Out[557]: L1 L2 bar one -1.069094 baz one 0.861209 foo one -0.368204 qux one 0.068159 bar two -0.057873 baz two 0.800193 foo two -1.144073 qux two 0.782098
Some indexing will work even if the data are not sorted, but will be rather inefcient and will also return a copy of the data rather than a view:
In [558]: s[qux] Out[558]: L2
88
one two
0.068159 0.782098
On higher dimensional objects, you can sort any of the other axes by level if they have a MultiIndex:
In [560]: df.T.sortlevel(1, axis=1) Out[560]: zero one zero one x x y y 0 1.523962 -1.134623 -0.260838 0.306389 1 -0.902937 -1.561819 0.281957 -2.290613
The MultiIndex object has code to explicity check the sort depth. Thus, if you try to index at a depth at which the index is not sorted, it will raise an exception. Here is a concrete example to illustrate this:
In [561]: tuples = [(a, a), (a, b), (b, a), (b, b)] In [562]: idx = MultiIndex.from_tuples(tuples) In [563]: idx.lexsort_depth Out[563]: 2 In [564]: reordered = idx[[1, 0, 3, 2]] In [565]: reordered.lexsort_depth Out[565]: 1 In [566]: s = Series(randn(4), index=reordered) In [567]: s.ix[a:a] Out[567]: a b -1.099248 a 0.255269
However:
>>> s.ix[(a, b):(b, a)] Exception: MultiIndex lexsort depth 1, key was length 2
89
Out[569]: y x y x 0 1 one 0.306389 -2.290613 one -1.134623 -1.561819 zero -0.260838 0.281957 zero 1.523962 -0.902937
You can probably guess that the labels determine which unique element is identied with that location at each layer of the index. Its important to note that sortedness is determined solely from the integer labels and does not check (or care) whether the levels themselves are sorted. Fortunately, the constructors from_tuples and from_arrays ensure that this is true, but if you compute the levels and labels yourself, please be careful.
90
d 1 2 3 4
In [576]: indexed1 = data.set_index(c) In [577]: indexed1 Out[577]: a b d c z bar one 1 y bar two 2 x foo one 3 w foo two 4 In [578]: indexed2 = data.set_index([a, b]) In [579]: indexed2 Out[579]: c d a b bar one z 1 two y 2 foo one x 3 two w 4
Other options in set_index allow you not drop the index columns or to add the index in-place (without creating a new object):
In [580]: data.set_index(c, drop=False) Out[580]: a b c d c z bar one z 1 y bar two y 2 x foo one x 3 w foo two w 4 In [581]: df = data.set_index([a, b], inplace=True) In [582]: data Out[582]: c d a b bar one z 1 two y 2 foo one x 3 two w 4
91
The output is more similar to a SQL table or a record array. The names for the columns derived from the index are the ones stored in the names attribute. reset_index takes an optional parameter drop which if true simply discards the index, instead of putting index values in the DataFrames columns. Note: The reset_index method used to be called delevel which is now deprecated.
92
The motivation for having an Index class in the rst place was to enable different implementations of indexing. This means that its possible for you, the user, to implement a custom Index subclass that may be better suited to a particular application than the ones provided in pandas. For example, we plan to add a more efcient datetime index which leverages the new numpy.datetime64 dtype in the relatively near future. From an internal implementation point of view, the relevant methods that an Index must dene are one or more of the following (depending on how incompatible the new object internals are with the Index functions): get_loc: returns an indexer (an integer, or in some cases a slice object) for a label slice_locs: returns the range to slice between two labels get_indexer: Computes the indexing vector for reindexing / data alignment purposes. See the source / docstrings for more on this reindex: Does any pre-conversion of the input index then calls get_indexer union, intersection: computes the union or intersection of two Index objects insert: Inserts a new label into an Index, yielding a new object delete: Delete a label, yielding a new object drop: Deletes a set of labels take: Analogous to ndarray.take
93
94
CHAPTER
EIGHT
COMPUTATIONAL TOOLS
8.1 Statistical functions
8.1.1 Covariance
The Series object has a method cov to compute covariance between series (excluding NA/null values).
In [157]: s1 = Series(randn(1000)) In [158]: s2 = Series(randn(1000)) In [159]: s1.cov(s2) Out[159]: 0.019465636696791695
Analogously, DataFrame has a method cov to compute pairwise covariances among the series in the DataFrame, also excluding NA/null values.
In [160]: frame = DataFrame(randn(1000, 5), columns=[a, b, c, d, e]) In [161]: frame.cov() Out[161]: a b c d e a 0.953751 -0.029550 -0.006415 0.001020 -0.004134 b -0.029550 0.997223 -0.044276 0.005967 0.044884 c -0.006415 -0.044276 1.050236 0.077775 0.010642 d 0.001020 0.005967 0.077775 0.998485 -0.007345 e -0.004134 0.044884 0.010642 -0.007345 1.025446
8.1.2 Correlation
Several methods for computing correlations are provided. Several kinds of correlation methods are provided: Method name pearson (default) kendall spearman Description Standard correlation coefcient Kendall Tau correlation coefcient Spearman rank correlation coefcient
95
# Series with Series In [164]: frame[a].corr(frame[b]) Out[164]: 0.013306883832198543 In [165]: frame[a].corr(frame[b], method=spearman) Out[165]: 0.022530330121320486 # Pairwise correlation of DataFrame columns In [166]: frame.corr() Out[166]: a b c d e a 1.000000 0.013307 -0.037801 -0.021905 0.001165 b 0.013307 1.000000 -0.017259 0.079246 -0.043606 c -0.037801 -0.017259 1.000000 0.061657 0.078945 d -0.021905 0.079246 0.061657 1.000000 -0.036978 e 0.001165 -0.043606 0.078945 -0.036978 1.000000
Note that non-numeric columns will be automatically excluded from the correlation calculation. A related method corrwith is implemented on DataFrame to compute the correlation between like-labeled Series contained in different DataFrame objects.
In [167]: index = [a, b, c, d, e] In [168]: columns = [one, two, three, four] In [169]: df1 = DataFrame(randn(5, 4), index=index, columns=columns) In [170]: df2 = DataFrame(randn(4, 4), index=index[:4], columns=columns) In [171]: df1.corrwith(df2) Out[171]: one 0.344149 two 0.837438 three 0.458904 four 0.712401 In [172]: df2.corrwith(df1, axis=1) Out[172]: a 0.404019 b 0.772204 c 0.420390 d -0.142959 e NaN
96
c d e
rank is also a DataFrame method and can rank either the rows (axis=0) or the columns (axis=1). NaN values are excluded from the ranking.
In [176]: df = DataFrame(np.random.randn(10, 6)) In [177]: df[4] = df[2][:5] # some ties In [178]: df Out[178]: 0 1 0 0.106333 0.712162 1 -1.301869 0.612432 2 -0.899627 0.822023 3 -0.522705 -1.473680 4 0.733147 0.415881 5 0.995001 -1.399355 6 -0.779714 -0.226893 7 -0.635495 -0.621647 8 0.085011 -0.459422 9 -0.557052 0.775425 In [179]: df.rank(1) Out[179]: 0 1 2 3 4 0 3 4 1.5 5 1.5 1 1 6 3.5 5 3.5 2 1 3 5.5 4 5.5 3 5 3 1.5 6 1.5 4 5 4 1.5 6 1.5 5 5 2 3.0 1 NaN 6 1 4 5.0 3 NaN 7 2 3 5.0 4 NaN 8 4 3 2.0 1 NaN 9 2 5 3.0 4 NaN
2 -0.351275 -0.577677 1.506319 -1.726800 -0.026973 0.082244 0.956567 0.406259 -1.660917 0.003794
3 1.176287 0.124709 0.998896 1.555343 0.999488 -1.521795 -0.443664 -0.279002 -1.913019 0.555351
4 -0.351275 -0.577677 1.506319 -1.726800 -0.026973 NaN NaN NaN NaN NaN
5 1.741787 -1.068084 0.259080 -1.411978 0.082219 0.416180 -0.610675 -1.153000 0.833479 -1.169977
5 6 2 2 4 3 4 2 1 5 1
Note: These methods are signicantly faster (around 10-20x) than scipy.stats.rankdata.
97
Function rolling_count rolling_sum rolling_mean rolling_median rolling_min rolling_max rolling_std rolling_var rolling_skew rolling_kurt rolling_quantile rolling_apply rolling_cov rolling_corr rolling_corr_pairwise
Description Number of non-null observations Sum of values Mean of values Arithmetic median of values Minimum Maximum Unbiased standard deviation Unbiased variance Unbiased skewness (3rd moment) Unbiased kurtosis (4th moment) Sample quantile (value at %) Generic apply Unbiased covariance (binary) Correlation (binary) Pairwise correlation of DataFrame columns
Generally these methods all have the same interface. The binary operators (e.g. rolling_corr) take two Series or DataFrames. Otherwise, they all accept the following arguments: window: size of moving window min_periods: threshold of non-null data points to require (otherwise result is NA) time_rule: optionally specify a time rule to pre-conform the data to These functions can be applied to ndarrays or Series objects:
In [180]: ts = Series(randn(1000), index=DateRange(1/1/2000, periods=1000)) In [181]: ts = ts.cumsum() In [182]: ts.plot(style=k--) Out[182]: <matplotlib.axes.AxesSubplot at 0x10e583e50> In [183]: rolling_mean(ts, 60).plot(style=k) Out[183]: <matplotlib.axes.AxesSubplot at 0x10e583e50>
98
They can also be applied to DataFrame objects. This is really just syntactic sugar for applying the moving window operator to all of the DataFrames columns:
In [184]: df = DataFrame(randn(1000, 4), index=ts.index, .....: columns=[A, B, C, D]) In [185]: df = df.cumsum() In [186]: rolling_sum(df, 60).plot(subplots=True) Out[186]: array([Axes(0.125,0.747826;0.775x0.152174), Axes(0.125,0.565217;0.775x0.152174), Axes(0.125,0.382609;0.775x0.152174), Axes(0.125,0.2;0.775x0.152174)], dtype=object)
99
2000-01-07 0.806980 2000-01-10 0.689915 2000-01-11 0.211679 2000-01-12 0.286270 2000-01-13 -0.565249 2000-01-14 0.295310 2000-01-17 0.041252 2000-01-18 0.205705 2000-01-19 0.326449 2000-01-20 0.120893 2000-01-21 0.680531 2000-01-24 0.643667 2000-01-25 0.703188 2000-01-26 0.065322 2000-01-27 -0.429914 2000-01-28 -0.387498
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
-0.911973 -0.609054 -0.383565 0.104075 0.039148 0.501143 0.868636 0.917778 0.933352 0.409255 -0.192045 -0.588676 -0.746130 -0.209789 -0.100807 0.512321
-0.747745 -0.680394 -0.164879 0.345844 0.333921 -0.524100 -0.577590 -0.819271 -0.882750 -0.795062 -0.349044 0.473287 0.714265 0.635360 0.266005 0.592033
You can efciently retrieve the time series of correlations between two columns using ix indexing:
In [191]: correls.ix[:, A, C].plot() Out[191]: <matplotlib.axes.AxesSubplot at 0x10b800250>
100
You can pass one or the other to these functions but not both. Span corresponds to what is commonly called a 20day EW moving average for example. Center of mass has a more physical interpretation. For example, span = 20 corresponds to com = 9.5. Here is the list of functions available: Function ewma ewvar ewstd ewmcorr ewmcov Description EW moving average EW moving variance EW moving standard deviation EW moving correlation EW moving covariance
101
Note: The EW functions perform a standard adjustment to the initial observations whereby if there are fewer observations than called for in the span, those observations are reweighted accordingly.
102
Based on the types of y and x, the model will be inferred to either a panel model or a regular linear model. If the y variable is a DataFrame, the result will be a panel model. In this case, the x variable must either be a Panel, or a dict of DataFrame (which will be coerced into a Panel).
103
If we had passed a Series instead of a DataFrame with the single GOOG column, the model would have assigned the generic name x to the sole right-hand side variable. We can do a moving window regression to see how the relationship changes over time:
In [205]: model = ols(y=rets[AAPL], x=rets.ix[:, [GOOG]], .....: window=250) # just plot the coefficient for GOOG In [206]: model.beta[GOOG].plot() Out[206]: <matplotlib.axes.AxesSubplot at 0x1105c8fd0>
It looks like there are some outliers rolling in and out of the window in the above regression, inuencing the results. We could perform a simple winsorization at the 3 STD level to trim the impact of outliers:
In [207]: winz = rets.copy() In [208]: std_1year = rolling_std(rets, 250, min_periods=20) # cap at 3 * 1 year standard deviation In [209]: cap_level = 3 * np.sign(winz) * std_1year In [210]: winz[np.abs(winz) > 3 * std_1year] = cap_level In [211]: winz_model = ols(y=winz[AAPL], x=winz.ix[:, [GOOG]], .....: window=250) In [212]: model.beta[GOOG].plot(label="With outliers") Out[212]: <matplotlib.axes.AxesSubplot at 0x10fb56450> In [213]: winz_model.beta[GOOG].plot(label="Winsorized"); plt.legend(loc=best) Out[213]: <matplotlib.legend.Legend at 0x10fd02550>
104
So in this simple example we see the impact of winsorization is actually quite signicant. Note the correlation after winsorization remains high:
In [214]: winz.corrwith(rets) Out[214]: AAPL 0.997979 GOOG 0.970836 MSFT 0.998111
Multiple regressions can be run by passing a DataFrame with multiple columns for the predictors x:
In [215]: ols(y=winz[AAPL], x=winz.drop([AAPL], axis=1)) Out[215]: -------------------------Summary of Regression Analysis------------------------Formula: Y ~ <GOOG> + <MSFT> + <intercept> Number of Observations: 251 Number of Degrees of Freedom: 3 R-squared: 0.4532 Adj R-squared: 0.4488 Rmse: 0.0121 F-stat (2, 248): 102.7814, p-value: 0.0000 Degrees of Freedom: model 2, resid 248 -----------------------Summary of Estimated Coefficients-----------------------Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5% -------------------------------------------------------------------------------GOOG 0.4581 0.0542 8.45 0.0000 0.3519 0.5643 MSFT 0.2956 0.0632 4.68 0.0000 0.1718 0.4194 intercept 0.0014 0.0008 1.87 0.0631 -0.0001 0.0029 ---------------------------------End of Summary---------------------------------
trading volume among a group of stocks, and we want to pool all the data together to run one big regression. This is actually quite easy:
# make the units somewhat comparable In [216]: volume = panel[Volume] / 1e8 In [217]: model = ols(y=volume, x={return : np.abs(rets)}) In [218]: model Out[218]: -------------------------Summary of Regression Analysis------------------------Formula: Y ~ <return> + <intercept> Number of Observations: 753 Number of Degrees of Freedom: 2 R-squared: 0.0182 Adj R-squared: 0.0168 Rmse: 0.2859 F-stat (1, 751): 13.8849, p-value: 0.0002 Degrees of Freedom: model 1, resid 751 -----------------------Summary of Estimated Coefficients-----------------------Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5% -------------------------------------------------------------------------------return 3.2352 0.8682 3.73 0.0002 1.5335 4.9369 intercept 0.2273 0.0150 15.19 0.0000 0.1980 0.2567 ---------------------------------End of Summary---------------------------------
In a panel model, we can insert dummy (0-1) variables for the entities involved (here, each of the stocks) to account the a entity-specic effect (intercept):
In [219]: fe_model = ols(y=volume, x={return : np.abs(rets)}, .....: entity_effects=True) In [220]: fe_model Out[220]: -------------------------Summary of Regression Analysis------------------------Formula: Y ~ <return> + <FE_GOOG> + <FE_MSFT> + <intercept> Number of Observations: 753 Number of Degrees of Freedom: 4 R-squared: 0.7365 Adj R-squared: 0.7355 Rmse: 0.1483 F-stat (3, 749): 697.8591, p-value: 0.0000 Degrees of Freedom: model 3, resid 749 -----------------------Summary of Estimated Coefficients-----------------------Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5% -------------------------------------------------------------------------------return 4.4688 0.4512 9.90 0.0000 3.5845 5.3531 FE_GOOG -0.1416 0.0132 -10.69 0.0000 -0.1675 -0.1156 FE_MSFT 0.4335 0.0133 32.71 0.0000 0.4076 0.4595 intercept 0.1147 0.0110 10.44 0.0000 0.0932 0.1363 ---------------------------------End of Summary---------------------------------
Because we ran the regression with an intercept, one of the dummy variables must be dropped or the design matrix will not be full rank. If we do not use an intercept, all of the dummy variables will be included:
In [221]: fe_model = ols(y=volume, x={return : np.abs(rets)}, .....: entity_effects=True, intercept=False) In [222]: fe_model
106
Out[222]: -------------------------Summary of Regression Analysis------------------------Formula: Y ~ <return> + <FE_AAPL> + <FE_GOOG> + <FE_MSFT> Number of Observations: 753 Number of Degrees of Freedom: 4 R-squared: 0.7365 Adj R-squared: 0.7355 Rmse: 0.1483 F-stat (4, 749): 697.8591, p-value: 0.0000 Degrees of Freedom: model 3, resid 749 -----------------------Summary of Estimated Coefficients-----------------------Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5% -------------------------------------------------------------------------------return 4.4688 0.4512 9.90 0.0000 3.5845 5.3531 FE_AAPL 0.1147 0.0110 10.44 0.0000 0.0932 0.1363 FE_GOOG -0.0269 0.0111 -2.43 0.0154 -0.0485 -0.0052 FE_MSFT 0.5483 0.0107 51.37 0.0000 0.5274 0.5692 ---------------------------------End of Summary---------------------------------
We can also include time effects, which demeans the data cross-sectionally at each point in time (equivalent to including dummy variables for each date). More mathematical care must be taken to properly compute the standard errors in this case:
In [223]: te_model = ols(y=volume, x={return : np.abs(rets)}, .....: time_effects=True, entity_effects=True) In [224]: te_model Out[224]: -------------------------Summary of Regression Analysis------------------------Formula: Y ~ <return> + <FE_GOOG> + <FE_MSFT> Number of Observations: 753 Number of Degrees of Freedom: 254 R-squared: 0.8122 Adj R-squared: 0.7170 Rmse: 0.1436 F-stat (3, 499): 8.5306, p-value: 0.0000 Degrees of Freedom: model 253, resid 499 -----------------------Summary of Estimated Coefficients-----------------------Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5% -------------------------------------------------------------------------------return 3.7138 0.6877 5.40 0.0000 2.3660 5.0617 FE_GOOG -0.1414 0.0128 -11.03 0.0000 -0.1665 -0.1162 FE_MSFT 0.4325 0.0129 33.66 0.0000 0.4073 0.4577 ---------------------------------End of Summary---------------------------------
Here the intercept (the mean term) is dropped by default because it will be 0 according to the model assumptions, having subtracted off the group means.
107
108
CHAPTER
NINE
three four -2.400634 bar -1.386071 bar 1.500571 bar -0.374279 bar -0.551865 bar
In [720]: df2 = df.reindex([a, b, c, d, e, f, g, h]) In [721]: df2 Out[721]: one two three four a 0.059117 1.138469 -2.400634 bar
five True
109
b NaN NaN NaN c -0.280853 0.025653 -1.386071 d NaN NaN NaN e 0.863937 0.252462 1.500571 f 1.053202 -2.338595 -0.374279 g NaN NaN NaN h -2.359958 -1.157886 -0.551865
Summary: NaN, inf, -inf, and None (in object arrays) are all considered missing by the isnull and notnull functions.
110
two three 1.138469 -2.400634 NaN NaN 0.025653 -1.386071 NaN NaN 0.252462 1.500571
In [727]: a + b Out[727]: one three a 0.118234 NaN b NaN NaN c -0.561707 NaN d NaN NaN e 1.727874 NaN
The descriptive statistics and computational methods discussed in the data structure overview (and listed here and here) are all written to account for missing data. For example: When summing data, NA (missing) values will be treated as zero If the data are all NA, the result will be NA Methods like cumsum and cumprod ignore NA values, but preserve them in the resulting arrays
In [728]: df Out[728]: one two a 0.059117 1.138469 b NaN NaN c -0.280853 0.025653 d NaN NaN e 0.863937 0.252462 f 1.053202 -2.338595 g NaN NaN h -2.359958 -1.157886
111
c d e f g h
In [731]: df.cumsum() Out[731]: one two a 0.059117 1.138469 b NaN NaN c -0.221736 1.164122 d NaN NaN e 0.642200 1.416584 f 1.695403 -0.922011 g NaN NaN h -0.664556 -2.079897
three four -2.400634 bar NaN NaN -1.386071 bar NaN NaN 1.500571 bar -0.374279 bar NaN NaN -0.551865 bar
In [733]: df2.fillna(0) Out[733]: one two three four a 0.059117 1.138469 -2.400634 bar b 0.000000 0.000000 0.000000 0 c -0.280853 0.025653 -1.386071 bar d 0.000000 0.000000 0.000000 0 e 0.863937 0.252462 1.500571 bar f 1.053202 -2.338595 -0.374279 bar
112
0 bar
0 False
In [734]: df2[four].fillna(missing) Out[734]: a bar b missing c bar d missing e bar f bar g missing h bar Name: four
Fill gaps forward or backward Using the same lling arguments as reindexing, we can propagate non-null values forward or backward:
In [735]: df Out[735]: one two a 0.059117 1.138469 b NaN NaN c -0.280853 0.025653 d NaN NaN e 0.863937 0.252462 f 1.053202 -2.338595 g NaN NaN h -2.359958 -1.157886
In [736]: df.fillna(method=pad) Out[736]: one two three a 0.059117 1.138469 -2.400634 b 0.059117 1.138469 -2.400634 c -0.280853 0.025653 -1.386071 d -0.280853 0.025653 -1.386071 e 0.863937 0.252462 1.500571 f 1.053202 -2.338595 -0.374279 g 1.053202 -2.338595 -0.374279 h -2.359958 -1.157886 -0.551865
To remind you, these are the available lling methods: Method pad / fll bll / backll Action Fill values forward Fill values backward
With time series data, using pad/fll is extremely common so that the last known value is available at every time point.
113
one two three a 0.059117 1.138469 -2.400634 b NaN 0.000000 0.000000 c -0.280853 0.025653 -1.386071 d NaN 0.000000 0.000000 e 0.863937 0.252462 1.500571 f 1.053202 -2.338595 -0.374279 g NaN 0.000000 0.000000 h -2.359958 -1.157886 -0.551865 In [738]: df.dropna(axis=0) Out[738]: one two three a 0.059117 1.138469 -2.400634 c -0.280853 0.025653 -1.386071 e 0.863937 0.252462 1.500571 f 1.053202 -2.338595 -0.374279 h -2.359958 -1.157886 -0.551865 In [739]: df.dropna(axis=1) Out[739]: two three a 1.138469 -2.400634 b 0.000000 0.000000 c 0.025653 -1.386071 d 0.000000 0.000000 e 0.252462 1.500571 f -2.338595 -0.374279 g 0.000000 0.000000 h -1.157886 -0.551865 In [740]: df[one].dropna() Out[740]: a 0.059117 c -0.280853 e 0.863937 f 1.053202 h -2.359958 Name: one
dropna is presently only implemented for Series and DataFrame, but will be eventually added to Panel. Series.dropna is a simpler method as it only has one axis to consider. DataFrame.dropna has considerably more options, which can be examined in the API.
9.3.3 Interpolation
A basic linear interpolate method has been implemented on Series with intended use for time series data. There has not been a great deal of demand for interpolation methods outside of the lling methods described above.
In [741]: fig, axes = plt.subplots(ncols=2, figsize=(8, 4)) In [742]: ts.plot(ax=axes[0]) Out[742]: <matplotlib.axes.AxesSubplot at 0x1122a7910> In [743]: ts.interpolate().plot(ax=axes[1]) Out[743]: <matplotlib.axes.AxesSubplot at 0x10c160210>
114
In [744]: axes[0].set_title(Not interpolated) Out[744]: <matplotlib.text.Text at 0x10c2ad490> In [745]: axes[1].set_title(Interpolated) Out[745]: <matplotlib.text.Text at 0x11485d310> In [746]: plt.close(all)
115
In [750]: crit = (s > 0).reindex(range(8)) In [751]: crit Out[751]: 0 False 1 NaN 2 True 3 NaN 4 True 5 NaN 6 True 7 True In [752]: crit.dtype Out[752]: dtype(object)
Ordinarily NumPy will complain if you try to use an object array (even if it contains boolean values) instead of a boolean array to get or set values from an ndarray (e.g. selecting values based on some criteria). If a boolean vector contains NAs, an exception will be generated:
In [753]: reindexed = s.reindex(range(8)).fillna(0) In [754]: reindexed[crit] --------------------------------------------------------------------------ValueError Traceback (most recent call last) /Users/wesm/code/pandas/doc/<ipython-input-754-2da204ed1ac7> in <module>() ----> 1 reindexed[crit] /Users/wesm/code/pandas/pandas/core/series.py in __getitem__(self, key) 388 # special handling of boolean data with NAs stored in object 389 # arrays. Since we cant represent NA with dtype=bool --> 390 if _is_bool_indexer(key): 391 key = self._check_bool_indexer(key) 392 key = np.asarray(key, dtype=bool) /Users/wesm/code/pandas/pandas/core/common.py in _is_bool_indexer(key) 321 if not lib.is_bool_array(key): 322 if isnull(key).any(): --> 323 raise ValueError(cannot index with vector containing 324 NA / NaN values) 325 return False ValueError: cannot index with vector containing NA / NaN values
However, these can be lled in using llna and it will work ne:
In [755]: reindexed[crit.fillna(False)] Out[755]: 2 1.314232 4 0.690579 6 0.995761 7 2.396780 In [756]: reindexed[crit.fillna(True)] Out[756]: 1 0.000000 2 1.314232 3 0.000000 4 0.690579 5 0.000000 6 0.995761 7 2.396780
116
CHAPTER
TEN
We aim to make operations like this natural and easy to express using pandas. Well address each area of GroupBy functionality then provide some non-trivial examples / use cases.
117
The mapping can be specied many different ways: A Python function, to be called on each of the axis labels A list or NumPy array of the same length as the selected axis A dict or Series, providing a label -> group name mapping For DataFrame objects, a string indicating a column to be used to group. Of course df.groupby(A) is just syntactic sugar for df.groupby(df[A]), but it makes life simpler A list of any of the above things Collectively we refer to the grouping objects as the keys. For example, consider the following DataFrame:
In [358]: df = DataFrame({A : [foo, bar, foo, bar, .....: foo, bar, foo, foo], .....: B : [one, one, two, three, .....: two, two, one, three], .....: C : randn(8), D : randn(8)}) In [359]: df Out[359]: A B 0 foo one 1 bar one 2 foo two 3 bar three 4 foo two 5 bar two 6 foo one 7 foo three
These will split the DataFrame on its index (rows). We could also split by the columns:
In [362]: def get_letter_type(letter): .....: if letter.lower() in aeiou: .....: return vowel .....: else: .....: return consonant .....: In [363]: grouped = df.groupby(get_letter_type, axis=1)
Note that no splitting occurs until its needed. Creating the GroupBy object only veries that youve passed a valid mapping. Note: Many kinds of complicated data manipulations can be expressed in terms of GroupBy operations (though cant be guaranteed to be the most efcient). You can get quite creative with the label mapping functions.
118
Calling the standard Python len function on the GroupBy object just returns the length of the groups dict, so it is largely just a convenience:
In [366]: grouped = df.groupby([A, B]) In [367]: grouped.groups Out[367]: {(bar, one): [1], (bar, three): [3], (bar, two): [5], (foo, one): [0, 6], (foo, three): [7], (foo, two): [2, 4]} In [368]: len(grouped) Out[368]: 6
By default the group keys are sorted during the groupby operation. You may however pass sort=False for potential speedups:
In [369]: df2 = DataFrame({X : [B, B, A, A], Y : [1, 2, 3, 4]}) In [370]: df2.groupby([X], sort=True).sum() Out[370]: Y X A 7 B 3 In [371]: df2.groupby([X], sort=False).sum() Out[371]: Y X B 3 A 7
119
foo qux
In [373]: grouped = s.groupby(level=0) In [374]: grouped.sum() Out[374]: first bar 0.142048 baz -0.811169 foo -0.560041 qux -0.953439 Name: result
If the MultiIndex has names specied, these can be passed instead of the level number:
In [375]: s.groupby(level=second).sum() Out[375]: second one -2.300857 two 0.118256 Name: result
The aggregation functions such as sum will take the level parameter directly. Additionally, the resulting index will be named according to the chosen level:
In [376]: s.sum(level=second) Out[376]: second one -2.300857 two 0.118256 Name: result
In [378]: s.groupby(level=[first,second]).sum() Out[378]: first second bar doo 0.981751 baz bee -2.754270 foo bop -1.528539 qux bop -0.499427 Name: result
120
This is mainly syntactic sugar for the alternative and much more verbose:
In [382]: df[C].groupby(df[A]) Out[382]: <pandas.core.groupby.SeriesGroupBy at 0x113240d10>
Additionally this method avoids recomputing the internal grouping information derived from the passed key.
C D -0.282863 -2.104569 -1.135632 1.071804 -0.173215 -0.706771 C D 0.469112 -0.861849 -1.509059 -0.494929 1.212112 0.721555 0.119209 -1.039575 -1.044236 0.271860
In the case of grouping by multiple keys, the group name will be a tuple:
In [385]: for name, group in df.groupby([A, B]): .....: print name .....: print group .....: (bar, one) A B C D 1 bar one -0.282863 -2.104569 (bar, three) A B C D 3 bar three -1.135632 1.071804 (bar, two) A B C D 5 bar two -0.173215 -0.706771
121
one) B C D one 0.469112 -0.861849 one 0.119209 -1.039575 three) B C D three -1.044236 0.27186 two) B C D two -1.509059 -0.494929 two 1.212112 0.721555
Its standard Python-fu but remember you can unpack the tuple in the for loop statement if you wish: for (k1, k2), group in grouped:.
10.3 Aggregation
Once the GroupBy object has been created, several methods are available to perform a computation on the grouped data. An obvious one is aggregation via the aggregate or equivalently agg method:
In [386]: grouped = df.groupby(A) In [387]: grouped.aggregate(np.sum) Out[387]: B C D A bar onethreetwo -1.591710 -1.739537 foo onetwotwoonethree -0.752861 -1.402938 In [388]: grouped = df.groupby([A, B]) In [389]: grouped.aggregate(np.sum) Out[389]: C D A B bar one -0.282863 -2.104569 three -1.135632 1.071804 two -0.173215 -0.706771 foo one 0.588321 -1.901424 three -1.044236 0.271860 two -0.296946 0.226626
As you can see, the result of the aggregation will have the group names as the new index along the grouped axis. In the case of multiple keys, the result is a MultiIndex by default, though this can be changed by using the as_index option:
In [390]: grouped = df.groupby([A, B], as_index=False) In [391]: grouped.aggregate(np.sum) Out[391]: A B C D 0 bar one -0.282863 -2.104569 1 bar three -1.135632 1.071804 2 bar two -0.173215 -0.706771 3 foo one 0.588321 -1.901424 4 foo three -1.044236 0.271860
122
foo
two -0.296946
0.226626
In [392]: df.groupby(A, as_index=False).sum() Out[392]: A C D 0 bar -1.591710 -1.739537 1 foo -0.752861 -1.402938
Note that you could use the reset_index DataFrame function to achieve the same result as the column names are stored in the resulting MultiIndex:
In [393]: df.groupby([A, B]).sum().reset_index() Out[393]: A B C D 0 bar one -0.282863 -2.104569 1 bar three -1.135632 1.071804 2 bar two -0.173215 -0.706771 3 foo one 0.588321 -1.901424 4 foo three -1.044236 0.271860 5 foo two -0.296946 0.226626
If a dict is passed, the keys will be used to name the columns. Otherwise the functions name (stored in the function object) will be used.
In [396]: grouped[D].agg({result1 : np.sum, .....: result2 : np.mean}) Out[396]: result1 result2 A bar -1.739537 -0.579846 foo -1.402938 -0.280588
On a grouped DataFrame, you can pass a list of functions to apply to each column, which produces an aggregated result with a hierarchical index:
In [397]: grouped.agg([np.sum, np.mean, np.std]) Out[397]: C D mean std sum mean std sum A bar -0.530570 0.526860 -1.591710 -0.579846 1.591986 -1.739537 foo -0.150572 1.113308 -0.752861 -0.280588 0.753219 -1.402938
Passing a dict of functions has different behavior by default, see the next section.
10.3. Aggregation
123
The function names can also be strings. In order for a string to be valid it must be either implemented on GroupBy or available via dispatching:
In [399]: grouped.agg({C : sum, D : std}) Out[399]: C D A bar -1.591710 1.591986 foo -0.752861 0.753219
Of course sum and mean are implemented on pandas objects, so the above code would work even without the special versions via dispatching (see below).
10.4 Transformation
The transform method returns an object that is indexed the same (same size) as the one being grouped. Thus, the passed transform function should return a result that is the same size as the group chunk. For example, suppose we wished to standardize a data set within a group:
In [402]: tsdf = DataFrame(randn(1000, 3), .....: index=DateRange(1/1/2000, periods=1000), .....: columns=[A, B, C])
124
In [403]: tsdf Out[403]: <class pandas.core.frame.DataFrame> DateRange: 1000 entries, 2000-01-03 00:00:00 to 2003-10-31 00:00:00 offset: <1 BusinessDay> Data columns: A 1000 non-null values B 1000 non-null values C 1000 non-null values dtypes: float64(3) In [404]: zscore = lambda x: (x - x.mean()) / x.std() In [405]: transformed = tsdf.groupby(lambda x: x.year).transform(zscore)
We would expect the result to now have mean 0 and standard deviation 1 within each group, which we can easily check:
In [406]: grouped = transformed.groupby(lambda x: x.year) # OK, close enough to zero In [407]: grouped.mean() Out[407]: A B C key_0 2000 -0 -0 0 2001 0 0 0 2002 -0 -0 -0 2003 0 -0 0 In [408]: Out[408]: A key_0 2000 1 2001 1 2002 1 2003 1 grouped.std() B 1 1 1 1 C 1 1 1 1
But, its rather verbose and can be untidy if you need to pass additional arguments. Using a bit of metaprogramming cleverness, GroupBy now has the ability to dispatch method calls to the groups:
125
What is actually happening here is that a function wrapper is being generated. When invoked, it takes any passed arguments and invokes the function with any arguments on each group (in the above example, the std function). The results are then combined together much in the style of agg and transform (it actually uses apply to infer the gluing, documented next). This enables some operations to be carried out rather succinctly:
In [412]: tsdf.ix[::2] = np.nan In [413]: grouped = tsdf.groupby(lambda x: x.year) In [414]: grouped.fillna(method=pad) Out[414]: <class pandas.core.frame.DataFrame> DateRange: 1000 entries, 2000-01-03 00:00:00 to 2003-10-31 00:00:00 offset: <1 BusinessDay> Data columns: A 997 non-null values B 997 non-null values C 997 non-null values dtypes: float64(3)
In this example, we chopped the collection of time series into yearly chunks then independently called llna on the groups.
In [416]: grouped = df.groupby(A) # could also just call .describe() In [417]: grouped[C].apply(lambda x: x.describe()) Out[417]: count mean std min 25% A
50%
75%
max
126
bar foo
3 -0.530570 5 -0.150572
0.526860 -1.135632 -0.709248 -0.282863 -0.228039 -0.173215 1.113308 -1.509059 -1.044236 0.119209 0.469112 1.212112
Supposed we wished to compute the standard deviation grouped by the A column. There is a slight problem, namely that we dont care about the data in column B. We refer to this as a nuisance column. If the passed aggregation function cant be applied to some columns, the troublesome columns will be (silently) dropped. Thus, this does not pose any problems:
In [422]: df.groupby(A).std() Out[422]: C D A bar 0.526860 1.591986 foo 1.113308 0.753219
127
128
CHAPTER
ELEVEN
2 -1.509059 0.119209 -0.494929 -1.039575 0.276232 -1.478427 -1.715002 -1.344312 1.643563 -1.776904
3 -1.135632 -1.044236 1.071804 0.271860 -1.087401 0.524988 -1.039268 0.844885 -1.469388 -0.968914
# break it into pieces In [622]: pieces = [df[:3], df[3:7], df[7:]] In [623]: concatenated = concat(pieces) In [624]: concatenated Out[624]: 0 1 2 0 0.469112 -0.282863 -1.509059 1 1.212112 -0.173215 0.119209 2 -0.861849 -2.104569 -0.494929 3 0.721555 -0.706771 -1.039575 4 -0.424972 0.567020 0.276232 5 -0.673690 0.113648 -1.478427 6 0.404705 0.577046 -1.715002
129
7 -0.370647 -1.157892 -1.344312 0.844885 8 1.075770 -0.109050 1.643563 -1.469388 9 0.357021 -0.674600 -1.776904 -0.968914
Like its sibling function on ndarrays, numpy.concatenate, pandas.concat takes a list or dict of homogeneously-typed objects and concatenates them with some congurable handling of what to do with the other axes:
concat(objs, axis=0, join=outer, join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False)
objs: list or dict of Series, DataFrame, or Panel objects. If a dict is passed, the sorted keys will be used as the keys argument, unless it is passed, in which case the values will be selected (see below) axis: {0, 1, ...}, default 0. The axis to concatenate along join: {inner, outer}, default outer. How to handle indexes on other axis(es). Outer for union and inner for intersection join_axes: list of Index objects. Specic indexes to use for the other n - 1 axes instead of performing inner/outer set logic keys: sequence, default None. Construct hierarchical index using the passed keys as the outermost level If multiple levels passed, should contain tuples. levels : list of sequences, default None. If keys passed, specic levels to use for the resulting MultiIndex. Otherwise they will be inferred from the keys names: list, default None. Names for the levels in the resulting hierarchical index verify_integrity: boolean, default False. Check whether the new concatenated axis contains duplicates. This can be very expensive relative to the actual data concatenation ignore_index : boolean, default False. If True, do not use the index values on the concatenation axis. The resulting axis will be labeled 0, ..., n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information. Without a little bit of context and example many of these arguments dont make much sense. Lets take the above example. Suppose we wanted to associate specic keys with each of the pieces of the chopped up DataFrame. We can do this using the keys argument:
In [625]: concatenated = concat(pieces, keys=[first, second, third]) In [626]: concatenated Out[626]: 0 1 first 0 0.469112 -0.282863 1 1.212112 -0.173215 2 -0.861849 -2.104569 second 3 0.721555 -0.706771 4 -0.424972 0.567020 5 -0.673690 0.113648 6 0.404705 0.577046 third 7 -0.370647 -1.157892 8 1.075770 -0.109050 9 0.357021 -0.674600
2 -1.509059 0.119209 -0.494929 -1.039575 0.276232 -1.478427 -1.715002 -1.344312 1.643563 -1.776904
3 -1.135632 -1.044236 1.071804 0.271860 -1.087401 0.524988 -1.039268 0.844885 -1.469388 -0.968914
As you can see (if youve read the rest of the documentation), the resulting objects index has a hierarchical index. This means that we can now do stuff like select out each chunk by key:
130
In [627]: concatenated.ix[second] Out[627]: 0 1 2 3 3 0.721555 -0.706771 -1.039575 0.271860 4 -0.424972 0.567020 0.276232 -1.087401 5 -0.673690 0.113648 -1.478427 0.524988 6 0.404705 0.577046 -1.715002 -1.039268
Its not a stretch to see how this can be very useful. More detail on this functionality below.
In [631]: concat([df.ix[:7, [a, b]], df.ix[2:-2, [c]], .....: df.ix[-7:, [d]]], axis=1) Out[631]: a b c d BGx5h NaN NaN 0.974466 -2.006747 Muad1 0.895717 0.805244 -1.206412 NaN PRh1h -1.294524 0.413738 NaN NaN WiR40 1.431256 1.340309 -1.170299 -0.226169 YK0p6 0.410835 0.813850 0.132003 -0.827317 ZRbsV -1.413681 1.607920 1.024180 0.569605 ZW134 -0.076467 -1.187678 1.130127 -1.436737 ZepHk NaN NaN NaN -0.727707 uJToG -0.013960 -0.362543 NaN NaN zCvoD NaN NaN NaN -1.219217
131
Note that the row indexes have been unioned and sorted. Here is the same thing with join=inner:
In [632]: concat([df.ix[:7, [a, b]], df.ix[2:-2, [c]], .....: df.ix[-7:, [d]]], axis=1, join=inner) Out[632]: a b c d WiR40 1.431256 1.340309 -1.170299 -0.226169 YK0p6 0.410835 0.813850 0.132003 -0.827317 ZW134 -0.076467 -1.187678 1.130127 -1.436737 ZRbsV -1.413681 1.607920 1.024180 0.569605
Lastly, suppose we just wanted to reuse the exact index from the original DataFrame:
In [633]: concat([df.ix[:7, [a, b]], df.ix[2:-2, [c]], .....: df.ix[-7:, [d]]], axis=1, join_axes=[df.index]) Out[633]: a b c d PRh1h -1.294524 0.413738 NaN NaN uJToG -0.013960 -0.362543 NaN NaN Muad1 0.895717 0.805244 -1.206412 NaN WiR40 1.431256 1.340309 -1.170299 -0.226169 YK0p6 0.410835 0.813850 0.132003 -0.827317 ZW134 -0.076467 -1.187678 1.130127 -1.436737 ZRbsV -1.413681 1.607920 1.024180 0.569605 BGx5h NaN NaN 0.974466 -2.006747 zCvoD NaN NaN NaN -1.219217 ZepHk NaN NaN NaN -0.727707
In the case of DataFrame, the indexes must be disjoint but the columns do not need to be:
In [638]: df = DataFrame(randn(6, 4), index=DateRange(1/1/2000, periods=6), .....: columns=[A, B, C, D]) In [639]: df1 = df.ix[:3]
132
In [640]: df2 = df.ix[3:, :3] In [641]: df1 Out[641]: A B C 2000-01-03 0.176444 0.403310 -0.154951 2000-01-04 -2.179861 -1.369849 -0.954208 2000-01-05 -1.743161 -0.826591 -0.345352 In [642]: df2 Out[642]: A B C 2000-01-06 0.690579 0.995761 2.396780 2000-01-07 3.357427 -0.317441 -1.236269 2000-01-10 -0.487602 -0.082240 -2.182937 In [643]: df1.append(df2) Out[643]: A B 2000-01-03 0.176444 0.403310 2000-01-04 -2.179861 -1.369849 2000-01-05 -1.743161 -0.826591 2000-01-06 0.690579 0.995761 2000-01-07 3.357427 -0.317441 2000-01-10 -0.487602 -0.082240 D 0.301624 1.462696 1.314232
Note: Unlike list.append method, which appends to the original list and returns nothing, append here does not modify df1 and returns its copy with df2 appended.
133
In [650]: df1 Out[650]: A B C 0 0.084844 0.432390 1.519970 1 0.600178 0.274230 0.132885 2 2.410179 1.450520 0.206053 3 -2.213588 1.063327 1.266143 4 -0.863838 0.408204 -1.048089 5 -0.988387 0.094055 1.262731
In [651]: df2 Out[651]: A B C D 0 0.082423 -0.055758 0.536580 -0.489682 1 0.369374 -0.034571 -2.484478 -0.281461 2 0.030711 0.109121 1.126203 -0.977349
3 0.781836 0.583787
134
2 3 4 5 6 7 8 9
-0.744471 0.758527 1.729689 -0.845696 -1.340896 1.846883 1.682706 -1.717693 0.888782 0.901805 1.171216 0.520260 -1.066969 -0.303421 -0.858447 -0.028665 0.384316 1.574159 0.476720 0.473424 -0.242861 -0.284319 0.650776 -1.461665
# break it into pieces In [656]: pieces = [df.ix[:, [0, 1]], df.ix[:, [2]], df.ix[:, [3]]] In [657]: result = concat(pieces, axis=1, keys=[one, two, three]) In [658]: result Out[658]: one 0 1 0 1.474071 -0.064034 1 -1.071357 0.441153 2 0.221471 -0.744471 3 -0.964980 -0.845696 4 -1.328865 1.682706 5 0.228440 0.901805 6 -1.197071 -1.066969 7 0.306996 -0.028665 8 1.588931 0.476720 9 -0.014805 -0.284319
two three 2 3 -1.282782 0.781836 2.353925 0.583787 0.758527 1.729689 -1.340896 1.846883 -1.717693 0.888782 1.171216 0.520260 -0.303421 -0.858447 0.384316 1.574159 0.473424 -0.242861 0.650776 -1.461665
You can also pass a dict to concat in which case the dict keys will be used for the keys argument (unless other keys are specied):
In [659]: pieces = {one: df.ix[:, [0, 1]], .....: two: df.ix[:, [2]], .....: three: df.ix[:, [3]]} In [660]: concat(pieces, axis=1) Out[660]: one three two 0 1 3 2 0 1.474071 -0.064034 0.781836 -1.282782 1 -1.071357 0.441153 0.583787 2.353925 2 0.221471 -0.744471 1.729689 0.758527 3 -0.964980 -0.845696 1.846883 -1.340896 4 -1.328865 1.682706 0.888782 -1.717693 5 0.228440 0.901805 0.520260 1.171216 6 -1.197071 -1.066969 -0.858447 -0.303421 7 0.306996 -0.028665 1.574159 0.384316 8 1.588931 0.476720 -0.242861 0.473424 9 -0.014805 -0.284319 -1.461665 0.650776 In [661]: concat(pieces, keys=[three, two]) Out[661]: 2 3 three 0 NaN 0.781836 1 NaN 0.583787 2 NaN 1.729689 3 NaN 1.846883
135
two
4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
NaN 0.888782 NaN 0.520260 NaN -0.858447 NaN 1.574159 NaN -0.242861 NaN -1.461665 -1.282782 NaN 2.353925 NaN 0.758527 NaN -1.340896 NaN -1.717693 NaN 1.171216 NaN -0.303421 NaN 0.384316 NaN 0.473424 NaN 0.650776 NaN
The MultiIndex created has levels that are constructed from the passed keys and the columns of the DataFrame pieces:
In [662]: result.columns.levels Out[662]: [Index([one, two, three], dtype=object), Int64Index([0, 1, 2, 3])]
If you wish to specify other levels (as will occasionally be the case), you can do so using the levels argument:
In [663]: result = concat(pieces, axis=1, keys=[one, two, three], .....: levels=[[three, two, one, zero]], .....: names=[group_key]) In [664]: result Out[664]: group_key one 0 0 1.474071 1 -1.071357 2 0.221471 3 -0.964980 4 -1.328865 5 0.228440 6 -1.197071 7 0.306996 8 1.588931 9 -0.014805
1 -0.064034 0.441153 -0.744471 -0.845696 1.682706 0.901805 -1.066969 -0.028665 0.476720 -0.284319
two three 2 3 -1.282782 0.781836 2.353925 0.583787 0.758527 1.729689 -1.340896 1.846883 -1.717693 0.888782 1.171216 0.520260 -0.303421 -0.858447 0.384316 1.574159 0.473424 -0.242861 0.650776 -1.461665
In [665]: result.columns.levels Out[665]: [Index([three, two, one, zero], dtype=object), Int64Index([0, 1, 2, 3])]
Yes, this is fairly esoteric, but is actually necessary for implementing things like GroupBy where the order of a categorical variable is meaningful.
136
0 1 2 3 4 5 6 7
A B C D -1.137707 -0.891060 -0.693921 1.613616 0.464000 0.227371 -0.496922 0.306389 -2.290613 -1.134623 -1.561819 -0.260838 0.281957 1.523962 -0.902937 0.068159 -0.057873 -0.368204 -1.144073 0.861209 0.800193 0.782098 -1.069094 -1.099248 0.255269 0.009750 0.661084 0.379319 -0.008434 1.952541 -1.056652 0.533946
In [668]: s = df.xs(3) In [669]: df.append(s, ignore_index=True) Out[669]: A B C D 0 -1.137707 -0.891060 -0.693921 1.613616 1 0.464000 0.227371 -0.496922 0.306389 2 -2.290613 -1.134623 -1.561819 -0.260838 3 0.281957 1.523962 -0.902937 0.068159 4 -0.057873 -0.368204 -1.144073 0.861209 5 0.800193 0.782098 -1.069094 -1.099248 6 0.255269 0.009750 0.661084 0.379319 7 -0.008434 1.952541 -1.056652 0.533946 8 0.281957 1.523962 -0.902937 0.068159
You should use ignore_index with this method to instruct DataFrame to discard its index. If you wish to preserve the index, you should construct an appropriately-indexed DataFrame and append or concatenate those objects. You can also pass a list of dicts or Series:
In [670]: df = DataFrame(np.random.randn(5, 4), .....: columns=[foo, bar, baz, qux]) In [671]: dicts = [{foo: 1, bar: 2, baz: 3, peekaboo: 4}, .....: {foo: 5, bar: 6, baz: 7, peekaboo: 8}] In [672]: result = df.append(dicts, ignore_index=True) In [673]: result Out[673]: bar baz foo 0 0.040403 -0.507516 -1.226970 1 -1.934370 -1.652499 0.394500 2 0.576897 1.146000 -0.896484 3 2.121453 0.597701 0.604603 4 -1.057909 1.375020 0.967661 5 2.000000 3.000000 1.000000 6 6.000000 7.000000 5.000000
peekaboo qux NaN -0.230096 NaN 1.488753 NaN 1.487349 NaN 0.563700 NaN -0.928797 4 NaN 8 NaN
137
pandas provides a single function, merge, as the entry point for all standard database join operations between DataFrame objects:
merge(left, right, how=left, on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True, suffixes=(.x, .y), copy=True)
Heres a description of what each argument is for: left: A DataFrame object right: Another DataFrame object on: Columns (names) to join on. Must be found in both the left and right DataFrame objects. If not passed and left_index and right_index are False, the intersectino of the columns in the DataFrames will be inferred to be the join keys left_on: Columns from the left DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the DataFrame right_on: Columns from the left DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the DataFrame left_index: If True, use the index (row labels) from the left DataFrame as its join key(s). In the case of a DataFrame with a MultiIndex (hierarchical), the number of levels must match the number of join keys from the right DataFrame right_index: Same usage as left_index for the right DataFrame how: One of left, right, outer, inner. Defaults to inner. See below for more detailed description of each method sort: Sort the result DataFrame by the join keys in lexicographical order. Defaults to True, setting to False will improve performance substantially in many cases suffixes: A tuple of string sufxes to apply to overlapping columns. Defaults to (.x, .y). copy: Always copy data (default True) from the passed DataFrame objects, even when reindexing is not necessary. Cannot be avoided in many cases but may improve performance / memory usage. The cases where copying can be avoided are somewhat pathological but this option is provided nonetheless. merge is a function in the pandas namespace, and it is also available as a DataFrame instance method, with the calling DataFrame being implicitly considered the left object in the join. The related DataFrame.join method, uses merge internally for the index-on-index and index-on-column(s) joins, but joins on indexes by default rather than trying to join on common columns (the default behavior for merge). If you are joining on index, you may wish to use DataFrame.join to save yourself some typing.
138
Note: When joining columns on columns (potentially a many-to-many join), any indexes on the passed DataFrame objects will be discarded. It is worth spending some time understanding the result of the many-to-many join case. In SQL / standard relational algebra, if a key combination appears more than once in both tables, the resulting table will have the Cartesian product of the associated data. Here is a very basic example with one unique key combination:
In [674]: left = DataFrame({key: [foo, foo], lval: [1, 2]}) In [675]: right = DataFrame({key: [foo, foo], rval: [4, 5]}) In [676]: left Out[676]: key lval 0 foo 1 1 foo 2 In [677]: right Out[677]: key rval 0 foo 4 1 foo 5 In [678]: merge(left, right, on=key) Out[678]: key lval rval 0 foo 1 4 1 foo 1 5 2 foo 2 4 3 foo 2 5
139
The how argument to merge species how to determine which keys are to be included in the resulting table. If a key combination does not appear in either the left or right tables, the values in the joined table will be NA. Here is a summary of the how options and their SQL equivalent names: Merge method left right outer inner SQL Join Name LEFT OUTER JOIN RIGHT OUTER JOIN FULL OUTER JOIN INNER JOIN Description Use keys from left frame only Use keys from right frame only Use union of keys from both frames Use intersection of keys from both frames
Note that if using the index from either the left or right DataFrame (or both) using the left_index / right_index options, the join operation is no longer a many-to-many join by construction, as the index values are necessarily unique. There will be some examples of this below.
140
In [689]: df1.join(df2, how=outer) Out[689]: A B C D 0 NaN NaN 0.377953 0.493672 1 -2.461467 -1.553902 2.015523 -1.833722 2 1.771740 -0.670027 0.049307 -0.521493 3 -3.201750 0.792716 0.146111 1.903247 4 -0.747169 -0.309038 0.393876 1.861468 5 0.936527 1.255746 -2.655452 1.219492 6 0.062297 -0.110388 NaN NaN 7 0.077849 0.629498 NaN NaN In [690]: df1.join(df2, how=inner) Out[690]: A B C D 1 -2.461467 -1.553902 2.015523 -1.833722 2 1.771740 -0.670027 0.049307 -0.521493 3 -3.201750 0.792716 0.146111 1.903247 4 -0.747169 -0.309038 0.393876 1.861468 5 0.936527 1.255746 -2.655452 1.219492
The data alignment here is on the indexes (row labels). This same behavior can be achieved using merge plus additional arguments instructing it to use the indexes:
In [691]: merge(df1, df2, left_index=True, right_index=True, how=outer) Out[691]: A B C D 0 NaN NaN 0.377953 0.493672 1 -2.461467 -1.553902 2.015523 -1.833722 2 1.771740 -0.670027 0.049307 -0.521493 3 -3.201750 0.792716 0.146111 1.903247 4 -0.747169 -0.309038 0.393876 1.861468 5 0.936527 1.255746 -2.655452 1.219492 6 0.062297 -0.110388 NaN NaN 7 0.077849 0.629498 NaN NaN
Obviously you can choose whichever form you nd more convenient. For many-to-one joins (where one of the DataFrames is already indexed by the join key), using join may be more convenient. Here is a simple example:
In [692]: df[key] = [foo, bar] * 4 In [693]: to_join = DataFrame(randn(2, 2), index=[bar, foo], .....: columns=[j1, j2]) In [694]: df Out[694]: A
key
141
0 1 2 3 4 5 6 7
-0.681087 0.377953 0.493672 -1.553902 2.015523 -1.833722 -0.670027 0.049307 -0.521493 0.792716 0.146111 1.903247 -0.309038 0.393876 1.861468 1.255746 -2.655452 1.219492 -0.110388 -1.184357 -0.558081 0.629498 -1.035260 -0.438229
In [695]: to_join Out[695]: j1 j2 bar 0.503703 0.413086 foo -1.139050 0.660342 In [696]: df.join(to_join, on=key) Out[696]: A B C D 0 -0.308853 -0.681087 0.377953 0.493672 1 -2.461467 -1.553902 2.015523 -1.833722 2 1.771740 -0.670027 0.049307 -0.521493 3 -3.201750 0.792716 0.146111 1.903247 4 -0.747169 -0.309038 0.393876 1.861468 5 0.936527 1.255746 -2.655452 1.219492 6 0.062297 -0.110388 -1.184357 -0.558081 7 0.077849 0.629498 -1.035260 -0.438229
In [697]: merge(df, to_join, left_on=key, right_index=True, .....: how=left, sort=False) Out[697]: A B C D key j1 j2 0 -0.308853 -0.681087 0.377953 0.493672 foo -1.139050 0.660342 1 -2.461467 -1.553902 2.015523 -1.833722 bar 0.503703 0.413086 2 1.771740 -0.670027 0.049307 -0.521493 foo -1.139050 0.660342 3 -3.201750 0.792716 0.146111 1.903247 bar 0.503703 0.413086 4 -0.747169 -0.309038 0.393876 1.861468 foo -1.139050 0.660342 5 0.936527 1.255746 -2.655452 1.219492 bar 0.503703 0.413086 6 0.062297 -0.110388 -1.184357 -0.558081 foo -1.139050 0.660342 7 0.077849 0.629498 -1.035260 -0.438229 bar 0.503703 0.413086
142
In [703]: data = DataFrame({key1 : key1, key2 : key2, .....: data : data}) In [704]: data Out[704]: data key1 0 -1.004168 bar 1 -1.377627 bar 2 0.499281 bar 3 -1.405256 foo 4 0.162565 foo 5 -0.067785 baz 6 -1.260006 baz 7 -1.132896 qux 8 -2.006481 qux 9 0.301016 snap In [705]: to_join Out[705]: j_one first second foo one two three bar one two baz two three qux one two three 0.464794 0.683758 1.032814 1.515707 1.397431 -0.135950 0.281151 -0.851985 -1.537770 -0.390201 j_two -0.309337 -0.643834 -1.290493 -0.276487 1.503874 -0.730327 -1.298915 -1.106952 0.555759 1.207122 j_three -0.649593 0.421287 0.787872 -0.223762 -0.478905 -0.033277 -2.819487 -0.937731 -2.277282 0.178690
key2 two one three one two one two two three one
Now this can be joined by passing the two key column names:
In [706]: data.join(to_join, on=[key1, key2]) Out[706]: data key1 key2 j_one j_two j_three 0 -1.004168 bar two 1.397431 1.503874 -0.478905 1 -1.377627 bar one 1.515707 -0.276487 -0.223762 2 0.499281 bar three NaN NaN NaN 3 -1.405256 foo one 0.464794 -0.309337 -0.649593 4 0.162565 foo two 0.683758 -0.643834 0.421287 5 -0.067785 baz one NaN NaN NaN 6 -1.260006 baz two -0.135950 -0.730327 -0.033277 7 -1.132896 qux two -1.537770 0.555759 -2.277282 8 -2.006481 qux three -0.390201 1.207122 0.178690 9 0.301016 snap one NaN NaN NaN
The default for DataFrame.join is to perform a left join (essentially a VLOOKUP operation, for Excel users), which uses only the keys found in the calling DataFrame. Other join types, for example inner join, can be just as easily performed:
In [707]: data.join(to_join, on=[key1, key2], how=inner) Out[707]: data key1 key2 j_one j_two j_three 0 -1.004168 bar two 1.397431 1.503874 -0.478905 1 -1.377627 bar one 1.515707 -0.276487 -0.223762 3 -1.405256 foo one 0.464794 -0.309337 -0.649593
143
two 0.683758 -0.643834 0.421287 two -0.135950 -0.730327 -0.033277 two -1.537770 0.555759 -2.277282 three -0.390201 1.207122 0.178690
As you can see, this drops any rows where there was no match.
144
4 -0.747169 -0.309038 0.393876 1.861468 5 0.936527 1.255746 -2.655452 1.219492 6 0.062297 -0.110388 -1.184357 -0.558081 7 0.077849 0.629498 -1.035260 -0.438229
145
146
CHAPTER
TWELVE
For the curious here is how the above DataFrame was created:
import pandas.util.testing as tm; tm.N = 3 def unpivot(frame): N, K = frame.shape data = {value : frame.values.ravel(F), variable : np.asarray(frame.columns).repeat(N), date : np.tile(np.asarray(frame.index), K)} return DataFrame(data, columns=[date, variable, value]) df = unpivot(tm.makeTimeDataFrame())
But suppose we wish to do time series operations with the variables. A better representation would be where the columns are the unique variables and an index of dates identies individual observations. To reshape the data into this form, use the pivot function:
147
In [762]: df.pivot(index=date, columns=variable, values=value) Out[762]: variable A B C D date 2000-01-03 0.469112 -1.135632 0.119209 -2.104569 2000-01-04 -0.282863 1.212112 -1.044236 -0.494929 2000-01-05 -1.509059 -0.173215 -0.861849 1.071804
If the values argument is omitted, and the input DataFrame has more than one column of values which are not used as column or index inputs to pivot, then the resulting pivoted DataFrame will have hierarchical columns whose topmost level indicates the respective value column:
In [763]: df[value2] = df[value] * 2 In [764]: pivoted = df.pivot(date, variable) In [765]: pivoted Out[765]: value value2 variable A B C D A B C D date 2000-01-03 0.469112 -1.135632 0.119209 -2.104569 0.938225 -2.271265 0.238417 -4.209138 2000-01-04 -0.282863 1.212112 -1.044236 -0.494929 -0.565727 2.424224 -2.088472 -0.989859 2000-01-05 -1.509059 -0.173215 -0.861849 1.071804 -3.018117 -0.346429 -1.723698 2.143608
You of course can then select subsets from the pivoted DataFrame:
In [766]: pivoted[value2] Out[766]: variable A B C D date 2000-01-03 0.938225 -2.271265 0.238417 -4.209138 2000-01-04 -0.565727 2.424224 -2.088472 -0.989859 2000-01-05 -3.018117 -0.346429 -1.723698 2.143608
Note that this returns a view on the underlying data in the case where the data are homogeneously-typed.
148
In [769]: df = DataFrame(randn(8, 2), index=index, columns=[A, B]) In [770]: df2 = df[:4] In [771]: df2 Out[771]: A B first second bar one 0.721555 -0.706771 two -1.039575 0.271860 baz one -0.424972 0.567020 two 0.276232 -1.087401
The stack function compresses a level in the DataFrames columns to produce either: A Series, in the case of a simple column Index A DataFrame, in the case of a MultiIndex in the columns If the columns have a MultiIndex, you can choose which level to stack. The stacked level becomes the new lowest level in a MultiIndex on the columns:
In [772]: stacked = df2.stack() In [773]: stacked Out[773]: first second bar one A B two A B baz one A B two A B
With a stacked DataFrame or Series (having a MultiIndex as the index), the inverse operation of stack is unstack, which by default unstacks the last level:
In [774]: stacked.unstack() Out[774]: A B first second bar one 0.721555 -0.706771 two -1.039575 0.271860 baz one -0.424972 0.567020 two 0.276232 -1.087401 In [775]: stacked.unstack(1) Out[775]: second one two first bar A 0.721555 -1.039575 B -0.706771 0.271860 baz A -0.424972 0.276232 B 0.567020 -1.087401 In [776]: stacked.unstack(0) Out[776]: first bar baz
149
second one A 0.721555 -0.424972 B -0.706771 0.567020 two A -1.039575 0.276232 B 0.271860 -1.087401
If the indexes have names, you can use the level names instead of specifying the level numbers:
In [777]: stacked.unstack(second) Out[777]: second one two first bar A 0.721555 -1.039575 B -0.706771 0.271860 baz A -0.424972 0.276232 B 0.567020 -1.087401
You may also stack or unstack more than one level at a time by passing a list of levels, in which case the end result is as if each level in the list were processed individually. These functions are intelligent about handling missing data and do not expect each subgroup within the hierarchical index to have the same set of labels. They also can handle the index being unsorted (but you can make it sorted by calling sortlevel, of course). Here is a more complex example:
In [778]: columns = MultiIndex.from_tuples([(A, cat), (B, dog), .....: (B, cat), (A, dog)], .....: names=[exp, animal]) In [779]: df = DataFrame(randn(8, 4), index=index, columns=columns) In [780]: df2 = df.ix[[0, 1, 2, 4, 5, 7]] In [781]: df2 Out[781]: exp A animal cat first second bar one -0.370647 two 1.075770 baz one 0.357021 foo one -0.013960 two 0.895717 qux two 0.410835
As mentioned above, stack can be called with a level argument to select which level in the columns to stack:
In [782]: df2.stack(exp) Out[782]: animal cat first second exp bar one A -0.370647 B -1.344312 two A 1.075770 B 1.643563 baz one A 0.357021 B -1.776904 foo A -0.013960 B -0.006154 two A 0.895717
dog 0.844885 -1.157892 -1.469388 -0.109050 -0.968914 -0.674600 -0.923061 -0.362543 2.565646
150
qux
B A B
In [783]: df2.stack(animal) Out[783]: exp A first second animal bar one cat -0.370647 dog 0.844885 two cat 1.075770 dog -1.469388 baz one cat 0.357021 dog -0.968914 foo cat -0.013960 dog -0.923061 two cat 0.895717 dog 2.565646 qux cat 0.410835 dog -0.827317
B -1.344312 -1.157892 1.643563 -0.109050 -1.776904 -0.674600 -0.006154 -0.362543 -1.206412 0.805244 0.132003 0.813850
Unstacking when the columns are a MultiIndex is also careful about doing the right thing:
In [784]: df[:3].unstack(0) Out[784]: exp A B A animal cat dog cat dog first bar baz bar baz bar baz bar baz second one -0.370647 0.357021 -1.157892 -0.6746 -1.344312 -1.776904 0.844885 -0.968914 two 1.075770 NaN -0.109050 NaN 1.643563 NaN -1.469388 NaN In [785]: df2.unstack(1) Out[785]: exp A B A animal cat dog cat dog second one two one two one two one two first bar -0.370647 1.075770 -1.157892 -0.109050 -1.344312 1.643563 0.844885 -1.469388 baz 0.357021 NaN -0.674600 NaN -1.776904 NaN -0.968914 NaN foo -0.013960 0.895717 -0.362543 0.805244 -0.006154 -1.206412 -0.923061 2.565646 qux NaN 0.410835 NaN 0.813850 NaN 0.132003 NaN -0.827317
151
In [787]: cheese Out[787]: first height last 0 John 5.5 Doe 1 Mary 6.0 Bo In [788]: melt(cheese, Out[788]: first last variable 0 John Doe height 1 Mary Bo height 2 John Doe weight 3 Mary Bo weight
weight 130 150 id_vars=[first, last]) value 5.5 6.0 130.0 150.0
152
qux
In [792]: df.stack().groupby(level=1).mean() Out[792]: exp A B second one 0.016301 -0.644049 two 0.110588 0.346200 In [793]: df.mean().unstack(0) Out[793]: exp A B animal cat 0.311433 -0.431481 dog -0.184544 0.133632
C foo foo foo bar bar bar foo foo foo bar bar bar foo foo
D -0.076467 -1.187678 1.130127 -1.436737 -1.413681 1.607920 1.024180 0.569605 0.875906 -2.211372 0.974466 -2.006747 -0.410001 -0.078638
E 0.959726 -1.110336 -0.619976 0.149748 -0.732339 0.687738 0.176444 0.403310 -0.154951 0.301624 -2.179861 -1.369849 -0.954208 1.462696
153
14 15 16 17 18 19 20 21 22 23
two three one one two three one one two three
C A B C A B C A B C
foo bar bar bar foo foo foo bar bar bar
0.545952 -1.219217 -1.226825 0.769804 -1.281247 -0.727707 -0.121306 -0.097883 0.695775 0.341734
-1.743161 -0.826591 -0.345352 1.314232 0.690579 0.995761 2.396780 0.014871 3.357427 -0.317441
The result object is a DataFrame having potentially hierarchical indexes on the rows and columns. If the values column name is not given, the pivot table will include all of the data that can be aggregated in an additional level of hierarchy in the columns:
154
In [799]: pivot_table(df, rows=[A, B], cols=[C]) Out[799]: D E C bar foo bar foo A B one A -1.154627 -0.243234 0.158248 0.002759 B -1.320253 -0.633158 -0.538846 0.176180 C 1.188862 0.377300 1.000985 1.120915 three A -1.327977 NaN -0.338421 NaN B NaN -0.079051 NaN 0.699535 C -0.832506 NaN -0.843645 NaN two A NaN -0.128534 NaN 0.433512 B 0.835120 NaN 0.588783 NaN C NaN 0.838040 NaN -1.181568
You can render a nice output of the table omitting the missing values by calling to_string if you wish:
In [800]: table = pivot_table(df, rows=[A, B], cols=[C]) In [801]: print table.to_string(na_rep=) D E C bar foo bar foo A B one A -1.154627 -0.243234 0.158248 0.002759 B -1.320253 -0.633158 -0.538846 0.176180 C 1.188862 0.377300 1.000985 1.120915 three A -1.327977 -0.338421 B -0.079051 0.699535 C -0.832506 -0.843645 two A -0.128534 0.433512 B 0.835120 0.588783 C 0.838040 -1.181568
155
In [802]: foo, bar, dull, shiny, one, two = foo, bar, dull, shiny, one, two In [803]: a = np.array([foo, foo, bar, bar, foo, foo], dtype=object) In [804]: b = np.array([one, one, two, one, two, one], dtype=object) In [805]: c = np.array([dull, dull, shiny, dull, dull, shiny], dtype=object) In [806]: crosstab(a, [b, c], rownames=[a], colnames=[b, c]) Out[806]: b one two c dull shiny dull shiny a bar 1 0 0 1 foo 2 1 1 0
156
CHAPTER
THIRTEEN
The basic DateOffset takes the same arguments as dateutil.relativedelta, which works like:
157
In [831]: d = datetime(2008, 8, 18) In [832]: d + relativedelta(months=4, days=5) Out[832]: datetime.datetime(2008, 12, 23, 0, 0)
The key features of a DateOffset object are: it can be added / subtracted to/from a datetime object to obtain a shifted date it can be multiplied by an integer (positive or negative) so that the increment will be applied multiple times it has rollforward and rollback methods for moving a date forward or backward to the next or previous offset date Subclasses of DateOffset dene the apply function which dictates custom date increment logic, such as adding business days:
class BDay(DateOffset): """DateOffset increments between business days""" def apply(self, other): ... In [834]: d - 5 * BDay() Out[834]: datetime.datetime(2008, 8, 11, 0, 0) In [835]: d + BMonthEnd() Out[835]: datetime.datetime(2008, 8, 29, 0, 0)
The rollforward and rollback methods do exactly what you would expect:
In [836]: d Out[836]: datetime.datetime(2008, 8, 18, 0, 0) In [837]: offset = BMonthEnd() In [838]: offset.rollforward(d) Out[838]: datetime.datetime(2008, 8, 29, 0, 0) In [839]: offset.rollback(d) Out[839]: datetime.datetime(2008, 7, 31, 0, 0)
Its denitely worth exploring the pandas.core.datetools module and the various docstrings for the classes.
158
These can be used as arguments to DateRange and various other time series-related functions in pandas.
159
<class pandas.core.daterange.DateRange> offset: <1 BusinessMonthEnd>, tzinfo: None [2009-01-30 00:00:00, ..., 2009-12-31 00:00:00] length: 12
Business day frequency is the default for DateRange. You can also strictly generate a DateRange of a certain length by providing either a start or end date and a periods argument:
In [848]: DateRange(start, periods=20) Out[848]: <class pandas.core.daterange.DateRange> offset: <1 BusinessDay>, tzinfo: None [2009-01-01 00:00:00, ..., 2009-01-28 00:00:00] length: 20 In [849]: DateRange(end=end, periods=20) Out[849]: <class pandas.core.daterange.DateRange> offset: <1 BusinessDay>, tzinfo: None [2009-12-07 00:00:00, ..., 2010-01-01 00:00:00] length: 20
The start and end dates are strictly inclusive. So it will not generate any dates outside of those dates if specied.
160
<class pandas.core.daterange.DateRange> offset: <2 BusinessMonthEnds>, tzinfo: None [2009-01-30 00:00:00, ..., 2009-11-30 00:00:00] length: 6
More complicated fancy indexing will result in an Index that is no longer a DateRange, however:
In [855]: ts[[0, 2, 6]].index Out[855]: Index([2009-01-30 00:00:00, 2009-03-31 00:00:00, 2009-07-31 00:00:00], dtype=object)
The shift method accepts an offset argument which can accept a DateOffset class or other timedelta-like object or also a time rule:
In [858]: ts.shift(5, offset=datetools.bday) Out[858]: 2009-02-06 0.469112 2009-03-06 -0.282863 2009-04-07 -1.509059 2009-05-07 -1.135632 2009-06-05 1.212112 In [859]: ts.shift(5, offset=EOM) Out[859]: 2009-06-30 0.469112 2009-07-31 -0.282863 2009-08-31 -1.509059 2009-09-30 -1.135632 2009-10-30 1.212112
161
2. Use the asof function (as of) of the DateRange to do a groupby expression
162
Some things to note about this method: This is rather inefcient because we havent exploited the orderedness of the data at all. Calling the asof function on every date in the minutely time series is not strictly necessary. Well be writing some signicantly more efcient methods in the near future The dates in the result mark the beginning of the period. Be careful about which convention you use; you dont want to end up misaligning data because you used the wrong upsampling convention
163
164
CHAPTER
FOURTEEN
165
If the index consists of dates, it calls gca().autofmt_xdate() to try to format the x-axis nicely as per above. The method takes a number of arguments for controlling the look of the plot:
In [875]: plt.figure(); ts.plot(style=k--, label=Series); plt.legend() Out[875]: <matplotlib.legend.Legend at 0x116fe2e10>
166
You may set the legend argument to False to hide the legend, which is shown by default.
In [879]: df.plot(legend=False) Out[879]: <matplotlib.axes.AxesSubplot at 0x118bd4790>
Some other options are available, like plotting each Series on a different axis:
In [880]: df.plot(subplots=True, figsize=(8, 8)); plt.legend(loc=best) Out[880]: <matplotlib.legend.Legend at 0x118bd0ad0>
167
168
169
170
14.2.2 Histogramming
In [890]: plt.figure(); In [890]: df[A].diff().hist() Out[890]: <matplotlib.axes.AxesSubplot at 0x119c30550>
For a DataFrame, hist plots the histograms of the columns on multiple subplots:
In [891]: plt.figure() Out[891]: <matplotlib.figure.Figure at 0x119c53dd0> In [892]: df.diff().hist(color=k, alpha=0.5, bins=50) Out[892]: array([[Axes(0.125,0.536364;0.352273x0.363636), Axes(0.547727,0.536364;0.352273x0.363636)], [Axes(0.125,0.1;0.352273x0.363636), Axes(0.547727,0.1;0.352273x0.363636)]], dtype=object)
171
14.2.3 Box-Plotting
DataFrame has a boxplot method which allows you to visualize the distribution of values within each column. For instance, here is a boxplot representing ve trials of 10 observations of a uniform random variable on [0,1).
In [893]: df = DataFrame(np.random.rand(10,5)) In [894]: plt.figure(); In [894]: df.boxplot() Out[894]: <matplotlib.axes.AxesSubplot at 0x119c5f610>
172
You can create a stratied boxplot using the by keyword argument to create groupings. For instance,
In [895]: df = DataFrame(np.random.rand(10,2), columns=[Col1, Col2] ) In [896]: df[X] = Series([A,A,A,A,A,B,B,B,B,B]) In [897]: plt.figure(); In [897]: df.boxplot(by=X) Out[897]: array([Axes(0.1,0.15;0.363636x0.75), Axes(0.536364,0.15;0.363636x0.75)], dtype=object)
You can also pass a subset of columns to plot, as well as group by multiple columns:
In [898]: df = DataFrame(np.random.rand(10,3), columns=[Col1, Col2, Col3]) In [899]: df[X] = Series([A,A,A,A,A,B,B,B,B,B]) In [900]: df[Y] = Series([A,B,A,B,A,B,A,B,A,B]) In [901]: plt.figure(); In [901]: df.boxplot(column=[Col1,Col2], by=[X,Y]) Out[901]: array([Axes(0.1,0.15;0.363636x0.75), Axes(0.536364,0.15;0.363636x0.75)], dtype=object)
173
174
CHAPTER
FIFTEEN
175
index_col: column number, or list of column numbers, to use as the index (row labels) of the resulting DataFrame. By default, it will number the rows without using any column, unless there is one more data column than there are headers, in which case the rst column is taken as the index. parse_dates: If True, attempt to parse the index column as dates. False by default. date_parser: function to use to parse strings into datetime objects. If parse_dates is True, it defaults to the very robust dateutil.parser. Specifying this implicitly sets parse_dates as True. na_values: optional list of strings to recognize as NaN (missing values), in addition to a default set. nrows: Number of rows to read out of the le. Useful to only read a small portion of a large le chunksize: An number of rows to be used to chunk a le into pieces. Will cause an TextParser object to be returned. More on this below in the section on iterating and chunking iterator: If True, return a TextParser to enable reading a le into memory piece by piece skip_footer: number of lines to skip at bottom of le (default 0) converters: a dictionary of functions for converting values in certain columns, where keys are either integers or column labels encoding: a string representing the encoding to use if the contents are non-ascii, for python versions prior to 3 verbose : show number of NA values inserted in non-numeric columns Consider a typical CSV le containing, in this case, some time series data:
In [586]: print open(foo.csv).read() date,A,B,C 20090101,a,1,2 20090102,b,3,4 20090103,c,4,5
The default for read_csv is to create a DataFrame with simple numbered rows:
In [587]: read_csv(foo.csv) Out[587]: date A B C 0 20090101 a 1 2 1 20090102 b 3 4 2 20090103 c 4 5
In the case of indexed data, you can pass the column number (or a list of column numbers, for a hierarchical index) you wish to use as the index. If the index values are dates and you want them to be converted to datetime objects, pass parse_dates=True:
# Use a column as an index, and parse it as dates. In [588]: df = read_csv(foo.csv, index_col=0, parse_dates=True) In [589]: df Out[589]: A date 2009-01-01 2009-01-02 2009-01-03 a b c B 1 3 4 C 2 4 5
# These are python datetime objects In [590]: df.index Out[590]: Index([2009-01-01 00:00:00, 2009-01-02 00:00:00, 2009-01-03 00:00:00], dtype=object)
176
The parsers make every attempt to do the right thing and not be very fragile. Type inference is a pretty big deal. So if a column can be coerced to integer dtype without altering the contents, it will do so. Any non-numeric columns will come through as object dtype as with the rest of pandas objects.
In this special case, read_csv assumes that the rst column is to be used as the index of the DataFrame:
In [592]: read_csv(foo.csv) Out[592]: A B C 20090101 a 1 2 20090102 b 3 4 20090103 c 4 5
Note that the dates werent automatically parsed. In that case you would need to do as before:
In [593]: df = read_csv(foo.csv, parse_dates=True) In [594]: df.index Out[594]: Index([2009-01-01 00:00:00, 2009-01-02 00:00:00, 2009-01-03 00:00:00], dtype=object)
The index_col argument to read_csv and read_table can take a list of column numbers to turn multiple columns into a MultiIndex:
177
In [596]: df = read_csv("data/mindex_ex.csv", index_col=[0,1]) In [597]: df Out[597]: zit year indiv 1977 A B C 1978 A B C D E 1979 C D E F G H I 1.20 1.50 1.70 0.20 0.70 0.80 0.90 1.40 0.20 0.14 0.50 1.20 3.40 5.40 6.40 xit 0.60 0.50 0.80 0.06 0.20 0.30 0.50 0.90 0.15 0.05 0.15 0.50 1.90 2.70 1.20
In [598]: df.ix[1978] Out[598]: zit xit indiv A 0.2 0.06 B 0.7 0.20 C 0.8 0.30 D 0.9 0.50 E 1.4 0.90
178
By speciying a chunksize to read_csv or read_table, the return value will be an iterable object of type TextParser:
In [604]: reader = read_table(tmp.sv, sep=|, chunksize=4) In [605]: reader Out[605]: <pandas.io.parsers.TextParser at 0x1132ae390> In [606]: for chunk in reader: .....: print chunk .....: year indiv zit xit 0 1977 A 1.2 0.60 1 1977 B 1.5 0.50 2 1977 C 1.7 0.80 3 1978 A 0.2 0.06 year indiv zit xit 0 1978 B 0.7 0.2 1 1978 C 0.8 0.3 2 1978 D 0.9 0.5
179
3 4
1978 1978
A B
0.2 0.7
0.06 0.20
180
Then use the parse instance method with a sheetname, then use the same additional arguments as the parsers above:
xls.parse(Sheet1, index_col=None, na_values=[NA])
To read sheets from an Excel 2007 le, you can pass a lename with a .xlsx extension, in which case the openpyxl module will be used to read the le. To write a DataFrame object to a sheet of an Excel le, you can use the to_excel instance method. The arguments are largely the same as to_csv described above, the rst argument being the name of the excel le, and the optional second argument the name of the sheet to which the DataFrame should be written. For example:
df.to_excel(path_to_file.xlsx, sheet_name=sheet1)
Files with a .xls extension will be written using xlwt and those with a .xlsx extension will be written using openpyxl. The Panel class also has a to_excel instance method, which writes each DataFrame in the Panel to a separate sheet. In order to write separate DataFrames to separate sheets in a single Excel le, one can use the ExcelWriter class, as in the following example:
writer = ExcelWriter(path_to_file.xlsx) df1.to_excel(writer, sheet_name=sheet1) df2.to_excel(writer, sheet_name=sheet2) writer.save()
Objects can be written to the le just like adding key-value pairs to a dict:
In [611]: index = DateRange(1/1/2000, periods=8) In [612]: s = Series(randn(5), index=[a, b, c, d, e])
181
In [613]: df = DataFrame(randn(8, 3), index=index, .....: columns=[A, B, C]) In [614]: wp = Panel(randn(2, 5, 4), items=[Item1, Item2], .....: major_axis=DateRange(1/1/2000, periods=5), .....: minor_axis=[A, B, C, D]) In [615]: store[s] = s In [616]: store[df] = df In [617]: store[wp] = wp In [618]: store Out[618]: <class pandas.io.pytables.HDFStore> File path: store.h5 df DataFrame s Series wp Panel
182
CHAPTER
SIXTEEN
The to_sparse method takes a kind argument (for the sparse index, see below) and a fill_value. So if we had a mostly zero Series, we could convert it to sparse with fill_value=0:
In [812]: ts.fillna(0).to_sparse(fill_value=0) Out[812]: 0 0.469112 1 -0.282863 2 0.000000 3 0.000000 4 0.000000 5 0.000000 6 0.000000 7 0.000000 8 -0.861849 9 -2.104569 BlockIndex
183
Block locations: array([0, 8], dtype=int32) Block lengths: array([2, 2], dtype=int32)
The sparse objects exist for memory efciency reasons. Suppose you had a large, mostly NA DataFrame:
In [813]: df = DataFrame(randn(10000, 4)) In [814]: df.ix[:9998] = np.nan In [815]: sdf = df.to_sparse() In [816]: sdf Out[816]: <class pandas.sparse.frame.SparseDataFrame> Int64Index: 10000 entries, 0 to 9999 Columns: 0 to 3 dtypes: float64(4) In [817]: sdf.density Out[817]: 0.0001
As you can see, the density (% of values that have not been compressed) is extremely low. This sparse object takes up much less memory on disk (pickled) and in the Python interpreter. Functionally, their behavior should be nearly identical to their dense counterparts. Any sparse object can be converted back to the standard dense form by calling to_dense:
In [818]: sts.to_dense() Out[818]: 0 0.469112 1 -0.282863 2 NaN 3 NaN 4 NaN 5 NaN 6 NaN 7 NaN 8 -0.861849 9 -2.104569
16.1 SparseArray
SparseArray is the base layer for all of the sparse indexed data structures. It is a 1-dimensional ndarray-like object storing only values distinct from the fill_value:
In [819]: arr = np.random.randn(10) In [820]: arr[2:5] = np.nan; arr[7:8] = np.nan In [821]: sparr = SparseArray(arr) In [822]: sparr Out[822]: SparseArray([-1.9557, -1.6589, nan, nan, nan, 0.606 , 1.3342]) IntIndex Indices: array([0, 1, 5, 6, 8, 9], dtype=int32)
nan,
1.1589,
0.1453,
184
Like the indexed objects (SparseSeries, SparseDataFrame, SparsePanel), a SparseArray can be converted back to a regular ndarray by calling to_dense:
In [823]: sparr.to_dense() Out[823]: array([-1.9557, -1.6589, nan, nan, 0.606 , 1.3342])
nan,
nan,
1.1589,
0.1453,
16.2 SparseList
SparseList is a list-like data structure for managing a dynamic collection of SparseArrays. To create one, simply call the SparseList constructor with a fill_value (defaulting to NaN):
In [824]: spl = SparseList() In [825]: spl Out[825]: <pandas.sparse.list.SparseList object at 0x1148f9d10>
The two important methods are append and to_array. append can accept scalar values or any 1-dimensional sequence:
In [826]: spl.append(np.array([1., nan, nan, 2., 3.])) In [827]: spl.append(5) In [828]: spl.append(sparr) In [829]: spl Out[829]: <pandas.sparse.list.SparseList object at 0x1148f9d10> SparseArray([ 1., nan, nan, 2., 3.]) IntIndex Indices: array([0, 3, 4], dtype=int32) SparseArray([ 5.]) IntIndex Indices: array([0], dtype=int32) SparseArray([-1.9557, -1.6589, nan, nan, nan, nan, 0.606 , 1.3342]) IntIndex Indices: array([0, 1, 5, 6, 8, 9], dtype=int32)
1.1589,
0.1453,
As you can see, all of the contents are stored internally as a list of memory-efcient SparseArray objects. Once youve accumulated all of the data, you can call to_array to get a single SparseArray with all the data:
In [830]: spl.to_array() Out[830]: SparseArray([ 1. , nan, nan, 2. , 3. , 5. , -1.9557, -1.6589, nan, nan, nan, 1.1589, 0.1453, nan, 0.606 , 1.3342]) IntIndex Indices: array([ 0, 3, 4, 5, 6, 7, 11, 12, 14, 15], dtype=int32)
16.2. SparseList
185
186
CHAPTER
SEVENTEEN
187
c f u
3 NaN NaN
This trade-off is made largely for memory and performance reasons, and also so that the resulting Series continues to be numeric. One possibility is to use dtype=object arrays instead.
While this may seem like a heavy trade-off, in practice I have found very few cases where this is an issue in practice. Some explanation for the motivation here in the next section.
The R language, by contrast, only has a handful of built-in data types: integer, numeric (oating-point), character, and boolean. NA types are implemented by reserving special bit patterns for each type to be used as the missing value. While doing this with the full NumPy type hierarchy would be possible, it would be a more substantial trade-off (especially for the 8- and 16-bit data types) and implementation undertaking. An alternate approach is that of using masked arrays. A masked array is an array of data with an associated boolean mask denoting whether each value should be considered NA or not. I am personally not in love with this approach as I feel that overall it places a fairly heavy burden on the user and the library implementer. Additionally, it exacts a fairly high performance cost when working with numerical data compared with the simple approach of using NaN. Thus, I have chosen the Pythonic practicality beats purity approach and traded integer NA capability for a much simpler approach of using a special value in oat and object arrays to denote NA, and promoting integer arrays to oating when NAs must be introduced.
188
However, if you only had c and e, determining the next element in the index can be somewhat complicated. For example, the following does not work:
s.ix[c:e+1]
A very common use case is to limit a time series to start and end at two specic dates. To enable this, we made the design design to make label-based slicing include both endpoints:
In [347]: s.ix[c:e] Out[347]: c 0.606382 d -0.681101 e -0.289724
This is most denitely a practicality beats purity sort of thing, but it is something to watch out for is you expect label-based slicing to behave exactly in the way that standard Python integer slicing works.
189
In [348]: df = DataFrame(randn(6, 4), columns=[one, two, three, four], .....: index=list(abcdef)) In [349]: df Out[349]: one two three four a -1.407699 1.014104 0.314226 -0.001675 b 0.071823 0.892566 0.680594 -0.339640 c 0.214910 -0.078410 -0.177665 0.490838 d -1.360102 1.592456 1.007100 0.697835 e -1.890591 -0.254002 1.360151 -0.059912 f -0.151652 0.624697 -1.124779 0.072594 In [350]: df.ix[[b, c, e]] Out[350]: one two three four b 0.071823 0.892566 0.680594 -0.339640 c 0.214910 -0.078410 -0.177665 0.490838 e -1.890591 -0.254002 1.360151 -0.059912
This is, of course, completely equivalent in this case to using th reindex method:
In [351]: df.reindex([b, c, e]) Out[351]: one two three four b 0.071823 0.892566 0.680594 -0.339640 c 0.214910 -0.078410 -0.177665 0.490838 e -1.890591 -0.254002 1.360151 -0.059912
Some might conclude that ix and reindex are 100% equivalent based on this. This is indeed true except in the case of integer indexing. For example, the above operation could alternately have been expressed as:
In [352]: df.ix[[1, 2, 4]] Out[352]: one two three four b 0.071823 0.892566 0.680594 -0.339640 c 0.214910 -0.078410 -0.177665 0.490838 e -1.890591 -0.254002 1.360151 -0.059912
If you pass [1, 2, 4] to reindex you will get another thing entirely:
In [353]: df.reindex([1, 2, 4]) Out[353]: one two three four 1 NaN NaN NaN NaN 2 NaN NaN NaN NaN 4 NaN NaN NaN NaN
So its important to remember that reindex is strict label indexing only. This can lead to some potentially surprising results in pathological cases where an index contains, say, both integers and strings:
In [354]: s = Series([1, 2, 3], index=[a, 0, 1]) In [355]: s Out[355]: a 1 0 2 1 3 In [356]: s.ix[[0, 1]]
190
Because the index in this case does not contain solely integers, ix falls back on integer indexing. By contrast, reindex only looks for the values passed in the index, thus nding the integers 0 and 1. While it would be possible to insert some logic to check whether a passed sequence is all contained in the index, that logic would exact a very high cost in large data sets.
191
192
CHAPTER
EIGHTEEN
RPY2 / R INTERFACE
Note: This is all highly experimental. I would like to get more people involved with building a nice RPy2 interface for pandas If your computer has R and rpy2 (> 2.2) installed (which will be left to the reader), you will be able to leverage the below functionality. On Windows, doing this is quite an ordeal at the moment, but users on Unix-like systems should nd it quite easy. As a general rule, I would recommend using the latest revision of rpy2 from bitbucket:
# if installing for the first time hg clone http://bitbucket.org/lgautier/rpy2 cd rpy2 hg pull hg update sudo python setup.py install
Note: To use R packages with this interface, you will need to install them inside R yourself. At the moment it cannot install them for you. Once you have done installed R and rpy2, you should be able to import pandas.rpy.common without a hitch.
induced 1 1 2 2 1
case 1 1 1 1 1
spontaneous 2 0 0 0 1
stratum 1 2 3 4 5
pooled.stratum 3 1 4 2 32
193
18.2 Calling R functions with pandas objects 18.3 High-level interface to R estimators
194
CHAPTER
NINETEEN
19.2 scikits.statsmodels
The main statistics and econometrics library for Python. pandas has become a dependency of this library.
19.3 scikits.timeseries
scikits.timeseries provides a data structure for xed frequency time series data based on the numpy.MaskedArray class. For time series data, it provides some of the same functionality to the pandas Series class. It has many more functions for time series-specic manipulation. Also, it has support for many more frequencies, though less customizable by the user (so 5-minutely data is easier to do with pandas for example). We are aiming to merge these libraries together in the near future.
195
196
CHAPTER
TWENTY
20.1 data.frame 20.2 zoo 20.3 xts 20.4 plyr 20.5 reshape / reshape2
197
198
CHAPTER
TWENTYONE
API REFERENCE
21.1 General functions
21.1.1 Data manipulations
pivot_table(data[, values, rows, cols, ...]) pandas.tools.pivot.pivot_table static pivot.pivot_table(data, values=None, rows=None, cols=None, aggfunc=mean, ll_value=None, margins=False) Create a spreadsheet-style pivot table as a DataFrame. The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame Parameters data : DataFrame values : column to aggregate, optional rows : list Columns to group on the x-axis of the pivot table cols : list Columns to group on the x-axis of the pivot table aggfunc : function, default numpy.mean, or list of functions If list of functions passed, the resulting pivot table will have hierarchical columns whose top level are the function names (inferred from the function objects themselves) ll_value : scalar, default None Value to replace missing values with margins : boolean, default False Add all row / columns (e.g. for subtotal / grand totals) Returns table : DataFrame Create a spreadsheet-style pivot table as a DataFrame. The levels in the
199
Examples >>> df A 0 foo 1 foo 2 foo 3 foo 4 foo 5 bar 6 bar 7 bar 8 bar
D 1 2 2 3 3 4 5 6 7
>>> table = pivot_table(df, values=D, rows=[A, B], ... cols=[C], aggfunc=np.sum) >>> table small large foo one 1 4 two 6 NaN bar one 5 4 two 6 7
merge(left, right[, how, on, left_on, ...]) concat(objs[, axis, join, join_axes, ...]) pandas.tools.merge.merge
Merge DataFrame objects by performing a database-style join operation by Concatenate pandas objects along a particular axis with optional set logic along the other axes.
static merge.merge(left, right, how=inner, on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True, sufxes=(.x, .y), copy=True) Merge DataFrame objects by performing a database-style join operation by columns or indexes. If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on. Parameters left : DataFrame right : DataFrame how : {left, right, outer, inner}, default inner left: use only keys from left frame (SQL: left outer join) right: use only keys from right frame (SQL: right outer join) outer: use union of keys from both frames (SQL: full outer join) inner: use intersection of keys from both frames (SQL: inner join) on : label or list Field names to join on. Must be found in both DataFrames. left_on : label or list, or array-like Field names to join on in left DataFrame. Can be a vector or list of vectors of the length of the DataFrame to use a particular vector as the join key instead of columns right_on : label or list, or array-like
200
Field names to join on in right DataFrame or vector/list of vectors per left_on docs left_index : boolean, default True Use the index from the left DataFrame as the join key(s). If it is a MultiIndex, the number of keys in the other DataFrame (either the index or a number of columns) must match the number of levels right_index : boolean, default True Use the index from the right DataFrame as the join key. Same caveats as left_index sort : boolean, default True Sort the join keys lexicographically in the result DataFrame sufxes : 2-length sequence (tuple, list, ...) Sufx to apply to overlapping column names in the left and right side, respectively copy : boolean, default True If False, do not copy data unnecessarily Returns merged : DataFrame
Examples >>> A lkey 0 foo 1 bar 2 baz 3 foo >>> B rkey 0 foo 1 bar 2 qux 3 bar
value 1 2 3 4
value 5 6 7 8
>>> merge(A, B, left_on=lkey, right_on=rkey, how=outer) lkey value.x rkey value.y 0 bar 2 bar 6 1 bar 2 bar 8 2 baz 3 NaN NaN 3 foo 1 foo 5 4 foo 4 foo 5 5 NaN NaN qux 7
pandas.tools.merge.concat static merge.concat(objs, axis=0, join=outer, join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False) Concatenate pandas objects along a particular axis with optional set logic along the other axes. Can also add a layer of hierarchical indexing on the concatenation axis, which may be useful if the labels are the same (or overlapping) on the passed axis number Parameters objs : list or dict of Series, DataFrame, or Panel objects If a dict is passed, the sorted keys will be used as the keys argument, unless it is passed, in which case the values will be selected (see below). Any None objects will be dropped silently unless they are all None in which case an Exception will be raised axis : {0, 1, ...}, default 0 The axis to concatenate along 21.1. General functions 201
join : {inner, outer}, default outer How to handle indexes on other axis(es) join_axes : list of Index objects Specic indexes to use for the other n - 1 axes instead of performing inner/outer set logic verify_integrity : boolean, default False Check whether the new concatenated axis contains duplicates. This can be very expensive relative to the actual data concatenation keys : sequence, default None If multiple levels passed, should contain tuples. Construct hierarchical index using the passed keys as the outermost level levels : list of sequences, default None Specic levels (unique values) to use for constructing a MultiIndex. Otherwise they will be inferred from the keys names : list, default None Names for the levels in the resulting hierarchical index ignore_index : boolean, default False If True, do not use the index values on the concatenation axis. The resulting axis will be labeled 0, ..., n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information. Returns concatenated : type of objects
Notes
21.1.2 Pickling
load(path) save(obj, path) Load pickled pandas object (or any other pickled object) from the specied Pickle (serialize) object to input le path
pandas.core.common.load static common.load(path) Load pickled pandas object (or any other pickled object) from the specied le path Parameters path : string File path Returns unpickled : type of object stored in le
202
pandas.core.common.save static common.save(obj, path) Pickle (serialize) object to input le path Parameters obj : any object path : string File path
21.1.3 File IO
read_table(lepath_or_buffer[, sep, ...]) read_csv(lepath_or_buffer[, sep, header, ...]) ExcelFile.parse(sheetname[, header, ...]) pandas.io.parsers.read_table static parsers.read_table(lepath_or_buffer, sep=t, header=0, index_col=None, names=None, skiprows=None, na_values=None, parse_dates=False, date_parser=None, nrows=None, iterator=False, chunksize=None, skip_footer=0, converters=None, verbose=False, delimiter=None, encoding=None) Read general delimited le into DataFrame Also supports optionally iterating or breaking of the le into chunks. Parameters lepath_or_buffer : string or le handle / StringIO sep : string, default t (tab-stop) Delimiter to use header : int, default 0 Row to use for the column labels of the parsed DataFrame skiprows : list-like or integer Row numbers to skip (0-indexed) or number of rows to skip (int) index_col : int or sequence, default None Column to use as the row labels of the DataFrame. If a sequence is given, a MultiIndex is used. names : array-like List of column names na_values : list-like, default None List of additional strings to recognize as NA/NaN parse_dates : boolean, default False Attempt to parse dates in the index column(s) date_parser : function Function to use for converting dates to strings. Defaults to dateutil.parser nrows : int, default None Read general delimited le into DataFrame Read CSV (comma-separated) le into DataFrame Read Excel table into DataFrame
203
Number of rows of le to read. Useful for reading pieces of large les iterator : boolean, default False Return TextParser object chunksize : int, default None Return TextParser object for iteration skip_footer : int, default 0 Number of line at bottom of le to skip converters : dict. optional Dict of functions for converting values in certain columns. Keys can either be integers or column labels verbose : boolean, default False Indicate number of NA values placed in non-numeric columns delimiter : string, default None Alternative argument name for sep encoding : string, default None Encoding to use for UTF when reading/writing (ex. utf-8) Returns result : DataFrame or TextParser pandas.io.parsers.read_csv static parsers.read_csv(lepath_or_buffer, sep=, , header=0, index_col=None, names=None, skiprows=None, na_values=None, parse_dates=False, date_parser=None, nrows=None, iterator=False, chunksize=None, skip_footer=0, converters=None, verbose=False, delimiter=None, encoding=None) Read CSV (comma-separated) le into DataFrame Also supports optionally iterating or breaking of the le into chunks. Parameters lepath_or_buffer : string or le handle / StringIO sep : string, default , Delimiter to use. If sep is None, will try to automatically determine this header : int, default 0 Row to use for the column labels of the parsed DataFrame skiprows : list-like or integer Row numbers to skip (0-indexed) or number of rows to skip (int) index_col : int or sequence, default None Column to use as the row labels of the DataFrame. If a sequence is given, a MultiIndex is used. names : array-like List of column names na_values : list-like, default None
204
List of additional strings to recognize as NA/NaN parse_dates : boolean, default False Attempt to parse dates in the index column(s) date_parser : function Function to use for converting dates to strings. Defaults to dateutil.parser nrows : int, default None Number of rows of le to read. Useful for reading pieces of large les iterator : boolean, default False Return TextParser object chunksize : int, default None Return TextParser object for iteration skip_footer : int, default 0 Number of line at bottom of le to skip converters : dict. optional Dict of functions for converting values in certain columns. Keys can either be integers or column labels verbose : boolean, default False Indicate number of NA values placed in non-numeric columns delimiter : string, default None Alternative argument name for sep encoding : string, default None Encoding to use for UTF when reading/writing (ex. utf-8) Returns result : DataFrame or TextParser pandas.io.parsers.ExcelFile.parse ExcelFile.parse(sheetname, header=0, skiprows=None, index_col=None, date_parser=None, na_values=None, chunksize=None) Read Excel table into DataFrame Parameters sheetname : string Name of Excel sheet header : int, default 0 Row to use for the column labels of the parsed DataFrame skiprows : list-like Row numbers to skip (0-indexed) index_col : int, default None Column to use as the row labels of the DataFrame. Pass None if there is no such column na_values : list-like, default None parse_dates=False,
205
206
207
pandas.stats.moments.rolling_mean static moments.rolling_mean(arg, window, min_periods=None, time_rule=None) Moving mean Parameters arg : Series, DataFrame window : Number of observations used for calculating statistic min_periods : int Minimum number of observations in window required to have a value time_rule : {None, WEEKDAY, EOM, W@MON, ...}, default=None Name of time rule to conform to before computing statistic Returns y : type of input argument pandas.stats.moments.rolling_median static moments.rolling_median(arg, window, min_periods=None, time_rule=None) O(N log(window)) implementation using skip list Moving median Parameters arg : Series, DataFrame window : Number of observations used for calculating statistic min_periods : int Minimum number of observations in window required to have a value time_rule : {None, WEEKDAY, EOM, W@MON, ...}, default=None Name of time rule to conform to before computing statistic Returns y : type of input argument pandas.stats.moments.rolling_var static moments.rolling_var(arg, window, min_periods=None, time_rule=None) Unbiased moving variance Parameters arg : Series, DataFrame window : Number of observations used for calculating statistic min_periods : int Minimum number of observations in window required to have a value time_rule : {None, WEEKDAY, EOM, W@MON, ...}, default=None Name of time rule to conform to before computing statistic Returns y : type of input argument
208
pandas.stats.moments.rolling_std static moments.rolling_std(arg, window, min_periods=None, time_rule=None) Unbiased moving standard deviation Parameters arg : Series, DataFrame window : Number of observations used for calculating statistic min_periods : int Minimum number of observations in window required to have a value time_rule : {None, WEEKDAY, EOM, W@MON, ...}, default=None Name of time rule to conform to before computing statistic Returns y : type of input argument pandas.stats.moments.rolling_corr static moments.rolling_corr(arg1, arg2, window, min_periods=None, time_rule=None) Moving sample correlation Parameters arg1 : Series, DataFrame, or ndarray arg2 : Series, DataFrame, or ndarray window : Number of observations used for calculating statistic min_periods : int Minimum number of observations in window required to have a value time_rule : {None, WEEKDAY, EOM, W@MON, ...}, default=None Name of time rule to conform to before computing statistic Returns y : type depends on inputs DataFrame / DataFrame -> DataFrame (matches on columns) DataFrame / Series -> Computes result for each column Series / Series -> Series pandas.stats.moments.rolling_cov static moments.rolling_cov(arg1, arg2, window, min_periods=None, time_rule=None) Unbiased moving covariance Parameters arg1 : Series, DataFrame, or ndarray arg2 : Series, DataFrame, or ndarray window : Number of observations used for calculating statistic min_periods : int Minimum number of observations in window required to have a value time_rule : {None, WEEKDAY, EOM, W@MON, ...}, default=None Name of time rule to conform to before computing statistic Returns y : type depends on inputs
209
DataFrame / DataFrame -> DataFrame (matches on columns) DataFrame / Series -> Computes result for each column Series / Series -> Series pandas.stats.moments.rolling_skew static moments.rolling_skew(arg, window, min_periods=None, time_rule=None) Unbiased moving skewness Parameters arg : Series, DataFrame window : Number of observations used for calculating statistic min_periods : int Minimum number of observations in window required to have a value time_rule : {None, WEEKDAY, EOM, W@MON, ...}, default=None Name of time rule to conform to before computing statistic Returns y : type of input argument pandas.stats.moments.rolling_kurt static moments.rolling_kurt(arg, window, min_periods=None, time_rule=None) Unbiased moving kurtosis Parameters arg : Series, DataFrame window : Number of observations used for calculating statistic min_periods : int Minimum number of observations in window required to have a value time_rule : {None, WEEKDAY, EOM, W@MON, ...}, default=None Name of time rule to conform to before computing statistic Returns y : type of input argument pandas.stats.moments.rolling_apply static moments.rolling_apply(arg, window, func, min_periods=None, time_rule=None) Generic moving function application Parameters arg : Series, DataFrame window : Number of observations used for calculating statistic func : function Must produce a single value from an ndarray input min_periods : int Minimum number of observations in window required to have a value time_rule : {None, WEEKDAY, EOM, W@MON, ...}, default=None Name of time rule to conform to before computing statistic Returns y : type of input argument
210
pandas.stats.moments.rolling_quantile static moments.rolling_quantile(arg, window, quantile, min_periods=None, time_rule=None) Moving quantile Parameters arg : Series, DataFrame window : Number of observations used for calculating statistic quantile : 0 <= quantile <= 1 min_periods : int Minimum number of observations in window required to have a value time_rule : {None, WEEKDAY, EOM, W@MON, ...}, default=None Name of time rule to conform to before computing statistic Returns y : type of input argument
Exponentially-weighted moving average Exponentially-weighted moving std Exponentially-weighted moving variance Exponentially-weighted moving correlation Exponentially-weighted moving covariance
Either center of mass or span must be specied EWMA is sometimes specied using a span parameter s, we have have that the decay parameter alpha is related to the span as = 1 2/(s + 1) = c/(1 + c)
211
where c is the center of mass. Given a span, the associated center of mass is c = (s 1)/2 So a 20-day EWMA would have center 9.5. pandas.stats.moments.ewmstd static moments.ewmstd(arg, com=None, span=None, min_periods=0, bias=False, time_rule=None) Exponentially-weighted moving std Parameters arg : Series, DataFrame com : oat. optional Center of mass: alpha = com / (1 + com), span : oat, optional Specify decay in terms of span, alpha = 2 / (span + 1) min_periods : int, default 0 Number of observations in sample to require (only affects beginning) time_rule : {None, WEEKDAY, EOM, W@MON, ...}, default None Name of time rule to conform to before computing statistic bias : boolean, default False Use a standard estimation bias correction Returns y : type of input argument
Notes
Either center of mass or span must be specied EWMA is sometimes specied using a span parameter s, we have have that the decay parameter alpha is related to the span as = 1 2/(s + 1) = c/(1 + c) where c is the center of mass. Given a span, the associated center of mass is c = (s 1)/2 So a 20-day EWMA would have center 9.5. pandas.stats.moments.ewmvar static moments.ewmvar(arg, com=None, span=None, min_periods=0, bias=False, time_rule=None) Exponentially-weighted moving variance Parameters arg : Series, DataFrame com : oat. optional Center of mass: alpha = com / (1 + com), span : oat, optional Specify decay in terms of span, alpha = 2 / (span + 1) min_periods : int, default 0 Number of observations in sample to require (only affects beginning) time_rule : {None, WEEKDAY, EOM, W@MON, ...}, default None 212 Chapter 21. API Reference
Name of time rule to conform to before computing statistic bias : boolean, default False Use a standard estimation bias correction Returns y : type of input argument
Notes
Either center of mass or span must be specied EWMA is sometimes specied using a span parameter s, we have have that the decay parameter alpha is related to the span as = 1 2/(s + 1) = c/(1 + c) where c is the center of mass. Given a span, the associated center of mass is c = (s 1)/2 So a 20-day EWMA would have center 9.5. pandas.stats.moments.ewmcorr static moments.ewmcorr(arg1, arg2, com=None, span=None, min_periods=0, time_rule=None) Exponentially-weighted moving correlation Parameters arg1 : Series, DataFrame, or ndarray arg2 : Series, DataFrame, or ndarray com : oat. optional Center of mass: alpha = com / (1 + com), span : oat, optional Specify decay in terms of span, alpha = 2 / (span + 1) min_periods : int, default 0 Number of observations in sample to require (only affects beginning) time_rule : {None, WEEKDAY, EOM, W@MON, ...}, default None Name of time rule to conform to before computing statistic Returns y : type of input argument
Notes
Either center of mass or span must be specied EWMA is sometimes specied using a span parameter s, we have have that the decay parameter alpha is related to the span as = 1 2/(s + 1) = c/(1 + c) where c is the center of mass. Given a span, the associated center of mass is c = (s 1)/2 So a 20-day EWMA would have center 9.5.
213
pandas.stats.moments.ewmcov static moments.ewmcov(arg1, arg2, com=None, time_rule=None) Exponentially-weighted moving covariance Parameters arg1 : Series, DataFrame, or ndarray arg2 : Series, DataFrame, or ndarray com : oat. optional Center of mass: alpha = com / (1 + com), span : oat, optional Specify decay in terms of span, alpha = 2 / (span + 1) min_periods : int, default 0 Number of observations in sample to require (only affects beginning) time_rule : {None, WEEKDAY, EOM, W@MON, ...}, default None Name of time rule to conform to before computing statistic Returns y : type of input argument
Notes
span=None,
min_periods=0,
bias=False,
Either center of mass or span must be specied EWMA is sometimes specied using a span parameter s, we have have that the decay parameter alpha is related to the span as = 1 2/(s + 1) = c/(1 + c) where c is the center of mass. Given a span, the associated center of mass is c = (s 1)/2 So a 20-day EWMA would have center 9.5.
21.2 Series
21.2.1 Attributes and underlying data
Axes index: axis labels Series.values Series.dtype Series.isnull(obj) Series.notnull(obj) pandas.Series.values Series.values Return Series as ndarray Returns arr : numpy.ndarray Return Series as ndarray Data-type of the arrays elements. Replacement for numpy.isnan / -numpy.isnite which is suitable for use on object arrays. Replacement for numpy.isnite / -numpy.isnan which is suitable for use on object arrays.
214
pandas.Series.dtype Series.dtype Data-type of the arrays elements. Parameters None : Returns d : numpy dtype object See Also: numpy.dtype
Examples >>> x array([[0, 1], [2, 3]]) >>> x.dtype dtype(int32) >>> type(x.dtype) <type numpy.dtype>
pandas.Series.isnull Series.isnull(obj) Replacement for numpy.isnan / -numpy.isnite which is suitable for use on object arrays. Parameters arr: ndarray or object value : Returns boolean ndarray or boolean : pandas.Series.notnull Series.notnull(obj) Replacement for numpy.isnite / -numpy.isnan which is suitable for use on object arrays. Parameters arr: ndarray or object value : Returns boolean ndarray or boolean :
21.2. Series
215
operations involving the index. Statistical methods from ndarray have been overridden to automatically exclude missing data (currently represented as NaN) Operations between Series (+, -, /, , *) align values based on their associated index values they need not be the same length. The result index will be the sorted union of the two indexes. Parameters data : array-like, dict, or scalar value Contains data stored in Series index : array-like or Index (1d) Values must be unique and hashable, same length as data. Index object (or other iterable of same length as data) Will default to np.arange(len(data)) if not provided. If both a dict and index sequence are used, the index will override the keys found in the dict. dtype : numpy.dtype or None If None, dtype will be inferred copy : boolean, default False Copy input data copy : boolean, default False pandas.Series.astype Series.astype(t) Copy of the array, cast to a specied type. Parameters t : str or dtype Typecode or data-type to which the array is cast. Raises ComplexWarning : : When casting from complex to oat or int. a.real.astype(t).
Examples >>> x = np.array([1, 2, 2.5]) >>> x array([ 1. , 2. , 2.5]) >>> x.astype(int) array([1, 2, 2])
pandas.Series.copy Series.copy() Return new Series with copy of underlying values Returns cp : Series
216
21.2. Series
217
pandas.Series.add Series.add(other, level=None, ll_value=None) Binary operator add with support to substitute a ll_value for missing data in one of the inputs Parameters other: Series or scalar value : ll_value : None or oat value, default None (NaN) Fill missing (NaN) values with this value. If both Series are missing, the result will be missing level : int or name Broadcast across a level, matching Index values on the passed MultiIndex level Returns result : Series pandas.Series.div Series.div(other, level=None, ll_value=None) Binary operator divide with support to substitute a ll_value for missing data in one of the inputs Parameters other: Series or scalar value : ll_value : None or oat value, default None (NaN) Fill missing (NaN) values with this value. If both Series are missing, the result will be missing level : int or name Broadcast across a level, matching Index values on the passed MultiIndex level Returns result : Series pandas.Series.mul Series.mul(other, level=None, ll_value=None) Binary operator multiply with support to substitute a ll_value for missing data in one of the inputs Parameters other: Series or scalar value : ll_value : None or oat value, default None (NaN) Fill missing (NaN) values with this value. If both Series are missing, the result will be missing level : int or name Broadcast across a level, matching Index values on the passed MultiIndex level Returns result : Series pandas.Series.sub Series.sub(other, level=None, ll_value=None) Binary operator subtract with support to substitute a ll_value for missing data in one of the inputs Parameters other: Series or scalar value : ll_value : None or oat value, default None (NaN) 218 Chapter 21. API Reference
Fill missing (NaN) values with this value. If both Series are missing, the result will be missing level : int or name Broadcast across a level, matching Index values on the passed MultiIndex level Returns result : Series pandas.Series.combine Series.combine(other, func, ll_value=nan) Perform elementwise binary operation on two Series using given function with optional ll value when an index is missing from one Series or the other Parameters other : Series or scalar value func : function ll_value : scalar value Returns result : Series pandas.Series.combine_rst Series.combine_first(other) Combine Series values, choosing the calling Seriess values rst. Result index will be the union of the two indexes Parameters other : Series Returns y : Series
21.2. Series
219
pandas.Series.groupby Series.groupby(by=None, axis=0, level=None, as_index=True, sort=True) Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns Parameters by : mapping function / list of functions, dict, Series, or tuple / list of column names. Called on each element of the object index to determine the groups. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups axis : int, default 0 level : int, level name, or sequence of such, default None If the axis is a MultiIndex (hierarchical), group by a particular level or levels as_index : boolean, default True For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively SQL-style grouped output sort : boolean, default True Sort group keys. Get better performance by turning this off Returns GroupBy object :
Examples
# DataFrame result >>> data.groupby(func, axis=0).mean() # DataFrame result >>> data.groupby([col1, col2])[col3].mean() # DataFrame with hierarchical index >>> data.groupby([col1, col2]).mean()
220
pandas.Series.autocorr Series.autocorr() Lag-1 autocorrelation Returns autocorr : oat pandas.Series.clip Series.clip(lower=None, upper=None, out=None) Trim values at input threshold(s) Parameters lower : oat, default None upper : oat, default None Returns clipped : Series pandas.Series.clip_lower Series.clip_lower(threshold) Return copy of series with values below given value truncated
21.2. Series
221
Returns clipped : Series See Also: clip pandas.Series.clip_upper Series.clip_upper(threshold) Return copy of series with values above given value truncated Returns clipped : Series See Also: clip pandas.Series.corr Series.corr(other, method=pearson) Compute correlation two Series, excluding missing values Parameters other : Series method : {pearson, kendall, spearman} pearson : standard correlation coefcient kendall : Kendall Tau correlation coefcient spearman : Spearman rank correlation Returns correlation : oat pandas.Series.count Series.count(level=None) Return number of non-NA/null observations in the Series Parameters level : int, default None If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a smaller Series Returns nobs : int or Series (if level specied) pandas.Series.cumprod Series.cumprod(axis=0, dtype=None, out=None, skipna=True) Cumulative product of values. Preserves locations of NaN values Extra parameters are to preserve ndarray interface. Parameters skipna : boolean, default True Exclude NA/null values Returns cumprod : Series
222
pandas.Series.cumsum Series.cumsum(axis=0, dtype=None, out=None, skipna=True) Cumulative sum of values. Preserves locations of NaN values Extra parameters are to preserve ndarray interface. Parameters skipna : boolean, default True Exclude NA/null values Returns cumsum : Series pandas.Series.describe Series.describe() Generate various summary statistics of Series, excluding NaN values. These include: count, mean, std, min, max, and 10%/50%/90% quantiles Returns desc : Series pandas.Series.diff Series.diff(periods=1) 1st discrete difference of object Parameters periods : int, default 1 Periods to shift for forming difference Returns diffed : Series pandas.Series.max Series.max(axis=None, out=None, skipna=True, level=None) Return maximum of values NA/null values are excluded Parameters skipna : boolean, default True Exclude NA/null values level : int, default None If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a smaller Series Returns max : oat (or Series if level specied) pandas.Series.mean Series.mean(axis=0, dtype=None, out=None, skipna=True, level=None) Return mean of values NA/null values are excluded Parameters skipna : boolean, default True Exclude NA/null values level : int, default None
21.2. Series
223
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a smaller Series Extra parameters are to preserve ndarrayinterface. : Returns mean : oat (or Series if level specied) pandas.Series.median Series.median(skipna=True, level=None) Return median of values NA/null values are excluded Parameters skipna : boolean, default True Exclude NA/null values level : int, default None If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a smaller Series Returns median : oat (or Series if level specied) pandas.Series.min Series.min(axis=None, out=None, skipna=True, level=None) Return minimum of values NA/null values are excluded Parameters skipna : boolean, default True Exclude NA/null values level : int, default None If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a smaller Series Returns min : oat (or Series if level specied) pandas.Series.prod Series.prod(axis=None, dtype=None, out=None, skipna=True, level=None) Return product of values NA/null values are excluded Parameters skipna : boolean, default True Exclude NA/null values level : int, default None If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a smaller Series Returns product : oat (or Series if level specied)
224
pandas.Series.quantile Series.quantile(q=0.5) Return value at the given quantile, a la scoreatpercentile in scipy.stats Parameters q : quantile 0 <= q <= 1 Returns quantile : oat pandas.Series.skew Series.skew(skipna=True, level=None) Return unbiased skewness of values NA/null values are excluded Parameters skipna : boolean, default True Exclude NA/null values level : int, default None If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a smaller Series Returns skew : oat (or Series if level specied) pandas.Series.std Series.std(axis=None, dtype=None, out=None, ddof=1, skipna=True, level=None) Return unbiased standard deviation of values NA/null values are excluded Parameters skipna : boolean, default True Exclude NA/null values level : int, default None If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a smaller Series Returns stdev : oat (or Series if level specied) pandas.Series.sum Series.sum(axis=0, dtype=None, out=None, skipna=True, level=None) Return sum of values NA/null values are excluded Parameters skipna : boolean, default True Exclude NA/null values level : int, default None If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a smaller Series Extra parameters are to preserve ndarrayinterface. : Returns sum : oat (or Series if level specied)
21.2. Series
225
pandas.Series.var Series.var(axis=None, dtype=None, out=None, ddof=1, skipna=True, level=None) Return unbiased variance of values NA/null values are excluded Parameters skipna : boolean, default True Exclude NA/null values level : int, default None If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a smaller Series Returns var : oat (or Series if level specied) pandas.Series.value_counts Series.value_counts() Returns Series containing counts of unique values. The resulting Series will be in descending order so that the rst element is the most frequently-occurring element. Excludes NA values Returns counts : Series
226
Aligned Series pandas.Series.drop Series.drop(labels, axis=0) Return new object with labels in requested axis removed Parameters labels : array-like axis : int Returns dropped : type of caller pandas.Series.reindex Series.reindex(index=None, method=None, level=None, ll_value=nan, copy=True) Conform Series to new index with optional lling logic, placing NA/NaN in locations having no value in the previous index. A new object is produced unless the new index is equivalent to the current one and copy=False Parameters index : array-like or Index New labels / index to conform to. Preferably an Index object to avoid duplicating data method : {backll, bll, pad, fll, None} Method to use for lling holes in reindexed Series pad / fll: propagate LAST valid observation forward to next valid backll / bll: use NEXT valid observation to ll gap copy : boolean, default True Return a new object, even if the passed indexes are the same level : int or name Broadcast across a level, matching Index values on the passed MultiIndex level ll_value : scalar, default np.NaN Value to use for missing values. Defaults to NaN, but can be any compatible value Returns reindexed : Series pandas.Series.reindex_like Series.reindex_like(other, method=None) Reindex Series to match index of another Series, optionally with lling logic Parameters other : Series method : string or None See Series.reindex docstring Returns reindexed : Series
Notes
21.2. Series
227
pandas.Series.rename Series.rename(mapper) Alter Series index using dict or function Parameters mapper : dict-like or function Transformation to apply to each index Returns renamed : Series (new object)
Notes
pandas.Series.select Series.select(crit, axis=0) Return data corresponding to axis labels matching criteria Parameters crit : function To be called on each index (label). Should return True or False axis : int Returns selection : type of caller pandas.Series.take Series.take(indices, axis=0) Analogous to ndarray.take, return Series corresponding to requested indices Parameters indices : list / array of ints Returns taken : Series
228
pandas.Series.truncate Series.truncate(before=None, after=None, copy=True) Function truncate a sorted DataFrame / Series before and/or after some particular dates. Parameters before : date Truncate before date after : date Truncate after date Returns truncated : type of caller
21.2. Series
229
pandas.Series.interpolate Series.interpolate(method=linear) Interpolate missing values (after the rst valid value) Parameters method : {linear, time} Interpolation method. Time interpolation works on daily and higher resolution data to interpolate given length of interval Returns interpolated : Series
230
pandas.Series.sort Series.sort(axis=0, kind=quicksort, order=None) Sort values and index labels by value, in place. For compatibility with ndarray API. No return value Parameters axis : int (can only be zero) kind : {mergesort, quicksort, heapsort}, default quicksort Choice of sorting algorithm. See np.sort for more information. mergesort is the only stable algorithm order : ignored pandas.Series.sort_index Series.sort_index(ascending=True) Sort object by labels (along an axis) Parameters ascending : boolean, default True Sort ascending vs. descending Returns sorted_obj : Series pandas.Series.sortlevel Series.sortlevel(level=0, ascending=True) Sort Series with MultiIndex by chosen level. Data will be lexicographically sorted by the chosen level followed by the other levels (in order) Parameters level : int ascending : bool, default True Returns sorted : Series pandas.Series.unstack Series.unstack(level=-1) Unstack, a.k.a. pivot, Series with MultiIndex to produce DataFrame Parameters level : int, string, or list of these, default last level Level(s) to unstack, can pass level name Returns unstacked : DataFrame
Examples >>> s one a one b two a two b
1. 2. 3. 4.
21.2. Series
231
Dates are assumed to be sorted pandas.Series.shift Series.shift(periods, offset=None, **kwds) Shift the index of the Series by desired number of periods with an optional time offset Parameters periods : int Number of periods to move, can be positive or negative offset : DateOffset, timedelta, or time rule string, optional Increment to use from datetools module or time rule (e.g. EOM) Returns shifted : Series pandas.Series.rst_valid_index Series.first_valid_index() Return label for rst non-NA/null value pandas.Series.last_valid_index Series.last_valid_index() Return label for last non-NA/null value pandas.Series.weekday Series.weekday
21.2.12 Plotting
Series.hist(**kwds[, ax, grid]) Series.plot(**kwds[, label, kind, ...]) pandas.Series.hist Series.hist(ax=None, grid=True, **kwds) Draw histogram of the input series using matplotlib Parameters ax : matplotlib axis object If not passed, uses gca() kwds : keywords To be passed to the actual plotting function 21.2. Series 233 Draw histogram of the input series using matplotlib Plot the input series with the index on the x-axis using matplotlib
Notes
See matplotlib documentation online for more on this pandas.Series.plot Series.plot(label=None, kind=line, use_index=True, rot=30, ax=None, style=-, grid=True, logy=False, **kwds) Plot the input series with the index on the x-axis using matplotlib Parameters label : label argument to provide to plot kind : {line, bar} rot : int, default 30 Rotation for tick labels use_index : boolean, default True Plot index as axis tick labels ax : matplotlib axis object If not passed, uses gca() style : string, default - matplotlib line style to use kwds : keywords To be passed to the actual plotting function
Notes
See matplotlib documentation online for more on this subject Intended to be used in ipython pylab mode
Write Series to a comma-separated values (csv) le Convert Series to {label -> value} dict Convert Series to SparseSeries
234
parse_dates : boolean, default True Parse dates. Different default from read_table header : int, default 0 Row to use at header (skip prior rows) index_col : int or sequence, default 0 Column to use for index. If a sequence is given, a MultiIndex is used. Different default from read_table encoding : string, optional a string representing the encoding to use if the contents are non-ascii, for python versions prior to 3 Returns y : Series pandas.Series.load classmethod Series.load(path) pandas.Series.save Series.save(path) pandas.Series.to_csv Series.to_csv(path, index=True, sep=, , na_rep=, header=False, index_label=None, mode=w, nanRep=None, encoding=None) Write Series to a comma-separated values (csv) le Parameters path : string File path nanRep : string, default Missing data repn header : boolean, default False Write out series name index : boolean, default True Write row names (index) index_label : string or sequence, default None Column label for index column(s) if desired. If None is given, and header and index are True, then the index names are used. A sequence should be given if the DataFrame uses MultiIndex. mode : Python write mode, default w sep : character, default , Field delimiter for the output le. encoding : string, optional
21.2. Series
235
a string representing the encoding to use if the contents are non-ascii, for python versions prior to 3 pandas.Series.to_dict Series.to_dict() Convert Series to {label -> value} dict Returns value_dict : dict pandas.Series.to_sparse Series.to_sparse(kind=block, ll_value=None) Convert Series to SparseSeries Parameters kind : {block, integer} ll_value : oat, defaults to NaN (missing) Returns sp : SparseSeries
21.3 DataFrame
21.3.1 Attributes and underlying data
Axes index: row labels columns: column labels DataFrame.as_matrix([columns]) DataFrame.dtypes DataFrame.get_dtype_counts() DataFrame.values DataFrame.axes DataFrame.ndim DataFrame.shape pandas.DataFrame.as_matrix DataFrame.as_matrix(columns=None) Convert the frame to its Numpy-array matrix representation. Columns are presented in sorted order unless a specic list of columns is provided. Parameters columns : array-like Specic column order Returns values : ndarray If the DataFrame is heterogeneous and contains booleans or objects, the result will be of dtype=object Convert the frame to its Numpy-array matrix representation. Columns
236
pandas.DataFrame.dtypes DataFrame.dtypes pandas.DataFrame.get_dtype_counts DataFrame.get_dtype_counts() pandas.DataFrame.values DataFrame.values Convert the frame to its Numpy-array matrix representation. Columns are presented in sorted order unless a specic list of columns is provided. Parameters columns : array-like Specic column order Returns values : ndarray If the DataFrame is heterogeneous and contains booleans or objects, the result will be of dtype=object pandas.DataFrame.axes DataFrame.axes pandas.DataFrame.ndim DataFrame.ndim pandas.DataFrame.shape DataFrame.shape
index : Index or array-like Index to use for resulting frame. Will default to np.arange(n) if no indexing information part of input data and no index provided columns : Index or array-like Will default to np.arange(n) if not column labels provided dtype : dtype, default None Data type to force, otherwise infer copy : boolean, default False Copy data from inputs. Only affects DataFrame / 2d ndarray input See Also: DataFrame.from_records constructor from tuples, also record arrays DataFrame.from_dict from dicts of Series, arrays, or dicts DataFrame.from_csv from CSV les DataFrame.from_items from sequence of (key, value) pairs read_csv
Examples >>> >>> >>> >>> ... d = {col1: ts1, col2: ts2} df = DataFrame(data=d, index=index) df2 = DataFrame(np.random.randn(10, 5)) df3 = DataFrame(np.random.randn(10, 5), columns=[a, b, c, d, e])
pandas.DataFrame.astype DataFrame.astype(dtype) Cast object to input numpy.dtype Parameters dtype : numpy.dtype or Python type Returns casted : type of caller pandas.DataFrame.copy DataFrame.copy(deep=True) Make a copy of this object Parameters deep : boolean, default True Make a deep copy, i.e. also copy data Returns copy : type of caller
238
axis : int, default 0 Axis to retrieve cross-section on copy : boolean, default True Whether to make a copy of the data Returns xs : Series
pandas.DataFrame.add DataFrame.add(other, axis=columns, level=None, ll_value=None) Binary operator add with support to substitute a ll_value for missing data in one of the inputs Parameters other : Series, DataFrame, or constant axis : {0, 1, index, columns} For Series input, axis to match Series index on ll_value : None or oat value, default None Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing level : int or name Broadcast across a level, matching Index values on the passed MultiIndex level Returns result : DataFrame
240
Notes
Mismatched indices will be unioned together pandas.DataFrame.div DataFrame.div(other, axis=columns, level=None, ll_value=None) Binary operator divide with support to substitute a ll_value for missing data in one of the inputs Parameters other : Series, DataFrame, or constant axis : {0, 1, index, columns} For Series input, axis to match Series index on ll_value : None or oat value, default None Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing level : int or name Broadcast across a level, matching Index values on the passed MultiIndex level Returns result : DataFrame
Notes
Mismatched indices will be unioned together pandas.DataFrame.mul DataFrame.mul(other, axis=columns, level=None, ll_value=None) Binary operator multiply with support to substitute a ll_value for missing data in one of the inputs Parameters other : Series, DataFrame, or constant axis : {0, 1, index, columns} For Series input, axis to match Series index on ll_value : None or oat value, default None Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing level : int or name Broadcast across a level, matching Index values on the passed MultiIndex level Returns result : DataFrame
Notes
21.3. DataFrame
241
pandas.DataFrame.sub DataFrame.sub(other, axis=columns, level=None, ll_value=None) Binary operator subtract with support to substitute a ll_value for missing data in one of the inputs Parameters other : Series, DataFrame, or constant axis : {0, 1, index, columns} For Series input, axis to match Series index on ll_value : None or oat value, default None Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing level : int or name Broadcast across a level, matching Index values on the passed MultiIndex level Returns result : DataFrame
Notes
Mismatched indices will be unioned together pandas.DataFrame.radd DataFrame.radd(other, axis=columns, level=None, ll_value=None) Binary operator radd with support to substitute a ll_value for missing data in one of the inputs Parameters other : Series, DataFrame, or constant axis : {0, 1, index, columns} For Series input, axis to match Series index on ll_value : None or oat value, default None Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing level : int or name Broadcast across a level, matching Index values on the passed MultiIndex level Returns result : DataFrame
Notes
Mismatched indices will be unioned together pandas.DataFrame.rdiv DataFrame.rdiv(other, axis=columns, level=None, ll_value=None) Binary operator rdivide with support to substitute a ll_value for missing data in one of the inputs Parameters other : Series, DataFrame, or constant axis : {0, 1, index, columns} 242 Chapter 21. API Reference
For Series input, axis to match Series index on ll_value : None or oat value, default None Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing level : int or name Broadcast across a level, matching Index values on the passed MultiIndex level Returns result : DataFrame
Notes
Mismatched indices will be unioned together pandas.DataFrame.rmul DataFrame.rmul(other, axis=columns, level=None, ll_value=None) Binary operator rmultiply with support to substitute a ll_value for missing data in one of the inputs Parameters other : Series, DataFrame, or constant axis : {0, 1, index, columns} For Series input, axis to match Series index on ll_value : None or oat value, default None Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing level : int or name Broadcast across a level, matching Index values on the passed MultiIndex level Returns result : DataFrame
Notes
Mismatched indices will be unioned together pandas.DataFrame.rsub DataFrame.rsub(other, axis=columns, level=None, ll_value=None) Binary operator rsubtract with support to substitute a ll_value for missing data in one of the inputs Parameters other : Series, DataFrame, or constant axis : {0, 1, index, columns} For Series input, axis to match Series index on ll_value : None or oat value, default None Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing level : int or name
21.3. DataFrame
243
Broadcast across a level, matching Index values on the passed MultiIndex level Returns result : DataFrame
Notes
Mismatched indices will be unioned together pandas.DataFrame.combine DataFrame.combine(other, func, ll_value=None) Add two DataFrame objects and do not propagate NaN values, so if for a (column, time) one frame is missing a value, it will default to the other frames value (which might be NaN as well) Parameters other : DataFrame func : function ll_value : scalar value Returns result : DataFrame pandas.DataFrame.combineAdd DataFrame.combineAdd(other) Add two DataFrame objects and do not propagate NaN values, so if for a (column, time) one frame is missing a value, it will default to the other frames value (which might be NaN as well) Parameters other : DataFrame Returns DataFrame : pandas.DataFrame.combine_rst DataFrame.combine_first(other) Combine two DataFrame objects and default to non-null values in frame calling the method. Result index will be the union of the two indexes Parameters other : DataFrame Returns combined : DataFrame
Examples >>> a.combine_first(b) as values prioritized, use values from b to fill holes
pandas.DataFrame.combineMult DataFrame.combineMult(other) Multiply two DataFrame objects and do not propagate NaN values, so if for a (column, time) one frame is missing a value, it will default to the other frames value (which might be NaN as well) Parameters other : DataFrame
244
Returns DataFrame :
Applies function along input axis of DataFrame. Objects passed to Apply a function to a DataFrame that is intended to operate Group series using mapper (dict or key function, apply given function
Function passed should not have side effects. If the result is a Series, it should have the same index
Examples >>> df.apply(numpy.sqrt) # returns DataFrame >>> df.apply(numpy.sum, axis=0) # equiv to df.sum(0) >>> df.apply(numpy.sum, axis=1) # equiv to df.sum(1)
21.3. DataFrame
245
pandas.DataFrame.applymap DataFrame.applymap(func) Apply a function to a DataFrame that is intended to operate elementwise, i.e. like doing map(func, series) for each series in the DataFrame Parameters func : function Python function, returns a single value from a single value Returns applied : DataFrame pandas.DataFrame.groupby DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True) Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns Parameters by : mapping function / list of functions, dict, Series, or tuple / list of column names. Called on each element of the object index to determine the groups. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups axis : int, default 0 level : int, level name, or sequence of such, default None If the axis is a MultiIndex (hierarchical), group by a particular level or levels as_index : boolean, default True For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively SQL-style grouped output sort : boolean, default True Sort group keys. Get better performance by turning this off Returns GroupBy object :
Examples
# DataFrame result >>> data.groupby(func, axis=0).mean() # DataFrame result >>> data.groupby([col1, col2])[col3].mean() # DataFrame with hierarchical index >>> data.groupby([col1, col2]).mean()
246
21.3. DataFrame
247
pandas.DataFrame.corr DataFrame.corr(method=pearson) Compute pairwise correlation of columns, excluding NA/null values Parameters method : {pearson, kendall, spearman} pearson : standard correlation coefcient kendall : Kendall Tau correlation coefcient spearman : Spearman rank correlation Returns y : DataFrame pandas.DataFrame.corrwith DataFrame.corrwith(other, axis=0, drop=False) Compute pairwise correlation between rows or columns of two DataFrame objects. Parameters other : DataFrame axis : {0, 1} 0 to compute column-wise, 1 for row-wise drop : boolean, default False Drop missing indices from result, default returns union of all Returns correls : Series pandas.DataFrame.count DataFrame.count(axis=0, level=None, numeric_only=False) Return Series with number of non-NA/null observations over requested axis. Works with non-oating point data as well (detects NaN and None) Parameters axis : {0, 1} 0 for row-wise, 1 for column-wise level : int, default None If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a DataFrame numeric_only : boolean, default False Include only oat, int, boolean data Returns count : Series (or DataFrame if level specied) pandas.DataFrame.cumprod DataFrame.cumprod(axis=None, skipna=True) Return cumulative product over requested axis as DataFrame Parameters axis : {0, 1} 0 for row-wise, 1 for column-wise skipna : boolean, default True Exclude NA/null values. If an entire row/column is NA, the result will be NA 248 Chapter 21. API Reference
Returns y : DataFrame pandas.DataFrame.cumsum DataFrame.cumsum(axis=None, skipna=True) Return DataFrame of cumulative sums over requested axis. Parameters axis : {0, 1} 0 for row-wise, 1 for column-wise skipna : boolean, default True Exclude NA/null values. If an entire row/column is NA, the result will be NA Returns y : DataFrame pandas.DataFrame.describe DataFrame.describe() Generate various summary statistics of each column, excluding NaN values. These include: count, mean, std, min, max, and 10%/50%/90% quantiles Returns DataFrame of summary statistics : pandas.DataFrame.diff DataFrame.diff(periods=1) 1st discrete difference of object Parameters periods : int, default 1 Periods to shift for forming difference Returns diffed : DataFrame pandas.DataFrame.mad DataFrame.mad(axis=0, skipna=True, level=None) Return mean absolute deviation over requested axis. NA/null values are excluded Parameters axis : {0, 1} 0 for row-wise, 1 for column-wise skipna : boolean, default True Exclude NA/null values. If an entire row/column is NA, the result will be NA level : int, default None If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a DataFrame Returns mad : Series (or DataFrame if level specied)
21.3. DataFrame
249
pandas.DataFrame.max DataFrame.max(axis=0, skipna=True, level=None) Return maximum over requested axis. NA/null values are excluded Parameters axis : {0, 1} 0 for row-wise, 1 for column-wise skipna : boolean, default True Exclude NA/null values. If an entire row/column is NA, the result will be NA level : int, default None If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a DataFrame Returns max : Series (or DataFrame if level specied) pandas.DataFrame.mean DataFrame.mean(axis=0, skipna=True, level=None) Return mean over requested axis. NA/null values are excluded Parameters axis : {0, 1} 0 for row-wise, 1 for column-wise skipna : boolean, default True Exclude NA/null values. If an entire row/column is NA, the result will be NA level : int, default None If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a DataFrame Returns mean : Series (or DataFrame if level specied) pandas.DataFrame.median DataFrame.median(axis=0, skipna=True, level=None) Return median over requested axis. NA/null values are excluded Parameters axis : {0, 1} 0 for row-wise, 1 for column-wise skipna : boolean, default True Exclude NA/null values. If an entire row/column is NA, the result will be NA level : int, default None If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a DataFrame Returns median : Series (or DataFrame if level specied)
250
pandas.DataFrame.min DataFrame.min(axis=0, skipna=True, level=None) Return minimum over requested axis. NA/null values are excluded Parameters axis : {0, 1} 0 for row-wise, 1 for column-wise skipna : boolean, default True Exclude NA/null values. If an entire row/column is NA, the result will be NA level : int, default None If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a DataFrame Returns min : Series (or DataFrame if level specied) pandas.DataFrame.prod DataFrame.prod(axis=0, skipna=True, level=None) Return product over requested axis. NA/null values are treated as 1 Parameters axis : {0, 1} 0 for row-wise, 1 for column-wise skipna : boolean, default True Exclude NA/null values. If an entire row/column is NA, the result will be NA level : int, default None If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a DataFrame Returns product : Series (or DataFrame if level specied) pandas.DataFrame.quantile DataFrame.quantile(q=0.5, axis=0) Return values at the given quantile over requested axis, a la scoreatpercentile in scipy.stats Parameters q : quantile, default 0.5 (50% quantile) 0 <= q <= 1 axis : {0, 1} 0 for row-wise, 1 for column-wise Returns quantiles : Series pandas.DataFrame.skew DataFrame.skew(axis=0, skipna=True, level=None) Return unbiased skewness over requested axis. NA/null values are excluded Parameters axis : {0, 1}
21.3. DataFrame
251
0 for row-wise, 1 for column-wise skipna : boolean, default True Exclude NA/null values. If an entire row/column is NA, the result will be NA level : int, default None If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a DataFrame Returns skew : Series (or DataFrame if level specied) pandas.DataFrame.sum DataFrame.sum(axis=0, numeric_only=None, skipna=True, level=None) Return sum over requested axis. NA/null values are excluded Parameters axis : {0, 1} 0 for row-wise, 1 for column-wise skipna : boolean, default True Exclude NA/null values. If an entire row/column is NA, the result will be NA level : int, default None If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a DataFrame numeric_only : boolean, default None Include only oat, int, boolean data. If None, will attempt to use everything, then use only numeric data Returns sum : Series (or DataFrame if level specied) pandas.DataFrame.std DataFrame.std(axis=0, skipna=True, level=None) Return unbiased standard deviation over requested axis. NA/null values are excluded Parameters axis : {0, 1} 0 for row-wise, 1 for column-wise skipna : boolean, default True Exclude NA/null values. If an entire row/column is NA, the result will be NA level : int, default None If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a DataFrame Returns std : Series (or DataFrame if level specied)
252
pandas.DataFrame.var DataFrame.var(axis=0, skipna=True, level=None) Return unbiased variance over requested axis. NA/null values are excluded Parameters axis : {0, 1} 0 for row-wise, 1 for column-wise skipna : boolean, default True Exclude NA/null values. If an entire row/column is NA, the result will be NA level : int, default None If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a DataFrame Returns var : Series (or DataFrame if level specied)
21.3. DataFrame
253
Parameters sufx : string Returns with_sufx : type of caller pandas.DataFrame.align DataFrame.align(other, join=outer, axis=None, level=None, copy=True, ll_value=nan, method=None) Align two DataFrame object on their index and columns with the specied join method for each axis Index Parameters other : DataFrame or Series join : {outer, inner, left, right}, default outer axis : {0, 1, None}, default None Align on index (0), columns (1), or both (None) level : int or name Broadcast across a level, matching Index values on the passed MultiIndex level copy : boolean, default True Always returns new objects. If copy=False and no reindexing is required then original objects are returned. ll_value : scalar, default np.NaN Value to use for missing values. Defaults to NaN, but can be any compatible value method : str, default None Returns (left, right) : (DataFrame, type of other) Aligned objects pandas.DataFrame.drop DataFrame.drop(labels, axis=0) Return new object with labels in requested axis removed Parameters labels : array-like axis : int Returns dropped : type of caller pandas.DataFrame.lter DataFrame.filter(items=None, like=None, regex=None) Restrict frames columns to set of items or wildcard Parameters items : list-like List of columns to restrict to (must not all be present) like : string Keep columns where arg in col == True regex : string (regular expression) Keep columns with re.search(regex, col) == True 254 Chapter 21. API Reference
Arguments are mutually exclusive, but this is not checked for pandas.DataFrame.reindex DataFrame.reindex(index=None, columns=None, method=None, level=None, ll_value=nan, copy=True) Conform DataFrame to new index with optional lling logic, placing NA/NaN in locations having no value in the previous index. A new object is produced unless the new index is equivalent to the current one and copy=False Parameters index : array-like, optional New labels / index to conform to. Preferably an Index object to avoid duplicating data columns : array-like, optional Same usage as index argument method : {backll, bll, pad, fll, None}, default None Method to use for lling holes in reindexed DataFrame pad / fll: propagate last valid observation forward to next valid backll / bll: use NEXT valid observation to ll gap copy : boolean, default True Return a new object, even if the passed indexes are the same level : int or name Broadcast across a level, matching Index values on the passed MultiIndex level ll_value : scalar, default np.NaN Value to use for missing values. Defaults to NaN, but can be any compatible value Returns reindexed : same type as calling instance
Examples >>> df.reindex(index=[date1, date2, date3], columns=[A, B, C])
pandas.DataFrame.reindex_like DataFrame.reindex_like(other, method=None, copy=True) Reindex DataFrame to match indices of another DataFrame, optionally with lling logic Parameters other : DataFrame method : string or None copy : boolean, default True Returns reindexed : DataFrame
21.3. DataFrame
255
Notes
Like calling s.reindex(index=other.index, columns=other.columns, method=...) pandas.DataFrame.rename DataFrame.rename(index=None, columns=None, copy=True) Alter index and / or columns using input function or functions. Function / dict values must be unique (1-to-1). Labels not contained in a dict / Series will be left as-is. Parameters index : dict-like or function, optional Transformation to apply to index values columns : dict-like or function, optional Transformation to apply to column values copy : boolean, default True Also copy underlying data Returns renamed : DataFrame (new object) See Also: Series.rename pandas.DataFrame.select DataFrame.select(crit, axis=0) Return data corresponding to axis labels matching criteria Parameters crit : function To be called on each index (label). Should return True or False axis : int Returns selection : type of caller pandas.DataFrame.take DataFrame.take(indices, axis=0) Analogous to ndarray.take, return DataFrame corresponding to requested indices along an axis Parameters indices : list / array of ints axis : {0, 1} Returns taken : DataFrame pandas.DataFrame.truncate DataFrame.truncate(before=None, after=None, copy=True) Function truncate a sorted DataFrame / Series before and/or after some particular dates. Parameters before : date Truncate before date 256 Chapter 21. API Reference
after : date Truncate after date Returns truncated : type of caller pandas.DataFrame.head DataFrame.head(n=5) Returns rst n rows of DataFrame pandas.DataFrame.tail DataFrame.tail(n=5) Returns last n rows of DataFrame
21.3. DataFrame
257
value : any kind (should be same type as array) Value to use to ll holes (e.g. 0) inplace : boolean, default False If True, ll the DataFrame in place. Note: this will modify any other views on this DataFrame, like if you took a no-copy slice of an existing DataFrame, for example a column in a DataFrame. Returns a reference to the lled object, which is self if inplace=True Returns lled : DataFrame See Also: reindex, asfreq
pandas.DataFrame.sort_index DataFrame.sort_index(axis=0, by=None, ascending=True) Sort DataFrame either by labels (along either axis) or by the values in a column Parameters axis : {0, 1} Sort index/rows versus columns by : object Column name(s) in frame. Accepts a column name or a list or tuple for a nested sort. ascending : boolean, default True Sort ascending vs. descending Returns sorted : DataFrame pandas.DataFrame.delevel DataFrame.delevel(*args, **kwargs) 258 Chapter 21. API Reference
pandas.DataFrame.pivot DataFrame.pivot(index=None, columns=None, values=None) Reshape data (produce a pivot table) based on column values. Uses unique values from index / columns to form axes and return either DataFrame or Panel, depending on whether you request a single value column (DataFrame) or all columns (Panel) Parameters index : string or object Column name to use to make new frames index columns : string or object Column name to use to make new frames columns values : string or object, optional Column name to use for populating new frames values Returns pivoted : DataFrame If no values column specied, will have hierarchically indexed columns
Notes
For ner-tuned control, see hierarchical indexing documentation along with the related stack/unstack methods
Examples >>> df foo 0 one 1 one 2 one 3 two 4 two 5 two
bar A B C A B C
baz 1. 2. 3. 4. 5. 6.
>>> df.pivot(foo, bar, baz) A B C one 1 2 3 two 4 5 6 >>> df.pivot(foo, bar)[baz] A B C one 1 2 3 two 4 5 6
pandas.DataFrame.sortlevel DataFrame.sortlevel(level=0, axis=0, ascending=True) Sort multilevel index by chosen axis and primary level. Data will be lexicographically sorted by the chosen level followed by the other levels (in order) Parameters level : int axis : {0, 1}
21.3. DataFrame
259
ascending : bool, default True Returns sorted : DataFrame pandas.DataFrame.swaplevel DataFrame.swaplevel(i, j, axis=0) Swap levels i and j in a MultiIndex on a particular axis Returns swapped : type of caller (new object) pandas.DataFrame.stack DataFrame.stack(level=-1, dropna=True) Pivot a level of the (possibly hierarchical) column labels, returning a DataFrame (or Series in the case of an object with a single level of column labels) having a hierarchical index with a new inner-most level of row labels. Parameters level : int, string, or list of these, default last level Level(s) to stack, can pass level name dropna : boolean, default True Whether to drop rows in the resulting Frame/Series with no valid values Returns stacked : DataFrame or Series
Examples >>> s one two a 1. 3. b 2. 4.
pandas.DataFrame.unstack DataFrame.unstack(level=-1) Pivot a level of the (necessarily hierarchical) index labels, returning a DataFrame having a new level of column labels whose inner-most level consists of the pivoted index labels. If the index is not a MultiIndex, the output will be a Series (the analogue of stack when the columns are not a MultiIndex) Parameters level : int, string, or list of these, default last level Level(s) of index to unstack, can pass level name Returns unstacked : DataFrame or Series
260
1. 2. 3. 4.
>>> s.unstack(level=-1) a b one 1. 2. two 3. 4. >>> df = s.unstack(level=0) >>> df one two a 1. 2. b 3. 4. >>> df.unstack() one a 1. b 3. two a 2. b 4.
pandas.DataFrame.T DataFrame.T Returns a DataFrame with the rows/columns switched. If the DataFrame is homogeneously-typed, the data is not copied pandas.DataFrame.transpose DataFrame.transpose() Returns a DataFrame with the rows/columns switched. If the DataFrame is homogeneously-typed, the data is not copied
Index should be similar to one of the columns in this one. If a Series is passed, its name attribute must be set, and that will be used as the column name in the resulting joined DataFrame on : column name, tuple/list of column names, or array-like Column(s) to use for joining, otherwise join on index. If multiples columns given, the passed DataFrame must have a MultiIndex. Can pass an array as the join key if not already contained in the calling DataFrame. Like an Excel VLOOKUP operation how : {left, right, outer, inner} How to handle indexes of the two objects. Default: left for joining on index, None otherwise * left: use calling frames index * right: use input frames index * outer: form union of indexes * inner: use intersection of indexes lsufx : string Sufx to use from left frames overlapping columns rsufx : string Sufx to use from right frames overlapping columns sort : boolean, default False Order result DataFrame lexicographically by the join key. If False, preserves the index order of the calling (left) DataFrame Returns joined : DataFrame
Notes
on, lsufx, and rsufx options are not supported when passing a list of DataFrame objects pandas.DataFrame.merge DataFrame.merge(right, how=inner, on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True, sufxes=(.x, .y), copy=True) Merge DataFrame objects by performing a database-style join operation by columns or indexes. If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on. Parameters right : DataFrame how : {left, right, outer, inner}, default inner left: use only keys from left frame (SQL: left outer join) right: use only keys from right frame (SQL: right outer join) outer: use union of keys from both frames (SQL: full outer join) inner: use intersection of keys from both frames (SQL: inner join) on : label or list Field names to join on. Must be found in both DataFrames. left_on : label or list, or array-like Field names to join on in left DataFrame. Can be a vector or list of vectors of the length of the DataFrame to use a particular vector as the join key instead of columns 262 Chapter 21. API Reference
right_on : label or list, or array-like Field names to join on in right DataFrame or vector/list of vectors per left_on docs left_index : boolean, default True Use the index from the left DataFrame as the join key(s). If it is a MultiIndex, the number of keys in the other DataFrame (either the index or a number of columns) must match the number of levels right_index : boolean, default True Use the index from the right DataFrame as the join key. Same caveats as left_index sort : boolean, default True Sort the join keys lexicographically in the result DataFrame sufxes : 2-length sequence (tuple, list, ...) Sufx to apply to overlapping column names in the left and right side, respectively copy : boolean, default True If False, do not copy data unnecessarily Returns merged : DataFrame
Examples >>> A lkey 0 foo 1 bar 2 baz 3 foo >>> B rkey 0 foo 1 bar 2 qux 3 bar
value 1 2 3 4
value 5 6 7 8
>>> merge(A, B, left_on=lkey, right_on=rkey, how=outer) lkey value.x rkey value.y 0 bar 2 bar 6 1 bar 2 bar 8 2 baz 3 NaN NaN 3 foo 1 foo 5 4 foo 4 foo 5 5 NaN NaN qux 7
pandas.DataFrame.append DataFrame.append(other, ignore_index=False, verify_integrity=True) Append columns of other to end of this frames columns and index, returning a new object. Columns not in this frame are added as new columns. Parameters other : DataFrame or list of Series/dict-like objects ignore_index : boolean, default False If True do not use the index labels. Useful for gluing together record arrays Returns appended : DataFrame
21.3. DataFrame
263
Notes
If a list of dict is passed and the keys are all contained in the DataFrames index, the order of the columns in the resulting DataFrame will be unchanged
21.3.12 Plotting
DataFrame.hist(**kwds[, grid]) DataFrame.plot(**kwds[, subplots, sharex, ...]) pandas.DataFrame.hist DataFrame.hist(grid=True, **kwds) Draw Histogram the DataFrames series using matplotlib / pylab. Parameters kwds : other plotting keyword arguments To be passed to hist function pandas.DataFrame.plot DataFrame.plot(subplots=False, sharex=True, sharey=False, use_index=True, gsize=None, grid=True, legend=True, rot=30, ax=None, kind=line, **kwds) Make line plot of DataFrames series with the index on the x-axis using matplotlib / pylab. Parameters subplots : boolean, default False Make separate subplots for each time series sharex : boolean, default True In case subplots=True, share x axis sharey : boolean, default False In case subplots=True, share y axis use_index : boolean, default True Use index as ticks for x axis kind : {line, bar} kwds : keywords Options to pass to Axis.plot
Notes
Draw Histogram the DataFrames series using matplotlib / pylab. Make line plot of DataFrames series with the index on the x-axis using
This method doesnt make much sense for cross-sections, and will error.
21.3. DataFrame
265
pandas.DataFrame.from_csv classmethod DataFrame.from_csv(path, header=0, sep=, , index_col=0, parse_dates=True, encoding=None) Read delimited le into DataFrame Parameters path : string header : int, default 0 Row to use at header (skip prior rows) sep : string, default , Field delimiter index_col : int or sequence, default 0 Column to use for index. If a sequence is given, a MultiIndex is used. Different default from read_table parse_dates : boolean, default True Parse dates. Different default from read_table Returns y : DataFrame
Notes
Preferable to use read_table for most general purposes but from_csv makes for an easy roundtrip to and from le, especially with a DataFrame of time series data pandas.DataFrame.from_records classmethod DataFrame.from_records(data, index=None, names=None) Convert structured or record ndarray to DataFrame exclude=None, columns=None,
266
Parameters data : ndarray (structured dtype), list of tuples, or DataFrame index : string, list of elds, array-like Field of array to use as the index, alternately a specic set of input labels to use exclude: sequence, default None : Columns or elds to exclude columns : sequence, default None Column names to use, replacing any found in passed data Returns df : DataFrame pandas.DataFrame.to_csv DataFrame.to_csv(path_or_buf, sep=, , na_rep=, cols=None, header=True, index=True, index_label=None, mode=w, nanRep=None, encoding=None) Write DataFrame to a comma-separated values (csv) le Parameters path_or_buf : string or le handle / StringIO File path na_rep : string, default Missing data representation cols : sequence, optional Columns to write header : boolean, default True Write out column names index : boolean, default True Write row names (index) index_label : string or sequence, default None Column label for index column(s) if desired. If None is given, and header and index are True, then the index names are used. A sequence should be given if the DataFrame uses MultiIndex. mode : Python write mode, default w sep : character, default , Field delimiter for the output le. encoding : string, optional a string representing the encoding to use if the contents are non-ascii, for python versions prior to 3 pandas.DataFrame.to_excel DataFrame.to_excel(excel_writer, sheet_name=sheet1, na_rep=, cols=None, header=True, index=True, index_label=None) Write DataFrame to a excel sheet
21.3. DataFrame
267
Parameters excel_writer : string or ExcelWriter object File path or existing ExcelWriter sheet_name : string, default sheet1 Name of sheet which will contain DataFrame na_rep : string, default Missing data repn cols : sequence, optional Columns to write header : boolean, default True Write out column names index : boolean, default True Write row names (index) index_label : string or sequence, default None Column label for index column(s) if desired. If None is given, and header and index are True, then the index names are used. A sequence should be given if the DataFrame uses MultiIndex.
Notes
If passing an existing ExcelWriter object, then the sheet will be added to the existing workbook. This can be used to save different DataFrames to one workbook >>> writer = ExcelWriter(output.xlsx) >>> df1.to_excel(writer,sheet1) >>> df2.to_excel(writer,sheet2) >>> writer.save() pandas.DataFrame.to_dict DataFrame.to_dict() Convert DataFrame to nested dictionary Returns result : dict like {column -> {index -> value}} pandas.DataFrame.to_records DataFrame.to_records(index=True) Convert DataFrame to record array. Index will be put in the index eld of the record array if requested Parameters index : boolean, default True Include index in resulting record array, stored in index eld Returns y : recarray pandas.DataFrame.to_sparse DataFrame.to_sparse(ll_value=None, kind=block) Convert to SparseDataFrame
268
Parameters ll_value : oat, default NaN kind : {block, integer} Returns y : SparseDataFrame pandas.DataFrame.to_string DataFrame.to_string(buf=None, columns=None, col_space=None, colSpace=None, header=True, index=True, na_rep=NaN, formatters=None, oat_format=None, sparsify=True, nanRep=None, index_names=True, justify=None, force_unicode=False) Render a DataFrame to a console-friendly tabular output. Parameters frame : DataFrame object to render buf : StringIO-like, optional buffer to write to columns : sequence, optional the subset of columns to write; default None writes all columns col_space : int, optional the width of each columns header : bool, optional whether to print column labels, default True index : bool, optional whether to print index (row) labels, default True na_rep : string, optional string representation of NAN to use, default NaN formatters : list or dict of one-parameter functions, optional formatter functions to apply to columns elements by position or name, default None oat_format : one-parameter function, optional formatter function to apply to columns elements if they are oats default None sparsify : bool, optional Set to False for a DataFrame with a hierarchical index to print every multiindex key at each row, default True justify : {left, right}, default None Left or right-justify the column labels. If None uses the option from the conguration in pandas.core.common, left out of the box index_names : bool, optional Prints the names of the indexes, default True force_unicode : bool, default False Always return a unicode result Returns formatted : string (or unicode, depending on data and options) 21.3. DataFrame 269
pandas.DataFrame.save DataFrame.save(path) pandas.DataFrame.load classmethod DataFrame.load(path) pandas.DataFrame.info DataFrame.info(verbose=True, buf=None) Concise summary of a DataFrame, used in __repr__ when very large. Parameters verbose : boolean, default True If False, dont print column count summary buf : writable buffer, defaults to sys.stdout
21.4 Panel
21.4.1 Computations / Descriptive Stats
270
p
pandas, 1
271
272
p
pandas, 1
273
274
INDEX
Symbols
__init__() (pandas.DataFrame method), 237 __init__() (pandas.Series method), 215 __iter__() (pandas.DataFrame method), 239 __iter__() (pandas.Series method), 217
A
add() (pandas.DataFrame method), 240 add() (pandas.Series method), 218 add_prex() (pandas.DataFrame method), 253 add_sufx() (pandas.DataFrame method), 253 align() (pandas.DataFrame method), 254 align() (pandas.Series method), 226 append() (pandas.DataFrame method), 263 append() (pandas.Series method), 232 apply() (pandas.DataFrame method), 245 apply() (pandas.Series method), 219 applymap() (pandas.DataFrame method), 246 argsort() (pandas.Series method), 230 as_matrix() (pandas.DataFrame method), 236 asfreq() (pandas.DataFrame method), 264 asfreq() (pandas.Series method), 232 asof() (pandas.Series method), 232 astype (pandas.Series attribute), 216 astype() (pandas.DataFrame method), 238 autocorr() (pandas.Series method), 221 axes (pandas.DataFrame attribute), 237
concat() (pandas.tools.merge static method), 201 copy() (pandas.DataFrame method), 238 copy() (pandas.Series method), 216 corr() (pandas.DataFrame method), 248 corr() (pandas.Series method), 222 corrwith() (pandas.DataFrame method), 248 count() (pandas.DataFrame method), 248 count() (pandas.Series method), 222 cumprod() (pandas.DataFrame method), 248 cumprod() (pandas.Series method), 222 cumsum() (pandas.DataFrame method), 249 cumsum() (pandas.Series method), 223
D
delevel() (pandas.DataFrame method), 258 describe() (pandas.DataFrame method), 249 describe() (pandas.Series method), 223 diff() (pandas.DataFrame method), 249 diff() (pandas.Series method), 223 div() (pandas.DataFrame method), 241 div() (pandas.Series method), 218 drop() (pandas.DataFrame method), 254 drop() (pandas.Series method), 227 dropna() (pandas.DataFrame method), 257 dropna() (pandas.Series method), 229 dtype (pandas.Series attribute), 215 dtypes (pandas.DataFrame attribute), 237
C
clip() (pandas.DataFrame method), 247 clip() (pandas.Series method), 221 clip_lower() (pandas.DataFrame method), 247 clip_lower() (pandas.Series method), 221 clip_upper() (pandas.DataFrame method), 247 clip_upper() (pandas.Series method), 222 combine() (pandas.DataFrame method), 244 combine() (pandas.Series method), 219 combine_rst() (pandas.DataFrame method), 244 combine_rst() (pandas.Series method), 219 combineAdd() (pandas.DataFrame method), 244 combineMult() (pandas.DataFrame method), 244
E
ewma() (pandas.stats.moments static method), 211 ewmcorr() (pandas.stats.moments static method), 213 ewmcov() (pandas.stats.moments static method), 214 ewmstd() (pandas.stats.moments static method), 212 ewmvar() (pandas.stats.moments static method), 212
F
llna() (pandas.DataFrame method), 257 llna() (pandas.Series method), 229 lter() (pandas.DataFrame method), 254 rst_valid_index() (pandas.DataFrame method), 264 rst_valid_index() (pandas.Series method), 233
275
from_csv() (pandas.DataFrame class method), 266 from_csv() (pandas.Series class method), 234 from_records() (pandas.DataFrame class method), 266
O
order() (pandas.Series method), 230
G
get() (pandas.io.pytables.HDFStore method), 206 get() (pandas.Series method), 217 get_dtype_counts() (pandas.DataFrame method), 237 groupby() (pandas.DataFrame method), 246 groupby() (pandas.Series method), 220
P
pandas (module), 1 parse() (pandas.io.parsers.ExcelFile method), 205 pivot() (pandas.DataFrame method), 259 pivot_table() (pandas.tools.pivot static method), 199 plot() (pandas.DataFrame method), 265 plot() (pandas.Series method), 234 pop() (pandas.DataFrame method), 239 prod() (pandas.DataFrame method), 251 prod() (pandas.Series method), 224 put() (pandas.io.pytables.HDFStore method), 206
H
head() (pandas.DataFrame method), 257 hist() (pandas.DataFrame method), 265 hist() (pandas.Series method), 233
I
info() (pandas.DataFrame method), 270 insert() (pandas.DataFrame method), 239 interpolate() (pandas.Series method), 230 isnull() (pandas.Series method), 215 iteritems() (pandas.DataFrame method), 239 iteritems() (pandas.Series method), 217 ix (pandas.DataFrame attribute), 239 ix (pandas.Series attribute), 217
Q
quantile() (pandas.DataFrame method), 251 quantile() (pandas.Series method), 225
R
radd() (pandas.DataFrame method), 242 rdiv() (pandas.DataFrame method), 242 read_csv() (pandas.io.parsers static method), 204 read_table() (pandas.io.parsers static method), 203 reindex() (pandas.DataFrame method), 255 reindex() (pandas.Series method), 227 reindex_like() (pandas.DataFrame method), 255 reindex_like() (pandas.Series method), 227 rename() (pandas.DataFrame method), 256 rename() (pandas.Series method), 228 rmul() (pandas.DataFrame method), 243 rolling_apply() (pandas.stats.moments static method), 210 rolling_corr() (pandas.stats.moments static method), 209 rolling_count() (pandas.stats.moments static method), 207 rolling_cov() (pandas.stats.moments static method), 209 rolling_kurt() (pandas.stats.moments static method), 210 rolling_mean() (pandas.stats.moments static method), 208 rolling_median() (pandas.stats.moments static method), 208 rolling_quantile() (pandas.stats.moments static method), 211 rolling_skew() (pandas.stats.moments static method), 210 rolling_std() (pandas.stats.moments static method), 209 rolling_sum() (pandas.stats.moments static method), 207 rolling_var() (pandas.stats.moments static method), 208 rsub() (pandas.DataFrame method), 243
J
join() (pandas.DataFrame method), 261
L
last_valid_index() (pandas.DataFrame method), 264 last_valid_index() (pandas.Series method), 233 load() (pandas.core.common static method), 202 load() (pandas.DataFrame class method), 270 load() (pandas.Series class method), 235
M
mad() (pandas.DataFrame method), 249 map() (pandas.Series method), 219 max() (pandas.DataFrame method), 250 max() (pandas.Series method), 223 mean() (pandas.DataFrame method), 250 mean() (pandas.Series method), 223 median() (pandas.DataFrame method), 250 median() (pandas.Series method), 224 merge() (pandas.DataFrame method), 262 merge() (pandas.tools.merge static method), 200 min() (pandas.DataFrame method), 251 min() (pandas.Series method), 224 mul() (pandas.DataFrame method), 241 mul() (pandas.Series method), 218
N
ndim (pandas.DataFrame attribute), 237 276
S
save() (pandas.core.common static method), 203 Index
save() (pandas.DataFrame method), 270 save() (pandas.Series method), 235 select() (pandas.DataFrame method), 256 select() (pandas.Series method), 228 shape (pandas.DataFrame attribute), 237 shift() (pandas.DataFrame method), 264 shift() (pandas.Series method), 233 skew() (pandas.DataFrame method), 251 skew() (pandas.Series method), 225 sort() (pandas.Series method), 231 sort_index() (pandas.DataFrame method), 258 sort_index() (pandas.Series method), 231 sortlevel() (pandas.DataFrame method), 259 sortlevel() (pandas.Series method), 231 stack() (pandas.DataFrame method), 260 std() (pandas.DataFrame method), 252 std() (pandas.Series method), 225 sub() (pandas.DataFrame method), 242 sub() (pandas.Series method), 218 sum() (pandas.DataFrame method), 252 sum() (pandas.Series method), 225 swaplevel() (pandas.DataFrame method), 260
X
xs() (pandas.DataFrame method), 239
T
T (pandas.DataFrame attribute), 261 tail() (pandas.DataFrame method), 257 take() (pandas.DataFrame method), 256 take() (pandas.Series method), 228 to_csv() (pandas.DataFrame method), 267 to_csv() (pandas.Series method), 235 to_dict() (pandas.DataFrame method), 268 to_dict() (pandas.Series method), 236 to_excel() (pandas.DataFrame method), 267 to_records() (pandas.DataFrame method), 268 to_sparse() (pandas.DataFrame method), 268 to_sparse() (pandas.Series method), 236 to_string() (pandas.DataFrame method), 269 transpose() (pandas.DataFrame method), 261 truncate() (pandas.DataFrame method), 256 truncate() (pandas.Series method), 229
U
unstack() (pandas.DataFrame method), 260 unstack() (pandas.Series method), 231
V
value_counts() (pandas.Series method), 226 values (pandas.DataFrame attribute), 237 values (pandas.Series attribute), 214 var() (pandas.DataFrame method), 253 var() (pandas.Series method), 226
W
weekday (pandas.Series attribute), 233 Index 277