Structured Data Challenges in Finance and Statistics

Structured Data Challenges in Finance and Statistics

Wes McKinney

Rice Statistics, 21 November 2011

Wes McKinney () Structured data challenges Rice Statistics 1 / 43

Me

S.B., MIT Math ’07
3 years in the quant ﬁnance business
Now: starting a software company, initially to build ﬁnancial data
analysis and research systems
My blog: http://blog.wesmckinney.com
Twitter: @wesmckinn
Book! “Python for Data Analysis”, to hit the shelves later next year
from O’Reilly Media


Structured data

cname year agefrom ageto ls lsc pop ccode
0 Australia 1950 15 19 64.3 15.4 558 AUS
1 Australia 1950 20 24 48.4 26.4 645 AUS
2 Australia 1950 25 29 47.9 26.2 681 AUS
3 Australia 1950 30 34 44 23.8 614 AUS
4 Australia 1950 35 39 42.1 21.9 625 AUS
5 Australia 1950 40 44 38.9 20.1 555 AUS
6 Australia 1950 45 49 34 16.9 491 AUS
7 Australia 1950 50 54 29.6 14.6 439 AUS
8 Australia 1950 55 59 28 12.9 408 AUS
9 Australia 1950 60 64 26.3 12.1 356 AUS


Partial list of structured data necessities

Table modification: column insertion/deletion/type changes
Rich axis indexing, metadata
Easy data alignment
Aggregation and transformation by group (“group by”)
Missing data (NA) handling
Pivoting and reshaping
Merging and joining
Time series-specific manipulations
Fast Input/Output: text files, databases, HDF5, ...


Are existing tools good enough?

We care nearly equally about
Ease-of-use (syntax / API ﬁts your mental model)
Expressiveness
Performance (speed and memory usage)
Clean, consistent interface design is hard


Auxiliary concerns

Any tool needs to integrate well with:
Statistical modeling tools
Data visualization (plotting)
Target users
Computer scientists, statisticians, software engineers?
Data scientists?


Are existing tools good enough?

The typical players
R data.frame and friends + CRAN libraries
SQL and other relational databases
Python / NumPy: structured (record) arrays
Commercial products: SAS, Stata, MS Excel...
My conclusion: we still have a ways to go
R has become demonstrably better in the last 5 years (e.g. via plyr,
reshape2)


Deeper problems in many industries

Facilitating the research process only part of problem
Much of academia: “Production systems?”
Industry: a wasteland of misshapen wheels or expensive vendor
products
Explosive growth in data-driven production systems
Hybrid-language systems are not always a good idea


The big data conundrum

Great effort being invested in the (difficult) problem of large-scale
data processing, e.g. MapReduce-based
Less effort in the fundamental tooling for data manipulation /
preparation / integration
Single-node performance does matter
Single-node code development time matters too


pandas: my effort in this arena

Pick your favorite: panel data structures or Python structured data
analysis
Starting building April 2008 back at AQR Capital
Open-sourced (BSD license) mid-2009
Heavily tested, being used by many companies (inc. lots of financial
firms) as the cornerstone of their systems
Goal: optimal balance of ease-of-use, flexibility, and performance
Heavy development the last 6 months


Why did I start from scratch?

Accusations of NIH Syndrome abound
In 2008 I simultaneously needed
Agile, high performance data structures
A high productivity programming language for implementing all of the
non-computational business logic
A production application platform that would seamlessly integrate with
an interactive data analysis / research platform
In short, I was rebuilding major ﬁnancial systems and I found my
options inadequate
Thrilling innovation opportunity!


Why did I use Python?

High productivity general purpose language
Well thought-out object-oriented model
Excellent software-development tools
Easy for MATLAB/R users to learn
Flexible built-in data structures (dicts, sets, lists, tuples)
The right open-source scientiﬁc computing tools
Powerful array processing (NumPy)
Abundant tools for performance computing


But, Python is not perfect

For statistical computing, a chicken-and-egg problem
Python’s plotting libraries are not designed for statistical graphics
Built-in data structures are not especially optimized for my large data
use cases
Occasional semantic / syntactic niggles


Partial list of structured data necessities

Table modification: column insertion/deletion/type changes
Rich axis indexing, metadata
Easy data alignment
Aggregation and transformation by group (“group by”)
Missing data (NA) handling
Pivoting and reshaping
Merging and joining
Time series-specific manipulations
Fast Input/Output: text files, databases, HDF5, ...


DataFrame, the pandas workhorse

A 2D tabular data structure with row and column indexes
Fast for row- and column-oriented operations
Support heterogeneous columns WITHOUT sacriﬁcing performance in
the homogeneous (e.g. ﬂoating point only) case


DataFrame

cname year agefrom ageto ls lsc pop ccode
0 Australia 1950 15 19 64.3 15.4 558 AUS
1 Australia 1950 20 24 48.4 26.4 645 AUS
2 Australia 1950 25 29 47.9 26.2 681 AUS
3 Australia 1950 30 34 44 23.8 614 AUS
4 Australia 1950 35 39 42.1 21.9 625 AUS
5 Australia 1950 40 44 38.9 20.1 555 AUS
6 Australia 1950 45 49 34 16.9 491 AUS
7 Australia 1950 50 54 29.6 14.6 439 AUS
8 Australia 1950 55 59 28 12.9 408 AUS
9 Australia 1950 60 64 26.3 12.1 356 AUS


Axis indexing and metadata

Basic concept: labeled axes in use throughout the library
Need to support
Fast lookups (constant time)
Data realignment / selection by labels (linear)
Munging together irregularly indexed data
Key innovation: index is a data structure itself. Diﬀerent
implementations can support more sophisticated indexing
Axis labels can be any immutable Python object


Irregularly indexed data
DataFrame
Columns

I
N
D
E
X


Axis indexing

Axis Index

d 0

a 1

b 2

c 3

e 4


Why does this matter?

Real world data is highly irregular, especially time series
Operations between DataFrame objects automatically align on the
indexes
Nearly impossible to have errors due to misaligned data
Can vastly facilitate munging unstructured data into structured form
Grants immense freedom in writing research code
Time series are just a special case of a general indexed data structure


Axis indexing

"I have a brain!"

year 1965 1970 1975 1980 1985
cname agefrom
Australia 15 85.5 92.6 91.1 91.7 87.7
20 57.7 70.6 74.1 73.4 72
25 52.8 57.7 71.8 74 73.5
30 53.8 57.7 60.4 72.8 74.1
Me too! 35 46.6 52.6 59.5 63.2 72.7
40 47.5 52.6 56.3 61.4 62.9
45 41.3 48.7 55.9 61.1 61.1
50 41.7 48.7 53.8 60.2 60
55 36.3 42.1 54.1 61 59.2
60 37.5 42.1 48.9 61.6 59.1
65 30.2 30.9 48.8 60.4 59.7
70 30.2 30.9 41.7 60.4 57.6
75 30.2 30.9 41.7 62.5 57.5


Hierarchical indexing

Basic idea: represent high dimensional data in a lower-dimensional
structure that is easier to reason about
Axis index with k levels of indexing
Slice chunks of data in constant time!
Provides a very natural way of implementing reshaping operations
Advantage over a truly N-dimensional object: space-eﬃcient dense
representation if groups are unbalanced
Extremely useful for econometric models on panel data


Hierarchical indexing

agefrom 15 20 25 30 35
cname year
Australia 1965 85.5 57.7 52.8 53.8 46.6
1970 92.6 70.6 57.7 57.7 52.6
1975 91.1 74.1 71.8 60.4 59.5
1980 91.7 73.4 74 72.8 63.2
1985 87.7 72 73.5 74.1 72.7
1990 80 66.4 72 73.5 74.1
1995 66.3 59.9 66.4 72 73.5
2000 73.4 38.2 59.9 66.4 72
2005 82.3 44.8 38.2 59.9 66.4
2010 78.4 41.5 44.8 38.2 59.9
Austria 1965 8.1 50.9 50.9 24.2 24.2
1970 13.3 61.4 50.9 50.9 39.6
1975 23.5 64.3 61.4 56.6 49.6
1980 23.8 73.9 64.3 61.4 61.4
1985 22.4 69.2 71.1 62.9 59.9
1990 27.4 72 68.9 68.3 61.5
1995 38.9 64.2 66.9 66.6 66.7
2000 41.4 73.2 64.2 63.1 64.5
2005 42.9 93.9 75 62.9 59.2
2010 45.4 93.5 92.4 73.2 62.7


Joining and merging

Join and merge-type operations are very easy to implement with
indexing in place
Multi-key join same code as aligning hierarchically-indexed
DataFrames
Will illustrate this with examples


Supporting size mutability

In order to have good row-oriented performance, need to store
like-typed columns in a single ndarray
“Column” insertion: accumulate 1 × N × . . . homogeneous columns,
later consolidate with other like-typed into a single block
I.e. avoid reallocate-copy or array concatenation steps as long as
possible
Column deletions can be no-copy events (since ndarrays support
views)


DataFrame, under the hood

Actually You see
6 2 4 5 1 3 1 2 3 4 5 6

O F I B I F B F F O


Reshaping
The fundamental operations
stack: pivot level from columns to rows
unstack: pivot level from rows to columns
Completely natural and intuitive with hierarchical indexing
No munging of column names necessary

year 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010
cname agefrom
Australia 15 85.5 92.6 91.1 91.7 87.7 80 66.3 73.4 82.3 78.4
20 57.7 70.6 74.1 73.4 72 66.4 59.9 38.2 44.8 41.5
25 52.8 57.7 71.8 74 73.5 72 66.4 59.9 38.2 44.8
30 53.8 57.7 60.4 72.8 74.1 73.5 72 66.4 59.9 38.2
35 46.6 52.6 59.5 63.2 72.7 74.1 73.5 72 66.4 59.9
Austria 15 8.1 13.3 23.5 23.8 22.4 27.4 38.9 41.4 42.9 45.4
20 50.9 61.4 64.3 73.9 69.2 72 64.2 73.2 93.9 93.5
25 50.9 50.9 61.4 64.3 71.1 68.9 66.9 64.2 75 92.4
30 24.2 50.9 56.6 61.4 62.9 68.3 66.6 63.1 62.9 73.2
35 24.2 39.6 49.6 61.4 59.9 61.5 66.7 64.5 59.2 62.7


Reshaping

In [5]: df.unstack(’agefrom’).stack(’year’)
agefrom 15 20 25 30 35
cname year
Australia 1965 85.5 57.7 52.8 53.8 46.6
1970 92.6 70.6 57.7 57.7 52.6
1975 91.1 74.1 71.8 60.4 59.5
1980 91.7 73.4 74 72.8 63.2
1985 87.7 72 73.5 74.1 72.7
1990 80 66.4 72 73.5 74.1
1995 66.3 59.9 66.4 72 73.5
2000 73.4 38.2 59.9 66.4 72
2005 82.3 44.8 38.2 59.9 66.4
2010 78.4 41.5 44.8 38.2 59.9
Austria 1965 8.1 50.9 50.9 24.2 24.2
1970 13.3 61.4 50.9 50.9 39.6
1975 23.5 64.3 61.4 56.6 49.6
1980 23.8 73.9 64.3 61.4 61.4
1985 22.4 69.2 71.1 62.9 59.9
1990 27.4 72 68.9 68.3 61.5
1995 38.9 64.2 66.9 66.6 66.7
2000 41.4 73.2 64.2 63.1 64.5
2005 42.9 93.9 75 62.9 59.2
2010 45.4 93.5 92.4 73.2 62.7


Reshaping implementation nuances

Must carefully deal with unbalanced group sizes / missing data
I play vectorization tricks with the NumPy memory layout: no for
loops!
Care must be taken to handle heterogeneous and homogeneous data
cases


GroupBy

High level process
split data set into groups
apply function to each group (an aggregation or a transformation)
combine results intelligently into a result data structure
Can be used to emulate SQL GROUP BY operations


GroupBy

Grouping closely related to indexing
Create correspondence between axis labels and group labels using one
of:
Array of group labels (like a DataFrame column)
Python function to be applied to each axis tick
Can group by multiple keys
For a hierarchically indexed axis, can select a level and group by that
(or some transformation thereof)


Anatomy of GroupBy

grouped = obj.groupby([key1, key2, key3])

This returns a GroupBy object
Each of the keys could be any of:
A Python function
A vector
A column name


Anatomy of GroupBy

aggregate, transform, and the more general apply supported

group_means = grouped.agg(np.mean)
group_means2 = grouped.mean()
demeaned = grouped.transform(lambda x: x - x.mean())


Anatomy of GroupBy

The GroupBy object is also iterable

group_means = {}
for group_name, group in grouped:
group_means[group_name] = grouped.mean()


GroupBy and hierarchical indexing

Hierarchical indexing came about as the natural result of a multi-key
aggregation:
>>> group_means = df.groupby(['country', 'agefrom']).mean()
>>> group_means[['ls', 'lsc', 'pop']].unstack('country')

ls lsc pop
country Australia Austria Australia Austria Australia Austria
agefrom
15 70.03 31.1 26.17 14.67 6163 3310
20 58.02 59.98 34.51 45.83 1113 531
25 57.02 45.5 33.02 32.73 5021 2791
30 59.16 46.56 35.29 33.87 1082 527
35 59.58 43.29 34.53 30.3 1053 528.8
40 58.8 40.98 33.92 27.88 1005 522.5
45 56.79 39.19 31.71 26 927.2 503.5
50 54.71 37.14 30.37 23.94 836.6 475.2
55 51.93 34.82 26.98 22.11 735.8 443.2
60 49.92 32.35 25.85 20.2 632.6 408.8
65 47.02 29.59 22.98 18.44 522.5 361.9
70 46.27 28.52 22.53 17.72 410.5 295.8
75 46.28 28.06 23.41 18.42 624.5 437.6


What makes GroupBy hard?

factor-izing the group labels is very expensive
Function call overhead on small groups
To sort or not to sort?
Cheaper than computing the group labels!
Munging together results in exceptional cases is tricky


Time series operations

Fixed frequency indexing (DateRange)
Domain-specific time offsets (like business day, business month end)
Frequency conversion
Forward-filling / back-filling / interpolation
Leading/lagging
In the works (later this year), better/faster upsampling/downsampling


Shoot-out with R’s time series objects

“Inner join” addition between two irregular 500K-length time series
sampled from 1 million-length time series. Timestamps are POSIXct (R) /
int64 (Python)

package timing factor
pandas 21.5 ms 1.0
xts 41.3 ms 1.92
fts 370 ms 17.2
its 1117 ms 51.95
zoo 3479 ms 161.8

Where there is smoke, there is ﬁre?
Iron: Macbook Pro Core i7 with R 2.13.1, pandas 0.5.1git / Python 2.7.2


Erm, inner join?

Intersecting time stamps loses information


My (mild) performance obsession


My (mild) performance obsession

Good performance stems from many factors
Well-designed algorithms (from a complexity standpoint)
Minimizing copying of data
Minimizing interpreter / function call overhead
Taking advantage of memory layout
I value functionality over performance, but I do spend a lot of time
proﬁling code and grokking performance characteristics


Demo, time permitting


Structured Data Challenges in Finance and Statistics

Recommended

More Related Content

What's hot (20)

Similar to Structured Data Challenges in Finance and Statistics (20)

More from Wes McKinney (20)

Recently uploaded (20)

Structured Data Challenges in Finance and Statistics