SlideShare a Scribd company logo
Structured Data Challenges in Finance and Statistics

                            Wes McKinney


                   Rice Statistics, 21 November 2011




 Wes McKinney ()            Structured data challenges   Rice Statistics   1 / 43
Me



     S.B., MIT Math โ€™07
     3 years in the quant ๏ฌnance business
     Now: starting a software company, initially to build ๏ฌnancial data
     analysis and research systems
     My blog: http://blog.wesmckinney.com
     Twitter: @wesmckinn
     Book! โ€œPython for Data Analysisโ€, to hit the shelves later next year
     from Oโ€™Reilly Media




     Wes McKinney ()          Structured data challenges     Rice Statistics   2 / 43
Structured data



    cname              year   agefrom       ageto             ls     lsc    pop       ccode
0   Australia          1950   15            19                64.3   15.4   558       AUS
1   Australia          1950   20            24                48.4   26.4   645       AUS
2   Australia          1950   25            29                47.9   26.2   681       AUS
3   Australia          1950   30            34                44     23.8   614       AUS
4   Australia          1950   35            39                42.1   21.9   625       AUS
5   Australia          1950   40            44                38.9   20.1   555       AUS
6   Australia          1950   45            49                34     16.9   491       AUS
7   Australia          1950   50            54                29.6   14.6   439       AUS
8   Australia          1950   55            59                28     12.9   408       AUS
9   Australia          1950   60            64                26.3   12.1   356       AUS




     Wes McKinney ()             Structured data challenges                 Rice Statistics   3 / 43
Partial list of structured data necessities


    Table modi๏ฌcation: column insertion/deletion/type changes
    Rich axis indexing, metadata
    Easy data alignment
    Aggregation and transformation by group (โ€œgroup byโ€)
    Missing data (NA) handling
    Pivoting and reshaping
    Merging and joining
    Time series-speci๏ฌc manipulations
    Fast Input/Output: text ๏ฌles, databases, HDF5, ...




     Wes McKinney ()         Structured data challenges    Rice Statistics   4 / 43
Are existing tools good enough?




   We care nearly equally about
         Ease-of-use (syntax / API ๏ฌts your mental model)
         Expressiveness
         Performance (speed and memory usage)
   Clean, consistent interface design is hard




    Wes McKinney ()           Structured data challenges    Rice Statistics   5 / 43
Auxiliary concerns




    Any tool needs to integrate well with:
         Statistical modeling tools
         Data visualization (plotting)
    Target users
         Computer scientists, statisticians, software engineers?
         Data scientists?




    Wes McKinney ()            Structured data challenges          Rice Statistics   6 / 43
Are existing tools good enough?



   The typical players
         R data.frame and friends + CRAN libraries
         SQL and other relational databases
         Python / NumPy: structured (record) arrays
         Commercial products: SAS, Stata, MS Excel...
   My conclusion: we still have a ways to go
   R has become demonstrably better in the last 5 years (e.g. via plyr,
   reshape2)




    Wes McKinney ()          Structured data challenges    Rice Statistics   7 / 43
Deeper problems in many industries




   Facilitating the research process only part of problem
   Much of academia: โ€œProduction systems?โ€
   Industry: a wasteland of misshapen wheels or expensive vendor
   products
   Explosive growth in data-driven production systems
   Hybrid-language systems are not always a good idea




    Wes McKinney ()         Structured data challenges      Rice Statistics   8 / 43
The big data conundrum




   Great e๏ฌ€ort being invested in the (di๏ฌƒcult) problem of large-scale
   data processing, e.g. MapReduce-based
   Less e๏ฌ€ort in the fundamental tooling for data manipulation /
   preparation / integration
   Single-node performance does matter
   Single-node code development time matters too




    Wes McKinney ()        Structured data challenges      Rice Statistics   9 / 43
pandas: my e๏ฌ€ort in this arena



    Pick your favorite: panel data structures or Python structured data
    analysis
    Starting building April 2008 back at AQR Capital
    Open-sourced (BSD license) mid-2009
    Heavily tested, being used by many companies (inc. lots of ๏ฌnancial
    ๏ฌrms) as the cornerstone of their systems
    Goal: optimal balance of ease-of-use, ๏ฌ‚exibility, and performance
    Heavy development the last 6 months




    Wes McKinney ()         Structured data challenges     Rice Statistics   10 / 43
Why did I start from scratch?



    Accusations of NIH Syndrome abound
    In 2008 I simultaneously needed
         Agile, high performance data structures
         A high productivity programming language for implementing all of the
         non-computational business logic
         A production application platform that would seamlessly integrate with
         an interactive data analysis / research platform
    In short, I was rebuilding major ๏ฌnancial systems and I found my
    options inadequate
    Thrilling innovation opportunity!




    Wes McKinney ()           Structured data challenges       Rice Statistics   11 / 43
Why did I use Python?



   High productivity general purpose language
   Well thought-out object-oriented model
   Excellent software-development tools
   Easy for MATLAB/R users to learn
   Flexible built-in data structures (dicts, sets, lists, tuples)
   The right open-source scienti๏ฌc computing tools
         Powerful array processing (NumPy)
         Abundant tools for performance computing




    Wes McKinney ()           Structured data challenges       Rice Statistics   12 / 43
But, Python is not perfect




    For statistical computing, a chicken-and-egg problem
    Pythonโ€™s plotting libraries are not designed for statistical graphics
    Built-in data structures are not especially optimized for my large data
    use cases
    Occasional semantic / syntactic niggles




    Wes McKinney ()           Structured data challenges      Rice Statistics   13 / 43
Partial list of structured data necessities


    Table modi๏ฌcation: column insertion/deletion/type changes
    Rich axis indexing, metadata
    Easy data alignment
    Aggregation and transformation by group (โ€œgroup byโ€)
    Missing data (NA) handling
    Pivoting and reshaping
    Merging and joining
    Time series-speci๏ฌc manipulations
    Fast Input/Output: text ๏ฌles, databases, HDF5, ...




     Wes McKinney ()         Structured data challenges    Rice Statistics   14 / 43
DataFrame, the pandas workhorse




   A 2D tabular data structure with row and column indexes
   Fast for row- and column-oriented operations
   Support heterogeneous columns WITHOUT sacri๏ฌcing performance in
   the homogeneous (e.g. ๏ฌ‚oating point only) case




    Wes McKinney ()        Structured data challenges   Rice Statistics   15 / 43
DataFrame



    cname              year   agefrom       ageto             ls     lsc    pop        ccode
0   Australia          1950   15            19                64.3   15.4   558        AUS
1   Australia          1950   20            24                48.4   26.4   645        AUS
2   Australia          1950   25            29                47.9   26.2   681        AUS
3   Australia          1950   30            34                44     23.8   614        AUS
4   Australia          1950   35            39                42.1   21.9   625        AUS
5   Australia          1950   40            44                38.9   20.1   555        AUS
6   Australia          1950   45            49                34     16.9   491        AUS
7   Australia          1950   50            54                29.6   14.6   439        AUS
8   Australia          1950   55            59                28     12.9   408        AUS
9   Australia          1950   60            64                26.3   12.1   356        AUS




     Wes McKinney ()             Structured data challenges                 Rice Statistics   16 / 43
Axis indexing and metadata



   Basic concept: labeled axes in use throughout the library
   Need to support
         Fast lookups (constant time)
         Data realignment / selection by labels (linear)
         Munging together irregularly indexed data
   Key innovation: index is a data structure itself. Di๏ฌ€erent
   implementations can support more sophisticated indexing
   Axis labels can be any immutable Python object




    Wes McKinney ()            Structured data challenges   Rice Statistics   17 / 43
Irregularly indexed data
                      DataFrame
                                              Columns




                           I
                           N
                           D
                           E
                           X




    Wes McKinney ()            Structured data challenges   Rice Statistics   18 / 43
Axis indexing

                      Axis Index

                      d                        0


                      a                        1


                      b                        2


                      c                        3


                      e                        4


    Wes McKinney ()   Structured data challenges   Rice Statistics   19 / 43
Why does this matter?




   Real world data is highly irregular, especially time series
   Operations between DataFrame objects automatically align on the
   indexes
         Nearly impossible to have errors due to misaligned data
   Can vastly facilitate munging unstructured data into structured form
   Grants immense freedom in writing research code
   Time series are just a special case of a general indexed data structure




    Wes McKinney ()           Structured data challenges      Rice Statistics   20 / 43
Axis indexing

           "I have a brain!"


             year             1965 1970 1975 1980 1985
            cname     agefrom
            Australia 15       85.5 92.6 91.1 91.7 87.7
                      20       57.7 70.6 74.1 73.4 72
                      25       52.8 57.7 71.8 74   73.5
                      30       53.8 57.7 60.4 72.8 74.1
Me too!               35       46.6 52.6 59.5 63.2 72.7
                      40       47.5 52.6 56.3 61.4 62.9
                      45       41.3 48.7 55.9 61.1 61.1
                      50       41.7 48.7 53.8 60.2 60
                      55       36.3 42.1 54.1 61   59.2
                      60       37.5 42.1 48.9 61.6 59.1
                      65       30.2 30.9 48.8 60.4 59.7
                      70       30.2 30.9 41.7 60.4 57.6
                      75       30.2 30.9 41.7 62.5 57.5




      Wes McKinney ()          Structured data challenges   Rice Statistics   21 / 43
Hierarchical indexing



    Basic idea: represent high dimensional data in a lower-dimensional
    structure that is easier to reason about
    Axis index with k levels of indexing
    Slice chunks of data in constant time!
    Provides a very natural way of implementing reshaping operations
    Advantage over a truly N-dimensional object: space-e๏ฌƒcient dense
    representation if groups are unbalanced
    Extremely useful for econometric models on panel data




    Wes McKinney ()          Structured data challenges     Rice Statistics   22 / 43
Hierarchical indexing

                       agefrom       15   20    25    30    35
                      cname     year
                      Australia 1965 85.5 57.7 52.8 53.8 46.6
                                1970 92.6 70.6 57.7 57.7 52.6
                                1975 91.1 74.1 71.8 60.4 59.5
                                1980 91.7 73.4 74      72.8 63.2
                                1985 87.7 72     73.5 74.1 72.7
                                1990 80    66.4 72     73.5 74.1
                                1995 66.3 59.9 66.4 72       73.5
                                2000 73.4 38.2 59.9 66.4 72
                                2005 82.3 44.8 38.2 59.9 66.4
                                2010 78.4 41.5 44.8 38.2 59.9
                      Austria   1965 8.1   50.9 50.9 24.2 24.2
                                1970 13.3 61.4 50.9 50.9 39.6
                                1975 23.5 64.3 61.4 56.6 49.6
                                1980 23.8 73.9 64.3 61.4 61.4
                                1985 22.4 69.2 71.1 62.9 59.9
                                1990 27.4 72     68.9 68.3 61.5
                                1995 38.9 64.2 66.9 66.6 66.7
                                2000 41.4 73.2 64.2 63.1 64.5
                                2005 42.9 93.9 75      62.9 59.2
                                2010 45.4 93.5 92.4 73.2 62.7




    Wes McKinney ()               Structured data challenges        Rice Statistics   23 / 43
Joining and merging




   Join and merge-type operations are very easy to implement with
   indexing in place
   Multi-key join same code as aligning hierarchically-indexed
   DataFrames
   Will illustrate this with examples




    Wes McKinney ()         Structured data challenges    Rice Statistics   24 / 43
Supporting size mutability



    In order to have good row-oriented performance, need to store
    like-typed columns in a single ndarray
    โ€œColumnโ€ insertion: accumulate 1 ร— N ร— . . . homogeneous columns,
    later consolidate with other like-typed into a single block
    I.e. avoid reallocate-copy or array concatenation steps as long as
    possible
    Column deletions can be no-copy events (since ndarrays support
    views)




    Wes McKinney ()          Structured data challenges     Rice Statistics   25 / 43
DataFrame, under the hood


                Actually                                       You see
        6        2    4   5   1        3            1          2   3   4   5       6




        O             F       I       B              I         F   B   F   F      O




    Wes McKinney ()               Structured data challenges               Rice Statistics   26 / 43
Reshaping
    The fundamental operations
          stack: pivot level from columns to rows
          unstack: pivot level from rows to columns
    Completely natural and intuitive with hierarchical indexing
    No munging of column names necessary

   year                1965   1970    1975     1980      1985     1990   1995   2000    2005       2010
 cname     agefrom
 Australia 15          85.5   92.6    91.1     91.7      87.7     80     66.3   73.4    82.3       78.4
           20          57.7   70.6    74.1     73.4      72       66.4   59.9   38.2    44.8       41.5
           25          52.8   57.7    71.8     74        73.5     72     66.4   59.9    38.2       44.8
           30          53.8   57.7    60.4     72.8      74.1     73.5   72     66.4    59.9       38.2
           35          46.6   52.6    59.5     63.2      72.7     74.1   73.5   72      66.4       59.9
 Austria   15          8.1    13.3    23.5     23.8      22.4     27.4   38.9   41.4    42.9       45.4
           20          50.9   61.4    64.3     73.9      69.2     72     64.2   73.2    93.9       93.5
           25          50.9   50.9    61.4     64.3      71.1     68.9   66.9   64.2    75         92.4
           30          24.2   50.9    56.6     61.4      62.9     68.3   66.6   63.1    62.9       73.2
           35          24.2   39.6    49.6     61.4      59.9     61.5   66.7   64.5    59.2       62.7



     Wes McKinney ()                 Structured data challenges                  Rice Statistics    27 / 43
Reshaping

In [5]: df.unstack(โ€™agefromโ€™).stack(โ€™yearโ€™)
                       agefrom       15   20    25    30    35
                      cname     year
                      Australia 1965 85.5 57.7 52.8 53.8 46.6
                                1970 92.6 70.6 57.7 57.7 52.6
                                1975 91.1 74.1 71.8 60.4 59.5
                                1980 91.7 73.4 74      72.8 63.2
                                1985 87.7 72     73.5 74.1 72.7
                                1990 80    66.4 72     73.5 74.1
                                1995 66.3 59.9 66.4 72       73.5
                                2000 73.4 38.2 59.9 66.4 72
                                2005 82.3 44.8 38.2 59.9 66.4
                                2010 78.4 41.5 44.8 38.2 59.9
                      Austria   1965 8.1   50.9 50.9 24.2 24.2
                                1970 13.3 61.4 50.9 50.9 39.6
                                1975 23.5 64.3 61.4 56.6 49.6
                                1980 23.8 73.9 64.3 61.4 61.4
                                1985 22.4 69.2 71.1 62.9 59.9
                                1990 27.4 72     68.9 68.3 61.5
                                1995 38.9 64.2 66.9 66.6 66.7
                                2000 41.4 73.2 64.2 63.1 64.5
                                2005 42.9 93.9 75      62.9 59.2
                                2010 45.4 93.5 92.4 73.2 62.7



    Wes McKinney ()               Structured data challenges        Rice Statistics   28 / 43
Reshaping implementation nuances




   Must carefully deal with unbalanced group sizes / missing data
   I play vectorization tricks with the NumPy memory layout: no for
   loops!
   Care must be taken to handle heterogeneous and homogeneous data
   cases




    Wes McKinney ()        Structured data challenges    Rice Statistics   29 / 43
GroupBy




   High level process
        split data set into groups
        apply function to each group (an aggregation or a transformation)
        combine results intelligently into a result data structure
   Can be used to emulate SQL GROUP BY operations




   Wes McKinney ()           Structured data challenges      Rice Statistics   30 / 43
GroupBy



   Grouping closely related to indexing
   Create correspondence between axis labels and group labels using one
   of:
        Array of group labels (like a DataFrame column)
        Python function to be applied to each axis tick
   Can group by multiple keys
   For a hierarchically indexed axis, can select a level and group by that
   (or some transformation thereof)




   Wes McKinney ()           Structured data challenges    Rice Statistics   31 / 43
Anatomy of GroupBy




     grouped = obj.groupby([key1, key2, key3])

   This returns a GroupBy object
   Each of the keys could be any of:
        A Python function
        A vector
        A column name




   Wes McKinney ()          Structured data challenges   Rice Statistics   32 / 43
Anatomy of GroupBy




   aggregate, transform, and the more general apply supported

   group_means = grouped.agg(np.mean)
   group_means2 = grouped.mean()
   demeaned = grouped.transform(lambda x: x - x.mean())




   Wes McKinney ()       Structured data challenges   Rice Statistics   33 / 43
Anatomy of GroupBy




   The GroupBy object is also iterable

   group_means = {}
   for group_name, group in grouped:
       group_means[group_name] = grouped.mean()




   Wes McKinney ()         Structured data challenges   Rice Statistics   34 / 43
GroupBy and hierarchical indexing

   Hierarchical indexing came about as the natural result of a multi-key
   aggregation:
         >>> group_means = df.groupby(['country', 'agefrom']).mean()
         >>> group_means[['ls', 'lsc', 'pop']].unstack('country')

                      ls                    lsc                      pop
            country   Australia   Austria   Australia      Austria   Australia   Austria
         agefrom
         15           70.03       31.1      26.17          14.67     6163        3310
         20           58.02       59.98     34.51          45.83     1113        531
         25           57.02       45.5      33.02          32.73     5021        2791
         30           59.16       46.56     35.29          33.87     1082        527
         35           59.58       43.29     34.53          30.3      1053        528.8
         40           58.8        40.98     33.92          27.88     1005        522.5
         45           56.79       39.19     31.71          26        927.2       503.5
         50           54.71       37.14     30.37          23.94     836.6       475.2
         55           51.93       34.82     26.98          22.11     735.8       443.2
         60           49.92       32.35     25.85          20.2      632.6       408.8
         65           47.02       29.59     22.98          18.44     522.5       361.9
         70           46.27       28.52     22.53          17.72     410.5       295.8
         75           46.28       28.06     23.41          18.42     624.5       437.6




    Wes McKinney ()                 Structured data challenges                   Rice Statistics   35 / 43
What makes GroupBy hard?




   factor-izing the group labels is very expensive
   Function call overhead on small groups
   To sort or not to sort?
        Cheaper than computing the group labels!
   Munging together results in exceptional cases is tricky




   Wes McKinney ()          Structured data challenges       Rice Statistics   36 / 43
Time series operations




    Fixed frequency indexing (DateRange)
    Domain-speci๏ฌc time o๏ฌ€sets (like business day, business month end)
    Frequency conversion
    Forward-๏ฌlling / back-๏ฌlling / interpolation
    Leading/lagging
    In the works (later this year), better/faster upsampling/downsampling




    Wes McKinney ()          Structured data challenges   Rice Statistics   37 / 43
Shoot-out with Rโ€™s time series objects

โ€œInner joinโ€ addition between two irregular 500K-length time series
sampled from 1 million-length time series. Timestamps are POSIXct (R) /
int64 (Python)

                               package             timing             factor
                                pandas           21.5 ms                 1.0
                                    xts          41.3 ms                1.92
                                    fts           370 ms                17.2
                                    its          1117 ms              51.95
                                   zoo           3479 ms              161.8


                         Where there is smoke, there is ๏ฌre?
                   Iron: Macbook Pro Core i7 with R 2.13.1, pandas 0.5.1git / Python 2.7.2




     Wes McKinney ()                     Structured data challenges                          Rice Statistics   38 / 43
Erm, inner join?




                      Intersecting time stamps loses information

    Wes McKinney ()                Structured data challenges      Rice Statistics   39 / 43
My (mild) performance obsession




    Wes McKinney ()   Structured data challenges   Rice Statistics   40 / 43
My (mild) performance obsession




   Good performance stems from many factors
         Well-designed algorithms (from a complexity standpoint)
         Minimizing copying of data
         Minimizing interpreter / function call overhead
         Taking advantage of memory layout
   I value functionality over performance, but I do spend a lot of time
   pro๏ฌling code and grokking performance characteristics




    Wes McKinney ()           Structured data challenges      Rice Statistics   41 / 43
Demo, time permitting




Wes McKinney ()         Structured data challenges   Rice Statistics   42 / 43
Ad

More Related Content

What's hot (20)

SciPy 2010 Review
SciPy 2010 ReviewSciPy 2010 Review
SciPy 2010 Review
Enthought, Inc.
ย 
Unit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxUnit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptx
Malla Reddy University
ย 
Data engineering and analytics using python
Data engineering and analytics using pythonData engineering and analytics using python
Data engineering and analytics using python
Purna Chander
ย 
Large Scale Data Analysis Tools
Large Scale Data Analysis ToolsLarge Scale Data Analysis Tools
Large Scale Data Analysis Tools
boorad
ย 
Getting started with pandas
Getting started with pandasGetting started with pandas
Getting started with pandas
maikroeder
ย 
ffbase, statistical functions for large datasets
ffbase, statistical functions for large datasetsffbase, statistical functions for large datasets
ffbase, statistical functions for large datasets
Edwin de Jonge
ย 
Introduction to Pandas and Time Series Analysis [PyCon DE]
Introduction to Pandas and Time Series Analysis [PyCon DE]Introduction to Pandas and Time Series Analysis [PyCon DE]
Introduction to Pandas and Time Series Analysis [PyCon DE]
Alexander Hendorf
ย 
Data-Intensive Scalable Science
Data-Intensive Scalable ScienceData-Intensive Scalable Science
Data-Intensive Scalable Science
University of Washington
ย 
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]Introduction to Pandas and Time Series Analysis [Budapest BI Forum]
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]
Alexander Hendorf
ย 
Pandas data transformational data structure patterns and challenges final
Pandas   data transformational data structure patterns and challenges  finalPandas   data transformational data structure patterns and challenges  final
Pandas data transformational data structure patterns and challenges final
Rajesh M
ย 
Pandas
PandasPandas
Pandas
maikroeder
ย 
Artificial intelligence and data stream mining
Artificial intelligence and data stream miningArtificial intelligence and data stream mining
Artificial intelligence and data stream mining
Albert Bifet
ย 
Python and Data Analysis
Python and Data AnalysisPython and Data Analysis
Python and Data Analysis
Praveen Nair
ย 
MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016 MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016
Albert Bifet
ย 
Osss manual-5-extract data
Osss manual-5-extract dataOsss manual-5-extract data
Osss manual-5-extract data
cwarner7_11
ย 
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Serban Tanasa
ย 
Visualization-Driven Data Aggregation
Visualization-Driven Data AggregationVisualization-Driven Data Aggregation
Visualization-Driven Data Aggregation
Zbigniew Jerzak
ย 
Introduction to the R Statistical Computing Environment
Introduction to the R Statistical Computing EnvironmentIntroduction to the R Statistical Computing Environment
Introduction to the R Statistical Computing Environment
izahn
ย 
Introduction to R for data science
Introduction to R for data scienceIntroduction to R for data science
Introduction to R for data science
Long Nguyen
ย 
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Scalable and Efficient Algorithms for Analysis of Massive, Streaming GraphsScalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Jason Riedy
ย 
SciPy 2010 Review
SciPy 2010 ReviewSciPy 2010 Review
SciPy 2010 Review
Enthought, Inc.
ย 
Unit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxUnit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptx
Malla Reddy University
ย 
Data engineering and analytics using python
Data engineering and analytics using pythonData engineering and analytics using python
Data engineering and analytics using python
Purna Chander
ย 
Large Scale Data Analysis Tools
Large Scale Data Analysis ToolsLarge Scale Data Analysis Tools
Large Scale Data Analysis Tools
boorad
ย 
Getting started with pandas
Getting started with pandasGetting started with pandas
Getting started with pandas
maikroeder
ย 
ffbase, statistical functions for large datasets
ffbase, statistical functions for large datasetsffbase, statistical functions for large datasets
ffbase, statistical functions for large datasets
Edwin de Jonge
ย 
Introduction to Pandas and Time Series Analysis [PyCon DE]
Introduction to Pandas and Time Series Analysis [PyCon DE]Introduction to Pandas and Time Series Analysis [PyCon DE]
Introduction to Pandas and Time Series Analysis [PyCon DE]
Alexander Hendorf
ย 
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]Introduction to Pandas and Time Series Analysis [Budapest BI Forum]
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]
Alexander Hendorf
ย 
Pandas data transformational data structure patterns and challenges final
Pandas   data transformational data structure patterns and challenges  finalPandas   data transformational data structure patterns and challenges  final
Pandas data transformational data structure patterns and challenges final
Rajesh M
ย 
Pandas
PandasPandas
Pandas
maikroeder
ย 
Artificial intelligence and data stream mining
Artificial intelligence and data stream miningArtificial intelligence and data stream mining
Artificial intelligence and data stream mining
Albert Bifet
ย 
Python and Data Analysis
Python and Data AnalysisPython and Data Analysis
Python and Data Analysis
Praveen Nair
ย 
MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016 MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016
Albert Bifet
ย 
Osss manual-5-extract data
Osss manual-5-extract dataOsss manual-5-extract data
Osss manual-5-extract data
cwarner7_11
ย 
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Serban Tanasa
ย 
Visualization-Driven Data Aggregation
Visualization-Driven Data AggregationVisualization-Driven Data Aggregation
Visualization-Driven Data Aggregation
Zbigniew Jerzak
ย 
Introduction to the R Statistical Computing Environment
Introduction to the R Statistical Computing EnvironmentIntroduction to the R Statistical Computing Environment
Introduction to the R Statistical Computing Environment
izahn
ย 
Introduction to R for data science
Introduction to R for data scienceIntroduction to R for data science
Introduction to R for data science
Long Nguyen
ย 
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Scalable and Efficient Algorithms for Analysis of Massive, Streaming GraphsScalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Jason Riedy
ย 

Similar to Structured Data Challenges in Finance and Statistics (20)

Documenting sacs compliance with microsoft excel
Documenting sacs compliance with microsoft excelDocumenting sacs compliance with microsoft excel
Documenting sacs compliance with microsoft excel
Sandra Nicks
ย 
Week 1 Lab Directions
Week 1 Lab DirectionsWeek 1 Lab Directions
Week 1 Lab Directions
oudesign
ย 
Practical Data Visualization
Practical Data VisualizationPractical Data Visualization
Practical Data Visualization
Angela Zoss
ย 
From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...
From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...
From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...
Craig Knoblock
ย 
summarized best pre-processing techniques
summarized best pre-processing techniquessummarized best pre-processing techniques
summarized best pre-processing techniques
shalinipriya1692
ย 
INT 1010 07-6.pdf
INT 1010 07-6.pdfINT 1010 07-6.pdf
INT 1010 07-6.pdf
Luis R Castellanos
ย 
Data Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notesData Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notes
asnaparveen414
ย 
Data-Analysis-and-Visualization-in-Python-1.pptx
Data-Analysis-and-Visualization-in-Python-1.pptxData-Analysis-and-Visualization-in-Python-1.pptx
Data-Analysis-and-Visualization-in-Python-1.pptx
ChiragNahata2
ย 
2. DATABASE MODELING_Database Fundamentals.pptx
2. DATABASE MODELING_Database Fundamentals.pptx2. DATABASE MODELING_Database Fundamentals.pptx
2. DATABASE MODELING_Database Fundamentals.pptx
Javier Daza
ย 
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the Web
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the WebBeyond Linked Data - Exploiting Entity-Centric Knowledge on the Web
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the Web
Stefan Dietze
ย 
Ee184405 statistika dan stokastik statistik deskriptif 1 grafik
Ee184405 statistika dan stokastik   statistik deskriptif 1 grafikEe184405 statistika dan stokastik   statistik deskriptif 1 grafik
Ee184405 statistika dan stokastik statistik deskriptif 1 grafik
yusufbf
ย 
Sql xp 05
Sql xp 05Sql xp 05
Sql xp 05
Niit Care
ย 
Eyeriss Introduction
Eyeriss IntroductionEyeriss Introduction
Eyeriss Introduction
Michael Lee
ย 
Lecture_1_Intro.pdf
Lecture_1_Intro.pdfLecture_1_Intro.pdf
Lecture_1_Intro.pdf
paijitk
ย 
CIKM Tutorial 2008
CIKM Tutorial 2008CIKM Tutorial 2008
CIKM Tutorial 2008
Peiling Wang
ย 
Retrieval, Crawling and Fusion of Entity-centric Data on the Web
Retrieval, Crawling and Fusion of Entity-centric Data on the WebRetrieval, Crawling and Fusion of Entity-centric Data on the Web
Retrieval, Crawling and Fusion of Entity-centric Data on the Web
Stefan Dietze
ย 
unit 1 DATA MINING.ppt
unit 1 DATA MINING.pptunit 1 DATA MINING.ppt
unit 1 DATA MINING.ppt
BREENAHICETSTAFFCSE
ย 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
AbcdDcba12
ย 
STATISTICSINFORMED DECISIONS USING DATAFifth EditionChapte.docx
STATISTICSINFORMED DECISIONS USING DATAFifth EditionChapte.docxSTATISTICSINFORMED DECISIONS USING DATAFifth EditionChapte.docx
STATISTICSINFORMED DECISIONS USING DATAFifth EditionChapte.docx
rafaelaj1
ย 
Data Mining introduction and basic concepts
Data Mining introduction and basic conceptsData Mining introduction and basic concepts
Data Mining introduction and basic concepts
PritiRishi
ย 
Documenting sacs compliance with microsoft excel
Documenting sacs compliance with microsoft excelDocumenting sacs compliance with microsoft excel
Documenting sacs compliance with microsoft excel
Sandra Nicks
ย 
Week 1 Lab Directions
Week 1 Lab DirectionsWeek 1 Lab Directions
Week 1 Lab Directions
oudesign
ย 
Practical Data Visualization
Practical Data VisualizationPractical Data Visualization
Practical Data Visualization
Angela Zoss
ย 
From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...
From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...
From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...
Craig Knoblock
ย 
summarized best pre-processing techniques
summarized best pre-processing techniquessummarized best pre-processing techniques
summarized best pre-processing techniques
shalinipriya1692
ย 
Data Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notesData Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notes
asnaparveen414
ย 
Data-Analysis-and-Visualization-in-Python-1.pptx
Data-Analysis-and-Visualization-in-Python-1.pptxData-Analysis-and-Visualization-in-Python-1.pptx
Data-Analysis-and-Visualization-in-Python-1.pptx
ChiragNahata2
ย 
2. DATABASE MODELING_Database Fundamentals.pptx
2. DATABASE MODELING_Database Fundamentals.pptx2. DATABASE MODELING_Database Fundamentals.pptx
2. DATABASE MODELING_Database Fundamentals.pptx
Javier Daza
ย 
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the Web
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the WebBeyond Linked Data - Exploiting Entity-Centric Knowledge on the Web
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the Web
Stefan Dietze
ย 
Ee184405 statistika dan stokastik statistik deskriptif 1 grafik
Ee184405 statistika dan stokastik   statistik deskriptif 1 grafikEe184405 statistika dan stokastik   statistik deskriptif 1 grafik
Ee184405 statistika dan stokastik statistik deskriptif 1 grafik
yusufbf
ย 
Sql xp 05
Sql xp 05Sql xp 05
Sql xp 05
Niit Care
ย 
Eyeriss Introduction
Eyeriss IntroductionEyeriss Introduction
Eyeriss Introduction
Michael Lee
ย 
Lecture_1_Intro.pdf
Lecture_1_Intro.pdfLecture_1_Intro.pdf
Lecture_1_Intro.pdf
paijitk
ย 
CIKM Tutorial 2008
CIKM Tutorial 2008CIKM Tutorial 2008
CIKM Tutorial 2008
Peiling Wang
ย 
Retrieval, Crawling and Fusion of Entity-centric Data on the Web
Retrieval, Crawling and Fusion of Entity-centric Data on the WebRetrieval, Crawling and Fusion of Entity-centric Data on the Web
Retrieval, Crawling and Fusion of Entity-centric Data on the Web
Stefan Dietze
ย 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
AbcdDcba12
ย 
STATISTICSINFORMED DECISIONS USING DATAFifth EditionChapte.docx
STATISTICSINFORMED DECISIONS USING DATAFifth EditionChapte.docxSTATISTICSINFORMED DECISIONS USING DATAFifth EditionChapte.docx
STATISTICSINFORMED DECISIONS USING DATAFifth EditionChapte.docx
rafaelaj1
ย 
Data Mining introduction and basic concepts
Data Mining introduction and basic conceptsData Mining introduction and basic concepts
Data Mining introduction and basic concepts
PritiRishi
ย 
Ad

More from Wes McKinney (20)

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
Wes McKinney
ย 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
ย 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney
ย 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney
ย 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
Wes McKinney
ย 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
ย 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
ย 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney
ย 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Wes McKinney
ย 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
Wes McKinney
ย 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney
ย 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
Wes McKinney
ย 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
Wes McKinney
ย 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Wes McKinney
ย 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
ย 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney
ย 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
ย 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
Wes McKinney
ย 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
Wes McKinney
ย 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
Wes McKinney
ย 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
Wes McKinney
ย 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
ย 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney
ย 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney
ย 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
Wes McKinney
ย 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
ย 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
ย 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney
ย 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Wes McKinney
ย 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
Wes McKinney
ย 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney
ย 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
Wes McKinney
ย 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
Wes McKinney
ย 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Wes McKinney
ย 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
ย 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney
ย 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
ย 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
Wes McKinney
ย 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
Wes McKinney
ย 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
Wes McKinney
ย 
Ad

Recently uploaded (20)

Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
ย 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
ย 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
ย 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
ย 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
ย 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
ย 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
ย 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
ย 
AI Changes Everything โ€“ Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything โ€“ Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything โ€“ Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything โ€“ Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
ย 
HCL Nomad Web โ€“ Best Practices and Managing Multiuser Environments
HCL Nomad Web โ€“ Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web โ€“ Best Practices and Managing Multiuser Environments
HCL Nomad Web โ€“ Best Practices and Managing Multiuser Environments
panagenda
ย 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
ย 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
ย 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
ย 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
ย 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
ย 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
ย 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
ย 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
ย 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
ย 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
ย 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
ย 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
ย 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
ย 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
ย 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
ย 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
ย 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
ย 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
ย 
AI Changes Everything โ€“ Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything โ€“ Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything โ€“ Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything โ€“ Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
ย 
HCL Nomad Web โ€“ Best Practices and Managing Multiuser Environments
HCL Nomad Web โ€“ Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web โ€“ Best Practices and Managing Multiuser Environments
HCL Nomad Web โ€“ Best Practices and Managing Multiuser Environments
panagenda
ย 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
ย 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
ย 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
ย 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
ย 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
ย 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
ย 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
ย 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
ย 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
ย 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
ย 

Structured Data Challenges in Finance and Statistics

  • 1. Structured Data Challenges in Finance and Statistics Wes McKinney Rice Statistics, 21 November 2011 Wes McKinney () Structured data challenges Rice Statistics 1 / 43
  • 2. Me S.B., MIT Math โ€™07 3 years in the quant ๏ฌnance business Now: starting a software company, initially to build ๏ฌnancial data analysis and research systems My blog: http://blog.wesmckinney.com Twitter: @wesmckinn Book! โ€œPython for Data Analysisโ€, to hit the shelves later next year from Oโ€™Reilly Media Wes McKinney () Structured data challenges Rice Statistics 2 / 43
  • 3. Structured data cname year agefrom ageto ls lsc pop ccode 0 Australia 1950 15 19 64.3 15.4 558 AUS 1 Australia 1950 20 24 48.4 26.4 645 AUS 2 Australia 1950 25 29 47.9 26.2 681 AUS 3 Australia 1950 30 34 44 23.8 614 AUS 4 Australia 1950 35 39 42.1 21.9 625 AUS 5 Australia 1950 40 44 38.9 20.1 555 AUS 6 Australia 1950 45 49 34 16.9 491 AUS 7 Australia 1950 50 54 29.6 14.6 439 AUS 8 Australia 1950 55 59 28 12.9 408 AUS 9 Australia 1950 60 64 26.3 12.1 356 AUS Wes McKinney () Structured data challenges Rice Statistics 3 / 43
  • 4. Partial list of structured data necessities Table modi๏ฌcation: column insertion/deletion/type changes Rich axis indexing, metadata Easy data alignment Aggregation and transformation by group (โ€œgroup byโ€) Missing data (NA) handling Pivoting and reshaping Merging and joining Time series-speci๏ฌc manipulations Fast Input/Output: text ๏ฌles, databases, HDF5, ... Wes McKinney () Structured data challenges Rice Statistics 4 / 43
  • 5. Are existing tools good enough? We care nearly equally about Ease-of-use (syntax / API ๏ฌts your mental model) Expressiveness Performance (speed and memory usage) Clean, consistent interface design is hard Wes McKinney () Structured data challenges Rice Statistics 5 / 43
  • 6. Auxiliary concerns Any tool needs to integrate well with: Statistical modeling tools Data visualization (plotting) Target users Computer scientists, statisticians, software engineers? Data scientists? Wes McKinney () Structured data challenges Rice Statistics 6 / 43
  • 7. Are existing tools good enough? The typical players R data.frame and friends + CRAN libraries SQL and other relational databases Python / NumPy: structured (record) arrays Commercial products: SAS, Stata, MS Excel... My conclusion: we still have a ways to go R has become demonstrably better in the last 5 years (e.g. via plyr, reshape2) Wes McKinney () Structured data challenges Rice Statistics 7 / 43
  • 8. Deeper problems in many industries Facilitating the research process only part of problem Much of academia: โ€œProduction systems?โ€ Industry: a wasteland of misshapen wheels or expensive vendor products Explosive growth in data-driven production systems Hybrid-language systems are not always a good idea Wes McKinney () Structured data challenges Rice Statistics 8 / 43
  • 9. The big data conundrum Great e๏ฌ€ort being invested in the (di๏ฌƒcult) problem of large-scale data processing, e.g. MapReduce-based Less e๏ฌ€ort in the fundamental tooling for data manipulation / preparation / integration Single-node performance does matter Single-node code development time matters too Wes McKinney () Structured data challenges Rice Statistics 9 / 43
  • 10. pandas: my e๏ฌ€ort in this arena Pick your favorite: panel data structures or Python structured data analysis Starting building April 2008 back at AQR Capital Open-sourced (BSD license) mid-2009 Heavily tested, being used by many companies (inc. lots of ๏ฌnancial ๏ฌrms) as the cornerstone of their systems Goal: optimal balance of ease-of-use, ๏ฌ‚exibility, and performance Heavy development the last 6 months Wes McKinney () Structured data challenges Rice Statistics 10 / 43
  • 11. Why did I start from scratch? Accusations of NIH Syndrome abound In 2008 I simultaneously needed Agile, high performance data structures A high productivity programming language for implementing all of the non-computational business logic A production application platform that would seamlessly integrate with an interactive data analysis / research platform In short, I was rebuilding major ๏ฌnancial systems and I found my options inadequate Thrilling innovation opportunity! Wes McKinney () Structured data challenges Rice Statistics 11 / 43
  • 12. Why did I use Python? High productivity general purpose language Well thought-out object-oriented model Excellent software-development tools Easy for MATLAB/R users to learn Flexible built-in data structures (dicts, sets, lists, tuples) The right open-source scienti๏ฌc computing tools Powerful array processing (NumPy) Abundant tools for performance computing Wes McKinney () Structured data challenges Rice Statistics 12 / 43
  • 13. But, Python is not perfect For statistical computing, a chicken-and-egg problem Pythonโ€™s plotting libraries are not designed for statistical graphics Built-in data structures are not especially optimized for my large data use cases Occasional semantic / syntactic niggles Wes McKinney () Structured data challenges Rice Statistics 13 / 43
  • 14. Partial list of structured data necessities Table modi๏ฌcation: column insertion/deletion/type changes Rich axis indexing, metadata Easy data alignment Aggregation and transformation by group (โ€œgroup byโ€) Missing data (NA) handling Pivoting and reshaping Merging and joining Time series-speci๏ฌc manipulations Fast Input/Output: text ๏ฌles, databases, HDF5, ... Wes McKinney () Structured data challenges Rice Statistics 14 / 43
  • 15. DataFrame, the pandas workhorse A 2D tabular data structure with row and column indexes Fast for row- and column-oriented operations Support heterogeneous columns WITHOUT sacri๏ฌcing performance in the homogeneous (e.g. ๏ฌ‚oating point only) case Wes McKinney () Structured data challenges Rice Statistics 15 / 43
  • 16. DataFrame cname year agefrom ageto ls lsc pop ccode 0 Australia 1950 15 19 64.3 15.4 558 AUS 1 Australia 1950 20 24 48.4 26.4 645 AUS 2 Australia 1950 25 29 47.9 26.2 681 AUS 3 Australia 1950 30 34 44 23.8 614 AUS 4 Australia 1950 35 39 42.1 21.9 625 AUS 5 Australia 1950 40 44 38.9 20.1 555 AUS 6 Australia 1950 45 49 34 16.9 491 AUS 7 Australia 1950 50 54 29.6 14.6 439 AUS 8 Australia 1950 55 59 28 12.9 408 AUS 9 Australia 1950 60 64 26.3 12.1 356 AUS Wes McKinney () Structured data challenges Rice Statistics 16 / 43
  • 17. Axis indexing and metadata Basic concept: labeled axes in use throughout the library Need to support Fast lookups (constant time) Data realignment / selection by labels (linear) Munging together irregularly indexed data Key innovation: index is a data structure itself. Di๏ฌ€erent implementations can support more sophisticated indexing Axis labels can be any immutable Python object Wes McKinney () Structured data challenges Rice Statistics 17 / 43
  • 18. Irregularly indexed data DataFrame Columns I N D E X Wes McKinney () Structured data challenges Rice Statistics 18 / 43
  • 19. Axis indexing Axis Index d 0 a 1 b 2 c 3 e 4 Wes McKinney () Structured data challenges Rice Statistics 19 / 43
  • 20. Why does this matter? Real world data is highly irregular, especially time series Operations between DataFrame objects automatically align on the indexes Nearly impossible to have errors due to misaligned data Can vastly facilitate munging unstructured data into structured form Grants immense freedom in writing research code Time series are just a special case of a general indexed data structure Wes McKinney () Structured data challenges Rice Statistics 20 / 43
  • 21. Axis indexing "I have a brain!" year 1965 1970 1975 1980 1985 cname agefrom Australia 15 85.5 92.6 91.1 91.7 87.7 20 57.7 70.6 74.1 73.4 72 25 52.8 57.7 71.8 74 73.5 30 53.8 57.7 60.4 72.8 74.1 Me too! 35 46.6 52.6 59.5 63.2 72.7 40 47.5 52.6 56.3 61.4 62.9 45 41.3 48.7 55.9 61.1 61.1 50 41.7 48.7 53.8 60.2 60 55 36.3 42.1 54.1 61 59.2 60 37.5 42.1 48.9 61.6 59.1 65 30.2 30.9 48.8 60.4 59.7 70 30.2 30.9 41.7 60.4 57.6 75 30.2 30.9 41.7 62.5 57.5 Wes McKinney () Structured data challenges Rice Statistics 21 / 43
  • 22. Hierarchical indexing Basic idea: represent high dimensional data in a lower-dimensional structure that is easier to reason about Axis index with k levels of indexing Slice chunks of data in constant time! Provides a very natural way of implementing reshaping operations Advantage over a truly N-dimensional object: space-e๏ฌƒcient dense representation if groups are unbalanced Extremely useful for econometric models on panel data Wes McKinney () Structured data challenges Rice Statistics 22 / 43
  • 23. Hierarchical indexing agefrom 15 20 25 30 35 cname year Australia 1965 85.5 57.7 52.8 53.8 46.6 1970 92.6 70.6 57.7 57.7 52.6 1975 91.1 74.1 71.8 60.4 59.5 1980 91.7 73.4 74 72.8 63.2 1985 87.7 72 73.5 74.1 72.7 1990 80 66.4 72 73.5 74.1 1995 66.3 59.9 66.4 72 73.5 2000 73.4 38.2 59.9 66.4 72 2005 82.3 44.8 38.2 59.9 66.4 2010 78.4 41.5 44.8 38.2 59.9 Austria 1965 8.1 50.9 50.9 24.2 24.2 1970 13.3 61.4 50.9 50.9 39.6 1975 23.5 64.3 61.4 56.6 49.6 1980 23.8 73.9 64.3 61.4 61.4 1985 22.4 69.2 71.1 62.9 59.9 1990 27.4 72 68.9 68.3 61.5 1995 38.9 64.2 66.9 66.6 66.7 2000 41.4 73.2 64.2 63.1 64.5 2005 42.9 93.9 75 62.9 59.2 2010 45.4 93.5 92.4 73.2 62.7 Wes McKinney () Structured data challenges Rice Statistics 23 / 43
  • 24. Joining and merging Join and merge-type operations are very easy to implement with indexing in place Multi-key join same code as aligning hierarchically-indexed DataFrames Will illustrate this with examples Wes McKinney () Structured data challenges Rice Statistics 24 / 43
  • 25. Supporting size mutability In order to have good row-oriented performance, need to store like-typed columns in a single ndarray โ€œColumnโ€ insertion: accumulate 1 ร— N ร— . . . homogeneous columns, later consolidate with other like-typed into a single block I.e. avoid reallocate-copy or array concatenation steps as long as possible Column deletions can be no-copy events (since ndarrays support views) Wes McKinney () Structured data challenges Rice Statistics 25 / 43
  • 26. DataFrame, under the hood Actually You see 6 2 4 5 1 3 1 2 3 4 5 6 O F I B I F B F F O Wes McKinney () Structured data challenges Rice Statistics 26 / 43
  • 27. Reshaping The fundamental operations stack: pivot level from columns to rows unstack: pivot level from rows to columns Completely natural and intuitive with hierarchical indexing No munging of column names necessary year 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 cname agefrom Australia 15 85.5 92.6 91.1 91.7 87.7 80 66.3 73.4 82.3 78.4 20 57.7 70.6 74.1 73.4 72 66.4 59.9 38.2 44.8 41.5 25 52.8 57.7 71.8 74 73.5 72 66.4 59.9 38.2 44.8 30 53.8 57.7 60.4 72.8 74.1 73.5 72 66.4 59.9 38.2 35 46.6 52.6 59.5 63.2 72.7 74.1 73.5 72 66.4 59.9 Austria 15 8.1 13.3 23.5 23.8 22.4 27.4 38.9 41.4 42.9 45.4 20 50.9 61.4 64.3 73.9 69.2 72 64.2 73.2 93.9 93.5 25 50.9 50.9 61.4 64.3 71.1 68.9 66.9 64.2 75 92.4 30 24.2 50.9 56.6 61.4 62.9 68.3 66.6 63.1 62.9 73.2 35 24.2 39.6 49.6 61.4 59.9 61.5 66.7 64.5 59.2 62.7 Wes McKinney () Structured data challenges Rice Statistics 27 / 43
  • 28. Reshaping In [5]: df.unstack(โ€™agefromโ€™).stack(โ€™yearโ€™) agefrom 15 20 25 30 35 cname year Australia 1965 85.5 57.7 52.8 53.8 46.6 1970 92.6 70.6 57.7 57.7 52.6 1975 91.1 74.1 71.8 60.4 59.5 1980 91.7 73.4 74 72.8 63.2 1985 87.7 72 73.5 74.1 72.7 1990 80 66.4 72 73.5 74.1 1995 66.3 59.9 66.4 72 73.5 2000 73.4 38.2 59.9 66.4 72 2005 82.3 44.8 38.2 59.9 66.4 2010 78.4 41.5 44.8 38.2 59.9 Austria 1965 8.1 50.9 50.9 24.2 24.2 1970 13.3 61.4 50.9 50.9 39.6 1975 23.5 64.3 61.4 56.6 49.6 1980 23.8 73.9 64.3 61.4 61.4 1985 22.4 69.2 71.1 62.9 59.9 1990 27.4 72 68.9 68.3 61.5 1995 38.9 64.2 66.9 66.6 66.7 2000 41.4 73.2 64.2 63.1 64.5 2005 42.9 93.9 75 62.9 59.2 2010 45.4 93.5 92.4 73.2 62.7 Wes McKinney () Structured data challenges Rice Statistics 28 / 43
  • 29. Reshaping implementation nuances Must carefully deal with unbalanced group sizes / missing data I play vectorization tricks with the NumPy memory layout: no for loops! Care must be taken to handle heterogeneous and homogeneous data cases Wes McKinney () Structured data challenges Rice Statistics 29 / 43
  • 30. GroupBy High level process split data set into groups apply function to each group (an aggregation or a transformation) combine results intelligently into a result data structure Can be used to emulate SQL GROUP BY operations Wes McKinney () Structured data challenges Rice Statistics 30 / 43
  • 31. GroupBy Grouping closely related to indexing Create correspondence between axis labels and group labels using one of: Array of group labels (like a DataFrame column) Python function to be applied to each axis tick Can group by multiple keys For a hierarchically indexed axis, can select a level and group by that (or some transformation thereof) Wes McKinney () Structured data challenges Rice Statistics 31 / 43
  • 32. Anatomy of GroupBy grouped = obj.groupby([key1, key2, key3]) This returns a GroupBy object Each of the keys could be any of: A Python function A vector A column name Wes McKinney () Structured data challenges Rice Statistics 32 / 43
  • 33. Anatomy of GroupBy aggregate, transform, and the more general apply supported group_means = grouped.agg(np.mean) group_means2 = grouped.mean() demeaned = grouped.transform(lambda x: x - x.mean()) Wes McKinney () Structured data challenges Rice Statistics 33 / 43
  • 34. Anatomy of GroupBy The GroupBy object is also iterable group_means = {} for group_name, group in grouped: group_means[group_name] = grouped.mean() Wes McKinney () Structured data challenges Rice Statistics 34 / 43
  • 35. GroupBy and hierarchical indexing Hierarchical indexing came about as the natural result of a multi-key aggregation: >>> group_means = df.groupby(['country', 'agefrom']).mean() >>> group_means[['ls', 'lsc', 'pop']].unstack('country') ls lsc pop country Australia Austria Australia Austria Australia Austria agefrom 15 70.03 31.1 26.17 14.67 6163 3310 20 58.02 59.98 34.51 45.83 1113 531 25 57.02 45.5 33.02 32.73 5021 2791 30 59.16 46.56 35.29 33.87 1082 527 35 59.58 43.29 34.53 30.3 1053 528.8 40 58.8 40.98 33.92 27.88 1005 522.5 45 56.79 39.19 31.71 26 927.2 503.5 50 54.71 37.14 30.37 23.94 836.6 475.2 55 51.93 34.82 26.98 22.11 735.8 443.2 60 49.92 32.35 25.85 20.2 632.6 408.8 65 47.02 29.59 22.98 18.44 522.5 361.9 70 46.27 28.52 22.53 17.72 410.5 295.8 75 46.28 28.06 23.41 18.42 624.5 437.6 Wes McKinney () Structured data challenges Rice Statistics 35 / 43
  • 36. What makes GroupBy hard? factor-izing the group labels is very expensive Function call overhead on small groups To sort or not to sort? Cheaper than computing the group labels! Munging together results in exceptional cases is tricky Wes McKinney () Structured data challenges Rice Statistics 36 / 43
  • 37. Time series operations Fixed frequency indexing (DateRange) Domain-speci๏ฌc time o๏ฌ€sets (like business day, business month end) Frequency conversion Forward-๏ฌlling / back-๏ฌlling / interpolation Leading/lagging In the works (later this year), better/faster upsampling/downsampling Wes McKinney () Structured data challenges Rice Statistics 37 / 43
  • 38. Shoot-out with Rโ€™s time series objects โ€œInner joinโ€ addition between two irregular 500K-length time series sampled from 1 million-length time series. Timestamps are POSIXct (R) / int64 (Python) package timing factor pandas 21.5 ms 1.0 xts 41.3 ms 1.92 fts 370 ms 17.2 its 1117 ms 51.95 zoo 3479 ms 161.8 Where there is smoke, there is ๏ฌre? Iron: Macbook Pro Core i7 with R 2.13.1, pandas 0.5.1git / Python 2.7.2 Wes McKinney () Structured data challenges Rice Statistics 38 / 43
  • 39. Erm, inner join? Intersecting time stamps loses information Wes McKinney () Structured data challenges Rice Statistics 39 / 43
  • 40. My (mild) performance obsession Wes McKinney () Structured data challenges Rice Statistics 40 / 43
  • 41. My (mild) performance obsession Good performance stems from many factors Well-designed algorithms (from a complexity standpoint) Minimizing copying of data Minimizing interpreter / function call overhead Taking advantage of memory layout I value functionality over performance, but I do spend a lot of time pro๏ฌling code and grokking performance characteristics Wes McKinney () Structured data challenges Rice Statistics 41 / 43
  • 42. Demo, time permitting Wes McKinney () Structured data challenges Rice Statistics 42 / 43