SlideShare a Scribd company logo
1	
 ย ยฉ	
 ย Cloudera,	
 ย Inc.	
 ย All	
 ย rights	
 ย reserved.	
 ย 
DataFrames:	
 ย The	
 ย Good,	
 ย Bad,	
 ย 
and	
 ย Ugly	
 ย 
Wes	
 ย McKinney	
 ย @wesmckinn	
 ย 
NY	
 ย R	
 ย Conference,	
 ย 2015-ยญโ€04-ยญโ€25	
 ย 
2	
 ย ยฉ	
 ย Cloudera,	
 ย Inc.	
 ย All	
 ย rights	
 ย reserved.	
 ย 
Disclaimer:	
 ย the	
 ย views	
 ย presented	
 ย in	
 ย 
this	
 ย talk	
 ย are	
 ย my	
 ย personal	
 ย opinions	
 ย 
and	
 ย not	
 ย necessarily	
 ย those	
 ย of	
 ย 
Cloudera	
 ย 
3	
 ย ยฉ	
 ย Cloudera,	
 ย Inc.	
 ย All	
 ย rights	
 ย reserved.	
 ย 
This	
 ย talk	
 ย 
โ€ขโ€ฏ Some	
 ย commentary	
 ย on	
 ย all	
 ย the	
 ย data	
 ย frame	
 ย interfaces	
 ย out	
 ย there	
 ย 
โ€ขโ€ฏ Biased	
 ย observaTons	
 ย and	
 ย cursory	
 ย judgments	
 ย 
โ€ขโ€ฏ Thoughts	
 ย on	
 ย craVing	
 ย high	
 ย quality	
 ย data	
 ย tools	
 ย 
4	
 ย ยฉ	
 ย Cloudera,	
 ย Inc.	
 ย All	
 ย rights	
 ย reserved.	
 ย 
Disclaimer	
 ย #2:	
 ย This	
 ย is	
 ย a	
 ย nuanced	
 ย 
discussion	
 ย 
5	
 ย ยฉ	
 ย Cloudera,	
 ย Inc.	
 ย All	
 ย rights	
 ย reserved.	
 ย 
Who	
 ย am	
 ย I?	
 ย 
โ€ขโ€ฏ Father	
 ย of	
 ย pandas	
 ย (2008	
 ย -ยญโ€	
 ย )	
 ย 	
 ย 
โ€ขโ€ฏ Financial	
 ย analyTcs	
 ย in	
 ย R	
 ย /	
 ย Python	
 ย starTng	
 ย 2007	
 ย 
โ€ขโ€ฏ 2010-ยญโ€2012	
 ย 
โ€ขโ€ฏHiatus	
 ย from	
 ย gainful	
 ย employment	
 ย 
โ€ขโ€ฏMake	
 ย pandas	
 ย ready	
 ย for	
 ย primeTme	
 ย 
โ€ขโ€ฏWrite	
 ย "Python	
 ย for	
 ย Data	
 ย Analysisโ€	
 ย 
โ€ขโ€ฏ 2013-ยญโ€2014:	
 ย DataPad	
 ย with	
 ย Chang	
 ย She	
 ย &	
 ย co	
 ย 
โ€ขโ€ฏ 2014	
 ย -ยญโ€	
 ย :	
 ย Cloudera	
 ย 
6	
 ย ยฉ	
 ย Cloudera,	
 ย Inc.	
 ย All	
 ย rights	
 ย reserved.	
 ย 
Whatโ€™s	
 ย in	
 ย a	
 ย DataFrame?	
 ย 
	
 ย 
	
 ย 
	
 ย 
7	
 ย ยฉ	
 ย Cloudera,	
 ย Inc.	
 ย All	
 ย rights	
 ย reserved.	
 ย 
Whatโ€™s	
 ย in	
 ย a	
 ย DataFrame?	
 ย 
A	
 ย table	
 ย with	
 ย some	
 ย rows	
 ย 
By	
 ย any	
 ย other	
 ย name	
 ย 	
 ย 	
 ย 	
 ย 	
 ย 	
 ย 	
 ย 	
 ย 	
 ย 	
 ย 	
 ย 
would	
 ย analyze	
 ย as	
 ย sweet.	
 ย 
8	
 ย ยฉ	
 ย Cloudera,	
 ย Inc.	
 ย All	
 ย rights	
 ย reserved.	
 ย 
Got	
 ย a	
 ย table?	
 ย 
	
 ย 
9	
 ย ยฉ	
 ย Cloudera,	
 ย Inc.	
 ย All	
 ย rights	
 ย reserved.	
 ย 
Got	
 ย a	
 ย table?	
 ย 
Put	
 ย a	
 ย DataFrame	
 ย (interface)	
 ย on	
 ย it!	
 ย 
10	
 ย ยฉ	
 ย Cloudera,	
 ย Inc.	
 ย All	
 ย rights	
 ย reserved.	
 ย 
What	
 ย is	
 ย this	
 ย โ€œdata	
 ย frameโ€	
 ย that	
 ย you	
 ย speak	
 ย of	
 ย 
โ€ขโ€ฏ A	
 ย table-ยญโ€like	
 ย data	
 ย structure	
 ย 
โ€ขโ€ฏ An	
 ย API	
 ย /	
 ย user	
 ย interface	
 ย for	
 ย the	
 ย table	
 ย 
โ€ขโ€ฏSelecTng	
 ย data	
 ย 
โ€ขโ€ฏMath	
 ย and	
 ย relaTonal	
 ย algebra	
 ย (join,	
 ย ๏ฌlter,	
 ย etc.)	
 ย 
โ€ขโ€ฏFile	
 ย /	
 ย database	
 ย IO	
 ย 
โ€ขโ€ฏad	
 ย in๏ฌnitum	
 ย 
11	
 ย ยฉ	
 ย Cloudera,	
 ย Inc.	
 ย All	
 ย rights	
 ย reserved.	
 ย 
Some	
 ย axes	
 ย of	
 ย comparison	
 ย 
โ€ขโ€ฏ Data	
 ย structure	
 ย internals	
 ย (types,	
 ย in-ยญโ€memory	
 ย representaTon,	
 ย etc.)	
 ย 
โ€ขโ€ฏ Basic	
 ย table	
 ย API	
 ย 
โ€ขโ€ฏ RelaTonal	
 ย algebra	
 ย support	
 ย 
โ€ขโ€ฏ Group-ยญโ€by	
 ย /	
 ย split-ยญโ€apply-ยญโ€combine	
 ย API	
 ย 
โ€ขโ€ฏ Performance,	
 ย memory	
 ย use,	
 ย evaluaTon	
 ย semanTcs	
 ย 
โ€ขโ€ฏ Missing	
 ย data	
 ย 
โ€ขโ€ฏ Data	
 ย Tdying	
 ย /	
 ย ETL	
 ย tools	
 ย 
โ€ขโ€ฏ IO	
 ย uTliTes	
 ย 
โ€ขโ€ฏ Domain	
 ย speci๏ฌc	
 ย tools	
 ย (e.g.	
 ย Tme	
 ย series)	
 ย 
โ€ขโ€ฏ โ€ฆ	
 ย 
12	
 ย ยฉ	
 ย Cloudera,	
 ย Inc.	
 ย All	
 ย rights	
 ย reserved.	
 ย 
The	
 ย Great	
 ย Data	
 ย Tool	
 ย Decouplingโ„ข	
 ย 
โ€ขโ€ฏ Thesis:	
 ย over	
 ย Tme,	
 ย user	
 ย interfaces,	
 ย data	
 ย storage,	
 ย and	
 ย execuTon	
 ย engines	
 ย will	
 ย 
decouple	
 ย and	
 ย specialize	
 ย 
โ€ขโ€ฏ In	
 ย fact,	
 ย you	
 ย should	
 ย really	
 ย want	
 ย this	
 ย to	
 ย happen	
 ย 
โ€ขโ€ฏShare	
 ย systems	
 ย among	
 ย languages	
 ย 
โ€ขโ€ฏReduce	
 ย fragmentaTon	
 ย and	
 ย โ€œlock-ยญโ€inโ€	
 ย 
โ€ขโ€ฏShiV	
 ย developer	
 ย focus	
 ย to	
 ย usability	
 ย 	
 ย 
โ€ขโ€ฏ PredicTon:	
 ย weโ€™ll	
 ย be	
 ย there	
 ย by	
 ย 2025;	
 ย sooner	
 ย if	
 ย we	
 ย all	
 ย get	
 ย our	
 ย act	
 ย together	
 ย 
13	
 ย ยฉ	
 ย Cloudera,	
 ย Inc.	
 ย All	
 ย rights	
 ย reserved.	
 ย 
CraVing	
 ย quality	
 ย data	
 ย tools	
 ย 
โ€ขโ€ฏ Quality	
 ย /	
 ย usefulness	
 ย is	
 ย usually	
 ย forged	
 ย by	
 ย the	
 ย ๏ฌre	
 ย of	
 ย basle	
 ย 
โ€ขโ€ฏ Real	
 ย world	
 ย use	
 ย cases	
 ย and	
 ย social	
 ย proof	
 ย trump	
 ย theory	
 ย 
โ€ขโ€ฏ Eat	
 ย that	
 ย dog	
 ย food	
 ย 
โ€ขโ€ฏ When	
 ย in	
 ย doubt?	
 ย Look	
 ย at	
 ย the	
 ย test	
 ย suite.	
 ย 
14	
 ย ยฉ	
 ย Cloudera,	
 ย Inc.	
 ย All	
 ย rights	
 ย reserved.	
 ย 
R	
 ย data	
 ย frames	
 ย 
โ€ขโ€ฏ Thin	
 ย layer	
 ย on	
 ย top	
 ย of	
 ย R	
 ย list	
 ย type	
 ย 
โ€ขโ€ฏSequence	
 ย of	
 ย named	
 ย vectors	
 ย 
โ€ขโ€ฏCan	
 ย have	
 ย row	
 ย names	
 ย (any	
 ย R	
 ย vector)	
 ย 
โ€ขโ€ฏ Simple	
 ย column	
 ย and	
 ย row	
 ย selecTon	
 ย API	
 ย 
โ€ขโ€ฏ AnalyTcs,	
 ย data	
 ย transformaTon,	
 ย etc.	
 ย leV	
 ย to	
 ย base	
 ย package	
 ย and	
 ย add-ยญโ€on	
 ย libraries	
 ย 
โ€ขโ€ฏ Richness	
 ย /	
 ย usability	
 ย comes	
 ย largely	
 ย from	
 ย libraries	
 ย 
15	
 ย ยฉ	
 ย Cloudera,	
 ย Inc.	
 ย All	
 ย rights	
 ย reserved.	
 ย 
Some	
 ย awesome	
 ย R	
 ย data	
 ย frame	
 ย stu๏ฌ€	
 ย 
โ€ขโ€ฏ โ€œHadley	
 ย Stackโ€	
 ย 
โ€ขโ€ฏdplyr,	
 ย Tdyr	
 ย 
โ€ขโ€ฏlegacy:	
 ย plyr,	
 ย reshape2	
 ย 	
 ย 
โ€ขโ€ฏggplot2	
 ย 
โ€ขโ€ฏ data.table	
 ย (data.frame	
 ย +	
 ย indices,	
 ย fast	
 ย algorithms)	
 ย 
โ€ขโ€ฏ xts	
 ย :	
 ย Tme	
 ย series	
 ย 
	
 ย 
16	
 ย ยฉ	
 ย Cloudera,	
 ย Inc.	
 ย All	
 ย rights	
 ย reserved.	
 ย 
R	
 ย data	
 ย frames:	
 ย rough	
 ย edges	
 ย 
โ€ขโ€ฏ Copy-ยญโ€on-ยญโ€write	
 ย semanTcs	
 ย 
โ€ขโ€ฏ API	
 ย fragmentaTon	
 ย /	
 ย inconsistency	
 ย 
โ€ขโ€ฏUse	
 ย the	
 ย โ€œHadley	
 ย stackโ€	
 ย for	
 ย improved	
 ย sanity	
 ย 
โ€ขโ€ฏ Factor	
 ย /	
 ย String	
 ย dichotomy	
 ย 
โ€ขโ€ฏstringsAsFactors=FALSE	
 ย a	
 ย blessing	
 ย and	
 ย curse	
 ย 
โ€ขโ€ฏ Somewhat	
 ย limited	
 ย type	
 ย system	
 ย 
17	
 ย ยฉ	
 ย Cloudera,	
 ย Inc.	
 ย All	
 ย rights	
 ย reserved.	
 ย 
dplyr	
 ย 
โ€ขโ€ฏ Composable	
 ย table	
 ย API	
 ย 
โ€ขโ€ฏ Good	
 ย example	
 ย of	
 ย what	
 ย the	
 ย โ€œdecoupledโ€	
 ย future	
 ย might	
 ย look	
 ย like	
 ย 
โ€ขโ€ฏNew	
 ย in-ยญโ€memory	
 ย R/RCpp	
 ย execuTon	
 ย engine	
 ย 
โ€ขโ€ฏSQL	
 ย backends	
 ย for	
 ย large	
 ย subset	
 ย of	
 ย API	
 ย 
18	
 ย ยฉ	
 ย Cloudera,	
 ย Inc.	
 ย All	
 ย rights	
 ย reserved.	
 ย 
Spark	
 ย DataFrames	
 ย 
โ€ขโ€ฏ R/pandas-ยญโ€inspired	
 ย API	
 ย for	
 ย tabular	
 ย data	
 ย manipulaTon	
 ย in	
 ย Scala,	
 ย Python,	
 ย etc.	
 ย 
โ€ขโ€ฏ Logical	
 ย operaTon	
 ย graphs	
 ย rewrisen	
 ย internally	
 ย in	
 ย more	
 ย e๏ฌƒcient	
 ย form	
 ย 
โ€ขโ€ฏ Good	
 ย interop	
 ย with	
 ย Spark	
 ย SQL	
 ย 
โ€ขโ€ฏ Some	
 ย interoperability	
 ย with	
 ย pandas	
 ย 
โ€ขโ€ฏ ParTal	
 ย API	
 ย Decoupling!	
 ย (it	
 ย sTll	
 ย binds	
 ย you	
 ย to	
 ย Spark)	
 ย 
19	
 ย ยฉ	
 ย Cloudera,	
 ย Inc.	
 ย All	
 ย rights	
 ย reserved.	
 ย 
pandas	
 ย 
โ€ขโ€ฏ Several	
 ย key	
 ย data	
 ย structures,	
 ย data	
 ย frame	
 ย among	
 ย them	
 ย 
โ€ขโ€ฏ Considerably	
 ย more	
 ย complex	
 ย internals	
 ย than	
 ย other	
 ย data	
 ย frame	
 ย libraries	
 ย 
โ€ขโ€ฏ Some	
 ย good	
 ย things	
 ย 
โ€ขโ€ฏBorn	
 ย of	
 ย need	
 ย 
โ€ขโ€ฏA	
 ย โ€œbaseries	
 ย includedโ€	
 ย approach	
 ย 
โ€ขโ€ฏHierarchical	
 ย axis	
 ย labeling:	
 ย addresses	
 ย some	
 ย hard	
 ย use	
 ย cases	
 ย at	
 ย expense	
 ย of	
 ย 
semanTc	
 ย complexity	
 ย 
โ€ขโ€ฏStrong	
 ย Tme	
 ย series	
 ย support	
 ย 
20	
 ย ยฉ	
 ย Cloudera,	
 ย Inc.	
 ย All	
 ย rights	
 ย reserved.	
 ย 
pandas:	
 ย rough	
 ย edges	
 ย 
โ€ขโ€ฏ Axis	
 ย labelling	
 ย can	
 ย get	
 ย in	
 ย the	
 ย way	
 ย for	
 ย folks	
 ย needing	
 ย โ€œjust	
 ย a	
 ย tableโ€	
 ย 
โ€ขโ€ฏ Ceded	
 ย control	
 ย of	
 ย its	
 ย type	
 ย system	
 ย /	
 ย data	
 ย repโ€™n	
 ย from	
 ย day	
 ย 1	
 ย to	
 ย NumPy	
 ย 
โ€ขโ€ฏ Ine๏ฌƒcient	
 ย string	
 ย handling	
 ย (uses	
 ย NumPy	
 ย object	
 ย arrays)	
 ย 
โ€ขโ€ฏ Missing	
 ย data	
 ย handling	
 ย less	
 ย precise	
 ย than	
 ย other	
 ย tools	
 ย 
21	
 ย ยฉ	
 ย Cloudera,	
 ย Inc.	
 ย All	
 ย rights	
 ย reserved.	
 ย 
Julia:	
 ย DataFrames.jl	
 ย 
โ€ขโ€ฏ Started	
 ย by	
 ย Harlan	
 ย Harris	
 ย &	
 ย co	
 ย 
โ€ขโ€ฏ Part	
 ย of	
 ย broader	
 ย JuliaStats	
 ย iniTaTve	
 ย 
โ€ขโ€ฏMore	
 ย R-ยญโ€like	
 ย than	
 ย pandas-ยญโ€like	
 ย 
โ€ขโ€ฏVery	
 ย acTve:	
 ย >	
 ย 50	
 ย contributors!	
 ย 	
 ย 
โ€ขโ€ฏ STll	
 ย comparaTvely	
 ย early	
 ย 
โ€ขโ€ฏLess	
 ย comprehensive	
 ย API	
 ย 
โ€ขโ€ฏMore	
 ย limited	
 ย IO	
 ย capabiliTes	
 ย 
22	
 ย ยฉ	
 ย Cloudera,	
 ย Inc.	
 ย All	
 ย rights	
 ย reserved.	
 ย 
Other	
 ย data	
 ย frames	
 ย 
โ€ขโ€ฏ Saddle	
 ย (Scala)	
 ย 
โ€ขโ€ฏDevโ€™d	
 ย by	
 ย Adam	
 ย Klein	
 ย (ex-ยญโ€AQR)	
 ย at	
 ย Novus	
 ย Partners	
 ย (๏ฌntech	
 ย startup)	
 ย 
โ€ขโ€ฏDesigned	
 ย and	
 ย used	
 ย for	
 ย ๏ฌnancial	
 ย use	
 ย cases	
 ย 
โ€ขโ€ฏ Deedle	
 ย (F#	
 ย /	
 ย .NET)	
 ย 
โ€ขโ€ฏDevโ€™d	
 ย by	
 ย AKโ€™s	
 ย colleagues	
 ย at	
 ย BlueMountain	
 ย (hedge	
 ย fund)	
 ย 
โ€ขโ€ฏ GraphLab	
 ย /	
 ย Dato	
 ย 
โ€ขโ€ฏReally	
 ย good	
 ย C++	
 ย data	
 ย frame	
 ย with	
 ย Python	
 ย interface	
 ย 
โ€ขโ€ฏDual-ยญโ€licensed:	
 ย AGPL	
 ย +	
 ย Commercial	
 ย 
โ€ขโ€ฏ Thatโ€™s	
 ย not	
 ย all!	
 ย Haskell,	
 ย Go,	
 ย uswโ€ฆ	
 ย 
23	
 ย ยฉ	
 ย Cloudera,	
 ย Inc.	
 ย All	
 ย rights	
 ย reserved.	
 ย 
Weโ€™re	
 ย not	
 ย done	
 ย yet	
 ย 
โ€ขโ€ฏ The	
 ย future	
 ย is	
 ย JSON-ยญโ€like	
 ย 
โ€ขโ€ฏSupport	
 ย for	
 ย nested	
 ย types	
 ย /	
 ย semi-ยญโ€structured	
 ย data	
 ย is	
 ย sTll	
 ย weak	
 ย 
โ€ขโ€ฏ Wanted:	
 ย Apache-ยญโ€licensed,	
 ย community	
 ย standard	
 ย C/C++	
 ย data	
 ย frame	
 ย that	
 ย we	
 ย all	
 ย use	
 ย 
(R,	
 ย Python,	
 ย Julia)	
 ย 
โ€ขโ€ฏ Bring	
 ย on	
 ย the	
 ย Great	
 ย Decoupling	
 ย 
24	
 ย ยฉ	
 ย Cloudera,	
 ย Inc.	
 ย All	
 ย rights	
 ย reserved.	
 ย 
Thank	
 ย you	
 ย 
@wesmckinn	
 ย 
Ad

More Related Content

What's hot (20)

An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015
Wes McKinney
ย 
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache Arrow
Wes McKinney
ย 
Building Better Analytics Workflows (Strata-Hadoop World 2013)
Building Better Analytics Workflows (Strata-Hadoop World 2013)Building Better Analytics Workflows (Strata-Hadoop World 2013)
Building Better Analytics Workflows (Strata-Hadoop World 2013)
Wes McKinney
ย 
PyCon Singapore 2013 Keynote
PyCon Singapore 2013 KeynotePyCon Singapore 2013 Keynote
PyCon Singapore 2013 Keynote
Wes McKinney
ย 
Improving data interoperability in Python and R
Improving data interoperability in Python and RImproving data interoperability in Python and R
Improving data interoperability in Python and R
Wes McKinney
ย 
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the Future
Wes McKinney
ย 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Wes McKinney
ย 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
Wes McKinney
ย 
Productive Data Tools for Quants
Productive Data Tools for QuantsProductive Data Tools for Quants
Productive Data Tools for Quants
Wes McKinney
ย 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
Wes McKinney
ย 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Wes McKinney
ย 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandas
Wes McKinney
ย 
Apache Arrow and Python: The latest
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latest
Wes McKinney
ย 
Ibis: Scaling Python Analytics on Hadoop and Impala
Ibis: Scaling Python Analytics on Hadoop and ImpalaIbis: Scaling Python Analytics on Hadoop and Impala
Ibis: Scaling Python Analytics on Hadoop and Impala
Wes McKinney
ย 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
Dremio Corporation
ย 
Pandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkPandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySpark
Li Jin
ย 
Apache Arrow - An Overview
Apache Arrow - An OverviewApache Arrow - An Overview
Apache Arrow - An Overview
Dremio Corporation
ย 
Uber's data science workbench
Uber's data science workbenchUber's data science workbench
Uber's data science workbench
Ran Wei
ย 
Transitioning from Traditional DW to Apacheยฎ Sparkโ„ข in Operating Room Predict...
Transitioning from Traditional DW to Apacheยฎ Sparkโ„ข in Operating Room Predict...Transitioning from Traditional DW to Apacheยฎ Sparkโ„ข in Operating Room Predict...
Transitioning from Traditional DW to Apacheยฎ Sparkโ„ข in Operating Room Predict...
Databricks
ย 
Data Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenData Discovery at Databricks with Amundsen
Data Discovery at Databricks with Amundsen
Databricks
ย 
An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015
Wes McKinney
ย 
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache Arrow
Wes McKinney
ย 
Building Better Analytics Workflows (Strata-Hadoop World 2013)
Building Better Analytics Workflows (Strata-Hadoop World 2013)Building Better Analytics Workflows (Strata-Hadoop World 2013)
Building Better Analytics Workflows (Strata-Hadoop World 2013)
Wes McKinney
ย 
PyCon Singapore 2013 Keynote
PyCon Singapore 2013 KeynotePyCon Singapore 2013 Keynote
PyCon Singapore 2013 Keynote
Wes McKinney
ย 
Improving data interoperability in Python and R
Improving data interoperability in Python and RImproving data interoperability in Python and R
Improving data interoperability in Python and R
Wes McKinney
ย 
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the Future
Wes McKinney
ย 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Wes McKinney
ย 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
Wes McKinney
ย 
Productive Data Tools for Quants
Productive Data Tools for QuantsProductive Data Tools for Quants
Productive Data Tools for Quants
Wes McKinney
ย 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
Wes McKinney
ย 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Wes McKinney
ย 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandas
Wes McKinney
ย 
Apache Arrow and Python: The latest
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latest
Wes McKinney
ย 
Ibis: Scaling Python Analytics on Hadoop and Impala
Ibis: Scaling Python Analytics on Hadoop and ImpalaIbis: Scaling Python Analytics on Hadoop and Impala
Ibis: Scaling Python Analytics on Hadoop and Impala
Wes McKinney
ย 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
Dremio Corporation
ย 
Pandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkPandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySpark
Li Jin
ย 
Apache Arrow - An Overview
Apache Arrow - An OverviewApache Arrow - An Overview
Apache Arrow - An Overview
Dremio Corporation
ย 
Uber's data science workbench
Uber's data science workbenchUber's data science workbench
Uber's data science workbench
Ran Wei
ย 
Transitioning from Traditional DW to Apacheยฎ Sparkโ„ข in Operating Room Predict...
Transitioning from Traditional DW to Apacheยฎ Sparkโ„ข in Operating Room Predict...Transitioning from Traditional DW to Apacheยฎ Sparkโ„ข in Operating Room Predict...
Transitioning from Traditional DW to Apacheยฎ Sparkโ„ข in Operating Room Predict...
Databricks
ย 
Data Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenData Discovery at Databricks with Amundsen
Data Discovery at Databricks with Amundsen
Databricks
ย 

Viewers also liked (16)

Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data Science
Wes McKinney
ย 
Reproducibility with Checkpoint & RRO - NYC R Conference
Reproducibility with Checkpoint & RRO - NYC R ConferenceReproducibility with Checkpoint & RRO - NYC R Conference
Reproducibility with Checkpoint & RRO - NYC R Conference
Revolution Analytics
ย 
A Step Towards Reproducibility in R
A Step Towards Reproducibility in RA Step Towards Reproducibility in R
A Step Towards Reproducibility in R
Revolution Analytics
ย 
pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Python
Wes McKinney
ย 
What's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial usersWhat's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial users
Wes McKinney
ย 
Effective Data Analysis with Deedle
Effective Data Analysis with DeedleEffective Data Analysis with Deedle
Effective Data Analysis with Deedle
Howard Mansell
ย 
F# and Financial Data Making Data Analysis Simple
F# and Financial Data Making Data Analysis SimpleF# and Financial Data Making Data Analysis Simple
F# and Financial Data Making Data Analysis Simple
Tomas Petricek
ย 
Structured Data Challenges in Finance and Statistics
Structured Data Challenges in Finance and StatisticsStructured Data Challenges in Finance and Statistics
Structured Data Challenges in Finance and Statistics
Wes McKinney
ย 
Ayasdi strata
Ayasdi strataAyasdi strata
Ayasdi strata
Alpine Data
ย 
Apache Spark Components
Apache Spark ComponentsApache Spark Components
Apache Spark Components
Girish Khanzode
ย 
A look inside pandas design and development
A look inside pandas design and developmentA look inside pandas design and development
A look inside pandas design and development
Wes McKinney
ย 
Scipy 2011 Time Series Analysis in Python
Scipy 2011 Time Series Analysis in PythonScipy 2011 Time Series Analysis in Python
Scipy 2011 Time Series Analysis in Python
Wes McKinney
ย 
Data Structures for Statistical Computing in Python
Data Structures for Statistical Computing in PythonData Structures for Statistical Computing in Python
Data Structures for Statistical Computing in Python
Wes McKinney
ย 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
Taras Matyashovsky
ย 
SPARQL and Linked Data Benchmarking
SPARQL and Linked Data BenchmarkingSPARQL and Linked Data Benchmarking
SPARQL and Linked Data Benchmarking
Kristian Alexander
ย 
Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data Science
Wes McKinney
ย 
Reproducibility with Checkpoint & RRO - NYC R Conference
Reproducibility with Checkpoint & RRO - NYC R ConferenceReproducibility with Checkpoint & RRO - NYC R Conference
Reproducibility with Checkpoint & RRO - NYC R Conference
Revolution Analytics
ย 
A Step Towards Reproducibility in R
A Step Towards Reproducibility in RA Step Towards Reproducibility in R
A Step Towards Reproducibility in R
Revolution Analytics
ย 
pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Python
Wes McKinney
ย 
What's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial usersWhat's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial users
Wes McKinney
ย 
Effective Data Analysis with Deedle
Effective Data Analysis with DeedleEffective Data Analysis with Deedle
Effective Data Analysis with Deedle
Howard Mansell
ย 
F# and Financial Data Making Data Analysis Simple
F# and Financial Data Making Data Analysis SimpleF# and Financial Data Making Data Analysis Simple
F# and Financial Data Making Data Analysis Simple
Tomas Petricek
ย 
Structured Data Challenges in Finance and Statistics
Structured Data Challenges in Finance and StatisticsStructured Data Challenges in Finance and Statistics
Structured Data Challenges in Finance and Statistics
Wes McKinney
ย 
Ayasdi strata
Ayasdi strataAyasdi strata
Ayasdi strata
Alpine Data
ย 
Apache Spark Components
Apache Spark ComponentsApache Spark Components
Apache Spark Components
Girish Khanzode
ย 
A look inside pandas design and development
A look inside pandas design and developmentA look inside pandas design and development
A look inside pandas design and development
Wes McKinney
ย 
Scipy 2011 Time Series Analysis in Python
Scipy 2011 Time Series Analysis in PythonScipy 2011 Time Series Analysis in Python
Scipy 2011 Time Series Analysis in Python
Wes McKinney
ย 
Data Structures for Statistical Computing in Python
Data Structures for Statistical Computing in PythonData Structures for Statistical Computing in Python
Data Structures for Statistical Computing in Python
Wes McKinney
ย 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
Taras Matyashovsky
ย 
SPARQL and Linked Data Benchmarking
SPARQL and Linked Data BenchmarkingSPARQL and Linked Data Benchmarking
SPARQL and Linked Data Benchmarking
Kristian Alexander
ย 
Ad

Similar to DataFrames: The Good, Bad, and Ugly (20)

PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015
Cloudera, Inc.
ย 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Uri Laserson
ย 
Productionizing Hadoop - New Lessons Learned
Productionizing Hadoop - New Lessons LearnedProductionizing Hadoop - New Lessons Learned
Productionizing Hadoop - New Lessons Learned
Cloudera, Inc.
ย 
Kudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast DataKudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast Data
michaelguia
ย 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
ย 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)
Todd Lipcon
ย 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
huguk
ย 
Consolidate your data marts for fast, flexible analytics 5.24.18
Consolidate your data marts for fast, flexible analytics 5.24.18Consolidate your data marts for fast, flexible analytics 5.24.18
Consolidate your data marts for fast, flexible analytics 5.24.18
Cloudera, Inc.
ย 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Cloudera, Inc.
ย 
Impala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisImpala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris Tsirogiannis
Felicia Haggarty
ย 
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyIbis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Hakka Labs
ย 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Cloudera, Inc.
ย 
Options for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current MarketOptions for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current Market
Dremio Corporation
ย 
Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache Hadoop
Cloudera, Inc.
ย 
Microservices with Terraform, Docker and the Cloud. DevOps Wet 2018
Microservices with Terraform, Docker and the Cloud. DevOps Wet 2018Microservices with Terraform, Docker and the Cloud. DevOps Wet 2018
Microservices with Terraform, Docker and the Cloud. DevOps Wet 2018
Derek Ashmore
ย 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
Mike Percy
ย 
Not Your Fatherโ€™s Data Warehouse: Breaking Tradition with Innovation
Not Your Fatherโ€™s Data Warehouse: Breaking Tradition with InnovationNot Your Fatherโ€™s Data Warehouse: Breaking Tradition with Innovation
Not Your Fatherโ€™s Data Warehouse: Breaking Tradition with Innovation
Inside Analysis
ย 
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & TableauBig Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Sam Palani
ย 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
Jeff Holoman
ย 
Machine Learning and Hadoop: Present and future
Machine Learning and Hadoop: Present and futureMachine Learning and Hadoop: Present and future
Machine Learning and Hadoop: Present and future
Cloudera, Inc.
ย 
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015
Cloudera, Inc.
ย 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Uri Laserson
ย 
Productionizing Hadoop - New Lessons Learned
Productionizing Hadoop - New Lessons LearnedProductionizing Hadoop - New Lessons Learned
Productionizing Hadoop - New Lessons Learned
Cloudera, Inc.
ย 
Kudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast DataKudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast Data
michaelguia
ย 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
ย 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)
Todd Lipcon
ย 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
huguk
ย 
Consolidate your data marts for fast, flexible analytics 5.24.18
Consolidate your data marts for fast, flexible analytics 5.24.18Consolidate your data marts for fast, flexible analytics 5.24.18
Consolidate your data marts for fast, flexible analytics 5.24.18
Cloudera, Inc.
ย 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Cloudera, Inc.
ย 
Impala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisImpala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris Tsirogiannis
Felicia Haggarty
ย 
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyIbis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Hakka Labs
ย 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Cloudera, Inc.
ย 
Options for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current MarketOptions for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current Market
Dremio Corporation
ย 
Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache Hadoop
Cloudera, Inc.
ย 
Microservices with Terraform, Docker and the Cloud. DevOps Wet 2018
Microservices with Terraform, Docker and the Cloud. DevOps Wet 2018Microservices with Terraform, Docker and the Cloud. DevOps Wet 2018
Microservices with Terraform, Docker and the Cloud. DevOps Wet 2018
Derek Ashmore
ย 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
Mike Percy
ย 
Not Your Fatherโ€™s Data Warehouse: Breaking Tradition with Innovation
Not Your Fatherโ€™s Data Warehouse: Breaking Tradition with InnovationNot Your Fatherโ€™s Data Warehouse: Breaking Tradition with Innovation
Not Your Fatherโ€™s Data Warehouse: Breaking Tradition with Innovation
Inside Analysis
ย 
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & TableauBig Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Sam Palani
ย 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
Jeff Holoman
ย 
Machine Learning and Hadoop: Present and future
Machine Learning and Hadoop: Present and futureMachine Learning and Hadoop: Present and future
Machine Learning and Hadoop: Present and future
Cloudera, Inc.
ย 
Ad

More from Wes McKinney (20)

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
Wes McKinney
ย 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
ย 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney
ย 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney
ย 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
Wes McKinney
ย 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
ย 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
ย 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney
ย 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Wes McKinney
ย 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
Wes McKinney
ย 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney
ย 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
Wes McKinney
ย 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
Wes McKinney
ย 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
ย 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney
ย 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
ย 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
Wes McKinney
ย 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
Wes McKinney
ย 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
Wes McKinney
ย 
PyCon APAC 2016 Keynote
PyCon APAC 2016 KeynotePyCon APAC 2016 Keynote
PyCon APAC 2016 Keynote
Wes McKinney
ย 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
Wes McKinney
ย 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
ย 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney
ย 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney
ย 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
Wes McKinney
ย 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
ย 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
ย 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney
ย 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Wes McKinney
ย 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
Wes McKinney
ย 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney
ย 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
Wes McKinney
ย 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
Wes McKinney
ย 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
ย 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney
ย 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
ย 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
Wes McKinney
ย 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
Wes McKinney
ย 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
Wes McKinney
ย 
PyCon APAC 2016 Keynote
PyCon APAC 2016 KeynotePyCon APAC 2016 Keynote
PyCon APAC 2016 Keynote
Wes McKinney
ย 

Recently uploaded (20)

Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
ย 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
ย 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
ย 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
ย 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
ย 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
ย 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
ย 
HCL Nomad Web โ€“ Best Practices and Managing Multiuser Environments
HCL Nomad Web โ€“ Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web โ€“ Best Practices and Managing Multiuser Environments
HCL Nomad Web โ€“ Best Practices and Managing Multiuser Environments
panagenda
ย 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
ย 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
ย 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
ย 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
ย 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
ย 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
ย 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
ย 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
ย 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
ย 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
ย 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
ย 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
ย 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
ย 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
ย 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
ย 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
ย 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
ย 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
ย 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
ย 
HCL Nomad Web โ€“ Best Practices and Managing Multiuser Environments
HCL Nomad Web โ€“ Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web โ€“ Best Practices and Managing Multiuser Environments
HCL Nomad Web โ€“ Best Practices and Managing Multiuser Environments
panagenda
ย 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
ย 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
ย 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
ย 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
ย 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
ย 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
ย 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
ย 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
ย 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
ย 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
ย 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
ย 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
ย 

DataFrames: The Good, Bad, and Ugly

  • 1. 1 ย ยฉ ย Cloudera, ย Inc. ย All ย rights ย reserved. ย  DataFrames: ย The ย Good, ย Bad, ย  and ย Ugly ย  Wes ย McKinney ย @wesmckinn ย  NY ย R ย Conference, ย 2015-ยญโ€04-ยญโ€25 ย 
  • 2. 2 ย ยฉ ย Cloudera, ย Inc. ย All ย rights ย reserved. ย  Disclaimer: ย the ย views ย presented ย in ย  this ย talk ย are ย my ย personal ย opinions ย  and ย not ย necessarily ย those ย of ย  Cloudera ย 
  • 3. 3 ย ยฉ ย Cloudera, ย Inc. ย All ย rights ย reserved. ย  This ย talk ย  โ€ขโ€ฏ Some ย commentary ย on ย all ย the ย data ย frame ย interfaces ย out ย there ย  โ€ขโ€ฏ Biased ย observaTons ย and ย cursory ย judgments ย  โ€ขโ€ฏ Thoughts ย on ย craVing ย high ย quality ย data ย tools ย 
  • 4. 4 ย ยฉ ย Cloudera, ย Inc. ย All ย rights ย reserved. ย  Disclaimer ย #2: ย This ย is ย a ย nuanced ย  discussion ย 
  • 5. 5 ย ยฉ ย Cloudera, ย Inc. ย All ย rights ย reserved. ย  Who ย am ย I? ย  โ€ขโ€ฏ Father ย of ย pandas ย (2008 ย -ยญโ€ ย ) ย  ย  โ€ขโ€ฏ Financial ย analyTcs ย in ย R ย / ย Python ย starTng ย 2007 ย  โ€ขโ€ฏ 2010-ยญโ€2012 ย  โ€ขโ€ฏHiatus ย from ย gainful ย employment ย  โ€ขโ€ฏMake ย pandas ย ready ย for ย primeTme ย  โ€ขโ€ฏWrite ย "Python ย for ย Data ย Analysisโ€ ย  โ€ขโ€ฏ 2013-ยญโ€2014: ย DataPad ย with ย Chang ย She ย & ย co ย  โ€ขโ€ฏ 2014 ย -ยญโ€ ย : ย Cloudera ย 
  • 6. 6 ย ยฉ ย Cloudera, ย Inc. ย All ย rights ย reserved. ย  Whatโ€™s ย in ย a ย DataFrame? ย  ย  ย  ย 
  • 7. 7 ย ยฉ ย Cloudera, ย Inc. ย All ย rights ย reserved. ย  Whatโ€™s ย in ย a ย DataFrame? ย  A ย table ย with ย some ย rows ย  By ย any ย other ย name ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  would ย analyze ย as ย sweet. ย 
  • 8. 8 ย ยฉ ย Cloudera, ย Inc. ย All ย rights ย reserved. ย  Got ย a ย table? ย  ย 
  • 9. 9 ย ยฉ ย Cloudera, ย Inc. ย All ย rights ย reserved. ย  Got ย a ย table? ย  Put ย a ย DataFrame ย (interface) ย on ย it! ย 
  • 10. 10 ย ยฉ ย Cloudera, ย Inc. ย All ย rights ย reserved. ย  What ย is ย this ย โ€œdata ย frameโ€ ย that ย you ย speak ย of ย  โ€ขโ€ฏ A ย table-ยญโ€like ย data ย structure ย  โ€ขโ€ฏ An ย API ย / ย user ย interface ย for ย the ย table ย  โ€ขโ€ฏSelecTng ย data ย  โ€ขโ€ฏMath ย and ย relaTonal ย algebra ย (join, ย ๏ฌlter, ย etc.) ย  โ€ขโ€ฏFile ย / ย database ย IO ย  โ€ขโ€ฏad ย in๏ฌnitum ย 
  • 11. 11 ย ยฉ ย Cloudera, ย Inc. ย All ย rights ย reserved. ย  Some ย axes ย of ย comparison ย  โ€ขโ€ฏ Data ย structure ย internals ย (types, ย in-ยญโ€memory ย representaTon, ย etc.) ย  โ€ขโ€ฏ Basic ย table ย API ย  โ€ขโ€ฏ RelaTonal ย algebra ย support ย  โ€ขโ€ฏ Group-ยญโ€by ย / ย split-ยญโ€apply-ยญโ€combine ย API ย  โ€ขโ€ฏ Performance, ย memory ย use, ย evaluaTon ย semanTcs ย  โ€ขโ€ฏ Missing ย data ย  โ€ขโ€ฏ Data ย Tdying ย / ย ETL ย tools ย  โ€ขโ€ฏ IO ย uTliTes ย  โ€ขโ€ฏ Domain ย speci๏ฌc ย tools ย (e.g. ย Tme ย series) ย  โ€ขโ€ฏ โ€ฆ ย 
  • 12. 12 ย ยฉ ย Cloudera, ย Inc. ย All ย rights ย reserved. ย  The ย Great ย Data ย Tool ย Decouplingโ„ข ย  โ€ขโ€ฏ Thesis: ย over ย Tme, ย user ย interfaces, ย data ย storage, ย and ย execuTon ย engines ย will ย  decouple ย and ย specialize ย  โ€ขโ€ฏ In ย fact, ย you ย should ย really ย want ย this ย to ย happen ย  โ€ขโ€ฏShare ย systems ย among ย languages ย  โ€ขโ€ฏReduce ย fragmentaTon ย and ย โ€œlock-ยญโ€inโ€ ย  โ€ขโ€ฏShiV ย developer ย focus ย to ย usability ย  ย  โ€ขโ€ฏ PredicTon: ย weโ€™ll ย be ย there ย by ย 2025; ย sooner ย if ย we ย all ย get ย our ย act ย together ย 
  • 13. 13 ย ยฉ ย Cloudera, ย Inc. ย All ย rights ย reserved. ย  CraVing ย quality ย data ย tools ย  โ€ขโ€ฏ Quality ย / ย usefulness ย is ย usually ย forged ย by ย the ย ๏ฌre ย of ย basle ย  โ€ขโ€ฏ Real ย world ย use ย cases ย and ย social ย proof ย trump ย theory ย  โ€ขโ€ฏ Eat ย that ย dog ย food ย  โ€ขโ€ฏ When ย in ย doubt? ย Look ย at ย the ย test ย suite. ย 
  • 14. 14 ย ยฉ ย Cloudera, ย Inc. ย All ย rights ย reserved. ย  R ย data ย frames ย  โ€ขโ€ฏ Thin ย layer ย on ย top ย of ย R ย list ย type ย  โ€ขโ€ฏSequence ย of ย named ย vectors ย  โ€ขโ€ฏCan ย have ย row ย names ย (any ย R ย vector) ย  โ€ขโ€ฏ Simple ย column ย and ย row ย selecTon ย API ย  โ€ขโ€ฏ AnalyTcs, ย data ย transformaTon, ย etc. ย leV ย to ย base ย package ย and ย add-ยญโ€on ย libraries ย  โ€ขโ€ฏ Richness ย / ย usability ย comes ย largely ย from ย libraries ย 
  • 15. 15 ย ยฉ ย Cloudera, ย Inc. ย All ย rights ย reserved. ย  Some ย awesome ย R ย data ย frame ย stu๏ฌ€ ย  โ€ขโ€ฏ โ€œHadley ย Stackโ€ ย  โ€ขโ€ฏdplyr, ย Tdyr ย  โ€ขโ€ฏlegacy: ย plyr, ย reshape2 ย  ย  โ€ขโ€ฏggplot2 ย  โ€ขโ€ฏ data.table ย (data.frame ย + ย indices, ย fast ย algorithms) ย  โ€ขโ€ฏ xts ย : ย Tme ย series ย  ย 
  • 16. 16 ย ยฉ ย Cloudera, ย Inc. ย All ย rights ย reserved. ย  R ย data ย frames: ย rough ย edges ย  โ€ขโ€ฏ Copy-ยญโ€on-ยญโ€write ย semanTcs ย  โ€ขโ€ฏ API ย fragmentaTon ย / ย inconsistency ย  โ€ขโ€ฏUse ย the ย โ€œHadley ย stackโ€ ย for ย improved ย sanity ย  โ€ขโ€ฏ Factor ย / ย String ย dichotomy ย  โ€ขโ€ฏstringsAsFactors=FALSE ย a ย blessing ย and ย curse ย  โ€ขโ€ฏ Somewhat ย limited ย type ย system ย 
  • 17. 17 ย ยฉ ย Cloudera, ย Inc. ย All ย rights ย reserved. ย  dplyr ย  โ€ขโ€ฏ Composable ย table ย API ย  โ€ขโ€ฏ Good ย example ย of ย what ย the ย โ€œdecoupledโ€ ย future ย might ย look ย like ย  โ€ขโ€ฏNew ย in-ยญโ€memory ย R/RCpp ย execuTon ย engine ย  โ€ขโ€ฏSQL ย backends ย for ย large ย subset ย of ย API ย 
  • 18. 18 ย ยฉ ย Cloudera, ย Inc. ย All ย rights ย reserved. ย  Spark ย DataFrames ย  โ€ขโ€ฏ R/pandas-ยญโ€inspired ย API ย for ย tabular ย data ย manipulaTon ย in ย Scala, ย Python, ย etc. ย  โ€ขโ€ฏ Logical ย operaTon ย graphs ย rewrisen ย internally ย in ย more ย e๏ฌƒcient ย form ย  โ€ขโ€ฏ Good ย interop ย with ย Spark ย SQL ย  โ€ขโ€ฏ Some ย interoperability ย with ย pandas ย  โ€ขโ€ฏ ParTal ย API ย Decoupling! ย (it ย sTll ย binds ย you ย to ย Spark) ย 
  • 19. 19 ย ยฉ ย Cloudera, ย Inc. ย All ย rights ย reserved. ย  pandas ย  โ€ขโ€ฏ Several ย key ย data ย structures, ย data ย frame ย among ย them ย  โ€ขโ€ฏ Considerably ย more ย complex ย internals ย than ย other ย data ย frame ย libraries ย  โ€ขโ€ฏ Some ย good ย things ย  โ€ขโ€ฏBorn ย of ย need ย  โ€ขโ€ฏA ย โ€œbaseries ย includedโ€ ย approach ย  โ€ขโ€ฏHierarchical ย axis ย labeling: ย addresses ย some ย hard ย use ย cases ย at ย expense ย of ย  semanTc ย complexity ย  โ€ขโ€ฏStrong ย Tme ย series ย support ย 
  • 20. 20 ย ยฉ ย Cloudera, ย Inc. ย All ย rights ย reserved. ย  pandas: ย rough ย edges ย  โ€ขโ€ฏ Axis ย labelling ย can ย get ย in ย the ย way ย for ย folks ย needing ย โ€œjust ย a ย tableโ€ ย  โ€ขโ€ฏ Ceded ย control ย of ย its ย type ย system ย / ย data ย repโ€™n ย from ย day ย 1 ย to ย NumPy ย  โ€ขโ€ฏ Ine๏ฌƒcient ย string ย handling ย (uses ย NumPy ย object ย arrays) ย  โ€ขโ€ฏ Missing ย data ย handling ย less ย precise ย than ย other ย tools ย 
  • 21. 21 ย ยฉ ย Cloudera, ย Inc. ย All ย rights ย reserved. ย  Julia: ย DataFrames.jl ย  โ€ขโ€ฏ Started ย by ย Harlan ย Harris ย & ย co ย  โ€ขโ€ฏ Part ย of ย broader ย JuliaStats ย iniTaTve ย  โ€ขโ€ฏMore ย R-ยญโ€like ย than ย pandas-ยญโ€like ย  โ€ขโ€ฏVery ย acTve: ย > ย 50 ย contributors! ย  ย  โ€ขโ€ฏ STll ย comparaTvely ย early ย  โ€ขโ€ฏLess ย comprehensive ย API ย  โ€ขโ€ฏMore ย limited ย IO ย capabiliTes ย 
  • 22. 22 ย ยฉ ย Cloudera, ย Inc. ย All ย rights ย reserved. ย  Other ย data ย frames ย  โ€ขโ€ฏ Saddle ย (Scala) ย  โ€ขโ€ฏDevโ€™d ย by ย Adam ย Klein ย (ex-ยญโ€AQR) ย at ย Novus ย Partners ย (๏ฌntech ย startup) ย  โ€ขโ€ฏDesigned ย and ย used ย for ย ๏ฌnancial ย use ย cases ย  โ€ขโ€ฏ Deedle ย (F# ย / ย .NET) ย  โ€ขโ€ฏDevโ€™d ย by ย AKโ€™s ย colleagues ย at ย BlueMountain ย (hedge ย fund) ย  โ€ขโ€ฏ GraphLab ย / ย Dato ย  โ€ขโ€ฏReally ย good ย C++ ย data ย frame ย with ย Python ย interface ย  โ€ขโ€ฏDual-ยญโ€licensed: ย AGPL ย + ย Commercial ย  โ€ขโ€ฏ Thatโ€™s ย not ย all! ย Haskell, ย Go, ย uswโ€ฆ ย 
  • 23. 23 ย ยฉ ย Cloudera, ย Inc. ย All ย rights ย reserved. ย  Weโ€™re ย not ย done ย yet ย  โ€ขโ€ฏ The ย future ย is ย JSON-ยญโ€like ย  โ€ขโ€ฏSupport ย for ย nested ย types ย / ย semi-ยญโ€structured ย data ย is ย sTll ย weak ย  โ€ขโ€ฏ Wanted: ย Apache-ยญโ€licensed, ย community ย standard ย C/C++ ย data ย frame ย that ย we ย all ย use ย  (R, ย Python, ย Julia) ย  โ€ขโ€ฏ Bring ย on ย the ย Great ย Decoupling ย 
  • 24. 24 ย ยฉ ย Cloudera, ย Inc. ย All ย rights ย reserved. ย  Thank ย you ย  @wesmckinn ย