SlideShare a Scribd company logo
1© Cloudera, Inc. All rights reserved.
PyData: The Next Generation
Wes McKinney @wesmckinn
Data Day Texas 2015 #ddtx15
2© Cloudera, Inc. All rights reserved.
PyData: Everything’s
awesome…or is it?
Wes McKinney @wesmckinn
Data Day Texas 2015 #ddtx15
3© Cloudera, Inc. All rights reserved.
Me
• Data systems, tools, Python guru at Cloudera
• Formerly Founder/CEO of DataPad (visual analytics startup)
• Created pandas in 2008, lead developer until 2013
• Python for Data Analysis, published 10/2012
• O’Reilly’s best-selling data book of 2014
• Pythonista since 2007
4© Cloudera, Inc. All rights reserved.
What’s this about?
• Hopes and fears for the community and ecosystem
• Why do I care?
• Python is fun!
• Leverage
• Accessibility for newbies
• Community: smart, nice, humble people
5© Cloudera, Inc. All rights reserved.
Python at Cloudera
• Want Cloudera platform users to be successful with Python
• Spark/PySpark part of the Enterprise Data Hub / CDH
• Actively investing in Python tooling
• (p.s. we’re hiring?)
• (p.p.s. we have an Austin office now!)
6© Cloudera, Inc. All rights reserved.
Historical perspective and background
• 20 years of fast numerical computing in Python (Numeric 1995)
• 10 years of NumPy
• PyData becomes a thing in 2012
• Python as a data language goes mainstream
• Job descriptions tell all
• Shift in larger Python community from web towards data
• PyCon 2015 committee reported substantial growth in data-related
submissions!
7© Cloudera, Inc. All rights reserved.
How’d this happen?
• Data, data everywhere
• Science! scikit-learn, statsmodels, and friends
• Comprehensive data wrangling tools and in-memory analytics/reporting (pandas)
• IPython Notebook
• Learning resources (books, conferences, blogs, etc.)
• Python environment/library management that “just works”
8© Cloudera, Inc. All rights reserved.
Put a Python (interface) on it!
Something no one got fired for, ever.
9© Cloudera, Inc. All rights reserved.
Meanwhile…
• Hadoop and Big Data go mainstream in 2009 onward
• First Hadoop World: Fall 2009
• First Strata conference: Spring 2011
• Lots of smart engineers in fast-growing businesses with massive analytics / ETL
problems
• Solutions built, frameworks developed, companies founded
• Python was generally not a central part of those solutions
• A lot of our nice things weren’t much help for data munging and counting at
scale (more on this later)
10© Cloudera, Inc. All rights reserved.
We’re lucky to have lots of nice things
• What a language!
• IPython: interactive computing and collaboration
• Libraries to solve nearly any (non-big data) problem
• Trustworthy (medium) data wrangling, statistics, machine learning
• HPC / GPU / parallel computing frameworks
• FFI tools
• … and much more
11© Cloudera, Inc. All rights reserved.
“If this isn’t nice, what is?”
—Kurt Vonnegut
12© Cloudera, Inc. All rights reserved.
So, what kind of big data?
• Big multidimensional arrays / linear algebra
• Big tables (structured data)
• Big text data (unstructured data)
• Empirically I personally am mostly interested in big tables
13© Cloudera, Inc. All rights reserved.
What kind of big data problems?
• ETL / Data Wrangling
• Python been used here for years with Hadoop Streaming
• BI / Analytics (“things you can do in SQL”)
• Advanced Analytics / Machine Learning
14© Cloudera, Inc. All rights reserved.
Some ways we are #winning
• Python seen as a viable alternative to SAS/MATLAB/proprietary software without
nearly as much arguing
• Huge uptake in the financial sector
• Many current and upcoming generations of data scientists learning Python as a
first language
• Python in HPC / scientific computing
15© Cloudera, Inc. All rights reserved.
Some ways we are not #winning
• Python still doesn’t have a great “big data story”
• Little venture capital trickling down to Python projects
• Data structures and programming APIs lagging modern realities
• Weak support for emerging data formats
• Many companies with Python big data successes have not open-sourced their
work
16© Cloudera, Inc. All rights reserved.
Python in big data workflows in practice
HDFS Hadoop-MR
Spark SQL
Big Data, Many machines Small/Medium Data, One Machine
pandas
Viz tools
ML / Stats
More counting / ETL More insights / reporting
DSLs
17© Cloudera, Inc. All rights reserved.
Big data storage formats
• JSON and CSV are not a good way to warehouse data
• Apache Avro
• Compact binary data serialization format
• RPC framework
• Apache Parquet
• Efficient columnar data format optimized for HDFS
• Supports nested and repeated fields, compression, encoding schemes
• Co-developed by Twitter and Cloudera
• Reference impl’s in Impala (C++), and standalone Java/Scala (used in Spark)
18© Cloudera, Inc. All rights reserved.
We’re living in a JVM world
• Scala rapidly taking over big data analytics
• Functional, concise, good for building high level DSLs
• Build nice Scala APIs to clunkier Java frameworks
• JVM legitimately good for concurrent, distributed systems
• Binary interface with Python a major issue
19© Cloudera, Inc. All rights reserved.
Dremel, baby, Dremel…
• VLDB 2010: Dremel: Interactive Analysis of Web-Scale Datasets
• Inspiration for Parquet (cf blog “Dremel made easy with Parquet”)
• Peta-scale analytics directly on nested data
• Google BigQuery said to be a IaaS-ification of Dremel
• Supports SQL variant + new user-defined functions with JavaScript + V8
SELECT COUNT(c1 > c2)
FROM (SELECT SUM(a.b.c.d) WITHIN RECORD AS c1,
SUM(a.b.p.q.r) WITHIN RECORD AS c2
FROM T3)
20© Cloudera, Inc. All rights reserved.
Cloudera Impala
• Open-source interactive SQL for Hadoop
• Analytical query processor written in C++ with LLVM code generation
• Optimized to scan tables (best as Parquet format) in HDFS
• SQL front-end and query optimizer / planner
• User-defined function API (C++)
• impyla enables Python UDFs to be compiled with Numba to LLVM IR
21© Cloudera, Inc. All rights reserved.
Cloudera Impala (cont’d)
• For high performance big data analytics, Impala could be Python’s best friend
• C++/LLVM backend is lower-level than SQL
• Nested data support is coming
22© Cloudera, Inc. All rights reserved.
Some interesting things in recent
times
23© Cloudera, Inc. All rights reserved.
Set point: Hadley Wickham
• R has upped it’s game with dplyr, tidyr, and other new projects
• New standard for a uniform interface to either in-memory or in-database data
processing
• Composable table primitive operations
• Multiple major versions shipped, getting adopted
80dc69b 2012-10-28 | Initial commit of dplyr [hadley]
tbl %>% filter(c==‘bar’) %>% group_by(a, b)
%>% summarise(metric=mean(d – f))
%>% arrange(desc(metric))
24© Cloudera, Inc. All rights reserved.
Blaze
• Shares some semantics with dplyr
• Uses a generalized datashape protocol
• Fresh start in 2014 under Matthew Rocklin’s (Continuum) direction
• Deferred expression API
• Support for piping data between storage systems
• Multiple backends (pandas, SQL, MongoDB, PySpark, …)
• Growing support for out-of-core analytics
25© Cloudera, Inc. All rights reserved.
libdynd
• Led by Mark Wiebe at Continuum Analytics
• Pure C++11 modern reimagining of NumPy
• Python bindings
• Supports variadic data cells and nested types (datashape protocol)
• Development has focused on the data container design over analytics
26© Cloudera, Inc. All rights reserved.
PySpark
• Popularity may exceed official Scala API
• Spark was not exactly designed to be an ideal companion to Python
• General architecture
• Users build Spark deferred expression graphs in Python
• User-supplied functions are serialized and broadcast around the cluster
• Spark plans job and breaks work into tasks executed by Python worker jobs
• Data is managed / shuffled by the Spark Scala master process
• Python used largely as a black box to transform input to output
27© Cloudera, Inc. All rights reserved.
PySpark: Some more gory details
• Spark master controlled using py4j
• Py4J docs: “If performance is critical to your application, accessing Java objects
from Python programs might not be the best idea”
• Data is marshalled mostly with files with various serialization protocols (pickle +
bespoke formats)
• Does not natively interface with NumPy (yet)
• But, the in-memory benefits of Spark over Hadoop Streaming alternatives
massively outweigh the downsides
# pass large object by py4j is very slow and need much memory
28© Cloudera, Inc. All rights reserved.
Spartan
• http://github.com/spartan-array/spartan
• Python distributed array expression evaluator (“distributed NumPy”)
• Developed by Russell Power & others at NYU
• Uses ZeroMQ and custom RPC implementation
29© Cloudera, Inc. All rights reserved.
Things I think we should do
• Create high fidelity data structures for Dremel-style data
• Get serious about Avro, Parquet, and other new data format standards
• Invest in the Python-Impala-LLVM relationship
• Efficient binary protocols to receive and emit data from Python processes
30© Cloudera, Inc. All rights reserved.
Conclusions
• Python + PyData stack is as strong as ever, and still gaining momentum
• The time for a “dark horse” Python-centric big data solution has probably passed
us by. Maybe better to pursue alliances.
• Focused work is needed to still be relevant in 2020. Some of our competitive
advantages are eroding
31© Cloudera, Inc. All rights reserved.
Thank you
Wes McKinney @wesmckinn
wes@cloudera.com
Ad

More Related Content

What's hot (19)

Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
Wes McKinney
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Cloudera, Inc.
 
Data Science in the Cloud with Spark, Zeppelin, and Cloudbreak
Data Science in the Cloud with Spark, Zeppelin, and CloudbreakData Science in the Cloud with Spark, Zeppelin, and Cloudbreak
Data Science in the Cloud with Spark, Zeppelin, and Cloudbreak
DataWorks Summit
 
Format Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and ParquetFormat Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and Parquet
DataWorks Summit
 
Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...
Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...
Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...
Cloudera, Inc.
 
Uber's data science workbench
Uber's data science workbenchUber's data science workbench
Uber's data science workbench
Ran Wei
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
Wes McKinney
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014
Eli Singer
 
Apache Arrow and Python: The latest
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latest
Wes McKinney
 
PyData Texas 2015 Keynote
PyData Texas 2015 KeynotePyData Texas 2015 Keynote
PyData Texas 2015 Keynote
Peter Wang
 
Using Oracle Big Data Discovey as a Data Scientist's Toolkit
Using Oracle Big Data Discovey as a Data Scientist's ToolkitUsing Oracle Big Data Discovey as a Data Scientist's Toolkit
Using Oracle Big Data Discovey as a Data Scientist's Toolkit
Mark Rittman
 
Using Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataUsing Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch data
DataWorks Summit/Hadoop Summit
 
Spark mhug2
Spark mhug2Spark mhug2
Spark mhug2
Joseph Niemiec
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
Slim Baltagi
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
Wes McKinney
 
Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale
DataWorks Summit/Hadoop Summit
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
Dremio Corporation
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
DataWorks Summit
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Timothy Spann
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
Wes McKinney
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Cloudera, Inc.
 
Data Science in the Cloud with Spark, Zeppelin, and Cloudbreak
Data Science in the Cloud with Spark, Zeppelin, and CloudbreakData Science in the Cloud with Spark, Zeppelin, and Cloudbreak
Data Science in the Cloud with Spark, Zeppelin, and Cloudbreak
DataWorks Summit
 
Format Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and ParquetFormat Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and Parquet
DataWorks Summit
 
Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...
Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...
Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...
Cloudera, Inc.
 
Uber's data science workbench
Uber's data science workbenchUber's data science workbench
Uber's data science workbench
Ran Wei
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
Wes McKinney
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014
Eli Singer
 
Apache Arrow and Python: The latest
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latest
Wes McKinney
 
PyData Texas 2015 Keynote
PyData Texas 2015 KeynotePyData Texas 2015 Keynote
PyData Texas 2015 Keynote
Peter Wang
 
Using Oracle Big Data Discovey as a Data Scientist's Toolkit
Using Oracle Big Data Discovey as a Data Scientist's ToolkitUsing Oracle Big Data Discovey as a Data Scientist's Toolkit
Using Oracle Big Data Discovey as a Data Scientist's Toolkit
Mark Rittman
 
Using Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataUsing Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch data
DataWorks Summit/Hadoop Summit
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
Slim Baltagi
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
Wes McKinney
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
Dremio Corporation
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
DataWorks Summit
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Timothy Spann
 

Viewers also liked (20)

Dcd As9100 Cert
Dcd As9100 CertDcd As9100 Cert
Dcd As9100 Cert
jcarlton99
 
Breakout: Data Discovery with Hadoop
Breakout: Data Discovery with HadoopBreakout: Data Discovery with Hadoop
Breakout: Data Discovery with Hadoop
Cloudera, Inc.
 
Geo Review 10.1-10.5 Pg 2
Geo Review 10.1-10.5 Pg 2Geo Review 10.1-10.5 Pg 2
Geo Review 10.1-10.5 Pg 2
HeatherHunt
 
Grafo 5º (parte II)
Grafo 5º (parte II)Grafo 5º (parte II)
Grafo 5º (parte II)
revista Grafo
 
kunstbeurs Alkmaar 2010
kunstbeurs Alkmaar 2010kunstbeurs Alkmaar 2010
kunstbeurs Alkmaar 2010
karinwelhuis
 
Mg 4500 Ret 22x14
Mg 4500 Ret 22x14Mg 4500 Ret 22x14
Mg 4500 Ret 22x14
guesta6c4f0
 
Resume
ResumeResume
Resume
Keith Pane
 
Ijniygb7yhnujnigbi9uhjn
Ijniygb7yhnujnigbi9uhjnIjniygb7yhnujnigbi9uhjn
Ijniygb7yhnujnigbi9uhjn
cruze padron
 
Institucion Educativa Indigena Agroambiental Mayker
Institucion Educativa Indigena Agroambiental MaykerInstitucion Educativa Indigena Agroambiental Mayker
Institucion Educativa Indigena Agroambiental Mayker
Wilson Rodrigo Alpala
 
Demystifying Execution
Demystifying ExecutionDemystifying Execution
Demystifying Execution
Malcolm Ryder
 
Vmware针对教育行业it解决方案
Vmware针对教育行业it解决方案 Vmware针对教育行业it解决方案
Vmware针对教育行业it解决方案
ITband
 
F:\2010\Agendas 2010\Agenda Abril 12
F:\2010\Agendas 2010\Agenda Abril 12F:\2010\Agendas 2010\Agenda Abril 12
F:\2010\Agendas 2010\Agenda Abril 12
guestd2ae7b
 
Limites y posiblidades
Limites y posiblidadesLimites y posiblidades
Limites y posiblidades
Rösy Ross
 
Presentation oracle exalogic elastic cloud
Presentation   oracle exalogic elastic cloudPresentation   oracle exalogic elastic cloud
Presentation oracle exalogic elastic cloud
solarisyougood
 
Agencia produza planejamento-marketing-agapes-versao1.0
Agencia produza planejamento-marketing-agapes-versao1.0Agencia produza planejamento-marketing-agapes-versao1.0
Agencia produza planejamento-marketing-agapes-versao1.0
Alex Cavalcante
 
Dcd As9100 Cert
Dcd As9100 CertDcd As9100 Cert
Dcd As9100 Cert
jcarlton99
 
Breakout: Data Discovery with Hadoop
Breakout: Data Discovery with HadoopBreakout: Data Discovery with Hadoop
Breakout: Data Discovery with Hadoop
Cloudera, Inc.
 
Geo Review 10.1-10.5 Pg 2
Geo Review 10.1-10.5 Pg 2Geo Review 10.1-10.5 Pg 2
Geo Review 10.1-10.5 Pg 2
HeatherHunt
 
Grafo 5º (parte II)
Grafo 5º (parte II)Grafo 5º (parte II)
Grafo 5º (parte II)
revista Grafo
 
kunstbeurs Alkmaar 2010
kunstbeurs Alkmaar 2010kunstbeurs Alkmaar 2010
kunstbeurs Alkmaar 2010
karinwelhuis
 
Mg 4500 Ret 22x14
Mg 4500 Ret 22x14Mg 4500 Ret 22x14
Mg 4500 Ret 22x14
guesta6c4f0
 
Ijniygb7yhnujnigbi9uhjn
Ijniygb7yhnujnigbi9uhjnIjniygb7yhnujnigbi9uhjn
Ijniygb7yhnujnigbi9uhjn
cruze padron
 
Institucion Educativa Indigena Agroambiental Mayker
Institucion Educativa Indigena Agroambiental MaykerInstitucion Educativa Indigena Agroambiental Mayker
Institucion Educativa Indigena Agroambiental Mayker
Wilson Rodrigo Alpala
 
Demystifying Execution
Demystifying ExecutionDemystifying Execution
Demystifying Execution
Malcolm Ryder
 
Vmware针对教育行业it解决方案
Vmware针对教育行业it解决方案 Vmware针对教育行业it解决方案
Vmware针对教育行业it解决方案
ITband
 
F:\2010\Agendas 2010\Agenda Abril 12
F:\2010\Agendas 2010\Agenda Abril 12F:\2010\Agendas 2010\Agenda Abril 12
F:\2010\Agendas 2010\Agenda Abril 12
guestd2ae7b
 
Limites y posiblidades
Limites y posiblidadesLimites y posiblidades
Limites y posiblidades
Rösy Ross
 
Presentation oracle exalogic elastic cloud
Presentation   oracle exalogic elastic cloudPresentation   oracle exalogic elastic cloud
Presentation oracle exalogic elastic cloud
solarisyougood
 
Agencia produza planejamento-marketing-agapes-versao1.0
Agencia produza planejamento-marketing-agapes-versao1.0Agencia produza planejamento-marketing-agapes-versao1.0
Agencia produza planejamento-marketing-agapes-versao1.0
Alex Cavalcante
 
Ad

Similar to PyData: The Next Generation | Data Day Texas 2015 (20)

Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
Wes McKinney
 
Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache Hadoop
Cloudera, Inc.
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Uri Laserson
 
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyIbis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Hakka Labs
 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended Cut
Wes McKinney
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Cloudera, Inc.
 
DataFrames: The Good, Bad, and Ugly
DataFrames: The Good, Bad, and UglyDataFrames: The Good, Bad, and Ugly
DataFrames: The Good, Bad, and Ugly
Wes McKinney
 
Practical introduction to hadoop
Practical introduction to hadoopPractical introduction to hadoop
Practical introduction to hadoop
inside-BigData.com
 
Data Science and CDSW
Data Science and CDSWData Science and CDSW
Data Science and CDSW
Jason Hubbard
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
Gwen (Chen) Shapira
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
huguk
 
Hadoop Data Modeling
Hadoop Data ModelingHadoop Data Modeling
Hadoop Data Modeling
Adam Doyle
 
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data InsightSyncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Steven Totman
 
Python Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the FuturePython Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the Future
Wes McKinney
 
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems
 
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data InsightSyncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Precisely
 
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Eric Baldeschwieler
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Dataconomy Media
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Mats Uddenfeldt
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
Wes McKinney
 
Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache Hadoop
Cloudera, Inc.
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Uri Laserson
 
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyIbis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Hakka Labs
 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended Cut
Wes McKinney
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Cloudera, Inc.
 
DataFrames: The Good, Bad, and Ugly
DataFrames: The Good, Bad, and UglyDataFrames: The Good, Bad, and Ugly
DataFrames: The Good, Bad, and Ugly
Wes McKinney
 
Practical introduction to hadoop
Practical introduction to hadoopPractical introduction to hadoop
Practical introduction to hadoop
inside-BigData.com
 
Data Science and CDSW
Data Science and CDSWData Science and CDSW
Data Science and CDSW
Jason Hubbard
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
huguk
 
Hadoop Data Modeling
Hadoop Data ModelingHadoop Data Modeling
Hadoop Data Modeling
Adam Doyle
 
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data InsightSyncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Steven Totman
 
Python Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the FuturePython Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the Future
Wes McKinney
 
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems
 
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data InsightSyncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Precisely
 
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Eric Baldeschwieler
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Dataconomy Media
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Mats Uddenfeldt
 
Ad

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 
Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 

PyData: The Next Generation | Data Day Texas 2015

  • 1. 1© Cloudera, Inc. All rights reserved. PyData: The Next Generation Wes McKinney @wesmckinn Data Day Texas 2015 #ddtx15
  • 2. 2© Cloudera, Inc. All rights reserved. PyData: Everything’s awesome…or is it? Wes McKinney @wesmckinn Data Day Texas 2015 #ddtx15
  • 3. 3© Cloudera, Inc. All rights reserved. Me • Data systems, tools, Python guru at Cloudera • Formerly Founder/CEO of DataPad (visual analytics startup) • Created pandas in 2008, lead developer until 2013 • Python for Data Analysis, published 10/2012 • O’Reilly’s best-selling data book of 2014 • Pythonista since 2007
  • 4. 4© Cloudera, Inc. All rights reserved. What’s this about? • Hopes and fears for the community and ecosystem • Why do I care? • Python is fun! • Leverage • Accessibility for newbies • Community: smart, nice, humble people
  • 5. 5© Cloudera, Inc. All rights reserved. Python at Cloudera • Want Cloudera platform users to be successful with Python • Spark/PySpark part of the Enterprise Data Hub / CDH • Actively investing in Python tooling • (p.s. we’re hiring?) • (p.p.s. we have an Austin office now!)
  • 6. 6© Cloudera, Inc. All rights reserved. Historical perspective and background • 20 years of fast numerical computing in Python (Numeric 1995) • 10 years of NumPy • PyData becomes a thing in 2012 • Python as a data language goes mainstream • Job descriptions tell all • Shift in larger Python community from web towards data • PyCon 2015 committee reported substantial growth in data-related submissions!
  • 7. 7© Cloudera, Inc. All rights reserved. How’d this happen? • Data, data everywhere • Science! scikit-learn, statsmodels, and friends • Comprehensive data wrangling tools and in-memory analytics/reporting (pandas) • IPython Notebook • Learning resources (books, conferences, blogs, etc.) • Python environment/library management that “just works”
  • 8. 8© Cloudera, Inc. All rights reserved. Put a Python (interface) on it! Something no one got fired for, ever.
  • 9. 9© Cloudera, Inc. All rights reserved. Meanwhile… • Hadoop and Big Data go mainstream in 2009 onward • First Hadoop World: Fall 2009 • First Strata conference: Spring 2011 • Lots of smart engineers in fast-growing businesses with massive analytics / ETL problems • Solutions built, frameworks developed, companies founded • Python was generally not a central part of those solutions • A lot of our nice things weren’t much help for data munging and counting at scale (more on this later)
  • 10. 10© Cloudera, Inc. All rights reserved. We’re lucky to have lots of nice things • What a language! • IPython: interactive computing and collaboration • Libraries to solve nearly any (non-big data) problem • Trustworthy (medium) data wrangling, statistics, machine learning • HPC / GPU / parallel computing frameworks • FFI tools • … and much more
  • 11. 11© Cloudera, Inc. All rights reserved. “If this isn’t nice, what is?” —Kurt Vonnegut
  • 12. 12© Cloudera, Inc. All rights reserved. So, what kind of big data? • Big multidimensional arrays / linear algebra • Big tables (structured data) • Big text data (unstructured data) • Empirically I personally am mostly interested in big tables
  • 13. 13© Cloudera, Inc. All rights reserved. What kind of big data problems? • ETL / Data Wrangling • Python been used here for years with Hadoop Streaming • BI / Analytics (“things you can do in SQL”) • Advanced Analytics / Machine Learning
  • 14. 14© Cloudera, Inc. All rights reserved. Some ways we are #winning • Python seen as a viable alternative to SAS/MATLAB/proprietary software without nearly as much arguing • Huge uptake in the financial sector • Many current and upcoming generations of data scientists learning Python as a first language • Python in HPC / scientific computing
  • 15. 15© Cloudera, Inc. All rights reserved. Some ways we are not #winning • Python still doesn’t have a great “big data story” • Little venture capital trickling down to Python projects • Data structures and programming APIs lagging modern realities • Weak support for emerging data formats • Many companies with Python big data successes have not open-sourced their work
  • 16. 16© Cloudera, Inc. All rights reserved. Python in big data workflows in practice HDFS Hadoop-MR Spark SQL Big Data, Many machines Small/Medium Data, One Machine pandas Viz tools ML / Stats More counting / ETL More insights / reporting DSLs
  • 17. 17© Cloudera, Inc. All rights reserved. Big data storage formats • JSON and CSV are not a good way to warehouse data • Apache Avro • Compact binary data serialization format • RPC framework • Apache Parquet • Efficient columnar data format optimized for HDFS • Supports nested and repeated fields, compression, encoding schemes • Co-developed by Twitter and Cloudera • Reference impl’s in Impala (C++), and standalone Java/Scala (used in Spark)
  • 18. 18© Cloudera, Inc. All rights reserved. We’re living in a JVM world • Scala rapidly taking over big data analytics • Functional, concise, good for building high level DSLs • Build nice Scala APIs to clunkier Java frameworks • JVM legitimately good for concurrent, distributed systems • Binary interface with Python a major issue
  • 19. 19© Cloudera, Inc. All rights reserved. Dremel, baby, Dremel… • VLDB 2010: Dremel: Interactive Analysis of Web-Scale Datasets • Inspiration for Parquet (cf blog “Dremel made easy with Parquet”) • Peta-scale analytics directly on nested data • Google BigQuery said to be a IaaS-ification of Dremel • Supports SQL variant + new user-defined functions with JavaScript + V8 SELECT COUNT(c1 > c2) FROM (SELECT SUM(a.b.c.d) WITHIN RECORD AS c1, SUM(a.b.p.q.r) WITHIN RECORD AS c2 FROM T3)
  • 20. 20© Cloudera, Inc. All rights reserved. Cloudera Impala • Open-source interactive SQL for Hadoop • Analytical query processor written in C++ with LLVM code generation • Optimized to scan tables (best as Parquet format) in HDFS • SQL front-end and query optimizer / planner • User-defined function API (C++) • impyla enables Python UDFs to be compiled with Numba to LLVM IR
  • 21. 21© Cloudera, Inc. All rights reserved. Cloudera Impala (cont’d) • For high performance big data analytics, Impala could be Python’s best friend • C++/LLVM backend is lower-level than SQL • Nested data support is coming
  • 22. 22© Cloudera, Inc. All rights reserved. Some interesting things in recent times
  • 23. 23© Cloudera, Inc. All rights reserved. Set point: Hadley Wickham • R has upped it’s game with dplyr, tidyr, and other new projects • New standard for a uniform interface to either in-memory or in-database data processing • Composable table primitive operations • Multiple major versions shipped, getting adopted 80dc69b 2012-10-28 | Initial commit of dplyr [hadley] tbl %>% filter(c==‘bar’) %>% group_by(a, b) %>% summarise(metric=mean(d – f)) %>% arrange(desc(metric))
  • 24. 24© Cloudera, Inc. All rights reserved. Blaze • Shares some semantics with dplyr • Uses a generalized datashape protocol • Fresh start in 2014 under Matthew Rocklin’s (Continuum) direction • Deferred expression API • Support for piping data between storage systems • Multiple backends (pandas, SQL, MongoDB, PySpark, …) • Growing support for out-of-core analytics
  • 25. 25© Cloudera, Inc. All rights reserved. libdynd • Led by Mark Wiebe at Continuum Analytics • Pure C++11 modern reimagining of NumPy • Python bindings • Supports variadic data cells and nested types (datashape protocol) • Development has focused on the data container design over analytics
  • 26. 26© Cloudera, Inc. All rights reserved. PySpark • Popularity may exceed official Scala API • Spark was not exactly designed to be an ideal companion to Python • General architecture • Users build Spark deferred expression graphs in Python • User-supplied functions are serialized and broadcast around the cluster • Spark plans job and breaks work into tasks executed by Python worker jobs • Data is managed / shuffled by the Spark Scala master process • Python used largely as a black box to transform input to output
  • 27. 27© Cloudera, Inc. All rights reserved. PySpark: Some more gory details • Spark master controlled using py4j • Py4J docs: “If performance is critical to your application, accessing Java objects from Python programs might not be the best idea” • Data is marshalled mostly with files with various serialization protocols (pickle + bespoke formats) • Does not natively interface with NumPy (yet) • But, the in-memory benefits of Spark over Hadoop Streaming alternatives massively outweigh the downsides # pass large object by py4j is very slow and need much memory
  • 28. 28© Cloudera, Inc. All rights reserved. Spartan • http://github.com/spartan-array/spartan • Python distributed array expression evaluator (“distributed NumPy”) • Developed by Russell Power & others at NYU • Uses ZeroMQ and custom RPC implementation
  • 29. 29© Cloudera, Inc. All rights reserved. Things I think we should do • Create high fidelity data structures for Dremel-style data • Get serious about Avro, Parquet, and other new data format standards • Invest in the Python-Impala-LLVM relationship • Efficient binary protocols to receive and emit data from Python processes
  • 30. 30© Cloudera, Inc. All rights reserved. Conclusions • Python + PyData stack is as strong as ever, and still gaining momentum • The time for a “dark horse” Python-centric big data solution has probably passed us by. Maybe better to pursue alliances. • Focused work is needed to still be relevant in 2020. Some of our competitive advantages are eroding
  • 31. 31© Cloudera, Inc. All rights reserved. Thank you Wes McKinney @wesmckinn [email protected]

Editor's Notes

  • #5: Programming should be fun, even if it is work. I’ve been writing a lot of C++ lately…and that feels like work
  • #11: One of the weirdest experiences for me was going to the annual Supercomputing conference: like going to a parallel universe.
  • #12: Some consternation in 2011 when the Strata conferences started that somehow we’d completely missed the big data boat and the world was moving on without us
  • #23: Some consternation in 2011 when the Strata conferences started that somehow we’d completely missed the big data boat and the world was moving on without us