SlideShare a Scribd company logo
Continuum Analytics: What
we are doing and why
Travis Oliphant, PhD
Continuum Analytics, Inc
This talk will be about (mostly) the free and /
or open source software we are building.
Enterprise
Python
Scientific
Computing
Data Processing
Data Analysis
Visualisation
Scalable
Computing
• Products
• Training
• Support
• Consulting
Began operations in January of 2012 with 5 people
with big dreams
Python
We are big backers of NumFOCUS and
organizers of PyData
Spyder
Our Team
Scientists Developers
+5 contractors
+4 interns
+1 Part-time
Business
Big Picture
expertise
insights
NumPy and SciPy are quite successful
Thanks to large and diverse community around it
(Matplotlib, IPython, SymPy, Pandas, etc.)
I estimate 1.5 million to 2 million users
Only incremental improvements possible with
these projects at this point.
Thus, we needed to start new projects...
Related Open Source Projects
Blaze: High-performance Python library for modern
vector computing, distributed and streaming data
Numba:Vectorizing Python compiler for multicore
and GPU, using LLVM
Bokeh: Interactive, grammar-based visualization
system for large datasets
Common Thread: High-level, expressive language for
domain experts; innovative compilers & runtimes
for efficient, powerful data transformation
Conda and Anaconda
• Cross-platform package management
• Multiple environments allows you to have multiple
versions of packages installed in system
• Easy app-deployment
• Taming open-source
Free for all users
Enterprise
support available!
Why Conda?
• Linux users stopped complaining about Python
deployment
• I made major mistakes in management of NumPy/SciPy:
• too much in SciPy (SciPy as distribution) --- scikits model is much
better (tighter libraries developed by smaller teams)
• gave in to community desire for binary ABI compatibility in
NumPy motivated by difficulty of reproducing Python install
• Need for a cross-platform way to install major Python
extensions (many with dependencies on large C or C++
libraries
• Python can’t be ubiquitous if people struggle to just get
it and then manage it.
Why conda
Information Technology
What is Conda
• Full package management (like yum or apt-get)
but cross-platform
• Control over environments (using hard-link
farms) --- better than virtual-env. virtualenv today is like
distutils and setuptools of several years ago (great at first but will end up
hating it)
• Architected to be able to manage any packages
(R, Scala, Clojure, Haskell, Ruby, JS)
• SAT solver to manage dependencies
• User-definable repositories
New Features and Binstar
• Build command from recipe --- many recipes here: https://
github.com/ContinuumIO/conda-recipes
• Upload recipes to Binstar (last mile in binary package
hosting and deployment for any language).
• “binstar in beta” is the beta code
• Personal conda repositories --- https://conda.binstar.org/
travis
• Free Continuum Anaconda repo will be on binstar.org
• Private packages and behind-the-firewall satellites available
• *Free build queue on Linux (Mac and Windows coming
soon) for hosted conda recipes
Demo
create Python 3 environment with IPython and scipy
create new recipe from PyPI (yunomi)
Packaging and Distribution Solved
• conda and binstar solve most of the problems that
we have seen people encounter in managing
Python installations (especially in large-scale
institutions).
• they are supported solutions that can remove the
technology pain of managing Python
• some problems, though, are people
Anaconda (open)
Free enterprise-ready Python distribution of open-
source tools for large-scale data processing,
predictive analytics, and scientific computing
Anaconda Add-Ons (paid-for)
•Revolutionary Python to GPU compiler
•Extends Numba to take a subset of Python
to the GPU (program CUDA in Python)
•CUDA FFT / BLAS interfaces
Fast, memory-efficient Python interface for
SQL databases, NoSQL stores,Amazon S3,
and large data files.
NumPy, SciPy, scikit-learn, NumExpr compiled
against Intel’s Math Kernel Library (MKL)
Launcher
Why Numba?
•Python is too slow for loops
•Most people are not learning C/C++/Fortran today
•Cython is an improvment (but still verbose and
needs C-compiler)
•NVIDIA using LLVM for the GPU
•Many people working with large typed-containers
(NumPy arrays)
•We want to take high-level, tarray-oriented
expressions and compile it to fast code
NumPy + Mamba = Numba
LLVM Library
Intel Nvidia AppleAMD
OpenCLISPC CUDA CLANGOpenMP
LLVMPY
Python Function Machine Code
ARM
Example
Numba
Numba
@jit('void(f8[:,:],f8[:,:],f8[:,:])')
def filter(image, filt, output):
M, N = image.shape
m, n = filt.shape
for i in range(m//2, M-m//2):
for j in range(n//2, N-n//2):
result = 0.0
for k in range(m):
for l in range(n):
result += image[i+k-m//2,j+l-n//2]*filt[k, l]
output[i,j] = result
~1500x speed-up
Numba changes the game!
LLVM IR
x86
C++
ARM
PTX
C
Fortran
Python
Numba turns (a subset of) Python into a
“compiled language” as fast as C (but much more
flexible). You don’t have to reach for C/C++
Laplace Example
@jit('void(double[:,:], double, double)')
def numba_update(u, dx2, dy2):
nx, ny = u.shape
for i in xrange(1,nx-1):
for j in xrange(1, ny-1):
u[i,j] = ((u[i+1,j] + u[i-1,j]) * dy2 +
(u[i,j+1] + u[i,j-1]) * dx2) / (2*(dx2+dy2))
Adapted from http://www.scipy.org/PerformancePython
originally by Prabhu Ramachandran
@jit('void(double[:,:], double, double)')
def numbavec_update(u, dx2, dy2):
u[1:-1,1:-1] = ((u[2:,1:-1]+u[:-2,1:-1])*dy2 +
(u[1:-1,2:] + u[1:-1,:-2])*dx2) / (2*(dx2+dy2))
Results of Laplace example
Version Time Speed Up
NumPy 3.19 1.0
Numba 2.32 1.38
Vect. Numba 2.33 1.37
Cython 2.38 1.34
Weave 2.47 1.29
Numexpr 2.62 1.22
Fortran Loops 2.30 1.39
Vect. Fortran 1.50 2.13
https://github.com/teoliphant/speed.git
LLVMPy worth looking at
LLVM (via
LLVMPy)
has done
much heavy
lifting
LLVMPy =
Compilers for
everybody
What is wrong with NumPy?
• Dtype system is difficult to extend
• Many Dtypes needed (missing data, enums, variable length
strings)
• Immediate mode creates huge temporaries (spawning
Numexpr)
• “Almost” an in-memory data-base comparable to SQL-
lite (missing indexes)
• Integration with sparse arrays
• Standard structure of arrays representation...
• Missing Multi-methods
• Optimization / Minimal support for multi-core / GPU
Now What?
After watching NumPy and SciPy get used all over
Wall Street and by many scientists / engineers in
industry --- what would we do differently?
New Project
Blaze
NumPy
Out of Core,
Distributed and Optimized
NumPy
Blaze Array or Table
Data Descriptor
Data Buffer
Index
Operation
NumPy BLZ
Persistent
Format
RDBMS
CSVData Stream
Blaze Deferred Arrays
+"
A" *"
B" C"
A + B*C
• Symbolic objects which build a graph
• Represents deferred computation
Usually what you have when
you have a Blaze Array
DataShape Type System



• A data description language
• A super-set of NumPy’s dtype
• Provides more flexibility
Shape DType
DataShape
Blaze
Database
GPU Node
Array
Server
NFS
Array
Server
Array
Server
Blaze Client
Synthesized
Array/Table view
array+sql://
array://
file:// array://
Python REPL,
Scripts
Viz Data
Server
C, C++,
FORTRAN
JVM
languages
Progress
• Basic calculations work out-of-core (via Numba and
LLVM)
• Hard dependency on dynd and dynd-python (a dynamic
C++-only multi-dimensional library like NumPy but with
many improvements)
• Persistent arrays from BLZ
• Basic array-server functionality for layering over CSV
files
• 0.2 release in 1-2 weeks. 0.3 within a month after that
(first usable release)
DARPA providing help
DARPA-BAA-12-38: XDATA
TA-1: Scalable analytics and data processing technology	
  
TA-2: Visual user interface technology
Bokeh Plotting Library
• Interactive graphics for the web
• Designed for large datasets
• Designed for streaming data
• Native interface in Python
• Fast JavaScript component
• DARPA funded
• v0.1 release imminent
Reasons for Bokeh
1. Plotting must happen near the data too
2. Quick iteration is essential => interactive visualization
3. Interactive visualization on remote-data => use the browser
4. Almost all web plotting libraries are either:
1. Designed for javascript programmers
2. Designed to output static graphs
5. We designed Bokeh to be dynamic graphing in the web for
Python programmers
6. Will include “Abstract” or “synthetic” rendering (working on
Hadoop and Spark compatibility)
Wakari
• Browser-based data analysis and
visualization platform
• Wordpress /YouTube / Github
for data analysis
• Full Linux environment with
Anaconda Python
• Can be installed on internal
clusters & servers
Why Wakari?
• Data is too big to fit on your desktop
• You need compute power but don’t have easy access to a
large cluster (cloud is sitting there with lots of power)
• Configuration of software on a new system stinks
(especially a cluster).
• Collaborative Data Analytics --- you want to build a
complex technical workflow and then share it with others
easily (without requiring they do painful configuration to
see your results)
• IPython Notebook is awesome --- let’s share it (but we
also need the dependencies and data).
Wakari
• Free account has 512 MB RAM / 10 GB disk and shared
multi-core CPU
• Easily spin-up map-reduce (Disco and Hadoop clusters)
• Use IPython Parallel on many-nodes in the cloud
• Develop GUI apps (possibly in Anaconda) and publish
them easily to Wakari (based on full power of scientific
python --- complex technical workflows (IPython
notebook for now)
Basic Data Explorer
Continuum Data Explorer (CDX)
• Open Source
• Goal is interactivity
• Combination of IPython REPL, Bokeh, and tables
• Tight integration between GUI elements and REPL
• Current features
- Namespace viewer (mapped to IPython namespace)
- DataTable widget with group-by, computed columns, advanced-
filters
- Interactive Plots connected to tables
CDX
Conclusion
Projects circle around giving tools to experts
(occasional programmers or domain
experts) to enable them to move their
expertise to the data to get insights --- keep
data where it is and move high-level but
performant code)
Join us or ask how we can help you!
Ad

More Related Content

What's hot (15)

Scale up and Scale Out Anaconda and PyData
Scale up and Scale Out Anaconda and PyDataScale up and Scale Out Anaconda and PyData
Scale up and Scale Out Anaconda and PyData
Travis Oliphant
 
A look inside pandas design and development
A look inside pandas design and developmentA look inside pandas design and development
A look inside pandas design and development
Wes McKinney
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
Wes McKinney
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Slim Baltagi
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Wes McKinney
 
Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data Analytics
Edureka!
 
New Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 EditionNew Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 Edition
Sri Ambati
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...
jaxLondonConference
 
Apache Arrow and Python: The latest
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latest
Wes McKinney
 
Social Networks Analysis
Social Networks AnalysisSocial Networks Analysis
Social Networks Analysis
Joud Khattab
 
My Data Journey with Python (SciPy 2015 Keynote)
My Data Journey with Python (SciPy 2015 Keynote)My Data Journey with Python (SciPy 2015 Keynote)
My Data Journey with Python (SciPy 2015 Keynote)
Wes McKinney
 
PyData: Past, Present Future (PyData SV 2014 Keynote)
PyData: Past, Present Future (PyData SV 2014 Keynote)PyData: Past, Present Future (PyData SV 2014 Keynote)
PyData: Past, Present Future (PyData SV 2014 Keynote)
Peter Wang
 
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
Herman Wu
 
H2O Big Join Slides
H2O Big Join SlidesH2O Big Join Slides
H2O Big Join Slides
Sri Ambati
 
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDesign Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data Analytics
DataWorks Summit
 
Scale up and Scale Out Anaconda and PyData
Scale up and Scale Out Anaconda and PyDataScale up and Scale Out Anaconda and PyData
Scale up and Scale Out Anaconda and PyData
Travis Oliphant
 
A look inside pandas design and development
A look inside pandas design and developmentA look inside pandas design and development
A look inside pandas design and development
Wes McKinney
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
Wes McKinney
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Slim Baltagi
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Wes McKinney
 
Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data Analytics
Edureka!
 
New Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 EditionNew Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 Edition
Sri Ambati
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...
jaxLondonConference
 
Apache Arrow and Python: The latest
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latest
Wes McKinney
 
Social Networks Analysis
Social Networks AnalysisSocial Networks Analysis
Social Networks Analysis
Joud Khattab
 
My Data Journey with Python (SciPy 2015 Keynote)
My Data Journey with Python (SciPy 2015 Keynote)My Data Journey with Python (SciPy 2015 Keynote)
My Data Journey with Python (SciPy 2015 Keynote)
Wes McKinney
 
PyData: Past, Present Future (PyData SV 2014 Keynote)
PyData: Past, Present Future (PyData SV 2014 Keynote)PyData: Past, Present Future (PyData SV 2014 Keynote)
PyData: Past, Present Future (PyData SV 2014 Keynote)
Peter Wang
 
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
Herman Wu
 
H2O Big Join Slides
H2O Big Join SlidesH2O Big Join Slides
H2O Big Join Slides
Sri Ambati
 
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDesign Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data Analytics
DataWorks Summit
 

Similar to PyData Boston 2013 (20)

Travis Oliphant "Python for Speed, Scale, and Science"
Travis Oliphant "Python for Speed, Scale, and Science"Travis Oliphant "Python for Speed, Scale, and Science"
Travis Oliphant "Python for Speed, Scale, and Science"
Fwdays
 
Keynote at Converge 2019
Keynote at Converge 2019Keynote at Converge 2019
Keynote at Converge 2019
Travis Oliphant
 
Linux Distribution Collaboration …on a Mainframe!
Linux Distribution Collaboration …on a Mainframe!Linux Distribution Collaboration …on a Mainframe!
Linux Distribution Collaboration …on a Mainframe!
All Things Open
 
PyTables
PyTablesPyTables
PyTables
Ali Hallaji
 
An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015
Wes McKinney
 
Large Data Analyze With PyTables
Large Data Analyze With PyTablesLarge Data Analyze With PyTables
Large Data Analyze With PyTables
Innfinision Cloud and BigData Solutions
 
Py tables
Py tablesPy tables
Py tables
Ali Hallaji
 
PyTables
PyTablesPyTables
PyTables
Ali Hallaji
 
.NET per la Data Science e oltre
.NET per la Data Science e oltre.NET per la Data Science e oltre
.NET per la Data Science e oltre
Marco Parenzan
 
Numba lightning
Numba lightningNumba lightning
Numba lightning
Travis Oliphant
 
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
 Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo... Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo...
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
Rogue Wave Software
 
Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...
Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...
Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...
Simplilearn
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
Ganesan Narayanasamy
 
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015
Mike Broberg
 
Netflix oss season 2 episode 1 - meetup Lightning talks
Netflix oss   season 2 episode 1 - meetup Lightning talksNetflix oss   season 2 episode 1 - meetup Lightning talks
Netflix oss season 2 episode 1 - meetup Lightning talks
Ruslan Meshenberg
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticians
Peter Clapham
 
Flexible compute
Flexible computeFlexible compute
Flexible compute
Peter Clapham
 
Amazon Deep Learning
Amazon Deep LearningAmazon Deep Learning
Amazon Deep Learning
Amanda Mackay (she/her)
 
Current Trends in HPC
Current Trends in HPCCurrent Trends in HPC
Current Trends in HPC
Putchong Uthayopas
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of ML
Nordic APIs
 
Travis Oliphant "Python for Speed, Scale, and Science"
Travis Oliphant "Python for Speed, Scale, and Science"Travis Oliphant "Python for Speed, Scale, and Science"
Travis Oliphant "Python for Speed, Scale, and Science"
Fwdays
 
Keynote at Converge 2019
Keynote at Converge 2019Keynote at Converge 2019
Keynote at Converge 2019
Travis Oliphant
 
Linux Distribution Collaboration …on a Mainframe!
Linux Distribution Collaboration …on a Mainframe!Linux Distribution Collaboration …on a Mainframe!
Linux Distribution Collaboration …on a Mainframe!
All Things Open
 
An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015
Wes McKinney
 
.NET per la Data Science e oltre
.NET per la Data Science e oltre.NET per la Data Science e oltre
.NET per la Data Science e oltre
Marco Parenzan
 
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
 Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo... Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo...
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
Rogue Wave Software
 
Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...
Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...
Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...
Simplilearn
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
Ganesan Narayanasamy
 
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015
Mike Broberg
 
Netflix oss season 2 episode 1 - meetup Lightning talks
Netflix oss   season 2 episode 1 - meetup Lightning talksNetflix oss   season 2 episode 1 - meetup Lightning talks
Netflix oss season 2 episode 1 - meetup Lightning talks
Ruslan Meshenberg
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticians
Peter Clapham
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of ML
Nordic APIs
 
Ad

More from Travis Oliphant (12)

Array computing and the evolution of SciPy, NumPy, and PyData
Array computing and the evolution of SciPy, NumPy, and PyDataArray computing and the evolution of SciPy, NumPy, and PyData
Array computing and the evolution of SciPy, NumPy, and PyData
Travis Oliphant
 
SciPy Latin America 2019
SciPy Latin America 2019SciPy Latin America 2019
SciPy Latin America 2019
Travis Oliphant
 
PyCon Estonia 2019
PyCon Estonia 2019PyCon Estonia 2019
PyCon Estonia 2019
Travis Oliphant
 
Standardizing arrays -- Microsoft Presentation
Standardizing arrays -- Microsoft PresentationStandardizing arrays -- Microsoft Presentation
Standardizing arrays -- Microsoft Presentation
Travis Oliphant
 
Scaling Python to CPUs and GPUs
Scaling Python to CPUs and GPUsScaling Python to CPUs and GPUs
Scaling Python to CPUs and GPUs
Travis Oliphant
 
Python for Data Science with Anaconda
Python for Data Science with AnacondaPython for Data Science with Anaconda
Python for Data Science with Anaconda
Travis Oliphant
 
Anaconda and PyData Solutions
Anaconda and PyData SolutionsAnaconda and PyData Solutions
Anaconda and PyData Solutions
Travis Oliphant
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
Travis Oliphant
 
Effectively using Open Source with conda
Effectively using Open Source with condaEffectively using Open Source with conda
Effectively using Open Source with conda
Travis Oliphant
 
Blaze: a large-scale, array-oriented infrastructure for Python
Blaze: a large-scale, array-oriented infrastructure for PythonBlaze: a large-scale, array-oriented infrastructure for Python
Blaze: a large-scale, array-oriented infrastructure for Python
Travis Oliphant
 
Numba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPyNumba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPy
Travis Oliphant
 
Numba
NumbaNumba
Numba
Travis Oliphant
 
Array computing and the evolution of SciPy, NumPy, and PyData
Array computing and the evolution of SciPy, NumPy, and PyDataArray computing and the evolution of SciPy, NumPy, and PyData
Array computing and the evolution of SciPy, NumPy, and PyData
Travis Oliphant
 
SciPy Latin America 2019
SciPy Latin America 2019SciPy Latin America 2019
SciPy Latin America 2019
Travis Oliphant
 
Standardizing arrays -- Microsoft Presentation
Standardizing arrays -- Microsoft PresentationStandardizing arrays -- Microsoft Presentation
Standardizing arrays -- Microsoft Presentation
Travis Oliphant
 
Scaling Python to CPUs and GPUs
Scaling Python to CPUs and GPUsScaling Python to CPUs and GPUs
Scaling Python to CPUs and GPUs
Travis Oliphant
 
Python for Data Science with Anaconda
Python for Data Science with AnacondaPython for Data Science with Anaconda
Python for Data Science with Anaconda
Travis Oliphant
 
Anaconda and PyData Solutions
Anaconda and PyData SolutionsAnaconda and PyData Solutions
Anaconda and PyData Solutions
Travis Oliphant
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
Travis Oliphant
 
Effectively using Open Source with conda
Effectively using Open Source with condaEffectively using Open Source with conda
Effectively using Open Source with conda
Travis Oliphant
 
Blaze: a large-scale, array-oriented infrastructure for Python
Blaze: a large-scale, array-oriented infrastructure for PythonBlaze: a large-scale, array-oriented infrastructure for Python
Blaze: a large-scale, array-oriented infrastructure for Python
Travis Oliphant
 
Numba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPyNumba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPy
Travis Oliphant
 
Ad

Recently uploaded (20)

TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 

PyData Boston 2013

  • 1. Continuum Analytics: What we are doing and why Travis Oliphant, PhD Continuum Analytics, Inc
  • 2. This talk will be about (mostly) the free and / or open source software we are building. Enterprise Python Scientific Computing Data Processing Data Analysis Visualisation Scalable Computing • Products • Training • Support • Consulting
  • 3. Began operations in January of 2012 with 5 people with big dreams Python
  • 4. We are big backers of NumFOCUS and organizers of PyData Spyder
  • 5. Our Team Scientists Developers +5 contractors +4 interns +1 Part-time Business
  • 7. NumPy and SciPy are quite successful Thanks to large and diverse community around it (Matplotlib, IPython, SymPy, Pandas, etc.) I estimate 1.5 million to 2 million users Only incremental improvements possible with these projects at this point. Thus, we needed to start new projects...
  • 8. Related Open Source Projects Blaze: High-performance Python library for modern vector computing, distributed and streaming data Numba:Vectorizing Python compiler for multicore and GPU, using LLVM Bokeh: Interactive, grammar-based visualization system for large datasets Common Thread: High-level, expressive language for domain experts; innovative compilers & runtimes for efficient, powerful data transformation
  • 9. Conda and Anaconda • Cross-platform package management • Multiple environments allows you to have multiple versions of packages installed in system • Easy app-deployment • Taming open-source Free for all users Enterprise support available!
  • 10. Why Conda? • Linux users stopped complaining about Python deployment • I made major mistakes in management of NumPy/SciPy: • too much in SciPy (SciPy as distribution) --- scikits model is much better (tighter libraries developed by smaller teams) • gave in to community desire for binary ABI compatibility in NumPy motivated by difficulty of reproducing Python install • Need for a cross-platform way to install major Python extensions (many with dependencies on large C or C++ libraries • Python can’t be ubiquitous if people struggle to just get it and then manage it.
  • 12. What is Conda • Full package management (like yum or apt-get) but cross-platform • Control over environments (using hard-link farms) --- better than virtual-env. virtualenv today is like distutils and setuptools of several years ago (great at first but will end up hating it) • Architected to be able to manage any packages (R, Scala, Clojure, Haskell, Ruby, JS) • SAT solver to manage dependencies • User-definable repositories
  • 13. New Features and Binstar • Build command from recipe --- many recipes here: https:// github.com/ContinuumIO/conda-recipes • Upload recipes to Binstar (last mile in binary package hosting and deployment for any language). • “binstar in beta” is the beta code • Personal conda repositories --- https://conda.binstar.org/ travis • Free Continuum Anaconda repo will be on binstar.org • Private packages and behind-the-firewall satellites available • *Free build queue on Linux (Mac and Windows coming soon) for hosted conda recipes
  • 14. Demo create Python 3 environment with IPython and scipy create new recipe from PyPI (yunomi)
  • 15. Packaging and Distribution Solved • conda and binstar solve most of the problems that we have seen people encounter in managing Python installations (especially in large-scale institutions). • they are supported solutions that can remove the technology pain of managing Python • some problems, though, are people
  • 16. Anaconda (open) Free enterprise-ready Python distribution of open- source tools for large-scale data processing, predictive analytics, and scientific computing
  • 17. Anaconda Add-Ons (paid-for) •Revolutionary Python to GPU compiler •Extends Numba to take a subset of Python to the GPU (program CUDA in Python) •CUDA FFT / BLAS interfaces Fast, memory-efficient Python interface for SQL databases, NoSQL stores,Amazon S3, and large data files. NumPy, SciPy, scikit-learn, NumExpr compiled against Intel’s Math Kernel Library (MKL)
  • 19. Why Numba? •Python is too slow for loops •Most people are not learning C/C++/Fortran today •Cython is an improvment (but still verbose and needs C-compiler) •NVIDIA using LLVM for the GPU •Many people working with large typed-containers (NumPy arrays) •We want to take high-level, tarray-oriented expressions and compile it to fast code
  • 20. NumPy + Mamba = Numba LLVM Library Intel Nvidia AppleAMD OpenCLISPC CUDA CLANGOpenMP LLVMPY Python Function Machine Code ARM
  • 22. Numba @jit('void(f8[:,:],f8[:,:],f8[:,:])') def filter(image, filt, output): M, N = image.shape m, n = filt.shape for i in range(m//2, M-m//2): for j in range(n//2, N-n//2): result = 0.0 for k in range(m): for l in range(n): result += image[i+k-m//2,j+l-n//2]*filt[k, l] output[i,j] = result ~1500x speed-up
  • 23. Numba changes the game! LLVM IR x86 C++ ARM PTX C Fortran Python Numba turns (a subset of) Python into a “compiled language” as fast as C (but much more flexible). You don’t have to reach for C/C++
  • 24. Laplace Example @jit('void(double[:,:], double, double)') def numba_update(u, dx2, dy2): nx, ny = u.shape for i in xrange(1,nx-1): for j in xrange(1, ny-1): u[i,j] = ((u[i+1,j] + u[i-1,j]) * dy2 + (u[i,j+1] + u[i,j-1]) * dx2) / (2*(dx2+dy2)) Adapted from http://www.scipy.org/PerformancePython originally by Prabhu Ramachandran @jit('void(double[:,:], double, double)') def numbavec_update(u, dx2, dy2): u[1:-1,1:-1] = ((u[2:,1:-1]+u[:-2,1:-1])*dy2 + (u[1:-1,2:] + u[1:-1,:-2])*dx2) / (2*(dx2+dy2))
  • 25. Results of Laplace example Version Time Speed Up NumPy 3.19 1.0 Numba 2.32 1.38 Vect. Numba 2.33 1.37 Cython 2.38 1.34 Weave 2.47 1.29 Numexpr 2.62 1.22 Fortran Loops 2.30 1.39 Vect. Fortran 1.50 2.13 https://github.com/teoliphant/speed.git
  • 26. LLVMPy worth looking at LLVM (via LLVMPy) has done much heavy lifting LLVMPy = Compilers for everybody
  • 27. What is wrong with NumPy? • Dtype system is difficult to extend • Many Dtypes needed (missing data, enums, variable length strings) • Immediate mode creates huge temporaries (spawning Numexpr) • “Almost” an in-memory data-base comparable to SQL- lite (missing indexes) • Integration with sparse arrays • Standard structure of arrays representation... • Missing Multi-methods • Optimization / Minimal support for multi-core / GPU
  • 28. Now What? After watching NumPy and SciPy get used all over Wall Street and by many scientists / engineers in industry --- what would we do differently?
  • 29. New Project Blaze NumPy Out of Core, Distributed and Optimized NumPy
  • 30. Blaze Array or Table Data Descriptor Data Buffer Index Operation NumPy BLZ Persistent Format RDBMS CSVData Stream
  • 31. Blaze Deferred Arrays +" A" *" B" C" A + B*C • Symbolic objects which build a graph • Represents deferred computation Usually what you have when you have a Blaze Array
  • 32. DataShape Type System    • A data description language • A super-set of NumPy’s dtype • Provides more flexibility Shape DType DataShape
  • 33. Blaze Database GPU Node Array Server NFS Array Server Array Server Blaze Client Synthesized Array/Table view array+sql:// array:// file:// array:// Python REPL, Scripts Viz Data Server C, C++, FORTRAN JVM languages
  • 34. Progress • Basic calculations work out-of-core (via Numba and LLVM) • Hard dependency on dynd and dynd-python (a dynamic C++-only multi-dimensional library like NumPy but with many improvements) • Persistent arrays from BLZ • Basic array-server functionality for layering over CSV files • 0.2 release in 1-2 weeks. 0.3 within a month after that (first usable release)
  • 35. DARPA providing help DARPA-BAA-12-38: XDATA TA-1: Scalable analytics and data processing technology   TA-2: Visual user interface technology
  • 36. Bokeh Plotting Library • Interactive graphics for the web • Designed for large datasets • Designed for streaming data • Native interface in Python • Fast JavaScript component • DARPA funded • v0.1 release imminent
  • 37. Reasons for Bokeh 1. Plotting must happen near the data too 2. Quick iteration is essential => interactive visualization 3. Interactive visualization on remote-data => use the browser 4. Almost all web plotting libraries are either: 1. Designed for javascript programmers 2. Designed to output static graphs 5. We designed Bokeh to be dynamic graphing in the web for Python programmers 6. Will include “Abstract” or “synthetic” rendering (working on Hadoop and Spark compatibility)
  • 38. Wakari • Browser-based data analysis and visualization platform • Wordpress /YouTube / Github for data analysis • Full Linux environment with Anaconda Python • Can be installed on internal clusters & servers
  • 39. Why Wakari? • Data is too big to fit on your desktop • You need compute power but don’t have easy access to a large cluster (cloud is sitting there with lots of power) • Configuration of software on a new system stinks (especially a cluster). • Collaborative Data Analytics --- you want to build a complex technical workflow and then share it with others easily (without requiring they do painful configuration to see your results) • IPython Notebook is awesome --- let’s share it (but we also need the dependencies and data).
  • 40. Wakari • Free account has 512 MB RAM / 10 GB disk and shared multi-core CPU • Easily spin-up map-reduce (Disco and Hadoop clusters) • Use IPython Parallel on many-nodes in the cloud • Develop GUI apps (possibly in Anaconda) and publish them easily to Wakari (based on full power of scientific python --- complex technical workflows (IPython notebook for now)
  • 42. Continuum Data Explorer (CDX) • Open Source • Goal is interactivity • Combination of IPython REPL, Bokeh, and tables • Tight integration between GUI elements and REPL • Current features - Namespace viewer (mapped to IPython namespace) - DataTable widget with group-by, computed columns, advanced- filters - Interactive Plots connected to tables
  • 43. CDX
  • 44. Conclusion Projects circle around giving tools to experts (occasional programmers or domain experts) to enable them to move their expertise to the data to get insights --- keep data where it is and move high-level but performant code) Join us or ask how we can help you!