SlideShare a Scribd company logo
Building an Enterprise/Cloud
Analytics Platform with Jupyter
Notebooks and Apache Spark
Fred Reiss
Chief Architect, IBM Spark Technology Center
2
Hi!
Fred Reiss
• 2014-present: Chief Architect,
IBM Spark Technology Center.
• 2006-2014: Worked for IBM
Research.
• 2006: Ph.D. from U.C.
Berkeley.
3
The Jupyter Project
• Open Source project that builds software to enable
interactive notebooks for data science
– Started in 2014
– Grew out of the IPython project
4
What is IPython?
https://upload.wikimedia.org/wikipedia/commons/4/47/IPython-shell.png
By Shishirdasika (Own work) [CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons
Interactive
console for
Python
Can open a
window to
display
graphics
5
IPython Notebooks
https://upload.wikimedia.org/wikipedia/commons/a/af/IPython-notebook.png
By Shishirdasika (Own work) [CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons
Text and
graphics in the
same browser
window
6
Jupyter Notebooks Today
https://developer.ibm.com/code/patterns/create-visualizations-to-understand-food-insecurity/
User
Input
System
Output
Tables
Graphs
Text output
Cells
7
Jupyter Notebooks
• Jupyter notebooks are widely used by data scientists, social
scientists, physical scientists, engineers, and others
• Useful for many tasks
– Analyzing data
– Developing and debugging software
– Running experiments
– Keeping track of experimental results
– Presenting results
• Jupyter is a central part of the IBM Data Science Experience
(http://datascience.ibm.com)
8
Jupyter in the Enterprise: Key Challenges
• Collaboration among multiple users
• Large-scale data analysis (problems that don’t fit in
a laptop)
– Shared cloud infrastructure like Kubernetes
– Parallel frameworks like Spark
• Security and authentication
• Auditing and data access control
9
Isn’t this just shipping strings around?
JavaScript
“1+1”
Server
“1+1”
Python
Process
“1+1”
“2”“2”“2”
10
Isn’t this just shipping strings around?
JavaScript
“1+1”
FancyNewSystem
“1+1”
Python
Process
“1+1”
“2”“2”“2”
Security
Multitenancy
Authentication
Spark
Kubernetes
11
The Five Stages of Enterprise Jupyter Deployment
12
The Five Stages of Enterprise Jupyter Deployment
1. Denial
13
Jupyter does more than just pass strings around.
• Quite a bit more!
14
Asynchronous Operations
• Queue up multiple cells for
execution
– …in arbitrary order
• Stream output while a cell is
running
• Interrupt any operation
Fifteenth cell
that executed in
this session
15
Jupyter’s Display System: Much More than Text
https://nbviewer.jupyter.org/github/ipython/ipython/bl
ob/master/examples/IPython%20Kernel/Custom%2
0Display%20Logic.ipynb
16
Profiling and Debugging
17
Magics
• Jupyter’s
standard
Python kernel
has over 90
built-in magic
commands
http://ipython.readthedocs.io/en/stable/interactive/magics.html
18
Extensions
• Many additional
extensions in the
iPython project’s
Github repository
– https://github.com/ip
ython-
contrib/jupyter_contri
b_nbextensions
19
PixieDust
https://developer.ibm.com/code/patterns/analyze-san-francisco-traffic-data-with-ibm-pixiedust-and-data-science-experience/
20
Brunel
https://developer.ibm.com/open/videos/brunel-visualization-update-tech-talk/
21
The Actual Architecture of Jupyter Notebooks
Notebook Server Process
22
The Actual Architecture of Jupyter Notebooks
JavaScript
NotebookManagement
Python Process
KernelManagement
iPythonKernel
Notebook
Server
State
KernelProxy
Shell
IOPub
stdin
control
heartbeat
Kernel
Session
State
UserCode
sklearn
Spark
Tensor
Flow
…
Local
Filesystem
23
The Five Stages of Enterprise Jupyter Deployment
1. Denial
24
The Five Stages of Enterprise Jupyter Deployment
1. Denial
2. Anger
Notebook Server Process
25
The Actual Architecture of Jupyter Notebooks
JavaScript
NotebookManagement
Python Process
KernelManagement
iPythonKernel
Notebook
Server
State
KernelProxy
Shell
IOPub
stdin
control
heartbeat
Kernel
Session
State
UserCode
sklearn
Spark
Tensor
Flow
…
Local
Filesystem
Notebook Server Process
26
The Actual Architecture of Jupyter Notebooks
JavaScript
NotebookManagement
Python Process
KernelManagement
iPythonKernel
Notebook
Server
State
KernelProxy
Shell
IOPub
stdin
control
heartbeat
Kernel
Session
State
UserCode
sklearn
Spark
Tensor
Flow
…
Local
Filesystem
Notebook Server Process
27
The Actual Architecture of Jupyter Notebooks
JavaScript
NotebookManagement
Python Process
KernelManagement
iPythonKernel
Notebook
Server
State
KernelProxy
Shell
IOPub
stdin
control
heartbeat
Kernel
Session
State
UserCode
sklearn
Spark
Tensor
Flow
…
Local
Filesystem
Notebook Server Process
28
The Actual Architecture of Jupyter Notebooks
JavaScript
NotebookManagement
Python Process
KernelManagement
iPythonKernel
Notebook
Server
State
KernelProxy
Shell
IOPub
stdin
control
heartbeat
Kernel
Session
State
UserCode
sklearn
Spark
Tensor
Flow
…
Local
Filesystem
Five ZeroMQ
message queues over
unencrypted TCP
sockets…
…per kernel
29
Third-Party Kernels
• The IPython kernel is
the most common…
• …but there is a long tail
of other Jupyter kernels
– 103 kernels currently
listed on the Jupyter
project’s wiki
Notebook Server Process
30
The Actual Architecture of Jupyter Notebooks
JavaScript
NotebookManagement
Python Process
KernelManagement
iPythonKernel
Notebook
Server
State
KernelProxy
Shell
IOPub
stdin
control
heartbeat
Kernel
Session
State
UserCode
sklearn
Spark
Tensor
Flow
…
Local
Filesystem
To share
notebooks among
users, need to
share notebook
server
Notebook Server Process
31
The Actual Architecture of Jupyter Notebooks
JavaScript
NotebookManagement
Python Process
KernelManagement
iPythonKernel
Notebook
Server
State
KernelProxy
Shell
IOPub
stdin
control
heartbeat
Kernel
Session
State
UserCode
sklearn
Spark
Tensor
Flow
…
Local
Filesystem
To use Apache
Spark™ on YARN,
need to be inside
the YARN cluster’s
network.
32
Jupyter in the Enterprise: Key Challenges
• Collaboration among multiple users
• Large-scale data analysis
– Shared cloud infrastructure like Kubernetes
– Parallel frameworks like Spark
• Security and authentication
• Auditing and data access control
Bringing these properties to the Jupyter stack is hard!
33
The Five Stages of Enterprise Jupyter Deployment
1. Denial
2. Anger
34
The Five Stages of Enterprise Jupyter Deployment
1. Denial
2. Anger
3. Bargaining
35
Bargaining
• Meeting all the enterprise requirements is expensive
• Compromise to bring down the cost
36
Compromise #1: Gigantic Server
• Find the biggest machine or container you can get
• Run the entire Jupyter stack on that one machine
• Issues:
– Machine needs to be sized for the maximum aggregate
memory of all active users’ active kernels
• Hard upper limit of 256GB-1TB in most organizations
• Very problematic if you have many users and big data
– Need to authenticate all these users to the same machine
and notebook server
37
Compromise #2: Notebook Server Per User
• Proxy server manages a pool of containers, one per active user
• Each container contains an entire Jupyter notebook stack
• JupyterHub project provides a pre-built implementation of this
approach
• Issues:
– Container needs to be big enough for all the user’s kernels
• What size container to allocate when the user logs in?
• Does a big enough container even exist?
– Disables collaboration features
– Many more moving parts  More failure modes
38
Compromise #3: Replace the Kernel
iPythonKernel
KernelProxy
Shell
IOPub
stdin
control
heartbeat
KernelProxy
Proxy
39
Compromise #3: Replace the Kernel
• Replace the IPython kernel with a proxy
• Put something enterprise-friendly on
the other side of the proxy
• Apache Livy implements this approach
– https://github.com/jupyter-
incubator/sparkmagic
• Issues:
– Breaks Jupyter’s magics and extensions
– Breaks data visualization libraries
– Breaks third-party kernels
– Less control over code execution
Shell
IOPub
stdin
control
heartbeat
RESTfulwebservice
40
The Five Stages of Enterprise Jupyter Deployment
1. Denial
2. Anger
3. Bargaining
41
The Five Stages of Enterprise Jupyter Deployment
1. Denial
2. Anger
3. Bargaining
4. Depression
42
The Five Stages of Enterprise Jupyter Deployment
1. Denial
2. Anger
3. Bargaining
4. Depression
5. Jupyter Enterprise Gateway
43
The Origins of Jupyter Enterprise Gateway
• Multiple IBM products embedding Spark on YARN
• All wanted to add Jupyter notebooks with Spark
• Usual enterprise requirements (multitenancy,
scalability, security, etc.)
• Had reached the “Bargaining” stage
– Mix of compromises 1, 2, and 3
YARN Cluster
Initial
Prototype
44
Security
Layer
YARN
Workers
YARN
Resource
Manager
Spark
ExecutorsSpark
ExecutorsSpark
Executors
Spark
ExecutorsSpark
ExecutorsSpark
Executors
Notebook Node
nb2kg
(Proxy)
nb2kg
Jupyter
Kernel
Gateway
Python
Kernel
Spark Driver
Python
Kernel
Spark Driver
Shell
IOPub
stdin
control
heartbeat
YARN Cluster
Initial
Prototype
45
Security
Layer
YARN
Workers
YARN
Resource
Manager
Spark
ExecutorsSpark
ExecutorsSpark
Executors
Spark
ExecutorsSpark
ExecutorsSpark
Executors
Notebook Node
nb2kg
(Proxy)
nb2kg
Jupyter
Kernel
Gateway
Python
Kernel
Spark Driver
Python
Kernel
Spark Driver
Shell
IOPub
stdin
control
heartbeat
Issue #2: All
Spark jobs
run as same
user ID
Issue #1: All kernels
and Spark drivers
run on a single node
Issue #1: All kernels run on a single node
8 8 8 8
0
10
20
30
40
50
60
70
80
4 Nodes 8 Nodes 12 Nodes 16 Nodes
MaxKernels(4GBHeap)
Cluster Size (32GB Nodes)
Maximum Number of Simultaneous Kernels
46
Jupyter Enterprise Gateway: Initial Goals
• Optimized Resource Allocation
– Run Spark in YARN Cluster Mode to better utilize cluster resources.
– Pluggable architecture for additional Resource Managers
• Multiuser support with user impersonation
– Enhance security and sandboxing by enabling user impersonation
when running kernels (using Kerberos).
– Individual HDFS home folder for each notebook user.
– Use the same user ID for notebook and batch jobs.
• Enhanced Security
– Secure socket communications
– Any network communication should be encrypted
47
YARN Cluster
Jupyter Enterprise Gateway
48
Security
Layer
YARN
Workers
Jupyter EnterpriseGateway
Multitenancy
Remote kernels and Kernel Lifecycle management
Spark Executors
Spark Executors
Spark Executors
Yarn Container
Jupyter Kernel
Spark Driver
Spark Executors
Spark Executors
Spark Executors
Yarn Container
Jupyter Kernel
Spark Driver
Spark Executors
Spark Executors
Spark Executors
Yarn Container
Jupyter Kernel
Spark Driver
Impersonation:
Alice’s kernel
runs under
Alice’s user ID.
Scalability Benefits
8 8 8 8
16
32
48
64
0
10
20
30
40
50
60
70
80
4 Nodes 8 Nodes 12 Nodes 16 Nodes
MaxKernels(4GBHeap)
Cluster Size (32GB Nodes)
Maximum Number of Simultaneous Kernels
Before JEG
After JEG
49
Jupyter Enterprise Gateway: Open Source
50
• Released through the
Jupyter Incubator
– BSD License
– https://github.com/jupyter-
incubator/enterprise_gatew
ay
– Current release: 0.7.0
Jupyter Enterprise Gateway: Supported Platforms
• Python/Spark 2.x using IPython kernel
– With Spark Context delayed initialization
• Scala 2.11/ Spark 2.x using Apache Toree kernel
– With Spark Context delayed initialization
• R / Spark 2.x with IRkernel
51
Jupyter Enterprise Gateway – Roadmap
• Add support for other resource managers
– Kubernetes support
• Kernel Configuration Profile
– Enable client to request different resource configuration for kernels (e.g. small,
medium, large)
– Profiles should be defined by Administrators and enabled for user/group of users.
• Administration UI
– Dashboard with running kernels and administration actions
• Time running, stop/kill, Profile Management, etc
• User Environments
• High Availability
52
Jupyter Enterprise Gateway
• Jupyter Enterprise Gateway at IBM Code
– https://developer.ibm.com/code/openprojects/jupyter-enterprise-gateway/
• Jupyter Enterprise Gateway source code at GitHub
– https://github.com/jupyter-incubator/enterprise_gateway
• Docker images
– https://github.com/jupyter-
incubator/enterprise_gateway/tree/master/etc/docker
• Jupyter Enterprise Gateway 0.7 release
– https://github.com/jupyter-incubator/enterprise_gateway/releases/tag/v0.7.0
• Jupyter Enterprise Gateway Documentation
– http://jupyter-enterprise-gateway.readthedocs.io/en/latest/
53
Free
IBM Data Science
trial
https://ibm.biz/BdZceR
54
Thank you!
And special thanks to the Jupyter
Enterprise Gateway team: Luciano Resende,
Kevin Bates, Kun Liu, Christian Kadner,
Sanjay Saxena, Alan Chin, Sherry Guo, Alex
Bozarth, Zee Chen
55
Backup
56
Building your own test
environment with
Jupyter Enterprise Gateway
Jupyter Enterprise Gateway: Deployment
57
Management
Node
Powered by
Ambari
EG
Compute Engine based on Apache Spark
Jupyter Enterprise Gateway: Deployment
• Ansible deployment scripts
– https://github.com/lresende/spark-cluster-install
• One click deployment of the Spark Cluster
– Configure your host inventory (see example on git repository)
– Run the ”setup-ambari.yml” playbook
• $ ansible-playbook --verbose setup-ambari.yml -i hosts-fyre-ambari -c paramiko
• One click deployment of the Jupyter Enterprise Engine
– Run the ”setup-enterprise-gateway.yml” playbook
• $ ansible-playbook --verbose setup-enterprise-gateway.yml -i hosts-fyre-ambari -c
paramiko
58
Jupyter Enterprise Gateway - Deployment
• Docker images
– yarn-spark: Basic one node Spark on Yarn configuration
– enterprise-gateway: Adds Anaconda and Jupyter Enterprise Gateway to the
yarn-spark image
– nb2kg: Minimal Jupyter Notebook client configured with hooks to access the
Enterprise Gateway
– https://github.com/jupyter-incubator/enterprise_gateway/tree/master/etc/docker
• Building the latest docker images
– git checkout https://github.com/jupyter-incubator/enterprise_gateway
– make docker-clean docker-images
– Note: Make also have individual targets to clean and build individual images
(type make for help)
59
Jupyter Enterprise Gateway - Deployment
• Connecting to a Spark Cluster using a docker image
docker run -t --rm 
-e KG_URL='http://<Enterprise Gateway IP>:8888' 
-p 8888:8888 
-e VALIDATE_KG_CERT='no' 
-e LOG_LEVEL=DEBUG 
-e KG_REQUEST_TIMEOUT=40 
-e KG_CONNECT_TIMEOUT=40 
-v ${HOME}/opensource/jupyter/jupyter-notebooks/:/tmp/notebooks 
-w /tmp/notebooks 
elyra/nb2kg:dev
60

More Related Content

PDF
The Five Stages of Enterprise Jupyter Deployment
Frederick Reiss
 
PDF
Droidcon 2013 france - The Growth of Android in Embedded Systems
Benjamin Zores
 
PDF
都立大「ユビキタスロボティクス特論」5月12日
NoriakiAndo
 
PDF
200519 TMU Ubiquitous Robot
NoriakiAndo
 
PDF
ABS 2014 - The Growth of Android in Embedded Systems
Benjamin Zores
 
PDF
An Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
Luciano Resende
 
PPTX
Contributing to OpenStack
devkulkarni
 
PDF
Jupyter Enterprise Gateway Overview
Luciano Resende
 
The Five Stages of Enterprise Jupyter Deployment
Frederick Reiss
 
Droidcon 2013 france - The Growth of Android in Embedded Systems
Benjamin Zores
 
都立大「ユビキタスロボティクス特論」5月12日
NoriakiAndo
 
200519 TMU Ubiquitous Robot
NoriakiAndo
 
ABS 2014 - The Growth of Android in Embedded Systems
Benjamin Zores
 
An Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
Luciano Resende
 
Contributing to OpenStack
devkulkarni
 
Jupyter Enterprise Gateway Overview
Luciano Resende
 

What's hot (20)

PDF
Flaky tests and bugs in Apache software (e.g. Hadoop)
Akihiro Suda
 
PDF
Finding and Organizing a Great Cloud Foundry User Group
Daniel Krook
 
PDF
Puppet & Jenkins
Matthew Barr
 
PPTX
Tackling non-determinism in Hadoop - Testing and debugging distributed system...
Akihiro Suda
 
PPTX
Slack Bot: upload NUGET package to Artifactory
Sergey Dzyuban
 
PDF
Continuous Deployment at Disqus (Pylons Minicon)
zeeg
 
PDF
CPOSC2014: Next Generation Cloud -- Rise of the Unikernel
The Linux Foundation
 
PDF
SCALE13x: Next Generation of the Cloud - Rise of the Unikernel
The Linux Foundation
 
PDF
FusionInventory at LSM/RMLL 2012
Nouh Walid
 
PDF
Docker openstack-2014
OpenCity Community
 
PDF
Training Ensimag OpenStack 2016
Bruno Cornec
 
PPT
OaaS:Open as a Strategy
OpenCity Community
 
PPTX
Opening words at DockerCon Europe by Ben Golub
Docker, Inc.
 
PDF
Using Embedded Linux for Infrastructure Systems
Yoshitake Kobayashi
 
PDF
Learn OpenStack from trystack.cn
OpenCity Community
 
PDF
Eclipse e4
Chris Aniszczyk
 
PDF
Platform for a Connected World
All Things Open
 
PDF
Containers & CaaS
OpenCity Community
 
PPTX
Moby Open Source Summit North America 2017
Patrick Chanezon
 
PPTX
Neo4J with Docker and Azure - GraphConnect 2015
Patrick Chanezon
 
Flaky tests and bugs in Apache software (e.g. Hadoop)
Akihiro Suda
 
Finding and Organizing a Great Cloud Foundry User Group
Daniel Krook
 
Puppet & Jenkins
Matthew Barr
 
Tackling non-determinism in Hadoop - Testing and debugging distributed system...
Akihiro Suda
 
Slack Bot: upload NUGET package to Artifactory
Sergey Dzyuban
 
Continuous Deployment at Disqus (Pylons Minicon)
zeeg
 
CPOSC2014: Next Generation Cloud -- Rise of the Unikernel
The Linux Foundation
 
SCALE13x: Next Generation of the Cloud - Rise of the Unikernel
The Linux Foundation
 
FusionInventory at LSM/RMLL 2012
Nouh Walid
 
Docker openstack-2014
OpenCity Community
 
Training Ensimag OpenStack 2016
Bruno Cornec
 
OaaS:Open as a Strategy
OpenCity Community
 
Opening words at DockerCon Europe by Ben Golub
Docker, Inc.
 
Using Embedded Linux for Infrastructure Systems
Yoshitake Kobayashi
 
Learn OpenStack from trystack.cn
OpenCity Community
 
Eclipse e4
Chris Aniszczyk
 
Platform for a Connected World
All Things Open
 
Containers & CaaS
OpenCity Community
 
Moby Open Source Summit North America 2017
Patrick Chanezon
 
Neo4J with Docker and Azure - GraphConnect 2015
Patrick Chanezon
 
Ad

Similar to 2018 02 20-jeg_index (20)

PDF
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
Big Data Spain
 
PDF
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
Luciano Resende
 
PDF
Jupyter con meetup extended jupyter kernel gateway
Luciano Resende
 
PDF
Big analytics meetup - Extended Jupyter Kernel Gateway
Luciano Resende
 
PDF
Strata - Scaling Jupyter with Jupyter Enterprise Gateway
Luciano Resende
 
PDF
Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...
Codemotion
 
PDF
Building analytical microservices powered by jupyter kernels
Luciano Resende
 
PDF
Data science apps powered by Jupyter Notebooks
Natalino Busa
 
PPTX
Interactive Analytics using Apache Spark
Sachin Aggarwal
 
PDF
Jupyter, A Platform for Data Science at Scale
Matthias Bussonnier
 
PDF
Scaling notebooks for Deep Learning workloads
Luciano Resende
 
PDF
Data analysis with Pandas and Spark
Felix Crisan
 
PDF
20151015 zagreb spark_notebooks
Andrey Vykhodtsev
 
PDF
Ai pipelines powered by jupyter notebooks
Luciano Resende
 
PDF
Jupyter notebooks on steroids
Jose Enrique Ruiz
 
PDF
Jupyter For Data Science Exploratory Analysis Statistical Modeling Machine Le...
ainaniccallo68
 
PPTX
JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or ...
David Taieb
 
PPTX
Introduction to Jupyter notebook and MS Azure Machine Learning Studio
Muralidharan Deenathayalan
 
PPTX
Introduction to Jupyter notebook and MS Azure Machine Learning Studio
Muralidharan Deenathayalan
 
PPTX
JupyterCon 2020 - Supercharging SQL Users with Jupyter Notebooks
Michelle Ufford
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
Big Data Spain
 
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
Luciano Resende
 
Jupyter con meetup extended jupyter kernel gateway
Luciano Resende
 
Big analytics meetup - Extended Jupyter Kernel Gateway
Luciano Resende
 
Strata - Scaling Jupyter with Jupyter Enterprise Gateway
Luciano Resende
 
Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...
Codemotion
 
Building analytical microservices powered by jupyter kernels
Luciano Resende
 
Data science apps powered by Jupyter Notebooks
Natalino Busa
 
Interactive Analytics using Apache Spark
Sachin Aggarwal
 
Jupyter, A Platform for Data Science at Scale
Matthias Bussonnier
 
Scaling notebooks for Deep Learning workloads
Luciano Resende
 
Data analysis with Pandas and Spark
Felix Crisan
 
20151015 zagreb spark_notebooks
Andrey Vykhodtsev
 
Ai pipelines powered by jupyter notebooks
Luciano Resende
 
Jupyter notebooks on steroids
Jose Enrique Ruiz
 
Jupyter For Data Science Exploratory Analysis Statistical Modeling Machine Le...
ainaniccallo68
 
JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or ...
David Taieb
 
Introduction to Jupyter notebook and MS Azure Machine Learning Studio
Muralidharan Deenathayalan
 
Introduction to Jupyter notebook and MS Azure Machine Learning Studio
Muralidharan Deenathayalan
 
JupyterCon 2020 - Supercharging SQL Users with Jupyter Notebooks
Michelle Ufford
 
Ad

More from Chester Chen (20)

PDF
SFBigAnalytics_SparkRapid_20220622.pdf
Chester Chen
 
PDF
zookeeer+raft-2.pdf
Chester Chen
 
PPTX
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
Chester Chen
 
PDF
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
Chester Chen
 
PDF
A missing link in the ML infrastructure stack?
Chester Chen
 
PDF
Shopify datadiscoverysf bigdata
Chester Chen
 
PDF
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
Chester Chen
 
PDF
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
Chester Chen
 
PDF
SFBigAnalytics_20190724: Monitor kafka like a Pro
Chester Chen
 
PDF
SF Big Analytics 2019-06-12: Managing uber's data workflows at scale
Chester Chen
 
PPTX
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Chester Chen
 
PPTX
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
Chester Chen
 
PDF
SFBigAnalytics- hybrid data management using cdap
Chester Chen
 
PDF
Sf big analytics: bighead
Chester Chen
 
PPTX
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Chester Chen
 
PPTX
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Chester Chen
 
PPTX
2018 data warehouse features in spark
Chester Chen
 
PDF
2018 02-08-what's-new-in-apache-spark-2.3
Chester Chen
 
PDF
Index conf sparkml-feb20-n-pentreath
Chester Chen
 
PDF
Index conf sparkai-feb20-n-pentreath
Chester Chen
 
SFBigAnalytics_SparkRapid_20220622.pdf
Chester Chen
 
zookeeer+raft-2.pdf
Chester Chen
 
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
Chester Chen
 
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
Chester Chen
 
A missing link in the ML infrastructure stack?
Chester Chen
 
Shopify datadiscoverysf bigdata
Chester Chen
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
Chester Chen
 
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
Chester Chen
 
SFBigAnalytics_20190724: Monitor kafka like a Pro
Chester Chen
 
SF Big Analytics 2019-06-12: Managing uber's data workflows at scale
Chester Chen
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Chester Chen
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
Chester Chen
 
SFBigAnalytics- hybrid data management using cdap
Chester Chen
 
Sf big analytics: bighead
Chester Chen
 
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Chester Chen
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Chester Chen
 
2018 data warehouse features in spark
Chester Chen
 
2018 02-08-what's-new-in-apache-spark-2.3
Chester Chen
 
Index conf sparkml-feb20-n-pentreath
Chester Chen
 
Index conf sparkai-feb20-n-pentreath
Chester Chen
 

Recently uploaded (20)

PPTX
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
PPTX
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
PPT
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
PPTX
short term internship project on Data visualization
JMJCollegeComputerde
 
PPT
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
PDF
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PPTX
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
PPTX
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
PDF
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
PPTX
Introduction to computer chapter one 2017.pptx
mensunmarley
 
PPTX
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
PDF
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
PPTX
HSE WEEKLY REPORT for dummies and lazzzzy.pptx
ahmedibrahim691723
 
PPTX
INFO8116 -Big data architecture and analytics
guddipatel10
 
PDF
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PPTX
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
PDF
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
PPTX
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
PPTX
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
short term internship project on Data visualization
JMJCollegeComputerde
 
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
Introduction to computer chapter one 2017.pptx
mensunmarley
 
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
HSE WEEKLY REPORT for dummies and lazzzzy.pptx
ahmedibrahim691723
 
INFO8116 -Big data architecture and analytics
guddipatel10
 
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 

2018 02 20-jeg_index

  • 1. Building an Enterprise/Cloud Analytics Platform with Jupyter Notebooks and Apache Spark Fred Reiss Chief Architect, IBM Spark Technology Center
  • 2. 2 Hi! Fred Reiss • 2014-present: Chief Architect, IBM Spark Technology Center. • 2006-2014: Worked for IBM Research. • 2006: Ph.D. from U.C. Berkeley.
  • 3. 3 The Jupyter Project • Open Source project that builds software to enable interactive notebooks for data science – Started in 2014 – Grew out of the IPython project
  • 4. 4 What is IPython? https://upload.wikimedia.org/wikipedia/commons/4/47/IPython-shell.png By Shishirdasika (Own work) [CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons Interactive console for Python Can open a window to display graphics
  • 5. 5 IPython Notebooks https://upload.wikimedia.org/wikipedia/commons/a/af/IPython-notebook.png By Shishirdasika (Own work) [CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons Text and graphics in the same browser window
  • 7. 7 Jupyter Notebooks • Jupyter notebooks are widely used by data scientists, social scientists, physical scientists, engineers, and others • Useful for many tasks – Analyzing data – Developing and debugging software – Running experiments – Keeping track of experimental results – Presenting results • Jupyter is a central part of the IBM Data Science Experience (http://datascience.ibm.com)
  • 8. 8 Jupyter in the Enterprise: Key Challenges • Collaboration among multiple users • Large-scale data analysis (problems that don’t fit in a laptop) – Shared cloud infrastructure like Kubernetes – Parallel frameworks like Spark • Security and authentication • Auditing and data access control
  • 9. 9 Isn’t this just shipping strings around? JavaScript “1+1” Server “1+1” Python Process “1+1” “2”“2”“2”
  • 10. 10 Isn’t this just shipping strings around? JavaScript “1+1” FancyNewSystem “1+1” Python Process “1+1” “2”“2”“2” Security Multitenancy Authentication Spark Kubernetes
  • 11. 11 The Five Stages of Enterprise Jupyter Deployment
  • 12. 12 The Five Stages of Enterprise Jupyter Deployment 1. Denial
  • 13. 13 Jupyter does more than just pass strings around. • Quite a bit more!
  • 14. 14 Asynchronous Operations • Queue up multiple cells for execution – …in arbitrary order • Stream output while a cell is running • Interrupt any operation Fifteenth cell that executed in this session
  • 15. 15 Jupyter’s Display System: Much More than Text https://nbviewer.jupyter.org/github/ipython/ipython/bl ob/master/examples/IPython%20Kernel/Custom%2 0Display%20Logic.ipynb
  • 17. 17 Magics • Jupyter’s standard Python kernel has over 90 built-in magic commands http://ipython.readthedocs.io/en/stable/interactive/magics.html
  • 18. 18 Extensions • Many additional extensions in the iPython project’s Github repository – https://github.com/ip ython- contrib/jupyter_contri b_nbextensions
  • 21. 21 The Actual Architecture of Jupyter Notebooks
  • 22. Notebook Server Process 22 The Actual Architecture of Jupyter Notebooks JavaScript NotebookManagement Python Process KernelManagement iPythonKernel Notebook Server State KernelProxy Shell IOPub stdin control heartbeat Kernel Session State UserCode sklearn Spark Tensor Flow … Local Filesystem
  • 23. 23 The Five Stages of Enterprise Jupyter Deployment 1. Denial
  • 24. 24 The Five Stages of Enterprise Jupyter Deployment 1. Denial 2. Anger
  • 25. Notebook Server Process 25 The Actual Architecture of Jupyter Notebooks JavaScript NotebookManagement Python Process KernelManagement iPythonKernel Notebook Server State KernelProxy Shell IOPub stdin control heartbeat Kernel Session State UserCode sklearn Spark Tensor Flow … Local Filesystem
  • 26. Notebook Server Process 26 The Actual Architecture of Jupyter Notebooks JavaScript NotebookManagement Python Process KernelManagement iPythonKernel Notebook Server State KernelProxy Shell IOPub stdin control heartbeat Kernel Session State UserCode sklearn Spark Tensor Flow … Local Filesystem
  • 27. Notebook Server Process 27 The Actual Architecture of Jupyter Notebooks JavaScript NotebookManagement Python Process KernelManagement iPythonKernel Notebook Server State KernelProxy Shell IOPub stdin control heartbeat Kernel Session State UserCode sklearn Spark Tensor Flow … Local Filesystem
  • 28. Notebook Server Process 28 The Actual Architecture of Jupyter Notebooks JavaScript NotebookManagement Python Process KernelManagement iPythonKernel Notebook Server State KernelProxy Shell IOPub stdin control heartbeat Kernel Session State UserCode sklearn Spark Tensor Flow … Local Filesystem Five ZeroMQ message queues over unencrypted TCP sockets… …per kernel
  • 29. 29 Third-Party Kernels • The IPython kernel is the most common… • …but there is a long tail of other Jupyter kernels – 103 kernels currently listed on the Jupyter project’s wiki
  • 30. Notebook Server Process 30 The Actual Architecture of Jupyter Notebooks JavaScript NotebookManagement Python Process KernelManagement iPythonKernel Notebook Server State KernelProxy Shell IOPub stdin control heartbeat Kernel Session State UserCode sklearn Spark Tensor Flow … Local Filesystem To share notebooks among users, need to share notebook server
  • 31. Notebook Server Process 31 The Actual Architecture of Jupyter Notebooks JavaScript NotebookManagement Python Process KernelManagement iPythonKernel Notebook Server State KernelProxy Shell IOPub stdin control heartbeat Kernel Session State UserCode sklearn Spark Tensor Flow … Local Filesystem To use Apache Spark™ on YARN, need to be inside the YARN cluster’s network.
  • 32. 32 Jupyter in the Enterprise: Key Challenges • Collaboration among multiple users • Large-scale data analysis – Shared cloud infrastructure like Kubernetes – Parallel frameworks like Spark • Security and authentication • Auditing and data access control Bringing these properties to the Jupyter stack is hard!
  • 33. 33 The Five Stages of Enterprise Jupyter Deployment 1. Denial 2. Anger
  • 34. 34 The Five Stages of Enterprise Jupyter Deployment 1. Denial 2. Anger 3. Bargaining
  • 35. 35 Bargaining • Meeting all the enterprise requirements is expensive • Compromise to bring down the cost
  • 36. 36 Compromise #1: Gigantic Server • Find the biggest machine or container you can get • Run the entire Jupyter stack on that one machine • Issues: – Machine needs to be sized for the maximum aggregate memory of all active users’ active kernels • Hard upper limit of 256GB-1TB in most organizations • Very problematic if you have many users and big data – Need to authenticate all these users to the same machine and notebook server
  • 37. 37 Compromise #2: Notebook Server Per User • Proxy server manages a pool of containers, one per active user • Each container contains an entire Jupyter notebook stack • JupyterHub project provides a pre-built implementation of this approach • Issues: – Container needs to be big enough for all the user’s kernels • What size container to allocate when the user logs in? • Does a big enough container even exist? – Disables collaboration features – Many more moving parts  More failure modes
  • 38. 38 Compromise #3: Replace the Kernel iPythonKernel KernelProxy Shell IOPub stdin control heartbeat
  • 39. KernelProxy Proxy 39 Compromise #3: Replace the Kernel • Replace the IPython kernel with a proxy • Put something enterprise-friendly on the other side of the proxy • Apache Livy implements this approach – https://github.com/jupyter- incubator/sparkmagic • Issues: – Breaks Jupyter’s magics and extensions – Breaks data visualization libraries – Breaks third-party kernels – Less control over code execution Shell IOPub stdin control heartbeat RESTfulwebservice
  • 40. 40 The Five Stages of Enterprise Jupyter Deployment 1. Denial 2. Anger 3. Bargaining
  • 41. 41 The Five Stages of Enterprise Jupyter Deployment 1. Denial 2. Anger 3. Bargaining 4. Depression
  • 42. 42 The Five Stages of Enterprise Jupyter Deployment 1. Denial 2. Anger 3. Bargaining 4. Depression 5. Jupyter Enterprise Gateway
  • 43. 43 The Origins of Jupyter Enterprise Gateway • Multiple IBM products embedding Spark on YARN • All wanted to add Jupyter notebooks with Spark • Usual enterprise requirements (multitenancy, scalability, security, etc.) • Had reached the “Bargaining” stage – Mix of compromises 1, 2, and 3
  • 45. YARN Cluster Initial Prototype 45 Security Layer YARN Workers YARN Resource Manager Spark ExecutorsSpark ExecutorsSpark Executors Spark ExecutorsSpark ExecutorsSpark Executors Notebook Node nb2kg (Proxy) nb2kg Jupyter Kernel Gateway Python Kernel Spark Driver Python Kernel Spark Driver Shell IOPub stdin control heartbeat Issue #2: All Spark jobs run as same user ID Issue #1: All kernels and Spark drivers run on a single node
  • 46. Issue #1: All kernels run on a single node 8 8 8 8 0 10 20 30 40 50 60 70 80 4 Nodes 8 Nodes 12 Nodes 16 Nodes MaxKernels(4GBHeap) Cluster Size (32GB Nodes) Maximum Number of Simultaneous Kernels 46
  • 47. Jupyter Enterprise Gateway: Initial Goals • Optimized Resource Allocation – Run Spark in YARN Cluster Mode to better utilize cluster resources. – Pluggable architecture for additional Resource Managers • Multiuser support with user impersonation – Enhance security and sandboxing by enabling user impersonation when running kernels (using Kerberos). – Individual HDFS home folder for each notebook user. – Use the same user ID for notebook and batch jobs. • Enhanced Security – Secure socket communications – Any network communication should be encrypted 47
  • 48. YARN Cluster Jupyter Enterprise Gateway 48 Security Layer YARN Workers Jupyter EnterpriseGateway Multitenancy Remote kernels and Kernel Lifecycle management Spark Executors Spark Executors Spark Executors Yarn Container Jupyter Kernel Spark Driver Spark Executors Spark Executors Spark Executors Yarn Container Jupyter Kernel Spark Driver Spark Executors Spark Executors Spark Executors Yarn Container Jupyter Kernel Spark Driver Impersonation: Alice’s kernel runs under Alice’s user ID.
  • 49. Scalability Benefits 8 8 8 8 16 32 48 64 0 10 20 30 40 50 60 70 80 4 Nodes 8 Nodes 12 Nodes 16 Nodes MaxKernels(4GBHeap) Cluster Size (32GB Nodes) Maximum Number of Simultaneous Kernels Before JEG After JEG 49
  • 50. Jupyter Enterprise Gateway: Open Source 50 • Released through the Jupyter Incubator – BSD License – https://github.com/jupyter- incubator/enterprise_gatew ay – Current release: 0.7.0
  • 51. Jupyter Enterprise Gateway: Supported Platforms • Python/Spark 2.x using IPython kernel – With Spark Context delayed initialization • Scala 2.11/ Spark 2.x using Apache Toree kernel – With Spark Context delayed initialization • R / Spark 2.x with IRkernel 51
  • 52. Jupyter Enterprise Gateway – Roadmap • Add support for other resource managers – Kubernetes support • Kernel Configuration Profile – Enable client to request different resource configuration for kernels (e.g. small, medium, large) – Profiles should be defined by Administrators and enabled for user/group of users. • Administration UI – Dashboard with running kernels and administration actions • Time running, stop/kill, Profile Management, etc • User Environments • High Availability 52
  • 53. Jupyter Enterprise Gateway • Jupyter Enterprise Gateway at IBM Code – https://developer.ibm.com/code/openprojects/jupyter-enterprise-gateway/ • Jupyter Enterprise Gateway source code at GitHub – https://github.com/jupyter-incubator/enterprise_gateway • Docker images – https://github.com/jupyter- incubator/enterprise_gateway/tree/master/etc/docker • Jupyter Enterprise Gateway 0.7 release – https://github.com/jupyter-incubator/enterprise_gateway/releases/tag/v0.7.0 • Jupyter Enterprise Gateway Documentation – http://jupyter-enterprise-gateway.readthedocs.io/en/latest/ 53 Free IBM Data Science trial https://ibm.biz/BdZceR
  • 54. 54 Thank you! And special thanks to the Jupyter Enterprise Gateway team: Luciano Resende, Kevin Bates, Kun Liu, Christian Kadner, Sanjay Saxena, Alan Chin, Sherry Guo, Alex Bozarth, Zee Chen
  • 56. 56 Building your own test environment with Jupyter Enterprise Gateway
  • 57. Jupyter Enterprise Gateway: Deployment 57 Management Node Powered by Ambari EG Compute Engine based on Apache Spark
  • 58. Jupyter Enterprise Gateway: Deployment • Ansible deployment scripts – https://github.com/lresende/spark-cluster-install • One click deployment of the Spark Cluster – Configure your host inventory (see example on git repository) – Run the ”setup-ambari.yml” playbook • $ ansible-playbook --verbose setup-ambari.yml -i hosts-fyre-ambari -c paramiko • One click deployment of the Jupyter Enterprise Engine – Run the ”setup-enterprise-gateway.yml” playbook • $ ansible-playbook --verbose setup-enterprise-gateway.yml -i hosts-fyre-ambari -c paramiko 58
  • 59. Jupyter Enterprise Gateway - Deployment • Docker images – yarn-spark: Basic one node Spark on Yarn configuration – enterprise-gateway: Adds Anaconda and Jupyter Enterprise Gateway to the yarn-spark image – nb2kg: Minimal Jupyter Notebook client configured with hooks to access the Enterprise Gateway – https://github.com/jupyter-incubator/enterprise_gateway/tree/master/etc/docker • Building the latest docker images – git checkout https://github.com/jupyter-incubator/enterprise_gateway – make docker-clean docker-images – Note: Make also have individual targets to clean and build individual images (type make for help) 59
  • 60. Jupyter Enterprise Gateway - Deployment • Connecting to a Spark Cluster using a docker image docker run -t --rm -e KG_URL='http://<Enterprise Gateway IP>:8888' -p 8888:8888 -e VALIDATE_KG_CERT='no' -e LOG_LEVEL=DEBUG -e KG_REQUEST_TIMEOUT=40 -e KG_CONNECT_TIMEOUT=40 -v ${HOME}/opensource/jupyter/jupyter-notebooks/:/tmp/notebooks -w /tmp/notebooks elyra/nb2kg:dev 60

Editor's Notes

  • #9: Now, when I first saw these requirements, my initial reaction was, “sounds easy”. I mean, to a first approximation, all that Jupyter is doing is passing strings around.
  • #11: This is what I initially thought, and I’ve met a good number of other people who were in the same situation and came up with the same design. The problem with this design is that it’s actually only the first stage of a much longer process that I like to call…
  • #12: And in particular, the first stage of this process is called…
  • #13: Let me explain.
  • #22: All these cool features of Jupyter notebooks rely on an architecture that is substantially more baroque than the cartoon picture from ten slides back…
  • #24: When an enterprise architect becomes aware of all this complexity, that’s when he or she moves from stage 1 to stage 2, which is…
  • #25: Let me explain.
  • #26: This architecture was designed for an academic setting. When you try to transplant it into an enterprise environment and layer enterprise requirements on top of it, things go downhill rather quickly.
  • #42: …and the purpose of this talk is to help you to work through this fourth stage as quickly as possible and move on to stage 5, which is…
  • #43: …Jupyter Enterprise Gateway. (Bet you thought I was going to say “acceptance”). So, what is Jupyter Enterprise Gateway?
  • #48: Min RK
  • #52: Min RK
  • #53: Min RK