SlideShare a Scribd company logo
© 2015 IBM Corporation
Interactive Analytics Using Apache Spark
Bangalore Spark Enthusiasts Group
http://www.meetup.com/Bangalore-Spark-Enthusiasts/
1
Bagavath Subramaniam, IBM Analytics
Shally Sangal, IBM Analytics
© 2015 IBM Corporation
Agenda
▪ Overview of Interactive Analytics
▪ Spark Application User Types
▪ Spark Context
▪ Spark Shell
▪ Spark Submit
▪ Spark JDBC Thrift Server
▪ Apache Zeppelin
▪ Jupyter
▪ Spark Kernel
▪ Spark Job Server
▪ Livy
2
© 2015 IBM Corporation
Spark Application User Types
Data Scientist
▪ Data Exploration
▪ Data Wrangling
▪ Build Models from Data using Algorithms - Predict/Prescribe
▪ Knowledge in Statistics & Maths
▪ R/Python, Matlab/SPSS
▪ Ad-hoc analysis using Interactive Shells
Data Analyst
▪ Data Exploration and Visualization
▪ Understands data sources and relationships among them in an Enterprise
▪ Relates data to business and derives insights, can talk business language
▪ May have basic programming skills and analytic tools knowledge
▪ Ad-hoc analysis using canned reports
▪ Limited usage of interactive shells
3
© 2015 IBM Corporation
Typical User Roles
Business Analyst
▪ Industry Expert
▪ Understand business needs and works on solutions
▪ Improve business processes and design new systems to support them
▪ Not a programmer / Analytics expert
▪ Typical user of reporting systems
Data Engineer / Application Developer
▪ Programmer with S/W Engineering background
▪ Builds production data pipelines, data warehouses, reporting solutions and apps
▪ Productionize models built by data scientists
▪ Builds s/w applications to solve business problems
▪ Maintains, Monitors, Tunes data processing platform and applications
> Roles are often fluid and overlapping
4
© 2015 IBM Corporation
Interactive Tools for Spark
Apache Spark
IBM Spark Kernel
(Apache Toree)
Cloudera
Livy
Ooyala
Spark Job Server
5
© 2015 IBM Corporation
User and Tools
Primary set of tools for each role
Spark
Shell
Spark
Submit
Thrift
JDBC
Server
Zeppelin Spark
Kernel
Jupyter Livy Hue
Data Scientist
Data Analyst
Developer
Business Analyst
6
Spark Job
Server
© 2015 IBM Corporation
Spark Context
▪ Common thread for all Spark Interfaces
▪ Main entry point for Spark, represents the connection to a Spark cluster
▪ Standalone, Yarn, Mesos, Local
▪ Holds all the configuration - memory, cores, parallelism, compression
▪ Create RDDs, accumulators, broadcast variables
▪ Run Jobs, Cancel Jobs
▪ One Spark Context per JVM limitation, one application Id
▪ Supports parallel jobs from separate threads
▪ Scheduler mode - FIFO / Fair (within an Application)
▪ Fair Scheduler
− Pools - spark.scheduler.pool
− Weights/Priorities, Scheduling Mode, Weights
7
© 2015 IBM Corporation
Spark shell
Spark Shell
▪ Interactive shell ( spark-shell for scala, pyspark for python)
▪ spark-shell is based on scala REPL
▪ Instantiates Spark Context by default, also a UI
▪ Also gives sqlContext which is hiveContext
▪ Internally calls spark-submit
{SPARK_HOME}"/bin/spark-submit --class org.apache.spark.repl.Main --name Spark shell
▪ All parameters of Spark submit can be passed to Spark shell as well
8
© 2015 IBM Corporation
Spark submit
▪ Launch/Submit a spark application to a spark cluster
▪ org.apache.spark.launcher.Main gets called with org.apache.spark.deploy.SparkSubmit
as a parameter along with the other params passed to spark-submit
org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit "$@"
▪ spark-submit --help => list of supported parameters
▪ kill a job ( spark-submit --kill) and get job status (spark-submit --status)
▪ spark-defaults.conf in SPARK_CONF_DIR
▪ Precedence - Explicit set on SparkConf, flags passed to spark-submit, values from
defaults
9
© 2015 IBM Corporation
▪ Web based notebook for interactive analytics.
▪ Provides built in Spark integration.
▪ Supports many interpreters such as Scala,
Pyspark,SparkSQL, Hive, Shell etc.
▪ It starts a Zeppelin server.
▪ Spawns one JVM per interpreter group.
▪ Server communicates with the Interpreter
Group using Thrift.
10
Apache Zeppelin
© 2015 IBM Corporation
Zeppelin Demo
To know more : http://zeppelin.incubator.apache.org/
11
© 2015 IBM Corporation
Jupyter Notebook
Web notebook for interactive data
analysis.Part of Jupyter ecosystem.
Evolved from IPython, works on the
IPython messaging Protocol.
Has the concept of Kernels - any
language kernel can be plugged in
which implements the protocol.
Spark kernel is available via Apache
Toree.
12
© 2015 IBM Corporation
Jupyter Notebook Demo
To know more : http://jupyter.org/
13
© 2015 IBM Corporation
Spark Kernel ( Apache Toree )
Kernel provides the foundation for interactive
applications to connect to use Spark.
Provides an interface that allows clients to
interact with a Spark Cluster. Clients can send
code snippets and libraries that are interpreted
and run against a pre configured SparkContext.
Acts as a proxy between your application and
the Spark Cluster.
14
© 2015 IBM Corporation
Kernel Architecture
Kernel uses ZeroMQ as its messaging
middleware using TCP sockets
and implements the IPython
message protocol.
It is architected in layers, where each
layer has a specific purpose
in processing of requests.
Provides concurrency and code
isolation by use of Akka
framework.
15
© 2015 IBM Corporation
How does it talk to Spark?
Kernel is launched by a spark-submit process. It works with local spark, Standalone Spark
Cluster as well as Spark with Yarn.
SPARK_HOME is a mandatory environment variable needed.
SPARK_OPTS is an optional environment variable - we can use to configure spark master,
deploy mode, driver memory, number of executors etc.
Uses the same Scala Interpreter as Spark shell. The interpreter holds a Spark Context and
the class server uri used to host compiled code.
16
© 2015 IBM Corporation
How to communicate with Kernel
Two forms of communication :
1. Client library for code execution
1. Directly talk to Kernel like Jupyter notebook
17
© 2015 IBM Corporation
Kernel Client Library
Written in Scala. Eliminates need to understand ZeroMQ
message protocol.
Enables treating the kernel as a remote service.
Shares majority of its code with the kernel’s
codebase.
Two steps to using the client :
1. Initialize the client with the connection details of the kernel.
2. Use the execute API to run code snippets with attached
callbacks.
18
© 2015 IBM Corporation
How to run Kernel and Client:
https://github.com/ibm-et/spark-kernel/wiki/Getting-Started-with-the-Spark-Kernel
https://github.com/ibm-et/spark-kernel/wiki/Guide-for-the-Spark-Kernel-Client
CODE DEMO
19
© 2015 IBM Corporation
Comm API
As part of the IPython message protocol, the Comm API allows developers to specify
custom messages to communicate data and perform actions on both the frontend(client) and
backend(kernel). This API is useful in scenarios where we want to do same actions for the
messages. Either client or kernel can start sending messages.
20
© 2015 IBM Corporation
Livy
Livy is an open source REST
interface for interacting with
Spark. It supports executing
code snippets of Python,
Scala, R.
It is currently used to power
the Spark snippets of Hadoop
Notebook in Hue.
Multiple contexts by using
multiple sessions or multiple
users to same session.
21
© 2015 IBM Corporation
LIVY CODE EXECUTION DEMO
To know more : https://github.com/cloudera/hue/tree/master/apps/spark/java
22
© 2015 IBM Corporation
Spark Job Server
JobServer provides a REST interface for submitting and managing Spark jobs/jars.
It is intended to be run as one or more independent processes, separate from the Spark
cluster or within the spark cluster. It works with Mesos as well as Yarn.
It supports multiple Spark Context. Runs SparkContext in their own forked JVM process.
This is available via a config parameter spark.jobserver.context-per-jvm. It is by default set
to false for local development mode, but recommended to be set to true for production
deployment.
It exposes APIs to upload your jars, get contexts, run jobs, get data, configure contexts etc.
It used Spray, Akka actors , Akka Cluster for separate contexts.
23
© 2015 IBM Corporation
JOB SERVER DEMO
To know more : https://github.com/spark-jobserver/spark-jobserver
24
Ad

More Related Content

What's hot (20)

Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis MagdaApache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Databricks
 
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Spark Summit
 
Project Zen: Improving Apache Spark for Python Users
Project Zen: Improving Apache Spark for Python UsersProject Zen: Improving Apache Spark for Python Users
Project Zen: Improving Apache Spark for Python Users
Databricks
 
Helium makes Zeppelin fly!
Helium makes Zeppelin fly!Helium makes Zeppelin fly!
Helium makes Zeppelin fly!
DataWorks Summit
 
Comparison of various streaming technologies
Comparison of various streaming technologiesComparison of various streaming technologies
Comparison of various streaming technologies
Sachin Aggarwal
 
Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...
Hortonworks Technical Workshop:   HDP everywhere - cloud considerations using...Hortonworks Technical Workshop:   HDP everywhere - cloud considerations using...
Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...
Hortonworks
 
Migrating pipelines into Docker
Migrating pipelines into DockerMigrating pipelines into Docker
Migrating pipelines into Docker
DataWorks Summit/Hadoop Summit
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Alex Zeltov
 
Apache Zeppelin and Spark for Enterprise Data Science
Apache Zeppelin and Spark for Enterprise Data ScienceApache Zeppelin and Spark for Enterprise Data Science
Apache Zeppelin and Spark for Enterprise Data Science
Bikas Saha
 
Cooperative Data Exploration with iPython Notebook
Cooperative Data Exploration with iPython NotebookCooperative Data Exploration with iPython Notebook
Cooperative Data Exploration with iPython Notebook
DataWorks Summit/Hadoop Summit
 
Apache Zeppelin Helium and Beyond
Apache Zeppelin Helium and BeyondApache Zeppelin Helium and Beyond
Apache Zeppelin Helium and Beyond
DataWorks Summit/Hadoop Summit
 
SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...
SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...
SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...
Databricks
 
Streaming Sensor Data Slides_Virender
Streaming Sensor Data Slides_VirenderStreaming Sensor Data Slides_Virender
Streaming Sensor Data Slides_Virender
vithakur
 
Advanced Visualization of Spark jobs
Advanced Visualization of Spark jobsAdvanced Visualization of Spark jobs
Advanced Visualization of Spark jobs
DataWorks Summit/Hadoop Summit
 
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
Apache Hive 2.0 SQL, Speed, Scale by Alan GatesApache Hive 2.0 SQL, Speed, Scale by Alan Gates
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
Big Data Spain
 
Spark Summit EU talk by William Benton
Spark Summit EU talk by William BentonSpark Summit EU talk by William Benton
Spark Summit EU talk by William Benton
Spark Summit
 
Securing Hadoop in an Enterprise Context
Securing Hadoop in an Enterprise ContextSecuring Hadoop in an Enterprise Context
Securing Hadoop in an Enterprise Context
DataWorks Summit/Hadoop Summit
 
Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Simplify and Boost Spark 3 Deployments with Hypervisor-Native KubernetesSimplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Databricks
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
huguk
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Alex Zeltov
 
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis MagdaApache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Databricks
 
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Spark Summit
 
Project Zen: Improving Apache Spark for Python Users
Project Zen: Improving Apache Spark for Python UsersProject Zen: Improving Apache Spark for Python Users
Project Zen: Improving Apache Spark for Python Users
Databricks
 
Helium makes Zeppelin fly!
Helium makes Zeppelin fly!Helium makes Zeppelin fly!
Helium makes Zeppelin fly!
DataWorks Summit
 
Comparison of various streaming technologies
Comparison of various streaming technologiesComparison of various streaming technologies
Comparison of various streaming technologies
Sachin Aggarwal
 
Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...
Hortonworks Technical Workshop:   HDP everywhere - cloud considerations using...Hortonworks Technical Workshop:   HDP everywhere - cloud considerations using...
Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...
Hortonworks
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Alex Zeltov
 
Apache Zeppelin and Spark for Enterprise Data Science
Apache Zeppelin and Spark for Enterprise Data ScienceApache Zeppelin and Spark for Enterprise Data Science
Apache Zeppelin and Spark for Enterprise Data Science
Bikas Saha
 
SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...
SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...
SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...
Databricks
 
Streaming Sensor Data Slides_Virender
Streaming Sensor Data Slides_VirenderStreaming Sensor Data Slides_Virender
Streaming Sensor Data Slides_Virender
vithakur
 
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
Apache Hive 2.0 SQL, Speed, Scale by Alan GatesApache Hive 2.0 SQL, Speed, Scale by Alan Gates
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
Big Data Spain
 
Spark Summit EU talk by William Benton
Spark Summit EU talk by William BentonSpark Summit EU talk by William Benton
Spark Summit EU talk by William Benton
Spark Summit
 
Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Simplify and Boost Spark 3 Deployments with Hypervisor-Native KubernetesSimplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Databricks
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
huguk
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Alex Zeltov
 

Viewers also liked (20)

Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceApache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault Tolerance
Sachin Aggarwal
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
Livy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache SparkLivy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache Spark
Jen Aman
 
Building a REST Job Server for Interactive Spark as a Service
Building a REST Job Server for Interactive Spark as a ServiceBuilding a REST Job Server for Interactive Spark as a Service
Building a REST Job Server for Interactive Spark as a Service
Cloudera, Inc.
 
Building a REST Job Server for interactive Spark as a service by Romain Rigau...
Building a REST Job Server for interactive Spark as a service by Romain Rigau...Building a REST Job Server for interactive Spark as a service by Romain Rigau...
Building a REST Job Server for interactive Spark as a service by Romain Rigau...
Spark Summit
 
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
Spark Summit
 
How to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOSHow to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOS
Legacy Typesafe (now Lightbend)
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Hadoop spark performance comparison
Hadoop spark performance comparisonHadoop spark performance comparison
Hadoop spark performance comparison
arunkumar sadhasivam
 
Big Data Scala by the Bay: Interactive Spark in your Browser
Big Data Scala by the Bay: Interactive Spark in your BrowserBig Data Scala by the Bay: Interactive Spark in your Browser
Big Data Scala by the Bay: Interactive Spark in your Browser
gethue
 
Graph Data -- RDF and Property Graphs
Graph Data -- RDF and Property GraphsGraph Data -- RDF and Property Graphs
Graph Data -- RDF and Property Graphs
andyseaborne
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkYggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Jen Aman
 
Huawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingHuawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark Streaming
Jen Aman
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLBuilding Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemML
Jen Aman
 
No More “Sbt Assembly”: Rethinking Spark-Submit Using CueSheet: Spark Summit ...
No More “Sbt Assembly”: Rethinking Spark-Submit Using CueSheet: Spark Summit ...No More “Sbt Assembly”: Rethinking Spark-Submit Using CueSheet: Spark Summit ...
No More “Sbt Assembly”: Rethinking Spark-Submit Using CueSheet: Spark Summit ...
Spark Summit
 
Huohua: A Distributed Time Series Analysis Framework For Spark
Huohua: A Distributed Time Series Analysis Framework For SparkHuohua: A Distributed Time Series Analysis Framework For Spark
Huohua: A Distributed Time Series Analysis Framework For Spark
Jen Aman
 
codecentric AG: CQRS and Event Sourcing Applications with Cassandra
codecentric AG: CQRS and Event Sourcing Applications with Cassandracodecentric AG: CQRS and Event Sourcing Applications with Cassandra
codecentric AG: CQRS and Event Sourcing Applications with Cassandra
DataStax Academy
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache Spark
Jen Aman
 
Bolt: Building A Distributed ndarray
Bolt: Building A Distributed ndarrayBolt: Building A Distributed ndarray
Bolt: Building A Distributed ndarray
Jen Aman
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsRecent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced Analytics
Databricks
 
Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceApache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault Tolerance
Sachin Aggarwal
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
Livy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache SparkLivy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache Spark
Jen Aman
 
Building a REST Job Server for Interactive Spark as a Service
Building a REST Job Server for Interactive Spark as a ServiceBuilding a REST Job Server for Interactive Spark as a Service
Building a REST Job Server for Interactive Spark as a Service
Cloudera, Inc.
 
Building a REST Job Server for interactive Spark as a service by Romain Rigau...
Building a REST Job Server for interactive Spark as a service by Romain Rigau...Building a REST Job Server for interactive Spark as a service by Romain Rigau...
Building a REST Job Server for interactive Spark as a service by Romain Rigau...
Spark Summit
 
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
Spark Summit
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Hadoop spark performance comparison
Hadoop spark performance comparisonHadoop spark performance comparison
Hadoop spark performance comparison
arunkumar sadhasivam
 
Big Data Scala by the Bay: Interactive Spark in your Browser
Big Data Scala by the Bay: Interactive Spark in your BrowserBig Data Scala by the Bay: Interactive Spark in your Browser
Big Data Scala by the Bay: Interactive Spark in your Browser
gethue
 
Graph Data -- RDF and Property Graphs
Graph Data -- RDF and Property GraphsGraph Data -- RDF and Property Graphs
Graph Data -- RDF and Property Graphs
andyseaborne
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkYggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Jen Aman
 
Huawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingHuawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark Streaming
Jen Aman
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLBuilding Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemML
Jen Aman
 
No More “Sbt Assembly”: Rethinking Spark-Submit Using CueSheet: Spark Summit ...
No More “Sbt Assembly”: Rethinking Spark-Submit Using CueSheet: Spark Summit ...No More “Sbt Assembly”: Rethinking Spark-Submit Using CueSheet: Spark Summit ...
No More “Sbt Assembly”: Rethinking Spark-Submit Using CueSheet: Spark Summit ...
Spark Summit
 
Huohua: A Distributed Time Series Analysis Framework For Spark
Huohua: A Distributed Time Series Analysis Framework For SparkHuohua: A Distributed Time Series Analysis Framework For Spark
Huohua: A Distributed Time Series Analysis Framework For Spark
Jen Aman
 
codecentric AG: CQRS and Event Sourcing Applications with Cassandra
codecentric AG: CQRS and Event Sourcing Applications with Cassandracodecentric AG: CQRS and Event Sourcing Applications with Cassandra
codecentric AG: CQRS and Event Sourcing Applications with Cassandra
DataStax Academy
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache Spark
Jen Aman
 
Bolt: Building A Distributed ndarray
Bolt: Building A Distributed ndarrayBolt: Building A Distributed ndarray
Bolt: Building A Distributed ndarray
Jen Aman
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsRecent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced Analytics
Databricks
 
Ad

Similar to Interactive Analytics using Apache Spark (20)

Five cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark fasterFive cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark faster
Tim Ellison
 
20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooks20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooks
Andrey Vykhodtsev
 
Apache Big Data Europe 2016
Apache Big Data Europe 2016Apache Big Data Europe 2016
Apache Big Data Europe 2016
Tim Ellison
 
Scala & Spark Online Training
Scala & Spark Online TrainingScala & Spark Online Training
Scala & Spark Online Training
Learntek1
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a... The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
Big Data Spain
 
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
J On The Beach
 
Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...
Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...
Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...
Romeo Kienzler
 
Apache spark with java 8
Apache spark with java 8Apache spark with java 8
Apache spark with java 8
Janu Jahnavi
 
Apache spark with java 8
Apache spark with java 8Apache spark with java 8
Apache spark with java 8
Janu Jahnavi
 
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
Luciano Resende
 
Getting Started with Spark Scala
Getting Started with Spark ScalaGetting Started with Spark Scala
Getting Started with Spark Scala
Knoldus Inc.
 
An Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
An Enterprise Analytics Platform with Jupyter Notebooks and Apache SparkAn Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
An Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
Luciano Resende
 
JDD2015: Towards the Fastest (J)VM on the Planet! - Jaroslav Tulach
JDD2015: Towards the Fastest (J)VM on the Planet! - Jaroslav TulachJDD2015: Towards the Fastest (J)VM on the Planet! - Jaroslav Tulach
JDD2015: Towards the Fastest (J)VM on the Planet! - Jaroslav Tulach
PROIDEA
 
OpenDataPlane Project
OpenDataPlane ProjectOpenDataPlane Project
OpenDataPlane Project
GlobalLogic Ukraine
 
Serverless Java: JJUG CCC 2019
Serverless Java: JJUG CCC 2019Serverless Java: JJUG CCC 2019
Serverless Java: JJUG CCC 2019
Shaun Smith
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
ShidrokhGoudarzi1
 
Introduction to Apache Beam
Introduction to Apache BeamIntroduction to Apache Beam
Introduction to Apache Beam
Knoldus Inc.
 
A Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark PerformanceA Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark Performance
Tim Ellison
 
20180417 hivemall meetup#4
20180417 hivemall meetup#420180417 hivemall meetup#4
20180417 hivemall meetup#4
Takeshi Yamamuro
 
Big analytics meetup - Extended Jupyter Kernel Gateway
Big analytics meetup - Extended Jupyter Kernel GatewayBig analytics meetup - Extended Jupyter Kernel Gateway
Big analytics meetup - Extended Jupyter Kernel Gateway
Luciano Resende
 
Five cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark fasterFive cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark faster
Tim Ellison
 
20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooks20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooks
Andrey Vykhodtsev
 
Apache Big Data Europe 2016
Apache Big Data Europe 2016Apache Big Data Europe 2016
Apache Big Data Europe 2016
Tim Ellison
 
Scala & Spark Online Training
Scala & Spark Online TrainingScala & Spark Online Training
Scala & Spark Online Training
Learntek1
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a... The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
Big Data Spain
 
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
J On The Beach
 
Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...
Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...
Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...
Romeo Kienzler
 
Apache spark with java 8
Apache spark with java 8Apache spark with java 8
Apache spark with java 8
Janu Jahnavi
 
Apache spark with java 8
Apache spark with java 8Apache spark with java 8
Apache spark with java 8
Janu Jahnavi
 
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
Luciano Resende
 
Getting Started with Spark Scala
Getting Started with Spark ScalaGetting Started with Spark Scala
Getting Started with Spark Scala
Knoldus Inc.
 
An Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
An Enterprise Analytics Platform with Jupyter Notebooks and Apache SparkAn Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
An Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
Luciano Resende
 
JDD2015: Towards the Fastest (J)VM on the Planet! - Jaroslav Tulach
JDD2015: Towards the Fastest (J)VM on the Planet! - Jaroslav TulachJDD2015: Towards the Fastest (J)VM on the Planet! - Jaroslav Tulach
JDD2015: Towards the Fastest (J)VM on the Planet! - Jaroslav Tulach
PROIDEA
 
Serverless Java: JJUG CCC 2019
Serverless Java: JJUG CCC 2019Serverless Java: JJUG CCC 2019
Serverless Java: JJUG CCC 2019
Shaun Smith
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
ShidrokhGoudarzi1
 
Introduction to Apache Beam
Introduction to Apache BeamIntroduction to Apache Beam
Introduction to Apache Beam
Knoldus Inc.
 
A Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark PerformanceA Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark Performance
Tim Ellison
 
20180417 hivemall meetup#4
20180417 hivemall meetup#420180417 hivemall meetup#4
20180417 hivemall meetup#4
Takeshi Yamamuro
 
Big analytics meetup - Extended Jupyter Kernel Gateway
Big analytics meetup - Extended Jupyter Kernel GatewayBig analytics meetup - Extended Jupyter Kernel Gateway
Big analytics meetup - Extended Jupyter Kernel Gateway
Luciano Resende
 
Ad

Recently uploaded (20)

Data Analytics Overview and its applications
Data Analytics Overview and its applicationsData Analytics Overview and its applications
Data Analytics Overview and its applications
JanmejayaMishra7
 
定制学历(美国Purdue毕业证)普渡大学电子版毕业证
定制学历(美国Purdue毕业证)普渡大学电子版毕业证定制学历(美国Purdue毕业证)普渡大学电子版毕业证
定制学历(美国Purdue毕业证)普渡大学电子版毕业证
Taqyea
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
Modern_Distribution_Presentation.pptx Aa
Modern_Distribution_Presentation.pptx AaModern_Distribution_Presentation.pptx Aa
Modern_Distribution_Presentation.pptx Aa
MuhammadAwaisKamboh
 
Medical Dataset including visualizations
Medical Dataset including visualizationsMedical Dataset including visualizations
Medical Dataset including visualizations
vishrut8750588758
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 
Deloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit contextDeloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit context
Process mining Evangelist
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
ISO 9001_2015 FINALaaaaaaaaaaaaaaaa - MDX - Copy.pptx
ISO 9001_2015 FINALaaaaaaaaaaaaaaaa - MDX - Copy.pptxISO 9001_2015 FINALaaaaaaaaaaaaaaaa - MDX - Copy.pptx
ISO 9001_2015 FINALaaaaaaaaaaaaaaaa - MDX - Copy.pptx
pankaj6188303
 
VKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptxVKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptx
Vinod Srivastava
 
MASAkkjjkttuyrdquesjhjhjfc44dddtions.docx
MASAkkjjkttuyrdquesjhjhjfc44dddtions.docxMASAkkjjkttuyrdquesjhjhjfc44dddtions.docx
MASAkkjjkttuyrdquesjhjhjfc44dddtions.docx
santosh162
 
Defense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptxDefense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptx
Greg Makowski
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
Customer Segmentation using K-Means clustering
Customer Segmentation using K-Means clusteringCustomer Segmentation using K-Means clustering
Customer Segmentation using K-Means clustering
Ingrid Nyakerario
 
Data Analytics Overview and its applications
Data Analytics Overview and its applicationsData Analytics Overview and its applications
Data Analytics Overview and its applications
JanmejayaMishra7
 
定制学历(美国Purdue毕业证)普渡大学电子版毕业证
定制学历(美国Purdue毕业证)普渡大学电子版毕业证定制学历(美国Purdue毕业证)普渡大学电子版毕业证
定制学历(美国Purdue毕业证)普渡大学电子版毕业证
Taqyea
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
Modern_Distribution_Presentation.pptx Aa
Modern_Distribution_Presentation.pptx AaModern_Distribution_Presentation.pptx Aa
Modern_Distribution_Presentation.pptx Aa
MuhammadAwaisKamboh
 
Medical Dataset including visualizations
Medical Dataset including visualizationsMedical Dataset including visualizations
Medical Dataset including visualizations
vishrut8750588758
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 
Deloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit contextDeloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit context
Process mining Evangelist
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
ISO 9001_2015 FINALaaaaaaaaaaaaaaaa - MDX - Copy.pptx
ISO 9001_2015 FINALaaaaaaaaaaaaaaaa - MDX - Copy.pptxISO 9001_2015 FINALaaaaaaaaaaaaaaaa - MDX - Copy.pptx
ISO 9001_2015 FINALaaaaaaaaaaaaaaaa - MDX - Copy.pptx
pankaj6188303
 
VKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptxVKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptx
Vinod Srivastava
 
MASAkkjjkttuyrdquesjhjhjfc44dddtions.docx
MASAkkjjkttuyrdquesjhjhjfc44dddtions.docxMASAkkjjkttuyrdquesjhjhjfc44dddtions.docx
MASAkkjjkttuyrdquesjhjhjfc44dddtions.docx
santosh162
 
Defense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptxDefense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptx
Greg Makowski
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
Customer Segmentation using K-Means clustering
Customer Segmentation using K-Means clusteringCustomer Segmentation using K-Means clustering
Customer Segmentation using K-Means clustering
Ingrid Nyakerario
 

Interactive Analytics using Apache Spark

  • 1. © 2015 IBM Corporation Interactive Analytics Using Apache Spark Bangalore Spark Enthusiasts Group http://www.meetup.com/Bangalore-Spark-Enthusiasts/ 1 Bagavath Subramaniam, IBM Analytics Shally Sangal, IBM Analytics
  • 2. © 2015 IBM Corporation Agenda ▪ Overview of Interactive Analytics ▪ Spark Application User Types ▪ Spark Context ▪ Spark Shell ▪ Spark Submit ▪ Spark JDBC Thrift Server ▪ Apache Zeppelin ▪ Jupyter ▪ Spark Kernel ▪ Spark Job Server ▪ Livy 2
  • 3. © 2015 IBM Corporation Spark Application User Types Data Scientist ▪ Data Exploration ▪ Data Wrangling ▪ Build Models from Data using Algorithms - Predict/Prescribe ▪ Knowledge in Statistics & Maths ▪ R/Python, Matlab/SPSS ▪ Ad-hoc analysis using Interactive Shells Data Analyst ▪ Data Exploration and Visualization ▪ Understands data sources and relationships among them in an Enterprise ▪ Relates data to business and derives insights, can talk business language ▪ May have basic programming skills and analytic tools knowledge ▪ Ad-hoc analysis using canned reports ▪ Limited usage of interactive shells 3
  • 4. © 2015 IBM Corporation Typical User Roles Business Analyst ▪ Industry Expert ▪ Understand business needs and works on solutions ▪ Improve business processes and design new systems to support them ▪ Not a programmer / Analytics expert ▪ Typical user of reporting systems Data Engineer / Application Developer ▪ Programmer with S/W Engineering background ▪ Builds production data pipelines, data warehouses, reporting solutions and apps ▪ Productionize models built by data scientists ▪ Builds s/w applications to solve business problems ▪ Maintains, Monitors, Tunes data processing platform and applications > Roles are often fluid and overlapping 4
  • 5. © 2015 IBM Corporation Interactive Tools for Spark Apache Spark IBM Spark Kernel (Apache Toree) Cloudera Livy Ooyala Spark Job Server 5
  • 6. © 2015 IBM Corporation User and Tools Primary set of tools for each role Spark Shell Spark Submit Thrift JDBC Server Zeppelin Spark Kernel Jupyter Livy Hue Data Scientist Data Analyst Developer Business Analyst 6 Spark Job Server
  • 7. © 2015 IBM Corporation Spark Context ▪ Common thread for all Spark Interfaces ▪ Main entry point for Spark, represents the connection to a Spark cluster ▪ Standalone, Yarn, Mesos, Local ▪ Holds all the configuration - memory, cores, parallelism, compression ▪ Create RDDs, accumulators, broadcast variables ▪ Run Jobs, Cancel Jobs ▪ One Spark Context per JVM limitation, one application Id ▪ Supports parallel jobs from separate threads ▪ Scheduler mode - FIFO / Fair (within an Application) ▪ Fair Scheduler − Pools - spark.scheduler.pool − Weights/Priorities, Scheduling Mode, Weights 7
  • 8. © 2015 IBM Corporation Spark shell Spark Shell ▪ Interactive shell ( spark-shell for scala, pyspark for python) ▪ spark-shell is based on scala REPL ▪ Instantiates Spark Context by default, also a UI ▪ Also gives sqlContext which is hiveContext ▪ Internally calls spark-submit {SPARK_HOME}"/bin/spark-submit --class org.apache.spark.repl.Main --name Spark shell ▪ All parameters of Spark submit can be passed to Spark shell as well 8
  • 9. © 2015 IBM Corporation Spark submit ▪ Launch/Submit a spark application to a spark cluster ▪ org.apache.spark.launcher.Main gets called with org.apache.spark.deploy.SparkSubmit as a parameter along with the other params passed to spark-submit org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit "$@" ▪ spark-submit --help => list of supported parameters ▪ kill a job ( spark-submit --kill) and get job status (spark-submit --status) ▪ spark-defaults.conf in SPARK_CONF_DIR ▪ Precedence - Explicit set on SparkConf, flags passed to spark-submit, values from defaults 9
  • 10. © 2015 IBM Corporation ▪ Web based notebook for interactive analytics. ▪ Provides built in Spark integration. ▪ Supports many interpreters such as Scala, Pyspark,SparkSQL, Hive, Shell etc. ▪ It starts a Zeppelin server. ▪ Spawns one JVM per interpreter group. ▪ Server communicates with the Interpreter Group using Thrift. 10 Apache Zeppelin
  • 11. © 2015 IBM Corporation Zeppelin Demo To know more : http://zeppelin.incubator.apache.org/ 11
  • 12. © 2015 IBM Corporation Jupyter Notebook Web notebook for interactive data analysis.Part of Jupyter ecosystem. Evolved from IPython, works on the IPython messaging Protocol. Has the concept of Kernels - any language kernel can be plugged in which implements the protocol. Spark kernel is available via Apache Toree. 12
  • 13. © 2015 IBM Corporation Jupyter Notebook Demo To know more : http://jupyter.org/ 13
  • 14. © 2015 IBM Corporation Spark Kernel ( Apache Toree ) Kernel provides the foundation for interactive applications to connect to use Spark. Provides an interface that allows clients to interact with a Spark Cluster. Clients can send code snippets and libraries that are interpreted and run against a pre configured SparkContext. Acts as a proxy between your application and the Spark Cluster. 14
  • 15. © 2015 IBM Corporation Kernel Architecture Kernel uses ZeroMQ as its messaging middleware using TCP sockets and implements the IPython message protocol. It is architected in layers, where each layer has a specific purpose in processing of requests. Provides concurrency and code isolation by use of Akka framework. 15
  • 16. © 2015 IBM Corporation How does it talk to Spark? Kernel is launched by a spark-submit process. It works with local spark, Standalone Spark Cluster as well as Spark with Yarn. SPARK_HOME is a mandatory environment variable needed. SPARK_OPTS is an optional environment variable - we can use to configure spark master, deploy mode, driver memory, number of executors etc. Uses the same Scala Interpreter as Spark shell. The interpreter holds a Spark Context and the class server uri used to host compiled code. 16
  • 17. © 2015 IBM Corporation How to communicate with Kernel Two forms of communication : 1. Client library for code execution 1. Directly talk to Kernel like Jupyter notebook 17
  • 18. © 2015 IBM Corporation Kernel Client Library Written in Scala. Eliminates need to understand ZeroMQ message protocol. Enables treating the kernel as a remote service. Shares majority of its code with the kernel’s codebase. Two steps to using the client : 1. Initialize the client with the connection details of the kernel. 2. Use the execute API to run code snippets with attached callbacks. 18
  • 19. © 2015 IBM Corporation How to run Kernel and Client: https://github.com/ibm-et/spark-kernel/wiki/Getting-Started-with-the-Spark-Kernel https://github.com/ibm-et/spark-kernel/wiki/Guide-for-the-Spark-Kernel-Client CODE DEMO 19
  • 20. © 2015 IBM Corporation Comm API As part of the IPython message protocol, the Comm API allows developers to specify custom messages to communicate data and perform actions on both the frontend(client) and backend(kernel). This API is useful in scenarios where we want to do same actions for the messages. Either client or kernel can start sending messages. 20
  • 21. © 2015 IBM Corporation Livy Livy is an open source REST interface for interacting with Spark. It supports executing code snippets of Python, Scala, R. It is currently used to power the Spark snippets of Hadoop Notebook in Hue. Multiple contexts by using multiple sessions or multiple users to same session. 21
  • 22. © 2015 IBM Corporation LIVY CODE EXECUTION DEMO To know more : https://github.com/cloudera/hue/tree/master/apps/spark/java 22
  • 23. © 2015 IBM Corporation Spark Job Server JobServer provides a REST interface for submitting and managing Spark jobs/jars. It is intended to be run as one or more independent processes, separate from the Spark cluster or within the spark cluster. It works with Mesos as well as Yarn. It supports multiple Spark Context. Runs SparkContext in their own forked JVM process. This is available via a config parameter spark.jobserver.context-per-jvm. It is by default set to false for local development mode, but recommended to be set to true for production deployment. It exposes APIs to upload your jars, get contexts, run jobs, get data, configure contexts etc. It used Spray, Akka actors , Akka Cluster for separate contexts. 23
  • 24. © 2015 IBM Corporation JOB SERVER DEMO To know more : https://github.com/spark-jobserver/spark-jobserver 24