Apache spark - Installation

Installation
Martin Zapletal Cake Solutions
Apache Spark

Apache Spark and Big Data
1) History and market overview
2) Installation
3) MLlib and machine learning on Spark
4) Porting R code to Scala and Spark
5) Concepts - Core, SQL, GraphX, Streaming
6) Spark’s distributed programming model
7) Deployment

Table of Contents
● Spark architecture
● download, versions, install, startup
● Cluster managers
○ Local
○ Standalone
○ Mesos
○ YARN
● Spark shell
● Job deployment
● Streaming job deployment
● Integration with other tools
● after this session you should be able to install Spark, run Spark cluster and deploy
basic jobs

Installation
● prebuilt packages for different versions of Hadoop, CDH (Cloudera’s
distribution of Hadoop), MapR (MapR’s distribution of Hadoop)
○ currently only support for Scala 2.10
● build from source
○ uses mvn, but has a sbt wrapper
○ need to specify Hadoop version build against
○ can be built with Scala 2.11 support

Spark architecture
● cluster persistent, user submits Jobs
● SparkContext (driver) contacts Cluster Manager which assigns cluster resources
● then it sends application code to assigned Executors (distributing computation, not data!)
● finally sends tasks to Executors to run
● each master and worker run a webUI that displays task progress and results
● each application (SparkContext) has its own executors (not shared) living for the whole duration of the program
running in separate JVM using multiple threads
● Cluster Manager agnostic. Spark only needs to acquire executors and have them communicate with each other

Spark streaming
● mostly similar
● Receiver components - consuming from data source
● Receiver sends information to driver program which then schedules tasks (discretized streams,
small batches) to run in the cluster
○ number of assigned cores must be higher than number of Receivers
● different job lifecycle
○ potentially unbounded
○ needs to be stopped by calling sc.stop()

SparkContext
● passing configuration
● accessing cluster
● SparkContext then used to create RDD from input data
○ various sources
val conf = new SparkConf().setAppName(appName).setMaster(master)
val sc = new SparkContext(conf)
val conf = new SparkConf().setAppName(appName).setMaster(master)
val ssc = new StreamingContext(conf, Seconds(1))

Spark architecture
● 5 modes:
1. local
2. standalone
3. Yarn
4. Mesos
5. Amazon EC2

Local mode
● for application development purposes, no cluster required
● local
○ Run Spark locally with one worker thread (i.e. no parallelism at all).
● local[K]
○ Run Spark locally with K worker threads (ideally, set this to the number of cores on your
machine).
● local[*]
○ Run Spark locally with as many worker threads as logical cores on your machine.
● example local

Standalone mode
● place compiled version of spark at each node
● deployment scripts
○ sbin/start-master.sh
○ sbin/start-slaves.sh
○ sbin/stop-all.sh
● various settings, e.g. port, webUI port, memory, cores, java opts
● drivers use spark://HOST:PORT as master
● only supports a simple FIFO scheduler
○ application or global config decides how many cores and memory will be assigned to it.
● resilient to Worker failures, Master single point of failure
● supports Zookeeper for multiple Masters, leader election and state recovery. Running applications unaffected
● or local filesystem recovery mode just restarts Master if it goes down. Single node. Better with external monitor
● example 2 start cluster

YARN mode
● yet another resource negotiator
● decouples resource management and scheduler from data processing framework
● exclusive to Hadoop ecosystem
● binary distribution of spark built with YARN support
● uses hadoop configuration HADOOP_CONF_DIR or YARN_CONF_DIR
● master is set to either yarn-client or yarn-cluster

Mesos mode
● Mesos is a cluster operating system
● abstracts CPU, memory, storage and other resources enabling fault tolerant and elastic distriuted system
● can run Spark along with other applications (Hadoop, Kafka, ElasticSearch, Jenkins, ...) and manage resources
and scheduling across the whole cluster and all the applications
● Mesos master replaces Spark Master as Cluster Manager
● Spark binary accessible by Mesos (config)
● mesos://HOST:PORT for single master mesos or mesos://zk://HOST:PORT for multi master mesos using
Zookeeper for failover
● In “fine-grained” mode (default), each Spark task runs as a separate Mesos task. This allows multiple instances
of Spark (and other frameworks) to share machines at a very fine granularity
● The “coarse-grained” mode will instead launch only one long-running Spark task on each Mesos machine, and
dynamically schedule its own “mini-tasks” within it.
● project Myriad

● utility to connect to a cluster/local Spark
● no need to write program
● constructs and provides SparkContext
● similar to Scala console
● example 3 shell
Spark shell

Job deployment
● client or cluster mode
● spark-submit script
● spark driver program
● allows to write same programs, differ in deployment to cluster

Spark submit script
● need to build and submit a jar with all dependencies (the dependencies
need to be available at worker nodes)
● all other jars need to be specified using --jars
● spark and hadoop dependencies can be provided
● ./bin/spark-submit
○ --class <main class>
○ --master <master>
○ --deploy-mode <deploy mode>
○ --conf <key>=<value>
○ <application jar>
○ <application arguments>

Spark submit script
● non trivial automation
● need to build application jar, have it available at driver, submit job with
arguments and collect result
● deployment pipeline necessary

Spark driver program
● can be part of scala/akka application and execute Spark jobs
● needs dependencies, can not be provided
● jars need to be specified using .setJars() method
● running Spark applications, passing parameters and retrieving results
same as just running any other code
● dependency management, versions, compatibility, jar size
● one SparkContext per JVM
● example 3 submit script

Integration
● streaming
○ Kafka, Flume, Kinesis, Twitter, ZeroMQ, MQTT
● batch
○ HDFS, Cassandra, HBase, Amazon S3, …
○ text files, SequenceFiles, any Hadoop InputFormat
○ when loading local file then the file must be present on worker nodes
on given path. You need to either copy it or use dfs

Conclusion
● getting started with Spark is relatively simple
● tools simplifying development (console, local mode)
● cluster deployment fragile and difficult to troubleshoot
● networking using akka remoting

Apache spark - Installation

Recommended

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Apache spark - Installation (20)

Recently uploaded (20)

Apache spark - Installation