"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
“Apache Spark™ is a fast and general engine for large-scale data processing”
Monitoring
Ref: http://www.storagereview.com/images/ramdiskarticle1.jpg
Spark sets a new record in Petabyte Sort
7
Ref: https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
Other Benchmarks
Ref: http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7847709
Evolution
(2007 – 2015?)
Giraph
Pregel
(2004 – 2013) (2014 – ?)
Tez
Drill Storm
Mahout
S4
Impala
GraphLab
Specialized Systems
(iterative, interactive, ML, streaming,
General Batch Processing graph, SQL, etc) General Unified Engine
Evolution…
MapReduce Hadoop
Paper Summit
spark.apache.org
11
Spark Components
Spark Core
o Provides distributed task dispatching,
scheduling, and basic I/O functionalities.
Spark SQL
o Lets you query structured data as a distributed
dataset (RDD) in Spark
Spark Streaming
o Allows scalable stream processing on micro-
batches of data
MLlib
o Scalable machine learning library
Graphx
o Allows custom iterative graph algorithms using Pregel API
Spark Core
Spark Core
o Provides distributed task dispatching,
scheduling, and basic I/O functionalities.
Spark SQL
o Lets you query structured data inside Spark
programs, using either SQL or a familiar
DataFrame API. Usable in Java, Scala, Python
and R.
o Runs SQL / HiveQL queries
Features:
o Integrated: Mix Spark programs with RDDs (Python, Scala and Java) and Spark SQL queries
o Unified Data Access: Query from different data sources such as Hive tables, JSON.
o Hive Compatibility: Runs modified hive queries on existing warehouses.
o DB Connectivity: JDBC and ODBC
o Performance and Scalability: scales to thousands of nodes
o Spark API for graph-parallel computation.
o Introducing the Resilient Distributed Property
Graph: a directed multigraph with properties
attached to each vertex and edge.
o GraphX includes a growing collection of graph
algorithms and builders to simplify graph analytics
tasks.
Scala
Python
Java
R
Spark Packages – Total 96 (As on 27-07-2015)
Spark Packages – Total 190 (As on 2-1-2016)
Spark Packages – Total 378 (As on 26-10-2017)
Spark Packages – Total 451 (As on 17-06-2019)
Hadoop or Spark?
Hadoop MapReduce
22
Image Source: http://tm.durusau.net/?cat=84
Spark .vs. Hadoop
Spark is a better candidate for iterative jobs
Each stage passes through the hard drives,
(MapReduce iterative jobs involve a lot of disk I/O for each
repetition…Very slow)
Hadoop
MapReduce
Apache Spark
Spark .vs. Hadoop MR – Programming Complexity
Word Count using MR Word Count using Spark
package org.apache.hadoop.examples; public static class IntSumReducer extends
import java.io.IOException; Reducer<Text, IntWritable, Text, IntWritable> {
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration; private IntWritable result = new IntWritable();
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable; public void reduce(Text key, Iterable<IntWritable> from pyspark import SparkContext
import org.apache.hadoop.io.Text; values,
import org.apache.hadoop.mapreduce.Job; Context context) throws IOException,
InterruptedException {
logFile =
import org.apache.hadoop.mapreduce.Mapper;
int sum = 0; "hdfs://localhost:9000/user/bigdatavm/input"
import org.apache.hadoop.mapreduce.Reducer;
import for (IntWritable val : values) {
org.apache.hadoop.mapreduce.lib.input.FileInputFormat sum += val.get(); sc = SparkContext("spark://bigdata-vm:7077",
; } "WordCount")
import result.set(sum);
org.apache.hadoop.mapreduce.lib.input.FileSplit; context.write(key, result);
} textFile = sc.textFile(logFile)
import
org.apache.hadoop.mapreduce.lib.output.FileOutputFor }
mat; wordCounts = textFile.flatMap(lambda line:
import org.apache.hadoop.util.GenericOptionsParser; public static void main(String[] args) throws line.split()).map(lambda word: (word,
Exception {
Configuration conf = new Configuration();
1)).reduceByKey(lambda a, b: a+b)
public class WordCount {
public static class TokenizerMapper extends String[] otherArgs = new
Mapper<Object, Text, Text, IntWritable> { GenericOptionsParser(conf, args)
private final static IntWritable one = new .getRemainingArgs();
IntWritable(1);
private Text word = new Text(); Job job = new Job(conf, "word count");
27
Is Spark replacing Hadoop ?
Enemies OR Frenemies ??
Can work together even though there are fundamental differences.
If you want to use HDFS or YARN with Spark, better start with Hadoop.
Installing Apache Spark
Installing Spark
The following
configuration
options can be
passed to the
master and
worker:
Cluster Launch Scripts
To launch a Spark standalone cluster with the launch scripts, you should create a file called
conf/slaves in your Spark directory, which must contain the hostnames of all the machines
where you intend to start Spark workers, one per line.
If conf/slaves does not exist, the launch scripts defaults to a single machine (localhost), which
is useful for testing.
Note, the master machine accesses each of the worker machines via ssh.
By default, ssh is run in parallel and requires password-less (using a private key) access to be
setup.
If you do not have a password-less setup, you can set the environment variable
SPARK_SSH_FOREGROUND and serially provide a password for each worker.
Cluster Launch Scripts
sbin/start-master.sh - Starts a master instance on the machine the script is executed on.
sbin/start-slaves.sh - Starts a slave instance on each machine specified in the conf/slaves file.
sbin/start-slave.sh - Starts a slave instance on the machine the script is executed on.
sbin/start-all.sh - Starts both a master and a number of slaves as described above.
sbin/stop-master.sh - Stops the master that was started via the sbin/start-master.sh script.
sbin/stop-slaves.sh - Stops all slave instances on the machines specified in the conf/slaves file.
sbin/stop-all.sh - Stops both the master and the slaves as described above.
Note that these scripts must be executed on the machine you want to run the Spark master on, not your local machine.
Interactive Shells
Python:
$ pyspark --master local[4]
$ pyspark --master local[*] (Use the threads equal to the no. of cores)
$ pyspark --master local[4] code.py
$ pyspark --master spark://IPAddressOfMaster:7077 code.py
Scala:
$ spark-shell (By default, it takes only 1 thread, Both driver and program)
$ spark-shell --master local[4] --packages “org.example:example:0.1”
$ spark-shell --master local[*] --jars code.jar
Spark Master
The master parameter for spark determines which type and size of cluster to use.
local [k, F] Run Spark locally with K worker threads (ideally, set this to the number of cores
on your machine) and F max failures.
local[*] Run Spark locally with as many worker threads as logical cores on your
machine.
spark://HOST:PO Connect to the given Spark standalone cluster master. The port must be
RT whichever one your master is configured to use, which is 7077 by default.
spark://HOST1:P Connect to the given Spark standalone cluster with standby masters with
ORT1,HOST2:PO Zookeeper. The list must have all the master hosts in the high availability cluster set
RT2…. up with Zookeeper. The port must be whichever each master is configured to use,
which is 7077 by default.
yarn-cluster Connect to a YARN cluster. The cluster location will be found based on the
HADOOP_CONF_DIR or YARN_CONF_DIR variable.
Launching Applications
Run application locally with 8 cores
./bin/spark-submit \
--master spark://207.184.161.138:7077 \
--executor-memory 20G \
--total-executor-cores 100 \
/path/to/examples.jar \
1000
Spark Driver
&Workers…
(with YARN)
Credits:
http://0x0fff.com/spark-architecture/
Spark Program Execution
A spark program has two components:
Driver Component
Worker Component
conf = SparkConf().setAppName(appName).setMaster(master)
Python
sc = SparkContext(conf=conf)
Spark jargons
Application User program built on Spark. Consists of a driver program and executors on the cluster.
Application jar A jar containing the user's Spark application. In some cases users will want to create an
"uber jar" containing their application along with its dependencies. The user's jar should
never include Hadoop or Spark libraries, however, these will be added at runtime.
Driver program The process running the main part of the application and creating the SparkContext
Cluster manager Acquiring & scheduling resources on the cluster (e.g. standalone manager, Mesos, YARN)
Worker node Any node that can run application code in the cluster, Which have the executors
Executor A process launched for an application on a worker node, that runs tasks and keeps data in
memory or disk storage across them. Each application has its own executors.
Job A parallel computation consisting of multiple tasks that gets spawned in response to a Spark
action (e.g. save, collect); you'll see this term used in the driver's logs.
Stage Each job gets divided into smaller sets of tasks called stages that depend on each other
(similar to the map and reduce stages in MapReduce); You can see this in driver log.
Task A unit of work that will be sent to one executor
DAG DAG stands for Directed Acyclic Graph, in the present context its a DAG of operators.
Resilient Distributed Datasets (RDDs)
What is RDD?
Definition:
Resilient Distributed Dataset (RDD) is
the primary data abstraction in Apache
Spark and the core of Spark.
Spark Core Concept Resilient
Distributed Data
The primary abstraction in Spark
It is a Partitioned fault-tolerant
collection of data elements.
Can be operated on in parallel.
data = [1, 2, 3, 4, 5, 6, 7]
RDD1 = sc.parallelize(data, 4)
One important parameter for parallel collections is the number of partitions to cut the
dataset into.
Spark will run one task for each partition of the cluster.
More Partitions More parallelism
How data is partitioned?
data = [1, 2, 3, 4, 5, 6, 7]
RDD1 = sc.parallelize(data, 4)
P1 1, 2
P2 3, 4
Number of
Partitions
P3 5, 6
7
P4
data = [1, 2, 3, 4, 5, 6, 7]
RDD1 = sc.parallelize(data, 4)
By applying transformations
you incrementally build a
RDD lineage with all the
parent RDDs of the final
RDD(s).
Transformations…
# Create sub function to subtract 1# Args: One integer value, Returns: One
integer valuedef sub(value):
return (value - 1)
# Create sub function to subtract 1# Args: One integer value, Returns: One
integer valuedef sub(value):
return (value - 1)
# Create an RDD
data = [1, 2, 2, 3, 4, 4, 5, 6, 7]
RDD1 = sc.parallelize(data, 4)
# Create an RDD
data = [1, 2, 3]
RDD1 = sc.parallelize(data, 3)
# Create an RDD
data = [1, 2, 2, 3, 4, 4, 5, 6, 7]
RDD1 = sc.parallelize(data, 4)
# The first 4
print(filteredRDD.take(4))
Executor
RDD Cluster
Partition
Stage
Task
Error, ts, msg1 Info, ts, msg8 Error, ts, msg3 Error, ts, msg4
Warn, ts, msg2 Warn, ts, msg2 Info, ts, msg5 Warn, ts, msg9
Error, ts, msg1 Info, ts, msg8 Info, ts, msg5 Error, ts, msg1
logLinesRDD
(input/base RDD)
errorsRDD
Error, ts, msg1 Error, ts, msg3 Error, ts, msg4
Transformation .coalesce(2)
Action .collect( )
Driver
Execute DAG!
.collect( )
Driver
LogLinesRDD
.filter(f(x))
errorsRDD
.coalesce(2)
cleanedRDD
.collect( )
Driver
Resource Managers
Resource Managers
- Local
- Standalone
- YARN
- Mesos
Local
CPUs: 3 options:
- local
- local[N]
JVM: Ex + Driver - local[*]
- SPARK_WORKER_CORES
W W W W
Ex Ex Ex Ex
RDD, P1 T T RDD, P4 T T RDD, P5 T T RDD, P7 T T
RDD, P2 T T RDD, P6 T T RDD, P3 T T RDD, P8 T T
RDD, P1 T T RDD, P1 T T RDD, P2 T T RDD, P2 T T
T T
Internal T T Internal Internal
Threads Threads Threads
Internal
Threads
Driver Spark
Master
7 7