SparkFramework

Apache Spark
Cluster Compu2ng Pla6orm
ITV-DS, Applied Compu2ng Group.
Sergio Viademonte, PhD.

Spark Introduc.on

-  Apache Spark is a open source, fast and general-purpose
cluster compu2ng pla6orm
-  parallel distributed processing
-  fault tolerance
-  on commodity hardware
-  Originally developed at UC Berkeley AMP Lab, 2009
-  Open sourced in March 2010
-  Apache SoOware Founda2on, 2013
-  WriSen in Scala
-  Runs on the JVM
Insight Data Labs, December 2015

Spark Introduc.on

-  Deployed at massive scale, mul2ple petabytes of data
-  Clusters of over 8,000 nodes
-  Yahoo, Baidu, Tencent, …....
Insight Data Labs, December 2015

Spark Features
•  Spark is a general computa2on engine that uses distributed
memory to perform fault-tolerant computa2ons with a cluster
•  Speed
•  Ease of use
•  Analy2c
•  Environments that require
•  Large datasets
•  Low latency processing
•  Spark can perform itera2ve computa2ons at scale (in memory)
which opens up the possibility of execu2ng machine learning
algorithms much faster than with Hadoop MR (disk-based)[2] [4].

Spark Features
•  Computa2onal engine:
•  Scheduling
•  Distribu2ng
•  Monitoring
applica2ons consis2ng of many computa2onal tasks across a
computa2onal cluster.
•  From an engineering perspec2ve Spark hides the complexity of:
•  distributed systems programming
•  network communica2on
•  and fault tolerance.

Spark Features
•  Spark contains mul2ple closely integrated components, designed
to interoperate closely, and be combined as libraries in a soOware
project.
•  Supports Java (6+), Scala (2.10+) and Python (2.6+)
•  Runs on top of Hadoop, Mesos*[1], Standalone or in the cloud
•  Access diverse data sources: HDFS, Cassandra, Hbase [2]
•  Supports SQL queries
•  Machine Learning algorithms
•  Graph processing
•  Stream processing
•  Sensor data processing
•  A general cluster manager, provides APIs for resource management
and scheduling across datacenter and cloud environments (www.mesos.apache.org).
Can run Hadoop MR and service applica2ons.

Spark Ecosystem
hSps://www.safaribooksonline.com/library/view/learning-spark/

Spark Ecosystem
Spark Core
The execu2on engine for the Spark pla6orm that all other func2onality is built
on top of.

Contains the basic func2onality of Spark [5]:
•  in-memory compu2ng capabili2es
•  memory management
•  components for task scheduling
•  fault recovery
•  interac2ng with storage systems
•  Java, Scala, and Python APIs
•  Resilient Distributed Datasets (RDDs) API


Spark Ecosystem

§  Resilient Distributed Datasets (RDDs)
•  Spark main programming abstrac2on for working with data.
•  RDDs represent a fault-tolerant collec2on of elements
distributed across many compute nodes that can be
manipulated in parallel.
•  Spark Core provides many APIs for building and manipula2ng
these collec2ons.
•  All work is expressed as
•  crea2ng new RDDs
•  transforming exis2ng RDDs – return pointers to RDDs
•  ac2ons, calling opera2ons on RDDs - return values
Ex: val textFile = sc.textFile("README.md")
textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3

Spark Ecosystem

§  Resilient Distributed Datasets (RDDs)
•  crea2ng new RDDs
scala> val textFile = sc.textFile("README.md")
textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3

•  transforming exis2ng RDDs – return pointers to RDDs
Ex: filter transformaLon to return a new RDD with a subset of the items in
the file.
scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
•  ac2ons, calling opera2ons on RDDs - return values
scala> textFile.count() // Number of items in this RDD
res0: Long = 126
scala> textFile.first() // First item in this RDD
res1: String = # Apache Spark ITV-DS, Applied Compu2ng Group.

Spark Ecosystem
Spark SQL
•  Spark SQL is a Spark module for structured data processing
•  Allows querying data via SQL as well as HQL (Hive Query Language)
•  Act as distributed SQL query engine
•  Extends the Spark RDD API
•  It provides DataFrames – a DataFrame is equivalent to a rela2onal table in
Spark SQL.
•  It also provides powerful integra2on with the rest of the Spark ecosystem
(e.g., integra2ng SQL query processing with machine learning)
•  It enables unmodiﬁed Hadoop Hive queries to run up to 100x faster on exis2ng
deployments and data

Spark Ecosystem
Spark SQL - DataFrames
val sc: SparkContext // An exis2ng SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// Create the DataFrame
val df = sqlContext.read.json("examples/src/main/resources/people.json")

// Displays the content of the DataFrame to stdout
df.show()

// age name
// 20 Michael
// 30 Andy
// 19 JusLn


Spark Ecosystem
Spark SQL - DataFrames
// Select only the "name" column
df.select("name").show()

// name
// Michael
// Andy
// JusLn

df.ﬁlter(df("age") > 21).show()

df.groupBy("age").count().show()


Spark Ecosystem
Spark Streaming
•  Spark component that provides the ability to process and analyze live
streams of data in real-2me.
•  Web logs, online posts and updates from web services, logﬁles, etc.
•  Enables powerful interac2ve and analy2cal applica2ons across both
streaming and historical data
•  Integrates with a wide variety of popular data sources, including HDFS,
Flume, Kapa, and TwiSer
•  API for manipula2ng data streams

Spark Ecosystem
Spark Mllib – Machine Learning
•  MLlib is a scalable machine learning library that delivers both high-quality
algorithms (e.g., mul2ple itera2ons to increase accuracy) and speed (up to
100x faster than MapReduce).
•  Provides mul2ple types of machine learning algorithms, including
classifica2on, regression, clustering, and collabora2ve filtering, as well as
suppor2ng func2onality such as model evalua2on and data import.
•  The library is usable in Java, Scala, and Python as part of Spark applica2ons,
so that can be included in complete workflows

Spark Ecosystem
Spark GraphX – Graph Computa2on
•  It is a library for manipula2ng graphs
•  Performs graph-parallel computa2ons
•  GraphX is a graph computa2on engine built on top of Spark that enables
users to interac2vely build, transform and reason about graph structured
data at scale
•  GraphX extends the Spark RDD API
•  Provides a library of graph algorithms (e.g., PageRank and triangle coun2ng)

Spark Ecosystem
Cluster Managers
•  Spark is designed to scale up from one to many thousands of compute nodes
•  Runs on diverse cluster managers:
•  Hadoop YARN
•  Apache Mesos [1]
•  Standalone Scheduler – a simple cluster manager included in Spark

References
[1] Apache Mesos. hSp://spark.apache.org/docs/1.3.0/cluster-overview.html
[2] ApacheHadoop. hSp://hadoop.apache.org/.
[3] ApacheMahout. hSps://mahout.apache.org/.
[4] Shi, Juwei; Qiu, Yunjie et all. "Clash of the Titans: MapReduce vs. Spark for
Large Scale Data Analy2cs.”. IBM Research China, IBM Almadem Research
Center, Renmin University of China.
[5] Safari books online hSps://www.safaribooksonline.com/library/view/
learning-spark/.

SparkFramework

More Related Content

What's hot (20)

Viewers also liked (14)

Similar to SparkFramework (20)

SparkFramework