Apache Spark II (SparkSQL)

Contents
Introduction to Spark1
2
3
Spark modules
SparkSQL
4 Workshop

What is Apache Spark?
● Extends MapReduce
● Cluster computing platform
● Runs in memory

Fast
Easy of
development
Unified
Stack
Multi
Language
Support
Deployment
Flexibility
❏ Scala, python, java, R
❏ Deployment: Mesos, YARN, standalone, local
❏ Storage: HDFS, S3, local FS
❏ Batch
❏ Streaming
❏ 10x faster on disk
❏ 100x in memory
❏ Easy code
❏ Interactive shell
Why
Spark

Rise of the data center
Hugh amounts of data spread out
across many commodity servers
MapReduce
lots of data → scale out
Data Processing Requirements
Network bottleneck → Distributed Computing
Hardware failure → Fault Tolerance
Abstraction to organize parallelizable tasks
MapReduce
Abstraction to organize parallelizable tasks

MapReduce
Input Split Map [combine]
Suffle &
Sort
Reduce Output
AA BB AA
AA CC DD
AA EE DD
BB FF AA
AA BB AA
AA CC DD
AA EE DD
BB FF AA
(AA, 1)
(BB, 1)
(AA, 1)
(AA, 1)
(CC, 1)
(DD, 1)
(AA, 1)
(EE, 1)
(DD, 1)
(BB, 1)
(FF, 1)
(AA, 1)
(AA, 2)
(BB, 1)
(AA, 1)
(CC, 1)
(DD, 1)
(AA, 1)
(EE, 1)
(DD, 1)
(BB, 1)
(FF, 1)
(AA, 1)
(AA, 2)
(AA, 1)
(AA, 1)
(AA, 1)
(BB, 1)
(BB, 1)
(CC, 1)
(DD, 1)
(DD, 1)
(EE, 1)
(FF, 1)
(AA, 5)
(BB, 2)
(CC, 1)
(DD, 2)
(EE, 1)
(FF, 1)
AA, 5
BB, 2
CC, 1
DD, 2
EE, 1
FF, 1

Spark Components
Cluster Manager
Driver Program
SparkContext
Worker Node
Executor
Task Task
Worker Node
Executor
Task Task

Spark Components
SparkContext
● Main entry point for Spark functionality
● Represents the connection to a Spark cluster
● Tells Spark how & where to access a cluster
● Can be used to create RDDs, accumulators and
broadcast variables on that cluster
Driver program
● “Main” process coordinated by the
SparkContext object
● Allows to configure any spark process with
specific parameters
● Spark actions are executed in the Driver
● Spark-shell
● Application → driver program + executors
Driver Program
SparkContext

Spark Components
● External service for acquiring resources on the cluster
● Variety of cluster managers
○ Local
○ Standalone
○ YARN
○ Mesos
● Deploy mode:
○ Cluster → framework launches the driver inside of the cluser
○ Client → submitter launches the driver outside of the cluster
Cluster Manager

Spark Components
● Any node that can run application code in the cluster
● Key Terms
○ Executor: A process launched for an application on a worker node, that runs tasks and
keeps data in memory or disk storage across them. Each application has its own executors.
○ Task: Unit of work that will be sent to one executor
○ Job: A parallel computation consisting of multiple tasks that gets spawned in response to a
Spark action (e.g. save, collect)
○ Stage: smaller set of tasks inside any job
Worker Node
Executor
Task Task
Worker

RDD
Resilient Distributed Datasets
● Collection of objects that is distributed across
nodes in a cluster
● Data Operations are performed on RDD
● Once created, RDD are immutable
● RDD can be persisted in memory or on disk
● Fault Tolerant
numbers = RDD[1,2,3,4,5,6,7,8,9,10]
Worker Node
Executor
[1,5,6,9]
Worker Node
Executor
[2,7,8]
Worker Node
Executor
[3,4,10]

MLlib
● Classification: logistic regression, naive Bayes,...
● Regression: generalized linear regression, survival regression,...
● Decision trees, random forests, and gradient-boosted trees
● Recommendation: alternating least squares (ALS)
● Clustering: K-means, Gaussian mixtures (GMMs),...
● Topic modeling: latent Dirichlet allocation (LDA)
● Frequent itemsets, association rules, and sequential pattern mining
ML Algorithms Include

Spark SQL
● Integrated: Query data stored in RDDs. Languages: Python, Scala, Java, R.
● Unified data access: Parquet, JSON, CSV, Hive tables
● Apache Hive compatibility.
● Standard connectivity: JDBC, ODBC.
● Scalability
Features

DataFrame
Column 1 Column 2 Column 3 ... Column N
Column 1 Column 2 Column 3 ... Column N

DataFrame
● Ability to process the data in the size of Kilobytes to Petabytes on a single
node cluster to large cluster.
● Different data formats (JSON, Csv, Elastic Search, ...) and storage systems
(HDFS, HIVE tables, Oracle, ...)
● Easily integrated with others Big Data tools (Spark-Core).
● API for Python, Java, Scala, and R.
Features

WORKSHOP
In order to practice the main concepts, please complete the exercises
proposed at our Github repository by clicking the following link:
○ Homework

THANKS!
Any questions?
@datiobddatio-big-data
Special thanks to Stratio for its theoretical contribution
academy@datiobd.com

Apache Spark II (SparkSQL)

Recommended

More Related Content

What's hot (20)

Similar to Apache Spark II (SparkSQL) (20)

More from Datio Big Data (17)

Recently uploaded (20)

Apache Spark II (SparkSQL)