Intro to Spark with Zeppelin

Robert Hryniewicz
Data Evangelist
@RobHryniewicz
Intro to Spark & Zeppelin

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Spark Background

What is Spark?
 Apache Open Source Project - originally developed at AMPLab (University of California
Berkeley)
 Data Processing Engine - focused on in-memory distributed computing use-cases
 API - Scala, Python, Java and R

Spark Ecosystem
Spark Core
Spark SQL Spark Streaming MLLib GraphX

Why Spark?
 Elegant Developer APIs
– Single environment for data munging and Machine Learning (ML)
 In-memory computation model – Fast!
– Effective for iterative computations and ML
 Machine Learning
– Implementation of distributed ML algorithms
– Pipeline API (Spark ML)

History of Hadoop & Spark

Apache Spark Basics

Spark Context
 Main entry point for Spark functionality
 Represents a connection to a Spark cluster
 Represented as sc in your code
What is it?

RDD - Resilient Distributed Dataset
 Primary abstraction in Spark
– An Immutable collection of objects (or records, or elements) that can be operated on in parallel
 Distributed
– Collection of elements partitioned across nodes in a cluster
– Each RDD is composed of one or more partitions
– User can control the number of partitions
– More partitions => more parallelism
 Resilient
– Recover from node failures
– An RDD keeps its lineage information -> it can be recreated from parent RDDs
 Created by starting with a file in Hadoop Distributed File System (HDFS) or an existing
collection in the driver program
 May be persisted in memory for efficient reuse across parallel operations (caching)

RDD – Resilient Distributed Dataset
Partition
1
Partition
2
Partition
3
RDD 2
Partition
1
Partition
2
Partition
3
Partition
4
RDD 1
Cluster
Nodes

Spark SQL

Spark SQL Overview
 Spark module for structured data processing (e.g. DB tables, JSON files)
 Three ways to manipulate data:
– DataFrames API
– SQL queries
– Datasets API
 Same execution engine for all three
 Spark SQL interfaces provide more information about both structure and computation
being performed than basic Spark RDD API

DataFrames
 Conceptually equivalent to a table in relational DB or data frame in R/Python
 API available in Scala, Java, Python, and R
 Richer optimizations (significantly faster than RDDs)
 Distributed collection of data organized into named columns
 Underneath is an RDD

DataFrames
CSVAvro
HIVE
Spark SQL
Text
Col1 Col2 … … ColN
DataFrame
(with RDD underneath)
Column
Row
Created from Various Sources
 DataFrames from HIVE:
– Reading and writing HIVE tables,
including ORC
 DataFrames from files:
– Built-in: JSON, JDBC, ORC, Parquet, HDFS
– External plug-in: CSV, HBASE, Avro
 DataFrames from existing RDDs
– with toDF()function
Data is described as a DataFrame
with rows, columns and a schema

SQL Context and Hive Context
 Entry point into all functionality in Spark SQL
 All you need is SparkContext
val sqlContext = SQLContext(sc)
SQLContext
 Superset of functionality provided by basic SQLContext
– Read data from Hive tables
– Access to Hive Functions  UDFs
HiveContext
val hc = HiveContext(sc)
Use when your
data resides in
Hive

Spark SQL Examples

DataFrame Example
val df = sqlContext.table("flightsTbl")
df.select("Origin", "Dest", "DepDelay").show(5)
Reading Data From Table
+------+----+--------+
|Origin|Dest|DepDelay|
+------+----+--------+
| IAD| TPA| 8|
| IAD| TPA| 19|
| IND| BWI| 8|
| IND| BWI| -4|
| IND| BWI| 34|
+------+----+--------+

DataFrame Example
df.select("Origin", "Dest", "DepDelay”).filter($"DepDelay" > 15).show(5)
Using DataFrame API to Filter Data (show delays more than 15 min)
+------+----+--------+
+------+----+--------+
| IAD| TPA| 19|
| IND| BWI| 34|
| IND| JAX| 25|
| IND| LAS| 67|
| IND| MCO| 94|
+------+----+--------+

SQL Example
// Register Temporary Table
df.registerTempTable("flights")
// Use SQL to Query Dataset
sqlContext.sql("SELECT Origin, Dest, DepDelay
FROM flights
WHERE DepDelay > 15 LIMIT 5").show
Using SQL to Query and Filter Data (again, show delays more than 15 min)
+------+----+--------+
+------+----+--------+
| IAD| TPA| 19|
| IND| BWI| 34|
| IND| JAX| 25|
| IND| LAS| 67|
| IND| MCO| 94|
+------+----+--------+

RDD vs. DataFrame

RDDs vs. DataFrames
RDD
DataFrame
 Lower-level API (more control)
 Lots of existing code & users
 Compile-time type-safety
 Higher-level API (faster development)
 Faster sorting, hashing, and serialization
 More opportunities for automatic optimization
 Lower memory pressure

Data Frames are Intuitive
RDD Example
Equivalent Data Frame Example
dept name age
Bio H Smith 48
CS A Turing 54
Bio B Jones 43
Phys E Witten 61
Find average age by
department?

Spark SQL Optimizations
 Spark SQL uses an underlying optimization engine (Catalyst)
– Catalyst can perform intelligent optimization since it understands the schema
 Spark SQL does not materialize all the columns (as with RDD) only what’s needed

Apache Zeppelin & HDP Sandbox

Apache Zeppelin
 Web-based Notebook for interactive analytics
 Use Cases
– Data exploration and discovery
– Visualization
– Interactive snippet-at-a-time experience
– “Modern Data Science Studio”
 Features
– Deeply integrated with Spark and Hadoop
– Supports multiple language backends
– Pluggable “Interpreters”

What’s not included with Spark?
Resource Management
Storage
Applications
Spark Core Engine
Scala
Java
Python
libraries
MLlib
(Machine
learning)
Spark
SQL*
Spark
Streaming*
Spark Core Engine

HDP Sandbox
What’s included in the Sandbox?
 Zeppelin
 Latest Hortonworks Data Platform (HDP)
– Spark
– YARN  Resource Management
– HDFS  Distributed Storage Layer
– And many more components... YARN
Scala
Java
Python
R
APIs
Spark Core Engine
Spark
SQL
Spark
Streaming
MLlib GraphX
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
HDFS

Access patterns enabled by YARN
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°N
HDFS
Hadoop Distributed File System
Interactive Real-TimeBatch
Applications Batch
Needs to happen but, no
timeframe limitations
Interactive
Needs to happen at
Human time
Real-Time
Needs to happen at
Machine Execution time.

Why Spark on YARN?
 Utilize existing HDP cluster infrastructure
 Resource management
– share Spark workloads with other workloads like PIG, HIVE, etc.
 Scheduling and queues
Spark Driver
Client
Spark
Application Master
YARN container
Spark Executor
YARN container
Task Task
Spark Executor
YARN container
Task Task
Spark Executor
YARN container
Task Task

Why HDFS?
Fault Tolerant Distributed Storage
• Divide files into big blocks and distribute 3 copies randomly across the cluster
• Processing Data Locality
• Not Just storage but computation
10110100101
00100111001
11111001010
01110100101
00101100100
10101001100
01010010111
01011101011
11011011010
10110100101
01001010101
01011100100
11010111010
0
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
2
2
3
3
34
4
4

There’s more to HDP
YARN : Data Operating System
DATA ACCESS SECURITY
GOVERNANCE &
INTEGRATION
OPERATIONS
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
Data Lifecycle &
Governance
Falcon
Atlas
Administration
Authentication
Authorization
Auditing
Data Protection
Ranger
Knox
Atlas
HDFS EncryptionData Workflow
Sqoop
Flume
Kafka
NFS
WebHDFS
Provisioning,
Managing, &
Monitoring
Ambari
Cloudbreak
Zookeeper
Scheduling
Oozie
Batch
MapReduce
Script
Pig
Search
Solr
SQL
Hive
NoSQL
HBase
Accumulo
Phoenix
Stream
Storm
In-memory Others
ISV Engines
Tez Tez Slider Slider
DATA MANAGEMENT
Hortonworks Data Platform 2.4.x
Deployment ChoiceLinux Windows On-Premise Cloud
HDFS Hadoop Distributed File System

Hortonworks Community Connection

community.hortonworks.com

HCC DS, Analytics, and Spark Related Questions Sample

Lab Preview

Link to Tutorials with Lab Instructions
http://tinyurl.com/hwx-intro-to-spark

Thank you!

Intro to Spark with Zeppelin

More Related Content

What's hot (20)

Similar to Intro to Spark with Zeppelin (20)

More from Hortonworks (20)

Recently uploaded (20)

Intro to Spark with Zeppelin