Apache Spark and Online Analytics

Spark and Online Analytics
Sudarshan Kadambi
Copyright 2016 Bloomberg L.P. All rights reserved.

Agenda
• Data and Analytics at Bloomberg
• The role of Spark
• The Bloomberg Spark Server
• Spark for Online use-cases

Data and Analytics is our Business

Analytics at Bloomberg
• Human-time, interactive analytics
• Scalability
– Handle increasingly sophisticated client analytic workflows
– Ad-hoc and cross-domain aggregations, filtering
• Heterogeneous data stores
– Analytics often requires data from multiple stores
• Low-latency updates, in addition to queries

Spark for Bloomberg Analytics
• Distributed compute scales well for
– large security universes and
– multi-universe cross-domain queries
• Abstract away heterogeneous data sources and present consistent interface
for efficient data access
– Spark as a tool for systems integration
• Connectors and primitives to deal with incoming streams
• Cache intermediate compute for fast queries

Spark as a Service?
6
• Stand-alone Spark Apps on isolated clusters pose
challenges:
– Redundancy in:
» Crafting and Managing of RDDs/DFs
» Coding of the same or similar types of transforms/actions
– Management of clusters, replication of data, etc.
– Analytics are confined to specific content sets making
Cross-Asset Analytics much harder
– Need to handle Real-time ingestion in each App
Spark
Cluster
Spark
App
Spark
Cluster
Spark
Server
Spark
App
Spark
App
Spark
Cluster
Spark
App

Bloomberg Spark Server
Spark
Context
Request
Processor
Request
Processor
Request
Processor
Request Handler
MDF Registry
7
Function Transform
Registry (FTR)
RSI …
use
Ingestion Manager
MDF1
MDF2
1 2
1 2

Spark Server: Content Caching
• Data access has long tail characteristics
• High value data sub-setted within Spark
• Specified as a filter predicate at time of registration
• Seamless unification of data in Spark and backing store

Spark HA: State of the World
– Execution lineage in Driver
• Recovery from lost RDDs
– RDD Replication
• Low latency, even with lost executors
• Support for “MEMORY_ONLY”, “MEMORY_ONLY_2”, “MEMORY_ONLY_SER”,
“MEMORY_ONLY_SER_2” modes for in-memory persistence. Easily extensible to
more replicas if needed.
– Speculative execution
• Minimizing performance hit from stragglers
– Off-heap data
• Minimizing GC stalls

Spark Architecture
RPC Environment RPC Environment
BlockManagerMasterEndpoint BlockManagerMaster
Driver Executor - 2
BlockManager
RPC Environment
BlockManagerMaster
Executor - 1
BlockManager
RDD - 0
RDDBlock(0,
Partition-1)
RDDBlock(0,
Partition-2)

RDD Block Replication
Executor-1 Executor-2Driver
Compute RDD
Computation
complete Get Peers for replication
List of Peers
Replicate block to Peer
Block stored
locallyResults of computation

RDD Block Replication: Challenges
– Lost RDD partitions costly to recover
• Data replenished at query time
– RDD replicated to random executors
• On YARN, multiple executors can be brought up on the same node in
different containers
• Hence multiple replicas possible on the same node/rack, susceptible to
node/rack failure
• Lost block replicas not recovered proactively

Topology Aware Replication (SPARK-15352)
– Ideas & Implementation by Shubham Chopra
– Making Peer selection for replication pluggable
• Driver gets topology information for executors
• Executors informed about this topology information
• Executors use prioritization logic to order peers for block replication
• Pluggable TopologyMapper and BlockReplicationPrioritizer
• Default implementation replicates current Spark behavior

Topology Aware Replication (SPARK-15352)
– Customizable prioritization strategies to suit different deployments
• Variety of replication objectives – ReplicateToDifferentHost,
ReplicateBlockWithinRack, ReplicateBlockOutsideRack
• Optimizer to find a minimum number of peers to meet the objectives
• Replicate to these peers with a higher priority
– Proactive replenishment of lost replicas
• BlockManagerMasterEndpoint triggered replenishment when an executor
failure is detected.

Spark HA: Challenges
– High Availability of Spark Driver
• High bootstrap cost to reconstructing cluster and cached state
• Naïve HA models (such as multiple active clusters) surface query
inconsistency
– High Availability and Low Tail Latency closely related

Spark HA – A Strawman
• Multiple Spark Servers in Leader-Standby
configuration
• Each Spark Server backed by a different
Spark Cluster
• Each Spark Server refreshed with up-to-
date data
• Queries to standbys redirected to leader
• Only leader responds to queries - Data
consistency
• RDD Partition loss in the leader still a
concern
• Performance still gated by slowest
executor in leader
• Resource usage amplified by the number
of Spark Servers

Spark Driver State
• Spark Driver is an arbitrary Java application
• Only a subset of the state is interesting or expensive to reconstruct
• For online-use cases, only RDDs/DFs created during ingestion are of
interest
• Expressing ingestion using DFs has better decoupling of data/state than
RDDs

Spark Driver State*
• BlockManagerMasterEndpoint holds Block<->Executor assignment
• Cache Manager holds Logical Plan and DataFrame references
– Used to short-circuit queries with pre-cached query plans, if possible
• Job Scheduler
– Keeps a track of various stages and tasks being scheduled
• Executor information
– Hostname and ports of live executors
*Illustrative, not exhaustive

Externalizing Driver State
Benefits:
– Quicker recoveries
– No need to restart executors
– State accessible from multiple Active-Active drivers
Solutions:
– Off-heap storage for RDDs
– Residual book-keeping driver state externalized to ZooKeeper

THANK YOU.
skadambi@bloomberg.net

Apache Spark and Online Analytics

Recommended

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Apache Spark and Online Analytics (20)

More from Databricks (20)

Recently uploaded (20)

Apache Spark and Online Analytics