The Future of Real-Time in Spark: Reynold Xin @rxin
The Future of Real-Time in Spark: Reynold Xin @rxin
Real-Time in Spark
Reynold Xin @rxin
Spark Summit, New York, Feb 18, 2016
Why Real-Time?
Making decisions faster is valuable.
• Preventing credit card fraud
• Monitoring industrial machinery
• Human-facing dashboards
• …
Streaming Engine
Noun.
Spark Core
Spark Unified Stack
SQL Streaming
Streaming MLlib GraphX
Spark Core
STREAM
pricing 10:10 30
• Late events
... ... ...
• Partial outputs to MySQL
• State recovery on failure
• Distributed reads/writes
• ...
Complex Programming Models
Data
Late arrival, varying distribution over time, …
Processing Output
Business logic change & new ops How do we define
(windows, sessions) output over time & correctness?
Structured Streaming
The simplest way to perform streaming analytics
is not having to reason about streaming.
Spark 1.3 Spark 2.0
Static DataFrames Infinite DataFrames
Single API !
Structured Streaming
High-level streaming API built on Spark SQL engine
• Runs the same queries on DataFrames
• Event time, windowing, sessions, sources & sinks
Query
Result output for output for output for
data at 1 data at 2 data at 3
complete
Output output
Trigger: every 1 sec
Model 1 2 3
Time
Query
Result output for output for output for
data at 1 data at 2 data at 3
delta
Output output
Model Details
Input sources: append-only tables
Logical Plan
Physically:
Spark automatically runs the query in
Catalyst optimizer
streaming fashion
(i.e. incrementally and continuously) Continuous,
incremental execution
Example: Batch Aggregation
logs = ctx.read.format("json").open("s3://logs")
logs.groupBy(logs.user_id).agg(sum(logs.time))
.write.format("jdbc")
.save("jdbc:mysql//...")
Example: Continuous Aggregation
logs = ctx.read.format("json").stream("s3://logs")
logs.groupBy(logs.user_id).agg(sum(logs.time))
.write.format("jdbc")
.stream("jdbc:mysql//...")
Automatic Incremental Execution
T=0 Aggregate
T=1 Aggregate
T=2 Aggregate
…
Rest of Spark will follow
• Interactive queries should just work
STREAM