0% found this document useful (0 votes)
28 views

The Future of Real-Time in Spark: Reynold Xin @rxin

Structured Streaming provides a simple way to perform streaming analytics using Spark SQL's DataFrame API. It allows users to write streaming queries on continuously updating dataframes using the same SQL-like queries as batch queries. The queries are automatically executed incrementally to update the results as new data arrives. This unifies streaming, interactive, and batch processing by allowing queries to be changed at runtime, results to be served through databases, and ML models to be built and applied continuously on the streaming data. Some challenges in building continuous applications like integration with non-streaming systems and complex streaming programming models can be addressed using Structured Streaming.

Uploaded by

zameer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

The Future of Real-Time in Spark: Reynold Xin @rxin

Structured Streaming provides a simple way to perform streaming analytics using Spark SQL's DataFrame API. It allows users to write streaming queries on continuously updating dataframes using the same SQL-like queries as batch queries. The queries are automatically executed incrementally to update the results as new data arrives. This unifies streaming, interactive, and batch processing by allowing queries to be changed at runtime, results to be served through databases, and ML models to be built and applied continuously on the streaming data. Some challenges in building continuous applications like integration with non-streaming systems and complex streaming programming models can be addressed using Structured Streaming.

Uploaded by

zameer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

The Future of

Real-Time in Spark
Reynold Xin @rxin
Spark Summit, New York, Feb 18, 2016
Why Real-Time?
Making decisions faster is valuable.
• Preventing credit card fraud
• Monitoring industrial machinery
• Human-facing dashboards
• …
Streaming Engine
Noun.

Takes an input stream and produces an output stream.


Spark Unified Stack

SQL Streaming MLlib GraphX

Spark Core
Spark Unified Stack

SQL Streaming
Streaming MLlib GraphX

Spark Core

Introduced 3 years ago in Spark 0.7


50% users consider most important part of Spark
Spark Streaming

• First attempt at unifying streaming and batch


• State management built in
• Exactly once semantics
• Features required for large clusters
• Straggler mitigation, dynamic load balancing, fast fault-recovery
Streaming computations don’t run in isolation.
Use Case: Fraud Detection
ANOMALY

Ad-hoc analyze historic data

STREAM

Machine learning model


continuously updates
to detect new anomalies
Continuous Application
noun.

An end-to-end application that acts on real-time data.


Challenges Building Continuous
Applications
Integration with non-streaming systems often an after-thought
• Interactive, batch, relational databases, machine learning, …

Streaming programming models are complex


Integration Example
Stream MySQL
(home.html, 10:08)
(product.html, 10:09)
Streaming
(home.html, 10:10) engine
... Page Minute Visits

What can go wrong? home 10:09 21

pricing 10:10 30
• Late events
... ... ...
• Partial outputs to MySQL
• State recovery on failure
• Distributed reads/writes
• ...
Complex Programming Models

Data
Late arrival, varying distribution over time, …

Processing Output
Business logic change & new ops How do we define
(windows, sessions) output over time & correctness?
Structured Streaming
The simplest way to perform streaming analytics
is not having to reason about streaming.
Spark 1.3 Spark 2.0
Static DataFrames Infinite DataFrames

Single API !
Structured Streaming
High-level streaming API built on Spark SQL engine
• Runs the same queries on DataFrames
• Event time, windowing, sessions, sources & sinks

Unifies streaming, interactive and batch queries


• Aggregate data in a stream, then serve using JDBC
• Change queries at runtime
• Build and apply ML models
Trigger: every 1 sec
Model 1 2 3
Time

data up data up data up


Input
to PT 1 to PT 2 to PT 3

Query
Result output for output for output for
data at 1 data at 2 data at 3

complete
Output output
Trigger: every 1 sec
Model 1 2 3
Time

data up data up data up


Input
to PT 1 to PT 2 to PT 3

Query
Result output for output for output for
data at 1 data at 2 data at 3

delta
Output output
Model Details
Input sources: append-only tables

Queries: new operators for windowing, sessions, etc

Triggers: based on time (e.g. every 1 sec)

Output modes: complete, deltas, update-in-place


Example: ETL
Input: files in S3

Query: map (transform each record)

Trigger: “every 5 sec”

Output mode: “new records”, into S3 sink


Example: Page View Count
Input: records in Kafka

Query: select count(*) group by page, minute(evtime)

Trigger: “every 5 sec”

Output mode: “update-in-place”, into MySQL sink

Note: this will automatically update “old” records on late data!


Execution
Logically:
DataFrame
DataFrame operations on static data
(i.e. as easy to understand as batch)

Logical Plan
Physically:
Spark automatically runs the query in
Catalyst optimizer
streaming fashion
(i.e. incrementally and continuously) Continuous,
incremental execution
Example: Batch Aggregation
logs = ctx.read.format("json").open("s3://logs")

logs.groupBy(logs.user_id).agg(sum(logs.time))
.write.format("jdbc")
.save("jdbc:mysql//...")
Example: Continuous Aggregation
logs = ctx.read.format("json").stream("s3://logs")

logs.groupBy(logs.user_id).agg(sum(logs.time))
.write.format("jdbc")
.stream("jdbc:mysql//...")
Automatic Incremental Execution
T=0 Aggregate

T=1 Aggregate

T=2 Aggregate


Rest of Spark will follow
• Interactive queries should just work

• Spark’s data source API will be updated to support seamless


streaming integration
• Exactly once semantics end-to-end
• Different output modes (complete, delta, update-in-place)

• ML algorithms will be updated too


What can we do with this that’s hard
with other engines?
Ad-hoc, interactive queries

Dynamic changing queries

Benefits of Spark: elastic scaling, straggler mitigation, etc


Use Case: Fraud Detection
ANOMALY

Analyze Historic Data

STREAM

Machine Learning Model


continuously updates
to detect new anomalies
Timeline
Spark 2.0 Spark 2.1 +
• API foundation • Continuous SQL
• Kafka, file systems, and • BI app integration
databases • Other streaming sources / sinks
• Event-time aggregations • Machine learning
Thank you.
@rxin

You might also like