Data Stream Processing - An Overview: Sangeetha Seshadri Sangeeta@cc - Gatech.edu
Data Stream Processing - An Overview: Sangeetha Seshadri Sangeeta@cc - Gatech.edu
[email protected]
Data Stream Processing An Overview
CS 4440 Lecture 6
Agenda
Data Streams
What are they?
Why now? Applications..
DSMS: Architecture & Issues
Query Processing
3
Data Streams What and Where?
Continuous, unbounded, rapid, time-varying streams of
data elements (tuples).
Occur in a variety of modern applications
Network monitoring and traffic engineering
Sensor networks, RFID tags
Telecom call records
Financial applications
Web logs and click-streams
Manufacturing processes
DSMS = Data Stream Management System
stanfordstreamdatamanager
4
DBMS versus DSMS
Persistent relations
One-time queries
Random access
Access plan determined by
query processor and
physical DB design
Transient streams (and
persistent relations)
Continuous queries
Sequential access
Unpredictable data
characteristics and arrival
patterns
stanfordstreamdatamanager
Continuous Queries
One time queries Run once to completion over the
current data set.
Continuous queries Issued once and then continuously
evaluated over the data.
Example:
Notify me when the temperature drops below X
Tell me when prices of stock Y > 300
stanfordstreamdatamanager 6
DSMS
Scratch Store
The (Simplified) Big Picture
Input streams
Register
Query
Streamed
Result
Stored
Result
Archive
Stored
Relations
stanfordstreamdatamanager 7
(Simplified) Network Monitoring
Register
Monitoring
Queries
DSMS
Scratch Store
Network measurements,
Packet traces
Intrusion
Warnings
Online
Performance
Metrics
Archive
Lookup
Tables
8
Triggers?
Recall triggers in traditional DBMSs?
Why not use triggers to process continuous queries over
data streams?
R.Motwani, Models & Issues in Data Streams PODS 2002
9
Making Things Concrete
DSMS
Outgoing (call_ID, caller, time, event)
Incoming (call_ID, callee, time, event)
event = start or end
Central
Office
Central
Office
ALICE
BOB
R.Motwani, Models & Issues in Data Streams PODS 2002
10
Query 1 (self-join)
Find all outgoing calls longer than 2 minutes
SELECT O1.call_ID, O1.caller
FROM Outgoing O1, Outgoing O2
WHERE (O2.time O1.time > 2
AND O1.call_ID = O2.call_ID
AND O1.event = start
AND O2.event = end)
Result requires unbounded storage
Can provide result as data stream
Can output after 2 min, without seeing end
R.Motwani, Models & Issues in Data Streams PODS 2002
11
Query 2 (join)
Pair up callers and callees
SELECT O.caller, I.callee
FROM Outgoing O, Incoming I
WHERE O.call_ID = I.call_ID
Can still provide result as data stream
Requires unbounded temporary storage
unless streams are near-synchronized
R.Motwani, Models & Issues in Data Streams PODS 2002
12
Query 3 (group-by aggregation)
Total connection time for each caller
SELECT O1.caller, sum(O2.time O1.time)
FROM Outgoing O1, Outgoing O2
WHERE (O1.call_ID = O2.call_ID
AND O1.event = start
AND O2.event = end)
GROUP BY O1.caller
Cannot provide result in (append-only) stream
Output updates?
Provide current value on demand?
Memory?
13
DSMS Architecture & Issues
Data streams and stored relations Architectural
differences.
Declarative language for registering continuous queries
Flexible query plans and execution strategies
Centralized ? Distributed ?
Agenda
Data Streams
What are they?
Why now? Applications..
DSMS: Architecture & Issues
Query Processing
DSMS Issues
Relation: Tuple Set or Sequence?
Updates: Modifications or Appends?
Query Answer: Exact or Approximate?
Query Evaluation: One of multiple Pass?
Query Plan: Fixed or Adaptive?
Architectural Issues
DSMS
DBMS
Resource (memory, per-tuple
computation) limited
Reasonably complex, near real
time, query processing
Useful to identify what data to
populate in database
Query Evaluation: One pass
Query Plan: Adaptive
Resource (memory, disk,
per-tuple computation) rich
Extremely sophisticated
query processing, analysis
Useful to audit query results
of data stream systems.
Query Evaluation: Arbitrary
Query Plan: Fixed.
N.Koudas, D. Srivastava (2003) AT&T Labs-Research
stanfordstreamdatamanager 17
STREAM System Challenges
Must cope with:
Stream rates that may be high,variable, bursty
Stream data that may be unpredictable, variable
Continuous query loads that may be high, variable
stanfordstreamdatamanager 18
STREAM System Challenges
Must cope with:
Stream rates that may be high,variable, bursty
Stream data that may be unpredictable, variable
Continuous query loads that may be high, variable
Overload
stanfordstreamdatamanager 19
STREAM System Challenges
Must cope with:
Stream rates that may be high,variable, bursty
Stream data that may be unpredictable, variable
Continuous query loads that may be high, variable
Overload need to use resources very carefully.
Changing conditions adaptive strategy.
R.Motwani, Models & Issues in Data Streams PODS 2002
20
Query Model
User/Application
Query Registration
Predefined
Ad-hoc
Predefined, inactive
until invoked
Answer Availability
One-time
Event/timer based
Multiple-time, periodic
Continuous (stored or
streamed)
Stream Access
Arbitrary
Weighted history
Sliding window
(special case: size = 1)
DSMS
Query Processor
Agenda
Data Streams
What are they?
Why now? Applications..
DSMS: Architecture & Issues
Query Processing
Language
Operators
Optimization
Multi-Query Optimization
N.Koudas, D. Srivastava (2003) AT&T Labs-Research
22
Stream Query Language
SQL extension
Queries reference/produce relations or streams
Examples: GSQL [Gigascope], CQL [STREAM]
Stream or
Finite
Relation
Stream or
Finite
Relation
Stream Query
Language
stanfordstreamdatamanager 23
Example: Continuous Query Language CQL
Start with SQL
Then add
Streams as new data type
Continuous instead of one-time semantics
Windows on streams (derived from SQL-99)
Sampling on streams (basic)
R.Motwani, Models & Issues in Data Streams PODS 2002
24
Impact of Limited Memory
Continuous streams grow unboundedly
Queries may require unbounded memory
One solution: Approximate query evaluation
R.Motwani, Models & Issues in Data Streams PODS 2002
25
Approximate Query Evaluation
Why?
Handling load streams coming too fast
Avoid unbounded storage and computation
Ad hoc queries need approximate history
How? Sliding windows, synopsis, samples, load-shed
Major Issues?
Metric for set-valued queries
Composition of approximate operators
How is it understood/controlled by user?
Integrate into query language
Query planning and interaction with resource allocation
Accuracy-efficiency-storage tradeoff and global metric
Windows
Mechanism for extracting a finite relation from an infinite
stream
Various window proposals for restricting operator scope.
Windows based on ordering attribute (e.g. time)
Windows based on tuple counts
Windows based on explicit markers (e.g. punctuations)
Variants (e.g., partitioning tuples in a window)
Stream Stream
Finite
relations
manipulated
using SQL
Window
specifications
streamify
N.Koudas, D. Srivastava (2003) AT&T Labs-Research
Windows
Terminology
Start time Current time
time
t1 t2 t3 t4 t5
Sliding Window
time Tumbling Window
N.Koudas, D. Srivastava (2003) AT&T Labs-Research
Query Operators
Selections - Where clause
Projections - Select clause
Joins - From clause
Group-by (Aggregations) Group-by clause
Query Operators
Selections and projections on streams - straightforward
Local per-element operators
Projection may need to include ordering attribute.
Joins Problematic
May need to join tuples that are arbitrarily far apart.
Equijoin on stream ordering attributes may be tractable.
Majority of the work focuses on joins using windows.
R.Motwani, Models & Issues in Data Streams PODS 2002
30
Blocking Operators
Blocking
No output until entire input seen
Streams input never ends
Simple Aggregates output update stream
Set Output (sort, group-by)
Root could maintain output data structure
Intermediate nodes try non-blocking analogs
Join
Apply sliding-window restrictions
Optimization in DSMS
Traditionally table based cardinalities used in query
optimizer.
Goal of query optimizer: Minimize the size of intermediate
results.
Problematic in a streaming environment All streams are
unbounded = infinite size!
Need novel optimization objectives that are relevant when
the input sources are streams.
N.Koudas, D. Srivastava (2003) AT&T Labs-Research
Query Optimization in DSMS
Novel notions of optimization:
Stream rate based [e.g. NiagaraCQ]
Resource-based [e.g. STREAM]
QoS based [e.g. Aurora]
Continuous adaptive optimization
Possibilities that objectives cannot be met:
Resource constraints
Bursty arrivals under limited processing capabilities.
N.Koudas, D. Srivastava (2003) AT&T Labs-Research
R.Motwani, Models & Issues in Data Streams PODS 2002
33
Stream Projects
Amazon/Cougar (Cornell) sensors
Aurora (Brown/MIT) sensor monitoring, dataflow
Hancock (AT&T) telecom streams
Niagara (OGI/Wisconsin) Internet XML databases
OpenCQ (Georgia) triggers, incr. view maintenance
Stream (Stanford) general-purpose DSMS
Tapestry (Xerox) pub/sub content-based filtering
Telegraph (Berkeley) adaptive engine for sensors
Tribeca (Bellcore) network monitoring
Optimizing Multiple Distributed Stream Queries Using
Hierarchical Network Partitions
Sangeetha Seshadri
*
Jointly with: Vibhore Kumar
*
, Brian F. Cooper
, Ling Liu
*
and Karsten Schwan
*
*
College of Computing
Georgia Tech
Yahoo! Research
IPDPS07
March 29
th
2007
35
Talk Outline
Motivation
Challenges
Our Approach
Experimental Results
Future Work
36
Distributed Data Stream Systems
Weather
Local Weather
Web sources
Flight information
Travel Agent
Centralized
DB
What is the status of
my flight?
Can low-capacity
flights be cancelled?
Lots of data produced in lots of places
Examples: operational information systems, scientific collaborations,
web traffic data, financial applications
Centralized processing does not scale
Motivation
38
Challenges
Choosing efficient deployments.
Fast and efficient initial deployments.
Utilize reuse opportunities.
Handling dynamic nature of system.
Queries arrive or leave.
Nodes join (recover) or leave (fail).
Network conditions change.
Data conditions (e.g. rate) changes.
39
Approach Outline
Query Planning Deployment Adaptivity
Typical Approaches
Our Approach
Query Planning
&
Deployment
Adaptivity
40
Query Planning
C
B
A
B C
(B C) A
Sink
A B (A B) C
SELECT * FROM A B C
41
Query Deployment
Sink1
Sink5
Sink4
Sink3
Sink2
N1
C
B
A
N3
N2
N4
N5
A B
(A B) C
42
An Illustrative Example..
SELECT * FROM A C
SELECT * FROM A B C
Why an integrated approach?
Integrated approach decreases cost by > 50 %
Setup: 64 node network, 100 queries over 5 stream sources each.
Y-axis represents communication costs.
44
Problem
Massive Search Space.
Example: 5 stream sources, 64 nodes
2,880,000,000 (approx) plans considered.
Lemma 1:
Our Solution:
Trade some optimality for smaller search space
( 1)
( 1)( 1)
( )
6
K
exhaustive
K K K
N
+
| |
O =
|
\ .
45
Solution
Organize the nodes into a virtual Network Hierarchy.
Operator reuse through Stream Advertisements
Two approximation based algorithms:
Top-Down
Bottom-Up
46
Optimization Metric
Minimize `network usage
Network usage: total amount of data in transit at any
point in time.
Encapsulates both bandwidth and latency of links.
47
Network Hierarchy
Coordinator Nodes
Cluster network nodes based on cost.
User defined parameter max
cs
48
Stream Advertisements for Reuse
A
B
A, C and
A C
B
C
A C
Coordinator Nodes
49
Optimization Algorithms
Top-Down
Bottom-Up
50
Planning algorithms
Top down
A B C D
C D
A B
C D
A B
D C B A
51
Top-Down Algorithm: Features
Reduced search space:
Search space reduced by a factor .
(h = height of hierarchy, N = network size, K = number of
sources).
User defined parameter max
cs
allows to tune trade-off
between search space and sub-optimality.
Operators re-used when beneficial through stream
advertisements.
1
max
K
cs
h
N
|
| |
=
|
\ .
52
Planning algorithms
Bottom up
A B C D
A B
A B
C D
A B
A B
D C B A
53
Bottom-Up Algorithm: Features
Reduced search space.
Deploys only sub-queries within current cluster.
Analytical bounds: Search space reduced by factor .
Operators re-used when beneficial.
But, may choose sub-optimal join-orders.
54
Experiments
Simulation and prototype based experiments.
128 node network: Used GT-ITM internetwork topology
generator.
Uniformly random workload generator: 10 sources, 100
queries, 2-5 join operators, random sink placements.
Cost with Bottom-Up Algorithm
56
Comparison with existing approaches
57
Comparison of Search Space
58
Future Work
We have built a prototype based on IFLOW a distributed
data stream system built at Georgia Tech.
Aggregations
Modifying existing deployments at runtime
Relaxing filter conditions
Modifying join ordering at runtime.
59
Related Work
Distributed query optimization
Distributed INGRES, R*, SDD-1
Stream data processing engines
Centralized - STREAM, Aurora, TelegraphCQ
Distributed - Borealis, Flux
60
Conclusion
Integrated approach to query optimization
Hierarchical clustering of network and stream
advertisements.
Approximation based algorithms
Top-Down
Bottom-Up
Design Highlights
Trade some optimality for smaller search space.
Decrease search space while offering bounds on the sub-
optimality.
61
For further information
http://www.cc.gatech.edu/~sangeeta
Contact: [email protected]
Thank You!
62
Deployment Times
63
Example
Simple use-case for pushing down selections:
Query 1:
SELECT FLIGHTS.Number, FLIGHTS.Status CARRIER_CODES.Name
FROM FLIGHTS, CARRIER_CODES
WHERE FLIGHTS.Departing =ATLANTA
AND FLIGHTS.Carrier_Code = CARRIER_CODES.Code
AND FLIGHTS.Departure_terminal = `TERMINAL SOUTH
Query 2:
SELECT FLIGHTS.Number, FLIGHTS.Status, CARRIER_CODES.Name
FROM FLIGHTS, CARRIER_CODES
WHERE FLIGHTS.Departing =ATLANTA
AND FLIGHTS.Carrier_Code = CARRIER_CODES.Code
AND FLIGHTS.Departure_terminal = `TERMINAL NORTH'
64
The Big Picture
Large number of possibilities
System Model
Stream processing systems (SQL-style queries)
Pub-sub systems
Runtime annotators (keyword-based queries).
Trade-offs Cost with
Search space
Reliability
Availability.
Adaptivity
Admission Control
Moving operators
Dropping data
Migrating plans.
65
Real Enterprise Workload
Delta Airlines Operational information system
Q1 (15%): Terminal Overhead Display (Lifetime = 12 hours)
Q2 (80%): Gate Agent Query (Lifetime = 2 hours)
Q3 (5%): Ad-hoc flight status monitoring queries (Lifetime =
6 hours)
66
Real Enterprise Workload
Backups
PODS 2002 70
Sliding Window Approximation
Why?
Approximation technique for bounded memory
Natural in applications (emphasizes recent data)
Well-specified and deterministic semantics
Issues
Extend relational algebra, SQL, query optimization
Algorithmic work
Timestamps?
0 1 1 0 0 0 0 1 1 1 0 0 0 0 0 1 0 1 0 1 0