Predictive Datacenter Analytics with Strymon

PREDICTIVE
DATACENTER ANALYTICS
WITH STRYMON
Vasia Kalavri
kalavriv@inf.ethz.ch
Support:
QCon San Francisco
14 November 2017

ABOUT ME
▸ Postdoc at ETH Zürich
▸ Systems Group: https://www.systems.ethz.ch/
▸ PMC member of Apache Flink
▸ Research interests
▸ Large-scale graph processing
▸ Streaming dataﬂow engines
▸ Current project
▸ Predictive datacenter analytics and
management
2
@vkalavri

https://code.facebook.com/posts/1499322996995183/solving-the-mystery-of-link-imbalance-a-metastable-failure-state-at-scale/
4

5
https://www.theregister.co.uk/2017/09/28/amadeus_booking_software_outages_lead_to_global_delayed_ﬂights/

DATACENTER MANAGEMENT
66
Workloadﬂuctuations
Network failures
Conﬁguration
updates
Software updatesResource scaling
Service
deployment

6
Can we predict the effect of changes?
6
Network failures
Conﬁguration
updates
Service
deployment

6
Can we predict the effect of changes?
Can we prevent catastrophic faults?
6
Network failures
Conﬁguration
updates
Service
deployment

Predicting outcomes under hypothetical conditions
What-if Analysis:
7

Predicting outcomes under hypothetical conditions
‣ How will response time change if we migrate a large service?
‣ What will happen to link utilization if we change the routing protocol costs?
‣ Will SLOs will be violated if a certain switch fails?
‣ Which services will be affected if we change load balancing strategy?
What-if Analysis:
7

Test deployment?
Data Center Test Cluster
▸ expensive to operate and maintain
▸ some errors only occur in a large scale!
a physical small-scale cluster
to try out configuration changes and what-ifs
8

Analytical model?
Data Center
infer workload distribution from samples and analytically
model system components (e.g. disk failure rate)
▸ hard to develop, large design space to explore
▸ often inaccurate
9

MODERN ENTERPRISE
DATACENTERS ARE ALREADY
HEAVILY INSTRUMENTED
10

Trace-driven online simulation
Use existing instrumentation to build a datacenter model
▸ construct DC state from real events
▸ simulate the state we cannot directly observe
11
Data Center
current DC
state
forked what-if
DC state
forked what-if
DC state
forked what-if
DC state

1212
traces, conﬁguration,
topology updates, …
Datacenter
Strymon
queries, complex analytics,
simulations, …
policy enforcement,
what-if scenarios, …
STRYMON: ONLINE DATACENTER ANALYTICS AND MANAGEMENT
Datacenter
state
event streams
strymon.systems.ethz.ch

TIMELY DATAFLOW
ingress egress
feedback

STRYMON’S OPERATIONAL REQUIREMENTS
‣ Low latency: react quickly to network failures
‣ High throughput: keep up with high stream rates
‣ Iterative computation: complex graph analytics on the
network topology
‣ Incremental computation: reuse already computed results
when possible
‣ e.g. do not recompute forwarding rules after a link update
14

TIMELY DATAFLOW: STREAM PROCESSING IN RUST
15
▸ Data-parallel computations
▸ Arbitrary cyclic dataflows
▸ Logical timestamps (epochs)
▸ Asynchronous execution
▸ Low latency
D. Murray, F. McSherry, M. Isard, R. Isaacs, P. Barham, M. Abadi.
Naiad: A Timely Dataflow System. In SOSP, 2013.
15
https://github.com/frankmcsherry/timely-dataflow

WORDCOUNT IN TIMELY
fn main() { 
timely::execute_from_args(std::env::args(), |worker| { 
 
let mut input = InputHandle::new(); 
let mut probe = ProbeHandle::new(); 
let index = worker.index(); 
 
worker.dataflow(|scope| { 
input.to_stream(scope) 
.flat_map(|text: String| 
text.split_whitespace() 
.map(move |word| (word.to_owned(), 1)) 
.collect::<Vec<_>>() 
) 
.aggregate( 
|_key, val, agg| { *agg += val; }, 
|key, agg: i64| (key, agg), 
|key| hash_str(key) 
) 
.inspect(|data| println!("seen {:?}", data)) 
.probe_with(&mut probe); 
}); 
//feed data
… 
}).unwrap(); 
}
16

WORDCOUNT IN TIMELY
fn main() { 
 
 
) 
.aggregate( 
) 
}); 
//feed data
… 
}).unwrap(); 
}
16
initialize and run
a timely job

WORDCOUNT IN TIMELY
fn main() { 
 
 
) 
.aggregate( 
) 
}); 
//feed data
… 
}).unwrap(); 
}
16
create input and
progress handles

WORDCOUNT IN TIMELY
fn main() { 
 
 
) 
.aggregate( 
) 
}); 
//feed data
… 
}).unwrap(); 
}
16
deﬁne the dataﬂow
and its operators

WORDCOUNT IN TIMELY
fn main() { 
 
 
) 
.aggregate( 
) 
}); 
//feed data
… 
}).unwrap(); 
}
16
watch for
progress

WORDCOUNT IN TIMELY
fn main() { 
 
 
) 
.aggregate( 
) 
}); 
//feed data
… 
}).unwrap(); 
}
16
a few Rust
peculiarities to
get used to :-)

PROGRESS TRACKING
▸ All tuples bear a logical timestamp (think event time)
▸ To send a timestamped tuple, an operator must hold a
capability for it
▸ Workers broadcast progress changes to other workers
▸ Each worker independently determines progress
17
A distributed protocol that allows operators reason
about the possibility of receiving data

MAKING PROGRESS
fn main() { 
… 
…
…
for round in 0..10 { 
input.send(("round".to_owned(), 1)); 
input.advance_to(round + 1); 
while probe.less_than(input.time()) { 
worker.step(); 
}
} 
}).unwrap(); 
}
18

MAKING PROGRESS
fn main() { 
… 
…
…
worker.step(); 
}
} 
}).unwrap(); 
}
18
push data to the
input stream

MAKING PROGRESS
fn main() { 
… 
…
…
worker.step(); 
}
} 
}).unwrap(); 
}
18
advance the
input epoch

MAKING PROGRESS
fn main() { 
… 
…
…
worker.step(); 
}
} 
}).unwrap(); 
}
18
do work while there’s
still data for this epoch

TIMELY ITERATIONS
timely::example(|scope| { 
 
let (handle, stream) = scope.loop_variable(100, 1); 
(0..10).to_stream(scope) 
.concat(&stream) 
.inspect(|x| println!("seen: {:?}", x)) 
.connect_loop(handle); 
});
19

TIMELY ITERATIONS
 
.concat(&stream) 
});
19
loop 100 times
at most
advance
timestamps by 1
in each iteration
create the
feedback loop

TIMELY ITERATIONS
 
.concat(&stream) 
});
19
loop 100 times
at most
advance
timestamps by 1
in each iteration
create the
feedback loop
t
(t, l1)
(t, (l1, l2))

TIMELY & I
20
Relationship status:
it’s complicated

TIMELY & I
20
Relationship status:
it’s complicated
APIs &
libraries
performance
ecosystem
fault-tolerance
deployment
debugging
expressiveness
incremental
computation
performance

What-if analysis
Real-time
datacenter analytics
Incremental
network routing
Online critical
path analysis
Strymon
22
‣ Query and analyze state online
‣ Control and enforce conﬁguration
‣ Simulate what-if scenarios
‣ Understand performance

What-if analysis
Real-time
Incremental
network routing
Online critical
path analysis
Strymon
23

RECONSTRUCTING USER SESSIONS
24
Application A
Application B
A.1
A.2
A.3
B.1
B.2
B.3
B.4
Time: 2015/09/01 10:03:38.599859
Session ID: XKSHSKCBA53U088FXGE7LD8
Transaction ID: 26-3-11-5-1

PERFORMANCE RESULTS
25
Log event
A
B
1 2
1-1 1-2
1
1-3
1-1
Client Time
1
1-1 1-2
2
Inactivity
Transaction IDn-2-8Transaction
‣ Logs from 1263
streams and 42 servers
‣ 1.3 million events/s at
424.3 MB/s
‣ 26ms per epoch vs. 2.1s
per epoch with Flink
Zaheer Chothia, John Liagouris, Desislava Dimitrova, and Timothy Roscoe.
Online Reconstruction of Structural Information from Datacenter Logs. (EuroSys '17).
More Results:

What-if analysis
Real-time
Incremental
network routing
Online critical
path analysis
Strymon
26

ROUTING AS A STREAMING COMPUTATION
Network
changes
G R
delta-join
iteration
Updated rules
Forwarding  
rules
27
‣ Compute APSP and keep forwarding rules as operator state
‣ Flow requests translate to lookups on that state
‣ Network updates cause re-computation of affected rules only

REACTION TO LINK FAILURES
▸ Fail random link
▸ 500 individual runs
▸ 32 threads
28
Desislava C. Dimitrova et. al.
Quick Incremental Routing Logic for Dynamic Network Graphs. SIGCOMM Posters and Demos '17
More Results:

What-if analysis
Real-time
Incremental
network routing
Online critical
path analysis
Strymon
29

ONLINE CRITICAL PATH ANALYSIS
30
input stream output stream
periodic
snapshot
trace snapshot
stream
analyzer
performance
summaries
Apache Flink,
Apache Spark,
TensorFlow,
Heron,
Timely Dataﬂow

TASK SCHEDULING IN APACHE SPARK
31
DRIVER
W1
W2
W3
Venkataraman, Shivaram, et al. "Drizzle: Fast and adaptable stream processing at scale." Spark Summit (2016).

SCHEDULING BOTTLENECK IN APACHE SPARK
32
0 5 10 15
Snapshot
0.0
0.2
0.4
0.6
0.8
CP
0 5 10 15
Snapshot
%weight
Processing Scheduling
Conventional ProﬁlingStrymon Proﬁling
Apache Spark: Yahoo! Streaming Benchmark, 16 workers, 8s snapshots

What-if analysis
Real-time
Incremental
network routing
Online critical
path analysis
Strymon
33

EVALUATING LOAD BALANCING STRATEGIES
▸ 13k services
▸ ~100K user requests/s
▸ OSPF routing
▸ Weighted Round-Robin
load balancing
34
Trafﬁc matrix simulation
under topology and load
balancing changes

WHAT-IF DATAFLOW
35
Logging Servers
Topology Discovery
Conﬁguration DBs
RPC logs
(text)
Topology changes
(graph)
Device specs, 
application placement
(relational)
ETL Load Balancing Routing
LB View
Construction
Routing View
Construction
Trafﬁc Matrix
Construction
DC Model
construction

Strymon 0.1.0 has been released!
https://strymon-system.github.io
https://github.com/strymon-system
strymon-users@lists.inf.ethz.ch
Try it out:
Send us feedback:
37

THE STRYMON TEAM & FRIENDS
38
Vasiliki Kalavri
Zaheer Chothia
Andrea Lattuada
Prof. Timothy Roscoe
Sebastian Wicki
Moritz HoffmannDesislava Dimitrova
John Liagouris
Frank McSherry

38
Vasiliki Kalavri
Zaheer Chothia
Andrea Lattuada
Sebastian Wicki
John Liagouris
? ?
Frank McSherry

38
Vasiliki Kalavri
Zaheer Chothia
Andrea Lattuada
Sebastian Wicki
John Liagouris
? ?
Frank McSherry
IT COULD BE YOU!
strymon.systems.ethz.ch

Predictive Datacenter Analytics with Strymon

More Related Content

What's hot (20)

Similar to Predictive Datacenter Analytics with Strymon (20)

More from Vasia Kalavri (10)

Recently uploaded (20)

Predictive Datacenter Analytics with Strymon