SlideShare a Scribd company logo
Performance	
  Checklists	
  
for	
  SREs	
  
Brendan Gregg
Senior Performance Architect
Performance	
  Checklists	
  
1.  uptime
2.  dmesg -T | tail
3.  vmstat 1
4.  mpstat -P ALL 1
5.  pidstat 1
6.  iostat -xz 1
7.  free -m
8.  sar -n DEV 1
9.  sar -n TCP,ETCP 1
10.  top
per instance: cloud wide:
1.	
  RPS,	
  CPU	
   2.	
  Volume	
  
6.	
  Load	
  Avg	
  
3.	
  Instances	
   4.	
  Scaling	
  
5.	
  CPU/RPS	
  
7.	
  Java	
  Heap	
   8.	
  ParNew	
  
9.	
  Latency	
   10.	
  99th	
  Qle	
  
SREcon 2016 Performance Checklists for SREs
Brendan	
  the	
  SRE	
  
•  On the Perf Eng team & primary on-call rotation for Core:
our central SRE team
–  we get paged on SPS dips (starts per second) & more
•  In this talk I'll condense some perf engineering into SRE
timescales (minutes) using checklists
Performance	
  Engineering	
  
!=	
  
SRE	
  Performance	
  
Incident	
  Response	
  
Performance	
  Engineering	
  
•  Aim: best price/performance possible
–  Can be endless: continual improvement
•  Fixes can take hours, days, weeks, months
–  Time to read docs & source code, experiment
–  Can take on large projects no single team would staff
•  Usually no prior "good" state
–  No spot the difference. No starting point.
–  Is now "good" or "bad"? Experience/instinct helps
•  Solo/team work
At Netflix: The Performance Engineering team, with help from
developers +3
Performance	
  Engineering	
  
Performance	
  Engineering	
  
stat tools
tracers
benchmarks
documentation
source code
tuning
PMCs profilers
flame graphs
monitoring
dashboards
SRE	
  Perf	
  Incident	
  Response	
  
•  Aim: resolve issue in minutes
–  Quick resolution is king. Can scale up, roll back, redirect traffic.
–  Must cope under pressure, and at 3am
•  Previously was in a "good" state
–  Spot the difference with historical graphs
•  Get immediate help from all staff
–  Must be social
•  Reliability & perf issues often related
At Netflix, the Core team (5 SREs), with immediate help
from developers and performance engineers
+1
SRE	
  Perf	
  Incident	
  Response	
  
SRE	
  Perf	
  Incident	
  Response	
  
custom dashboards central event logs
distributed system tracing
chat rooms
pager
ticket system
NeSlix	
  Cloud	
  Analysis	
  Process	
  
Atlas	
  Alerts	
  
Atlas	
  Dashboards	
  
Atlas	
  Metrics	
  
Salp	
  Mogul	
  
SSH,	
  instance	
  tools	
  
ICE	
  
4.	
  Check	
  Dependencies	
  
Create	
  
New	
  Alert	
  
Plus some other
tools not pictured
Cost	
  
3.	
  Drill	
  Down	
  
5.	
  Root	
  
Cause	
  
Chronos	
  
2.	
  Check	
  Events	
  
In summary…
Example SRE
response path
enumerated
Redirected	
  to	
  
a	
  new	
  Target	
  
1.	
  Check	
  Issue	
  
The	
  Need	
  for	
  Checklists	
  
•  Speed
•  Completeness
•  A Starting Point
•  An Ending Point
•  Reliability
•  Training
Perf checklists have historically
been created for perf engineering
(hours) not SRE response (minutes)
More on checklists: Gawande, A.,
The Checklist Manifesto. Metropolitan
Books, 2008 Boeing	
  707	
  Emergency	
  Checklist	
  (1969)	
  
SRE	
  Checklists	
  at	
  NeSlix	
  
•  Some shared docs
–  PRE Triage Methodology
–  go/triage: a checklist of dashboards
•  Most "checklists" are really custom dashboards
–  Selected metrics for both reliability and performance
•  I maintain my own per-service and per-device checklists
SRE	
  Performance	
  Checklists	
  
The following are:
•  Cloud performance checklists/dashboards
•  SSH/Linux checklists (lowest common denominator)
•  Methodologies for deriving cloud/instance checklists
Ad Hoc Methodology
Checklists
Dashboards
Including aspirational: what we want to do & build as dashboards
1.	
  PRE	
  Triage	
  Checklist	
  
	
  
Our	
  iniQal	
  checklist	
  
NeSlix	
  specic	
  
PRE	
  Triage	
  Checklist	
  
•  Performance and Reliability Engineering checklist
–  Shared doc with a hierarchal checklist with 66 steps total
1.  Initial Impact
1.  record timestamp
2.  quantify: SPS, signups, support calls
3.  check impact: regional or global?
4.  check devices: device specific?
2.  Time Correlations
1.  pretriage dashboard
1.  check for suspect NIWS client: error rates
2.  check for source of error/request rate change
3.  […dashboard specifics…]
Confirms, quantifies,
& narrows problem.
Helps you reason
about the cause.
PRE	
  Triage	
  Checklist.	
  cont.	
  
•  3. Evaluate Service Health
–  perfvitals dashboard
–  mogul dependency correlation
–  by cluster/asg/node:
•  latency: avg, 90 percentile
•  request rate
•  CPU: utilization, sys/user
•  Java heap: GC rate, leaks
•  memory
•  load average
•  thread contention (from Java)
•  JVM crashes
•  network: tput, sockets
•  […]
custom dashboards
2.	
  predash	
  
	
  
IniQal	
  dashboard	
  
NeSlix	
  specic	
  
predash	
  
Performance and Reliability Engineering dashboard
A list of selected dashboards suited for incident response
predash	
  
List of dashboards is its own checklist:
1.  Overview
2.  Client stats
3.  Client errors & retries
4.  NIWS HTTP errors
5.  NIWS Errors by code
6.  DRM request overview
7.  DoS attack metrics
8.  Push map
9.  Cluster status
...
3.	
  perfvitals	
  
	
  
Service	
  dashboard	
  
1.	
  RPS,	
  CPU	
   2.	
  Volume	
  
6.	
  Load	
  Avg	
  
3.	
  Instances	
   4.	
  Scaling	
  
5.	
  CPU/RPS	
  
7.	
  Java	
  Heap	
   8.	
  ParNew	
  
9.	
  Latency	
   10.	
  99th	
  Qle	
  
perfvitals	
  
4.	
  Cloud	
  ApplicaQon	
  Performance	
  
Dashboard	
  
	
  
A	
  generic	
  example	
  
Cloud	
  App	
  Perf	
  Dashboard	
  
1.  Load
2.  Errors
3.  Latency
4.  Saturation
5.  Instances
Cloud	
  App	
  Perf	
  Dashboard	
  
1.  Load
2.  Errors
3.  Latency
4.  Saturation
5.  Instances
All time series, for every application, and dependencies.
Draw a functional diagram with the entire data path.
Same as Google's "Four Golden Signals" (Latency, Traffic,
Errors, Saturation), with instances added due to cloud
–  Beyer, B., Jones, C., Petoff, J., Murphy, N. Site Reliability Engineering.
O'Reilly, Apr 2016
problem	
  of	
  load	
  applied?	
  req/sec,	
  by	
  type	
  
errors,	
  Qmeouts,	
  retries	
  
response	
  Qme	
  average,	
  99th	
  -­‐Qle,	
  distribuQon	
  
CPU	
  load	
  averages,	
  queue	
  length/Qme	
  
scale	
  up/down?	
  count,	
  state,	
  version	
  
5.	
  Bad	
  Instance	
  Dashboard	
  
	
  
An	
  An>-­‐Methodology	
  
Bad	
  Instance	
  Dashboard	
  
1.  Plot request time per-instance
2.  Find the bad instance
3.  Terminate bad instance
4.  Someone else’s problem now!
In SRE incident response, if it works,
do it.
95th	
  percenQle	
  latency	
  
(Atlas	
  Exploder)	
  
Bad	
  instance	
  
Terminate!	
  
Lots	
  More	
  Dashboards	
  
We have countless more,
mostly app specific and
reliability focused
•  Most reliability incidents
involve time correlation with a
central log system
Sometimes, dashboards &
monitoring aren't enough.
Time for SSH.
NIWS HTTP errors:
Error	
  Types	
  
Regions	
  
Apps	
  
Time	
  
6.	
  Linux	
  Performance	
  Analysis	
  
in	
  
60,000	
  milliseconds	
  
Linux	
  Perf	
  Analysis	
  in	
  60s	
  
1.  uptime
2.  dmesg -T | tail
3.  vmstat 1
4.  mpstat -P ALL 1
5.  pidstat 1
6.  iostat -xz 1
7.  free -m
8.  sar -n DEV 1
9.  sar -n TCP,ETCP 1
10.  top
Linux	
  Perf	
  Analysis	
  in	
  60s	
  
1.  uptime
2.  dmesg -T | tail
3.  vmstat 1
4.  mpstat -P ALL 1
5.  pidstat 1
6.  iostat -xz 1
7.  free -m
8.  sar -n DEV 1
9.  sar -n TCP,ETCP 1
10.  top
load	
  averages	
  
kernel	
  errors	
  
overall	
  stats	
  by	
  Qme	
  
CPU	
  balance	
  
process	
  usage	
  
disk	
  I/O	
  
memory	
  usage	
  
network	
  I/O	
  
TCP	
  stats	
  
check	
  overview	
  
hap://techblog.neSlix.com/2015/11/linux-­‐performance-­‐analysis-­‐in-­‐60s.html	
  
60s:	
  upQme,	
  dmesg,	
  vmstat	
  
$ uptime
23:51:26 up 21:31, 1 user, load average: 30.02, 26.43, 19.02
$ dmesg | tail
[1880957.563150] perl invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
[...]
[1880957.563400] Out of memory: Kill process 18694 (perl) score 246 or sacrifice child
[1880957.563408] Killed process 18694 (perl) total-vm:1972392kB, anon-rss:1953348kB, file-rss:0kB
[2320864.954447] TCP: Possible SYN flooding on port 7001. Dropping request. Check SNMP counters.
$ vmstat 1
procs ---------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
34 0 0 200889792 73708 591828 0 0 0 5 6 10 96 1 3 0 0
32 0 0 200889920 73708 591860 0 0 0 592 13284 4282 98 1 1 0 0
32 0 0 200890112 73708 591860 0 0 0 0 9501 2154 99 1 0 0 0
32 0 0 200889568 73712 591856 0 0 0 48 11900 2459 99 0 0 0 0
32 0 0 200890208 73712 591860 0 0 0 0 15898 4840 98 1 1 0 0
^C
60s:	
  mpstat	
  
$ mpstat -P ALL 1
Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (32 CPU)
07:38:49 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
07:38:50 PM all 98.47 0.00 0.75 0.00 0.00 0.00 0.00 0.00 0.00 0.78
07:38:50 PM 0 96.04 0.00 2.97 0.00 0.00 0.00 0.00 0.00 0.00 0.99
07:38:50 PM 1 97.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 2.00
07:38:50 PM 2 98.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
07:38:50 PM 3 96.97 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.03
[...]
60s:	
  pidstat	
  
$ pidstat 1
Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (32 CPU)
07:41:02 PM UID PID %usr %system %guest %CPU CPU Command
07:41:03 PM 0 9 0.00 0.94 0.00 0.94 1 rcuos/0
07:41:03 PM 0 4214 5.66 5.66 0.00 11.32 15 mesos-slave
07:41:03 PM 0 4354 0.94 0.94 0.00 1.89 8 java
07:41:03 PM 0 6521 1596.23 1.89 0.00 1598.11 27 java
07:41:03 PM 0 6564 1571.70 7.55 0.00 1579.25 28 java
07:41:03 PM 60004 60154 0.94 4.72 0.00 5.66 9 pidstat
07:41:03 PM UID PID %usr %system %guest %CPU CPU Command
07:41:04 PM 0 4214 6.00 2.00 0.00 8.00 15 mesos-slave
07:41:04 PM 0 6521 1590.00 1.00 0.00 1591.00 27 java
07:41:04 PM 0 6564 1573.00 10.00 0.00 1583.00 28 java
07:41:04 PM 108 6718 1.00 0.00 0.00 1.00 0 snmp-pass
07:41:04 PM 60004 60154 1.00 4.00 0.00 5.00 9 pidstat
^C
60s:	
  iostat	
  
$ iostat -xmdz 1
Linux 3.13.0-29 (db001-eb883efa) 08/18/2014 _x86_64_ (16 CPU)
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s  ...
xvda 0.00 0.00 0.00 0.00 0.00 0.00 / ...
xvdb 213.00 0.00 15299.00 0.00 338.17 0.00  ...
xvdc 129.00 0.00 15271.00 3.00 336.65 0.01 / ...
md0 0.00 0.00 31082.00 3.00 678.45 0.01  ...
...  avgqu-sz await r_await w_await svctm %util
... / 0.00 0.00 0.00 0.00 0.00 0.00
...  126.09 8.22 8.22 0.00 0.06 86.40
... / 99.31 6.47 6.47 0.00 0.06 86.00
...  0.00 0.00 0.00 0.00 0.00 0.00
Workload	
  
ResulQng	
  Performance	
  
60s:	
  free,	
  sar	
  –n	
  DEV	
  
$ free -m
total used free shared buffers cached
Mem: 245998 24545 221453 83 59 541
-/+ buffers/cache: 23944 222053
Swap: 0 0 0
$ sar -n DEV 1
Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (32 CPU)
12:16:48 AM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
12:16:49 AM eth0 18763.00 5032.00 20686.42 478.30 0.00 0.00 0.00 0.00
12:16:49 AM lo 14.00 14.00 1.36 1.36 0.00 0.00 0.00 0.00
12:16:49 AM docker0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
12:16:49 AM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
12:16:50 AM eth0 19763.00 5101.00 21999.10 482.56 0.00 0.00 0.00 0.00
12:16:50 AM lo 20.00 20.00 3.25 3.25 0.00 0.00 0.00 0.00
12:16:50 AM docker0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
^C
60s:	
  sar	
  –n	
  TCP,ETCP	
  
$ sar -n TCP,ETCP 1
Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_
(32 CPU)
12:17:19 AM active/s passive/s iseg/s oseg/s
12:17:20 AM 1.00 0.00 10233.00 18846.00
12:17:19 AM atmptf/s estres/s retrans/s isegerr/s orsts/s
12:17:20 AM 0.00 0.00 0.00 0.00 0.00
12:17:20 AM active/s passive/s iseg/s oseg/s
12:17:21 AM 1.00 0.00 8359.00 6039.00
12:17:20 AM atmptf/s estres/s retrans/s isegerr/s orsts/s
12:17:21 AM 0.00 0.00 0.00 0.00 0.00
^C
60s:	
  top	
  
$ top
top - 00:15:40 up 21:56, 1 user, load average: 31.09, 29.87, 29.92
Tasks: 871 total, 1 running, 868 sleeping, 0 stopped, 2 zombie
%Cpu(s): 96.8 us, 0.4 sy, 0.0 ni, 2.7 id, 0.1 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 25190241+total, 24921688 used, 22698073+free, 60448 buffers
KiB Swap: 0 total, 0 used, 0 free. 554208 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
20248 root 20 0 0.227t 0.012t 18748 S 3090 5.2 29812:58 java
4213 root 20 0 2722544 64640 44232 S 23.5 0.0 233:35.37 mesos-slave
66128 titancl+ 20 0 24344 2332 1172 R 1.0 0.0 0:00.07 top
5235 root 20 0 38.227g 547004 49996 S 0.7 0.2 2:02.74 java
4299 root 20 0 20.015g 2.682g 16836 S 0.3 1.1 33:14.42 java
1 root 20 0 33620 2920 1496 S 0.0 0.0 0:03.82 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.02 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 0:05.35 ksoftirqd/0
5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H
6 root 20 0 0 0 0 S 0.0 0.0 0:06.94 kworker/u256:0
8 root 20 0 0 0 0 S 0.0 0.0 2:38.05 rcu_sched
Other	
  Analysis	
  in	
  60s	
  
•  We need such checklists for:
–  Java
–  Cassandra
–  MySQL
–  Nginx
–  etc…
•  Can follow a methodology:
–  Process of elimination
–  Workload characterization
–  Differential diagnosis
–  Some summaries: http://www.brendangregg.com/methodology.html
•  Turn checklists into dashboards (many do exist)
7.	
  Linux	
  Disk	
  Checklist	
  
SREcon 2016 Performance Checklists for SREs
Linux	
  Disk	
  Checklist	
  
1.  iostat –xnz 1
2.  vmstat 1
3.  df -h
4.  ext4slower 10
5.  bioslower 10
6.  ext4dist 1
7.  biolatency 1
8.  cat /sys/devices/…/ioerr_cnt
9.  smartctl -l error /dev/sda1
Linux	
  Disk	
  Checklist	
  
1.  iostat –xnz 1
2.  vmstat 1
3.  df -h
4.  ext4slower 10
5.  bioslower 10
6.  ext4dist 1
7.  biolatency 1
8.  cat /sys/devices/…/ioerr_cnt
9.  smartctl -l error /dev/sda1
any	
  disk	
  I/O?	
  if	
  not,	
  stop	
  looking	
  
is	
  this	
  swapping?	
  or,	
  high	
  sys	
  Qme?	
  
are	
  le	
  systems	
  nearly	
  full?	
  
(zfs*,	
  xfs*,	
  etc.)	
  slow	
  le	
  system	
  I/O?	
  
if	
  so,	
  check	
  disks	
  
check	
  distribuQon	
  and	
  rate	
  
if	
  interesQng,	
  check	
  disks	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  (if	
  available)	
  errors	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  (if	
  available)	
  errors	
  
Another short checklist. Won't solve everything. FS focused.
ext4slower/dist, bioslower, are from bcc/BPF tools.
ext4slower	
  
•  ext4 operations slower than the threshold:
•  Better indicator of application pain than disk I/O
•  Measures & filters in-kernel for efficiency using BPF
–  From https://github.com/iovisor/bcc
# ./ext4slower 1
Tracing ext4 operations slower than 1 ms
TIME COMM PID T BYTES OFF_KB LAT(ms) FILENAME
06:49:17 bash 3616 R 128 0 7.75 cksum
06:49:17 cksum 3616 R 39552 0 1.34 [
06:49:17 cksum 3616 R 96 0 5.36 2to3-2.7
06:49:17 cksum 3616 R 96 0 14.94 2to3-3.4
06:49:17 cksum 3616 R 10320 0 6.82 411toppm
06:49:17 cksum 3616 R 65536 0 4.01 a2p
06:49:17 cksum 3616 R 55400 0 8.77 ab
06:49:17 cksum 3616 R 36792 0 16.34 aclocal-1.14
06:49:17 cksum 3616 R 15008 0 19.31 acpi_listen
[…]
BPF	
  is	
  coming…	
  
Free	
  your	
  mind	
  
BPF	
  
•  That file system checklist should be a dashboard:
–  FS & disk latency histograms, heatmaps, IOPS, outlier log
•  Now possible with enhanced BPF (Berkeley Packet Filter)
–  Built into Linux 4.x: dynamic tracing, filters, histograms
System dashboards of 2017+ should look very different
8.	
  Linux	
  Network	
  Checklist	
  
Linux	
  Network	
  Checklist	
  
1.  sar -n DEV,EDEV 1
2.  sar -n TCP,ETCP 1
3.  cat /etc/resolv.conf
4.  mpstat -P ALL 1
5.  tcpretrans
6.  tcpconnect
7.  tcpaccept
8.  netstat -rnv
9.  check firewall config
10.  netstat -s
Linux	
  Network	
  Checklist	
  
1.  sar -n DEV,EDEV 1
2.  sar -n TCP,ETCP 1
3.  cat /etc/resolv.conf
4.  mpstat -P ALL 1
5.  tcpretrans
6.  tcpconnect
7.  tcpaccept
8.  netstat -rnv
9.  check firewall config
10.  netstat -s
at	
  interface	
  limits?	
  or	
  use	
  nicstat	
  
acQve/passive	
  load,	
  retransmit	
  rate	
  
it's	
  always	
  DNS	
  
high	
  kernel	
  Qme?	
  single	
  hot	
  CPU?	
  
what	
  are	
  the	
  retransmits?	
  state?	
  
connecQng	
  to	
  anything	
  unexpected?	
  
unexpected	
  workload?	
  
any	
  inecient	
  routes?	
  
anything	
  blocking/throaling?	
  
play	
  252	
  metric	
  pickup	
  
tcp*, are from bcc/BPF tools
tcpretrans	
  
•  Just trace kernel TCP retransmit functions for efficiency:
•  From either bcc (BPF) or perf-tools (ftrace, older kernels)
# ./tcpretrans
TIME PID IP LADDR:LPORT T> RADDR:RPORT STATE
01:55:05 0 4 10.153.223.157:22 R> 69.53.245.40:34619 ESTABLISHED
01:55:05 0 4 10.153.223.157:22 R> 69.53.245.40:34619 ESTABLISHED
01:55:17 0 4 10.153.223.157:22 R> 69.53.245.40:22957 ESTABLISHED
[…]
9.	
  Linux	
  CPU	
  Checklist	
  
(too many lines – should be a utilization heat map)
http://www.brendangregg.com/HeatMaps/subsecondoffset.html
$ perf script
[…]
java 14327 [022] 252764.179741: cycles: 7f36570a4932 SpinPause (/usr/lib/jvm/java-8
java 14315 [014] 252764.183517: cycles: 7f36570a4932 SpinPause (/usr/lib/jvm/java-8
java 14310 [012] 252764.185317: cycles: 7f36570a4932 SpinPause (/usr/lib/jvm/java-8
java 14332 [015] 252764.188720: cycles: 7f3658078350 pthread_cond_wait@@GLIBC_2.3.2
java 14341 [019] 252764.191307: cycles: 7f3656d150c8 ClassLoaderDataGraph::do_unloa
java 14341 [019] 252764.198825: cycles: 7f3656d140b8 ClassLoaderData::free_dealloca
java 14341 [019] 252764.207057: cycles: 7f3657192400 nmethod::do_unloading(BoolObje
java 14341 [019] 252764.215962: cycles: 7f3656ba807e Assembler::locate_operand(unsi
java 14341 [019] 252764.225141: cycles: 7f36571922e8 nmethod::do_unloading(BoolObje
java 14341 [019] 252764.234578: cycles: 7f3656ec4960 CodeHeap::block_start(void*) c
[…]
Linux	
  CPU	
  Checklist	
  
1.  uptime
2.  vmstat 1
3.  mpstat -P ALL 1
4.  pidstat 1
5.  CPU flame graph
6.  CPU subsecond offset heat map
7.  perf stat -a -- sleep 10
Linux	
  CPU	
  Checklist	
  
1.  uptime
2.  vmstat 1
3.  mpstat -P ALL 1
4.  pidstat 1
5.  CPU flame graph
6.  CPU subsecond offset heat map
7.  perf stat -a -- sleep 10
load	
  averages	
  
system-­‐wide	
  uQlizaQon,	
  run	
  q	
  length	
  
CPU	
  balance	
  
per-­‐process	
  CPU	
  
CPU	
  proling	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  look	
  for	
  gaps	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  IPC,	
  LLC	
  hit	
  raQo	
  
htop can do 1-4
htop	
  
CPU	
  Flame	
  Graph	
  
perf_events	
  CPU	
  Flame	
  Graphs	
  
•  We have this automated in Netflix Vector:
•  Flame graph interpretation:
–  x-axis: alphabetical stack sort, to maximize merging
–  y-axis: stack depth
–  color: random, or hue can be a dimension (eg, diff)
–  Top edge is on-CPU, beneath it is ancestry
•  Can also do Java & Node.js. Differentials.
•  We're working on a d3 version for Vector
git clone --depth 1 https://github.com/brendangregg/FlameGraph
cd FlameGraph
perf record -F 99 -a –g -- sleep 30
perf script | ./stackcollapse-perf.pl |./flamegraph.pl > perf.svg
10.	
  Tools	
  Method	
  
	
  
An	
  An>-­‐Methodology	
  
Tools	
  Method	
  
1.  RUN EVERYTHING AND HOPE FOR THE BEST
For SRE response: a mental checklist to see what might
have been missed (no time to run them all)
Linux	
  Perf	
  Observability	
  Tools	
  
Linux	
  StaQc	
  Performance	
  Tools	
  
Linux	
  perf-­‐tools	
  (mrace,	
  perf)	
  
Linux	
  bcc	
  tools	
  (BPF)	
  
Needs	
  Linux	
  4.x	
  
CONFIG_BPF_SYSCALL=y	
  
11.	
  USE	
  Method	
  
	
  
A	
  Methodology	
  
The	
  USE	
  Method	
  
•  For every resource, check:
1.  Utilization
2.  Saturation
3.  Errors
•  Definitions:
–  Utilization: busy time
–  Saturation: queue length or queued time
–  Errors: easy to interpret (objective)
Used to generate checklists. Starts with the questions,
then finds the tools.
Resource	
  
UQlizaQon	
  
(%)	
  X	
  
USE	
  Method	
  for	
  Hardware	
  
•  For every resource, check:
1.  Utilization
2.  Saturation
3.  Errors
•  Including busses & interconnects
(hap://www.brendangregg.com/USEmethod/use-­‐linux.html)	
  
USE	
  Method	
  for	
  Distributed	
  Systems	
  
•  Draw a service diagram, and for every service:
1.  Utilization: resource usage (CPU, network)
2.  Saturation: request queueing, timeouts
3.  Errors
•  Turn into a dashboard
NeSlix	
  Vector	
  
•  Real time instance analysis tool
–  https://github.com/netflix/vector
–  http://techblog.netflix.com/2015/04/introducing-vector-netflixs-on-host.html
•  USE method-inspired metrics
–  More in development, incl. flame graphs
NeSlix	
  Vector	
  
NeSlix	
  Vector	
  
utilization saturationCPU:
utilization saturationNetwork: load
utilization saturationMemory:
load saturationDisk: utilization
12.	
  Bonus:	
  External	
  Factor	
  Checklist	
  
External	
  Factor	
  Checklist	
  
1.  Sports ball?
2.  Power outage?
3.  Snow storm?
4.  Internet/ISP down?
5.  Vendor firmware update?
6.  Public holiday/celebration?
7.  Chaos Kong?
Social media searches (Twitter) often useful
–  Can also be NSFW
Take	
  Aways	
  
•  Checklists are great
–  Speed, Completeness, Starting/Ending Point, Training
–  Can be ad hoc, or from a methodology (USE method)
•  Service dashboards
–  Serve as checklists
–  Metrics: Load, Errors, Latency, Saturation, Instances
•  System dashboards with Linux BPF
–  Latency histograms & heatmaps, etc. Free your mind.
Please create and share more checklists
References	
  
•  Netflix Tech Blog:
•  http://techblog.netflix.com/2015/11/linux-performance-analysis-in-60s.html
•  http://techblog.netflix.com/2015/02/sps-pulse-of-netflix-streaming.html
•  http://techblog.netflix.com/2015/04/introducing-vector-netflixs-on-host.html
•  Linux Performance & BPF tools:
•  http://www.brendangregg.com/linuxperf.html
•  https://github.com/iovisor/bcc#tools
•  USE Method Linux:
•  http://www.brendangregg.com/USEmethod/use-linux.html
•  Flame Graphs:
•  http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html
•  Heat maps:
•  http://cacm.acm.org/magazines/2010/7/95062-visualizing-system-latency/fulltext
•  http://www.brendangregg.com/heatmaps.html
•  Books:
•  Beyer, B., et al. Site Reliability Engineering. O'Reilly,Apr 2016
•  Gawande, A. The Checklist Manifesto. Metropolitan Books, 2008
•  Gregg, B. Systems Performance. Prentice Hall, 2013 (more checklists & methods!)
•  Thanks: Netflix Perf & Core teams for predash, pretriage, Vector, etc
Thanks	
  
http://slideshare.net/brendangregg
http://www.brendangregg.com
bgregg@netflix.com
@brendangregg
Netflix is hiring SREs!
Ad

More Related Content

What's hot (20)

Automating OpenSCAP with Foreman
Automating OpenSCAP with ForemanAutomating OpenSCAP with Foreman
Automating OpenSCAP with Foreman
szadok
 
Amazon EKS Deep Dive
Amazon EKS Deep DiveAmazon EKS Deep Dive
Amazon EKS Deep Dive
Andrzej Komarnicki
 
Autoscaling Kubernetes
Autoscaling KubernetesAutoscaling Kubernetes
Autoscaling Kubernetes
craigbox
 
Velocity 2015 linux perf tools
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf tools
Brendan Gregg
 
High performance computing tutorial, with checklist and tips to optimize clus...
High performance computing tutorial, with checklist and tips to optimize clus...High performance computing tutorial, with checklist and tips to optimize clus...
High performance computing tutorial, with checklist and tips to optimize clus...
Pradeep Redddy Raamana
 
Lifecycle of a pod
Lifecycle of a podLifecycle of a pod
Lifecycle of a pod
Harshal Shah
 
VictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - PreviewVictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - Preview
VictoriaMetrics
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
DataWorks Summit/Hadoop Summit
 
Security-by-Design and -Default
 Security-by-Design and -Default Security-by-Design and -Default
Security-by-Design and -Default
Mehdi Mirakhorli
 
Kubernetes Monitoring & Best Practices
Kubernetes Monitoring & Best PracticesKubernetes Monitoring & Best Practices
Kubernetes Monitoring & Best Practices
Ajeet Singh Raina
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
Jiangjie Qin
 
High Performance Computing
High Performance ComputingHigh Performance Computing
High Performance Computing
Divyen Patel
 
Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)
KafkaZone
 
Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016
Brendan Gregg
 
Python Linters at Scale.pdf
Python Linters at Scale.pdfPython Linters at Scale.pdf
Python Linters at Scale.pdf
Jimmy Lai
 
Linux Profiling at Netflix
Linux Profiling at NetflixLinux Profiling at Netflix
Linux Profiling at Netflix
Brendan Gregg
 
Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)
Brian Brazil
 
Spark NLP: State of the Art Natural Language Processing at Scale
Spark NLP: State of the Art Natural Language Processing at ScaleSpark NLP: State of the Art Natural Language Processing at Scale
Spark NLP: State of the Art Natural Language Processing at Scale
Databricks
 
KOCOON – KAKAO Automatic K8S Monitoring
KOCOON – KAKAO Automatic K8S MonitoringKOCOON – KAKAO Automatic K8S Monitoring
KOCOON – KAKAO Automatic K8S Monitoring
issac lim
 
URP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to KnowURP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to Know
Todd Palino
 
Automating OpenSCAP with Foreman
Automating OpenSCAP with ForemanAutomating OpenSCAP with Foreman
Automating OpenSCAP with Foreman
szadok
 
Autoscaling Kubernetes
Autoscaling KubernetesAutoscaling Kubernetes
Autoscaling Kubernetes
craigbox
 
Velocity 2015 linux perf tools
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf tools
Brendan Gregg
 
High performance computing tutorial, with checklist and tips to optimize clus...
High performance computing tutorial, with checklist and tips to optimize clus...High performance computing tutorial, with checklist and tips to optimize clus...
High performance computing tutorial, with checklist and tips to optimize clus...
Pradeep Redddy Raamana
 
Lifecycle of a pod
Lifecycle of a podLifecycle of a pod
Lifecycle of a pod
Harshal Shah
 
VictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - PreviewVictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - Preview
VictoriaMetrics
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
DataWorks Summit/Hadoop Summit
 
Security-by-Design and -Default
 Security-by-Design and -Default Security-by-Design and -Default
Security-by-Design and -Default
Mehdi Mirakhorli
 
Kubernetes Monitoring & Best Practices
Kubernetes Monitoring & Best PracticesKubernetes Monitoring & Best Practices
Kubernetes Monitoring & Best Practices
Ajeet Singh Raina
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
Jiangjie Qin
 
High Performance Computing
High Performance ComputingHigh Performance Computing
High Performance Computing
Divyen Patel
 
Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)
KafkaZone
 
Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016
Brendan Gregg
 
Python Linters at Scale.pdf
Python Linters at Scale.pdfPython Linters at Scale.pdf
Python Linters at Scale.pdf
Jimmy Lai
 
Linux Profiling at Netflix
Linux Profiling at NetflixLinux Profiling at Netflix
Linux Profiling at Netflix
Brendan Gregg
 
Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)
Brian Brazil
 
Spark NLP: State of the Art Natural Language Processing at Scale
Spark NLP: State of the Art Natural Language Processing at ScaleSpark NLP: State of the Art Natural Language Processing at Scale
Spark NLP: State of the Art Natural Language Processing at Scale
Databricks
 
KOCOON – KAKAO Automatic K8S Monitoring
KOCOON – KAKAO Automatic K8S MonitoringKOCOON – KAKAO Automatic K8S Monitoring
KOCOON – KAKAO Automatic K8S Monitoring
issac lim
 
URP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to KnowURP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to Know
Todd Palino
 

Viewers also liked (19)

Velocity 2017 Performance analysis superpowers with Linux eBPF
Velocity 2017 Performance analysis superpowers with Linux eBPFVelocity 2017 Performance analysis superpowers with Linux eBPF
Velocity 2017 Performance analysis superpowers with Linux eBPF
Brendan Gregg
 
Performance Tuning EC2 Instances
Performance Tuning EC2 InstancesPerformance Tuning EC2 Instances
Performance Tuning EC2 Instances
Brendan Gregg
 
Blazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBlazing Performance with Flame Graphs
Blazing Performance with Flame Graphs
Brendan Gregg
 
ACM Applicative System Methodology 2016
ACM Applicative System Methodology 2016ACM Applicative System Methodology 2016
ACM Applicative System Methodology 2016
Brendan Gregg
 
Stop the Guessing: Performance Methodologies for Production Systems
Stop the Guessing: Performance Methodologies for Production SystemsStop the Guessing: Performance Methodologies for Production Systems
Stop the Guessing: Performance Methodologies for Production Systems
Brendan Gregg
 
Netflix: From Clouds to Roots
Netflix: From Clouds to RootsNetflix: From Clouds to Roots
Netflix: From Clouds to Roots
Brendan Gregg
 
Linux BPF Superpowers
Linux BPF SuperpowersLinux BPF Superpowers
Linux BPF Superpowers
Brendan Gregg
 
Kernel Recipes 2017: Using Linux perf at Netflix
Kernel Recipes 2017: Using Linux perf at NetflixKernel Recipes 2017: Using Linux perf at Netflix
Kernel Recipes 2017: Using Linux perf at Netflix
Brendan Gregg
 
G1 Garbage Collector: Details and Tuning
G1 Garbage Collector: Details and TuningG1 Garbage Collector: Details and Tuning
G1 Garbage Collector: Details and Tuning
Simone Bordet
 
Troubleshooting PostgreSQL Streaming Replication
Troubleshooting PostgreSQL Streaming ReplicationTroubleshooting PostgreSQL Streaming Replication
Troubleshooting PostgreSQL Streaming Replication
Alexey Lesovsky
 
Am I reading GC logs Correctly?
Am I reading GC logs Correctly?Am I reading GC logs Correctly?
Am I reading GC logs Correctly?
Tier1 App
 
Linux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPFLinux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPF
Brendan Gregg
 
RxNetty vs Tomcat Performance Results
RxNetty vs Tomcat Performance ResultsRxNetty vs Tomcat Performance Results
RxNetty vs Tomcat Performance Results
Brendan Gregg
 
Row Pattern Matching in SQL:2016
Row Pattern Matching in SQL:2016Row Pattern Matching in SQL:2016
Row Pattern Matching in SQL:2016
Markus Winand
 
Designing Tracing Tools
Designing Tracing ToolsDesigning Tracing Tools
Designing Tracing Tools
Brendan Gregg
 
Java Performance Analysis on Linux with Flame Graphs
Java Performance Analysis on Linux with Flame GraphsJava Performance Analysis on Linux with Flame Graphs
Java Performance Analysis on Linux with Flame Graphs
Brendan Gregg
 
Shell,信号量以及java进程的退出
Shell,信号量以及java进程的退出Shell,信号量以及java进程的退出
Shell,信号量以及java进程的退出
wang hongjiang
 
Linux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsLinux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old Secrets
Brendan Gregg
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016
Brendan Gregg
 
Velocity 2017 Performance analysis superpowers with Linux eBPF
Velocity 2017 Performance analysis superpowers with Linux eBPFVelocity 2017 Performance analysis superpowers with Linux eBPF
Velocity 2017 Performance analysis superpowers with Linux eBPF
Brendan Gregg
 
Performance Tuning EC2 Instances
Performance Tuning EC2 InstancesPerformance Tuning EC2 Instances
Performance Tuning EC2 Instances
Brendan Gregg
 
Blazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBlazing Performance with Flame Graphs
Blazing Performance with Flame Graphs
Brendan Gregg
 
ACM Applicative System Methodology 2016
ACM Applicative System Methodology 2016ACM Applicative System Methodology 2016
ACM Applicative System Methodology 2016
Brendan Gregg
 
Stop the Guessing: Performance Methodologies for Production Systems
Stop the Guessing: Performance Methodologies for Production SystemsStop the Guessing: Performance Methodologies for Production Systems
Stop the Guessing: Performance Methodologies for Production Systems
Brendan Gregg
 
Netflix: From Clouds to Roots
Netflix: From Clouds to RootsNetflix: From Clouds to Roots
Netflix: From Clouds to Roots
Brendan Gregg
 
Linux BPF Superpowers
Linux BPF SuperpowersLinux BPF Superpowers
Linux BPF Superpowers
Brendan Gregg
 
Kernel Recipes 2017: Using Linux perf at Netflix
Kernel Recipes 2017: Using Linux perf at NetflixKernel Recipes 2017: Using Linux perf at Netflix
Kernel Recipes 2017: Using Linux perf at Netflix
Brendan Gregg
 
G1 Garbage Collector: Details and Tuning
G1 Garbage Collector: Details and TuningG1 Garbage Collector: Details and Tuning
G1 Garbage Collector: Details and Tuning
Simone Bordet
 
Troubleshooting PostgreSQL Streaming Replication
Troubleshooting PostgreSQL Streaming ReplicationTroubleshooting PostgreSQL Streaming Replication
Troubleshooting PostgreSQL Streaming Replication
Alexey Lesovsky
 
Am I reading GC logs Correctly?
Am I reading GC logs Correctly?Am I reading GC logs Correctly?
Am I reading GC logs Correctly?
Tier1 App
 
Linux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPFLinux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPF
Brendan Gregg
 
RxNetty vs Tomcat Performance Results
RxNetty vs Tomcat Performance ResultsRxNetty vs Tomcat Performance Results
RxNetty vs Tomcat Performance Results
Brendan Gregg
 
Row Pattern Matching in SQL:2016
Row Pattern Matching in SQL:2016Row Pattern Matching in SQL:2016
Row Pattern Matching in SQL:2016
Markus Winand
 
Designing Tracing Tools
Designing Tracing ToolsDesigning Tracing Tools
Designing Tracing Tools
Brendan Gregg
 
Java Performance Analysis on Linux with Flame Graphs
Java Performance Analysis on Linux with Flame GraphsJava Performance Analysis on Linux with Flame Graphs
Java Performance Analysis on Linux with Flame Graphs
Brendan Gregg
 
Shell,信号量以及java进程的退出
Shell,信号量以及java进程的退出Shell,信号量以及java进程的退出
Shell,信号量以及java进程的退出
wang hongjiang
 
Linux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsLinux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old Secrets
Brendan Gregg
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016
Brendan Gregg
 
Ad

Similar to SREcon 2016 Performance Checklists for SREs (20)

Monitorama 2015 Netflix Instance Analysis
Monitorama 2015 Netflix Instance AnalysisMonitorama 2015 Netflix Instance Analysis
Monitorama 2015 Netflix Instance Analysis
Brendan Gregg
 
Performance Analysis: new tools and concepts from the cloud
Performance Analysis: new tools and concepts from the cloudPerformance Analysis: new tools and concepts from the cloud
Performance Analysis: new tools and concepts from the cloud
Brendan Gregg
 
Nagios Conference 2011 - Daniel Wittenberg - Scaling Nagios At A Giant Insur...
Nagios Conference 2011 - Daniel Wittenberg -  Scaling Nagios At A Giant Insur...Nagios Conference 2011 - Daniel Wittenberg -  Scaling Nagios At A Giant Insur...
Nagios Conference 2011 - Daniel Wittenberg - Scaling Nagios At A Giant Insur...
Nagios
 
VÌrktøjer udviklet pü AAU til analyse af SCJ programmer
VÌrktøjer udviklet pü AAU til analyse af SCJ programmerVÌrktøjer udviklet pü AAU til analyse af SCJ programmer
VÌrktøjer udviklet pü AAU til analyse af SCJ programmer
InfinIT - InnovationsnetvĂŚrket for it
 
Building the Internet of Things with Thingsquare and Contiki - day 2 part 1
Building the Internet of Things with Thingsquare and Contiki - day 2 part 1Building the Internet of Things with Thingsquare and Contiki - day 2 part 1
Building the Internet of Things with Thingsquare and Contiki - day 2 part 1
Adam Dunkels
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
Coburn Watson
 
Testing real-time Linux. What to test and how
Testing real-time Linux. What to test and how Testing real-time Linux. What to test and how
Testing real-time Linux. What to test and how
Chirag Jog
 
XPDDS18: Real Time in XEN on ARM - Andrii Anisov, EPAM Systems Inc.
XPDDS18: Real Time in XEN on ARM - Andrii Anisov, EPAM Systems Inc.XPDDS18: Real Time in XEN on ARM - Andrii Anisov, EPAM Systems Inc.
XPDDS18: Real Time in XEN on ARM - Andrii Anisov, EPAM Systems Inc.
The Linux Foundation
 
Embedded Intro India05
Embedded Intro India05Embedded Intro India05
Embedded Intro India05
Rajesh Gupta
 
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
Puppet
 
What’s eating python performance
What’s eating python performanceWhat’s eating python performance
What’s eating python performance
Piotr Przymus
 
Towards "write once - run whenever possible" with Safety Critical Java af Ben...
Towards "write once - run whenever possible" with Safety Critical Java af Ben...Towards "write once - run whenever possible" with Safety Critical Java af Ben...
Towards "write once - run whenever possible" with Safety Critical Java af Ben...
InfinIT - InnovationsnetvĂŚrket for it
 
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
OpenStack Korea Community
 
The hitchhiker’s guide to Prometheus
The hitchhiker’s guide to PrometheusThe hitchhiker’s guide to Prometheus
The hitchhiker’s guide to Prometheus
Bol.com Techlab
 
Prometheus monitoring
Prometheus monitoringPrometheus monitoring
Prometheus monitoring
Hien Nguyen Van
 
The hitchhiker’s guide to Prometheus
The hitchhiker’s guide to PrometheusThe hitchhiker’s guide to Prometheus
The hitchhiker’s guide to Prometheus
Bol.com Techlab
 
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
rschuppe
 
TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platform
Ganesan Narayanasamy
 
EuroBSDcon 2017 System Performance Analysis Methodologies
EuroBSDcon 2017 System Performance Analysis MethodologiesEuroBSDcon 2017 System Performance Analysis Methodologies
EuroBSDcon 2017 System Performance Analysis Methodologies
Brendan Gregg
 
Streaming meetup
Streaming meetupStreaming meetup
Streaming meetup
karthik_krk
 
Monitorama 2015 Netflix Instance Analysis
Monitorama 2015 Netflix Instance AnalysisMonitorama 2015 Netflix Instance Analysis
Monitorama 2015 Netflix Instance Analysis
Brendan Gregg
 
Performance Analysis: new tools and concepts from the cloud
Performance Analysis: new tools and concepts from the cloudPerformance Analysis: new tools and concepts from the cloud
Performance Analysis: new tools and concepts from the cloud
Brendan Gregg
 
Nagios Conference 2011 - Daniel Wittenberg - Scaling Nagios At A Giant Insur...
Nagios Conference 2011 - Daniel Wittenberg -  Scaling Nagios At A Giant Insur...Nagios Conference 2011 - Daniel Wittenberg -  Scaling Nagios At A Giant Insur...
Nagios Conference 2011 - Daniel Wittenberg - Scaling Nagios At A Giant Insur...
Nagios
 
Building the Internet of Things with Thingsquare and Contiki - day 2 part 1
Building the Internet of Things with Thingsquare and Contiki - day 2 part 1Building the Internet of Things with Thingsquare and Contiki - day 2 part 1
Building the Internet of Things with Thingsquare and Contiki - day 2 part 1
Adam Dunkels
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
Coburn Watson
 
Testing real-time Linux. What to test and how
Testing real-time Linux. What to test and how Testing real-time Linux. What to test and how
Testing real-time Linux. What to test and how
Chirag Jog
 
XPDDS18: Real Time in XEN on ARM - Andrii Anisov, EPAM Systems Inc.
XPDDS18: Real Time in XEN on ARM - Andrii Anisov, EPAM Systems Inc.XPDDS18: Real Time in XEN on ARM - Andrii Anisov, EPAM Systems Inc.
XPDDS18: Real Time in XEN on ARM - Andrii Anisov, EPAM Systems Inc.
The Linux Foundation
 
Embedded Intro India05
Embedded Intro India05Embedded Intro India05
Embedded Intro India05
Rajesh Gupta
 
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
Puppet
 
What’s eating python performance
What’s eating python performanceWhat’s eating python performance
What’s eating python performance
Piotr Przymus
 
Towards "write once - run whenever possible" with Safety Critical Java af Ben...
Towards "write once - run whenever possible" with Safety Critical Java af Ben...Towards "write once - run whenever possible" with Safety Critical Java af Ben...
Towards "write once - run whenever possible" with Safety Critical Java af Ben...
InfinIT - InnovationsnetvĂŚrket for it
 
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
OpenStack Korea Community
 
The hitchhiker’s guide to Prometheus
The hitchhiker’s guide to PrometheusThe hitchhiker’s guide to Prometheus
The hitchhiker’s guide to Prometheus
Bol.com Techlab
 
Prometheus monitoring
Prometheus monitoringPrometheus monitoring
Prometheus monitoring
Hien Nguyen Van
 
The hitchhiker’s guide to Prometheus
The hitchhiker’s guide to PrometheusThe hitchhiker’s guide to Prometheus
The hitchhiker’s guide to Prometheus
Bol.com Techlab
 
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
rschuppe
 
TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platform
Ganesan Narayanasamy
 
EuroBSDcon 2017 System Performance Analysis Methodologies
EuroBSDcon 2017 System Performance Analysis MethodologiesEuroBSDcon 2017 System Performance Analysis Methodologies
EuroBSDcon 2017 System Performance Analysis Methodologies
Brendan Gregg
 
Streaming meetup
Streaming meetupStreaming meetup
Streaming meetup
karthik_krk
 
Ad

More from Brendan Gregg (20)

YOW2021 Computing Performance
YOW2021 Computing PerformanceYOW2021 Computing Performance
YOW2021 Computing Performance
Brendan Gregg
 
IntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingIntelON 2021 Processor Benchmarking
IntelON 2021 Processor Benchmarking
Brendan Gregg
 
Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)
Brendan Gregg
 
Systems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting StartedSystems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting Started
Brendan Gregg
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)
Brendan Gregg
 
BPF Internals (eBPF)
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)
Brendan Gregg
 
Performance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedPerformance Wins with BPF: Getting Started
Performance Wins with BPF: Getting Started
Brendan Gregg
 
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems Performance
Brendan Gregg
 
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflix
Brendan Gregg
 
UM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareUM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of Software
Brendan Gregg
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems Performance
Brendan Gregg
 
LPC2019 BPF Tracing Tools
LPC2019 BPF Tracing ToolsLPC2019 BPF Tracing Tools
LPC2019 BPF Tracing Tools
Brendan Gregg
 
LSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityLSFMM 2019 BPF Observability
LSFMM 2019 BPF Observability
Brendan Gregg
 
YOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflixYOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflix
Brendan Gregg
 
eBPF Perf Tools 2019
eBPF Perf Tools 2019eBPF Perf Tools 2019
eBPF Perf Tools 2019
Brendan Gregg
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at Netflix
Brendan Gregg
 
BPF Tools 2017
BPF Tools 2017BPF Tools 2017
BPF Tools 2017
Brendan Gregg
 
NetConf 2018 BPF Observability
NetConf 2018 BPF ObservabilityNetConf 2018 BPF Observability
NetConf 2018 BPF Observability
Brendan Gregg
 
FlameScope 2018
FlameScope 2018FlameScope 2018
FlameScope 2018
Brendan Gregg
 
ATO Linux Performance 2018
ATO Linux Performance 2018ATO Linux Performance 2018
ATO Linux Performance 2018
Brendan Gregg
 
YOW2021 Computing Performance
YOW2021 Computing PerformanceYOW2021 Computing Performance
YOW2021 Computing Performance
Brendan Gregg
 
IntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingIntelON 2021 Processor Benchmarking
IntelON 2021 Processor Benchmarking
Brendan Gregg
 
Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)
Brendan Gregg
 
Systems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting StartedSystems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting Started
Brendan Gregg
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)
Brendan Gregg
 
BPF Internals (eBPF)
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)
Brendan Gregg
 
Performance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedPerformance Wins with BPF: Getting Started
Performance Wins with BPF: Getting Started
Brendan Gregg
 
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems Performance
Brendan Gregg
 
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflix
Brendan Gregg
 
UM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareUM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of Software
Brendan Gregg
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems Performance
Brendan Gregg
 
LPC2019 BPF Tracing Tools
LPC2019 BPF Tracing ToolsLPC2019 BPF Tracing Tools
LPC2019 BPF Tracing Tools
Brendan Gregg
 
LSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityLSFMM 2019 BPF Observability
LSFMM 2019 BPF Observability
Brendan Gregg
 
YOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflixYOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflix
Brendan Gregg
 
eBPF Perf Tools 2019
eBPF Perf Tools 2019eBPF Perf Tools 2019
eBPF Perf Tools 2019
Brendan Gregg
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at Netflix
Brendan Gregg
 
BPF Tools 2017
BPF Tools 2017BPF Tools 2017
BPF Tools 2017
Brendan Gregg
 
NetConf 2018 BPF Observability
NetConf 2018 BPF ObservabilityNetConf 2018 BPF Observability
NetConf 2018 BPF Observability
Brendan Gregg
 
FlameScope 2018
FlameScope 2018FlameScope 2018
FlameScope 2018
Brendan Gregg
 
ATO Linux Performance 2018
ATO Linux Performance 2018ATO Linux Performance 2018
ATO Linux Performance 2018
Brendan Gregg
 

Recently uploaded (20)

UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 

SREcon 2016 Performance Checklists for SREs

  • 1. Performance  Checklists   for  SREs   Brendan Gregg Senior Performance Architect
  • 2. Performance  Checklists   1.  uptime 2.  dmesg -T | tail 3.  vmstat 1 4.  mpstat -P ALL 1 5.  pidstat 1 6.  iostat -xz 1 7.  free -m 8.  sar -n DEV 1 9.  sar -n TCP,ETCP 1 10.  top per instance: cloud wide: 1.  RPS,  CPU   2.  Volume   6.  Load  Avg   3.  Instances   4.  Scaling   5.  CPU/RPS   7.  Java  Heap   8.  ParNew   9.  Latency   10.  99th  Qle  
  • 4. Brendan  the  SRE   •  On the Perf Eng team & primary on-call rotation for Core: our central SRE team –  we get paged on SPS dips (starts per second) & more •  In this talk I'll condense some perf engineering into SRE timescales (minutes) using checklists
  • 5. Performance  Engineering   !=   SRE  Performance   Incident  Response  
  • 6. Performance  Engineering   •  Aim: best price/performance possible –  Can be endless: continual improvement •  Fixes can take hours, days, weeks, months –  Time to read docs & source code, experiment –  Can take on large projects no single team would staff •  Usually no prior "good" state –  No spot the difference. No starting point. –  Is now "good" or "bad"? Experience/instinct helps •  Solo/team work At Netflix: The Performance Engineering team, with help from developers +3
  • 8. Performance  Engineering   stat tools tracers benchmarks documentation source code tuning PMCs profilers flame graphs monitoring dashboards
  • 9. SRE  Perf  Incident  Response   •  Aim: resolve issue in minutes –  Quick resolution is king. Can scale up, roll back, redirect traffic. –  Must cope under pressure, and at 3am •  Previously was in a "good" state –  Spot the difference with historical graphs •  Get immediate help from all staff –  Must be social •  Reliability & perf issues often related At Netflix, the Core team (5 SREs), with immediate help from developers and performance engineers +1
  • 10. SRE  Perf  Incident  Response  
  • 11. SRE  Perf  Incident  Response   custom dashboards central event logs distributed system tracing chat rooms pager ticket system
  • 12. NeSlix  Cloud  Analysis  Process   Atlas  Alerts   Atlas  Dashboards   Atlas  Metrics   Salp  Mogul   SSH,  instance  tools   ICE   4.  Check  Dependencies   Create   New  Alert   Plus some other tools not pictured Cost   3.  Drill  Down   5.  Root   Cause   Chronos   2.  Check  Events   In summary… Example SRE response path enumerated Redirected  to   a  new  Target   1.  Check  Issue  
  • 13. The  Need  for  Checklists   •  Speed •  Completeness •  A Starting Point •  An Ending Point •  Reliability •  Training Perf checklists have historically been created for perf engineering (hours) not SRE response (minutes) More on checklists: Gawande, A., The Checklist Manifesto. Metropolitan Books, 2008 Boeing  707  Emergency  Checklist  (1969)  
  • 14. SRE  Checklists  at  NeSlix   •  Some shared docs –  PRE Triage Methodology –  go/triage: a checklist of dashboards •  Most "checklists" are really custom dashboards –  Selected metrics for both reliability and performance •  I maintain my own per-service and per-device checklists
  • 15. SRE  Performance  Checklists   The following are: •  Cloud performance checklists/dashboards •  SSH/Linux checklists (lowest common denominator) •  Methodologies for deriving cloud/instance checklists Ad Hoc Methodology Checklists Dashboards Including aspirational: what we want to do & build as dashboards
  • 16. 1.  PRE  Triage  Checklist     Our  iniQal  checklist   NeSlix  specic  
  • 17. PRE  Triage  Checklist   •  Performance and Reliability Engineering checklist –  Shared doc with a hierarchal checklist with 66 steps total 1.  Initial Impact 1.  record timestamp 2.  quantify: SPS, signups, support calls 3.  check impact: regional or global? 4.  check devices: device specific? 2.  Time Correlations 1.  pretriage dashboard 1.  check for suspect NIWS client: error rates 2.  check for source of error/request rate change 3.  […dashboard specifics…] Confirms, quantifies, & narrows problem. Helps you reason about the cause.
  • 18. PRE  Triage  Checklist.  cont.   •  3. Evaluate Service Health –  perfvitals dashboard –  mogul dependency correlation –  by cluster/asg/node: •  latency: avg, 90 percentile •  request rate •  CPU: utilization, sys/user •  Java heap: GC rate, leaks •  memory •  load average •  thread contention (from Java) •  JVM crashes •  network: tput, sockets •  […] custom dashboards
  • 19. 2.  predash     IniQal  dashboard   NeSlix  specic  
  • 20. predash   Performance and Reliability Engineering dashboard A list of selected dashboards suited for incident response
  • 21. predash   List of dashboards is its own checklist: 1.  Overview 2.  Client stats 3.  Client errors & retries 4.  NIWS HTTP errors 5.  NIWS Errors by code 6.  DRM request overview 7.  DoS attack metrics 8.  Push map 9.  Cluster status ...
  • 22. 3.  perfvitals     Service  dashboard  
  • 23. 1.  RPS,  CPU   2.  Volume   6.  Load  Avg   3.  Instances   4.  Scaling   5.  CPU/RPS   7.  Java  Heap   8.  ParNew   9.  Latency   10.  99th  Qle   perfvitals  
  • 24. 4.  Cloud  ApplicaQon  Performance   Dashboard     A  generic  example  
  • 25. Cloud  App  Perf  Dashboard   1.  Load 2.  Errors 3.  Latency 4.  Saturation 5.  Instances
  • 26. Cloud  App  Perf  Dashboard   1.  Load 2.  Errors 3.  Latency 4.  Saturation 5.  Instances All time series, for every application, and dependencies. Draw a functional diagram with the entire data path. Same as Google's "Four Golden Signals" (Latency, Traffic, Errors, Saturation), with instances added due to cloud –  Beyer, B., Jones, C., Petoff, J., Murphy, N. Site Reliability Engineering. O'Reilly, Apr 2016 problem  of  load  applied?  req/sec,  by  type   errors,  Qmeouts,  retries   response  Qme  average,  99th  -­‐Qle,  distribuQon   CPU  load  averages,  queue  length/Qme   scale  up/down?  count,  state,  version  
  • 27. 5.  Bad  Instance  Dashboard     An  An>-­‐Methodology  
  • 28. Bad  Instance  Dashboard   1.  Plot request time per-instance 2.  Find the bad instance 3.  Terminate bad instance 4.  Someone else’s problem now! In SRE incident response, if it works, do it. 95th  percenQle  latency   (Atlas  Exploder)   Bad  instance   Terminate!  
  • 29. Lots  More  Dashboards   We have countless more, mostly app specific and reliability focused •  Most reliability incidents involve time correlation with a central log system Sometimes, dashboards & monitoring aren't enough. Time for SSH. NIWS HTTP errors: Error  Types   Regions   Apps   Time  
  • 30. 6.  Linux  Performance  Analysis   in   60,000  milliseconds  
  • 31. Linux  Perf  Analysis  in  60s   1.  uptime 2.  dmesg -T | tail 3.  vmstat 1 4.  mpstat -P ALL 1 5.  pidstat 1 6.  iostat -xz 1 7.  free -m 8.  sar -n DEV 1 9.  sar -n TCP,ETCP 1 10.  top
  • 32. Linux  Perf  Analysis  in  60s   1.  uptime 2.  dmesg -T | tail 3.  vmstat 1 4.  mpstat -P ALL 1 5.  pidstat 1 6.  iostat -xz 1 7.  free -m 8.  sar -n DEV 1 9.  sar -n TCP,ETCP 1 10.  top load  averages   kernel  errors   overall  stats  by  Qme   CPU  balance   process  usage   disk  I/O   memory  usage   network  I/O   TCP  stats   check  overview   hap://techblog.neSlix.com/2015/11/linux-­‐performance-­‐analysis-­‐in-­‐60s.html  
  • 33. 60s:  upQme,  dmesg,  vmstat   $ uptime 23:51:26 up 21:31, 1 user, load average: 30.02, 26.43, 19.02 $ dmesg | tail [1880957.563150] perl invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0 [...] [1880957.563400] Out of memory: Kill process 18694 (perl) score 246 or sacrifice child [1880957.563408] Killed process 18694 (perl) total-vm:1972392kB, anon-rss:1953348kB, file-rss:0kB [2320864.954447] TCP: Possible SYN flooding on port 7001. Dropping request. Check SNMP counters. $ vmstat 1 procs ---------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 34 0 0 200889792 73708 591828 0 0 0 5 6 10 96 1 3 0 0 32 0 0 200889920 73708 591860 0 0 0 592 13284 4282 98 1 1 0 0 32 0 0 200890112 73708 591860 0 0 0 0 9501 2154 99 1 0 0 0 32 0 0 200889568 73712 591856 0 0 0 48 11900 2459 99 0 0 0 0 32 0 0 200890208 73712 591860 0 0 0 0 15898 4840 98 1 1 0 0 ^C
  • 34. 60s:  mpstat   $ mpstat -P ALL 1 Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (32 CPU) 07:38:49 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 07:38:50 PM all 98.47 0.00 0.75 0.00 0.00 0.00 0.00 0.00 0.00 0.78 07:38:50 PM 0 96.04 0.00 2.97 0.00 0.00 0.00 0.00 0.00 0.00 0.99 07:38:50 PM 1 97.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 2.00 07:38:50 PM 2 98.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 07:38:50 PM 3 96.97 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.03 [...]
  • 35. 60s:  pidstat   $ pidstat 1 Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (32 CPU) 07:41:02 PM UID PID %usr %system %guest %CPU CPU Command 07:41:03 PM 0 9 0.00 0.94 0.00 0.94 1 rcuos/0 07:41:03 PM 0 4214 5.66 5.66 0.00 11.32 15 mesos-slave 07:41:03 PM 0 4354 0.94 0.94 0.00 1.89 8 java 07:41:03 PM 0 6521 1596.23 1.89 0.00 1598.11 27 java 07:41:03 PM 0 6564 1571.70 7.55 0.00 1579.25 28 java 07:41:03 PM 60004 60154 0.94 4.72 0.00 5.66 9 pidstat 07:41:03 PM UID PID %usr %system %guest %CPU CPU Command 07:41:04 PM 0 4214 6.00 2.00 0.00 8.00 15 mesos-slave 07:41:04 PM 0 6521 1590.00 1.00 0.00 1591.00 27 java 07:41:04 PM 0 6564 1573.00 10.00 0.00 1583.00 28 java 07:41:04 PM 108 6718 1.00 0.00 0.00 1.00 0 snmp-pass 07:41:04 PM 60004 60154 1.00 4.00 0.00 5.00 9 pidstat ^C
  • 36. 60s:  iostat   $ iostat -xmdz 1 Linux 3.13.0-29 (db001-eb883efa) 08/18/2014 _x86_64_ (16 CPU) Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s ... xvda 0.00 0.00 0.00 0.00 0.00 0.00 / ... xvdb 213.00 0.00 15299.00 0.00 338.17 0.00 ... xvdc 129.00 0.00 15271.00 3.00 336.65 0.01 / ... md0 0.00 0.00 31082.00 3.00 678.45 0.01 ... ... avgqu-sz await r_await w_await svctm %util ... / 0.00 0.00 0.00 0.00 0.00 0.00 ... 126.09 8.22 8.22 0.00 0.06 86.40 ... / 99.31 6.47 6.47 0.00 0.06 86.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 Workload   ResulQng  Performance  
  • 37. 60s:  free,  sar  –n  DEV   $ free -m total used free shared buffers cached Mem: 245998 24545 221453 83 59 541 -/+ buffers/cache: 23944 222053 Swap: 0 0 0 $ sar -n DEV 1 Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (32 CPU) 12:16:48 AM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil 12:16:49 AM eth0 18763.00 5032.00 20686.42 478.30 0.00 0.00 0.00 0.00 12:16:49 AM lo 14.00 14.00 1.36 1.36 0.00 0.00 0.00 0.00 12:16:49 AM docker0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 12:16:49 AM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil 12:16:50 AM eth0 19763.00 5101.00 21999.10 482.56 0.00 0.00 0.00 0.00 12:16:50 AM lo 20.00 20.00 3.25 3.25 0.00 0.00 0.00 0.00 12:16:50 AM docker0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ^C
  • 38. 60s:  sar  –n  TCP,ETCP   $ sar -n TCP,ETCP 1 Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (32 CPU) 12:17:19 AM active/s passive/s iseg/s oseg/s 12:17:20 AM 1.00 0.00 10233.00 18846.00 12:17:19 AM atmptf/s estres/s retrans/s isegerr/s orsts/s 12:17:20 AM 0.00 0.00 0.00 0.00 0.00 12:17:20 AM active/s passive/s iseg/s oseg/s 12:17:21 AM 1.00 0.00 8359.00 6039.00 12:17:20 AM atmptf/s estres/s retrans/s isegerr/s orsts/s 12:17:21 AM 0.00 0.00 0.00 0.00 0.00 ^C
  • 39. 60s:  top   $ top top - 00:15:40 up 21:56, 1 user, load average: 31.09, 29.87, 29.92 Tasks: 871 total, 1 running, 868 sleeping, 0 stopped, 2 zombie %Cpu(s): 96.8 us, 0.4 sy, 0.0 ni, 2.7 id, 0.1 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem: 25190241+total, 24921688 used, 22698073+free, 60448 buffers KiB Swap: 0 total, 0 used, 0 free. 554208 cached Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 20248 root 20 0 0.227t 0.012t 18748 S 3090 5.2 29812:58 java 4213 root 20 0 2722544 64640 44232 S 23.5 0.0 233:35.37 mesos-slave 66128 titancl+ 20 0 24344 2332 1172 R 1.0 0.0 0:00.07 top 5235 root 20 0 38.227g 547004 49996 S 0.7 0.2 2:02.74 java 4299 root 20 0 20.015g 2.682g 16836 S 0.3 1.1 33:14.42 java 1 root 20 0 33620 2920 1496 S 0.0 0.0 0:03.82 init 2 root 20 0 0 0 0 S 0.0 0.0 0:00.02 kthreadd 3 root 20 0 0 0 0 S 0.0 0.0 0:05.35 ksoftirqd/0 5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H 6 root 20 0 0 0 0 S 0.0 0.0 0:06.94 kworker/u256:0 8 root 20 0 0 0 0 S 0.0 0.0 2:38.05 rcu_sched
  • 40. Other  Analysis  in  60s   •  We need such checklists for: –  Java –  Cassandra –  MySQL –  Nginx –  etc… •  Can follow a methodology: –  Process of elimination –  Workload characterization –  Differential diagnosis –  Some summaries: http://www.brendangregg.com/methodology.html •  Turn checklists into dashboards (many do exist)
  • 41. 7.  Linux  Disk  Checklist  
  • 43. Linux  Disk  Checklist   1.  iostat –xnz 1 2.  vmstat 1 3.  df -h 4.  ext4slower 10 5.  bioslower 10 6.  ext4dist 1 7.  biolatency 1 8.  cat /sys/devices/…/ioerr_cnt 9.  smartctl -l error /dev/sda1
  • 44. Linux  Disk  Checklist   1.  iostat –xnz 1 2.  vmstat 1 3.  df -h 4.  ext4slower 10 5.  bioslower 10 6.  ext4dist 1 7.  biolatency 1 8.  cat /sys/devices/…/ioerr_cnt 9.  smartctl -l error /dev/sda1 any  disk  I/O?  if  not,  stop  looking   is  this  swapping?  or,  high  sys  Qme?   are  le  systems  nearly  full?   (zfs*,  xfs*,  etc.)  slow  le  system  I/O?   if  so,  check  disks   check  distribuQon  and  rate   if  interesQng,  check  disks                                                              (if  available)  errors                                                              (if  available)  errors   Another short checklist. Won't solve everything. FS focused. ext4slower/dist, bioslower, are from bcc/BPF tools.
  • 45. ext4slower   •  ext4 operations slower than the threshold: •  Better indicator of application pain than disk I/O •  Measures & filters in-kernel for efficiency using BPF –  From https://github.com/iovisor/bcc # ./ext4slower 1 Tracing ext4 operations slower than 1 ms TIME COMM PID T BYTES OFF_KB LAT(ms) FILENAME 06:49:17 bash 3616 R 128 0 7.75 cksum 06:49:17 cksum 3616 R 39552 0 1.34 [ 06:49:17 cksum 3616 R 96 0 5.36 2to3-2.7 06:49:17 cksum 3616 R 96 0 14.94 2to3-3.4 06:49:17 cksum 3616 R 10320 0 6.82 411toppm 06:49:17 cksum 3616 R 65536 0 4.01 a2p 06:49:17 cksum 3616 R 55400 0 8.77 ab 06:49:17 cksum 3616 R 36792 0 16.34 aclocal-1.14 06:49:17 cksum 3616 R 15008 0 19.31 acpi_listen […]
  • 46. BPF  is  coming…   Free  your  mind  
  • 47. BPF   •  That file system checklist should be a dashboard: –  FS & disk latency histograms, heatmaps, IOPS, outlier log •  Now possible with enhanced BPF (Berkeley Packet Filter) –  Built into Linux 4.x: dynamic tracing, filters, histograms System dashboards of 2017+ should look very different
  • 48. 8.  Linux  Network  Checklist  
  • 49. Linux  Network  Checklist   1.  sar -n DEV,EDEV 1 2.  sar -n TCP,ETCP 1 3.  cat /etc/resolv.conf 4.  mpstat -P ALL 1 5.  tcpretrans 6.  tcpconnect 7.  tcpaccept 8.  netstat -rnv 9.  check firewall config 10.  netstat -s
  • 50. Linux  Network  Checklist   1.  sar -n DEV,EDEV 1 2.  sar -n TCP,ETCP 1 3.  cat /etc/resolv.conf 4.  mpstat -P ALL 1 5.  tcpretrans 6.  tcpconnect 7.  tcpaccept 8.  netstat -rnv 9.  check firewall config 10.  netstat -s at  interface  limits?  or  use  nicstat   acQve/passive  load,  retransmit  rate   it's  always  DNS   high  kernel  Qme?  single  hot  CPU?   what  are  the  retransmits?  state?   connecQng  to  anything  unexpected?   unexpected  workload?   any  inecient  routes?   anything  blocking/throaling?   play  252  metric  pickup   tcp*, are from bcc/BPF tools
  • 51. tcpretrans   •  Just trace kernel TCP retransmit functions for efficiency: •  From either bcc (BPF) or perf-tools (ftrace, older kernels) # ./tcpretrans TIME PID IP LADDR:LPORT T> RADDR:RPORT STATE 01:55:05 0 4 10.153.223.157:22 R> 69.53.245.40:34619 ESTABLISHED 01:55:05 0 4 10.153.223.157:22 R> 69.53.245.40:34619 ESTABLISHED 01:55:17 0 4 10.153.223.157:22 R> 69.53.245.40:22957 ESTABLISHED […]
  • 52. 9.  Linux  CPU  Checklist  
  • 53. (too many lines – should be a utilization heat map)
  • 55. $ perf script […] java 14327 [022] 252764.179741: cycles: 7f36570a4932 SpinPause (/usr/lib/jvm/java-8 java 14315 [014] 252764.183517: cycles: 7f36570a4932 SpinPause (/usr/lib/jvm/java-8 java 14310 [012] 252764.185317: cycles: 7f36570a4932 SpinPause (/usr/lib/jvm/java-8 java 14332 [015] 252764.188720: cycles: 7f3658078350 pthread_cond_wait@@GLIBC_2.3.2 java 14341 [019] 252764.191307: cycles: 7f3656d150c8 ClassLoaderDataGraph::do_unloa java 14341 [019] 252764.198825: cycles: 7f3656d140b8 ClassLoaderData::free_dealloca java 14341 [019] 252764.207057: cycles: 7f3657192400 nmethod::do_unloading(BoolObje java 14341 [019] 252764.215962: cycles: 7f3656ba807e Assembler::locate_operand(unsi java 14341 [019] 252764.225141: cycles: 7f36571922e8 nmethod::do_unloading(BoolObje java 14341 [019] 252764.234578: cycles: 7f3656ec4960 CodeHeap::block_start(void*) c […]
  • 56. Linux  CPU  Checklist   1.  uptime 2.  vmstat 1 3.  mpstat -P ALL 1 4.  pidstat 1 5.  CPU flame graph 6.  CPU subsecond offset heat map 7.  perf stat -a -- sleep 10
  • 57. Linux  CPU  Checklist   1.  uptime 2.  vmstat 1 3.  mpstat -P ALL 1 4.  pidstat 1 5.  CPU flame graph 6.  CPU subsecond offset heat map 7.  perf stat -a -- sleep 10 load  averages   system-­‐wide  uQlizaQon,  run  q  length   CPU  balance   per-­‐process  CPU   CPU  proling                                                          look  for  gaps                                                          IPC,  LLC  hit  raQo   htop can do 1-4
  • 60. perf_events  CPU  Flame  Graphs   •  We have this automated in Netflix Vector: •  Flame graph interpretation: –  x-axis: alphabetical stack sort, to maximize merging –  y-axis: stack depth –  color: random, or hue can be a dimension (eg, diff) –  Top edge is on-CPU, beneath it is ancestry •  Can also do Java & Node.js. Differentials. •  We're working on a d3 version for Vector git clone --depth 1 https://github.com/brendangregg/FlameGraph cd FlameGraph perf record -F 99 -a –g -- sleep 30 perf script | ./stackcollapse-perf.pl |./flamegraph.pl > perf.svg
  • 61. 10.  Tools  Method     An  An>-­‐Methodology  
  • 62. Tools  Method   1.  RUN EVERYTHING AND HOPE FOR THE BEST For SRE response: a mental checklist to see what might have been missed (no time to run them all)
  • 66. Linux  bcc  tools  (BPF)   Needs  Linux  4.x   CONFIG_BPF_SYSCALL=y  
  • 67. 11.  USE  Method     A  Methodology  
  • 68. The  USE  Method   •  For every resource, check: 1.  Utilization 2.  Saturation 3.  Errors •  Definitions: –  Utilization: busy time –  Saturation: queue length or queued time –  Errors: easy to interpret (objective) Used to generate checklists. Starts with the questions, then finds the tools. Resource   UQlizaQon   (%)  X  
  • 69. USE  Method  for  Hardware   •  For every resource, check: 1.  Utilization 2.  Saturation 3.  Errors •  Including busses & interconnects
  • 71. USE  Method  for  Distributed  Systems   •  Draw a service diagram, and for every service: 1.  Utilization: resource usage (CPU, network) 2.  Saturation: request queueing, timeouts 3.  Errors •  Turn into a dashboard
  • 72. NeSlix  Vector   •  Real time instance analysis tool –  https://github.com/netflix/vector –  http://techblog.netflix.com/2015/04/introducing-vector-netflixs-on-host.html •  USE method-inspired metrics –  More in development, incl. flame graphs
  • 74. NeSlix  Vector   utilization saturationCPU: utilization saturationNetwork: load utilization saturationMemory: load saturationDisk: utilization
  • 75. 12.  Bonus:  External  Factor  Checklist  
  • 76. External  Factor  Checklist   1.  Sports ball? 2.  Power outage? 3.  Snow storm? 4.  Internet/ISP down? 5.  Vendor firmware update? 6.  Public holiday/celebration? 7.  Chaos Kong? Social media searches (Twitter) often useful –  Can also be NSFW
  • 77. Take  Aways   •  Checklists are great –  Speed, Completeness, Starting/Ending Point, Training –  Can be ad hoc, or from a methodology (USE method) •  Service dashboards –  Serve as checklists –  Metrics: Load, Errors, Latency, Saturation, Instances •  System dashboards with Linux BPF –  Latency histograms & heatmaps, etc. Free your mind. Please create and share more checklists
  • 78. References   •  Netflix Tech Blog: •  http://techblog.netflix.com/2015/11/linux-performance-analysis-in-60s.html •  http://techblog.netflix.com/2015/02/sps-pulse-of-netflix-streaming.html •  http://techblog.netflix.com/2015/04/introducing-vector-netflixs-on-host.html •  Linux Performance & BPF tools: •  http://www.brendangregg.com/linuxperf.html •  https://github.com/iovisor/bcc#tools •  USE Method Linux: •  http://www.brendangregg.com/USEmethod/use-linux.html •  Flame Graphs: •  http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html •  Heat maps: •  http://cacm.acm.org/magazines/2010/7/95062-visualizing-system-latency/fulltext •  http://www.brendangregg.com/heatmaps.html •  Books: •  Beyer, B., et al. Site Reliability Engineering. O'Reilly,Apr 2016 •  Gawande, A. The Checklist Manifesto. Metropolitan Books, 2008 •  Gregg, B. Systems Performance. Prentice Hall, 2013 (more checklists & methods!) •  Thanks: Netflix Perf & Core teams for predash, pretriage, Vector, etc