SlideShare a Scribd company logo
Introduction to Prometheus
An Approach to Whitebox Monitoring
Who am I?
Engineer passionate about running software reliably in production.
Studied Computer Science in Trinity College Dublin.
Google SRE for 7 years, working on high-scale reliable systems.
Contributor to many open source projects, including Prometheus, Ansible,
Python, Aurora and Zookeeper.
Founder of Robust Perception, provider of commercial support and consulting
for Prometheus.
What is Whitebox Monitoring?
Blackbox monitoring
Monitoring from the outside
No knowledge of how the application works internally
Examples: ping, HTTP request, inserting data and waiting for it to appear on
dashboard
Where to use Blackbox
Blackbox monitoring should be treated similarly to smoke tests.
It’s good for finding when things have badly broken in an obvious way, and testing
from outside your network.
Not so good for knowing what’s going on inside a system.
Nor should it be treated like regression testing and try to test every single feature.
Tend to be flaky, as they either pass or fail.
Whitebox Monitoring
Complementary to blackbox monitoring.
Works with information from inside your systems.
Can be simple things like CPU usage, down to the number of requests triggering a
particular obscure codepath.
Prometheus
Inspired by Google’s Borgmon monitoring system.
Started in 2012 by ex-Googlers working in Soundcloud as an open source project.
Mainly written in Go. Version 1.0 released in 2016. Incubating with the CNCF.
500+ companies using it including Digital Ocean, Ericsson, Weave and CoreOS.
What is Monitoring For?
Why monitor?
Know when things go wrong
Be able to debug and gain insight
Trending to see changes over time
Plumbing data to other systems/processes
Knowing when things go wrong
The first thing people think of you say monitoring is alerting.
What is the wrongness we want to detect and alert on?
A blip with no real consequence, or a latency issue affecting users?
Symptoms vs Causes
Humans are limited in what they can handle.
If you alert on every single thing that might be a problem, you'll get overwhelmed
and suffer from alert fatigue.
Key problem: You care about things like user facing latency. There are hundreds
of things that could cause that.
Alerting on every possible cause is a Sisyphean task, but alerting on the symptom
of high latency is just one alert.
Example: CPU usage
Some monitoring systems don't allow you to alert on the latency of your servers.
The closest you can get is CPU usage.
False positives due to e.g. logrotate running too long.
False negatives due to deadlocks.
End result: Spammy alerts which operators learn to ignore, missing real problems.
Many Approaches have Limited Visibility
Services have Internals
Monitor the Internals
Monitor as a Service, not as Machines
Freedom for Alerting
A system like Prometheus gives you the freedom to alert on whatever you like.
Alerting on error ratio across all the machines in a datacenter? No problem.
Alerting on 95th percentile latency for the service being <200ms? No problem.
Alerting on data taking too long to get through your pipeline? No problem.
Alerting on your VIP not giving the right HTTP response codes? No problem.
Produce alerts that require intelligent human action!
Alerting Architecture
Debugging to Gain Insight
After you receive an alert notification you need to investigate it.
How do you work from a high level symptom alert such as increased latency?
You drill down through your stack with dashboards to find the subsystem that's the
cause!
Dashboards
Metrics from All Levels of the Stack
Many existing integrations: Java, JMX, Python, Go, Ruby, .Net, Machine,
Cloudwatch, EC2, MySQL, PostgreSQL, Haskell, Bash, Node.js, SNMP, Consul,
HAProxy, Mesos, Bind, CouchDB, Django, Mtail, Heka, Memcached, RabbitMQ,
Redis, RethinkDB, Rsyslog, Meteor.js, Minecraft and Factorio.
Graphite, Statsd, Collectd, Scollector, Munin, Nagios integrations aid transition.
It’s so easy, most of the above were written without the core team even knowing
about them!
Metrics are just one Tool
Metrics are good for alerting on issues and letting you drill down the focus of your
debugging.
Not a panacea though, as with all approaches fundamental limitations on data
volumes.
For successful debugging of complex problems you need a mix of logs, profiling
and source code analysis.
Complementary Debugging Tools
Trending and Reporting
Alerting and debugging is short term.
Trending is medium to long term.
How is cache hit rate changing over time?
Is anyone still using that obscure feature?
With Prometheus you can do analysis beyond this.
Powerful Query Language
Can multiply, add, aggregate, join, predict, take quantiles across many metrics in
the same query. Can evaluate right now, and graph back in time.
Answer questions like:
What’s the 95th percentile latency in each datacenter over the past month?
How full will the disks be in 4 days?
Which services are the top 5 users of CPU?
Example: Top 5 Docker images by CPU
topk(5,
sum by (image)(
rate(container_cpu_usage_seconds_total{
id=~"/system.slice/docker.*"}[5m]
)
)
)
Structured Data: Labels
Prometheus doesn’t use dotted.strings like metric.grafnacon.nyc.
Multi-dimensional labels instead like
metric{event=”grafanacon”,aircraft_carrier_location=”nyc”}
Can aggregate, cut, and slice along them.
Can come from instrumentation, or be added based on the service you are
monitoring.
Example: Labels from Node Exporter
Python Instrumentation: An example
pip install prometheus_client
from prometheus_client import Summary, start_http_server
REQUEST_DURATION = Summary('request_duration_seconds',
'Request duration in seconds')
@REQUEST_DURATION.time()
def my_handler(request):
pass // Your code here
start_http_server(8000)
Adding Dimensions (No Evil Twins Please)
from prometheus_client import Counter
REQUESTS = Counter('requests_total',
'Total requests', ['method'])
def my_handler(request):
REQUESTS.labels(request.method).inc()
pass // Your code here
Labels go beyond Prometheus
If you're using Kubernetes, Prometheus can take in your labels and annotations
too.
Similar data models and mutual integrations make your life easier!
Plumbing
Prometheus isn't just open source, it's also an open ecosystem.
We know we can't support everything, so at every level there's a generic interface
to let you get data in and/or out.
So for example if you want to run a shell script when an alert fires, you can make
that happen.
Prometheus Clients as a Clearinghouse
Live Demo!
Monitoring What Matters with Prometheus
To summarise, the key things Prometheus empowers you to build:
Alerting on symptoms. Alerts which require intelligent human action.
Debugging dashboards that let you drill down to where the problem is.
The ability to run complex queries to slice and dice your data.
Easy integration points for other systems.
These are good things to have no matter which monitoring system(s) you use.
10 Tips for Monitoring
With potentially millions of time series across your system, can be difficult to know
what is and isn't useful.
What approaches help manage this complexity?
How do you avoid getting caught out?
Here's some tips.
#1: Choose your key statistics
Users don't care that one of your machines is short of CPU.
Users care if the service is slow or throwing errors.
For your primary dashboards focus on high-level metrics that directly impact
users.
#2: Use aggregations
Think about services, not machines.
Once you have more than a handful of machines, you should treat them as an
amorphous blob.
Looking at the key statistics is easier for 10 services than 10 services each of
which is on 10 machines
Once you have isolated a problem to one service, then can see if one machine is
the problem
#3: Avoid the Wall of Graphs
Dashboards tend to grow without bound. Worst I've seen was 600 graphs.
It might look impressive, but humans can't deal with that much data at once.
(and they take forever to load)
Your services will have a rough tree structure, have a dashboard per service and
talk the tree from the top when you have a problem. Similarly for each service,
have dashboards per subsystem.
Rule of Thumb: Limit of 5 graphs per dashboard, and 5 lines per graph.
#4: Client-side quantiles aren't aggregatable
Many instrumentation systems calculate quantiles/percentiles inside each
process, and export it to the TSDB.
It is not statistically possible to aggregate these.
If you want meaningful quantiles, you should track histogram buckets in each
process, aggregate those in your monitoring system and then calculate the
quantile.
This is done using histogram_quantile() and rate() in Prometheus.
#5: Averages are easy to reason about
Q: Say you have a service with two backends. If 95th percentile latency goes up
due to one of the backends, what will you see in 95th percentile latency for that
backend?
A: ?
#5: Averages are easy to reason about
Q: Say you have a service with two backends. If 95th percentile latency goes up
due to one of the backends, what will you see in 95th percentile latency for that
backend?
A: It depends, could be no change. If the latencies are strongly correlated for each
request across the backends, you'll see the same latency bump.
This is tricky to reason about, especially in an emergency.
Averages don't have this problem, as they include all requests.
#6: Costs and Benefits
1s resolution monitoring of all metrics would be handy for debugging.
But is it ten time more valuable than 10s monitoring? And sixty times more
valuable than 60s monitoring?
Monitoring isn't free. It costs resources to run, and resources in the services being
monitored too. Quantiles and histograms can get expensive fast.
60s resolution is generally a good balance. Reserve 1s granularity or a literal
handful of key metrics.
#7: Nyquist-Shannon Sampling Theorem
To reconstruct a signal you need a resolution that's at least double it's frequency.
If you've got a 10s resolution time series, you can't reconstruct patterns that are
less than 20s long.
Higher frequency patterns can cause effects like aliasing, and mislead you.
If you suspect that there's something more to the data, try a higher resolution
temporarily or start profiling.
#8: Correlation is not Causation - Confirmation Bias
Humans are great at spotting patterns. Not all of them are actually there.
Always try to look for evidence that'd falsify your hypothesis.
If two metrics seem to correlate on a graph that doesn't mean that they're related.
They could be independent tasks running on the same schedule.
Or if you zoom out there plenty of times when one spikes but not the other.
Or one could be causing a slight increase in resource contention, pushing the
other over the edge.
#9 Know when to use Logs and Metrics
You want a metrics time series system for your primary monitoring.
Logs have information about every event. This limits the number of fields (<100),
but you have unlimited cardinality.
Metrics aggregate across events, but you can have many metrics (>10000) with
limited cardinality.
Metrics help you determine where in the system the problem is. From there, logs
can help you pinpoint which requests are tickling the problem.
#10 Have a way to deal with non-critical alerts
Most alerts don't justify waking up someone at night, but someone needs to look
at them sometime.
Often they're sent to a mailing list, where everyone promptly filters them away.
Better to have some form of ticketing system that'll assign a single owner for each
alert.
A daily email with all firing alerts that the oncall has to process can also work.
Questions?
Project Website: prometheus.io
Demo: demo.robustperception.io
Company Website: www.robustperception.io
Ad

More Related Content

What's hot (20)

Getting Started Monitoring with Prometheus and Grafana
Getting Started Monitoring with Prometheus and GrafanaGetting Started Monitoring with Prometheus and Grafana
Getting Started Monitoring with Prometheus and Grafana
Syah Dwi Prihatmoko
 
Prometheus monitoring
Prometheus monitoringPrometheus monitoring
Prometheus monitoring
Hien Nguyen Van
 
MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)
Lucas Jellema
 
Systems Monitoring with Prometheus (Devops Ireland April 2015)
Systems Monitoring with Prometheus (Devops Ireland April 2015)Systems Monitoring with Prometheus (Devops Ireland April 2015)
Systems Monitoring with Prometheus (Devops Ireland April 2015)
Brian Brazil
 
Prometheus design and philosophy
Prometheus design and philosophy   Prometheus design and philosophy
Prometheus design and philosophy
Docker, Inc.
 
Prometheus - basics
Prometheus - basicsPrometheus - basics
Prometheus - basics
Juraj Hantak
 
Monitoring with Prometheus
Monitoring with PrometheusMonitoring with Prometheus
Monitoring with Prometheus
Shiao-An Yuan
 
Server monitoring using grafana and prometheus
Server monitoring using grafana and prometheusServer monitoring using grafana and prometheus
Server monitoring using grafana and prometheus
Celine George
 
Intro to open source observability with grafana, prometheus, loki, and tempo(...
Intro to open source observability with grafana, prometheus, loki, and tempo(...Intro to open source observability with grafana, prometheus, loki, and tempo(...
Intro to open source observability with grafana, prometheus, loki, and tempo(...
LibbySchulze
 
kafka
kafkakafka
kafka
Amikam Snir
 
Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)
Brian Brazil
 
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,GrafanaPrometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
Sridhar Kumar N
 
Cloud Monitoring with Prometheus
Cloud Monitoring with PrometheusCloud Monitoring with Prometheus
Cloud Monitoring with Prometheus
QAware GmbH
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Jean-Paul Azar
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
Clement Demonchy
 
Monitoring with Prometheus
Monitoring with PrometheusMonitoring with Prometheus
Monitoring with Prometheus
Richard Langlois P. Eng.
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka Introduction
Amita Mirajkar
 
Prometheus 101
Prometheus 101Prometheus 101
Prometheus 101
Paul Podolny
 
OpenTelemetry For Architects
OpenTelemetry For ArchitectsOpenTelemetry For Architects
OpenTelemetry For Architects
Kevin Brockhoff
 
Explore your prometheus data in grafana - Promcon 2018
Explore your prometheus data in grafana - Promcon 2018Explore your prometheus data in grafana - Promcon 2018
Explore your prometheus data in grafana - Promcon 2018
Grafana Labs
 
Getting Started Monitoring with Prometheus and Grafana
Getting Started Monitoring with Prometheus and GrafanaGetting Started Monitoring with Prometheus and Grafana
Getting Started Monitoring with Prometheus and Grafana
Syah Dwi Prihatmoko
 
MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)
Lucas Jellema
 
Systems Monitoring with Prometheus (Devops Ireland April 2015)
Systems Monitoring with Prometheus (Devops Ireland April 2015)Systems Monitoring with Prometheus (Devops Ireland April 2015)
Systems Monitoring with Prometheus (Devops Ireland April 2015)
Brian Brazil
 
Prometheus design and philosophy
Prometheus design and philosophy   Prometheus design and philosophy
Prometheus design and philosophy
Docker, Inc.
 
Prometheus - basics
Prometheus - basicsPrometheus - basics
Prometheus - basics
Juraj Hantak
 
Monitoring with Prometheus
Monitoring with PrometheusMonitoring with Prometheus
Monitoring with Prometheus
Shiao-An Yuan
 
Server monitoring using grafana and prometheus
Server monitoring using grafana and prometheusServer monitoring using grafana and prometheus
Server monitoring using grafana and prometheus
Celine George
 
Intro to open source observability with grafana, prometheus, loki, and tempo(...
Intro to open source observability with grafana, prometheus, loki, and tempo(...Intro to open source observability with grafana, prometheus, loki, and tempo(...
Intro to open source observability with grafana, prometheus, loki, and tempo(...
LibbySchulze
 
Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)
Brian Brazil
 
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,GrafanaPrometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
Sridhar Kumar N
 
Cloud Monitoring with Prometheus
Cloud Monitoring with PrometheusCloud Monitoring with Prometheus
Cloud Monitoring with Prometheus
QAware GmbH
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Jean-Paul Azar
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka Introduction
Amita Mirajkar
 
OpenTelemetry For Architects
OpenTelemetry For ArchitectsOpenTelemetry For Architects
OpenTelemetry For Architects
Kevin Brockhoff
 
Explore your prometheus data in grafana - Promcon 2018
Explore your prometheus data in grafana - Promcon 2018Explore your prometheus data in grafana - Promcon 2018
Explore your prometheus data in grafana - Promcon 2018
Grafana Labs
 

Viewers also liked (20)

What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)
Brian Brazil
 
An Exploration of the Formal Properties of PromQL
An Exploration of the Formal Properties of PromQLAn Exploration of the Formal Properties of PromQL
An Exploration of the Formal Properties of PromQL
Brian Brazil
 
Microservices and Prometheus (Microservices NYC 2016)
Microservices and Prometheus (Microservices NYC 2016)Microservices and Prometheus (Microservices NYC 2016)
Microservices and Prometheus (Microservices NYC 2016)
Brian Brazil
 
Provisioning and Capacity Planning (Travel Meets Big Data)
Provisioning and Capacity Planning (Travel Meets Big Data)Provisioning and Capacity Planning (Travel Meets Big Data)
Provisioning and Capacity Planning (Travel Meets Big Data)
Brian Brazil
 
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Brian Brazil
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Brian Brazil
 
Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)
Brian Brazil
 
So You Want to Write an Exporter
So You Want to Write an ExporterSo You Want to Write an Exporter
So You Want to Write an Exporter
Brian Brazil
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum Japan
Brian Brazil
 
Ansible at FOSDEM (Ansible Dublin, 2016)
Ansible at FOSDEM (Ansible Dublin, 2016)Ansible at FOSDEM (Ansible Dublin, 2016)
Ansible at FOSDEM (Ansible Dublin, 2016)
Brian Brazil
 
Foreman como provisionador
Foreman como provisionadorForeman como provisionador
Foreman como provisionador
Andre "Ramoni" Guimaraes
 
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
Brian Brazil
 
Life of a Label (PromCon2016, Berlin)
Life of a Label (PromCon2016, Berlin)Life of a Label (PromCon2016, Berlin)
Life of a Label (PromCon2016, Berlin)
Brian Brazil
 
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Brian Brazil
 
Prometheus (Monitorama 2016)
Prometheus (Monitorama 2016)Prometheus (Monitorama 2016)
Prometheus (Monitorama 2016)
Brian Brazil
 
Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...
Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...
Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...
Brian Brazil
 
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Brian Brazil
 
Lessons Learned Running The Largest OpenStack Clouds
Lessons Learned Running The Largest OpenStack CloudsLessons Learned Running The Largest OpenStack Clouds
Lessons Learned Running The Largest OpenStack Clouds
Kenneth Hui
 
Foreman and Chef integration at ChefConf 2014
Foreman and Chef integration at ChefConf 2014Foreman and Chef integration at ChefConf 2014
Foreman and Chef integration at ChefConf 2014
Dominic Cleal
 
Foreman - Process manager for applications with multiple components
Foreman - Process manager for applications with multiple componentsForeman - Process manager for applications with multiple components
Foreman - Process manager for applications with multiple components
Stoyan Zhekov
 
What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)
Brian Brazil
 
An Exploration of the Formal Properties of PromQL
An Exploration of the Formal Properties of PromQLAn Exploration of the Formal Properties of PromQL
An Exploration of the Formal Properties of PromQL
Brian Brazil
 
Microservices and Prometheus (Microservices NYC 2016)
Microservices and Prometheus (Microservices NYC 2016)Microservices and Prometheus (Microservices NYC 2016)
Microservices and Prometheus (Microservices NYC 2016)
Brian Brazil
 
Provisioning and Capacity Planning (Travel Meets Big Data)
Provisioning and Capacity Planning (Travel Meets Big Data)Provisioning and Capacity Planning (Travel Meets Big Data)
Provisioning and Capacity Planning (Travel Meets Big Data)
Brian Brazil
 
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Brian Brazil
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Brian Brazil
 
Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)
Brian Brazil
 
So You Want to Write an Exporter
So You Want to Write an ExporterSo You Want to Write an Exporter
So You Want to Write an Exporter
Brian Brazil
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum Japan
Brian Brazil
 
Ansible at FOSDEM (Ansible Dublin, 2016)
Ansible at FOSDEM (Ansible Dublin, 2016)Ansible at FOSDEM (Ansible Dublin, 2016)
Ansible at FOSDEM (Ansible Dublin, 2016)
Brian Brazil
 
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
Brian Brazil
 
Life of a Label (PromCon2016, Berlin)
Life of a Label (PromCon2016, Berlin)Life of a Label (PromCon2016, Berlin)
Life of a Label (PromCon2016, Berlin)
Brian Brazil
 
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Brian Brazil
 
Prometheus (Monitorama 2016)
Prometheus (Monitorama 2016)Prometheus (Monitorama 2016)
Prometheus (Monitorama 2016)
Brian Brazil
 
Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...
Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...
Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...
Brian Brazil
 
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Brian Brazil
 
Lessons Learned Running The Largest OpenStack Clouds
Lessons Learned Running The Largest OpenStack CloudsLessons Learned Running The Largest OpenStack Clouds
Lessons Learned Running The Largest OpenStack Clouds
Kenneth Hui
 
Foreman and Chef integration at ChefConf 2014
Foreman and Chef integration at ChefConf 2014Foreman and Chef integration at ChefConf 2014
Foreman and Chef integration at ChefConf 2014
Dominic Cleal
 
Foreman - Process manager for applications with multiple components
Foreman - Process manager for applications with multiple componentsForeman - Process manager for applications with multiple components
Foreman - Process manager for applications with multiple components
Stoyan Zhekov
 
Ad

Similar to An Introduction to Prometheus (GrafanaCon 2016) (20)

Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)
Brian Brazil
 
Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)
Brian Brazil
 
Prometheus for Monitoring Metrics (Percona Live Europe 2017)
Prometheus for Monitoring Metrics (Percona Live Europe 2017)Prometheus for Monitoring Metrics (Percona Live Europe 2017)
Prometheus for Monitoring Metrics (Percona Live Europe 2017)
Brian Brazil
 
From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018
Christophe Rochefolle
 
Skynet project: Monitor, analyze, scale, and maintain a system in the Cloud
Skynet project: Monitor, analyze, scale, and maintain a system in the CloudSkynet project: Monitor, analyze, scale, and maintain a system in the Cloud
Skynet project: Monitor, analyze, scale, and maintain a system in the Cloud
Sylvain Kalache
 
Security for AWS : Journey to Least Privilege (update)
Security for AWS : Journey to Least Privilege (update)Security for AWS : Journey to Least Privilege (update)
Security for AWS : Journey to Least Privilege (update)
dhubbard858
 
Security for AWS: Journey to Least Privilege
Security for AWS: Journey to Least PrivilegeSecurity for AWS: Journey to Least Privilege
Security for AWS: Journey to Least Privilege
Lacework
 
MySQL Monitoring Shoot Out
MySQL Monitoring Shoot OutMySQL Monitoring Shoot Out
MySQL Monitoring Shoot Out
Kris Buytaert
 
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
IRJET Journal
 
Chaos Engineering Without Observability ... Is Just Chaos
Chaos Engineering Without Observability ... Is Just ChaosChaos Engineering Without Observability ... Is Just Chaos
Chaos Engineering Without Observability ... Is Just Chaos
Charity Majors
 
Is your Automation Infrastructure ‘Well Architected’?
Is your Automation Infrastructure ‘Well Architected’?Is your Automation Infrastructure ‘Well Architected’?
Is your Automation Infrastructure ‘Well Architected’?
Adam Goucher
 
Tef con2016 (1)
Tef con2016 (1)Tef con2016 (1)
Tef con2016 (1)
ggarber
 
Building Cloud Ready Apps
Building Cloud Ready AppsBuilding Cloud Ready Apps
Building Cloud Ready Apps
VMware Tanzu
 
LOPSA East 2013 - Building a More Effective Monitoring Environment
LOPSA East 2013 - Building a More Effective Monitoring EnvironmentLOPSA East 2013 - Building a More Effective Monitoring Environment
LOPSA East 2013 - Building a More Effective Monitoring Environment
Mike Julian
 
Cartographer, or Building A Next Generation Management Framework
Cartographer, or Building A Next Generation Management FrameworkCartographer, or Building A Next Generation Management Framework
Cartographer, or Building A Next Generation Management Framework
ansmtug
 
Prometheus-Grafana-RahulSoni1584KnolX.pptx.pdf
Prometheus-Grafana-RahulSoni1584KnolX.pptx.pdfPrometheus-Grafana-RahulSoni1584KnolX.pptx.pdf
Prometheus-Grafana-RahulSoni1584KnolX.pptx.pdf
Knoldus Inc.
 
Which watcher watches CloudWatch
Which watcher watches CloudWatch Which watcher watches CloudWatch
Which watcher watches CloudWatch
David Lutz
 
Purple Teaming With Adversary Emulation.pdf
Purple Teaming With Adversary Emulation.pdfPurple Teaming With Adversary Emulation.pdf
Purple Teaming With Adversary Emulation.pdf
prithaaash
 
Performance Evaluation of a Network Using Simulation Tools or Packet Tracer
Performance Evaluation of a Network Using Simulation Tools or Packet TracerPerformance Evaluation of a Network Using Simulation Tools or Packet Tracer
Performance Evaluation of a Network Using Simulation Tools or Packet Tracer
IOSRjournaljce
 
The difference between in-depth analysis of virtual infrastructures & monitoring
The difference between in-depth analysis of virtual infrastructures & monitoringThe difference between in-depth analysis of virtual infrastructures & monitoring
The difference between in-depth analysis of virtual infrastructures & monitoring
BettyRManning
 
Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)
Brian Brazil
 
Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)
Brian Brazil
 
Prometheus for Monitoring Metrics (Percona Live Europe 2017)
Prometheus for Monitoring Metrics (Percona Live Europe 2017)Prometheus for Monitoring Metrics (Percona Live Europe 2017)
Prometheus for Monitoring Metrics (Percona Live Europe 2017)
Brian Brazil
 
From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018
Christophe Rochefolle
 
Skynet project: Monitor, analyze, scale, and maintain a system in the Cloud
Skynet project: Monitor, analyze, scale, and maintain a system in the CloudSkynet project: Monitor, analyze, scale, and maintain a system in the Cloud
Skynet project: Monitor, analyze, scale, and maintain a system in the Cloud
Sylvain Kalache
 
Security for AWS : Journey to Least Privilege (update)
Security for AWS : Journey to Least Privilege (update)Security for AWS : Journey to Least Privilege (update)
Security for AWS : Journey to Least Privilege (update)
dhubbard858
 
Security for AWS: Journey to Least Privilege
Security for AWS: Journey to Least PrivilegeSecurity for AWS: Journey to Least Privilege
Security for AWS: Journey to Least Privilege
Lacework
 
MySQL Monitoring Shoot Out
MySQL Monitoring Shoot OutMySQL Monitoring Shoot Out
MySQL Monitoring Shoot Out
Kris Buytaert
 
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
IRJET Journal
 
Chaos Engineering Without Observability ... Is Just Chaos
Chaos Engineering Without Observability ... Is Just ChaosChaos Engineering Without Observability ... Is Just Chaos
Chaos Engineering Without Observability ... Is Just Chaos
Charity Majors
 
Is your Automation Infrastructure ‘Well Architected’?
Is your Automation Infrastructure ‘Well Architected’?Is your Automation Infrastructure ‘Well Architected’?
Is your Automation Infrastructure ‘Well Architected’?
Adam Goucher
 
Tef con2016 (1)
Tef con2016 (1)Tef con2016 (1)
Tef con2016 (1)
ggarber
 
Building Cloud Ready Apps
Building Cloud Ready AppsBuilding Cloud Ready Apps
Building Cloud Ready Apps
VMware Tanzu
 
LOPSA East 2013 - Building a More Effective Monitoring Environment
LOPSA East 2013 - Building a More Effective Monitoring EnvironmentLOPSA East 2013 - Building a More Effective Monitoring Environment
LOPSA East 2013 - Building a More Effective Monitoring Environment
Mike Julian
 
Cartographer, or Building A Next Generation Management Framework
Cartographer, or Building A Next Generation Management FrameworkCartographer, or Building A Next Generation Management Framework
Cartographer, or Building A Next Generation Management Framework
ansmtug
 
Prometheus-Grafana-RahulSoni1584KnolX.pptx.pdf
Prometheus-Grafana-RahulSoni1584KnolX.pptx.pdfPrometheus-Grafana-RahulSoni1584KnolX.pptx.pdf
Prometheus-Grafana-RahulSoni1584KnolX.pptx.pdf
Knoldus Inc.
 
Which watcher watches CloudWatch
Which watcher watches CloudWatch Which watcher watches CloudWatch
Which watcher watches CloudWatch
David Lutz
 
Purple Teaming With Adversary Emulation.pdf
Purple Teaming With Adversary Emulation.pdfPurple Teaming With Adversary Emulation.pdf
Purple Teaming With Adversary Emulation.pdf
prithaaash
 
Performance Evaluation of a Network Using Simulation Tools or Packet Tracer
Performance Evaluation of a Network Using Simulation Tools or Packet TracerPerformance Evaluation of a Network Using Simulation Tools or Packet Tracer
Performance Evaluation of a Network Using Simulation Tools or Packet Tracer
IOSRjournaljce
 
The difference between in-depth analysis of virtual infrastructures & monitoring
The difference between in-depth analysis of virtual infrastructures & monitoringThe difference between in-depth analysis of virtual infrastructures & monitoring
The difference between in-depth analysis of virtual infrastructures & monitoring
BettyRManning
 
Ad

More from Brian Brazil (9)

OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)
OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)
OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)
Brian Brazil
 
Evaluating Prometheus Knowledge in Interviews (PromCon 2018)
Evaluating Prometheus Knowledge in Interviews (PromCon 2018)Evaluating Prometheus Knowledge in Interviews (PromCon 2018)
Evaluating Prometheus Knowledge in Interviews (PromCon 2018)
Brian Brazil
 
Anatomy of a Prometheus Client Library (PromCon 2018)
Anatomy of a Prometheus Client Library (PromCon 2018)Anatomy of a Prometheus Client Library (PromCon 2018)
Anatomy of a Prometheus Client Library (PromCon 2018)
Brian Brazil
 
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
Brian Brazil
 
Evolution of the Prometheus TSDB (Percona Live Europe 2017)
Evolution of the Prometheus TSDB  (Percona Live Europe 2017)Evolution of the Prometheus TSDB  (Percona Live Europe 2017)
Evolution of the Prometheus TSDB (Percona Live Europe 2017)
Brian Brazil
 
Staleness and Isolation in Prometheus 2.0 (PromCon 2017)
Staleness and Isolation in Prometheus 2.0 (PromCon 2017)Staleness and Isolation in Prometheus 2.0 (PromCon 2017)
Staleness and Isolation in Prometheus 2.0 (PromCon 2017)
Brian Brazil
 
Rule 110 for Prometheus (PromCon 2017)
Rule 110 for Prometheus (PromCon 2017)Rule 110 for Prometheus (PromCon 2017)
Rule 110 for Prometheus (PromCon 2017)
Brian Brazil
 
Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)
Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)
Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)
Brian Brazil
 
Prometheus: From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
Prometheus:  From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)Prometheus:  From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
Prometheus: From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
Brian Brazil
 
OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)
OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)
OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)
Brian Brazil
 
Evaluating Prometheus Knowledge in Interviews (PromCon 2018)
Evaluating Prometheus Knowledge in Interviews (PromCon 2018)Evaluating Prometheus Knowledge in Interviews (PromCon 2018)
Evaluating Prometheus Knowledge in Interviews (PromCon 2018)
Brian Brazil
 
Anatomy of a Prometheus Client Library (PromCon 2018)
Anatomy of a Prometheus Client Library (PromCon 2018)Anatomy of a Prometheus Client Library (PromCon 2018)
Anatomy of a Prometheus Client Library (PromCon 2018)
Brian Brazil
 
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
Brian Brazil
 
Evolution of the Prometheus TSDB (Percona Live Europe 2017)
Evolution of the Prometheus TSDB  (Percona Live Europe 2017)Evolution of the Prometheus TSDB  (Percona Live Europe 2017)
Evolution of the Prometheus TSDB (Percona Live Europe 2017)
Brian Brazil
 
Staleness and Isolation in Prometheus 2.0 (PromCon 2017)
Staleness and Isolation in Prometheus 2.0 (PromCon 2017)Staleness and Isolation in Prometheus 2.0 (PromCon 2017)
Staleness and Isolation in Prometheus 2.0 (PromCon 2017)
Brian Brazil
 
Rule 110 for Prometheus (PromCon 2017)
Rule 110 for Prometheus (PromCon 2017)Rule 110 for Prometheus (PromCon 2017)
Rule 110 for Prometheus (PromCon 2017)
Brian Brazil
 
Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)
Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)
Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)
Brian Brazil
 
Prometheus: From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
Prometheus:  From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)Prometheus:  From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
Prometheus: From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
Brian Brazil
 

Recently uploaded (19)

5-Proses-proses Akuisisi Citra Digital.pptx
5-Proses-proses Akuisisi Citra Digital.pptx5-Proses-proses Akuisisi Citra Digital.pptx
5-Proses-proses Akuisisi Citra Digital.pptx
andani26
 
APNIC -Policy Development Process, presented at Local APIGA Taiwan 2025
APNIC -Policy Development Process, presented at Local APIGA Taiwan 2025APNIC -Policy Development Process, presented at Local APIGA Taiwan 2025
APNIC -Policy Development Process, presented at Local APIGA Taiwan 2025
APNIC
 
highend-srxseries-services-gateways-customer-presentation.pptx
highend-srxseries-services-gateways-customer-presentation.pptxhighend-srxseries-services-gateways-customer-presentation.pptx
highend-srxseries-services-gateways-customer-presentation.pptx
elhadjcheikhdiop
 
(Hosting PHising Sites) for Cryptography and network security
(Hosting PHising Sites) for Cryptography and network security(Hosting PHising Sites) for Cryptography and network security
(Hosting PHising Sites) for Cryptography and network security
aluacharya169
 
White and Red Clean Car Business Pitch Presentation.pptx
White and Red Clean Car Business Pitch Presentation.pptxWhite and Red Clean Car Business Pitch Presentation.pptx
White and Red Clean Car Business Pitch Presentation.pptx
canumatown
 
Understanding the Tor Network and Exploring the Deep Web
Understanding the Tor Network and Exploring the Deep WebUnderstanding the Tor Network and Exploring the Deep Web
Understanding the Tor Network and Exploring the Deep Web
nabilajabin35
 
IT Services Workflow From Request to Resolution
IT Services Workflow From Request to ResolutionIT Services Workflow From Request to Resolution
IT Services Workflow From Request to Resolution
mzmziiskd
 
Determining Glass is mechanical textile
Determining  Glass is mechanical textileDetermining  Glass is mechanical textile
Determining Glass is mechanical textile
Azizul Hakim
 
Reliable Vancouver Web Hosting with Local Servers & 24/7 Support
Reliable Vancouver Web Hosting with Local Servers & 24/7 SupportReliable Vancouver Web Hosting with Local Servers & 24/7 Support
Reliable Vancouver Web Hosting with Local Servers & 24/7 Support
steve198109
 
DNS Resolvers and Nameservers (in New Zealand)
DNS Resolvers and Nameservers (in New Zealand)DNS Resolvers and Nameservers (in New Zealand)
DNS Resolvers and Nameservers (in New Zealand)
APNIC
 
project_based_laaaaaaaaaaearning,kelompok 10.pptx
project_based_laaaaaaaaaaearning,kelompok 10.pptxproject_based_laaaaaaaaaaearning,kelompok 10.pptx
project_based_laaaaaaaaaaearning,kelompok 10.pptx
redzuriel13
 
Mobile database for your company telemarketing or sms marketing campaigns. Fr...
Mobile database for your company telemarketing or sms marketing campaigns. Fr...Mobile database for your company telemarketing or sms marketing campaigns. Fr...
Mobile database for your company telemarketing or sms marketing campaigns. Fr...
DataProvider1
 
OSI TCP IP Protocol Layers description f
OSI TCP IP Protocol Layers description fOSI TCP IP Protocol Layers description f
OSI TCP IP Protocol Layers description f
cbr49917
 
Top Vancouver Green Business Ideas for 2025 Powered by 4GoodHosting
Top Vancouver Green Business Ideas for 2025 Powered by 4GoodHostingTop Vancouver Green Business Ideas for 2025 Powered by 4GoodHosting
Top Vancouver Green Business Ideas for 2025 Powered by 4GoodHosting
steve198109
 
Computers Networks Computers Networks Computers Networks
Computers Networks Computers Networks Computers NetworksComputers Networks Computers Networks Computers Networks
Computers Networks Computers Networks Computers Networks
Tito208863
 
Perguntas dos animais - Slides ilustrados de múltipla escolha
Perguntas dos animais - Slides ilustrados de múltipla escolhaPerguntas dos animais - Slides ilustrados de múltipla escolha
Perguntas dos animais - Slides ilustrados de múltipla escolha
socaslev
 
APNIC Update, presented at NZNOG 2025 by Terry Sweetser
APNIC Update, presented at NZNOG 2025 by Terry SweetserAPNIC Update, presented at NZNOG 2025 by Terry Sweetser
APNIC Update, presented at NZNOG 2025 by Terry Sweetser
APNIC
 
Smart Mobile App Pitch Deck丨AI Travel App Presentation Template
Smart Mobile App Pitch Deck丨AI Travel App Presentation TemplateSmart Mobile App Pitch Deck丨AI Travel App Presentation Template
Smart Mobile App Pitch Deck丨AI Travel App Presentation Template
yojeari421237
 
Best web hosting Vancouver 2025 for you business
Best web hosting Vancouver 2025 for you businessBest web hosting Vancouver 2025 for you business
Best web hosting Vancouver 2025 for you business
steve198109
 
5-Proses-proses Akuisisi Citra Digital.pptx
5-Proses-proses Akuisisi Citra Digital.pptx5-Proses-proses Akuisisi Citra Digital.pptx
5-Proses-proses Akuisisi Citra Digital.pptx
andani26
 
APNIC -Policy Development Process, presented at Local APIGA Taiwan 2025
APNIC -Policy Development Process, presented at Local APIGA Taiwan 2025APNIC -Policy Development Process, presented at Local APIGA Taiwan 2025
APNIC -Policy Development Process, presented at Local APIGA Taiwan 2025
APNIC
 
highend-srxseries-services-gateways-customer-presentation.pptx
highend-srxseries-services-gateways-customer-presentation.pptxhighend-srxseries-services-gateways-customer-presentation.pptx
highend-srxseries-services-gateways-customer-presentation.pptx
elhadjcheikhdiop
 
(Hosting PHising Sites) for Cryptography and network security
(Hosting PHising Sites) for Cryptography and network security(Hosting PHising Sites) for Cryptography and network security
(Hosting PHising Sites) for Cryptography and network security
aluacharya169
 
White and Red Clean Car Business Pitch Presentation.pptx
White and Red Clean Car Business Pitch Presentation.pptxWhite and Red Clean Car Business Pitch Presentation.pptx
White and Red Clean Car Business Pitch Presentation.pptx
canumatown
 
Understanding the Tor Network and Exploring the Deep Web
Understanding the Tor Network and Exploring the Deep WebUnderstanding the Tor Network and Exploring the Deep Web
Understanding the Tor Network and Exploring the Deep Web
nabilajabin35
 
IT Services Workflow From Request to Resolution
IT Services Workflow From Request to ResolutionIT Services Workflow From Request to Resolution
IT Services Workflow From Request to Resolution
mzmziiskd
 
Determining Glass is mechanical textile
Determining  Glass is mechanical textileDetermining  Glass is mechanical textile
Determining Glass is mechanical textile
Azizul Hakim
 
Reliable Vancouver Web Hosting with Local Servers & 24/7 Support
Reliable Vancouver Web Hosting with Local Servers & 24/7 SupportReliable Vancouver Web Hosting with Local Servers & 24/7 Support
Reliable Vancouver Web Hosting with Local Servers & 24/7 Support
steve198109
 
DNS Resolvers and Nameservers (in New Zealand)
DNS Resolvers and Nameservers (in New Zealand)DNS Resolvers and Nameservers (in New Zealand)
DNS Resolvers and Nameservers (in New Zealand)
APNIC
 
project_based_laaaaaaaaaaearning,kelompok 10.pptx
project_based_laaaaaaaaaaearning,kelompok 10.pptxproject_based_laaaaaaaaaaearning,kelompok 10.pptx
project_based_laaaaaaaaaaearning,kelompok 10.pptx
redzuriel13
 
Mobile database for your company telemarketing or sms marketing campaigns. Fr...
Mobile database for your company telemarketing or sms marketing campaigns. Fr...Mobile database for your company telemarketing or sms marketing campaigns. Fr...
Mobile database for your company telemarketing or sms marketing campaigns. Fr...
DataProvider1
 
OSI TCP IP Protocol Layers description f
OSI TCP IP Protocol Layers description fOSI TCP IP Protocol Layers description f
OSI TCP IP Protocol Layers description f
cbr49917
 
Top Vancouver Green Business Ideas for 2025 Powered by 4GoodHosting
Top Vancouver Green Business Ideas for 2025 Powered by 4GoodHostingTop Vancouver Green Business Ideas for 2025 Powered by 4GoodHosting
Top Vancouver Green Business Ideas for 2025 Powered by 4GoodHosting
steve198109
 
Computers Networks Computers Networks Computers Networks
Computers Networks Computers Networks Computers NetworksComputers Networks Computers Networks Computers Networks
Computers Networks Computers Networks Computers Networks
Tito208863
 
Perguntas dos animais - Slides ilustrados de múltipla escolha
Perguntas dos animais - Slides ilustrados de múltipla escolhaPerguntas dos animais - Slides ilustrados de múltipla escolha
Perguntas dos animais - Slides ilustrados de múltipla escolha
socaslev
 
APNIC Update, presented at NZNOG 2025 by Terry Sweetser
APNIC Update, presented at NZNOG 2025 by Terry SweetserAPNIC Update, presented at NZNOG 2025 by Terry Sweetser
APNIC Update, presented at NZNOG 2025 by Terry Sweetser
APNIC
 
Smart Mobile App Pitch Deck丨AI Travel App Presentation Template
Smart Mobile App Pitch Deck丨AI Travel App Presentation TemplateSmart Mobile App Pitch Deck丨AI Travel App Presentation Template
Smart Mobile App Pitch Deck丨AI Travel App Presentation Template
yojeari421237
 
Best web hosting Vancouver 2025 for you business
Best web hosting Vancouver 2025 for you businessBest web hosting Vancouver 2025 for you business
Best web hosting Vancouver 2025 for you business
steve198109
 

An Introduction to Prometheus (GrafanaCon 2016)

  • 1. Introduction to Prometheus An Approach to Whitebox Monitoring
  • 2. Who am I? Engineer passionate about running software reliably in production. Studied Computer Science in Trinity College Dublin. Google SRE for 7 years, working on high-scale reliable systems. Contributor to many open source projects, including Prometheus, Ansible, Python, Aurora and Zookeeper. Founder of Robust Perception, provider of commercial support and consulting for Prometheus.
  • 3. What is Whitebox Monitoring?
  • 4. Blackbox monitoring Monitoring from the outside No knowledge of how the application works internally Examples: ping, HTTP request, inserting data and waiting for it to appear on dashboard
  • 5. Where to use Blackbox Blackbox monitoring should be treated similarly to smoke tests. It’s good for finding when things have badly broken in an obvious way, and testing from outside your network. Not so good for knowing what’s going on inside a system. Nor should it be treated like regression testing and try to test every single feature. Tend to be flaky, as they either pass or fail.
  • 6. Whitebox Monitoring Complementary to blackbox monitoring. Works with information from inside your systems. Can be simple things like CPU usage, down to the number of requests triggering a particular obscure codepath.
  • 7. Prometheus Inspired by Google’s Borgmon monitoring system. Started in 2012 by ex-Googlers working in Soundcloud as an open source project. Mainly written in Go. Version 1.0 released in 2016. Incubating with the CNCF. 500+ companies using it including Digital Ocean, Ericsson, Weave and CoreOS.
  • 9. Why monitor? Know when things go wrong Be able to debug and gain insight Trending to see changes over time Plumbing data to other systems/processes
  • 10. Knowing when things go wrong The first thing people think of you say monitoring is alerting. What is the wrongness we want to detect and alert on? A blip with no real consequence, or a latency issue affecting users?
  • 11. Symptoms vs Causes Humans are limited in what they can handle. If you alert on every single thing that might be a problem, you'll get overwhelmed and suffer from alert fatigue. Key problem: You care about things like user facing latency. There are hundreds of things that could cause that. Alerting on every possible cause is a Sisyphean task, but alerting on the symptom of high latency is just one alert.
  • 12. Example: CPU usage Some monitoring systems don't allow you to alert on the latency of your servers. The closest you can get is CPU usage. False positives due to e.g. logrotate running too long. False negatives due to deadlocks. End result: Spammy alerts which operators learn to ignore, missing real problems.
  • 13. Many Approaches have Limited Visibility
  • 16. Monitor as a Service, not as Machines
  • 17. Freedom for Alerting A system like Prometheus gives you the freedom to alert on whatever you like. Alerting on error ratio across all the machines in a datacenter? No problem. Alerting on 95th percentile latency for the service being <200ms? No problem. Alerting on data taking too long to get through your pipeline? No problem. Alerting on your VIP not giving the right HTTP response codes? No problem. Produce alerts that require intelligent human action!
  • 19. Debugging to Gain Insight After you receive an alert notification you need to investigate it. How do you work from a high level symptom alert such as increased latency? You drill down through your stack with dashboards to find the subsystem that's the cause!
  • 21. Metrics from All Levels of the Stack Many existing integrations: Java, JMX, Python, Go, Ruby, .Net, Machine, Cloudwatch, EC2, MySQL, PostgreSQL, Haskell, Bash, Node.js, SNMP, Consul, HAProxy, Mesos, Bind, CouchDB, Django, Mtail, Heka, Memcached, RabbitMQ, Redis, RethinkDB, Rsyslog, Meteor.js, Minecraft and Factorio. Graphite, Statsd, Collectd, Scollector, Munin, Nagios integrations aid transition. It’s so easy, most of the above were written without the core team even knowing about them!
  • 22. Metrics are just one Tool Metrics are good for alerting on issues and letting you drill down the focus of your debugging. Not a panacea though, as with all approaches fundamental limitations on data volumes. For successful debugging of complex problems you need a mix of logs, profiling and source code analysis.
  • 24. Trending and Reporting Alerting and debugging is short term. Trending is medium to long term. How is cache hit rate changing over time? Is anyone still using that obscure feature? With Prometheus you can do analysis beyond this.
  • 25. Powerful Query Language Can multiply, add, aggregate, join, predict, take quantiles across many metrics in the same query. Can evaluate right now, and graph back in time. Answer questions like: What’s the 95th percentile latency in each datacenter over the past month? How full will the disks be in 4 days? Which services are the top 5 users of CPU?
  • 26. Example: Top 5 Docker images by CPU topk(5, sum by (image)( rate(container_cpu_usage_seconds_total{ id=~"/system.slice/docker.*"}[5m] ) ) )
  • 27. Structured Data: Labels Prometheus doesn’t use dotted.strings like metric.grafnacon.nyc. Multi-dimensional labels instead like metric{event=”grafanacon”,aircraft_carrier_location=”nyc”} Can aggregate, cut, and slice along them. Can come from instrumentation, or be added based on the service you are monitoring.
  • 28. Example: Labels from Node Exporter
  • 29. Python Instrumentation: An example pip install prometheus_client from prometheus_client import Summary, start_http_server REQUEST_DURATION = Summary('request_duration_seconds', 'Request duration in seconds') @REQUEST_DURATION.time() def my_handler(request): pass // Your code here start_http_server(8000)
  • 30. Adding Dimensions (No Evil Twins Please) from prometheus_client import Counter REQUESTS = Counter('requests_total', 'Total requests', ['method']) def my_handler(request): REQUESTS.labels(request.method).inc() pass // Your code here
  • 31. Labels go beyond Prometheus If you're using Kubernetes, Prometheus can take in your labels and annotations too. Similar data models and mutual integrations make your life easier!
  • 32. Plumbing Prometheus isn't just open source, it's also an open ecosystem. We know we can't support everything, so at every level there's a generic interface to let you get data in and/or out. So for example if you want to run a shell script when an alert fires, you can make that happen.
  • 33. Prometheus Clients as a Clearinghouse
  • 35. Monitoring What Matters with Prometheus To summarise, the key things Prometheus empowers you to build: Alerting on symptoms. Alerts which require intelligent human action. Debugging dashboards that let you drill down to where the problem is. The ability to run complex queries to slice and dice your data. Easy integration points for other systems. These are good things to have no matter which monitoring system(s) you use.
  • 36. 10 Tips for Monitoring With potentially millions of time series across your system, can be difficult to know what is and isn't useful. What approaches help manage this complexity? How do you avoid getting caught out? Here's some tips.
  • 37. #1: Choose your key statistics Users don't care that one of your machines is short of CPU. Users care if the service is slow or throwing errors. For your primary dashboards focus on high-level metrics that directly impact users.
  • 38. #2: Use aggregations Think about services, not machines. Once you have more than a handful of machines, you should treat them as an amorphous blob. Looking at the key statistics is easier for 10 services than 10 services each of which is on 10 machines Once you have isolated a problem to one service, then can see if one machine is the problem
  • 39. #3: Avoid the Wall of Graphs Dashboards tend to grow without bound. Worst I've seen was 600 graphs. It might look impressive, but humans can't deal with that much data at once. (and they take forever to load) Your services will have a rough tree structure, have a dashboard per service and talk the tree from the top when you have a problem. Similarly for each service, have dashboards per subsystem. Rule of Thumb: Limit of 5 graphs per dashboard, and 5 lines per graph.
  • 40. #4: Client-side quantiles aren't aggregatable Many instrumentation systems calculate quantiles/percentiles inside each process, and export it to the TSDB. It is not statistically possible to aggregate these. If you want meaningful quantiles, you should track histogram buckets in each process, aggregate those in your monitoring system and then calculate the quantile. This is done using histogram_quantile() and rate() in Prometheus.
  • 41. #5: Averages are easy to reason about Q: Say you have a service with two backends. If 95th percentile latency goes up due to one of the backends, what will you see in 95th percentile latency for that backend? A: ?
  • 42. #5: Averages are easy to reason about Q: Say you have a service with two backends. If 95th percentile latency goes up due to one of the backends, what will you see in 95th percentile latency for that backend? A: It depends, could be no change. If the latencies are strongly correlated for each request across the backends, you'll see the same latency bump. This is tricky to reason about, especially in an emergency. Averages don't have this problem, as they include all requests.
  • 43. #6: Costs and Benefits 1s resolution monitoring of all metrics would be handy for debugging. But is it ten time more valuable than 10s monitoring? And sixty times more valuable than 60s monitoring? Monitoring isn't free. It costs resources to run, and resources in the services being monitored too. Quantiles and histograms can get expensive fast. 60s resolution is generally a good balance. Reserve 1s granularity or a literal handful of key metrics.
  • 44. #7: Nyquist-Shannon Sampling Theorem To reconstruct a signal you need a resolution that's at least double it's frequency. If you've got a 10s resolution time series, you can't reconstruct patterns that are less than 20s long. Higher frequency patterns can cause effects like aliasing, and mislead you. If you suspect that there's something more to the data, try a higher resolution temporarily or start profiling.
  • 45. #8: Correlation is not Causation - Confirmation Bias Humans are great at spotting patterns. Not all of them are actually there. Always try to look for evidence that'd falsify your hypothesis. If two metrics seem to correlate on a graph that doesn't mean that they're related. They could be independent tasks running on the same schedule. Or if you zoom out there plenty of times when one spikes but not the other. Or one could be causing a slight increase in resource contention, pushing the other over the edge.
  • 46. #9 Know when to use Logs and Metrics You want a metrics time series system for your primary monitoring. Logs have information about every event. This limits the number of fields (<100), but you have unlimited cardinality. Metrics aggregate across events, but you can have many metrics (>10000) with limited cardinality. Metrics help you determine where in the system the problem is. From there, logs can help you pinpoint which requests are tickling the problem.
  • 47. #10 Have a way to deal with non-critical alerts Most alerts don't justify waking up someone at night, but someone needs to look at them sometime. Often they're sent to a mailing list, where everyone promptly filters them away. Better to have some form of ticketing system that'll assign a single owner for each alert. A daily email with all firing alerts that the oncall has to process can also work.
  • 48. Questions? Project Website: prometheus.io Demo: demo.robustperception.io Company Website: www.robustperception.io