Prometheus Course
Prometheus Course
Prometheus
Course Introduction
Prometheus
• Thank you for taking this Prometheus Cours
Course Overview
Introduction Monitoring Alerting Internals Use cases
Concepts Querying
Architecture
• Prometheus
• DevOps advocate
Prometheus
• In Prometheus we talk about Dimensional Data: time series are identi ed by metric
name and a set of key/value pairs
Metric name Label Sample
Temperature location=outside 90
• It stores metrics in memory and local disk in an own custom, ef cient format
• It is written in Go
fi
fi
How does Prometheus work?
• Prometheus collects metrics from monitored
targets by scraping metrics HTTP endpoint
Database HTTP
• This is fundamentally different than other
monitoring and alerting systems, (except
Windows Server HTTP this is also how Google’s Borgmon works)
fi
s
Prometheus
Installation
Prometheus Installation
• I will install Prometheus using scripts from our GitHub repository (https://github.com/
in4it/prometheus-course)
• Feel free to use the scripts with any Cloud Provider, Virtual Machine, or Docker
image, as long as it’s a recent Linux distribution
• To get a free $100 coupon on DigitalOcean, valid for 60-days with a valid payment
method added, use the following link:
https://m.do.co/c/b71b388ab76f
• It’s best to use the scripts we provided so that your environment is the
same as ours when you follow the demos
• metric: go_memstat_alloc_bytes
• instance=localhost:9090
• job=prometheus
• or a millisecond-precision timestamp
• For example:
• node_boot_time{instance="localhost:9100",job="node_exporter"}
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
• For example, to scrape metrics from prometheus itself, the following code
block is added by default
static_configs:
- targets: ['localhost:9090']
• The node exporter will expose machine metrics of Linux / *Nix machines
• The node exporter can be used to monitor machines, and later on, you
can create alerts based on these ingested metric
Monitor nodes
Linux machine
Prometheus
Windows machine
• Pushing Metrics
• Querying
• Service Discovery
• Exporters
• Libraries
• Unof cial: Bash, C++, Common Lisp, Elixir, Erlang, Haskell, Lua for
Nginx, Lua for Tarantool, .NET / C#, Node.js, PHP, Rust
• Protocol-buffer format (Prometheus 2.0 removed support for the protocol-buffer format)
metric_name [
"{" label_name "=" `"` label_value `"` { "," label_name "=" `"` label_value `"` } [ "," ] "}"
] value [ timestamp ]
node_filesystem_avail_bytes{device="/dev/vda1",fstype="ext4",mountpoint="/"} 4.9386491904e+10
node_filesystem_avail_bytes{device="/dev/vda15",fstype="vfat",mountpoint="/boot/efi"} 1.05903104e+08
node_filesystem_avail_bytes{device="lxcfs",fstype="fuse.lxcfs",mountpoint="/var/lib/lxcfs"} 0
node_filesystem_avail_bytes{device="tmpfs",fstype="tmpfs",mountpoint="/run"} 2.01273344e+08
node_filesystem_avail_bytes{device="tmpfs",fstype="tmpfs",mountpoint="/run/lock"} 5.24288e+06
• Counte
• Gaug
• Histogra
• Summary
Single numeric value that can go up and down (e.g. CPU load,
temperature)
fi
Prometheus
Instrumentation- Python
Client Libraries - Python Example
• https://github.com/prometheus/client_python
app = Flask(__name__)
@app.route('/')
@TIMINGS.time()
@IN_PROGRESS.track_inprogress()
def hello_world():
REQUESTS.labels(method='GET', endpoint="/", status_code=200).inc() # Increment the counter
return 'Hello, World!'
@app.route('/prometheus-course/<name>')
@IN_PROGRESS.track_inprogress()
@TIMINGS.time()
def index(name):
REQUESTS.labels(method='GET', endpoint="/prometheus-course/<name>", status_code=200).inc()
return render_template_string('<b>Hello {{name}} welcome!</b>!', name=name)
@app.route('/metrics')
@IN_PROGRESS.track_inprogress()
app = Flask(__name__)
@app.route('/')
@TIMINGS.time()
@IN_PROGRESS.track_inprogress()
def hello_world():
REQUESTS.labels(method='GET', endpoint="/", status_code=200).inc() # Increment the counter
return 'Hello, World!'
@app.route('/prometheus-course/<name>')
@IN_PROGRESS.track_inprogress()
@TIMINGS.time()
def index(name):
REQUESTS.labels(method='GET', endpoint="/prometheus-course/<name>", status_code=200).inc()
return render_template_string('<b>Hello {{name}} welcome!</b>!', name=name)
@app.route('/metrics')
@IN_PROGRESS.track_inprogress()
app = Flask(__name__)
@app.route('/')
@TIMINGS.time()
@IN_PROGRESS.track_inprogress()
def hello_world():
REQUESTS.labels(method='GET', endpoint="/", status_code=200).inc() # Increment the counter
return 'Hello, World!'
@app.route('/prometheus-course/<name>')
@IN_PROGRESS.track_inprogress()
@TIMINGS.time()
def index(name):
REQUESTS.labels(method='GET', endpoint="/prometheus-course/<name>", status_code=200).inc()
return render_template_string('<b>Hello {{name}} welcome!</b>!', name=name)
@app.route('/metrics')
@IN_PROGRESS.track_inprogress()
app = Flask(__name__)
@app.route('/')
@TIMINGS.time()
@IN_PROGRESS.track_inprogress()
def hello_world():
REQUESTS.labels(method='GET', endpoint="/", status_code=200).inc() # Increment the counter
return 'Hello, World!'
@app.route('/prometheus-course/<name>')
@IN_PROGRESS.track_inprogress()
@TIMINGS.time()
def index(name):
REQUESTS.labels(method='GET', endpoint="/prometheus-course/<name>", status_code=200).inc()
return render_template_string('<b>Hello {{name}} welcome!</b>!', name=name)
@app.route('/metrics')
@IN_PROGRESS.track_inprogress()
app = Flask(__name__)
@app.route('/')
@TIMINGS.time()
@IN_PROGRESS.track_inprogress()
def hello_world():
REQUESTS.labels(method='GET', endpoint="/", status_code=200).inc() # Increment the counter
return 'Hello, World!'
@app.route('/prometheus-course/<name>')
@IN_PROGRESS.track_inprogress()
@TIMINGS.time()
def index(name):
REQUESTS.labels(method='GET', endpoint="/prometheus-course/<name>", status_code=200).inc()
return render_template_string('<b>Hello {{name}} welcome!</b>!', name=name)
@app.route('/metrics')
@IN_PROGRESS.track_inprogress()
app = Flask(__name__)
@app.route('/')
@TIMINGS.time()
@IN_PROGRESS.track_inprogress()
def hello_world():
REQUESTS.labels(method='GET', endpoint="/", status_code=200).inc() # Increment the counter
return 'Hello, World!'
@app.route('/prometheus-course/<name>')
@IN_PROGRESS.track_inprogress()
@TIMINGS.time()
def index(name):
REQUESTS.labels(method='GET', endpoint="/prometheus-course/<name>", status_code=200).inc()
return render_template_string('<b>Hello {{name}} welcome!</b>!', name=name)
@app.route('/metrics')
@IN_PROGRESS.track_inprogress()
app = Flask(__name__)
@app.route('/')
@TIMINGS.time()
@IN_PROGRESS.track_inprogress()
def hello_world():
REQUESTS.labels(method='GET', endpoint="/", status_code=200).inc() # Increment the counter
return 'Hello, World!'
@app.route('/prometheus-course/<name>')
@IN_PROGRESS.track_inprogress()
@TIMINGS.time()
def index(name):
REQUESTS.labels(method='GET', endpoint=“/prometheus-course/jorn", status_code=200).inc()
return render_template_string('<b>Hello {{name}} welcome!</b>!', name=name)
@app.route('/metrics')
@IN_PROGRESS.track_inprogress()
app = Flask(__name__)
@app.route('/')
@TIMINGS.time()
@IN_PROGRESS.track_inprogress()
def hello_world():
REQUESTS.labels(method='GET', endpoint="/", status_code=200).inc() # Increment the counter
return 'Hello, World!'
@app.route('/prometheus-course/<name>')
@IN_PROGRESS.track_inprogress()
@TIMINGS.time()
def index(name):
REQUESTS.labels(method='GET', endpoint=“/prometheus-course/jorn", status_code=200).inc()
return render_template_string('<b>Hello {{name}} welcome!</b>!', name=name)
@app.route('/metrics')
@IN_PROGRESS.track_inprogress()
if __name__ == "__main__":
app.run(host='0.0.0.0')
• Easy to implement:
package main
import (
"github.com/prometheus/client_golang/prometheus/promhttp"
"net/http"
)
func main() {
http.Handle("/metrics", promhttp.Handler())
panic(http.ListenAndServe(":8080", nil))
}
func init(){
promtheus.MustRegister(jobsQueued)
}
func runNextJob() {
job := queue.Dequeue()
jobsInQueue.Dec()
job.Run()
}
func init(){
promtheus.MustRegister(jobsQueued)
}
func runNextJob() {
job := queue.Dequeue()
jobsInQueue.Dec()
job.Run()
}
func init(){
promtheus.MustRegister(jobsQueued)
}
func runNextJob() {
job := queue.Dequeue()
jobsInQueue.WithLabelValues(job.Type()).Dec()
job.Run()
}
func init(){
promtheus.MustRegister(jobsQueued)
}
func runNextJob() {
job := queue.Dequeue()
jobsQueued.WithLabelValues(job.Type()).Dec()
job.Run()
}
start := time.Now()
job.Run()
duration := time.Since(start)
jobsDurationHistogram.WithLabelValues(job.Type()).Observe(duration.Seconds())
start := time.Now()
job.Run()
duration := time.Since(start)
jobsDurationHistogram.WithLabelValues(job.Type()).Observe(duration.Seconds())
• Diagram
Push Pull
Metrics Metrics
App Push Gateway Prometheus
• Pitfall
• The Pushgateway never forgets the metrics unless they are deleted via the api
example:
curl -X DELETE http://localhost:9091/metrics/job/prom_course/instance/localhost
fi
l
• If NAT or/both rewall is blocking you from using the pull mechanism
registry = CollectorRegistry()
g = Gauge('job_last_success_unixtime', ‘Last time the course batch job has finished', registry=registry)
g.set_to_current_time()
push_to_gateway('localhost:9091', job='batchA', registry=registry)
• pushadd_to_gateway only replaces metrics with the same name and grouping key
• delete_from_gateway deletes metrics with the given job and grouping key.
gatewayUrl:="http://localhost:9091/"
throughputGuage := prometheus.NewGauge(prometheus.GaugeOpts{
Name: “throughput”,
Help: "Throughput in Mbps",
})
throughputGuage.Set(800)
if err := push.Collectors(
"throughput_job", push.HostnameGroupingKey(),
gatewayUrl, throughputGuage
); err != nil {
fmt.Println("Could not push completion time to Pushgateway:", err)
}
• PromQL is read-only
• Example:
100 - (avg by (instance) (irate(node_cpu_seconds_total{job='node_exporter',mode="idle"}[5m])) * 100)
• Range vector - a set of time series containing a range of data points over
time for each time series
Example: node_cpu_seconds_total[5m]
• Aggregation operators
Example:sum (calculate sum over dimensions), min (select minimum over dimensions) ,max
(select maximum over dimensions), avg (calculate the average over dimensions), stddev
(calculate population standard deviation over dimensions), stdvar (calculate population standard
variance over dimensions), count (count number of elements in the vector), count_values (count
number of elements with the same value), bottomk (smallest k elements by sample value), topk
(largest k elements by sample value), quantile (calculate φ-quantile (0 ≤ φ ≤ 1) over dimensions)
Prometheus course: Edward Viaene & Jorn Jambers
Demo
Querying
Prometheus
Service Discovery
Service Discovery - Introduction
• De nition:
Service discovery is the automatic detection of devices and services
offered by these devices on a computer network.
scrape_configs:
- job_name: 'node'
ec2_sd_configs:
- region: eu-west-1
access_key: PUT_THE_ACCESS_KEY_HERE
secret_key: PUT_THE_SECRET_KEY_HERE
port: 9100
• Make sure the user has the following IAM role: AmazonEC2ReadOnlyAcces
• Make sure you security groups allow access to port (9100, 9090)
scrape_configs:
- job_name: 'node'
ec2_sd_configs:
- region: eu-west-1
access_key: PUT_THE_ACCESS_KEY_HERE
secret_key: PUT_THE_SECRET_KEY_HERE
port: 9100
relabel_configs:
# Only monitor instances with a tag Name starting with “PROD"
- source_labels: [__meta_ec2_tag_Name]
regex: PROD.*
action: keep
# Use the instance ID as the instance label
- source_labels: [__meta_ec2_instance_id]
target_label: instance
scrape_configs:
- job_name: 'node'
ec2_sd_configs:
- region: eu-west-1
access_key: PUT_THE_ACCESS_KEY_HERE
secret_key: PUT_THE_SECRET_KEY_HERE
port: 9100
relabel_configs:
# Only monitor instances with a tag Name starting with “PROD"
- source_labels: [__meta_ec2_tag_Name]
regex: PROD.*
action: keep
# Use the instance ID as the instance label
- source_labels: [__meta_ec2_instance_id]
target_label: instance
kubernetes_sd_configs:
-
api_servers:
- https://kubernetes.default.svc
in_cluster: true
basic_auth:
username: prometheus
password: secret
retry_interval:5s
- job_name:’kubernetes-service-endpoints'
kubernetes_sd_configs:
-
api_servers:
- https://kube-master.prometheuscourse.com
in_cluster: true
• Format target.json
[
{
"targets": [ "myslave1:9104", "myslave2:9104" ],
"labels": {
"env": "prod",
"job": "mysql_slave"
}
},
{
"targets": [ "mymaster:9104" ],
"labels": {
"env": "prod",
"job": "mysql_master"
}
}
]
• When Prometheus is not able to pull metrics directly(Linux sys stats, haproxy, …)
• Examples:
MySQL server exporter
Memcached exporter
Consul exporter
Node/system metrics exporter
MongoDB
Redis
Many more….
• https://prometheus.io/docs/instrumenting/exporters/
• Alertmanager Alertmanage
Receivers
EMAIL
Routes
SLACK
Push alert
Prometheus
Alerting - Alerting rules
Alerting Rules
• Rules live in Prometheus server con g
groups:
- name: example
rules:
• Alert example: - alert: cpuUsge
expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{job='node_exporter',mode="idle"}[5m])) * 100) >
95
for: 1m
labels:
severity: critical
annotations:
summary: Machine under healvy load
• Example:
groups:
- name: Important instance
rules:
templates:
- '/etc/alertmanager/template/*.tmpl'
route:
repeat_interval: 1h
receiver: operations-team
receivers:
- name: 'operations-team'
email_configs:
- to: '[email protected]'
slack_configs:
- api_url: https://hooks.slack.com/services/XXXXXX/XXXXXX/XXXXXX
channel: '#prometheus-course'
send_resolved: true
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
static_configs:
- targets: ['localhost:9090']
• You can create a high available Alertmanager cluster using mesh con g
receivers:
- name: 'operations-team'
email_configs:
- to: '[email protected]'
slack_configs:
- api_url: https://hooks.slack.com/services/XXXXXX/XXXXXX/XXXXXX
channel: '#prometheus-course'
send_resolved: true
• Slack
• Setup an alert
From: https://github.com/prometheus/prometheus
Remote Storage
• Remote storage is primarily focussed at long term storage
Source: https://prometheus.io/docs/operating/integrations/#remote-endpoints-and-storage
• Those 2h samples are stored in separate directories (in the data directory of
prometheus)
• Writes are batched and written to disk in chunks, containing multiple data points
Local Storage
• Every directory also has an index le (index) and a metadata le (meta.json)
• It stores the metric names and the labels, and provides an index from the
metric names and labels to the series in the chunk les
chunks/000001 chunks/000001
chunks/000001
chunks/000002 chunks/000002
meta.jso
meta.jso meta.jso
index
index index
fi
fi
fi
Local Storage
• The most recent data is kept in memory
• You don’t want to loose the in-memory data during a crash, so the data also
needs to be persisted to disk. This is done using a write-ahead-log (WAL)
chunks/000001 chunks/000001
chunks/000001
chunks/000002 chunks/000002 000001
meta.json
meta.json meta.json 000002
index
index index
• If there’s a server crash and the data from memory is lost, then the WAL
will be replayed
• This is more ef cient than immediately deleting the data from the chunk les, as
the actual delete can happen at a later time (e.g. when there’s not a lot of load)
chunks/000001 chunks/000001
chunks/000002 chunks/000002
Local Storage
• Block characteristics:
• When querying, the blocks not in the time range can be skipped
• Too many blocks could cause too much merging overhead, so blocks
are compacted
• 2 blocks are merged and form a newly created (often larger) block
• The index will contain an inverted index for the labels, for example
for label env=production, it’ll have 1 and 3 as IDs if those series
contain the label env=production
Prometheus course: Edward Viaene & Jorn Jambers
Local Storage
• What about Disk size?
• You can use the following formula to calculate the disk space needed:
• You can increase the scrape interval, which will get you less data
• Or you can can reduce the retention (how long you keep the data)
• You can still enable authentication and TLS, using a reverse proxy
• This is only valid for server components, prometheus can scrape TLS and
authentication enabled targets
• See tls_con g in the prometheus con guration to con gure a CA certi cate,
user certi cate and user key
• You’d still need to setup a reverse proxy for the targets itself
Prometheus course: Edward Viaene & Jorn Jambers
fi
fi
fi
fi
fi
Demo
Prometheus TLS and authentication
Demo
Prometheus mutual TLS for targets
Prometheus Use Cases
Monitoring a web app
Prometheus with python- ask and MySQL
fl
Monitoring a web app
• I’m going to integrate prometheus monitoring with a web application
based on python
• It will create an http server and I’ll able to con gure routes (e.g. /query)
• I’ll include one normal query and one “bad behaving” query that will
take between 0 and 10 seconds to execute
• A Counter to capture the amount of times an http endpoint is hit + to capture the
amount of times a MySQL query is executed
• The value of the Counter must always increase, that’s why you should take
the Counter type for these types of data
• A Histogram to capture the latency of the HTTP requests and the MySQL Queries
• The default buckets are intended to cover a typical web/rpc request from
milliseconds to seconds
start_time = time.time()
MYSQL_REQUEST_LATENCY.labels(sql[:50]).observe(query_latency)
MYSQL_REQUEST_COUNT.labels(sql[:50]).inc()
• Spring Boot
• Micromete
Prometheus
• Protected by default
• Adjustable in application.properties
fi
Monitoring a web app
• Micromete
fi
fl
Monitoring a web app
• Micrometer
• pom.xml example
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-core</artifactId>
</dependency>
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
• Code example
…
import io.micrometer.core.instrument.Metrics;
…
private Counter runCounter = Metrics.counter("runCounter");
…
@GetMapping("/api/demo")
@Timed
public String apiUse() throws InterruptedException {
runCounter.increment();
log.info("Hello world app accessed on /api/demo");
return "Hello world";
}
• Rather than using the UI, you can also use yaml and json les to provision
Grafana with datasources and dashboards
• This is a much more powerful way of using Grafana, as you can test new
dashboards rst on a dev / test server, then import the newly created
dashboards to production
• You can do the import manually through the UI, or using yaml and json
les
• When using les, you can keep les within version control to keep
changes, revisions and backups
Prometheus course: Edward Viaene & Jorn Jambers
fi
fi
fi
fi
fi
fi
Grafana Provisioning
• The con guration of Grafana is all kept in /etc/grafana:
/etc/grafana/:
-rw-r----- 1 root grafana 14K Jul 17 12:30 grafana.ini
-rw-r----- 1 root grafana 3.4K Jul 17 12:30 ldap.toml
drwxr-xr-x 4 root grafana 4.0K Jul 17 13:15 provisioning/
/etc/grafana/provisioning/
drwxr-xr-x 2 root grafana 4.0K Jul 17 14:56 dashboards/
drwxr-xr-x 2 root grafana 4.0K Jul 17 15:34 datasources/
Grafana Provisioning
• You can change the database & paths in /etc/grafana/grafana.ini
[paths]
# Path to where grafana can store temp les, sessions, and the sqlite3 db (if that is used)
;data = /var/lib/grafana
# Directory where grafana will automatically scan and look for plugins
;plugins = /var/lib/grafana/plugins
# folder that contains provisioning con g les that grafana will apply on startup and while running.
;provisioning = conf/provisioning
…
[database]
# Either "mysql", "postgres" or "sqlite3", it's your choice
;type = sqlite3
;host = 127.0.0.1:3306
;name = grafana
;user = root
# If the password contains # or ; you have to wrap it with triple quotes. Ex """#password;"""
;password =
• Installation
• Querying metrics
• A Service Mesh
• Service Discovery
• A Key-Value store
• Multi-datacenter support
• 1) Prometheus can scrape Consul’s metrics and provide you with all
sorts of information about your running services
Consul integration
• In the next demo I’ll focus on the Prometheus integration with Consul, not really on
implementing consul itself
• I’ll show you the installation of consul, but not how to integrate consul with your
infrastructure (it’s out of scope for this Prometheus course)
• https://github.com/in4it/prometheus-course/blob/master/use-cases/ec2-
auto-discovery/lab.txt
Prometheus course: Edward Viaene & Jorn Jambers
fi
Prometheus on Kubernetes
Getting Kubernetes metrics