Booking Confirmation
Booking Confirmation
tools to monitor things with! We'll be using a combination of Prometheus, Alertmanager, and
Grafana — Prometheus being a pull-based monitoring and alerting solution, with Alertmanager
collecting any alerts from Prometheus and pushing notifications, and Grafana compiling and
collecting all our metrics to create visualizations.
If we're going to have a monitoring course, we need something to monitor! Part of that is
going to be our Ubuntu 18.04 host, but another equally important part is going to be a
web application that already exists on the provided Playground server for this course.
The application is a simple to-do list program called Forethought that uses the Express
web framework to do most of the hard work for us. The application has also been
Dockerized and saved as an image (also called forethought) and is ready for us to
deploy.
Want to use your own server and now the provided Playground? See steps in the study
guide!
5. Deploy the web application to a container. Map port 8080 on the container to port
80 on the host:
6. $ docker run --name ft-app -p 80:8080 -d forethought
7. Check that the application is working correctly by visiting the server's provided
URL.
Prometheus Setup
Now that we have what we're monitoring set up, we need to get our monitoring tool itself
up and running, complete with a service file. Prometheus is a pull-based monitoring
system that scrapes various metrics set up across our system and stores them in a
time-series database, where we can use a web UI and the PromQL language to view
trends in our data. Prometheus provides its own web UI, but we'll also be pairing it with
Grafana later, as well as an alerting system.
Steps in This Video
1. Create a system user for Prometheus:
2. sudo useradd --no-create-home --shell /bin/false prometheus
3. Create the directories in which we'll be storing our configuration files and
libraries:
4. sudo mkdir /etc/prometheus
5. sudo mkdir /var/lib/prometheus
8. Pull down the tar.gz file from the Prometheus downloads page:
9. cd /tmp/
10. wget https://github.com/prometheus/prometheus/releases/download/v2.7.1/promet
heus-2.7.1.linux-amd64.tar.gz
13. Move the configuration file and set the owner to the prometheus user:
14. sudo mv console* /etc/prometheus
15. sudo mv prometheus.yml /etc/prometheus
16. sudo chown -R prometheus:prometheus /etc/prometheus
Add:
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
--config.file /etc/prometheus/prometheus.yml \
--storage.tsdb.path /var/lib/prometheus/ \
--web.console.templates=/etc/prometheus/consoles \
--web.console.libraries=/etc/prometheus/console_libraries
[Install]
WantedBy=multi-user.target
47. Reload systemd, and then start the prometheus and alertmanager services:
48. sudo systemctl daemon-reload
49. sudo systemctl start prometheus
50. sudo systemctl start alertmanager
https://prometheus.io/download/
cd /tmp
Comment out alert manager line
Grafana Setup
While Prometheus provides us with a web UI to view our metrics and craft charts, the
web UI alone is often not the best solution to visualizing our data. Grafana is a robust
visualization platform that will allow us to better see trends in our metrics and give us
insight into what's going on with our applications and servers. It also lets us use multiple
data sources, not just Prometheus, which gives us a full view of what's happening.
3. Download and install Grafana using the .deb package provided on the Grafana
download page:
4. wget https://dl.grafana.com/oss/release/grafana_5.4.3_amd64.deb
5. sudo dpkg -i grafana_5.4.3_amd64.deb
Add a Dashboard
https://grafana.com/docs/v3.1/installation/rpm/
https://www.fosslinux.com/8328/how-to-install-and-configure-grafana-on-centos-7.htm
https://www.fosslinux.com/10398/how-to-install-and-configure-prometheus-on-centos-7.htm/
https://www.fosslinux.com/8424/install-and-configure-check_mk-server-on-centos-7.htm
https://grafana.com/grafana/download
Username and password is admin
Click add data source
Push or Pull
Within monitoring there is an age-old battle that puts the debate between Vim versus
Emacs to shame: whether or not to use a push- or pull-based monitoring solution. And
while Prometheus is a pull-based monitoring system, it's important to know your options
before actually implementing your monitoring — after all, this is a course about
gathering and using your monitoring data, not a course on Prometheus itself.
Pull-Based Monitoring
When using a pull system to monitor your environments and applications, we're having
the monitoring solution itself query our metrics endpoints, such as the one located
at :3000/metrics on our Playground server itself. This is specifically our Grafana
metrics, but it looks the same regardless of the endpoint.
Pull-based systems allow us to better check the status of our targets, let us run
monitoring from virtually anywhere, and provide us with web endpoints we can check for
our metrics. That said, they are not without their concerns: Since a pull-based system is
doing the scraping, the metrics might not be as "live" as an event-based push system,
and if you have a particularly complicated network setup, then it might be difficult to
grant the monitoring solution access to all the endpoints it needs to connect with.
Push-Based Monitoring
Push-based monitoring solutions offload a lot of the "work" from the monitoring platform
to the endpoints themselves: The endpoints are the ones that push their metrics up to
the monitoring application. Push systems are especially useful when you need event-
based monitoring, and can't wait every 15 or so seconds for the data to be pulled in.
They also allow for greater modularity, offloading most of the difficult work to the clients
they serve.
That said, many push-based systems have greater setup requirements and overhead
than pull-based ones, and the majority of the managing isn't done through only the
monitoring server.
Which to Choose
Despite the debate, one system is not necessarily better than the other, and a lot of it
will depend on your individual needs. Not sure which is best for you? I would suggest
taking the time to set a system of either type up on a dev environment and note the pain
points — because anything causing trouble on a test environment is going to cause
bigger problems on production, and those issues will most likely dictate which system
works best for you.
Patterns and Anti-Patterns
Unfortunately for us, there are a lot of ways to do inefficient monitoring. From monitoring
the wrong thing to spending too much time setting up the coolest new monitoring tool,
monitoring can often become a relentless series of broken and screaming alerts for
problems we're not sure how to fix. In this lesson, we'll address some of the most
common monitoring issues and think about how to avoid them.
Service Discovery
22. Refresh the Targets page on the web UI. All three targets are now available!
Add grafana
Infrastructure Monitoring
6. Extract its contents; note that the versioning of the Node Exporter may be
different:
7. $ tar -xvf /releases/download/v0.17.0/node_exporter-0.17.0.linux-amd64.tar.gz
44. Wait for about one minute, and then view the graph to see the difference in
activity.
CPU Metrics
Memory Metrics
node_memory_MemTotal_bytes
node_memory_MemFree_bytes
node_memory_MemAvailable_bytes
node_memory_Buffers_bytes
node_memory_Cached_bytes
Those who do a bit of systems administration, incident response, and the like have
probably used free before to check the memory of a system. The metric expressions
listed above provide us with what is essentially the same data as free but in a time
series where we can witness trends over time or compare memory between multiple
system builds.
node_memory_MemTotal_bytes provides us with the amount of memory on the server as
a whole — in other words, if we have 64 GB of memory, then this would always be 64
GB of memory, until we allocate more. While on its own this is not the most helpful
number, it helps us calculate the amount of in-use memory:
node_memory_MemTotal_bytes - node_memory_MemFree_bytes
Here, node_memory_MemFree_bytes denotes the amount of free memory left on the
system, not including caches and buffers that can be cleared. To see the amount
of available memory, including caches and buffers that can be opened up, we would
use node_memory_MemAvailable_bytes. And if we wanted to see the cache and buffer
data itself, we would use node_memory_Cached_bytes and node_memory_Buffers_bytes,
respectively.
Disk Metrics
irate(node_disk_write_time_seconds_total[30s]) / irate(node_disk_io_time_seconds_tota
l[30s])
Additionally, we're also provided with a gauge-based metric that lets us see how many
I/O operations are occurring at a point in time:
node_disk_io_now
ile system metrics contain information about our mounted file systems. These metrics
are taken from a few different sources, but all use the node_filesystemprefix when we
view them in Prometheus.
Although most of the seven metrics we're provided here are fairly straightforward, there
are some caveats we want to address — the first being the difference
between node_filesystem_avail_bytes and node_filesystem_free_bytes. While for
some systems these two metrics may be the same, in many Unix systems a portion of
the disk is reserved for the root user. In this
case, node_filesystem_free_bytes contains the amount of free space, including the
space reserved for root, while node_filesystem_avail_bytes contains only the
available space for all users.
Let's go ahead and look at the node_filesystem_avail_bytes metric in our expression
editor. Notice how we have a number of file systems mounted that we can view: Our
main xvda disk, the LXC file system for our container, and various temporary file
systems. If we wanted to limit which file systems we view on the graph, we can uncheck
the systems we're not interested in.
The file system collector also supplies us with more labels than we've previously seen.
Labels are the key-value pairs we see in the curly brackets next to the metric. We can
use these to further manipulate our data, as we saw in previous lessons. So, if we
wanted to view only our temporary file systems, we can use:
node_filesystem_avail_bytes{fstype="tmpfs"}
Of course, these features can be used across all metrics and are not just limited to the
file system. Other metrics may also have their own specific labels, much like
the fstype and mountpoint labels here.
Networking Metrics
When we discuss network monitoring through the Node Exporter, we're talking about
viewing networking data from a systems administration or engineering viewpoint: The
Node Exporter provides us with networking device information pulled both
from /proc/net/dev and /sys/class/net/INTERFACE, with INTERFACEbeing the name of
the interface itself, such as eth0. All network metrics are prefixed with
the node_network name.
Should we take a look at node_network in the expression editor, we can see quite a
number of options — many of these are information gauges whose data is pulled from
that /sys/class/net/INTERFACE directory. So, when we look at node_network_dormant,
we're seeing point-in-time data from the /sys/class/net/INTERFACE/dormant file.
But with regards to metrics that the average user will need in terms of day-to-day
monitoring, we really want to look at the metrics prepended with
either node_network_transmit or node_network_receive, as this contains information
about the amount of data/packets that pass through our networking, both outbound
(transmit) and inbound (receive). Specifically, we want to look at
the node_network_receive_bytes_total or node_network_transmit_bytes_totalmetric
s, because these are what will help us calculate our network bandwidth:
rate(node_network_transmit_bytes_total[30s])
rate(node_network_receive_bytes_total[30s])
The above expressions will show us the 30-second average of bytes either transmitted
or received across our time series, allowing us to see when our network bandwidth has
spiked or dropped.
Load Metrics
When we talk about load, we're referencing the amount of processes waiting to be
served by the CPU. You've probably seen these metrics before: They're sitting at the
top of any top command run, and are available for us to view in the /proc/loadavg file.
Taken every 1, 5, and 15 minutes, the load average gives us a snapshot of how hard
our system is working. We can view these statistics in Prometheus
at node_load1, node_load5, and node_load15.
That said, load metrics are mostly useless from a monitoring standpoint. What is a
heavy load to one server can be an easy load for another, and beyond looking at any
trends in load in the time series, there is nothing we can alert on here nor any real data
we can extract through queries or any kind of math.
Although we have our host monitored for various common metrics at this time, the Node
Exporter doesn't cross the threshold into monitoring our containers. Instead, if we want
to monitor anything we have in Docker, including our application, we need to add a
container monitoring solution.
Lucky for us, Google's cAdvisor is an open-source solution that works out of the box
with most container platforms, including Docker. And once we have cAdvisor installed,
we can see much of the same metrics we see for our host on our container, only these
are provided to us through the prefix container.
cAdvisor also monitors all our containers automatically. That means when we view a
metric, we're seeing it for everything that cAvisor monitors. Should we want to target
specific containers, we can do so by using the name label, which pulls the container
name from the name it uses in Docker itself.