DevOps Shack _ Comprehensive Monitoring Guide
DevOps Shack _ Comprehensive Monitoring Guide
DevOps Shack
Comprehensive Monitoring Guide:
Prometheus and Grafana
Table of Contents
1. Introduction to Monitoring with Prometheus and Grafana
o Overview of modern monitoring
o Why choose Prometheus and Grafana?
o Key features and benefits
2. Setting Up Prometheus
o Installation and configuration
o Understanding Prometheus architecture
o Configuring scrape jobs
3. Prometheus Data Model and Query Language (PromQL)
o Metrics, labels, and time-series data
o Writing basic and advanced PromQL queries
o Aggregations, filters, and functions
4. Setting Up Grafana
o Installing Grafana
o Configuring data sources (Prometheus)
o User management and authentication
5. Building Dashboards in Grafana
o Creating and customizing dashboards
o Adding and configuring panels
o Using templating and variables
2
6. Alerting with Prometheus and Grafana
o Configuring Prometheus Alertmanager
o Defining alert rules and notifications
o Setting up Grafana alerting
7. Exporters and Instrumentation
o Understanding Prometheus exporters
o Popular exporters (Node Exporter, cAdvisor, Blackbox, etc.)
o Custom instrumentation for applications
8. Scaling and Performance Optimization
o Handling high cardinality metrics
o Federation and remote storage options
o Best practices for optimizing queries
9. Monitoring Kubernetes with Prometheus and Grafana
o Deploying Prometheus in Kubernetes
o Using kube-state-metrics and cAdvisor
o Building Kubernetes-specific dashboards
10. Security, Logging, and Maintenance
Securing Prometheus and Grafana instances
Logging and auditing
Regular maintenance and troubleshooting
3
1. Introduction to Monitoring with Prometheus and
Grafana
Overview of Modern Monitoring
In today’s fast-paced digital world, monitoring plays a crucial role in
maintaining the reliability, performance, and security of IT infrastructure.
Whether managing traditional servers, cloud-based environments, or
containerized applications, monitoring helps detect issues before they impact
end-users.
Modern monitoring solutions follow these key principles:
Proactive Detection: Identifying issues before they cause failures.
Real-time Insights: Continuous tracking of system performance and
health.
Automated Alerts: Immediate notifications for critical system anomalies.
Scalability: Ability to handle large-scale, distributed environments.
Visualization and Analysis: Converting raw data into meaningful insights.
Traditional monitoring tools relied on polling-based methods, which often had
limitations in handling dynamic infrastructures like microservices and
Kubernetes. This led to the rise of time-series monitoring solutions like
Prometheus and Grafana, which are built to handle real-time, high-volume
metric collection efficiently.
4
2. Powerful Query Language (PromQL) – Enables users to filter, aggregate,
and analyze data efficiently.
3. Time-Series Data Storage – Optimized for storing metrics with
timestamps and labels.
4. Multi-dimensional Data Model – Uses labels to categorize and organize
metrics, making it easy to search and analyze.
5. Built-in Alerting System – Includes Alertmanager to send notifications
via email, Slack, PagerDuty, and other platforms.
6. Scalability & Federation – Allows multiple Prometheus instances to work
together in large-scale environments.
7. Rich Ecosystem of Exporters – Supports monitoring of various services
like databases, Linux servers, Kubernetes, and cloud platforms.
Grafana: The Visualization Layer
Grafana is an open-source data visualization tool that helps convert
Prometheus metrics into interactive dashboards. It is widely used for
monitoring and analyzing system performance trends.
Reasons to choose Grafana:
1. Beautiful Dashboards – Supports graphs, tables, heatmaps, and more for
intuitive data representation.
2. Multiple Data Sources – Can integrate with Prometheus, Loki (for logs),
MySQL, Elasticsearch, AWS CloudWatch, and more.
3. Custom Alerts – Users can configure threshold-based alerts directly from
dashboards.
4. User Management & Permissions – Allows secure access control for
different teams and users.
5. Templating & Variables – Enables dynamic dashboards that adapt to
different environments.
6. Plugins & Extensions – Extensible with community plugins for advanced
features.
Together, Prometheus and Grafana form a powerful, flexible, and scalable
monitoring stack for modern IT environments.
5
Key Features and Benefits
1. End-to-End Observability
With Prometheus and Grafana, organizations can monitor the entire
infrastructure, from bare-metal servers to cloud applications.
2. Real-Time Metrics Collection and Analysis
Prometheus continuously scrapes and stores time-series data, allowing teams
to analyze trends and detect issues in real time.
3. Scalable for Large Deployments
Whether monitoring a few servers or thousands of microservices in
Kubernetes, Prometheus can scale horizontally using federation and remote
storage solutions.
4. Cost-Effective Open-Source Solution
Both Prometheus and Grafana are open-source, eliminating the need for
expensive proprietary monitoring tools.
5. Customizable Dashboards for Actionable Insights
Grafana provides customizable dashboards to visualize key performance
indicators (KPIs), improving decision-making.
6. Automated Alerting & Incident Response
Prometheus Alertmanager and Grafana’s built-in alerts help detect issues and
notify teams before they impact users.
7. Strong Community & Ecosystem
Prometheus and Grafana are backed by large open-source communities,
ensuring continuous updates, extensive documentation, and active support.
6
2. Setting Up Prometheus
Introduction
To start using Prometheus for monitoring, it must be installed and configured
correctly. This section covers:
How to install Prometheus on different environments.
Understanding Prometheus architecture and its components.
Configuring Prometheus to scrape data from target systems.
Installing Prometheus
Prometheus can be installed on various platforms, including Linux, Windows,
Docker, and Kubernetes. Below are step-by-step instructions for each method.
1. Installing Prometheus on Linux (Ubuntu/Debian)
1. Update the system:
sudo apt update && sudo apt upgrade -y
2. Download the latest Prometheus release:
curl -LO
https://github.com/prometheus/prometheus/releases/latest/download/prome
theus-linux-amd64.tar.gz
3. Extract the archive:
tar -xvf prometheus-linux-amd64.tar.gz
cd prometheus-linux-amd64
4. Move the Prometheus binary to /usr/local/bin/:
sudo mv prometheus /usr/local/bin/
sudo mv promtool /usr/local/bin/
5. Verify installation:
prometheus --version
2. Running Prometheus with Docker
7
1. Pull the Prometheus Docker image:
docker pull prom/prometheus
2. Run Prometheus in a container:
docker run -d --name=prometheus -p 9090:9090 prom/prometheus
3. Deploying Prometheus on Kubernetes
1. Add the Helm repository:
helm repo add prometheus-community https://prometheus-
community.github.io/helm-charts
2. Install Prometheus using Helm:
helm install prometheus prometheus-community/prometheus
Configuring Prometheus
8
Prometheus uses a YAML configuration file (prometheus.yml) to define its
behavior. Below is a basic example:
global:
scrape_interval: 15s # How often to scrape targets
scrape_configs:
- job_name: 'node_exporter' # A monitoring target
static_configs:
- targets: ['localhost:9100'] # Replace with the actual target address
Key Configuration Parameters:
scrape_interval – Defines how frequently Prometheus collects metrics
(default: 15s).
scrape_configs – Specifies the list of targets from which Prometheus
should scrape data.
job_name – A label used to identify the scraping job.
static_configs – Defines manually specified targets.
After modifying the configuration, restart Prometheus for the changes to take
effect:
sudo systemctl restart prometheus
9
Execute PromQL queries to analyze collected data.
Check Prometheus logs and alerts.
10
3. Prometheus Data Model and Query Language
(PromQL)
Introduction
Prometheus is a time-series database that stores and retrieves metrics
efficiently. To extract useful insights, it provides PromQL (Prometheus Query
Language), a powerful querying tool designed to filter, aggregate, and analyze
collected metrics.
This section covers:
How Prometheus organizes and stores data.
Understanding metric types and labels.
Writing PromQL queries for real-time monitoring and analysis.
11
1706874000 is the UNIX timestamp.
12
Aggregations and Functions in PromQL
1. Summing Up Values
To get the total number of requests across all instances:
sum(http_requests_total)
2. Finding the Maximum Value
max(cpu_usage)
This returns the highest CPU usage among all instances.
3. Calculating the Rate of Change
For a per-second request rate over the last 5 minutes:
rate(http_requests_total[5m])
This is useful for measuring how fast requests are increasing.
4. Grouping Data by Labels
To find the total requests per method:
sum(http_requests_total) by (method)
This section covered Prometheus’ data model, metric types, and essential
PromQL queries. Next, we will explore how to set up and configure Grafana
for visualizing these metrics. 🚀
13
4. Setting Up Grafana
Introduction
Grafana is a powerful open-source visualization tool that transforms raw
metrics from Prometheus into interactive dashboards and graphs. It allows
users to create meaningful representations of system performance, making
monitoring more intuitive and actionable.
This section covers:
Installing Grafana on different platforms.
Connecting Grafana to Prometheus as a data source.
Managing users, authentication, and access control.
Installing Grafana
Grafana can be installed on Linux, Windows, Docker, and Kubernetes. Below
are the installation steps for different environments.
1. Installing Grafana on Linux (Ubuntu/Debian)
1. Update system packages:
sudo apt update && sudo apt upgrade -y
2. Add the Grafana APT repository:
sudo apt install -y software-properties-common
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable
main"
3. Install Grafana:
sudo apt install grafana -y
4. Start and enable Grafana service:
sudo systemctl start grafana-server
sudo systemctl enable grafana-server
14
5. Verify installation:
grafana-server -v
2. Running Grafana with Docker
1. Pull the Grafana Docker image:
docker pull grafana/grafana
2. Run Grafana in a container:
docker run -d --name=grafana -p 3000:3000 grafana/grafana
3. Deploying Grafana on Kubernetes
1. Add the Helm repository:
helm repo add grafana https://grafana.github.io/helm-charts
2. Install Grafana using Helm:
helm install grafana grafana/grafana
15
4. Enter the Prometheus server URL:
http://localhost:9090
5. Click "Save & Test" to verify the connection.
Conclusion
This section covered Grafana installation, accessing the web interface,
configuring Prometheus as a data source, and managing authentication. Next,
we will explore how to build custom dashboards and visualizations in
Grafana! 🚀
16
5. Creating Dashboards and Visualizations in Grafana
Introduction
Grafana’s key strength lies in its ability to create custom dashboards with
interactive visualizations. These dashboards help teams analyze performance
trends, detect anomalies, and make data-driven decisions.
This section covers:
Creating a new dashboard in Grafana.
Adding and configuring different types of visualizations.
Using variables for dynamic dashboards.
17
2. Gauge and Single Stat Panels
Useful for real-time monitoring of single metrics (e.g., current memory
usage).
Example:
node_memory_Active_bytes / node_memory_MemTotal_bytes * 100
3. Heatmaps
Displays data distribution over time (e.g., request latency trends).
4. Table View
Converts metrics into tabular format for detailed analysis.
5. Bar Charts and Pie Charts
Good for categorical data visualization (e.g., requests by HTTP method).
18
Adding Alerts to Dashboards
Grafana allows setting up alerts for important metrics.
1. Creating an Alert
1. In the panel editor, go to the "Alert" tab.
2. Click "Create Alert Rule".
3. Define conditions (e.g., CPU usage > 80% for 5 minutes).
4. Choose notification channels (e.g., email, Slack, PagerDuty).
5. Save and apply the alert.
19
6. Grafana Plugins and Integrations
Introduction
Grafana’s flexibility extends beyond built-in features through plugins and
integrations, allowing users to extend functionality, support additional data
sources, and integrate with alerting or automation tools.
This section covers:
Types of Grafana plugins.
Installing and managing plugins.
Integrating Grafana with third-party services.
20
Kubernetes App (for monitoring Kubernetes clusters).
Istio App (for service mesh observability).
21
Slack
Email (SMTP)
PagerDuty
Microsoft Teams
Configuring Slack Alerts
1. Go to "Alerting" → "Notification Channels".
2. Click "Add Channel", select Slack, and enter the webhook URL.
3. Configure the alert message format and save.
2. Logging and Tracing Integrations
Loki: Centralized logging for Grafana.
Jaeger/Zipkin: Distributed tracing for microservices.
3. Automation and CI/CD Integrations
Grafana can be integrated into CI/CD pipelines using:
Prometheus GitHub Actions (for monitoring deployments).
Terraform (for infrastructure monitoring).
Grafana API (for automated dashboard provisioning).
22
7. Configuring Alerting and Notifications in
Prometheus and Grafana
Introduction
Effective monitoring isn’t just about visualizing metrics—it’s about being
notified when something goes wrong. Prometheus and Grafana both support
alerting mechanisms that can send notifications via email, Slack, PagerDuty,
and other services.
This section covers:
Setting up alert rules in Prometheus.
Configuring Alertmanager for notification delivery.
Creating and managing alerts in Grafana.
23
summary: "High CPU Usage Detected"
description: "CPU usage has been above 80% for more than 5 minutes."
2. Loading Alert Rules into Prometheus
Save the alerting rules in a file (e.g., alerts.yml) and reference it in
prometheus.yml:
rule_files:
- "alerts.yml"
Restart Prometheus to apply changes:
systemctl restart prometheus
receivers:
- name: "slack-alerts"
24
slack_configs:
- send_resolved: true
channel: "#alerts"
api_url: "https://hooks.slack.com/services/XXXXX/YYYYY/ZZZZZ"
Start Alertmanager:
./alertmanager --config.file=alertmanager.yml
3. Linking Prometheus with Alertmanager
Modify prometheus.yml to use Alertmanager:
alerting:
alertmanagers:
- static_configs:
- targets:
- "localhost:9093"
Restart Prometheus:
systemctl restart prometheus
25
1. Navigate to "Alerting" → "Notification Channels".
2. Click "Add Channel" and select a service (e.g., Slack, Email).
3. Enter the required details, such as webhook URL for Slack or SMTP
settings for email.
4. Click "Save" and assign this channel to an alert rule.
26
8. Scaling and Optimizing Prometheus for Large
Environments
Introduction
As infrastructure grows, monitoring at scale becomes a challenge. Large-scale
environments generate massive amounts of metrics, which can cause
performance bottlenecks in Prometheus if not managed properly.
This section covers:
Scaling Prometheus horizontally and vertically.
Optimizing Prometheus performance.
Using remote storage for long-term data retention.
Scaling Prometheus
Prometheus is designed to be a pull-based, single-node system, meaning it
does not natively support clustering. However, it can be scaled using federation
and sharding.
1. Vertical Scaling (Upgrading Resources)
Increase CPU and Memory: Adjust server resources based on query
load.
Use SSDs: Improves Prometheus performance by speeding up time-
series database reads/writes.
Tweak storage retention: Modify --storage.tsdb.retention.time to keep
only relevant data.
Example: Set retention to 15 days to reduce storage load:
prometheus --storage.tsdb.retention.time=15d
2. Horizontal Scaling with Federation
Federation allows multiple Prometheus instances to aggregate metrics from
different sources while keeping query loads distributed.
Setting Up Federation
27
1. Deploy multiple Prometheus instances, each scraping a subset of
targets.
2. Use a central Prometheus to pull aggregated data using:
scrape_configs:
- job_name: 'federate'
honor_labels: true
metrics_path: '/federate'
params:
match[]:
- '{job="node_exporter"}'
static_configs:
- targets:
- 'prometheus-instance-1:9090'
- 'prometheus-instance-2:9090'
3. The federated Prometheus queries and visualizes aggregated metrics.
3. Sharding Prometheus with Thanos or Cortex
Thanos: Provides scalable storage, query federation, and long-term
retention.
Cortex: Allows multi-tenancy and distributed querying across
Prometheus instances.
Example: Deploying Thanos for Scaling
1. Install Thanos components (thanos-query, thanos-store, thanos-sidecar).
2. Configure Prometheus to use Thanos Sidecar:
--storage.tsdb.retention.time=30d
--web.enable-lifecycle
3. Query data across multiple Prometheus instances using thanos-query.
28
Optimizing Prometheus Performance
1. Reduce Label Cardinality
High-cardinality metrics (too many unique labels) can slow down Prometheus.
Bad Example:
http_requests_total{user_id="1234", session="abcd"}
o user_id and session create an enormous number of unique series.
Better Alternative:
http_requests_total{status_code="200"}
o Fewer labels, reducing storage and query load.
2. Tune Scrape Intervals
Default scrape intervals (15s) can be adjusted to reduce storage usage:
scrape_configs:
- job_name: 'node_exporter'
scrape_interval: 30s
3. Enable WAL Compression
Write-Ahead Log (WAL) compression reduces storage costs:
storage.tsdb.wal-compression: true
This section covered how to scale Prometheus using federation and Thanos,
optimize performance, and store metrics in remote backends. Next, we will
explore securing and managing user access in Prometheus and Grafana. 🚀
30
9. Securing and Managing User Access in Prometheus
and Grafana
Introduction
In any production environment, ensuring that your monitoring systems are
secure and that user access is well-managed is crucial. Both Prometheus and
Grafana have built-in mechanisms for authentication, authorization, and secure
communication to safeguard sensitive data.
This section covers:
Securing Prometheus with authentication and authorization.
Implementing security best practices for Grafana.
Configuring user access and permissions in Grafana.
Securing Prometheus
Prometheus, by default, does not include authentication or encryption.
However, security can be configured using reverse proxies and external tools
like NGINX or OAuth.
1. Securing Prometheus with Reverse Proxy
A common method to secure Prometheus is by placing it behind a reverse
proxy like NGINX. This allows for handling SSL/TLS encryption and basic HTTP
authentication.
Example: NGINX as a Reverse Proxy for Prometheus
1. Install NGINX on the Prometheus server:
sudo apt install nginx
2. Configure NGINX to reverse proxy Prometheus:
nginx
CopyEdit
server {
listen 443 ssl;
31
server_name prometheus.yourdomain.com;
ssl_certificate /etc/nginx/ssl/prometheus.crt;
ssl_certificate_key /etc/nginx/ssl/prometheus.key;
location / {
proxy_pass http://localhost:9090;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}
3. Enable the NGINX configuration and restart NGINX:
sudo systemctl restart nginx
2. Enabling Authentication with NGINX
To enable basic authentication:
1. Install apache2-utils to create password files:
sudo apt install apache2-utils
2. Create a password file for HTTP basic authentication:
sudo htpasswd -c /etc/nginx/.htpasswd user_name
3. Add the authentication directive to the NGINX configuration:
location / {
auth_basic "Restricted Access";
auth_basic_user_file /etc/nginx/.htpasswd;
proxy_pass http://localhost:9090;
32
}
4. Restart NGINX:
sudo systemctl restart nginx
Securing Grafana
1. Enabling SSL/TLS in Grafana
To encrypt communication between users and the Grafana interface, use
SSL/TLS.
1. Generate an SSL certificate (self-signed or from a certificate authority).
2. Edit the Grafana configuration file (/etc/grafana/grafana.ini):
[server]
protocol = https
cert_file = /etc/grafana/ssl/grafana.crt
cert_key = /etc/grafana/ssl/grafana.key
3. Restart Grafana to apply the changes:
sudo systemctl restart grafana-server
2. Enabling Authentication in Grafana
Grafana supports several authentication methods:
Basic Authentication
OAuth (Google, GitHub, LDAP)
SAML
LDAP
Example: Configuring OAuth Authentication (Google)
1. In the Grafana configuration file (/etc/grafana/grafana.ini), enable the
OAuth section:
[auth.google]
enabled = true
33
client_id = your-client-id
client_secret = your-client-secret
scopes = openid profile email
auth_url = https://accounts.google.com/o/oauth2/auth
token_url = https://oauth2.googleapis.com/token
api_url = https://www.googleapis.com/oauth2/v2/userinfo
2. Restart Grafana:
sudo systemctl restart grafana-server
34
3. Setting Up Organization Roles
Grafana allows assigning organization-wide roles.
1. Go to "Configuration" → "Users" and select the user.
2. Assign them to specific organizations and set their permissions.
o Admin: Full control of the organization.
o Editor: Can edit dashboards within the organization.
o Viewer: Can view dashboards within the organization.
This section covered securing Prometheus and Grafana with reverse proxies,
SSL, and authentication mechanisms, as well as managing user access in
Grafana using roles and identity providers. Next, we will explore best practices
35
for monitoring and maintaining Prometheus and Grafana in production
environments. 🚀
36
10. Best Practices for Monitoring and Maintaining
Prometheus and Grafana in Production
Introduction
When deploying Prometheus and Grafana in a production environment, the
focus should be on reliability, scalability, and maintainability. It's important to
implement best practices to ensure that the monitoring infrastructure can
handle large volumes of data, provide timely alerts, and scale as your systems
evolve.
This section covers:
Best practices for managing Prometheus.
Maintaining Grafana for optimal performance.
Backup and disaster recovery strategies.
Keeping Prometheus and Grafana up-to-date.
37
--storage.tsdb.retention.time=15d
--storage.tsdb.path=/prometheus/data
3. Regularly Review and Optimize Queries
Use Prometheus query optimization to ensure efficient data retrieval.
Implement promQL best practices, such as using avg over sum where
appropriate to reduce the volume of data returned by queries.
Minimize expensive queries like those that scan large time ranges or
metrics with high cardinality.
4. Monitor Prometheus Itself
Monitor Prometheus for key performance indicators (KPIs) such as:
Memory usage
Disk space
Query latency
Scraping failures
Alert rule firing frequency
Use a Prometheus monitoring instance to track these metrics and create alerts
to ensure optimal operation.
38
Limit the number of queries per panel to avoid overloading Grafana
with requests.
Set appropriate time ranges for each panel (e.g., avoid pulling high-
resolution data over long time periods).
3. Backup and Restore Dashboards
Regularly export and backup dashboards to ensure you can restore them if
needed. Grafana allows dashboards to be exported as JSON files:
Go to the dashboard settings and click on "JSON Model" to export.
Backup regularly and store the exported dashboards securely.
4. Implement Version Control for Dashboards
For teams working on multiple Grafana dashboards, implement a version
control system (VCS) like Git to track changes to dashboard configurations. This
can help revert to older versions or audit changes.
39
For production environments, configure Grafana to use a remote
database (like PostgreSQL or MySQL) to improve reliability and
scalability.
3. Disaster Recovery
In case of failure, having a disaster recovery plan is essential. Ensure you have:
Backups of Prometheus data (including WAL logs) and Grafana
configurations.
A secondary instance of Prometheus or a distributed system like Thanos
in place to ensure availability.
A Grafana backup instance that can be restored quickly in the event of a
failure.
40
Conclusion
In this guide, we have covered a comprehensive journey through Prometheus
and Grafana, from installation and configuration to scaling, securing, and
maintaining these tools in a production environment. By following best
practices, you ensure that your monitoring infrastructure is robust, reliable,
and scalable.
Key takeaways include:
Effective scaling and optimization of Prometheus to handle large
environments.
Securing both Prometheus and Grafana to protect sensitive data and
manage access.
Implementing disaster recovery strategies to ensure your monitoring
system is resilient.
Regular maintenance and updates to ensure long-term sustainability of
your monitoring setup.
Adhering to these practices will not only enhance the performance of your
Prometheus and Grafana setup but also provide the reliability needed for
mission-critical production systems. With Prometheus and Grafana in place,
you’ll be equipped to manage, monitor, and visualize the health of your
systems effectively, ensuring minimal downtime and efficient incident
response. 🚀
41