Prometheus-入门学习

最新推荐文章于 2025-06-12 09:04:07 发布

real向往

最新推荐文章于 2025-06-12 09:04:07 发布

阅读量4.5k

点赞数 5

CC 4.0 BY-SA版权

分类专栏： Prometheus 文章标签： Prometheus exporter alertmanager 快速上手

本文链接：https://blog.csdn.net/yuanfangPOET/article/details/108653476

Prometheus 专栏收录该内容

5 篇文章

订阅专栏

本文详细介绍Prometheus监控系统的搭建步骤及配置方法，包括Prometheus服务器、node_exporter、AlertManager等核心组件的安装配置过程，以及如何实现数据可视化、告警规则定义与邮件通知等功能。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Prometheus是由google研发的开源监控系统

特点

强大的数据模型，监控数据以metric{labels}的方式存储在内置的时间序列数据库中
监控数据的精细程度高，可以精确到1~5秒的采集程度
灵活的查询语句(PromQL)
采用HTTP pull/push两种数据采集传输方式
支持本地和远程存储
良好的可视化，自带Prometheus UI，可以直接输入PromQL查询监控指标，自动成图，并且支持用grafana进行数据呈现
支持大量的客户端库(exporter)，像nginx,tomcat等，使prometheus可轻易获取服务监控指标
易扩展，当prometheus处理数据量过大时，可以进行联邦集群和功能分区，让多个prometheus产生一个逻辑集群
支持自动发现，动态发现监控目标

架构

在这里插入图片描述

prometheus server: 负责定时轮询采集数据、存储、对外提供数据查询、告警规则检测
exporter：以http的方式，暴露收集的metric，然后Prometheus server会定期来拉取数据，可安装在被监控主机
AlertManager：Prometheus server会将通过规则匹配的告警发送到alertmanager，然后alertmanager会对告警发送到邮箱、企业微信等，期间会对告警进行分组、静默或抑制
PushGateway：出于网络或安全问题，有时数据无法直接暴露给prometheus采集，这时需要pushgateway完成中转工作。用户把数据推送到pushgateway，然后由prometheus采用拉取pull的方式采集数据

数据存储

Prometheus提供了本地存储，即tsdb数据库，也可以选择远程存储
https://www.cnblogs.com/zhoujinyi/p/11983859.html
1. 存储到本地

Prometheus的本地时间序列库以自定义格式存储时间序列数据（格式：时间9监控指标{标签} 值）

Prometheus每隔两个小时，生成一个block，并将最新数据保存进去。block存储在data，是一个目录，该目录包含一个或多个chunk文件（存储时间序列数据）、一个metadata文件和一个index文件。在index文件中可以通过metric（指标名）、label（标签）查找数据在哪个chunk。
最新写入的数据会保存在内存中，达到两小时才会落盘，同时为了防止服务崩溃，造成数据丢失，采用了预写日志(WAL write-ahead-log),会在写入新数据的时候进行，当服务出现崩溃时，重启prometheus就可以通过wal重放日志，恢复数据。

2. 存储到远端
采用本地存储在可伸缩性和持久性方面会受到单个节点的限制。因此Prometheus提供了一组允许与远程存储系统集成的接口。可以将获取的样本数据以标准格式写入远程URL，以标准格式从远程URL获取数据。例如，可以接入influxdb

与zabbix对比

Zabbix	Prometheus
后端C开发,界面用PHP开发,定制化难度高	golang开发,且前端可以用grafana展示,定制化难度较低
在服务器相关监控方面，占据绝对优势	在容器监控上,Prometheus占绝对优势,原生支持且对服务内部监控
上限约为1万节点	以万为单位，支持更大规模
部署相对较麻烦,需要配置依赖等	部署比较简单,直接解压运行即可
zabbix采用关系型数据库存储,这将大大限制采集性能	Prometheus采用自研的时序数据库,每秒可达到千万级别的数据存储
zabbix社区较为活跃	prometheus起步较晚,社区相对不活跃

一、Prometheus安装

1.1 下载解压

[root@prometheus ~]# wget https://github.com/prometheus/prometheus/releases/download/v2.20.1/prometheus-2.20.1.linux-amd64.tar.gz
[root@prometheus ~]# tar xf prometheus-2.20.1.linux-amd64.tar.gz
[root@prometheus ~]# mv prometheus-2.20.1.linux-amd64 /usr/local/prometheus

1.2 查看prometheus参数

有些参数需要在启动prometheus时添加，具体参数可通过–help查看

[root@prometheus ~]# cd /usr/local/prometheus/
[root@prometheus prometheus]# ./prometheus --help

1.3 启动

指定配置文件和数据存储路径，默认存储在prometheus路径下的data

[root@prometheus prometheus]# ./prometheus --config.file="./prometheus.yml" --storage.tsdb.path="/usr/local/prometheus/data"> /dev/null 2>&1 &
[2] 7558

托管到systemd

[root@prometheus ~]# cat /usr/lib/systemd/system/prometheus.service 
[Unit]
Description=https://prometheus.io
After=network.target

[Service]
ExecStart=/usr/local/prometheus/prometheus --config.file=/usr/local/prometheus/prometheus.yml --storage.tsdb.path=/usr/local/prometheus/data

[Install]
WantedBy=multi-user.target

[root@prometheus ~]# systemctl daemon-reload
[root@prometheus ~]# systemctl start prometheus
[root@prometheus ~]# systemctl status prometheus
[root@prometheus ~]# systemctl enable prometheus

1.4 浏览器访问,http://192.168.71.21:9090
在这里插入图片描述 1.5 配置文件了解

[root@prometheus ~]# vim /usr/local/prometheus/prometheus.yml 
# 全局配置
global:
  scrape_interval:     15s # 抓取数据间隔设置为15秒，默认为1分钟
  evaluation_interval: 15s # 评估规则默认周期为15秒评估一次，默认1分钟
  #scrape_timeout: 1m      # 抓取超时时间默认为1分钟

# Alertmanager告警相关配置
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

# 加载告警规则，并根据全局定义的评估规则时间定期评估
rule_files:
  # - "first_rules.yml"   #告警规则文件所在位置
  # - "second_rules.yml"

# 收集数据配置列表
scrape_configs:
  # 作业命名
  - job_name: 'prometheus'
    #静态配置目录列表
    static_configs:
      #静态配置指定目标
    - targets: ['localhost:9090']

二、node_exporter安装

在Prometheus中，不仅提供了丰富的exporter，用于监控基础指标、中间件、网络设备等

在这里，node_exporter用于收集机器基础的监控指标，像cpu、内存、磁盘可用空间等，常用exporter如下：

范围	常用Exporter
数据库	MySQL Exporter, Redis Exporter, MongoDB Exporter, MSSQL Exporter等
硬件	Apcupsd Exporter，IoT Edison Exporter， IPMI Exporter, Node Exporter等
消息队列	Beanstalkd Exporter, Kafka Exporter, NSQ Exporter, RabbitMQ Exporter等
存储	Ceph Exporter, Gluster Exporter, HDFS Exporter, ScaleIO Exporter等
HTTP服务	Apache Exporter, HAProxy Exporter, Nginx Exporter等
API服务	AWS ECS Exporter， Docker Cloud Exporter, Docker Hub Exporter, GitHub Exporter等
日志	Fluentd Exporter, Grok Exporter等
监控系统	Collectd Exporter, Graphite Exporter, InfluxDB Exporter, Nagios Exporter, SNMP Exporter等
其它	Blockbox Exporter, JIRA Exporter, Jenkins Exporter， Confluence Exporter等

2.1 下载解压

[root@prometheus ~]# wget https://github.com/prometheus/node_exporter/releases/download/v1.0.1/node_exporter-1.0.1.linux-amd64.tar.gz
[root@prometheus ~]# tar xf  node_exporter-1.0.1.linux-amd64.tar.gz
[root@prometheus ~]#  mv /root/mv node_exporter-1.0.1.linux-amd64 /usr/local/node_exporter

2.2 托管到systemd

[root@prometheus ~]# vim /usr/lib/systemd/system/node_exporter.service 
[Unit]
Description=node-exporter

[Service]
ExecStart=/usr/local/node_exporter/node_exporter

[Install]
WantedBy=multi-user.target

[root@prometheus ~]# systemctl daemon-reload
[root@prometheus ~]# systemctl start node_exporter
[root@prometheus ~]# ss -lnt | grep 9100
LISTEN     0      128         :::9100                    :::*

2.3 查看node_exporter暴露的数据

[root@prometheus ~]# curl http://192.168.71.21:9100/metrics
# HELP node_network_iface_link iface_link value of /sys/class/net/<iface>.
# TYPE node_network_iface_link gauge
node_network_iface_link{device="ens33"} 2
node_network_iface_link{device="lo"} 1
# HELP node_network_iface_link_mode iface_link_mode value of /sys/class/net/<iface>.
# TYPE node_network_iface_link_mode gauge
node_network_iface_link_mode{device="ens33"} 0
node_network_iface_link_mode{device="lo"} 0

解释

HELP：用于解释当前指标的含义
type：数据类型.如counter(计数器)只增不减数据类型;gauge(仪表盘),数据会发生变化;summary,获取x轴坐标和y轴坐标,计算分位数;histogram(直方图),获取x轴坐标和y轴坐标，并显示总和
最后一行：监控指标和对应的值

2.4 接入到prometheus

[root@prometheus ~]# vim /usr/local/prometheus/prometheus.yml 
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"
scrape_configs:
  - job_name: 'prometheus'
    static_configs:
    - targets: ['localhost:9090']
  - job_name: 'node_exporter'  #监控其他机器的node_exporter，都可以添加到targets下
    static_configs:
    - targets:
      - '192.168.71.21:9100'

检查配置文件是否有语法错误

[root@prometheus prometheus]# ./promtool check config prometheus.yml 
Checking prometheus.yml
  SUCCESS: 0 rule files found

重启服务

[root@prometheus ~]# systemctl restart prometheus

2.5 访问可查看相关指标信息

要查询cpu或磁盘空间可直接输入cpu或filesystem，在下边就会检索到，想进一步学习查询，可学习PromQL，Prometheus中查询数据、告警规则定义都要使用PromQL

在这里插入图片描述查看prometheus配置信息

查看targets信息，即对应的exporter信息

在这里插入图片描述

三、influxdb安装

存储数据到influxdb

3.1 安装influxdb

[root@prometheus ~]# wget https://dl.influxdata.com/influxdb/releases/influxdb-1.8.1.x86_64.rpm
[root@prometheus ~]# rpm -ivh influxdb-1.8.1.x86_64.rpm 
[root@prometheus ~]# systemctl start influxdb
[root@prometheus ~]# systemctl enable influxdb

查看端口是否已开启

[root@prometheus ~]# ss -lntp | grep influxd
LISTEN     0      128    127.0.0.1:8088                     *:*                   users:(("influxd",pid=6392,fd=3))
LISTEN     0      128         :::8086                    :::*                   users:(("influxd",pid=6392,fd=23))

3.2 登录influxdb,创建对应数据库

[root@prometheus ~]# influx
Connected to http://localhost:8086 version 1.8.1
InfluxDB shell version: 1.8.1
> show databases;
name: databases
name
----
_internal
> create database prometheus;
> show databases;
name: databases
name
----
_internal
prometheus
> exit

3.3 修改prometheus配置文件，使数据写入influxdb并从influxbd读取数据
修改prometheus配置文件，添加读取和写入数据的url，如果登录influxdb需要密码，可参考
https://docs.influxdata.com/influxdb/v1.8/supported_protocols/prometheus/

[root@prometheus ~]# vim /usr/local/prometheus/prometheus.yml 

remote_write:
    - url: "http://192.168.71.21:8086/api/v1/prom/write?db=prometheus"

remote_read:
    - url: "http://192.168.71.21:8086/api/v1/prom/read?db=prometheus"

重启prometheus

[root@prometheus ~]# systemctl restart prometheus

登录influxdb，查看采集数据是否已写入

[root@prometheus ~]# influx
Connected to http://localhost:8086 version 1.8.1
InfluxDB shell version: 1.8.1
> use prometheus
Using database prometheus
> show measurements
name: measurements
name
----
go_gc_duration_seconds
go_gc_duration_seconds_count
go_gc_duration_seconds_sum
go_goroutines
go_info

四、安装grafana

接入grafana，对Prometheus获取的数据进行可视化展示，模板分享https://grafana.com/dashboards

4.1 下载安装grafana

[root@prometheus ~]# wget https://dl.grafana.com/oss/release/grafana-7.1.5-1.x86_64.rpm
[root@prometheus ~]# rpm -ivh grafana-7.1.5-1.x86_64.rpm

4.2 访问登录
192.168.71.21:3000,默认用户名密码:admin/admin
在这里插入图片描述添加数据源
添加Dashboard
添加panel
query：添加查询语句
legend：对显示图标信息的维度进行格式化，{{instance}}~{{device}}
Min step：控制查询语句的最小步长，减少从数据源获取的数据量
Resolution：控制grafana自身渲染的数据量。当Resolution为10时，会将获取到的点合并成一个点，值越小，精确度越高
在这里插入图片描述

五、告警规则创建

通过在Prometheus中定义AlertRule（告警规则），Prometheus会周期性的对告警规则进行计算，如果满足告警触发条件就会向Alertmanager发送告警信息。默认情况下，我们可以在Prometheus的web界面查到这些告警规则和告警触发状态

在告警规则文件中，可以将一组相关的告警规则定义在一个group下，在每一个group中，我们可以定义多个告警规则，其中，告警规则主要由以下几个部分组成

alert：告警规则的名称
expr：基于PromQL的告警触发条件
for：评估等待时间，可选。当触发条件持续一段时间后才发送告警。在此期间，告警的状态为pending，之后为active
labels：自定义标签
annotations：描述告警详细信息，并且在这里可以对信息描述进行模板化，像{{ $labels. }}，可以获取到当前告警实例中标签的值

在这里插入图片描述 5.1 配置告警文件
添加告警规则匹配目录

[root@prometheus ~]# vim /usr/local/prometheus/prometheus.yml 
rule_files:
  - /usr/local/prometheus/rules/*.yml

添加告警文件，这里监测跟目录的可用空间是否小于5120M

[root@prometheus ~]# vim /usr/local/prometheus/rules/hoststatus.yml
groups:
- name: hostStatusAlert
  rules:
  - alert: hostDiskAvail
    expr: node_filesystem_avail_bytes{device="/dev/mapper/centos_test-root"}/1024/1024 < 5120
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "{{ $labels.instance }} 磁盘可用空间减少"
      description: "{{ $labels.instance }} 磁盘可用空间<5GB(current value: {{ $value}})"

重启prometheus

[root@prometheus ~]# systemctl restart prometheus

5.2 查看告警规则
在这里插入图片描述
5.3 触发告警

六、安装AlertManager

AlertManager用于对告警进行处理，根据规则，发送告警到企业微信、邮件等。同时，Alertmanager在处理时，可以进行分组、抑制、静默处理，相当于告警收敛。

6.1 安装

[root@prometheus ~]# wget https://github.com/prometheus/alertmanager/releases/download/v0.21.0/alertmanager-0.21.0.linux-amd64.tar.gz 
[root@prometheus ~]# tar xf alertmanager-0.21.0.linux-amd64.tar.gz 
[root@prometheus ~]# mv alertmanager-0.21.0.linux-amd64 /usr/local/alertmanager

6.2 启动

alertmanager会把告警数据存储到本地，可指定存储目录

[root@prometheus ~]# ./alertmanager --config.file=/usr/local/alertmanager/alertmanager.yml --storage.path=/usr/local/alertmanager/data

6.3 托管到systemd

[root@prometheus ~]# vim /usr/lib/systemd/system/alertmanager.service
[Unit]
Description=https://prometheus.io

[Service]
ExecStart=/usr/local/alertmanager/alertmanager --config.file=/usr/local/alertmanager/alertmanager.yml --storage.path=/usr/local/alertmanager/data

[Install]
WantedBy=multi-user.target

[root@prometheus ~]#  systemctl daemon-reload

6.4 配置文件

[root@prometheus ~]# vim /usr/local/alertmanager/alertmanager.yml 
global:
  resolve_timeout: 5m  #在超时时间内未收到上次告警，就会发出恢复邮件

route:   #告警路由，根据标签匹配，确定当前告警如何处理
  group_by: ['alertname']  #以告警名进行分组
  group_wait: 10s  #分组收到告警后等待时间，以便把更多相同组的告警一同发送
  group_interval: 10s  #两组告警之间的等待时间
  repeat_interval: 1h  #重复告警时间间隔，减少相同告警的发送频率
  receiver: 'web.hook'  #接收者
receivers:  # receivers：接收者，需要配合route进行使用
- name: 'web.hook'
  webhook_configs:
  - url: 'http://127.0.0.1:5001/'
inhibit_rules:    #抑制规则，合理抑制规则可以减少垃圾告警的产生
  - source_match:   #下边添加label:value,匹配到就进行告警，否则让target_match进行匹配
      severity: 'critical'
    target_match:   #匹配到下边的label:value和equal就对告警进行抑制
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

Alertmanager配置文件主要包含两个部分：路由(route)以及接收器(receivers)。所有的告警信息都会从顶级路由进入路由树，然后根据路由规则，将告警信息发送给相应的接收器。同时，接收器可以关联邮件、企业微信和钉钉等接受告警信息。
6.5 查看相关信息
查看端口

[root@prometheus ~]# ss -lntp | grep alertmanager
LISTEN     0      128         :::9093                    :::*                   users:(("alertmanager",pid=28862,fd=8))
LISTEN     0      128         :::9094                    :::*                   users:(("alertmanager",pid=28862,fd=3))

查看告警
在这里插入图片描述

6.6 与prometheus进行关联

[root@prometheus ~]# vim /usr/local/prometheus/prometheus.yml 

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - 192.168.71.21:9093

重启prometheus

[root@prometheus ~]# systemctl restart prometheus

6.7 触发告警
从alertmanager上查看
在这里插入图片描述 route的完整配置

[ receiver: <string> ]
[ group_by: '[' <labelname>, ... ']' ]
[ continue: <boolean> | default = false ]

match:
  [ <labelname>: <labelvalue>, ... ]

match_re:
  [ <labelname>: <regex>, ... ]

[ group_wait: <duration> | default = 30s ]
[ group_interval: <duration> | default = 5m ]
[ repeat_interval: <duration> | default = 4h ]

routes:
  [ - <route> ... ]

七、集成邮件进行发送

7.1 修改alertmanager配置

[root@prometheus ~]# vim /usr/local/alertmanager/alertmanager.yml
global:
  smtp_smarthost: smtp.qq.com:25
  smtp_from: xxxxxx@qq.com
  smtp_auth_username: xxxxxx@qq.com
  smtp_auth_identity: xxxxxx@qq.com
  smtp_auth_password: xxxxxx
  resolve_timeout: 5m

route:
  group_by: ['alertname']
  receiver: 'default-receiver'

receivers:
  - name: default-receiver
    email_configs:
      - to: xxxxxx@qq.com
        send_resolved: true

重启alertmanager

[root@prometheus ~]# systemctl restart alertmanager

触发告警，查看邮件
在这里插入图片描述
7.2 配置邮件告警模板
创建邮件告警模板

[root@prometheus ~]# mkdir /usr/local/prometheus/alertmanager-tmpl
[root@prometheus ~]# vim /usr/local/prometheus/alertmanager-tmpl/email.tmpl 

{{ define "email.from" }}xxx@qq.com{{ end }}
{{ define "email.to" }}xxxx@qq.com{{ end }}
{{ define "email.to.html" }}
{{ range .Alerts }}
=========start==========<br>
告警程序: prometheus_alert <br>
告警级别: {{ .Labels.severity }} 级 <br>
告警类型: {{ .Labels.alertname }} <br>
故障主机: {{ .Labels.instance }} <br>
告警主题: {{ .Annotations.summary }} <br>
告警详情: {{ .Annotations.description }} <br>
触发时间: {{ .StartsAt.Format "2019-08-04 16:58:15" }} <br>
=========end==========<br>
{{ end }}
{{ end }}

配置alertmanager

[root@prometheus ~]# vim /usr/local/alertmanager/alertmanager.yml

global:
  smtp_smarthost: smtp.qq.com:25
  smtp_from: xxxxx@qq.com
  smtp_auth_username: xxxxx@qq.com
  smtp_auth_identity: xxxxx@qq.com
  smtp_auth_password: xxxxx
  resolve_timeout: 5m

templates:   #添加告警模板文件
  - '/usr/local/prometheus/alertmanager-tmpl/email.tmpl'

route:
  group_by: ['alertname']
  receiver: 'default-receiver'

receivers:
- name: default-receiver
  email_configs:
  - to: '{{ template "email.to" . }}'
    html: '{{ template "email.to.html" . }}'
    send_resolved: true

重启prometheus、alertmanager服务

[root@prometheus ~]# systemctl restart prometheus
[root@prometheus ~]# systemctl restart alertmanager

7.3 触发告警
在这里插入图片描述

八、配置告警收敛

8.1 告警分组
前边配置告警分组，没有体现出来，在这里，新加一个node_exporter，并触发两个机器的磁盘告警
修改prometheus配置文件

[root@prometheus ~]# vim /usr/local/prometheus/prometheus.yml 
  - job_name: 'node_exporter'
    static_configs:
    - targets:
      - '192.168.71.21:9100'
      - '192.168.71.22:9100'

重启prometheus

[root@prometheus ~]# systemctl restart prometheus

触发告警，查看邮件，可看到根据告警名称进行了分组，两个告警一起发送了
在这里插入图片描述
8.2 静默屏蔽
浏览器访问http://192.168.71.21:9093/,对waring级别的告警进行屏蔽
查看静默屏蔽的信息，此时触发告警不会发送邮件

8.3 抑制告警
修改alertmanager配置文件

#添加抑制告警规则
[root@prometheus ~]# vim /usr/local/alertmanager/alertmanager.yml
inhibit_rules:
  - source_match:  #当前告警匹配后，进行告警，其他没匹配到的进行target_match
      instance: '192.168.71.21:9100'
    target_match:  #当前告警匹配后(target_match,equal都匹配到)，抑制告警
      instance: '192.168.71.22:9100'
    equal: ['alertname']   #标签匹配

重启alertmanager

[root@prometheus ~]# systemctl restart alertmanager

触发告警，发现instance为’192.168.71.22:9100’的告警被屏蔽了，只收到了一个告警
在这里插入图片描述