更新时间:2025 年 6 月
版本:Prometheus 3.4.1、Alertmanager 0.28.1
简介
提供 Prometheus + Alertmanager 告警配置的完整示例,包含微信、邮件两种告警通知方式的配置和实战案例
告警机制架构
基本流程
flowchart LR
subgraph Prometheus
A1[采集指标数据] --> A2[存储 TSDB]
A2 --> A3[评估告警规则<br>Alerting Rules]
A3 --> A4[触发告警事件<br>Pending → Firing]
end
subgraph Alertmanager
A4 --> B1[接收告警]
B1 --> B2[去重处理]
B2 --> B3[分组处理<br>Group by labels]
B3 --> B4[静默 Silence<br>用户配置静默规则]
B3 --> B5[抑制 Inhibit<br> 高优先级屏蔽低优先级]
B4 --> B6[路由 Routing<br>根据标签匹配路由]
B5 --> B6
end
B6 --> C1[通知接收者<br>微信通知]
B6 --> C2[通知接收者<br>邮件通知]
B6 --> C3[通知接收者<br>Webhook 通知]
%% 样式增强(可选)
classDef prom fill:#f9f,stroke:#333,stroke-width:1px;
classDef alert fill:#bbf,stroke:#333,stroke-width:1px;
classDef notify fill:#bfb,stroke:#333,stroke-width:1px;
class A1,A2,A3,A4 prom
class B1,B2,B3,B4,B5,B6 alert
class C1,C2,C3 notify
- Prometheus - 收集和存储监控指标
- 告警规则 - 在 Prometheus 中定义告警条件,符合条件时触发告警事件
- Alertmanager - 接收告警,进行分组、去重、静默处理
- 通知渠道 - 通过微信、Webhook、邮件等方式发送告警通知
组件配置架构
#### Prometheus
/etc/prometheus/
├── prometheus.yml # Prometheus 主配置文件
├── rules/ # 规则目录
│ ├── alerts/ # 告警规则目录(分层管理)
│ │ ├── hardware-alerts.yml # 硬件层告警规则
│ │ ├── node-alerts.yml # 节点层(系统)告警规则
│ │ ├── application-alerts.yml # 应用层告警规则
│ │ ├── middleware-alerts.yml # 中间件告警规则
│ │ ├── network-alerts.yml # 网络层告警规则
│ │ └── database-alerts.yml # 数据库告警规则
│ └── records/ # 记录规则目录
│ ├── node-metrics.yml # 节点层(系统)指标记录规则
│ ├── application-metrics.yml # 应用指标记录规则
│ ├── business-metrics.yml # 业务指标记录规则
│ └── sli-slo-metrics.yml # SLI/SLO指标记录规则
└── targets/ # 监控目标配置目录
#### Alertmanager
/etc/alertmanager/
├── alertmanager.yml # Alertmanager 主配置文件
└── templates/ # 告警模板目录
├── wechat.tmpl # 微信告警模板
└── email.tmpl # 邮件告警模板
准备
主机准备
主机名 | 操作系统 | 架构 | IP | 安装软件 |
---|---|---|---|---|
prometheus-01.monitor.local | AlmaLinux 9.6 | x86_64 | 192.168.111.197 | Prometheus 3.x Node Exporter 1.9.x |
alertmanager-01.monitor.local | AlmaLinux 9.6 | x86_64 | 192.168.111.198 | Alertmanager 0.28.x Node Exporter 1.9.x |
node-exporter-01.monitor.local | AlmaLinux 9.6 | x86_64 | 192.168.111.199 | Node Exporter 1.9.x |
部署软件过程略
Prometheus 配置
准备
创建目录结构
sudo -u ${PROMETHEUS_USER} mkdir -p ${PROMETHEUS_CONF}/rules/{alerts,records}
主配置
$ sudo -u ${PROMETHEUS_USER} vim ${PROMETHEUS_CONF}/prometheus.yml
#### 全局配置
global:
# 设置全局数据采集间隔时间. 默认 1 分钟
scrape_interval: 15s
# 评估告警规则和记录规则(alerting + recording)的时间间隔,一般应为 scrape_interval 的倍数。默认 1 分钟
evaluation_interval: 15s
# 采集超时时间,默认 10 秒
scrape_timeout: 10s
#### Alertmanager 配置
alerting:
alertmanagers:
- static_configs:
- targets:
# alertmanager 的 IP 或者域名,如果是集群需要配置每个节点
- 192.168.111.198:9093
#### 规则配置
rule_files:
## 记录规则文件
# 业务指标记录规则
- "rules/records/node-metrics.yml"
## 告警规则文件
# 硬件层告警
# - "rules/alerts/hardware-alerts.yml"
# 系统层告警规则
- "rules/alerts/node-alerts.yml"
# 应用层告警规则
# - "rules/alerts/application-alerts.yml"
scrape_configs:
# Prometheus 自监控
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
labels:
app: "prometheus"
# 节点监控
- job_name: 'node-exporter'
static_configs:
- targets:
- '192.168.111.197:9100'
- '192.168.111.198:9100'
- '192.168.111.199:9100'
labels:
env: 'prod'
level: 'system'
category: 'monitor'
# 应用监控
# - job_name: 'app-metrics'
# static_configs:
# - targets:
# - '192.168.111.200:8080'
# labels:
# env: 'prod'
# level: 'application'
# category: 'service'
# service: 'user'
分层记录规则配置
节点层
$ sudo -u ${PROMETHEUS_USER} vim ${PROMETHEUS_CONF}/rules/records/node-metrics.yml
groups:
- name: node_recording_rules
interval: 30s
rules:
# CPU 核心数(基础指标)
- record: node:cpu:cores
expr: count without (cpu, mode) (node_cpu_seconds_total{mode="idle"})
# CPU 使用率(实例聚合,5分钟平均)
- record: node:cpu:usage_percent
expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# CPU IO wait 百分比(实例聚合,5分钟平均)
- record: node:cpu:iowait_percent
expr: avg by (instance) (irate(node_cpu_seconds_total{mode="iowait"}[5m])) * 100
# 内存使用率
- record: node:memory:usage_percent
expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
# Swap 使用率(实例聚合)
- record: node:memory:swap_usage_percent
expr: (1 - node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes) * 100
# 分区使用率
- record: node:partition:usage_percent
expr: 100 * (1 - node_filesystem_avail_bytes / node_filesystem_size_bytes)
# 分区 Inode 使用率(实例 + 挂载点)
- record: node:partition:inode_usage_percent
expr: 100 * (1 - node_filesystem_files_free / node_filesystem_files)
# 1 分钟负载与核心数比(实例聚合)
- record: node:load:per_core_1m
expr: node_load1 / node:cpu:cores
# 15 分钟负载与核心数比(实例聚合)
- record: node:load:per_core_15m
expr: node_load15 / node:cpu:cores
# 文件描述符使用率(实例聚合)
- record: node:filefd:usage_percent
expr: node_filefd_allocated / node_filefd_maximum * 100
# 上下文切换速率(实例聚合)
- record: node:context_switches:rate
expr: rate(node_context_switches_total[5m])
# 运行进程数(实例聚合)
- record: node:processes:running
expr: node_processes_state{state="R"}
分层告警规则配置
硬件层
一般用 IPMI/BMC 做硬件监控
$ sudo -u ${PROMETHEUS_USER} vim ${PROMETHEUS_CONF}/rules/alerts/hardware-alerts.yml
节点层
$ sudo -u ${PROMETHEUS_USER} vim ${PROMETHEUS_CONF}/rules/alerts/node-alerts.yml
# 基于 Node Exporter 的系统层告警规则
# 依赖: node_exporter >= 1.3.0
groups:
- name: system_alerts
interval: 1m
rules:
# 节点宕机告警 (基于 up 指标)
- alert: InstanceDown
expr: up{job="node-exporter"} == 0
for: 1m
labels:
severity: "critical"
team: "infrastructure, management"
layer: "system"
priority: "P0"
annotations:
summary: "节点 {{ $labels.instance }} 已宕机"
description: "节点 {{ $labels.instance }} 已宕机超过 1 分钟,请立即检查状态。"
- name: system_resource_alerts
rules:
# CPU 使用率告警 (基于 node_cpu_seconds_total)
- alert: HighCPUUsage
expr: node:cpu:usage_percent > 80
for: 5m
labels:
severity: "warning"
# 基础团队
team: "infrastructure"
layer: "system"
priority: "P2"
annotations:
summary: "节点 {{ $labels.instance }} CPU 使用率过高"
description: "节点 {{ $labels.instance }} CPU 使用率为 {{ printf \"%.2f\" $value }}%,已持续 5 分钟。建议检查高耗CPU进程。"
# CPU 使用率严重过高
- alert: CriticalCPUUsage
expr: node:cpu:usage_percent > 90
for: 2m
labels:
severity: "critical"
# 基础团队
team: "infrastructure"
layer: "system"
priority: "P1"
annotations:
summary: "节点 {{ $labels.instance }} CPU 使用率严重过高"
description: "节点 {{ $labels.instance }} CPU 使用率为 {{ printf \"%.2f\" $value }}%,已持续 5 分钟,系统可能即将无响应。"
# 内存使用率告警 (基于 node_memory_* 指标)
- alert: HighMemoryUsage
expr: node:memory:usage_percent > 80
for: 5m
labels:
severity: "warning"
team: "infrastructure"
layer: "system"
priority: "P2"
annotations:
summary: "节点 {{ $labels.instance }} 内存使用率过高"
description: "节点 {{ $labels.instance }} 内存使用率为 {{ printf \"%.2f\" $value }}%,可用内存:{{ with query (printf \"node_memory_MemAvailable_bytes{instance='%s'}\" $labels.instance) }}{{ . | first | value | humanize1024 }}B{{end}}。"
# 内存使用率严重过高
- alert: CriticalMemoryUsage
expr: node:memory:usage_percent > 90
for: 2m
labels:
severity: "critical"
team: "infrastructure"
layer: "system"
priority: "P1"
annotations:
summary: "节点 {{ $labels.instance }} 内存严重不足"
description: "节点 {{ $labels.instance }} 内存使用率为 {{ printf \"%.2f\" $value }}%,系统可能开始使用交换空间或 OOM。"
# 交换空间使用率过高 (基于 node_memory_SwapTotal_bytes)
- alert: HighSwapUsage
expr: node:memory:swap_usage_percent > 60
for: 5m
labels:
severity: "warning"
team: "infrastructure"
layer: "system"
priority: "P2"
annotations:
summary: "节点 {{ $labels.instance }} 交换空间使用率过高"
description: "节点 {{ $labels.instance }} 交换空间使用率为 {{ printf \"%.2f\" $value }}%,可能存在内存不足问题。"
# 磁盘分区空间告警 (基于 node_filesystem_* 指标)
- alert: HighDiskUsage
expr: node:partition:usage_percent > 75
for: 5m
labels:
severity: "warning"
team: "infrastructure"
layer: "system"
priority: "P2"
annotations:
summary: "节点 {{ $labels.instance }} 磁盘空间不足"
description: "节点 {{ $labels.instance }} 挂载点 {{ $labels.mountpoint }} 磁盘使用率为 {{ printf \"%.2f\" $value }}%,剩余空间:{{ $labels.node_filesystem_avail_bytes | humanize1024 }}B。"
# 磁盘分区空间严重不足
- alert: CriticalDiskUsage
expr: node:partition:usage_percent > 85
for: 2m
labels:
severity: "critical"
team: "infrastructure"
layer: "system"
priority: "P1"
annotations:
summary: "节点 {{ $labels.instance }} 磁盘空间严重不足"
description: "节点 {{ $labels.instance }} 挂载点 {{ $labels.mountpoint }} 磁盘使用率为 {{ printf \"%.2f\" $value }}%,请立即清理磁盘空间。"
# 磁盘分区 Inode 使用率过高 (基于 node_filesystem_files)
- alert: HighInodeUsage
expr: node:partition:inode_usage_percent > 90
for: 5m
labels:
severity: "warning"
team: "infrastructure"
layer: "system"
priority: "P2"
annotations:
summary: "节点 {{ $labels.instance }} Inode使用率过高"
description: "节点 {{ $labels.instance }} 挂载点 {{ $labels.mountpoint }} Inode使用率为 {{ printf \"%.2f\" $value }}%,可能无法创建新文件。"
# 磁盘 IO 等待时间过长 (基于 node_cpu_seconds_total)
- alert: HighIOWait
expr: node:cpu:iowait_percent > 20
for: 5m
labels:
severity: "warning"
team: "infrastructure"
layer: "system"
priority: "P2"
annotations:
summary: "节点 {{ $labels.instance }} 磁盘 IO 等待时间过长"
description: "节点 {{ $labels.instance }} IO 等待时间占 CPU 的 {{ printf \"%.2f\" $value }}%,可能存在磁盘性能问题。"
- name: system_process_alerts
rules:
# 系统负载过高 (基于 node_load* 指标)
- alert: HighSystemLoad1m
expr: node:load:per_core_1m > 2
for: 5m
labels:
severity: "warning"
team: "infrastructure"
layer: "system"
priority: "P2"
annotations:
summary: "节点 {{ $labels.instance }} 1 分钟系统负载过高"
description: "节点 {{ $labels.instance }} 1 分钟平均负载为 {{ printf \"%.2f\" $value }},超过 CPU 核心数的 2 倍。"
# 系统负载严重过高
- alert: HighSystemLoad15m
expr: node:load:per_core_15m > 2
for: 5m
labels:
severity: "critical"
team: "infrastructure"
layer: "system"
priority: "P1"
annotations:
summary: "节点 {{ $labels.instance }} 15 分钟系统负载过高"
description: "节点 {{ $labels.instance }} 15 分钟平均负载为 {{ printf \"%.2f\" $value }},超过 CPU 核心数的 2 倍。"
# 文件描述符使用率过高 (基于 node_filefd_* 指标)
- alert: HighFileDescriptorUsage
expr: node:filefd:usage_percent > 75
for: 5m
labels:
severity: "warning"
team: "infrastructure"
layer: "system"
priority: "P2"
annotations:
summary: "节点 {{ $labels.instance }} 文件描述符使用率过高"
description: "节点 {{ $labels.instance }} 文件描述符使用率为 {{ printf \"%.2f\" $value }}%,当前使用:{{ $labels.node_filefd_allocated }}。"
# 进程数量过多 (基于 node_processes_* 指标)
- alert: TooManyProcesses
expr: node:processes:running > 300
for: 10m
labels:
severity: "warning"
team: "infrastructure"
layer: "system"
priority: "P2"
annotations:
summary: "节点 {{ $labels.instance }} 运行进程数过多"
description: "节点 {{ $labels.instance }} 当前运行进程数为 {{ $value }},可能存在进程泄漏或系统异常。"
# 上下文切换过多 (基于 node_context_switches_total)
- alert: HighContextSwitches
expr: node:context_switches:rate > 10000
for: 5m
labels:
severity: "warning"
team: "infrastructure"
layer: "system"
priority: "P2"
annotations:
summary: "节点 {{ $labels.instance }} 上下文切换过多"
description: "节点 {{ $labels.instance }} 上下文切换速率为 {{ printf \"%.0f\" $value }}/s,可能存在系统性能问题。"
应用层
配置验证
验证规则文件
# 验证记录规则文件
${PROMETHEUS_HOME}/bin/promtool check rules ${PROMETHEUS_CONF}/rules/records/node-metrics.yml
# 验证告警规则文件
sudo -u ${PROMETHEUS_USER} \
${PROMETHEUS_HOME}/bin/promtool check rules ${PROMETHEUS_CONF}/rules/alerts/node-alerts.yml
验证 Prometheus 配置
sudo -u ${PROMETHEUS_USER} \
${PROMETHEUS_HOME}/bin/promtool check config ${PROMETHEUS_CONF}/prometheus.yml
加载配置
curl -X POST http://localhost:9090/-/reload
AlertManager 配置
准备
创建目录结构
sudo -u ${ALERTMANAGER_USER} mkdir -p ${ALERTMANAGER_CONF}/templates
主配置
配置参考:Configuration | Prometheus
$ sudo -u ${ALERTMANAGER_USER} vim ${ALERTMANAGER_CONF}/alertmanager.yml
#### 全局配置
global:
# 邮件配置
smtp_smarthost: 'smtp.163.com:465'
smtp_from: 'sre_alerts@163.com'
smtp_auth_username: 'sre_alerts@163.com'
smtp_auth_password: '**************'
smtp_require_tls: false
# 企业微信配置
wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'
wechat_api_secret: '*******************'
wechat_api_corp_id: '******************'
#### 模板配置
templates:
# 可以用通配符
- '/etc/alertmanager/templates/email.tmpl'
- '/etc/alertmanager/templates/wechat.tmpl'
#### 告警路由配置
route:
group_by: ['env', 'alertname', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'email'
routes:
- receiver: 'email'
matchers:
- severity =~ "warning|info"
continue: true
- receiver: 'wechat'
matchers:
- severity =~ "warning|info"
continue: true
- receiver: 'wechat'
matchers:
- severity = "critical"
group_wait: 10s
repeat_interval: 1h # 每 1 小时重复提醒
#### 接收器设置
receivers:
# 邮件
- name: 'email'
email_configs:
- to: 'ezra-sullivan@outlook.com'
html: '{{ template "email.alert.html" . }}'
headers:
Subject: '{{ template "email.alert.subject" . }}'
# 发送告警恢复通知
# 注:告警恢复通知中,会复用最后一次触发告警的值
send_resolved: true
# 微信
- name: "wechat"
wechat_configs:
- agent_id: "1000004"
to_user: '@all'
message: '{{ template "wechat.alert.message.markdown" . }}'
# 注:使用 markdown 格式模板,必须指定类型,默认类型为 text
message_type: 'markdown'
send_resolved: true
# webhook
# 抑制规则
inhibit_rules:
# 当实例下线时,抑制该实例的其他告警
- source_matchers:
- alertname = InstanceDown
target_matchers:
- alertname =~ ".*"
equal: ["instance"]
#
- source_matchers:
- severity = "critical"
target_matchers:
- severity = "warning"
equal: ["alertname", "instance"]
告警模板配置
告警模板配置参考:Notification template reference | Prometheus
告警模板示例参考:Notification template examples | Prometheus
邮件模板
$ sudo -u ${ALERTMANAGER_USER} vim ${ALERTMANAGER_CONF}/templates/email.tmpl
{{ define "email.alert.subject" }}
{{- $count := len .Alerts -}}
{{- if eq $count 1 -}}
{{- range .Alerts -}}
告警-{{ .Labels.alertname }}
{{- end -}}
{{- else -}}
告警-{{ .GroupLabels.alertname }}-批量告警-{{ $count }} 项
{{- end -}}
{{ end }}
{{ define "email.alert.html" }}
{{- $hasFiring := false -}}
{{- $severity := "" -}}
{{- range .Alerts -}}
{{- if eq .Status "firing" }}
{{- $hasFiring = true -}}
{{- if eq .Labels.severity "critical" }}{{ $severity = "critical" }}
{{- else if and (eq .Labels.severity "warning") (ne $severity "critical") }}{{ $severity = "warning" }}
{{- else if and (eq .Labels.severity "info") (ne $severity "critical") (ne $severity "warning") }}{{ $severity = "info" }}
{{- end }}
{{- end }}
{{- end }}
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>监控告警通知</title>
<style>
body {
font-family: Arial, sans-serif;
margin: 20px;
{{- if not $hasFiring }}
background-color: #f4fff4;
{{- else if eq $severity "critical" }}
background-color: #fff5f5;
{{- else if eq $severity "warning" }}
background-color: #fffbf0;
{{- else }}
background-color: #f8f9fa;
{{- end }}
}
.header {
text-align: center;
padding: 15px;
border-radius: 5px;
color: white;
{{- if not $hasFiring }}
background-color: #5cb85c;
{{- else if eq $severity "critical" }}
background-color: #d9534f;
{{- else if eq $severity "warning" }}
background-color: #f0ad4e;
{{- else }}
background-color: #5bc0de;
{{- end }}
}
.alert-item {
background-color: #fff;
margin: 20px 0;
padding: 20px;
border-radius: 5px;
border-left: 5px solid;
border-color: #ccc;
}
.firing { border-left-color: #d9534f; }
.resolved { border-left-color: #5cb85c; }
.label { font-weight: bold; color: #333; }
.value { color: #555; }
.time { color: #999; font-size: 0.9em; }
.footer {
background-color: #f8f8f8;
padding: 10px;
border-radius: 5px;
margin-top: 20px;
}
</style>
</head>
<body>
<div class="header">
<h2>
{{- if not $hasFiring }}
✅ 告警恢复通知
{{- else if eq $severity "critical" }}
🚨 严重告警通知
{{- else if eq $severity "warning" }}
⚠️ 监控警告通知
{{- else }}
ℹ️ 监控信息通知
{{- end }}
</h2>
<p>告警组:{{ .GroupLabels.alertname }}</p>
</div>
{{ range .Alerts }}
<div class="alert-item {{ .Status }}">
<h3>{{ .Labels.alertname }}</h3>
{{- if .Labels.env }}
<p><span class="label">告警环境:</span><span class="value">{{ .Labels.env }}</span></p>
{{- end }}
<p><span class="label">告警级别:</span><span class="value">{{ .Labels.severity }}</span></p>
<p><span class="label">告警实例:</span><span class="value">{{ .Labels.instance }}</span></p>
<p><span class="label">告警状态:</span><span class="value">{{ if eq .Status "firing" }}🔥 触发中{{ else }}✅ 已恢复{{ end }}</span></p>
<p><span class="label">告警摘要:</span><span class="value">{{ .Annotations.summary }}</span></p>
<p><span class="label">告警描述:</span><span class="value">{{ .Annotations.description }}</span></p>
<p><span class="label">开始时间:</span><span class="time">{{ .StartsAt.Local.Format "2006-01-02 15:04:05" }}</span></p>
{{- if eq .Status "resolved" }}
<p><span class="label">结束时间:</span><span class="time">{{ .EndsAt.Local.Format "2006-01-02 15:04:05" }}</span></p>
{{- end }}
</div>
{{ end }}
{{- if not $hasFiring }}
<div class="footer">
<p><strong>📗 所有告警已恢复,无需进一步处理。</strong></p>
</div>
{{- else if eq $severity "critical" }}
<div class="footer">
<p><strong>🚨 请立即处理高优先级告警,避免影响生产。</strong></p>
</div>
{{- end }}
</body>
</html>
{{ end }}
企业微信模板
$ sudo -u ${ALERTMANAGER_USER} vim ${ALERTMANAGER_CONF}/templates/wechat.tmpl
{{ define "wechat.alert.message" }}
{{- $hasFiring := false -}}
{{- $severity := "" -}}
{{- range .Alerts -}}
{{- if eq .Status "firing" -}}
{{- $hasFiring = true -}}
{{- if eq .Labels.severity "critical" }}{{ $severity = "critical" }}
{{- else if and (eq .Labels.severity "warning") (ne $severity "critical") }}{{ $severity = "warning" }}
{{- else if and (eq .Labels.severity "info") (ne $severity "critical") (ne $severity "warning") }}{{ $severity = "info" }}
{{- end -}}
{{- end -}}
{{- end -}}
{{- if not $hasFiring -}}
✅ 告警恢复 --- {{ .GroupLabels.alertname }}
{{- else if eq $severity "critical" -}}
❗ 严重告警 --- {{ .GroupLabels.alertname }}
{{- else if eq $severity "warning" -}}
⚠️ 监控警告 --- {{ .GroupLabels.alertname }}
{{- else -}}
ℹ️ 监控信息 --- {{ .GroupLabels.alertname }}
{{- end }}
{{ range .Alerts }}
=======================
{{ with .Labels.env }}
告警环境:{{ . }}
{{ end }}
告警规则:{{ .Labels.alertname }}
告警级别:{{ .Labels.severity }}
实例信息:{{ .Labels.instance }}
告警状态:{{ if eq .Status "firing" }}触发中❗{{ else }}已恢复 ✅{{ end }}
告警摘要:{{ .Annotations.summary }}
告警描述:{{ .Annotations.description }}
开始时间:⏰ {{ .StartsAt.Local.Format "2006-01-02 15:04:05" }}
{{- if eq .Status "resolved" }}
恢复时间:✅ {{ .EndsAt.Local.Format "2006-01-02 15:04:05" }}
{{- end }}
=======================
{{ end }}
{{- if not $hasFiring }}
✅ 该组告警已全部恢复
{{- else if eq $severity "critical" }}
❗ 请立即处理告警
{{- end }}
{{ end }}
{{ define "wechat.alert.message.markdown" }}
{{- $hasFiring := false -}}
{{- $severity := "" -}}
{{- range .Alerts -}}
{{- if eq .Status "firing" -}}
{{- $hasFiring = true -}}
{{- if eq .Labels.severity "critical" }}{{ $severity = "critical" }}
{{- else if and (eq .Labels.severity "warning") (ne $severity "critical") }}{{ $severity = "warning" }}
{{- else if and (eq .Labels.severity "info") (ne $severity "critical") (ne $severity "warning") }}{{ $severity = "info" }}
{{- end -}}
{{- end -}}
{{- end -}}
{{- if not $hasFiring -}}
## 告警恢复 - ✅
{{- else if eq $severity "critical" -}}
## 严重告警 - ❗
{{- else if eq $severity "warning" -}}
## 监控警告 - ⚠️
{{- else -}}
## 监控信息 - ℹ️
{{- end }}
{{- with .GroupLabels.alertname }}
**告警类型**:{{ . }}
{{- end }}
{{ range .Alerts }}
---
{{ with .Labels.env }}
**告警环境**:{{ . }}
{{ end }}
**告警规则**:{{ .Labels.alertname }}
**告警级别**:{{ .Labels.severity }}
**实例信息**:{{ .Labels.instance }}
**告警状态**:{{ if eq .Status "firing" }}触发中❗{{ else }}已恢复 ✅{{ end }}
**告警摘要**:{{ .Annotations.summary }}
**告警描述**:{{ .Annotations.description }}
**开始时间**:⏰ {{ .StartsAt.Local.Format "2006-01-02 15:04:05" }}
{{- if eq .Status "resolved" }}
**恢复时间**:✅ {{ .EndsAt.Local.Format "2006-01-02 15:04:05" }}
{{- end }}
{{ end }}
{{- if not $hasFiring }}
---
✅ **该组告警已全部恢复**
{{- else if eq $severity "critical" }}
---
❗ **请立即处理告警**
{{- end }}
{{ end }}
配置验证
验证主配置
sudo -u ${ALERTMANAGER_USER} \
${ALERTMANAGER_HOME}/bin/amtool check-config ${ALERTMANAGER_CONF}/alertmanager.yml
加载配置
curl -X POST http://localhost:9093/-/reload
测试
手动触发
使用 amtool 发送测试告警:
sudo -u ${ALERTMANAGER_USER} \
${ALERTMANAGER_HOME}/bin/amtool alert add \
--alertmanager.url=http://localhost:9093 \
env="prod" \
alertname="TestAlert" \
severity="warning" \
instance="test-instance" \
--annotation='summary="summary of the alert"' \
--annotation='description="description of the alert"'
模拟触发
模拟 CPU 高占用
# 占用 2 核 CPU,繁忙程度 80%,占用 720 秒
$ docker run -it --rm --name 'stress-ng' \
--cpus="2" \
registry.cn-hangzhou.aliyuncs.com/kmust/stress-ng-alpine:0.14.00-r0 \
--cpu 2 --cpu-load 80 -t 720
# 检查 CPU 占用
$ top -p $(docker top stress-ng | awk 'NR>1 {print $2}' | paste -sd, -)
# 输出如下信息
top - 23:39:38 up 1 day, 1:44, 5 users, load average: 1.69, 1.26, 0.69
Tasks: 3 total, 1 running, 2 sleeping, 0 stopped, 0 zombie
%Cpu(s): 70.5 us, 1.0 sy, 0.0 ni, 27.6 id, 0.0 wa, 1.0 hi, 0.0 si, 0.0 st
MiB Mem : 1739.0 total, 434.0 free, 577.3 used, 904.7 buff/cache
MiB Swap: 2060.0 total, 2060.0 free, 0.0 used. 1161.7 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
17303 root 20 0 33448 5580 1792 R 79.4 0.3 3:07.00 stress-ng
17304 root 20 0 33448 5580 1792 S 79.4 0.3 3:07.42 stress-ng
17274 root 20 0 33448 4224 3968 S 0.0 0.2 0:00.04 stress-ng
模拟高内存占用
# 占用 1.5 G 内存,占用 720 秒
$ docker run -it --rm --name 'stress-ng' \
--memory="2G" \
registry.cn-hangzhou.aliyuncs.com/kmust/stress-ng-alpine:0.14.00-r0 \
--vm 1 --vm-bytes 1.5G --vm-keep --verify --timeout 720s
# 检查内存占用
$ free -h
total used free shared buff/cache available
Mem: 1.7Gi 1.6Gi 133Mi 0.0Ki 105Mi 114Mi
其他配置
告警静默
通过 Alertmanager Web UI 或命令行工具设置静默:
# 静默特定实例的告警
${ALERTMANAGER_HOME}/bin/amtool silence add \
--alertmanager.url=http://localhost:9093 \
--author="admin" \
--comment="计划维护" \
--duration="2h" \
instance="192.168.111.199:9100"
# 查看静默列表
${ALERTMANAGER_HOME}/bin/amtool silence query \
--alertmanager.url=http://localhost:9093
按时间段告警
route:
routes:
# 工作时间告警
- match:
severity: warning
active_time_intervals:
- business_hours
receiver: 'business-hours-alerts'
# 非工作时间紧急告警
- match:
severity: critical
active_time_intervals:
- non_business_hours
receiver: 'emergency-alerts'
time_intervals:
- name: business_hours
time_intervals:
- times:
- start_time: '09:00'
end_time: '18:00'
weekdays: ['monday:friday']
- name: non_business_hours
time_intervals:
- times:
- start_time: '18:00'
end_time: '09:00'
weekdays: ['monday:friday']
- weekdays: ['saturday', 'sunday']
附录
告警规则设计原则
- 分层设计:区分不同严重级别的告警
- 避免告警风暴:合理设置
for
时间和抑制规则 - 标签规范:统一标签命名规范,便于分组和路由
通知渠道选择
- Critical 级别:多渠道通知(微信+邮件+电话)
- Warning 级别:基础通知(微信或邮件)
- Info 级别:仅记录到日志系统
模板优化
- 包含关键信息:告警名称、级别、实例、时间、描述
- 格式清晰:使用表格或结构化格式
- 包含操作链接:Grafana 仪表盘、Runbook 链接
运维建议
- 定期测试告警配置
- 建立告警处理 SOP
- 监控告警系统本身的健康状态
- 定期评估和优化告警规则