05 - Prometheus 告警示例-（1）邮件告警-企业微信

更新时间：2025 年 6 月

版本：Prometheus 3.4.1、Alertmanager 0.28.1

简介

提供 Prometheus + Alertmanager 告警配置的完整示例，包含微信、邮件两种告警通知方式的配置和实战案例

告警机制架构

基本流程

flowchart LR
    subgraph Prometheus
        A1[采集指标数据] --> A2[存储 TSDB]
        A2 --> A3[评估告警规则<br>Alerting Rules]
        A3 --> A4[触发告警事件<br>Pending → Firing]
    end

    subgraph Alertmanager
        A4 --> B1[接收告警]
        B1 --> B2[去重处理]
        B2 --> B3[分组处理<br>Group by labels]
        B3 --> B4[静默 Silence<br>用户配置静默规则]
        B3 --> B5[抑制 Inhibit<br> 高优先级屏蔽低优先级]
        B4 --> B6[路由 Routing<br>根据标签匹配路由]
        B5 --> B6
    end

    B6 --> C1[通知接收者<br>微信通知]
    B6 --> C2[通知接收者<br>邮件通知]
    B6 --> C3[通知接收者<br>Webhook 通知]

    %% 样式增强（可选）
    classDef prom fill:#f9f,stroke:#333,stroke-width:1px;
    classDef alert fill:#bbf,stroke:#333,stroke-width:1px;
    classDef notify fill:#bfb,stroke:#333,stroke-width:1px;

    class A1,A2,A3,A4 prom
    class B1,B2,B3,B4,B5,B6 alert
    class C1,C2,C3 notify

Prometheus - 收集和存储监控指标
告警规则 - 在 Prometheus 中定义告警条件，符合条件时触发告警事件
Alertmanager - 接收告警，进行分组、去重、静默处理
通知渠道 - 通过微信、Webhook、邮件等方式发送告警通知

组件配置架构

#### Prometheus
/etc/prometheus/
├── prometheus.yml           # Prometheus 主配置文件
├── rules/                     # 规则目录
│   ├── alerts/                  # 告警规则目录（分层管理）
│   │   ├── hardware-alerts.yml      # 硬件层告警规则
│   │   ├── node-alerts.yml          # 节点层（系统）告警规则
│   │   ├── application-alerts.yml   # 应用层告警规则
│   │   ├── middleware-alerts.yml    # 中间件告警规则
│   │   ├── network-alerts.yml       # 网络层告警规则
│   │   └── database-alerts.yml      # 数据库告警规则
│   └── records/                 # 记录规则目录
│       ├── node-metrics.yml         # 节点层（系统）指标记录规则
│       ├── application-metrics.yml  # 应用指标记录规则
│       ├── business-metrics.yml     # 业务指标记录规则
│       └── sli-slo-metrics.yml      # SLI/SLO指标记录规则
└── targets/                   # 监控目标配置目录


#### Alertmanager
/etc/alertmanager/
├── alertmanager.yml           # Alertmanager 主配置文件
└── templates/                 # 告警模板目录
    ├── wechat.tmpl                # 微信告警模板
    └── email.tmpl                 # 邮件告警模板

准备

主机准备

主机名	操作系统	架构	IP	安装软件
prometheus-01.monitor.local	AlmaLinux 9.6	x86_64	192.168.111.197	Prometheus 3.x Node Exporter 1.9.x
alertmanager-01.monitor.local	AlmaLinux 9.6	x86_64	192.168.111.198	Alertmanager 0.28.x Node Exporter 1.9.x
node-exporter-01.monitor.local	AlmaLinux 9.6	x86_64	192.168.111.199	Node Exporter 1.9.x

部署软件过程略

Prometheus 配置

准备

创建目录结构

sudo -u ${PROMETHEUS_USER} mkdir -p ${PROMETHEUS_CONF}/rules/{alerts,records}

主配置

配置参考：Configuration | Prometheus

$ sudo -u ${PROMETHEUS_USER} vim ${PROMETHEUS_CONF}/prometheus.yml

#### 全局配置
global:
  # 设置全局数据采集间隔时间. 默认 1 分钟
  scrape_interval: 15s 
  # 评估告警规则和记录规则（alerting + recording）的时间间隔，一般应为 scrape_interval 的倍数。默认 1 分钟
  evaluation_interval: 15s 
  # 采集超时时间，默认 10 秒
  scrape_timeout: 10s
  
  
#### Alertmanager 配置
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            # alertmanager 的 IP 或者域名，如果是集群需要配置每个节点
            - 192.168.111.198:9093


#### 规则配置
rule_files:
  ## 记录规则文件
  # 业务指标记录规则
  - "rules/records/node-metrics.yml"

  ## 告警规则文件
  # 硬件层告警
  # - "rules/alerts/hardware-alerts.yml"
  # 系统层告警规则
  - "rules/alerts/node-alerts.yml"
  # 应用层告警规则
  # - "rules/alerts/application-alerts.yml"
  
  



scrape_configs:
  # Prometheus 自监控
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]
        labels:
          app: "prometheus"
          
  # 节点监控
  - job_name: 'node-exporter'
    static_configs:
      - targets: 
          - '192.168.111.197:9100'
          - '192.168.111.198:9100'
          - '192.168.111.199:9100'
        labels:
          env: 'prod'
          level: 'system'
          category: 'monitor'
          
          
  # 应用监控
  # - job_name: 'app-metrics'
  #   static_configs:
  #     - targets: 
  #         - '192.168.111.200:8080'
  #       labels:
  #         env: 'prod'
  #         level: 'application'
  #         category: 'service'
  #         service: 'user'

分层记录规则配置

节点层

$ sudo -u ${PROMETHEUS_USER} vim ${PROMETHEUS_CONF}/rules/records/node-metrics.yml


groups:
  - name: node_recording_rules
    interval: 30s
    rules:
      # CPU 核心数（基础指标）
      - record: node:cpu:cores
        expr: count without (cpu, mode) (node_cpu_seconds_total{mode="idle"})
    
      # CPU 使用率（实例聚合，5分钟平均）
      - record: node:cpu:usage_percent
        expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

      # CPU IO wait 百分比（实例聚合，5分钟平均）
      - record: node:cpu:iowait_percent
        expr: avg by (instance) (irate(node_cpu_seconds_total{mode="iowait"}[5m])) * 100

      # 内存使用率
      - record: node:memory:usage_percent
        expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

      # Swap 使用率（实例聚合）
      - record: node:memory:swap_usage_percent
        expr: (1 - node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes) * 100

      # 分区使用率
      - record: node:partition:usage_percent
        expr: 100 * (1 - node_filesystem_avail_bytes / node_filesystem_size_bytes)

      # 分区 Inode 使用率（实例 + 挂载点）
      - record: node:partition:inode_usage_percent
        expr: 100 * (1 - node_filesystem_files_free / node_filesystem_files)

      # 1 分钟负载与核心数比（实例聚合）
      - record: node:load:per_core_1m
        expr: node_load1 / node:cpu:cores

      # 15 分钟负载与核心数比（实例聚合）
      - record: node:load:per_core_15m
        expr: node_load15 / node:cpu:cores

      # 文件描述符使用率（实例聚合）
      - record: node:filefd:usage_percent
        expr: node_filefd_allocated / node_filefd_maximum * 100

      # 上下文切换速率（实例聚合）
      - record: node:context_switches:rate
        expr: rate(node_context_switches_total[5m])

      # 运行进程数（实例聚合）
      - record: node:processes:running
        expr: node_processes_state{state="R"}

分层告警规则配置

配置参考：Alerting rules | Prometheus

硬件层

一般用 IPMI/BMC 做硬件监控

$ sudo -u ${PROMETHEUS_USER} vim ${PROMETHEUS_CONF}/rules/alerts/hardware-alerts.yml

节点层

$ sudo -u ${PROMETHEUS_USER} vim ${PROMETHEUS_CONF}/rules/alerts/node-alerts.yml

# 基于 Node Exporter 的系统层告警规则
# 依赖: node_exporter >= 1.3.0
groups:
  - name: system_alerts
    interval: 1m
    rules:
      # 节点宕机告警 (基于 up 指标)
      - alert: InstanceDown
        expr: up{job="node-exporter"} == 0
        for: 1m
        labels:
          severity: "critical"
          team: "infrastructure, management"
          layer: "system"
          priority: "P0"
        annotations:
          summary: "节点 {{ $labels.instance }} 已宕机"
          description: "节点 {{ $labels.instance }} 已宕机超过 1 分钟，请立即检查状态。"

  - name: system_resource_alerts
    rules:
      # CPU 使用率告警 (基于 node_cpu_seconds_total)
      - alert: HighCPUUsage
        expr: node:cpu:usage_percent > 80
        for: 5m
        labels:
          severity: "warning"
          # 基础团队
          team: "infrastructure"        
          layer: "system"
          priority: "P2"
        annotations:
          summary: "节点 {{ $labels.instance }} CPU 使用率过高"
          description: "节点 {{ $labels.instance }} CPU 使用率为 {{ printf \"%.2f\" $value }}%，已持续 5 分钟。建议检查高耗CPU进程。"
          


      # CPU 使用率严重过高
      - alert: CriticalCPUUsage
        expr: node:cpu:usage_percent > 90
        for: 2m
        labels:
          severity: "critical"
          # 基础团队
          team: "infrastructure"
          layer: "system"
          priority: "P1"
        annotations:
          summary: "节点 {{ $labels.instance }} CPU 使用率严重过高"
          description: "节点 {{ $labels.instance }} CPU 使用率为 {{ printf \"%.2f\" $value }}%，已持续 5 分钟，系统可能即将无响应。"

      # 内存使用率告警 (基于 node_memory_* 指标)
      - alert: HighMemoryUsage
        expr: node:memory:usage_percent > 80
        for: 5m
        labels:
          severity: "warning"
          team: "infrastructure"
          layer: "system"
          priority: "P2"
        annotations:
          summary: "节点 {{ $labels.instance }} 内存使用率过高"
          description: "节点 {{ $labels.instance }} 内存使用率为 {{ printf \"%.2f\" $value }}%，可用内存：{{ with query (printf \"node_memory_MemAvailable_bytes{instance='%s'}\" $labels.instance) }}{{ . | first | value | humanize1024 }}B{{end}}。"

      # 内存使用率严重过高
      - alert: CriticalMemoryUsage
        expr: node:memory:usage_percent > 90
        for: 2m
        labels:
          severity: "critical"
          team: "infrastructure"
          layer: "system"
          priority: "P1"
        annotations:
          summary: "节点 {{ $labels.instance }} 内存严重不足"
          description: "节点 {{ $labels.instance }} 内存使用率为 {{ printf \"%.2f\" $value }}%，系统可能开始使用交换空间或 OOM。"

      # 交换空间使用率过高 (基于 node_memory_SwapTotal_bytes)
      - alert: HighSwapUsage
        expr: node:memory:swap_usage_percent > 60
        for: 5m
        labels:
          severity: "warning"
          team: "infrastructure"
          layer: "system"
          priority: "P2"
        annotations:
          summary: "节点 {{ $labels.instance }} 交换空间使用率过高"
          description: "节点 {{ $labels.instance }} 交换空间使用率为 {{ printf \"%.2f\" $value }}%，可能存在内存不足问题。"

      # 磁盘分区空间告警 (基于 node_filesystem_* 指标)
      - alert: HighDiskUsage
        expr: node:partition:usage_percent > 75
        for: 5m
        labels:
          severity: "warning"
          team: "infrastructure"
          layer: "system"
          priority: "P2"
        annotations:
          summary: "节点 {{ $labels.instance }} 磁盘空间不足"
          description: "节点 {{ $labels.instance }} 挂载点 {{ $labels.mountpoint }} 磁盘使用率为 {{ printf \"%.2f\" $value }}%，剩余空间：{{ $labels.node_filesystem_avail_bytes | humanize1024 }}B。"

      # 磁盘分区空间严重不足
      - alert: CriticalDiskUsage
        expr: node:partition:usage_percent > 85
        for: 2m
        labels:
          severity: "critical"
          team: "infrastructure"
          layer: "system"
          priority: "P1"
        annotations:
          summary: "节点 {{ $labels.instance }} 磁盘空间严重不足"
          description: "节点 {{ $labels.instance }} 挂载点 {{ $labels.mountpoint }} 磁盘使用率为 {{ printf \"%.2f\" $value }}%，请立即清理磁盘空间。"


      # 磁盘分区 Inode 使用率过高 (基于 node_filesystem_files)
      - alert: HighInodeUsage
        expr: node:partition:inode_usage_percent > 90
        for: 5m
        labels:
          severity: "warning"
          team: "infrastructure"
          layer: "system"
          priority: "P2"
        annotations:
          summary: "节点 {{ $labels.instance }} Inode使用率过高"
          description: "节点 {{ $labels.instance }} 挂载点 {{ $labels.mountpoint }} Inode使用率为 {{ printf \"%.2f\" $value }}%，可能无法创建新文件。"

      # 磁盘 IO 等待时间过长 (基于 node_cpu_seconds_total)
      - alert: HighIOWait
        expr: node:cpu:iowait_percent > 20
        for: 5m
        labels:
          severity: "warning"
          team: "infrastructure"
          layer: "system"
          priority: "P2"
        annotations:
          summary: "节点 {{ $labels.instance }} 磁盘 IO 等待时间过长"
          description: "节点 {{ $labels.instance }} IO 等待时间占 CPU 的 {{ printf \"%.2f\" $value }}%，可能存在磁盘性能问题。"

  - name: system_process_alerts
    rules:
      # 系统负载过高 (基于 node_load* 指标)
      - alert: HighSystemLoad1m
        expr: node:load:per_core_1m > 2
        for: 5m
        labels:
          severity: "warning"
          team: "infrastructure"
          layer: "system"
          priority: "P2"
        annotations:
          summary: "节点 {{ $labels.instance }} 1 分钟系统负载过高"
          description: "节点 {{ $labels.instance }} 1 分钟平均负载为 {{ printf \"%.2f\" $value }}，超过 CPU 核心数的 2 倍。"

      # 系统负载严重过高
      - alert: HighSystemLoad15m
        expr: node:load:per_core_15m > 2
        for: 5m
        labels:
          severity: "critical"
          team: "infrastructure"
          layer: "system"
          priority: "P1"
        annotations:
          summary: "节点 {{ $labels.instance }} 15 分钟系统负载过高"
          description: "节点 {{ $labels.instance }} 15 分钟平均负载为 {{ printf \"%.2f\" $value }}，超过 CPU 核心数的 2 倍。"

      # 文件描述符使用率过高 (基于 node_filefd_* 指标)
      - alert: HighFileDescriptorUsage
        expr: node:filefd:usage_percent > 75
        for: 5m
        labels:
          severity: "warning"
          team: "infrastructure"
          layer: "system"
          priority: "P2"
        annotations:
          summary: "节点 {{ $labels.instance }} 文件描述符使用率过高"
          description: "节点 {{ $labels.instance }} 文件描述符使用率为 {{ printf \"%.2f\" $value }}%，当前使用：{{ $labels.node_filefd_allocated }}。"

      # 进程数量过多 (基于 node_processes_* 指标)
      - alert: TooManyProcesses
        expr: node:processes:running > 300
        for: 10m
        labels:
          severity: "warning"
          team: "infrastructure"
          layer: "system"
          priority: "P2"
        annotations:
          summary: "节点 {{ $labels.instance }} 运行进程数过多"
          description: "节点 {{ $labels.instance }} 当前运行进程数为 {{ $value }}，可能存在进程泄漏或系统异常。"

      # 上下文切换过多 (基于 node_context_switches_total)
      - alert: HighContextSwitches
        expr: node:context_switches:rate > 10000
        for: 5m
        labels:
          severity: "warning"
          team: "infrastructure"
          layer: "system"
          priority: "P2"
        annotations:
          summary: "节点 {{ $labels.instance }} 上下文切换过多"
          description: "节点 {{ $labels.instance }} 上下文切换速率为 {{ printf \"%.0f\" $value }}/s，可能存在系统性能问题。"

应用层

配置验证

验证规则文件

# 验证记录规则文件
${PROMETHEUS_HOME}/bin/promtool check rules ${PROMETHEUS_CONF}/rules/records/node-metrics.yml

# 验证告警规则文件
sudo -u ${PROMETHEUS_USER} \
    ${PROMETHEUS_HOME}/bin/promtool check rules ${PROMETHEUS_CONF}/rules/alerts/node-alerts.yml

验证 Prometheus 配置

sudo -u ${PROMETHEUS_USER} \
    ${PROMETHEUS_HOME}/bin/promtool check config ${PROMETHEUS_CONF}/prometheus.yml

加载配置

curl -X POST http://localhost:9090/-/reload

AlertManager 配置

准备

创建目录结构

sudo -u ${ALERTMANAGER_USER} mkdir -p ${ALERTMANAGER_CONF}/templates

主配置

配置参考：Configuration | Prometheus

$ sudo -u ${ALERTMANAGER_USER} vim ${ALERTMANAGER_CONF}/alertmanager.yml

#### 全局配置
global:
  # 邮件配置
  smtp_smarthost: 'smtp.163.com:465'
  smtp_from: 'sre_alerts@163.com'
  smtp_auth_username: 'sre_alerts@163.com'
  smtp_auth_password: '**************'
  smtp_require_tls: false

  # 企业微信配置
  wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'
  wechat_api_secret: '*******************'
  wechat_api_corp_id: '******************'
  


#### 模板配置
templates:
  # 可以用通配符
  - '/etc/alertmanager/templates/email.tmpl'
  - '/etc/alertmanager/templates/wechat.tmpl'
  
  
#### 告警路由配置
route:
  group_by: ['env', 'alertname', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'email'
  
  routes:
    - receiver: 'email'
      matchers: 
        - severity =~ "warning|info"
      continue: true
    
    - receiver: 'wechat'
      matchers: 
        - severity =~ "warning|info"
      continue: true
        
    - receiver: 'wechat'
      matchers:
        - severity = "critical"
      group_wait: 10s
      repeat_interval: 1h # 每 1 小时重复提醒



#### 接收器设置
receivers:
  # 邮件
  - name: 'email'
    email_configs:
      - to: 'ezra-sullivan@outlook.com'
        html: '{{ template "email.alert.html" . }}'
        headers:
          Subject: '{{ template "email.alert.subject" . }}'
        # 发送告警恢复通知
        # 注：告警恢复通知中，会复用最后一次触发告警的值
        send_resolved: true
  
  # 微信
  - name: "wechat"
    wechat_configs:
      - agent_id: "1000004"
        to_user: '@all'
        message: '{{ template "wechat.alert.message.markdown" . }}'
        # 注：使用 markdown 格式模板，必须指定类型，默认类型为 text
        message_type: 'markdown'
        send_resolved: true
  
  # webhook
  
  
# 抑制规则
inhibit_rules:
  # 当实例下线时，抑制该实例的其他告警
  - source_matchers:
      - alertname = InstanceDown
    target_matchers:
      - alertname =~ ".*"
    equal: ["instance"]
    
  # 
  - source_matchers:
      - severity = "critical"
    target_matchers:
      - severity = "warning"
    equal: ["alertname", "instance"]

告警模板配置

告警模板配置参考：Notification template reference | Prometheus

告警模板示例参考：Notification template examples | Prometheus

邮件模板

$ sudo -u ${ALERTMANAGER_USER} vim ${ALERTMANAGER_CONF}/templates/email.tmpl

{{ define "email.alert.subject" }}
{{- $count := len .Alerts -}}
{{- if eq $count 1 -}}
  {{- range .Alerts -}}
    告警-{{ .Labels.alertname }}
  {{- end -}}
{{- else -}}
  告警-{{ .GroupLabels.alertname }}-批量告警-{{ $count }} 项
{{- end -}}
{{ end }}

{{ define "email.alert.html" }}
{{- $hasFiring := false -}}
{{- $severity := "" -}}

{{- range .Alerts -}}
  {{- if eq .Status "firing" }}
    {{- $hasFiring = true -}}
    {{- if eq .Labels.severity "critical" }}{{ $severity = "critical" }}
    {{- else if and (eq .Labels.severity "warning") (ne $severity "critical") }}{{ $severity = "warning" }}
    {{- else if and (eq .Labels.severity "info") (ne $severity "critical") (ne $severity "warning") }}{{ $severity = "info" }}
    {{- end }}
  {{- end }}
{{- end }}

<!DOCTYPE html>
<html>
<head>
  <meta charset="UTF-8">
  <title>监控告警通知</title>
  <style>
    body {
      font-family: Arial, sans-serif;
      margin: 20px;
      {{- if not $hasFiring }}
      background-color: #f4fff4;
      {{- else if eq $severity "critical" }}
      background-color: #fff5f5;
      {{- else if eq $severity "warning" }}
      background-color: #fffbf0;
      {{- else }}
      background-color: #f8f9fa;
      {{- end }}
    }
    .header {
      text-align: center;
      padding: 15px;
      border-radius: 5px;
      color: white;
      {{- if not $hasFiring }}
      background-color: #5cb85c;
      {{- else if eq $severity "critical" }}
      background-color: #d9534f;
      {{- else if eq $severity "warning" }}
      background-color: #f0ad4e;
      {{- else }}
      background-color: #5bc0de;
      {{- end }}
    }
    .alert-item {
      background-color: #fff;
      margin: 20px 0;
      padding: 20px;
      border-radius: 5px;
      border-left: 5px solid;
      border-color: #ccc;
    }
    .firing { border-left-color: #d9534f; }
    .resolved { border-left-color: #5cb85c; }
    .label { font-weight: bold; color: #333; }
    .value { color: #555; }
    .time { color: #999; font-size: 0.9em; }
    .footer {
      background-color: #f8f8f8;
      padding: 10px;
      border-radius: 5px;
      margin-top: 20px;
    }
  </style>
</head>
<body>

  <div class="header">
    <h2>
      {{- if not $hasFiring }}
      &#x2705; 告警恢复通知
      {{- else if eq $severity "critical" }}
      &#x1F6A8; 严重告警通知
      {{- else if eq $severity "warning" }}
      &#x26A0;&#xFE0F; 监控警告通知
      {{- else }}
      &#x2139;&#xFE0F; 监控信息通知
      {{- end }}
    </h2>
    <p>告警组：{{ .GroupLabels.alertname }}</p>
  </div>

  {{ range .Alerts }}
  <div class="alert-item {{ .Status }}">
    <h3>{{ .Labels.alertname }}</h3>
    {{- if .Labels.env }}
    <p><span class="label">告警环境：</span><span class="value">{{ .Labels.env }}</span></p>
    {{- end }}
    <p><span class="label">告警级别：</span><span class="value">{{ .Labels.severity }}</span></p>
    <p><span class="label">告警实例：</span><span class="value">{{ .Labels.instance }}</span></p>
    <p><span class="label">告警状态：</span><span class="value">{{ if eq .Status "firing" }}&#x1F525; 触发中{{ else }}&#x2705; 已恢复{{ end }}</span></p>
    <p><span class="label">告警摘要：</span><span class="value">{{ .Annotations.summary }}</span></p>
    <p><span class="label">告警描述：</span><span class="value">{{ .Annotations.description }}</span></p>
    <p><span class="label">开始时间：</span><span class="time">{{ .StartsAt.Local.Format "2006-01-02 15:04:05" }}</span></p>
    {{- if eq .Status "resolved" }}
    <p><span class="label">结束时间：</span><span class="time">{{ .EndsAt.Local.Format "2006-01-02 15:04:05" }}</span></p>
    {{- end }}
  </div>
  {{ end }}

  {{- if not $hasFiring }}
  <div class="footer">
    <p><strong>&#x1F4D7; 所有告警已恢复，无需进一步处理。</strong></p>
  </div>
  {{- else if eq $severity "critical" }}
  <div class="footer">
    <p><strong>&#x1F6A8; 请立即处理高优先级告警，避免影响生产。</strong></p>
  </div>
  {{- end }}

</body>
</html>
{{ end }}

企业微信模板

$ sudo -u ${ALERTMANAGER_USER} vim ${ALERTMANAGER_CONF}/templates/wechat.tmpl

{{ define "wechat.alert.message" }}
{{- $hasFiring := false -}}
{{- $severity := "" -}}

{{- range .Alerts -}}
  {{- if eq .Status "firing" -}}
    {{- $hasFiring = true -}}
    {{- if eq .Labels.severity "critical" }}{{ $severity = "critical" }}
    {{- else if and (eq .Labels.severity "warning") (ne $severity "critical") }}{{ $severity = "warning" }}
    {{- else if and (eq .Labels.severity "info") (ne $severity "critical") (ne $severity "warning") }}{{ $severity = "info" }}
    {{- end -}}
  {{- end -}}
{{- end -}}

{{- if not $hasFiring -}}
✅ 告警恢复 --- {{ .GroupLabels.alertname }}
{{- else if eq $severity "critical" -}}
❗ 严重告警 --- {{ .GroupLabels.alertname }}
{{- else if eq $severity "warning" -}}
⚠️ 监控警告 --- {{ .GroupLabels.alertname }}
{{- else -}}
ℹ️ 监控信息 --- {{ .GroupLabels.alertname }}
{{- end }}
{{ range .Alerts }}
=======================
{{ with .Labels.env }}
告警环境：{{ . }}
{{ end }}
告警规则：{{ .Labels.alertname }}
告警级别：{{ .Labels.severity }}
实例信息：{{ .Labels.instance }}
告警状态：{{ if eq .Status "firing" }}触发中❗{{ else }}已恢复 ✅{{ end }}
告警摘要：{{ .Annotations.summary }}
告警描述：{{ .Annotations.description }}
开始时间：⏰ {{ .StartsAt.Local.Format "2006-01-02 15:04:05" }}
{{- if eq .Status "resolved" }}
恢复时间：✅ {{ .EndsAt.Local.Format "2006-01-02 15:04:05" }}
{{- end }}
=======================
{{ end }}

{{- if not $hasFiring }}
✅ 该组告警已全部恢复
{{- else if eq $severity "critical" }}
❗ 请立即处理告警 
{{- end }}
{{ end }}



{{ define "wechat.alert.message.markdown" }}
{{- $hasFiring := false -}}
{{- $severity := "" -}}

{{- range .Alerts -}}
  {{- if eq .Status "firing" -}}
    {{- $hasFiring = true -}}
    {{- if eq .Labels.severity "critical" }}{{ $severity = "critical" }}
    {{- else if and (eq .Labels.severity "warning") (ne $severity "critical") }}{{ $severity = "warning" }}
    {{- else if and (eq .Labels.severity "info") (ne $severity "critical") (ne $severity "warning") }}{{ $severity = "info" }}
    {{- end -}}
  {{- end -}}
{{- end -}}

{{- if not $hasFiring -}}
## 告警恢复 - ✅  
{{- else if eq $severity "critical" -}}
## 严重告警 - ❗  
{{- else if eq $severity "warning" -}}
## 监控警告 - ⚠️  
{{- else -}}
## 监控信息 - ℹ️
{{- end }}

{{- with .GroupLabels.alertname }}
**告警类型**：{{ . }}
{{- end }}

{{ range .Alerts }}
---
{{ with .Labels.env }}
**告警环境**：{{ . }}
{{ end }}
**告警规则**：{{ .Labels.alertname }}  
**告警级别**：{{ .Labels.severity }}  
**实例信息**：{{ .Labels.instance }}  
**告警状态**：{{ if eq .Status "firing" }}触发中❗{{ else }}已恢复 ✅{{ end }}  
**告警摘要**：{{ .Annotations.summary }}  
**告警描述**：{{ .Annotations.description }}  
**开始时间**：⏰ {{ .StartsAt.Local.Format "2006-01-02 15:04:05" }}  
{{- if eq .Status "resolved" }}
**恢复时间**：✅ {{ .EndsAt.Local.Format "2006-01-02 15:04:05" }}  
{{- end }}
{{ end }}

{{- if not $hasFiring }}
---
✅ **该组告警已全部恢复**
{{- else if eq $severity "critical" }}
---
❗ **请立即处理告警**
{{- end }}
{{ end }}

配置验证

验证主配置

sudo -u ${ALERTMANAGER_USER} \
    ${ALERTMANAGER_HOME}/bin/amtool check-config ${ALERTMANAGER_CONF}/alertmanager.yml

加载配置

curl -X POST http://localhost:9093/-/reload

测试

手动触发

使用 amtool 发送测试告警：

sudo -u ${ALERTMANAGER_USER} \
  ${ALERTMANAGER_HOME}/bin/amtool alert add \
  --alertmanager.url=http://localhost:9093 \
  env="prod" \
  alertname="TestAlert" \
  severity="warning" \
  instance="test-instance" \
  --annotation='summary="summary of the alert"' \
  --annotation='description="description of the alert"'

模拟触发

模拟 CPU 高占用

# 占用 2 核 CPU，繁忙程度 80%，占用 720 秒
$ docker run -it --rm --name 'stress-ng' \
    --cpus="2" \
    registry.cn-hangzhou.aliyuncs.com/kmust/stress-ng-alpine:0.14.00-r0 \
     --cpu 2 --cpu-load 80  -t 720



# 检查 CPU 占用
$ top -p $(docker top stress-ng | awk 'NR>1 {print $2}' | paste -sd, -)

# 输出如下信息

top - 23:39:38 up 1 day,  1:44,  5 users,  load average: 1.69, 1.26, 0.69
Tasks:   3 total,   1 running,   2 sleeping,   0 stopped,   0 zombie
%Cpu(s): 70.5 us,  1.0 sy,  0.0 ni, 27.6 id,  0.0 wa,  1.0 hi,  0.0 si,  0.0 st
MiB Mem :   1739.0 total,    434.0 free,    577.3 used,    904.7 buff/cache
MiB Swap:   2060.0 total,   2060.0 free,      0.0 used.   1161.7 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                     
  17303 root      20   0   33448   5580   1792 R  79.4   0.3   3:07.00 stress-ng                                                                   
  17304 root      20   0   33448   5580   1792 S  79.4   0.3   3:07.42 stress-ng                                                                   
  17274 root      20   0   33448   4224   3968 S   0.0   0.2   0:00.04 stress-ng

模拟高内存占用

# 占用 1.5 G 内存，占用 720 秒
$ docker run -it --rm --name 'stress-ng' \
    --memory="2G" \
    registry.cn-hangzhou.aliyuncs.com/kmust/stress-ng-alpine:0.14.00-r0 \
    --vm 1 --vm-bytes 1.5G --vm-keep --verify --timeout 720s
    
    
# 检查内存占用
$ free -h
               total        used        free      shared  buff/cache   available
Mem:           1.7Gi       1.6Gi       133Mi       0.0Ki       105Mi       114Mi

其他配置

告警静默

通过 Alertmanager Web UI 或命令行工具设置静默：

# 静默特定实例的告警
${ALERTMANAGER_HOME}/bin/amtool silence add \
  --alertmanager.url=http://localhost:9093 \
  --author="admin" \
  --comment="计划维护" \
  --duration="2h" \
  instance="192.168.111.199:9100"

# 查看静默列表
${ALERTMANAGER_HOME}/bin/amtool silence query \
  --alertmanager.url=http://localhost:9093

按时间段告警

route:
  routes:
    # 工作时间告警
    - match:
        severity: warning
      active_time_intervals:
        - business_hours
      receiver: 'business-hours-alerts'
    
    # 非工作时间紧急告警
    - match:
        severity: critical
      active_time_intervals:
        - non_business_hours
      receiver: 'emergency-alerts'

time_intervals:
  - name: business_hours
    time_intervals:
      - times:
          - start_time: '09:00'
            end_time: '18:00'
        weekdays: ['monday:friday']
  
  - name: non_business_hours
    time_intervals:
      - times:
          - start_time: '18:00'
            end_time: '09:00'
        weekdays: ['monday:friday']
      - weekdays: ['saturday', 'sunday']

附录

告警规则设计原则

分层设计：区分不同严重级别的告警
避免告警风暴：合理设置 for 时间和抑制规则
标签规范：统一标签命名规范，便于分组和路由

通知渠道选择

Critical 级别：多渠道通知（微信+邮件+电话）
Warning 级别：基础通知（微信或邮件）
Info 级别：仅记录到日志系统

模板优化

包含关键信息：告警名称、级别、实例、时间、描述
格式清晰：使用表格或结构化格式
包含操作链接：Grafana 仪表盘、Runbook 链接

运维建议

定期测试告警配置
建立告警处理 SOP
监控告警系统本身的健康状态
定期评估和优化告警规则

分享

05 - Prometheus 告警示例-（1）邮件告警-企业微信

简介

告警机制架构

基本流程

组件配置架构

准备

主机准备

Prometheus 配置

准备

主配置

分层记录规则配置

节点层

分层告警规则配置

硬件层

节点层

应用层

配置验证

加载配置

AlertManager 配置

准备

主配置

告警模板配置

邮件模板

企业微信模板

配置验证

加载配置

测试

手动触发

模拟触发

其他配置

告警静默

按时间段告警

附录

告警规则设计原则

通知渠道选择

模板优化

运维建议

评论