ezra-sullivan
发布于 2025-06-28 / 4 阅读
0
0

05 - Prometheus 告警示例-(1)邮件告警-企业微信

更新时间:2025 年 6 月

版本:Prometheus 3.4.1、Alertmanager 0.28.1

简介

提供 Prometheus + Alertmanager 告警配置的完整示例,包含微信、邮件两种告警通知方式的配置和实战案例


告警机制架构

基本流程

flowchart LR
    subgraph Prometheus
        A1[采集指标数据] --> A2[存储 TSDB]
        A2 --> A3[评估告警规则<br>Alerting Rules]
        A3 --> A4[触发告警事件<br>Pending → Firing]
    end

    subgraph Alertmanager
        A4 --> B1[接收告警]
        B1 --> B2[去重处理]
        B2 --> B3[分组处理<br>Group by labels]
        B3 --> B4[静默 Silence<br>用户配置静默规则]
        B3 --> B5[抑制 Inhibit<br> 高优先级屏蔽低优先级]
        B4 --> B6[路由 Routing<br>根据标签匹配路由]
        B5 --> B6
    end

    B6 --> C1[通知接收者<br>微信通知]
    B6 --> C2[通知接收者<br>邮件通知]
    B6 --> C3[通知接收者<br>Webhook 通知]

    %% 样式增强(可选)
    classDef prom fill:#f9f,stroke:#333,stroke-width:1px;
    classDef alert fill:#bbf,stroke:#333,stroke-width:1px;
    classDef notify fill:#bfb,stroke:#333,stroke-width:1px;

    class A1,A2,A3,A4 prom
    class B1,B2,B3,B4,B5,B6 alert
    class C1,C2,C3 notify
  • Prometheus - 收集和存储监控指标
  • 告警规则 - 在 Prometheus 中定义告警条件,符合条件时触发告警事件
  • Alertmanager - 接收告警,进行分组、去重、静默处理
  • 通知渠道 - 通过微信、Webhook、邮件等方式发送告警通知

组件配置架构

#### Prometheus
/etc/prometheus/
├── prometheus.yml           # Prometheus 主配置文件
├── rules/                     # 规则目录
│   ├── alerts/                  # 告警规则目录(分层管理)
│   │   ├── hardware-alerts.yml      # 硬件层告警规则
│   │   ├── node-alerts.yml          # 节点层(系统)告警规则
│   │   ├── application-alerts.yml   # 应用层告警规则
│   │   ├── middleware-alerts.yml    # 中间件告警规则
│   │   ├── network-alerts.yml       # 网络层告警规则
│   │   └── database-alerts.yml      # 数据库告警规则
│   └── records/                 # 记录规则目录
│       ├── node-metrics.yml         # 节点层(系统)指标记录规则
│       ├── application-metrics.yml  # 应用指标记录规则
│       ├── business-metrics.yml     # 业务指标记录规则
│       └── sli-slo-metrics.yml      # SLI/SLO指标记录规则
└── targets/                   # 监控目标配置目录


#### Alertmanager
/etc/alertmanager/
├── alertmanager.yml           # Alertmanager 主配置文件
└── templates/                 # 告警模板目录
    ├── wechat.tmpl                # 微信告警模板
    └── email.tmpl                 # 邮件告警模板

准备

主机准备

主机名操作系统架构IP安装软件
prometheus-01.monitor.localAlmaLinux 9.6x86_64192.168.111.197Prometheus 3.x
Node Exporter 1.9.x
alertmanager-01.monitor.localAlmaLinux 9.6x86_64192.168.111.198Alertmanager 0.28.x
Node Exporter 1.9.x
node-exporter-01.monitor.localAlmaLinux 9.6x86_64192.168.111.199Node Exporter 1.9.x

部署软件过程略

Prometheus 配置

准备

创建目录结构

sudo -u ${PROMETHEUS_USER} mkdir -p ${PROMETHEUS_CONF}/rules/{alerts,records}

主配置

配置参考:Configuration | Prometheus

$ sudo -u ${PROMETHEUS_USER} vim ${PROMETHEUS_CONF}/prometheus.yml

#### 全局配置
global:
  # 设置全局数据采集间隔时间. 默认 1 分钟
  scrape_interval: 15s 
  # 评估告警规则和记录规则(alerting + recording)的时间间隔,一般应为 scrape_interval 的倍数。默认 1 分钟
  evaluation_interval: 15s 
  # 采集超时时间,默认 10 秒
  scrape_timeout: 10s
  
  
#### Alertmanager 配置
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            # alertmanager 的 IP 或者域名,如果是集群需要配置每个节点
            - 192.168.111.198:9093


#### 规则配置
rule_files:
  ## 记录规则文件
  # 业务指标记录规则
  - "rules/records/node-metrics.yml"

  ## 告警规则文件
  # 硬件层告警
  # - "rules/alerts/hardware-alerts.yml"
  # 系统层告警规则
  - "rules/alerts/node-alerts.yml"
  # 应用层告警规则
  # - "rules/alerts/application-alerts.yml"
  
  



scrape_configs:
  # Prometheus 自监控
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]
        labels:
          app: "prometheus"
          
  # 节点监控
  - job_name: 'node-exporter'
    static_configs:
      - targets: 
          - '192.168.111.197:9100'
          - '192.168.111.198:9100'
          - '192.168.111.199:9100'
        labels:
          env: 'prod'
          level: 'system'
          category: 'monitor'
          
          
  # 应用监控
  # - job_name: 'app-metrics'
  #   static_configs:
  #     - targets: 
  #         - '192.168.111.200:8080'
  #       labels:
  #         env: 'prod'
  #         level: 'application'
  #         category: 'service'
  #         service: 'user'

分层记录规则配置

节点层

$ sudo -u ${PROMETHEUS_USER} vim ${PROMETHEUS_CONF}/rules/records/node-metrics.yml


groups:
  - name: node_recording_rules
    interval: 30s
    rules:
      # CPU 核心数(基础指标)
      - record: node:cpu:cores
        expr: count without (cpu, mode) (node_cpu_seconds_total{mode="idle"})
    
      # CPU 使用率(实例聚合,5分钟平均)
      - record: node:cpu:usage_percent
        expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

      # CPU IO wait 百分比(实例聚合,5分钟平均)
      - record: node:cpu:iowait_percent
        expr: avg by (instance) (irate(node_cpu_seconds_total{mode="iowait"}[5m])) * 100

      # 内存使用率
      - record: node:memory:usage_percent
        expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

      # Swap 使用率(实例聚合)
      - record: node:memory:swap_usage_percent
        expr: (1 - node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes) * 100

      # 分区使用率
      - record: node:partition:usage_percent
        expr: 100 * (1 - node_filesystem_avail_bytes / node_filesystem_size_bytes)

      # 分区 Inode 使用率(实例 + 挂载点)
      - record: node:partition:inode_usage_percent
        expr: 100 * (1 - node_filesystem_files_free / node_filesystem_files)

      # 1 分钟负载与核心数比(实例聚合)
      - record: node:load:per_core_1m
        expr: node_load1 / node:cpu:cores

      # 15 分钟负载与核心数比(实例聚合)
      - record: node:load:per_core_15m
        expr: node_load15 / node:cpu:cores

      # 文件描述符使用率(实例聚合)
      - record: node:filefd:usage_percent
        expr: node_filefd_allocated / node_filefd_maximum * 100

      # 上下文切换速率(实例聚合)
      - record: node:context_switches:rate
        expr: rate(node_context_switches_total[5m])

      # 运行进程数(实例聚合)
      - record: node:processes:running
        expr: node_processes_state{state="R"}


分层告警规则配置

配置参考:Alerting rules | Prometheus

硬件层

一般用 IPMI/BMC 做硬件监控

$ sudo -u ${PROMETHEUS_USER} vim ${PROMETHEUS_CONF}/rules/alerts/hardware-alerts.yml

  

节点层

$ sudo -u ${PROMETHEUS_USER} vim ${PROMETHEUS_CONF}/rules/alerts/node-alerts.yml

# 基于 Node Exporter 的系统层告警规则
# 依赖: node_exporter >= 1.3.0
groups:
  - name: system_alerts
    interval: 1m
    rules:
      # 节点宕机告警 (基于 up 指标)
      - alert: InstanceDown
        expr: up{job="node-exporter"} == 0
        for: 1m
        labels:
          severity: "critical"
          team: "infrastructure, management"
          layer: "system"
          priority: "P0"
        annotations:
          summary: "节点 {{ $labels.instance }} 已宕机"
          description: "节点 {{ $labels.instance }} 已宕机超过 1 分钟,请立即检查状态。"

  - name: system_resource_alerts
    rules:
      # CPU 使用率告警 (基于 node_cpu_seconds_total)
      - alert: HighCPUUsage
        expr: node:cpu:usage_percent > 80
        for: 5m
        labels:
          severity: "warning"
          # 基础团队
          team: "infrastructure"        
          layer: "system"
          priority: "P2"
        annotations:
          summary: "节点 {{ $labels.instance }} CPU 使用率过高"
          description: "节点 {{ $labels.instance }} CPU 使用率为 {{ printf \"%.2f\" $value }}%,已持续 5 分钟。建议检查高耗CPU进程。"
          


      # CPU 使用率严重过高
      - alert: CriticalCPUUsage
        expr: node:cpu:usage_percent > 90
        for: 2m
        labels:
          severity: "critical"
          # 基础团队
          team: "infrastructure"
          layer: "system"
          priority: "P1"
        annotations:
          summary: "节点 {{ $labels.instance }} CPU 使用率严重过高"
          description: "节点 {{ $labels.instance }} CPU 使用率为 {{ printf \"%.2f\" $value }}%,已持续 5 分钟,系统可能即将无响应。"

      # 内存使用率告警 (基于 node_memory_* 指标)
      - alert: HighMemoryUsage
        expr: node:memory:usage_percent > 80
        for: 5m
        labels:
          severity: "warning"
          team: "infrastructure"
          layer: "system"
          priority: "P2"
        annotations:
          summary: "节点 {{ $labels.instance }} 内存使用率过高"
          description: "节点 {{ $labels.instance }} 内存使用率为 {{ printf \"%.2f\" $value }}%,可用内存:{{ with query (printf \"node_memory_MemAvailable_bytes{instance='%s'}\" $labels.instance) }}{{ . | first | value | humanize1024 }}B{{end}}。"

      # 内存使用率严重过高
      - alert: CriticalMemoryUsage
        expr: node:memory:usage_percent > 90
        for: 2m
        labels:
          severity: "critical"
          team: "infrastructure"
          layer: "system"
          priority: "P1"
        annotations:
          summary: "节点 {{ $labels.instance }} 内存严重不足"
          description: "节点 {{ $labels.instance }} 内存使用率为 {{ printf \"%.2f\" $value }}%,系统可能开始使用交换空间或 OOM。"

      # 交换空间使用率过高 (基于 node_memory_SwapTotal_bytes)
      - alert: HighSwapUsage
        expr: node:memory:swap_usage_percent > 60
        for: 5m
        labels:
          severity: "warning"
          team: "infrastructure"
          layer: "system"
          priority: "P2"
        annotations:
          summary: "节点 {{ $labels.instance }} 交换空间使用率过高"
          description: "节点 {{ $labels.instance }} 交换空间使用率为 {{ printf \"%.2f\" $value }}%,可能存在内存不足问题。"

      # 磁盘分区空间告警 (基于 node_filesystem_* 指标)
      - alert: HighDiskUsage
        expr: node:partition:usage_percent > 75
        for: 5m
        labels:
          severity: "warning"
          team: "infrastructure"
          layer: "system"
          priority: "P2"
        annotations:
          summary: "节点 {{ $labels.instance }} 磁盘空间不足"
          description: "节点 {{ $labels.instance }} 挂载点 {{ $labels.mountpoint }} 磁盘使用率为 {{ printf \"%.2f\" $value }}%,剩余空间:{{ $labels.node_filesystem_avail_bytes | humanize1024 }}B。"

      # 磁盘分区空间严重不足
      - alert: CriticalDiskUsage
        expr: node:partition:usage_percent > 85
        for: 2m
        labels:
          severity: "critical"
          team: "infrastructure"
          layer: "system"
          priority: "P1"
        annotations:
          summary: "节点 {{ $labels.instance }} 磁盘空间严重不足"
          description: "节点 {{ $labels.instance }} 挂载点 {{ $labels.mountpoint }} 磁盘使用率为 {{ printf \"%.2f\" $value }}%,请立即清理磁盘空间。"


      # 磁盘分区 Inode 使用率过高 (基于 node_filesystem_files)
      - alert: HighInodeUsage
        expr: node:partition:inode_usage_percent > 90
        for: 5m
        labels:
          severity: "warning"
          team: "infrastructure"
          layer: "system"
          priority: "P2"
        annotations:
          summary: "节点 {{ $labels.instance }} Inode使用率过高"
          description: "节点 {{ $labels.instance }} 挂载点 {{ $labels.mountpoint }} Inode使用率为 {{ printf \"%.2f\" $value }}%,可能无法创建新文件。"

      # 磁盘 IO 等待时间过长 (基于 node_cpu_seconds_total)
      - alert: HighIOWait
        expr: node:cpu:iowait_percent > 20
        for: 5m
        labels:
          severity: "warning"
          team: "infrastructure"
          layer: "system"
          priority: "P2"
        annotations:
          summary: "节点 {{ $labels.instance }} 磁盘 IO 等待时间过长"
          description: "节点 {{ $labels.instance }} IO 等待时间占 CPU 的 {{ printf \"%.2f\" $value }}%,可能存在磁盘性能问题。"

  - name: system_process_alerts
    rules:
      # 系统负载过高 (基于 node_load* 指标)
      - alert: HighSystemLoad1m
        expr: node:load:per_core_1m > 2
        for: 5m
        labels:
          severity: "warning"
          team: "infrastructure"
          layer: "system"
          priority: "P2"
        annotations:
          summary: "节点 {{ $labels.instance }} 1 分钟系统负载过高"
          description: "节点 {{ $labels.instance }} 1 分钟平均负载为 {{ printf \"%.2f\" $value }},超过 CPU 核心数的 2 倍。"

      # 系统负载严重过高
      - alert: HighSystemLoad15m
        expr: node:load:per_core_15m > 2
        for: 5m
        labels:
          severity: "critical"
          team: "infrastructure"
          layer: "system"
          priority: "P1"
        annotations:
          summary: "节点 {{ $labels.instance }} 15 分钟系统负载过高"
          description: "节点 {{ $labels.instance }} 15 分钟平均负载为 {{ printf \"%.2f\" $value }},超过 CPU 核心数的 2 倍。"

      # 文件描述符使用率过高 (基于 node_filefd_* 指标)
      - alert: HighFileDescriptorUsage
        expr: node:filefd:usage_percent > 75
        for: 5m
        labels:
          severity: "warning"
          team: "infrastructure"
          layer: "system"
          priority: "P2"
        annotations:
          summary: "节点 {{ $labels.instance }} 文件描述符使用率过高"
          description: "节点 {{ $labels.instance }} 文件描述符使用率为 {{ printf \"%.2f\" $value }}%,当前使用:{{ $labels.node_filefd_allocated }}。"

      # 进程数量过多 (基于 node_processes_* 指标)
      - alert: TooManyProcesses
        expr: node:processes:running > 300
        for: 10m
        labels:
          severity: "warning"
          team: "infrastructure"
          layer: "system"
          priority: "P2"
        annotations:
          summary: "节点 {{ $labels.instance }} 运行进程数过多"
          description: "节点 {{ $labels.instance }} 当前运行进程数为 {{ $value }},可能存在进程泄漏或系统异常。"

      # 上下文切换过多 (基于 node_context_switches_total)
      - alert: HighContextSwitches
        expr: node:context_switches:rate > 10000
        for: 5m
        labels:
          severity: "warning"
          team: "infrastructure"
          layer: "system"
          priority: "P2"
        annotations:
          summary: "节点 {{ $labels.instance }} 上下文切换过多"
          description: "节点 {{ $labels.instance }} 上下文切换速率为 {{ printf \"%.0f\" $value }}/s,可能存在系统性能问题。"

应用层

配置验证

验证规则文件

# 验证记录规则文件
${PROMETHEUS_HOME}/bin/promtool check rules ${PROMETHEUS_CONF}/rules/records/node-metrics.yml

# 验证告警规则文件
sudo -u ${PROMETHEUS_USER} \
    ${PROMETHEUS_HOME}/bin/promtool check rules ${PROMETHEUS_CONF}/rules/alerts/node-alerts.yml

验证 Prometheus 配置

sudo -u ${PROMETHEUS_USER} \
    ${PROMETHEUS_HOME}/bin/promtool check config ${PROMETHEUS_CONF}/prometheus.yml

加载配置

curl -X POST http://localhost:9090/-/reload

AlertManager 配置

准备

创建目录结构

sudo -u ${ALERTMANAGER_USER} mkdir -p ${ALERTMANAGER_CONF}/templates

主配置

配置参考:Configuration | Prometheus

$ sudo -u ${ALERTMANAGER_USER} vim ${ALERTMANAGER_CONF}/alertmanager.yml

#### 全局配置
global:
  # 邮件配置
  smtp_smarthost: 'smtp.163.com:465'
  smtp_from: 'sre_alerts@163.com'
  smtp_auth_username: 'sre_alerts@163.com'
  smtp_auth_password: '**************'
  smtp_require_tls: false

  # 企业微信配置
  wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'
  wechat_api_secret: '*******************'
  wechat_api_corp_id: '******************'
  


#### 模板配置
templates:
  # 可以用通配符
  - '/etc/alertmanager/templates/email.tmpl'
  - '/etc/alertmanager/templates/wechat.tmpl'
  
  
#### 告警路由配置
route:
  group_by: ['env', 'alertname', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'email'
  
  routes:
    - receiver: 'email'
      matchers: 
        - severity =~ "warning|info"
      continue: true
    
    - receiver: 'wechat'
      matchers: 
        - severity =~ "warning|info"
      continue: true
        
    - receiver: 'wechat'
      matchers:
        - severity = "critical"
      group_wait: 10s
      repeat_interval: 1h # 每 1 小时重复提醒



#### 接收器设置
receivers:
  # 邮件
  - name: 'email'
    email_configs:
      - to: 'ezra-sullivan@outlook.com'
        html: '{{ template "email.alert.html" . }}'
        headers:
          Subject: '{{ template "email.alert.subject" . }}'
        # 发送告警恢复通知
        # 注:告警恢复通知中,会复用最后一次触发告警的值
        send_resolved: true
  
  # 微信
  - name: "wechat"
    wechat_configs:
      - agent_id: "1000004"
        to_user: '@all'
        message: '{{ template "wechat.alert.message.markdown" . }}'
        # 注:使用 markdown 格式模板,必须指定类型,默认类型为 text
        message_type: 'markdown'
        send_resolved: true
  
  # webhook
  
  
# 抑制规则
inhibit_rules:
  # 当实例下线时,抑制该实例的其他告警
  - source_matchers:
      - alertname = InstanceDown
    target_matchers:
      - alertname =~ ".*"
    equal: ["instance"]
    
  # 
  - source_matchers:
      - severity = "critical"
    target_matchers:
      - severity = "warning"
    equal: ["alertname", "instance"]

告警模板配置

告警模板配置参考:Notification template reference | Prometheus

告警模板示例参考:Notification template examples | Prometheus

邮件模板

$ sudo -u ${ALERTMANAGER_USER} vim ${ALERTMANAGER_CONF}/templates/email.tmpl

{{ define "email.alert.subject" }}
{{- $count := len .Alerts -}}
{{- if eq $count 1 -}}
  {{- range .Alerts -}}
    告警-{{ .Labels.alertname }}
  {{- end -}}
{{- else -}}
  告警-{{ .GroupLabels.alertname }}-批量告警-{{ $count }} 项
{{- end -}}
{{ end }}

{{ define "email.alert.html" }}
{{- $hasFiring := false -}}
{{- $severity := "" -}}

{{- range .Alerts -}}
  {{- if eq .Status "firing" }}
    {{- $hasFiring = true -}}
    {{- if eq .Labels.severity "critical" }}{{ $severity = "critical" }}
    {{- else if and (eq .Labels.severity "warning") (ne $severity "critical") }}{{ $severity = "warning" }}
    {{- else if and (eq .Labels.severity "info") (ne $severity "critical") (ne $severity "warning") }}{{ $severity = "info" }}
    {{- end }}
  {{- end }}
{{- end }}

<!DOCTYPE html>
<html>
<head>
  <meta charset="UTF-8">
  <title>监控告警通知</title>
  <style>
    body {
      font-family: Arial, sans-serif;
      margin: 20px;
      {{- if not $hasFiring }}
      background-color: #f4fff4;
      {{- else if eq $severity "critical" }}
      background-color: #fff5f5;
      {{- else if eq $severity "warning" }}
      background-color: #fffbf0;
      {{- else }}
      background-color: #f8f9fa;
      {{- end }}
    }
    .header {
      text-align: center;
      padding: 15px;
      border-radius: 5px;
      color: white;
      {{- if not $hasFiring }}
      background-color: #5cb85c;
      {{- else if eq $severity "critical" }}
      background-color: #d9534f;
      {{- else if eq $severity "warning" }}
      background-color: #f0ad4e;
      {{- else }}
      background-color: #5bc0de;
      {{- end }}
    }
    .alert-item {
      background-color: #fff;
      margin: 20px 0;
      padding: 20px;
      border-radius: 5px;
      border-left: 5px solid;
      border-color: #ccc;
    }
    .firing { border-left-color: #d9534f; }
    .resolved { border-left-color: #5cb85c; }
    .label { font-weight: bold; color: #333; }
    .value { color: #555; }
    .time { color: #999; font-size: 0.9em; }
    .footer {
      background-color: #f8f8f8;
      padding: 10px;
      border-radius: 5px;
      margin-top: 20px;
    }
  </style>
</head>
<body>

  <div class="header">
    <h2>
      {{- if not $hasFiring }}
      &#x2705; 告警恢复通知
      {{- else if eq $severity "critical" }}
      &#x1F6A8; 严重告警通知
      {{- else if eq $severity "warning" }}
      &#x26A0;&#xFE0F; 监控警告通知
      {{- else }}
      &#x2139;&#xFE0F; 监控信息通知
      {{- end }}
    </h2>
    <p>告警组:{{ .GroupLabels.alertname }}</p>
  </div>

  {{ range .Alerts }}
  <div class="alert-item {{ .Status }}">
    <h3>{{ .Labels.alertname }}</h3>
    {{- if .Labels.env }}
    <p><span class="label">告警环境:</span><span class="value">{{ .Labels.env }}</span></p>
    {{- end }}
    <p><span class="label">告警级别:</span><span class="value">{{ .Labels.severity }}</span></p>
    <p><span class="label">告警实例:</span><span class="value">{{ .Labels.instance }}</span></p>
    <p><span class="label">告警状态:</span><span class="value">{{ if eq .Status "firing" }}&#x1F525; 触发中{{ else }}&#x2705; 已恢复{{ end }}</span></p>
    <p><span class="label">告警摘要:</span><span class="value">{{ .Annotations.summary }}</span></p>
    <p><span class="label">告警描述:</span><span class="value">{{ .Annotations.description }}</span></p>
    <p><span class="label">开始时间:</span><span class="time">{{ .StartsAt.Local.Format "2006-01-02 15:04:05" }}</span></p>
    {{- if eq .Status "resolved" }}
    <p><span class="label">结束时间:</span><span class="time">{{ .EndsAt.Local.Format "2006-01-02 15:04:05" }}</span></p>
    {{- end }}
  </div>
  {{ end }}

  {{- if not $hasFiring }}
  <div class="footer">
    <p><strong>&#x1F4D7; 所有告警已恢复,无需进一步处理。</strong></p>
  </div>
  {{- else if eq $severity "critical" }}
  <div class="footer">
    <p><strong>&#x1F6A8; 请立即处理高优先级告警,避免影响生产。</strong></p>
  </div>
  {{- end }}

</body>
</html>
{{ end }}

企业微信模板

$ sudo -u ${ALERTMANAGER_USER} vim ${ALERTMANAGER_CONF}/templates/wechat.tmpl

{{ define "wechat.alert.message" }}
{{- $hasFiring := false -}}
{{- $severity := "" -}}

{{- range .Alerts -}}
  {{- if eq .Status "firing" -}}
    {{- $hasFiring = true -}}
    {{- if eq .Labels.severity "critical" }}{{ $severity = "critical" }}
    {{- else if and (eq .Labels.severity "warning") (ne $severity "critical") }}{{ $severity = "warning" }}
    {{- else if and (eq .Labels.severity "info") (ne $severity "critical") (ne $severity "warning") }}{{ $severity = "info" }}
    {{- end -}}
  {{- end -}}
{{- end -}}

{{- if not $hasFiring -}}
✅ 告警恢复 --- {{ .GroupLabels.alertname }}
{{- else if eq $severity "critical" -}}
❗ 严重告警 --- {{ .GroupLabels.alertname }}
{{- else if eq $severity "warning" -}}
⚠️ 监控警告 --- {{ .GroupLabels.alertname }}
{{- else -}}
ℹ️ 监控信息 --- {{ .GroupLabels.alertname }}
{{- end }}
{{ range .Alerts }}
=======================
{{ with .Labels.env }}
告警环境:{{ . }}
{{ end }}
告警规则:{{ .Labels.alertname }}
告警级别:{{ .Labels.severity }}
实例信息:{{ .Labels.instance }}
告警状态:{{ if eq .Status "firing" }}触发中❗{{ else }}已恢复 ✅{{ end }}
告警摘要:{{ .Annotations.summary }}
告警描述:{{ .Annotations.description }}
开始时间:⏰ {{ .StartsAt.Local.Format "2006-01-02 15:04:05" }}
{{- if eq .Status "resolved" }}
恢复时间:✅ {{ .EndsAt.Local.Format "2006-01-02 15:04:05" }}
{{- end }}
=======================
{{ end }}

{{- if not $hasFiring }}
✅ 该组告警已全部恢复
{{- else if eq $severity "critical" }}
❗ 请立即处理告警 
{{- end }}
{{ end }}



{{ define "wechat.alert.message.markdown" }}
{{- $hasFiring := false -}}
{{- $severity := "" -}}

{{- range .Alerts -}}
  {{- if eq .Status "firing" -}}
    {{- $hasFiring = true -}}
    {{- if eq .Labels.severity "critical" }}{{ $severity = "critical" }}
    {{- else if and (eq .Labels.severity "warning") (ne $severity "critical") }}{{ $severity = "warning" }}
    {{- else if and (eq .Labels.severity "info") (ne $severity "critical") (ne $severity "warning") }}{{ $severity = "info" }}
    {{- end -}}
  {{- end -}}
{{- end -}}

{{- if not $hasFiring -}}
## 告警恢复 - ✅  
{{- else if eq $severity "critical" -}}
## 严重告警 - ❗  
{{- else if eq $severity "warning" -}}
## 监控警告 - ⚠️  
{{- else -}}
## 监控信息 - ℹ️
{{- end }}

{{- with .GroupLabels.alertname }}
**告警类型**:{{ . }}
{{- end }}

{{ range .Alerts }}
---
{{ with .Labels.env }}
**告警环境**:{{ . }}
{{ end }}
**告警规则**:{{ .Labels.alertname }}  
**告警级别**:{{ .Labels.severity }}  
**实例信息**:{{ .Labels.instance }}  
**告警状态**:{{ if eq .Status "firing" }}触发中❗{{ else }}已恢复 ✅{{ end }}  
**告警摘要**:{{ .Annotations.summary }}  
**告警描述**:{{ .Annotations.description }}  
**开始时间**:⏰ {{ .StartsAt.Local.Format "2006-01-02 15:04:05" }}  
{{- if eq .Status "resolved" }}
**恢复时间**:✅ {{ .EndsAt.Local.Format "2006-01-02 15:04:05" }}  
{{- end }}
{{ end }}

{{- if not $hasFiring }}
---
✅ **该组告警已全部恢复**
{{- else if eq $severity "critical" }}
---
❗ **请立即处理告警**
{{- end }}
{{ end }}


配置验证

验证主配置

sudo -u ${ALERTMANAGER_USER} \
    ${ALERTMANAGER_HOME}/bin/amtool check-config ${ALERTMANAGER_CONF}/alertmanager.yml

加载配置

curl -X POST http://localhost:9093/-/reload

测试

手动触发

使用 amtool 发送测试告警:

sudo -u ${ALERTMANAGER_USER} \
  ${ALERTMANAGER_HOME}/bin/amtool alert add \
  --alertmanager.url=http://localhost:9093 \
  env="prod" \
  alertname="TestAlert" \
  severity="warning" \
  instance="test-instance" \
  --annotation='summary="summary of the alert"' \
  --annotation='description="description of the alert"'
  

模拟触发

模拟 CPU 高占用

# 占用 2 核 CPU,繁忙程度 80%,占用 720 秒
$ docker run -it --rm --name 'stress-ng' \
    --cpus="2" \
    registry.cn-hangzhou.aliyuncs.com/kmust/stress-ng-alpine:0.14.00-r0 \
     --cpu 2 --cpu-load 80  -t 720



# 检查 CPU 占用
$ top -p $(docker top stress-ng | awk 'NR>1 {print $2}' | paste -sd, -)

# 输出如下信息

top - 23:39:38 up 1 day,  1:44,  5 users,  load average: 1.69, 1.26, 0.69
Tasks:   3 total,   1 running,   2 sleeping,   0 stopped,   0 zombie
%Cpu(s): 70.5 us,  1.0 sy,  0.0 ni, 27.6 id,  0.0 wa,  1.0 hi,  0.0 si,  0.0 st
MiB Mem :   1739.0 total,    434.0 free,    577.3 used,    904.7 buff/cache
MiB Swap:   2060.0 total,   2060.0 free,      0.0 used.   1161.7 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                     
  17303 root      20   0   33448   5580   1792 R  79.4   0.3   3:07.00 stress-ng                                                                   
  17304 root      20   0   33448   5580   1792 S  79.4   0.3   3:07.42 stress-ng                                                                   
  17274 root      20   0   33448   4224   3968 S   0.0   0.2   0:00.04 stress-ng 

模拟高内存占用

# 占用 1.5 G 内存,占用 720 秒
$ docker run -it --rm --name 'stress-ng' \
    --memory="2G" \
    registry.cn-hangzhou.aliyuncs.com/kmust/stress-ng-alpine:0.14.00-r0 \
    --vm 1 --vm-bytes 1.5G --vm-keep --verify --timeout 720s
    
    
# 检查内存占用
$ free -h
               total        used        free      shared  buff/cache   available
Mem:           1.7Gi       1.6Gi       133Mi       0.0Ki       105Mi       114Mi

其他配置

告警静默

通过 Alertmanager Web UI 或命令行工具设置静默:

# 静默特定实例的告警
${ALERTMANAGER_HOME}/bin/amtool silence add \
  --alertmanager.url=http://localhost:9093 \
  --author="admin" \
  --comment="计划维护" \
  --duration="2h" \
  instance="192.168.111.199:9100"

# 查看静默列表
${ALERTMANAGER_HOME}/bin/amtool silence query \
  --alertmanager.url=http://localhost:9093

按时间段告警

route:
  routes:
    # 工作时间告警
    - match:
        severity: warning
      active_time_intervals:
        - business_hours
      receiver: 'business-hours-alerts'
    
    # 非工作时间紧急告警
    - match:
        severity: critical
      active_time_intervals:
        - non_business_hours
      receiver: 'emergency-alerts'

time_intervals:
  - name: business_hours
    time_intervals:
      - times:
          - start_time: '09:00'
            end_time: '18:00'
        weekdays: ['monday:friday']
  
  - name: non_business_hours
    time_intervals:
      - times:
          - start_time: '18:00'
            end_time: '09:00'
        weekdays: ['monday:friday']
      - weekdays: ['saturday', 'sunday']

附录

告警规则设计原则

  • 分层设计:区分不同严重级别的告警
  • 避免告警风暴:合理设置 for 时间和抑制规则
  • 标签规范:统一标签命名规范,便于分组和路由

通知渠道选择

  • Critical 级别:多渠道通知(微信+邮件+电话)
  • Warning 级别:基础通知(微信或邮件)
  • Info 级别:仅记录到日志系统

模板优化

  • 包含关键信息:告警名称、级别、实例、时间、描述
  • 格式清晰:使用表格或结构化格式
  • 包含操作链接:Grafana 仪表盘、Runbook 链接

运维建议

  • 定期测试告警配置
  • 建立告警处理 SOP
  • 监控告警系统本身的健康状态
  • 定期评估和优化告警规则

评论