钉钉 AI 客服:监控告警体系

2 阅读1分钟

钉钉 AI 客服:监控告警体系

监控是运维的眼睛。


一、监控层次

层次指标
基础设施CPU、内存、磁盘
应用响应时间、错误率
业务对话量、解决率

二、Prometheus 配置

2.1 指标采集

# prometheus.yml
scrape_configs:
  - job_name: 'ai-chat'
    static_configs:
      - targets: ['localhost:3000']

2.2 应用指标

const client = require('prom-client');

// 对话计数
const chatCounter = new client.Counter({
  name: 'chat_total',
  help: 'Total chats',
  labelNames: ['status']
});

// 响应时间
const responseTime = new client.Histogram({
  name: 'response_time_seconds',
  help: 'Response time',
  buckets: [0.1, 0.5, 1, 2, 5]
});

// 暴露指标
app.get('/metrics', (req, res) => {
  res.set('Content-Type', client.register.contentType);
  res.end(client.register.metrics());
});

三、Grafana 仪表盘

3.1 关键面板

  • 请求量趋势
  • 响应时间分布
  • 错误率变化
  • 资源使用率

3.2 告警规则

groups:
  - name: ai-chat
    rules:
      - alert: HighErrorRate
        expr: rate(chat_total{status="error"}[5m]) > 0.05
        for: 2m
        annotations:
          summary: "错误率过高"
          
      - alert: SlowResponse
        expr: histogram_quantile(0.99, response_time_seconds) > 2
        for: 5m
        annotations:
          summary: "P99 响应时间过高"

四、告警通知

4.1 钉钉通知

async function sendAlert(message) {
  await fetch(WEBHOOK_URL, {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
      msgtype: 'text',
      text: { content: `[告警] ${message}` }
    })
  });
}

4.2 告警升级

const ALERT_LEVELS = {
  P1: { threshold: 0.1, notify: ['电话', '钉钉'] },
  P2: { threshold: 0.05, notify: ['钉钉'] },
  P3: { threshold: 0.01, notify: ['邮件'] }
};

async function handleAlert(level, message) {
  const config = ALERT_LEVELS[level];
  for (const channel of config.notify) {
    await notify(channel, message);
  }
}

五、日志监控

5.1 ELK 配置

# filebeat.yml
filebeat.inputs:
  - type: log
    paths:
      - /var/log/ai-chat/*.log
    json.keys_under_root: true

output.elasticsearch:
  hosts: ["localhost:9200"]

5.2 日志告警

# 错误日志告警
- alert: ErrorLogs
  expr: increase(log_messages{level="error"}[5m]) > 10
  annotations:
    summary: "错误日志过多"

六、监控最佳实践

  • 关键指标全覆盖
  • 告警分级处理
  • 避免告警风暴
  • 定期演练

项目地址:GitHub - dingtalk-connector-pro 有问题欢迎 Issue 或评论区交流