钉钉 AI 客服:监控告警体系
监控是运维的眼睛。
一、监控层次
| 层次 | 指标 |
|---|---|
| 基础设施 | CPU、内存、磁盘 |
| 应用 | 响应时间、错误率 |
| 业务 | 对话量、解决率 |
二、Prometheus 配置
2.1 指标采集
# prometheus.yml
scrape_configs:
- job_name: 'ai-chat'
static_configs:
- targets: ['localhost:3000']
2.2 应用指标
const client = require('prom-client');
// 对话计数
const chatCounter = new client.Counter({
name: 'chat_total',
help: 'Total chats',
labelNames: ['status']
});
// 响应时间
const responseTime = new client.Histogram({
name: 'response_time_seconds',
help: 'Response time',
buckets: [0.1, 0.5, 1, 2, 5]
});
// 暴露指标
app.get('/metrics', (req, res) => {
res.set('Content-Type', client.register.contentType);
res.end(client.register.metrics());
});
三、Grafana 仪表盘
3.1 关键面板
- 请求量趋势
- 响应时间分布
- 错误率变化
- 资源使用率
3.2 告警规则
groups:
- name: ai-chat
rules:
- alert: HighErrorRate
expr: rate(chat_total{status="error"}[5m]) > 0.05
for: 2m
annotations:
summary: "错误率过高"
- alert: SlowResponse
expr: histogram_quantile(0.99, response_time_seconds) > 2
for: 5m
annotations:
summary: "P99 响应时间过高"
四、告警通知
4.1 钉钉通知
async function sendAlert(message) {
await fetch(WEBHOOK_URL, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
msgtype: 'text',
text: { content: `[告警] ${message}` }
})
});
}
4.2 告警升级
const ALERT_LEVELS = {
P1: { threshold: 0.1, notify: ['电话', '钉钉'] },
P2: { threshold: 0.05, notify: ['钉钉'] },
P3: { threshold: 0.01, notify: ['邮件'] }
};
async function handleAlert(level, message) {
const config = ALERT_LEVELS[level];
for (const channel of config.notify) {
await notify(channel, message);
}
}
五、日志监控
5.1 ELK 配置
# filebeat.yml
filebeat.inputs:
- type: log
paths:
- /var/log/ai-chat/*.log
json.keys_under_root: true
output.elasticsearch:
hosts: ["localhost:9200"]
5.2 日志告警
# 错误日志告警
- alert: ErrorLogs
expr: increase(log_messages{level="error"}[5m]) > 10
annotations:
summary: "错误日志过多"
六、监控最佳实践
- 关键指标全覆盖
- 告警分级处理
- 避免告警风暴
- 定期演练
项目地址:GitHub - dingtalk-connector-pro 有问题欢迎 Issue 或评论区交流