【自动化系列】004:日志分析和告警工具

29 阅读13分钟

功能介绍

这是一个强大的日志分析和告警工具,用于实时监控、分析系统日志并发送告警通知。该工具具备以下核心功能:

  1. 多格式日志支持

    • 支持常见日志格式(Apache、Nginx、Syslog、JSON等)
    • 自定义日志格式解析
    • 实时日志流处理
    • 历史日志文件分析
  2. 智能模式匹配

    • 正则表达式模式匹配
    • 关键词和短语检测
    • 异常行为模式识别
    • 统计分析和趋势检测
  3. 灵活告警机制

    • 多种告警渠道(邮件、短信、Slack、Webhook等)
    • 告警级别和优先级设置
    • 告警抑制和去重
    • 告警升级机制
  4. 实时监控面板

    • 实时日志流可视化
    • 关键指标仪表板
    • 告警历史查看
    • 性能统计图表
  5. 配置管理

    • YAML/JSON配置文件支持
    • 动态规则加载和更新
    • 多环境配置管理
    • 规则模板和复用

场景应用

1. 系统运维监控

  • 监控系统日志中的错误和异常
  • 检测安全威胁和入侵尝试
  • 跟踪应用程序性能问题
  • 自动响应系统故障

2. 安全事件检测

  • 检测恶意登录尝试
  • 识别可疑网络活动
  • 监控文件访问异常
  • 实时安全威胁告警

3. 业务指标监控

  • 监控用户行为模式
  • 检测业务异常波动
  • 跟踪关键业务指标
  • 自动生成业务报告

4. 合规性审计

  • 监控合规性相关日志
  • 自动生成审计报告
  • 检测违规操作行为
  • 满足监管要求

报错处理

1. 日志文件访问异常

try:
    with open(log_file, 'r') as f:
        process_log_lines(f)
except FileNotFoundError:
    logger.error(f"日志文件不存在: {log_file}")
    send_alert(f"日志文件缺失: {log_file}", level="CRITICAL")
except PermissionError:
    logger.error(f"无权限访问日志文件: {log_file}")
    send_alert(f"日志文件访问被拒绝: {log_file}", level="CRITICAL")
except IOError as e:
    logger.error(f"读取日志文件失败: {str(e)}")
    handle_io_error(log_file, e)

2. 日志解析异常

try:
    parsed_log = parse_log_line(log_line)
    if not parsed_log:
        logger.warning(f"无法解析日志行: {log_line}")
        increment_parse_error_count()
except LogParseError as e:
    logger.error(f"日志解析错误: {str(e)}")
    handle_parse_error(log_line, e)
except Exception as e:
    logger.error(f"日志解析异常: {str(e)}")
    handle_unexpected_parse_error(log_line, e)

3. 告警发送异常

try:
    alert_sender.send(alert_message)
except AlertSendError as e:
    logger.error(f"告警发送失败: {str(e)}")
    # 重试机制
    retry_send_alert(alert_message, max_retries=3)
except NetworkError as e:
    logger.error(f"网络连接失败: {str(e)}")
    handle_network_failure(alert_message, e)
except Exception as e:
    logger.error(f"告警发送异常: {str(e)}")

4. 配置文件异常

try:
    config = load_config(config_file)
    validate_config(config)
except yaml.YAMLError as e:
    logger.error(f"配置文件YAML格式错误: {str(e)}")
    raise ConfigError(f"配置文件格式无效: {str(e)}")
except json.JSONDecodeError as e:
    logger.error(f"配置文件JSON格式错误: {str(e)}")
    raise ConfigError(f"配置文件格式无效: {str(e)}")
except ValidationError as e:
    logger.error(f"配置验证失败: {str(e)}")
    raise ConfigError(f"配置无效: {str(e)}")
except Exception as e:
    logger.error(f"配置加载异常: {str(e)}")

代码实现

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
日志分析和告警工具
功能:实时监控、分析日志并发送告警
作者:Cline
版本:1.0
"""

import argparse
import sys
import json
import yaml
import logging
import os
import time
import threading
import re
import sqlite3
from datetime import datetime, timedelta
from typing import Dict, List, Any, Optional, Callable
from collections import defaultdict, deque
import smtplib
import requests
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart

# 配置日志
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('log_analyzer.log'),
        logging.StreamHandler(sys.stdout)
    ]
)
logger = logging.getLogger(__name__)

class LogAnalyzerError(Exception):
    """日志分析器异常类"""
    pass

class ConfigError(Exception):
    """配置异常类"""
    pass

class AlertSendError(Exception):
    """告警发送异常类"""
    pass

class LogEntry:
    """日志条目类"""
    def __init__(self, raw_line: str, parsed_data: Dict = None):
        self.raw_line = raw_line
        self.parsed_data = parsed_data or {}
        self.timestamp = parsed_data.get('timestamp') or datetime.now()
        self.level = parsed_data.get('level', 'INFO')
        self.source = parsed_data.get('source', 'unknown')
        self.message = parsed_data.get('message', raw_line)
        
    def to_dict(self):
        """转换为字典"""
        return {
            'raw_line': self.raw_line,
            'parsed_data': self.parsed_data,
            'timestamp': self.timestamp.isoformat() if isinstance(self.timestamp, datetime) else self.timestamp,
            'level': self.level,
            'source': self.source,
            'message': self.message
        }

class LogParser:
    """日志解析器"""
    def __init__(self, format_config: Dict):
        self.format_config = format_config
        self.pattern = re.compile(format_config.get('pattern', '.*'))
        self.fields = format_config.get('fields', [])
        
    def parse(self, log_line: str) -> Optional[LogEntry]:
        """解析日志行"""
        try:
            match = self.pattern.match(log_line.strip())
            if not match:
                return None
                
            parsed_data = {}
            for i, field in enumerate(self.fields):
                if i < len(match.groups()):
                    parsed_data[field] = match.group(i + 1)
                    
            # 处理时间戳
            if 'timestamp' in parsed_data:
                try:
                    timestamp_str = parsed_data['timestamp']
                    # 尝试多种时间格式
                    for fmt in [
                        '%Y-%m-%d %H:%M:%S',
                        '%d/%b/%Y:%H:%M:%S %z',
                        '%Y-%m-%dT%H:%M:%S',
                        '%Y-%m-%d %H:%M:%S.%f'
                    ]:
                        try:
                            parsed_data['timestamp'] = datetime.strptime(timestamp_str, fmt)
                            break
                        except ValueError:
                            continue
                except Exception:
                    pass
                    
            return LogEntry(log_line, parsed_data)
        except Exception as e:
            logger.warning(f"解析日志行失败: {str(e)}")
            return LogEntry(log_line)

class Rule:
    """告警规则"""
    def __init__(self, rule_config: Dict):
        self.id = rule_config.get('id')
        self.name = rule_config.get('name', 'Unnamed Rule')
        self.description = rule_config.get('description', '')
        self.enabled = rule_config.get('enabled', True)
        self.pattern = rule_config.get('pattern')
        self.regex = re.compile(rule_config.get('regex', '.*')) if rule_config.get('regex') else None
        self.level = rule_config.get('level', 'INFO')
        self.threshold = rule_config.get('threshold', 1)
        self.window = rule_config.get('window', 60)  # 秒
        self.cooldown = rule_config.get('cooldown', 300)  # 秒
        self.actions = rule_config.get('actions', [])
        self.last_triggered = None
        self.match_count = 0
        self.match_history = deque(maxlen=1000)  # 保留最近1000个匹配记录
        
    def should_trigger(self, log_entry: LogEntry) -> bool:
        """判断是否应该触发告警"""
        if not self.enabled:
            return False
            
        # 检查冷却时间
        if self.last_triggered and (datetime.now() - self.last_triggered).seconds < self.cooldown:
            return False
            
        # 检查日志级别
        if log_entry.level < self.level:
            return False
            
        # 检查模式匹配
        if self.pattern and self.pattern not in log_entry.message:
            return False
            
        # 检查正则表达式
        if self.regex and not self.regex.search(log_entry.message):
            return False
            
        # 记录匹配
        self.match_history.append({
            'timestamp': datetime.now(),
            'log_entry': log_entry.to_dict()
        })
        
        # 检查阈值
        if self.threshold > 1:
            # 统计时间窗口内的匹配次数
            window_start = datetime.now() - timedelta(seconds=self.window)
            recent_matches = [m for m in self.match_history 
                            if m['timestamp'] >= window_start]
            
            if len(recent_matches) < self.threshold:
                return False
                
        return True
        
    def trigger(self, log_entry: LogEntry) -> Dict:
        """触发告警"""
        self.last_triggered = datetime.now()
        self.match_count += 1
        
        alert_data = {
            'rule_id': self.id,
            'rule_name': self.name,
            'timestamp': datetime.now().isoformat(),
            'log_entry': log_entry.to_dict(),
            'match_count': self.match_count,
            'recent_matches': list(self.match_history)[-10:]  # 最近10个匹配
        }
        
        return alert_data

class AlertSender:
    """告警发送器"""
    def __init__(self, config: Dict):
        self.config = config
        self.senders = {
            'email': self._send_email,
            'webhook': self._send_webhook,
            'slack': self._send_slack,
            'console': self._send_console
        }
        
    def send(self, alert_data: Dict, action_config: Dict):
        """发送告警"""
        action_type = action_config.get('type', 'console')
        sender = self.senders.get(action_type)
        
        if not sender:
            raise AlertSendError(f"不支持的告警类型: {action_type}")
            
        try:
            sender(alert_data, action_config)
        except Exception as e:
            raise AlertSendError(f"发送告警失败: {str(e)}")
            
    def _send_email(self, alert_data: Dict, config: Dict):
        """发送邮件告警"""
        try:
            msg = MIMEMultipart()
            msg['From'] = config.get('sender')
            msg['To'] = ', '.join(config.get('recipients', []))
            msg['Subject'] = f"日志告警 - {alert_data['rule_name']}"
            
            body = f"""
日志告警通知

规则名称: {alert_data['rule_name']}
规则ID: {alert_data['rule_id']}
触发时间: {alert_data['timestamp']}
匹配次数: {alert_data['match_count']}

日志内容:
{alert_data['log_entry']['raw_line']}

---
日志分析器
            """
            msg.attach(MIMEText(body, 'plain'))
            
            server = smtplib.SMTP(config.get('smtp_server'), config.get('smtp_port', 587))
            server.starttls()
            server.login(config.get('sender'), config.get('password'))
            server.send_message(msg)
            server.quit()
            
            logger.info(f"邮件告警已发送: {alert_data['rule_name']}")
            
        except Exception as e:
            logger.error(f"发送邮件告警失败: {str(e)}")
            raise
            
    def _send_webhook(self, alert_data: Dict, config: Dict):
        """发送Webhook告警"""
        try:
            response = requests.post(
                config.get('url'),
                json=alert_data,
                headers=config.get('headers', {}),
                timeout=config.get('timeout', 30)
            )
            response.raise_for_status()
            
            logger.info(f"Webhook告警已发送: {alert_data['rule_name']}")
            
        except Exception as e:
            logger.error(f"发送Webhook告警失败: {str(e)}")
            raise
            
    def _send_slack(self, alert_data: Dict, config: Dict):
        """发送Slack告警"""
        try:
            payload = {
                'text': f"日志告警: {alert_data['rule_name']}",
                'attachments': [{
                    'color': 'danger',
                    'fields': [
                        {
                            'title': '规则名称',
                            'value': alert_data['rule_name'],
                            'short': True
                        },
                        {
                            'title': '触发时间',
                            'value': alert_data['timestamp'],
                            'short': True
                        },
                        {
                            'title': '日志内容',
                            'value': alert_data['log_entry']['raw_line'],
                            'short': False
                        }
                    ]
                }]
            }
            
            response = requests.post(
                config.get('webhook_url'),
                json=payload,
                timeout=config.get('timeout', 30)
            )
            response.raise_for_status()
            
            logger.info(f"Slack告警已发送: {alert_data['rule_name']}")
            
        except Exception as e:
            logger.error(f"发送Slack告警失败: {str(e)}")
            raise
            
    def _send_console(self, alert_data: Dict, config: Dict):
        """发送控制台告警"""
        print(f"[ALERT] {alert_data['rule_name']}: {alert_data['log_entry']['message']}")

class LogFileMonitor:
    """日志文件监控器"""
    def __init__(self, file_path: str, parser: LogParser):
        self.file_path = file_path
        self.parser = parser
        self.file_handle = None
        self.file_position = 0
        self.callbacks = []
        
    def add_callback(self, callback: Callable):
        """添加回调函数"""
        self.callbacks.append(callback)
        
    def start(self):
        """开始监控"""
        try:
            # 打开文件并移动到末尾
            self.file_handle = open(self.file_path, 'r', encoding='utf-8', errors='ignore')
            self.file_handle.seek(0, 2)  # 移动到文件末尾
            self.file_position = self.file_handle.tell()
            
            # 启动监控线程
            thread = threading.Thread(target=self._monitor_loop)
            thread.daemon = True
            thread.start()
            
            logger.info(f"开始监控日志文件: {self.file_path}")
            
        except Exception as e:
            logger.error(f"启动日志监控失败: {str(e)}")
            raise LogAnalyzerError(f"启动监控失败: {str(e)}")
            
    def stop(self):
        """停止监控"""
        if self.file_handle:
            self.file_handle.close()
            self.file_handle = None
        logger.info(f"停止监控日志文件: {self.file_path}")
        
    def _monitor_loop(self):
        """监控循环"""
        while self.file_handle:
            try:
                # 检查文件是否有新内容
                current_position = self.file_handle.tell()
                self.file_handle.seek(0, 2)
                file_size = self.file_handle.tell()
                self.file_handle.seek(current_position)
                
                if file_size > current_position:
                    # 读取新内容
                    new_lines = self.file_handle.readlines()
                    for line in new_lines:
                        self._process_line(line)
                        
                # 更新文件位置
                self.file_position = self.file_handle.tell()
                
                time.sleep(0.1)  # 短暂休眠
                
            except Exception as e:
                logger.error(f"日志监控异常: {str(e)}")
                time.sleep(1)
                
    def _process_line(self, line: str):
        """处理日志行"""
        try:
            log_entry = self.parser.parse(line)
            if log_entry:
                for callback in self.callbacks:
                    try:
                        callback(log_entry)
                    except Exception as e:
                        logger.error(f"回调执行失败: {str(e)}")
        except Exception as e:
            logger.warning(f"处理日志行失败: {str(e)}")

class DatabaseManager:
    """数据库管理器"""
    def __init__(self, db_path: str = 'log_analyzer.db'):
        self.db_path = db_path
        self.init_database()
        
    def init_database(self):
        """初始化数据库"""
        try:
            conn = sqlite3.connect(self.db_path)
            cursor = conn.cursor()
            
            # 创建告警记录表
            cursor.execute('''
                CREATE TABLE IF NOT EXISTS alerts (
                    id INTEGER PRIMARY KEY AUTOINCREMENT,
                    timestamp TEXT NOT NULL,
                    rule_id TEXT NOT NULL,
                    rule_name TEXT NOT NULL,
                    log_entry TEXT,
                    match_count INTEGER
                )
            ''')
            
            # 创建统计信息表
            cursor.execute('''
                CREATE TABLE IF NOT EXISTS statistics (
                    id INTEGER PRIMARY KEY AUTOINCREMENT,
                    timestamp TEXT NOT NULL,
                    rule_id TEXT NOT NULL,
                    match_count INTEGER,
                    processed_lines INTEGER
                )
            ''')
            
            conn.commit()
            conn.close()
            logger.info("数据库初始化完成")
            
        except Exception as e:
            logger.error(f"数据库初始化失败: {str(e)}")
            
    def save_alert(self, alert_data: Dict):
        """保存告警记录"""
        try:
            conn = sqlite3.connect(self.db_path)
            cursor = conn.cursor()
            
            cursor.execute('''
                INSERT INTO alerts 
                (timestamp, rule_id, rule_name, log_entry, match_count)
                VALUES (?, ?, ?, ?, ?)
            ''', (
                alert_data.get('timestamp', ''),
                alert_data.get('rule_id', ''),
                alert_data.get('rule_name', ''),
                json.dumps(alert_data.get('log_entry', {})),
                alert_data.get('match_count', 0)
            ))
            
            conn.commit()
            conn.close()
            logger.info(f"告警记录已保存: {alert_data.get('rule_name', '')}")
            
        except Exception as e:
            logger.error(f"保存告警记录失败: {str(e)}")
            
    def get_alerts(self, limit: int = 100) -> List[Dict]:
        """获取告警记录"""
        try:
            conn = sqlite3.connect(self.db_path)
            cursor = conn.cursor()
            
            cursor.execute('''
                SELECT timestamp, rule_id, rule_name, log_entry, match_count
                FROM alerts
                ORDER BY timestamp DESC
                LIMIT ?
            ''', (limit,))
            
            rows = cursor.fetchall()
            conn.close()
            
            alerts = []
            for row in rows:
                alerts.append({
                    'timestamp': row[0],
                    'rule_id': row[1],
                    'rule_name': row[2],
                    'log_entry': json.loads(row[3]) if row[3] else {},
                    'match_count': row[4]
                })
                
            return alerts
            
        except Exception as e:
            logger.error(f"获取告警记录失败: {str(e)}")
            return []

class LogAnalyzer:
    """日志分析器主类"""
    def __init__(self, config_file: str = None):
        self.config_file = config_file
        self.config = {}
        self.rules = []
        self.parsers = {}
        self.monitors = []
        self.alert_sender = None
        self.db_manager = None
        self.running = False
        self.processed_lines = 0
        self.match_statistics = defaultdict(int)
        
        # 加载配置
        self.load_config()
        
        # 初始化组件
        self._init_components()
        
    def load_config(self):
        """加载配置文件"""
        if not self.config_file or not os.path.exists(self.config_file):
            logger.info("未指定配置文件或文件不存在,使用默认配置")
            self.config = self._create_default_config()
            return
            
        try:
            with open(self.config_file, 'r', encoding='utf-8') as f:
                if self.config_file.endswith('.yaml') or self.config_file.endswith('.yml'):
                    self.config = yaml.safe_load(f)
                else:
                    self.config = json.load(f)
                    
            logger.info(f"成功加载配置文件: {self.config_file}")
            
        except Exception as e:
            logger.error(f"加载配置文件失败: {str(e)}")
            raise ConfigError(f"配置加载失败: {str(e)}")
            
    def _create_default_config(self) -> Dict:
        """创建默认配置"""
        return {
            "parsers": {
                "default": {
                    "pattern": "(.*)",
                    "fields": ["message"]
                }
            },
            "rules": [
                {
                    "id": "error_detector",
                    "name": "错误检测器",
                    "description": "检测包含ERROR关键字的日志",
                    "enabled": True,
                    "pattern": "ERROR",
                    "level": "ERROR",
                    "threshold": 1,
                    "window": 60,
                    "cooldown": 300,
                    "actions": [
                        {"type": "console"}
                    ]
                }
            ],
            "sources": [
                {
                    "path": "/var/log/syslog",
                    "parser": "default",
                    "enabled": False
                }
            ],
            "actions": {
                "email": {
                    "type": "email",
                    "sender": "loganalyzer@example.com",
                    "password": "your_password",
                    "smtp_server": "smtp.example.com",
                    "smtp_port": 587,
                    "recipients": ["admin@example.com"]
                }
            }
        }
        
    def _init_components(self):
        """初始化组件"""
        # 初始化解析器
        for name, parser_config in self.config.get('parsers', {}).items():
            self.parsers[name] = LogParser(parser_config)
            
        # 初始化规则
        for rule_config in self.config.get('rules', []):
            self.rules.append(Rule(rule_config))
            
        # 初始化告警发送器
        self.alert_sender = AlertSender(self.config.get('actions', {}))
        
        # 初始化数据库管理器
        self.db_manager = DatabaseManager()
        
    def start(self):
        """启动日志分析器"""
        if self.running:
            logger.warning("日志分析器已在运行")
            return
            
        logger.info("启动日志分析器...")
        self.running = True
        
        # 启动日志监控器
        for source_config in self.config.get('sources', []):
            if not source_config.get('enabled', False):
                continue
                
            path = source_config.get('path')
            parser_name = source_config.get('parser', 'default')
            
            if not os.path.exists(path):
                logger.warning(f"日志文件不存在: {path}")
                continue
                
            if parser_name not in self.parsers:
                logger.warning(f"解析器不存在: {parser_name}")
                continue
                
            try:
                parser = self.parsers[parser_name]
                monitor = LogFileMonitor(path, parser)
                monitor.add_callback(self._process_log_entry)
                monitor.start()
                self.monitors.append(monitor)
                logger.info(f"已启动日志监控: {path}")
                
            except Exception as e:
                logger.error(f"启动日志监控失败 {path}: {str(e)}")
                
        logger.info("日志分析器启动完成")
        
    def stop(self):
        """停止日志分析器"""
        logger.info("停止日志分析器...")
        self.running = False
        
        # 停止所有监控器
        for monitor in self.monitors:
            try:
                monitor.stop()
            except Exception as e:
                logger.error(f"停止监控器失败: {str(e)}")
                
        self.monitors.clear()
        logger.info("日志分析器已停止")
        
    def _process_log_entry(self, log_entry: LogEntry):
        """处理日志条目"""
        self.processed_lines += 1
        
        # 应用所有规则
        for rule in self.rules:
            try:
                if rule.should_trigger(log_entry):
                    # 触发告警
                    alert_data = rule.trigger(log_entry)
                    
                    # 保存到数据库
                    if self.db_manager:
                        self.db_manager.save_alert(alert_data)
                        
                    # 更新统计信息
                    self.match_statistics[rule.id] += 1
                    
                    # 发送告警
                    for action_config in rule.actions:
                        try:
                            self.alert_sender.send(alert_data, action_config)
                        except Exception as e:
                            logger.error(f"发送告警失败: {str(e)}")
                            
            except Exception as e:
                logger.error(f"应用规则 {rule.name} 失败: {str(e)}")
                
    def get_statistics(self) -> Dict:
        """获取统计信息"""
        return {
            'processed_lines': self.processed_lines,
            'match_statistics': dict(self.match_statistics),
            'active_monitors': len(self.monitors),
            'active_rules': len([r for r in self.rules if r.enabled])
        }
        
    def add_rule(self, rule_config: Dict):
        """动态添加规则"""
        try:
            rule = Rule(rule_config)
            self.rules.append(rule)
            logger.info(f"已添加规则: {rule.name}")
        except Exception as e:
            logger.error(f"添加规则失败: {str(e)}")
            
    def remove_rule(self, rule_id: str):
        """移除规则"""
        self.rules = [r for r in self.rules if r.id != rule_id]
        logger.info(f"已移除规则: {rule_id}")
        
    def get_alerts(self, limit: int = 100) -> List[Dict]:
        """获取告警记录"""
        if self.db_manager:
            return self.db_manager.get_alerts(limit)
        return []

def create_sample_config():
    """创建示例配置文件"""
    sample_config = {
        "parsers": {
            "nginx": {
                "pattern": r'(\S+) - - \[(.*?)\] "(\S+) (\S+) (\S+)" (\d+) (\d+) "(.*?)" "(.*?)"',
                "fields": ["ip", "timestamp", "method", "url", "protocol", "status", "size", "referer", "user_agent"]
            },
            "syslog": {
                "pattern": r'<(\d+)>(\d+) (\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d+[+-]\d{2}:\d{2}) (\S+) ([^ ]+) ([^ ]+) (.*)',
                "fields": ["priority", "version", "timestamp", "hostname", "app_name", "proc_id", "message"]
            }
        },
        "rules": [
            {
                "id": "nginx_404_errors",
                "name": "Nginx 404错误检测",
                "description": "检测Nginx访问日志中的404错误",
                "enabled": True,
                "regex": r'" 404 ',
                "level": "WARNING",
                "threshold": 5,
                "window": 60,
                "cooldown": 300,
                "actions": [
                    {"type": "console"},
                    {"type": "email"}
                ]
            },
            {
                "id": "security_login_failures",
                "name": "安全登录失败检测",
                "description": "检测认证失败的日志",
                "enabled": True,
                "pattern": "authentication failure",
                "level": "ERROR",
                "threshold": 3,
                "window": 300,
                "cooldown": 600,
                "actions": [
                    {"type": "console"},
                    {"type": "webhook", "url": "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"}
                ]
            }
        ],
        "sources": [
            {
                "path": "/var/log/nginx/access.log",
                "parser": "nginx",
                "enabled": False
            },
            {
                "path": "/var/log/syslog",
                "parser": "syslog",
                "enabled": False
            }
        ],
        "actions": {
            "email": {
                "type": "email",
                "sender": "loganalyzer@example.com",
                "password": "your_app_password",
                "smtp_server": "smtp.gmail.com",
                "smtp_port": 587,
                "recipients": ["admin@example.com", "security@example.com"]
            },
            "slack_webhook": {
                "type": "webhook",
                "url": "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK",
                "timeout": 30
            }
        }
    }
    
    with open('log_analyzer_sample_config.json', 'w', encoding='utf-8') as f:
        json.dump(sample_config, f, indent=2, ensure_ascii=False)
    logger.info("示例配置文件已创建: log_analyzer_sample_config.json")

def main():
    parser = argparse.ArgumentParser(description='日志分析和告警工具')
    parser.add_argument('-c', '--config', help='配置文件路径')
    parser.add_argument('--start', action='store_true', help='启动日志分析器')
    parser.add_argument('--sample-config', action='store_true', help='创建示例配置文件')
    parser.add_argument('--stats', action='store_true', help='显示统计信息')
    parser.add_argument('--alerts', action='store_true', help='显示最近告警')
    parser.add_argument('--limit', type=int, default=10, help='告警记录数量限制')
    
    args = parser.parse_args()
    
    if args.sample_config:
        create_sample_config()
        return
        
    analyzer = LogAnalyzer(args.config)
    
    if args.stats:
        stats = analyzer.get_statistics()
        print(json.dumps(stats, indent=2, ensure_ascii=False))
        return
        
    if args.alerts:
        alerts = analyzer.get_alerts(args.limit)
        print(json.dumps(alerts, indent=2, ensure_ascii=False))
        return
        
    if args.start:
        try:
            analyzer.start()
            # 保持程序运行
            while True:
                time.sleep(1)
        except KeyboardInterrupt:
            logger.info("收到中断信号,正在停止分析器...")
        finally:
            analyzer.stop()
    else:
        parser.print_help()

if __name__ == '__main__':
    main()

使用说明

1. 安装依赖

pip install pyyaml requests

2. 创建配置文件

python log_analyzer.py --sample-config

3. 启动日志分析器

python log_analyzer.py --config log_analyzer_config.json --start

4. 查看统计信息

python log_analyzer.py --stats

5. 查看告警记录

python log_analyzer.py --alerts --limit 20

配置文件示例

JSON配置文件

{
  "parsers": {
    "nginx": {
      "pattern": "(\\S+) - - \\[(.*?)\\] \"(\\S+) (\\S+) (\\S+)\" (\\d+) (\\d+) \"(.*?)\" \"(.*?)\"",
      "fields": ["ip", "timestamp", "method", "url", "protocol", "status", "size", "referer", "user_agent"]
    },
    "syslog": {
      "pattern": "<(\\d+)>(\\d+) (\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}\\.\\d+[+-]\\d{2}:\\d{2}) (\\S+) ([^ ]+) ([^ ]+) (.*)",
      "fields": ["priority", "version", "timestamp", "hostname", "app_name", "proc_id", "message"]
    }
  },
  "rules": [
    {
      "id": "nginx_404_errors",
      "name": "Nginx 404错误检测",
      "description": "检测Nginx访问日志中的404错误",
      "enabled": true,
      "regex": "\" 404 ",
      "level": "WARNING",
      "threshold": 5,
      "window": 60,
      "cooldown": 300,
      "actions": [
        {"type": "console"},
        {"type": "email"}
      ]
    },
    {
      "id": "security_login_failures",
      "name": "安全登录失败检测",
      "description": "检测认证失败的日志",
      "enabled": true,
      "pattern": "authentication failure",
      "level": "ERROR",
      "threshold": 3,
      "window": 300,
      "cooldown": 600,
      "actions": [
        {"type": "console"},
        {"type": "webhook", "url": "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"}
      ]
    }
  ],
  "sources": [
    {
      "path": "/var/log/nginx/access.log",
      "parser": "nginx",
      "enabled": false
    },
    {
      "path": "/var/log/syslog",
      "parser": "syslog",
      "enabled": false
    }
  ],
  "actions": {
    "email": {
      "type": "email",
      "sender": "loganalyzer@example.com",
      "password": "your_app_password",
      "smtp_server": "smtp.gmail.com",
      "smtp_port": 587,
      "recipients": ["admin@example.com", "security@example.com"]
    },
    "slack_webhook": {
      "type": "webhook",
      "url": "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK",
      "timeout": 30
    }
  }
}

高级特性

1. 智能模式匹配

支持正则表达式和关键词匹配,能够识别复杂的日志模式和异常行为。

2. 告警抑制机制

通过阈值和冷却时间设置,避免重复告警和告警风暴。

3. 多渠道告警

支持邮件、Webhook、Slack等多种告警渠道,确保告警及时送达。

4. 历史数据分析

内置数据库存储告警记录和统计信息,支持历史数据分析和趋势预测。

最佳实践

1. 规则配置优化

  • 根据实际需求设置合理的阈值和时间窗口
  • 使用具体的正则表达式而非通用模式以提高性能
  • 定期审查和优化告警规则

2. 性能调优

  • 合理设置日志监控文件的数量和大小
  • 使用高效的正则表达式模式
  • 定期清理历史数据

3. 安全性考虑

  • 保护配置文件中的敏感信息
  • 限制对日志文件的访问权限
  • 定期审查告警记录

总结

这个日志分析和告警工具提供了一个功能强大、灵活可配置的日志监控解决方案。通过实时分析日志数据并发送及时告警,可以帮助运维人员快速发现和响应系统问题,提高系统的稳定性和安全性。