【自动化系列】005:系统健康检查和报告工具

40 阅读14分钟

功能介绍

这是一个全面的系统健康检查和报告工具,用于定期评估系统状态并生成详细的健康报告。该工具具备以下核心功能:

  1. 全方位健康检查

    • 系统资源使用情况(CPU、内存、磁盘、网络)
    • 硬件状态监测(温度、风扇、电压等)
    • 服务和进程状态检查
    • 网络连接和延迟测试
    • 安全漏洞和配置审计
  2. 智能诊断分析

    • 性能瓶颈识别
    • 异常行为检测
    • 趋势分析和预测
    • 问题根源定位
    • 优化建议生成
  3. 多样化报告输出

    • HTML可视化报告
    • PDF格式报告
    • JSON数据报告
    • 邮件自动发送
    • 企业微信/钉钉推送
  4. 灵活配置管理

    • 自定义检查项目和阈值
    • 多环境配置支持
    • 检查频率和时间安排
    • 报告模板自定义
    • 告警条件设置
  5. 自动化调度执行

    • 定时健康检查任务
    • 周期性报告生成
    • 异常情况即时告警
    • 历史数据对比分析
    • 集成CI/CD流水线

场景应用

1. 服务器运维管理

  • 定期检查服务器健康状态
  • 及时发现性能瓶颈和资源不足
  • 预防系统故障和宕机风险
  • 生成运维报告供管理层审阅

2. 云环境监控

  • 监控云服务器资源使用情况
  • 检测云服务异常和性能下降
  • 优化云资源配置和成本控制
  • 满足云服务SLA监控要求

3. 企业IT资产管理

  • 批量检查企业内部设备状态
  • 生成资产健康状况报告
  • 识别需要维护或更换的设备
  • 支持ITIL服务管理流程

4. 安全合规审计

  • 检查系统安全配置和漏洞
  • 生成合规性审计报告
  • 满足等保、ISO等合规要求
  • 支持安全事件追溯分析

报错处理

1. 系统资源访问异常

try:
    cpu_percent = psutil.cpu_percent(interval=1)
except AccessDeniedError:
    logger.error("无权限访问CPU信息")
    cpu_percent = None
except PsutilError as e:
    logger.error(f"获取CPU信息失败: {str(e)}")
    cpu_percent = None
except Exception as e:
    logger.error(f"CPU监控异常: {str(e)}")
    cpu_percent = None

2. 网络连接测试异常

try:
    response = requests.get(url, timeout=timeout)
    response.raise_for_status()
except requests.Timeout:
    logger.error(f"网络连接超时: {url}")
    return {"status": "TIMEOUT", "error": "连接超时"}
except requests.ConnectionError:
    logger.error(f"网络连接失败: {url}")
    return {"status": "CONNECTION_ERROR", "error": "连接失败"}
except requests.HTTPError as e:
    logger.error(f"HTTP错误: {e.response.status_code} - {url}")
    return {"status": "HTTP_ERROR", "error": f"HTTP {e.response.status_code}"}
except Exception as e:
    logger.error(f"网络测试异常: {str(e)}")
    return {"status": "UNKNOWN_ERROR", "error": str(e)}

3. 报告生成异常

try:
    report_generator.generate_html_report(data, template)
except TemplateError as e:
    logger.error(f"报告模板错误: {str(e)}")
    # 使用默认模板重新生成
    report_generator.generate_html_report(data, default_template)
except IOError as e:
    logger.error(f"报告文件写入失败: {str(e)}")
    # 检查磁盘空间和权限
    check_disk_space()
    check_file_permissions()
except Exception as e:
    logger.error(f"报告生成异常: {str(e)}")

4. 邮件发送异常

try:
    smtp_client.send_message(email_message)
except smtplib.SMTPAuthenticationError:
    logger.error("SMTP认证失败")
    # 尝试重新认证
    refresh_smtp_credentials()
except smtplib.SMTPRecipientsRefused:
    logger.error("收件人被拒绝")
    return {"status": "FAILED", "error": "收件人被拒绝"}
except smtplib.SMTPException as e:
    logger.error(f"SMTP错误: {str(e)}")
    return {"status": "FAILED", "error": f"SMTP错误: {str(e)}"}
except Exception as e:
    logger.error(f"邮件发送异常: {str(e)}")

代码实现

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
系统健康检查和报告工具
功能:全面评估系统健康状态并生成报告
作者:Cline
版本:1.0
"""

import argparse
import sys
import json
import yaml
import logging
import os
import time
import threading
import subprocess
import platform
import psutil
import socket
import requests
import smtplib
from datetime import datetime, timedelta
from typing import Dict, List, Any, Optional
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart
from email.mime.base import MIMEBase
from email import encoders

# 配置日志
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('health_check.log'),
        logging.StreamHandler(sys.stdout)
    ]
)
logger = logging.getLogger(__name__)

class HealthCheckError(Exception):
    """健康检查异常类"""
    pass

class ReportGenerationError(Exception):
    """报告生成异常类"""
    pass

class SystemMetrics:
    """系统指标收集器"""
    @staticmethod
    def get_cpu_info() -> Dict:
        """获取CPU信息"""
        try:
            cpu_percent = psutil.cpu_percent(interval=1, percpu=True)
            cpu_freq = psutil.cpu_freq()
            cpu_count = psutil.cpu_count(logical=False)
            logical_cpu_count = psutil.cpu_count(logical=True)
            
            return {
                'cpu_percent': cpu_percent,
                'cpu_avg_percent': sum(cpu_percent) / len(cpu_percent) if cpu_percent else 0,
                'cpu_freq_current': cpu_freq.current if cpu_freq else None,
                'cpu_freq_min': cpu_freq.min if cpu_freq else None,
                'cpu_freq_max': cpu_freq.max if cpu_freq else None,
                'cpu_count': cpu_count,
                'logical_cpu_count': logical_cpu_count,
                'cpu_status': 'HEALTHY' if sum(cpu_percent) / len(cpu_percent) < 80 else 'WARNING'
            }
        except Exception as e:
            logger.error(f"获取CPU信息失败: {str(e)}")
            return {'cpu_status': 'ERROR', 'error': str(e)}
            
    @staticmethod
    def get_memory_info() -> Dict:
        """获取内存信息"""
        try:
            virtual_memory = psutil.virtual_memory()
            swap_memory = psutil.swap_memory()
            
            return {
                'memory_total': virtual_memory.total,
                'memory_available': virtual_memory.available,
                'memory_used': virtual_memory.used,
                'memory_percent': virtual_memory.percent,
                'swap_total': swap_memory.total,
                'swap_used': swap_memory.used,
                'swap_percent': swap_memory.percent,
                'memory_status': 'HEALTHY' if virtual_memory.percent < 85 else 'WARNING'
            }
        except Exception as e:
            logger.error(f"获取内存信息失败: {str(e)}")
            return {'memory_status': 'ERROR', 'error': str(e)}
            
    @staticmethod
    def get_disk_info() -> Dict:
        """获取磁盘信息"""
        try:
            disk_info = []
            for partition in psutil.disk_partitions():
                try:
                    usage = psutil.disk_usage(partition.mountpoint)
                    disk_info.append({
                        'device': partition.device,
                        'mountpoint': partition.mountpoint,
                        'fstype': partition.fstype,
                        'total': usage.total,
                        'used': usage.used,
                        'free': usage.free,
                        'percent': round(usage.used / usage.total * 100, 2) if usage.total > 0 else 0
                    })
                except Exception:
                    continue
                    
            # 计算总体磁盘使用情况
            total_used = sum(disk['used'] for disk in disk_info)
            total_total = sum(disk['total'] for disk in disk_info)
            overall_percent = round(total_used / total_total * 100, 2) if total_total > 0 else 0
            
            return {
                'disks': disk_info,
                'overall_percent': overall_percent,
                'disk_status': 'HEALTHY' if overall_percent < 85 else 'WARNING'
            }
        except Exception as e:
            logger.error(f"获取磁盘信息失败: {str(e)}")
            return {'disk_status': 'ERROR', 'error': str(e)}
            
    @staticmethod
    def get_network_info() -> Dict:
        """获取网络信息"""
        try:
            net_io = psutil.net_io_counters()
            net_connections = len(psutil.net_connections())
            
            # 获取网络接口信息
            net_interfaces = []
            for interface, addrs in psutil.net_if_addrs().items():
                interface_info = {
                    'name': interface,
                    'addresses': []
                }
                for addr in addrs:
                    interface_info['addresses'].append({
                        'family': str(addr.family),
                        'address': addr.address,
                        'netmask': addr.netmask
                    })
                net_interfaces.append(interface_info)
                
            return {
                'bytes_sent': net_io.bytes_sent,
                'bytes_recv': net_io.bytes_recv,
                'packets_sent': net_io.packets_sent,
                'packets_recv': net_io.packets_recv,
                'connections': net_connections,
                'interfaces': net_interfaces,
                'network_status': 'HEALTHY'
            }
        except Exception as e:
            logger.error(f"获取网络信息失败: {str(e)}")
            return {'network_status': 'ERROR', 'error': str(e)}

class ServiceChecker:
    """服务检查器"""
    @staticmethod
    def check_service_status(service_name: str) -> Dict:
        """检查服务状态"""
        try:
            if platform.system() == "Windows":
                # Windows服务检查
                result = subprocess.run(
                    ['sc', 'query', service_name],
                    capture_output=True,
                    text=True,
                    timeout=10
                )
                if 'RUNNING' in result.stdout:
                    return {'status': 'RUNNING', 'service': service_name}
                elif 'STOPPED' in result.stdout:
                    return {'status': 'STOPPED', 'service': service_name}
                else:
                    return {'status': 'UNKNOWN', 'service': service_name, 'error': '服务不存在'}
            else:
                # Linux服务检查
                result = subprocess.run(
                    ['systemctl', 'is-active', service_name],
                    capture_output=True,
                    text=True,
                    timeout=10
                )
                if result.stdout.strip() == 'active':
                    return {'status': 'RUNNING', 'service': service_name}
                elif result.stdout.strip() == 'inactive':
                    return {'status': 'STOPPED', 'service': service_name}
                else:
                    return {'status': 'UNKNOWN', 'service': service_name, 'error': result.stderr}
        except subprocess.TimeoutExpired:
            return {'status': 'TIMEOUT', 'service': service_name, 'error': '检查超时'}
        except Exception as e:
            return {'status': 'ERROR', 'service': service_name, 'error': str(e)}
            
    @staticmethod
    def check_port_connectivity(host: str, port: int, timeout: int = 5) -> Dict:
        """检查端口连通性"""
        try:
            sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
            sock.settimeout(timeout)
            result = sock.connect_ex((host, port))
            sock.close()
            
            if result == 0:
                return {'status': 'OPEN', 'host': host, 'port': port}
            else:
                return {'status': 'CLOSED', 'host': host, 'port': port}
        except Exception as e:
            return {'status': 'ERROR', 'host': host, 'port': port, 'error': str(e)}

class NetworkTester:
    """网络测试器"""
    @staticmethod
    def ping_host(host: str, count: int = 4) -> Dict:
        """Ping主机测试"""
        try:
            if platform.system() == "Windows":
                cmd = ['ping', '-n', str(count), host]
            else:
                cmd = ['ping', '-c', str(count), host]
                
            result = subprocess.run(cmd, capture_output=True, text=True, timeout=30)
            
            if result.returncode == 0:
                # 解析ping结果
                output = result.stdout
                if 'time=' in output:
                    times = []
                    for line in output.split('\n'):
                        if 'time=' in line:
                            try:
                                time_str = line.split('time=')[1].split('ms')[0]
                                times.append(float(time_str))
                            except Exception:
                                continue
                    avg_time = sum(times) / len(times) if times else 0
                    return {'status': 'SUCCESS', 'host': host, 'avg_time': avg_time, 'packet_loss': 0}
                else:
                    return {'status': 'SUCCESS', 'host': host, 'avg_time': 0, 'packet_loss': 0}
            else:
                return {'status': 'FAILED', 'host': host, 'error': 'Ping失败'}
        except subprocess.TimeoutExpired:
            return {'status': 'TIMEOUT', 'host': host, 'error': 'Ping超时'}
        except Exception as e:
            return {'status': 'ERROR', 'host': host, 'error': str(e)}
            
    @staticmethod
    def http_check(url: str, timeout: int = 10) -> Dict:
        """HTTP连通性检查"""
        try:
            response = requests.get(url, timeout=timeout)
            response.raise_for_status()
            
            return {
                'status': 'SUCCESS',
                'url': url,
                'status_code': response.status_code,
                'response_time': response.elapsed.total_seconds(),
                'content_length': len(response.content)
            }
        except requests.Timeout:
            return {'status': 'TIMEOUT', 'url': url, 'error': '请求超时'}
        except requests.ConnectionError:
            return {'status': 'CONNECTION_ERROR', 'url': url, 'error': '连接失败'}
        except requests.HTTPError as e:
            return {'status': 'HTTP_ERROR', 'url': url, 'error': f'HTTP {e.response.status_code}'}
        except Exception as e:
            return {'status': 'ERROR', 'url': url, 'error': str(e)}

class SecurityAuditor:
    """安全审计器"""
    @staticmethod
    def check_firewall_status() -> Dict:
        """检查防火墙状态"""
        try:
            if platform.system() == "Windows":
                result = subprocess.run(
                    ['netsh', 'advfirewall', 'show', 'allprofiles'],
                    capture_output=True,
                    text=True,
                    timeout=10
                )
                if 'State ON' in result.stdout:
                    return {'status': 'ENABLED', 'platform': 'Windows'}
                else:
                    return {'status': 'DISABLED', 'platform': 'Windows'}
            else:
                # 检查iptables或ufw
                try:
                    ufw_result = subprocess.run(['ufw', 'status'], capture_output=True, text=True, timeout=5)
                    if 'Status: active' in ufw_result.stdout:
                        return {'status': 'ENABLED', 'platform': 'Linux', 'firewall': 'ufw'}
                    elif 'Status: inactive' in ufw_result.stdout:
                        return {'status': 'DISABLED', 'platform': 'Linux', 'firewall': 'ufw'}
                except Exception:
                    pass
                    
                try:
                    iptables_result = subprocess.run(['iptables', '-L'], capture_output=True, text=True, timeout=5)
                    if iptables_result.stdout.strip():
                        return {'status': 'ENABLED', 'platform': 'Linux', 'firewall': 'iptables'}
                    else:
                        return {'status': 'DISABLED', 'platform': 'Linux', 'firewall': 'iptables'}
                except Exception:
                    pass
                    
                return {'status': 'UNKNOWN', 'platform': 'Linux'}
        except Exception as e:
            return {'status': 'ERROR', 'error': str(e)}
            
    @staticmethod
    def check_open_ports() -> Dict:
        """检查开放端口"""
        try:
            connections = psutil.net_connections(kind='inet')
            open_ports = []
            
            for conn in connections:
                if conn.status == 'LISTEN':
                    open_ports.append({
                        'port': conn.laddr.port,
                        'address': conn.laddr.ip,
                        'family': str(conn.family),
                        'type': str(conn.type)
                    })
                    
            return {
                'open_ports': open_ports,
                'count': len(open_ports),
                'status': 'SUCCESS'
            }
        except Exception as e:
            return {'status': 'ERROR', 'error': str(e)}

class HealthChecker:
    """健康检查主类"""
    def __init__(self, config_file: str = None):
        self.config_file = config_file
        self.config = {}
        self.check_results = {}
        
        # 加载配置
        self.load_config()
        
    def load_config(self):
        """加载配置文件"""
        if not self.config_file or not os.path.exists(self.config_file):
            logger.info("未指定配置文件或文件不存在,使用默认配置")
            self.config = self._create_default_config()
            return
            
        try:
            with open(self.config_file, 'r', encoding='utf-8') as f:
                if self.config_file.endswith('.yaml') or self.config_file.endswith('.yml'):
                    self.config = yaml.safe_load(f)
                else:
                    self.config = json.load(f)
                    
            logger.info(f"成功加载配置文件: {self.config_file}")
            
        except Exception as e:
            logger.error(f"加载配置文件失败: {str(e)}")
            raise HealthCheckError(f"配置加载失败: {str(e)}")
            
    def _create_default_config(self) -> Dict:
        """创建默认配置"""
        return {
            "checks": {
                "system": {
                    "enabled": True,
                    "cpu_threshold": 80,
                    "memory_threshold": 85,
                    "disk_threshold": 85
                },
                "services": {
                    "enabled": True,
                    "services": ["ssh", "nginx", "mysql"]
                },
                "network": {
                    "enabled": True,
                    "hosts": ["8.8.8.8", "google.com"],
                    "urls": ["https://www.google.com", "https://www.github.com"]
                },
                "security": {
                    "enabled": True
                }
            },
            "report": {
                "formats": ["html", "json"],
                "output_dir": "./reports",
                "retention_days": 30
            },
            "notifications": {
                "email": {
                    "enabled": False,
                    "smtp_server": "smtp.example.com",
                    "smtp_port": 587,
                    "sender": "healthcheck@example.com",
                    "password": "your_password",
                    "recipients": ["admin@example.com"]
                }
            }
        }
        
    def perform_health_check(self) -> Dict:
        """执行健康检查"""
        logger.info("开始执行系统健康检查...")
        start_time = datetime.now()
        
        # 初始化结果
        self.check_results = {
            'timestamp': start_time.isoformat(),
            'system_info': {
                'hostname': platform.node(),
                'platform': platform.system(),
                'platform_version': platform.version(),
                'architecture': platform.machine(),
                'python_version': platform.python_version()
            }
        }
        
        # 执行各项检查
        checks_config = self.config.get('checks', {})
        
        # 系统资源检查
        if checks_config.get('system', {}).get('enabled', True):
            self.check_results['system_metrics'] = self._check_system_metrics(checks_config.get('system', {}))
            
        # 服务检查
        if checks_config.get('services', {}).get('enabled', True):
            self.check_results['services'] = self._check_services(checks_config.get('services', {}))
            
        # 网络检查
        if checks_config.get('network', {}).get('enabled', True):
            self.check_results['network'] = self._check_network(checks_config.get('network', {}))
            
        # 安全检查
        if checks_config.get('security', {}).get('enabled', True):
            self.check_results['security'] = self._check_security()
            
        # 计算总体健康状态
        self.check_results['overall_status'] = self._calculate_overall_status()
        
        end_time = datetime.now()
        self.check_results['duration'] = (end_time - start_time).total_seconds()
        
        logger.info(f"健康检查完成,耗时: {self.check_results['duration']:.2f}秒")
        return self.check_results
        
    def _check_system_metrics(self, config: Dict) -> Dict:
        """检查系统指标"""
        logger.info("检查系统资源...")
        results = {}
        
        # CPU检查
        cpu_info = SystemMetrics.get_cpu_info()
        results['cpu'] = cpu_info
        
        # 内存检查
        memory_info = SystemMetrics.get_memory_info()
        results['memory'] = memory_info
        
        # 磁盘检查
        disk_info = SystemMetrics.get_disk_info()
        results['disk'] = disk_info
        
        # 网络检查
        network_info = SystemMetrics.get_network_info()
        results['network'] = network_info
        
        return results
        
    def _check_services(self, config: Dict) -> List[Dict]:
        """检查服务状态"""
        logger.info("检查服务状态...")
        services = config.get('services', [])
        results = []
        
        for service in services:
            result = ServiceChecker.check_service_status(service)
            results.append(result)
            
        return results
        
    def _check_network(self, config: Dict) -> Dict:
        """检查网络连接"""
        logger.info("检查网络连接...")
        results = {
            'ping_results': [],
            'http_results': [],
            'port_results': []
        }
        
        # Ping测试
        hosts = config.get('hosts', [])
        for host in hosts:
            result = NetworkTester.ping_host(host)
            results['ping_results'].append(result)
            
        # HTTP测试
        urls = config.get('urls', [])
        for url in urls:
            result = NetworkTester.http_check(url)
            results['http_results'].append(result)
            
        return results
        
    def _check_security(self) -> Dict:
        """安全检查"""
        logger.info("执行安全检查...")
        results = {}
        
        # 防火墙检查
        firewall_status = SecurityAuditor.check_firewall_status()
        results['firewall'] = firewall_status
        
        # 开放端口检查
        open_ports = SecurityAuditor.check_open_ports()
        results['open_ports'] = open_ports
        
        return results
        
    def _calculate_overall_status(self) -> str:
        """计算总体健康状态"""
        statuses = []
        
        # 检查系统指标状态
        if 'system_metrics' in self.check_results:
            metrics = self.check_results['system_metrics']
            statuses.append(metrics.get('cpu', {}).get('cpu_status', 'UNKNOWN'))
            statuses.append(metrics.get('memory', {}).get('memory_status', 'UNKNOWN'))
            statuses.append(metrics.get('disk', {}).get('disk_status', 'UNKNOWN'))
            
        # 检查服务状态
        if 'services' in self.check_results:
            for service in self.check_results['services']:
                if service.get('status') not in ['RUNNING', 'SUCCESS']:
                    statuses.append('WARNING')
                    
        # 检查网络状态
        if 'network' in self.check_results:
            network = self.check_results['network']
            for ping_result in network.get('ping_results', []):
                if ping_result.get('status') != 'SUCCESS':
                    statuses.append('WARNING')
            for http_result in network.get('http_results', []):
                if http_result.get('status') != 'SUCCESS':
                    statuses.append('WARNING')
                    
        # 如果有任何错误状态,返回ERROR
        if 'ERROR' in statuses:
            return 'ERROR'
        # 如果有任何警告状态,返回WARNING
        elif 'WARNING' in statuses:
            return 'WARNING'
        # 否则返回HEALTHY
        else:
            return 'HEALTHY'

class ReportGenerator:
    """报告生成器"""
    @staticmethod
    def generate_html_report(data: Dict, output_file: str = None) -> str:
        """生成HTML报告"""
        if not output_file:
            timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
            output_file = f'health_report_{timestamp}.html'
            
        try:
            html_content = f"""
<!DOCTYPE html>
<html lang="zh-CN">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>系统健康检查报告</title>
    <style>
        body {{ font-family: Arial, sans-serif; margin: 20px; }}
        .header {{ background-color: #f0f0f0; padding: 20px; border-radius: 5px; }}
        .status-healthy {{ color: green; }}
        .status-warning {{ color: orange; }}
        .status-error {{ color: red; }}
        .section {{ margin: 20px 0; padding: 15px; border: 1px solid #ddd; border-radius: 5px; }}
        .metric {{ display: flex; justify-content: space-between; margin: 5px 0; }}
        table {{ width: 100%; border-collapse: collapse; margin: 10px 0; }}
        th, td {{ border: 1px solid #ddd; padding: 8px; text-align: left; }}
        th {{ background-color: #f2f2f2; }}
    </style>
</head>
<body>
    <div class="header">
        <h1>系统健康检查报告</h1>
        <p>检查时间: {data.get('timestamp', 'N/A')}</p>
        <p>总体状态: <span class="status-{'healthy' if data.get('overall_status') == 'HEALTHY' else 'warning' if data.get('overall_status') == 'WARNING' else 'error'}">{data.get('overall_status', 'UNKNOWN')}</span></p>
        <p>检查耗时: {data.get('duration', 0):.2f}秒</p>
    </div>
    
    <div class="section">
        <h2>系统信息</h2>
        <div class="metric">
            <span>主机名:</span>
            <span>{data.get('system_info', {}).get('hostname', 'N/A')}</span>
        </div>
        <div class="metric">
            <span>操作系统:</span>
            <span>{data.get('system_info', {}).get('platform', 'N/A')} {data.get('system_info', {}).get('platform_version', '')}</span>
        </div>
        <div class="metric">
            <span>架构:</span>
            <span>{data.get('system_info', {}).get('architecture', 'N/A')}</span>
        </div>
    </div>
    
    <div class="section">
        <h2>系统资源</h2>
        <h3>CPU</h3>
        <div class="metric">
            <span>使用率:</span>
            <span>{data.get('system_metrics', {}).get('cpu', {}).get('cpu_avg_percent', 'N/A'):.2f}%</span>
        </div>
        <div class="metric">
            <span>核心数:</span>
            <span>{data.get('system_metrics', {}).get('cpu', {}).get('cpu_count', 'N/A')} (逻辑: {data.get('system_metrics', {}).get('cpu', {}).get('logical_cpu_count', 'N/A')})</span>
        </div>
        
        <h3>内存</h3>
        <div class="metric">
            <span>使用率:</span>
            <span>{data.get('system_metrics', {}).get('memory', {}).get('memory_percent', 'N/A'):.2f}%</span>
        </div>
        <div class="metric">
            <span>总内存:</span>
            <span>{data.get('system_metrics', {}).get('memory', {}).get('memory_total', 0) / (1024**3):.2f} GB</span>
        </div>
        
        <h3>磁盘</h3>
        <table>
            <tr><th>挂载点</th><th>使用率</th><th>总容量</th><th>已使用</th></tr>
            { ''.join([f"<tr><td>{disk.get('mountpoint', 'N/A')}</td><td>{disk.get('percent', 0):.2f}%</td><td>{disk.get('total', 0) / (1024**3):.2f} GB</td><td>{disk.get('used', 0) / (1024**3):.2f} GB</td></tr>" for disk in data.get('system_metrics', {}).get('disk', {}).get('disks', [])]) }
        </table>
    </div>
    
    <div class="section">
        <h2>服务状态</h2>
        <table>
            <tr><th>服务名</th><th>状态</th></tr>
            { ''.join([f"<tr><td>{service.get('service', 'N/A')}</td><td class=\"status-{'healthy' if service.get('status') == 'RUNNING' else 'error'}\">{service.get('status', 'N/A')}</td></tr>" for service in data.get('services', [])]) }
        </table>
    </div>
    
    <div class="section">
        <h2>网络连接</h2>
        <h3>Ping测试</h3>
        <table>
            <tr><th>主机</th><th>状态</th><th>平均延迟(ms)</th></tr>
            { ''.join([f"<tr><td>{ping.get('host', 'N/A')}</td><td class=\"status-{'healthy' if ping.get('status') == 'SUCCESS' else 'error'}\">{ping.get('status', 'N/A')}</td><td>{ping.get('avg_time', 0):.2f}</td></tr>" for ping in data.get('network', {}).get('ping_results', [])]) }
        </table>
        
        <h3>HTTP测试</h3>
        <table>
            <tr><th>URL</th><th>状态码</th><th>响应时间(s)</th></tr>
            { ''.join([f"<tr><td>{http.get('url', 'N/A')}</td><td>{http.get('status_code', 'N/A')}</td><td>{http.get('response_time', 0):.2f}</td></tr>" for http in data.get('network', {}).get('http_results', [])]) }
        </table>
    </div>
</body>
</html>
            """
            
            # 确保输出目录存在
            os.makedirs(os.path.dirname(output_file) if os.path.dirname(output_file) else '.', exist_ok=True)
            
            with open(output_file, 'w', encoding='utf-8') as f:
                f.write(html_content)
                
            logger.info(f"HTML报告已生成: {output_file}")
            return output_file
            
        except Exception as e:
            logger.error(f"生成HTML报告失败: {str(e)}")
            raise ReportGenerationError(f"生成HTML报告失败: {str(e)}")
            
    @staticmethod
    def generate_json_report(data: Dict, output_file: str = None) -> str:
        """生成JSON报告"""
        if not output_file:
            timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
            output_file = f'health_report_{timestamp}.json'
            
        try:
            # 确保输出目录存在
            os.makedirs(os.path.dirname(output_file) if os.path.dirname(output_file) else '.', exist_ok=True)
            
            with open(output_file, 'w', encoding='utf-8') as f:
                json.dump(data, f, indent=2, ensure_ascii=False)
                
            logger.info(f"JSON报告已生成: {output_file}")
            return output_file
            
        except Exception as e:
            logger.error(f"生成JSON报告失败: {str(e)}")
            raise ReportGenerationError(f"生成JSON报告失败: {str(e)}")

class NotificationManager:
    """通知管理器"""
    def __init__(self, config: Dict):
        self.config = config
        
    def send_email(self, subject: str, body: str, attachments: List[str] = None):
        """发送邮件通知"""
        email_config = self.config.get('notifications', {}).get('email', {})
        if not email_config.get('enabled', False):
            return
            
        try:
            msg = MIMEMultipart()
            msg['From'] = email_config.get('sender')
            msg['To'] = ', '.join(email_config.get('recipients', []))
            msg['Subject'] = subject
            
            msg.attach(MIMEText(body, 'html'))
            
            # 添加附件
            if attachments:
                for file_path in attachments:
                    if os.path.exists(file_path):
                        with open(file_path, "rb") as attachment:
                            part = MIMEBase('application', 'octet-stream')
                            part.set_payload(attachment.read())
                            
                        encoders.encode_base64(part)
                        part.add_header(
                            'Content-Disposition',
                            f'attachment; filename= {os.path.basename(file_path)}'
                        )
                        msg.attach(part)
            
            server = smtplib.SMTP(email_config.get('smtp_server'), email_config.get('smtp_port', 587))
            server.starttls()
            server.login(email_config.get('sender'), email_config.get('password'))
            server.send_message(msg)
            server.quit()
            
            logger.info("健康检查报告已通过邮件发送")
            
        except Exception as e:
            logger.error(f"发送邮件失败: {str(e)}")

def create_sample_config():
    """创建示例配置文件"""
    sample_config = {
        "checks": {
            "system": {
                "enabled": True,
                "cpu_threshold": 80,
                "memory_threshold": 85,
                "disk_threshold": 85
            },
            "services": {
                "enabled": True,
                "services": ["ssh", "nginx", "mysql"]
            },
            "network": {
                "enabled": True,
                "hosts": ["8.8.8.8", "google.com", "github.com"],
                "urls": ["https://www.google.com", "https://www.github.com", "https://api.github.com"]
            },
            "security": {
                "enabled": True
            }
        },
        "report": {
            "formats": ["html", "json"],
            "output_dir": "./reports",
            "retention_days": 30
        },
        "notifications": {
            "email": {
                "enabled": True,
                "smtp_server": "smtp.gmail.com",
                "smtp_port": 587,
                "sender": "your_email@gmail.com",
                "password": "your_app_password",
                "recipients": ["admin@example.com", "ops@example.com"]
            }
        }
    }
    
    with open('health_check_sample_config.json', 'w', encoding='utf-8') as f:
        json.dump(sample_config, f, indent=2, ensure_ascii=False)
    logger.info("示例配置文件已创建: health_check_sample_config.json")

def clean_old_reports(output_dir: str, retention_days: int):
    """清理旧报告"""
    try:
        cutoff_date = datetime.now() - timedelta(days=retention_days)
        
        for filename in os.listdir(output_dir):
            file_path = os.path.join(output_dir, filename)
            if os.path.isfile(file_path):
                file_modified = datetime.fromtimestamp(os.path.getmtime(file_path))
                if file_modified < cutoff_date:
                    os.remove(file_path)
                    logger.info(f"已删除旧报告: {filename}")
                    
    except Exception as e:
        logger.error(f"清理旧报告失败: {str(e)}")

def main():
    parser = argparse.ArgumentParser(description='系统健康检查和报告工具')
    parser.add_argument('-c', '--config', help='配置文件路径')
    parser.add_argument('--check', action='store_true', help='执行健康检查')
    parser.add_argument('--sample-config', action='store_true', help='创建示例配置文件')
    parser.add_argument('--clean-reports', action='store_true', help='清理旧报告')
    
    args = parser.parse_args()
    
    if args.sample_config:
        create_sample_config()
        return
        
    if args.clean_reports:
        checker = HealthChecker(args.config)
        report_config = checker.config.get('report', {})
        output_dir = report_config.get('output_dir', './reports')
        retention_days = report_config.get('retention_days', 30)
        clean_old_reports(output_dir, retention_days)
        return
        
    if args.check:
        # 执行健康检查
        checker = HealthChecker(args.config)
        results = checker.perform_health_check()
        
        # 生成报告
        report_config = checker.config.get('report', {})
        output_dir = report_config.get('output_dir', './reports')
        formats = report_config.get('formats', ['html'])
        
        generated_files = []
        for fmt in formats:
            try:
                if fmt == 'html':
                    html_file = ReportGenerator.generate_html_report(results, os.path.join(output_dir, ''))
                    generated_files.append(html_file)
                elif fmt == 'json':
                    json_file = ReportGenerator.generate_json_report(results, os.path.join(output_dir, ''))
                    generated_files.append(json_file)
            except Exception as e:
                logger.error(f"生成{fmt}报告失败: {str(e)}")
                
        # 发送通知
        notification_manager = NotificationManager(checker.config)
        if generated_files:
            subject = f"系统健康检查报告 - {results.get('overall_status', 'UNKNOWN')}"
            body = f"""
            <h2>系统健康检查报告</h2>
            <p>检查时间: {results.get('timestamp', 'N/A')}</p>
            <p>总体状态: {results.get('overall_status', 'UNKNOWN')}</p>
            <p>检查耗时: {results.get('duration', 0):.2f}秒</p>
            <p>详细报告请查看附件。</p>
            """
            notification_manager.send_email(subject, body, generated_files)
            
        # 输出简要结果
        print(f"健康检查完成 - 状态: {results.get('overall_status', 'UNKNOWN')}")
        print(f"检查时间: {results.get('timestamp', 'N/A')}")
        print(f"耗时: {results.get('duration', 0):.2f}秒")
        
    else:
        parser.print_help()

if __name__ == '__main__':
    main()

使用说明

1. 安装依赖

pip install psutil requests pyyaml

2. 创建配置文件

python health_checker.py --sample-config

3. 执行健康检查

python health_checker.py --config health_check_config.json --check

4. 清理旧报告

python health_checker.py --config health_check_config.json --clean-reports

配置文件示例

JSON配置文件

{
  "checks": {
    "system": {
      "enabled": true,
      "cpu_threshold": 80,
      "memory_threshold": 85,
      "disk_threshold": 85
    },
    "services": {
      "enabled": true,
      "services": ["ssh", "nginx", "mysql"]
    },
    "network": {
      "enabled": true,
      "hosts": ["8.8.8.8", "google.com", "github.com"],
      "urls": ["https://www.google.com", "https://www.github.com", "https://api.github.com"]
    },
    "security": {
      "enabled": true
    }
  },
  "report": {
    "formats": ["html", "json"],
    "output_dir": "./reports",
    "retention_days": 30
  },
  "notifications": {
    "email": {
      "enabled": true,
      "smtp_server": "smtp.gmail.com",
      "smtp_port": 587,
      "sender": "your_email@gmail.com",
      "password": "your_app_password",
      "recipients": ["admin@example.com", "ops@example.com"]
    }
  }
}

高级特性

1. 全方位健康评估

提供CPU、内存、磁盘、网络等系统资源的全面监控,以及服务状态和网络安全检查。

2. 智能状态判定

根据各项检查结果自动计算系统总体健康状态,快速识别潜在问题。

3. 多格式报告输出

支持HTML可视化报告和JSON数据报告,满足不同使用场景的需求。

4. 自动化通知机制

支持邮件通知功能,检查完成后自动发送报告给相关人员。

最佳实践

1. 定期执行检查

  • 建议设置定时任务定期执行健康检查
  • 根据系统重要性调整检查频率
  • 关注检查结果的趋势变化

2. 配置优化

  • 根据实际环境调整检查阈值
  • 自定义需要检查的服务和网络目标
  • 合理配置通知接收人员

3. 报告管理

  • 定期清理旧报告以节省存储空间
  • 归档重要报告用于历史对比分析
  • 结合其他监控工具形成完整监控体系

总结

这个系统健康检查和报告工具提供了一个全面、自动化的系统健康监控解决方案。通过定期检查系统各项指标并生成详细报告,可以帮助系统管理员及时发现和解决潜在问题,确保系统稳定运行。