LS-LINUX-003 单机自动化运维

2 阅读1分钟

LS-LINUX-003 单机自动化运维

一、基础架构设计(单机版)

1. 核心组件选择
graph TD
    A[控制端] -->|SSH/API| B[被控服务器]
    B --> C[Ansible]
    B --> D[Docker]
    B --> E[Prometheus]
    B --> F[ELK Stack]
2. 推荐技术栈
  • 配置管理: Ansible(无Agent架构)
  • 容器化: Docker + Docker Compose
  • 监控系统: Prometheus + Grafana + Node Exporter
  • 日志系统: Elasticsearch + Logstash + Kibana (ELK)
  • 持续交付: GitLab CI/CD 或 Jenkins
  • Web管理: Flask/Django + Bootstrap

二、分阶段实施计划

阶段1:基础设施自动化(1-3天)

核心任务:建立基础运维框架

# 安装核心工具
sudo apt-get install -y ansible python3-pip git
pip3 install docker-compose

# 创建Ansible清单文件
mkdir -p /opt/automation/inventories
echo "[single_server]
192.168.1.100 ansible_user=root" > /opt/automation/inventories/hosts
阶段2:监控系统搭建(示例配置)
# docker-compose-monitor.yml
version: '3'
services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"

  node_exporter:
    image: prom/node-exporter:latest
    ports:
      - "9100:9100"
    pid: "host"
阶段3:Web控制台开发(Python示例)
# app.py
from flask import Flask, render_template
import subprocess

app = Flask(__name__)

@app.route('/deploy/<service>')
def deploy_service(service):
    result = subprocess.run(
        f"ansible-playbook -i inventories/hosts deploy_{service}.yml",
        shell=True,
        capture_output=True
    )
    return f"Deployment output:\n{result.stdout.decode()}"

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

三、关键自动化场景实现

1. 自动化安全加固
# security_hardening.yml
- name: Apply security updates
  apt:
    upgrade: dist
    update_cache: yes

- name: Configure firewall
  ufw:
    rule: allow
    port: "{{ item }}"
  loop:
    - 22
    - 80
    - 443

- name: Disable root SSH login
  lineinfile:
    path: /etc/ssh/sshd_config
    regexp: '^PermitRootLogin'
    line: 'PermitRootLogin no'
  notify: restart sshd

handlers:
  - name: restart sshd
    service:
      name: sshd
      state: restarted
2. 智能监控告警
# prometheus/rules.yml
groups:
- name: node_alerts
  rules:
  - alert: HighCPUUsage
    expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage on {{ $labels.instance }}"

四、进阶扩展方向

1. 架构演进路线
graph LR
    A[单机自动化] --> B[容器编排]
    B --> C[混合云管理]
    C --> D[AIOps]

    style A fill:#f9f,stroke:#333
    style D fill:#ccf,stroke:#f66
2. 扩展功能建议
  • 灰度发布系统:使用Nginx + Lua实现流量切分
  • 配置中心:Apollo或Consul实现动态配置
  • 灾备演练:Chaos Engineering工具如Chaos Mesh

五、安全注意事项

  1. 密钥管理方案:

    # 使用Ansible Vault加密敏感数据
    ansible-vault encrypt_string 'super_secret' --name 'db_password'
    
  2. 访问控制矩阵:

    # RBAC示例
    def check_permission(user, action):
        permissions = {
            'admin': ['deploy', 'restart', 'config'],
            'developer': ['deploy', 'logs'],
            'guest': ['view']
        }
        return action in permissions.get(user.role, [])
    

六、学习资源推荐

  1. 书籍

    • 《Ansible权威指南》李松涛(机械工业出版社)^^ansible-book^^
    • 《SRE:Google运维解密》(北京电子工业出版社)^^sre-book^^
  2. 课程

    • 极客时间《运维体系管理课》^^geektime-course^^
    • Coursera《Google IT Automation with Python》^^coursera-course^^
  3. 工具链

    graph LR
        A[开发] --> B[GitLab]
        B --> C[Jenkins]
        C --> D[Ansible]
        D --> E[Kubernetes]
        E --> F[Prometheus]
        F --> G[Grafana]
    

建议从Ansible和Docker入手,逐步扩展监控和日志系统。单服务器环境是学习自动化运维的绝佳实验平台,关键要建立规范的运维流程体系。后续扩展集群时,重点注意配置管理的标准化和服务的无状态化改造。