Fly.io 部署维护实操指南(AI生成)

97 阅读6分钟

Fly.io 部署维护实操指南

开发和运维操作完整手册


目录

  1. 快速开始检查清单
  2. 部署操作手册
  3. 日常运维操作
  4. 扩缩容管理
  5. 数据库和存储运维
  6. 网络和安全配置
  7. CI/CD集成实战
  8. 应急响应手册
  9. 成本监控和优化
  10. 运维脚本集合

1. 快速开始检查清单

1.1 环境准备清单

✅ 开发环境检查

# 检查必要工具
[] Docker已安装并运行
[] Git版本控制
[] 代码编辑器配置
[] 本地开发端口(3000, 8080等)可用

# 验证命令
docker --version
git --version
netstat -an | grep :3000

✅ Fly.io 账号设置

# 1. 注册账号
# 访问 https://fly.io/app/sign-up

# 2. 验证邮箱
[] 邮箱验证完成
[] 账号激活确认

# 3. 添加支付方式
[] 信用卡信息已添加
[] 计费信息已确认

✅ CLI工具安装

# macOS安装
brew install flyctl

# Linux安装
curl -L https://fly.io/install.sh | sh

# Windows (PowerShell)
iwr https://fly.io/install.ps1 -useb | iex

# 验证安装
fly version
# 预期输出:flyctl v0.x.x

1.2 认证和基础配置

# 登录账号
fly auth login
# 浏览器会自动打开,完成认证

# 验证登录状态
fly auth whoami
# 预期输出:显示你的邮箱地址

# 查看组织信息
fly orgs list
# 显示你有权限的组织

# 设置默认组织(可选)
fly orgs select YOUR_ORG_NAME

1.3 项目初始化检查清单

✅ 项目准备

[] Dockerfile已创建且测试通过
[] .dockerignore文件已配置
[] 环境变量已梳理
[] 端口配置已确认
[] 健康检查接口已实现

# 本地测试Docker构建
docker build -t test-app .
docker run -p 3000:3000 test-app
curl http://localhost:3000/health

✅ 部署前最终检查

[] 代码已提交到Git
[] 敏感信息已从代码中移除
[] 生产配置已准备
[] 数据库连接字符串已准备
[] 域名已准备(如需要)

2. 部署操作手册

2.1 首次部署完整流程

步骤1: 项目初始化
# 进入项目目录
cd your-project

# 初始化Fly.io应用
fly launch
# 交互式配置:
# - 应用名称
# - 部署区域
# - 数据库需求
# - Redis需求

# 查看生成的配置
cat fly.toml
步骤2: 配置文件调优
# fly.toml 生产配置示例
app = "my-production-app"
primary_region = "nrt"

[build]
  # 使用多阶段构建优化镜像大小

[env]
  NODE_ENV = "production"
  PORT = "3000"

[http_service]
  internal_port = 3000
  force_https = true
  auto_stop_machines = "stop"
  auto_start_machines = true
  min_machines_running = 1
  
  [http_service.concurrency]
    type = "connections"
    hard_limit = 25
    soft_limit = 20
  
  [[http_service.checks]]
    grace_period = "10s"
    interval = "30s"
    method = "GET"
    path = "/health"
    protocol = "http"
    timeout = "5s"

[vm]
  size = "shared-cpu-1x"
  memory = "1gb"

[[statics]]
  guest_path = "/app/public"
  url_prefix = "/static/"
步骤3: 设置环境变量和密钥
# 设置生产环境密钥
fly secrets set DATABASE_URL="postgresql://..."
fly secrets set JWT_SECRET="your-random-secret"
fly secrets set API_KEY="your-api-key"

# 查看已设置的密钥
fly secrets list
步骤4: 执行部署
# 部署应用
fly deploy

# 监控部署过程
fly logs

# 验证部署状态
fly status
fly checks list
步骤5: 部署后验证
# 获取应用URL
fly info

# 测试应用响应
curl https://my-production-app.fly.dev/health

# 检查所有服务状态
fly status --all

# 查看详细日志
fly logs --since 10m

2.2 更新部署策略

滚动更新部署
# 标准更新部署
fly deploy

# 指定特定镜像版本
fly deploy --image my-app:v2.1.0

# 仅更新配置不重新构建
fly deploy --no-build

# 部署到特定区域
fly deploy --region nrt
蓝绿部署
# 1. 部署新版本到新实例
fly scale count 2

# 2. 验证新实例健康状态
fly status

# 3. 如果正常,逐步切换流量
fly scale count 1 --region nrt  # 保留新版本
fly machines stop OLD_MACHINE_ID  # 停止旧版本

# 4. 如果有问题,快速回滚
fly machines start OLD_MACHINE_ID
fly machines stop NEW_MACHINE_ID

2.3 回滚操作指南

快速回滚到上一版本
# 查看部署历史
fly releases

# 回滚到上一个版本
fly releases rollback

# 回滚到特定版本
fly releases rollback v3

# 确认回滚状态
fly status
fly logs --since 5m
紧急回滚流程
#!/bin/bash
# 紧急回滚脚本
echo "开始紧急回滚..."

# 1. 立即回滚到上一版本
fly releases rollback

# 2. 检查回滚状态
if fly status | grep -q "running"; then
    echo "✅ 回滚成功,应用正在运行"
else
    echo "❌ 回滚失败,需要手动干预"
    exit 1
fi

# 3. 验证应用健康状态
if curl -f https://your-app.fly.dev/health; then
    echo "✅ 应用健康检查通过"
else
    echo "⚠️ 应用健康检查失败,请检查日志"
    fly logs --since 2m
fi

echo "回滚操作完成"

2.4 多环境部署

开发环境配置
# fly.dev.toml
app = "my-app-dev"
primary_region = "nrt"

[env]
  NODE_ENV = "development"
  DEBUG = "true"

[http_service]
  auto_stop_machines = "stop"  # 节省成本
  min_machines_running = 0

[vm]
  size = "shared-cpu-1x"
  memory = "256mb"  # 最小配置
测试环境配置
# fly.staging.toml
app = "my-app-staging"
primary_region = "nrt"

[env]
  NODE_ENV = "staging"

[http_service]
  min_machines_running = 1

[vm]
  size = "shared-cpu-1x"
  memory = "512mb"
多环境部署脚本
#!/bin/bash
# deploy-multi-env.sh

ENVIRONMENT=$1
if [ -z "$ENVIRONMENT" ]; then
    echo "Usage: $0 [dev|staging|production]"
    exit 1
fi

case $ENVIRONMENT in
    "dev")
        CONFIG_FILE="fly.dev.toml"
        ;;
    "staging")
        CONFIG_FILE="fly.staging.toml"
        ;;
    "production")
        CONFIG_FILE="fly.toml"
        ;;
    *)
        echo "Unknown environment: $ENVIRONMENT"
        exit 1
        ;;
esac

echo "Deploying to $ENVIRONMENT using $CONFIG_FILE"
fly deploy --config $CONFIG_FILE

echo "Deployment completed for $ENVIRONMENT"
fly status --config $CONFIG_FILE

3. 日常运维操作

3.1 健康检查和监控

应用健康状态检查
# 查看应用整体状态
fly status

# 详细状态信息
fly status --all

# 检查健康检查状态
fly checks list

# 查看特定Machine状态
fly machines list
fly machines show MACHINE_ID
实时监控脚本
#!/bin/bash
# monitor.sh - 实时监控脚本

APP_NAME="your-app"
HEALTH_URL="https://${APP_NAME}.fly.dev/health"

while true; do
    echo "=== $(date) ==="
    
    # 检查应用状态
    STATUS=$(fly status --json | jq -r '.status')
    echo "App Status: $STATUS"
    
    # 检查健康端点
    if curl -f $HEALTH_URL > /dev/null 2>&1; then
        echo "Health Check: ✅ PASS"
    else
        echo "Health Check: ❌ FAIL"
        echo "Checking logs..."
        fly logs --since 2m | tail -10
    fi
    
    # 检查Machine状态
    MACHINES=$(fly machines list --json | jq -r '.[].state' | sort | uniq -c)
    echo "Machine States:"
    echo "$MACHINES"
    
    echo "------------------------"
    sleep 30
done

3.2 日志管理和分析

基础日志操作
# 查看实时日志
fly logs

# 查看历史日志
fly logs --since 1h
fly logs --since "2024-01-15 10:00:00"

# 过滤特定实例日志
fly logs --instance MACHINE_ID

# 只显示应用日志,排除系统日志
fly logs --no-system

# 导出日志到文件
fly logs --since 24h > app-logs-$(date +%Y%m%d).log
日志分析脚本
#!/bin/bash
# log-analysis.sh - 日志分析工具

LOG_FILE="app-logs-$(date +%Y%m%d).log"
FLY_APP="your-app"

# 获取24小时日志
echo "Fetching logs..."
fly logs --since 24h --app $FLY_APP > $LOG_FILE

# 分析错误
echo "=== Error Analysis ==="
grep -i "error\|exception\|fail" $LOG_FILE | head -20

# 分析响应时间
echo "=== Slow Requests (>1000ms) ==="
grep "response_time" $LOG_FILE | awk '$NF > 1000' | head -10

# 分析访问统计
echo "=== Top IPs ==="
grep -o 'ip=[0-9.]*' $LOG_FILE | sort | uniq -c | sort -nr | head -10

# 分析状态码
echo "=== Status Code Distribution ==="
grep -o 'status=[0-9]*' $LOG_FILE | sort | uniq -c | sort -nr

echo "Log analysis saved to: $LOG_FILE"

3.3 性能调优操作

资源使用监控
# 查看资源使用情况
fly ssh console -C "htop"

# 查看内存使用
fly ssh console -C "free -h"

# 查看磁盘使用
fly ssh console -C "df -h"

# 查看网络连接
fly ssh console -C "netstat -an | head -20"

# 查看进程状态
fly ssh console -C "ps aux | head -20"
性能基准测试
#!/bin/bash
# performance-test.sh

APP_URL="https://your-app.fly.dev"
CONCURRENT_USERS=10
DURATION=60

echo "Starting performance test..."
echo "URL: $APP_URL"
echo "Concurrent Users: $CONCURRENT_USERS"
echo "Duration: ${DURATION}s"

# 使用wrk进行压力测试
wrk -t4 -c$CONCURRENT_USERS -d${DURATION}s $APP_URL/api/health

# 或使用ab进行测试
# ab -n 1000 -c $CONCURRENT_USERS $APP_URL/api/health

echo "Performance test completed"

3.4 故障排查流程图

故障发生
    
检查应用状态 (fly status)
    
应用运行?  No  检查部署历史  考虑回滚
     Yes
检查健康检查 (fly checks list)
    
健康检查通过?  No  检查健康检查端点  修复应用逻辑
     Yes
查看最近日志 (fly logs --since 10m)
    
发现错误?  Yes  分析错误类型  修复并重新部署
     No
检查资源使用 (SSH到机器检查)
    
资源不足?  Yes  扩容或优化
     No
检查网络连接和外部依赖
    
网络问题?  Yes  检查DNS、数据库连接等
     No
联系Fly.io支持或社区

4. 扩缩容管理

4.1 手动扩缩容操作

垂直扩容(增加资源)
# 查看当前配置
fly status

# 增加内存
fly scale memory 2048

# 修改CPU配置
fly scale vm performance-1x

# 查看可用配置选项
fly platform vm-sizes

# 验证扩容结果
fly status --all
水平扩容(增加实例)
# 增加总实例数
fly scale count 3

# 在特定区域增加实例
fly scale count 2 --region nrt
fly scale count 1 --region sin

# 查看实例分布
fly status --all

# 检查负载均衡
curl -s https://your-app.fly.dev/debug/instance | grep machine_id

4.2 自动扩缩容配置

基于连接数的自动扩缩容
# fly.toml
[http_service]
  auto_stop_machines = "stop"
  auto_start_machines = true
  min_machines_running = 1
  
  [http_service.concurrency]
    type = "connections"
    hard_limit = 25      # 单实例最大连接数
    soft_limit = 20      # 触发扩容的阈值
基于请求数的自动扩缩容
[http_service.concurrency]
  type = "requests"
  hard_limit = 100     # 单实例最大并发请求
  soft_limit = 80      # 触发扩容的阈值
扩缩容监控脚本
#!/bin/bash
# autoscale-monitor.sh

APP_NAME="your-app"

while true; do
    echo "=== $(date) ==="
    
    # 获取当前实例状态
    MACHINES=$(fly machines list --json --app $APP_NAME)
    RUNNING_COUNT=$(echo $MACHINES | jq '[.[] | select(.state == "started")] | length')
    TOTAL_COUNT=$(echo $MACHINES | jq 'length')
    
    echo "Running Machines: $RUNNING_COUNT/$TOTAL_COUNT"
    
    # 检查负载情况
    for machine_id in $(echo $MACHINES | jq -r '.[].id'); do
        STATE=$(echo $MACHINES | jq -r ".[] | select(.id == \"$machine_id\") | .state")
        echo "Machine $machine_id: $STATE"
    done
    
    # 检查是否需要手动干预
    if [ $RUNNING_COUNT -eq 0 ]; then
        echo "⚠️ 没有运行中的实例!"
        echo "执行: fly machines start [MACHINE_ID]"
    fi
    
    echo "------------------------"
    sleep 60
done

4.3 跨区域部署管理

全球区域部署策略
# 查看可用区域
fly platform regions

# 主要用户区域部署
fly scale count 2 --region nrt  # 东京(亚太主要)
fly scale count 2 --region iad  # 华盛顿(北美主要)
fly scale count 1 --region ams  # 阿姆斯特丹(欧洲主要)

# 查看区域分布
fly status --all
区域性能测试
#!/bin/bash
# region-performance-test.sh

REGIONS=("nrt" "iad" "ams" "sin" "lhr")
APP_NAME="your-app"

for region in "${REGIONS[@]}"; do
    echo "Testing region: $region"
    
    # 创建测试实例
    fly machines create --region $region --config test-config.json
    
    # 等待启动
    sleep 30
    
    # 测试延迟
    URL="https://$APP_NAME.fly.dev"
    LATENCY=$(curl -o /dev/null -s -w "%{time_total}" $URL)
    
    echo "Region $region latency: ${LATENCY}s"
    
    # 清理测试实例
    # fly machines stop TEST_MACHINE_ID
done

4.4 负载均衡配置

配置负载均衡策略
# fly.toml - 负载均衡配置
[http_service]
  internal_port = 3000
  
  # 会话亲和性(可选)
  [http_service.checks]
    path = "/health"
    
  # 超时配置
  [http_service.timeouts]
    read_timeout = "30s"
    write_timeout = "30s"
    idle_timeout = "60s"
健康检查配置优化
[[http_service.checks]]
  grace_period = "5s"     # 启动后等待时间
  interval = "15s"        # 检查间隔
  method = "GET"
  path = "/health"
  protocol = "http"
  timeout = "2s"          # 检查超时
  
  # 失败处理
  [http_service.checks.headers]
    "User-Agent" = "fly-health-check"

5. 数据库和存储运维

5.1 PostgreSQL管理操作

创建和配置数据库
# 创建PostgreSQL数据库
fly postgres create --name my-app-db --region nrt

# 查看数据库信息
fly postgres show --app my-app-db

# 连接到数据库
fly postgres connect --app my-app-db

# 获取数据库连接字符串
fly postgres attach --app my-app my-app-db
数据库维护操作
# 查看数据库状态
fly postgres status --app my-app-db

# 数据库配置调优
fly postgres config show --app my-app-db
fly postgres config update --app my-app-db \
  --shared-preload-libraries="pg_stat_statements" \
  --log-statement="all"

# 重启数据库
fly postgres restart --app my-app-db

5.2 备份和恢复流程

自动备份配置
# 查看备份状态
fly postgres backup list --app my-app-db

# 创建手动备份
fly postgres backup create --app my-app-db

# 下载备份文件
fly postgres backup download BACKUP_ID --app my-app-db
数据库恢复操作
# 从备份恢复
fly postgres restore --app my-app-db --from BACKUP_ID

# 从本地文件恢复
fly postgres import --app my-app-db --local-file backup.sql

# 恢复验证
fly postgres connect --app my-app-db -c "SELECT COUNT(*) FROM your_table;"
备份脚本自动化
#!/bin/bash
# db-backup.sh - 数据库备份脚本

DB_APP="my-app-db"
BACKUP_DIR="/backups"
DATE=$(date +%Y%m%d_%H%M%S)

mkdir -p $BACKUP_DIR

echo "Starting backup for $DB_APP..."

# 创建备份
BACKUP_ID=$(fly postgres backup create --app $DB_APP --json | jq -r '.id')

if [ "$BACKUP_ID" != "null" ]; then
    echo "Backup created with ID: $BACKUP_ID"
    
    # 下载备份文件
    fly postgres backup download $BACKUP_ID --app $DB_APP \
        --output "$BACKUP_DIR/backup_${DATE}.sql"
    
    echo "Backup saved to: $BACKUP_DIR/backup_${DATE}.sql"
    
    # 清理旧备份(保留30天)
    find $BACKUP_DIR -name "backup_*.sql" -mtime +30 -delete
    
    echo "Backup completed successfully"
else
    echo "Failed to create backup"
    exit 1
fi

5.3 数据迁移步骤

从其他数据库迁移
# 1. 导出现有数据
pg_dump postgresql://old-db-url > migration_data.sql

# 2. 准备新数据库
fly postgres create --name new-app-db --region nrt

# 3. 导入数据
fly postgres import --app new-app-db --local-file migration_data.sql

# 4. 验证数据完整性
fly postgres connect --app new-app-db -c "
SELECT 
  schemaname,
  tablename,
  n_tup_ins as inserts,
  n_tup_upd as updates,
  n_tup_del as deletes
FROM pg_stat_user_tables;
"

# 5. 更新应用连接字符串
fly secrets set DATABASE_URL="postgresql://new-db-url"
零停机数据迁移
#!/bin/bash
# zero-downtime-migration.sh

OLD_DB="old-app-db"
NEW_DB="new-app-db"
APP_NAME="my-app"

echo "Starting zero-downtime migration..."

# 1. 创建新数据库
echo "Creating new database..."
fly postgres create --name $NEW_DB --region nrt

# 2. 初始数据同步
echo "Initial data sync..."
pg_dump postgresql://old-db-connection | \
  psql postgresql://new-db-connection

# 3. 设置实时同步(使用逻辑复制)
echo "Setting up logical replication..."
# 配置逻辑复制(具体命令取决于数据库版本)

# 4. 等待同步完成
echo "Waiting for sync to complete..."
sleep 30

# 5. 切换应用到新数据库
echo "Switching application to new database..."
fly secrets set DATABASE_URL="postgresql://new-db-connection" --app $APP_NAME

# 6. 重启应用
echo "Restarting application..."
fly restart --app $APP_NAME

# 7. 验证切换成功
echo "Verifying migration..."
if curl -f https://$APP_NAME.fly.dev/health; then
    echo "✅ Migration successful"
else
    echo "❌ Migration failed, consider rollback"
    exit 1
fi

echo "Migration completed"

5.4 Volume管理

持久化存储操作
# 创建Volume
fly volumes create data_volume --size 10GB --region nrt

# 查看Volumes
fly volumes list

# 扩展Volume大小
fly volumes extend data_volume --size 20GB

# 创建Volume快照
fly volumes snapshot create data_volume

# 从快照恢复Volume
fly volumes create new_volume --snapshot SNAPSHOT_ID --region nrt
Volume挂载配置
# fly.toml
[mounts]
  source = "data_volume"
  destination = "/data"
  
# 多个挂载点
[[mounts]]
  source = "logs_volume"
  destination = "/var/log/app"
  
[[mounts]]
  source = "uploads_volume"  
  destination = "/app/uploads"
Volume备份脚本
#!/bin/bash
# volume-backup.sh

VOLUME_NAME="data_volume"
APP_NAME="my-app"
BACKUP_DATE=$(date +%Y%m%d_%H%M%S)

echo "Creating snapshot of $VOLUME_NAME..."

# 创建快照
SNAPSHOT_ID=$(fly volumes snapshot create $VOLUME_NAME --json | jq -r '.id')

if [ "$SNAPSHOT_ID" != "null" ]; then
    echo "Snapshot created: $SNAPSHOT_ID"
    
    # 记录备份信息
    echo "$BACKUP_DATE,$VOLUME_NAME,$SNAPSHOT_ID" >> volume_backups.log
    
    echo "Volume backup completed"
else
    echo "Failed to create volume snapshot"
    exit 1
fi

6. 网络和安全配置

6.1 域名和证书管理

自定义域名配置
# 添加自定义域名
fly certs add example.com

# 添加通配符域名
fly certs add "*.example.com"

# 查看证书状态
fly certs show example.com

# 查看所有证书
fly certs list

# 删除证书
fly certs remove example.com
SSL证书验证
# 检查证书配置
curl -I https://example.com
openssl s_client -connect example.com:443 -servername example.com

# 证书到期检查脚本
#!/bin/bash
# cert-check.sh
DOMAIN="example.com"

EXPIRY_DATE=$(openssl s_client -connect $DOMAIN:443 -servername $DOMAIN 2>/dev/null | \
  openssl x509 -noout -dates | grep notAfter | cut -d= -f2)

EXPIRY_TIMESTAMP=$(date -d "$EXPIRY_DATE" +%s)
CURRENT_TIMESTAMP=$(date +%s)
DAYS_UNTIL_EXPIRY=$(( (EXPIRY_TIMESTAMP - CURRENT_TIMESTAMP) / 86400 ))

echo "Certificate for $DOMAIN expires in $DAYS_UNTIL_EXPIRY days"

if [ $DAYS_UNTIL_EXPIRY -lt 30 ]; then
    echo "⚠️ Certificate expires soon!"
fi
DNS配置指南
# 查看Fly.io提供的IP地址
fly ips list

# DNS记录配置示例:
# A记录: example.com -> 77.83.142.52
# AAAA记录: example.com -> 2a09:8280:1::4:d2b4
# CNAME记录: www.example.com -> example.com

# 验证DNS配置
dig example.com A +short
dig example.com AAAA +short
nslookup example.com 8.8.8.8

6.2 密钥和环境变量管理

安全密钥管理
# 设置生产环境密钥
fly secrets set DATABASE_URL="postgresql://..." 
fly secrets set JWT_SECRET="$(openssl rand -base64 32)"
fly secrets set API_KEY="your-secure-api-key"

# 查看密钥列表(不显示值)
fly secrets list

# 删除密钥
fly secrets unset OLD_SECRET

# 批量设置密钥
fly secrets set \
  SECRET_1="value1" \
  SECRET_2="value2" \
  SECRET_3="value3"
环境变量最佳实践
#!/bin/bash
# setup-secrets.sh - 密钥设置脚本

# 从.env文件读取并设置(排除敏感信息)
while IFS='=' read -r key value; do
    # 跳过注释和空行
    [[ $key =~ ^#.*$ || -z $key ]] && continue
    
    # 设置非敏感环境变量
    if [[ ! $key =~ (PASSWORD|SECRET|KEY|TOKEN) ]]; then
        echo "Setting env var: $key"
        fly env set "$key=$value"
    else
        echo "Setting secret: $key"
        fly secrets set "$key=$value"
    fi
done < .env.production

echo "Environment setup completed"

6.3 内部网络配置

6PN私有网络配置
# 查看私有网络信息
fly ips private

# 分配私有IPv6地址
fly ips allocate-v6 --private

# 查看网络连接
fly ssh console -C "ip addr show"
服务间通信配置
// 应用内网络配置
const DB_HOST = process.env.DATABASE_URL || 'postgres.internal';
const REDIS_HOST = process.env.REDIS_URL || 'redis.internal';
const API_HOST = process.env.API_URL || 'api-service.internal';

// 内网服务发现
const services = {
  database: 'my-app-db.internal:5432',
  cache: 'my-app-redis.internal:6379',
  api: 'my-app-api.internal:8080'
};
网络安全配置
# fly.toml - 网络安全配置
[http_service]
  internal_port = 3000
  
  # 仅允许HTTPS
  force_https = true
  
  # 配置CORS
  [http_service.cors]
    allowed_origins = ["https://yourdomain.com"]
    allowed_methods = ["GET", "POST", "PUT", "DELETE"]
    allowed_headers = ["Content-Type", "Authorization"]

# 内部服务不暴露外部端口
[[services]]
  internal_port = 5432
  protocol = "tcp"
  # 不配置[[services.ports]]即不暴露外部

6.4 防护策略设置

速率限制配置
// 应用层速率限制
const rateLimit = require('express-rate-limit');

const limiter = rateLimit({
  windowMs: 15 * 60 * 1000, // 15分钟
  max: 100, // 最多100个请求
  message: 'Too many requests from this IP',
  standardHeaders: true,
  legacyHeaders: false,
});

app.use('/api/', limiter);
DDoS防护监控
#!/bin/bash
# ddos-monitor.sh - DDoS监控脚本

APP_NAME="my-app"
THRESHOLD=1000  # 每分钟请求阈值

while true; do
    # 统计最近1分钟的请求数
    REQUEST_COUNT=$(fly logs --since 1m --app $APP_NAME | \
      grep -c "GET\|POST\|PUT\|DELETE")
    
    echo "$(date): Requests in last minute: $REQUEST_COUNT"
    
    if [ $REQUEST_COUNT -gt $THRESHOLD ]; then
        echo "⚠️ High traffic detected! Requests: $REQUEST_COUNT"
        
        # 可以触发自动扩容或告警
        fly scale count 3 --app $APP_NAME
        
        # 发送告警通知
        # curl -X POST "https://hooks.slack.com/..." \
        #   -d "{'text':'High traffic detected on $APP_NAME: $REQUEST_COUNT requests/min'}"
    fi
    
    sleep 60
done
安全头配置
// 安全头中间件
const helmet = require('helmet');

app.use(helmet({
  contentSecurityPolicy: {
    directives: {
      defaultSrc: ["'self'"],
      styleSrc: ["'self'", "'unsafe-inline'"],
      scriptSrc: ["'self'"],
      imgSrc: ["'self'", "data:", "https:"],
    },
  },
  hsts: {
    maxAge: 31536000,
    includeSubDomains: true,
    preload: true
  }
}));

7. CI/CD集成实战

7.1 GitHub Actions配置

基础部署工作流
# .github/workflows/deploy.yml
name: Deploy to Fly.io

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  deploy:
    name: Deploy app
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Setup Node.js
        uses: actions/setup-node@v3
        with:
          node-version: '18'
          cache: 'npm'
      
      - name: Install dependencies
        run: npm ci
      
      - name: Run tests
        run: npm test
      
      - name: Run linter
        run: npm run lint
      
      - name: Setup flyctl
        uses: superfly/flyctl-actions/setup-flyctl@master
      
      - name: Deploy to Fly.io
        run: flyctl deploy --remote-only
        env:
          FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }}
多环境部署工作流
# .github/workflows/multi-env-deploy.yml
name: Multi-Environment Deploy

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Setup Node.js
        uses: actions/setup-node@v3
        with:
          node-version: '18'
          cache: 'npm'
      - name: Install and test
        run: |
          npm ci
          npm test
          npm run lint

  deploy-staging:
    needs: test
    if: github.ref == 'refs/heads/develop'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: superfly/flyctl-actions/setup-flyctl@master
      - name: Deploy to staging
        run: flyctl deploy --config fly.staging.toml
        env:
          FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }}

  deploy-production:
    needs: test
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: superfly/flyctl-actions/setup-flyctl@master
      - name: Deploy to production
        run: flyctl deploy --config fly.toml
        env:
          FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }}

7.2 GitLab CI配置

GitLab CI配置文件
# .gitlab-ci.yml
stages:
  - test
  - build
  - deploy

variables:
  DOCKER_HOST: tcp://docker:2376
  DOCKER_TLS_CERTDIR: "/certs"

test:
  stage: test
  image: node:18
  script:
    - npm ci
    - npm test
    - npm run lint
  cache:
    paths:
      - node_modules/

build:
  stage: build
  image: docker:latest
  services:
    - docker:dind
  script:
    - docker build -t $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA .
    - docker push $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA

deploy:
  stage: deploy
  image: flyio/flyctl:latest
  script:
    - flyctl auth docker
    - flyctl deploy --image $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA
  environment:
    name: production
    url: https://your-app.fly.dev
  only:
    - main

7.3 自动化部署脚本

部署前检查脚本
#!/bin/bash
# pre-deploy-check.sh

APP_NAME="my-app"
HEALTH_ENDPOINT="/health"

echo "Starting pre-deployment checks..."

# 1. 检查当前应用状态
echo "Checking current app status..."
if ! fly status --app $APP_NAME > /dev/null 2>&1; then
    echo "❌ Cannot access app status"
    exit 1
fi

# 2. 检查健康端点
echo "Checking health endpoint..."
if ! curl -f "https://$APP_NAME.fly.dev$HEALTH_ENDPOINT" > /dev/null 2>&1; then
    echo "❌ Health check failed"
    exit 1
fi

# 3. 检查Docker构建
echo "Testing Docker build..."
if ! docker build -t test-build . > /dev/null 2>&1; then
    echo "❌ Docker build failed"
    exit 1
fi

# 4. 检查环境变量
echo "Checking required secrets..."
REQUIRED_SECRETS=("DATABASE_URL" "JWT_SECRET")
for secret in "${REQUIRED_SECRETS[@]}"; do
    if ! fly secrets list --app $APP_NAME | grep -q $secret; then
        echo "❌ Missing required secret: $secret"
        exit 1
    fi
done

echo "✅ All pre-deployment checks passed"
部署后验证脚本
#!/bin/bash
# post-deploy-verify.sh

APP_NAME="my-app"
TIMEOUT=300  # 5分钟超时

echo "Starting post-deployment verification..."

# 等待部署完成
echo "Waiting for deployment to complete..."
END_TIME=$((SECONDS + TIMEOUT))

while [ $SECONDS -lt $END_TIME ]; do
    if fly status --app $APP_NAME | grep -q "running"; then
        echo "✅ App is running"
        break
    fi
    echo "Waiting for app to start..."
    sleep 10
done

if [ $SECONDS -ge $END_TIME ]; then
    echo "❌ Timeout waiting for app to start"
    exit 1
fi

# 健康检查
echo "Running health checks..."
sleep 30  # 给应用时间完全启动

if curl -f "https://$APP_NAME.fly.dev/health" > /dev/null 2>&1; then
    echo "✅ Health check passed"
else
    echo "❌ Health check failed"
    echo "Recent logs:"
    fly logs --since 5m --app $APP_NAME | tail -20
    exit 1
fi

# 功能测试
echo "Running basic functionality tests..."
RESPONSE=$(curl -s "https://$APP_NAME.fly.dev/api/version")
if [[ $RESPONSE =~ "version" ]]; then
    echo "✅ API responding correctly"
else
    echo "❌ API not responding as expected"
    exit 1
fi

echo "✅ Post-deployment verification completed successfully"

7.4 多阶段部署流程

蓝绿部署CI配置
# .github/workflows/blue-green-deploy.yml
name: Blue-Green Deployment

on:
  push:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - uses: superfly/flyctl-actions/setup-flyctl@master
      
      - name: Blue-Green Deploy
        run: |
          # 获取当前活跃实例
          CURRENT_MACHINES=$(flyctl machines list --json | jq -r '.[].id')
          
          # 部署新版本(绿色环境)
          flyctl deploy --strategy=bluegreen
          
          # 等待新实例就绪
          sleep 60
          
          # 健康检查
          if curl -f https://your-app.fly.dev/health; then
            echo "Green deployment successful"
            
            # 停止旧实例(蓝色环境)
            for machine in $CURRENT_MACHINES; do
              flyctl machines stop $machine
            done
            
            echo "Blue-green deployment completed"
          else
            echo "Health check failed, rolling back"
            flyctl releases rollback
            exit 1
          fi
        env:
          FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }}
分阶段部署脚本
#!/bin/bash
# staged-deployment.sh

APP_NAME="my-app"
REGIONS=("nrt" "iad" "ams")

echo "Starting staged deployment..."

for region in "${REGIONS[@]}"; do
    echo "Deploying to region: $region"
    
    # 部署到特定区域
    flyctl deploy --region $region --app $APP_NAME
    
    # 等待部署完成
    sleep 30
    
    # 验证区域部署
    if flyctl status --app $APP_NAME | grep $region | grep -q "running"; then
        echo "✅ Deployment to $region successful"
    else
        echo "❌ Deployment to $region failed"
        
        # 回滚策略
        echo "Rolling back $region deployment..."
        flyctl releases rollback --app $APP_NAME
        exit 1
    fi
    
    # 区域间等待时间
    sleep 60
done

echo "✅ Staged deployment completed successfully"

8. 应急响应手册

8.1 常见故障处理

应用无法启动
# 故障现象:应用显示状态异常或无法访问

# 1. 快速诊断
fly status
fly logs --since 10m

# 2. 检查健康检查
fly checks list

# 3. 检查资源使用
fly ssh console -C "ps aux | head -10"
fly ssh console -C "free -h"

# 4. 常见解决方案
# 4.1 内存不足
fly scale memory 1024

# 4.2 启动超时
# 检查fly.toml中的grace_period设置
[http_service.checks]
  grace_period = "30s"

# 4.3 端口配置错误
# 确保internal_port与应用监听端口一致
[http_service]
  internal_port = 3000

# 5. 如果仍无法解决,回滚到上一版本
fly releases rollback
数据库连接失败
# 故障现象:应用报数据库连接错误

# 1. 检查数据库状态
fly postgres status --app my-app-db

# 2. 测试数据库连接
fly postgres connect --app my-app-db

# 3. 检查连接字符串
fly secrets list | grep DATABASE

# 4. 验证网络连通性
fly ssh console -C "nc -zv my-app-db.internal 5432"

# 5. 常见解决方案
# 5.1 连接字符串错误
fly secrets set DATABASE_URL="postgresql://correct-url"

# 5.2 连接池耗尽
# 在应用中增加连接池配置
max: 20,
min: 5,
idle: 10000

# 5.3 数据库重启
fly postgres restart --app my-app-db

8.2 紧急回滚流程

自动回滚脚本
#!/bin/bash
# emergency-rollback.sh

APP_NAME="my-app"
HEALTH_URL="https://$APP_NAME.fly.dev/health"

echo "🚨 EMERGENCY ROLLBACK INITIATED"
echo "App: $APP_NAME"
echo "Time: $(date)"

# 1. 记录当前状态
echo "Current status:"
fly status --app $APP_NAME

# 2. 执行回滚
echo "Rolling back to previous version..."
fly releases rollback --app $APP_NAME

# 3. 等待回滚完成
echo "Waiting for rollback to complete..."
sleep 30

# 4. 验证回滚成功
for i in {1..10}; do
    if curl -f $HEALTH_URL > /dev/null 2>&1; then
        echo "✅ ROLLBACK SUCCESSFUL - Health check passed"
        echo "App is responding normally"
        
        # 发送成功通知
        echo "Rollback completed at $(date)" | \
            curl -X POST -H 'Content-Type: application/json' \
            -d @- "https://hooks.slack.com/your-webhook"
        
        exit 0
    fi
    
    echo "Attempt $i: Health check failed, retrying in 10s..."
    sleep 10
done

# 5. 回滚失败处理
echo "❌ ROLLBACK FAILED - Manual intervention required"
echo "Recent logs:"
fly logs --since 5m --app $APP_NAME | tail -20

# 发送失败告警
echo "CRITICAL: Rollback failed for $APP_NAME" | \
    curl -X POST -H 'Content-Type: application/json' \
    -d @- "https://hooks.slack.com/your-webhook"

exit 1
灾难恢复计划
#!/bin/bash
# disaster-recovery.sh

APP_NAME="my-app"
DB_APP="my-app-db"
BACKUP_REGION="iad"  # 备份区域

echo "🆘 DISASTER RECOVERY PROCEDURE"

# 1. 评估损坏程度
echo "Assessing damage..."
if fly status --app $APP_NAME | grep -q "running"; then
    echo "App is running, attempting standard recovery..."
    # 标准恢复流程
    fly releases rollback --app $APP_NAME
else
    echo "App is down, initiating disaster recovery..."
    
    # 2. 创建新应用实例
    echo "Creating new application instance..."
    fly apps create $APP_NAME-recovery --region $BACKUP_REGION
    
    # 3. 恢复数据库
    echo "Restoring database..."
    LATEST_BACKUP=$(fly postgres backup list --app $DB_APP --json | \
        jq -r '.[0].id')
    
    fly postgres create --name $DB_APP-recovery --region $BACKUP_REGION
    fly postgres restore --app $DB_APP-recovery --from $LATEST_BACKUP
    
    # 4. 部署应用到恢复实例
    echo "Deploying to recovery instance..."
    fly deploy --app $APP_NAME-recovery
    
    # 5. 更新DNS(手动步骤提示)
    echo "⚠️ MANUAL STEP REQUIRED:"
    echo "Update DNS records to point to recovery instance:"
    echo "A record: your-domain.com -> $(fly ips list --app $APP_NAME-recovery)"
fi

echo "Disaster recovery procedure completed"

8.3 性能问题诊断

性能监控脚本
#!/bin/bash
# performance-diagnosis.sh

APP_NAME="my-app"
DURATION=300  # 5分钟监控

echo "Starting performance diagnosis for $APP_NAME..."
echo "Duration: ${DURATION}s"

# 1. 基线性能测试
echo "Running baseline performance test..."
ab -n 100 -c 10 "https://$APP_NAME.fly.dev/" > baseline_test.log

# 2. 实时监控
echo "Starting real-time monitoring..."
END_TIME=$((SECONDS + DURATION))

while [ $SECONDS -lt $END_TIME ]; do
    TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
    
    # CPU和内存使用
    RESOURCES=$(fly ssh console --app $APP_NAME -C "top -bn1 | head -5")
    
    # 活跃连接数
    CONNECTIONS=$(fly ssh console --app $APP_NAME -C "netstat -an | grep ESTABLISHED | wc -l")
    
    # 响应时间测试
    RESPONSE_TIME=$(curl -o /dev/null -s -w "%{time_total}" "https://$APP_NAME.fly.dev/health")
    
    echo "$TIMESTAMP,CPU_MEM,$RESOURCES" >> performance_log.csv
    echo "$TIMESTAMP,CONNECTIONS,$CONNECTIONS" >> performance_log.csv
    echo "$TIMESTAMP,RESPONSE_TIME,$RESPONSE_TIME" >> performance_log.csv
    
    # 检查是否有性能问题
    if (( $(echo "$RESPONSE_TIME > 2.0" | bc -l) )); then
        echo "⚠️ Slow response detected: ${RESPONSE_TIME}s"
        
        # 获取详细日志
        fly logs --since 1m --app $APP_NAME | grep -E "(error|slow|timeout)" >> slow_requests.log
    fi
    
    sleep 10
done

echo "Performance diagnosis completed"
echo "Check performance_log.csv and slow_requests.log for details"
数据库性能诊断
#!/bin/bash
# db-performance-diagnosis.sh

DB_APP="my-app-db"

echo "Database performance diagnosis..."

# 1. 连接到数据库
echo "Connecting to database for analysis..."

# 2. 检查慢查询
fly postgres connect --app $DB_APP -c "
SELECT query, mean_time, calls, total_time
FROM pg_stat_statements 
ORDER BY mean_time DESC 
LIMIT 10;
"

# 3. 检查锁等待
fly postgres connect --app $DB_APP -c "
SELECT 
  blocked_locks.pid AS blocked_pid,
  blocked_activity.usename AS blocked_user,
  blocking_locks.pid AS blocking_pid,
  blocking_activity.usename AS blocking_user,
  blocked_activity.query AS blocked_statement,
  blocking_activity.query AS current_statement_in_blocking_process
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity 
  ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks 
  ON blocking_locks.locktype = blocked_locks.locktype
JOIN pg_catalog.pg_stat_activity blocking_activity 
  ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.GRANTED;
"

# 4. 检查表统计信息
fly postgres connect --app $DB_APP -c "
SELECT 
  schemaname,
  tablename,
  n_tup_ins + n_tup_upd + n_tup_del as total_writes,
  n_tup_ins,
  n_tup_upd,
  n_tup_del,
  seq_scan,
  seq_tup_read
FROM pg_stat_user_tables 
ORDER BY total_writes DESC 
LIMIT 10;
"

echo "Database performance analysis completed"

8.4 数据恢复操作

数据库恢复脚本
#!/bin/bash
# data-recovery.sh

DB_APP="my-app-db"
RECOVERY_POINT="$1"  # 恢复点时间戳或备份ID

if [ -z "$RECOVERY_POINT" ]; then
    echo "Usage: $0 <backup_id_or_timestamp>"
    echo "Available backups:"
    fly postgres backup list --app $DB_APP
    exit 1
fi

echo "🔄 DATA RECOVERY PROCEDURE"
echo "Database: $DB_APP"
echo "Recovery Point: $RECOVERY_POINT"
echo "WARNING: This will overwrite current data!"

read -p "Continue? (yes/no): " CONFIRM
if [ "$CONFIRM" != "yes" ]; then
    echo "Recovery cancelled"
    exit 1
fi

# 1. 创建当前状态备份
echo "Creating safety backup of current state..."
SAFETY_BACKUP=$(fly postgres backup create --app $DB_APP --json | jq -r '.id')
echo "Safety backup created: $SAFETY_BACKUP"

# 2. 停止应用实例
echo "Stopping application instances..."
MACHINES=$(fly machines list --json | jq -r '.[].id')
for machine in $MACHINES; do
    fly machines stop $machine
done

# 3. 执行数据恢复
echo "Restoring data from backup..."
if [[ $RECOVERY_POINT =~ ^[0-9]{4}-[0-9]{2}-[0-9]{2} ]]; then
    # 时间点恢复
    fly postgres restore --app $DB_APP --timestamp "$RECOVERY_POINT"
else
    # 备份恢复
    fly postgres restore --app $DB_APP --from "$RECOVERY_POINT"
fi

# 4. 验证数据恢复
echo "Verifying data recovery..."
fly postgres connect --app $DB_APP -c "SELECT COUNT(*) FROM pg_tables WHERE schemaname = 'public';"

# 5. 重启应用实例
echo "Restarting application instances..."
for machine in $MACHINES; do
    fly machines start $machine
done

# 6. 最终验证
echo "Final verification..."
sleep 30
if curl -f "https://my-app.fly.dev/health"; then
    echo "✅ Data recovery completed successfully"
else
    echo "❌ Recovery verification failed"
    echo "Consider rolling back to safety backup: $SAFETY_BACKUP"
fi
文件系统恢复
#!/bin/bash
# volume-recovery.sh

VOLUME_NAME="data_volume"
SNAPSHOT_ID="$1"

if [ -z "$SNAPSHOT_ID" ]; then
    echo "Usage: $0 <snapshot_id>"
    echo "Available snapshots:"
    fly volumes snapshot list $VOLUME_NAME
    exit 1
fi

echo "🗂️ VOLUME RECOVERY PROCEDURE"
echo "Volume: $VOLUME_NAME"
echo "Snapshot: $SNAPSHOT_ID"

# 1. 停止使用该Volume的机器
echo "Stopping machines using volume..."
MACHINES_USING_VOLUME=$(fly machines list --json | \
    jq -r '.[] | select(.config.mounts[]?.volume == "'$VOLUME_NAME'") | .id')

for machine in $MACHINES_USING_VOLUME; do
    echo "Stopping machine: $machine"
    fly machines stop $machine
done

# 2. 创建新Volume从快照
echo "Creating new volume from snapshot..."
NEW_VOLUME="${VOLUME_NAME}_recovered"
fly volumes create $NEW_VOLUME --snapshot $SNAPSHOT_ID --region nrt

# 3. 更新机器配置使用新Volume
echo "Updating machine configurations..."
for machine in $MACHINES_USING_VOLUME; do
    # 这里需要更新机器配置,将旧volume替换为新volume
    # 具体实现取决于你的应用配置
    echo "Update machine $machine to use $NEW_VOLUME"
done

# 4. 重启机器
echo "Restarting machines..."
for machine in $MACHINES_USING_VOLUME; do
    fly machines start $machine
done

echo "Volume recovery completed"

9. 成本监控和优化

9.1 费用监控设置

成本监控脚本
#!/bin/bash
# cost-monitor.sh

ORG_NAME="your-org"
ALERT_THRESHOLD=100  # 美元

echo "💰 Cost monitoring for organization: $ORG_NAME"

# 1. 获取当前月费用
CURRENT_COST=$(fly billing show --org $ORG_NAME --json | jq -r '.current_month.total')

echo "Current month cost: $${CURRENT_COST}"

# 2. 检查是否超过阈值
if (( $(echo "$CURRENT_COST > $ALERT_THRESHOLD" | bc -l) )); then
    echo "⚠️ Cost alert: Current spending ($${CURRENT_COST}) exceeds threshold ($${ALERT_THRESHOLD})"
    
    # 发送告警通知
    curl -X POST "https://hooks.slack.com/your-webhook" \
        -H 'Content-Type: application/json' \
        -d "{\"text\":\"Cost Alert: Current spending \$${CURRENT_COST} exceeds \$${ALERT_THRESHOLD}\"}"
fi

# 3. 获取应用费用明细
echo "Application cost breakdown:"
fly apps list --org $ORG_NAME --json | jq -r '.[] | .name' | while read app; do
    APP_COST=$(fly billing breakdown --app $app --json 2>/dev/null | jq -r '.total' 2>/dev/null || echo "0")
    echo "$app: \$${APP_COST}"
done

# 4. 预测月底费用
DAY_OF_MONTH=$(date +%d)
DAYS_IN_MONTH=$(date -d "$(date +'%Y-%m-01') +1 month -1 day" +%d)
PROJECTED_COST=$(echo "scale=2; $CURRENT_COST * $DAYS_IN_MONTH / $DAY_OF_MONTH" | bc)

echo "Projected month-end cost: \$${PROJECTED_COST}"

if (( $(echo "$PROJECTED_COST > $ALERT_THRESHOLD * 1.5" | bc -l) )); then
    echo "🚨 WARNING: Projected cost significantly exceeds budget!"
fi
资源使用分析
#!/bin/bash
# resource-analysis.sh

echo "📊 Resource utilization analysis"

# 1. 分析所有应用的资源配置
fly apps list --json | jq -r '.[] | .name' | while read app; do
    echo "=== App: $app ==="
    
    # 获取机器配置
    MACHINES=$(fly machines list --app $app --json)
    
    if [ "$MACHINES" != "[]" ]; then
        # 统计配置分布
        echo "Machine configurations:"
        echo "$MACHINES" | jq -r '.[] | "\(.config.size) - \(.config.guest.memory_mb)MB"' | sort | uniq -c
        
        # 统计运行状态
        echo "Machine states:"
        echo "$MACHINES" | jq -r '.[].state' | sort | uniq -c
        
        # 计算总资源
        TOTAL_MEMORY=$(echo "$MACHINES" | jq '[.[] | select(.state == "started") | .config.guest.memory_mb] | add')
        TOTAL_CPUS=$(echo "$MACHINES" | jq '[.[] | select(.state == "started") | .config.guest.cpus] | add')
        
        echo "Total allocated: ${TOTAL_CPUS} CPUs, ${TOTAL_MEMORY}MB RAM"
    else
        echo "No machines found"
    fi
    echo ""
done

9.2 资源优化建议

自动扩缩容优化
#!/bin/bash
# optimize-autoscaling.sh

APP_NAME="$1"
if [ -z "$APP_NAME" ]; then
    echo "Usage: $0 <app_name>"
    exit 1
fi

echo "🎯 Autoscaling optimization for $APP_NAME"

# 1. 分析过去24小时的机器状态变化
echo "Analyzing machine state changes in last 24h..."
fly logs --since 24h --app $APP_NAME | grep -E "(machine.*started|machine.*stopped)" > machine_events.log

START_EVENTS=$(grep "started" machine_events.log | wc -l)
STOP_EVENTS=$(grep "stopped" machine_events.log | wc -l)

echo "Machine start events: $START_EVENTS"
echo "Machine stop events: $STOP_EVENTS"

# 2. 计算平均运行时间
if [ $START_EVENTS -gt 0 ] && [ $STOP_EVENTS -gt 0 ]; then
    AVG_RUNTIME=$(echo "scale=2; 24 * 60 / ($START_EVENTS + $STOP_EVENTS)" | bc)
    echo "Average machine runtime: ${AVG_RUNTIME} minutes"
    
    # 3. 提供优化建议
    if (( $(echo "$AVG_RUNTIME < 10" | bc -l) )); then
        echo "💡 Optimization suggestion:"
        echo "- Very short runtime detected"
        echo "- Consider increasing min_machines_running to 1"
        echo "- This will reduce cold start frequency"
        
        echo "Recommended configuration:"
        echo "[http_service]"
        echo "  min_machines_running = 1"
        echo "  auto_stop_machines = \"suspend\""
    fi
    
    if (( $(echo "$START_EVENTS > 50" | bc -l) )); then
        echo "💡 Optimization suggestion:"
        echo "- High frequency scaling detected"
        echo "- Consider adjusting concurrency limits"
        echo "- Current events suggest traffic spikes"
        
        # 检查当前配置
        CURRENT_CONFIG=$(fly config show --app $APP_NAME)
        echo "Current concurrency config:"
        echo "$CURRENT_CONFIG" | grep -A 5 "concurrency"
    fi
else
    echo "ℹ️ No significant scaling activity detected"
    echo "Current configuration appears stable"
fi
成本优化建议脚本
#!/bin/bash
# cost-optimization.sh

echo "💰 Cost optimization analysis"

# 1. 检查过度配置的机器
echo "=== Checking for over-provisioned machines ==="
fly apps list --json | jq -r '.[] | .name' | while read app; do
    echo "Analyzing app: $app"
    
    # 获取机器配置和使用情况
    MACHINES=$(fly machines list --app $app --json)
    
    if [ "$MACHINES" != "[]" ]; then
        # 检查大内存配置的机器
        HIGH_MEMORY_MACHINES=$(echo "$MACHINES" | jq '[.[] | select(.config.guest.memory_mb > 1024)]')
        
        if [ "$HIGH_MEMORY_MACHINES" != "[]" ]; then
            echo "⚠️ Found high-memory machines (>1GB):"
            echo "$HIGH_MEMORY_MACHINES" | jq -r '.[] | "\(.id): \(.config.guest.memory_mb)MB"'
            
            echo "💡 Suggestion: Monitor memory usage and consider downsizing if underutilized"
            echo "Check with: fly ssh console --app $app -C 'free -h'"
        fi
        
        # 检查停止状态的机器
        STOPPED_MACHINES=$(echo "$MACHINES" | jq '[.[] | select(.state == "stopped")]')
        if [ "$STOPPED_MACHINES" != "[]" ]; then
            echo "ℹ️ Found stopped machines (no cost impact):"
            echo "$STOPPED_MACHINES" | jq -r '.[] | .id'
        fi
    fi
    echo ""
done

# 2. 检查未使用的volumes
echo "=== Checking for unused volumes ==="
fly volumes list --json | jq -r '.[] | select(.attached_machine_id == null) | "\(.name): \(.size_gb)GB - $" + ((.size_gb * 0.15) | tostring) + "/month"'

# 3. 检查多余的IP地址
echo "=== Checking IP address usage ==="
DEDICATED_IPS=$(fly ips list --json | jq '[.[] | select(.type == "v4")]')
if [ "$DEDICATED_IPS" != "[]" ]; then
    echo "💰 Dedicated IPv4 addresses found ($2/month each):"
    echo "$DEDICATED_IPS" | jq -r '.[] | .address'
    echo "💡 Suggestion: Consider if all dedicated IPs are necessary"
fi

echo "=== Cost optimization analysis completed ==="

9.3 成本告警配置

成本告警系统
#!/bin/bash
# cost-alerting.sh

# 配置参数
ORG_NAME="your-org"
DAILY_THRESHOLD=5      # 每日预算
MONTHLY_THRESHOLD=100  # 月度预算
WEBHOOK_URL="https://hooks.slack.com/your-webhook"

# 获取当前成本数据
BILLING_DATA=$(fly billing show --org $ORG_NAME --json)
CURRENT_MONTH_COST=$(echo "$BILLING_DATA" | jq -r '.current_month.total')
YESTERDAY_COST=$(echo "$BILLING_DATA" | jq -r '.yesterday.total')

# 计算每日平均成本
DAY_OF_MONTH=$(date +%d)
DAILY_AVERAGE=$(echo "scale=2; $CURRENT_MONTH_COST / $DAY_OF_MONTH" | bc)

echo "Cost monitoring report:"
echo "Current month total: \$${CURRENT_MONTH_COST}"
echo "Yesterday cost: \$${YESTERDAY_COST}"
echo "Daily average: \$${DAILY_AVERAGE}"

# 检查每日成本告警
if (( $(echo "$YESTERDAY_COST > $DAILY_THRESHOLD" | bc -l) )); then
    MESSAGE="🚨 Daily cost alert: Yesterday's cost (\$${YESTERDAY_COST}) exceeded daily threshold (\$${DAILY_THRESHOLD})"
    echo "$MESSAGE"
    
    curl -X POST "$WEBHOOK_URL" \
        -H 'Content-Type: application/json' \
        -d "{\"text\":\"$MESSAGE\"}"
fi

# 检查月度成本趋势
DAYS_IN_MONTH=$(date -d "$(date +'%Y-%m-01') +1 month -1 day" +%d)
PROJECTED_MONTHLY=$(echo "scale=2; $DAILY_AVERAGE * $DAYS_IN_MONTH" | bc)

if (( $(echo "$PROJECTED_MONTHLY > $MONTHLY_THRESHOLD" | bc -l) )); then
    MESSAGE="⚠️ Monthly projection alert: Projected monthly cost (\$${PROJECTED_MONTHLY}) may exceed budget (\$${MONTHLY_THRESHOLD})"
    echo "$MESSAGE"
    
    curl -X POST "$WEBHOOK_URL" \
        -H 'Content-Type: application/json' \
        -d "{\"text\":\"$MESSAGE\"}"
fi

# 检查成本激增
if (( $(echo "$YESTERDAY_COST > $DAILY_AVERAGE * 2" | bc -l) )); then
    MESSAGE="🔥 Cost spike alert: Yesterday's cost (\$${YESTERDAY_COST}) is more than double the daily average (\$${DAILY_AVERAGE})"
    echo "$MESSAGE"
    
    curl -X POST "$WEBHOOK_URL" \
        -H 'Content-Type: application/json' \
        -d "{\"text\":\"$MESSAGE\"}"
fi
定时成本报告
#!/bin/bash
# weekly-cost-report.sh

ORG_NAME="your-org"
REPORT_DATE=$(date +%Y-%m-%d)

echo "📊 Weekly Cost Report - $REPORT_DATE"
echo "=================================="

# 1. 总体成本概览
BILLING_DATA=$(fly billing show --org $ORG_NAME --json)
CURRENT_MONTH=$(echo "$BILLING_DATA" | jq -r '.current_month.total')
LAST_MONTH=$(echo "$BILLING_DATA" | jq -r '.last_month.total')

echo "Current month: \$${CURRENT_MONTH}"
echo "Last month: \$${LAST_MONTH}"

# 计算月环比变化
if [ "$LAST_MONTH" != "0" ]; then
    CHANGE_PERCENT=$(echo "scale=2; ($CURRENT_MONTH - $LAST_MONTH) * 100 / $LAST_MONTH" | bc)
    echo "Month-over-month change: ${CHANGE_PERCENT}%"
fi

# 2. 按应用分解成本
echo ""
echo "Cost by application:"
echo "-------------------"

fly apps list --org $ORG_NAME --json | jq -r '.[] | .name' | while read app; do
    # 获取应用成本明细(这里需要根据实际API调整)
    MACHINES_COUNT=$(fly machines list --app $app --json | jq 'length')
    echo "$app: $MACHINES_COUNT machines"
done

# 3. 资源使用统计
echo ""
echo "Resource utilization:"
echo "--------------------"

TOTAL_MEMORY=0
TOTAL_CPUS=0
RUNNING_MACHINES=0

fly apps list --org $ORG_NAME --json | jq -r '.[] | .name' | while read app; do
    MACHINES=$(fly machines list --app $app --json)
    
    APP_MEMORY=$(echo "$MACHINES" | jq '[.[] | select(.state == "started") | .config.guest.memory_mb] | add // 0')
    APP_CPUS=$(echo "$MACHINES" | jq '[.[] | select(.state == "started") | .config.guest.cpus] | add // 0')
    APP_RUNNING=$(echo "$MACHINES" | jq '[.[] | select(.state == "started")] | length')
    
    TOTAL_MEMORY=$((TOTAL_MEMORY + APP_MEMORY))
    TOTAL_CPUS=$((TOTAL_CPUS + APP_CPUS))
    RUNNING_MACHINES=$((RUNNING_MACHINES + APP_RUNNING))
done

echo "Total running machines: $RUNNING_MACHINES"
echo "Total allocated CPUs: $TOTAL_CPUS"
echo "Total allocated memory: ${TOTAL_MEMORY}MB"

# 4. 优化建议
echo ""
echo "Optimization recommendations:"
echo "-----------------------------"

# 检查长时间停止的机器
echo "Checking for optimization opportunities..."

# 这里可以添加更多分析逻辑

echo ""
echo "Report generated on: $(date)"

10. 运维脚本集合

10.1 常用bash脚本

应用健康检查脚本
#!/bin/bash
# health-check.sh

APP_NAME="${1:-my-app}"
HEALTH_ENDPOINT="${2:-/health}"

check_health() {
    local url="https://$APP_NAME.fly.dev$HEALTH_ENDPOINT"
    local response_time
    local http_code
    
    response_time=$(curl -o /dev/null -s -w "%{time_total}" -m 10 "$url")
    http_code=$(curl -o /dev/null -s -w "%{http_code}" -m 10 "$url")
    
    echo "Response time: ${response_time}s"
    echo "HTTP code: $http_code"
    
    if [ "$http_code" = "200" ]; then
        if (( $(echo "$response_time < 2.0" | bc -l) )); then
            echo "✅ Health check PASSED"
            return 0
        else
            echo "⚠️ Health check SLOW (>${response_time}s)"
            return 1
        fi
    else
        echo "❌ Health check FAILED (HTTP $http_code)"
        return 1
    fi
}

echo "Checking health for $APP_NAME..."
echo "Endpoint: $HEALTH_ENDPOINT"
echo "================================"

for i in {1..3}; do
    echo "Attempt $i:"
    if check_health; then
        exit 0
    fi
    echo ""
    sleep 5
done

echo "Health check failed after 3 attempts"
exit 1
快速部署脚本
#!/bin/bash
# quick-deploy.sh

set -e  # 遇到错误立即退出

APP_NAME="${1:-my-app}"
ENVIRONMENT="${2:-production}"

echo "🚀 Quick Deploy Script"
echo "App: $APP_NAME"
echo "Environment: $ENVIRONMENT"

# 1. 预检查
echo "Running pre-deployment checks..."

# 检查Git状态
if ! git diff-index --quiet HEAD --; then
    echo "⚠️ You have uncommitted changes"
    read -p "Continue anyway? (y/N): " CONTINUE
    if [ "$CONTINUE" != "y" ]; then
        echo "Deployment cancelled"
        exit 1
    fi
fi

# 检查Docker构建
echo "Testing Docker build..."
if ! docker build -t test-build . > /dev/null 2>&1; then
    echo "❌ Docker build failed"
    exit 1
fi

# 2. 获取当前版本信息
CURRENT_VERSION=$(fly releases --app $APP_NAME --json | jq -r '.[0].version')
echo "Current version: v$CURRENT_VERSION"

# 3. 执行部署
echo "Starting deployment..."
fly deploy --app $APP_NAME

# 4. 等待部署完成
echo "Waiting for deployment to stabilize..."
sleep 30

# 5. 验证部署
echo "Verifying deployment..."
NEW_VERSION=$(fly releases --app $APP_NAME --json | jq -r '.[0].version')
echo "New version: v$NEW_VERSION"

if [ "$NEW_VERSION" != "$CURRENT_VERSION" ]; then
    # 运行健康检查
    if bash health-check.sh $APP_NAME; then
        echo "✅ Deployment successful!"
        echo "Version: v$CURRENT_VERSION → v$NEW_VERSION"
    else
        echo "❌ Deployment health check failed"
        read -p "Rollback to v$CURRENT_VERSION? (Y/n): " ROLLBACK
        if [ "$ROLLBACK" != "n" ]; then
            fly releases rollback --app $APP_NAME
            echo "Rolled back to v$CURRENT_VERSION"
        fi
        exit 1
    fi
else
    echo "⚠️ Version unchanged, deployment may have failed"
    exit 1
fi

10.2 批量操作脚本

批量应用管理
#!/bin/bash
# batch-operations.sh

OPERATION="$1"
PATTERN="${2:-.*}"  # 应用名称模式

if [ -z "$OPERATION" ]; then
    echo "Usage: $0 <operation> [pattern]"
    echo "Operations: status, restart, scale, logs"
    echo "Pattern: regex pattern to match app names (default: all apps)"
    exit 1
fi

# 获取匹配的应用列表
APPS=$(fly apps list --json | jq -r ".[] | select(.name | test(\"$PATTERN\")) | .name")

if [ -z "$APPS" ]; then
    echo "No apps found matching pattern: $PATTERN"
    exit 1
fi

echo "Found apps matching '$PATTERN':"
echo "$APPS"
echo ""

case $OPERATION in
    "status")
        echo "Getting status for all matching apps..."
        echo "$APPS" | while read app; do
            echo "=== $app ==="
            fly status --app $app
            echo ""
        done
        ;;
        
    "restart")
        echo "Restarting all matching apps..."
        echo "$APPS" | while read app; do
            echo "Restarting $app..."
            fly restart --app $app
            sleep 10  # 等待重启完成
        done
        ;;
        
    "scale")
        COUNT="${3:-1}"
        echo "Scaling all matching apps to $COUNT instances..."
        echo "$APPS" | while read app; do
            echo "Scaling $app to $COUNT..."
            fly scale count $COUNT --app $app
        done
        ;;
        
    "logs")
        DURATION="${3:-5m}"
        echo "Getting logs for all matching apps (last $DURATION)..."
        echo "$APPS" | while read app; do
            echo "=== $app logs ===" 
            fly logs --since $DURATION --app $app | tail -10
            echo ""
        done
        ;;
        
    *)
        echo "Unknown operation: $OPERATION"
        exit 1
        ;;
esac
批量环境变量更新
#!/bin/bash
# batch-env-update.sh

ENV_FILE="$1"
APPS_PATTERN="${2:-.*}"

if [ ! -f "$ENV_FILE" ]; then
    echo "Usage: $0 <env_file> [apps_pattern]"
    echo "env_file format: KEY=value (one per line)"
    exit 1
fi

echo "Updating environment variables from: $ENV_FILE"
echo "Apps pattern: $APPS_PATTERN"

# 获取匹配的应用
APPS=$(fly apps list --json | jq -r ".[] | select(.name | test(\"$APPS_PATTERN\")) | .name")

echo "Apps to update:"
echo "$APPS"
echo ""

# 读取环境变量
declare -A ENV_VARS
while IFS='=' read -r key value; do
    # 跳过注释和空行
    [[ $key =~ ^#.*$ || -z $key ]] && continue
    ENV_VARS[$key]=$value
done < "$ENV_FILE"

# 更新每个应用
echo "$APPS" | while read app; do
    echo "Updating $app..."
    
    # 构建secrets命令
    SECRETS_CMD="fly secrets set --app $app"
    ENV_CMD="fly env set --app $app"
    
    for key in "${!ENV_VARS[@]}"; do
        value="${ENV_VARS[$key]}"
        
        # 判断是否为敏感信息
        if [[ $key =~ (PASSWORD|SECRET|KEY|TOKEN) ]]; then
            SECRETS_CMD="$SECRETS_CMD $key=\"$value\""
        else
            ENV_CMD="$ENV_CMD $key=\"$value\""
        fi
    done
    
    # 执行更新
    if [ "$SECRETS_CMD" != "fly secrets set --app $app" ]; then
        eval $SECRETS_CMD
    fi
    
    if [ "$ENV_CMD" != "fly env set --app $app" ]; then
        eval $ENV_CMD
    fi
    
    echo "Updated $app"
    sleep 2
done

echo "Batch environment update completed"

10.3 监控告警脚本

综合监控脚本
#!/bin/bash
# comprehensive-monitor.sh

WEBHOOK_URL="$1"
CHECK_INTERVAL="${2:-300}"  # 5分钟

if [ -z "$WEBHOOK_URL" ]; then
    echo "Usage: $0 <webhook_url> [check_interval_seconds]"
    exit 1
fi

send_alert() {
    local message="$1"
    local severity="${2:-WARNING}"
    
    echo "[$severity] $message"
    
    curl -X POST "$WEBHOOK_URL" \
        -H 'Content-Type: application/json' \
        -d "{\"text\":\"[$severity] $message\"}" \
        > /dev/null 2>&1
}

check_app_health() {
    local app="$1"
    local url="https://$app.fly.dev/health"
    
    if ! curl -f "$url" > /dev/null 2>&1; then
        send_alert "Health check failed for $app" "CRITICAL"
        return 1
    fi
    return 0
}

check_app_performance() {
    local app="$1"
    local url="https://$app.fly.dev/health"
    
    local response_time=$(curl -o /dev/null -s -w "%{time_total}" -m 10 "$url" 2>/dev/null)
    
    if (( $(echo "$response_time > 3.0" | bc -l) )); then
        send_alert "Slow response detected for $app: ${response_time}s" "WARNING"
    fi
}

check_machine_states() {
    local app="$1"
    
    local machines=$(fly machines list --app $app --json 2>/dev/null)
    if [ "$machines" = "[]" ] || [ -z "$machines" ]; then
        return
    fi
    
    local failed_machines=$(echo "$machines" | jq -r '.[] | select(.state == "failed") | .id')
    if [ -n "$failed_machines" ]; then
        send_alert "Failed machines detected in $app: $failed_machines" "CRITICAL"
    fi
    
    local running_count=$(echo "$machines" | jq '[.[] | select(.state == "started")] | length')
    if [ "$running_count" = "0" ]; then
        send_alert "No running machines in $app" "CRITICAL"
    fi
}

monitor_costs() {
    local current_cost=$(fly billing show --json 2>/dev/null | jq -r '.current_month.total // 0')
    local day_of_month=$(date +%d)
    local daily_average=$(echo "scale=2; $current_cost / $day_of_month" | bc 2>/dev/null || echo "0")
    
    if (( $(echo "$daily_average > 10" | bc -l) )); then
        send_alert "High daily cost average: \$${daily_average}" "WARNING"
    fi
}

echo "🔍 Starting comprehensive monitoring..."
echo "Check interval: ${CHECK_INTERVAL}s"
echo "Webhook: $WEBHOOK_URL"

while true; do
    echo "=== $(date) ==="
    
    # 获取所有应用
    APPS=$(fly apps list --json 2>/dev/null | jq -r '.[].name' 2>/dev/null)
    
    if [ -z "$APPS" ]; then
        send_alert "Failed to retrieve apps list" "ERROR"
        sleep $CHECK_INTERVAL
        continue
    fi
    
    # 检查每个应用
    echo "$APPS" | while read app; do
        echo "Checking $app..."
        
        # 健康检查
        check_app_health "$app"
        
        # 性能检查
        check_app_performance "$app"
        
        # 机器状态检查
        check_machine_states "$app"
        
        sleep 2  # 避免过于频繁的API调用
    done
    
    # 成本监控(每小时检查一次)
    if [ $(($(date +%M) % 60)) -eq 0 ]; then
        monitor_costs
    fi
    
    echo "Monitoring cycle completed"
    sleep $CHECK_INTERVAL
done
自动恢复脚本
#!/bin/bash
# auto-recovery.sh

APP_NAME="$1"
MAX_FAILURES="${2:-3}"
RECOVERY_DELAY="${3:-60}"

if [ -z "$APP_NAME" ]; then
    echo "Usage: $0 <app_name> [max_failures] [recovery_delay_seconds]"
    exit 1
fi

FAILURE_COUNT=0
LAST_RECOVERY=0

log_event() {
    local event="$1"
    echo "$(date '+%Y-%m-%d %H:%M:%S') [$APP_NAME] $event" | tee -a recovery.log
}

check_and_recover() {
    # 检查应用健康状态
    if curl -f "https://$APP_NAME.fly.dev/health" > /dev/null 2>&1; then
        if [ $FAILURE_COUNT -gt 0 ]; then
            log_event "Application recovered, resetting failure count"
            FAILURE_COUNT=0
        fi
        return 0
    fi
    
    FAILURE_COUNT=$((FAILURE_COUNT + 1))
    log_event "Health check failed (attempt $FAILURE_COUNT/$MAX_FAILURES)"
    
    if [ $FAILURE_COUNT -ge $MAX_FAILURES ]; then
        local current_time=$(date +%s)
        local time_since_recovery=$((current_time - LAST_RECOVERY))
        
        if [ $time_since_recovery -ge $RECOVERY_DELAY ]; then
            log_event "Starting automatic recovery procedure"
            
            # 尝试重启应用
            if fly restart --app $APP_NAME; then
                log_event "Application restart initiated"
                LAST_RECOVERY=$current_time
                FAILURE_COUNT=0
                
                # 等待重启完成
                sleep 30
                
                # 验证恢复
                if curl -f "https://$APP_NAME.fly.dev/health" > /dev/null 2>&1; then
                    log_event "Automatic recovery successful"
                else
                    log_event "Automatic recovery failed, manual intervention required"
                fi
            else
                log_event "Failed to restart application"
            fi
        else
            log_event "Recovery delay not met, waiting..."
        fi
    fi
}

echo "🔄 Auto-recovery monitoring for $APP_NAME"
echo "Max failures: $MAX_FAILURES"
echo "Recovery delay: ${RECOVERY_DELAY}s"

while true; do
    check_and_recover
    sleep 30
done

结语

这份运维指南涵盖了Fly.io平台的完整部署和维护操作,包括:

  • 快速上手:从环境准备到首次部署的完整流程
  • 日常运维:监控、日志、性能调优等常规操作
  • 高级管理:扩缩容、数据库管理、网络安全配置
  • 自动化集成:CI/CD流程和批量操作脚本
  • 应急响应:故障排查、恢复流程和成本优化

使用建议:

  1. 收藏常用脚本:将frequently used的脚本加入到你的工具箱
  2. 定制化配置:根据你的具体应用需求调整配置参数
  3. 监控告警:设置适合你团队的成本和性能告警阈值
  4. 定期review:定期检查和优化你的部署配置

持续改进:

这份指南会随着Fly.io平台的更新和最佳实践的演进而持续更新。建议定期关注Fly.io官方文档和社区最佳实践。


文档版本: v1.0
更新时间: 2024年12月
维护团队: DevOps Team
反馈方式: 通过项目issue提交改进建议