Fly.io 部署维护实操指南
开发和运维操作完整手册
目录
1. 快速开始检查清单
1.1 环境准备清单
✅ 开发环境检查
# 检查必要工具
[] Docker已安装并运行
[] Git版本控制
[] 代码编辑器配置
[] 本地开发端口(3000, 8080等)可用
# 验证命令
docker --version
git --version
netstat -an | grep :3000
✅ Fly.io 账号设置
# 1. 注册账号
# 访问 https://fly.io/app/sign-up
# 2. 验证邮箱
[] 邮箱验证完成
[] 账号激活确认
# 3. 添加支付方式
[] 信用卡信息已添加
[] 计费信息已确认
✅ CLI工具安装
# macOS安装
brew install flyctl
# Linux安装
curl -L https://fly.io/install.sh | sh
# Windows (PowerShell)
iwr https://fly.io/install.ps1 -useb | iex
# 验证安装
fly version
# 预期输出:flyctl v0.x.x
1.2 认证和基础配置
# 登录账号
fly auth login
# 浏览器会自动打开,完成认证
# 验证登录状态
fly auth whoami
# 预期输出:显示你的邮箱地址
# 查看组织信息
fly orgs list
# 显示你有权限的组织
# 设置默认组织(可选)
fly orgs select YOUR_ORG_NAME
1.3 项目初始化检查清单
✅ 项目准备
[] Dockerfile已创建且测试通过
[] .dockerignore文件已配置
[] 环境变量已梳理
[] 端口配置已确认
[] 健康检查接口已实现
# 本地测试Docker构建
docker build -t test-app .
docker run -p 3000:3000 test-app
curl http://localhost:3000/health
✅ 部署前最终检查
[] 代码已提交到Git
[] 敏感信息已从代码中移除
[] 生产配置已准备
[] 数据库连接字符串已准备
[] 域名已准备(如需要)
2. 部署操作手册
2.1 首次部署完整流程
步骤1: 项目初始化
# 进入项目目录
cd your-project
# 初始化Fly.io应用
fly launch
# 交互式配置:
# - 应用名称
# - 部署区域
# - 数据库需求
# - Redis需求
# 查看生成的配置
cat fly.toml
步骤2: 配置文件调优
# fly.toml 生产配置示例
app = "my-production-app"
primary_region = "nrt"
[build]
# 使用多阶段构建优化镜像大小
[env]
NODE_ENV = "production"
PORT = "3000"
[http_service]
internal_port = 3000
force_https = true
auto_stop_machines = "stop"
auto_start_machines = true
min_machines_running = 1
[http_service.concurrency]
type = "connections"
hard_limit = 25
soft_limit = 20
[[http_service.checks]]
grace_period = "10s"
interval = "30s"
method = "GET"
path = "/health"
protocol = "http"
timeout = "5s"
[vm]
size = "shared-cpu-1x"
memory = "1gb"
[[statics]]
guest_path = "/app/public"
url_prefix = "/static/"
步骤3: 设置环境变量和密钥
# 设置生产环境密钥
fly secrets set DATABASE_URL="postgresql://..."
fly secrets set JWT_SECRET="your-random-secret"
fly secrets set API_KEY="your-api-key"
# 查看已设置的密钥
fly secrets list
步骤4: 执行部署
# 部署应用
fly deploy
# 监控部署过程
fly logs
# 验证部署状态
fly status
fly checks list
步骤5: 部署后验证
# 获取应用URL
fly info
# 测试应用响应
curl https://my-production-app.fly.dev/health
# 检查所有服务状态
fly status --all
# 查看详细日志
fly logs --since 10m
2.2 更新部署策略
滚动更新部署
# 标准更新部署
fly deploy
# 指定特定镜像版本
fly deploy --image my-app:v2.1.0
# 仅更新配置不重新构建
fly deploy --no-build
# 部署到特定区域
fly deploy --region nrt
蓝绿部署
# 1. 部署新版本到新实例
fly scale count 2
# 2. 验证新实例健康状态
fly status
# 3. 如果正常,逐步切换流量
fly scale count 1 --region nrt # 保留新版本
fly machines stop OLD_MACHINE_ID # 停止旧版本
# 4. 如果有问题,快速回滚
fly machines start OLD_MACHINE_ID
fly machines stop NEW_MACHINE_ID
2.3 回滚操作指南
快速回滚到上一版本
# 查看部署历史
fly releases
# 回滚到上一个版本
fly releases rollback
# 回滚到特定版本
fly releases rollback v3
# 确认回滚状态
fly status
fly logs --since 5m
紧急回滚流程
#!/bin/bash
# 紧急回滚脚本
echo "开始紧急回滚..."
# 1. 立即回滚到上一版本
fly releases rollback
# 2. 检查回滚状态
if fly status | grep -q "running"; then
echo "✅ 回滚成功,应用正在运行"
else
echo "❌ 回滚失败,需要手动干预"
exit 1
fi
# 3. 验证应用健康状态
if curl -f https://your-app.fly.dev/health; then
echo "✅ 应用健康检查通过"
else
echo "⚠️ 应用健康检查失败,请检查日志"
fly logs --since 2m
fi
echo "回滚操作完成"
2.4 多环境部署
开发环境配置
# fly.dev.toml
app = "my-app-dev"
primary_region = "nrt"
[env]
NODE_ENV = "development"
DEBUG = "true"
[http_service]
auto_stop_machines = "stop" # 节省成本
min_machines_running = 0
[vm]
size = "shared-cpu-1x"
memory = "256mb" # 最小配置
测试环境配置
# fly.staging.toml
app = "my-app-staging"
primary_region = "nrt"
[env]
NODE_ENV = "staging"
[http_service]
min_machines_running = 1
[vm]
size = "shared-cpu-1x"
memory = "512mb"
多环境部署脚本
#!/bin/bash
# deploy-multi-env.sh
ENVIRONMENT=$1
if [ -z "$ENVIRONMENT" ]; then
echo "Usage: $0 [dev|staging|production]"
exit 1
fi
case $ENVIRONMENT in
"dev")
CONFIG_FILE="fly.dev.toml"
;;
"staging")
CONFIG_FILE="fly.staging.toml"
;;
"production")
CONFIG_FILE="fly.toml"
;;
*)
echo "Unknown environment: $ENVIRONMENT"
exit 1
;;
esac
echo "Deploying to $ENVIRONMENT using $CONFIG_FILE"
fly deploy --config $CONFIG_FILE
echo "Deployment completed for $ENVIRONMENT"
fly status --config $CONFIG_FILE
3. 日常运维操作
3.1 健康检查和监控
应用健康状态检查
# 查看应用整体状态
fly status
# 详细状态信息
fly status --all
# 检查健康检查状态
fly checks list
# 查看特定Machine状态
fly machines list
fly machines show MACHINE_ID
实时监控脚本
#!/bin/bash
# monitor.sh - 实时监控脚本
APP_NAME="your-app"
HEALTH_URL="https://${APP_NAME}.fly.dev/health"
while true; do
echo "=== $(date) ==="
# 检查应用状态
STATUS=$(fly status --json | jq -r '.status')
echo "App Status: $STATUS"
# 检查健康端点
if curl -f $HEALTH_URL > /dev/null 2>&1; then
echo "Health Check: ✅ PASS"
else
echo "Health Check: ❌ FAIL"
echo "Checking logs..."
fly logs --since 2m | tail -10
fi
# 检查Machine状态
MACHINES=$(fly machines list --json | jq -r '.[].state' | sort | uniq -c)
echo "Machine States:"
echo "$MACHINES"
echo "------------------------"
sleep 30
done
3.2 日志管理和分析
基础日志操作
# 查看实时日志
fly logs
# 查看历史日志
fly logs --since 1h
fly logs --since "2024-01-15 10:00:00"
# 过滤特定实例日志
fly logs --instance MACHINE_ID
# 只显示应用日志,排除系统日志
fly logs --no-system
# 导出日志到文件
fly logs --since 24h > app-logs-$(date +%Y%m%d).log
日志分析脚本
#!/bin/bash
# log-analysis.sh - 日志分析工具
LOG_FILE="app-logs-$(date +%Y%m%d).log"
FLY_APP="your-app"
# 获取24小时日志
echo "Fetching logs..."
fly logs --since 24h --app $FLY_APP > $LOG_FILE
# 分析错误
echo "=== Error Analysis ==="
grep -i "error\|exception\|fail" $LOG_FILE | head -20
# 分析响应时间
echo "=== Slow Requests (>1000ms) ==="
grep "response_time" $LOG_FILE | awk '$NF > 1000' | head -10
# 分析访问统计
echo "=== Top IPs ==="
grep -o 'ip=[0-9.]*' $LOG_FILE | sort | uniq -c | sort -nr | head -10
# 分析状态码
echo "=== Status Code Distribution ==="
grep -o 'status=[0-9]*' $LOG_FILE | sort | uniq -c | sort -nr
echo "Log analysis saved to: $LOG_FILE"
3.3 性能调优操作
资源使用监控
# 查看资源使用情况
fly ssh console -C "htop"
# 查看内存使用
fly ssh console -C "free -h"
# 查看磁盘使用
fly ssh console -C "df -h"
# 查看网络连接
fly ssh console -C "netstat -an | head -20"
# 查看进程状态
fly ssh console -C "ps aux | head -20"
性能基准测试
#!/bin/bash
# performance-test.sh
APP_URL="https://your-app.fly.dev"
CONCURRENT_USERS=10
DURATION=60
echo "Starting performance test..."
echo "URL: $APP_URL"
echo "Concurrent Users: $CONCURRENT_USERS"
echo "Duration: ${DURATION}s"
# 使用wrk进行压力测试
wrk -t4 -c$CONCURRENT_USERS -d${DURATION}s $APP_URL/api/health
# 或使用ab进行测试
# ab -n 1000 -c $CONCURRENT_USERS $APP_URL/api/health
echo "Performance test completed"
3.4 故障排查流程图
故障发生
↓
检查应用状态 (fly status)
↓
应用运行? → No → 检查部署历史 → 考虑回滚
↓ Yes
检查健康检查 (fly checks list)
↓
健康检查通过? → No → 检查健康检查端点 → 修复应用逻辑
↓ Yes
查看最近日志 (fly logs --since 10m)
↓
发现错误? → Yes → 分析错误类型 → 修复并重新部署
↓ No
检查资源使用 (SSH到机器检查)
↓
资源不足? → Yes → 扩容或优化
↓ No
检查网络连接和外部依赖
↓
网络问题? → Yes → 检查DNS、数据库连接等
↓ No
联系Fly.io支持或社区
4. 扩缩容管理
4.1 手动扩缩容操作
垂直扩容(增加资源)
# 查看当前配置
fly status
# 增加内存
fly scale memory 2048
# 修改CPU配置
fly scale vm performance-1x
# 查看可用配置选项
fly platform vm-sizes
# 验证扩容结果
fly status --all
水平扩容(增加实例)
# 增加总实例数
fly scale count 3
# 在特定区域增加实例
fly scale count 2 --region nrt
fly scale count 1 --region sin
# 查看实例分布
fly status --all
# 检查负载均衡
curl -s https://your-app.fly.dev/debug/instance | grep machine_id
4.2 自动扩缩容配置
基于连接数的自动扩缩容
# fly.toml
[http_service]
auto_stop_machines = "stop"
auto_start_machines = true
min_machines_running = 1
[http_service.concurrency]
type = "connections"
hard_limit = 25 # 单实例最大连接数
soft_limit = 20 # 触发扩容的阈值
基于请求数的自动扩缩容
[http_service.concurrency]
type = "requests"
hard_limit = 100 # 单实例最大并发请求
soft_limit = 80 # 触发扩容的阈值
扩缩容监控脚本
#!/bin/bash
# autoscale-monitor.sh
APP_NAME="your-app"
while true; do
echo "=== $(date) ==="
# 获取当前实例状态
MACHINES=$(fly machines list --json --app $APP_NAME)
RUNNING_COUNT=$(echo $MACHINES | jq '[.[] | select(.state == "started")] | length')
TOTAL_COUNT=$(echo $MACHINES | jq 'length')
echo "Running Machines: $RUNNING_COUNT/$TOTAL_COUNT"
# 检查负载情况
for machine_id in $(echo $MACHINES | jq -r '.[].id'); do
STATE=$(echo $MACHINES | jq -r ".[] | select(.id == \"$machine_id\") | .state")
echo "Machine $machine_id: $STATE"
done
# 检查是否需要手动干预
if [ $RUNNING_COUNT -eq 0 ]; then
echo "⚠️ 没有运行中的实例!"
echo "执行: fly machines start [MACHINE_ID]"
fi
echo "------------------------"
sleep 60
done
4.3 跨区域部署管理
全球区域部署策略
# 查看可用区域
fly platform regions
# 主要用户区域部署
fly scale count 2 --region nrt # 东京(亚太主要)
fly scale count 2 --region iad # 华盛顿(北美主要)
fly scale count 1 --region ams # 阿姆斯特丹(欧洲主要)
# 查看区域分布
fly status --all
区域性能测试
#!/bin/bash
# region-performance-test.sh
REGIONS=("nrt" "iad" "ams" "sin" "lhr")
APP_NAME="your-app"
for region in "${REGIONS[@]}"; do
echo "Testing region: $region"
# 创建测试实例
fly machines create --region $region --config test-config.json
# 等待启动
sleep 30
# 测试延迟
URL="https://$APP_NAME.fly.dev"
LATENCY=$(curl -o /dev/null -s -w "%{time_total}" $URL)
echo "Region $region latency: ${LATENCY}s"
# 清理测试实例
# fly machines stop TEST_MACHINE_ID
done
4.4 负载均衡配置
配置负载均衡策略
# fly.toml - 负载均衡配置
[http_service]
internal_port = 3000
# 会话亲和性(可选)
[http_service.checks]
path = "/health"
# 超时配置
[http_service.timeouts]
read_timeout = "30s"
write_timeout = "30s"
idle_timeout = "60s"
健康检查配置优化
[[http_service.checks]]
grace_period = "5s" # 启动后等待时间
interval = "15s" # 检查间隔
method = "GET"
path = "/health"
protocol = "http"
timeout = "2s" # 检查超时
# 失败处理
[http_service.checks.headers]
"User-Agent" = "fly-health-check"
5. 数据库和存储运维
5.1 PostgreSQL管理操作
创建和配置数据库
# 创建PostgreSQL数据库
fly postgres create --name my-app-db --region nrt
# 查看数据库信息
fly postgres show --app my-app-db
# 连接到数据库
fly postgres connect --app my-app-db
# 获取数据库连接字符串
fly postgres attach --app my-app my-app-db
数据库维护操作
# 查看数据库状态
fly postgres status --app my-app-db
# 数据库配置调优
fly postgres config show --app my-app-db
fly postgres config update --app my-app-db \
--shared-preload-libraries="pg_stat_statements" \
--log-statement="all"
# 重启数据库
fly postgres restart --app my-app-db
5.2 备份和恢复流程
自动备份配置
# 查看备份状态
fly postgres backup list --app my-app-db
# 创建手动备份
fly postgres backup create --app my-app-db
# 下载备份文件
fly postgres backup download BACKUP_ID --app my-app-db
数据库恢复操作
# 从备份恢复
fly postgres restore --app my-app-db --from BACKUP_ID
# 从本地文件恢复
fly postgres import --app my-app-db --local-file backup.sql
# 恢复验证
fly postgres connect --app my-app-db -c "SELECT COUNT(*) FROM your_table;"
备份脚本自动化
#!/bin/bash
# db-backup.sh - 数据库备份脚本
DB_APP="my-app-db"
BACKUP_DIR="/backups"
DATE=$(date +%Y%m%d_%H%M%S)
mkdir -p $BACKUP_DIR
echo "Starting backup for $DB_APP..."
# 创建备份
BACKUP_ID=$(fly postgres backup create --app $DB_APP --json | jq -r '.id')
if [ "$BACKUP_ID" != "null" ]; then
echo "Backup created with ID: $BACKUP_ID"
# 下载备份文件
fly postgres backup download $BACKUP_ID --app $DB_APP \
--output "$BACKUP_DIR/backup_${DATE}.sql"
echo "Backup saved to: $BACKUP_DIR/backup_${DATE}.sql"
# 清理旧备份(保留30天)
find $BACKUP_DIR -name "backup_*.sql" -mtime +30 -delete
echo "Backup completed successfully"
else
echo "Failed to create backup"
exit 1
fi
5.3 数据迁移步骤
从其他数据库迁移
# 1. 导出现有数据
pg_dump postgresql://old-db-url > migration_data.sql
# 2. 准备新数据库
fly postgres create --name new-app-db --region nrt
# 3. 导入数据
fly postgres import --app new-app-db --local-file migration_data.sql
# 4. 验证数据完整性
fly postgres connect --app new-app-db -c "
SELECT
schemaname,
tablename,
n_tup_ins as inserts,
n_tup_upd as updates,
n_tup_del as deletes
FROM pg_stat_user_tables;
"
# 5. 更新应用连接字符串
fly secrets set DATABASE_URL="postgresql://new-db-url"
零停机数据迁移
#!/bin/bash
# zero-downtime-migration.sh
OLD_DB="old-app-db"
NEW_DB="new-app-db"
APP_NAME="my-app"
echo "Starting zero-downtime migration..."
# 1. 创建新数据库
echo "Creating new database..."
fly postgres create --name $NEW_DB --region nrt
# 2. 初始数据同步
echo "Initial data sync..."
pg_dump postgresql://old-db-connection | \
psql postgresql://new-db-connection
# 3. 设置实时同步(使用逻辑复制)
echo "Setting up logical replication..."
# 配置逻辑复制(具体命令取决于数据库版本)
# 4. 等待同步完成
echo "Waiting for sync to complete..."
sleep 30
# 5. 切换应用到新数据库
echo "Switching application to new database..."
fly secrets set DATABASE_URL="postgresql://new-db-connection" --app $APP_NAME
# 6. 重启应用
echo "Restarting application..."
fly restart --app $APP_NAME
# 7. 验证切换成功
echo "Verifying migration..."
if curl -f https://$APP_NAME.fly.dev/health; then
echo "✅ Migration successful"
else
echo "❌ Migration failed, consider rollback"
exit 1
fi
echo "Migration completed"
5.4 Volume管理
持久化存储操作
# 创建Volume
fly volumes create data_volume --size 10GB --region nrt
# 查看Volumes
fly volumes list
# 扩展Volume大小
fly volumes extend data_volume --size 20GB
# 创建Volume快照
fly volumes snapshot create data_volume
# 从快照恢复Volume
fly volumes create new_volume --snapshot SNAPSHOT_ID --region nrt
Volume挂载配置
# fly.toml
[mounts]
source = "data_volume"
destination = "/data"
# 多个挂载点
[[mounts]]
source = "logs_volume"
destination = "/var/log/app"
[[mounts]]
source = "uploads_volume"
destination = "/app/uploads"
Volume备份脚本
#!/bin/bash
# volume-backup.sh
VOLUME_NAME="data_volume"
APP_NAME="my-app"
BACKUP_DATE=$(date +%Y%m%d_%H%M%S)
echo "Creating snapshot of $VOLUME_NAME..."
# 创建快照
SNAPSHOT_ID=$(fly volumes snapshot create $VOLUME_NAME --json | jq -r '.id')
if [ "$SNAPSHOT_ID" != "null" ]; then
echo "Snapshot created: $SNAPSHOT_ID"
# 记录备份信息
echo "$BACKUP_DATE,$VOLUME_NAME,$SNAPSHOT_ID" >> volume_backups.log
echo "Volume backup completed"
else
echo "Failed to create volume snapshot"
exit 1
fi
6. 网络和安全配置
6.1 域名和证书管理
自定义域名配置
# 添加自定义域名
fly certs add example.com
# 添加通配符域名
fly certs add "*.example.com"
# 查看证书状态
fly certs show example.com
# 查看所有证书
fly certs list
# 删除证书
fly certs remove example.com
SSL证书验证
# 检查证书配置
curl -I https://example.com
openssl s_client -connect example.com:443 -servername example.com
# 证书到期检查脚本
#!/bin/bash
# cert-check.sh
DOMAIN="example.com"
EXPIRY_DATE=$(openssl s_client -connect $DOMAIN:443 -servername $DOMAIN 2>/dev/null | \
openssl x509 -noout -dates | grep notAfter | cut -d= -f2)
EXPIRY_TIMESTAMP=$(date -d "$EXPIRY_DATE" +%s)
CURRENT_TIMESTAMP=$(date +%s)
DAYS_UNTIL_EXPIRY=$(( (EXPIRY_TIMESTAMP - CURRENT_TIMESTAMP) / 86400 ))
echo "Certificate for $DOMAIN expires in $DAYS_UNTIL_EXPIRY days"
if [ $DAYS_UNTIL_EXPIRY -lt 30 ]; then
echo "⚠️ Certificate expires soon!"
fi
DNS配置指南
# 查看Fly.io提供的IP地址
fly ips list
# DNS记录配置示例:
# A记录: example.com -> 77.83.142.52
# AAAA记录: example.com -> 2a09:8280:1::4:d2b4
# CNAME记录: www.example.com -> example.com
# 验证DNS配置
dig example.com A +short
dig example.com AAAA +short
nslookup example.com 8.8.8.8
6.2 密钥和环境变量管理
安全密钥管理
# 设置生产环境密钥
fly secrets set DATABASE_URL="postgresql://..."
fly secrets set JWT_SECRET="$(openssl rand -base64 32)"
fly secrets set API_KEY="your-secure-api-key"
# 查看密钥列表(不显示值)
fly secrets list
# 删除密钥
fly secrets unset OLD_SECRET
# 批量设置密钥
fly secrets set \
SECRET_1="value1" \
SECRET_2="value2" \
SECRET_3="value3"
环境变量最佳实践
#!/bin/bash
# setup-secrets.sh - 密钥设置脚本
# 从.env文件读取并设置(排除敏感信息)
while IFS='=' read -r key value; do
# 跳过注释和空行
[[ $key =~ ^#.*$ || -z $key ]] && continue
# 设置非敏感环境变量
if [[ ! $key =~ (PASSWORD|SECRET|KEY|TOKEN) ]]; then
echo "Setting env var: $key"
fly env set "$key=$value"
else
echo "Setting secret: $key"
fly secrets set "$key=$value"
fi
done < .env.production
echo "Environment setup completed"
6.3 内部网络配置
6PN私有网络配置
# 查看私有网络信息
fly ips private
# 分配私有IPv6地址
fly ips allocate-v6 --private
# 查看网络连接
fly ssh console -C "ip addr show"
服务间通信配置
// 应用内网络配置
const DB_HOST = process.env.DATABASE_URL || 'postgres.internal';
const REDIS_HOST = process.env.REDIS_URL || 'redis.internal';
const API_HOST = process.env.API_URL || 'api-service.internal';
// 内网服务发现
const services = {
database: 'my-app-db.internal:5432',
cache: 'my-app-redis.internal:6379',
api: 'my-app-api.internal:8080'
};
网络安全配置
# fly.toml - 网络安全配置
[http_service]
internal_port = 3000
# 仅允许HTTPS
force_https = true
# 配置CORS
[http_service.cors]
allowed_origins = ["https://yourdomain.com"]
allowed_methods = ["GET", "POST", "PUT", "DELETE"]
allowed_headers = ["Content-Type", "Authorization"]
# 内部服务不暴露外部端口
[[services]]
internal_port = 5432
protocol = "tcp"
# 不配置[[services.ports]]即不暴露外部
6.4 防护策略设置
速率限制配置
// 应用层速率限制
const rateLimit = require('express-rate-limit');
const limiter = rateLimit({
windowMs: 15 * 60 * 1000, // 15分钟
max: 100, // 最多100个请求
message: 'Too many requests from this IP',
standardHeaders: true,
legacyHeaders: false,
});
app.use('/api/', limiter);
DDoS防护监控
#!/bin/bash
# ddos-monitor.sh - DDoS监控脚本
APP_NAME="my-app"
THRESHOLD=1000 # 每分钟请求阈值
while true; do
# 统计最近1分钟的请求数
REQUEST_COUNT=$(fly logs --since 1m --app $APP_NAME | \
grep -c "GET\|POST\|PUT\|DELETE")
echo "$(date): Requests in last minute: $REQUEST_COUNT"
if [ $REQUEST_COUNT -gt $THRESHOLD ]; then
echo "⚠️ High traffic detected! Requests: $REQUEST_COUNT"
# 可以触发自动扩容或告警
fly scale count 3 --app $APP_NAME
# 发送告警通知
# curl -X POST "https://hooks.slack.com/..." \
# -d "{'text':'High traffic detected on $APP_NAME: $REQUEST_COUNT requests/min'}"
fi
sleep 60
done
安全头配置
// 安全头中间件
const helmet = require('helmet');
app.use(helmet({
contentSecurityPolicy: {
directives: {
defaultSrc: ["'self'"],
styleSrc: ["'self'", "'unsafe-inline'"],
scriptSrc: ["'self'"],
imgSrc: ["'self'", "data:", "https:"],
},
},
hsts: {
maxAge: 31536000,
includeSubDomains: true,
preload: true
}
}));
7. CI/CD集成实战
7.1 GitHub Actions配置
基础部署工作流
# .github/workflows/deploy.yml
name: Deploy to Fly.io
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
deploy:
name: Deploy app
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Node.js
uses: actions/setup-node@v3
with:
node-version: '18'
cache: 'npm'
- name: Install dependencies
run: npm ci
- name: Run tests
run: npm test
- name: Run linter
run: npm run lint
- name: Setup flyctl
uses: superfly/flyctl-actions/setup-flyctl@master
- name: Deploy to Fly.io
run: flyctl deploy --remote-only
env:
FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }}
多环境部署工作流
# .github/workflows/multi-env-deploy.yml
name: Multi-Environment Deploy
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Node.js
uses: actions/setup-node@v3
with:
node-version: '18'
cache: 'npm'
- name: Install and test
run: |
npm ci
npm test
npm run lint
deploy-staging:
needs: test
if: github.ref == 'refs/heads/develop'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: superfly/flyctl-actions/setup-flyctl@master
- name: Deploy to staging
run: flyctl deploy --config fly.staging.toml
env:
FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }}
deploy-production:
needs: test
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: superfly/flyctl-actions/setup-flyctl@master
- name: Deploy to production
run: flyctl deploy --config fly.toml
env:
FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }}
7.2 GitLab CI配置
GitLab CI配置文件
# .gitlab-ci.yml
stages:
- test
- build
- deploy
variables:
DOCKER_HOST: tcp://docker:2376
DOCKER_TLS_CERTDIR: "/certs"
test:
stage: test
image: node:18
script:
- npm ci
- npm test
- npm run lint
cache:
paths:
- node_modules/
build:
stage: build
image: docker:latest
services:
- docker:dind
script:
- docker build -t $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA .
- docker push $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA
deploy:
stage: deploy
image: flyio/flyctl:latest
script:
- flyctl auth docker
- flyctl deploy --image $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA
environment:
name: production
url: https://your-app.fly.dev
only:
- main
7.3 自动化部署脚本
部署前检查脚本
#!/bin/bash
# pre-deploy-check.sh
APP_NAME="my-app"
HEALTH_ENDPOINT="/health"
echo "Starting pre-deployment checks..."
# 1. 检查当前应用状态
echo "Checking current app status..."
if ! fly status --app $APP_NAME > /dev/null 2>&1; then
echo "❌ Cannot access app status"
exit 1
fi
# 2. 检查健康端点
echo "Checking health endpoint..."
if ! curl -f "https://$APP_NAME.fly.dev$HEALTH_ENDPOINT" > /dev/null 2>&1; then
echo "❌ Health check failed"
exit 1
fi
# 3. 检查Docker构建
echo "Testing Docker build..."
if ! docker build -t test-build . > /dev/null 2>&1; then
echo "❌ Docker build failed"
exit 1
fi
# 4. 检查环境变量
echo "Checking required secrets..."
REQUIRED_SECRETS=("DATABASE_URL" "JWT_SECRET")
for secret in "${REQUIRED_SECRETS[@]}"; do
if ! fly secrets list --app $APP_NAME | grep -q $secret; then
echo "❌ Missing required secret: $secret"
exit 1
fi
done
echo "✅ All pre-deployment checks passed"
部署后验证脚本
#!/bin/bash
# post-deploy-verify.sh
APP_NAME="my-app"
TIMEOUT=300 # 5分钟超时
echo "Starting post-deployment verification..."
# 等待部署完成
echo "Waiting for deployment to complete..."
END_TIME=$((SECONDS + TIMEOUT))
while [ $SECONDS -lt $END_TIME ]; do
if fly status --app $APP_NAME | grep -q "running"; then
echo "✅ App is running"
break
fi
echo "Waiting for app to start..."
sleep 10
done
if [ $SECONDS -ge $END_TIME ]; then
echo "❌ Timeout waiting for app to start"
exit 1
fi
# 健康检查
echo "Running health checks..."
sleep 30 # 给应用时间完全启动
if curl -f "https://$APP_NAME.fly.dev/health" > /dev/null 2>&1; then
echo "✅ Health check passed"
else
echo "❌ Health check failed"
echo "Recent logs:"
fly logs --since 5m --app $APP_NAME | tail -20
exit 1
fi
# 功能测试
echo "Running basic functionality tests..."
RESPONSE=$(curl -s "https://$APP_NAME.fly.dev/api/version")
if [[ $RESPONSE =~ "version" ]]; then
echo "✅ API responding correctly"
else
echo "❌ API not responding as expected"
exit 1
fi
echo "✅ Post-deployment verification completed successfully"
7.4 多阶段部署流程
蓝绿部署CI配置
# .github/workflows/blue-green-deploy.yml
name: Blue-Green Deployment
on:
push:
branches: [main]
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: superfly/flyctl-actions/setup-flyctl@master
- name: Blue-Green Deploy
run: |
# 获取当前活跃实例
CURRENT_MACHINES=$(flyctl machines list --json | jq -r '.[].id')
# 部署新版本(绿色环境)
flyctl deploy --strategy=bluegreen
# 等待新实例就绪
sleep 60
# 健康检查
if curl -f https://your-app.fly.dev/health; then
echo "Green deployment successful"
# 停止旧实例(蓝色环境)
for machine in $CURRENT_MACHINES; do
flyctl machines stop $machine
done
echo "Blue-green deployment completed"
else
echo "Health check failed, rolling back"
flyctl releases rollback
exit 1
fi
env:
FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }}
分阶段部署脚本
#!/bin/bash
# staged-deployment.sh
APP_NAME="my-app"
REGIONS=("nrt" "iad" "ams")
echo "Starting staged deployment..."
for region in "${REGIONS[@]}"; do
echo "Deploying to region: $region"
# 部署到特定区域
flyctl deploy --region $region --app $APP_NAME
# 等待部署完成
sleep 30
# 验证区域部署
if flyctl status --app $APP_NAME | grep $region | grep -q "running"; then
echo "✅ Deployment to $region successful"
else
echo "❌ Deployment to $region failed"
# 回滚策略
echo "Rolling back $region deployment..."
flyctl releases rollback --app $APP_NAME
exit 1
fi
# 区域间等待时间
sleep 60
done
echo "✅ Staged deployment completed successfully"
8. 应急响应手册
8.1 常见故障处理
应用无法启动
# 故障现象:应用显示状态异常或无法访问
# 1. 快速诊断
fly status
fly logs --since 10m
# 2. 检查健康检查
fly checks list
# 3. 检查资源使用
fly ssh console -C "ps aux | head -10"
fly ssh console -C "free -h"
# 4. 常见解决方案
# 4.1 内存不足
fly scale memory 1024
# 4.2 启动超时
# 检查fly.toml中的grace_period设置
[http_service.checks]
grace_period = "30s"
# 4.3 端口配置错误
# 确保internal_port与应用监听端口一致
[http_service]
internal_port = 3000
# 5. 如果仍无法解决,回滚到上一版本
fly releases rollback
数据库连接失败
# 故障现象:应用报数据库连接错误
# 1. 检查数据库状态
fly postgres status --app my-app-db
# 2. 测试数据库连接
fly postgres connect --app my-app-db
# 3. 检查连接字符串
fly secrets list | grep DATABASE
# 4. 验证网络连通性
fly ssh console -C "nc -zv my-app-db.internal 5432"
# 5. 常见解决方案
# 5.1 连接字符串错误
fly secrets set DATABASE_URL="postgresql://correct-url"
# 5.2 连接池耗尽
# 在应用中增加连接池配置
max: 20,
min: 5,
idle: 10000
# 5.3 数据库重启
fly postgres restart --app my-app-db
8.2 紧急回滚流程
自动回滚脚本
#!/bin/bash
# emergency-rollback.sh
APP_NAME="my-app"
HEALTH_URL="https://$APP_NAME.fly.dev/health"
echo "🚨 EMERGENCY ROLLBACK INITIATED"
echo "App: $APP_NAME"
echo "Time: $(date)"
# 1. 记录当前状态
echo "Current status:"
fly status --app $APP_NAME
# 2. 执行回滚
echo "Rolling back to previous version..."
fly releases rollback --app $APP_NAME
# 3. 等待回滚完成
echo "Waiting for rollback to complete..."
sleep 30
# 4. 验证回滚成功
for i in {1..10}; do
if curl -f $HEALTH_URL > /dev/null 2>&1; then
echo "✅ ROLLBACK SUCCESSFUL - Health check passed"
echo "App is responding normally"
# 发送成功通知
echo "Rollback completed at $(date)" | \
curl -X POST -H 'Content-Type: application/json' \
-d @- "https://hooks.slack.com/your-webhook"
exit 0
fi
echo "Attempt $i: Health check failed, retrying in 10s..."
sleep 10
done
# 5. 回滚失败处理
echo "❌ ROLLBACK FAILED - Manual intervention required"
echo "Recent logs:"
fly logs --since 5m --app $APP_NAME | tail -20
# 发送失败告警
echo "CRITICAL: Rollback failed for $APP_NAME" | \
curl -X POST -H 'Content-Type: application/json' \
-d @- "https://hooks.slack.com/your-webhook"
exit 1
灾难恢复计划
#!/bin/bash
# disaster-recovery.sh
APP_NAME="my-app"
DB_APP="my-app-db"
BACKUP_REGION="iad" # 备份区域
echo "🆘 DISASTER RECOVERY PROCEDURE"
# 1. 评估损坏程度
echo "Assessing damage..."
if fly status --app $APP_NAME | grep -q "running"; then
echo "App is running, attempting standard recovery..."
# 标准恢复流程
fly releases rollback --app $APP_NAME
else
echo "App is down, initiating disaster recovery..."
# 2. 创建新应用实例
echo "Creating new application instance..."
fly apps create $APP_NAME-recovery --region $BACKUP_REGION
# 3. 恢复数据库
echo "Restoring database..."
LATEST_BACKUP=$(fly postgres backup list --app $DB_APP --json | \
jq -r '.[0].id')
fly postgres create --name $DB_APP-recovery --region $BACKUP_REGION
fly postgres restore --app $DB_APP-recovery --from $LATEST_BACKUP
# 4. 部署应用到恢复实例
echo "Deploying to recovery instance..."
fly deploy --app $APP_NAME-recovery
# 5. 更新DNS(手动步骤提示)
echo "⚠️ MANUAL STEP REQUIRED:"
echo "Update DNS records to point to recovery instance:"
echo "A record: your-domain.com -> $(fly ips list --app $APP_NAME-recovery)"
fi
echo "Disaster recovery procedure completed"
8.3 性能问题诊断
性能监控脚本
#!/bin/bash
# performance-diagnosis.sh
APP_NAME="my-app"
DURATION=300 # 5分钟监控
echo "Starting performance diagnosis for $APP_NAME..."
echo "Duration: ${DURATION}s"
# 1. 基线性能测试
echo "Running baseline performance test..."
ab -n 100 -c 10 "https://$APP_NAME.fly.dev/" > baseline_test.log
# 2. 实时监控
echo "Starting real-time monitoring..."
END_TIME=$((SECONDS + DURATION))
while [ $SECONDS -lt $END_TIME ]; do
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
# CPU和内存使用
RESOURCES=$(fly ssh console --app $APP_NAME -C "top -bn1 | head -5")
# 活跃连接数
CONNECTIONS=$(fly ssh console --app $APP_NAME -C "netstat -an | grep ESTABLISHED | wc -l")
# 响应时间测试
RESPONSE_TIME=$(curl -o /dev/null -s -w "%{time_total}" "https://$APP_NAME.fly.dev/health")
echo "$TIMESTAMP,CPU_MEM,$RESOURCES" >> performance_log.csv
echo "$TIMESTAMP,CONNECTIONS,$CONNECTIONS" >> performance_log.csv
echo "$TIMESTAMP,RESPONSE_TIME,$RESPONSE_TIME" >> performance_log.csv
# 检查是否有性能问题
if (( $(echo "$RESPONSE_TIME > 2.0" | bc -l) )); then
echo "⚠️ Slow response detected: ${RESPONSE_TIME}s"
# 获取详细日志
fly logs --since 1m --app $APP_NAME | grep -E "(error|slow|timeout)" >> slow_requests.log
fi
sleep 10
done
echo "Performance diagnosis completed"
echo "Check performance_log.csv and slow_requests.log for details"
数据库性能诊断
#!/bin/bash
# db-performance-diagnosis.sh
DB_APP="my-app-db"
echo "Database performance diagnosis..."
# 1. 连接到数据库
echo "Connecting to database for analysis..."
# 2. 检查慢查询
fly postgres connect --app $DB_APP -c "
SELECT query, mean_time, calls, total_time
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 10;
"
# 3. 检查锁等待
fly postgres connect --app $DB_APP -c "
SELECT
blocked_locks.pid AS blocked_pid,
blocked_activity.usename AS blocked_user,
blocking_locks.pid AS blocking_pid,
blocking_activity.usename AS blocking_user,
blocked_activity.query AS blocked_statement,
blocking_activity.query AS current_statement_in_blocking_process
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity
ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks
ON blocking_locks.locktype = blocked_locks.locktype
JOIN pg_catalog.pg_stat_activity blocking_activity
ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.GRANTED;
"
# 4. 检查表统计信息
fly postgres connect --app $DB_APP -c "
SELECT
schemaname,
tablename,
n_tup_ins + n_tup_upd + n_tup_del as total_writes,
n_tup_ins,
n_tup_upd,
n_tup_del,
seq_scan,
seq_tup_read
FROM pg_stat_user_tables
ORDER BY total_writes DESC
LIMIT 10;
"
echo "Database performance analysis completed"
8.4 数据恢复操作
数据库恢复脚本
#!/bin/bash
# data-recovery.sh
DB_APP="my-app-db"
RECOVERY_POINT="$1" # 恢复点时间戳或备份ID
if [ -z "$RECOVERY_POINT" ]; then
echo "Usage: $0 <backup_id_or_timestamp>"
echo "Available backups:"
fly postgres backup list --app $DB_APP
exit 1
fi
echo "🔄 DATA RECOVERY PROCEDURE"
echo "Database: $DB_APP"
echo "Recovery Point: $RECOVERY_POINT"
echo "WARNING: This will overwrite current data!"
read -p "Continue? (yes/no): " CONFIRM
if [ "$CONFIRM" != "yes" ]; then
echo "Recovery cancelled"
exit 1
fi
# 1. 创建当前状态备份
echo "Creating safety backup of current state..."
SAFETY_BACKUP=$(fly postgres backup create --app $DB_APP --json | jq -r '.id')
echo "Safety backup created: $SAFETY_BACKUP"
# 2. 停止应用实例
echo "Stopping application instances..."
MACHINES=$(fly machines list --json | jq -r '.[].id')
for machine in $MACHINES; do
fly machines stop $machine
done
# 3. 执行数据恢复
echo "Restoring data from backup..."
if [[ $RECOVERY_POINT =~ ^[0-9]{4}-[0-9]{2}-[0-9]{2} ]]; then
# 时间点恢复
fly postgres restore --app $DB_APP --timestamp "$RECOVERY_POINT"
else
# 备份恢复
fly postgres restore --app $DB_APP --from "$RECOVERY_POINT"
fi
# 4. 验证数据恢复
echo "Verifying data recovery..."
fly postgres connect --app $DB_APP -c "SELECT COUNT(*) FROM pg_tables WHERE schemaname = 'public';"
# 5. 重启应用实例
echo "Restarting application instances..."
for machine in $MACHINES; do
fly machines start $machine
done
# 6. 最终验证
echo "Final verification..."
sleep 30
if curl -f "https://my-app.fly.dev/health"; then
echo "✅ Data recovery completed successfully"
else
echo "❌ Recovery verification failed"
echo "Consider rolling back to safety backup: $SAFETY_BACKUP"
fi
文件系统恢复
#!/bin/bash
# volume-recovery.sh
VOLUME_NAME="data_volume"
SNAPSHOT_ID="$1"
if [ -z "$SNAPSHOT_ID" ]; then
echo "Usage: $0 <snapshot_id>"
echo "Available snapshots:"
fly volumes snapshot list $VOLUME_NAME
exit 1
fi
echo "🗂️ VOLUME RECOVERY PROCEDURE"
echo "Volume: $VOLUME_NAME"
echo "Snapshot: $SNAPSHOT_ID"
# 1. 停止使用该Volume的机器
echo "Stopping machines using volume..."
MACHINES_USING_VOLUME=$(fly machines list --json | \
jq -r '.[] | select(.config.mounts[]?.volume == "'$VOLUME_NAME'") | .id')
for machine in $MACHINES_USING_VOLUME; do
echo "Stopping machine: $machine"
fly machines stop $machine
done
# 2. 创建新Volume从快照
echo "Creating new volume from snapshot..."
NEW_VOLUME="${VOLUME_NAME}_recovered"
fly volumes create $NEW_VOLUME --snapshot $SNAPSHOT_ID --region nrt
# 3. 更新机器配置使用新Volume
echo "Updating machine configurations..."
for machine in $MACHINES_USING_VOLUME; do
# 这里需要更新机器配置,将旧volume替换为新volume
# 具体实现取决于你的应用配置
echo "Update machine $machine to use $NEW_VOLUME"
done
# 4. 重启机器
echo "Restarting machines..."
for machine in $MACHINES_USING_VOLUME; do
fly machines start $machine
done
echo "Volume recovery completed"
9. 成本监控和优化
9.1 费用监控设置
成本监控脚本
#!/bin/bash
# cost-monitor.sh
ORG_NAME="your-org"
ALERT_THRESHOLD=100 # 美元
echo "💰 Cost monitoring for organization: $ORG_NAME"
# 1. 获取当前月费用
CURRENT_COST=$(fly billing show --org $ORG_NAME --json | jq -r '.current_month.total')
echo "Current month cost: $${CURRENT_COST}"
# 2. 检查是否超过阈值
if (( $(echo "$CURRENT_COST > $ALERT_THRESHOLD" | bc -l) )); then
echo "⚠️ Cost alert: Current spending ($${CURRENT_COST}) exceeds threshold ($${ALERT_THRESHOLD})"
# 发送告警通知
curl -X POST "https://hooks.slack.com/your-webhook" \
-H 'Content-Type: application/json' \
-d "{\"text\":\"Cost Alert: Current spending \$${CURRENT_COST} exceeds \$${ALERT_THRESHOLD}\"}"
fi
# 3. 获取应用费用明细
echo "Application cost breakdown:"
fly apps list --org $ORG_NAME --json | jq -r '.[] | .name' | while read app; do
APP_COST=$(fly billing breakdown --app $app --json 2>/dev/null | jq -r '.total' 2>/dev/null || echo "0")
echo "$app: \$${APP_COST}"
done
# 4. 预测月底费用
DAY_OF_MONTH=$(date +%d)
DAYS_IN_MONTH=$(date -d "$(date +'%Y-%m-01') +1 month -1 day" +%d)
PROJECTED_COST=$(echo "scale=2; $CURRENT_COST * $DAYS_IN_MONTH / $DAY_OF_MONTH" | bc)
echo "Projected month-end cost: \$${PROJECTED_COST}"
if (( $(echo "$PROJECTED_COST > $ALERT_THRESHOLD * 1.5" | bc -l) )); then
echo "🚨 WARNING: Projected cost significantly exceeds budget!"
fi
资源使用分析
#!/bin/bash
# resource-analysis.sh
echo "📊 Resource utilization analysis"
# 1. 分析所有应用的资源配置
fly apps list --json | jq -r '.[] | .name' | while read app; do
echo "=== App: $app ==="
# 获取机器配置
MACHINES=$(fly machines list --app $app --json)
if [ "$MACHINES" != "[]" ]; then
# 统计配置分布
echo "Machine configurations:"
echo "$MACHINES" | jq -r '.[] | "\(.config.size) - \(.config.guest.memory_mb)MB"' | sort | uniq -c
# 统计运行状态
echo "Machine states:"
echo "$MACHINES" | jq -r '.[].state' | sort | uniq -c
# 计算总资源
TOTAL_MEMORY=$(echo "$MACHINES" | jq '[.[] | select(.state == "started") | .config.guest.memory_mb] | add')
TOTAL_CPUS=$(echo "$MACHINES" | jq '[.[] | select(.state == "started") | .config.guest.cpus] | add')
echo "Total allocated: ${TOTAL_CPUS} CPUs, ${TOTAL_MEMORY}MB RAM"
else
echo "No machines found"
fi
echo ""
done
9.2 资源优化建议
自动扩缩容优化
#!/bin/bash
# optimize-autoscaling.sh
APP_NAME="$1"
if [ -z "$APP_NAME" ]; then
echo "Usage: $0 <app_name>"
exit 1
fi
echo "🎯 Autoscaling optimization for $APP_NAME"
# 1. 分析过去24小时的机器状态变化
echo "Analyzing machine state changes in last 24h..."
fly logs --since 24h --app $APP_NAME | grep -E "(machine.*started|machine.*stopped)" > machine_events.log
START_EVENTS=$(grep "started" machine_events.log | wc -l)
STOP_EVENTS=$(grep "stopped" machine_events.log | wc -l)
echo "Machine start events: $START_EVENTS"
echo "Machine stop events: $STOP_EVENTS"
# 2. 计算平均运行时间
if [ $START_EVENTS -gt 0 ] && [ $STOP_EVENTS -gt 0 ]; then
AVG_RUNTIME=$(echo "scale=2; 24 * 60 / ($START_EVENTS + $STOP_EVENTS)" | bc)
echo "Average machine runtime: ${AVG_RUNTIME} minutes"
# 3. 提供优化建议
if (( $(echo "$AVG_RUNTIME < 10" | bc -l) )); then
echo "💡 Optimization suggestion:"
echo "- Very short runtime detected"
echo "- Consider increasing min_machines_running to 1"
echo "- This will reduce cold start frequency"
echo "Recommended configuration:"
echo "[http_service]"
echo " min_machines_running = 1"
echo " auto_stop_machines = \"suspend\""
fi
if (( $(echo "$START_EVENTS > 50" | bc -l) )); then
echo "💡 Optimization suggestion:"
echo "- High frequency scaling detected"
echo "- Consider adjusting concurrency limits"
echo "- Current events suggest traffic spikes"
# 检查当前配置
CURRENT_CONFIG=$(fly config show --app $APP_NAME)
echo "Current concurrency config:"
echo "$CURRENT_CONFIG" | grep -A 5 "concurrency"
fi
else
echo "ℹ️ No significant scaling activity detected"
echo "Current configuration appears stable"
fi
成本优化建议脚本
#!/bin/bash
# cost-optimization.sh
echo "💰 Cost optimization analysis"
# 1. 检查过度配置的机器
echo "=== Checking for over-provisioned machines ==="
fly apps list --json | jq -r '.[] | .name' | while read app; do
echo "Analyzing app: $app"
# 获取机器配置和使用情况
MACHINES=$(fly machines list --app $app --json)
if [ "$MACHINES" != "[]" ]; then
# 检查大内存配置的机器
HIGH_MEMORY_MACHINES=$(echo "$MACHINES" | jq '[.[] | select(.config.guest.memory_mb > 1024)]')
if [ "$HIGH_MEMORY_MACHINES" != "[]" ]; then
echo "⚠️ Found high-memory machines (>1GB):"
echo "$HIGH_MEMORY_MACHINES" | jq -r '.[] | "\(.id): \(.config.guest.memory_mb)MB"'
echo "💡 Suggestion: Monitor memory usage and consider downsizing if underutilized"
echo "Check with: fly ssh console --app $app -C 'free -h'"
fi
# 检查停止状态的机器
STOPPED_MACHINES=$(echo "$MACHINES" | jq '[.[] | select(.state == "stopped")]')
if [ "$STOPPED_MACHINES" != "[]" ]; then
echo "ℹ️ Found stopped machines (no cost impact):"
echo "$STOPPED_MACHINES" | jq -r '.[] | .id'
fi
fi
echo ""
done
# 2. 检查未使用的volumes
echo "=== Checking for unused volumes ==="
fly volumes list --json | jq -r '.[] | select(.attached_machine_id == null) | "\(.name): \(.size_gb)GB - $" + ((.size_gb * 0.15) | tostring) + "/month"'
# 3. 检查多余的IP地址
echo "=== Checking IP address usage ==="
DEDICATED_IPS=$(fly ips list --json | jq '[.[] | select(.type == "v4")]')
if [ "$DEDICATED_IPS" != "[]" ]; then
echo "💰 Dedicated IPv4 addresses found ($2/month each):"
echo "$DEDICATED_IPS" | jq -r '.[] | .address'
echo "💡 Suggestion: Consider if all dedicated IPs are necessary"
fi
echo "=== Cost optimization analysis completed ==="
9.3 成本告警配置
成本告警系统
#!/bin/bash
# cost-alerting.sh
# 配置参数
ORG_NAME="your-org"
DAILY_THRESHOLD=5 # 每日预算
MONTHLY_THRESHOLD=100 # 月度预算
WEBHOOK_URL="https://hooks.slack.com/your-webhook"
# 获取当前成本数据
BILLING_DATA=$(fly billing show --org $ORG_NAME --json)
CURRENT_MONTH_COST=$(echo "$BILLING_DATA" | jq -r '.current_month.total')
YESTERDAY_COST=$(echo "$BILLING_DATA" | jq -r '.yesterday.total')
# 计算每日平均成本
DAY_OF_MONTH=$(date +%d)
DAILY_AVERAGE=$(echo "scale=2; $CURRENT_MONTH_COST / $DAY_OF_MONTH" | bc)
echo "Cost monitoring report:"
echo "Current month total: \$${CURRENT_MONTH_COST}"
echo "Yesterday cost: \$${YESTERDAY_COST}"
echo "Daily average: \$${DAILY_AVERAGE}"
# 检查每日成本告警
if (( $(echo "$YESTERDAY_COST > $DAILY_THRESHOLD" | bc -l) )); then
MESSAGE="🚨 Daily cost alert: Yesterday's cost (\$${YESTERDAY_COST}) exceeded daily threshold (\$${DAILY_THRESHOLD})"
echo "$MESSAGE"
curl -X POST "$WEBHOOK_URL" \
-H 'Content-Type: application/json' \
-d "{\"text\":\"$MESSAGE\"}"
fi
# 检查月度成本趋势
DAYS_IN_MONTH=$(date -d "$(date +'%Y-%m-01') +1 month -1 day" +%d)
PROJECTED_MONTHLY=$(echo "scale=2; $DAILY_AVERAGE * $DAYS_IN_MONTH" | bc)
if (( $(echo "$PROJECTED_MONTHLY > $MONTHLY_THRESHOLD" | bc -l) )); then
MESSAGE="⚠️ Monthly projection alert: Projected monthly cost (\$${PROJECTED_MONTHLY}) may exceed budget (\$${MONTHLY_THRESHOLD})"
echo "$MESSAGE"
curl -X POST "$WEBHOOK_URL" \
-H 'Content-Type: application/json' \
-d "{\"text\":\"$MESSAGE\"}"
fi
# 检查成本激增
if (( $(echo "$YESTERDAY_COST > $DAILY_AVERAGE * 2" | bc -l) )); then
MESSAGE="🔥 Cost spike alert: Yesterday's cost (\$${YESTERDAY_COST}) is more than double the daily average (\$${DAILY_AVERAGE})"
echo "$MESSAGE"
curl -X POST "$WEBHOOK_URL" \
-H 'Content-Type: application/json' \
-d "{\"text\":\"$MESSAGE\"}"
fi
定时成本报告
#!/bin/bash
# weekly-cost-report.sh
ORG_NAME="your-org"
REPORT_DATE=$(date +%Y-%m-%d)
echo "📊 Weekly Cost Report - $REPORT_DATE"
echo "=================================="
# 1. 总体成本概览
BILLING_DATA=$(fly billing show --org $ORG_NAME --json)
CURRENT_MONTH=$(echo "$BILLING_DATA" | jq -r '.current_month.total')
LAST_MONTH=$(echo "$BILLING_DATA" | jq -r '.last_month.total')
echo "Current month: \$${CURRENT_MONTH}"
echo "Last month: \$${LAST_MONTH}"
# 计算月环比变化
if [ "$LAST_MONTH" != "0" ]; then
CHANGE_PERCENT=$(echo "scale=2; ($CURRENT_MONTH - $LAST_MONTH) * 100 / $LAST_MONTH" | bc)
echo "Month-over-month change: ${CHANGE_PERCENT}%"
fi
# 2. 按应用分解成本
echo ""
echo "Cost by application:"
echo "-------------------"
fly apps list --org $ORG_NAME --json | jq -r '.[] | .name' | while read app; do
# 获取应用成本明细(这里需要根据实际API调整)
MACHINES_COUNT=$(fly machines list --app $app --json | jq 'length')
echo "$app: $MACHINES_COUNT machines"
done
# 3. 资源使用统计
echo ""
echo "Resource utilization:"
echo "--------------------"
TOTAL_MEMORY=0
TOTAL_CPUS=0
RUNNING_MACHINES=0
fly apps list --org $ORG_NAME --json | jq -r '.[] | .name' | while read app; do
MACHINES=$(fly machines list --app $app --json)
APP_MEMORY=$(echo "$MACHINES" | jq '[.[] | select(.state == "started") | .config.guest.memory_mb] | add // 0')
APP_CPUS=$(echo "$MACHINES" | jq '[.[] | select(.state == "started") | .config.guest.cpus] | add // 0')
APP_RUNNING=$(echo "$MACHINES" | jq '[.[] | select(.state == "started")] | length')
TOTAL_MEMORY=$((TOTAL_MEMORY + APP_MEMORY))
TOTAL_CPUS=$((TOTAL_CPUS + APP_CPUS))
RUNNING_MACHINES=$((RUNNING_MACHINES + APP_RUNNING))
done
echo "Total running machines: $RUNNING_MACHINES"
echo "Total allocated CPUs: $TOTAL_CPUS"
echo "Total allocated memory: ${TOTAL_MEMORY}MB"
# 4. 优化建议
echo ""
echo "Optimization recommendations:"
echo "-----------------------------"
# 检查长时间停止的机器
echo "Checking for optimization opportunities..."
# 这里可以添加更多分析逻辑
echo ""
echo "Report generated on: $(date)"
10. 运维脚本集合
10.1 常用bash脚本
应用健康检查脚本
#!/bin/bash
# health-check.sh
APP_NAME="${1:-my-app}"
HEALTH_ENDPOINT="${2:-/health}"
check_health() {
local url="https://$APP_NAME.fly.dev$HEALTH_ENDPOINT"
local response_time
local http_code
response_time=$(curl -o /dev/null -s -w "%{time_total}" -m 10 "$url")
http_code=$(curl -o /dev/null -s -w "%{http_code}" -m 10 "$url")
echo "Response time: ${response_time}s"
echo "HTTP code: $http_code"
if [ "$http_code" = "200" ]; then
if (( $(echo "$response_time < 2.0" | bc -l) )); then
echo "✅ Health check PASSED"
return 0
else
echo "⚠️ Health check SLOW (>${response_time}s)"
return 1
fi
else
echo "❌ Health check FAILED (HTTP $http_code)"
return 1
fi
}
echo "Checking health for $APP_NAME..."
echo "Endpoint: $HEALTH_ENDPOINT"
echo "================================"
for i in {1..3}; do
echo "Attempt $i:"
if check_health; then
exit 0
fi
echo ""
sleep 5
done
echo "Health check failed after 3 attempts"
exit 1
快速部署脚本
#!/bin/bash
# quick-deploy.sh
set -e # 遇到错误立即退出
APP_NAME="${1:-my-app}"
ENVIRONMENT="${2:-production}"
echo "🚀 Quick Deploy Script"
echo "App: $APP_NAME"
echo "Environment: $ENVIRONMENT"
# 1. 预检查
echo "Running pre-deployment checks..."
# 检查Git状态
if ! git diff-index --quiet HEAD --; then
echo "⚠️ You have uncommitted changes"
read -p "Continue anyway? (y/N): " CONTINUE
if [ "$CONTINUE" != "y" ]; then
echo "Deployment cancelled"
exit 1
fi
fi
# 检查Docker构建
echo "Testing Docker build..."
if ! docker build -t test-build . > /dev/null 2>&1; then
echo "❌ Docker build failed"
exit 1
fi
# 2. 获取当前版本信息
CURRENT_VERSION=$(fly releases --app $APP_NAME --json | jq -r '.[0].version')
echo "Current version: v$CURRENT_VERSION"
# 3. 执行部署
echo "Starting deployment..."
fly deploy --app $APP_NAME
# 4. 等待部署完成
echo "Waiting for deployment to stabilize..."
sleep 30
# 5. 验证部署
echo "Verifying deployment..."
NEW_VERSION=$(fly releases --app $APP_NAME --json | jq -r '.[0].version')
echo "New version: v$NEW_VERSION"
if [ "$NEW_VERSION" != "$CURRENT_VERSION" ]; then
# 运行健康检查
if bash health-check.sh $APP_NAME; then
echo "✅ Deployment successful!"
echo "Version: v$CURRENT_VERSION → v$NEW_VERSION"
else
echo "❌ Deployment health check failed"
read -p "Rollback to v$CURRENT_VERSION? (Y/n): " ROLLBACK
if [ "$ROLLBACK" != "n" ]; then
fly releases rollback --app $APP_NAME
echo "Rolled back to v$CURRENT_VERSION"
fi
exit 1
fi
else
echo "⚠️ Version unchanged, deployment may have failed"
exit 1
fi
10.2 批量操作脚本
批量应用管理
#!/bin/bash
# batch-operations.sh
OPERATION="$1"
PATTERN="${2:-.*}" # 应用名称模式
if [ -z "$OPERATION" ]; then
echo "Usage: $0 <operation> [pattern]"
echo "Operations: status, restart, scale, logs"
echo "Pattern: regex pattern to match app names (default: all apps)"
exit 1
fi
# 获取匹配的应用列表
APPS=$(fly apps list --json | jq -r ".[] | select(.name | test(\"$PATTERN\")) | .name")
if [ -z "$APPS" ]; then
echo "No apps found matching pattern: $PATTERN"
exit 1
fi
echo "Found apps matching '$PATTERN':"
echo "$APPS"
echo ""
case $OPERATION in
"status")
echo "Getting status for all matching apps..."
echo "$APPS" | while read app; do
echo "=== $app ==="
fly status --app $app
echo ""
done
;;
"restart")
echo "Restarting all matching apps..."
echo "$APPS" | while read app; do
echo "Restarting $app..."
fly restart --app $app
sleep 10 # 等待重启完成
done
;;
"scale")
COUNT="${3:-1}"
echo "Scaling all matching apps to $COUNT instances..."
echo "$APPS" | while read app; do
echo "Scaling $app to $COUNT..."
fly scale count $COUNT --app $app
done
;;
"logs")
DURATION="${3:-5m}"
echo "Getting logs for all matching apps (last $DURATION)..."
echo "$APPS" | while read app; do
echo "=== $app logs ==="
fly logs --since $DURATION --app $app | tail -10
echo ""
done
;;
*)
echo "Unknown operation: $OPERATION"
exit 1
;;
esac
批量环境变量更新
#!/bin/bash
# batch-env-update.sh
ENV_FILE="$1"
APPS_PATTERN="${2:-.*}"
if [ ! -f "$ENV_FILE" ]; then
echo "Usage: $0 <env_file> [apps_pattern]"
echo "env_file format: KEY=value (one per line)"
exit 1
fi
echo "Updating environment variables from: $ENV_FILE"
echo "Apps pattern: $APPS_PATTERN"
# 获取匹配的应用
APPS=$(fly apps list --json | jq -r ".[] | select(.name | test(\"$APPS_PATTERN\")) | .name")
echo "Apps to update:"
echo "$APPS"
echo ""
# 读取环境变量
declare -A ENV_VARS
while IFS='=' read -r key value; do
# 跳过注释和空行
[[ $key =~ ^#.*$ || -z $key ]] && continue
ENV_VARS[$key]=$value
done < "$ENV_FILE"
# 更新每个应用
echo "$APPS" | while read app; do
echo "Updating $app..."
# 构建secrets命令
SECRETS_CMD="fly secrets set --app $app"
ENV_CMD="fly env set --app $app"
for key in "${!ENV_VARS[@]}"; do
value="${ENV_VARS[$key]}"
# 判断是否为敏感信息
if [[ $key =~ (PASSWORD|SECRET|KEY|TOKEN) ]]; then
SECRETS_CMD="$SECRETS_CMD $key=\"$value\""
else
ENV_CMD="$ENV_CMD $key=\"$value\""
fi
done
# 执行更新
if [ "$SECRETS_CMD" != "fly secrets set --app $app" ]; then
eval $SECRETS_CMD
fi
if [ "$ENV_CMD" != "fly env set --app $app" ]; then
eval $ENV_CMD
fi
echo "Updated $app"
sleep 2
done
echo "Batch environment update completed"
10.3 监控告警脚本
综合监控脚本
#!/bin/bash
# comprehensive-monitor.sh
WEBHOOK_URL="$1"
CHECK_INTERVAL="${2:-300}" # 5分钟
if [ -z "$WEBHOOK_URL" ]; then
echo "Usage: $0 <webhook_url> [check_interval_seconds]"
exit 1
fi
send_alert() {
local message="$1"
local severity="${2:-WARNING}"
echo "[$severity] $message"
curl -X POST "$WEBHOOK_URL" \
-H 'Content-Type: application/json' \
-d "{\"text\":\"[$severity] $message\"}" \
> /dev/null 2>&1
}
check_app_health() {
local app="$1"
local url="https://$app.fly.dev/health"
if ! curl -f "$url" > /dev/null 2>&1; then
send_alert "Health check failed for $app" "CRITICAL"
return 1
fi
return 0
}
check_app_performance() {
local app="$1"
local url="https://$app.fly.dev/health"
local response_time=$(curl -o /dev/null -s -w "%{time_total}" -m 10 "$url" 2>/dev/null)
if (( $(echo "$response_time > 3.0" | bc -l) )); then
send_alert "Slow response detected for $app: ${response_time}s" "WARNING"
fi
}
check_machine_states() {
local app="$1"
local machines=$(fly machines list --app $app --json 2>/dev/null)
if [ "$machines" = "[]" ] || [ -z "$machines" ]; then
return
fi
local failed_machines=$(echo "$machines" | jq -r '.[] | select(.state == "failed") | .id')
if [ -n "$failed_machines" ]; then
send_alert "Failed machines detected in $app: $failed_machines" "CRITICAL"
fi
local running_count=$(echo "$machines" | jq '[.[] | select(.state == "started")] | length')
if [ "$running_count" = "0" ]; then
send_alert "No running machines in $app" "CRITICAL"
fi
}
monitor_costs() {
local current_cost=$(fly billing show --json 2>/dev/null | jq -r '.current_month.total // 0')
local day_of_month=$(date +%d)
local daily_average=$(echo "scale=2; $current_cost / $day_of_month" | bc 2>/dev/null || echo "0")
if (( $(echo "$daily_average > 10" | bc -l) )); then
send_alert "High daily cost average: \$${daily_average}" "WARNING"
fi
}
echo "🔍 Starting comprehensive monitoring..."
echo "Check interval: ${CHECK_INTERVAL}s"
echo "Webhook: $WEBHOOK_URL"
while true; do
echo "=== $(date) ==="
# 获取所有应用
APPS=$(fly apps list --json 2>/dev/null | jq -r '.[].name' 2>/dev/null)
if [ -z "$APPS" ]; then
send_alert "Failed to retrieve apps list" "ERROR"
sleep $CHECK_INTERVAL
continue
fi
# 检查每个应用
echo "$APPS" | while read app; do
echo "Checking $app..."
# 健康检查
check_app_health "$app"
# 性能检查
check_app_performance "$app"
# 机器状态检查
check_machine_states "$app"
sleep 2 # 避免过于频繁的API调用
done
# 成本监控(每小时检查一次)
if [ $(($(date +%M) % 60)) -eq 0 ]; then
monitor_costs
fi
echo "Monitoring cycle completed"
sleep $CHECK_INTERVAL
done
自动恢复脚本
#!/bin/bash
# auto-recovery.sh
APP_NAME="$1"
MAX_FAILURES="${2:-3}"
RECOVERY_DELAY="${3:-60}"
if [ -z "$APP_NAME" ]; then
echo "Usage: $0 <app_name> [max_failures] [recovery_delay_seconds]"
exit 1
fi
FAILURE_COUNT=0
LAST_RECOVERY=0
log_event() {
local event="$1"
echo "$(date '+%Y-%m-%d %H:%M:%S') [$APP_NAME] $event" | tee -a recovery.log
}
check_and_recover() {
# 检查应用健康状态
if curl -f "https://$APP_NAME.fly.dev/health" > /dev/null 2>&1; then
if [ $FAILURE_COUNT -gt 0 ]; then
log_event "Application recovered, resetting failure count"
FAILURE_COUNT=0
fi
return 0
fi
FAILURE_COUNT=$((FAILURE_COUNT + 1))
log_event "Health check failed (attempt $FAILURE_COUNT/$MAX_FAILURES)"
if [ $FAILURE_COUNT -ge $MAX_FAILURES ]; then
local current_time=$(date +%s)
local time_since_recovery=$((current_time - LAST_RECOVERY))
if [ $time_since_recovery -ge $RECOVERY_DELAY ]; then
log_event "Starting automatic recovery procedure"
# 尝试重启应用
if fly restart --app $APP_NAME; then
log_event "Application restart initiated"
LAST_RECOVERY=$current_time
FAILURE_COUNT=0
# 等待重启完成
sleep 30
# 验证恢复
if curl -f "https://$APP_NAME.fly.dev/health" > /dev/null 2>&1; then
log_event "Automatic recovery successful"
else
log_event "Automatic recovery failed, manual intervention required"
fi
else
log_event "Failed to restart application"
fi
else
log_event "Recovery delay not met, waiting..."
fi
fi
}
echo "🔄 Auto-recovery monitoring for $APP_NAME"
echo "Max failures: $MAX_FAILURES"
echo "Recovery delay: ${RECOVERY_DELAY}s"
while true; do
check_and_recover
sleep 30
done
结语
这份运维指南涵盖了Fly.io平台的完整部署和维护操作,包括:
- 快速上手:从环境准备到首次部署的完整流程
- 日常运维:监控、日志、性能调优等常规操作
- 高级管理:扩缩容、数据库管理、网络安全配置
- 自动化集成:CI/CD流程和批量操作脚本
- 应急响应:故障排查、恢复流程和成本优化
使用建议:
- 收藏常用脚本:将frequently used的脚本加入到你的工具箱
- 定制化配置:根据你的具体应用需求调整配置参数
- 监控告警:设置适合你团队的成本和性能告警阈值
- 定期review:定期检查和优化你的部署配置
持续改进:
这份指南会随着Fly.io平台的更新和最佳实践的演进而持续更新。建议定期关注Fly.io官方文档和社区最佳实践。
文档版本: v1.0
更新时间: 2024年12月
维护团队: DevOps Team
反馈方式: 通过项目issue提交改进建议