Ironic 的状态系统和生命周期管理

111 阅读5分钟

Ironic 状态系统概述

Ironic 使用有限状态机(FSM)来管理裸机节点的生命周期。每个节点都有两个主要状态:

  • 供应状态(Provision State):节点的主要生命周期状态
  • 电源状态(Power State):节点的电源状态

供应状态(Provision States)详解

1. 基础状态

ENROLL(注册状态)

描述:节点刚被创建,基本信息已录入但未验证
特点:
- 初始状态,节点刚加入 Ironic 时的状态
- 驱动信息可能不完整或未验证
- 不能直接部署,需要先验证配置

操作示例:
openstack baremetal node create --driver ipmi --name node-01
# 节点创建后处于 ENROLL 状态

MANAGEABLE(可管理状态)

描述:节点配置已验证,可以进行管理操作
特点:
- 驱动程序可以与节点通信
- 可以执行清理、检查等操作
- 仍不能用于部署实例

状态转换:
ENROLL → MANAGEABLE (通过 manage 操作)

操作示例:
openstack baremetal node manage node-01

AVAILABLE(可用状态)

描述:节点已就绪,可以用于部署实例
特点:
- 节点已通过所有验证
- 可以被 Nova 调度器选择用于部署
- 这是正常可用节点的状态

状态转换:
MANAGEABLE → AVAILABLE (通过 provide 操作)

操作示例:
openstack baremetal node provide node-01

ACTIVE(活动状态)

描述:节点正在运行用户实例
特点:
- 有用户实例正在运行
- 节点被占用,不能部署新实例
- 这是部署完成后的正常状态

状态转换:
AVAILABLE → ACTIVE (通过部署操作)

操作示例:
openstack server create --flavor baremetal --image cirros my-instance

2. 过渡状态

VERIFYING(验证状态)

描述:正在验证节点配置和连接性
特点:
- 临时状态,验证驱动程序配置
- 检查与BMC的连接
- 验证完成后转到目标状态

状态转换:
ENROLL → VERIFYING → MANAGEABLE

DEPLOYING(部署状态)

描述:正在部署实例到节点
特点:
- 正在执行部署流程
- 包括PXE启动、镜像下载、系统安装等
- 部署成功后转为 ACTIVE

状态转换:
AVAILABLE → DEPLOYING → ACTIVE

监控部署进度:
while true; do
    state=$(openstack baremetal node show node-01 -f value -c provision_state)
    echo "Current state: $state"
    [ "$state" = "active" ] && break
    sleep 10
done

CLEANING(清理状态)

描述:正在清理节点
特点:
- 执行清理步骤(擦除磁盘、重置配置等)
- 确保节点安全重用
- 清理完成后可转为 AVAILABLE

清理操作示例:
openstack baremetal node clean node-01 \
    --clean-steps '[{        "interface": "deploy",        "step": "erase_devices"    }]'

INSPECTING(检查状态)

描述:正在检查节点硬件信息
特点:
- 自动发现硬件规格
- 收集CPU、内存、磁盘等信息
- 更新节点属性

检查操作示例:
openstack baremetal node inspect node-01

3. 错误和特殊状态

ERROR(错误状态)

描述:节点操作失败,处于错误状态
特点:
- 上一次操作失败
- 需要人工干预解决问题
- 可以查看错误信息进行故障排除

故障排除:
# 查看错误信息
openstack baremetal node show node-01 -c last_error

# 清除错误状态
openstack baremetal node manage node-01  # 重置到 MANAGEABLE

RESCUE(救援状态)

描述:节点处于救援模式
特点:
- 使用救援镜像启动
- 用于系统恢复和维护
- 可以SSH登录进行故障修复

救援操作:
openstack baremetal node rescue node-01 --rescue-password mypassword
# 使用完毕后退出救援模式
openstack baremetal node unrescue node-01

MAINTENANCE(维护模式)

描述:节点被标记为维护状态
特点:
- 人工设置的状态
- 防止自动操作
- 用于硬件维护或故障排除

维护操作:
openstack baremetal node maintenance set node-01 --reason "Hardware upgrade"
openstack baremetal node maintenance unset node-01

电源状态(Power States)

1. 基本电源状态

# 电源状态类型
POWER ON    # 开机状态
POWER OFF   # 关机状态  
REBOOTING   # 重启中
ERROR       # 电源操作错误

# 电源操作命令
openstack baremetal node power on node-01      # 开机
openstack baremetal node power off node-01     # 关机
openstack baremetal node reboot node-01        # 重启
openstack baremetal node power show node-01    # 查看电源状态

完整生命周期管理流程

1. 标准节点生命周期

graph TD
    A[ENROLL] -->|manage| B[MANAGEABLE]
    B -->|provide| C[AVAILABLE]
    C -->|deploy| D[DEPLOYING]
    D -->|success| E[ACTIVE]
    D -->|failure| F[DEPLOY FAILED]
    E -->|undeploy| G[DELETING]
    G -->|success| H[CLEANING]
    H -->|success| C
    F -->|manage| B
    B -->|clean| I[CLEANING]
    I -->|success| B

2. 实际操作流程

新节点注册到部署流程

#!/bin/bash
# 完整节点生命周期管理脚本

NODE_NAME="node-01"
IPMI_IP="192.168.1.100"
IPMI_USER="admin"
IPMI_PASS="password"

echo "=== 步骤 1: 创建节点 (ENROLL 状态) ==="
openstack baremetal node create \
    --driver ipmi \
    --name $NODE_NAME \
    --driver-info ipmi_address=$IPMI_IP \
    --driver-info ipmi_username=$IPMI_USER \
    --driver-info ipmi_password=$IPMI_PASS \
    --property cpus=24 \
    --property memory_mb=65536 \
    --property local_gb=500 \
    --property cpu_arch=x86_64

echo "当前状态: $(openstack baremetal node show $NODE_NAME -f value -c provision_state)"

echo "=== 步骤 2: 管理节点 (ENROLL → MANAGEABLE) ==="
openstack baremetal node manage $NODE_NAME

# 等待状态变更
while [ "$(openstack baremetal node show $NODE_NAME -f value -c provision_state)" = "verifying" ]; do
    echo "等待验证完成..."
    sleep 5
done

echo "当前状态: $(openstack baremetal node show $NODE_NAME -f value -c provision_state)"

echo "=== 步骤 3: 添加网络端口 ==="
openstack baremetal port create \
    --node $NODE_NAME \
    --pxe-enabled \
    00:11:22:33:44:55

echo "=== 步骤 4: 验证节点配置 ==="
openstack baremetal node validate $NODE_NAME

echo "=== 步骤 5: 提供节点 (MANAGEABLE → AVAILABLE) ==="
openstack baremetal node provide $NODE_NAME

# 等待清理完成
while [ "$(openstack baremetal node show $NODE_NAME -f value -c provision_state)" = "cleaning" ]; do
    echo "等待清理完成..."
    sleep 10
done

echo "节点准备完成,当前状态: $(openstack baremetal node show $NODE_NAME -f value -c provision_state)"

节点部署流程

#!/bin/bash
# 节点部署脚本

NODE_NAME="node-01"
IMAGE_ID="cirros-image"

echo "=== 开始部署流程 ==="

# 检查节点状态
current_state=$(openstack baremetal node show $NODE_NAME -f value -c provision_state)
echo "当前状态: $current_state"

if [ "$current_state" != "available" ]; then
    echo "错误: 节点不在 available 状态,无法部署"
    exit 1
fi

echo "=== 启动部署 (AVAILABLE → DEPLOYING → ACTIVE) ==="
openstack baremetal node deploy $NODE_NAME \
    --instance-info image_source=glance://$IMAGE_ID \
    --instance-info root_gb=20

echo "=== 监控部署进度 ==="
while true; do
    state=$(openstack baremetal node show $NODE_NAME -f value -c provision_state)
    power_state=$(openstack baremetal node power show $NODE_NAME -f value -c power_state)
    
    echo "供应状态: $state, 电源状态: $power_state"
    
    case $state in
        "active")
            echo "✅ 部署成功完成!"
            break
            ;;
        "deploy failed")
            echo "❌ 部署失败!"
            echo "错误信息:"
            openstack baremetal node show $NODE_NAME -c last_error
            exit 1
            ;;
        "error")
            echo "❌ 节点进入错误状态!"
            openstack baremetal node show $NODE_NAME -c last_error
            exit 1
            ;;
    esac
    
    sleep 15
done

echo "=== 验证部署结果 ==="
echo "节点信息:"
openstack baremetal node show $NODE_NAME -c provision_state -c power_state -c instance_uuid

echo "实例信息:"
instance_uuid=$(openstack baremetal node show $NODE_NAME -f value -c instance_uuid)
if [ -n "$instance_uuid" ]; then
    openstack server show $instance_uuid
fi

3. 错误处理和故障恢复

通用错误恢复流程

#!/bin/bash
# 错误状态恢复脚本

NODE_NAME=$1

if [ -z "$NODE_NAME" ]; then
    echo "使用方法: $0 <节点名称>"
    exit 1
fi

echo "=== 节点错误恢复流程 ==="

# 检查当前状态
current_state=$(openstack baremetal node show $NODE_NAME -f value -c provision_state)
echo "当前状态: $current_state"

# 显示错误信息
echo "=== 错误信息 ==="
openstack baremetal node show $NODE_NAME -c last_error

case $current_state in
    "error" | "deploy failed" | "clean failed")
        echo "=== 尝试重置到 MANAGEABLE 状态 ==="
        openstack baremetal node manage $NODE_NAME
        
        # 等待状态变更
        sleep 10
        new_state=$(openstack baremetal node show $NODE_NAME -f value -c provision_state)
        echo "新状态: $new_state"
        
        if [ "$new_state" = "manageable" ]; then
            echo "✅ 成功重置到 MANAGEABLE 状态"
            echo "=== 重新提供节点 ==="
            openstack baremetal node provide $NODE_NAME
        else
            echo "❌ 重置失败,需要手动干预"
        fi
        ;;
    "maintenance")
        echo "=== 退出维护模式 ==="
        openstack baremetal node maintenance unset $NODE_NAME
        ;;
    *)
        echo "节点状态正常,无需恢复"
        ;;
esac

4. 状态监控和报告

集群状态监控脚本

#!/bin/bash
# 集群状态监控脚本

echo "=== Ironic 集群状态报告 ==="
echo "生成时间: $(date)"
echo

echo "=== 节点状态统计 ==="
echo "总节点数: $(openstack baremetal node list -f value | wc -l)"
echo

# 按状态统计
echo "状态分布:"
openstack baremetal node list -f value -c "Provision State" | sort | uniq -c | while read count state; do
    printf "  %-15s: %d\n" "$state" "$count"
done

echo

echo "=== 详细节点状态 ==="
openstack baremetal node list -c Name -c "Provision State" -c "Power State" -c "Maintenance"

echo

echo "=== 问题节点检查 ==="

# 错误状态节点
error_nodes=$(openstack baremetal node list -f value -c Name --provision-state error)
if [ -n "$error_nodes" ]; then
    echo "❌ 错误状态节点:"
    echo "$error_nodes" | while read node; do
        echo "  - $node"
        openstack baremetal node show "$node" -f value -c last_error | sed 's/^/    错误: /'
    done
else
    echo "✅ 无错误状态节点"
fi

echo

# 维护模式节点
maintenance_nodes=$(openstack baremetal node list -f value -c Name --maintenance)
if [ -n "$maintenance_nodes" ]; then
    echo "🔧 维护模式节点:"
    echo "$maintenance_nodes" | while read node; do
        reason=$(openstack baremetal node show "$node" -f value -c maintenance_reason)
        echo "  - $node (原因: $reason)"
    done
else
    echo "✅ 无维护模式节点"
fi

echo

echo "=== 电源状态检查 ==="
openstack baremetal node list -f value -c Name | while read node; do
    power_state=$(openstack baremetal node power show "$node" -f value -c power_state 2>/dev/null)
    if [ "$power_state" = "ERROR" ]; then
        echo "❌ $node: 电源状态错误"
    fi
done

echo

echo "=== 导体服务状态 ==="
openstack baremetal conductor list

5. 高级状态管理

批量状态管理

#!/bin/bash
# 批量状态管理脚本

action=$1
shift
nodes=("$@")

if [ -z "$action" ] || [ ${#nodes[@]} -eq 0 ]; then
    echo "使用方法: $0 <操作> <节点1> [节点2] ..."
    echo "操作选项: manage, provide, clean, inspect, maintain, unmaintain"
    exit 1
fi

echo "=== 批量执行操作: $action ==="

for node in "${nodes[@]}"; do
    echo "处理节点: $node"
    
    case $action in
        "manage")
            openstack baremetal node manage "$node"
            ;;
        "provide")
            openstack baremetal node provide "$node"
            ;;
        "clean")
            openstack baremetal node clean "$node" \
                --clean-steps '[{"interface": "deploy", "step": "erase_devices_metadata"}]'
            ;;
        "inspect")
            openstack baremetal node inspect "$node"
            ;;
        "maintain")
            openstack baremetal node maintenance set "$node" --reason "Batch maintenance"
            ;;
        "unmaintain")
            openstack baremetal node maintenance unset "$node"
            ;;
        *)
            echo "未知操作: $action"
            continue
            ;;
    esac
    
    if [ $? -eq 0 ]; then
        echo "✅ $node: 操作成功"
    else
        echo "❌ $node: 操作失败"
    fi
    
    echo "---"
done

echo "=== 批量操作完成 ==="

最佳实践

1. 状态管理最佳实践

# 1. 始终验证状态转换前提条件
check_state_transition() {
    local node=$1
    local target_action=$2
    
    current_state=$(openstack baremetal node show "$node" -f value -c provision_state)
    
    case $target_action in
        "manage")
            [ "$current_state" = "enroll" ] || [ "$current_state" = "error" ]
            ;;
        "provide")
            [ "$current_state" = "manageable" ]
            ;;
        "deploy")
            [ "$current_state" = "available" ]
            ;;
    esac
}

# 2. 实现状态转换超时处理
wait_for_state() {
    local node=$1
    local expected_state=$2
    local timeout=${3:-300}  # 默认5分钟超时
    
    local count=0
    while [ $count -lt $timeout ]; do
        current_state=$(openstack baremetal node show "$node" -f value -c provision_state)
        
        if [ "$current_state" = "$expected_state" ]; then
            return 0
        elif [[ "$current_state" =~ "failed"$ ]] || [ "$current_state" = "error" ]; then
            return 1
        fi
        
        sleep 5
        count=$((count + 5))
    done
    
    return 2  # 超时
}

2. 监控和告警

#!/bin/bash
# 状态异常告警脚本

ALERT_EMAIL="admin@example.com"
ALERT_THRESHOLD=5  # 错误节点数量阈值

check_cluster_health() {
    local error_count=0
    local issues=()
    
    # 检查错误状态节点
    error_nodes=$(openstack baremetal node list -f value -c Name --provision-state error)
    if [ -n "$error_nodes" ]; then
        error_count=$(echo "$error_nodes" | wc -l)
        issues+=("错误状态节点: $error_count 个")
    fi
    
    # 检查长时间处于过渡状态的节点
    for state in "deploying" "cleaning" "inspecting"; do
        old_nodes=$(openstack baremetal node list -f value -c Name --provision-state "$state" | \
                   while read node; do
                       # 检查节点在当前状态的时间(需要自定义实现)
                       echo "$node"
                   done)
        if [ -n "$old_nodes" ]; then
            count=$(echo "$old_nodes" | wc -l)
            issues+=("长时间 $state 状态节点: $count 个")
        fi
    done
    
    # 发送告警
    if [ $error_count -ge $ALERT_THRESHOLD ] || [ ${#issues[@]} -gt 0 ]; then
        {
            echo "Ironic 集群健康告警"
            echo "时间: $(date)"
            echo
            for issue in "${issues[@]}"; do
                echo "- $issue"
            done
        } | mail -s "Ironic Cluster Alert" "$ALERT_EMAIL"
    fi
}

# 定时检查(通过 cron 调用)
check_cluster_health

理解 Ironic 的状态系统和生命周期管理对于成功运营裸机云至关重要。这个状态机制确保了节点在各个阶段都能得到适当的管理和监控,同时提供了强大的错误恢复能力。建议在生产环境中实施完善的监控和自动化脚本来管理这些状态转换。