ironic clean流程源码分析

14 阅读9分钟

0. 什么是clean操作

juejin.cn/post/752054… Ironic 的 clean 操作是一个复杂的过程,涉及多个阶段和组件的交互。以下是详细的代码流程和执行细节: 以下是 Ironic 中 clean 操作的代码执行流程详解:

1. 触发清理操作

当通过 Ironic 的 RESTful API 发送清理请求时,请求如下:

PUT /v1/nodes/<node_ident>/states/provision

请求体包含 targetclean_steps

{
  "target": "clean",
  "clean_steps": [
    {
      "interface": "raid",
      "step": "create_configuration",
      "args": {"create_nonroot_volumes": false}
    },
    {
      "interface": "deploy",
      "step": "erase_devices"
    }
  ]
}

这个请求会将节点从 manageable 状态直接设置为 cleaning 状态。 这个请求会触发 Ironic API 的相关处理。

2. Ironic API 处理

Ironic API 接收到请求后,会调用 NodesController 中的 provision 方法。该方法会根据 target 的值来决定执行的操作。

  • 如果 targetclean,会检查是否提供了 clean_steps。如果没有提供,则会报错。
  • 如果提供了 clean_steps,会调用 do_node_clean 方法。

2.1 代码走读

ironic\ironic\api\controllers\v1\node.py

    # Ironic API 接收到请求后,会调用 NodesController 中的 provision 方法。该方法会根据 target 的值来决定执行的操作
    # 如果 target 是 clean,会检查是否提供了 clean_steps。如果没有提供,则会报错。
    # 如果提供了 clean_steps,会调用 do_node_clean 方法
    def provision(self, node_ident, target, configdrive=None,
                  clean_steps=None, deploy_steps=None,
                  rescue_password=None, disable_ramdisk=None,
                  service_steps=None, runbook=None):

        # target = 'clean' 的时候,如何clean_steps不存在,会直接报错
        if clean_steps and target != ir_states.VERBS['clean']:
            msg = (_('"clean_steps" is only valid when setting target '
                     'provision state to %s') % ir_states.VERBS['clean'])
            if runbook:
                rb_allowed_targets = [ir_states.VERBS['clean'],
                                      ir_states.VERBS['service']]
                msg = (_('"runbooks" is only valid when setting target '
                         'provision state to any of %s') % rb_allowed_targets)
            raise exception.ClientSideError(
                msg, status_code=http_client.BAD_REQUEST)
        ........
        self._do_provision_action(rpc_node, target, configdrive, clean_steps,
                                  deploy_steps, rescue_password,
                                  disable_ramdisk, service_steps)

        # Set the HTTP Location Header
        url_args = '/'.join([node_ident, 'states'])
        api.response.location = link.build_url('nodes', url_args)

跳转到_do_provision_action

    def _do_provision_action(self, rpc_node, target, configdrive=None,
                             clean_steps=None, deploy_steps=None,
                             rescue_password=None, disable_ramdisk=None,
                             service_steps=None):
        ........
        elif target == ir_states.VERBS['clean']:
            if not clean_steps:
                msg = (_('"clean_steps" is required when setting target '
                         'provision state to %s') % ir_states.VERBS['clean'])
                raise exception.ClientSideError(
                    msg, status_code=http_client.BAD_REQUEST)
            _check_clean_steps(clean_steps)
            # 到ironic-conductor执行clean步骤
            api.request.rpcapi.do_node_clean(
                api.request.context, rpc_node.uuid, clean_steps,
                disable_ramdisk, topic=topic)

3. Conductor 处理

do_node_clean 方法在 Ironic Conductor 中被调用,它会创建一个任务(task)并调用 _do_node_clean 方法。

  • _do_node_clean 方法会根据节点的当前状态和提供的 clean_steps 来决定清理的类型(手动清理或自动清理)。
  • 如果是手动清理,会直接执行提供的 clean_steps;如果是自动清理,会根据配置决定是否跳过清理。

3.1 代码走读

ironic/conductor/manager.py

    def do_node_clean(self, context, node_id, clean_steps,
                      disable_ramdisk=False):
        self._concurrent_action_limit(action='cleaning')
        with task_manager.acquire(context, node_id, shared=False,
                                  purpose='node manual cleaning') as task:
            node = task.node
            if node.maintenance:
                raise exception.NodeInMaintenance(op=_('cleaning'),
                                                  node=node.uuid)

            try:
                # 验证电源、验证网络
                task.driver.power.validate(task)
                task.driver.network.validate(task)
            except exception.InvalidParameterValue as e:
                msg = (_('Validation of node %(node)s for cleaning '
                         'failed: %(msg)s') %
                       {'node': node.uuid, 'msg': e})
                raise exception.InvalidParameterValue(msg)

            try:
                # 调用 task.process_event 方法,启动清理操作。process_event 方法会处理节点的状态转换,
                # 并调用 cleaning.do_node_clean 方法来执行清理步骤。
                task.process_event(
                    'clean',
                    callback=self._spawn_worker,
                    call_args=(cleaning.do_node_clean, task,
                               clean_steps, disable_ramdisk, False),
                    err_handler=utils.provisioning_error_handler,
                    target_state=states.MANAGEABLE)
            except exception.InvalidState:
                raise exception.InvalidStateRequested(
                    action='manual clean', node=node.uuid,
                    state=node.provision_state)

task.process_event

def process_event(self, event, callback=None, call_args=None,
                  call_kwargs=None, err_handler=None, target_state=None,
                  last_error=None):
    # 1. 保存当前节点的状态和事件,便于后续日志和通知
    self._prev_provision_state = self.node.provision_state
    self._prev_target_provision_state = self.node.target_provision_state
    self._event = event

    # 2. 推进状态机到下一个状态(根据当前状态和事件),如果事件非法会抛异常
    self.fsm.process_event(event, target_state=target_state)

    # 3. 如果有错误处理器和回调,设置错误处理钩子(用于异步任务失败时调用)
    if err_handler and callback:
        self.set_spawn_error_hook(err_handler, self.node,
                                  self.node.provision_state,
                                  self.node.target_provision_state)

    # 4. 将节点的 provision_state 更新为状态机推进后的当前状态
    self.node.provision_state = self.fsm.current_state

    # 5. 如果没有回调且状态已稳定,则清空目标状态,否则设置为状态机的目标状态
    if not callback and self.fsm.is_stable(self.node.provision_state):
        self.node.target_provision_state = states.NOSTATE
    else:
        self.node.target_provision_state = self.fsm.target_state

    # 6. 如果有回调,准备异步执行回调(如部署、清理等),并设置 last_error
    if callback:
        self.node.last_error = last_error
        if call_args is None:
            call_args = ()
        if call_kwargs is None:
            call_kwargs = {}
        self.spawn_after(callback, *call_args, **call_kwargs)
    # 7. 如果没有回调但有错误,直接写入 last_error
    elif last_error is not None:
        self.node.last_error = last_error

    # 8. 保存节点对象,发布状态变更(如写入数据库)
    self.node.save()

    # 9. 记录状态变更日
    log_message = ('Node %(node)s moved to provision state "%(state)s" '
                   'from state "%(previous)s"; target provision state is '
                   '"%(target)s"' %
                   {'node': self.node.uuid,
                    'state': self.node.provision_state,
                    'target': self.node.target_provision_state,
                    'previous': self._prev_provision_state})

    if (self.node.provision_state.endswith('failed')
            or self.node.provision_state == 'error'):
        LOG.error(log_message)
    else:
        LOG.info(log_message)

    # 10. 如果没有回调,立即发送状态变更通知;有回调则延后通知
    if callback is None:
        self._notify_provision_state_change()
    else:
        # 保存节点对象,防止回调前节点被释放
        self._saved_node = self.node

4. 准备清理环境

在执行清理步骤之前,Ironic 会调用 prepare_cleaning 方法来准备清理环境。

  • 这个方法会调用 deploy_utils.prepare_inband_cleaning,它会执行以下操作:
    • 添加清理网络。
    • 准备引导参数。
    • 如果需要,启动节点。

conducor/cleaning.py

def do_node_clean(task, clean_steps=None, disable_ramdisk=False,
                  automated_with_steps=False):
    node = task.node
    # 手动清理、自动清理判断
    manual_clean = clean_steps is not None and automated_with_steps is False
    clean_type = 'manual' if manual_clean else 'automated'
    LOG.debug('Starting %(type)s cleaning for node %(node)s',
              {'type': clean_type, 'node': node.uuid})

    if not manual_clean and utils.skip_automated_cleaning(node):
        # Skip cleaning, move to AVAILABLE.
        # 如果是自动清理且配置/节点属性要求跳过,则直接完成,节点进入 AVAILABLE 状态。
        node.clean_step = None
        node.save()

        task.process_event('done')
        how = ('API' if node.automated_clean is False else 'configuration')
        LOG.info('Automated cleaning is disabled via %(how)s, node %(node)s '
                 'has been successfully moved to AVAILABLE state',
                 {'how': how, 'node': node.uuid})
        return
    # 如果节点处于维护模式且配置不允许,直接报错。
    if (not CONF.conductor.allow_provisioning_in_maintenance
            and node.maintenance):
        msg = _('Cleaning a node in maintenance mode is not allowed')
        return utils.cleaning_error_handler(task, msg,
                                            tear_down_cleaning=False)

    try:
        task.driver.power.validate(task)
        if not disable_ramdisk:
            task.driver.network.validate(task)
    except (exception.InvalidParameterValue, exception.NetworkError) as e:
        msg = (_('Validation of node %(node)s for cleaning failed: %(msg)s') %
               {'node': node.uuid, 'msg': e})
        return utils.cleaning_error_handler(task, msg)

    utils.wipe_cleaning_internal_info(task)
    # 清理旧的内部状态,保存新的清理步骤和参数到节点的 driver_internal_info 字段
    if clean_steps:
        node.set_driver_internal_info('clean_steps', clean_steps)
        node.set_driver_internal_info('cleaning_disable_ramdisk',
                                      disable_ramdisk)
        node.set_driver_internal_info('declarative_cleaning', True)
    task.node.save()

    utils.node_update_cache(task)

    # 调用驱动的 prepare_cleaning 方法,准备清理环境(如启动 ramdisk)。
    # 如果 agent 还没启动,进入等待状态。
    # 其他异常则进入错误处理。
    try:
        if not disable_ramdisk:
            prepare_result = task.driver.deploy.prepare_cleaning(task)
        else:
            LOG.info('Skipping preparing for in-band cleaning since '
                     'out-of-band only cleaning has been requested for node '
                     '%s', node.uuid)
            prepare_result = None
    except exception.AgentConnectionFailed:
        LOG.info('Agent is not yet running on node %(node)s, waiting for'
                 ' agent to come up for fast track', {'node': node.uuid})
        target_state = states.MANAGEABLE if manual_clean else None
        task.process_event('wait', target_state=target_state)
        return
    except Exception as e:
        msg = (_('Failed to prepare node %(node)s for cleaning: %(e)s')
               % {'node': node.uuid, 'e': e})
        return utils.cleaning_error_handler(task, msg, traceback=True)
    # 如果驱动返回 CLEANWAIT,说明准备工作异步进行,等待 agent 回调继续清理。
    if prepare_result == states.CLEANWAIT:
        target_state = states.MANAGEABLE if manual_clean else None
        task.process_event('wait', target_state=target_state)
        return

    try:
        # 设置节点的清理步骤(如擦除磁盘、重置 BMC 等),并保存到节点对象。
        conductor_steps.set_node_cleaning_steps(
            task, disable_ramdisk=disable_ramdisk,
            use_existing_steps=bool(clean_steps)
        )
    except Exception as e:
        msg = (_('Cannot clean node %(node)s: %(msg)s')
               % {'node': node.uuid, 'msg': e})
        return utils.cleaning_error_handler(task, msg)
    # 获取清理步骤列表,从第一个步骤开始执行,进入清理主循环。
    steps = node.driver_internal_info.get('clean_steps', [])
    step_index = 0 if steps else None
    # 节点清理主流程
    do_next_clean_step(task, step_index, disable_ramdisk=disable_ramdisk)

看看set_node_cleaning_steps这个函数做了什么

def set_node_cleaning_steps(task, disable_ramdisk=False,
                            use_existing_steps=False):
    node = task.node

    if use_existing_steps is True:
        steps = _validate_user_clean_steps(
            task, node.driver_internal_info['clean_steps'],
            disable_ramdisk=disable_ramdisk)
    else:
        # Get the prioritized steps for automated cleaning
        steps = _get_cleaning_steps(task, enabled=True)
    manual_clean = node.target_provision_state == states.MANAGEABLE
    LOG.debug('List of the steps for %(type)s cleaning of node %(node)s: '
              '%(steps)s', {'type': 'manual' if manual_clean else 'automated',
                            'node': node.uuid,
                            'steps': steps})

    node.clean_step = {}
    node.set_driver_internal_info('clean_steps', steps)
    node.set_driver_internal_info('clean_step_index', None)
    node.save()

_get_cleaning_steps的返回值如下

[
    {
        'interface': 'deploy',
        'step': 'erase_devices',
        'priority': 20,
        'argsinfo': {
            'quick_erase': {'description': 'Quick erase', 'required': False}
        },
        'abortable': True,
        'requires_ramdisk': True
    },
    {
        'interface': 'raid',
        'step': 'create_configuration',
        'priority': 10,
        'argsinfo': {
            'volume_name': {'description': 'Name', 'required': True}
        },
        'abortable': False,
        'requires_ramdisk': False
    },
    # ...更多步骤
]

do_next_clean_step

def do_next_clean_step(task, step_index, disable_ramdisk=None):
    node = task.node
    # 判断当前是手动清理还是自动清理
    manual_clean = node.target_provision_state == states.MANAGEABLE
    # 如果没有步骤,直接 steps 为空。否则,从当前 step_index 开始,获取剩余所有步骤。
    if step_index is None:
        steps = []
    else:
        assert node.driver_internal_info.get('clean_steps') is not None, \
            f"BUG: No clean steps for {node.uuid}, step index is {step_index}"
        steps = node.driver_internal_info['clean_steps'][step_index:]

    if disable_ramdisk is None:
        disable_ramdisk = node.driver_internal_info.get(
            'cleaning_disable_ramdisk', False)
    # 打印当前节点、清理类型和剩余步骤
    LOG.info('Executing %(kind)s cleaning on node %(node)s, remaining steps: '
             '%(steps)s', {'node': node.uuid, 'steps': steps,
                           'kind': 'manual' if manual_clean else 'automated'})
    # Execute each step until we hit an async step or run out of steps
    # 顺序执行每个步骤
    for ind, step in enumerate(steps):
        node.clean_step = step
        node.set_driver_internal_info('clean_step_index', step_index + ind)
        node.save()
        eocn = step.get('execute_on_child_nodes', False)
        result = None
        try:
            if async_steps.CLEANING_POLLING in node.driver_internal_info:
                node.del_driver_internal_info(async_steps.CLEANING_POLLING)
            if not eocn:
                LOG.info('Executing %(step)s on node %(node)s',
                         {'step': step, 'node': node.uuid})
                use_step_handler = conductor_steps.use_reserved_step_handler(
                    task, step)
                if use_step_handler:
                    if use_step_handler == conductor_steps.EXIT_STEPS:
                        # Exit the step, i.e. hold step
                        return
                else:
                    interface = getattr(task.driver, step.get('interface'))
                    # 正式执行clean_step
                    result = interface.execute_clean_step(task, step)
            else:
                LOG.info('Executing %(step)s on child nodes for node '
                         '%(node)s.',
                         {'step': step, 'node': node.uuid})
                result = execute_step_on_child_nodes(task, step)

        except Exception as e:
            if isinstance(e, exception.AgentConnectionFailed):
                if task.node.driver_internal_info.get(
                        async_steps.CLEANING_REBOOT):
                    LOG.info('Agent is not yet running on node %(node)s '
                             'after cleaning reboot, waiting for agent to '
                             'come up to run next clean step %(step)s.',
                             {'node': node.uuid, 'step': step})
                    node.set_driver_internal_info(
                        async_steps.SKIP_CURRENT_CLEAN_STEP, False)
                    target_state = (states.MANAGEABLE if manual_clean
                                    else None)
                    task.process_event('wait', target_state=target_state)
                    return
            if isinstance(e, exception.AgentInProgress):
                LOG.info('Conductor attempted to process clean step for '
                         'node %(node)s. Agent indicated it is presently '
                         'executing a command. Error: %(error)s',
                         {'node': task.node.uuid,
                          'error': e})
                node.set_driver_internal_info(
                    async_steps.SKIP_CURRENT_CLEAN_STEP, False)
                target_state = states.MANAGEABLE if manual_clean else None
                task.process_event('wait', target_state=target_state)
                return

            msg = (_('Node %(node)s failed step %(step)s: '
                     '%(exc)s') %
                   {'node': node.uuid, 'exc': e,
                    'step': node.clean_step})
            if not disable_ramdisk:
                driver_utils.collect_ramdisk_logs(task.node, label='cleaning')
            utils.cleaning_error_handler(task, msg, traceback=True)
            return

        if result == states.CLEANWAIT:
            LOG.info('Clean step %(step)s on node %(node)s being '
                     'executed asynchronously, waiting for driver.',
                     {'node': node.uuid, 'step': step})
            target_state = states.MANAGEABLE if manual_clean else None
            task.process_event('wait', target_state=target_state)
            return
        elif result is not None:
            msg = (_('While executing step %(step)s on node '
                     '%(node)s, step returned invalid value: %(val)s')
                   % {'step': step, 'node': node.uuid, 'val': result})
            return utils.cleaning_error_handler(task, msg)
        LOG.info('Node %(node)s finished clean step %(step)s',
                 {'node': node.uuid, 'step': step})
    if CONF.agent.deploy_logs_collect == 'always' and not disable_ramdisk:
        driver_utils.collect_ramdisk_logs(task.node, label='cleaning')

    # Clear clean_step
    node.clean_step = None
    utils.wipe_cleaning_internal_info(task)
    node.save()
    if not disable_ramdisk:
        try:
            task.driver.deploy.tear_down_cleaning(task)
        except Exception as e:
            msg = (_('Failed to tear down from cleaning for node %(node)s, '
                     'reason: %(err)s')
                   % {'node': node.uuid, 'err': e})
            return utils.cleaning_error_handler(task, msg,
                                                traceback=True,
                                                tear_down_cleaning=False)
    utils.node_update_cache(task)
    LOG.info('Node %s cleaning complete', node.uuid)
    event = 'manage' if manual_clean or node.retired else 'done'
    # NOTE(rloo): No need to specify target prov. state; we're done
    task.process_event(event)

跳转到ironic/drivers/modules/agent_base.py

def execute_clean_step(task, step):
    # NOTE(dtantsur): left for compatibility with agent-based hardware types.
    return execute_step(task, step, 'clean')

def execute_step(task, step, step_type, client=None):
    if client is None:
        client = agent_client.get_client(task)
    ports = objects.Port.list_by_node_id(
        task.context, task.node.id)
    # 实际上会发起一个 HTTP POST 请求到 IPA agent,内容包括 step、node、ports 等
    call = getattr(client, 'execute_%s_step' % step_type)
    result = call(step, task.node, ports)
    if not result.get('command_status'):
        _raise(step_type, _(
            'Agent on node %(node)s returned bad command result: '
            '%(result)s') % {'node': task.node.uuid, 'result': result})
    _validate_step_type(step_type)
    step_to_state = {
        'clean': states.CLEANWAIT,
        'deploy': states.DEPLOYWAIT,
        'service': states.SERVICEWAIT,
    }
    return step_to_state[step_type]

5. 执行清理步骤

清理步骤会按顺序执行,每个步骤的执行逻辑可能不同。

  • Ironic 会调用 agent_execute_clean_step 方法来执行清理步骤。
  • 这个方法会与 Ironic Python Agent(IPA)通信,发送清理命令。

具体流程如下

  • 服务端入口:agent_base.py → execute_clean_stepexecute_stepagent_client.execute_clean_step(HTTP 调用)
  • 客户端入口:ironic_python_agent/extensions/clean.pyexecute_clean_step(REST API handler)

5.1. 图示

Ironic Conductor
    |
    |  (HTTP POST /v1/commands/execute_clean_step)
    v
IPA Agent (ironic-python-agent)
    |
    |  (执行 step, 返回结果)
    v
Ironic Conductor

6. 监控清理进度

Ironic 通过周期性任务监控清理进度。

  • 如果某个步骤需要重启节点来完成,Ironic 会设置相应的标志,并等待节点重启后继续执行。
  • 如果某个步骤失败,节点会被设置为 clean failed 状态。

7. 清理完成

当所有清理步骤完成后,节点的电源状态会被设置为关机,清理网络会被移除,节点的状态会被设置为 available

8. 错误处理

如果清理过程中出现错误,节点会被设置为 clean failed 状态。

9. 手动干预

如果清理失败,管理员可以手动将节点从 clean failed 状态移动到 manageable 状态,然后尝试修复节点。

整个 clean 操作的代码执行流程涉及多个组件和方法的交互,从 Ironic API 到 Ironic Conductor,再到 Ironic Python Agent,每个步骤都有其特定的逻辑和功能。