【GUI-Agent】阶跃星辰 GUI-MCP 解读---(3)---执行层【GUI-Agent】阶跃星辰 GUI-MC

【GUI-Agent】阶跃星辰 GUI-MCP 解读---(3)---执行层

0x00 摘要

25年底，阶跃星辰升级发布了全新的AI Agent系列模型Step-GUI，包括云端模型Step-GUI、首个面向GUI Agent的MCP协议：GUI-MCP（Graphical User Interface - Model Context Protocol），这是首个专为图形用户界面自动化而设计的 MCP 实现，兼顾标准化与隐私保护。因此，我们就来解读这个MCP协议，顺便看看端侧Agent的实现架构。

本文是第三篇，主要是介绍Step-GUI的执行层，本层在任何情况下（是/非MCP）都会用到。

因为是反推解读，而且时间有限，所以可能会有各种错误，还请大家不吝指出。

0x01 执行流程

实际上，执行层的底层部分和非MCP是一致的，我们在此是为了梳理流程。

1.1 任务总体流程

我们首先要看看任务的执行流程，了解执行层所在的位置。

任务总体流程如下：

任务处理流程
- 任务接收：通过 MCP 工具接收任务请求
- 参数验证：验证设备 ID、任务描述等参数
- 会话创建：创建或恢复任务会话
代理执行循环，即用 gui_agent_loop 函数处理抽象任务到具体操作的映射
- 状态感知：通过截图获取设备状态
- 动作决策：
  - 通过 automate_step 方法获取 AI 模型的决策
  - 通过 LLM 生成动作，即使用ask_llm_anything 函数执行模型推理
- 模型决策解析：将动作转换为设备操作
  - 动作解析：parser.str2action 将模型输出转换为结构化动作
  - 动作转换：uiTars_to_frontend_action 将模型动作转换为前端动作
- 动作执行
- 状态更新：更新任务执行状态

1.2 抽象到设备原生操作映射机制

执行操作时，需要把抽象参数映射到设备原生参数。

参数映射流程

抽象参数接收
- MCP 协议接收：通过 MCP 协议接收客户端的抽象参数请求
  - ask_agent 工具接收抽象参数
  - device_id、task、max_steps 等参数
- 参数验证：验证参数格式和有效性
- 参数传递：将参数传递给 execute_task 函数
抽象参数处理
- 任务解析：将抽象任务文本转换为执行步骤
- 配置加载：从 mcp_server_config.yaml 加载执行配置
- 参数映射：将抽象参数映射到具体执行参数

坐标系统映射

坐标标准化
- 标准化坐标：GUI Agent 使用 0-1000 的标准化坐标系统
  - point 参数在 0-1000 范围内
- 设备坐标转换：convert_point_to_realworld_point 函数将标准化坐标转换为设备实际坐标

real_x = (float(x) / 1000) * wm_size[0]
real_y = (float(y) / 1000) * wm_size[1]

屏幕方向处理
- 方向检测：_detect_screen_orientation函数检测设备屏幕方向
- 坐标适配：根据方向调整坐标系统

动作空间映射

动作类型映射
- CLICK：点击操作
  - 模型输出：action:CLICK\tpoint:x,y
  - 原生操作：adb shell input tap {x} {y}
- TYPE：文本输入
  - 模型输出：action:TYPE\tvalue:text\tpoint:x,y
  - 原生操作：adb shell app_process ... -keyboard "{text}"
- AWAKE：应用启动
  - 模型输出：action:AWAKE\tvalue:app_name
  - 原生操作：adb shell monkey -p {package_name} -c android.intent.category.LAUNCHER 1
动作参数验证
- 参数验证：action_assertion函数验证动作参数
- 类型检查：确保动作类型在_ACTION_TYPE_ENUM列表中
- 字段验证：验证必需字段是否存在

1.3 原生操作执行

在得到了具体可以执行动作之后，就可以进行原生操作。

前端动作执行
- 动作执行：act_on_device函数执行具体设备操作
  - CLICK 动作 → adb shell input tap 命令
  - TYPE 动作 → adb shell app_process -Djava.class.path=/data/local/tmp/yadb /data/local/tmp com.ysbing.yadb.Main -keyboard 命令
  - AWAKE 动作 → adb shell monkey -p 命令
ADB 命令映射如下：

点击操作，即 elif action_type == "TYPE":

cmd = f"adb -s {device_id} shell input tap {x} {y}"

输入操作，即elif action_type == "LONGPRESS": 或者 elif action_type == "TYPE":

cmd = f"adb -s {device_id} shell app_process -Djava.class.path=/data/local/tmp/yadb /data/local/tmp com.ysbing.yadb.Main -keyboard '{text}'"

应用启动，即 elif action_type == "AWAKE":

cmd = f"adb -s {device_id} shell monkey -p {package_name} -c android.intent.category.LAUNCHER 1"

0x02 执行层

设备控制层/执行层有如下特征：

提供设备操作接口
执行具体动作
通过 ADB 与设备通信

2.1 act_on_device

在 copilot_front_end\pu_frontend_executor.py 和 copilot_front_end\mobile_action_helper.py 都有 act_on_device。

在 evaluate_task_on_device 中使用的是 act_on_device @ pu_frontend_executor.py。

def evaluate_task_on_device(agent_server, device_info, task, rollout_config, extra_info = {}, reflush_app=True, auto_reply = False, reset_environment=True):
        action = agent_server.automate_step(payload)['action']
        action = uiTars_to_frontend_action(action)
        act_on_device(action, device_id, device_wm_size, print_command=True, reflush_app=reflush_app)

而 act_on_device @ mobile_action_helper.py 被 step_interaction 调用，但是并没有找到调用step_interaction的地方，因此只能推测两者的区别如下。

维度	pu_frontend_executor.act_on_device	mobile_action_helper.act_on_device
位置	copilot_front_end/pu_frontend_executor.py	copilot_front_end/mobile_action_helper.py
归属	前端执行器模块（独立函数）	移动设备助手模块（独立函数）
参数结构	frontend_action、device_id、wm_size、reflush_app	device_id、action、print_command、reflush_app（拼写）、device_wm_size
动作全集	CLICK、LONGPRESS、TYPE、SCROLL、AWAKE、SLIDE、BACK、HOME、COMPLETE、ABORT、INFO、WAIT、HOT_KEY	Click、Awake、Type、Pop、Wait、Scroll、LongPress、Abort、Complete（子集）
ADB 实现	标准 ADB + 自定义 yadb 工具	标准 ADB，实现细节略有差异
屏幕方向	内置 _detect_screen_orientation，wm_size 宽高互换（方向 1/3）	无内置方向检测
键盘处理	TYPE 操作中会检查 keyboard_exists；若键盘不存在且提供坐标，先点击坐标激活输入框；含 preprocess_text_for_adb 处理特殊字符。	同样检查 keyboard_exists，实现逻辑类似，细节略有差异。
应用刷新	AWAKE 操作中，若 reflush_app 为 True，先 am force-stop 强制停止应用，再用 monkey 启动。	AWAKE 同样处理应用刷新，逻辑基本一致，变量名有所差异。
使用场景	被 gui_agent_loop 调用，处理模型生成的前端动作，是主入口	在 BaseMobileActionHelper 类中使用，用于底层设备操作辅助
代码状态	代码更完整，动作类型最全，含屏幕方向检测等高级功能，是主要执行函数。	功能子集，感觉维护优先级较低

简言之：

pu_frontend_executor.act_on_device 版“参数更全、动作更全、带方向适配”，应该是主入口；
mobile_action_helper.act_on_device 版是“动作子集、无方向检测”，应该是仅作辅助调用。

2.2 act_on_device @ pu_frontend_executor.py

在 copilot_front_end\pu_frontend_executor.py 的逻辑如下：

核心逻辑：标准化动作指令 → 参数校验 → 坐标 / 文本适配 → 构造 ADB 命令 → 执行并返回结果；
核心设计：按动作类型分支处理，每个分支做独立的参数校验和适配，保证执行安全性；
关键适配：屏幕方向适配、坐标转换、文本特殊字符转义，解决设备交互的兼容性问题。

核心执行流程

act_on_device @ pu_frontend_executor-1

坐标适配子流程（以 CLICK 为例）

act_on_device @ pu_frontend_executor-2

文本输入子流程

act_on_device @ pu_frontend_executor-3

代码

代码如下：

def act_on_device(frontend_action, device_id, wm_size, print_command = False, reflush_app = True):
    """
    在指定设备上执行前端动作指令
    支持的动作类型及格式：
    1. CLICK(point=(x,y)) - 点击
    2. LONGPRESS(point=(x,y), duration=sec) - 长按
    3. TYPE(value="string", point=None, keyboard_exists=True) - 文本输入（point为输入框坐标，默认用当前焦点）
    4. SCROLL(point=(x,y), direction="up|down|left|right") - 滚动（仅UI-Tars支持）
    5. AWAKE(value=app_name) - 唤醒应用
    6. SLIDE(point1=(x1,y1), point2=(x2,y2), duration=sec) - 滑动
    7. BACK() - 返回（仅UI-Tars支持）
    8. HOME() - 回到主页（仅UI-Tars支持）
    9. COMPLETE() - 任务完成
    10. ABORT() - 终止任务
    11. INFO() - 询问用户
    12. WAIT(seconds=sec) - 等待
    13. HOT_KEY(key="volume_up|volume_down|power|...") - 系统热键

    标准前端动作格式：
    {
        "action_type": "动作类型",
        "param_key": 参数值,
        ...
    }
    """
    # 定义所有合法的动作类型，用于校验输入的动作是否有效
    valid_actions = ["CLICK", "LONGPRESS", "TYPE", "SCROLL", "AWAKE", "SLIDE", "BACK", "HOME", "COMPLETE", "ABORT", "INFO", "WAIT", "HOT_KEY"]

    # 断言校验：动作指令必须包含 action_type 字段
    assert "action_type" in frontend_action, "Missing action_type in frontend_action"
    # 断言校验：动作类型必须在合法列表中
    assert frontend_action["action_type"] in valid_actions, f"Invalid action type: {frontend_action['action_type']}"

    # 提取动作类型，简化后续判断
    action_type = frontend_action["action_type"]

    # ========== 动作类型：点击（CLICK） ==========
    if action_type == "CLICK":
        # 断言校验：点击动作必须包含坐标点参数
        assert "point" in frontend_action, "Missing point in CLICK action"

        # 检测设备当前屏幕方向（横屏/竖屏）
        orientation = _detect_screen_orientation(device_id)

        # 若屏幕为横屏（方向1/3），交换宽高适配坐标
        if orientation in [1, 3]:
            wm_size = (wm_size[1], wm_size[0])

        # 将逻辑坐标转换为设备真实物理坐标
        x, y = _convert_point_to_realworld_point(frontend_action["point"], wm_size)

        # 构造ADB点击命令（input tap 模拟点击）
        cmd = f"adb -s {device_id} shell input tap {x} {y}"
        # 若开启打印，输出执行的命令
        if print_command:
            print(f"Executing command: {cmd}")
        
        # 执行ADB命令，捕获输出和返回码
        result = subprocess.run(cmd, shell=True, capture_output=True, text=True)

        # 返回命令执行结果
        return result
    
    # ========== 动作类型：长按（LONGPRESS） ==========
    elif action_type == "LONGPRESS":
        # 断言校验：长按必须包含坐标和时长参数
        assert "point" in frontend_action, "Missing point in LONGPRESS action"
        assert "duration" in frontend_action, "Missing duration in LONGPRESS action"
        # 转换坐标为真实物理坐标
        x, y = _convert_point_to_realworld_point(frontend_action["point"], wm_size)
        # 提取长按时长（秒）
        duration = frontend_action["duration"]
        # 构造长按命令（使用yadb工具实现精准长按，转换时长为毫秒）
        cmd = f"adb -s {device_id} shell app_process -Djava.class.path=/data/local/tmp/yadb /data/local/tmp com.ysbing.yadb.Main -touch {x} {y} {int(duration * 1000)}"

        if print_command:
            print(f"Executing command: {cmd}")
        # 执行命令并返回结果
        result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
        return result

    # ========== 动作类型：文本输入（TYPE） ==========
    elif action_type == "TYPE":
        # 断言校验：输入动作必须包含文本内容
        assert "value" in frontend_action, "Missing value in TYPE action"

        # 提取输入文本和键盘状态参数
        value = frontend_action["value"]
        keyboard_exists = frontend_action.get("keyboard_exists", True)
        # 若键盘未弹出，先点击输入框激活
        if not keyboard_exists:
            if "point" in frontend_action:
                x, y = _convert_point_to_realworld_point(frontend_action["point"], wm_size)
                cmd = f"adb -s {device_id} shell input tap {x} {y}"
                if print_command:
                    print(f"Executing command: {cmd}")
                result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
                # 等待1秒确保输入框激活
                time.sleep(1)
            else:
                print("Warning: keyboard does not exist and point is not given. Using current focus box.")

        # 文本预处理函数：转义ADB命令中的特殊字符（换行、制表符、空格）
        def preprocess_text_for_adb(text):
            text = text.replace("\n", " ").replace("\t", " ")
            text = text.replace(" ", "\ ")
            return text

        # 构造文本输入命令（使用yadb工具实现精准输入）
        cmd = f"adb -s {device_id} shell app_process -Djava.class.path=/data/local/tmp/yadb /data/local/tmp com.ysbing.yadb.Main -keyboard '{preprocess_text_for_adb(value)}'"
        if print_command:
            print(f"Executing command: {cmd}")

        result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
        return result
    
    # ========== 动作类型：滚动（SCROLL） ==========
    elif action_type == "SCROLL":
        # 断言校验：滚动必须包含坐标和方向参数
        assert "point" in frontend_action, "Missing point in SCROLL action"
        assert "direction" in frontend_action, "Missing direction in SCROLL action"
        # 转换滚动中心点为真实坐标
        x, y = _convert_point_to_realworld_point(frontend_action["point"], wm_size)

        # 计算滚动偏移量（宽高的30%）
        deltax = int(0.3 * wm_size[0])
        deltay = int(0.3 * wm_size[1])

        # 根据方向计算滚动的起始/结束坐标
        direction = frontend_action["direction"]
        if direction == "down":
            x1, y1 = x, y
            x2, y2 = x, y - deltay
        elif direction == "up":
            x1, y1 = x, y
            x2, y2 = x, y + deltay
        elif direction == "left":
            x1, y1 = x, y
            x2, y2 = x - deltax, y
        elif direction == "right":
            x1, y1 = x, y
            x2, y2 = x + deltax, y
        else:
            raise ValueError(f"Invalid direction: {direction}")
        
        # 构造滚动命令（input swipe 模拟滚动，时长1200ms）
        cmd = f"adb -s {device_id} shell input swipe {x1} {y1} {x2} {y2} 1200"
        if print_command:
            print(f"Executing command: {cmd}")

        result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
        return result
        
    # ========== 动作类型：唤醒应用（AWAKE） ==========
    elif action_type == "AWAKE":
        # 断言校验：唤醒必须包含应用名称
        assert "value" in frontend_action, "Missing value in AWAKE action"
        app_name = frontend_action["value"]
        # 根据应用名称查找包名（依赖外部函数find_package_name）
        package_name = find_package_name(app_name)
        if package_name is None:
            raise ValueError(f"App name {app_name} not found in package map.")
        
        # 若开启应用刷新，先强制停止应用
        if reflush_app:
            cmd = f"adb -s {device_id} shell am force-stop {package_name}"
            if print_command:
                print(f"Executing command: {cmd}")

            result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
            # 等待1秒确保应用停止
            time.sleep(1)

        # 构造启动应用命令（monkey命令启动应用）
        cmd = f"adb -s {device_id} shell monkey -p {package_name} -c android.intent.category.LAUNCHER 1"
        if print_command:
            print(f"Executing command: {cmd}")

        result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
        return result

    # ========== 动作类型：滑动（SLIDE） ==========
    elif action_type == "SLIDE":
        # 断言校验：滑动必须包含起始和结束坐标
        assert "point1" in frontend_action, "Missing point1 in SLIDE action"
        assert "point2" in frontend_action, "Missing point2 in SLIDE action"
        # 转换起始/结束坐标为真实坐标
        x1, y1 = _convert_point_to_realworld_point(frontend_action["point1"], wm_size)
        x2, y2 = _convert_point_to_realworld_point(frontend_action["point2"], wm_size)
        
        # 提取滑动时长（默认1.5秒）
        duration = frontend_action.get("duration", 1.5)
        # 构造滑动命令（转换时长为毫秒）
        cmd = f"adb -s {device_id} shell input swipe {x1} {y1} {x2} {y2} {int(duration * 1000)}"
        if print_command:
            print(f"Executing command: {cmd}")

        result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
        return result
    
    # ========== 动作类型：返回（BACK） ==========
    elif action_type == "BACK":
        # 构造返回命令（keyevent 4 对应返回键）
        cmd = f"adb -s {device_id} shell input keyevent 4"
        if print_command:
            print(f"Executing command: {cmd}")

        result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
        return result
    
    # ========== 动作类型：回到主页（HOME） ==========
    elif action_type == "HOME":
        # 构造主页命令（keyevent 3 对应主页键）
        cmd = f"adb -s {device_id} shell input keyevent 3"
        if print_command:
            print(f"Executing command: {cmd}")

        result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
        return result
    
    # ========== 动作类型：任务完成（COMPLETE） ==========
    elif action_type == "COMPLETE":
        if print_command:
            print("Task completed.")
        return None

    # ========== 动作类型：任务终止（ABORT） ==========
    elif action_type == "ABORT":
        if print_command:
            print("Task aborted.")
        return None

    # ========== 动作类型：询问用户（INFO） ==========
    elif action_type == "INFO":
        if print_command:
            print("Info action executed.")
        return None

    # ========== 动作类型：等待（WAIT） ==========
    elif action_type == "WAIT":
        # 断言校验：等待必须包含时长参数
        assert "seconds" in frontend_action, "Missing seconds in WAIT action"
        seconds = frontend_action["seconds"]
        if print_command:
            print(f"Waiting for {seconds} seconds.")
        # 执行等待
        time.sleep(seconds)
        return None
    
    # ========== 动作类型：系统热键（HOT_KEY） ==========
    elif action_type == "HOT_KEY":
        # 断言校验：热键必须包含按键名称
        assert "key" in frontend_action, "Missing key in HOT_KEY action"
        key = frontend_action["key"]
        # 热键与ADB keyevent的映射表
        key_event_map = {
            "volume_up": 24,    # 音量+
            "volume_down": 25,  # 音量-
            "power": 26,        # 电源键
            "home": 3,          # 主页键
            "back": 4,          # 返回键
            "menu": 82,         # 菜单键
        }
        # 校验热键是否支持
        if key.lower() not in key_event_map:
            raise ValueError(f"Unsupported hot key: {key}")

        # 构造热键执行命令
        key_event = key_event_map[key.lower()]
        cmd = f"adb -s {device_id} shell input keyevent {key_event}"
        if print_command:
            print(f"Executing command: {cmd}")

        result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
        return result

    # ========== 未知动作类型 ==========
    else:
        raise ValueError(f"Unsupported action type: {action_type}")

2.3 act_on_device @ mobile_action_helper

整体执行流程

act_on_device @ mobile_action_helper-1

坐标转换子流程（通用）

act_on_device @ mobile_action_helper-2

应用唤醒子流程

act_on_device @ mobile_action_helper-3

代码

核心逻辑：动作指令解析 → 坐标 / 参数适配 → ADB 命令拼接 → 命令执行，核心是将抽象动作转换为设备可执行的 ADB 指令；
坐标适配：支持物理坐标和归一化坐标两种模式，适配不同分辨率设备，提升通用性；
轻量化设计：仅保留核心交互动作，部分动作预留空实现，Wait 动作直接休眠无需 ADB 调用，执行效率更高。

代码如下：

def act_on_device(device_id, action, print_command = False, refush_app = True, device_wm_size = None):
    """
    在指定安卓设备上执行结构化的动作指令
    :param device_id: 设备唯一标识（ADB识别的设备ID）
    :param action: 结构化动作指令，格式为 {"action_type": 动作类型, "args": 动作参数}
    :param print_command: 是否打印执行的ADB命令（调试用）
    :param refush_app: 唤醒应用时是否先强制停止应用（保证启动状态干净）
    :param device_wm_size: 设备屏幕尺寸 (宽, 高)，用于归一化坐标转换
    :return: 无（Wait动作直接返回，其他动作执行ADB命令后返回）
    """
    # 获取基础ADB命令（拼接设备ID，如 "adb -s device123"）
    adb_command = _get_adb_command(device_id)

    # ========== 动作类型：点击（Click） ==========
    if action['action_type'] == "Click":
        # 判断是否使用归一化坐标：若传入设备尺寸，使用归一化坐标转换为物理坐标
        if device_wm_size is None:
            real_point = action['args']['point']  # 直接使用原始物理坐标
        else:
            normalized_point = action['args']['normalized_point']  # 归一化坐标（0-1范围）
            # 转换为真实物理坐标：归一化值 * 屏幕尺寸
            real_point = (int(normalized_point[0] * device_wm_size[0]), int(normalized_point[1] * device_wm_size[1]))
        # 拼接ADB点击命令：input tap 模拟屏幕点击
        adb_command += f" shell input tap {real_point[0]} {real_point[1]}"

    # ========== 动作类型：唤醒应用（Awake） ==========
    elif action['action_type'] == "Awake":
        # 提取要唤醒的应用名称
        app_name = action['args']['text']
        # 根据应用名称查找对应的安卓包名（依赖外部函数find_package_name）
        package_name = find_package_name(app_name)
        
        # 包名不存在则抛出异常
        if package_name is None:
            raise ValueError(f"App {app_name} not found in package map.")
        
        # 若开启应用刷新，先强制停止目标应用（避免应用后台残留状态）
        if refush_app:
            refush_command = f"{adb_command} shell am force-stop {package_name}"
            if print_command:
                print(f"Executing command: {refush_command}")
            # 执行强制停止命令
            subprocess.run(refush_command, shell=True, capture_output=True, text=True)

        # 拼接启动应用的ADB命令：monkey命令启动应用主界面
        adb_command = f"{adb_command} shell monkey -p {package_name} -c android.intent.category.LAUNCHER 1"
        # 等待2秒，确保应用完全启动
        time.sleep(2)

    # ========== 动作类型：文本输入（Type） ==========
    elif action['action_type'] == "Type":
        # 提取要输入的文本内容
        text = action['args']['text']
        
        # 转换输入框坐标（逻辑同Click动作）
        if device_wm_size is None:
            point = action['args']['point']
        else:
            normalized_point = action['args']['normalized_point']
            point = (int(normalized_point[0] * device_wm_size[0]), int(normalized_point[1] * device_wm_size[1]))
        
        # 若键盘未弹出，先点击输入框激活
        if "keyboard_exists" in action['args'] and not action['args']['keyboard_exists']:
            click_commmand = f"{adb_command} shell input tap {point[0]} {point[1]}"
            subprocess.run(click_commmand, shell=True, capture_output=True, text=True)

        # 拼接文本输入命令：使用yadb工具实现精准文本输入（支持中文等特殊字符）
        adb_command += f' shell app_process -Djava.class.path=/data/local/tmp/yadb /data/local/tmp com.ysbing.yadb.Main -keyboard "{text}"'

    # ========== 动作类型：弹窗处理（Pop） ==========
    elif action['action_type'] == "Pop":
        # 预留空实现，暂无具体逻辑
        pass

    # ========== 动作类型：等待（Wait） ==========
    elif action['action_type'] == "Wait":
        # 提取等待时长
        wait_time = action['args']['duration']
        # 执行等待（无需ADB命令，直接休眠）
        time.sleep(float(wait_time))
        # 等待动作直接返回，不执行后续ADB命令逻辑
        return
    
    # ========== 动作类型：滑动（Scroll） ==========
    elif action['action_type'] == "Scroll":
        # 初始化滑动路径（起始/结束坐标）
        path = action['args']['path']

        # 若传入设备尺寸，转换归一化路径为物理坐标
        if device_wm_size is not None:  
            normalized_path = action['args']['normalized_path']
            path = [
                (int(normalized_path[0][0] * device_wm_size[0]), int(normalized_path[0][1] * device_wm_size[1])),  # 起始点
                (int(normalized_path[1][0] * device_wm_size[0]), int(normalized_path[1][1] * device_wm_size[1]))   # 结束点
            ]

        # 拼接滑动命令：input swipe 模拟滑动，时长1000ms
        adb_command += f" shell input swipe {path[0][0]} {path[0][1]} {path[1][0]} {path[1][1]} 1000"

    # ========== 动作类型：长按（LongPress） ==========
    elif action['action_type'] == "LongPress":
        # 转换坐标（逻辑同Click动作）
        if device_wm_size is None:
            point = action['args']['point']
        else:
            normalized_point = action['args']['normalized_point']
            point = (int(normalized_point[0] * device_wm_size[0]), int(normalized_point[1] * device_wm_size[1]))

        # 拼接长按命令：使用yadb工具实现2000ms长按
        adb_command += f" shell app_process -Djava.class.path=/data/local/tmp/yadb /data/local/tmp com.ysbing.yadb.Main -touch {point[0]} {point[1]} 2000"

    # ========== 动作类型：终止任务（Abort） ==========
    elif action['action_type'] == "Abort":
        # 预留空实现，暂无具体逻辑
        pass

    # ========== 动作类型：完成任务（Complete） ==========
    elif action['action_type'] == "Complete":
        # 预留空实现，暂无具体逻辑
        pass

    # ========== 未知动作类型 ==========
    else:
        raise ValueError(f"Invalid action type: {action['action_type']}")

    # ========== 执行ADB命令（除Wait/Pop/Abort/Complete外） ==========
    # 若开启调试，打印要执行的ADB命令
    if print_command:
        print(f"Executing command: {adb_command}")

    # 执行ADB命令，捕获输出、错误和返回码
    result = subprocess.run(adb_command, shell=True, capture_output=True, text=True)

    # 若开启调试，打印命令执行输出
    if print_command:
        print(f"Command output: {result.stdout}")

调用

act_on_device 在 step_interaction 中被调用，但是没找到 step_interaction 在何处被调用。

此处因为使用到了 model_act2front_act，所以我们要特殊看看。

    def step_interaction(self, action, capture_duration = 0.5, image_full_path = None, user_comment = None):
        """
        Perform a step interaction on the device, and get the observation.
        """

        # to make sure the screen is on
        _open_screen(self.device_id)
        
        user_comment = ""
        if action is not None and action['action_type'] not in ['INFO', 'COMPLETE', 'ABORT']:
            # to convert vthe action to front-end action
            front_end_action = model_act2front_act(action, self.wm_size)

            # to perform the action
            act_on_device(self.device_id, front_end_action) 

        elif action is not None and action['action_type'] == "INFO":
            # to convert the action to front-end action
            front_end_action = model_act2front_act(action, self.wm_size)

            value = front_end_action['args']['text']
            
            # to ask the user to input the value
            if user_comment is None:
                user_comment = input(f"Please answer the model's question: {value}: ")

        
        elif action is not None and action['action_type'] in ["COMPLETE", "ABORT"]:
            
            return None

        # to wait for the action to be completed
        time.sleep(capture_duration)

        is_screenshot = False

        # to get the observation
        for i in range(3):
            try:
                screen_shot_pic_path = _capture_save_screenshot(self.device_id, tmp_file_dir="tmp_screenshot", print_command=True)
                is_screenshot = True
                break
            except Exception as e:
                print(f"Error capturing screenshot: {e}")
                time.sleep(0.5)


        if not is_screenshot:
            raise ValueError(f"Error capturing screenshot: {e}")
        # to check if the screenshot is valid

        if image_full_path is not None:
            # to copy the image to the full path
            smart_copy(screen_shot_pic_path, image_full_path)
            screen_shot_pic_path = image_full_path
            

        observation = {
            "image": screen_shot_pic_path,
            "user_comment": user_comment,
        }

        return observation

model_act2front_act

model_act2front_act 会将模型生成的抽象动作指令转换为前端（android）可执行的动作，是一个关键的桥梁函数，确保了高层的决策能够准确执行。

具体功能：

动作类型：将模型动作类型（如CLICK）映射到前端动作类型（如Click）
坐标转换：将归一化的坐标（0~1000）转换为设备实际屏幕坐标
参数处理：提取并处理各种动作所需的参数，如点击坐标，输入文本等
构建前端动作对象：创建符合前端执行器要求的完整动作对象

代码如下：

# convert model action from api to a front-end action
def model_act2front_act(act, wm_size):
    """
    Convert model action to front-end action.
    """
    # to parse the action and convert it to front-end action
    model_action_type_list = ['CLICK', "TYPE", "COMPLETE", "WAIT", "AWAKE", "INFO", "ABORT", "SWIPE", "LONGPRESS"]

    action_type_map = {
        "CLICK": "Click",
        "TYPE": "Type",
        "COMPLETE": "Complete",
        "INFO": "Pop",
        "WAIT": "Wait",
        "AWAKE": "Awake",
        "ABORT": "Abort",
        "SWIPE": "Scroll",
        "LONGPRESS": "LongPress",
    }

    if "action" in act:
        act['action_type'] = act['action']

    assert act['action_type'] in model_action_type_list, f"Invalid action type: {act['action_type']}"

    # action unrelated parameters
    status = act.get('status', None)
    payload_dict = act.get('payload', {})
    plan, summary = payload_dict.get('plan', None), payload_dict.get('summary', None)

    explain = act['explain']

    down_stream_action = {
        "action_type": action_type_map[act['action_type']],
        "args": {
            "status": status,
            "plan": plan,
            "summary": summary,

            "explain": explain,
        }
    }

    if act['action_type'] == 'CLICK':
        # <STATUS>xxx<ACTION>explain:xxx\taction:CLICK\tpoint:x,y\tsearch_type:app|keyboard|none<PAYLOAD>plan:xxx\tsummary:xxx\t
        assert "point" in act, f"Point not found in CLICK action: {act}"

        search_type = act.get('search_type', "none")

        point = act['point']

        zero_one_point = ((float(point[0])) / 1000, (float(point[1])) / 1000)
        real_coordinate = (int(zero_one_point[0] * wm_size[0]), int(zero_one_point[1] * wm_size[1]))

        # click point for several versions
        down_stream_action['args']['coordinate'] = real_coordinate + real_coordinate
        down_stream_action['args']['point'] = real_coordinate
        down_stream_action['args']['normalized_point'] = zero_one_point

        down_stream_action['args']['search_type'] = search_type

    elif act['action_type'] == 'TYPE':
        # <STATUS>xxx<ACTION>explain:xxx\taction:TYPE\tvalue:xxxx\tpoint:x,y\tkeyboard:true|alse<PAYLOAD>plan:xxx\tsummary:xxx\t
        assert "value" in act, f"Value not found in TYPE action: {act}"

        value = act['value'].replace(" ", "_")
        # point = act['point']
        # point can be optional
        point = act.get('point', None)

        # to set the keyboard exists default to True, for point is None
        keyboard_exists = act.get('keyboard', True)

        if point is not None:        
            zero_one_point = ((float(point[0])) / 1000, (float(point[1])) / 1000)
            real_coordinate = (int(zero_one_point[0] * wm_size[0]), int(zero_one_point[1] * wm_size[1]))
        else:
            zero_one_point = None
            real_coordinate = [None]

        # click point for several versions
        down_stream_action['args']['coordinate'] = real_coordinate + real_coordinate
        down_stream_action['args']['point'] = real_coordinate
        down_stream_action['args']['normalized_point'] = zero_one_point

        down_stream_action['args']['text'] = value

        down_stream_action['args']['keyboard_exists'] = keyboard_exists

    elif act['action_type'] == "INFO": 
        # <STATUS>xxx<ACTION>explain:xxx\taction:INFO\tvalue:xxxx\t<PAYLOAD>plan:xxx\tsummary:xxx\t
        assert "value" in act, f"Value not found in INFO action: {act}"

        value = act['value']
        down_stream_action['args']['text'] = value


    elif act['action_type'] == "WAIT":
        #<STATUS>xxx<ACTION>explain:xxx\taction:WAIT\tvalue:5\tis_auto_close:true|false\tr1:xxx\tp1:x1,y1\tr2:xxx\tp2:x2,y2\t<PAYLOAD>plan:xxx\tsummary:xxx\t

        assert "value" in act, f"Value not found in WAIT action: {act}"
        value = act['value']
        is_auto_close = act.get('is_auto_close', False)

        clickable_regions = []
        close_reasons = act.get('close_reasons', [])
        for click_area in close_reasons:

            point, reason = click_area['point'], click_area['reason']
            bbox = click_area.get('bbox', None)

            zero_one_point = ((float(point[0])) / 1000, (float(point[1])) / 1000)
            real_coordinate = (int(zero_one_point[0] * wm_size[0]), int(zero_one_point[1] * wm_size[1]))
            
            if bbox is not None:
                zero_one_bbox = ((float(bbox[0])) / 1000, (float(bbox[1])) / 1000, 
                                 (float(bbox[2])) / 1000, (float(bbox[3])) / 1000)
                real_bbox = (int(zero_one_bbox[0] * wm_size[0]), int(zero_one_bbox[1] * wm_size[1]),
                              int(zero_one_bbox[2] * wm_size[0]), int(zero_one_bbox[3] * wm_size[1]))
                
            else:
                zero_one_bbox = (zero_one_point[0], zero_one_point[1], zero_one_point[0], zero_one_point[1])
                real_bbox = (real_coordinate[0], real_coordinate[1], real_coordinate[0], real_coordinate[1])
            
                
            clickable_regions.append({
                "reason": reason,
                "point": real_coordinate,
                "region": real_bbox,
                "normalized_point": zero_one_point,
                "normalized_region": zero_one_bbox,
            })

        # for reason, point in act['']

        down_stream_action['args']['duration'] = value
        down_stream_action['args']['closability'] = {
            "auto_closable": is_auto_close,
            "type": explain,
            "regions": clickable_regions,
        }

    elif act['action_type'] == "AWAKE":
        # <STATUS>xxx<ACTION>explain:xxx\taction:AWAKE\tvalue:xxxx\t<PAYLOAD>plan:xxx\tsummary:xxx\t
        assert "value" in act, f"Value not found in AWAKE action: {act}"

        value = act['value']
        down_stream_action['args']['text'] = value

    elif act['action_type'] == "ABORT":
        # <STATUS>xxx<ACTION>explain:xxx\taction:ABORT\t<PAYLOAD>plan:xxx\tsummary:xxx\t

        down_stream_action['args']['abort_reason'] = explain
    
    elif act['action_type'] == "COMPLETE":
        # <STATUS>xxx<ACTION>explain:xxx\taction:COMPLETE\t<PAYLOAD>plan:xxx\tsummary:xxx\t

        # nothing to add
        pass

    elif act['action_type'] == "SWIPE":
        # <STATUS>xxx<ACTION>explain:xxx\taction:SWIPE\tpoint1:x,y\tpoint2:x,y\t<PAYLOAD>plan:xxx\tsummary:xxx\t  

        point1 = act['point1']
        zero_one_point1 = ((float(point1[0])) / 1000, (float(point1[1])) / 1000)
        real_coordinate1 = (int(zero_one_point1[0] * wm_size[0]), int(zero_one_point1[1] * wm_size[1]))

        point2 = act['point2']
        zero_one_point2 = ((float(point2[0])) / 1000, (float(point2[1])) / 1000)
        real_coordinate2 = (int(zero_one_point2[0] * wm_size[0]), int(zero_one_point2[1] * wm_size[1]))

        path = [(real_coordinate1[0], real_coordinate1[1]), (real_coordinate2[0], real_coordinate2[1])]
        normalized_path = [(zero_one_point1[0], zero_one_point1[1]), (zero_one_point2[0], zero_one_point2[1])]

        down_stream_action['args']['path'] = path
        down_stream_action['args']['normalized_path'] = normalized_path

    elif act['action_type'] == "LONGPRESS":
        # <STATUS>xxx<ACTION>explain:xxx\taction:LONGPRESS\tpoint:x,y\t<PAYLOAD>plan:xxx\tsummary:xxx\t

        point = act['point']
        zero_one_point = ((float(point[0])) / 1000, (float(point[1])) / 1000)
        real_coordinate = (int(zero_one_point[0] * wm_size[0]), int(zero_one_point[1] * wm_size[1]))

        # click point for several versions
        down_stream_action['args']['coordinate'] = real_coordinate + real_coordinate
        down_stream_action['args']['point'] = real_coordinate
        down_stream_action['args']['normalized_point'] = zero_one_point

    else:
        raise ValueError(f"Invalid action type: {act['action_type']}")
    
    return down_stream_action

model_act2front_act VS step_api_to_frontend_action

因为还有 step_api_to_frontend_action，所以我们做一下对比。

model_act2front_act：用于处理模型直接输出的动作指令
step_api_to_frontend_action：用于处理来自特定API接口的动作请求

这两个函数共同构成了完整的动作处理链，使得系统可以接受多种来源的动作指令并统一转换为可以执行的动作

step_api_to_frontend_action 函数负责把来自 step API 的动作指令转换为前端可执行的动作对象，主要完成以下工作：

动作类型映射将 step API 的动作类型映射到标准前端动作类型。
参数提取与转换从输入动作的 args 字段中提取必要参数，并转换为前端动作所需格式。
坐标标准化把归一化坐标（0–1000 范围）通过 _convert_normalized_point_to_fixed_point 转换为固定格式坐标。
默认值处理为缺失参数提供合理默认值。

step_api_to_frontend_action映射如下：

    action_type_map = {
        # "CLICK": "Click",
        "Click": "CLICK",
        # "TYPE": "Type",
        "Type": "TYPE",
        # "COMPLETE": "Complete",
        "Complete": "COMPLETE",
        # "INFO": "Pop",
        "Pop": "INFO",
        # "WAIT": "Wait",
        "Wait": "WAIT",
        # "AWAKE": "Awake",
        "Awake": "AWAKE",
        # "ABORT": "Abort",
        "Abort": "ABORT",
        # "SWIPE": "Scroll",
        "Scroll": "SLIDE",
        # "LONGPRESS": "LongPress",
        "LongPress": "LONGPRESS",
    }

与 model_act2front_act 的对比：

特性	step_api_to_frontend_action	model_act2front_act
输入来源	Step API 格式动作	模型生成的动作
输入结构	包含 args 字段的结构化数据	直接包含动作参数
坐标处理	使用 _convert_normalized_point_to_fixed_point	直接计算坐标
动作映射	规范统一映射	按模型输出特点定制

代码层面差异示例：

# step_api_to_frontend_action 从 args 获取参数
point = _convert_normalized_point_to_fixed_point(
    step_api_action["args"]["normalized_point"]
)

# model_act2front_act 直接从动作对象获取参数
point = act['point']

因此，step_api_to_frontend_action 专门处理来自 Step API 的规范输入，而 model_act2front_act 则针对模型输出的松散结构做定制转换。

0x03 跨平台实现与统一适配层

step-Func 的跨平台实现主要依赖于Android Debug Bridge（ADB）作为统一的通信协议，并通过多个适配层来屏蔽不同操作系统和设备之间的差异，具体如下：

以ADB为核心：利用ADB作为与Android设备通信的标准协议
标准化接口：定义统一的动作类型和参数格式
坐标系统转换：在标准化坐标和实际像素坐标间转换
平台差异处理：针对不同操作系统进行特定处理
设备状态管理：统一管理设备的各种状态
服务化封装：通过MCP协议提供标准化接口

这种架构使得系统可以在不同的操作系统上运行，同时能够与各种Android设备进行交互，实现了良好的跨平台兼容性。

3.1 跨平台实现机制

ADB 统一接口层

_get_adb_command 函数：提供统一的 ADB 命令构建接口
- 根据设备 ID 构建 ADB 命令前缀
- 支持指定设备和全局设备两种模式
get_adb_command 函数：提供 ADB 命令获取接口
- 封装了设备检测逻辑确
- 保设备连接状态

具体代码如下：

def _get_adb_command(device_id=None):
    """
    Get the ADB command for the specified device ID.
    """
    if device_id is None:
        adb_command = "adb "
    else:
        assert device_id in list_devices(), f"Device {device_id} not found in connected devices."
        adb_command = f"adb -s {device_id} "
    return adb_command

def get_adb_command(device_id=None):
    """
    Get the ADB command for the specified device ID.
    """
    adb_command = _get_adb_command(device_id)
    return adb_command

平台特定处理

local_str_grep 函数：提供跨平台字符串匹配功能
- 支持 Windows 平台的字符串处理
- 兼容不同平台的输出格式差异
detect_screen_on 函数：跨平台屏幕状态检测
- Windows 平台使用 shell dumpsys display 命令
- 其他平台使用 shell dumpsys display | grep mScreenState 命令

设备管理跨平台支持

get_manufacturer 函数：跨平台设备制造商识别
- 使用 getprop ro.product.manufacturer 命令
- 为特定品牌设备提供特殊处理

3.2 统一适配层设计

前端执行器适配层

act_on_device 函数：统一设备操作适配层
- 支持多种操作类型：CLICK、LONGPRESS、TYPE、AWAKE、SLIDE 等
- 验证操作类型和参数有效性
- 根据不同操作类型执行相应的 ADB 命令
uiTars_to_frontend_action 函数：动作格式转换适配层
- 将 UI-Tars 格式动作转换为前端执行格式
- 处理 WAIT 和 LONGPRESS 操作的参数转换

设备辅助适配层

model_act2front_act 函数：模型动作到前端动作转换适配层
- 将模型输出的动作格式转换为前端执行格式
- 支持多种动作类型：CLICK、TYPE、AWAKE、INFO、WAIT、SLIDE、LONGPRESS 等
- 坐标系统转换：从 0-1000 归一化坐标转换为实际屏幕坐标
_convert_point_to_realworld_point 函数：坐标系统适配层
- 将归一化坐标转换为实际屏幕坐标
- 考虑屏幕尺寸适配

屏幕方向适配

_detect_screen_orientation 函数：屏幕方向检测适配层
- 支持 Windows 平台使用 PowerShell 命令
- 支持 Unix/Linux/Mac 平台使用标准 shell 命令
- 检测屏幕方向并返回方向值
act_on_device 中的屏幕方向处理
- 根据检测到的屏幕方向调整坐标系统
- 在 orientation 为 1 或 3 时交换 wm_size 的宽高

3.3 跨平台依赖管理

依赖抽象层

subprocess 模块：跨平台进程管理
- 使用标准 subprocess 模块执行 ADB 命令
- 支持 shell=True 参数在不同平台执行命令
os 模块：跨平台操作系统接口
- 使用 os.name 检测操作系统类型
- 根据不同操作系统执行特定逻辑

文件路径处理

smart_open 函数：跨平台文件操作
- 使用 megfile 库进行跨平台文件操作
- 支持不同操作系统的文件路径格式
smart_remove 函数：跨平台文件删除
- 安全删除临时截图文件
- 支持不同操作系统的文件系统

0xFF 参考

从豆包手机谈起：端侧智能的愿景与路线图

本文使用 markdown.com.cn 排版