[论文翻译]PIVOT-R: Primitive-Driven Waypoint-Aware World Model for Robotic

Doc2X | 专注学术文档翻译支持 PDF 转 Word、多栏识别和沉浸式双语翻译，为您的论文处理和学术研究提供全方位支持。 Doc2X | Academic Document Translation Expert Support PDF to Word, multi-column recognition, and immersive bilingual translation for comprehensive academic research assistance. 👉 了解 Doc2X 功能 | Learn About Doc2X

原文链接：arxiv.org/pdf/2410.10…

PIVOT-R: Primitive-Driven Waypoint-Aware World Model for Robotic Manipulation

PIVOT-R: 用于机器人操作的原语驱动航点感知世界模型

Kaidong Zhang ${}^{1 * }$ Pengzhen Ren ${}^{2 * }$

张凯东 ${}^{1 * }$ 任鹏臻 ${}^{2 * }$

Shikui Ma3 Hang Xu4

马士奎3 徐航4

${}^{1}$ Sun Yat-sen University ${}^{2}$ Peng Cheng Laboratory

${}^{1}$ 中山大学 ${}^{2}$ 鹏城实验室

Bingqian Lin ${}^{1}\;$ Junfan Lin ${}^{2}$

林炳权 ${}^{1}\;$ 林俊帆 ${}^{2}$

Xiaodan Liang1,2 †

梁晓丹1,2 †

${}^{3}$ Dataa Robotics 4 Huawei Noah’s Ark Lab

${}^{3}$ 达闼机器人 4 华为诺亚方舟实验室

abliao.github.io/PIVOT-R

Abstract

摘要

Language-guided robotic manipulation is a challenging task that requires an embodied agent to follow abstract user instructions to accomplish various complex manipulation tasks. Previous work generally maps instructions and visual perceptions directly to low-level executable actions, neglecting the modeling of critical waypoints (e.g., key states of "close to/grab/move up" in action trajectories) in manipulation tasks. Trivially fitting the data without revealing the relation between instruction and low-level executable actions, these models are prone to memorizing the surficial pattern of the data instead of acquiring the transferable knowledge, and thus are fragile to dynamic environment changes. To address this issue, we propose a PrImitive-driVen waypOinT-aware world model for Robotic manipulation (PIVOT-R) that focuses solely on the prediction of task-relevant waypoints. Specifically, PIVOT-R consists of a Waypoint-aware World Model (WAWM) and a lightweight action prediction module. The former performs primitive action parsing and primitive-driven waypoint prediction, while the latter focuses on decoding low-level actions. Additionally, we also design an asynchronous hierarchical executor (AHE) for PIVOT-R, which can use different execution frequencies for different modules of the model, thereby helping the model reduce computational redundancy and improve model execution efficiency. Our PIVOT-R outperforms state-of-the-art (SoTA) open-source models on the SeaWave benchmark, achieving an average relative improvement of ${19.45}\%$ across four levels of instruction tasks. Moreover, compared to the synchronously executed PIVOT-R, the execution efficiency of PIVOT-R with AHE is increased by 28 -fold,with only a 2.9% drop in performance. These results provide compelling evidence that our PIVOT-R can significantly improve both the performance and efficiency of robotic manipulation.

语言引导的机器人操作是一项具有挑战性的任务，要求具身代理根据抽象的用户指令完成各种复杂的操作任务。以往的工作通常将指令和视觉感知直接映射到低级可执行动作，忽略了操作任务中关键路径点（例如，动作轨迹中的“接近/抓取/向上移动”的关键状态）的建模。简单地拟合数据而不揭示指令与低级可执行动作之间的关系，这些模型容易记忆数据的表面模式，而不是获取可迁移的知识，因此在动态环境变化下显得脆弱。为了解决这一问题，我们提出了一种基于原始动作驱动的路径点感知世界模型（PIVOT-R），该模型专注于预测与任务相关的路径点。具体来说，PIVOT-R 包括一个路径点感知世界模型（WAWM）和一个轻量级动作预测模块。前者执行原始动作解析和基于原始动作的路径点预测，而后者则专注于解码低级动作。此外，我们还为 PIVOT-R 设计了一个异步分层执行器（AHE），它可以为模型的不同模块使用不同的执行频率，从而帮助模型减少计算冗余并提高模型执行效率。我们的 PIVOT-R 在 SeaWave 基准测试中优于最先进的（SoTA）开源模型，在四个级别的指令任务中实现了平均相对改进 ${19.45}\%$ 。此外，与同步执行的 PIVOT-R 相比，带有 AHE 的 PIVOT-R 的执行效率提高了 28 倍，性能仅下降了 2.9%。这些结果提供了令人信服的证据，表明我们的 PIVOT-R 可以显著提高机器人操作的性能和效率。

1 Introduction

1 引言

Language-guided robotic manipulation [22, 33, 61, 50, 12, 38] is a key research problem of Embodied AI. This field aims to enable agents to follow abstract language instructions for performing various manipulation tasks. To complete the tasks, the agent needs to transform high-level language instructions into low-level actions as well as capturing environmental dynamics for precise manipulation decision-making.

语言引导的机器人操作 [22, 33, 61, 50, 12, 38] 是具身人工智能的关键研究问题。该领域旨在使代理能够遵循抽象的语言指令来执行各种操作任务。为了完成这些任务，代理需要将高级语言指令转化为低级动作，同时捕捉环境动态以进行精确的操作决策。

Witnessed the immense success of vision-language foundation models (VLMs) [2, 40, 37], many works have explored the utilization of VLMs for facilitating language-guided robotic manipulation in

见证了视觉语言基础模型（VLMs）[2, 40, 37] 的巨大成功，许多工作探索了利用 VLMs 来促进语言引导的机器人操作

*Equal contribution

*同等贡献

${}^{ \dagger }$ Corresponding authors

${}^{ \dagger }$ 通讯作者

Figure 1: Comparison of PIVOT-R and other models. (a) Sequentially executed robot manipulation model. They sequentially execute each module in the model at each timestep to perform manipulation reasoning (e.g., RT-2 [64], RT-X [49], RT-H [5], VILA [20], Octo [36], etc.) or world modeling (e.g., Surfer [42], Daydreamer [56], 3D-VLA [60], etc.) This easily leads to model redundancy and weak key manipulation node prediction capabilities. (b) PIVOT-R is a primitive-driven waypoint-aware world model with asynchronous hierarchical executors. It only focuses on the prediction of waypoints related to the manipulation task, and it is easier to predict key nodes in the manipulation task than other methods. In addition, PIVOT-R sets different execution frequencies for different modules to have higher execution efficiency and lower redundancy.

图1：PIVOT-R 与其他模型的比较。(a) 顺序执行的机器人操作模型。它们在每个时间步依次执行模型中的每个模块以进行操作推理（例如，RT-2 [64]、RT-X [49]、RT-H [5]、VILA [20]、Octo [36] 等）或世界建模（例如，Surfer [42]、Daydreamer [56]、3D-VLA [60] 等）。这很容易导致模型冗余和关键操作节点预测能力较弱。(b) PIVOT-R 是一种原始驱动的航点感知世界模型，具有异步分层执行器。它只关注与操作任务相关的航点预测，并且比其他方法更容易预测操作任务中的关键节点。此外，PIVOT-R 为不同模块设置了不同的执行频率，以实现更高的执行效率和更低的冗余。

recent years [48, 64, 49, 27, 21, 20]. For example, RT-2 [64], RT-X [49], RT-H [5], and RoboFlamingo [27] employ the VLM as the backbone and introduce large-scale vision language data for manipulation training, which significantly improve the generalization. VILA [20] resorts to GPT-4 [37] to generate sequential actionable steps for improving long-horizon planning. In addition, 3D-VLA [60] and Daydreamer [56] have also tried to introduce world models into robot manipulation to help the models free themselves from a large amount of trial and error and improve learning efficiency. Despite extensive efforts made by researchers, two key challenges remain: (i) Weak key waypoint prediction and world modeling capabilities; (ii) High computational redundancy and inefficient execution.

近年来 [48, 64, 49, 27, 21, 20]。例如，RT-2 [64]、RT-X [49]、RT-H [5] 和 RoboFlamingo [27] 采用 VLM 作为骨干，并引入大规模视觉语言数据进行操作训练，显著提高了泛化能力。VILA [20] 利用 GPT-4 [37] 生成连续的可执行步骤，以改进长期规划。此外，3D-VLA [60] 和 Daydreamer [56] 也尝试将世界模型引入机器人操作，以帮助模型摆脱大量试错，提高学习效率。尽管研究人员做出了广泛努力，但仍存在两个关键挑战：(i) 关键路径点和世界建模能力较弱；(ii) 计算冗余和执行效率低下。

For the first challenge, such as "moving a cup", humans intuitively apply their internal world models to seamlessly analyze and predict task-related key action flows: "getting close to the cup $\rightarrow$ grab the cup $\rightarrow$ move the cup $\rightarrow$ put down the cup". Similar to the approaches in navigation tasks,we define these key action frames as waypoints for manipulation tasks. Figure 1 (b) right shows a robot manipulation task with three waypoints. How to enable robots to acquire this ability is very critical. To this end, RT-H [5] uses VLM to perform natural language parsing of key action nodes and uses language to guide robot manipulation. However, it does not perform world modeling on visual scene information. Therefore, some work [56, 34, 60, 42] have attempted to summarize general dynamic knowledge about the environment and predict future outcomes by introducing world models, to generate more executable long-term plans and accurate manipulation action decisions. However, they tend to model the world at each timestep of robot manipulation, leading to the neglect of waypoints which have a more direct impact on manipulation success. To make matters worse, in the long-term lack of key waypoint guidance, the randomness of each action prediction may be continuously amplified due to the existence of low-level action similarities under local spatiotemporal conditions.

对于第一个挑战，例如“移动杯子”，人类会直观地应用其内部世界模型，无缝分析和预测与任务相关的关键动作流程：“接近杯子 $\rightarrow$ 抓住杯子 $\rightarrow$ 移动杯子 $\rightarrow$ 放下杯子”。类似于导航任务中的方法，我们将这些关键动作帧定义为操作任务的路径点。图1（b）右侧展示了一个包含三个路径点的机器人操作任务。如何使机器人获得这种能力非常关键。为此，RT-H [5] 使用视觉语言模型（VLM）执行关键动作节点的自然语言解析，并使用语言指导机器人操作。然而，它并未对视觉场景信息进行世界建模。因此，一些工作 [56, 34, 60, 42] 尝试通过引入世界模型来总结关于环境的通用动态知识并预测未来结果，以生成更具执行性的长期计划和准确的操作动作决策。然而，它们倾向于在机器人操作的每个时间步对世界进行建模，从而忽略了对操作成功有更直接影响的路径点。更糟糕的是，在长期缺乏关键路径点指导的情况下，由于局部时空条件下低级动作的相似性，每次动作预测的随机性可能会不断放大。

For the second challenge, as shown in Figure 1(a), previous methods [64, 5, 49, 42, 56] tend to treat different modules in the model equally and execute all modules sequentially, which is not necessary and inevitably introduces redundancy of computation and causes a great cost of resources. To this end, MResT [43] proposes a multi-resolution transformer that uses different execution frequencies for different spatial and temporal resolutions to control coarse, precise, and dynamic tasks in real-time, thereby effectively reducing unnecessary computational redundancy and improving the real-time performance of robot manipulation. However, it lacks focus on world modeling capabilities and cannot predict critical nodes of manipulation tasks as accurately as humans.

对于第二个挑战，如图1(a)所示，先前的方法[64, 5, 49, 42, 56]倾向于将模型中的不同模块同等对待，并按顺序执行所有模块，这并非必要，并且不可避免地引入了计算冗余，导致资源成本高昂。为此，MResT [43] 提出了一种多分辨率变换器，该变换器针对不同的空间和时间分辨率使用不同的执行频率，以实时控制粗略、精确和动态任务，从而有效减少不必要的计算冗余，并提高机器人操作的实时性能。然而，它缺乏对世界建模能力的关注，无法像人类那样准确地预测操作任务的关键节点。

Based on the above observations, as shown in Figure 1 (b), in this paper, we propose PIVOT-R, a primitive-driven waypoint-aware world model with an asynchronous hierarchical executor for robot manipulation. PIVOT-R mainly consists of a waypoint-aware world model (WAWM) and an action prediction module. Specifically, in WAWM, we first use the pre-trained VLM for primitive action parsing and use it as a primitive prompt for the scene prediction module to help the model perform modeling of the robot manipulation waypoint scene. Then, we use waypoints as cues for low-level action prediction. Thanks to WAWM's modeling of key waypoint information, PIVOT-R achieves an average relative performance improvement of ${19.45}\%$ compared to the state-of-the-art (SoTA) open-source manipulation model on SeaWave's [42] 4-level instruction tasks. In addition, to reduce model redundancy, we also design an asynchronous hierarchical executor (AHE) for PIVOT-R, which sets a slow-to-fast execution frequency scheduler for the three modules of primitive action parsing, scene prediction, and action prediction in the model to help PIVOT-R improves execution efficiency. With the help of AHE, the execution efficiency of PIVOT-R integrated with VLM has not dropped significantly. Compared with synchronously executed PIVOT-R (all modules use the same execution frequency), the execution efficiency of PIVOT-R with AHE is increased by 28 times, while the performance only drops by ${2.9}\%$ . Our contributions can be summarized as follows:

基于上述观察结果，如图1 (b) 所示，本文提出了一种基于原语的航点感知世界模型 PIVOT-R，该模型具有异步分层执行器，用于机器人操作。PIVOT-R 主要由航点感知世界模型 (WAWM) 和动作预测模块组成。具体而言，在 WAWM 中，我们首先使用预训练的视觉语言模型 (VLM) 进行原语动作解析，并将其作为场景预测模块的原语提示，以帮助模型对机器人操作航点场景进行建模。然后，我们使用航点作为低级动作预测的线索。得益于 WAWM 对关键航点信息的建模，PIVOT-R 在 SeaWave 的 4 级指令任务上相对于最先进的 (SoTA) 开源操作模型实现了 ${19.45}\%$ 的平均相对性能提升。此外，为了减少模型冗余，我们还为 PIVOT-R 设计了一个异步分层执行器 (AHE)，该执行器为模型中的原语动作解析、场景预测和动作预测三个模块设置了从慢到快的执行频率调度器，以帮助 PIVOT-R 提高执行效率。借助 AHE，与 VLM 集成的 PIVOT-R 的执行效率并未显著下降。与同步执行的 PIVOT-R（所有模块使用相同的执行频率）相比，带有 AHE 的 PIVOT-R 的执行效率提高了 28 倍，而性能仅下降了 ${2.9}\%$ 。我们的贡献可以总结如下：

We show that modeling of waypoints prevents critical robot dynamics from being submerged in trivial robot manipulations, allowing models to benefit from enhanced dynamic environment modeling.
我们展示了航点建模防止了关键机器人动力学被淹没在琐碎的机器人操作中，使模型能够从增强的动力学环境建模中受益。
The proposed AHE significantly improves the execution efficiency of the model by setting different frequencies for different modules.
所提出的 AHE 通过为不同模块设置不同的频率显著提高了模型的执行效率。
Extensive experimental results demonstrate that our PIVOT-R achieves significantly better performance than the SoTA baseline, such as Gato [41] and RT-1 [7], in all settings.

广泛的实验结果表明，我们的 PIVOT-R 在所有设置中都显著优于 SoTA 基线，例如 Gato [41] 和 RT-1 [7]。

2 Related Work

2 相关工作

Language-Guided Robotic Manipulation. Robotic Manipulation is a long-standing research field in Embodied Artificial Intelligence. Benefiting from the flexibility and practicality of facilitating human-robot interaction, language-guided robotic manipulation has gained extensive research attention in recent years. Many benchmarks have been built to encourage the research of language-guided robotic manipulation, such as RLBench [22], CALVIN [33], VLMBench [61], etc. Early methods improve the manipulation performance by introducing powerful representations [9, 59], elaborated network architectures [15, 13], or effective training mechanisms [32, 44]. With the rapid development of VLMs [2, 40, 37], recent works have attempted to introduce VLMs to improve the manipulation accuracy and generalization to unseen scenarios/objects in a trainable [48, 64, 49, 27, 26] or offline [21, 20, 35] manner. However, most previous approaches tend to learn a direct mapping from multi-modal inputs to low-level actions, ignoring the explicit modeling of environmental dynamics. As a result, they may fail to make executable actions or plans and not generalize well to complex environments. We have also noticed previous work on waypoints and primitive actions, but they often used a limited number of actions. For instance, CLIPort [45], Transporter [57], GMRT [47], and VPG [58] are restricted to simple actions like pick/place/push, limiting their use in complex tasks. Some language-guided models [10,16,30] define a few primitive actions $\left( { \leq 5}\right)$ and add prompts to aid decision-making. PerAct [46], RVT [14] use robot states as waypoints to skip trivial action predictions. SUSIE [6] and UniPi 11 predict sub-goals through video predictors, but there is an inconsistency between the predicted video and actions. In this work, we propose a waypoint-aware world model to track key dynamics that happen during the manipulation. Our model fulfills asynchronous world modeling and action prediction, which significantly promotes both manipulation accuracy and efficiency. PIVOT-R supports 10 primitive actions and is extensible, making it effective in complex tasks.

语言引导的机器人操作。机器人操作是具身人工智能领域的一个长期研究方向。得益于促进人机交互的灵活性和实用性，语言引导的机器人操作近年来获得了广泛的研究关注。许多基准测试已经被建立，以鼓励语言引导的机器人操作的研究，例如 RLBench [22]、CALVIN [33]、VLMBench [61] 等。早期方法通过引入强大的表示 [9, 59]、精心设计的网络架构 [15, 13] 或有效的训练机制 [32, 44] 来提高操作性能。随着 VLMs [2, 40, 37] 的快速发展，最近的工作尝试引入 VLMs 以提高操作精度和对未见场景/对象的泛化能力，采用可训练 [48, 64, 49, 27, 26] 或离线 [21, 20, 35] 的方式。然而，大多数先前的方法倾向于从多模态输入直接映射到低级动作，忽略了环境动态的显式建模。因此，它们可能无法制定可执行的动作或计划，并且在复杂环境中泛化能力不佳。我们也注意到先前关于路点和基本动作的工作，但它们通常使用有限数量的动作。例如，CLIPort [45]、Transporter [57]、GMRT [47] 和 VPG [58] 仅限于简单的动作，如拾取/放置/推动，限制了它们在复杂任务中的应用。一些语言引导模型 [10,16,30] 定义了几个基本动作 $\left( { \leq 5}\right)$ ，并添加提示以辅助决策。PerAct [46]、RVT [14] 使用机器人状态作为路点来跳过琐碎的动作预测。SUSIE [6] 和 UniPi 11 通过视频预测器预测子目标，但预测的视频与动作之间存在不一致性。在这项工作中，我们提出了一种路点感知的世界模型，以跟踪操作过程中发生的关键动态。我们的模型实现了异步世界建模和动作预测，显著提升了操作的准确性和效率。PIVOT-R 支持 10 个基本动作，并且可扩展，使其在复杂任务中有效。

World Models. World models aim to generate a predictive model of its surroundings, accounting for uncertainties and dynamic changes. They have been widely studied in video generation [4, 53, 8], navigation [51, 24, 39], and autonomous driving [52, 62, 54] areas. For example, Genie [8] introduces a spatiotemporal video tokenizer and a dynamics model to autoregressively predict the next video frame. DriveDreamer [52] builds a world model deriving from real-world driving scenarios for enabling reasonable driving policy generation. With the great potential for acquiring insights into real-world motion and physics rules, some works have also introduced world models for robotic manipulation tasks [56, 34, 60]. Daydreamer [56] applies the Dreamer [17] algorithm to train real-world robots by online reinforcement learning. SWIM [34] collects human videos for training a world model and fine-tuning it on a small amount of robot data. Nevertheless, they usually perform world modeling and decision-making alternatively, bringing great difficulty for training and is also inefficient. In contrast, our proposed WAWM only predicts task-relevant waypoints, empowering realistic and efficient world modeling for improving manipulation performance.

世界模型。世界模型的目标是生成其周围环境的预测模型，考虑到不确定性和动态变化。它们在视频生成 [4, 53, 8]、导航 [51, 24, 39] 和自动驾驶 [52, 62, 54] 领域得到了广泛研究。例如，Genie [8] 引入了一个时空视频标记器和一个动力学模型，用于自回归地预测下一帧视频。DriveDreamer [52] 构建了一个从真实驾驶场景中衍生出来的世界模型，以实现合理的驾驶策略生成。由于具有洞察现实世界运动和物理规则的巨大潜力，一些工作还引入了世界模型用于机器人操作任务 [56, 34, 60]。Daydreamer [56] 应用 Dreamer [17] 算法通过在线强化学习训练真实世界的机器人。SWIM [34] 收集人类视频用于训练世界模型，并在少量机器人数据上进行微调。然而，它们通常交替进行世界建模和决策，这给训练带来了很大困难，效率也不高。相比之下，我们提出的 WAWM 仅预测与任务相关的航点，从而实现现实且高效的世界建模，以提高操作性能。

Figure 2: PIVOT-R overview. It mainly consists of a waypoint-aware world model (WAWM) and an action prediction module, where two modules cooperate with each other through an asynchronous hierarchical executor (AHE). In WAWM, we first use pre-trained VLM to perform low-frequency primitive action parsing on user instructions and provide waypoint indications for the scene prediction module. Then, the scene prediction module learns to model the world knowledge based on waypoints and manipulation trajectories. Finally, we use a lightweight action prediction module to perform high-frequency action prediction and execution.

图 2：PIVOT-R 概述。它主要由一个航点感知世界模型（WAWM）和一个动作预测模块组成，两个模块通过异步分层执行器（AHE）相互协作。在 WAWM 中，我们首先使用预训练的 VLM 对用户指令进行低频基本动作解析，并为场景预测模块提供航点指示。然后，场景预测模块学习基于航点和操作轨迹建模世界知识。最后，我们使用一个轻量级动作预测模块进行高频动作预测和执行。

Vision-Language Foundation models. Vision-Language Foundation models (VLMs) [2, 40, 37] have witnessed striking advancements in recent years. The ability to understand multimodal inputs and rich real-world knowledge storage of VLMs makes them highly adaptable for a wide range of downstream applications such as image captioning [25, 63] and visual question answering [29, 25]. Many works have also explored the utilization of VLMs in robotic manipulation tasks recently [48, 64, [49, 27, 21, 20]. MOO [48] leverages a pre-trained vision-language model to improve zero-shot open-world object manipulation. RoboFlamingo [27] builds a vision-language manipulation framework upon the open-source VLM OpenFlamingo [2]. VILA [20] and CoPa [21] unleash the commonsense knowledge of GPT-4 for generating accurate and reasonable manipulation action decisions. In this work, we develop an elegant combination of VLMs and world models for tackling the challenging language-guided robotic manipulation task, where we query the VLM, the world model, and the action execution model in an asynchronous way.

视觉-语言基础模型。视觉-语言基础模型（VLMs）[2, 40, 37]近年来取得了显著的进展。VLMs理解多模态输入和存储丰富现实世界知识的能力使其能够高度适应各种下游应用，如图像描述[25, 63]和视觉问答[29, 25]。最近，许多研究还探索了VLMs在机器人操作任务中的应用[48, 64, [49, 27, 21, 20]。MOO [48]利用预训练的视觉-语言模型来改进零样本开放世界对象操作。RoboFlamingo [27]在开源VLM OpenFlamingo [2]的基础上构建了一个视觉-语言操作框架。VILA [20]和CoPa [21]利用GPT-4的常识知识生成准确合理的操作动作决策。在这项工作中，我们开发了一种优雅的VLMs和世界模型的结合，以应对具有挑战性的语言引导机器人操作任务，其中我们以异步方式查询VLM、世界模型和动作执行模型。

3 Architecture

3 架构

Our goal is to build a robot manipulation model that can respond accurately and timely to user instructions in various zero-shot complex and variable environments. To this end, as shown in Figure 2 , we introduce a primitive-driven waypoint-aware world model for robot manipulation. Next, we discuss the structural details of each module of PIVOT-R in detail.

我们的目标是构建一个能够在各种零样本复杂和可变环境中准确及时响应用户指令的机器人操作模型。为此，如图2所示，我们引入了一种基于原语的航点感知世界模型用于机器人操作。接下来，我们将详细讨论PIVOT-R中每个模块的结构细节。

3.1 Problem Formulation

3.1 问题表述

As shown in Figure 2 (a), we formulate the proposed PIVOT-R as learning a trainable robot manipulation model $\pi$ ,which maps the user’s language instruction $l$ and a series of observation images ${O}_{t - h : t}$ and robot state ${S}_{t - h : t}$ from the time step $t - h$ to the current time step $t$ to action ${A}_{t}.h$ represents the length of the historical frames, here it is set to 3 . In addition, we also introduced a scene prediction module for WAWM to help the model model world knowledge. The overall formulation of PIVOT-R is as follows:

如图2(a)所示，我们将提出的PIVOT-R形式化为学习一个可训练的机器人操作模型 $\pi$ ，该模型将用户的语言指令 $l$ 和一系列从时间步 $t - h$ 到当前时间步 $t$ 的观察图像 ${O}_{t - h : t}$ 和机器人状态 ${S}_{t - h : t}$ 映射到动作 ${A}_{t}.h$ ，表示历史帧的长度，这里设置为3。此外，我们还引入了一个场景预测模块，用于WAWM，以帮助模型建模世界知识。PIVOT-R的整体形式化如下：

where ${M}_{t}^{\prime }$ and ${A}_{t}^{\prime }$ are the waypoints and actions of the robot manipulation predicted by the model at timestep $t$ ,respectively. In particular,we use the pre-trained VLM to parse the primitive actions $P$

其中 ${M}_{t}^{\prime }$ 和 ${A}_{t}^{\prime }$ 分别是模型在时间步 $t$ 预测的机器人操作的路点和动作。特别地，我们使用预训练的VLM来解析基本动作 $P$ 。

—— 更多内容请到Doc2X翻译查看—— —— For more content, please visit Doc2X for translations ——