RAID: Retrieval-Augmented World Models for Robotics

Retrieval-Augmented Inverse Dynamics — Architecture

6.5× MSE improvement over direct visual MLP at N=25 demonstrations (LIBERO-Spatial)

Highlights

Achieves 6.5× lower action-prediction MSE than a direct MLP baseline on LIBERO-Spatial using only 25 demonstrations.
Cross-attention head retrieves the top-k structurally similar transitions from a 24k-entry demonstration memory bank, blended with a parametric trunk via a learned per-dimension gate.
GRPO fine-tuning probe reaches best mean reward 1.226 and intermittent 25% task success, demonstrating that retrieval-augmented inverse dynamics can be improved through online interaction.

The action-inference gap

Vision-language-action models can issue motor commands but lack explicit causal grounding. World models can dream the next state but cannot directly infer the action that reaches it. RAID isolates this inverse-dynamics sub-problem: given the current and predicted next latent state from a frozen GR-1 world model, what 7-DOF action should the robot take? This is the gap RAID is designed to close.

Architecture

GR-1, a language-conditioned vision Transformer pretrained on Ego4D-style video, acts as the world model. Its frozen visual encoder produces 384-dimensional class-token features; its one-step prediction head supplies the dreamed next state. The RAID decoder takes the (current, dreamed) feature pair and produces a continuous 7-DOF action through three components operating in parallel: a parametric MLP trunk for a direct action estimate, a cross-attention prior that retrieves the top-3 most similar transitions from a cosine-similarity memory bank and pools their actions, and a per-dimension sigmoid gate that learns how much to trust retrieval versus the trunk for each action coordinate independently.

Results

At 25 demonstrations on LIBERO-Spatial, RAID achieves a normalized validation MSE of 0.131 versus 0.852 for a direct MLP — a 6.5× improvement. The retrieval advantage persists across all tested scales up to 200 demonstrations, though the gap narrows as the direct model accumulates more data. This positions RAID primarily as a sample-efficiency mechanism: retrieval helps most when parametric training signal is scarce. A 195-update GRPO fine-tuning run from the 200-demo checkpoint reached best mean reward 1.226 and intermittent 25% task success, confirming that the behavior-cloned RAID policy can be improved through closed-loop simulator interaction.

Why GR-1 matters

Earlier experiments across 53 conditions and 171 runs using DINOv2 and SigLIP features showed no consistent retrieval benefit — RAID matched or trailed the direct MLP on both RoboMimic and pooled LIBERO under those encoders. The switch to GR-1 changed the picture entirely. GR-1's pretraining objective — predicting future image features — is precisely what produces the temporal-delta signal the cross-attention head needs. Without that signal, retrieval has no traction. With it, the attention places most of its weight on a single coherent neighbor and supplies a cleaner action template than the parametric trunk can learn from only 25 trajectories.