Embody4D: A Generalist Data Engine for Embodied 4D World Modeling

Embody4D

A Generalist Data Engine for Embodied
4D World Modeling

Embody4D teaser

Embody4D abstract icon Abstract

Embodied agents require robust and comprehensive 3D spatiotemporal representations to support spatial reasoning, manipulation understanding, and downstream decision making. However, existing robot data are typically captured from fixed or sparse viewpoints, providing only partial and view-dependent observations, which limits multi-view perception and generalization across viewpoints. Given the difficulty of collecting additional viewpoints in real-world settings, we propose Embody4D, a dedicated video-to-video world model for embodied scenarios to bridge this observation gap by transforming a monocular robot video into novel-view videos from flexible target camera viewpoints. First, to tackle training data scarcity, we introduce a 3D-aware compositional synthesis pipeline to curate a heterogeneous dataset compositing cross-embodiment robotic arms with diverse backgrounds, promoting broad generalization. Second, to enforce geometric stability, we devise a latent confidence-aware expert modulation strategy, which estimates the reliability of warped latent priors and adaptively routes regions to copy, repair, or inpaint experts for spatiotemporally consistent 4D generation. Finally, to enhance the fidelity of the manipulation, we incorporate an interaction-aware attention mechanism that explicitly attends to the robotic interaction regions. Extensive experiments show that Embody4D achieves state-of-the-art performance on visual evaluation benchmarks, while both simulated and real-world robotic experiments further demonstrate its effectiveness as a robust data engine for synthesizing high-fidelity, view-consistent videos that empower downstream robotic planning and learning.

Embody4D framework icon Framework Overview

Embody4D pipeline
Figure: Overview of Embody4D. We build paired embodied 4D training data through compositional synthesis and adopt a “warp-then-inpaint” framework for novel-view video generation. Given a source video, Embody4D reconstructs dynamic geometry, warps RGB/masks to target views, and generates the final video with confidence-aware modulation and interaction-aware attention.

Pan Left camera icon More Results Gallery

Input icon Input
Embody4D output icon Embody4D

Pan Left

Tilt Down

Arc Right

Zoom In

Pan Right

Arc Left

Application robot icon Application

For simulation experiments, we use Embody4D to generate view-randomized training videos, while evaluating policies from a fixed third-person camera. This setting isolates the effect of synthetic view augmentation on cross-view generalization and policy robustness.

For real-world experiments, we introduce an additional Embody4D-generated third-person view during training and inference. By expanding the observational coverage of robot-object interactions, this setting evaluates whether Embody4D can improve spatial reasoning, interaction understanding, and manipulation performance.

Embody4D application pipeline

Real World Experiments

Input

ReCamMaster

TrajCrafter

Ours

*Recammaster is designed for controlling the lens movement of camera external parameters, making it difficult to achieve fixed-angle generation, which leads to inconsistent parallax and a decrease in the overall results.

Input

ReCamMaster

TrajCrafter

Ours

Embody4D acts as a scalable data engine for downstream robotic manipulation and planning. Leveraging its novel-view synthesis capability, the success rate of π0.5 improves dramatically from 32% to 74%, with particularly strong generalization to out-of-distribution (OOD) tasks.

Baselines Ours
Task Single-view with wrist ReCamMaster TrajCrafter Embody4D
T1 (Grapes→Bowl)5/104/106/108/10
T2 (Grapes→Plate)5/105/106/108/10
T3 (Mangoes→Bowl)4/102/104/109/10
T4 (Lemons→Bowl unseen)1/102/100/106/10
T5 (Bananas→Plate unseen)1/102/107/106/10
Success Rate (SR) 32% 30% 46% 74%
Without wrist
With wrist
Figure: Comparison of successful novel-view synthesis counts on seen tasks and OOD unseen tasks. The left image is a comparison of the results from a third-person perspective combined with other generative models that create new perspectives. The right image is a comparison of results from a third-person perspective and a wrist perspective, along with new perspectives generated by other models.

VLA Inference

Application of Embody4D during VLA inference
Application of Embody4D during VLA inference. In addition to video-to-video generation, Embody4D supports image-to-image novel-view generation during VLA inference, providing complementary visual observations that can reveal occluded objects and assist the VLA in generating more reliable actions.