Embody4D: A Generalist 4D World Model for Embodied AI

Embody4D

A Generalist 4D World Model for Embodied AI

Embody4D teaser

Embody4D abstract icon Abstract

World models have made significant progress in modeling dynamic environments; however, most embodied world models are still restricted to 2D representations, lacking the comprehensive multi-view information essential for embodied spatial reasoning. Bridging this gap is non-trivial, primarily due to challenges from severe scarcity of paired multi-view data, the difficulty of maintaining spatiotemporal consistency in generated 3D geometries, and the tendency to hallucinate manipulation details. To address these challenges, we propose Embody4D, a dedicated video-to-video world model for embodied scenarios, capable of synthesizing arbitrary novel views from a monocular video. First, to tackle data scarcity, we introduce a 3D-aware compositional synthesis pipeline to curate a heterogeneous dataset compositing cross-embodiment robotic arms with diverse backgrounds, guaranteeing broad generalization. Second, to enforce geometric stability, we devise an adaptive noise injection strategy; by leveraging confidence disparities across image regions, this method selectively regularizes the diffusion process to ensure strict spatiotemporal consistency. Finally, to guarantee manipulation fidelity, we incorporate an interaction-aware attention mechanism that explicitly attends to the robotic interaction regions. Extensive experiments demonstrate that Embody4D achieves state-of-the-art performance, serving as a robust world model that synthesizes high-fidelity, view-consistent videos to empower downstream robotic planning and learning.

Embody4D framework icon Framework Overview

Embody4D pipeline
Figure: Overview of Embody4D. We construct paired embodied training videos in 3D via compositional synthesis and process them with a "warp-then-inpaint" architecture. The source video is reconstructed into a point cloud and projected to the target view to produce warped RGB plus occupancy masks; these are concatenated and passed to a confidence module that adaptively injects different noise levels. Finally, a backbone model with an interaction-aware block outputs the target-view video with highly consistent manipulation details.

Pan Left camera icon More Results Gallery

Input icon Input
Embody4D output icon Embody4D

Pan Left

Tilt Down

Arc Right

Zoom In

Pan Right

Arc Left

Application robot icon Application

Embody4D acts as a scalable data engine for downstream robotic manipulation and planning. Leveraging its novel-view synthesis capability, the success rate of π0.5 improves dramatically from 32% to 74%, with particularly strong generalization to out-of-distribution (OOD) tasks.

Baselines Ours
Task Single-view with wrist ReCamMaster TrajCrafter Embody4D
T1 (Grapes→Bowl)5/104/106/108/10
T2 (Grapes→Plate)5/105/106/108/10
T3 (Mangoes→Bowl)4/102/104/109/10
T4 (Lemons→Bowl unseen)1/102/100/106/10
T5 (Bananas→Plate unseen)1/102/107/106/10
Success Rate (SR) 32% 30% 46% 74%
Without wrist
With wrist
Figure: Comparison of successful novel-view synthesis counts on seen tasks and OOD unseen tasks. The left image is a comparison of the results from a third-person perspective combined with other generative models that create new perspectives. The right image is a comparison of results from a third-person perspective and a wrist perspective, along with new perspectives generated by other models.

Input

ReCamMaster

TrajCrafter

Ours

*Recammaster is designed for controlling the lens movement of camera external parameters, making it difficult to achieve fixed-angle generation, which leads to inconsistent parallax and a decrease in the overall results.

Input

ReCamMaster

TrajCrafter

Ours