Embody4D: A Generalist 4D World Model for Embodied AI

World models have made significant progress in modeling dynamic environments; however, most embodied world models are still restricted to 2D representations, lacking the comprehensive multi-view information essential for embodied spatial reasoning. Bridging this gap is non-trivial, primarily due to challenges from severe scarcity of paired multi-view data, the difficulty of maintaining spatiotemporal consistency in generated 3D geometries, and the tendency to hallucinate manipulation details. To address these challenges, we propose Embody4D, a dedicated video-to-video world model for embodied scenarios, capable of synthesizing arbitrary novel views from a monocular video. First, to tackle data scarcity, we introduce a 3D-aware compositional synthesis pipeline to curate a heterogeneous dataset compositing cross-embodiment robotic arms with diverse backgrounds, guaranteeing broad generalization. Second, to enforce geometric stability, we devise an adaptive noise injection strategy; by leveraging confidence disparities across image regions, this method selectively regularizes the diffusion process to ensure strict spatiotemporal consistency. Finally, to guarantee manipulation fidelity, we incorporate an interaction-aware attention mechanism that explicitly attends to the robotic interaction regions. Extensive experiments demonstrate that Embody4D achieves state-of-the-art performance, serving as a robust world model that synthesizes high-fidelity, view-consistent videos to empower downstream robotic planning and learning.

Embody4D pipeline — **Figure: Overview of Embody4D.** We construct paired embodied training videos in 3D via compositional synthesis and process them with a "warp-then-inpaint" architecture. The source video is reconstructed into a point cloud and projected to the target view to produce warped RGB plus occupancy masks; these are concatenated and passed to a confidence module that adaptively injects different noise levels. Finally, a backbone model with an interaction-aware block outputs the target-view video with highly consistent manipulation details.

Video Comparison Gallery

Input

A AGIBOT dual-arm robotic is performing a task with a toaster.

TrajectoryCrafter

ReCamMaster

Ex-4D

Ours

Input

A Franka robotic arm engaging in a sorting or retrieval task on a dark tabletop.

TrajectoryCrafter

ReCamMaster

Ex-4D

Ours

Input

A Universal robotic arm performing a block-sorting task on a wooden desk.

TrajectoryCrafter

ReCamMaster

Ex-4D

Ours

Input

A LIFT2 dual-arm robotic performing manipulation tasks on a light-colored tabletop.

TrajectoryCrafter

ReCamMaster

Ex-4D

Ours

**Quantitative comparisons.** **Bold**: Best. Underline: Second Best. Our method consistently outperforms baselines on VBench and Q-Align.
	VBench ↑					Q-Align ↑
Method	Subject	Background	Temporal	Motion	Imaging	Visual Quality
ReCamMaster	0.8981	0.8976	0.9717	0.9841	0.5914	3.4938
Ex-4D	0.8088	0.8906	0.9213	0.9742	0.5732	2.8585
TrajtoryCrafter	0.9202	0.9388	0.9714	0.9911	0.6257	3.8954
Ours	0.9477	0.9408	0.9751	0.9945	0.6994	3.9970

	Baselines	Ours
T1 (Grapes→Bowl)	5/10	4/10	6/10	8/10
T2 (Grapes→Plate)	5/10	5/10	6/10	8/10
T3 (Mangoes→Bowl)	4/10	2/10	4/10	9/10
T4 (Lemons→Bowl unseen)	1/10	2/10	0/10	6/10
T5 (Bananas→Plate unseen)	1/10	2/10	7/10	6/10
Success Rate (SR)	32%	30%	46%	74%

Baselines

Ours

Task

Single-view with wrist

ReCamMaster

TrajCrafter

Embody4D

T1 (Grapes→Bowl)

5/10

4/10

6/10

8/10

T2 (Grapes→Plate)

5/10

6/10

8/10

T3 (Mangoes→Bowl)

4/10

2/10

4/10

9/10

T4 (Lemons→Bowl unseen)

1/10

2/10

0/10

6/10

T5 (Bananas→Plate unseen)

1/10

2/10

7/10

6/10

Success Rate (SR)

32%

30%

46%

74%

Without wrist — **Figure: Comparison of successful novel-view synthesis** counts on seen tasks and OOD unseen tasks. The left image is a comparison of the results from a third-person perspective combined with other generative models that create new perspectives. The right image is a comparison of results from a third-person perspective and a wrist perspective, along with new perspectives generated by other models.

With wrist — **Figure: Comparison of successful novel-view synthesis** counts on seen tasks and OOD unseen tasks. The left image is a comparison of the results from a third-person perspective combined with other generative models that create new perspectives. The right image is a comparison of results from a third-person perspective and a wrist perspective, along with new perspectives generated by other models.

Embody4D

A Generalist 4D World Model for Embodied AI

Abstract

Framework Overview

Video Comparison Gallery

Input

TrajectoryCrafter

ReCamMaster

Ex-4D

Ours

Input

TrajectoryCrafter

ReCamMaster

Ex-4D

Ours

Input

TrajectoryCrafter

ReCamMaster

Ex-4D

Ours

Input

TrajectoryCrafter

ReCamMaster

Ex-4D

Ours

More Results Gallery

Pan Left

Tilt Down

Arc Right

Zoom In

Pan Right

Arc Left

Application

Input

ReCamMaster

TrajCrafter

Ours

Input

ReCamMaster

TrajCrafter

Ours