Abstract

Large-scale video diffusion models often fail to preserve 3D structure over time, causing geometric drift and implausible motion under viewpoint changes. VideoWeave is a latent-space post-training framework that uses implicit geometry-model features to constrain the generative distribution without relying on explicit reconstructed depth maps, point clouds, or 3D structures. It adapts geometry features into latents, jointly models them with video latents in a shared denoising space, and distills the resulting joint score field into an efficient generator. Together with GeoVid-80K, VideoWeave improves geometric coherence while preserving strong visual quality for text-to-video and image-to-video generation.

Overview

VideoWeave overview: joint geometry-video modeling for geometrically consistent video generation
We present VideoWeave, a latent-space post-training framework that changes the role of geometry from an explicit reconstruction-derived guidance signal into a temporary latent companion of the video during training. The framework progressively adapts implicit geometry features, learns a unified geometry-video denoising space, and distills the joint score field into a compact student generator. At inference time, the geometry latent is discarded, enabling efficient video generation with a distilled 3D-aware prior.

Results

VideoWeave is evaluated in both text-to-video and image-to-video settings. Across video quality, 3D reconstruction consistency, and epipolar consistency metrics, it improves spatial coherence and long-range geometric stability while maintaining competitive visual fidelity.

T2V

Text-to-video Results

Text-conditioned generations preserve scene structure across extended camera motion while keeping object appearance and layout coherent over time.

Sample 01 T2V
Sample 02 T2V
Sample 03 T2V
Sample 04 T2V
I2V

Image-to-video Results

Image-conditioned generations extend the input view with stable geometry, consistent camera trajectories, and reduced drift across long horizons.

Sample 01 I2V
Sample 02 I2V
Sample 03 I2V
Sample 04 I2V
Joint

Joint Generation Results

VideoWeave jointly denoises video latents and compressed geometry-feature latents, producing both the final video and its geometry-feature companion. The geometry features participate only during joint denoising and are discarded after denoising is complete.

Sample 01 Joint
Sample 02 Joint
Sample 03 Joint
Sample 04 Joint
Compare

Qualitative Comparison

Compared with recent baselines, VideoWeave produces more temporally stable videos with consistent scene layout, object geometry, and camera motion.

T2V Wan / GeoVideo / Ours
Wan
GeoVideo
Ours
Wan
GeoVideo
Ours
I2V ViewCrafter / Gen3C / Ours
ViewCrafter
Gen3C
Ours
ViewCrafter
Gen3C
Ours
VideoWeave project page