VideoWeave — Unlocking Geometric Consistency in Video Generation via Joint Geometry-Video Modeling

↓

Abstract

Large-scale video diffusion models often fail to preserve 3D structure over time, causing geometric drift and implausible motion under viewpoint changes. VideoWeave is a latent-space post-training framework that uses implicit geometry-model features to constrain the generative distribution without relying on explicit reconstructed depth maps, point clouds, or 3D structures. It adapts geometry features into latents, jointly models them with video latents in a shared denoising space, and distills the resulting joint score field into an efficient generator. Together with GeoVid-80K, VideoWeave improves geometric coherence while preserving strong visual quality for text-to-video and image-to-video generation.

Overview

Results

VideoWeave is evaluated in both text-to-video and image-to-video settings. Across video quality, 3D reconstruction consistency, and epipolar consistency metrics, it improves spatial coherence and long-range geometric stability while maintaining competitive visual fidelity.

T2V

Text-to-video Results

Text-conditioned generations preserve scene structure across extended camera motion while keeping object appearance and layout coherent over time.

Sample 01 T2V

Sample 02 T2V

Sample 03 T2V

Sample 04 T2V

I2V

Image-to-video Results

Image-conditioned generations extend the input view with stable geometry, consistent camera trajectories, and reduced drift across long horizons.

Sample 01 I2V

Sample 02 I2V

Sample 03 I2V

Sample 04 I2V

Joint

Joint Generation Results

VideoWeave jointly denoises video latents and compressed geometry-feature latents, producing both the final video and its geometry-feature companion. The geometry features participate only during joint denoising and are discarded after denoising is complete.