![]() |
VOOZH | about |
For amateur videographers or when shooting with a handheld camera, obtaining stable video is challenging. Video stabilization techniques aim to smooth out camera movements to produce easy-to-watch videos, which can be achieved by inputting smooth camera trajectories into ReCamMaster. To verify this, we used unsteady videos from the DeepStab dataset (consisting of unsteady videos collected via handheld hardware) as input to the model and obtained stable videos as output. It can be observed that the model stabilizes the video while preserving the scenes and actions from the original video.
In the realm of Embodied AI, creating large-scale, high-quality datasets like Bridge or AGIBOT can be extremely expensive. ReCamMaster offers a solution by altering video perspectives, making it an effective tool for data augmentation. It could supply robots with multi-perspective observation data, thereby enhancing the performance of downstream tasks.
We have discovered that ReCamMaster demonstrates promising generalization capabilities in autonomous driving scenarios. Consequently, it can serve as an effective data augmentation tool in autonomous driving.
Camera control has been actively studied in text or image conditioned video generation tasks. However, altering camera trajectories of a given video remains under-explored, despite its importance in the field of video creation. This is non-trivial because it induces extra constraints of maintaining multiple-frame appearance and dynamic synchronization. To address this, we present ReCamMaster, a camera-controlled generative video re-rendering framework that reproduces the dynamic scene of an input video at novel camera trajectories. The core innovation lies in harnessing the generative capabilities of pre-trained text-to-video models through a thoroughly explored video conditioning mechanism. Considering the scarcity of qualified training data, we constructed a large-scale multi-camera synchronized video dataset using Unreal Engine 5, which is carefully curated to follow real-world filming characteristics, covering diverse scenes and camera movements. It helps the generalization of trained models to in-the-wild videos. Lastly, we further improve the robustness to diverse input through a meticulously designed training strategy. Extensive experiments tell that our method substantially outperforms existing state-of-the-art approaches and strong baselines. Our method also finds promising applications in video stabilization, super-resolution, and outpainting. Our code and dataset will be publicly available.
To re-shoot a source video with novel camera trajectories, we propose to harness the generative capability of pre-trained text-to-video diffusion models by imposing dual conditions, i.e. the source video and target camera trajectories through a meticulously designed framework. The overview of the model is depicted below.
Left: The training pipeline of ReCamMaster. A latent diffusion model is optimized to reconstruct the target video Vt , conditioned on the source video Vs , target camera pose camt , and target prompt pt . Right: Comparison of different video condition techniques. (a) Frame-dimension conditioning used in our paper; (b) Channel-dimension conditioning used in baseline methods [1], [2]; (c) View-dimension conditioning in [3].
In our paper, we propose a novel video conditioning scheme that concatenates the tokens of a source video with the target video tokens along the frame dimension. To verify the effectiveness, we compared the "channel concatenation" technique used in baseline methods [1], [2] with the "view concatenation" technique from [3]. It is evident that the designed conditioning technique significantly enhances the model's performance.