🎵 Rocket Man | |||
🎵 Billie Jean | |||
🎵 Birds of a Feather |
Aligning the rhythm of visual motion in a video with a given music track is a practical need in multimedia production, yet remains an underexplored task in autonomous video editing. Effective alignment between motion and musical beats enhances viewer engagement and visual appeal, particularly in music videos, promotional content, and cinematic editing. Existing methods typically depend on labor-intensive manual cutting, speed adjustments, or heuristic-based editing techniques to achieve synchronization. While some generative models handle joint video and music generation, they often entangle the two modalities, limiting flexibility in aligning video to music beats while preserving the full visual content. In this paper, we propose a novel and efficient framework—termed MVAA (Music-Video Auto-Alignment)—that automatically edits video to align with the rhythm of a given music track while preserving the original visual content. To enhance flexibility, we modularize the task into a two-step process in our MVAA: aligning motion keyframes with audio beats, followed by rhythm-aware video inpainting. Specifically, we first insert keyframes at timestamps aligned with musical beats, then use a frame-conditioned diffusion model to generate coherent intermediate frames, preserving the original video’s semantic content. Since comprehensive test-time training can be time-consuming, we adopt a two-stage strategy: pretraining the inpainting module on a small video set to learn general motion priors, followed by rapid inference-time fine-tuning for video-specific adaptation. This hybrid approach enables adaptation within ~10 minutes per epoch on a single NVIDIA 4090 GPU using CogVideoX-5b-I2V as the backbone. Extensive experiments show that our approach can achieve high-quality beat alignment and visual smoothness. User studies further validate the natural rhythmic quality of the results, confirming their effectiveness for practical music-video editing.
🎵 Rocket Man | ||
🎵 Billie Jean | ||
🎵 Jingle Bells |
Our MVAA produces smooth transitions from original to edited videos, aligning music beats and preserving content consistency.
🎵 Rocket Man |
We generate the final output by sequentially producing short video clips and concatenating them. Test-time adaptation is performed only on the first clip, after which the model is directly applied to the remaining clips, demonstrating the strong generalization ability of our MVAA.