Let Your Video Listen to Your Music!

Beat-Aligned, Content-Preserving Video Editing with Arbitrary Music


1University of Adelaide,      2The University of New South Wales
 
🎵 Rocket Man
🎵 Billie Jean
🎵 Birds of a Feather


Abstract

Aligning the rhythm of visual motion in a video with a given music track is a practical need in multimedia production, yet remains an underexplored task in autonomous video editing. Effective alignment between motion and musical beats enhances viewer engagement and visual appeal, particularly in music videos, promotional content, and cinematic editing. Existing methods typically depend on labor-intensive manual cutting, speed adjustments, or heuristic-based editing techniques to achieve synchronization. While some generative models handle joint video and music generation, they often entangle the two modalities, limiting flexibility in aligning video to music beats while preserving the full visual content. In this paper, we propose a novel and efficient framework—termed MVAA (Music-Video Auto-Alignment)—that automatically edits video to align with the rhythm of a given music track while preserving the original visual content. To enhance flexibility, we modularize the task into a two-step process in our MVAA: aligning motion keyframes with audio beats, followed by rhythm-aware video inpainting. Specifically, we first insert keyframes at timestamps aligned with musical beats, then use a frame-conditioned diffusion model to generate coherent intermediate frames, preserving the original video’s semantic content. Since comprehensive test-time training can be time-consuming, we adopt a two-stage strategy: pretraining the inpainting module on a small video set to learn general motion priors, followed by rapid inference-time fine-tuning for video-specific adaptation. This hybrid approach enables adaptation within ~10 minutes per epoch on a single NVIDIA 4090 GPU using CogVideoX-5b-I2V as the backbone. Extensive experiments show that our approach can achieve high-quality beat alignment and visual smoothness. User studies further validate the natural rhythmic quality of the results, confirming their effectiveness for practical music-video editing.


Method



Gallery1: Original Videos vs. Edited Video with Our MVAA


🎵 Rocket Man
🎵 Billie Jean
🎵 Jingle Bells

Our MVAA produces smooth transitions from original to edited videos, aligning music beats and preserving content consistency.



Gallery2: Long video with our MVAA


🎵 Rocket Man

We generate the final output by sequentially producing short video clips and concatenating them. Test-time adaptation is performed only on the first clip, after which the model is directly applied to the remaining clips, demonstrating the strong generalization ability of our MVAA.



BibTeX