MeiGen-InfiniteTalk

InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing


1School of Artificial Intelligence, University of Chinese Academy of Sciences, 2Meituan,
3New Laboratory of Pattern Recognition (NLPR), CASIA, 4Shenzhen Campus of Sun Yat-sen University,
5Division of AMC and Department of ECE, HKUST, 6State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA
*Equal Contribution Corresponding Author


Technique Report Code

Long sequence video dubbing

Long sequence image-to-video human animation

What is InfiniteTalk?

Given a source video and a target audio. InfiniteTalk generate audio synchronized full body motion while preserving the identity, background, and camera movements from the source video. Our method can also use a single image condition to perform long sequence human animation.

Video-to-video

Video input
Audio input

Image-to-video

Image input
Audio input

Sparse-frame video dubbing v.s. traditional video dubbing

Editing region

Our paradigm is editing the whole video instead only the mouth region (the light green contour is the editing region).

Traditional: mouth
Sparse-frame: whole video

Comparison

MuseTalk
LatentSync
Ours
MuseTalk
LatentSync
Ours
MuseTalk
LatentSync
Ours

Abstract

Recent breakthroughs in video AIGC have ushered in a transformative era for audio-driven human animation. However, conventional video dubbing techniques remain constrained to oral region editing, resulting in discordant facial expressions and body gestures that compromise viewer immersion. To overcome this limitation, we introduce sparse-frame video dubbing—a novel paradigm that strategically preserves reference keyframes to maintain identity, iconic gestures, and camera trajectories while enabling holistic, audio-synchronized full-body motion editing. Through critical analysis, we identify why naive image-to-video models fail in this task, particularly their inability to achieve adaptive conditioning. Addressing this, we propose InfiniteTalk: a streaming audio-driven generator designed for infinite-length long sequence dubbing. This architecture leverages temporal context frames for seamless inter-chunk transitions and incorporates a simple yet effective sampling strategy that optimizes control strength via fine-grained reference frame positioning. Comprehensive evaluations on HDTF, CelebV-HQ, and EMTD datasets demonstrate state-of-the-art performance. Quantitative metrics confirm superior visual realism, emotional coherence, and full-body motion synchronization.

Method

We introduce InfiniteTalk, a audio-driven generator for long sequence sparse-frame video dubbing. It has a streaming video generation base that utilize context frames to inject momentum information that creates smooth inter-chunk transitions. To preserve human identity, background, and camera movements of the source video, it controls the output video by referencing keyframes. To achieve the soft reference mechanism in sparse-frame video dubbing, we investigate how and find the control strength is determined by the similarity between the video context and image condition. Based on our investigation, we propose a sampling strategy that balances control strength and motion alignment by fine-grained reference frame positioning, achieving high quality infinite length long sequence video dubbing with full body audio-aligned motion editing.

部分素材和视频源于真实视频,生成内容仅限学术使用,不允许商业使用。

Some materials and video sources are derived from real videos. The generated content is for academic use only and commercial use is not permitted.