LongCat-Video-Avatar

Meituan LongCat Team

Technique Report Code

An expressive avatar model built upon LongCat-Video

Actor/Actoress

Produces natural facial expressions and lip movements that stay perfectly in sync with dialogue, maintaining character identity flawlessly even during extended scenes. Note that [ATI2V] represents audio-text-image-to-video generation while [AT2V] represents audio-text-to-video generation.

Singer

Generates dynamic body motions that rhythmically align with vocals, enabling vibrant and cohesive performances from start to finish without quality degradation.

Podcast

Supports long-form speaking with stable, high-quality video output, ensuring the host’s appearance remains consistent and engaging throughout the entire session.

Sales

Creates smooth, professional presentations by intelligently handling silent audio segments with natural gestures, avoiding awkward pauses or stiffness.

Multi-Person

Seamlessly handles complex interactive scenarios by generating synchronized videos for multiple speakers, maintaining individual identity characteristics while ensuring natural turn-taking behaviors and group dynamics.

Long Video Generation

Support long video generation with stable lip synchronization, consistent identity, and unnoticeable color tone accumulation. Here is a 5-minute example.

What is LongCat-Video-Avatar?

LongCat-Video-Avatar is an audio-driven video generation model that can generates super-realistic,lip-synchronized long video generation with natural dynamics and consistent identity. It supports multiple video generation modes, incluing text-audio-to-video ([AT2V]) generation, image-text-audio-to-video ([ATI2V]) generation, and video continuation.

Audio-Text-to-Video [AT2V]

"Static medium shot. Man in denim jacket and beanie talking on snowy shore, gazing at city skyline across frozen river. Muted cool tones, winter light, slow pan."
Text input
Audio input

Audio-Text-Image-to-Video [ATI2V]

Image input
"A woman holds a microphone with both hands and talks passionately, her voice echoing through the quiet surroundings.
Audio & Text input

InfiniteTalk v.s. LongCat-Video-Avatar

Comparison: LongCat-Video-Avatar achieves more vivid performance and natural dynamics than InfiniteTalk.
InfiniteTalk
LongCat-Video-Avatar (Ours)
InfiniteTalk
LongCat-Video-Avatar (Ours)
InfiniteTalk
LongCat-Video-Avatar (Ours)

Abstract

Recent advances in audio-driven human video synthesis have significantly improved the realism of half-body and full-body generation. However, generating long-duration sequences remains a persistent challenge, as existing methods often suffer from error accumulation and identity drift over time. While reference image injection strategies (e.g., InfiniteTalk) have been proposed to mitigate these issues, they frequently result in a rigid “copy-paste” phenomenon and limited motion diversity due to conditional image leakage. Furthermore, current models exhibit an undesirable over-reliance on speech signals, leading to unnaturally static behaviors during silent segments. To address these limitations, we present LongCat-Video-Avatar, a unified architecture designed for super-realistic, lip-synchronized long video generation with natural dynamics and consistent identity. It supports multiple video generation modes, including text-audio-to-video, text-audio-image-to-video, and audio-conditioned video continuation. We first analyze the coupling between speech and human motion, proposing a Disentangled Unconditional Guidance strategy that separates audio signals from motion dynamics to ensure natural behavior even in the absence of speech. To alleviate the “copy- paste” issue, we introduce a Reference Skip Attention mechanism that strategically incorporates reference cues to preserve identity while preventing excessive leakage, thereby balancing visual fidelity with motion richness. Additionally, to tackle error accumulation caused by redundant VAE decode-encode cycles in autoregressive generation, we propose a Cross-Chunk Latent Stitching strategy. Extensive evaluations demonstrate the effectiveness of our approach in generating super realistic, long-duration human videos.

Method

We introduce LongCat-Video-Avatar, a unified DiT-based framework designed for generating super-realistic, long-duration audio-driven human videos with consistent identity and natural dynamics. To address the issue of unnatural static postures during silent intervals, we propose Disentangled Unconditional Guidance, which effectively decouples speech signals from global body motion by distinguishing silent audio embeddings from null conditions. Furthermore, to prevent identity drift without inducing rigid "copy-paste" artifacts, we devise a Reference Skip Attention mechanism that selectively incorporates reference cues to balance visual fidelity with motion diversity. Finally, we implement Cross-Chunk Latent Stitching, a training strategy that eliminates redundant VAE decode-encode cycles to reduce pixel degradation and bridge the train-test gap, ensuring seamless video continuation. Our architecture also extends to multi-person scenarios through L-ROPE-based audio-visual binding.

Ethical Considerations

Part of images and audios are derived from real videos solely to demonstrate the capabilities of this research, e.g., expressions, gestures, and naturalness. The generated content is for academic use only and commercial use is not permitted. If there are any concerns, please contact us (zhangyong202303@gmail.com) and we will delete it in time.