MeiGen-MultiTalk
Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation
Technique Report Code
We propose MultiTalk, a novel framework for audio-driven multi-person conversational video generation. Given a multi-stream audio input, a reference image and a prompt, MultiTalk generates a video containing interactions following the prompt, with consistent lip motions aligned with the audio.
In a cozy, warmly lit room, Nick Wilde, a fox with a mischievous grin, sits across from Judy Hopps, a rabbit with a determined expression. Both are dressed casually; Nick in a green shirt and striped tie, Judy in a blue outfit with headphones resting on the table. A Disney-branded mug sits between them on the wooden table. The background features a rustic interior with a lamp, a window, and various household items, creating a homely atmosphere. A medium shot captures their interaction as Nick picks up the mug and gently touches Judy's head, suggesting a moment of camaraderie and connection.
A man and a woman are seated at an outdoor table, engaged in a conversation. The woman, dressed in a light pink top with a white cardigan, holds a red cup of coffee, takes a sip, and then places it back on the saucer. The man, wearing a striped shirt over a white t-shirt, is engrossed in his smartphone, looking down intently. The table is adorned with two red cups of coffee and a plate with a croissant. The background features a charming European street with pastel-colored buildings, greenery, and a partially open green umbrella. The scene captures a casual, everyday moment with a warm, inviting atmosphere.
Two individuals sit at a white table in a studio with blue-and-white acoustic wall panels. A man on the left wears a dark casual top, holding a coffee mug. The woman on the right has a pair of studio headphones resting near her. The man is speaking while the woman is listening and nodding occasionally. The woman picks up the black headphones. A large wall-mounted TV displays technical interfaces. The scene suggests a collaborative workspace with professional audio-visual equipment in a bright studio environment.
Audio-driven human animation methods, such as talking head and talking body generation, have made remarkable progress in generating synchronized facial movements and appealing visual quality videos. However, existing methods primarily focus on single human animation and struggle with multi-stream audio inputs, facing incorrect binding problems between audio and persons. Additionally, they exhibit limitations in instruction-following capabilities. To solve this problem, in this paper, we propose a novel task: Multi-Person Conversational Video Generation, and introduce a new framework, MultiTalk, to address the challenges during multi-person generation. Specifically, for audio injection, we investigate several schemes and propose the Label Rotary Position Embedding (L-RoPE) method to resolve the audio and person binding problem. Furthermore, during training, we observe that partial parameter training and multi-task training are crucial for preserving the instruction-following ability of the base model. MultiTalk achieves superior performance compared to other methods on several datasets, including talking head, talking body, and multi-person datasets, demonstrating the powerful generation capabilities of our approach.
In this work, we propose MultiTalk, an audio-driven video generation framework. Our framework incorporates an additional audio cross-attention layer to support audio conditions. To achieve multi-person conversational video generation, we propose a Label Rotary Position Embedding (L-RoPE) for multi-stream audio injection.