LLIA

LLIA - Enabling Low-Latency Interactive Avatars: Real-Time Audio-Driven Portrait Video Generation with Diffusion Models

Haojie Yu^*, Zhaonian Wang^*, Yihan Pan^*, Meng Cheng, Hao Yang,

Chao Wang, Tao Xie, Xiaoming Xu^†, Xiaoming Wei, Xunliang Cai

Meituan Inc.
^*Equal Contribution ^†Corresponding Author

Abstract

We present LLIA, a novel audio-driven portrait video generation framework based on the diffusion model. Firstly, we propose robust variable-length video generation to reduce the minimum time required to generate the initial video clip or state transitions, which significantly enhances the user experience. Secondly, we propose a consistency model training strategy for Audio-Image-to-Video to ensure real-time performance, enabling a fast few-step generation. Model quantization and pipeline parallelism are further employed to accelerate the inference speed. To mitigate the stability loss incurred by the diffusion process and model quantization, we introduce a new inference strategy tailored for long-duration video generation. These methods ensure real-time performance and low latency while maintaining high-fidelity output. Thirdly, we incorporate class labels as a conditional input to seamlessly switch between speaking, listening, and idle states. Lastly, we design a novel mechanism for fine-grained facial expression control to exploit our model's inherent capacity. Extensive experiments demonstrate that our approach achieves low-latency, fluid, and authentic two-way communication. On an NVIDIA RTX 4090D, our model achieves a maximum of 78 FPS at a resolution of 384x384 and 45 FPS at a resolution of 512x512, with an initial video generation latency of 140 ms and 215 ms, respectively.

Method Overview

Overview of the proposed method. Our pipeline introduces several novel modules. 1) We first apply portrait animation to adjust its facial expression to match our provided template before feeding the reference portrait into the ReferenceNet. 2) The avatar’s states are determined by the input audio through class labels. These labels can be inferred directly from acoustic features, or alternatively guided by an LLM model to indicate the appropriate state. 3) The length of the sequential noisy latent is fixed during the early iterations. Then it turns to dynamic, enabling the model to gain the capability of variable-length video generation.

Different Portraits

Our method is capable of generating realistic facial expressions and natural head movements while ensuring low latency and real-time performance.

LLIA - Enabling Low-Latency Interactive Avatars: Real-Time Audio-Driven Portrait Video Generation with Diffusion Models

Digital Human Dialogue.

Abstract

Method Overview

Generated Videos

Real-time Communication

Different Portraits

Controlility of Expression

Real-world Applications