We present LLIA, a novel audio-driven portrait video generation framework based on the diffusion model. Firstly, we propose robust variable-length video generation to reduce the minimum time required to generate the initial video clip or state transitions, which significantly enhances the user experience. Secondly, we propose a consistency model training strategy for Audio-Image-to-Video to ensure real-time performance, enabling a fast few-step generation. Model quantization and pipeline parallelism are further employed to accelerate the inference speed. To mitigate the stability loss incurred by the diffusion process and model quantization, we introduce a new inference strategy tailored for long-duration video generation. These methods ensure real-time performance and low latency while maintaining high-fidelity output. Thirdly, we incorporate class labels as a conditional input to seamlessly switch between speaking, listening, and idle states. Lastly, we design a novel mechanism for fine-grained facial expression control to exploit our model's inherent capacity. Extensive experiments demonstrate that our approach achieves low-latency, fluid, and authentic two-way communication. On an NVIDIA RTX 4090D, our model achieves a maximum of 78 FPS at a resolution of 384x384 and 45 FPS at a resolution of 512x512, with an initial video generation latency of 140 ms and 215 ms, respectively.
Overview of the proposed method. Our pipeline introduces several novel modules. 1) We first apply portrait animation to adjust its facial expression to match our provided template before feeding the reference portrait into the ReferenceNet. 2) The avatar’s states are determined by the input audio through class labels. These labels can be inferred directly from acoustic features, or alternatively guided by an LLM model to indicate the appropriate state. 3) The length of the sequential noisy latent is fixed during the early iterations. Then it turns to dynamic, enabling the model to gain the capability of variable-length video generation.
Our model enables real-time communication between two digital avatars.
Our method is capable of generating realistic facial expressions and natural head movements while ensuring low latency and real-time performance.
We achieve controllable facial expression manipulation by employing portrait animation on the same portrait image.
Virtual interviewer
Chatbot on mobile phone