DyStream: Streaming Dyadic Talking Heads Generation via Flow Matching-based Autoregressive Model

Bohong Chen¹, Haiyang Liu²,

¹Zhejiang University, ²The University of Tokyo

Click on the video to turn its audio on/off

DyStream is a flow matching-based autoregressive model that generates talking-head videos in real-time from a reference image, and dual stream audio. It supports both speakering and listening latency under 100ms, and have state-of-the-art lip-sync quality.

System pipeline

DyStream generates talking-head videos from a single reference image and dyadic stream. First, an Autoencoder disentangles the reference image into a static appearance feature v_app and an initial, identity-agnostic motion feature m⁰. Next, the Audio-to-Motion Generator takes the initial motion m⁰ and the audio stream as input to generate a new sequence of audio-aligned motion features m^1:N. Finally, the Autoencoder's decoder synthesizes the output video by warping the appearance feature v_app according to the generated motion sequence m^1:N.

The Architecture of Our Audio-to-Motion Generator

Our model comprises two core modules: an autoregressive network (blue) and a flow matching head (orange). The autoregressive network, built from causal self-attention and MLP blocks, processes the audio, anchor, and previous motion inputs to generate a conditioning signal c^N. This signal is fed into the flow matching head, a stack of MLPs and AdaLN layers. Here, it is injected via AdaLN to guide a multi-step flow matching process to produce the final motion m^N. Finally, the newly generated motion m^N is used to warp the reference image into the output frame, while simultaneously being fed back into the autoregressive network as input for the subsequent generation.

Results driven by dyadic audio

Results on the test set of our training dataset.

Results on out-of-distribution real images from the wild.

Results on out-of-distribution generated images from Nanobanana.

Compare with Chunk-based Offline Model

For each pair of results, the left is our method and right is INFP (our reproduced). INFP is a chunk-based offline model, we reduce its chunk size to matching the latency of our method.