DyStream: Streaming Dyadic Talking Heads Generation via Flow Matching-based Autoregressive Model

1Zhejiang University, 2The University of Tokyo

Click on the video to turn its audio on/off

DyStream is a flow matching-based autoregressive model that generates talking-head videos in real-time from a reference image, and dual stream audio. It supports both speakering and listening latency under 100ms, and have state-of-the-art lip-sync quality.

System pipeline

Interpolate start reference image.

DyStream generates talking-head videos from a single reference image and dyadic stream. First, an Autoencoder disentangles the reference image into a static appearance feature vapp and an initial, identity-agnostic motion feature m0. Next, the Audio-to-Motion Generator takes the initial motion m0 and the audio stream as input to generate a new sequence of audio-aligned motion features m1:N. Finally, the Autoencoder's decoder synthesizes the output video by warping the appearance feature vapp according to the generated motion sequence m1:N.

The Architecture of Our Audio-to-Motion Generator

Interpolate start reference image.

Our model comprises two core modules: an autoregressive network (blue) and a flow matching head (orange). The autoregressive network, built from causal self-attention and MLP blocks, processes the audio, anchor, and previous motion inputs to generate a conditioning signal cN. This signal is fed into the flow matching head, a stack of MLPs and AdaLN layers. Here, it is injected via AdaLN to guide a multi-step flow matching process to produce the final motion mN. Finally, the newly generated motion mN is used to warp the reference image into the output frame, while simultaneously being fed back into the autoregressive network as input for the subsequent generation.

Results driven by dyadic audio

Results on the test set of our training dataset.

Results on out-of-distribution real images from the wild.

Results on out-of-distribution generated images from Nanobanana.

Compare with Chunk-based Offline Model

For each pair of results, the left is our method and right is INFP (our reproduced). INFP is a chunk-based offline model, we reduce its chunk size to matching the latency of our method.

Ablation on the audio lookahead

Each group of videos, we compare the offline model (left), online w. lookahead (middle), and online w/o lookahead (right).

offline model

online w. lookahead

online w/o lookahead

offline model

online w. lookahead

online w/o lookahead

offline model

online w. lookahead

online w/o lookahead

Ablation on anchor

We compare the results of w/o anchor (left), w. random weighted anchor (middle), and w. last weighted anchor (right).

w/o anchor

w. random weighted anchor

w. last weighted anchor