DyStream generates talking-head videos from a single reference image and dyadic stream. First, an Autoencoder disentangles the reference image into a static appearance feature vapp and an initial, identity-agnostic motion feature m0. Next, the Audio-to-Motion Generator takes the initial motion m0 and the audio stream as input to generate a new sequence of audio-aligned motion features m1:N. Finally, the Autoencoder's decoder synthesizes the output video by warping the appearance feature vapp according to the generated motion sequence m1:N.
Our model comprises two core modules: an autoregressive network (blue) and a flow matching head (orange). The autoregressive network, built from causal self-attention and MLP blocks, processes the audio, anchor, and previous motion inputs to generate a conditioning signal cN. This signal is fed into the flow matching head, a stack of MLPs and AdaLN layers. Here, it is injected via AdaLN to guide a multi-step flow matching process to produce the final motion mN. Finally, the newly generated motion mN is used to warp the reference image into the output frame, while simultaneously being fed back into the autoregressive network as input for the subsequent generation.
Results on the test set of our training dataset.
Results on out-of-distribution real images from the wild.
Results on out-of-distribution generated images from Nanobanana.
For each pair of results, the left is our method and right is INFP (our reproduced). INFP is a chunk-based offline model, we reduce its chunk size to matching the latency of our method.
Each group of videos, we compare the offline model (left), online w. lookahead (middle), and online w/o lookahead (right).
offline model
online w. lookahead
online w/o lookahead
offline model
online w. lookahead
online w/o lookahead
offline model
online w. lookahead
online w/o lookahead
We compare the results of w/o anchor (left), w. random weighted anchor (middle), and w. last weighted anchor (right).
w/o anchor
w. random weighted anchor
w. last weighted anchor