MECo: Motion-example-controlled Co-speech Gesture Generation Leveraging Large Language Models

SIGGRAPH 2025
Bohong Chen1, Yumeng Li1, Youyi Zheng1, Yao-Xiang Ding1, Kun Zhou1,
Zhejiang University, China1,
cars peace

MECo generated co-speech gestures following motion example.

Abstract

The automatic generation of controllable co-speech gestures has recently gained growing attention. While existing systems typically achieve ges- ture control through predefined categorical labels or implicit pseudo-labels derived from motion examples, these approaches often compromise the rich details present in the original motion examples. We present MECo, a framework for motion-example-controlled co-speech gesture generation by leveraging large language models (LLMs). Our method capitalizes on LLMs’ comprehension capabilities through fine-tuning to simultaneously inter- pret speech audio and motion examples, enabling the synthesis of gestures that preserve example-specific characteristics while maintaining speech congruence. Departing from conventional pseudo-labeling paradigms, we position motion examples as explicit query contexts within the prompt structure to guide gesture generation. Experimental results demonstrate state-of-the-art performance across three metrics: Fréchet Gesture Distance (FGD), motion diversity, and example-gesture similarity. Furthermore, our framework enables granular control of individual body parts and accommo- dates diverse input modalities including motion clips, static poses, human video sequences, and textual descriptions.

cars peace

Given a speech audio and a reference motion sequence, MECo can synthesize co-speech gestures with stylistic consistency to the reference motions. Furthermore, it provides granular control of individual body parts and accommodates diverse input modalities including motion clips, static poses, human video sequences, and textual descriptions.

Extensive experiments and user study show that our model achieves state-of-the-art performance across three metrics: Fréchet Gesture Distance (FGD), motion diversity, and example-gesture similarity.

Control With MultiModal Prompts Arbitrarily

Our system enables flexible control of synergistic full-body motion generation based on both speeches and user prompts simultaneously. Some results are shown below.

Motion Clip Prompt

(The left video is the motion clip prompt, and the right video shows the results.)

Static Pose Prompt

(The left picture is the static pose prompt, and the right video shows the results.)

Human Video Prompt

(The left video is the human video prompt, and the right video shows the results.)

    

Text Prompt

(The left video is the text prompt, and the right video shows the results.)

Control in Body Parts

Our method supports body part granularity control. We use different motion examples to control the upper body and lower body. The generated results demonstrate that we can produce gesture that conform to the motion examples, with coordinated motion between body parts and alignment with the speech audio.

More Results

Thank you for your watching! If you are interested in our work, please check our full video to see more results.

BibTeX


@inproceedings{chen2025meco,
  author = {Bohong Chen and Yumeng Li and Youyi Zheng and Yao-Xiang Ding and Kun Zhou},
  title = {Motion-example-controlled Co-speech Gesture Generation Leveraging Large Language Models},
  year = {2025},
  isbn = {9798400715402},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3721238.3730611},
  doi = {10.1145/3721238.3730611},
  booktitle = {Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers},
  series = {SIGGRAPH Conference Papers '25}
}