The automatic generation of controllable co-speech gestures has recently gained growing attention. While existing systems typically achieve ges- ture control through predefined categorical labels or implicit pseudo-labels derived from motion examples, these approaches often compromise the rich details present in the original motion examples. We present MECo, a framework for motion-example-controlled co-speech gesture generation by leveraging large language models (LLMs). Our method capitalizes on LLMs’ comprehension capabilities through fine-tuning to simultaneously inter- pret speech audio and motion examples, enabling the synthesis of gestures that preserve example-specific characteristics while maintaining speech congruence. Departing from conventional pseudo-labeling paradigms, we position motion examples as explicit query contexts within the prompt structure to guide gesture generation. Experimental results demonstrate state-of-the-art performance across three metrics: Fréchet Gesture Distance (FGD), motion diversity, and example-gesture similarity. Furthermore, our framework enables granular control of individual body parts and accommo- dates diverse input modalities including motion clips, static poses, human video sequences, and textual descriptions.
Given a speech audio and a reference motion sequence, MECo can synthesize co-speech gestures with stylistic consistency to the reference motions. Furthermore, it provides granular control of individual body parts and accommodates diverse input modalities including motion clips, static poses, human video sequences, and textual descriptions.
Extensive experiments and user study show that our model achieves state-of-the-art performance across three metrics: Fréchet Gesture Distance (FGD), motion diversity, and example-gesture similarity.
Our system enables flexible control of synergistic full-body motion generation based on both speeches and user prompts simultaneously. Some results are shown below.
(The left video is the motion clip prompt, and the right video shows the results.)
(The left picture is the static pose prompt, and the right video shows the results.)
(The left video is the human video prompt, and the right video shows the results.)
(The left video is the text prompt, and the right video shows the results.)
Our method supports body part granularity control. We use different motion examples to control the upper body and lower body. The generated results demonstrate that we can produce gesture that conform to the motion examples, with coordinated motion between body parts and alignment with the speech audio.
Thank you for your watching! If you are interested in our work, please check our full video to see more results.
@inproceedings{chen2025meco,
author = {Bohong Chen and Yumeng Li and Youyi Zheng and Yao-Xiang Ding and Kun Zhou},
title = {Motion-example-controlled Co-speech Gesture Generation Leveraging Large Language Models},
year = {2025},
isbn = {9798400715402},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3721238.3730611},
doi = {10.1145/3721238.3730611},
booktitle = {Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers},
series = {SIGGRAPH Conference Papers '25}
}