Given a speaker video, we extract the audio and motion of the speaker. From these multimodal speaker inputs, our method synthesizes multiple realistic listener 3D motion sequences (top and bottom) in an autoregessive fashion. The output of our approach can be optionally rendered as photorealistic video. Please see supplementary video for results.


   title={Learning to Listen: Modeling Non-Deterministic
      Dyadic Facial Motion},
   author={Ng, Evonne and Joo, Hanbyul and Hu, Liwen
      and Li, Hao and and Darrell, Trevor
      and Kanazawa, Angjoo and Ginosar, Shiry},
   journal={Proceedings of the IEEE/CVF Conference
      on Computer Vision and Pattern Recognition},


The authors would like to thank Justine Cassell, Alyosha Efros, Alison Gopnik, Jitendra Malik, and the Facebook FRL team for many insightful conversations and comments. Dave Epstein and Karttikeya Mangalam for Transformer advice. Ruilong Li and Ethan Weber for technical support. The work of Ng and Darrell is supported in part by DoD including DARPA’s XAI, LwLL, Machine Common Sense and/or SemaFor programs, which also supports Hu and Li in part, as well as BAIR’s industrial alliance programs. Ginosar’s work is funded by the NSF under Grant #2030859 to the Computing Research Association for the CIFellows Project. Parent authors would like to thank their children for the daily reminder that they should learn how to listen.