Given a speaker video, we extract the audio and motion of the
speaker. From these multimodal speaker inputs, our method synthesizes multiple realistic listener 3D motion
sequences (top and bottom) in an autoregessive fashion. The output of our approach can be optionally rendered as
photorealistic video. Please see supplementary video
for results.
Abstract
We present a framework for modeling interactional communication in dyadic conversations: given multimodal inputs
of a speaker, we autoregressively output multiple possibilities of corresponding listener motion. We combine the
motion and speech audio of the speaker using a motion-audio cross attention transformer. Furthermore, we enable
non-deterministic prediction by learning a discrete latent representation of realistic listener motion with a
novel motion-encoding VQ-VAE. Our method organically captures the multimodal and non-deterministic nature of
nonverbal dyadic interactions. Moreover, it produces realistic 3D listener facial motion synchronous with the
speaker (see video). We demonstrate that our method outperforms baselines qualitatively and quantitatively via a
rich suite of experiments. To facilitate this line of research, we introduce a novel and large in-the-wild
dataset of dyadic conversations.
Overview
The goal of our work is to model the conversational dynamics between a speaker and listener. We introduce a
novel motion VQ-VAE that allows us to output nondeterministic listener motion sequences in an autoregressive
manner. Given speaker motion and audio as inputs, our approach generates realistic, synchronous, and diverse
listener motion sequences that outperform prior SOTA.
Method
(1) To represent the manifold of realistic listner facial motion, we extend VQ-VAE to the domain of motion
synthesis. The learned discrete representation of motion enables us to model the next time step of motion as a
multinomial distribution.
(2) We use a motion-audio cross-modal transformer that learns to fuse the speaker's audio and facial
motion.
(3) We then learn an autoregressive transformer-based predictor that takes as input the speaker and listener
embeddings and outputs a distribution over possible synchronous and realistic listener responses, from which
we can sample multiple trajectories.
Given a speaker's facial motion and audio, our method generates synchronous, realistic listeners.
Multiple Samples
Our method generates multiple possible listener trajectories from a single speaker input.
Comparison vs. Baselines
Our method outperforms existing baselines.
Comparison vs. Ablations
Ablations demonstrate our method's strength.
Fun Results
Since our method generalizes to unseen speakers, we can generate results on novel listeners. We thank Devi Parikh
for allowing us to use her podcast videos.
Paper
BibTex
@article{ng2022learning2listen,
title={Learning to Listen: Modeling Non-Deterministic
Dyadic Facial Motion},
author={Ng, Evonne and Joo, Hanbyul and Hu, Liwen
and Li, Hao and and Darrell, Trevor
and Kanazawa, Angjoo and Ginosar, Shiry},
journal={Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition},
year={2022}
}
Acknowledgements
The authors would like to thank Justine Cassell, Alyosha Efros, Alison Gopnik, Jitendra Malik, and the Facebook
FRL team for many insightful conversations and comments. Dave Epstein and Karttikeya Mangalam for Transformer
advice. Ruilong Li and Ethan Weber for technical support. The work of Ng and Darrell is supported in part by DoD
including DARPA’s XAI, LwLL, Machine Common Sense and/or SemaFor programs, which also supports Hu and Li in part,
as well as BAIR’s industrial alliance programs. Ginosar’s work is funded by the NSF under Grant #2030859 to the
Computing Research Association for the CIFellows Project. Parent authors would like to thank their children for
the daily reminder that they should learn how to listen.