Learning2Listen

Given a speaker video, we extract the audio and motion of the speaker. From these multimodal speaker inputs, our method synthesizes multiple realistic listener 3D motion sequences (top and bottom) in an autoregessive fashion. The output of our approach can be optionally rendered as photorealistic video. Please see supplementary video for results.

Abstract

We present a framework for modeling interactional communication in dyadic conversations: given multimodal inputs of a speaker, we autoregressively output multiple possibilities of corresponding listener motion. We combine the motion and speech audio of the speaker using a motion-audio cross attention transformer. Furthermore, we enable non-deterministic prediction by learning a discrete latent representation of realistic listener motion with a novel motion-encoding VQ-VAE. Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions. Moreover, it produces realistic 3D listener facial motion synchronous with the speaker (see video). We demonstrate that our method outperforms baselines qualitatively and quantitatively via a rich suite of experiments. To facilitate this line of research, we introduce a novel and large in-the-wild dataset of dyadic conversations.

Overview

The goal of our work is to model the conversational dynamics between a speaker and listener. We introduce a novel motion VQ-VAE that allows us to output nondeterministic listener motion sequences in an autoregressive manner. Given speaker motion and audio as inputs, our approach generates realistic, synchronous, and diverse listener motion sequences that outperform prior SOTA.

Method

(1) To represent the manifold of realistic listner facial motion, we extend VQ-VAE to the domain of motion synthesis. The learned discrete representation of motion enables us to model the next time step of motion as a multinomial distribution.

(2) We use a motion-audio cross-modal transformer that learns to fuse the speaker's audio and facial motion.

(3) We then learn an autoregressive transformer-based predictor that takes as input the speaker and listener embeddings and outputs a distribution over possible synchronous and realistic listener responses, from which we can sample multiple trajectories.

Results

[Highlights] [Multiple Samples] [Baselines] [Ablations] [Fun Results]

Highlights

Given a speaker's facial motion and audio, our method generates synchronous, realistic listeners.

Multiple Samples

Our method generates multiple possible listener trajectories from a single speaker input.

Comparison vs. Baselines

Our method outperforms existing baselines.

Comparison vs. Ablations

Ablations demonstrate our method's strength.

Fun Results

Since our method generalizes to unseen speakers, we can generate results on novel listeners. We thank Devi Parikh for allowing us to use her podcast videos.

Paper	BibTex @article{ng2022learning2listen, title={Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion}, author={Ng, Evonne and Joo, Hanbyul and Hu, Liwen and Li, Hao and and Darrell, Trevor and Kanazawa, Angjoo and Ginosar, Shiry}, journal={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, year={2022} }

Acknowledgements

The authors would like to thank Justine Cassell, Alyosha Efros, Alison Gopnik, Jitendra Malik, and the Facebook FRL team for many insightful conversations and comments. Dave Epstein and Karttikeya Mangalam for Transformer advice. Ruilong Li and Ethan Weber for technical support. The work of Ng and Darrell is supported in part by DoD including DARPA’s XAI, LwLL, Machine Common Sense and/or SemaFor programs, which also supports Hu and Li in part, as well as BAIR’s industrial alliance programs. Ginosar’s work is funded by the NSF under Grant #2030859 to the Computing Research Association for the CIFellows Project. Parent authors would like to thank their children for the daily reminder that they should learn how to listen.

Learning to Listen:
Modeling Non-Deterministic Dyadic Facial Motion

CVPR 2022

Evonne
Ng
UC Berkeley Hanbyul
Joo
Seoul National University Liwen
Hu
Pinscreen Hao
Li
Pinscreen Trevor
Darrell
UC Berkeley Angjoo
Kanazawa
UC Berkeley Shiry
Ginosar
UC Berkeley

Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion

CVPR 2022

Evonne NgUC Berkeley Hanbyul JooSeoul National University Liwen HuPinscreen Hao LiPinscreen Trevor DarrellUC Berkeley Angjoo KanazawaUC Berkeley Shiry GinosarUC Berkeley

Learning to Listen:
Modeling Non-Deterministic Dyadic Facial Motion

Evonne
Ng
UC Berkeley Hanbyul
Joo
Seoul National University Liwen
Hu
Pinscreen Hao
Li
Pinscreen Trevor
Darrell
UC Berkeley Angjoo
Kanazawa
UC Berkeley Shiry
Ginosar
UC Berkeley