Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion

Ng, Evonne; Joo, Hanbyul; Hu, Liwen; Li, Hao; Darrell, Trevor; Kanazawa, Angjoo; Ginosar, Shiry

Computer Science > Computer Vision and Pattern Recognition

arXiv:2204.08451 (cs)

[Submitted on 18 Apr 2022]

Title:Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion

Authors:Evonne Ng, Hanbyul Joo, Liwen Hu, Hao Li, Trevor Darrell, Angjoo Kanazawa, Shiry Ginosar

View PDF

Abstract:We present a framework for modeling interactional communication in dyadic conversations: given multimodal inputs of a speaker, we autoregressively output multiple possibilities of corresponding listener motion. We combine the motion and speech audio of the speaker using a motion-audio cross attention transformer. Furthermore, we enable non-deterministic prediction by learning a discrete latent representation of realistic listener motion with a novel motion-encoding VQ-VAE. Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions. Moreover, it produces realistic 3D listener facial motion synchronous with the speaker (see video). We demonstrate that our method outperforms baselines qualitatively and quantitatively via a rich suite of experiments. To facilitate this line of research, we introduce a novel and large in-the-wild dataset of dyadic conversations. Code, data, and videos available at this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2204.08451 [cs.CV]
	(or arXiv:2204.08451v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2204.08451

Submission history

From: Evonne Ng [view email]
[v1] Mon, 18 Apr 2022 17:58:04 UTC (13,030 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators