SARAH Spatially Aware Real-time Agentic Humans

Meta Reality Labs in Pittsburgh, PA

Arxiv 2026

↓ This page walks through the 5-minute video with interactive examples ↓

tldr;

We generate full-body conversational motion that is both conversationally-aware and spatially responsive to the user. Our causal, lightweight architecture enables real-time deployment on a live VR headset.

Abstract

As embodied agents become central to VR, telepresence, and digital human applications, their motion must go beyond speech-aligned gestures: agents should turn toward users, respond to their movement, and maintain natural gaze. Current methods lack this spatial awareness.

We close this gap with the first real-time, fully causal method for spatially-aware conversational motion, deployable on a streaming VR headset. Given a user's position and dyadic audio, our approach produces full-body motion that aligns gestures with speech while orienting the agent toward the user.

Our architecture combines a causal transformer-based VAE with interleaved latent tokens for streaming inference and a flow matching model conditioned on user trajectory and audio. To support varying gaze preferences, we introduce a gaze scoring mechanism with classifier-free guidance to decouple learning from control: the model captures natural spatial alignment from data, while users can adjust eye contact intensity at inference time.

On the Embody 3D dataset, our method achieves state-of-the-art motion quality at over 300 FPS—3× faster than non-causal baselines—while capturing the subtle spatial dynamics of natural conversation. We validate our approach on a live VR system, bringing spatially-aware conversational agents to real-time deployment.

Real-time VR Demonstration

Watch SARAH in action on a live VR system

Approach

Method Overview

Architecture Overview.

Given the user's 3D position and dyadic conversational audio, our model generates spatially and conversationally aware 3D motion.

We employ a fully causal transformer-based VAE with interleaved latent tokens at a fixed temporal stride. Both encoder and decoder use causal attention, where each μ/σ token attends only to preceding frames and earlier latents.

These latents feed into a transformer-based flow matching model with causal masking, which optionally accepts a gaze score to control the agent's eye contact behavior.

Motion Representation

We encode each joint as a 3D icosahedron—a fully Euclidean representation that yields faster convergence and more stable training. Joint positions are recovered via the mean, and rotations via SVD alignment.

Motion Representation

Gaze Guidance

Our training data spans diverse gaze behaviors, from sustained eye contact to complete aversion (left). We devise a gaze scoring mechanism that can be enabled at test time to tune the agent's eye contact (right).

Gaze Guidance

Controllable Gaze at Inference

The gaze alignment score g ∈ [0, 1] controls how much the agent faces the user

Results

Comparison with Baselines

Our method achieves state-of-the-art motion quality while being both causal and real-time

Qualitative Results

Diverse motion samples generated by our method

BibTeX

@misc{ng2026sarahspatiallyawarerealtime,
title={SARAH: Spatially Aware Real-time Agentic Humans},
author={Evonne Ng and Siwei Zhang and Zhang Chen and Michael Zollhoefer and Alexander Richard},
year={2026},
eprint={2602.18432},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2602.18432},
}

Acknowledgements

We thank all of the Embody 3D members for making this project possible. We would also like to thank Abhay Mittal, Anastasis Stathopoulos, and Ethan Weber for helpful discussions; and Vasu Agrawal, Martin Gleize, and Srivathsan Govindarajan for help in creating the demo.