Please follow along the numbered videos for an overview of our approach.
takeaway: For body, our joint VQ + diffusion method allows us to achieve more dynamic and peaky motions
compared to using only one or the other.
1. We capture a
novel, rich dataset of dyadic conversations that allows for photorealistic
reconstructions. Dataset
here.
2. Our motion model comprises of
3 parts: a face motion model, guide pose predictor, and body
motion model.
3. Given audio and outputs from a pretrained lip regressor, we train a conditional diffusion model to
output
facial motion.
4. For body, we take audio as input and autoregressively output
VQ-ed guide poses at 1 fps.
5. We then pass both audio and guide poses into a diffusion model that
in-fills high-frequency body
motion at 30 fps.
6. Both the generated face and body motion is then passed into our
trained avatar renderer to
generate a photorealistic avatar.
7. VoilĂ ! The final result.