GestSync: Determining who is speaking without a talking head

Speaker: Sindhu B Hegde

Abstract: In this paper we introduce a new synchronisation task, Gesture-Sync: determining if a person’s gestures are correlated with their speech or not. In comparison to Lip-Sync, Gesture-Sync is far more challenging as there is a far looser relationship between the voice and body movement than there is between voice and lip motion. We introduce a dual-encoder model for this task, and compare a number of input representations including RGB frames, keypoint images, and keypoint vectors, assessing their performance and advantages. We show that the model can be trained using self-supervised learning alone, and evaluate its performance on the LRS3 dataset. Finally, we demonstrate applications of Gesture-Sync for audio-visual synchronisation, and in determining who is the speaker in a crowd, without seeing their faces. The code, datasets and pre-trained models can be found at: https://www.robots.ox.ac.uk/~vgg/research/gestsync

Bio: Sindhu is a DPhil student at University of Oxford's Visual Geometry Group (VGG) supervised by Prof. Andrew Zisserman. Her primary research interests broadly include Computer Vision and Machine Learning, focusing specifically on multi-modal learning and video understanding. Previously, Sindhu worked as a Lead Data Scientist at Verisk Analytics, India, where she developed deep-learning based solutions for image forensics projects. She holds a Master’s degree in Computer Science from IIIT Hyderabad, India, where she was supervised by Prof. C V Jawahar (IIIT-H) and Prof. Vinay Namboodiri (University of Bath, UK). Outside of her academic pursuits, Sindhu finds joy in culinary adventures and practicing yoga.