Combine and Conquer: Representation Learning From Multiple Data Distributions

Speaker

Jimmy Shi, University of Oxford

Abstract

It is becoming less and less controversial to say that the days of learning representations through label supervision are over. Recent work discovers that such regimes are not only expensive, but also suffer from various generalisation/robustness issues. This is somewhat unsurprising, as perceptual data (vision, language) are rich and cannot be well represented by a single label --- doing so inevitably result in the model learning spurious features that trivially correlates to the label.In this talk, I will introduce my work during my PhD at Oxford, which looks at representation learning through multiple sources of data, e.g. vision and language. We show in both generative models (VAE) and discriminative models that learning to extract common abstract concepts between multiple modalities/domains can result in higher-quality and more generalisable representations. Additionally, we also look at improving the data-efficiency of such models, both through 1) using less multimodal pair by adopting contrastive-style objectives and 2) "generating" multimodal pair via masked image modelling.