Today's visual learning methods require extensive supervision from human teachers. A major goal of the research community has been to remove the need for this supervision by creating methods that, instead, teach themselves by analyzing unlabeled images. In this talk, I will argue that this focus on learning from vision alone, without the use of other sensory modalities, is making the perception problem unnecessarily difficult. To demonstrate this, I will present computer vision methods for learning from co-occurring audio and visual signals. First, I will show that visual models of materials and objects emerge from predicting soundtracks for silent videos. Then, I will present a multimodal video representation that fuses information from both the visual and audio components of a video signal when solving video-understanding tasks. Finally, I will discuss applications of these methods to object recognition and audio-visual source separation.
Andrew Owens is a postdoctoral scholar at UC Berkeley. He received a Ph.D. in computer science from MIT in 2016. He is a recipient of a Computer Vision and Pattern Recognition (CVPR) Best Paper Honorable Mention Award, a Microsoft Research Ph.D. Fellowship, and an NDSEG Graduate Fellowship