Intelligent agents require the ability to perceive their environments, understand their high-level semantics, and communicate with humans. While computer vision has recently made great strides on visual recognition tasks, the predominant paradigm is to predict one or more fixed visual categories for each image. I will describe a line of work that significantly expands the vocabulary of our computer vision systems and allows them to express visual concepts in natural language, such as “a picture of a girl playing with a stack of legos”, or “a couple holding hands and walking on a beach”. In particular, the final model can take an image and both detect and describe in natural language all of its salient regions. My modeling techniques draw on recent advances in Deep Learning that allow us to construct and train neural networks with hundreds of millions of neurons that take raw images and map them directly to natural language sentences. I will show that the model generates qualitatively compelling results and quantitative evaluation and control experiments demonstrate the strength of this approach with respect to simpler baselines and previous methods.
Andrej Karpathy is a Computer Science Ph.D. candidate at Stanford University working with Prof. Fei-Fei Li. He received his M.S. from University of British Columbia in Computer Science and his B.S from University of Toronto in Computer Science and Physics. He is interested in the intersection of computer vision, natural language processing and reinforcement learning with the aim of developing agents who can intelligently interact with humans in dynamic environments. His work was featured in the New York Times and MIT Technology Review. He helped develop and instruct a new Computer Science class at Stanford on Convolutional Neural Networks for Visual Recognition. In his spare time he develops Deep Learning libraries in Javascript and maintains websites that support more efficient meta-research.