04-06
Probabilistic Models of Text and Images

Managing large and growing collections of information is a central goal of modern computer science. Data repositories of texts, images, music, and genetic information have become widely accessible, necessitating good methods of retrieval, organization, and exploration. In this talk, I will describe probabilistic models of information collections, for which the above problems can be cast as statistical queries.

First, I will describe the use of graphical models as a flexible framework for the representation of modeling assumptions. Fast posterior inference algorithms based on variational methods allow us to specify complex Bayesian models and apply them to large datasets.

With this framework in hand, I will develop latent Dirichlet allocation (LDA), a graphical model particularly suited to analyzing text collections. LDA posits an index of hidden topics which describe the underlying documents. The topics are learned from a collection, and new documents can be situated into that collection via posterior inference. Extensions of LDA can index a set of images, or multimedia collections of related text and images. I will illustrate the use of such models with several datasets.

Finally, I will describe nonparametric Bayesian methods for relaxing the restriction to a fixed number of topics. These methods allow for models based on the natural assumption that the number of topics grows with the collection. I will extend this idea to trees, and to models for discovering both the structure and content of a topic hierarchy.

Joint work with Michael Jordan, Andrew Ng, Thomas Griffiths, and Josh Tenenbaum

Date and Time
Wednesday April 6, 2005 4:00pm - 5:30pm
Location
Computer Science Small Auditorium (Room 105)
Speaker
David Blei, from CMU
Host
Andrea LaPaugh

Contributions to and/or sponsorship of any event does not constitute departmental or institutional endorsement of the specific program, speakers or views presented.

CS Talks Mailing List