Information Extraction from User Workstations
Date and Time
Thursday, March 16, 2006 - 4:00pm to 5:30pm
Location
Computer Science Large Auditorium (Room 104)
Type
Colloquium
Speaker
Tom Mitchell, from Carnegie Mellon University
Host
Robert Schapire
Website
Automatically extracting structured facts from unstructured text is a key
step toward natural language understanding. Many researchers study this
problem, typically in the context of text collections such as newsfeeds or
the web. This talk will explore information extraction from user
workstations. While many of the subproblems are the same as for extraction
from other corpora, there are characteristics of workstations that suggest
very different approaches from "traditional" information extraction. For
example, suppose the facts we wish to extract from the workstation consist
of assertions about the key activites of the workstation user (e.g., which
courses they are taking, which committees they serve on), and relations
among the people, meetings, topics, emails, files, etc. associated with each
such activity. Interestingly, workstations contain a great deal of
redundant clues regarding these facts (e.g., evidence that Bob and Sue are
both involved in the hiring committee exists in email, the calendar,
individual files, ...). This redundancy suggests considering information
extraction as a problem of integrating diverse clues from multiple sources
rather than a problem of examining a single sentence in great detail. This
talk will explore this formulation of the information extraction problem,
and present our recent work on automatically extracting facts using
workstation-wide information obtained by calling Google desktop search as a
subroutine.