The Content-Aware Similarity
Search (CASS) project investigates research issues in searching, clustering,
and classification, and management for feature-rich, non-text data types.
Current research topics include:
- Sketch construction
techniques. Address the issue that feature vectors of feature-rich data
objects are high dimensional. The goal is to develop practical
algorithms to construct sketches to substantially reduce the dimensions
and sizes of feature vectors while achieving high-quality similarity
searches.
- Efficient filtering and
indexing methods. Indexing and filtering large feature-rich datasets are
challenging because noisy data require similarity match and similarity
search and indexing data structures for exact match do not apply. The
goal is to investigate novel data structures and algorithms to filter
and index for similarity search of large datasets.
- Similarity search of
multiple data types. Develop understanding of similarity search of
various data types including images, audio, video, 3D shapes, scientific
sensor data and documents. We are also interested in continuous archived
data and data with multiple modalities.
- Toolkit for similarity
search. Most similarity search efforts are domain specific. Our goal is
to design and implement a toolkit that can be used to construct search
engines for various data types by plugging in specific data
segmentations, feature extractions and distance calculation modules.
We have built an initial toolkit
called Ferret which has been used with four data types including images,
audio recordings, 3D shapes, and genomic microarray data. Below is the
architecture design of the Ferret toolkit. Click here
to see some demos and snapshots of the search systems built using the Ferret
toolkit.
|