Algorithms for the Identification of Functional Sites in Proteins (thesis)
Abstract:
Proteins play an essential role in nearly every process carried out by
the cell. In accomplishing this incredibly diverse array of
functions, proteins interact with one another and other molecules in
their environments. The interactions of proteins with other molecules
are mediated by specific amino acids. For a given protein, the
identification of the residues that participate in its interactions
can be a crucial step in understanding its function. Knowledge of
these so-called functional sites can guide further experimental
analysis of the protein and aid drug design and development. The large
number of protein sequences and structural models that have become
available over the past 10 years present an exceptional opportunity to
use the methods of computer science and statistics to identify protein
functional sites, and thereby further biological understanding.This dissertation investigates the computational prediction of
functional sites from protein sequence and structure data. First, we
consider the estimation of evolutionary sequence conservation from a
multiple sequence alignment of homologous proteins--a common first
step in the identification of functionally important sites. We
introduce a fast, information theoretic algorithm for scoring
conservation and demonstrate that it provides state-of-the-art
performance in predicting catalytic sites, ligand binding sites, and
protein-protein interface residues. Second, we examine the
identification of a class of functional residues that cannot be
identified by considering sequence conservation alone: those that
determine functional substrate specificity within homologous protein
families. We combine sequence information with structural models to
build the first large dataset of these specificity determining
positions (SDPs). This dataset enabled the first large-scale analysis
of sequence-based SDP prediction methods. We demonstrate that
GroupSim, a new method we developed, outperforms existing approaches.
Finally, we focus on the prediction of ligand binding sites when both
evolutionary sequence information and structural models are available.
We introduce ConCavity, a new algorithm which directly integrates
sequence conservation information into structure-based surface pocket
identification. This algorithm provides significant improvement over
earlier methods and establishes the complementarity of sequence and
structural evidence in ligand binding site prediction. Overall, our
work significantly improves our ability to identify functional sites
from protein sequences and structures.