DETECTING GENE SIMILARITIES USING LARGE-SCALE CONTENT-BASED SEARCH SYSTEMS
Abstract:
The accumulation of public gene expression datasets offers numerous opportunities for
researchers to utilize these data to characterize gene functions, understand pathway
actions, and formulate hypotheses about the molecular basis of human diseases. Yet,
exploring this extremely large gene expression data collection has been challenging, due
to a lack of effective tools in reusing existing datasets and exploring these datasets for
targeted analyses. An important challenge is discovering robust gene signatures of
biological processes and diseases, where this depends on the ability to detect similar
genes that share gene expression patterns across a large set of conditions. This thesis
discusses query-based systems that are intended for large-scale integration and
exploration of gene similarities. It also discusses their key biological applications.
In the first part, I present SEEK, a search system and a novel algorithm for
searching similar (or coexpressed) genes around a multigene query of interest. The search
algorithm combines coexpressed genes using a sensitive dataset weighting algorithm for
effective weighting of coexpression results. Notably, through the robust search of
thousands of human datasets, the retrieval of functionally co-annotated genes always
improves with the inclusion of more datasets, showing the promise of the large
compendia. In the second part, I extend the work of SEEK to the expression compendia
of 5 commonly studied model organisms. The new system ModSEEK enables accurate
searches in a wider experimental variety, and has been extensively evaluated. In the third
iv
part, I propose a novel framework for integrating and comparing coexpression context
across a pair of organisms. I leverage both comparative genomics orthology data and
functional genomics coexpression data, in an unsupervised framework to identify pairs of
genes in an orthologous group that are similarly highly coexpressed to an orthologous
query in two organisms. I show that such functionally similar pairs of genes can be used
to improve the performance of single-organism gene retrieval searches. In the final part, I
demonstrate how coexpressed genes can be used to identify important transcription
factors and dysregulated processes underlying breast cancer subtypes. This part
highlights the promise of coexpressed genes in providing an understanding of cancer
dysregulations.