Analysis of Large Genomic Data Collections (thesis)
Abstract:
Modern computational biology draws on the historical strengths both of computer science and of molecular biology. It requires careful attention to algorithmic development, data structures, storage, and manipulation, efficient software engineering, and machine learning; but it also drives towards a deeper understanding of the causes and cures of disease, the organization of life at both micro- and macroscopic levels, and the molecular systems governing every living organism. In particular, the field must take advantage of the ever-increasing availability of experimental assays that provide whole-genome measurements: the sequences of the human and other genomes, the abundances of every transcript in a cell, the distribution of gene activity across cell or tissue types, and the interactions among proteins and protein complexes. These data can be combined and analyzed in an integrated manner to enable biological discoveries not obtainable from single experiments, but this requires both the computational ability to manipulate large, heterogeneous, and noisy datasets as well as the biological ability to ask targeted questions of these diverse data. This manuscript presents four broad solutions to these challenges. First, we discuss the closer integration of computational techniques with laboratory experimentation to study S. cerevisiae mitochondria. This confirms that biological discoveries can be made much more efficiently in the laboratory if informed by computational inference and that computational algorithms can be much more accurate if informed by appropriate experimental design. Second, we present a set of specific software tools created to address biological needs, ranging from the efficient analysis of very large data collections to the visualization of dense biological networks. Third, we detail several ways in which statistical models can be applied to genomic data so as to describe specific biological phenomena, including cellular growth rate, aneuploidy, and phosphorylation. Finally, we provide methods for integrating heterogeneous genomic data designed specifically for very large data collections and complex organisms (e.g. human beings); this allows one to study not only traditional biological questions, such as the interactions between individual genes and proteins, but also the interactions between entire pathways and processes at a systems level. These tools, techniques, and discoveries provide a basis from which further computational research can readily develop as biological experimentation explores more data, examines new organisms, and brings us ever closer to understanding and eliminating disease.