Algorithms for Representation and Discovery of Transcription Factor Binding Sites
Abstract:
A major objective in molecular biology is to understand how a genome encodes the information that specifies when and where a gene will be transcribed into its protein product. Mediating proteins, known as transcription factors, facilitate this process by interacting with the cell's DNA and the transcription machinery. It is of central importance to identify all sequence-specific DNA binding sites of transcription factors. In this thesis, we consider two relevant computational problems.
The first problem is to develop a representation for a group of known binding sites of a particular transcription factor, in order to facilitate recognition of other binding sites of the same protein. We evaluate the effectiveness of several approaches commonly used for this problem, and show that there are statistically significant differences in their performance. We also consider variants of the basic methods that incorporate pairwise nucleotide dependencies and per-position information content. We find that the use of per-position information content improves all basic methods, and that including local pairwise nucleotide dependencies within binding site models results in better performance for some approaches.
The second problem is that of motif discovery. In this context, given a set of sequences known to contain binding sites of a particular transcription factor, the objective is to identify their locations. We propose a novel combinatorial optimization framework for motif finding, which utilizes both graph pruning techniques and an integer linear programming formulation. Additionally, we introduce a procedure to identify statistically significant motifs. We apply our algorithm to numerous biological datasets as well as to synthetic data, and it performs exceptionally well. Furthermore, we show our framework to be versatile and easily applicable to other variants of the DNA binding site identification problem such as phylogenetic footprinting, the `subtle' motif formulation and the multiple motifs problem.
Studying the optimization framework in greater depth, we introduce a novel, more compact integer linear program that utilizes the discrete nature of the distance metric imposed on pairs of subsequences. We compare the properties of the two alternate formulations from a theoretical perspective and demonstrate that the compact formulation also leads to a method that is highly effective in practice.