COS 126 Deciphering the Genetic Code |
Programming Assignment 9 Due: Wednesday, 11:59pm |
Write a program to find the genetic encoding of amino acids, given a protein and a genetic sequence known to contain that protein, using a one-pass trial and error method. This program is an example of computational biology, an emerging science that uses computers to do biological research. Also, you'll learn about hashing, a widely used technique for making programs more efficient.
For the purposes of this assignment, we define a
Pattern matching. Now, given a protein and a long genetic sequence known to contain that protein, we want to establish the genetic encodings of the amino acids in the protein. An example will illustrate the process that we use to do this. Suppose that we know that the protein M S I Q H M R is found somewhere in the genetic sequence attgctagcaatgctagcaattgctagcaattcat but we don't know where, and we don't know any encodings for the amino acids. The problem seems hopeless at first, but it turns out that a direct approach works if the protein sequence is long enough. First, try aligning the first characters:
attgctagcaatgctagcaattgctagcaattcat M S I Q H M R
This approach would work if a code for M were att, a code for S were gct, etc. But this cannot be case, because gct cannot be an encoding for both S and H. Having two different genetic encodings for an amino acid (M in this case) is allowed, but each genetic code must represent only one amino acid.
The sequences don't align at the beginning, so we next try aligning the protein sequence with the second character of the genetic sequence:
attgctagcaatgctagcaattgctagcaattcat M S I Q H M R
Again, cta can't encode two different amino acids, so this can't be the alignment. The same thing happens again for aligning the protein with the third character of the genetic sequence; how about the fourth character?
attgctagcaatgctagcaattgctagcaattcat M S I Q H M RNot only is there a conflict with agc encoding both S and H, but also gct encodes both M and Q. Continuing to slide the protein sequence along the genetic sequence in this way, we finally get an alignment that could work:
attgctagcaatgctagcaattgctagcaattcat M S I Q H M RThis alignment is evidence that atg and agc are genetic codes for the amino acid M, cta is a code for S, gca for I, etc.
This is an artificial example to illustrate how the method works. In real data, the sequences are much longer, and there's no mistaking a match. Furthermore, the same codes always come out any time the method is run on a genetic sequence known to contain a protein: these may be found in any biochemistry textbook. We have simplified the biochemistry here considerably, and have the benefit of hindsight over decades of experimental work, but one wonders if computational tricks like this could have sped up the process of discovering the relationships between genes and amino acids. This string-processing abstraction for working with the building blocks of life is coming into widespread use, and it seems clear that computational methods will play an important role in new discoveries.
The program.
You may start with the
following code, which takes two filenames as command
line arguments, and prepares the two needed sequences to be read in.
Put it in a file named gene.c.
#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
#define MAX_GENE_LEN 100000
#define MAX_PROT_LEN 5000
#define NUCLEOTIDES 4
#define CODON_LEN 3
#define CODE_LEN 64
int hash(char p[]) {
/* YOUR CODE HERE */
}
void unhash(int n) {
/* YOUR CODE HERE */
}
int main(int argc, char *argv[]) {
char geneseq[MAX_GENE_LEN], protseq[MAX_PROT_LEN], genecode[CODE_LEN];
int i;
FILE *prot, *gene;
/* YOUR CODE BELOW TO READ IN GENE AND PROTEIN SEQUENCES */
prot = fopen(argv[1], "r");
gene = fopen(argv[2], "r");
/* YOUR CODE BELOW TO FIND MATCH */
/* print out results */
for (i = 0; i < CODE_LEN; i++) {
unhash(i);
printf(" %c\n", genecode[i]);
}
}
You will need to implement the alignment procedure in main(), and two auxiliary procedures: hash(), which produces a different six-bit number (0-63) for each three-character genetic encoding, and unhash(), which prints out the three characters corresponding to each six-bit number. These procedures will allow you to use the 64-entry table genecode[] in the alignment process to keep track of the encoding assignments that have been made. For example, if hash("aat") is 3 and genecode[3] is 'R', then this means that aat is an encoding for R.
Testing. Use the sample data above (
prot.1
and
gene.1
)
for debugging.
The files
prot.2
and
gene.2
contain real genetic data: the first is an amino acid sequence for
a beta-lactamase protein and the genetic sequence for
a human virus.
The files
prot.3,
gene.3,
prot.4,
and
gene.4
contain more real genetic data.
Submission.
Submit the files readme and gene.c.
This assignment was developed by
R. Sedgewick and Gun Sirer.
Copyright © 2000 Robert Sedgewick