COS 126

Deciphering the Genetic Code
Programming Assignment 6

Due: Wednesday, 11:59pm

Write a program to find the genetic encoding of amino acids, given a protein and a genetic sequence known to contain that protein, using a one-pass trial and error method.

For the purposes of this assignment, we define a genetic sequence to be a string made up of one of the four characters 'a', 'c', 'g', or 't' and an amino acid encoding to be a genetic sequence of length 3. There are 25 amino acids, which we will refer to with the capital letters 'A' through 'Y', and we define a protein to be a sequence of amino acids. There are 4*4*4 = 64 different amino acid encodings, each of which corresponds to at most one amino acid, but an amino acid might have more than one encoding.

Now, given a protein and a long genetic sequence known to contain that protein, we want to establish the genetic encodings of the amino acids in the protein. An example will illustrate the process that we use to do this. Suppose that we know that the protein M S I Q H M R is found somewhere in the genetic sequence attgctagcaatgctagcaattgctagcaattcat but we don't know where, and we don't know any encodings for the amino acids. The problem seems hopeless at first, but it turns out that a direct approach works if the protein sequence is long enough. First, try aligning the first characters:

          attgctagcaatgctagcaattgctagcaattcat
          M  S  I  Q  H  M  R

This approach would work if a code for M were att, a code for S were gct, etc. But this cannot be case, because gct cannot be an encoding for both S and H. Having two different genetic encodings for an amino acid (M in this case) is allowed, but each genetic code must represent only one amino acid.

The sequences don't align at the beginning, so we next try aligning the protein sequence with the second character of the genetic sequence:

          attgctagcaatgctagcaattgctagcaattcat
           M  S  I  Q  H  M  R

Again, cta can't encode two different amino acids, so this can't be the alignment. The same thing happens again for aligning the protein with the third character of the genetic sequence; how about the fourth character?

          attgctagcaatgctagcaattgctagcaattcat
             M  S  I  Q  H  M  R
Not only is there a conflict with agc encoding both S and H, but also gct encodes both M and Q. Continuing to slide the protein sequence along the genetic sequence in this way, we finally get an alignment that could work:
          attgctagcaatgctagcaattgctagcaattcat
                    M  S  I  Q  H  M  R
This alignment is evidence that atg and agc are genetic codes for the amino acid M, cta is a code for S, gca for I, etc.

This is an artificial example to illustrate how the method works. In real data, the sequences are much longer, and there's no mistaking a match. Furthermore, the same codes always come out any time the method is run on a genetic sequence known to contain a protein: these may be found in any biochemistry textbook. We have simplified the biochemistry here considerably, and have the benefit of hindsight over decades of experimental work, but one wonders if computational tricks like this could have sped up the process of discovering the relationships between genes and amino acids. This string-processing abstraction for working with the building blocks of life is coming into widespread use, and it seems clear that computational methods will play an important role in new discoveries.

You may start with the following code, which uses UNIX file input to take filenames as arguments and read in the two needed sequences. Put it in a file named gene.c.

      #include <stdio.h>
      #include <stdlib.h>

      int hash(char *p);
      void unhash(int);
      
      main(int argc, char *argv[])
        {
          char geneseq[10000], protseq[1000];
          char genecode[64]; 
          int i, j, c; char *p;
          FILE *prot = fopen(*++argv, "r");
          FILE *gene = fopen(*++argv, "r");
          if ( prot == NULL || gene == NULL ) exit(1);
          for (i = 0; (c = getc(prot)) != EOF; ) 
            if (c >= 'A' && c <= 'Z') protseq[i++] = c;
          protseq[i] = '\0';
          for (i = 0; (c = getc(gene)) != EOF; ) 
            if (c >= 'a' && c <= 'z') geneseq[i++] = c;
          geneseq[i] = '\0';
      
          /* your code here: print out where match is found */
      
          for (j = 0; j < 64; j++)
            { unhash(j); printf(" %c    ", genecode[j]); 
              if (!((j+1) % 4)) printf("\n"); }
        }

You will need to implement the alignment procedure in main, and two auxiliary procedures: hash, which produces a different six bit number for each three-character genetic encoding, and unhash, which prints out the three characters corresponding to each six bit number. These procedures will allow you to use the 64-entry table genecode in the alignment process to keep track of the encoding assignments that have been made.

Use the sample data above for debugging. The attached files protein and geneseq contain real genetic data: the amino acid sequence for a ``beta-lactamase'' protein and the genetic sequence for a human virus. Submit your gene.c code, as usual, and include the result of a.out prot.2 gene.2 in your readme file.


This assignment was developed by R. Sedgewick and Gun Sirer.

Copyright © 1998 Robert Sedgewick