COS 126 Cracking the Genetic Code |
Due: Wed. 4/23, 11:59pm |
For this assignment, you'll write a program, genes.c
, to find
the genetic encoding of amino acids, given a protein and a genetic sequence
known to contain that protein. This program is an example of computational
biology, a new science that uses computers to do biological research. Also,
you'll learn about hashing, a widely used technique for making
programs more efficient.
For the purposes of this assignment, a genetic sequence is a
string comprised of one of the four characters 'a
', 'c
',
'g
', or 't
', and an amino acid encoding
is a genetic sequence of length 3. There are 25 amino acids, which we will refer
to with the capital letters 'A
' through 'Y
'. A protein
is a sequence of amino acids. There are 4×4×4 = 64 different amino
acid encodings, each of which corresponds to at most one amino acid, but some of
the 25 amino acids might have more than one encoding.
Given a protein and a long genetic sequence known to contain that protein,
we want to establish the genetic encodings of the amino acids in the protein. An
example illustrates this process. Suppose that we know that the protein M
S I Q H M R
is somewhere in the genetic sequence attgctagcaatgctagcaattgctagcaattcat
,
but we don't know where, and we don't know any of the encodings for the amino
acids. The problem seems hopeless at first, but it turns out that a
straightforward approachthe
one-pass trial-and-error methodworks
if the protein sequence is long enough. First, try aligning the first
characters:
attgctagcaatgctagcaattgctagcaattcat M S I Q H M R
This would work if a code for M
were att
, a code
for S
were gct
, and so on. But this can't be
correct, because
gct
cannot be an encoding for both S
and H
,
as illustrated by the bold
characters in the display
above. Note that having two different genetic encodings for an amino acid (M
in this case as illustrated by the slanted
characters)
is allowedwhat
is required that each genetic code represent only one amino acid.
The sequences don't align at the beginning, so we next slide the protein one character to the right and try aligning the protein sequence with the second character of the genetic sequence:
attgctagcaatgctagcaattgctagcaattcat M S I Q H M R
Again, cta
can't encode two different amino acids, so this
can't be the alignment. The same thing happens again for aligning the protein
with the third character of the genetic sequence; how about the fourth
character?
attgctagcaatgctagcaattgctagcaattcat M S I Q H M R
Not only is there a conflict with agc
encoding both S
and H
, but also with gct
encoding both M
and Q
. Continuing to slide the protein sequence along the genetic
sequence in this way, we finally get an alignment that could work:
attgctagcaatgctagcaattgctagcaattcat M S I Q H M R
This alignment is evidence that atg
and agc
are
genetic codes for the amino acid M
, cta
is a code
for S
, gca
for I
, etc.
This example is an artificial one designed to illustrate how the method works. In real data, the sequences are much longer, and there's no mistaking a match. Furthermore, errors never happen: The same codes always emerge every time the method is run on a genetic sequence known to contain a protein. (These may be found in any biochemistry textbook.)
Your program will accept two arguments: The name of the file holding the protein sequence and the name of the file holding the gene sequence, and it prints the position in the gene sequence at which it found a match and the amino acid encodings. For example:
% /u/cs126/bin/lcc genes.c % a.out /u/cs126/data/prot.1 /u/cs126/data/gene.1 Match found at gene position 10 aaa . aac . aat R aag . aca . acc . act . acg . ata . atc . att Q atg M aga . agc M agt . agg . caa . cac . cat . cag . cca . ccc . cct . ccg . cta S ctc . ctt . ctg . cga . cgc . cgt . cgg . taa . tac . tat . tag . tca . tcc . tct . tcg . tta . ttc . ttt . ttg . tga . tgc . tgt . tgg . gaa . gac . gat . gag . gca I gcc . gct H gcg . gta . gtc . gtt . gtg . gga . ggc . ggt . ggg .
The files /u/cs126/data/prot.1
and /u/cs126/data/gene.1
contain the sample data above for debugging. The output shows all 64 possible
amino acid encodings and the encodings for each protein in
/u/cs126/data/prot.1
;
periods identify unused encodings. (Your program might print the 64
possibilities in a different order, but the encodings should be the same.)
Here's an outline of genes.c
; copy this code to use as a
starting point.
#include <stdio.h> #include "misc.h" int hash(char *p) { /* return a 6-bit number for the 3-character encoding p */ } void unhash(int x) { /* print the 3-character encoding corresponding to the 6-bit number x */ } int main(int argc, char *argv[]) { char gene[10000], protein[1000], genecode[64]; int j; if (argc < 3) error("usage: %s protein gene\n", argv[0]); /* read the contents of argv[1] into protein[0...] */ /* read the contents of argv[2] into gene[0...] */ /* find an encoding for protein[0...] in gene[0...] and record the encoding in gencode[0..63] */ for (j = 0; j < 64; j++) { unhash(j); printf(" %c ", genecode[j]); if ((j + 1)%4 == 0) printf("\n"); } return 0; }
As the outline above suggests, you'll need to implement the alignment
procedure in main
, and write two auxiliary functions. hash
returns a different 6-bit number for each 3-character genetic encoding, and
unhash
prints the three characters corresponding to each 6-bit
number. hash
allows you to use the 64-element array genecode
in the alignment process to keep track of the encoding assignments. For example,
for the example shown above, if hash("aat")
is 3, then
gencode[3]
is 'R'
; that is, aat
is the
encoding for R
.
To read a file, you'll need to use fopen
, fgetc
,
and fclose
(Deitel and Deitel, Sec. 11.3-5) to open a file, read
its contents, and close the file. You should use
efopen
, declared in
misc.h
, because it checks
the value returned from fopen
. error
is also declared in misc.h
.
Protein sequences are given as uppercase letters and gene sequences are
given as lowercase letters; your code can thus read these sequences a character
at a time with fgetc
and ignore all other characters. Use the
functions declared in ctype.h
to identify characters (Deitel and
Deitel, Sec. 8.3).
The files
/u/cs126/data/prot.2
and /u/cs126/data/gene.2
contain real genetic data: the amino acid sequence for a "beta-lactamase''
protein and the genetic sequence for a human virus. Here's what you should get:
% a.out /u/cs126/data/{prot,gene}.2 Match found at gene position 1687 aaa K aac N aat N aag K aca T acc T act T acg T ata I atc I att I atg M aga R agc S agt S agg . caa Q cac H cat H cag Q cca P ccc P cct P ccg P cta L ctc L ctt L ctg L cga R cgc R cgt R cgg R taa . tac Y tat Y tag . tca S tcc S tct S tcg S tta L ttc F ttt F ttg L tga . tgc C tgt C tgg W gaa E gac D gat D gag E gca A gcc A gct A gcg A gta V gtc V gtt V gtg V gga G ggc G ggt G ggg G
Turn in your code and your documentation with the command
/u/cs126/bin/submit 10 readme genes.c
References
R. Sedgewick, Algorithms in C,
2nd ed., Addison-Wesley, 1990, pp. 231-234.