COS 126 Assignment 7 Checklist

Part 0: preparation ( 0 points)

Read the assignment in the course packet to get an overview.

Read K+R 5.5-5.10, 7.5, and look over Appendix B3 (string library). Read Sedgewick 3.6.

Copy the following files from /u/cs126/files/gene into an empty directory.

gene.c   prot.1   gene.1   prot.2   gene.2
prot.3   gene.3   prot.4   gene.4

You can copy all the files to the current directory with the command:

cp /u/cs126/files/gene/* .

You can use gene.c as starting point. It handles the input and output. You will execute with

a.out prot.1 gene.1

Don't accidentally reverse the order of the two command line arguments.

Some genetic jargon you should get used to.

nucleotide - one charcter from the set {a, c, t, g}

codon - 3 nucleotides, e.g., att, agc, or tat

amino acid - one uppercase character from A through Y

protein - sequence of amino acids, e.g., MSIQHMR

Understand the provided code that reads in the protein sequence. Then, write similar code that will read in the gene sequence. Don't forget to discard all characters that don't belong, e.g., only add lower case characters to the geneseq[] array, and skip over the others. To check that you read it in properly, you can print the gene sequence with printf("%s\n", geneseq);. Since you've removed the newline characters, it will all appear as one long line.

Part 1: code (2 points)

First, get code() working and debugged.

The input to code() is an array of characters from the set {a, c, t, g}. Only the first 3 characters (corresponding to a codon) of the input array will be used. The output (or return value) is an integer between 0 and 63. For example, if the first three characters of the array are aaa, return 0. If the input is aac, return 1. If the input is aag, return 2. If the input is ttg, return 62. If the input is ttt return 63.

It will be helpful to think of the three characters (codon) as an integer represented in base 4, with the mapping a=0, c=1, g=2, and t=3. Your job is to convert this to a C integer. This is analogous to converting between the base 4 and base 10 representations of an integer.

To get started, you may want to write a small helper function char_to_int() that converts a character into an integer: a to 0, c to 1, g to 2, and t to 3.

One approach for debugging the code() function is to first comment out the portion of code that prints out the results, and replace it with printf statements like the following:

printf("%d %d\n", code("att"), code("gct"));

You should get the following output:

15 39

Part 2: decode (2 points)

Now, get the decode() function working and debugged.

The input is an integer between 0 and 63. The function does not produce any output. Instead, it prints to the screen the 3 characters corresponding to this integer (see above).

As in the code() function, you may want to start by writing a helper function int_to_char() that converts an integer to its corresponding character. It is the inverse of char_to_int().

There are many ways to write this function. It boils down to converting between the base 10 and base 4 representations of an integer.

To debug, you can replace the printf statements above with:

decode(15);
decode(39);

This should produce the following output:

att gct

Part 3: understanding main (0 points)

Before you can do the pattern matching part of the assignment, be sure that you understand what is going on in main().

geneseq[] is a string that holds the sequence of {a, c, t, g} characters that are read in from the gene intput file.

protseq[] is a string that holds the protein that are read in from the protein input file.

genecode[] is a 64 character array that you will use to keep track of the 'matches' you have made. Understanding the purpose of this array is crucial to completing the assignment. An explanation follows, but if you are unsure, get clarification from a preceptor before writing any more code.

Each of the 64 entries in the genecode array corresponds to one of the 3-character codons. Ideally, you would like to be able to use genecode["att"] to access the array value corresponding to the codon att. Unfortunately, C requires that array indices be integers. This was the whole purpose of the code and decode functions - so that you can use codons to index the array. To access the array element corresponding to the codon att, use genecode[code("att")]. This is the same as genecode[15], which is now valid C. Similarly, genecode[0] corresponds to the codon "aaa", genecode[1] corresponds to the codon "aac", and so on.

Each element of the genecode array holds a single character: a capital letters corresponding to one of the 25 amino acids. Whenever you store an amino acid in the genecode array, you are matching a codon with an amino acid. For example, setting genecode[15] = 'E' says that the amino acid 'E' is encoded by the codon "att". The goal of this assignment is to find a consistent matching of codons to amino acids, and produce a table like gene.3.ans.

The main function begins by opening the protein and gene files using fopen(). The names of these files are specified as command line arguments. These concepts will be discussed in precept, but understanding file manipulation is not central to completing this assignment.

The protein is read from the protein file into the protseq[] array. The code ensures that all values in the array are uppercase characters A through Y. After the last amino acid character is read in, the null character '\0' is inserted at the end of the array to denote the end of the protein string.

The geneseq[] array is similarly initialized.

The last part of gene.c includes code to print the table of amino acid encodings. This is the only place the code uses decode(). It prints out each value of the genecode[] array in 4 columns, along with the corresponding codon.

Part 4a: pattern matching (5 points)

This is the trickiest part of the assignment. You should carefully figure out a plan of attack before writing code. Here's a sketch of what you need to do.

Initialize each element of the genecode[] array to '-' instead of ' '. The '-' is used to denote a blank, and will be easy to see on the screen than ' '.

You will need variables to keep track of your current position in the geneseq[] and protseq[] arrays.

To test for a possible alignment repeated the following until you run out of amino acids in your protein sequence:

Look up the current codon (from the geneseq[] array) in the genecode[] table.

If the amino acid stored there does not match the current amino acid (from the protein sequence) exit the loop, perhaps using a break statement.

Otherwise, if the entry in the genecode[] table is blank, store the current amino acid there.

Update the current positions in the geneseq[] and protseq[] arrays (by +3 and +1, respectively).

The tricky part of this step is looking up the codon in the genecode[] table. To do this, you want to call the code function with the character array beginning at the current position in the geneseq array. This requires using pointers. Recall that geneseq + 17 is the address of the 17th element of the geneseq[] array. (Another way to write this is &geneseq[17].) You may want to work on this part of the assignment separately before you write the loop. Use prot.1 and gene.1 to test your code.

When you exit the loop outlined above, you need to know whether it was because you reached the end of the protein sequence (a match) or because a conflict occurred. If you found a conflict, print out the position in the protein sequence where it occurred.

If you initialize the variables that keep track of the current position in the geneseq[] and protseq[] arrays to zero, then you should get the following conflict output.

To test your code, try initializing the variable which indexes the current position in the geneseq[] array to 2, 3, and 10. If the initial value is 10, you should find a match and get the following match output.

Here are some debugging hints.

You may wish to use the strlen() library function.

Lots of people accidentally use = instead of ==, so consider yourself warned.

The solution for the example data in gene.1 and prot.1 is gene.1.ans.

Part 4b: pattern matching

At the end of the last step, you changed the starting position of the geneseq[] array by editing your code. You found the match by repeatedly incrementing this value by 1 and running your code to test for a match. Obviously, this is rather tedious - even for the simple prot.1 and gene.1 input files, this required running your program 11 times. Modify your code so that the program increments the starting index until it finds a match.

You will need to create an outer loop to repeat the pattern matching code you wrote in the previous step. Determine the conditions under which you will want to execute the loop, so that your program won't crash if no match is found.

Be sure to print out the position where the match occurred.

Don't forget to reinitialize the genecode[] array to '-' next time through the loop!

Written by Lisa Worthington and Kevin Wayne