First, get hash() working
and debugged.
The input to hash() is an array of
characters from the set {a, c, t, g}.
Only the first 3 characters (corresponding to a codon)
of the input array will be used.
The output (or return value) is an integer between 0 and 63.
For example, if the first three characters of the array
are aaa, return 0. If the input is aac,
return 1. If the input is aag, return 2.
If the input is ttg, return 62. If the input
is ttt return 63.
It will be helpful to think of the three characters
(codon) as an integer represented in base 4,
with the mapping a=0, c=1, g=2, and t=3. Your job is to
convert this to a C integer. This is analogous to
converting between the base 4 and base 10 representations
of an integer.
To get started, you may want to write a small helper
function char_to_int() that converts a character
into an integer: a to 0, c to 1, g to 2, and t to 3.
One approach for debugging hash()
is to first comment out the portion of code that
prints out the results, and
replace it with printf() statements like the
following:
printf("%d %d\n", hash("att"), hash("gct"));
You should get the following output:
15 39
Now, get the unhash() function working
and debugged.
The input is an integer between 0 and 63.
The function does not produce any output. Instead,
it prints to the screen the 3 characters corresponding
to this integer (see above).
As in the hash() function, you may
want to start by writing a helper function int_to_char()
that converts an integer to its corresponding character.
It is the inverse of char_to_int().
There are many ways to write this function.
It boils down to converting between the base 10 and base 4
representations of an integer.
To debug, you can replace the printf
statements above with:
unhash(15);
unhash(39);
This should produce the following output:
att gct
Before you can do the pattern matching part of the assignment,
you will need to read in the protein and gene sequences using
file input. We'll also review some of the key variables that
you'll use.
geneseq[] is a string that holds the sequence of
{'a', 'c', 't', 'g'} characters that are read in from the gene
input file.
protseq[] is a string that holds the sequence of
{'A', 'B', ... ,'Y'} that are read in from the protein input
file.
genecode[] is a 64 character array that you will use to keep track of
the matches you have made. Understanding the purpose of this array is crucial to
completing the assignment. An explanation follows, but if you are unsure, get
clarification from a preceptor before writing any more code.
Each of the 64 entries in genecode[]
corresponds to one of the 3-character codons.
You would like to be able to use genecode["att"] to
access the array value corresponding to the codon att.
Unfortunately, C requires that array indices be integers. This is the
whole purpose of hash() and unhash():
they allow you to use codons to index the array. To access the array element
corresponding to the codon att, use genecode[hash("att")].
This is the same as genecode[15], which is now valid C.
Similarly, genecode[0] corresponds
to the codon "aaa", and genecode[1] corresponds to the
codon "aac", and so on.
Each element of genecode[] holds a single character:
a capital letters corresponding to one of the 25 amino acids. Whenever
you store an amino acid in genecode[], you are matching a codon with
an amino acid. For example, setting genecode[15] = 'E' says
that the amino acid 'E' is encoded by the codon "att".
The goal
of this assignment is to find a consistent matching of codons to amino
acids, and produce a table like
gene.3.ans.
Your first task is to read in the protein file into the
protseq[] array. Be sure that all values in the array
are uppercase characters 'A' through 'Y'. After the last amino acid character
is read in, insert the null character '\0' to the end of the
array to denote the end of the protein string.
Print out the resulting string to standard output using
printf("%s\n", protseq) to make
sure you read it in successfully.
Hint: see the last exercise question on strings.
Now, write code to read in the genetic sequence
into the geneseq[] array.
Print it out to make sure you read it in properly.
The last part of main() prints out the table of
amino acid encodings. This is the only place the code uses
unhash(). It prints out each value of the
genecode[] array, along with the corresponding codon.
If you like, you may wish to modify it so that it prints out
the table in 4 columns instead of 1.
Part 4a: pattern matching
|
This is the trickiest part of the assignment.
You should carefully figure out a plan of attack before
writing code.
In this part, we describe how to check whether a match occurs at
one particular offset. In the next part, you will add an extra loop
that checks for matches at all possible offsets.
Here's a sketch of what you need to do.
Initialize each element of the genecode[]
array to '-' .
You will probably want two integer variables, say i
and j to hold the current index into the geneseq[]
and protseq[] arrays.
Initially i will be set to the offset, and j to 0.
To test for a possible alignment at the given offset,
repeat the following until you run out of amino acids in your
protein sequence. (Consider writing a loop that counts from j = 0
to the length of the protein sequence.)
The current codon is comprised of
genecode[i], genecode[i+1], and genecode[i+2].
Look up the current codon in the genecode[] table.
If the amino acid stored there does not match the current
amino acid prot[j] exit the loop, perhaps using a
break statement.
Otherwise, if the entry in the genecode[] table
is blank, store the current amino acid there.
Increment i by 3 and j by 1.
The tricky part is looking up the codon in the genecode[]
table. This involves calling hash() with a pointer to
geneseq[i]. This is the only place in the matching
phase where you'll need to use a pointer.
Recall that geneseq + 17 is one way to denote a pointer to element
17th of the geneseq[] array.
Use prot.1 and gene.1 to test your code.
When you exit the loop outlined above, you need to know
whether it was because you reached the end of the protein sequence
(a match) or because a conflict occurred. If you found a conflict,
print out the position in the protein sequence where it occurred.
If you initialize the variables that keep track of the
current position in the geneseq[] and protseq[]
arrays to zero, then you should get the following
debugging output.
To test your code, try initializing the variable which
indexes the current position in the geneseq[] array to 2, 3,
and 10. If the initial value is 10, you should find a match and
get the following
match output.
Here are some debugging hints.
You may wish to use the strlen()
library function.
Lots of people accidentally use = instead
of ==, so consider yourself warned.
Part 4b: pattern matching
|
At the end of the last step, you changed the offset into
the geneseq[] array by editing your code. If you happen
to choose the right offset (10), then you find the match.
Modify your code so that it checks all possible offsets.
You will need to create an outer loop to repeat the pattern matching
code you wrote in the previous step. Determine the conditions under
which you will want to execute the loop, so that your program won't
crash if no match is found.
Be sure to print out the position where the match occurred.
Don't forget to reinitialize the genecode[] array to
'-' next time through the loop!
Written by
Lisa Worthington
and
Kevin Wayne