Assignment Goals

  • Solve a pattern matching problem that arises in computational biology.

  • Learn about strings.

  • Learn about hashing.

  • Checking Your Work and Hints

  • You might find the following demo helpful if you don't understand the statement of the problem.

  • Some hints on getting started are available here.

  • File format. For the genetic sequence, you should ignore all characters other than 'a', 'c', 'g' and 't'. For the protein sequence, you should ignore all characters other than 'A' through 'Z'. Some of the input files contain other characters, including newlines, spaces, and numbers, so be sure to ignore these when reading in the two sequences.

  • Various test protein and genetic input files are located at /u/cs126/files/gene/. You may wish to use these to test your code. The solution for the example data in prot.1 and gene.1 is gene.1.ans; the solution for prot.3 and gene.3 is gene.3.ans; the solution for prot.3 and gene.2 is "NOT FOUND". Your program should behave properly even if there is no match.

  • You may use /u/cs126/bin/gene126 to test your solutions.

  • Submission and readme
  • Use the following submit command:
    submit126 8 readme gene.c
    

  • The readme file should contain the following information. Here is a template readme file.

  • Name, precept number, high level description of code, any problems encountered, and whatever help (if any) your received.

  • Describe how you implemented hash() and unhash().

  • Enrichment Links

  • The genetic data is actually cDNA (the coding region of DNA) not DNA; the mapping will be similar to RNA with t replaced by u if you wish to compare with your biology textbook, or the following amino acid table borrowed from EBB 320.

  • The genetic data is taken from the National Center for Biotechnology.



  • Kevin Wayne