Programming Assignment Checklist: DNA Sequence Alignment

Pair programming. On this assignment, you are encouraged (not required) to work with a partner provided you practice pair programming. Pair programming "is a practice in which two programmers work side-by-side at one computer, continuously collaborating on the same design, algorithm, code, or test." One partner is driving (designing and typing the code) while the other is navigating (reviewing the work, identifying bugs, and asking questions). The two partners switch roles every 30-40 minutes, and on demand, brainstorm.

Before pair programming, you must read the article All I really need to know about pair programming I learned in kindergarten. You may choose a partner (of similar ability) from the same or a different precept. You might also find the video here helpful.

You and your partner will each turn in a separate individually written readme.txt for the assignment, stating who your partner is and detailing your experience. However, only one of you should turn in the assignment code.

Please note that writing code with a partner without following the pair programming instructions listed above is a serious violation of the course collaboration policy.

Frequently Asked Questions

What are the main goals of this assignment? You will (i) solve a fundamental problem in computational biology, (ii) learn about the analysis of algorithms, and (iii) learn about a powerful programming paradigm known as dynamic programming.

How do I read in the two input strings from the file? Use StdIn.readString() and redirection as usual.

How do I tell Java to use more of my computer's memory? By default, Java will only use 64MB of memory (not very much for this application). You must explicitly ask for more by executing with

java -Xmx300m EditDistance < input.txt
The 300m means 300MB, and you should adjust this number depending on the amount of memory your computer has and the size of the arrays you will need for the data set you are running. (Everyone should be able to request 300m which should get you through ecoli7000.txt. If your computer has sufficient memory, you can request 600m which should handle ecoli10000.txt)

How do I determine how much memory I have? On Mac, select About this Mac from the Apple menubar. On Windows, type Ctrl-Alt-Delete, select the Taskbar, and look in the physical memory entry. It's also instructive to use the Activity Monitor (Mac) or taskbar (Windows) to observe the CPU and memory usage of your computer, as your program is running.

How do I access the length of a string s? The ith character? Use s.length() and s.charAt(i), respectively. As with arrays, indices start at 0. We'll learn about this notation for manipulating (String) objects in Section 3.1. For this assignment, this is all you'll need to know about objects.

Can I assume that the input characters will always be A, C, G or T? NO! Your program should work equally well for any letter, upper case or lower case.

What's a StringIndexOutOfBoundsException? It's just like an ArrayOutOfBoundsException. It results from invoking s.charAt(i) with an illegal value of i.

How could I get a NullPointerException? Did you forget to allocate memory for opt[][]?

How do I declare and initialize a two dimensional array in Java? Review the end of Section 1.4 in Intro to Programming.

It seems strange to define x[i..M] to be the substring consisting of x[i] through x[M-1] instead of x[i] through x[M]. Is this a typo? It's a little strange, but no, it's not a typo. It's consistent with Java's indexing notation where the left endpoint is inclusive and the right endpoint is exclusive.

Which alignment should I print out if there are two or more optimal ones? Output any one you like.

The asymptotic running time of my program is much better than my mathematical analysis predicts. What could I be doing wrong? If you are running your program and accessing the data files from the H: drive (especially if via a wireless network), the bottleneck for medium N might be the network latency instead of the dynamic programming algorithm! Copy all the files to your local hard drive and run from there.

The asymptotic running time of my program is a bit worse than my mathematical analysis predicts. What could I be doing wrong? When you run out of physical memory, your operating system may start using your hard drive as another form of storage. Accessing information from the hard drive is substantially slower than main memory, and you may be observing this effect.

Input, Output, and Testing

Input.   There are many sample data files (with extension .txt) available in the sequence subdirectory. To help you check your work, the edit distances of gene57.txt, stx1230.txt, and ftsa1272.txt are 8, 521, and 758, respectively.

Output.   When you execute your program, it should print out the edit distance and the optimal alignment (either horizontally or vertically as below), along with the individual penalties.

% java EditDistance < example10.txt     % java EditDistance < example10.txt
Edit distance = 7                       Edit distance = 7
A T  1                                  A A C A G T T A C C
A A  0                                  T A - A G G T - C A
C -  2                                  1 0 2 0 0 1 0 2 0 1 
A A  0
G G  0
T G  1
T T  0
A -  2
C C  0
C A  1

Execution.   We will use the following command to execute your program:

java EditDistance < input.txt
so be sure the main() is in a program named EditDistance.java.

Submission

Submission. Submit readme.txt and EditDistance.java.

readme.txt. Use the following readme file template. You will lose points if you don't address these questions.

Possible Progress Steps

These are purely suggestions for how you might make progress. You do not have to follow these steps.

  1. Download the subdirectory sequence to your system. It contains sample data files.

  2. Write the following two short helper methods.
    // return the penalty for aligning character a with character b public static int penalty(char a, char b) // return the min of 3 integers public static int min(int a, int b, int c)
    You will call these from your main method to compute penalties and to determine which of the three cases yields the minimum edit distance.

  3. Write the main() method in EditDistance.java to read in the two strings from standard input, using the method StdIn.readString. For debugging, print them to standard output.

  4. Declare and initialize the (M+1)-by-(N+1) array opt[][]. Include the base cases. Print out the 2D array to check your work.

    To print the matrix out in nicely formatted columns, use

    System.out.printf("%3d", opt[i][j]);
    with nested for loops. Remember to remove this debugging print section before submitting the final version of your program.

  5. Now, it's time to implement the crucial dynamic programming part. First read the dynamic programming portion of Section 9.6 and make sure you understand how dynamic programming works. Think carefully about the order in which you will perform the computation since this is critical. Hint: first fill in the base case of opt[i][j], e.g., when i = M or j = N. Now, fill in the remaining values using a nested loop. Test that it works by printing out the contents of opt.

  6. Now, figure out how to recover the optimal alignment by backtracing.

    This is an iterative process. At each step we look to see which path choice we should make. Using the example from the assignment we start at i = 0, j=0 where x[i] = 'A' and y[i] = 'T'. The choices are to print "A -" and move down with a gap cost of 2, "- T" and move right with a gap cost of 2, or "A T" and move diagonally with a mismatch cost of 1. We know to pick "A T" because 7 - 6 = 1. This is the only choice which matches the matrix. (It is possible to have more than one choice which matches the matrix. In that case, either choice will lead to the same optimal edit distance.)

  7. To analyze the running time of your program, execute with the following command
    java -Xmx300m EditDistance < input.txt > output.txt
    and use a stopwatch. We redirect the output to a file to prevent printing text from becoming a computational bottleneck. Alternatively, use the command
    java -Xprof -Xmx300m EditDistance < input.txt > output.txt
    The -Xprof flag tells the compiler to report timing information. Note that the timing information will appear at the end of the file output.txt.

Enrichment