|
What are the main goals of this assignment? You will (i) solve a fundamental problem in computational biology, (ii) learn about the analysis of algorithms, and (iii) learn about a powerful programming paradigm known as dynamic programming.
How do I tell Java to use more of my computer's memory? By default, Java will only use 64MB of memory (not very much for this application). You must explicitly ask for more by executing with
The 300m means 300MB, and you should adjust this number depending on the amount of memory your computer has and the size of the arrays you will need for the data set you running. (300m should get you through ecoli7000.txt)java -Xmx300m EditDistance < input.txt
How do I access the length of a string? Use x.length().
How do I access the ith character of a Java string x? Use x.charAt(i). As with arrays, indices start at 0.
What's a StringIndexOutOfBoundsException? It's just like an ArrayOutOfBoundsException. You might be invoking charAt(i) with an illegal value of i.
How could I get a NullPointerException? Did you forget to allocate memory for opt[][]?
How do I declare and initialze a two dimensional array in Java? Use int[][] opt = new int[M+1][N+1]; for an (M+1)-by-(N+1) array. All entries are initialized to 0.
It seems strange to define x[i..M] to be the substring consisting of x[i] through x[M-1] instead of x[i] through x[M]. Is this a typo? It's a little strange, but no, it's not a typo. This corresponds exactly with x.substring(i, M) in Java. However, you shouldn't need to use the substring method on this assignment.
Which alignment should I print out if there are two or more optimal ones? Output any one you like.
How do I determine how much memory I have? On Mac, select About this Mac from the Apple menubar. On Windows, type Ctrl-Alt-Delete, select the Taskbar, and look in the physical memory entry. It's also instructive to use the Activity Monitor (Mac) or taskbar (Windows) to observe the CPU and memory usage of your computer, as your program is running.
The asymptotic running time of my program is much better than my mathematical analysis predicts. What could I be doing wrong? If you are running your program and accessing the data files from the H: drive, especially if via a wireless network, the bottleneck for medium N might be the network latency instead of the dynamic programming algorithm! Copy all the files to your local hard drive and run from there.
The asymptotic running time of my program is a bit worse than my mathematical analysis predicts. What could I be doing wrong? When you run out of physical memory, your operating system may start using your hard drive as another form of storage. Accessing information from the hard drive is substantially slower than main memory, and you may be observing this effect.
Who discovered the dynamic programming algorithm we are using? The idea of dynamic programming was first advanced by Bellman (1957). Levenshtein (1966) formalized the notion of edit distance. Needleman-Wunsch (1970) were the first to apply edit distance and dynamic programming for aligning biological sequences, and our algorithm is essentially the one proposed in their seminal paper. The widely-used Smith-Waterman (1981) algorithm is quite similar, but solves a slightly different problem (local sequence alignment instead of global sequence alignment).
|
Input. There are many sample data files (with extension .txt) available in the sequence directory. To help you check your work, the edit distances of gene57.txt, stx1230.txt, and ftsa1272.txt are 8, 521, and 758, respectively.
Output. When you execute your program, it should print out the alignment either horizontally or vertically as below, along with the individual penalties.
% java EditDistance < example10.txt % java EditDistance < example10.txt Edit distance = 7 Edit distance = 7 A T 1 A A C A G T T A C C A A 0 T A A G G T C A C 2 1 0 2 0 0 1 0 2 0 1 A A 0 G G 0 T G 1 T T 0 A 2 C C 0 C A 1
Execution. We will use the following command to execute your program:
so be sure the main is in a program named EditDistance.java.java EditDistance < input.txt
|
Submission. Submit readme.txt and EditDistance.java.
readme.txt. Use the following readme file template. You will lose points if you don't address these questions.
|
These are purely suggestions for how you might make progress. You do not have to follow these steps.
and use a stopwatch. We redirect the output to a file to prevent printing text from becoming a computational bottleneck. Alternatively, use the commandjava -Xmx300m EditDistance < input.txt > output.txt
The -Xprof flag tells the compiler to report timing information. Note that the timing information will appear at the end of the file output.txt.java -Xprof -Xmx300m EditDistance < input.txt > output.txt
|