Computer Science 226
Algorithms and Data Structures
Fall 2013
|
|
|
Course Information |
Assignments |
Exercises |
Lectures |
Exams |
Precepts |
Study Guide |
Booksite
Union Find and Intro
Analysis of Algorithms
Stacks and Queues
Elementary Sorts
Mergesort
Quicksort
Collections, PQs, and Priority Queues
General Sorting
BSTs
Balanced BSTs
Hashing
Geometric Applications of BSTs
Undirected Graphs
Directed Graphs
MSTs
Shortest Paths
Max flow
String Sorts
Tries
Substring Matching
Regular Expressions
Compression
Reductions and Intractability
Study Guide
This page provides a list of what I consider the most important topics and ideas from each lecture. The optional exercises listed on this page are not for a grade. Instead, they are to provide you with a better idea of what we're expecting you to know before the final.
Optional exercises are either C problems, B problems, or A problems. C problems are at the same level as the exercises, and generally require that you only be able to mechanically carry out the algorithms. B problems are harder problems that require you to really know what's going on (the exams will have plenty of these). A problems are problems that either cover tangential material (like how Kosaraju-Shamir works), require deep comprehension of the material, involve very clever tricks, or perhaps even all 3.
Don't be intimidated by its length. I've tried to be as comprehensive as possible.
Union Find and General Intro
Key Concepts
Algorthm Development. Developing a good algorithm is an iterative process. We create a model of the problem, develop an algorithm, and revise the performance of the algorithm until it meets our needs.
Union-Find
. The ultimate goal is to develop a data type that support the following operations:
- union(int p, int q)
- connected(int p, int q)
We do NOT care about finding the actual path from p to q. We only care about their connectedness.
Very closely related to connected is a 3rd operation we can support:
Find is defined such that find(p) = find(q) iff connected(p, q).
We assume that the number of objects N is fixed.
Key Observation: Connectedness is an Equivalence Relation. Saying that two objects are connected is the same as saying they are in an equivalence class.
This is just fancy math-talk for saying "every object is in exactly one bucket, and we want to know if two objects are in the same bucket". When you union two objects, you're basically just pouring everything from one bucket into another.
Quick Find. This is the most natural solution, where each object is given an explicit number. Uses an array id of length N. id[i] is the bucket number of object i (which is returned by find). To union two objects p and q, we set every object in q's bucket to have p's number.
- Union: May require many changes to id. Takes N time in the worst case (to union large sets).
- Connected (and Find): Constant time.
Performing M union-find operations takes NM time. If M were proportional to N, this results in N2 time;
Quadratic Algorithms Don't Scale. Given an N times larger problems on an N times faster computer, the problem takes N times as long to run.
Quick Union. id[i] is the parent object of object i. An object can be its own parent. The find method climbs the ladder of parents until it reaches the root (an object whose parent is itself). To union p and q, we set p's parent to q.
- Union: Requires only one change to id, but also requires root finding (worst case N time).
- Connected (and Find): Requires root finding (worst case N time).
Performing M union-find operations takes NM time in the worst case. Again, this results in quadratic behavior.
Weighted Quick Union
. Rather than union(p, q) pointing p's parent at q, we instead point the smaller tree at the larger one. The tree's size is given by the NUMBER of nodes, not the height of the tree. Results in tree heights of lg N (you should understand this proof).
- Union: Requires only one change to id, but also requires root finding (worst case log N time).
- Connected (and Find): Requires root finding (worst case log N time).
Weighted Quick Union with Path Compression. When find is called, the tree is compressed. Results in nearly flat trees. Making M calls to union and find with N objects results in no more than M log*(N) array accesses. Log*(N) is at most 5 in our universe.
Recommended Problems
C level
- What are the best and worst case tree heights for weighted quick-union and weighted quick-union with path compression? Give your answers in terms of order of growth.
- Textbook: 1.5.1, 1.5.2, 1.5.3
B level
- Fall 11 Midterm, #1
- Fall 12 Midterm, #1
- Textbook: 1.5.8
- Textbook: 1.5.9
A level
- Textbook: 1.5.10
- If we're concerned about tree height, why don't we use height for deciding tree size instead of weight? What is the worst-case tree height for weighted-quick-union vs. heighted-quick-union? What is the average tree height?
- Try writing weighted-quick-union-with-path-compression without looking at the code on the booksite. You may look at the API. Compare your resulting code with the booksite code.
Analysis of Algorithms
Key Concepts
Tilde notation. We say that f(N) ~ g(N) if in the large N limit, f(N)/g(N) converges to 1. This is a general concept that can be applied to any functions, and is not restricted to runtime, memory, or any other specific domain.
Power-law assumption. For empirical analyses in COS226 we will assume that runtimes obey a power law (i.e. T(N) ~ aNb). Empirical derivation of accurate hypotheses for non-power law runtimes is beyond the scope of our course.
Cost model. For theoretical analyses of runtime in COS226, we will assume a 'cost model', namely that some particular operation (operations) dominates the runtime of a function. We can then express the runtime in terms of the total number of that operation. Often, we give this count in Tilde notation.
Order of growth. If we have two functions f(N) and g(N), and f(n) ~ a G(N), we say the order of growth of f(N) is g(N).
Performance depends on inputs. We can characterize an algorithm's performance by its best case, worst case, and average case.
Difficulty of a problem. To understand the difficulty of a particular problem in COS226, we consider only the worst-case order-of-growth of its solutions. We can naturally upper bound the difficulty of a problem by the performance of the best known solution. We can naturally lower bound the difficulty of a problem with general considerations of the problem. This is actually as trivial as it sounds.
Big Oh, Big Omega, Big Theta. Big Theta is precisely the same thing as "order of growth". If we say "f(N) is Θ(g(N))" or "f(N)=Θ(g(N))", we are simply saying f(N) is the same order of growth as g(N). If f(N)=O(g(N)), then f(N) has an order of growth that is less than or equal to g(N). More precisely, we could say that the limit of f(N)/g(N) as N goes to infinity is NOT infinity. If f(N)=Ω(g(N)), then f(N) has an order of growth that is greater than or equal to g(N). We could also say that the limit of f(N)/G(N) as N goes to infinity is greater than 0.
As with Tilde notation, these are all general concepts that can be applied to any functions, and are not limited to discussions of specific topics in computer science.
Worst case order of growth isn't everything. Just because an algorithm has a better order of growth does not mean it is better. We will see examples of this later (Quicksort vs. Mergesort. Quick select vs. Median of medians).
Memory analysis. Know how to calculate the memory utilization of a class with the 64 bit memory model.
Theoretical and empirical analysis. Hypotheses generated through theoretical analysis (or guesswork like our power law assumption) should be validated with data before being fully trusted.
Recommended Problems
C level
- Textbook 1.4.4
- Fall 2010 Midterm, #1d
- Fall 2011 Midterm, #2
B level
- Textbook 1.4.5
- Fall 2008 Midterm, #3
- Fall 2009 Midterm, #1
- Fall 2010 Midterm, #1a, #1c
- Spring 2012 Midterm, #1
- Spring 2013 Final, #4
A level
- Spring 2013 Midterm, #1
- Suppose the optimal (and possibly unknown) solution to problem P has order of growth F(N). Suppose that the best known solution has runtime that is Θ(B(N)). Finally, suppose that there is a clever proof that no solution can possibly have order of growth that is less than L(N). Which of the following can you surmise?
- F(N) = O(B(N))
- B(N) = O(F(N))
- The limit of F(N)/B(N) as N goes to infinity cannot be infinity.
- F(N) = Ω(L(N))
- L(N) = Ω(F(N))
- B(N) > F(N) for sufficiently large N.
Stacks and Queues
Key Concepts
Stacks and Queues. Understand the semantics of each.
Linked Lists. Understand how they can be used to implement a stack or a queue, how to calculate the memory used by a linked list (using the 64 bit memory model) for N items in a queue, and how to show that queue and stack operations complete in constant time.
Resizing Arrays. Understand how they can be used to implement a stack or a queue. Understand why the worst case memory usage and runtimes are different than the best case.
Amortized Analysis. The idea is simpler that it might seem at first. If we say that X-operations take amortized f(M) Y-operations, this means that if we start with an empty data structure, the average cost for any sequence of M X-operations is f(M) Y-operations.
For example, for a resizing array stack, we say that push and pop operations take amortized constant time. This means that if we start with an empty stack, any sequence of pushes or pops take on average a constant number of array accesses per push or pop. In this example X is {push, pop}, Y is array accesses, and f(M) is just 1.
Generics. Understand that these provide the ability to parameterize data types. Understand why this approach is superior to other approaches (writing separate classes, using the Object type and doing runtime casting).
Compile time vs. run time errors. Understand why the former are preferred.
Iterators. Understand what it means for a class to implement Iterable. Understand what a class must do to implement Iterator. Understand how the foreach operator works (a.k.a. the colon operator). Understand how to convert code using : into equivalent longhand code.
Loitering. Understand how to avoid loitering in Java.
Recommended Problems
C level
- Textbook 1.3.6
- 1.3.3
- 1.3.22, then 1.3.23
- Understand why the we prevent loitering by adding the line A[N] = null in Algorithm 1.1 on page 141 of the book. Why do we not have a special purpose line like this for our linked list implementation (page 149, algorithm 1.2).
B level
- Fall 2008 Midterm, #3b (you don't need to know what a symbol table is to do this problem) -- this is a great problem!
- Textbook 1.3.5
- Textbook 1.4.35 - a pushdown stack is just a stack.
- Textbook 1.4.36 [important note: they recommend using a static inner class to reduce Node overhead, hence the disagreement with the lecture slides!]
- Consider Algorithm 1.1 on page 141 of the book. We showed in lecture that the memory usage of this class is ~8N and ~32N bytes in the best and worst cases respectively. If we get rid of the line A[N] = null, which prevents loitering, how does this affect the best and worst case memory usage of our resizing array based stack? Highlight for answer: It doesn't! Follow up question with no provided answer: Then why do we care?
A level
- Textbook 1.4.32
- Suppose we have a resizing array that increases in size by K entries when the array is full, and decreases in size by K entries when the array has K empty entries. Show that the push and pop take amortized M time for some worst case sequence. Give an example of a worst case sequence. Observe that this results in M2 time for M operations.
Elementary sorts
Key Concepts
Inversions. The number of pairs of elements in a sequence that are out of order. An array with no inversions is ordered.
Selection sort. Be able to selection sort an array. How many compares does it use? How many exchanges?
Insertion sort. Be able to insertion sort an array. How many compares does it use in the best and worst case? How many exchanges in the best and worst case?
Partially ordered arrays. Why is insertion sort linear for partially sorted arrays?
Shellsort. Not covered Fall 2013.
Shuffling. Can use sorting to shuffle in N lg N by assigning each value an arbitrary unique value between 0 and 1. Can sort in N time using Knuth shuffle. You don't need to know the Knuth shuffle details.
Convex hull. Understand how sorting leads to a solution to the convex hull problem. You should be able to carry out the algorithm. You do not need to understand the code for determining ccw turns.
Recommended Problems
C level
- Final 2008 midterm, #9
- Give a worst case and best case input for insertion sort and selection sort.
B level
- Textbook 2.1.2
- Textbook 2.1.6, 2.1.7
- Fall 2012 midterm, #8
A level
- Textbook 2.1.28
For more problems, see the general sorting section.
Mergesort
Key Concepts
Mergesort. Merge left, merge right, merge.
Merge. Understand how to carry out the merge operation. How many compares does it use when comparing two arrays of size N in the best case? In the worst case?
Mergesort order of growth. Understand how to show that the order of growth of the number of compares is N lg N. Understand why the entire algorithm is also order N lg N.
Mergesort compare bounding. Know why the best case is ~ 1/2 N lg N and the worst case is ~ N lg N compares.
Mergesort properties. Mergesort is stable (why?). Mergesort uses N memory (why?). Does mergesort do particularly well on already sorted arrays? Partially ordered arrays?
Recommended Problems
C level
- Give a worst case and a best case input for mergesort.
B level
- Textbook 2.2.22
- Textbook 2.2.8
A level
- Using the decision tree idea, rederive the proof that all comparison based sorting algorithms must use at least ~N lg N compares. You may assume Stirling's formula.
For more problems, see the general sorting section.
Quicksort
Key Concepts
Quicksort. Partition on some pivot. Quicksort to the left of the pivot. Quicksort to the right.
Partitioning. Understand exactly how to carry out 2 way partitioning as discussed in class. Be able to recognize 3 way partitioning as discussed in the book / blackboard exercise.
Quicksort order of growth. Understand how to show that in the best case, quicksort is N lg N, and in the worse case is N^2. Shuffling is needed to probabilistically guarantee N lg N behavior.
Quicksort compare counting. Know why the best case is ~N lg N compares and worst case is ~1/2 N^2. Despite the greater number of compares, quicksort is usually faster than mergesort. Be familiar with the fact that shuffling yields 2N ln N compares on average (but you don't need to fully digest this proof -- especially solution of the difficult recurrence relation, as that involves discrete math that is beyond the scope of the course).
Pivot choice. Understand how the pivot affects the size of the subproblems created after partitoning.
Quicksort properties. Quicksort is not stable and is in-place (uses no more than log N memory).
Practical improvements. Cutoff to insertion sort. Using 3-way partitioning to attain faster performance for arrays with a constant number of keys.
Recommended Problems
C level
- Give a worst case input for non-random quicksort that chooses the leftmost element as a pivot. Why is a best case input hard to think of?
- Textbook 2.3.8
- Textbook 2.3.11
- Spring 2013 mitderm, #4a
B level
- Textbook 2.3.3
- Textbook 2.3.13 (don't do 2.3.20)
- Spring 2013 midterm, #4b
A level
- One strategy to speeding up quicksort is to insertion sort subarrays of size approximately 10 or smaller. An alternate approach is to simply NOT sort arrays of size less than 10, then use insertion sort on the entire array. Prove that this alternate approach results in a linear time sort, depite insertion sort's worst case N^2 performance.
For more problems, see the general sorting section.
Collections, PQs and Heaps
Key Concepts
Terminology. Abstract data types are independent of implementation. Collections are abstract data types.
Common collections. Priority queue, set, symbol table. Know what each of them does. You do not need to memorize the names of the the operations shown in the slides.
Heaps. A max (min heap is a binary tree such that every node is larger (smaller) than all of its children. This definition naturally applies recursively, i.e. a heap of height 5 is composed of two heaps of height 4 plus a parent.
Array-heaps. The heap data structure must be implemented in terms of even more elementary data structures in most programming languages, because there is not a heap primitive. In 226, we utilize a clever trick where we represent the heap as an array. You should understand how this works, e.g. why do we leave the 0th position empty in the array?
Runtimes of various PQ implementations. Know the runtimes of the 3 primary PQ operations for an unordered array, ordered array, and heap implementation.
Heapsort. A max PQ provides an easy algorithm for putting items in ascending order. We simply insert everything into the PQ and then delete the max until the PQ is empty. If we use an array representation of a heap, we can do everything in place (this is not obvious). This is somewhat subtle.
Heapification. Understand how the bottom up heapification process works. Know that it is linear time.
Recommended Problems
Note: The reason I've given lots of problems here is not because this is a more important topic, but because there are just so many interesting problems.
C level
- Textbook 2.4.2 (assume we'd also like to support delete operations)
- Textbook 2.4.4
- Textbook 2.4.11, 2.4.12
- Textbook 2.4.14
- Spring 2008 #5, Fall 2008 #4, Fall 2009 #3, etc.
B level
- Textbook 2.4.7
- Textbook 2.4.10
- Textbook 2.4.17, how could you use this technique to find the kth largest item? What would be the runtime of this algorithm in terms of N and k?
- Textbook 2.4.21
- Textbook 2.4.27
- Textbook 2.4.32 (easier than it looks)
- Spring 2008 Midterm #4
- Fall 2012 Midterm #4
- Spring 2013 Midterm #5a and #5b
A level
- Prove that bottom up heapification runs in linear time.
- Textbook 2.4.29
- Textbook 2.4.30 (great problem!)
- Fall 2010 Midterm #7
- Spring 2012 Miderm #8
- Use a resizing array based queue and a symbol table to create an algorithm that solves FastCollinear in N^2. Don't write code. Make sure you can output the segments in sorted order. You may assume the symbol table supports all operations in constant time.
General sorting
On the midterms, sorting questions tend to broadly cover the entire sorting chapter of the book.
Recommended Problems
C level
- What sort does Java use for sorting object arrays? For sorting primitive arrays? Why?
- Spring 2012 Midterm #1
B level
- This classic problem: Spring 2008 Midterm #1, Fall 2008 Midterm #1, Fall 2012 Midterm #2, Spring 2013 Midterm #6
- Spring 2008 Midterm, #2b
- Fall 2008 Midterm, #2
- Fall 2010 Miderm, #2b
A level
- Spring 2008 Midterm, #2a
- Spring 2012 Midterm, #2b
- Fall 2012 Midterm, #2
Symbol Tables and BSTs
Key Concepts
Elementary implementations. Ordered array and linked list. Understand the performance of each.
Comparable vs. equals. Keys must either be comparable or have an equals method which takes into account the semantics of the object. The default equals method only checks reference equality.
API for Symbol Table and Ordered Symbol Table. Know the operations that go with symbol tables and ordered symbol tables. You don't have to be able to list them, but you should understand what they mean.
BST basics. Know the definition of a BST and how to search and insert.
Recursive BST code. You should be able to read recursive BST code. You will not be required to write recursive BST code, since you won't have practice until KdTree.
In-order traversal. Understand how to perform an in order traversal (in optional slides, also at beginning of the BST lecture)
Treesort. Understand how to use a BST to sort items. Insert all the items, then perform an inorder traversal. This algorithm is also known as Quicksort.
Deletion and Hibbard deletion. Understand how to delete an item with 1 or 0 children. Understand how to delete an item with 2 children, a.k.a. Hibbard deletion.
Symbol table performance. Understand the performance of our elementary implementations. Understand the performance of a BST based symbol table in the best, worst, and average (shuffled) case, with an emphasis on the fact that the height of a tree is the key factor. Understand what Hibbard deletions do to tree height. You do not need to know why Hibbard deletes result in sqrt(n) height.
Recommended Problems
C level
- What is the best case BST height? Worst case?
- If shuffling guarantees log N tree height, why don't we simply shuffle our input data before building our BST based symbol table to avoid worst case behavior?
- Textbook 3.2.3, but give only two orderings.
- Textbook 3.2.4
B level
- In our elementary implementations, we considered ordered arrays and linked lists. We did not consider ordered linked lists. Why not?
- Give an example of something that you might like to store as a key in a symbol table for which there is no natural compare method.
- Do there exist any objects for which it is impossible to define a total order? In other words, can we always write a compare method if we're willing to do the work, or are some data types fundamentally impossible to compare other than for equality?
- Stable 2-way partitioning is a partitioning procedure in which items that are smaller than the pivot are kept in the same relative order after partitioning, and likewise with larger items. For example, if we stably partition GADFLY on G, we'd get ADFGLY. Perform an entire quicksort using stable partitioning on the array [5, 3, 2, 1, 9, 7, 6], and record the compares you make along the way. Build the BST for this array. Observe that the same exact compares were made.
- Textbook 3.2.22 (pretty easy, but good to do)
- Textbook 3.2.23
- Textbook 3.2.24 (so easy it might seem hard)
A level
- Spring 2008 Midterm, #8
- BSTs allow us to perform a stable Quicksort. What are the disadvantages of this stable Quicksort compared to the version we discussed in the Quicksort lecture?
Balanced Search Trees
Key Concepts
2-3 trees. Understand how to search and insert. Understand why insertion technique leads to perfect balance and symmetric order.
Terminology. Symmetric order. Perfect balance.
Performance of 2-3 trees. Tree height is between log3(N) and log2(N). All paths are of the same height.
Mapping between LLRBs and 2-3 Trees. Understand how to map a 2-3 tree to an LLRB and vice versa.
LLRBs. BST such that no node has two red links touching it. Perfect black balance. Red links lean left.
LLRB get. Exactly the same as regular BST get.
Storing color. A node's color is the same as the edge connecting that node to its parent.
LLRB insert. Insert as usual, then apply rotations recursively as follows, working upwards from the new node.
- Case 3: A parent node has a black left child and a red right child, so rotate the parent left. The former right child is now the boss. Reminder: null links are considered black for the purposes of deciding cases.
- Case 2: A grandparent node has a red left child whose left child is also red. Rotate this grandparent right (so that one in the middle is now the boss).
- Case 1: A parent node has two red children. Flip colors.
Conveniently, we can always apply case 3, case 2, then case 1 to every node in a tree, and we can guarantee that the entire tree will be an LLRB.
LLRB performance. Perfect black balance ensures worst case performance for get and insert is ~2 log(N).
Recommended Problems
C level
- Given an LLRB, is there exactly one corresponding 2-3 tree? Given a 2-3 tree, is the exactly one corresponding LLRB?
- Draw a 2-3 tree with all 3 nodes. Why is the height log3(N)?
- How many compares does it take in the worst case to decide whether to take the left, middle, or right link from a 3 node?
- Fall 2010 Midterm, #4. Fall 2011 Midterm, #6. Spring 2012 Midterm, #5. Spring 2013 Midterm, #2. Fall 2008 Midterm, #6. Fall 2009 Midterm, #4.
B level
- For a 2-3 tree of height h, what is the best case number of compares to find a key? The worst case?
A level
- Show that in the worst case, we need to perform order log N LLRB operations (rotation and color flipping) after an insert (not as hard as it might seem). Give an example of a tree which requires log N LLRB operations after a single insert (also not so hard).
Hash Tables
Key Concepts
Brute force approach. All data is just a sequence of bits. Can treat key as a gigantic number and use it as an array index. Requires exponentially large amounts of memory.
Hashing. Instead of using the entire key, represent entire key by a smaller value. In Java, we hash objects with a hashCode() method that returns an integer (32 bit) representation of the object.
hashCode() to index conversion. To use hashCode() results as an index, we must convert the hashCode() to a valid index. Modulus does not work since hashCode may be negative. Taking the absolute value then the modulus also doesn't work since Math.abs(Integer.MIN_VALUE) is negative. We use hashCode & 0x7FFFFFFF instead before taking the modulus.
Hash function. Converts a key to a value between 0 and M - 1. In Java, this means calling hashCode(), setting the top bit to 0, then taking the modulus.
Collision resolution. Two philosophies for resolving collisions discussed in class: Separate chaining and 'open addressing'. We didn't use the term open addressing, but it's where you use empty array entries to handle collisions, e.g. linear probing. We will use this term at the beginning of lecture 11.
Separate chaining with a linked list. Key/value pairs are stored in a linked list of nodes of length M. Hash function tells us which of these linked lists to use. Get and insert both require potentially scanning through entire list.
Resizing separate chaining hash tables. Understand how resizing may lead to objects moving from one linked list to another. Primary goal is so that M is always proportional to N.
Performance of separate chaining hash tables. Cost of a given get, insert, or delete is given by number of entries in the linked list that must be examined.
- Expected average amortized search and insert time is given by N / M, which is no larger than some constant (due to resizing).
- Expected amortized worst case time is given by size of largest bin, which is order log N / log log N. This is far beyond the scope of the class. Grows slowly, but is not quite constant.
Linear probing hash tables. If the space that should be occupied by a key is already occupied by something else, try the space to the right. If that's occupied, go right more. Repeat. This philosophy works for get and insert.
Performance of linear probing hash tables. As before, perfmrance determined by number of entries in the key array that must be examined.
- If N / M is sufficiently small, expected amortized worst case run time is given by a constant. For N / M = 0.5, expected cost of a search hit is 3/2 and expected cost of a search miss is 5/2. Once N / M becomes larger, cost grows. See Knuth's parking problem for more.
- Expected amortized worst case run time is given by Theta(log N). We did not prove this in class or even discuss it, and this is extremely distant from the scope of our course.
Hash functions and the uniform hashing assumption. For our analyses above, we assumed that our hash function distributes all input data evenly across bins. Design of such function is beyond the scope of our course. Most important guideline is to use all the bits in the key.
Dangers of hashing. If hashCode() is known and easy to reverse, adversary can design a sequence of inputs that result in everything being placed in one bin. Or if hashCode() is just plain bad, same thing can happen.
Recommended Problems
C level
- Textbook 3.4.5
- Fall 2010 Midterm, #5, Fall 2012 Midterm, #6a
B level
- Textbook 3.4.13, 3.4.14
- Textbook 3.4.15
- Fall 2012 Midterm, #6b, Fall 2009 Midterm, #5
- Spring 2013 Midterm, #3b-#3d
A level
- Textbook 3.4.23, here R is the multiplicative factor (we used 31 in class).
Geometric Applications of BSTs
Key Concepts
1d range search
- How do we use the rank method to support 1d range search? (You don't need to know how to implement rank, since we skipped that part of the BSTs lecture)
Kd-Trees
- Enables rapid 2d range search.
- Enables rapid 2d nearest neighbor search.
- You should understand all of the details of the KdTreeST class that you wrote for assignment 5.
Recommended Problems
None.
Undirected Graphs
Key Concepts
Graph Representation
There are many possible internal representations of a graph. We discussed three. You should be familiar with the tradeoffs for each.
- List of edges
- Adjacency matrix
- Adjacency lists
- We use these in 226, because we assume that our graphs are sparse.
Important graph problems
- Is there a path between s and t?
- Is there a cycle in the graph? How do you find the nodes in a cycle?
- Is a graph bipartite?
- Euler tour. Is there a cycle that uses every edge exactly once?
- Hamilton tour. Is there a cycle that uses every vertex exactly once?
- Planarity problem. Find a vertex layout that results in no crossing edges (see this online game for a fun (?) look at the problem)
- Graph isomorphism. Are two graphs the same graph (but with different vertex names)
You should know how to solve the first three -- not necessarily because these problems are so important, but because you should have at least a few examples of how to use DFS or BFS to solve problems. For the last four, it's good scholarship to be know whether the problem is tractable (polynomial time) and whether there exists an easy to understand tractable algorithm. We may even ask you how these things on an exam. You do not need to be familiar with the details of the best known algorithm for their solution.
One important meta-point is that it is non-trivial to determine the difficulty level of a graph problem. For example, finding an Euler tour is polynomial time and the algorithm is non-trivial but not hideously complex. By contrast, finding a Hamilton tour is an intractable problem. Planarity can be solved in linear time, but it's a rather complex algorithm to follow. Graph isomorphism is particularly notable since its tractability has remained elusive despite decades of research.
Important graph tools
- DFS. You should know this algorithm absolutely cold by the time the exam comes around. For example, you should be able to write DepthFirstSearch.java in your sleep, and the run time if we use adjacency lists (E + V) should be scorched into your neurons.
- BFS. You should have the same level of familiarity as with DFS.
You should understand that almost any problem you care to solve with unweighted graphs can be solved with one of these two algorithms. You should understand when each is appropriate to use.
Graph client design pattern (used for 4.1, 4.2, 4.3, and 4.4). Instead of creating highly complex classes that support basic graph operations AND solution of important graph problems, we have a basic class (e.g. Graph) that gives us basic operations, and special purpose clients that solve subproblems on a graph (e.g. ConnectedComponents(Graph g)).
Recommended Problems
C level
- Fall 09 Final, #2a
- Textbook: 4.1.12
B level
- Fall 09 Final, #2b
- Fall 10 Final, #2c
- Textbook: 4.1.14, 4.1.10
- What is the run time for DFS or BFS if we use an adjacency matrix insted of adjacency lists?
A level
- Fall 12 Final, #15
- Textbook: 4.1.33
Directed Graphs
Key Concepts
Using directed graphs vs. undirected graphs. Know when to use each. This should be fairly natural. For example, you'd use an undirected graph to represent friend connections between Facebook users, and a directed graph to represent following connections on Twitter.
BFS and DFS are directed graph algorithms. Many canonical unweighted digraph problems (e.g. shortest paths and reachability) can be solved using code that is effectively identical to code for solving problems on unweighted graphs.
Important digraph problems.
- Topological sort.
- What is a topological sort?
- Is it unique?
- What are the conditions under which a graph has a topological sort?
- How do you compute a topological sort?
- What is the run time to find a topological sort if we use adjacency lists?
- Bonus: How can you use the idea of a topological sort to find a Hamilton tour (usually an NP complete problem)?
- Strongly connected component identification.
- What is a strongly connected component (SCC) (also sometimes called a strong component)?
- Given Q strong components, how do we know that there is SOME sequence of Q depth-first-searches where each search finds exactly one of our Q strong components?
- How do we find that sequence? Step 1 of Kosaraju-Sharir.
Important digraph tools.
- Preorder, postorder, and reverse postorder.
What are these and how do we find them using DFS? What is the run time to find each order if we use adjacency lists?
- Topological sort.
- Kosaraju-Sharir algorithm.
- What does the first step provide? An ordering of our vertices that allows us to find the SCCs.
- What does the second step do? Finds all the strongly connected components using the ordering provided by the first step.
- What are the steps? 1. Run DFS on GR to compute reverse postorder of GR. 2. Start V depth-first-searches (one for each node) in the order given by first step.
- What is the run time if we use adjacency lists?
Recommended Problems
C level
- Spring 08 Final, #1a, #1b
- Fall 08 Final, #1a, #1b
- Fall 10 Final, #3a
- Spring 12 final, #3a, #3ab
B level
- Spring 08 Final, #1c, #1d
- Fall 08 Final, #1c
- Fall 10 Final, #3b
- Fall 10 Final, #2d, #2e
- Textbook: 4.2.10, 4.2.20
A level
- Fall 2008, #11
- 4.2.19, 4.2.22 (exam will not require you to know why Kosaraju-Sharir works)
- 4.2.27
- 4.2.40
Just for fun
- Write code similar to (or copy) the webcrawler shown in lecture.
Minimum Spanning Trees
Definition of a weighted graph and an MST.
- What is the MST?
- Under what conditions is it unique?
- How many edges are in the MST?
Cuts.
- What is a cut?
- How many cuts are there for a graph with V vertices?
- Cut property - given a cut, the crossing edge between the two partitions belongs to the MST. This is the most important point of this lecture, and you should be very familiar with the proof.
Manually performing Kruskal's and Prim's algorithms. These are easy and you should be able to do them even under the influence of powerful sedatives.
Why Kruskal's and Prim's algorithms work.
Implementing Kruskal's and Prim's algorithms.
- What sort of data do we track in Kruskal's algorithm, and what data structures do we use to track this data?
- What is the resulting run time for typical implementations of these data structures?
- What sort of data do we track in Lazy Prim's algorithm, and what data structures do we use to track this data?
- What is the resulting run time for typical implementations of these data structures?
- For Eager Prim's, why do we track vertices instead of edges? What does this do to our space requirements?
- What is the special data structure that we use to track vertices and their best known distance?
- How do we change the priority of a vertex using this special data structure (you should be able to give an example of a method call)?
Recommended Problems
C level
- Spring 08 Final, #2a, #2b
- Fall 08 Final, #2a, #2b
- Fall 09 Final, #1b
- Fall 09 Final, #3a, #3c
- Fall 10 Final, #4a, #4b
- Fall 10 Final, #3a, #3b
- Spring 12 Final, #4a, #4b
- Would Kruskal's or Prim's algorithm work with weighted directed graphs?
B level
- Fall 09 Final, #3b
- Fall 12 Final, #4a, #4b, #4c, #4d
- Textbook 4.3.8 - this is a particularly handy concept!
- Textbook 4.3.12
- Textbook 4.3.13
- Textbook 4.3.20
- True or False: Given any two components that are generated as Kruskal's algorithm is running (but before it has completed), the smallest edge connecting those two components is part of the MST.
- Textbook 4.3.19 - to be clear, they mean that the PQ is implemented with a sorted list.
Note the solution on the booksite is for an unsorted list. That's also a fine problem.
A level
- Spring 08 Final, #3
- Fall 09 Final, #3d
- Textbook 4.3.26
Just for fun
- Figure out if Kruskal's, Prim's Lazy, or Prim's Eager algorithm are faster on "typical" graphs. Use the code from the booksite.
Shortest Paths
Key Concepts
Graph Representation
- Use DirectedEdge instead of Edge.
- Adjacency list only tracks edges pointing out of node.
Shortest path representation
- Can represent shortest path as a tree.
- In Java, we represent the shortest path tree with two arrays: distTo[] and edgeTo[].
Edge relaxation
- What does it mean to relax an edge? Update shortest path tree (edgeTo[] and distTo[]) to reflect the information contained in that edge. What does it mean to relax a vertex?
- The code for relax should be second nature.
- What is the shortest paths optimality condition? Why is it true?
- What is the generic shortest path algorithm? Why does it always eventually get the right answer?
Manually performing Dijkstra's, Acyclic Shortest Paths, and Bellman Ford - These should be very easy to manully execute.
Bellman Ford algorithm
- Sequence of edge relaxations: Relax all the edges. Then relax them all again. And again. Repeat this V times. You are done.
- Runtime is obviously EV.
- Proof: After ith relaxation of all E edges, we have the shortest path containing i edges. There are at worst V-1 edges between any two vertices in the shortest path tree (otherwise it is not a tree).
- Sledgehammer of algorithms. Works on any graph except a graph with negative cycles. However, we don't really care, because a negative cycle has no definition of shortest path (since you can just go in the negative loop more times and get there sooner).
Dijkstra's algorithm
- Important invariant: Once a vertex belongs to the shortest path tree, its distTo and edgeTo can never change. Why is this invariant true? How does it prove correctness?
- Why is the runtime (V + E) log V? Why do the book and slides just say E log V? If we change our priority queue implementation, how does this runtime change (in general - no need to write Fibonacci heap runtimes on your cheat sheet)?
- How is Dijkstra's algorithm similar to Prim's algorithm? How is it dffferent?
Acyclic shortest paths
- Why does it work?
- Why is the run time E + V?
Recommended Problems
C level
- Fall 2009 Final, #4
- Fall 2010 Final, #5
- Textbook 4.3.1 and 4.4.1
B level
- Spring 2008 Final, #11 (this should be very easy if you've done Seam Carving!)
- Fall 2011 Final, #4
- Spring 2012 Final, #5
- Fall 2012 Final, #5
- Think about how to solve the undirected version of the shortest paths problem. Why is it easy for graphs with positive edge weights? Why is it hard if we include negative edge weights?
- Textbook 4.4.25
- Since Prim's and Dijkstra's algorithms are so similar, why do we need to check for cycle in Prim's, but not in Dijkstra's?
- Textbook 4.4.42 and 4.4.44
A level
- Fall 2008 Final, #11 (I remember getting asked this at a career fair as a freshman!)
- Textbook 4.4.34
- Textbook 4.4.37
Max-Flow
Key Concepts
Graph Representation
- Use FlowEdge instead of DirectedEdge.
- Adjacency list tracks nodes pointing towards and away from node (why?).
Max-flow, min-cut, and Ford-Fulkerson basics
- What is a flow? Assignment of value to every edge.
- What is the max-flow problem? What is the min-cut problem?
- How do we find an augmenting path from s to t? BFS, but with the special rule that we don't traverse a forward (backward) edge if it is full (empty).
- Given a max-flow, how do we find a min-cut? Try to find an augmenting path using algorithm above. The s side of the s-t cut is every vertex reachable from the source.
Ford-Fulkerson details
- How do we prove that it always terminates with integer weights?
- How do we know the answer is right if it terminates?
- Why can't we use the same argument that we used for the generic shortest paths algorithm?
- What are some different strategies for picking a path? Why is DFS potentially very bad?
Recommended Problems
C level
- Fall 2011, #11
- Spring 2012, #11
- Fall 2012, #6
B level
- Textbook 6.36
- Textbook 6.38
A level
- Textbook 6.37
String sorts
Key Concepts
Terminology
- String - sequence of characters from a finite ordered alphabet. In Java, our alphabet is the set of all 16 bit integers (representing unicode characters).
- Radix - just another word for 'base' as in the base of a number system. For example, the radix for words written in lowercase English letters is 26. For number written in arabic numerals it is 10.
- Radix sort - a sort that works with one character at a time (by grouping objects that have the same digit in the same position).
- Note: I will use 'character' and 'digit' interchangably in this study guide.
Key indexed counting. Allows you to sort N keys in linear time (better than linearithmic). Beats linearithmic bound by avoiding any comparisons. This is a completely different philosophy for how things should be sorted. This is the most important concept for this lecture.
Manually performing LSD and MSD. Should be doable in your sleep.
LSD.
- Requires fixed length strings (can pad end of strings with 'smaller-than-everything' character).
- Requires 2NW charAt calls. Why?
- Requires 7WN + 3WR array accesses. Why?
- Why do we consider these run times to be linear, despite the fact that they involve products WN and WR?
- Requires N + R space. Why?
- What sorting technique is used in LSD? Would a standard technique (e.g. Quicksort) work? Does the sort need to be stable?
- For a fixed alphabet and key size, what are the best case and worst case inputs for LSD?
MSD.
- What sorting technique is used in MSD? Would a standard technique (e.g. Quicksort) work? Does the sort need to be stable?
- How much memory does MSD use? Why is MSD so memory hungry? What sort of inputs result in the gratest memory usage?
- Why is it so important to switch to insertion sort (or another sort) for small subarrays? Why did we not have to do this in LSD?
- For a fixed alphabet and key size, what are the best and worst case inputs for MSD?
- What is the role of our special charAt(char, int) method?
Implementation issues.
- Why does the special charAt method have a significant impact on performance?
- What sorts of inputs are especially good for LSD and MSD?
- What sorts of inputs are especially bad for LSD and MSD?
Recommended Problems
C level
- Spring 2012 Final, #6
B level
- Fall 2008 Final, #6
- Fall 2012 Final, #7
Spring 2008 Final, #12
- Textbook 5.1.8, 5.1.10
A level
- How could we avoid the performance hit from our special charAt function?
- What makes MSD cache unfriendly?
- Textbook 5.1.22. The results may surprise you.
Tries
Key Concepts
Terminology and Basics
- Length of our String keys given by L.
- Alphabet size is R.
- You should be able to build an R-way Trie and TST. You should be able to search a R-way Trie and TST. You should be able to delete from an R-way trie. You should appreciate how a TST can become unbalanced with inserts.
- Given a graphical depiction of a TST or Trie, you should be able to tell which keys are in the trie.
Hash Tables and LLRBs with String keys
- With String keys, we don't want to think about counting compareTo() or hashCode() and equals() calls. Why? What do we think about instead?
- Know (but you do not have to prove) that LLRBs take order of growth lg2 N charAt() calls to compare random strings. Based on this, a hit is order L + lg2 N. Be aware that this is not necessarily a meaningful piece of information for practical purposes.
- Understand why hash tables are L charAt() calls on a hit or miss (assuming no collisions!)
Tries
- Special structure such that you only look at each letter one time. Trivially, misses and hits have a max run time of L.
- Be aware that for random inputs, search misses only take an average of logR N, which is order of growth lg N.
- Why do R-way tries use such a huge amount of memory? How much memory in terms of L and R? How does this same problem affect run time?
- What are the two ways of determining that a key does NOT exist in a trie?
TSTS
- When looking for a single character, why is it possible to follow as many as R links before you complete the search for that character?
- If the TST is balanced, typical case inserts and hits cost L + lg N character compares, and typical misses cost lg N character compares.
- If the TST is not balanced, worst case hits, misses, and inserts all include H, where H is the height of the tree.
Recommended Problems
C level
- Spring 2008 Final, #5
- Spring 2008 Final, #4
- Fall 2011 Final, #8
- Fall 2012 Final, #8
- Textbook 5.2.3, 5.2.4
B level
- Fall 2009 Final, #5
- Spring 2012 Final, #9
- Textbook 5.2.21 (just design the API)
- When would we want to use an R-way trie instead of a TST?
- Give a worst case input that causes a TST to be unbalanced (and thus slow).
- Is the number of character compares in an R-way trie strictly better than for an LLRB? For a hash table? Is a trie guaranteed to be faster in overall run time than an LLRB or a hash table?
A level
- When might we want to use a hash table instead of a TST?
- What would happen if we had an R-way trie where we keep our next nodes in a linked list? How would this affect memory performance? Timing? What if we used an LLRB? A hash table?
- Repeat Spring 2013 midterm problem #8, but assume your keys are Strings instead of Comparables.
Substring Matching
Key Concepts
Terminology and Basics
- Substring search problem: Find an instance (or all instances) of a substring of length M in a string of length N.
- Sustring must be precisely defined (no *s or |s).
- You should be able to manually use a KMP DFA, and you should be able to manually carry out Boyers-Moore and Rabin Karp.
KMP
- How do you construct the DFA? How much time does it take if you re-simulate every time you have a mismatch? It's ok if you don't fully understand the linear time construction process.
- What are the best case run times for DFA construction and DFA simulation? The worst case run time?
Boyer-Moore
- What is the mismatched character heuristic? Why do we use the rightmost character?
- Why is the mismatched character heuristic suboptimal? Why do we use it then -- because the basic idea is very similar to KMP and you'll learn it if you ever really need to.
- What is the best case run time? The worst case run time?
- What sort of inputs result in best and worst case performance?
Rabin Karp
- If we know mod(ABCDEFG, R), how do we compute mod(BCDEFGH, R) in constant time (where A through H are arbitrary digits of a number from some alphabet of radix R)?
- What are the Las Vegas and Monte Carlo versions of Rabin-Karp? Why does it not really matter which one we use? If you were worried, which one would YOU use?
- How would we extend Rabin-Karp to efficiently search for any one of P possible patterns in a text of length N? How would this technique compare to using KMP or Boyer-Moore for the same task?
Recommended Problems
C level
- Spring 2008 Final, #6
- Fall 2009 Final, #6
- Fall 2012 Final, #10
B level
- Fall 2009 Final, #5
- Fall 2009 Final, #10 [great problem!]
- Fall 2010 Final, #10
- Fall 2011 Final, #6
- Spring 2012 Final, #7
- Fall 2012 Final, #9
- Give an example of when you might want to use KMP? Boyer Moore? Rabin Karp?
A level
- Fall 2010 Final, #10
- Fall 2012 Final, #14
- Give an example of when you might prefer to use the Monte Carlo version of Rabin Karp over the Las Vegas version.
- Textbook: 5.3.22
- Textbook: 5.3.26
Regular Expressions
Key Concepts
Terminology and Basics
- Understand the regular expression syntax.
- Know how to write simple regular expressions. We will not ask you to do this on the final.
NFA Simulation
- Understand how to manually simulate an NFA operating on a string.
- What happens on a mismatch? All states that don't match are removed from the set of valid states. This may result in an NFA that is not in ANY state. The NFA does NOT go back to state 0.
- Why does an NFA only tell you if an ENTIRE string matches a regular expression, not whether or not some substring matches?
- Given a string and and a regular expression, how do we tell if any substring of the string matches the regular expression? (Hint: Make a new regular expression)
- How does graph search play a role? Does it matter if we use DFS or BFS? Would it ever make sense to use an undirected graph in the regular expression context? A weighted graph?
NFA Construction
- Understand how to build an NFA from a regular expression. If you know how to do this using the stack-based method, you can do this process without even thinking.
- Our NFA construction process is not the only valid construction process. There are other equally valid ones. Ours is just the simplest that Bob and Kevin could come up with.
- Our NFAs are non-unique representations of a given regex (as discussed in lecture).
- Why are there no more than 3M epsilon transitions? Why is this fact vitally important to the NM worst case run time for simulation?
- Since DFAs have worst case run time N, and all regexes go with some DFA, why don't we just use those?
- Advanced and important topic mentioned in passing during lecture that won't be covered on the final: There is no regex that can compile a regex into a DFA. However with a DFA and a stack, you can do this just fine (specifically, this process doesn't require a Turing machine). Compiling Java into class fiels, by contrast, does require a Turing machine.
Recommended Problems
C level
- Spring 2008 Final, #8
- Fall 2008 Final, #8
- Fall 2009 Final, #7
- Fall 2010 Final, #9
- Spring 2012 Final, #10
- Spring 2012 Final, #11
- 5.4.1, 5.4.2
B level
- Fall 2011 Final, #7
- Textbook 5.4.16, 5.4.17, 5.4.18
A level
None!
Just for fun
- 5.4.13
Data Compression
Key Concepts
Terminology and Basics
- Why compress? To save space AND time.
- How does compression work? Compression takes advantage of structure within data.
- What sort of data does not compress well? Random data.
- How do we characterize the randomness of data? The Kolmogorov complexity.
- Lossy compression can further reduce file sizes by throwing away unimportant information.
- What is the compression ratio?
- Why can no compression algorithm possibly compress all bit streams?
- What fraction of bitstreams can be compressed in half by a general purpose algorithm?
Run length coding
- Takes advantage of repeated binary digits.
- How do you handle runs of length more than 2M?
Huffman coding
- Basic idea: Variable length codewords to represent fixed length characters. More common characters are represented by shorter codewords.
- What is a prefix free code? Why is it important that Huffman coding use a prefix free code? Would encoding work with a non-prefix free code? Would decoding work?
- Why is it necessary to transmit the coding trie? Why don't we have to do something similar with run length encoding or LZW?
- Why do we typically use an array for encoding and a trie for decoding?
- You do not need to know the ins and outs of the Huffman code. However, you should conceptually understand the idea of transmitting/reading the trie using an inorder traversal.
LZW
- Basic idea: Fixed length codewords to represent variable length strings. More common characters are represented by shorter codewords.
- Why do we typically use a trie for encoding and an array for decoding?
- How do you handle the 'strange' case (where a codeword is seemingly not in the table during decoding)?
Kolmogorov Complexity
- For 226, we define the Kolmogorov Complexity of a string S as the length of the shortest Java program (including all relevant libraries) that can output S.
- Given: a string S, a program P of length L in programming language X that outputs S, why can we always find a Java program of length L + C, where C is some constant that depends on our choice of X (but not S or L)? Because we can always write a compiler for X written in Java.
- Why does the given fact above prove that any string which is compressible in one programming language can be compressed in any other language (up to a constant factor)?
- Why is it impossible to write a program which finds the Kolmogorov Complexity of a given string? No (the reason why was an optioanl part of the lecture)
- Are there any strings for which Kolmogorov Complexity can be found? Yes! For example, "a".
Recommended Problems
C level
- Spring 2008 Final, #4
- Fall 2008 Final, #7
- Fall 2011 Final, #10b
- Textbook 5.5.3
- Why is the Kolmogorov complexity of "a" (relatively) easy to find?
B level
- Fall 2011 Final, #10a
- Textbook 5.5.13
- Textbook 5.5.17
- Is the Kolmogorov complexity of π small or large?
A level
- Fall 2012 Final, #13
- Textbook 5.5.23 (don't know the answer myself, I'm quite curious)
Reductions and Intractability
Key Concepts
- We consider problems solvable in polynomial time to be efficient.
- Important real world note: There are two common definitions for P and NP. The classic definition defines the classes in terms of decision problems. A later (but also quite common) definition is in terms of search problems. For simplicity, we used decision problems in our lecture, but many other sources use the search problem version. The two definitions are very similar and have the same real world implications, but they are not quite the same.
- To make matters even a bit more confusing, I used P to represent the class of all problems solvable in polynomial time, and NP the class of all decision problems solvable in polynomial time.
- Bottom line: Whether we're talking about the decision or search problem formulation, the fundamental idea is that P is the a of problems that can be solved in polynomial time, and NP is the a of problems whose solution can be verified in polynomial time.
- Completeness: What conditions are necessary for a problem to be Q-complete, where Q is some set of problems?
- What did Rabin/Levin do?
- What did Karp do?
- What is a reduction?
- Why is solving an NP complete problem such a big deal?
- Amusing fact: There exists a well defined algorithm (that you can already write in Java) that solves NP problems in time P, but it's polynomial time only if P = NP. Wikipedia has the scoop. It's madness!
- One of the most common arguments against Searle's Chinese Box is the systems argument (namely that the dude in the box is acting as part of a larger system). I happen to strongly agree with this. I still won't let you blow me up and reassemble me.
Recommended Problems
Note: The Fall 2013 final will not feature one of those really sneaky reduction problems (typically part a of the B level and A level problems below). However, we will probably ask you about how reductions work.
C level
- Textbook 6.52
B level
- Textbook 6.57
- Textbook 6.61, but replace "search" with "decision" to match our terminology from class.
- Textbook 6.62
- Spring 2008 Final, #8
- Fall 2008 Final, #12
- Fall 2009 Final, #11
- What does the Venn diagram look like for NP-Complete and NP?
- If P=NP, then why is the set of NP-Complete problems the same as P?
A level
- Fall 2010 Final, #10
- Fall 2011 Final, #13
- Spring 2012 Final, #13
- Fall 2012 Final, #15