Computer Science 226
Algorithms and Data Structures
Fall 2013

Union Find and Intro
Analysis of Algorithms
Stacks and Queues
Elementary Sorts
Mergesort
Quicksort
Collections, PQs, and Priority Queues
General Sorting
BSTs
Balanced BSTs
Hashing

Geometric Applications of BSTs
Undirected Graphs
Directed Graphs
MSTs
Shortest Paths
Max flow
String Sorts
Tries
Substring Matching
Regular Expressions
Compression
Reductions and Intractability

Study Guide

This page provides a list of what I consider the most important topics and ideas from each lecture. The optional exercises listed on this page are not for a grade. Instead, they are to provide you with a better idea of what we're expecting you to know before the final.

Optional exercises are either C problems, B problems, or A problems. C problems are at the same level as the exercises, and generally require that you only be able to mechanically carry out the algorithms. B problems are harder problems that require you to really know what's going on (the exams will have plenty of these). A problems are problems that either cover tangential material (like how Kosaraju-Shamir works), require deep comprehension of the material, involve very clever tricks, or perhaps even all 3.

Don't be intimidated by its length. I've tried to be as comprehensive as possible.

Union Find and General Intro

Key Concepts

Algorthm Development. Developing a good algorithm is an iterative process. We create a model of the problem, develop an algorithm, and revise the performance of the algorithm until it meets our needs.

Union-Find
. The ultimate goal is to develop a data type that support the following operations:

union(int p, int q)
connected(int p, int q)

We do NOT care about finding the actual path from p to q. We only care about their connectedness.

Very closely related to connected is a 3rd operation we can support:

find(int p)

Find is defined such that find(p) = find(q) iff connected(p, q).

We assume that the number of objects N is fixed.

Key Observation: Connectedness is an Equivalence Relation. Saying that two objects are connected is the same as saying they are in an equivalence class.

This is just fancy math-talk for saying "every object is in exactly one bucket, and we want to know if two objects are in the same bucket". When you union two objects, you're basically just pouring everything from one bucket into another.

Quick Find. This is the most natural solution, where each object is given an explicit number. Uses an array id of length N. id[i] is the bucket number of object i (which is returned by find). To union two objects p and q, we set every object in q's bucket to have p's number.

Union: May require many changes to id. Takes N time in the worst case (to union large sets).
Connected (and Find): Constant time.

Performing M union-find operations takes NM time. If M were proportional to N, this results in N² time;

Quadratic Algorithms Don't Scale. Given an N times larger problems on an N times faster computer, the problem takes N times as long to run.

Quick Union. id[i] is the parent object of object i. An object can be its own parent. The find method climbs the ladder of parents until it reaches the root (an object whose parent is itself). To union p and q, we set p's parent to q.

Union: Requires only one change to id, but also requires root finding (worst case N time).
Connected (and Find): Requires root finding (worst case N time).

Performing M union-find operations takes NM time in the worst case. Again, this results in quadratic behavior.

Weighted Quick Union
. Rather than union(p, q) pointing p's parent at q, we instead point the smaller tree at the larger one. The tree's size is given by the NUMBER of nodes, not the height of the tree. Results in tree heights of lg N (you should understand this proof).

Union: Requires only one change to id, but also requires root finding (worst case log N time).
Connected (and Find): Requires root finding (worst case log N time).

Weighted Quick Union with Path Compression. When find is called, the tree is compressed. Results in nearly flat trees. Making M calls to union and find with N objects results in no more than M log*(N) array accesses. Log*(N) is at most 5 in our universe.

Analysis of Algorithms

Key Concepts

Tilde notation. We say that f(N) ~ g(N) if in the large N limit, f(N)/g(N) converges to 1. This is a general concept that can be applied to any functions, and is not restricted to runtime, memory, or any other specific domain.

Power-law assumption. For empirical analyses in COS226 we will assume that runtimes obey a power law (i.e. T(N) ~ aN^b). Empirical derivation of accurate hypotheses for non-power law runtimes is beyond the scope of our course.

Cost model. For theoretical analyses of runtime in COS226, we will assume a 'cost model', namely that some particular operation (operations) dominates the runtime of a function. We can then express the runtime in terms of the total number of that operation. Often, we give this count in Tilde notation.

Order of growth. If we have two functions f(N) and g(N), and f(n) ~ a G(N), we say the order of growth of f(N) is g(N).

Performance depends on inputs. We can characterize an algorithm's performance by its best case, worst case, and average case.

Difficulty of a problem. To understand the difficulty of a particular problem in COS226, we consider only the worst-case order-of-growth of its solutions. We can naturally upper bound the difficulty of a problem by the performance of the best known solution. We can naturally lower bound the difficulty of a problem with general considerations of the problem. This is actually as trivial as it sounds.

Big Oh, Big Omega, Big Theta. Big Theta is precisely the same thing as "order of growth". If we say "f(N) is Θ(g(N))" or "f(N)=Θ(g(N))", we are simply saying f(N) is the same order of growth as g(N). If f(N)=O(g(N)), then f(N) has an order of growth that is less than or equal to g(N). More precisely, we could say that the limit of f(N)/g(N) as N goes to infinity is NOT infinity. If f(N)=Ω(g(N)), then f(N) has an order of growth that is greater than or equal to g(N). We could also say that the limit of f(N)/G(N) as N goes to infinity is greater than 0.

As with Tilde notation, these are all general concepts that can be applied to any functions, and are not limited to discussions of specific topics in computer science.

Worst case order of growth isn't everything. Just because an algorithm has a better order of growth does not mean it is better. We will see examples of this later (Quicksort vs. Mergesort. Quick select vs. Median of medians).

Memory analysis. Know how to calculate the memory utilization of a class with the 64 bit memory model.

Theoretical and empirical analysis. Hypotheses generated through theoretical analysis (or guesswork like our power law assumption) should be validated with data before being fully trusted.

Stacks and Queues

Key Concepts

Stacks and Queues. Understand the semantics of each.

Linked Lists. Understand how they can be used to implement a stack or a queue, how to calculate the memory used by a linked list (using the 64 bit memory model) for N items in a queue, and how to show that queue and stack operations complete in constant time.

Resizing Arrays. Understand how they can be used to implement a stack or a queue. Understand why the worst case memory usage and runtimes are different than the best case.

Amortized Analysis. The idea is simpler that it might seem at first. If we say that X-operations take amortized f(M) Y-operations, this means that if we start with an empty data structure, the average cost for any sequence of M X-operations is f(M) Y-operations.

For example, for a resizing array stack, we say that push and pop operations take amortized constant time. This means that if we start with an empty stack, any sequence of pushes or pops take on average a constant number of array accesses per push or pop. In this example X is {push, pop}, Y is array accesses, and f(M) is just 1.

Generics. Understand that these provide the ability to parameterize data types. Understand why this approach is superior to other approaches (writing separate classes, using the Object type and doing runtime casting).

Compile time vs. run time errors. Understand why the former are preferred.

Iterators. Understand what it means for a class to implement Iterable. Understand what a class must do to implement Iterator. Understand how the foreach operator works (a.k.a. the colon operator). Understand how to convert code using : into equivalent longhand code.

Loitering. Understand how to avoid loitering in Java.

Elementary sorts

Key Concepts

Inversions. The number of pairs of elements in a sequence that are out of order. An array with no inversions is ordered.

Selection sort. Be able to selection sort an array. How many compares does it use? How many exchanges?

Insertion sort. Be able to insertion sort an array. How many compares does it use in the best and worst case? How many exchanges in the best and worst case?

Partially ordered arrays. Why is insertion sort linear for partially sorted arrays?

Shellsort. Not covered Fall 2013.

Shuffling. Can use sorting to shuffle in N lg N by assigning each value an arbitrary unique value between 0 and 1. Can sort in N time using Knuth shuffle. You don't need to know the Knuth shuffle details.

Convex hull. Understand how sorting leads to a solution to the convex hull problem. You should be able to carry out the algorithm. You do not need to understand the code for determining ccw turns.

Mergesort

Key Concepts

Mergesort. Merge left, merge right, merge.

Merge. Understand how to carry out the merge operation. How many compares does it use when comparing two arrays of size N in the best case? In the worst case?

Mergesort order of growth. Understand how to show that the order of growth of the number of compares is N lg N. Understand why the entire algorithm is also order N lg N.

Mergesort compare bounding. Know why the best case is ~ 1/2 N lg N and the worst case is ~ N lg N compares.

Mergesort properties. Mergesort is stable (why?). Mergesort uses N memory (why?). Does mergesort do particularly well on already sorted arrays? Partially ordered arrays?

Quicksort

Key Concepts

Quicksort. Partition on some pivot. Quicksort to the left of the pivot. Quicksort to the right.

Partitioning. Understand exactly how to carry out 2 way partitioning as discussed in class. Be able to recognize 3 way partitioning as discussed in the book / blackboard exercise.

Quicksort order of growth. Understand how to show that in the best case, quicksort is N lg N, and in the worse case is N^2. Shuffling is needed to probabilistically guarantee N lg N behavior.

Quicksort compare counting. Know why the best case is ~N lg N compares and worst case is ~1/2 N^2. Despite the greater number of compares, quicksort is usually faster than mergesort. Be familiar with the fact that shuffling yields 2N ln N compares on average (but you don't need to fully digest this proof -- especially solution of the difficult recurrence relation, as that involves discrete math that is beyond the scope of the course).

Pivot choice. Understand how the pivot affects the size of the subproblems created after partitoning.

Quicksort properties. Quicksort is not stable and is in-place (uses no more than log N memory).

Practical improvements. Cutoff to insertion sort. Using 3-way partitioning to attain faster performance for arrays with a constant number of keys.

Collections, PQs and Heaps

Key Concepts

Terminology. Abstract data types are independent of implementation. Collections are abstract data types.

Common collections. Priority queue, set, symbol table. Know what each of them does. You do not need to memorize the names of the the operations shown in the slides.

Heaps. A max (min heap is a binary tree such that every node is larger (smaller) than all of its children. This definition naturally applies recursively, i.e. a heap of height 5 is composed of two heaps of height 4 plus a parent.

Array-heaps. The heap data structure must be implemented in terms of even more elementary data structures in most programming languages, because there is not a heap primitive. In 226, we utilize a clever trick where we represent the heap as an array. You should understand how this works, e.g. why do we leave the 0th position empty in the array?

Runtimes of various PQ implementations. Know the runtimes of the 3 primary PQ operations for an unordered array, ordered array, and heap implementation.

Heapsort. A max PQ provides an easy algorithm for putting items in ascending order. We simply insert everything into the PQ and then delete the max until the PQ is empty. If we use an array representation of a heap, we can do everything in place (this is not obvious). This is somewhat subtle.

Heapification. Understand how the bottom up heapification process works. Know that it is linear time.

General sorting

On the midterms, sorting questions tend to broadly cover the entire sorting chapter of the book.

Symbol Tables and BSTs

Key Concepts

Elementary implementations. Ordered array and linked list. Understand the performance of each.

Comparable vs. equals. Keys must either be comparable or have an equals method which takes into account the semantics of the object. The default equals method only checks reference equality.

API for Symbol Table and Ordered Symbol Table. Know the operations that go with symbol tables and ordered symbol tables. You don't have to be able to list them, but you should understand what they mean.

BST basics. Know the definition of a BST and how to search and insert.

Recursive BST code. You should be able to read recursive BST code. You will not be required to write recursive BST code, since you won't have practice until KdTree.

In-order traversal. Understand how to perform an in order traversal (in optional slides, also at beginning of the BST lecture)

Treesort. Understand how to use a BST to sort items. Insert all the items, then perform an inorder traversal. This algorithm is also known as Quicksort.

Deletion and Hibbard deletion. Understand how to delete an item with 1 or 0 children. Understand how to delete an item with 2 children, a.k.a. Hibbard deletion.

Symbol table performance. Understand the performance of our elementary implementations. Understand the performance of a BST based symbol table in the best, worst, and average (shuffled) case, with an emphasis on the fact that the height of a tree is the key factor. Understand what Hibbard deletions do to tree height. You do not need to know why Hibbard deletes result in sqrt(n) height.

Balanced Search Trees

Key Concepts

2-3 trees. Understand how to search and insert. Understand why insertion technique leads to perfect balance and symmetric order.

Terminology. Symmetric order. Perfect balance.

Performance of 2-3 trees. Tree height is between log3(N) and log2(N). All paths are of the same height.

Mapping between LLRBs and 2-3 Trees. Understand how to map a 2-3 tree to an LLRB and vice versa.

LLRBs. BST such that no node has two red links touching it. Perfect black balance. Red links lean left.

LLRB get. Exactly the same as regular BST get.

Storing color. A node's color is the same as the edge connecting that node to its parent.

LLRB insert. Insert as usual, then apply rotations recursively as follows, working upwards from the new node.

Case 3: A parent node has a black left child and a red right child, so rotate the parent left. The former right child is now the boss. Reminder: null links are considered black for the purposes of deciding cases.
Case 2: A grandparent node has a red left child whose left child is also red. Rotate this grandparent right (so that one in the middle is now the boss).
Case 1: A parent node has two red children. Flip colors.

Conveniently, we can always apply case 3, case 2, then case 1 to every node in a tree, and we can guarantee that the entire tree will be an LLRB.

LLRB performance. Perfect black balance ensures worst case performance for get and insert is ~2 log(N).

Hash Tables

Key Concepts

Brute force approach. All data is just a sequence of bits. Can treat key as a gigantic number and use it as an array index. Requires exponentially large amounts of memory.

Hashing. Instead of using the entire key, represent entire key by a smaller value. In Java, we hash objects with a hashCode() method that returns an integer (32 bit) representation of the object.

hashCode() to index conversion. To use hashCode() results as an index, we must convert the hashCode() to a valid index. Modulus does not work since hashCode may be negative. Taking the absolute value then the modulus also doesn't work since Math.abs(Integer.MIN_VALUE) is negative. We use hashCode & 0x7FFFFFFF instead before taking the modulus.

Hash function. Converts a key to a value between 0 and M - 1. In Java, this means calling hashCode(), setting the top bit to 0, then taking the modulus.

Collision resolution. Two philosophies for resolving collisions discussed in class: Separate chaining and 'open addressing'. We didn't use the term open addressing, but it's where you use empty array entries to handle collisions, e.g. linear probing. We will use this term at the beginning of lecture 11.

Separate chaining with a linked list. Key/value pairs are stored in a linked list of nodes of length M. Hash function tells us which of these linked lists to use. Get and insert both require potentially scanning through entire list.

Resizing separate chaining hash tables. Understand how resizing may lead to objects moving from one linked list to another. Primary goal is so that M is always proportional to N.

Performance of separate chaining hash tables. Cost of a given get, insert, or delete is given by number of entries in the linked list that must be examined.

Expected average amortized search and insert time is given by N / M, which is no larger than some constant (due to resizing).
Expected amortized worst case time is given by size of largest bin, which is order log N / log log N. This is far beyond the scope of the class. Grows slowly, but is not quite constant.

Linear probing hash tables. If the space that should be occupied by a key is already occupied by something else, try the space to the right. If that's occupied, go right more. Repeat. This philosophy works for get and insert.

Performance of linear probing hash tables. As before, perfmrance determined by number of entries in the key array that must be examined.

If N / M is sufficiently small, expected amortized worst case run time is given by a constant. For N / M = 0.5, expected cost of a search hit is 3/2 and expected cost of a search miss is 5/2. Once N / M becomes larger, cost grows. See Knuth's parking problem for more.
Expected amortized worst case run time is given by Theta(log N). We did not prove this in class or even discuss it, and this is extremely distant from the scope of our course.

Hash functions and the uniform hashing assumption. For our analyses above, we assumed that our hash function distributes all input data evenly across bins. Design of such function is beyond the scope of our course. Most important guideline is to use all the bits in the key.

Dangers of hashing. If hashCode() is known and easy to reverse, adversary can design a sequence of inputs that result in everything being placed in one bin. Or if hashCode() is just plain bad, same thing can happen.

Geometric Applications of BSTs

Key Concepts

1d range search

How do we use the rank method to support 1d range search? (You don't need to know how to implement rank, since we skipped that part of the BSTs lecture)

Kd-Trees

Enables rapid 2d range search.
Enables rapid 2d nearest neighbor search.
You should understand all of the details of the KdTreeST class that you wrote for assignment 5.

Undirected Graphs

Key Concepts

Graph Representation
There are many possible internal representations of a graph. We discussed three. You should be familiar with the tradeoffs for each.

List of edges
Adjacency matrix
Adjacency lists

Important graph problems

Is there a path between s and t?
Is there a cycle in the graph? How do you find the nodes in a cycle?
Is a graph bipartite?
Euler tour. Is there a cycle that uses every edge exactly once?
Hamilton tour. Is there a cycle that uses every vertex exactly once?
Planarity problem. Find a vertex layout that results in no crossing edges (see this online game for a fun (?) look at the problem)
Graph isomorphism. Are two graphs the same graph (but with different vertex names)

You should know how to solve the first three -- not necessarily because these problems are so important, but because you should have at least a few examples of how to use DFS or BFS to solve problems. For the last four, it's good scholarship to be know whether the problem is tractable (polynomial time) and whether there exists an easy to understand tractable algorithm. We may even ask you how these things on an exam. You do not need to be familiar with the details of the best known algorithm for their solution.

One important meta-point is that it is non-trivial to determine the difficulty level of a graph problem. For example, finding an Euler tour is polynomial time and the algorithm is non-trivial but not hideously complex. By contrast, finding a Hamilton tour is an intractable problem. Planarity can be solved in linear time, but it's a rather complex algorithm to follow. Graph isomorphism is particularly notable since its tractability has remained elusive despite decades of research.

Important graph tools

DFS. You should know this algorithm absolutely cold by the time the exam comes around. For example, you should be able to write DepthFirstSearch.java in your sleep, and the run time if we use adjacency lists (E + V) should be scorched into your neurons.
BFS. You should have the same level of familiarity as with DFS.

Graph client design pattern (used for 4.1, 4.2, 4.3, and 4.4). Instead of creating highly complex classes that support basic graph operations AND solution of important graph problems, we have a basic class (e.g. Graph) that gives us basic operations, and special purpose clients that solve subproblems on a graph (e.g. ConnectedComponents(Graph g)).

Directed Graphs

Key Concepts

Using directed graphs vs. undirected graphs. Know when to use each. This should be fairly natural. For example, you'd use an undirected graph to represent friend connections between Facebook users, and a directed graph to represent following connections on Twitter.

BFS and DFS are directed graph algorithms. Many canonical unweighted digraph problems (e.g. shortest paths and reachability) can be solved using code that is effectively identical to code for solving problems on unweighted graphs.

Important digraph problems.

Topological sort.

What is a topological sort?
Is it unique?
What are the conditions under which a graph has a topological sort?
How do you compute a topological sort?
What is the run time to find a topological sort if we use adjacency lists?
Bonus: How can you use the idea of a topological sort to find a Hamilton tour (usually an NP complete problem)?

Strongly connected component identification.

What is a strongly connected component (SCC) (also sometimes called a strong component)?
Given Q strong components, how do we know that there is SOME sequence of Q depth-first-searches where each search finds exactly one of our Q strong components?
How do we find that sequence? Step 1 of Kosaraju-Sharir.

Important digraph tools.

Preorder, postorder, and reverse postorder.
Topological sort.
Kosaraju-Sharir algorithm.

What does the first step provide? An ordering of our vertices that allows us to find the SCCs.
What does the second step do? Finds all the strongly connected components using the ordering provided by the first step.
What are the steps? 1. Run DFS on G^R to compute reverse postorder of G^R. 2. Start V depth-first-searches (one for each node) in the order given by first step.
What is the run time if we use adjacency lists?

Minimum Spanning Trees

Definition of a weighted graph and an MST.

What is the MST?
Under what conditions is it unique?
How many edges are in the MST?

Cuts.

What is a cut?
How many cuts are there for a graph with V vertices?
Cut property - given a cut, the crossing edge between the two partitions belongs to the MST. This is the most important point of this lecture, and you should be very familiar with the proof.

Manually performing Kruskal's and Prim's algorithms. These are easy and you should be able to do them even under the influence of powerful sedatives.

Why Kruskal's and Prim's algorithms work.

Implementing Kruskal's and Prim's algorithms.

What sort of data do we track in Kruskal's algorithm, and what data structures do we use to track this data?
What is the resulting run time for typical implementations of these data structures?
What sort of data do we track in Lazy Prim's algorithm, and what data structures do we use to track this data?
What is the resulting run time for typical implementations of these data structures?
For Eager Prim's, why do we track vertices instead of edges? What does this do to our space requirements?
What is the special data structure that we use to track vertices and their best known distance?
How do we change the priority of a vertex using this special data structure (you should be able to give an example of a method call)?

Shortest Paths

Key Concepts

Graph Representation

Use DirectedEdge instead of Edge.
Adjacency list only tracks edges pointing out of node.

Shortest path representation

Can represent shortest path as a tree.
In Java, we represent the shortest path tree with two arrays: distTo[] and edgeTo[].

Edge relaxation

What does it mean to relax an edge? Update shortest path tree (edgeTo[] and distTo[]) to reflect the information contained in that edge. What does it mean to relax a vertex?
The code for relax should be second nature.
What is the shortest paths optimality condition? Why is it true?
What is the generic shortest path algorithm? Why does it always eventually get the right answer?

Manually performing Dijkstra's, Acyclic Shortest Paths, and Bellman Ford - These should be very easy to manully execute.

Bellman Ford algorithm

Sequence of edge relaxations: Relax all the edges. Then relax them all again. And again. Repeat this V times. You are done.
Runtime is obviously EV.
Proof: After ith relaxation of all E edges, we have the shortest path containing i edges. There are at worst V-1 edges between any two vertices in the shortest path tree (otherwise it is not a tree).
Sledgehammer of algorithms. Works on any graph except a graph with negative cycles. However, we don't really care, because a negative cycle has no definition of shortest path (since you can just go in the negative loop more times and get there sooner).

Dijkstra's algorithm

Important invariant: Once a vertex belongs to the shortest path tree, its distTo and edgeTo can never change. Why is this invariant true? How does it prove correctness?
Why is the runtime (V + E) log V? Why do the book and slides just say E log V? If we change our priority queue implementation, how does this runtime change (in general - no need to write Fibonacci heap runtimes on your cheat sheet)?
How is Dijkstra's algorithm similar to Prim's algorithm? How is it dffferent?

Acyclic shortest paths

Why does it work?
Why is the run time E + V?

Max-Flow

Key Concepts

Graph Representation

Use FlowEdge instead of DirectedEdge.
Adjacency list tracks nodes pointing towards and away from node (why?).

Max-flow, min-cut, and Ford-Fulkerson basics

What is a flow? Assignment of value to every edge.
What is the max-flow problem? What is the min-cut problem?
How do we find an augmenting path from s to t? BFS, but with the special rule that we don't traverse a forward (backward) edge if it is full (empty).
Given a max-flow, how do we find a min-cut? Try to find an augmenting path using algorithm above. The s side of the s-t cut is every vertex reachable from the source.

Ford-Fulkerson details

How do we prove that it always terminates with integer weights?
How do we know the answer is right if it terminates?
Why can't we use the same argument that we used for the generic shortest paths algorithm?
What are some different strategies for picking a path? Why is DFS potentially very bad?

String sorts

Key Concepts

Terminology

String - sequence of characters from a finite ordered alphabet. In Java, our alphabet is the set of all 16 bit integers (representing unicode characters).
Radix - just another word for 'base' as in the base of a number system. For example, the radix for words written in lowercase English letters is 26. For number written in arabic numerals it is 10.
Radix sort - a sort that works with one character at a time (by grouping objects that have the same digit in the same position).
Note: I will use 'character' and 'digit' interchangably in this study guide.

Key indexed counting. Allows you to sort N keys in linear time (better than linearithmic). Beats linearithmic bound by avoiding any comparisons. This is a completely different philosophy for how things should be sorted. This is the most important concept for this lecture.

Manually performing LSD and MSD. Should be doable in your sleep.

LSD.

Requires fixed length strings (can pad end of strings with 'smaller-than-everything' character).
Requires 2NW charAt calls. Why?
Requires 7WN + 3WR array accesses. Why?
Why do we consider these run times to be linear, despite the fact that they involve products WN and WR?
Requires N + R space. Why?
What sorting technique is used in LSD? Would a standard technique (e.g. Quicksort) work? Does the sort need to be stable?
For a fixed alphabet and key size, what are the best case and worst case inputs for LSD?

MSD.

What sorting technique is used in MSD? Would a standard technique (e.g. Quicksort) work? Does the sort need to be stable?
How much memory does MSD use? Why is MSD so memory hungry? What sort of inputs result in the gratest memory usage?
Why is it so important to switch to insertion sort (or another sort) for small subarrays? Why did we not have to do this in LSD?
For a fixed alphabet and key size, what are the best and worst case inputs for MSD?
What is the role of our special charAt(char, int) method?

Implementation issues.

Why does the special charAt method have a significant impact on performance?
What sorts of inputs are especially good for LSD and MSD?
What sorts of inputs are especially bad for LSD and MSD?

Tries

Key Concepts

Terminology and Basics

Length of our String keys given by L.
Alphabet size is R.
You should be able to build an R-way Trie and TST. You should be able to search a R-way Trie and TST. You should be able to delete from an R-way trie. You should appreciate how a TST can become unbalanced with inserts.
Given a graphical depiction of a TST or Trie, you should be able to tell which keys are in the trie.

Hash Tables and LLRBs with String keys

With String keys, we don't want to think about counting compareTo() or hashCode() and equals() calls. Why? What do we think about instead?
Know (but you do not have to prove) that LLRBs take order of growth lg² N charAt() calls to compare random strings. Based on this, a hit is order L + lg² N. Be aware that this is not necessarily a meaningful piece of information for practical purposes.
Understand why hash tables are L charAt() calls on a hit or miss (assuming no collisions!)

Tries

Special structure such that you only look at each letter one time. Trivially, misses and hits have a max run time of L.
Be aware that for random inputs, search misses only take an average of log_R N, which is order of growth lg N.
Why do R-way tries use such a huge amount of memory? How much memory in terms of L and R? How does this same problem affect run time?
What are the two ways of determining that a key does NOT exist in a trie?

TSTS

When looking for a single character, why is it possible to follow as many as R links before you complete the search for that character?
If the TST is balanced, typical case inserts and hits cost L + lg N character compares, and typical misses cost lg N character compares.
If the TST is not balanced, worst case hits, misses, and inserts all include H, where H is the height of the tree.

Substring Matching

Key Concepts

Terminology and Basics

Substring search problem: Find an instance (or all instances) of a substring of length M in a string of length N.
Sustring must be precisely defined (no *s or |s).
You should be able to manually use a KMP DFA, and you should be able to manually carry out Boyers-Moore and Rabin Karp.

KMP

How do you construct the DFA? How much time does it take if you re-simulate every time you have a mismatch? It's ok if you don't fully understand the linear time construction process.
What are the best case run times for DFA construction and DFA simulation? The worst case run time?

Boyer-Moore

What is the mismatched character heuristic? Why do we use the rightmost character?
Why is the mismatched character heuristic suboptimal? Why do we use it then -- because the basic idea is very similar to KMP and you'll learn it if you ever really need to.
What is the best case run time? The worst case run time?
What sort of inputs result in best and worst case performance?

Rabin Karp

If we know mod(ABCDEFG, R), how do we compute mod(BCDEFGH, R) in constant time (where A through H are arbitrary digits of a number from some alphabet of radix R)?
What are the Las Vegas and Monte Carlo versions of Rabin-Karp? Why does it not really matter which one we use? If you were worried, which one would YOU use?
How would we extend Rabin-Karp to efficiently search for any one of P possible patterns in a text of length N? How would this technique compare to using KMP or Boyer-Moore for the same task?

Regular Expressions

Key Concepts

Terminology and Basics

Understand the regular expression syntax.
Know how to write simple regular expressions. We will not ask you to do this on the final.

NFA Simulation

Understand how to manually simulate an NFA operating on a string.
What happens on a mismatch? All states that don't match are removed from the set of valid states. This may result in an NFA that is not in ANY state. The NFA does NOT go back to state 0.
Why does an NFA only tell you if an ENTIRE string matches a regular expression, not whether or not some substring matches?
Given a string and and a regular expression, how do we tell if any substring of the string matches the regular expression? (Hint: Make a new regular expression)
How does graph search play a role? Does it matter if we use DFS or BFS? Would it ever make sense to use an undirected graph in the regular expression context? A weighted graph?

NFA Construction

Understand how to build an NFA from a regular expression. If you know how to do this using the stack-based method, you can do this process without even thinking.
Our NFA construction process is not the only valid construction process. There are other equally valid ones. Ours is just the simplest that Bob and Kevin could come up with.
Our NFAs are non-unique representations of a given regex (as discussed in lecture).
Why are there no more than 3M epsilon transitions? Why is this fact vitally important to the NM worst case run time for simulation?
Since DFAs have worst case run time N, and all regexes go with some DFA, why don't we just use those?
Advanced and important topic mentioned in passing during lecture that won't be covered on the final: There is no regex that can compile a regex into a DFA. However with a DFA and a stack, you can do this just fine (specifically, this process doesn't require a Turing machine). Compiling Java into class fiels, by contrast, does require a Turing machine.

Data Compression

Key Concepts

Terminology and Basics

Why compress? To save space AND time.
How does compression work? Compression takes advantage of structure within data.
What sort of data does not compress well? Random data.
How do we characterize the randomness of data? The Kolmogorov complexity.
Lossy compression can further reduce file sizes by throwing away unimportant information.
What is the compression ratio?
Why can no compression algorithm possibly compress all bit streams?
What fraction of bitstreams can be compressed in half by a general purpose algorithm?

Run length coding

Takes advantage of repeated binary digits.
How do you handle runs of length more than 2^M?

Huffman coding

Basic idea: Variable length codewords to represent fixed length characters. More common characters are represented by shorter codewords.
What is a prefix free code? Why is it important that Huffman coding use a prefix free code? Would encoding work with a non-prefix free code? Would decoding work?
Why is it necessary to transmit the coding trie? Why don't we have to do something similar with run length encoding or LZW?
Why do we typically use an array for encoding and a trie for decoding?
You do not need to know the ins and outs of the Huffman code. However, you should conceptually understand the idea of transmitting/reading the trie using an inorder traversal.

LZW

Basic idea: Fixed length codewords to represent variable length strings. More common characters are represented by shorter codewords.
Why do we typically use a trie for encoding and an array for decoding?
How do you handle the 'strange' case (where a codeword is seemingly not in the table during decoding)?

Kolmogorov Complexity

For 226, we define the Kolmogorov Complexity of a string S as the length of the shortest Java program (including all relevant libraries) that can output S.
Given: a string S, a program P of length L in programming language X that outputs S, why can we always find a Java program of length L + C, where C is some constant that depends on our choice of X (but not S or L)? Because we can always write a compiler for X written in Java.
Why does the given fact above prove that any string which is compressible in one programming language can be compressed in any other language (up to a constant factor)?
Why is it impossible to write a program which finds the Kolmogorov Complexity of a given string? No (the reason why was an optioanl part of the lecture)
Are there any strings for which Kolmogorov Complexity can be found? Yes! For example, "a".

Reductions and Intractability

Key Concepts

We consider problems solvable in polynomial time to be efficient.
Important real world note: There are two common definitions for P and NP. The classic definition defines the classes in terms of decision problems. A later (but also quite common) definition is in terms of search problems. For simplicity, we used decision problems in our lecture, but many other sources use the search problem version. The two definitions are very similar and have the same real world implications, but they are not quite the same.
To make matters even a bit more confusing, I used P to represent the class of all problems solvable in polynomial time, and NP the class of all decision problems solvable in polynomial time.
Bottom line: Whether we're talking about the decision or search problem formulation, the fundamental idea is that P is the a of problems that can be solved in polynomial time, and NP is the a of problems whose solution can be verified in polynomial time.
Completeness: What conditions are necessary for a problem to be Q-complete, where Q is some set of problems?
What did Rabin/Levin do?
What did Karp do?
What is a reduction?
Why is solving an NP complete problem such a big deal?
Amusing fact: There exists a well defined algorithm (that you can already write in Java) that solves NP problems in time P, but it's polynomial time only if P = NP. Wikipedia has the scoop. It's madness!
One of the most common arguments against Searle's Chinese Box is the systems argument (namely that the dude in the box is acting as part of a larger system). I happen to strongly agree with this. I still won't let you blow me up and reassemble me.

Computer Science 226 Algorithms and Data Structures Fall 2013

Study Guide

Union Find and General Intro

Key Concepts

Recommended Problems

C level

B level

A level

Analysis of Algorithms

Key Concepts

Recommended Problems

C level

B level

A level

Stacks and Queues

Key Concepts

Recommended Problems

C level

B level

A level

Elementary sorts

Key Concepts

Recommended Problems

C level

B level

A level

Mergesort

Key Concepts

Recommended Problems

C level

B level

A level

Quicksort

Key Concepts

Recommended Problems

C level

B level

A level

Collections, PQs and Heaps

Key Concepts

Recommended Problems

C level

B level

A level

General sorting

Recommended Problems

C level

B level

A level

Symbol Tables and BSTs

Key Concepts

Recommended Problems

C level

B level

A level

Balanced Search Trees

Key Concepts

Recommended Problems

C level

B level

A level

Hash Tables

Key Concepts

Recommended Problems

C level

B level

A level

Geometric Applications of BSTs

Key Concepts

Recommended Problems

Undirected Graphs

Key Concepts

Recommended Problems

C level

B level

A level

Directed Graphs

Key Concepts

Recommended Problems

C level

Computer Science 226
Algorithms and Data Structures
Fall 2013