What is meant by an immutable data type?
An object is immutable if its data-type value (state) never changes once it is constructed.
A data type is immutable if every object of that type is immutable.
For example, to make the Autocomplete
data type immutable, you must make a
defensive copy of the terms[]
array in the constructor (because the client
might change one of the entries in the original terms[]
array after constructing the
Autocomplete
object).
Can my methods have side effects that are not described in the API?
No. For example, the Autocomplete
data type should not
change one of the entries in the terms[]
array
(or it could have a serious side effect on the client).
Similarly, the firstIndexOf()
and lastIndexOf()
should not change the entries in their argument arrays.
We usually allow methods to have benign side effects,
such as changing the state of a random number generator.
What is the meaning of the modifier static
in the following function declarations?
It means that there is one (static) method for the entire class, as opposed to one (instance) method per object. Since there is a single total order (by weight), the static modifier is appropriate.public static Comparator<Term> byReverseWeightOrder()
Can a nested class have a constructor and instance variables?
Yes, absolutely.
Recall your nested iterator class in RandomizedQueue
.
What is lexicographic order?
Lexicographic order
(or alphabetical order) is a total order in which strings of characters are ordered
based on the position of the characters in the underlying alphabet (such as the Roman alphabet, ASCII, or Unicode).
To determine which of two strings comes first, you compare the first characters of the two strings.
If they differ, the string whose first character appears earlier in the alphabet comes first.
If the first characters are the same, you compare the second characters, and so forth.
If you reach a position where one of the two strings has no more characters, the shorter string comes first.
For example, "cat"
is less than "dog"
and "dog"
is less than
"dogcatcher"
.
How can I compare two strings lexicographically?
Lexicographic order is the natural order for Java strings, so you can use the
compareTo()
method in String
to do so.
Alternatively, you can use the charAt()
method in the String
class
to access the corresponding characters of the two strings and then compare their Unicode values.
What exactly does it mean to compare two string by their first r characters?
You should compare the two strings lexicographically, as in the
natural order
for Java strings, but you should never examine beyond the first r characters
of either string. For example, if r = 3, then "cat"
is less than "dog"
;
"dog"
is equal to "dogcatcher"
; and "do"
is less than "dogcatcher"
.
Note that in the last example, "do"
has fewer than 3 characters, so we compare the
whole string "do"
to the first 3 characters of "dogcatcher"
.
What exactly is meant by the requirement that the the string comparison functions
must take time proportional to the number of characters needed to resolve the comparison?
As an example, if you are comparing the two strings
"ABCAAAAAAAAAAAAA"
and "ABDEEEEEEEEEEEEE"
,
you should compare only the first three corresponding pairs of characters because,
while the first two characters are the same, the third characters
are different.
What are the running times of the charAt()
and length()
methods in the String
library?
Constant time.
What is the running time of the length()
method
in the String
library?
Constant time.
What is the running time of the compareTo()
method
in the String
library?
Comparing two strings takes time proportional to the number of characters
needed to resolve the comparison.
What is the running time of the substring()
method
in the String
library?
Extracting a substring of length r takes \(\Theta(r)\) time.
Consequently, most uses of the substring()
method in the
string comparison functions will not meet the performance requirements.
What is the meaning of the type parameter Key
in the following function declaration?
This is an example of a generic method, which introduces its own type parameterpublic static <Key> int firstIndexOf(Key[] a, Key key, Comparator<Key> comparator)
Key
. The type parameter
serves as a placeholder for the argument types that are passed to the generic method.
In this case, it enforces the constraint that the type of the elements in the array a[]
must match
the type of the search key key
(so that the compiler would allow searching for an Apple
object in an
Apple[]
array
but reject searching for an Orange
object in an Apple[]
array).
This bit of arcane Java syntax enables the program to compile without warnings.
When I compile BinarySearchDeluxe.java
I get the compiler warning
"unchecked call to compare(T, T) as a member of the raw type java.util.Comparator."
Is that OK?
No, your program should compile without any errors or warnings.
What’s a good way to get a Comparator
object to use for testing?
You can create an array of Term
objects and use
Term.byReverseWeightOrder()
.
But, it's probably easier (and less error-prone) to create an array of String
objects and use String.CASE_INSENSITIVE_ORDER
.
String[] a = { "A", "A", "C", "G", "G", "T" }; Comparator<String> comparator = String.CASE_INSENSITIVE_ORDER; int index = BinarySearchDeluxe.firstIndexOf(a, "G", comparator);
Will I meet the performance requirement if I use binary search to find any matching key and then do a linear scan left to find the first such key? No. That approach will make \(\Theta(n)\) compares in the worst case (e.g., if all keys are equal to the search key).
Can I use Arrays.binarySearch()
?
Arrays.binarySearch()
returns the index of any matching key—not
necessarily the first or last such key. So, it will not be of much help.
My binary search makes more \(1 + \lceil \log_2 n \rceil\) compares. Is this a big problem? If it makes \(\Theta(n)\) compares in the worst case, that is a serious performance bug. If it makes \(\Theta(\log n)\) compares in the worst case (but more than the requisite number), that is just a minor deduction.
My firstIndexOf()
is correct but lastIndexOf()
has a
bug. Any suggestions?
Your code for the two methods should be symmetric
since you can view lastIndexOf()
as finding the first matching
index in the reverse of the array (which is in reverse-sorted order).
That is, replace less with greater,
lo
with hi
,
and index 0
with n-1
.
Should I re-implement mergesort or quicksort?
No. Don’t re-invent the wheel.
Call either Arrays.sort()
or one of the sorting
algorithms in algs4.jar
.
Can I assume that the weights are all distinct?
No. However, if there are two terms with equal weights, the allMatches()
method
can return them in arbitrary order.
Can I assume that the Autocomplete
constructor receives the
terms in descending order by weight?
No. While many of the sample input files are in descending order by weight, do not
make any assumptions about the order.
What should allMatches()
return if there are no matching terms?
It should still return an array containing all of the matching terms—an array of length 0.
Should Autocomplete
be case insensitive? For instance,
if I enter the prefix "prince" and one of my terms has a query
"Princeton", should "Princeton" be included among the matches?
No. On this assignment, queries are case sensitive.
In a real application, you might prefer a case-insensitive
(or diacritic-insensitive) order.
Can a prefix or query string be the empty string?
Yes and yes. For example, the empty-string prefix ""
should
return all queries, in descending order by weight.
It might be harder to think of an application for an empty-string query, but the
API does not exclude this possibility.
What do I need to do to handle Unicode characters and UTF-8 encoding?
You shouldn’t need to do anything special.
The Java String
data type handles Unicode naturally;
The supplied test files are encoded using UTF-8.
When reading input or writing output,
StdIn
, In
, and StdOut
assume UTF-8 encoding.
Why do I get a java.util.NoSuchElementException
from readLong()
when I run the sample client?
Did you cut-and-paste the file instead of downloading it? Your text editor may have corrupted
the file by changing the encoding from UTF-8 to something else.
How can I view the Unicode characters that my program prints out?
export LC_ALL='en_US.UTF-8' export LANG='en_US.UTF-8' export LANGUAGE='en_US.UTF-8'
chcp 65001
.
AutocompleteGUI
. It should work without adjustment.
Input files (for use with Autocomplete). Below are some input files for testing. Each input file consists of an integer n followed by n pairs of terms and integer weights, one pair per line, with the term and weight separated by a tab. The terms can be arbitrary strings consisting of Unicode characters, including spaces (but neither tabs nor newlines).
file number term weight source pu-buildings.txt 166 Princeton buildings age (in Y10K) pokemon.txt 729 Pokemon popularity Allen Qin fortune1000.txt 1,000 company revenue Fortune 1000 nasdaq.txt 2,658 NASDAQ market cap nasdaq.com metal-albums.txt 3,000 metal albums votes Metal Storm pu-courses.txt 6,771 course enrollment Princeton Registrar wiktionary.txt 10,000 word frequency Wiktionary redditors.txt 10,000 reddittor subscribers redditlist.com baby-names.txt 31,109 first name frequency U.S. Social Security trademarks.txt 92,254 company U.S. trademark number USPTO cities.txt 93,827 city population geoba.se mandarin.txt 94,339 Mandarin word frequency Invoke IT bing.txt 250,000 word Bing search frequency Microsoft Web N-gram 2grams.txt 277,718 2-gram frequency Google Web Trillion Word Corpus 3grams.txt 1,020,009 3-gram frequency Corpus of Contemporary American English 4grams.txt 1,034,308 4-gram frequency Corpus of Contemporary American English 5grams.txt 1,044,269 5-gram frequency Corpus of Contemporary American English words-3333333.txt 333,333 word frequency Google Web Trillion Word Corpus artists.txt 43,849 music artist familiarity Million Song Dataset songs.txt 922,230 song hotttnesss Million Song Dataset alexa.txt 1,000,000 web site frequency Alexa imdb-votes.txt 82,455 movie number of votes Internet Movie Database movies.txt 229,447 movies revenue (USD) Internet Movie Database actors.txt 2,875,183 actors revenue (USD) Internet Movie Database
These are purely suggestions for how you might make progress. You do not have to follow these steps.
Term.java
.
toString()
.
implements Comparable<Term>
to the class declaration
and implement the compareTo()
method.
byReverseWeightOrder()
. In support of this method, you will need to
define a private static nested class, as in the lecture slides.
byPrefixOrder()
. In support of this method, you will need
to define another private static nested class. Since you will need to create a comparator that
depends upon an integer parameter r, the nested class should have a constructor
that takes r as an argument and saves it in an instance variable.
BinarySearchDeluxe.java
.
Binary search is a deceptively simple algorithm.
Jon Bentley reports
that in an experiment at Bell Labs and IBM, only about 10% of professional programmers
got it right; so, test, debug, test.
lo
, mid
, and hi
to see where it stalls.
Autocomplete
. You will need to use all three orders
defined in Term
: the natural order, reverse order by weight,
and lexicographic order based on the first r characters.
Comparator<Term> comparator = Term.byReverseWeightOrder();
Arrays.sort(terms)
to sort the array terms[]
according to the natural order
(by query string) and Arrays.sort(terms, comparator)
to sort the array comparator
.
AutocompleteGUI.java
.
Suppose that I want to find only the top-k matching terms (instead of all matching terms). Are there better algorithms? Yes. The problem of finding the top term matching a given prefix can be formulated as a range maximum query problem or equivalently as a lowest common ancestor problem. Here are some lecture notes by Erik Demaine on the equivalence.
Any way to find the range of words that start with a given prefix in time proportional to the length of the prefix (in the worst case)? Yes, use a suffix tree (or a suffix array with the lcp array), where the terms are separated by a word-separation symbol and concatenated together. To use space proportional to the number of terms consider a suffix array on words. Or consider a compact DAWG.