WordNet is a semantic lexicon for the English language that is used extensively by computational linguists and cognitive scientists; for example, it was a key component in IBM's Watson. WordNet groups words into sets of synonyms called synsets and describes semantic relationships between them. One such relationship is the is-a relationship, which connects a hyponym (more specific synset) to a hypernym (more general synset). For example, a plant organ is a hypernym of carrot and plant organ is a hypernym of plant root.
The WordNet digraph. Your first task is to build the wordnet digraph: each vertex v is an integer that represents a synset, and each directed edge v→w represents that w is a hypernym of v. The wordnet digraph is a rooted DAG: it is acylic and has one vertex that is an ancestor of every other vertex. However, it is not necessarily a tree because a synset can have more than one hypernym. A small subgraph of the wordnet digraph is illustrated below.
The WordNet input file formats. We now describe the two data files that you will use to create the wordnet digraph. The files are in CSV format: each line contains a sequence of fields, separated by commas.
means that the synset { AND_circuit, AND_gate } has an id number of 36 and it's gloss is a circuit in a computer that fires only when all of its inputs fire. The individual nouns that comprise a synset are separated by spaces (and a synset element is not permitted to contain a space). The S synset ids are numbered 0 through S − 1; the id numbers will appear consecutively in the synset file.36,AND_circuit AND_gate,a circuit in a computer that fires only when all of its inputs fire
164,21012,56099
means that the the synset 164 ("Actifed") has two hypernyms: 21012 ("antihistamine") and 56099 ("nasal_decongestant"), representing that Actifed is both an antihistamine and a nasal decongestant. The synsets are obtained from the corresponding lines in the file synsets.txt.
164,Actifed,trade name for a drug containing an antihistamine and a decongestant... 21012,antihistamine,a medicine used to treat allergies... 56099,nasal_decongestant,a decongestant that provides temporary relief of nasal...
WordNet data type. Implement an immutable data type WordNet with the following API:
The constructor should throw a java.lang.IllegalArgumentException if the input does not correspond to a rooted DAG. The distance() and sap() methods should throw a java.lang.IllegalArgumentException unless both of the noun arguments are WordNet nouns.// constructor takes the name of the two input files public WordNet(String synsets, String hypernyms) // returns all WordNet nouns public Iterable<String> nouns() // is the word a WordNet noun? public boolean isNoun(String word) // distance between nounA and nounB (defined below) public int distance(String nounA, String nounB) // a synset (second field of synsets.txt) that is the common ancestor of nounA and nounB // in a shortest ancestral path (defined below) public String sap(String nounA, String nounB) // do unit testing of this class public static void main(String[] args)
Your data type should use space linear in the input size (size of synsets and hypernyms files). The constructor should take time linearithmic (or better) in the input size. The method isNoun() should run in time logarithmic (or better) in the number of nouns. The methods distance() and sap() should run in time linear in the size of the WordNet digraph.
Shortest ancestral path. An ancestral path between two vertices v and w in a digraph is a directed path from v to a common ancestor x, together with a directed path from w to the same ancestor x. A shortest ancestral path is an ancestral path of minimum total length. For example, in the digraph at left (digraph1.txt), the shortest ancestral path between 3 and 11 has length 4 (with common ancestor 1). In the digraph at right (digraph2.txt), one ancestral path between 1 and 5 has length 4 (with common ancestor 5), but the shortest ancestral path has length 2 (with common ancestor 0).
SAP data type.
Implement an immutable data type SAP with the following API:
All methods should throw a java.lang.IndexOutOfBoundsException if one (or more) of the input arguments is not between 0 and G.V() - 1. You may assume that the iterable arguments contain at least one integer. All methods (and the constructor) should take time at most proportional to E + V in the worst case, where E and V are the number of edges and vertices in the digraph, respectively. Your data type should use space proportional to E + V.// constructor takes a digraph (not necessarily a DAG) public SAP(Digraph G) // length of shortest ancestral path between v and w; -1 if no such path public int length(int v, int w) // a common ancestor of v and w that participates in a shortest ancestral path; -1 if no such path public int ancestor(int v, int w) // length of shortest ancestral path between any vertex in v and any vertex in w; -1 if no such path public int length(Iterable<Integer> v, Iterable<Integer> w) // a common ancestor that participates in shortest ancestral path; -1 if no such path public int ancestor(Iterable<Integer> v, Iterable<Integer> w) // do unit testing of this class public static void main(String[] args)
Test client. Your main() should take the name of a digraph input file as as a command-line argument, construct the digraph, and use that digraph to test that all SAP methods work as expected. The following test client reads in vertex pairs from standard input, and prints out the length of the shortest ancestral path between the two vertices and a common ancestor that participates in that path: On the left below you can see the contents of digraph1.txt and on the right are a few samples of expected values if your program reads in that input file: