RESULTS

   Seventeen groups participated.  Group size ranged from four to one.
All analogies consisted of a stem pair and 5 choice pairs.

   I compared the results for each of the groups on three data sets:

 * training data (229 analogies from web sources, some prep books and
   contributions from class members)
 * testing data (152 analogies from class members and real SATs)
 * real SATs only (76 analogies, subset of testing data)

   I looked at three different scoring criteria:

 * score (correct-incorrect/4; this is how raw SAT scores are computed)
 * correct
 * precision (correct/(correct+incorrect))

   To maximize correct, you want to always guess.  To maximize
precision, you want to guess only when you are quite sure.  To
maximize score, you have to trade these two things off carefully.

Training Data

   On the training data, the groups averaged 57.8 correct, 27.1 score,
and 33.9% precision.  Plot 1
(http://www.cs.duke.edu/~mlittman/ftp/plotb.ps) displays the results
for all the groups on the training data.  X-axis is number of correct
answers, Y-axis is number of incorrect answers.  Therefore, ideally,
groups want to be in the bottom-right of the plot.  I added three
lines to the plot corresponding to the maximum values of correct
(dotted), score (dashed), and precision (solid).

   As you can see, Group 7 substantially outscored other groups in
both the score (50.75) and correct (85) categories.  They used
altavista and a set of hand-built patterns to identify different types
of relations.  Wordnet was used selectively to do query expansions.

   Group 11 dominated in the precision category (65.5%).  Their
strategy was pretty interesting---they answered very few problems, but
the ones they answered, they got right nearly 2/3 of the time.  As a
result, they did amazingly badly on correct, but pretty reasonably on
score.  They used 14 different types of searches in Wordnet and
ignored questions they didn't handle.

   Groups 14, 6 and 1 deserve honorable mentions for occupying the
"Pareto frontier".  That is, there were no groups that got both more
correct and fewer incorrect answers than they did.

   Complete data:

GROUP CORRECT (%) INCORRECT SKIPPED SCORE PRECISION
  01    63 (27.5)   125       41    31.75   33.5 
  02    72 (31.4)   157        0    32.75   31.4 
  03    26 (11.4)    66      137     9.50   28.3 
  04    60 (26.2)   155       14    21.25   27.9 
  05    59 (25.8)   170        0    16.50   25.8 
  06    70 (30.6)   126       33    38.50   35.7 
  07    85 (37.1)   137        7    50.75   38.3 
  08    62 (27.1)   167        0    20.25   27.1 
  09    32 (14.0)    36      161    23.00   47.1 
  10    31 (13.5)   134       64    -2.50   18.8 
  11    38 (16.6)    20      171    33.00   65.5 
  12    76 (33.2)   153        0    37.75   33.2 
  13    57 (24.9)   144       28    21.00   28.4 
  14    55 (24.0)    82       92    34.50   40.1 
  15    44 (19.2)   111       74    16.25   28.4 
  16    77 (33.6)   152        0    39.00   33.6 
  17    75 (32.8)   153        1    36.75   32.9 

Testing Data

   On the testing data, the groups averaged 35.3 correct, 14.4 score,
and 31.8% precision.  Plot 2
(http://www.cs.duke.edu/~mlittman/ftp/plot.ps) displays the results
for all the groups.

   Group 7 still comes out ahead in the correct category (53), but two
other groups are nipping at its heals.  Group 6 and Group 13 also get
50 or so questions right.  However, Group 13 comes in with fewer
incorrect answers and takes the score category with 31.0 over Group
7's 30.00.  Second place is Group 7 (30.0) and third place is Group 6
(27.75); what a race!  Group 11 wins the precision category again
(63.2%).

   Take a look at where Group 13 was on the training data.  They
certainly don't look like a threat to Group 7---they really come out
of nowhere to take the prize.  Their technique was to use an online
thesaurus (www.wordsymth.net) to handle synonyms, Wordnet for
antonyms, and a comparison of words in the definitions of words in
wordsmyth for more general comparisons.

   Group 6 used a similar method to Group 7: they categorized a set of
analogies and did searches on the web and made extensive use of
Wordnet.  They also computed transitive closures in Wordnet to find a
large set of related words for each.

   Complete data:

GROUP CORRECT (%) INCORRECT SKIPPED SCORE PRECISION
  01    26 (17.1)   103       23     0.25   20.2 
  02    46 (30.3)   106        0    19.50   30.3 
  03    23 (15.1)    35       94    14.25   39.7 
  04    40 (26.3)   108        4    13.00   27.0 
  05    45 (29.6)   107        0    18.25   29.6 
  06    50 (32.9)    89       13    27.75   36.0 
  07    53 (34.9)    92        7    30.00   36.6 
  08    32 (21.1)   120        0     2.00   21.1 
  09    13 (08.6)    22      117     7.50   37.1 
  10    30 (19.7)    91       31     7.25   24.8 
  11    24 (15.8)    14      114    20.50   63.2 
  12    41 (27.0)   111        0    13.25   27.0 
  13    51 (33.6)    80       21    31.00   39.0 
  14    19 (12.5)    39       94     9.25   32.8 
  15    26 (17.1)    84       42     5.00   23.6 
  16    45 (29.6)   107        0    18.25   29.6 
  17    36 (23.7)   116        0     7.00   23.7 

   All 17 groups scored better than chance (0.00).  In statistics, one
thing is *significantly* better than chance if the probability that
chance would do as well is less than .05.  Using this criterion, any
group that scored over 10.75 is significantly better than chance; ten
of the groups achieved this.

SATs

   The test questions were of mixed quality.  I was curious how the
groups did on the 76 high-quality published SAT questions.  Plot 3
(http://www.cs.duke.edu/~mlittman/ftp/plotS.ps).  Twelve groups scored
above chance level on these questions and three of these were
significantly better (13, 7, 5).


GROUP I:
 wordnet relations (counted), 73.3% accuracy in 20%(?).
 downloaded dictionaries, wordlists, regexp search locally
 query expansion using wordnet
 discovered bridges used for web search
 generic list as a last resort

 idea: parse and/or tag definitions?

GROUP II:
 wordnet + mututal information
 part of speech classification
 like part of speech synonym if share at least one common word
 high precision portion gets 50% of 23%
 predefined set of linking words
 novel 3-word MI measure on altavista
 28-33% accuracy using web-related stuff

GROUP III:
 word graph conception (I like this)
 couldn't get reliable information from wordnet or wordnet glosses
 answer elimination: 88%

 idea: create network of word relationships from dictionary

GROUP IV:
 kitchen sink method
 synonyms/antonyms via wordnet
 exact phrase searcher
 part of speech changer to make this more accurate
 antonym, including part of speech change
 word associator
 suggest using shopping sites to distinguish objects from concepts
	(why not wordnet?) 
 bad results form meronyms et al.
 suggest opencyc
 suggest using probabilities instead of hard cutoffs (promising)
 (claim to get 4 more correct than my scorer says)

 idea: could use training data to learn appropriate weighting scheme

GROUP V:
 try to solve via wordnet.  If ambiguity, remaining choices go to web
 400 pages per pair downloaded from google to find "linking" words.
 stop word removal
 (probably the cleanest architecture to describe)

GROUP VI:
 combined wordnet and web
 used all wordnet relationships
 transitive closure for meronyms (parts, subparts, subsubparts...)
 puts strengths on relationships and allows multiple ones to be
	considered simultaneously
 score triples based on google snippets
 what are the form of the relationship words?
 very complete

GROUP VII:
 score triples based on proximity
 use wordnet synsets in searches
 fancy probabilistic formula
 caching of search results for efficiency in development
 hand-built configuration file for different analogy types
 suggestion: blend similiarity across relationships
 50% accuracy for wordnet
 suggestion: use more high quality, local text

GROUP VIII:
 search engine (no wordnet)
 "fingerprint" the pairs and compare the fingerprints
 did not combine with synonyms
 no skipping, but ranked words based on similarity.  Not too bad.

GROUP IX:
 wordnet (not web), 6 types of relationships

GROUP X:
 struggled with wordnet
 disambiguated POS instead of maintaining multiple possibilities
 forced choices to have one POS pattern (reasonable in real analogies)
 hand-built set of searches, depending on POS pattern
 used glosses as well for counting strength of relationship
 scored searches (nothing picked if no hits recorded)
 suggestion: categorize by POS, enumerate choices (some groups did this)

GROUP XI:
 didn't get far with linking words
 wordnet only (14 searches, used substring matches)

GROUP XII:
 synonyms/antonyms using wordnet
 hand-built pattern list, searched on google
 used stem to pick relationship, then checked the choices

GROUP XIII:
 wordnet for antonyms
 wordsmyth.net for synonyms
 wordsmyth.net definitions for general relationships
  sim(A,C)+sim(B,D)

GROUP XIV:
 classes: antonym, degree of, part of, synonym, function (in order)
 allowed for ambiguous POS
 wordnet for all but function (altavista with synonym expansion via webster)

GROUP XV:
 wordnet for synonyms and antonyms with synsets expansion
 rules some things out if no matching relation (positive evidence for
	some *other* relation)
 note (as have others) that wordnet relations are within POS.  No
	noun-verb connections (pool:swim)
 wordnet matching of "types" (apple : red :: watermelon : sweet,
	doesn't really work because red is a color and sweet is a
	taste).
 used glosses (B is in A's gloss)
 looked at sentences off the web (unstated what search engine), didn't
	really help

GROUP XVI:
 combination of google search and wordnet
 count words in common between return set for stem and choices in
	google (30 pages) (take 25 words, working out from the search
	words) 
 count words in common in wordnet searches
 combine scores
 disambiguate POS according to verb, noun, adverb, adjective
 weights on verbs increase with length

GROUP XVII:
 hand chosen corpora only (!)
 tried comparing word complexity (length, etc.)
 thesaurus and famous literature
 use synonym finder to throw out likely wrong answers (synonyms if
	stem isn't, non-synonyms if stem is)
 essential idea is to use frequency matching
 use an 'x' if question doesn't look hard enough (measured in
	characters)

 idea: need to run with/without wordnet/internet