RESULTS Seventeen groups participated. Group size ranged from four to one. All analogies consisted of a stem pair and 5 choice pairs. I compared the results for each of the groups on three data sets: * training data (229 analogies from web sources, some prep books and contributions from class members) * testing data (152 analogies from class members and real SATs) * real SATs only (76 analogies, subset of testing data) I looked at three different scoring criteria: * score (correct-incorrect/4; this is how raw SAT scores are computed) * correct * precision (correct/(correct+incorrect)) To maximize correct, you want to always guess. To maximize precision, you want to guess only when you are quite sure. To maximize score, you have to trade these two things off carefully. Training Data On the training data, the groups averaged 57.8 correct, 27.1 score, and 33.9% precision. Plot 1 (http://www.cs.duke.edu/~mlittman/ftp/plotb.ps) displays the results for all the groups on the training data. X-axis is number of correct answers, Y-axis is number of incorrect answers. Therefore, ideally, groups want to be in the bottom-right of the plot. I added three lines to the plot corresponding to the maximum values of correct (dotted), score (dashed), and precision (solid). As you can see, Group 7 substantially outscored other groups in both the score (50.75) and correct (85) categories. They used altavista and a set of hand-built patterns to identify different types of relations. Wordnet was used selectively to do query expansions. Group 11 dominated in the precision category (65.5%). Their strategy was pretty interesting---they answered very few problems, but the ones they answered, they got right nearly 2/3 of the time. As a result, they did amazingly badly on correct, but pretty reasonably on score. They used 14 different types of searches in Wordnet and ignored questions they didn't handle. Groups 14, 6 and 1 deserve honorable mentions for occupying the "Pareto frontier". That is, there were no groups that got both more correct and fewer incorrect answers than they did. Complete data: GROUP CORRECT (%) INCORRECT SKIPPED SCORE PRECISION 01 63 (27.5) 125 41 31.75 33.5 02 72 (31.4) 157 0 32.75 31.4 03 26 (11.4) 66 137 9.50 28.3 04 60 (26.2) 155 14 21.25 27.9 05 59 (25.8) 170 0 16.50 25.8 06 70 (30.6) 126 33 38.50 35.7 07 85 (37.1) 137 7 50.75 38.3 08 62 (27.1) 167 0 20.25 27.1 09 32 (14.0) 36 161 23.00 47.1 10 31 (13.5) 134 64 -2.50 18.8 11 38 (16.6) 20 171 33.00 65.5 12 76 (33.2) 153 0 37.75 33.2 13 57 (24.9) 144 28 21.00 28.4 14 55 (24.0) 82 92 34.50 40.1 15 44 (19.2) 111 74 16.25 28.4 16 77 (33.6) 152 0 39.00 33.6 17 75 (32.8) 153 1 36.75 32.9 Testing Data On the testing data, the groups averaged 35.3 correct, 14.4 score, and 31.8% precision. Plot 2 (http://www.cs.duke.edu/~mlittman/ftp/plot.ps) displays the results for all the groups. Group 7 still comes out ahead in the correct category (53), but two other groups are nipping at its heals. Group 6 and Group 13 also get 50 or so questions right. However, Group 13 comes in with fewer incorrect answers and takes the score category with 31.0 over Group 7's 30.00. Second place is Group 7 (30.0) and third place is Group 6 (27.75); what a race! Group 11 wins the precision category again (63.2%). Take a look at where Group 13 was on the training data. They certainly don't look like a threat to Group 7---they really come out of nowhere to take the prize. Their technique was to use an online thesaurus (www.wordsymth.net) to handle synonyms, Wordnet for antonyms, and a comparison of words in the definitions of words in wordsmyth for more general comparisons. Group 6 used a similar method to Group 7: they categorized a set of analogies and did searches on the web and made extensive use of Wordnet. They also computed transitive closures in Wordnet to find a large set of related words for each. Complete data: GROUP CORRECT (%) INCORRECT SKIPPED SCORE PRECISION 01 26 (17.1) 103 23 0.25 20.2 02 46 (30.3) 106 0 19.50 30.3 03 23 (15.1) 35 94 14.25 39.7 04 40 (26.3) 108 4 13.00 27.0 05 45 (29.6) 107 0 18.25 29.6 06 50 (32.9) 89 13 27.75 36.0 07 53 (34.9) 92 7 30.00 36.6 08 32 (21.1) 120 0 2.00 21.1 09 13 (08.6) 22 117 7.50 37.1 10 30 (19.7) 91 31 7.25 24.8 11 24 (15.8) 14 114 20.50 63.2 12 41 (27.0) 111 0 13.25 27.0 13 51 (33.6) 80 21 31.00 39.0 14 19 (12.5) 39 94 9.25 32.8 15 26 (17.1) 84 42 5.00 23.6 16 45 (29.6) 107 0 18.25 29.6 17 36 (23.7) 116 0 7.00 23.7 All 17 groups scored better than chance (0.00). In statistics, one thing is *significantly* better than chance if the probability that chance would do as well is less than .05. Using this criterion, any group that scored over 10.75 is significantly better than chance; ten of the groups achieved this. SATs The test questions were of mixed quality. I was curious how the groups did on the 76 high-quality published SAT questions. Plot 3 (http://www.cs.duke.edu/~mlittman/ftp/plotS.ps). Twelve groups scored above chance level on these questions and three of these were significantly better (13, 7, 5). GROUP I: wordnet relations (counted), 73.3% accuracy in 20%(?). downloaded dictionaries, wordlists, regexp search locally query expansion using wordnet discovered bridges used for web search generic list as a last resort idea: parse and/or tag definitions? GROUP II: wordnet + mututal information part of speech classification like part of speech synonym if share at least one common word high precision portion gets 50% of 23% predefined set of linking words novel 3-word MI measure on altavista 28-33% accuracy using web-related stuff GROUP III: word graph conception (I like this) couldn't get reliable information from wordnet or wordnet glosses answer elimination: 88% idea: create network of word relationships from dictionary GROUP IV: kitchen sink method synonyms/antonyms via wordnet exact phrase searcher part of speech changer to make this more accurate antonym, including part of speech change word associator suggest using shopping sites to distinguish objects from concepts (why not wordnet?) bad results form meronyms et al. suggest opencyc suggest using probabilities instead of hard cutoffs (promising) (claim to get 4 more correct than my scorer says) idea: could use training data to learn appropriate weighting scheme GROUP V: try to solve via wordnet. If ambiguity, remaining choices go to web 400 pages per pair downloaded from google to find "linking" words. stop word removal (probably the cleanest architecture to describe) GROUP VI: combined wordnet and web used all wordnet relationships transitive closure for meronyms (parts, subparts, subsubparts...) puts strengths on relationships and allows multiple ones to be considered simultaneously score triples based on google snippets what are the form of the relationship words? very complete GROUP VII: score triples based on proximity use wordnet synsets in searches fancy probabilistic formula caching of search results for efficiency in development hand-built configuration file for different analogy types suggestion: blend similiarity across relationships 50% accuracy for wordnet suggestion: use more high quality, local text GROUP VIII: search engine (no wordnet) "fingerprint" the pairs and compare the fingerprints did not combine with synonyms no skipping, but ranked words based on similarity. Not too bad. GROUP IX: wordnet (not web), 6 types of relationships GROUP X: struggled with wordnet disambiguated POS instead of maintaining multiple possibilities forced choices to have one POS pattern (reasonable in real analogies) hand-built set of searches, depending on POS pattern used glosses as well for counting strength of relationship scored searches (nothing picked if no hits recorded) suggestion: categorize by POS, enumerate choices (some groups did this) GROUP XI: didn't get far with linking words wordnet only (14 searches, used substring matches) GROUP XII: synonyms/antonyms using wordnet hand-built pattern list, searched on google used stem to pick relationship, then checked the choices GROUP XIII: wordnet for antonyms wordsmyth.net for synonyms wordsmyth.net definitions for general relationships sim(A,C)+sim(B,D) GROUP XIV: classes: antonym, degree of, part of, synonym, function (in order) allowed for ambiguous POS wordnet for all but function (altavista with synonym expansion via webster) GROUP XV: wordnet for synonyms and antonyms with synsets expansion rules some things out if no matching relation (positive evidence for some *other* relation) note (as have others) that wordnet relations are within POS. No noun-verb connections (pool:swim) wordnet matching of "types" (apple : red :: watermelon : sweet, doesn't really work because red is a color and sweet is a taste). used glosses (B is in A's gloss) looked at sentences off the web (unstated what search engine), didn't really help GROUP XVI: combination of google search and wordnet count words in common between return set for stem and choices in google (30 pages) (take 25 words, working out from the search words) count words in common in wordnet searches combine scores disambiguate POS according to verb, noun, adverb, adjective weights on verbs increase with length GROUP XVII: hand chosen corpora only (!) tried comparing word complexity (length, etc.) thesaurus and famous literature use synonym finder to throw out likely wrong answers (synonyms if stem isn't, non-synonyms if stem is) essential idea is to use frequency matching use an 'x' if question doesn't look hard enough (measured in characters) idea: need to run with/without wordnet/internet