COS 435
Resources for Projects
Please help keep this list
useful! Suggest free, stable, preferably open-source software and
free, open data sets. It is best if you can give a recommendation
from personal experience, but we will take other suggestions subject to
further exploration. Thanks!
Data Sets
Here are some data sets of possible value. Some have been used in
past COS435 projects.
IMPORTANT NOTE: Other
members of the faculty have data sets that they are willing to
share.
If you need something - either a specific data set or a specific kind
of data - ask, and we'll see if it is available in the
department.
- UCI KDD Archive
- University of California at Irving data sets, primarily for data
mining tasks, but also useful for other information analysis/search
tasks.
included in UCI kdd archive (annotated
by 2008 TA
Joseph Calandrino):
20 Newsgroups
20000 messages taken from 20 Usenet newsgroups
http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html
A large group of documents covering various topics
Reuters-21578 Text Categorization Collection
A collection of documents that appeared on Reuters newswire in 1987
http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html
This seems to be a relatively well-formatted large set of data
NSF Research Awards Abstracts 1990-2003
http://kdd.ics.uci.edu/databases/nsfabs/nsfawards.html
While this data is not as nicely formatted as XML, it appears to be
structured in a fairly parse-able manner, so I think it's usable.
UNIX User Data
9 sets of sanitized user data drawn from the command histories of 8
UNIX computer users over up to 2 years
http://kdd.ics.uci.edu/databases/UNIX_user_data/UNIX_user_data.html
This seems like an interesting case for differentiating between a small
number of richly detailed items (i.e., the eight UNIX users)
Reuters Transcribed Subset
Data created by selecting files the Reuters-21578 collection, which
were read by foreign speakers and recorded by an speech recognition
system This data seems a little noisier and less well-organized, but
that might
be useful for some projects.
- LETOR
data set: From Microsoft. The site says "a package of
benchmark data sets
for research on LEarning TO Rank. This dataset contains standard
features, relevance judgments, data partitioning, evaluation tools, and
several baselines, for the OHSUMED data collection and the '.gov' data
collection."
- 4
universities
data
set: from CMU. CS department
Web pages from various universities, hand-classified into 7 categories.
Software
Here are some free (as far as I know) software tools of possible
value. Some have been used in past COS435 projects.
lists last updated Tue
Mar
2
13:02:07 EST 2010; minor revision Thu Feb 10 16:50:20 EST 2011
Copyright
2008-2011 Andrea S. LaPaugh