COS 435
Resources for Projects

Please help keep this list useful!  Suggest free, stable, preferably open-source software and free, open data sets.  It is best if you can give a recommendation from personal experience, but we will take other suggestions subject to further exploration.  Thanks!

Data Sets
Here are some data sets of possible value.  Some have been used in past COS435 projects.

IMPORTANT NOTE:  Other members of the faculty have data sets that they are willing to share.  If you need something - either a specific data set or a specific kind of data - ask, and we'll see if it is available in the department.

included in UCI kdd archive (annotated by 2008 TA Joseph Calandrino):
20 Newsgroups
20000 messages taken from 20 Usenet newsgroups
A large group of documents covering various topics

Reuters-21578 Text Categorization Collection
A collection of documents that appeared on Reuters newswire in 1987
This seems to be a relatively well-formatted large set of data

NSF Research Awards Abstracts 1990-2003
While this data is not as nicely formatted as XML, it appears to be structured in a fairly parse-able manner, so I think it's usable.

UNIX User Data
9 sets of sanitized user data drawn from the command histories of 8 UNIX computer users over up to 2 years
This seems like an interesting case for differentiating between a small number of richly detailed items (i.e., the eight UNIX users)

Reuters Transcribed Subset
Data created by selecting files the Reuters-21578 collection, which were read by foreign speakers and recorded by an speech recognition system This data seems a little noisier and less well-organized, but that might
be useful for some projects.


Here are some free (as far as I know) software tools of possible value.  Some have been used in past COS435 projects.

lists last updated Tue Mar  2 13:02:07 EST 2010; minor revision Thu Feb 10 16:50:20 EST 2011
Copyright  2008-2011 Andrea S. LaPaugh