Resources for Projects
in Information Retrieval,
Data Mining and
Complex Network Analysis
compiled for
COS 435: Information Retrieval, Discovery and
Delivery,
COS 496: Complex Networks
- Analysis and Applications
and COS
independent projects
Please help keep this list
useful! Suggest free, stable, preferably open-source
software and free, open data sets. It is best if you can
give a recommendation from personal experience, but we will take
other suggestions subject to further exploration. Thanks!
Note: All quoted descriptions are from the linked web page at
the time it was last visited.
Data Sets
Here are some data sets of possible value. Some have been used
in past COS435 projects.
UCI Machine
Learning Repository - University of California at
Irving data sets, primarily for data mining tasks, but also
useful for other information analysis/search tasks. 468 Data
Sets as of February 2019. Some example data sets:
Optical Interconnection Network
3D Road Network (North Jutland, Denmark)
LETOR
data set: From Microsoft. The site says "
a package of benchmark data sets for research on LEarning TO
Rank, which contains standard features, relevance judgments,
data partitioning, evaluation tools, and several
baselines."
WordNet:
"WordNet® is a large lexical database of English. Nouns, verbs,
adjectives and adverbs are grouped into sets of cognitive
synonyms (synsets), each expressing a distinct concept.
Synsets are interlinked by means of conceptual-semantic and
lexical relations."
Amazon
Web Services (AWS) Registry of Open Data: provides a
variety of public data sets. Examples:
Amazon Customer Reviews, Google Books
Ngrams, Africa Soil Information Service
(AfSIS) Soil Chemistry.
The Common Crawl
corpus -
"contains petabytes of data collected over 8 years of web
crawling. The corpus contains raw web page data, metadata
extracts and text extracts."
The
Million Song Dataset is
" a freely-available collection of audio features and metadata
for a million contemporary popular music tracks," from LabRosa at Columbia
University. Warning: this dataset is huge
(280GB). You may be able to work with a subset.
Facebook social graph data from the
Online
Social NetworksProject at UC Irvine.
(No experience with these datasets.)
Two sources of the social network for a sample of Twitter:
data for "What is Twitter, a Social Network or a News Media?"
by Kwak, Lee, Park, and Moon, Inter. World Wide Web (WWW)
Conf., 2010. See
http://an.kaist.ac.kr/traces/WWW2010.html
data for "
Measuring User Influence in Twitter: The Million Follower
Fallacy by Cha, Haddadi, Benevenuto, and Gummadi, Inter.
AAAI Conf. on Weblogs and Social Media (ICWSM), 2010.
See http://twitter.mpi-sws.org/
The Tweets2011
corpus from the
TREC 2011 microblog
track is provided by NIST.
Quoting from the Web site: "
The Tweets2011 corpus is unusual in that what you get is a list
of tweet identifiers, and the actual tweets are downloaded
directly from Twitter, using the open-source twitter-tools. However, to
obtain the lists of tweets to be downloaded (i.e. the "tweet
lists"), a data usage agreement must be signed." (No experience with this data set.)
Dr. Kevin Chai,
data scientist in the Curtin
Institute for Computation at Curtin University,
Perth Australia, has compiled a list of data sets and data
set directories. The list overlaps with ours above, but
includes many not listed above. We have not checked every
data set in this list.
Software
Here are some free (as far as I know) software tools of possible
value. Some have been used in past projects.
Stanford
Network Analysis Platform (SNAP) " a general
purpose, high performance system for analysis and manipulation
of large networks." The core of SNAP is written in C++ .
There is a Python interface called Snap.py
Gephi: an open-source
network visualization and analysis platform. This is a
volunteer effort, but students have been pleased with it
NetworkX From the
site: "a Python language software package for the creation,
manipulation, and study of the structure, dynamics, and
functions of complex networks." Open source.
Elastic Search:
"
a distributed, JSON-based search and analytics engine designed
for horizontal scalability, maximum reliability, and easy
management."
Apache Lucene a toolkit
for indexing and search, in Java or Python
The Lemur Project
provides software tools for information retrieval and data
mining. Includes the Galago toolkit
for experimenting with text search.
HTMLAsText
utility "converts HTML documents to simple text files, by
removing all HTML tags and formatting the text according to your
preferences." (copyright Nir Sofer).
lists last updated Sat
Feb 9 22:09:53 EST 2019 Copyright
2008-2019
Andrea S. LaPaugh