data and software resource for data projects

Resources for Projects in Information Retrieval,
Data Mining and Complex Network Analysis

compiled for COS 435: Information Retrieval, Discovery and Delivery,
COS 496: Complex Networks - Analysis and Applications
and COS independent projects

       Andrea LaPaugh
   Computer Science
       Princeton University

Please help keep this list useful! Suggest free, stable, preferably open-source software and free, open data sets. It is best if you can give a recommendation from personal experience, but we will take other suggestions subject to further exploration. Thanks!

Note: All quoted descriptions are from the linked web page at the time it was last visited.

Data Sets
Here are some data sets of possible value. Some have been used in past COS435 projects.

UCI Machine Learning Repository - University of California at Irving data sets, primarily for data mining tasks, but also useful for other information analysis/search tasks. 468 Data Sets as of February 2019. Some example data sets:

Optical Interconnection Network
3D Road Network (North Jutland, Denmark)

LETOR data set: From Microsoft. The site says " a package of benchmark data sets for research on LEarning TO Rank, which contains standard features, relevance judgments, data partitioning, evaluation tools, and several baselines."

Stanford Large Network Dataset Collection (from SNAP project): Networks of all kinds: social networks, Web graphs, Internet networks to name a few.

WordNet: "WordNet® is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations."

Amazon Web Services (AWS) Registry of Open Data: provides a variety of public data sets. Examples: Amazon Customer Reviews, Google Books Ngrams, Africa Soil Information Service (AfSIS) Soil Chemistry .

Sloane Digital Sky Survey

The Common Crawl corpus - "contains petabytes of data collected over 8 years of web crawling. The corpus contains raw web page data, metadata extracts and text extracts."

The Million Song Dataset is " a freely-available collection of audio features and metadata for a million contemporary popular music tracks," from LabRosa at Columbia University. Warning: this dataset is huge (280GB). You may be able to work with a subset.

MIRFLICKR-1M and MIRFLICKR-25000 : collections of Flickr images (one million and 25,000 images respectively) under the Creative Commons license, with annotations. Created and provided by the LIACS Medialab at Leiden University, The Netherlands.

Stanford WebBase Archive - "topic-focused snapshots of Web sites"

University of Washington Database for Object and Concept Recognition for Content-Based Image Retrieval : a small collection of images organized by topic. Several content labels are provided for each image.

4 universities data set: from CMU. CS department Web pages from various universities, hand-classified into 7 categories.

Wikimedia Downloads, including Wikipedia

Collections provided on Web site for Search Engines: Information Retrieval in Practice a textbook by Croft, Metzler and Strohman.

Facebook social graph data from the Online Social Networks Project at UC Irvine. (No experience with these datasets.)

Two sources of the social network for a sample of Twitter:

data for "What is Twitter, a Social Network or a News Media?" by Kwak, Lee, Park, and Moon, Inter. World Wide Web (WWW) Conf., 2010. See http://an.kaist.ac.kr/traces/WWW2010.html
data for " Measuring User Influence in Twitter: The Million Follower Fallacy by Cha, Haddadi, Benevenuto, and Gummadi, Inter. AAAI Conf. on Weblogs and Social Media (ICWSM), 2010. See http://twitter.mpi-sws.org/

The Tweets2011 corpus from the TREC 2011 microblog track is provided by NIST. Quoting from the Web site: " The Tweets2011 corpus is unusual in that what you get is a list of tweet identifiers, and the actual tweets are downloaded directly from Twitter, using the open-source twitter-tools. However, to obtain the lists of tweets to be downloaded (i.e. the "tweet lists"), a data usage agreement must be signed." (No experience with this data set.)

The International AAAI Conference on Weblogs and Social Media (ICWSM) organization makes available data sets from the papers presented at the conference. (No experience with these data sets.)

The home page of the Online Social Networks Project at the Max Planck Institute for Software Systems includes pointers to the data sets for several of their papers circa 2010, including data sets for Flickr, Facebook, LiveJournal, Orkut, and YouTube. (No experience with these data sets.)

Dr. Kevin Chai, data scientist in the Curtin Institute for Computation at Curtin University, Perth Australia, has compiled a list of data sets and data set directories. The list overlaps with ours above, but includes many not listed above. We have not checked every data set in this list.

Software
Here are some free (as far as I know) software tools of possible value. Some have been used in past projects.

Stanford Network Analysis Platform (SNAP) " a general purpose, high performance system for analysis and manipulation of large networks." The core of SNAP is written in C++ . There is a Python interface called Snap.py
GraphStream A Dynamic Graph Library
Gephi: an open-source network visualization and analysis platform. This is a volunteer effort, but students have been pleased with it
NetworkX From the site: "a Python language software package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks." Open source.
Elastic Search: " a distributed, JSON-based search and analytics engine designed for horizontal scalability, maximum reliability, and easy management."
Apache Lucene a toolkit for indexing and search, in Java or Python
The Lemur Project provides software tools for information retrieval and data mining. Includes the Galago toolkit for experimenting with text search.
Twitter API
Tweepy: "An easy-to-use Python library for accessing the Twitter API."
YouTube data API
The Movie Database (TMDb) API
Natural Language Toolkit (NLTK): provides Python modules for natural language processing (NPL) and " easy-to-use interfaces to over 50 corpora and lexical resources"

http://www.scientificpsychic.com/paice/paice.html a modified version of the Lancaster stemmer

XAMPP: "a completely free, easy to install Apache distribution containing MariaDB, PHP, and Perl."
Woosh "Fast, pure-Python full text indexing, search, and spell checking library."
NumPy: Python package for scientific computing
Scikit-learn: tools for machine learning in Python
Scikit-image: tools for image processing in Python
A suite of Matlab scripts for data analysis implementation by Jakob Verbeek, including variants of principal component analysis (PCA)

The following tools should be used for website scrapping only if permitted in the terms of service of the site:

Beautiful Soup: a Python library for extracting data from HTML documents. Also handles XML.
open source web crawler WebSphinx
HTMLAsText utility "converts HTML documents to simple text files, by removing all HTML tags and formatting the text according to your preferences." (copyright Nir Sofer).

lists last updated Sat Feb 9 22:09:53 EST 2019
Copyright 2008-2019 Andrea S. LaPaugh