JSON (Javascript Object Notation) is a very widely used format for information interchange that has gone from its roots in Javascript to almost universal applicability, with processing libraries available in all languages. Not surprisingly, it is particularly often used to send information to and from web pages.
This assignment is partly to show in minimal form how web services operate, and partly to learn to use Python to manipulate JSON, using a dataset that is of particular interest to Princeton students: the Registrar's data on courses. It has, as has been the case with the other assignments, a testing component as well.
The data for this assignment is available Course Offerings but not in a form that is convenient for further processing. A Python program (originally written by Alex Ogier '13 and kept in service by Brian Kernighan and Christopher Moretti) scrapes that web site and produces the information as JSON.
The JSON file is a large array, each of whose elements is the information for a single course. Here are a couple of example courses (formatted and lightly edited to remove extranea):
{"profs": [{"uid": "960638964", "name": "Christopher M. Moretti"}], "title": "Advanced Programming Techniques", "courseid": "002065", "listings": [{"dept": "COS", "number": "333"}], "area": "", "descrip": "This is a course about the practice of programming ...", "classes": [ {"classnum": "40798", "enroll": "138", "limit": "160", "starttime": "11:00 am", "section": "L01", "endtime": "12:20 pm", "roomnum": "101", "days": "TTh", "bldg": "Friend Center" } ] }, {"profs": [ {"uid": "960423023", "name": "Bridget A. Alsdorf"}, {"uid": "910106245", "name": "Denis Feeney"}, {"uid": "010022721", "name": "Simon A. Morrison"}, {"uid": "960039380", "name": "Efthymia Rentzou"}, {"uid": "010000769", "name": "Esther H. Schor"}, {"uid": "960275842", "name": "Mira L. Siegelberg"} ], "title": "Interdisciplinary Approaches to Western Culture II: Literature and the Arts", "courseid": "003780", "listings": [{"dept": "HUM", "number": "218"}], "area": "LA", "descrip": "... examines European texts, works of art and music from the Renaissance..." "classes": [ {"classnum": "40007", "enroll": "41", "limit": "60", "starttime": "10:00 am", "section": "L01", "endtime": "10:50 am", "roomnum": "010", "days": "TWTh", "bldg": "East Pyne Building"}, {"classnum": "40008", "enroll": "15", "limit": "15", "starttime": "1:30 pm", "section": "C01", "endtime": "2:50 pm", "roomnum": "15", "days": "TTh", "bldg": "Henry House"}, {"classnum": "40009", "enroll": "14", "limit": "15", "starttime": "1:30 pm", "section": "C02", "endtime": "2:50 pm", "roomnum": "204", "days": "TTh", "bldg": "Friend Center"}, ... ] }
Your task in this assignment is to make this JSON information easily searchable from a browser. The file courses.json contains the registrar's data for this semester, as scraped by the program above. You must write a Python web server that provides a RESTful search interface: queries are encoded in the path components of the URL that requests the information and the server parses the URL and generates its response. These queries have the form:
str1/str2/str3/...
where each str# above is a partial query, and the result is all records that satisfy the intersection of the partial queries (that is: the logical AND of the partial queries, or in yet other words: all of them). This means order does not matter in the queries. Case also does not matter in the queries. For example, the query in the first line below should generate the result in the line below it:
stn/TILGH/frs
FRS 144 STN T 1:30-4:20 How the Tabby Cat Got Her Stripes Shirley M. Tilghman Blair Hall T5
The example above demonstrates the format in which your
program must display the information for each matching
record. Specifically, on a single line:
CourseNumber Area Day Time Title Professors Building
Room
There must be at least one space between
adjacent result fields. The CourseNumber result field is
created by joining the dept
and number JSON fields with exactly one space.
The Time result field is created by joining the
starttime and endtime JSON fields with
exactly one dash (-), after removing their AM/PM
designations and whitespace.
This was perhaps the simplest possible example. Many courses are cross-listed, have multiple professors, and/or have multiple sections, including lectures, classes, precepts, labs, and more. Many courses have no distribution area, no location, or even no sections at all. For cross-listed courses, you must join the CourseNumber result fields with exactly one slash (/). For courses with multiple professors, you must join the Professors result fields with exactly one slash (/). For courses with more than one section, you should match only on the section with the largest enrollment, and also print only the information for this section, ignoring all others; and in the event of a tie for section with largest enrollment, use the first tied entry (that is, the one appearing first in the list). The intent is to grab the main lecture of the course: this is pretty sketchy and doesn't always do exactly that -- for instance, it only shows one of the two approximately-equal lecture sections in COS126 -- but is sufficient for our purposes. For courses with no data in a field, that field shall be left blank in output. Here are some additional courses demonstrating these complications:
WWS 315/POL 393 SA MW 10:00-10:50 Grand Strategy Aaron L. Friedberg/G. John Ikenberry Robertson Hall 016 ELE 498 Senior Independent Work Paul R. Prucnal ENG 563/MOD 527 M 1:30-4:20 Poetics - 19thC English and American Poetry: New Tools, New Archives Meredith A. Martin/Meredith L. McGill
Although the ordering of the queries does not matter, the order of evaluation of a given query is important. A query must be evaluated in this order:
There are certainly categories of queries that do not fit within any of these categories; for example: one-letter queries, two-letter queries that do not match a distribution area, three-letter queries that contain a mix of letters and digits. This is fine -- queries of these types simply should not match any courses, and thus should return no results.
All queries must be case-insensitive. For examples: "mUsIc" matches "music", "Music", and "MUSIC"; and "QR", "qr", "Qr", and "qR" must all be accepted as the quantitative reasoning area. Again, you must use the Python regular expression module re for this.
Your program must be called reg.py. It implements a simple web server using this template:
import SocketServer import SimpleHTTPServer class Reply(SimpleHTTPServer.SimpleHTTPRequestHandler): def do_GET(self): # query arrives in self.path; return anything, e.g., self.wfile.write("query was %s\n" % self.path) def main(): # read and parse courses.json SocketServer.ForkingTCPServer(('', 8080), Reply).serve_forever() main()
With the server running on your own computer, you can run tests with the curl command, or by typing that same URL into a browser on your computer:
$curl localhost:8080/stn/TILGH/frs FRS 144 STN T 1:30-4:20 How the Tabby Cat Got Her Stripes Shirley M. Tilghman Blair Hall T5 $
Some systems (e.g., nobel) won't let you open port 8080, but will let you open ports with high numbers, say 30000-60000. Accordingly, your reg.py must accept an optional command line argument for the port number. You can set the default to whatever you like, but we will use the optional argument so we can test without editing your code.
The testing component of this assignment is similar to that from Assignments 1 and 3: create at least 25 high quality test queries in files named test00, test01, ... , test24, etc. Each file should contain the query on the first line, followed by the matching results, in the same format as the server response. For example:
stn/TILGH/frs
FRS 144 STN T 1:30-4:20 How the Tabby Cat Got Her Stripes Shirley M. Tilghman Blair Hall T5
These queries should thoroughly explore critical boundary conditions and other potential trouble spots of the specification. Again, you might find it helpful to read Chapter 6 of The Practice of Programming on testing. It is typically preferable to return a few interesting courses than a large glut of results. Please do not submit any queries that would result in more than 100 courses in the results.
This is the version of python installed on the CS servers. Note that there are significant incompatibilities between Python 2.* and Python 3. We would strongly prefer that you use Python 2.*
tux:~$ python Python 2.7.5 (default, Jun 24 2015, 01:06:47) [GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2
The courses.json file will be present in the directory where your reg.py is (and from where it is executed). You may hardcode the filename. You should not make any assumptions about the content -- we could give you a courses.json with last year's classes, or with only COS333.
You will have to write the code in reg.py that reads and parses courses.json (importing the module json will provide you with some useful functions to do this), accepts search requests, and sends responses. Start with the server template and add code to read and evaluate the JSON file. Then parse each user query, search the JSON for matching items, then format and return the selected ones. Repeating the advice from all the previous assignments: Keep it simple, this program does not need to be fast or the slightest bit clever. As is true with all programming, trying to be fast and/or clever is often a recipe for disaster. My version has only about 110 non-blank lines, so if your solution is a lot longer, you may be off on the wrong track or not working as surgically.
Talking over the precise meaning of the specification with friends is strongly encouraged, as always, but in particular with this assignment due to its large number of potential corner cases. Use Piazza to garner official interpretations (which may just be my opinions, of course).
The JSON file contains a modest number of accented or otherwise non-Latin alphabet characters, for example, FRE 367 is a course about Camus taught by Professor André Benhaïm. These characters are rendered in text as Unicode escapes, and may not print cleanly without special effort (though my implementation seems to handle them), but you do not need to worry about doing anything special with them. We wil not focus on these special characters in our testing.
When you are finished, submit your source and tests tarball using the CS Dropbox link dropbox.cs.princeton.edu/COS333_S2016/Four.
Please create your tests tarball using the same command as from Assignment 3:
tar cf tests.tar test??
We will give you some indication that you have not drastically misinterpreted the specification, by running some tests of our own when you submit. These are not a complete test. Do your own testing; don't rely on our tests to validate your code.
We will test your code primarily by running the same queries through your version and our version and comparing the results; we will sort the output lines and ignore empty lines and whitespace differences, so don't get too hung up on minutiae of line formatting aside from the minimal requirements mentioned above. As with prior assignments, we will test your tests for reasonable coverage of expected simple and corner cases.