COS 425, Fall 2006 - Problem Set 4, Part I:  querying XML

(postponed from Problem Set 3)

Due at 1:30pm, Monday Nov. 13, 2006.


Collaboration Policy

You may discuss problems with other students in the class. However, each student must write up his or her own solution to each problem independently. That is, while you may formulate the solutions to problems in collaboration with classmates, you must be able to articulate the solutions on your own.

Late Penalties



The eXist XML database software is free, open-source software that provides an XQuery processing engine for XML documents.  We are going to use a very very small portion of eXist - an application that allows one to write XQuery queries on a database consisting of 3 of Shakespeare's plays captured as XML documents.  Everyone will run under one guest account using a Web interface.  You will express queries and either print or save to a file the results of your queries.  Since everyone will be using the same account, you should save intermediate work by cutting and pasting to a file.  Then you can cut and paste back into the Web interface when you resume your work.  The eXist Web interface will not save state beyond your session. 

You can use eXist either by accessing a local copy running on the Computer Science Department studentdb server  http://studentdb.cs.princeton.edu:8080/exist/index.xml  or by using the eXist Project public Web site at  http://exist-db.org/.   Either will display a home page that looks essentially the same.  Access to the local copy is restricted to users in the cs.princeton.edu domain.  If you are a CS concentrator or CS graduate student, you should already have a CS account.  If you are not a member of the CS department, you should have received email giving you a temporary CS account for the duration of the semester.   The advantage of using the local copy is that it will not be changed during the rest of the semester except to fix problems (as best we can) and response may be faster.   The advantage of the eXist organization site is that you don't need to be in the cs.princeton.edu domain and the bugs may be fixed before we know about them. 

The assignment consists of the following steps.  Note that you do not need to submit anything until Step 5:
  1. Follow one of the links above to the eXist home page.
  2. Looking down the left hand side menu, find "Examples" and click on "XQuery Sandbox" under "Examples".   Note that the use of Sandbox requires that Javascript is enabled.  You will reach a Web page that looks like this (pdf file) (except that color highlighting does not show in the pdf file.  Also,  the drop-down choice box labeled "Paste Saved Query" may or may not have a query description showing.  We are not using these "saved queries" in this assignment, but you may want to look at them for more XQuery examples.)  Sandbox allows you to write XQuery queries on several pre-loaded XML databases.  We will use only the Shakespeare database, which consists of 3 plays:  HamletMacbeth and Romeo and Juliet.   Each play is a separate XML document but the Shakespeare database consists of all 3 plays (each play is a child of the database root).  The default database for queries submitted to Sandbox is the Shakespeare database. 
  3. Type (or cut and paste) the following XML query into the text box below the "Paste Saved Query" drop-down box.   Note that eXist  XQuery is case sensitive.
for $play in /PLAY
where $play/TITLE &= "Hamlet"
return $play

Then click the SEND button. You will see in the result window the full XML text of Hamlet.  The result may take on the order of a minute to appear if you are using the eXist Project public Web site.  Here (pdf)  is what the beginning of the result looks like.  This gives you an idea of how the plays are tagged.  Note that  "&="  denotes containment and matches full words. (Try replacing "Hamlet" with "Ham"; then replace it with "Ham*", where "*" matches any string.)   As you can see from the text of the TITLE tag, Hamlet is not the complete name of the play. 
  1. Type (or cut and paste) the following XML query into the text box and send it:
for $speech in /PLAY//SPEECH
where LINE &= 'war'
return $speech

(Note that the local site and the eXist Project site may return the results in different order;  this is because the 3 plays are in different orders in the different copies of the database.)
  1. Modify the query of Step 4 to return only the lines containing 'war'.   (Hint: recall that text() returns the text content of a tag.)  Turn in the query and the result of the query as part of the submission of this assignment.  You may print the Sandbox page showing the query and result.  If you wish to submit electronically,  you can either cut and paste the query and result into a text file or you can print the Sandbox page to a pdf file (if you have Adobe Acrobat).
  2. Type (or cut and paste) the following XML query into the text box and send it:
for $ghostspeech in /PLAY [TITLE &= 'Hamlet']//SPEECH [SPEAKER = 'Ghost']
let $ghostlines := $ghostspeech/LINE
return
<GhostLines total_number="{count($ghostlines)}">
{$ghostlines}
</GhostLines>

Describe the query in English.  Why is "
let $ghostlines := $ghostspeech/LINE" used?  Submit your answers to these questionsdo not submit the result of the query.
  1. Type (or cut and paste) the following XML query into the text box and send it:
for $ghostspeech in /PLAY [TITLE &= 'Hamlet']//SPEECH [SPEAKER = 'Ghost']
let $ghostlines := $ghostspeech/LINE
return
<GhostLines total_number="{count($ghostlines)}">
<FIRSTLINE>{$ghostspeech/LINE[1]/text()}</FIRSTLINE>
</GhostLines>

Describe the query in English.  Why is
"$ghostspeech/LINE[1]/text()" used in the return  for this query?  Submit your answers to these questionsdo not submit the result of the query.
Note that TAGNAME[n] is a general form for referring to the nth occurrence of  a child element with tag TAGNAME  in the order the children appear in the document.
  1. Type (or cut and paste)  the following XML query into the text box and send it:
let $allgspeech := /PLAY [TITLE &= 'Hamlet']//SPEECH [SPEAKER = 'Ghost']
let $ghostlines := $allgspeech/LINE
return
<GhostLines total_number="{count($ghostlines)}">
{for $ghostspeech in $allgspeech
return
<FIRSTLINE>{$ghostspeech/LINE[1]/text()}</FIRSTLINE>}
</GhostLines>

How do the results of this query differ from those in Step 7?  Why is "$allgspeech" used?  Why is a nested query used in the return portion?  Submit your answers to these questionsdo not submit the result of the query.
  1. Write an XQuery query that returns the titles of the plays that contain the word "fairies" in one or more of the lines of the play.  Turn in the query and the result of the query as part of the submission of this assignment.



NOTE:  eXist can be used for course projects.  You may either have an account on the local eXist server (which you must use from cs.princeton.edu) or download a copy of eXist for yourself from  http://exist-db.org/.   There is a java application programming interface (API).  I must warn you that neither I nor the technical staff of the CS department has tested the API, and the technical staff can only provide minimal support (e.g. they will reinstall it on studentdb if you break it, but won't fix it).  eXist has proven to be very fragile -- hence the necessity to postpone this exercise from Problem Set 3.  If you enjoy exploring new APIs and installing new software, a project using eXist may be exactly to your liking.  A large component of the project could be understanding how to use eXist in an application, i.e. the application itself could be fairly small.    However, we must discuss contingency plans should eXist be less stable than advertised.