COS 425, Fall 2006 - Problem Set 4, Part I: querying
XML
(postponed from Problem Set 3)
Due at 1:30pm, Monday Nov. 13, 2006.
Collaboration Policy
You may discuss problems with other students in the class. However,
each
student must write up his or her own solution to each problem
independently.
That is, while you may formulate the solutions to problems in
collaboration
with classmates, you must be able to articulate the solutions on your
own.
Late Penalties
- 10% of the earned score if submitted after class but by 5pm the
day due.
- 30% of the earned score if submitted by 5pm on Wednesday
11/15/06.
- No credit if submitted later than the 30% penalty
deadline
The eXist XML database software is free, open-source software
that provides an XQuery processing engine for XML documents. We
are going to use a very very small portion of eXist - an application
that allows one to write XQuery queries on a database consisting of 3
of Shakespeare's plays captured as XML documents. Everyone will
run under one guest account using a Web interface. You will
express queries and either print or save to a file the results of your
queries. Since everyone will be using the same account, you
should save intermediate work by cutting and pasting to a file.
Then you can cut and paste back into the Web interface when you resume
your work. The eXist Web
interface will not save state beyond your session.
You can use eXist either by accessing a local copy running on the
Computer Science Department studentdb server http://studentdb.cs.princeton.edu:8080/exist/index.xml
or by using the eXist Project public Web site at http://exist-db.org/.
Either will display a home page that looks essentially the same.
Access to the local copy is restricted to users in the cs.princeton.edu
domain. If you are a CS concentrator or CS graduate student, you
should already have a CS account. If you are not a member of the
CS department, you should have received email giving you a temporary CS
account for the duration of the semester. The advantage of
using the local copy is that it will not be changed during the rest of
the semester except to fix problems (as best we can) and response may
be faster. The advantage of the eXist organization site is
that you don't need to be in the cs.princeton.edu domain and the bugs
may be fixed before we know about them.
The assignment consists of the following steps. Note that you do not need to submit
anything until Step 5:
- Follow one of the links above to the eXist home page.
- Looking down the left hand side menu, find "Examples" and click
on "XQuery Sandbox" under "Examples". Note that the use of
Sandbox requires that Javascript is enabled. You will reach a Web
page that looks like this (pdf
file)
(except that color highlighting does not show in the pdf file.
Also, the drop-down choice box labeled "Paste Saved Query" may or
may not have a query description showing. We are not using these
"saved queries" in this assignment, but you may want to look at them
for more XQuery examples.) Sandbox allows you to
write XQuery queries on several pre-loaded XML databases. We will
use only the Shakespeare database, which consists of 3 plays: Hamlet, Macbeth and Romeo and Juliet. Each
play is a separate XML document but the Shakespeare database consists
of all 3 plays (each play is a child of the database root). The
default database for queries submitted to Sandbox is the Shakespeare
database.
- Type (or cut and paste) the following XML query into the text box below
the "Paste Saved Query" drop-down box. Note that eXist
XQuery is case sensitive.
for
$play in /PLAY
where $play/TITLE &= "Hamlet"
return $play
Then click the SEND button. You will
see in the result window the full XML text of
Hamlet. The result may take
on the order of a minute to appear
if you are
using the eXist Project public Web site. Here (pdf) is what the
beginning of the result looks like. This gives you an idea of how
the plays are tagged. Note that "&=" denotes containment
and matches full words. (Try replacing "Hamlet" with "Ham"; then
replace it with "Ham*", where "*" matches any string.) As you
can see from the text of the TITLE tag, Hamlet is not the complete name of
the play.
- Type (or cut and paste) the following XML query into the text box and send
it:
for $speech in
/PLAY//SPEECH
where LINE &= 'war'
return $speech
(Note that the local site and the eXist
Project site may return the
results in different order; this is because the 3 plays are in
different orders in the different copies of the database.)
- Modify the query of Step 4 to return only the lines containing
'war'. (Hint: recall that text()
returns the
text
content of a tag.) Turn in the query and
the result of the query as part of the
submission of this assignment. You may print the Sandbox
page showing the query and result. If you wish to submit
electronically, you can either cut and paste the query and result
into a text file or you can print the Sandbox page to a pdf file (if
you have Adobe
Acrobat).
- Type (or cut and paste) the following XML query into the text box and send
it:
for $ghostspeech in
/PLAY [TITLE &= 'Hamlet']//SPEECH [SPEAKER = 'Ghost']
let $ghostlines := $ghostspeech/LINE
return
<GhostLines
total_number="{count($ghostlines)}">
{$ghostlines}
</GhostLines>
Describe the query
in English. Why is "let $ghostlines := $ghostspeech/LINE" used? Submit your answers to
these questions; do not
submit the result of the query.
- Type (or cut and paste) the following XML query into the text box and send
it:
for $ghostspeech in
/PLAY [TITLE &= 'Hamlet']//SPEECH [SPEAKER = 'Ghost']
let $ghostlines := $ghostspeech/LINE
return
<GhostLines
total_number="{count($ghostlines)}">
<FIRSTLINE>{$ghostspeech/LINE[1]/text()}</FIRSTLINE>
</GhostLines>
Describe the query
in English. Why is "$ghostspeech/LINE[1]/text()" used in the
return for this query? Submit your answers to
these questions; do not
submit the result of the query.
Note that TAGNAME[n] is a
general form for referring to the nth occurrence of a
child element with tag TAGNAME in the order the children appear
in the document.
- Type (or cut and paste) the following XML query into the text box and send
it:
let $allgspeech := /PLAY [TITLE &=
'Hamlet']//SPEECH [SPEAKER = 'Ghost']
let $ghostlines := $allgspeech/LINE
return
<GhostLines
total_number="{count($ghostlines)}">
{for
$ghostspeech in $allgspeech
return
<FIRSTLINE>{$ghostspeech/LINE[1]/text()}</FIRSTLINE>}
</GhostLines>
How do the results
of this query differ from those in Step 7? Why is "$allgspeech" used? Why is a
nested query used in the return portion? Submit your answers to
these questions; do not
submit the result of the query.
- Write an XQuery query that returns the titles
of the plays that
contain the word "fairies" in one or more of the lines of the
play. Turn in the query and
the result of the query as part of the
submission of this assignment.
NOTE: eXist can be used for course projects. You may
either have an account on the local eXist server (which you must use
from cs.princeton.edu) or download a copy of eXist for yourself
from http://exist-db.org/.
There is a java application programming interface (API). I
must warn you that neither I nor the technical staff of the CS
department has tested the API, and the technical staff can only provide
minimal support (e.g. they will reinstall it on studentdb if you break it, but won't
fix it). eXist has proven to be very fragile -- hence the
necessity to postpone this exercise from Problem Set 3. If you
enjoy exploring new APIs and installing new software, a project using
eXist may be exactly to your liking. A large component of the
project could be understanding how to use eXist in an application, i.e.
the application itself could be fairly small. However, we
must discuss contingency plans should eXist be less stable than
advertised.