Updated Mon Feb 13 15:48:19 EST 2023

Assignment 3: Exploratory Data Analysis with Awk and Friends

Play it [again], Sam

This assignment is meant to give you practice in using the Unix commands that you learned in the third class, particularly Awk, using the file 10k.csv, which contains metadata on nearly 10,000 sonnets from a much wider range of authors and dates, including some living ones. Again, we have trimmed the original data from Mark Andrew Algee-Hewitt so we can focus on the most interesting attributes. The field separator is a slash, where the original was a comma.

Repeat the metadata experiments that you did on the 18th Century sonnet metadata in 18.csv with the much larger collection in 10k.csv. Be adventurous. What can you learn? What kinds of oddities, anomalies, errors, etc., did you find?

Advice: it's a LOT easier to copy commands from the studio file and paste them into a terminal window than to type them yourself. If something needs to be changed within a command, copy, paste, edit, then push return.

Verify that the data is sensible

Does every record have the same number of fields

How many lines are in each sonnet?  What is the range?

Check the authors

Who are the ten most prolific authors?

Are the records for any given author contiguous?

What do you see about the gender of authors?

What's the longest sonnet?

What's the longest title?
	How many different ways could you compute this, given what you know?

Check the dates

What's the story about the ages of authors? Answer these question by computing with Awk and other tools.
What's the earliest birth date?

What's the latest death date?

What fraction of the entries have unambiguous complete dates?

Compute the ages of all authors, including living ones, to produce a sorted list
using the usual sort | uniq -c | sort -n.

Who was the longest-lived author?

How many living authors are there?

Who is the oldest living author?

Submission

Use the submission link on the Google Drive. Collect your answers to the questions above in a TEXT file called x_abc_asgn3.txt where x is the first letter of your last name, and abc is your first name; thus I would submit k_brian_asgn3.txt.

You might find it easiest to either copy and paste the output from running the commands, or redirect their output into a file that you then include. These are good ways to avoid transcription errors that might happen if you retype anything.

Note: submit a .txt file, please, not a Word file.
Use your favorite (or least unfavorite) text editor,
like nano.  I'm not worried if the upload is mangled,
but I do want you to become comfortable with a real text
editor.  

Challenge problem

Compare the entries for William Shakespeare in 10k.csv with the sonnets in shakespeare.zip. Isolate the first lines of the sonnets from the latter and from the former into two files, say v1.txt and v2.txt, using the tools you have learned, or anything else.

How do the lines compare? How would you bring together lines that are the same except for spelling? What does the diff command tell you?

What if anything did you find interesting or surprising?