Updated Mon Feb 13 15:48:19 EST 2023
This assignment is meant to give you practice in using the Unix commands that you learned in the third class, particularly Awk, using the file 10k.csv, which contains metadata on nearly 10,000 sonnets from a much wider range of authors and dates, including some living ones. Again, we have trimmed the original data from Mark Andrew Algee-Hewitt so we can focus on the most interesting attributes. The field separator is a slash, where the original was a comma.
Repeat the metadata experiments that you did on the 18th Century sonnet metadata in 18.csv with the much larger collection in 10k.csv. Be adventurous. What can you learn? What kinds of oddities, anomalies, errors, etc., did you find?
Advice: it's a LOT easier to copy commands from the studio file and paste them into a terminal window than to type them yourself. If something needs to be changed within a command, copy, paste, edit, then push return.
Does every record have the same number of fields How many lines are in each sonnet? What is the range?
Who are the ten most prolific authors? Are the records for any given author contiguous? What do you see about the gender of authors? What's the longest sonnet? What's the longest title? How many different ways could you compute this, given what you know?
What's the earliest birth date? What's the latest death date? What fraction of the entries have unambiguous complete dates? Compute the ages of all authors, including living ones, to produce a sorted list using the usual sort | uniq -c | sort -n. Who was the longest-lived author? How many living authors are there? Who is the oldest living author?
You might find it easiest to either copy and paste the output from running the commands, or redirect their output into a file that you then include. These are good ways to avoid transcription errors that might happen if you retype anything.
Note: submit a .txt file, please, not a Word file. Use your favorite (or least unfavorite) text editor, like nano. I'm not worried if the upload is mangled, but I do want you to become comfortable with a real text editor.
How do the lines compare? How would you bring together lines that are the same except for spelling? What does the diff command tell you?
What if anything did you find interesting or surprising?