Updated Tue Feb 14 19:57:35 EST 2023
This studio is approximately what we will walk through in the third class on February 15. We'll use basic Unix commands to explore metadata about 500 different sonnets, that is, not the sonnets themselves, but data like the title, the author, when the author was born and died, etc. We'll do the in-class part with the 18th Century metadata file 18.csv, then use the much bigger one for assignment 3. By the end, you should be comfortable with some core Unix tools, including Awk, a versatile tool for quick and dirty explorations of data.
You will find it helpful to look at a draft of the first three chapters of The AWK Programming Language, second edition, which can be found in the readings for Week 3. This is a work in progress; any comments you offer will be gratefully accepted
To get started, download the file 18.csv from Google Drive. This is a slimmed-down version of the original 18thCenturySonnets.csv provided by Mark Andrew Algee-Hewitt, to whom we are indebted. I have preserved 7 of the original 13 fields, and I have replaced the original comma separator by /, which will work with all versions of Awk.
$ mkdir metadata $ cd metadata make a directory to work in; change to itDownload the file 18.csv from the Google Drive site, either directly to metadata or to your Downloads directory and then move it from there.
Examination of the sonnet metadata reveals that it's somewhat irregular in spite of a lot of work by its creator. Real data is never clean enough, so part of the exercise in this studio is to identify anomalies and talk about how to clean them up.
We call this "exploratory data analysis", since we don't know what's in the data; poking around is the first step. Let's do some exploration. This is what the first few lines will look like:
author_dob/author_dod/author_fname/author_lname/num_lines/title_main/title_sub 1771/1798/E. H. (Elihu Hubbard)/Smith/8/SONNET I./Sent to Miss , with a Braid of Hair. 1771/1798/E. H. (Elihu Hubbard)/Smith/8/SONNET II./Sent to Mrs. , with a Song. 1771/1798/E. H. (Elihu Hubbard)/Smith/8/SONNET III./Sent to Miss. ,
The first or "header" line is conventionally a list of names for the columns or fields that follow. The information is used by Excel, Pandas (a Python data analysis library that we will see soon), and many other programs.
Start by getting an initial sense of the data:
$ wc 18.csv how big is this data file? $ head 18.csv how does it start? The first line is column headings for all later lines $ tail 18.csv how does it end?
In the following Awk programs, -F/ sets the field separator to a slash, NR is the number of the current input line (record), NF is the number of fields in the line, and $n is the n-th field.
Hint: as you are experimenting, it is a LOT easier to copy the commands from this page and paste them into a terminal window than to laboriously re-type them.
$ awk -F/ '{print NF}' 18.csv | sort -u check number of fields; it should be the same for all lines (an awk program with an action but no pattern) try omitting the -F/ argument. might sort -un be better? $ awk -F/ '{print $5}' 18.csv | sort | uniq -c | sort -n how many lines are in each sonnet? any surprises? is the second sort command necessary, or just convenient? $ awk -F/ '$5 > 14' 18.csv print the longer ones (an awk program with a pattern but no action) $ awk -F/ 'NR > 1 && $5 > 14 {print $5, $6, $7}' 18.csv exclude the first line: NR is the number of the input record (= line) (an awk program with both pattern and action)
$ awk -F/ '{print $3, $4}' 18.csv author names $ awk -F/ '{print $4 ", " $3}' 18.csv author names, last name first, and separated by comma $ awk -F/ '{print $4 ", " $3}' 18.csv | sort | uniq -c | sort -n unique author names and how much they wrote. any surprises? $ awk -F/ '{print $3, $4}' 18.csv | sort | uniq -c | sort -nr | awk '$1 == 1' How many people wrote only one? Less than 10? $ awk -F/ '$3 ~ " " { print $3, $4 }' 18.csv | uniq authors with spaces in their first-name field are there any with spaces in their last-name field?
$ awk -F/ '{print $2-$1}' 18.csv | sort | uniq -c | sort -nr author's ages. see anything odd? $ awk -F/ '{print $1, $2}' 18.csv | sort | uniq see anything odd?
What is a valid date? What do we do about "invalid" ones?
$ awk -F/ '$1 !~ /^[0-9]+$/ || $2 !~ /^[0-9]+$/ {print $1, $2, $3, $4}' 18.csv look for weird dates. how would you show an empty field in the output? how would you ignore the header line? $ awk -F/ '$1 ~ /^[0-9]+$/ && $2 ~ /^[0-9]+$/ {print $2-$1, $3, $4}' 18.csv | uniq | sort -n compute ages if we have valid data. see anything odd?If you look carefully at 18.csv, you will see that the author names are not contiguous, which makes one wonder what order the data has been sorted into. How would you detect this anomaly mechanically in a larger dataset (in particular if it were not related to dates)? Hint: try uniq | sort | uniq -c | awk '$1>1' with one of the earlier commands above.
$ awk -F/ '$1 ~ /^[0-9]+$/ && $2 ~ /^[0-9]+$/' 18.csv >temp make a temporary file with the valid entries $ wc temp how many are there? $ awk -F/ '{ ages = ages + $2 - $1 } # add up all the ages END { print "average age =", ages / NR }' <temp compute average age. NR is total number of records (an awk program with two pattern-action statements) $ awk -F/ '$2-$1 > max { max = $2 - $1; fname = $3; lname = $4 } END { print "oldest:", fname, lname, " age", max }' <temp who is the oldest? who is the youngest?This might still be counting some people twice because of the sorting issue noted above. How might you fix that?
Most of our use of Awk in this session is for selecting parts of the data. There's a lot more you can do, so this is just a taste. Awk is a complete programming language with a lot of features for text analysis. This is not an Awk course, so we won't go much further here, but if you do want to learn more, say so.