Updated Tue Feb 14 19:57:35 EST 2023

Studio 3: Exploratory Data Analysis with Awk and Friends

Introduction

This studio is approximately what we will walk through in the third class on February 15. We'll use basic Unix commands to explore metadata about 500 different sonnets, that is, not the sonnets themselves, but data like the title, the author, when the author was born and died, etc. We'll do the in-class part with the 18th Century metadata file 18.csv, then use the much bigger one for assignment 3. By the end, you should be comfortable with some core Unix tools, including Awk, a versatile tool for quick and dirty explorations of data.

You will find it helpful to look at a draft of the first three chapters of The AWK Programming Language, second edition, which can be found in the readings for Week 3. This is a work in progress; any comments you offer will be gratefully accepted

To get started, download the file 18.csv from Google Drive. This is a slimmed-down version of the original 18thCenturySonnets.csv provided by Mark Andrew Algee-Hewitt, to whom we are indebted. I have preserved 7 of the original 13 fields, and I have replaced the original comma separator by /, which will work with all versions of Awk.

$ mkdir metadata
$ cd metadata
		make a directory to work in; change to it

Download the file 18.csv from the Google Drive site, either directly to metadata or to your Downloads directory and then move it from there.

Preliminary exploration

Examination of the sonnet metadata reveals that it's somewhat irregular in spite of a lot of work by its creator. Real data is never clean enough, so part of the exercise in this studio is to identify anomalies and talk about how to clean them up.

We call this "exploratory data analysis", since we don't know what's in the data; poking around is the first step. Let's do some exploration. This is what the first few lines will look like:

author_dob/author_dod/author_fname/author_lname/num_lines/title_main/title_sub
1771/1798/E. H. (Elihu Hubbard)/Smith/8/SONNET I./Sent to Miss  , with a Braid of Hair.
1771/1798/E. H. (Elihu Hubbard)/Smith/8/SONNET II./Sent to Mrs.  , with a Song.
1771/1798/E. H. (Elihu Hubbard)/Smith/8/SONNET III./Sent to Miss.  ,

The first or "header" line is conventionally a list of names for the columns or fields that follow. The information is used by Excel, Pandas (a Python data analysis library that we will see soon), and many other programs.

Start by getting an initial sense of the data:

$ wc 18.csv 
		how big is this data file?
$ head 18.csv
		how does it start?  The first line is column headings for all later lines
$ tail 18.csv
		how does it end?

In the following Awk programs, -F/ sets the field separator to a slash, NR is the number of the current input line (record), NF is the number of fields in the line, and $n is the n-th field.

Hint: as you are experimenting, it is a LOT easier to copy the
commands from this page and paste them into a terminal window
than to laboriously re-type them.

$ awk -F/ '{print NF}' 18.csv | sort -u
		check number of fields; it should be the same for all lines
		(an awk program with an action but no pattern)
		try omitting the -F/ argument.  might sort -un be better?
$ awk -F/ '{print $5}' 18.csv | sort | uniq -c | sort -n
		how many lines are in each sonnet?  any surprises?
		is the second sort command necessary, or just convenient?
$ awk -F/ '$5 > 14' 18.csv
		print the longer ones
		(an awk program with a pattern but no action)
$ awk -F/ 'NR > 1 && $5 > 14 {print $5, $6, $7}' 18.csv
		exclude the first line: NR is the number of the input record (= line)
		(an awk program with both pattern and action)

Who wrote this stuff?

Let's look at the author metadata: first and last names, and birth and death dates. Use the first line of 18.csv to figure out which field numbers to use.

$ awk -F/ '{print $3, $4}' 18.csv
		author names
$ awk -F/ '{print $4 ", " $3}' 18.csv
		author names, last name first, and separated by comma
$ awk -F/ '{print $4 ", " $3}' 18.csv | sort | uniq -c | sort -n
		unique author names and how much they wrote.  any surprises?
$ awk -F/ '{print $3, $4}' 18.csv | sort | uniq -c | sort -nr | awk '$1 == 1'
		How many people wrote only one?  Less than 10?
$ awk -F/ '$3 ~ " " { print $3, $4 }' 18.csv | uniq
		authors with spaces in their first-name field
		are there any with spaces in their last-name field?

When did they live and die?

What about the ages of the authors? Awk does arithmetic, so the age is $2-$1:

$ awk -F/ '{print $2-$1}' 18.csv | sort | uniq -c | sort -nr
		author's ages.  see anything odd?
$ awk -F/ '{print $1, $2}' 18.csv | sort | uniq
		see anything odd?

What is a valid date? What do we do about "invalid" ones?

$ awk -F/ '$1 !~ /^[0-9]+$/ || $2 !~ /^[0-9]+$/ {print $1, $2, $3, $4}' 18.csv
		look for weird dates.  how would you show an empty field in the output?
		how would you ignore the header line?
$ awk -F/ '$1 ~ /^[0-9]+$/ && $2 ~ /^[0-9]+$/ {print $2-$1, $3, $4}' 18.csv | uniq | sort -n
		compute ages if we have valid data.  see anything odd?

If you look carefully at 18.csv, you will see that the author names are not contiguous, which makes one wonder what order the data has been sorted into. How would you detect this anomaly mechanically in a larger dataset (in particular if it were not related to dates)? Hint: try uniq | sort | uniq -c | awk '$1>1' with one of the earlier commands above.

More arithmetic

Rather than repeating the test over and over again, let's collect all the lines that have valid dates in a new temporary file, then do some further computations on that.

$ awk -F/ '$1 ~ /^[0-9]+$/ && $2 ~ /^[0-9]+$/' 18.csv >temp
		make a temporary file with the valid entries
$ wc temp
		how many are there?
$ awk -F/ '{ ages = ages + $2 - $1 } # add up all the ages
       END { print "average age =", ages / NR }' <temp
		compute average age.  NR is total number of records
		(an awk program with two pattern-action statements)
$ awk -F/ '$2-$1 > max { max = $2 - $1; fname = $3; lname = $4 }
       END { print "oldest:", fname, lname, " age", max }' <temp
		who is the oldest?
		who is the youngest?

This might still be counting some people twice because of the sorting issue noted above. How might you fix that?

Envoi

Most of our use of Awk in this session is for selecting parts of the data. There's a lot more you can do, so this is just a taste. Awk is a complete programming language with a lot of features for text analysis. This is not an Awk course, so we won't go much further here, but if you do want to learn more, say so.