Updated Mon Feb 13 10:28:50 EST 2023
If you have any suggestions for improving the studios or the assignments, please let us know!
This studio will use basic Unix commands to explore Elizabeth Barrett Browning's sonnets. This is approximately what we will walk through in the studio part of the second class on February 8. If you try it beforehand, you will get more out of the class, it will go smoothly for you, and you'll be able to help your colleagues.
By the end of this, you should be comfortable with basic Unix commands for looking at text in multiple files, finding things, counting them, etc.
To get started, make sure the sonnets are in a directory where you can look at them. You already did this in the first studio; they should be in hum307/barrett in your home directory.
open a Terminal window $ ls you should be in your home directory; pwd to be sure $ cd hum307/barrett this pathname is relative to your home directory you could say cd hum307, then cd barrett $ ls check what's there $ unzip sonnets.zip if necessary $ ls -l to see all the details
Look at the README file
$ cat README $ cp README README.backup make a backup copy $ ls -l are they both there, with same size and different times? $ cmp README README.backup are they identical? $ rm README.backup you can remove the backup $ ls -l verify that it's gone
$ cat sonnets.txt the whole collection in one file how many are there? recognize the most famous one? $ more sonnets.txt look at them a page at a time ("less" is "more") $ wc sonnets.txt count lines, words, characters in the single file
These sonnets are all about love. How is that word used? How often? Meet grep, the archetypal Unix command. Grep searches for all instances of a text pattern in any number of files. It's one of the most useful of all commands; learn to use it effectively. (The fireside chat with Ken Thompson explains the origin of the program, though not the etymology of the name, around time 35:42.)
$ grep with no arguments, it prints a brief synopsis of how to use it $ grep love sonnets.txt The first argument is the pattern; subsequent arguments are filenames There sure is a lot of love $ grep Love sonnets.txt and we even missed some! $ grep -i love sonnets.txt use the argument -i to ignore case $ grep '[Ll]ove' sonnets.txt "love" or "Love": our first regular expression! quotes are not strictly necessary here but better to use them
Can we count them? Sure. Let me count the ways.
$ grep -i love sonnets.txt >love.out collect grep's output in a file called love.out $ wc love.out count the lines $ grep '[Ll]ove' sonnets.txt >love2.out use the regular expression instead $ diff love.out love2.out did we get the same answer?
Did we get the right number of loves?
$ grep 'love.*love' sonnets.txt another RE: .* means "any number of any characters" $ grep 'love.*love.*love' sonnets.txt triplets! What's the RE to look for quadruplets?How about gerunds like "loving"?
$ grep -i love sonnets.txt >love.out "love" $ grep -i lov sonnets.txt >lov.out "love" but also "loving", etc. $ wc lov*.out same number of lines? $ diff love.out lov.out what was added? surprised?
Filename patterns are another flavor of regular expressions: different syntax and rules, but the same idea.
Unzip sonnets.zip if you haven't already $ ls $ wc s* s* is all files whose names begin with s oops -- this picks up sonnets.zip and sonnets.txt $ wc s??? 4-letter names that begin with s: ? is any single letter
Pipes are another fundamental Unix idea. You can send the output of one program directly to the input of another without needing a temporary file (like love.out above).
$ grep love sonnets.txt | wc no temporary file $ ls s??? | wc counting sonnet files without a temporary file
$ wordfreq sonnets.txt lots of words, with counts $ wordfreq sonnets.txt | wc how many distinct words? $ tr A-Z a-z <sonnets.txt | wordfreq merge upper and lower case $ wordfreq sonnets.txt | grep -i lov hmm -- do you see something unexpected?"wordfreq" is not a standard Unix command. How can we create one from existing programs? The basic requirement is a program to print its input one word on each line. The result can be sorted, counted, etc., as before.
NOTE ADDED AFTER CLASS: Don't use this part; use the version on this page.
This version uses a pipeline of four standard commands; it' a nice example of how program composition can be used effectively. The tr command, which (with the argument -sc) translates each sequence of non-alphabetic characters into a single newline character, thus putting each word on a line by itself. The first sort command puts the words into alphabetical order, the uniq command prints each unique word with a count (-c) of the number of times it occurs, and the second sort command sorts the results in reverse numeric order.
tr -sc 'A-Za-z' '\012' <file | sort | uniq -c | sort -nr
This raises interesting questions. What is a word? Is "don't" one word or two? How about "wished-for" in the first sonnet? Are upper and lower case words different? (E.g., ERA vs era?) You can change the definition by modifying the list of characters for tr, or by using other tools to do the conversion.
The second assignment will be to do the same kinds of experiments with Shakespeare's sonnets, to consolidate your knowledge of commandline tools, and perhaps to learn something about the poems.