Updated Mon Feb 13 10:28:50 EST 2023

Studio 2: Unix Commandline Commands

Introduction

If you have any suggestions for improving the studios or the assignments, please let us know!

This studio will use basic Unix commands to explore Elizabeth Barrett Browning's sonnets. This is approximately what we will walk through in the studio part of the second class on February 8. If you try it beforehand, you will get more out of the class, it will go smoothly for you, and you'll be able to help your colleagues.

By the end of this, you should be comfortable with basic Unix commands for looking at text in multiple files, finding things, counting them, etc.

To get started, make sure the sonnets are in a directory where you can look at them. You already did this in the first studio; they should be in hum307/barrett in your home directory.

open a Terminal window

$ ls
		you should be in your home directory; pwd to be sure
$ cd hum307/barrett
		this pathname is relative to your home directory
		you could say cd hum307, then cd barrett
$ ls
		check what's there
$ unzip sonnets.zip
		if necessary
$ ls -l
		to see all the details

Look at the README file

$ cat README
$ cp README README.backup
		make a backup copy
$ ls -l
		are they both there, with same size and different times?
$ cmp README README.backup
		are they identical?
$ rm README.backup
		you can remove the backup
$ ls -l
		verify that it's gone

Viewing file contents

$ cat sonnets.txt
		the whole collection in one file
		how many are there?
		recognize the most famous one?
$ more sonnets.txt
		look at them a page at a time ("less" is "more")
$ wc sonnets.txt
		count lines, words, characters in the single file

Grep

These sonnets are all about love. How is that word used? How often? Meet grep, the archetypal Unix command. Grep searches for all instances of a text pattern in any number of files. It's one of the most useful of all commands; learn to use it effectively. (The fireside chat with Ken Thompson explains the origin of the program, though not the etymology of the name, around time 35:42.)

$ grep
		with no arguments, it prints a brief synopsis of how to use it
$ grep love sonnets.txt
		The first argument is the pattern; subsequent arguments are filenames
		There sure is a lot of love
$ grep Love sonnets.txt
		and we even missed some!
$ grep -i love sonnets.txt
		use the argument -i to ignore case
$ grep '[Ll]ove' sonnets.txt
		"love" or "Love":  our first regular expression!
		quotes are not strictly necessary here but better to use them

Can we count them? Sure. Let me count the ways.

$ grep -i love sonnets.txt >love.out
		collect grep's output in a file called love.out
$ wc love.out
		count the lines
$ grep '[Ll]ove' sonnets.txt >love2.out
		use the regular expression instead
$ diff love.out love2.out
		did we get the same answer?

Did we get the right number of loves?

$ grep 'love.*love' sonnets.txt
		another RE: .* means "any number of any characters"
$ grep 'love.*love.*love' sonnets.txt
		triplets!
		What's the RE to look for quadruplets?

How about gerunds like "loving"?

$ grep -i love sonnets.txt >love.out
		"love"
$ grep -i lov sonnets.txt >lov.out
		"love" but also "loving", etc.
$ wc lov*.out
		same number of lines?
$ diff love.out lov.out
		what was added?  surprised?

More filename patterns

Filename patterns are another flavor of regular expressions: different syntax and rules, but the same idea.

Unzip sonnets.zip if you haven't already

$ ls
$ wc s*
		s* is all files whose names begin with s
		oops -- this picks up sonnets.zip and sonnets.txt
$ wc s???
		4-letter names that begin with s: ? is any single letter

Pipes

Pipes are another fundamental Unix idea. You can send the output of one program directly to the input of another without needing a temporary file (like love.out above).

$ grep love sonnets.txt | wc
		no temporary file
$ ls s??? | wc
		counting sonnet files without a temporary file

Counting words

What words does EBB use most often? Word-counting is a fundamental idea for analyzing any kind of text.

$ wordfreq sonnets.txt
		lots of words, with counts
$ wordfreq sonnets.txt | wc
		how many distinct words?
$ tr A-Z a-z <sonnets.txt | wordfreq
		merge upper and lower case
$ wordfreq sonnets.txt | grep -i lov
		hmm -- do you see something unexpected?

"wordfreq" is not a standard Unix command. How can we create one from existing programs? The basic requirement is a program to print its input one word on each line. The result can be sorted, counted, etc., as before.

NOTE ADDED AFTER CLASS: Don't use this part; use the version on this page.

This version uses a pipeline of four standard commands; it' a nice example of how program composition can be used effectively. The tr command, which (with the argument -sc) translates each sequence of non-alphabetic characters into a single newline character, thus putting each word on a line by itself. The first sort command puts the words into alphabetical order, the uniq command prints each unique word with a count (-c) of the number of times it occurs, and the second sort command sorts the results in reverse numeric order.

tr -sc 'A-Za-z' '\012' <file |
 sort |
 uniq -c |
 sort -nr

This raises interesting questions. What is a word? Is "don't" one word or two? How about "wished-for" in the first sonnet? Are upper and lower case words different? (E.g., ERA vs era?) You can change the definition by modifying the list of characters for tr, or by using other tools to do the conversion.

The second assignment will be to do the same kinds of experiments with Shakespeare's sonnets, to consolidate your knowledge of commandline tools, and perhaps to learn something about the poems.