Updated Wed Feb 8 19:30:29 EST 2023

Loose ends from studio 2 (Feb 8)

If you find errors here, or think the explanation could be better, please let me know post haste. Thanks.

Hi, everyone. I apologize for the somewhat chaotic final 20 minutes of today's studio exercise. In the online instructions I was trying to show how to create a personal word-frequency command (in effect answering Tenzing's question), but the on-screen instructions were not clear or complete or even totally correct, and this threw people off. And as always there's more arcane bits and pieces than I had remembered, since a lot of this is wired into my fingers, and hasn't passed through my brain recently.

So, with that said...

A working word-frequency command

cat $* | tr -sc A-Za-z '\012' | sort | uniq -c | sort -n

Going through this one piece at a time:

cat $*

concatenates the contents of all the commandline arguments (filenames); that's what $* means. (Another pattern!)

tr -sc A-Za-z '\012'

translates all the characters that are not A-Za-z (-c), squeezing them into a single character (-s), and then replacing them by a newline (\012). This puts each "word" (that is, a sequence of alphabetic characters) on a line by itself. Arcane enough for you?

Now

sort | uniq -c | sort -n

sorts the words (which are one per line), converts each sequence of identical words into one line with a count and the word, then sorts the result numerically (-n), so the small counts come out first and the large ones come out last.

Why cat | tr? The issue is that the tr command, unlike most commands, only reads from its standard input (the keyboard, or often a pipe); it doesn't take a filename argument. Using cat this way allows you to use the pipeline with any number of filenames, including none. So, for example, you could convert a file into purely lower case and then count the words:

$ tr A-Z a-z <sonnets.txt | cat | ...

Put it into a file

To make this into a single command with its own name, say wordfreq, create a text file (must be text, not rtf or any other type) using an editor like nano. Put exactly that pipeline into the file. Save it as wordfreq or anything else you like. Make sure you know where you saved it. The barrett directory would be a good place.

$ ls -l wordfreq
-rw-r--r--. 1 bwk fac 57 Feb  8 17:26 wordfreq
$ cat wordfreq
cat $* | tr -sc A-Za-z '\012' | sort | uniq -c | sort -n

Run it

The file wordfreq is a "shell script", that is, a sequence of commands in a file that you want the shell to execute as if you had typed them directly to the shell.

You can say

$ sh wordfreq sonnets.txt

to run it directly. This is probably the easiest thing to do -- no fuss, no muss, just use it.

If you want to make it feel more like a real command, you can tell the operating system that it is an executable program by changing its mode to "executable", with the chmod command:

$ chmod +x wordfreq
		one time only
$ ls -l wordfreq
-rwxr-xr-x. 1 bwk fac 57 Feb  8 17:26 wordfreq
		notice those x's?  That means its executable.

Now you can run it like this:

$ ./wordfreq sonnets.txt

Notice the ./ at the front. That tells the shell to look for the command in the current directory (".") rather than in the usual places where commands are stored.

Finally, it is possible to set things up so this command becomes part of your personal repertoire, accessible from anywhere in the file system. That's getting too far into the weeds for us, but if you want to know more, let me know. (Hint: look for shell search path.)