Updated Mon Feb 6 19:56:24 EST 2023

Assignment 2: Unix Commandline Commands

Introduction

This assignment is practice in using the file system and Unix commands that you learned in the second class, using the file shakespeare.zip, which contains William Shakespeare's sonnets. You should already have downloaded this as part of the first assignment.

$ pwd
		make sure you're in the right directory
$ cd hum307/shakespeare
		if necessary
$ curl -L -k 'www.hum307.com/shakespeare.zip' -o shakespeare.zip
		if necessary
		-L says to follow any redirection of links
		-k says to allow insecure server connections
			probably not a good idea in general but ok here
$ curl --help		
		produces a compact list of options; --help works for many commands
$ man curl
		produces distinctly non-compact but thorough documenation

Practice makes perfect

Repeat the Barrett Browning experiments from Studio 2, but with Shakespeare's sonnets. You should repeat the "love" experiments from EBB's sonnets (there are some similarities and some differences), but you should aso explore something beyond the word "love". This is a chance to try other aspects of how language was used.

How many sonnets are there?
		think about ls and pipes

How many lines, words and characters are there in each sonnet
		did you see something odd?
		hint:  wc | sort

By what factor did the zip process shrink the original input?
		hint:  wc * and variations

Love is in the air

"Love" is a major theme in both sets of sonnets.


How often does "love" appear literally?

How often including variants like "Love", loving, beloved, etc.?
Compare the frequency of "love" in the two collections by using wordfreq to count the number of occurrences, and wc to count the total number of words.
What percent of EBB's words are "love"?

What percent of WS's words are "love"?

More about regular expressions

The sonnet number in both collections is marked by a Roman numeral number that begins in the first column, with nothing else on that line. Actual verses begin with three spaces.
How would you find lines that only have Roman numerals in shakespeare.txt?

What grep command would print only the lines of the sonnets, but not the Roman numerals?

What other grep command could you use to do the same thing?
	
	hints:	^ matches the beginning of a line
		$ matches the end of a line
		[abcde] matches any one of those letters
		[abcde]* matches zero or more occurrences
		grep -v prints only lines that *don't* match

What else is there?

Do something similar to "love", but with any words, phrases
or anything else that appeals.

You can use EBB or WS or both.

Tell us what you tried, with a quick summary of results,

Submission

There is now a link for submissions on the Google Drive. Collect your answers to the questions above in a TEXT file called x_abc_asgn2.txt where x is the first letter of your last name, and abc is your first name; thus I would submit k_brian_asgn2.txt.

You might find it easiest to either copy and paste the output from running the commands, or redirect their output into a file that you then include. These are good ways to avoid transcription errors that might happen if you retype anything.

Note: submit a .txt file, please, not a Word file.
Use your favorite (or least unfavorite) text editor,
like nano.  I'm not worried if the upload is mangled,
but I do want you to be comfortable with a real text
editor.  

Challenge problem

If this assignment was too easy, write a grep regular expression that will solve "Spelling Bee" problems from the NY Times, like this one:

The puzzle is to find all the words you can make with at least five letters and using the central letter ("Q" here). Lower-case words only. Your score is 1 for each word, with 3 points for a word that uses all seven letters.

The file web2 in the Datasets folder on the Google Drive contains the word list from Webster's Second International Dictionary, if you want to see how well your code works.

No extra credit for this beyond the satisfaction of knowing that you understand REs better than most.