Updated Mon Feb 27 20:26:11 EST 2023
[The Python programs shown here are stored as individual files in the directory py. It might be easier to copy them from there than copying from the web page.]
This page shows Python equivalents for some of the Awk programs that appeared in Studio 3. If you want to run a Python program, it's easiest to copy it into a file, save it as whatever.py, and then run it:
$ python whatever.pyOn macOS, the default version of python is likely to be Python 2, which has some minor but irritating incompatibilities with Python 3. It might be easier to download Python 3 from python.org than to cope with the differences.
On Windows with WSL, you are likely to already have Python 3.
In either case, to find out, run Python and it will tell you what version you're running:
$ python Python 3.7.1 (v3.7.1:260ec2c36a, Oct 20 2018, 03:13:28) [Clang 6.0 (clang-600.0.57)] on darwin Type "help", "copyright", "credits" or "license" for more information.
As an alternative to running Python on your own computer, you can invoke Python explicitly in Colab. Note the exclamation point !, which signals that the rest of the line is a shell command.
$ !python Python 3.8.10 (default, Nov 14 2022, 12:59:47) [GCC 9.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>>I do not fully understand this mechanism, but it seems to work if the input to the Python program comes from files that you have uploaded.
$ awk '{print}'which prints every input line, the simplest Python equivalent is this version of cat:
# cat: one way to read & copy a file # one line at a time from stdin # equivalent to Unix cat command for text # awk '{print}' import sys line = sys.stdin.readline() while line != "": print(line, end="") # removes trailing newline (python3 only) line = sys.stdin.readline()readline() is a function that reads an input stream a line at a time, returning each line (including the newline character at the end). It returns an empty line when there is no more input. sys.stdin.readline() reads from the standard input stream, which is the keyboard or from <file or from a pipe.
To run this, put it in a file, say cat.py, then say
$ python cat.pyIn Colab, cat.py doesn't seem to work right if the input comes from the keyboard, but it's ok if the input comes from a file like this:
$ !python cat.py <cat.py
# cat2: read & copy a file one line at a time # from stdin or list of filenames # equivalent to cat for text # awk '{print}' import sys def cat(f): # print a single text file line = f.readline() while line != "": print(line, end="") # removes trailing newline (python3 only) line = f.readline() def main(): if len(sys.argv) == 1: cat(sys.stdin) else: for i in range(1, len(sys.argv)): f = open(sys.argv[i]) cat(f) f.close() main()This program defines two functions, main to handle the overall processing, and cat to print a single file. Program execution begins by calling main (the last line). This is a very standard pattern for organizing a program.
The command-line arguments are availabe to a Python program in an array called sys.argv; the first argument is the first filename, and argv[0] is the name of the program itself. So the program tests whether there are any filename arguments; if not, then it uses the standard input, and otherwise it loops over the filenames. (Are you starting to get some idea of how Awk simplifies some aspects of computing? We'll mostly just read from the standard input from now on, but you can imagine adding this code to handle the more general case.)
Awk splits each input line into fields, by default strings of characters separated by spaces and/or tabs. To achieve the same in Python, we have to explicitly split each input line.
Here's the Awk program to just count the number of fields on each input line:
$ awk -F/ '{print NF}' 18.csvAnd here's the same thing for Python.
# flds: read stdin one line at a time, text only # split into fields, white space only # awk '{print NF}' import sys line = sys.stdin.readline() while line != "": flds = line.strip().split() # split by white space print(len(flds)) # print NF line = sys.stdin.readline()strip() strips white space from both ends of a string of characters; split() splits a string of characters separated by spaces into separate fields, like the default behavior of Awk. The flds array starts at zero, however, not at 1 as in Awk. Exercise: how could you fix this?
$ awk -F/ '{print $5}' 18.csv | sort | uniq -c | sort -nthat prints the 5th field could be replaced by a Python program that prints the 5th field; the principle lines would be
flds = line.strip().split('/') # split by / print(flds[4])and the pipeline would be
$ python whatever.py <18.csv | sort | uniq -c | sort -nSome of this could be replaced by Python code that uses a dictionary to accumulate the different counts:
# fld5: print number of times each item in field 5 (base 1) occurs # split on / # equivalent to # awk -F/ '{print $5}' 18.csv | sort | uniq -c import sys count = {} line = sys.stdin.readline() while line != "": flds = line.strip().split('/') # split by / count[flds[4]] = count.get(flds[4],0) + 1 line = sys.stdin.readline() for i in count: print(count[i], i)An example for selecting some lines and printing some fields:
$ awk -F/ 'NR > 1 && $5 > 14 {print $5, $6, $7}' 18.csv
# gt14: print number of times each item in field 5 (origin 1) occurs # split on / # equivalent to # awk -F/ 'NR > 1 && $5 > 14 {print $5, $6, $7}' 18.csv import sys NR = 0 line = sys.stdin.readline() while line != "": NR += 1 flds = line.strip().split('/') # split by / if NR > 1 and int(flds[4]) > 14: print(flds[4], flds[5], flds[6]) line = sys.stdin.readline()There are other ways to skip the first line. Any thoughts?
$ awk -F/ '{print $3, $4}' 18.csv | sort | uniq -c | sort -nr | awk '$1 == 1'
# auth1: print names of authors with only one entry # split on / # equivalent to # awk -F/ '{print $3, $4}' 18.csv | sort | uniq -c | sort -nr | awk '$1 == 1' import sys count = {} line = sys.stdin.readline() while line != "": flds = line.strip().split('/') # split by / author = flds[2] + " " + flds[3] count[author] = count.get(author,0) + 1 line = sys.stdin.readline() for i in count: if count[i] == 1: print(count[i], i)
$ awk -F/ '$1 ~ /^[0-9]+$/ && $2 ~ /^[0-9]+$/' 18.csv >tempHere's the Python version:
# ex4.py: print age if first and second field are both integers # equivalent to # awk -F/ '$1 ~ /^[0-9]+$/ && $2 ~ /^[0-9]+$/ {print $2-$1, $3, $4}' import sys import re line = sys.stdin.readline() while line != "": flds = line.strip().split('/') # split by white space if re.search('^[0-9]+$', flds[0]) != None and \ re.search('^[0-9]+$', flds[1]) != None: print(int(flds[1]) - int(flds[0])) line = sys.stdin.readline()It imports the re library, to use a single function, re.search, which returns a Match object if there was a match, and None if there was no match. Since we only care about whether there was a match or not, the test is simple and we can ignore the Match object.
If we have created a temporary file with valid-dates-only data, then subsequent processing is easier.
$ awk -F/ '{ ages = ages + $2 - $1 } # add up all the ages END { print "average age =", ages / NR }' <temp $ awk -F/ '$2-$1 > max { max = $2 - $1; fname = $3; lname = $4 } END { print "oldest:", fname, lname, " age", max }' <tempCombining these two into a single Python program:
# ages.py: compute ages assuming first and second field are both integers # equivalent to # awk -F/ '{ ages = ages + $2 - $1 } # add up all the ages # END { print "average age =", ages / NR }' <temp # awk -F/ '$2-$1 > max { max = $2 - $1; fname = $3; lname = $4 } # END { print "oldest:", fname, lname, " age", max }' <temp import sys import re ages = 0 max = 0 fname = "" lname = "" NR = 0 line = sys.stdin.readline() while line != "": NR += 1 flds = line.strip().split('/') # split by white space age = int(flds[1]) - int(flds[0]) ages += age if age > max: max = age fname = flds[2] lname = flds[3] line = sys.stdin.readline() print("average age =", ages / NR) print("oldest:", fname, lname, "age", max)
As you can see, it's a lot more work to write a Python program to do quick and dirty explorations than it is with Awk. But once you have the lay of the land, then you can switch to Python, perhaps with a set of functions that you have written for your specific case.