Updated Mon Feb 27 20:26:11 EST 2023

Awk to Python

Introduction

[The Python programs shown here are stored as individual files in the directory py. It might be easier to copy them from there than copying from the web page.]

This page shows Python equivalents for some of the Awk programs that appeared in Studio 3. If you want to run a Python program, it's easiest to copy it into a file, save it as whatever.py, and then run it:

$ python whatever.py

On macOS, the default version of python is likely to be Python 2, which has some minor but irritating incompatibilities with Python 3. It might be easier to download Python 3 from python.org than to cope with the differences.

On Windows with WSL, you are likely to already have Python 3.

In either case, to find out, run Python and it will tell you what version you're running:

$ python
Python 3.7.1 (v3.7.1:260ec2c36a, Oct 20 2018, 03:13:28) 
[Clang 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

As an alternative to running Python on your own computer, you can invoke Python explicitly in Colab. Note the exclamation point !, which signals that the rest of the line is a shell command.

$ !python
Python 3.8.10 (default, Nov 14 2022, 12:59:47) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

I do not fully understand this mechanism, but it seems to work if the input to the Python program comes from files that you have uploaded.

Basics of file handling

First, some preliminaries on how to read files from the file system. Awk does this automatically; in Python, you have to code it yourself. For the Awk program

$ awk '{print}'

which prints every input line, the simplest Python equivalent is this version of cat:

# cat: one way to read & copy a file
# one line at a time from stdin
# equivalent to Unix cat command for text
# awk '{print}'

import sys

line = sys.stdin.readline()
while line != "":
    print(line, end="")  # removes trailing newline (python3 only)
    line = sys.stdin.readline()

readline() is a function that reads an input stream a line at a time, returning each line (including the newline character at the end). It returns an empty line when there is no more input. sys.stdin.readline() reads from the standard input stream, which is the keyboard or from <file or from a pipe.

To run this, put it in a file, say cat.py, then say

$ python cat.py

In Colab, cat.py doesn't seem to work right if the input comes from the keyboard, but it's ok if the input comes from a file like this:

$ !python cat.py <cat.py

Multiple files; filename arguments

Note that cat.py only reads from the standard input (keyboard or <filename or a pipe), not from a list of filenames. If you want to add this capability (which is built in to Awk), more code is needed:

# cat2: read & copy a file one line at a time
# from stdin or list of filenames
# equivalent to cat for text
# awk '{print}'

import sys

def cat(f):  # print a single text file
   line = f.readline()
   while line != "":
       print(line, end="")  # removes trailing newline (python3 only)
       line = f.readline()

def main():
   if len(sys.argv) == 1:
      cat(sys.stdin)
   else:
      for i in range(1, len(sys.argv)):
         f = open(sys.argv[i])
         cat(f)
         f.close()

main()

This program defines two functions, main to handle the overall processing, and cat to print a single file. Program execution begins by calling main (the last line). This is a very standard pattern for organizing a program.

The command-line arguments are availabe to a Python program in an array called sys.argv; the first argument is the first filename, and argv[0] is the name of the program itself. So the program tests whether there are any filename arguments; if not, then it uses the standard input, and otherwise it loops over the filenames. (Are you starting to get some idea of how Awk simplifies some aspects of computing? We'll mostly just read from the standard input from now on, but you can imagine adding this code to handle the more general case.)

Fields

Awk splits each input line into fields, by default strings of characters separated by spaces and/or tabs. To achieve the same in Python, we have to explicitly split each input line.

Here's the Awk program to just count the number of fields on each input line:

$ awk -F/ '{print NF}' 18.csv

And here's the same thing for Python.

# flds: read stdin one line at a time, text only
# split into fields, white space only
# awk '{print NF}'

import sys

line = sys.stdin.readline()
while line != "":
    flds = line.strip().split()  # split by white space
    print(len(flds))     # print NF
    line = sys.stdin.readline()

strip() strips white space from both ends of a string of characters; split() splits a string of characters separated by spaces into separate fields, like the default behavior of Awk. The flds array starts at zero, however, not at 1 as in Awk. Exercise: how could you fix this?

Counting and summarization

The Awk program

$ awk -F/ '{print $5}' 18.csv | sort | uniq -c | sort -n

that prints the 5th field could be replaced by a Python program that prints the 5th field; the principle lines would be

    flds = line.strip().split('/')  # split by /
    print(flds[4])

and the pipeline would be

$ python whatever.py <18.csv | sort | uniq -c | sort -n

Some of this could be replaced by Python code that uses a dictionary to accumulate the different counts:

# fld5: print number of times each item in field 5 (base 1) occurs
# split on /
# equivalent to
# awk -F/ '{print $5}' 18.csv | sort | uniq -c

import sys

count = {}
line = sys.stdin.readline()
while line != "":
    flds = line.strip().split('/')  # split by /
    count[flds[4]] = count.get(flds[4],0) + 1
    line = sys.stdin.readline()
for i in count:
    print(count[i], i)

An example for selecting some lines and printing some fields:

$ awk -F/ 'NR > 1 && $5 > 14 {print $5, $6, $7}' 18.csv

# gt14: print number of times each item in field 5 (origin 1) occurs
# split on /
# equivalent to
# awk -F/ 'NR > 1 && $5 > 14 {print $5, $6, $7}' 18.csv

import sys

NR = 0
line = sys.stdin.readline()
while line != "":
    NR += 1
    flds = line.strip().split('/')  # split by /
    if NR > 1 and int(flds[4]) > 14:
        print(flds[4], flds[5], flds[6])
    line = sys.stdin.readline()

There are other ways to skip the first line. Any thoughts?

Another dictionary example

$ awk -F/ '{print $3, $4}' 18.csv | sort | uniq -c | sort -nr | awk '$1 == 1'

# auth1: print names of authors with only one entry
# split on /
# equivalent to
# awk -F/ '{print $3, $4}' 18.csv | sort | uniq -c | sort -nr | awk '$1 == 1'

import sys

count = {}
line = sys.stdin.readline()
while line != "":
    flds = line.strip().split('/')  # split by /
    author = flds[2] + " " + flds[3]
    count[author] = count.get(author,0) + 1
    line = sys.stdin.readline()
for i in count:
    if count[i] == 1:
        print(count[i], i)

Regular expression in Python

What about the ages of the authors? As we saw, the dates in 18.csv are not always numeric. We wrote Awk code to ignore lines with flaky dates, and we wrote one script to print only the lines with valid dates. Here's a Python program that takes the latter approach

$ awk -F/ '$1 ~ /^[0-9]+$/ && $2 ~ /^[0-9]+$/' 18.csv >temp

Here's the Python version:

# ex4.py: print age if first and second field are both integers
# equivalent to
# awk -F/ '$1 ~ /^[0-9]+$/ && $2 ~ /^[0-9]+$/ {print $2-$1, $3, $4}'

import sys
import re

line = sys.stdin.readline()
while line != "":
    flds = line.strip().split('/')  # split by white space

    if re.search('^[0-9]+$', flds[0]) != None and \
      re.search('^[0-9]+$', flds[1]) != None:
        print(int(flds[1]) - int(flds[0]))

    line = sys.stdin.readline()

It imports the re library, to use a single function, re.search, which returns a Match object if there was a match, and None if there was no match. Since we only care about whether there was a match or not, the test is simple and we can ignore the Match object.

If we have created a temporary file with valid-dates-only data, then subsequent processing is easier.

$ awk -F/ '{ ages = ages + $2 - $1 } # add up all the ages
       END { print "average age =", ages / NR }' <temp
$ awk -F/ '$2-$1 > max { max = $2 - $1; fname = $3; lname = $4 }
       END { print "oldest:", fname, lname, " age", max }' <temp

Combining these two into a single Python program:

# ages.py: compute ages assuming first and second field are both integers
# equivalent to
# awk -F/ '{ ages = ages + $2 - $1 } # add up all the ages
#       END { print "average age =", ages / NR }' <temp
# awk -F/ '$2-$1 > max { max = $2 - $1; fname = $3; lname = $4 }
#       END { print "oldest:", fname, lname, " age", max }' <temp

import sys
import re

ages = 0
max = 0
fname = ""
lname = ""
NR = 0

line = sys.stdin.readline()
while line != "":
    NR += 1
    flds = line.strip().split('/')  # split by white space

    age = int(flds[1]) - int(flds[0])
    ages += age
    if age > max:
        max = age
        fname = flds[2]
        lname = flds[3]

    line = sys.stdin.readline()

print("average age =", ages / NR)
print("oldest:", fname, lname, "age", max)

Envoi

As you can see, it's a lot more work to write a Python program to do quick and dirty explorations than it is with Awk. But once you have the lay of the land, then you can switch to Python, perhaps with a set of functions that you have written for your specific case.