awk is a language

Reading time ~4 minutes

Most of us know awk as the program that you use when you want only that one column of your output. Just pipe it through awk '{print $3}' and the third column you get.

Fewer people know that awk is a programming language. It is specialized for processing structured input like tables.

As an example we will analyze this table of events at of the first Quidditch match of the first Harry Potter book, Griffindor vs Slytherin.

player team points
Johnson Griffindor 10
Spinnet Griffindor 10
Flint Slytherin 10
Flint Slytherin 10
Flint Slytherin 10
Flint Slytherin 10
Flint Slytherin 10
Flint Slytherin 10
Potter Griffindor 150

You can run all the awk programs in this blogpost with awk -F, 'PROGRAMM' quidditch.csv or by saving the awk-program into a file and running awk -F, -f 'PROGRAMMFILE' quidditch.csv. The file can be found here, it is the table from above with columns separated by ,.`

Get values from column 3

In awk we think of the input as a sequence of records (lines normally) that consists of sequence of fields (words). These fields can be adressed by $1, $2 etc (and $0 is the whole record).

Lets extract the column "points": The third field of each record.

{print $3}
points
10
10
10
10
10
10
10
10
150

Of course the header is not interesting here and we should remove it.

Remove the header

The syntax of awk is a bit different from the programming languages you might know. In man awk we read:

An awk program is a sequence of patterns and corresponding

actions. When input is read that matches a pattern, the action

associated with that pattern is carried out.

So a program is a series of statements like this:

$$\stackrel{\text{If the input matches this pattern ....}}{\overbrace{\text{pattern}}}\underset{\text{... then run this action on it}}{\underbrace{\left\{ \text{action}\right\} }}$$

In the case of our program above a special case zoccurs: If there is no pattern given every record matches.

To remove the header of the results we can simply add a condition to the program: Don't act on the first line (the record where the record number NR is 1).

NR!=1 {print $3}
10
10
10
10
10
10
10
10
150

Sum of a column ...

Now lets get the sum of all the points scored by the players. Our first thought here is to pipe the result into the program sum. But interestingly enough there is no such program in the world of commandlines as far as I know. It seems like our ancestors with their great beards were fine with writing their own version when they saw need to. And when they did they probably used awk.

NR!=1 {points+=$3}
END{print points}
230
    Two things should be noted here:
  • the line marked with END is only triggered after all the input has
  • been processed. There is a similar keyword (BEGIN) which allows code to be executed before the rest of the code. (Dough!~)
  • the variable points did not have to be initialized. awk is full
  • of sensible defaults and one of them is that a numerical variable with no value is assumed to have the value 0 if not stated otherwise.

... grouped by another column

To get the result of the Quidditch match we need to sum the points for every team separately. We can use arrays (which would be called ~dict~s in most other languages) for that.

NR!=1 {salaries[$2]+=$3}
END {for (key in salaries) { print key " " salaries[key]}}
Griffindor 170
Slytherin 60

And here we have our final score of the game.

I really know only one other language which can do this sort of processing with so little code: SQL. But SQL only works for databases so it needs a big setup to be useful. So when I find myself with a bunch of text files in a unixoid environment, then awk is the way to go.

Oh, and if you think Griffindor totally dominated this match, here is how the score would have looked like if Harry hadn't caught the snitch (which happened more or less by luck).

NR!=1 && $1 !~ "Potter" {salaries[$2]+=$3}
END {for (key in salaries) { print key " " salaries[key]}}
Griffindor 20
Slytherin 60
comments powered by Disqus