awk is a language
Most of us know awk
as the program that you use when you want only
that one column of your output. Just pipe it through awk '{print
$3}'
and the third column you get.
Fewer people know that awk is a programming language. It is specialized for processing structured input like tables.
As an example we will analyze this table of events at of the first Quidditch match of the first Harry Potter book, Griffindor vs Slytherin.
player | team | points |
---|---|---|
Johnson | Griffindor | 10 |
Spinnet | Griffindor | 10 |
Flint | Slytherin | 10 |
Flint | Slytherin | 10 |
Flint | Slytherin | 10 |
Flint | Slytherin | 10 |
Flint | Slytherin | 10 |
Flint | Slytherin | 10 |
Potter | Griffindor | 150 |
You can run all the awk programs in this blogpost with awk -F,
'PROGRAMM' quidditch.csv
or by saving the awk-program into a file and
running awk -F, -f 'PROGRAMMFILE' quidditch.csv
. The file can be
found here, it is the table from above with columns separated by ,
.`
Get values from column 3
In awk
we think of the input as a sequence of records (lines
normally) that consists of sequence of fields (words). These fields
can be adressed by $1
, $2
etc (and $0
is the whole record).
Lets extract the column "points": The third field of each record.
{print $3}
points |
10 |
10 |
10 |
10 |
10 |
10 |
10 |
10 |
150 |
Of course the header is not interesting here and we should remove it.
Remove the header
The syntax of awk
is a bit different from the programming languages
you might know. In man awk
we read:
An awk program is a sequence of patterns and corresponding actions. When input is read that matches a pattern, the action associated with that pattern is carried out.
So a program is a series of statements like this:
<script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML"> </script>
$$\stackrel{\text{If the input matches this pattern ....}}{\overbrace{\text{pattern}}}\underset{\text{... then run this action on it}}{\underbrace{\left\{ \text{action}\right\} }}$$
In the case of our program above a special case zoccurs: If there is no pattern given every record matches.
To remove the header of the results we can simply add a condition to the program: Don't act on the first line (the record where the record number NR is 1).
NR!=1 {print $3}
10 |
10 |
10 |
10 |
10 |
10 |
10 |
10 |
150 |
Sum of a column …
Now lets get the sum of all the points scored by the players. Our first thought here
is to pipe the result into the program sum
. But interestingly enough
there is no such program in the world of commandlines as far as I
know. It seems like our ancestors with their great beards were fine
with writing their own version when they saw need to. And when they
did they probably used awk
.
NR!=1 {points+=$3}
END{print points}
230
Two things should be noted here:
- the line marked with
END
is only triggered after all the input has been processed. There is a similar keyword (BEGIN
) which allows code to be executed before the rest of the code. (Dough!~) - the variable
points
did not have to be initialized.awk
is full of sensible defaults and one of them is that a numerical variable with no value is assumed to have the value0
if not stated otherwise.
… grouped by another column
To get the result of the Quidditch match we need to sum the points for
every team separately. We can use arrays (which would be called
dict
in other languages) for that.
NR!=1 {points[$2]+=$3}
END {for (key in points) { print key " " points[key]}}
Griffindor | 170 |
Slytherin | 60 |
And here we have our final score of the game.
I really know only one other language which can do this sort of
processing with so little code: SQL. But SQL only works for databases
so it needs a big setup to be useful. So when I find myself with a
bunch of text files in a unixoid environment, then awk
is the way to
go.
Oh, and if you think Griffindor totally dominated this match, here is how the score would have looked like if Harry hadn't caught the snitch (which happened more or less by luck).
NR!=1 && $1 !~ "Potter" {points[$2]+=$3}
END {for (key in points) { print key " " points[key]}}
Griffindor | 20 |
Slytherin | 60 |