Back to the Classics: Awk

Comments

In this age of npm and github and easily available modules in any language of your choice, it is easy to forget the old Unix workhorses. Here’s a look at awk, a shell utility that allows you to treat and manipulate text files as if they were databases. In Part 2 there are a few sample scripts.

What is awk?

Awk is both the name of the command line utility, and the language used for it. It was invented at Bells Labs at the peak of punk rock, 1977, and its name is simply the initials of its three creators. Awk reads input (a file, or a stream) one line (one “record”) at the time, splits it into fields by blank space (these are all defaults that can be changed), and then uses the instructions in the awk language to manipulate these fields and generate some output. The ability to read files as streams is a big plus - it means the memory footprint is the same if you read a file of 1Kb or 200Tb; for a larger file it will just take longer.

Awk is standard with the version of bash that comes with OS X, and several others. There is another variant which is widespread - gawk, GNU awk. It is actually better than the original, because it offers array sort and length functions, the ability to include files, and more flexible rules for splitting input in fields. Here I will limit myself to the standard awk.

Example awk in action

Here’s what the simplest awk program looks like - this is basically cat

1
2
3
4
5
6
7
8
9
# awk loads the short program: {print} and wait for user to type stuff
$ awk '{print}'
# as you type, the shell prints out what you are typing. Awk is waiting
# for a <RETURN> outside a ''
It was a bright cold day in April, and the clocks were striking thirteen.
# now awk kicks in and runs the program on the input
# {print} simply prints the input line as it is, so here it is again
It was a bright cold day in April, and the clocks were striking thirteen.
$

The strong point of awk is that it automatically splits lines of text as if they were “columns” in a spreadsheet and assigns each column to a variable (a “field”). Then you can manipulate them and spit them out

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# awk loads a slightly more complex program and waits
$ awk '{print $3 ": " $1 + $2}'
# waiting for a <RETURN> outside a ''
10 20 Toronto
# this line is split into 3 "columns", and
# 10 is assigned to $1, 20 to $2, and Toronto to $3
# then the program {print "$3: " $1 + $2} is run - it adds $1 + $2 and
# prints the result out, with some extra text (the :)
Toronto: 30
# now it waits for the next line
20 30 Miami
# same program run on it
Miami: 50
$

Despite its simplicity, you can take awk quite far - for example creating a random sci-fi plot generator.

Running awk programs and redirecting input, output

Running awk on STDIN is not very useful, but of course you can use Unix magic to redirect the input and / or output of the program

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# awk will treat the second argument as a path to a file to read from
$ awk '{print}' some_data.txt
... # prints whatever was in some_data.txt
$
# exactly the same thing but done differently - redirecting file to STDIN
$ awk '{print}' < some_data.txt
... # prints whatever was in some_data.txt
$
# you can read several files, in order
$ awk '{print}' some_data.txt more_data.txt
...
$
# now the processed data goes to a separate file
$ awk '{print}' some_data.txt > result.txt
$
# the awk program itself can be loaded to a file - here this file is created
$ echo '{print}' > awk.txt
# passing the command on to awk with the -f option
$ awk -f awk.txt some_data.txt > result.txt
$
# mixing STDIN with files. The "-" is substituted by STDIN,  which is dealt with
# after some_data.txt
$ ls -l | awk '{print}' some_data.txt - more_data.txt
$ ... # prints all lines from some_data.txt
$ ... # prints result of ls -l (this is the "-")
$ ... # prints all lines from more_data.txt
$
# pass some text into awk, then run an awk program on it
$ echo '1 2 3' | awk '{print}'
$ 1 2 3
$
# using the curl util to download a csv file, piping it to awk, and running
# the simple awk program on it
$ curl http://is.gd/eUrbOZ | awk '{print}'
Forename,Surname,Description on ballot paper,Constituency Name,PANo,Votes,Share
... # etc

Anatomy of an awk program

So far the awk example consisted of simple one liners - but awk programs can consist of several instructions (“actions”). You can still write them out on the shell:

1
2
3
4
5
6
7
8
9
10
11
# note: the ">" is added automatically when hitting return inside a '',
# and the space between > and { was added manually to make it line up
$ awk '{print}
>      {print}
>      {print}'
# now that the closing ' was typed, awk kicks in. this programs simply
# prints out whatever you type three times
oh # typed by you
oh # printed by awk 3 times
oh
oh

In this tutorial I will put the awk program in its own file and load it from the command line - just to make formatting easier and allow comments. The file loaded here has suffix “.awk” but that’s irrelevant, it could be any filename.

1
$ awk -f example.awk some_input_text.txt

An awk program consists of a list of actions, one after the other, and typically one per line (they can be broken up though). There are two special types of actions - BEGIN actions are executed only once, before the text is scanned, and END only once, afterwards. All other actions are executed in order on every line of text. Assume your input file includes increasing integers, one per line

1
2
3
1
2
3

Then the program below

1
2
3
4
5
6
BEGIN { print "START!" }
{print "--------------"}
{print}
{print}
{print}
END { print "END!" }

Would produce

1
2
3
4
5
6
7
8
9
10
11
12
13
14
START!
--------------
1
1
1
--------------
2
2
2
--------------
3
3
3
END!

Note that actions can be in any order (they will be executed in the order they are written) and there can be multiple BEGIN and END, so the following is also a legal program.

1
2
3
4
5
6
7
{print "--------------"}
{print}
END { print "END!" }
BEGIN { print "START!" }
{print}
END { print "Copyright 2005" }
{print}

The way the program is dealt with is:

  • all the BEGIN actions are executed, in order
  • input is read one line at the time, and for each line
    • the line is split into fields
    • each action, in turn, is run on the fields
  • all the END actions are run at the end

Inside the actions awk offers what most programming languages offer - variable, loops, tests, etc.

Actions formatting

Awk follows Unix conventions on most things, so in case of doubt whatever works in Bash scripts tends to work.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# a 'normal' one lne
BEGIN { print "START" }
# you can add newlines for formatting - this is equivalent to the above
BEGIN {
  print "START"
}
-> START
-> START

# as in Bash scripts, you can use the semicolon to separate multiple statements
# on the same line...
BEGIN { print "STA"; print "RT";}
# or you can write them one per line, with or without semicolon
BEGIN { print "STA"
        print "RT" }
-> STA
   RT
-> STA
   RT

Variables

Awk makes several variables available to the programs - some are loaded when the program is launched, some are updated with each line read, some are created by the program itself.

Field Variables

Whenever awk reads a line, it splits it into “fields” by white-space / tab (this is the default and can be overridden), Then each field is copied to a variable $1, $2, …. in order - there is no limit. Additionally, $0 contains the whole line.

1
2
3
4
5
6
7
8
9
10
11
12
# assume this file
1 2 3 4 5 6 7

# the following two lines are equivalent
{ print }
{ print $0 }
-> 1 2 3 4 5 6 7
-> 1 2 3 4 5 6 7

# only prints some fields we are interested in
{ print $1 " " $3 }
-> 1 3

The field number doesn’t have to be a constant - it can be an expression or a variable. For example, the global variable NF contains the number of the last field and is updated which every line read. So if there are 7 fields, NF would be 7, and $NF would be $7, i.e. the last field

1
2
3
4
5
6
7
8
9
10
11
12
13
# assume this file
1 2 3 4 5 6 7

# both mean first and last field - but the first version only works if there are
# 7 fields, the second always works
{print $1 " " $7}
{print $1 " " $NF}
-> 1 7
-> 1 7
#
# print the last two fields
{print $(NF-1) " " $NF}
-> 6 7

Another useful global variable that gets updated for each record is NR - this is the record number

1
2
3
4
5
6
7
8
9
10
# feed a four line input into awk
$ echo 'a
> b
> c
> d' | awk '{print NR ") " $1}'
# it prints the line number, ), and the first (and only) field
1) a
2) b
3) c
4) d

You can assign to a field variable with the ‘=’ operator, thereby changing the record content:

1
2
3
4
5
# adding something to a field - will only works if it's a number
{$3 = ($3 + 100)
# now print the updated line
print $0}
-> 1 2 103 4 5 6 7

If you assign to a field variable that doesn’t exist, it will be added to the record

1
2
3
4
5
6
7
# the record only contains $1 and $2;
$ echo 1 2 | awk '{print $0}'
$ 1 2
$
# the program adds two new fields
$ echo 1 2 | awk '{$3 = 3; $4 = 4; print $0}'
$ 1 2 3 4

Global variables

A few variables are set when the program is launched. Here’s a very short list - if you need to play with these you probably want to get yourself a book on awk.

Variable
ARGV array of command line arguments
ARGC number of command line arguments
ENVIRON associative array with environment. Depends on system
FILENAME self explanatory

User defined variables

To create your own variable, just start assigning to them with the ‘=’ operator - awk will initialize them to an emptry string (which becomes a 0 if used in numeric context). The type of variable is dynamic and can vary during its lifetime.

In the example below, awk is used on the ls command to find the total size of a folder.

1
2
3
4
5
6
7
# ls -la returns listings in the form:
# -rw-rw-r--    1 gotofritz staff   1513 Dec 15  2013 .bash_profile
# awk simply collects each filesize and adds it to a running total,
# then prints it at the end
$ ls -la | awk '    { total += $5 }
                END { print total }'
-> 158448

Arrays

Awk has associative arrays, similar to PHP’s or Javascript. You create an array by using it, no need to initialize it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# assume this file
10 Life changes fast
20 Life changes in the instant
30 You sit down to dinner and life as you know it ends
40 The question of self-pity

# creates my_array and inserts all lines into it
{ my_array[$1] = $2 }
# creates - note that array is sparse
my_array[10] = "Life changes fast"
my_array[20] = "Life changes in the instant"
my_array[30] = "You sit down to dinner and life as you know it ends"
my_array[40] = "The question of self-pity"

# string keys are also possible
{ my_array["name"] = "Homer" }

One thing that is different in awk is that multidimensional arrays use a single set of square brackets to wrap both indices

1
2
3
4
5
6
7
8
9
10
11
# assume this file
dad homer
mum marge
son bart

# creates a two dimensional array
{ family["simpsons",$1] = $2 }
-> creates
family["simpsons","dad"] = "homer"
family["simpsons","mum"] = "marge"
family["simpsons","son"] = "bart"

Note that arrays in awk are pretty awkward. There are no built in functions to deal with them except for the for … in loop. If you need to sort, or even just find out the length, you’ll have to write your own functions. There are a couple in Part 2. Alternatively, use gawk which has better array handling.

Regular expressions

A regular expression (regexp) is a mini programming language which is used to describe variable strings; it is embedded in most programming languages. Regexps are enclosed in slashes and use a combination of literal characters and punctuation to describe strings. The operator ~ is used to match a regexp, and !~ to ensure it is not matched.

Regular expressions is a complicated topic of its own; here is just a quick introduction

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# print all lines with "gmail" in the 1st field
{ if ($1 ~ /gmail/) print}

# prints all lines EXCEPT those with "gmail"
{ if ($1 !~ /gmail/) print}

# ^ indicates start of string.
# this matches "tom" "tomato" but not "atom"
{ if ($1 ~ /^tom/) print}

# $ indicates end of string
# this matches "tom", "atom" but not "tomato"
{ if ($1 ~ /tom$/) print}

# this matches "tom", not "atom" and not "tomato"
{ if ($1 ~ /^tom$/) print}

# . matches any character.
# this matches "bear" "boar" but not "bar"
{ if ($1 ~ /b..r/) print}

# [ABC] matches one character from the set "A", "B", "C"
# this matches "boar" "bear" but not "blar"
{ if ($1 ~ /b[oe]ar/) print}

# [^ABC] matches one character which is anything except "A", "B", "C"
# this matches "blar" but neither "boar" nor "bear"
{ if ($1 ~ /b[^oe]ar/) print}

# (abc) groups the expression abc as a unit.
# | is an "or"
# \ is used to scape special characters, i.e. treat them as normal characters
# in this case we want to treat the '.' as a period and not "any character"
# the following matches @gmail.com or @yahoo.com
{ if ($1 ~ /@(gmail)|(yahoo)\.com/) print}

# * means repeat zero or more. + is repeat once or more. ? is repeat 0 or 1
{ if ($1 ~ /<[^>]+>[^<]*</[^>]+>\.?/) print }
# the following matches < followed by one or more (+) of anything except >, then >
# then zero or more (*) of anything except <
# then </ followed by one or more (+) of anything except >, then >
# then an optional .

Statements, operators, and function

Control statements

Awk has the usual loops and conditionals familiar from C. Braces are optional for single nested statements

1
2
3
4
5
6
7
8
9
10
11
12
13
# braces are optional for single statements
for (name in list_of_names)
  print name

for (capital_city in country) {
  print capital_city
}

# but needed for multiple statements
if (NR % 2 == 0) {
  $2 = $1 * 2
  print $0
}

if-else

Awk doesn’t have booleans. Instead it treats the number 0 or the empty string “” as false, and any other value (including the string “0”) as true. The comparison operators are the familiar ones, with double equal sign for equality, plus the tilde ~ and !~ for regular expression matching, and “in” for array existence

1
2
3
4
5
{ if ($1 == "full") ... }
{ if ($2 < 0.5) ... }
{ if ($0 ~ /Republican/) print $0 } ... # matches regexp
{ if ($1 !~ /Completed/) print $0 } ... # rejects regexp
{ if (capital_city in country) print country["capital_city"] }

loops

Awk has both for and while loops (including do-while). Additionally, there is the for-in loop for sparse arrays

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# assume file
1 10 100
2 20 200

# both these programs will print each line with fields back-to-front
# while loop version...
{ i = NF
  line = ""
  while (i) {
    line =  line " " $i
    i--
  }
  print line
}
# for loop version
{ line = ""
  for (i=NF; i>0; i--) {
    line =  line " " $i
  }
  print line
}
-> 100 10 1
   200 20 2

# puts each line of input into the array
{ lines[$NR] = $0 }
# at end prints all the lines
END {
  for (line in lines)
    print line
}

break and continue statements are available to exit a loop prematurely or skipping an iteration respectively.

next

next is used to stop precessing a record and moving on to the next

1
2
{ if ($5 == "") next }
{ print $5 $4 }

Awk numeric operators

The usual maths operators can be used: +, -, /, * , ++, – plus % for modulus, ^ for exponentiation. Unary + converts to a number

1
2
3
4
5
6
7
8
echo "1
> 2
> 3
> 4" | awk '{print $1 ^ 2}'
1
4
9
16

String concatenation

Concatenating string in awk is slightly weird. There is no string concatenation operator; just put the strings next to each other. Because of that it is recommended to use parenthesis except for trivial cases. Alternatively, print can take multiple comma separated arguments - and they will be printed with a space separating them

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# assume this file
1
2
3
4

# the strings ($1..) and "a" are concatenated (no space between them) and
# the resulting string is passed to print
{print ($1+2) "a"}
3a
4a
5a
6a

# two separate strings are passed to print -  a space is put between them
{print ($1+2), "a"}
3 a
4 a
5 a
6 a

# string concatenation works for variables too
{ something = $1 "--"
  print something }
1--
2--
3--
4--

Built in functions

There are a number of built in functions: numeric ones like cosine, square root, random; string functions like print or string length; time functions and bitwise functions. You can easily find out what they are by looking at the output of man awk.

Worth nothing that besides print awk also offers printf, i.e. “print formatted”. Printf is common to many Unix tools and languages. You give a string with some placeholders and rules, and then you pass variables to “plug in” those placeholders. The important thing is the rules, which control things like right alignment, decimal precision, zero padding for numbers, etc. A statement looks like this:

1
2
3
4
5
6
7
8
{ printf "%-10s %04.3f%% \n", $1, $2 }
# placeholders start with %
# %-10s is a string (s), and is left aligned (-) within a field 10 spaces wide (10)
# %07.3f is a decimal number or float (f), the toal length has to be at least 7
#        characters (7) and is padded with zeroes if too small (0) and has
#        3 decimals (3) and minimum 3 in the integer part (7 - 3 decimals - the point)
# %% if you want to print an actual %, you need to type it twice %%
# \n you need to supply the new line manually

More information on printf.

User defined functions

You can define functions anywhere in your code, outside actions. They are pretty similar to Javascript

1
2
3
4
5
6
7
# define funtion outside rules - could be at the bottom of file
function my_func(field_content) {
  print "FIELD: " field_content
}

# now use in rules
{my_func($1)}

Patterns

Previously I described an awk program as a series of actions, with the special case of BEGIN and END. That’s not entirely correct. An awk program consists of a sequence of actions and optional patterns; BEGIN and END are two special patterns. Incidentally, there is also a BEGINFILE and ENDFILE, for when processing more than one file at the time.

BEGIN and END are special because they identify actions which are not executed for every line of input, but before or after the whole program is run. The other patterns are used on every line to determine whether the action should be run for that particular line or not. Patterns are espressions that return false (i.e., 0 or “”) or true (anything else). When the pattern returns true, the rules is executed.

Regular expression patterns

Regular expressions can be used as pattern; they match the entire line. An exclamation mark reverses the match. Boolean operators can be used to combine patterns

1
2
3
4
5
6
7
8
9
# print lines with an email address
# (very lazy match - will only work if all email addresses are well formed)
/@/ { print $3}

# prints all lines except those with a gmail address
! /@gmail\./ { print $0 }

# prints lines with an @ and the sequence 0160
/@/ && /0160/ { print }

The regular expressions above are a shortcut for $0 ~ /pattern/, i.e. “apply the regexp to whole line”. Similar rules can be made for individual fields…

1
2
# matches only the regx on one field
$1 ~ /Anthony/ { print }

..and all expressions seen so far

1
2
3
4
5
6
# print even lines
NR % 2 == 0 { print }

# print only if length of 1st field is greater than 3
# length is a string function mentioned above
length($1) > 3 { print }

The reason we have been able to run program without patterns is because there is a special pattern, the empty pattern, which matches every line. In fact we could have a program which is just a pattern; the default action {print} would be executed.

1
2
# prints whole line, default action
$1 == "complete"

Splitting records and fields differently from default

By default awk treats each line as a record. In reality what it does is to split the input by a record separator, stored in the variable RS, which happens to be the new line character. You can change that in an awk program.

1
2
3
4
5
6
# separate records by semicolon
$ echo "1 2 3;4 5 6;7 8 9" | awk 'BEGIN {RS = ";" }
>                                 {print}'
1 2 3
4 5 6
7 8 9

Something similar is possible with the field separator, which is stored in the variable FS. By default it is equal to the regexp [ \t\n]+, i.e. any number of consectuve spaces of any type. Note that in reality awk cheats - leaving FS default doesn’t just mean setting it to [ \t\n]+, but also trimming $0 of leading and trailing empty space before processing it.

1
2
3
4
5
6
7
8
# separate fields by comma
$ echo "1,2,3
4,5,6
7,8,9" | awk 'BEGIN {FS = "," }
>             {print}'
1,2,3
4,5,6
7,8,9

You can combine the two together if, for example, your data has one field per line and records are separated by multiple lines - an empty RS means “any number of consecutive \n “”

1
2
3
4
5
6
7
8
9
10
11
12
13
# assume this data
homer simpson
dad

marge simpson
mum

# separate records by any number of newlines, and have one field per line
BEGIN {RS=""; FS="\n"}
{ print $1 " (" $2 ")" }

-> homer simpson (dad)
   marge simpson (mum)

Passing option to awk

A field separator (but not a record separator) can be also passed to an awk program in two ways. First of all, awk has a special option for it, -F (note that there is no space between it and the separator). And awk allow passing of variables with the -v syntax, so you could just pass FS that way.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# change separator from within program
BEGIN {FS = "," }

# pass separator with special option -F - note that you don't need quotes
$ echo "1,2,3
> 4,5,6" | awk -F, '{print}'
1,2,3
4,5,6

# pass separator as external var with -v
$ echo "1,2,3
> 4,5,6" | awk -v FS="," '{print}'
1,2,3
4,5,6i

# in fact you can pass any variable of your choice with -v
$ echo "" | awk -v WHAT="grow up" '{print "All children, except one, " WHAT}'
All children, except one, grow up

Reading CSV files in awk

The naive approach would be to simply set FS=”,” - but that doesn’t cover the fact that some fields are surrounded by quotation marks and others aren’t, and sometimes you have newlines and / or commas inside a field. Here are some examples scripts people have put together to solve the issues. They are also good examples of fairly complex awk scripts.

Personally I think that’s taking things too far - if you have to force awk to create arrays to store manipulated record fragments you may as well use a fully fledged scripting language.

Another approach is to use gawk, and its FPAT variable

Learning more about awk

With that all the main awk topics were touched on. If you want to go deeper I recomend The AWK Manual, or one of the O’Reilly books

Comments