A Selection of Akw Scripts

Comments

Some examples scripts to complement the previous awk tutorial

All the scripts here assume the fields in the data is space separated - i.e. UK London 10000 etc. There hasn’t been a CSV standard until recently, but the one you get from MS Office doesn’t work too well with awk, because CSV is basically a rubbish format and hard to handle. But if you have a “well behaved” CSV file, i.e. one which doesn’t have commas, quotation marks, or new lines inside fields, e.g UK,London,10000 then you can easily process it by passing -F”,” to the awk call:

1
awk -F"," -f my_awk_script.awk some_input_data.txt

In practice unless you generate the data yourself there is always going to be the odd comma or quotation mark in your data somewhere, so I find it easier to convert the data to tab separated using a free online CSV to TSV converter like this. I then call awk with “\t” as the separator, because some field may have spaces inside them.

The command below is what I use - my awk program is in my_awk_script.awk, the data is in uk_electoral_data_converted.csv, and the results goes into awk_output.txt.

1
awk -F"\t" -f my_awk_script.awk uk_electoral_data_converted.csv > awk_output.txt

Sample data: UK election results

Some of the scripts will use data from the 2015 UK election in CSV format as data, converted to TSV. Here’s what it looks like:

1
2
3
4
5
Forename  Surname Description on ballot paper Constituency Name PANO  Votes Share (%) Change  FIELD9  Incumbent?  FIELD11 Constituency ID Region ID County  Region  Country Constituency type Party name identifier Party abbreviation
Gerald  Howarth The Conservative Party Candidate  Aldershot 7 23369 50.6  3.9   MP    E14000530 E12000008 Hampshire South East  England Borough Conservative  Con
Gary  Puffett Labour Party  Aldershot 7 8468  18.3  6.2       E14000530 E12000008 Hampshire South East  England Borough Labour  Lab
Bill  Walker  UK Independence Party (UKIP)  Aldershot 7 8253  17.9  13.4        E14000530 E12000008 Hampshire South East  England Borough UK Independence Party UKIP
...

And here are the field names in order

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
 1 Forename
 2 Surname
 3 Description on ballot paper
 4 Constituency Name
 5 PANO
 6 Votes
 7 Share (%)
 8 Change
 9 --
10 Incumbent?
11 --
12 Constituency ID
13 Region ID
14 County
15 Region
16 Country
17 Constituency type
18 Party name identifier
19 Party abbreviation

Skipping header row

1
2
3
4
5
6
# put NR > 1 in front of every action to skip the header row
NR > 1 { print }

# result
Gerald  Howarth The Conservative Party Candidate  Aldershot ...
...

Skipping empty records

1
2
# check if the first field is empty
NF { print }

Rearranging fields, and skipping some

1
2
3
4
5
6
7
8
9
10
11
12
NR > 1 && NF { print $4 ": " $2 " " $1 " (" $NF ") " $7 "% " }
# NR > 1    ignore header
# NF  ignore empty record
# print constituence name, name surname, party abbreviation, share of the vote
# ignore otehr fields

Aldershot: Howarth Gerald (Con) 50.6%
Aldershot: Puffett Gary (Lab) 18.3%
Aldershot: Walker Bill (UKIP) 17.9%
Aldershot: Hilliar Alan (LD) 8.8%
Aldershot: Hewitt Carl (Green) 4.4%
...

Print sum of all fields

Finds the national Conservative vote

1
2
3
4
5
6
7
8
$NF == "Con" {total += $6}
END          {print total}
# $NF == "Con"  if a record concerns a tory vote
# {total += $6} add it to a running total
# END           when all records are processed
# {print total} output total

11299609

Work out column averages

Find the total vote of the 6 larger parties and their % of the national vote

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# keep a running total
NR > 1 && NF  {total += $6}
# keep a total for each party - don' do anything yet
$NF == "Con"        {total_con += $6}
$NF == "Lab"        {total_lab += $6}
$NF == "UKIP"       {total_ukip += $6}
$NF == "LD"         {total_ld += $6}
$NF == "Green"      {total_green += $6}
$NF == "SNP"        {total_snp += $6}
# print a report at the end
END {
  print "TOTAL: " total
  print "Con:   " (100 * total_con / total) "%"
  print "Lab:   " (100 * total_lab / total) "%"
  print "UKIP:  " (100 * total_ukip / total) "%"
  print "LD:    " (100 * total_ld / total) "%"
  print "SNP:   " (100 * total_snp / total) "%"
  print "Green: " (100 * total_green / total) "%"
}

# output:
TOTAL: 30697255
Con:   36.8098%
Lab:   30.449%
UKIP:  12.6431%
LD:    7.87014%
SNP:   4.738%
Green: 3.77112%

Using functions

The same as above, but without copy and paste code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# abstracting copy-and-paste code into a function
function print_party_percentage(party_name, party_vote, total_vote) {
  print party_name "   " (100 * party_vote / total_vote) "%"
}
# same program as before
NR > 1 && NF   {total += $6}
$NF == "Con"   {total_con += $6}
$NF == "Lab"   {total_lab += $6}
$NF == "UKIP"  {total_ukip += $6}
$NF == "LD"    {total_ld += $6}
$NF == "Green" {total_green += $6}
$NF == "SNP"   {total_snp += $6}
END {
  print "TOTAL: " total
  print_party_percentage("Con", total_con, total)
  print_party_percentage("Lab", total_lab, total)
  print_party_percentage("UKIP", total_ukip, total)
  print_party_percentage("LD", total_ld, total)
  print_party_percentage("SNP", total_snp, total)
  print_party_percentage("Green", total_green, total)
}

# output - looks messier because previous program was manually formatted
TOTAL: 30697255
Con   36.8098%
Lab   30.449%
UKIP   12.6431%
LD   7.87014%
SNP   4.738%
Green   3.77112%

Formatting numerical precision and alignment with printf

Same as above, but using printf for formatting

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
function print_party_percentage(party_name, party_vote, total_vote) {
  printf "%5s: %4.1f%%\n", party_name, (100 * party_vote / total_vote)
  # %5s a string (s) of fixed width 5 or more (5) aligned right (if it was -5 it would be left)
  # %4.1f a number (f) with one decimal (.1) and total width 4 (4) aligned right (4)
  # %% an actual %
}
# same program as before
NR > 1 && NF   {total += $6}
$NF == "Con"   {total_con += $6}
$NF == "Lab"   {total_lab += $6}
$NF == "UKIP"  {total_ukip += $6}
$NF == "LD"    {total_ld += $6}
$NF == "Green" {total_green += $6}
$NF == "SNP"   {total_snp += $6}
END {
  print "TOTAL: " total
  print_party_percentage("Con", total_con, total)
  print_party_percentage("Lab", total_lab, total)
  print_party_percentage("UKIP", total_ukip, total)
  print_party_percentage("LD", total_ld, total)
  print_party_percentage("SNP", total_snp, total)
}

# output
TOTAL: 30697255
  Con: 36.8%
  Lab: 30.4%
 UKIP: 12.6%
   LD:  7.9%
  SNP:  4.7%

Using arrays to group data

There is still some copy and past code because we are hardcoding the parties. We can use arrays to group whatever parties we find.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# same formatting function as before
function print_party_percentage(party_name, party_vote, total_vote) {
  printf "%5s: %4.1f%%\n", party_name, (100 * party_vote / total_vote)
}

# skip empty and header lines
NR > 1 && NF  {
  # runnint total
  total += $6
  # create or update running total for current party
  party_totals[$NF] += $6
}
# when all records are processed
END {
  print "TOTAL: " total
  # print a line for each party
  for (party in party_totals)
    print_party_percentage(party, party_totals[party], total)
}

# output - there are a LOT of tiny local parties
TOTAL: 30697255
  UUP:  0.4%
Left Unity - Trade Unionists and Socialists:  0.0%
  IZB:  0.0%
Respect:  0.0%
  SSP:  0.0%
  NSW:  0.0%
The 30-50 Coalition:  0.0%
.... and so on

Oh - turns out if you include all the novelty parties there are 132 of them across the UK. We need to sort the array and only print the top X items. Turns out it is quite complicated.

Sorting array

Standard awk’s array are not sortable. This was a design choice - only associative arrays are supported, so there is no order, hence they can’t be sorted in any meaningful way. gawk, however, has two array sorting functions - how do they do it? They actually create a new associative array, with all the values from the original but none of they keys; they keys are replaced by new ones, in order. Then you use a for loop (not the standard for in) to read all the array “in order”. This is all well and good if you don’t need the keys, but I do (they are the name of the party). Besides, I am using awk and not gawk.

The best approach is to create a new array with just the keys, sort that array, and then loop through it in order in order to find out which keys of the original array to read.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
# kickstarts the sort process
# puts all the sorted keys into a separate array. if i
function homebrew_asort(original, processed) {
  # before we use the array we must be sure it is empty
  empty_array(processed)
  original_length = copy_and_count_array(original, processed)
  qsort(original, processed, 0, original_length)
  return original_length
}

# removes al values
function empty_array(A) {
  for (i in A)
    delete A[i]
}

# awk doesn't even have an array size function... you also have to roll out your own
function copy_and_count_array(original, processed) {
  for (key in original) {
    # awk doesn't seem to like array[0] -  so we start from 1
    size++
    processed[size] = key
  }
  return size
}

# Adapted from a script from awk.info
# http://awk.info/?quicksort
function qsort(original, keys, left, right,   i, last) {
  if (left >= right)  return
  swap(keys, left, left + int( (right - left + 1) * rand() ) )
  last = left
  for (i = left+1; i <= right; i++)
    if (original[keys[i]] < original[keys[left]])
      swap(keys, ++last, i)
  swap(keys, left, last)
  qsort(original, keys, left, last-1)
  qsort(original, keys, last+1, right)
}
function swap(A, i, j,   t) {
  t = A[i]; A[i] = A[j]; A[j] = t
}

# same formatting function as before
function print_party_percentage(party_name, party_vote, total_vote) {
  printf "%5s: %4.1f%%\n", party_name, (100 * party_vote / total_vote)
}

# same main action as before
NR > 1 && NF  {
  total += $6
  party_totals[$NF] += $6
}

# when all records are processed
END {
  parties_count = homebrew_asort(party_totals, keys)
  for  (i = parties_count; i >= parties_count - 5; i--)
    print_party_percentage(keys[i], party_totals[keys[i]], total)
}

And the output

1
2
3
4
5
6
  Con: 36.8%
  Lab: 30.4%
 UKIP: 12.6%
   LD:  7.9%
  SNP:  4.7%
Green:  3.8%

Print first or last x characters of a line

You can easily mimick head -c or tail -c with awk - if you really want to.

1
2
3
4
5
6
7
# equivalent
$ awk '{print substr($0, 1, 32)}' xxx
$ head -c 32 xxx

# equivalent
$ awk 'END {print substr($0, length($0) - 30, 32)}' xxx
$ tail -c 32 xxx

But with awk you can also skip a few characters into a file

1
2
# no heard or tail equivalent - print characters 32 to 64
$ awk '{print substr($0, 32, 32)}' xxx

Other examples

The awk.info website has some one liners with extensive explanations

The gawk manual includes some one liners which are compatible with standard awk.

The Unix School: 10 examples to group data in a CSV or text file

Comments