A selection of akw scripts

cli, awk

Some examples scripts to complement the previous awk tutorial

Input data format

One area where standard awk falls short is dealing with input files from your standard desktop applications - basically, CSV files from Office. There hasn't been a CSV standard until recently, and the CSV generated by MS Office doesn't work too well with awk. The main issue is that CSV is basically a rubbish format which suffers from a few problems: the comma too common a character to be used as a separator (why didn't they choose tab??), newlines are saved without being converted to a safe sequence, sometimes fields are surrounded by quotes and sometimes they aren't.

But if you have a "well behaved" CSV file, i.e. one which doesn't have commas, quotation marks, or new lines inside fields, e.g UK,London,10000 then you can easily process it by passing -F"," to the awk call:

awk -F"," -f my_awk_script.awk some_input_data.txt

In practice unless you generate the data yourself there is always going to be the odd comma or quotation mark in your data somewhere; the safest and most reliable course of action is to use tab as separator. I use a free online CSV to TSV converter like this. I then call awk with "\t" as the separator.

The command below is what I use - my awk program is in my_awk_script.awk, the data is in uk_electoral_data_converted.csv, and the results goes into awk_output.txt.

awk -F"\t" -f my_awk_script.awk uk_electoral_data_converted.csv > awk_output.txt

Sample data: UK election results

Some of the scripts will use data from the 2015 UK election in CSV format as data, converted to TSV. Here's what it looks like:

Forename  Surname Description on ballot paper Constituency Name PANO  Votes Share (%) Change  FIELD9  Incumbent?  FIELD11 Constituency ID Region ID County  Region  Country Constituency type Party name identifier Party abbreviation
Gerald  Howarth The Conservative Party Candidate  Aldershot 7 23369 50.6  3.9   MP    E14000530 E12000008 Hampshire South East  England Borough Conservative  Con
Gary  Puffett Labour Party  Aldershot 7 8468  18.3  6.2       E14000530 E12000008 Hampshire South East  England Borough Labour  Lab
Bill  Walker  UK Independence Party (UKIP)  Aldershot 7 8253  17.9  13.4        E14000530 E12000008 Hampshire South East  England Borough UK Independence Party UKIP
...

And here are the field names in order

1 Forename
2 Surname
3 Description on ballot paper
4 Constituency Name
5 PANO
6 Votes
7 Share (%)
8 Change
9 --
10 Incumbent?
11 --
12 Constituency ID
13 Region ID
14 County
15 Region
16 Country
17 Constituency type
18 Party name identifier
19 Party abbreviation

Skipping header row

# put NR > 1 in front of every action to skip the header row
NR > 1 { print }

# result
Gerald  Howarth The Conservative Party Candidate  Aldershot ...
...

Skipping empty records

# check if the first field is empty
NF { print }

Rearranging fields, and skipping some

NR > 1 && NF { print $4 ": " $2 " " $1 " (" $NF ") " $7 "% " }
# NR > 1    ignore header
# NF  ignore empty record
# print constituency name, name surname, party abbreviation, share of the vote
# ignore other fields

Aldershot: Howarth Gerald (Con) 50.6%
Aldershot: Puffett Gary (Lab) 18.3%
Aldershot: Walker Bill (UKIP) 17.9%
Aldershot: Hilliar Alan (LD) 8.8%
Aldershot: Hewitt Carl (Green) 4.4%
...

Finds the national Conservative vote

$NF == "Con" {total += $6}
END          {print total}
# $NF == "Con"  if a record concerns a tory vote
# {total += $6} add it to a running total
# END           when all records are processed
# {print total} output total

11299609

Work out column averages

Find the total vote of the 6 larger parties and their % of the national vote

# keep a running total
NR > 1 && NF  {total += $6}
# keep a total for each party - don't do anything yet
$NF == "Con"        {total_con += $6}
$NF == "Lab"        {total_lab += $6}
$NF == "UKIP"       {total_ukip += $6}
$NF == "LD"         {total_ld += $6}
$NF == "Green"      {total_green += $6}
$NF == "SNP"        {total_snp += $6}
# print a report at the end
END {
  print "TOTAL: " total
  print "Con:   " (100 * total_con / total) "%"
  print "Lab:   " (100 * total_lab / total) "%"
  print "UKIP:  " (100 * total_ukip / total) "%"
  print "LD:    " (100 * total_ld / total) "%"
  print "SNP:   " (100 * total_snp / total) "%"
  print "Green: " (100 * total_green / total) "%"
}

# output:
TOTAL: 30697255
Con:   36.8098%
Lab:   30.449%
UKIP:  12.6431%
LD:    7.87014%
SNP:   4.738%
Green: 3.77112%

Using functions

The same as above, but without copy and paste code

# abstracting copy-and-paste code into a function
function print_party_percentage(party_name, party_vote, total_vote) {
  print party_name "   " (100 * party_vote / total_vote) "%"
}
# same program as before
NR > 1 && NF   {total += $6}
$NF == "Con"   {total_con += $6}
$NF == "Lab"   {total_lab += $6}
$NF == "UKIP"  {total_ukip += $6}
$NF == "LD"    {total_ld += $6}
$NF == "Green" {total_green += $6}
$NF == "SNP"   {total_snp += $6}
END {
  print "TOTAL: " total
  print_party_percentage("Con", total_con, total)
  print_party_percentage("Lab", total_lab, total)
  print_party_percentage("UKIP", total_ukip, total)
  print_party_percentage("LD", total_ld, total)
  print_party_percentage("SNP", total_snp, total)
  print_party_percentage("Green", total_green, total)
}

# output - looks messier because previous program was manually formatted
TOTAL: 30697255
Con   36.8098%
Lab   30.449%
UKIP   12.6431%
LD   7.87014%
SNP   4.738%
Green   3.77112%

Formatting numerical precision and alignment with printf

Same as above, but using printf for formatting

function print_party_percentage(party_name, party_vote, total_vote) {
  printf "%5s: %4.1f%%\n", party_name, (100 * party_vote / total_vote)
  # %5s a string (s) of fixed width 5 or more (5) aligned right (if it was -5 it would be left)
  # %4.1f a number (f) with one decimal (.1) and total width 4 (4) aligned right (4)
  # %% an actual %
}
# same program as before
NR > 1 && NF   {total += $6}
$NF == "Con"   {total_con += $6}
$NF == "Lab"   {total_lab += $6}
$NF == "UKIP"  {total_ukip += $6}
$NF == "LD"    {total_ld += $6}
$NF == "Green" {total_green += $6}
$NF == "SNP"   {total_snp += $6}
END {
  print "TOTAL: " total
  print_party_percentage("Con", total_con, total)
  print_party_percentage("Lab", total_lab, total)
  print_party_percentage("UKIP", total_ukip, total)
  print_party_percentage("LD", total_ld, total)
  print_party_percentage("SNP", total_snp, total)
}

# output
TOTAL: 30697255
  Con: 36.8%
  Lab: 30.4%
UKIP: 12.6%
  LD:  7.9%
  SNP:  4.7%

Using arrays to group data

There is still some copy and paste code because we are hardcoding the parties. We can use arrays to group whatever parties we find.

# same formatting function as before
function print_party_percentage(party_name, party_vote, total_vote) {
  printf "%5s: %4.1f%%\n", party_name, (100 * party_vote / total_vote)
}

# skip empty and header lines
NR > 1 && NF  {
  # runnint total
  total += $6
  # create or update running total for current party
  party_totals[$NF] += $6
}
# when all records are processed
END {
  print "TOTAL: " total
  # print a line for each party
  for (party in party_totals)
    print_party_percentage(party, party_totals[party], total)
}

# output - there are LOTS of tiny local parties
TOTAL: 30697255
  UUP:  0.4%
Left Unity - Trade Unionists and Socialists:  0.0%
  IZB:  0.0%
Respect:  0.0%
  SSP:  0.0%
  NSW:  0.0%
The 30-50 Coalition:  0.0%
.... and so on

Oh - turns out if you include all the novelty parties there are 132 of them across the UK. We need to sort the array and only print the top X items. Turns out it is quite complicated.

Sorting array

Standard awk's array are not sortable. This was a design choice - only associative arrays are supported, so there is no order, hence they can't be sorted in any meaningful way. gawk, however, has two array sorting functions - how do they do it? They actually create a new associative array, with all the values from the original but none of they keys; they keys are replaced by new ones, in order. Then you use a for loop (not the standard for in) to read all the array "in order". This is all well and good if you don't need the keys, but I do (they are the name of the party). Besides, I am using awk and not gawk.

The best approach is to create a new array with just the keys, sort that array, and then loop through it in order to find out which keys of the original array to read.

# kickstarts the sort process
# puts all the sorted keys into a separate array. if i
function homebrew_asort(original, processed) {
  # before we use the array we must be sure it is empty
  empty_array(processed)
  original_length = copy_and_count_array(original, processed)
  qsort(original, processed, 0, original_length)
  return original_length
}

# removes al values
function empty_array(A) {
  for (i in A)
    delete A[i]
}

# awk doesn't even have an array size function... you also have to roll out your own
function copy_and_count_array(original, processed) {
  for (key in original) {
    # awk doesn't seem to like array[0] -  so we start from 1
    size++
    processed[size] = key
  }
  return size
}

# Adapted from a script from awk.info
# http://awk.info/?quicksort
function qsort(original, keys, left, right,   i, last) {
  if (left >= right)  return
  swap(keys, left, left + int( (right - left + 1) * rand() ) )
  last = left
  for (i = left+1; i <= right; i++)
    if (original[keys[i]] < original[keys[left]])
      swap(keys, ++last, i)
  swap(keys, left, last)
  qsort(original, keys, left, last-1)
  qsort(original, keys, last+1, right)
}
function swap(A, i, j,   t) {
  t = A[i]; A[i] = A[j]; A[j] = t
}

# same formatting function as before
function print_party_percentage(party_name, party_vote, total_vote) {
  printf "%5s: %4.1f%%\n", party_name, (100 * party_vote / total_vote)
}

# same main action as before
NR > 1 && NF  {
  total += $6
  party_totals[$NF] += $6
}

# when all records are processed
END {
  parties_count = homebrew_asort(party_totals, keys)
  for  (i = parties_count; i >= parties_count - 5; i--)
    print_party_percentage(keys[i], party_totals[keys[i]], total)
}

And the output

  Con: 36.8%
  Lab: 30.4%
UKIP: 12.6%
  LD:  7.9%
  SNP:  4.7%
Green:  3.8%

You can easily mimic head -c or tail -c with awk - if you really want to.

# head equivalent
$ awk '{print substr($0, 1, 32)}' xxx
$ head -c 32 xxx

# tail equivalent
$ awk 'END {print substr($0, length($0) - 30, 32)}' xxx
$ tail -c 32 xxx

But with awk you can also skip a few characters into a file

# no head or tail equivalent - print characters 32 to 64
$ awk '{print substr($0, 32, 32)}' xxx

Other examples

The awk.info website has some one liners with extensive explanations

The gawk manual includes some one liners which are compatible with standard awk.

The Unix School: 10 examples to group data in a CSV or text file