Fetching Google Spreadsheets as XML

Comments

The challenge this week: using python to fetch Google Spreadsheets and converting them to XML.

About the challenge

This is part of an ongoing project, a simple recipe manager. So far I have a bunch of recipes, saved as XML on a local machine. I can serve them to a browser on my local machine and use XSLT to make them printable. Now I want to convert some recipes from other formats and locations to the same XML schema and local folder.

The final script is available on GitHub.

Converting Google Spreadsheets to XML

1
2
3
4
Feature: Converting Google Spreadsheets to XML
    In order to consolidate recipe data
    As a geeky recipe author
    I need to fetch recipes from Google Docs and convert them

First stop, Google Spreadsheets then.

Fetching a bunch of Google spreadsheets with googlecl

1
2
3
4
5
6
7
8
9
10
11
12
Scenario: Fetching a bunch of Google spreadsheets with googlecl
    Given that I have googlecl installed
    And I have <spreadsheets> in <folder> in <Google Docs account>
    When I start up googlecl
    And run coomand "docs get --folder <folder> --dest <destination> --format <format>"
    Then <destination> on my local machine should include <downloaded>

Examples:
    | spreadsheets               | folder  | destination    | format | downloaded                                |
    | Tuna, Pasta, Cornish Pasty | Recipes | ~/Recipes_ods  | ods    | Tuna.ods, Pasta.ods, Cornish Pasty.ods    |
    | Tuna, Pasta, Cornish Pasty | Recipes | ~/Recipes_csv  | csv    | Tuna.csv, Pasta.csv, Cornish Pasty.csv    |
    | Tuna, Pasta, Cornish Pasty | Recipes | ~/Recipes_html | html   | Tuna.html, Pasta.html, Cornish Pasty.html |

googlecl is a very useful command line tool built in Python to perform basic actions on Google services (Docs, Picasa, Calendar, etc). I had already installed that using easy_install as per the instructions, and it survived my upgrade to Python 2.7.2.

googlecl can download documents either by name, or by folder name, so I needed to make sure they were in the same folder. That was easily done by searching through all my docs for the word ‘ingredient’ and moving the results to a single folder called, strangely enough, ‘Recipes’.

Next I needed to decide on a download format. The supported Google Docs export formats are CSV, HTML, ODS, PDF, XLS, TXT, which in practice meant a choice between CSV, HTML, and ODS. I downloaded them in all three formats, in three different folders, using the commands google # to start off googlecl

1
2
3
4
$ google
> docs get --folder Recipes --dest ~/Recipes_ods --format ods
> docs get --folder Recipes --dest ~/Recipes_csv --format csv
> docs get --folder Recipes --dest ~/Recipes_html --format html

Generating simple XML from spreadsheet data

1
2
3
4
5
6
7
8
9
10
11
12
13
Scenario: Generating simple XML from spreadsheet data
    Given that I have downloaded <file>
    And the file contains a <row of lines>
    When I parse it with the python script
    Then <xmlsfile> should be created in the recipes folder
    And it should have the following structure
    """
    <recipe lang="en-uk">
        <![CDATA[
            <row of lines>
        ]]>
    </recipe>
    """

Now for the conversion. First step, prove I can read the data, and include it in a valid XML document. I have three possible input formats to choose from.

HTML looks like an interesting format - assuming it is well formed (in the XML sense), in theory I could simply XSL-transform it to the same XML schema as the other recipes. But it turns out Google Docs use HTML 5 and not XHTML5, as some attributes have no quotation marks. So they are not valid XML and there is no quick fix there.

ODS is another option. It is nothing but a zip file, unzip it and it turns out to be a folder. Conveniently, the data is in a document called content.xml at the top level. Not so conveniently, the data structure is quite involved, and not something I’d like to tackle for a quick job.

So in the end I settled on CSV. It is easily handled by Python, that comes with a CSV module built in. CSV has one main drawback, namely it only export the first sheet, but the first sheet is where most of my recipe data is, so that’s ok. Almost.

The script below will use the csv module to read the file, and then simply create an XML file by the same name.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
#! /usr/bin/env python

import csv
import re
import os

csv.register_dialect('custom',
                     delimiter=',',
                     doublequote=True,
                     escapechar=None,
                     quotechar='"',
                     quoting=csv.QUOTE_MINIMAL,
                     skipinitialspace=False)

input_directory  = "/PTH/TO/in"
output_directory = '/PTH/TO/out'

def main( input_dir, output_dir ):
    """Processes all the csv files in a directory, and generates an XML file for each of them.

    Keyword arguments:
    input_dir   -- where to get the input csv files from
    output_dir  -- where to save the output xml files to

    """
    postfix = "csv"

    for csv_file in os.listdir( input_dir ):
        if not csv_file.lower().endswith( postfix ):
            continue
        with open( os.path.join( input_dir , csv_file ), 'rb' ) as file_obj:
            xml_content = spreadsheet_read( file_obj )
            xml_file = re.sub( r'\.%s' % postfix, ".xml", csv_file )
        with open( os.path.join( output_dir , xml_file ) ,'wb' ) as file_obj:
            file_obj.write( recipe_print( xml_content ) )
    return


def spreadsheet_read( spreadsheet_file ):
    """Extract metadata, ingredient list, and directions from a csv file

    Keyword arguments:
    spreadsheet_file --  a file handler

    return a copy of the input

    """
    recipe = ""                              #the steps, returned

    spreadsheet = csv.DictReader(spreadsheet_file, dialect='custom')
    #oreoprocesses data
    for row in spreadsheet:
        recipe += "\n"
        for k,v in row.iteritems():
            recipe += k + ": " + ( v or "-" ) + "\t"

    return recipe


def recipe_print( content ):
    """The string wrapped into an XML document.

    Keyword arguments:
    content  -- a string

    return a copy of the input

    """
    str = """<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<?xml-stylesheet type="text/xsl" href="_assets/recipe.xsl"?>

<recipe lang="en-uk">"""
    str +=  "<![CDATA[" + content + "]]>\n"
    str +=  "</recipe>"
    return str


if __name__ == "__main__":
    main( input_directory, output_directory )

The script runs a function called main - the name == ”main” at the bottom allows the script to be run standalone or imported as a module in another script (not that that’s relevant here, but still). main loops over the csv files in input_directory, runs spreadsheet_read on it to fetch and process its contents, and then passes those contents to recipe_print to create an XML file of it.

spreadsheet_read fetches each row of the spreadsheet in turn as a dictionary object, and serializes it the fastest possible way to show it works. recipe_print does nothing more than wrap a string in a CDATA node.

Converting CSV data to XML

1
2
3
4
5
Scenario: Converting CSV data to XML
    Given that I have <XML schema>
    When I call recipe_print passing dictionaries <meta>, <ingredients>, <directions>
    Then <xmlsfile> should be created in the recipes folder
    And it should be compatible with the schema

Now that I know I can convert data from CSV to XML, I focus on generating the right kind of XML. There is a slight issue here as due to a botched Python upgrade I seem to be unable to install 2.7.2 modules, so using the lxml Python library is not an option. Therefore I couldn’t create the XML by creating an XML tree and adding nodes to it. Instead, I had to create functions to create the type of nodes I needed as strings. I ended up creating functions to format nodes as string.

The functions below will gnerate various XML tags.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
TAB = "   "

def el_CData( nodeName, str, tabs=1 ):
    """Format nodeName as a CData element with optional tabs and return it."""
    return TAB * tabs + "<"+nodeName+"><![CDATA[" + str + "]]></" + nodeName + ">\n";

def el_node( nodeName, str, tabs=1 ):
    """Crete a node nodeName with content str and optional tabs and return it."""
    nodeName = ( nodeName or "preparation" )
    str = ( str or "" )
    return TAB * tabs + "<"+nodeName+">" + str + "</" + nodeName + ">\n";

def el_parent_open( nodeName, tabs=1 ):
    """Format nodeName as an opening tag with optional tabs and return it."""
    return TAB * tabs + "<"+nodeName+">\n";

def el_parent_close( nodeName, tabs=1 ):
    """Format nodeName as a closing tag with optional tabs and return it."""
    return TAB * tabs + "</"+nodeName+">\n";

Then printing the XML is just a matter of calling those functions in the right sequence to reproduce the target XML. To make things easier the method now takes three dictionaries as input, one for metadata, one for the ingredients, and one for directions.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
def recipe_print( meta, ingredients, directions ):
    """Returns an XML document given some metadata, ingredients list, and directions.

    Keyword arguments:
    meta        -- a dictionary of meta information, simple key value pair
    ingredients -- a list of ingredient groups. Each group is dictionary including a name and list of ingredients, and each is a ingredient is itslef a tuple
    directions  -- a list of steps strings

    """
    str = """<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<?xml-stylesheet type="text/xsl" href="_assets/recipe.xsl"?>

<recipe lang="en-uk">"""
    str +=  el_CData( "title", meta['title'] )
    str +=  el_CData( "description", meta['description'] )
    str +=  el_CData( "source", meta['source'] )

    str +=  el_parent_open( "cuisine" )
    str +=  el_node( "style", meta['style'], 2 )
    str +=  el_node( "region", meta['region'], 2 )
    str +=  el_node( "approach", "", 2 )
    str +=  el_parent_close( "cuisine" )

    str +=  el_parent_open( "tags" )
    str +=  el_node( "tag", "", 2 )
    str +=  el_parent_close( "tags" )

    #steps
    str +=  el_parent_open( "directions" )
    for step in directions:
        if step:
            str +=  el_CData( "step", step, 2 )
    str +=  el_parent_close( "directions" )

    #ingredients
    str +=  el_parent_open( "ingredients" )
    str +=  el_node( "serves", "", 2 )
    for group in ingredients:
        if group['list']:
            str +=  el_parent_open( "group", 2 )
            if group['name']:
                str +=  el_node( "name", group['name'], 3 )
            for ingredient in group['list']:
                str +=  el_parent_open( "ingredient", 3 )
                str +=  el_node( "quantity", ingredient['quantity'], 4 )
                str +=  el_node( "measurement", ingredient['measurement'], 4 )
                str +=  el_node( "name", ingredient['name'], 4 )
                str +=  el_node( "preparation", ingredient['preparation'], 4 )
                str +=  el_parent_close( "ingredient", 3 )
            str +=  el_parent_close( "group", 2 )
    str +=  el_parent_close( "ingredients" )

    str +=  el_parent_close( "recipe", 0 )
    return str

Parsing CSV data

1
2
3
4
5
Scenario: Parsing CSV data
    Given that script is parsing <CSV file>
    When <line1> is encountered
    And <line2> is next
    Then it should create dictionaries <ingredients>, and <steps>

As the final step, I will now include real information from the CSV into the XML. The format of the CSV file follows a quite complex grammar, which splits ingrdients and groups and tried to minimize the amount of typing - it was a proof of concept for the interface I will be building. What the converter needs to do is read in lines such as

1
2
3
4
| group name | amount | unit | ingredient name | notes   | step   |
| sauce      | 100    | g    | flour, white    |         |        |
|            | 20     | ml   | butter          |         | mix {} |
| filling    | 100    | g    | onion           | chopped | sauté  |

into dictionaries such as

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
ingredients:
{   sauce: [
        {
            quantity: 100,
            measurement: "g",
            name: "white flour"
            preparation: ""
        },
        {
            quantity: 20,
            measurement: "ml",
            name: "butter"
            preparation: ""
        }
    ],
    filling: [
        {
            quantity: 100,
            measurement: "g",
            name: "onion"
            preparation: "chopped"
        }
    ]
}

steps:
[
    "Mix flour and butter",
    "Sauté onion"
]

The functions below do just that.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
def spreadsheet_read( spreadsheet_file, filename ):
    """Extract metadata, ingredient list, and directions from a csv file

    Keyword arguments:
    spreadsheet_file --  a file handler
    filename         --  the name of the file

    return metadata, ingredient list, and directions dictionaries

    """
    ingGroup = 0                            #pointer to active ingredient group
    lastIngs = []                           #stack of ingredients to be used in a single step
    meta = { 'title' : filename }           #the meta info, returned
    ings = [ { 'name' : '', 'list' : [] } ] #the ingredient groups, returned
    steps = []                              #the steps, returned

    spreadsheet = csv.DictReader(spreadsheet_file, dialect='custom')

    #oreoprocesses data
    for row in spreadsheet:

        #the Group cell is only filled in when the group change
        #when that happens, a new group is added to list
        if row['Group']:
            ings.append( { 'name' : row['Group'], 'list' : [] } )
            ingGroup += 1

        #preparation has no key in the original, so we add one
        cells = {}
        for k, v in row.items():
            k = k or 'preparation'
            cells[k] = v

        #collects ingredients
        ing = {}
        if cells['ingredient']:
            #ingredients are listed as 'oil, olive', so we need to swap them around to get "olive oil"
            if re.search( ",", cells['ingredient'] ):
                ing['name'] = re.sub( "^(.+?), (.+)$", r"\2 \1", cells['ingredient'] )
            else:
                ing['name'] = cells['ingredient']
            ing['measurement'] = cells['unit']
            ing['preparation'] = cells['preparation']
            ing['quantity'] = cells['amount']
            ings[ingGroup]['list'].append( ing )
            #some rows contain ingredients but not actions
            #those ingredients are collected into a list and used for the first action
            #ingredients are listed as 'oil, olive', so that we can extract just 'oil' easily
            lastIngs.append( re.sub( ',.*$', "", cells['ingredient'] ) )

        #collect steps
        step = ""
        if  cells['action']:
            #actions  can contain the string {} to refer to the ingredient - e.g., "add the {} and stir"
            step = re.sub( r'{}', ", ".join( lastIngs ), row['action'] )
            #the rest of the information is quite granular, but we ignore that and just return the one string
            if cells['medium']:
                step += " in"
                if cells['amount medium']:
                    step += " " + cells['amount medium']
                if cells['unit medium']:
                    step += " " + cells['unit medium']
                step += " " + cells['medium']
            if cells['temperature']:
                step += " on " + cells['temperature']
            if cells['ready when']:
                step += " until " + cells['ready when']
            if cells['estimated time']:
                step += ", " + cells['estimated time']
            lastIngs = []
            steps.append( step )
    return meta, ings, steps

Dealing with meta information

1
2
3
4
5
Scenario: Dealing with meta information
    Given that meta information is stored in the second sheet of a spreadsheet
    And that I have 83 recipes to convert
    When I conert the recipes
    Then that information should be part of it

CSV files only include the first sheet of the spreadsheet, so what is missing is three things - the recipe name, the source, and the style (e.g. Italian, Chinese, Modern British, etc). The recipe name I can also fetch from the file name, so all it leaves is a couple of words per recipes, i.e. about 200 words. Rather than trying to fetch them from the other formats and find ways to merge them with the already generated XML files, I decided it would be much easier to copy them over manually.

Challenge 100% complete

The recipes were successfully fetched and converted. The final script is available on GitHub.

Comments