This is part of an ongoing project, a simple recipe manager. So far I have a bunch of recipes, saved as XML on a local machine. I can serve them to a browser on my local machine and use XSLT to make them printable. Now I want to convert some recipes from other formats and locations to the same XML schema and local folder.
Feature: Converting Google Spreadsheets to XML In order to consolidate recipe data As a geeky recipe author I need to fetch recipes from Google Docs and convert them
First stop, Google Spreadsheets then.
Fetching a bunch of Google spreadsheets with googlecl
123456789101112
Scenario: Fetching a bunch of Google spreadsheets with googlecl Given that I have googlecl installedAnd I have <spreadsheets> in <folder> in <Google Docs account>When I start up googleclAnd run coomand "docs get --folder <folder> --dest <destination> --format <format>"Then <destination> on my local machine should include <downloaded>Examples: | spreadsheets | folder | destination | format | downloaded | | Tuna, Pasta, Cornish Pasty | Recipes | ~/Recipes_ods | ods | Tuna.ods, Pasta.ods, Cornish Pasty.ods | | Tuna, Pasta, Cornish Pasty | Recipes | ~/Recipes_csv | csv | Tuna.csv, Pasta.csv, Cornish Pasty.csv | | Tuna, Pasta, Cornish Pasty | Recipes | ~/Recipes_html | html | Tuna.html, Pasta.html, Cornish Pasty.html |
googlecl is a very useful command line tool built in Python to perform basic actions on Google services (Docs, Picasa, Calendar, etc). I had already installed that using easy_install as per the instructions, and it survived my upgrade to Python 2.7.2.
googlecl can download documents either by name, or by folder name, so I needed to make sure they were in the same folder. That was easily done by searching through all my docs for the word ‘ingredient’ and moving the results to a single folder called, strangely enough, ‘Recipes’.
Next I needed to decide on a download format. The supported Google Docs export formats are CSV, HTML, ODS, PDF, XLS, TXT, which in practice meant a choice between CSV, HTML, and ODS. I downloaded them in all three formats, in three different folders, using the commands google # to start off googlecl
1234
$ google> docs get --folder Recipes --dest ~/Recipes_ods --format ods> docs get --folder Recipes --dest ~/Recipes_csv --format csv> docs get --folder Recipes --dest ~/Recipes_html --format html
Generating simple XML from spreadsheet data
12345678910111213
Scenario: Generating simple XML from spreadsheet data Given that I have downloaded <file>And the file contains a <row of lines>When I parse it with the python scriptThen <xmlsfile> should be created in the recipes folderAnd it should have the following structure"""<recipe lang="en-uk"><![CDATA[ <row of lines> ]]></recipe>"""
Now for the conversion. First step, prove I can read the data, and include it in a valid XML document. I have three possible input formats to choose from.
HTML looks like an interesting format - assuming it is well formed (in the XML sense), in theory I could simply XSL-transform it to the same XML schema as the other recipes. But it turns out Google Docs use HTML 5 and not XHTML5, as some attributes have no quotation marks. So they are not valid XML and there is no quick fix there.
ODS is another option. It is nothing but a zip file, unzip it and it turns out to be a folder. Conveniently, the data is in a document called content.xml at the top level. Not so conveniently, the data structure is quite involved, and not something I’d like to tackle for a quick job.
So in the end I settled on CSV. It is easily handled by Python, that comes with a CSV module built in. CSV has one main drawback, namely it only export the first sheet, but the first sheet is where most of my recipe data is, so that’s ok. Almost.
The script below will use the csv module to read the file, and then simply create an XML file by the same name.
#! /usr/bin/env pythonimportcsvimportreimportoscsv.register_dialect('custom',delimiter=',',doublequote=True,escapechar=None,quotechar='"',quoting=csv.QUOTE_MINIMAL,skipinitialspace=False)input_directory="/PTH/TO/in"output_directory='/PTH/TO/out'defmain(input_dir,output_dir):"""Processes all the csv files in a directory, and generates an XML file for each of them. Keyword arguments: input_dir -- where to get the input csv files from output_dir -- where to save the output xml files to """postfix="csv"forcsv_fileinos.listdir(input_dir):ifnotcsv_file.lower().endswith(postfix):continuewithopen(os.path.join(input_dir,csv_file),'rb')asfile_obj:xml_content=spreadsheet_read(file_obj)xml_file=re.sub(r'\.%s'%postfix,".xml",csv_file)withopen(os.path.join(output_dir,xml_file),'wb')asfile_obj:file_obj.write(recipe_print(xml_content))returndefspreadsheet_read(spreadsheet_file):"""Extract metadata, ingredient list, and directions from a csv file Keyword arguments: spreadsheet_file -- a file handler return a copy of the input """recipe=""#the steps, returnedspreadsheet=csv.DictReader(spreadsheet_file,dialect='custom')#oreoprocesses dataforrowinspreadsheet:recipe+="\n"fork,vinrow.iteritems():recipe+=k+": "+(vor"-")+"\t"returnrecipedefrecipe_print(content):"""The string wrapped into an XML document. Keyword arguments: content -- a string return a copy of the input """str="""<?xml version="1.0" encoding="UTF-8" standalone="yes"?><?xml-stylesheet type="text/xsl" href="_assets/recipe.xsl"?><recipe lang="en-uk">"""str+="<![CDATA["+content+"]]>\n"str+="</recipe>"returnstrif__name__=="__main__":main(input_directory,output_directory)
The script runs a function called main - the name == ”main” at the bottom allows the script to be run standalone or imported as a module in another script (not that that’s relevant here, but still). main loops over the csv files in input_directory, runs spreadsheet_read on it to fetch and process its contents, and then passes those contents to recipe_print to create an XML file of it.
spreadsheet_read fetches each row of the spreadsheet in turn as a dictionary object, and serializes it the fastest possible way to show it works. recipe_print does nothing more than wrap a string in a CDATA node.
Converting CSV data to XML
12345
Scenario: Converting CSV data to XML Given that I have <XML schema>When I call recipe_print passing dictionaries <meta>, <ingredients>, <directions>Then <xmlsfile> should be created in the recipes folderAnd it should be compatible with the schema
Now that I know I can convert data from CSV to XML, I focus on generating the right kind of XML. There is a slight issue here as due to a botched Python upgrade I seem to be unable to install 2.7.2 modules, so using the lxml Python library is not an option. Therefore I couldn’t create the XML by creating an XML tree and adding nodes to it. Instead, I had to create functions to create the type of nodes I needed as strings. I ended up creating functions to format nodes as string.
The functions below will gnerate various XML tags.
12345678910111213141516171819
TAB=" "defel_CData(nodeName,str,tabs=1):"""Format nodeName as a CData element with optional tabs and return it."""returnTAB*tabs+"<"+nodeName+"><![CDATA["+str+"]]></"+nodeName+">\n";defel_node(nodeName,str,tabs=1):"""Crete a node nodeName with content str and optional tabs and return it."""nodeName=(nodeNameor"preparation")str=(stror"")returnTAB*tabs+"<"+nodeName+">"+str+"</"+nodeName+">\n";defel_parent_open(nodeName,tabs=1):"""Format nodeName as an opening tag with optional tabs and return it."""returnTAB*tabs+"<"+nodeName+">\n";defel_parent_close(nodeName,tabs=1):"""Format nodeName as a closing tag with optional tabs and return it."""returnTAB*tabs+"</"+nodeName+">\n";
Then printing the XML is just a matter of calling those functions in the right sequence to reproduce the target XML. To make things easier the method now takes three dictionaries as input, one for metadata, one for the ingredients, and one for directions.
defrecipe_print(meta,ingredients,directions):"""Returns an XML document given some metadata, ingredients list, and directions. Keyword arguments: meta -- a dictionary of meta information, simple key value pair ingredients -- a list of ingredient groups. Each group is dictionary including a name and list of ingredients, and each is a ingredient is itslef a tuple directions -- a list of steps strings """str="""<?xml version="1.0" encoding="UTF-8" standalone="yes"?><?xml-stylesheet type="text/xsl" href="_assets/recipe.xsl"?><recipe lang="en-uk">"""str+=el_CData("title",meta['title'])str+=el_CData("description",meta['description'])str+=el_CData("source",meta['source'])str+=el_parent_open("cuisine")str+=el_node("style",meta['style'],2)str+=el_node("region",meta['region'],2)str+=el_node("approach","",2)str+=el_parent_close("cuisine")str+=el_parent_open("tags")str+=el_node("tag","",2)str+=el_parent_close("tags")#stepsstr+=el_parent_open("directions")forstepindirections:ifstep:str+=el_CData("step",step,2)str+=el_parent_close("directions")#ingredientsstr+=el_parent_open("ingredients")str+=el_node("serves","",2)forgroupiningredients:ifgroup['list']:str+=el_parent_open("group",2)ifgroup['name']:str+=el_node("name",group['name'],3)foringredientingroup['list']:str+=el_parent_open("ingredient",3)str+=el_node("quantity",ingredient['quantity'],4)str+=el_node("measurement",ingredient['measurement'],4)str+=el_node("name",ingredient['name'],4)str+=el_node("preparation",ingredient['preparation'],4)str+=el_parent_close("ingredient",3)str+=el_parent_close("group",2)str+=el_parent_close("ingredients")str+=el_parent_close("recipe",0)returnstr
Parsing CSV data
12345
Scenario: Parsing CSV data Given that script is parsing <CSV file>When <line1> is encounteredAnd <line2> is nextThen it should create dictionaries <ingredients>, and <steps>
As the final step, I will now include real information from the CSV into the XML. The format of the CSV file follows a quite complex grammar, which splits ingrdients and groups and tried to minimize the amount of typing - it was a proof of concept for the interface I will be building. What the converter needs to do is read in lines such as
defspreadsheet_read(spreadsheet_file,filename):"""Extract metadata, ingredient list, and directions from a csv file Keyword arguments: spreadsheet_file -- a file handler filename -- the name of the file return metadata, ingredient list, and directions dictionaries """ingGroup=0#pointer to active ingredient grouplastIngs=[]#stack of ingredients to be used in a single stepmeta={'title':filename}#the meta info, returnedings=[{'name':'','list':[]}]#the ingredient groups, returnedsteps=[]#the steps, returnedspreadsheet=csv.DictReader(spreadsheet_file,dialect='custom')#oreoprocesses dataforrowinspreadsheet:#the Group cell is only filled in when the group change#when that happens, a new group is added to listifrow['Group']:ings.append({'name':row['Group'],'list':[]})ingGroup+=1#preparation has no key in the original, so we add onecells={}fork,vinrow.items():k=kor'preparation'cells[k]=v#collects ingredientsing={}ifcells['ingredient']:#ingredients are listed as 'oil, olive', so we need to swap them around to get "olive oil"ifre.search(",",cells['ingredient']):ing['name']=re.sub("^(.+?), (.+)$",r"\2 \1",cells['ingredient'])else:ing['name']=cells['ingredient']ing['measurement']=cells['unit']ing['preparation']=cells['preparation']ing['quantity']=cells['amount']ings[ingGroup]['list'].append(ing)#some rows contain ingredients but not actions#those ingredients are collected into a list and used for the first action#ingredients are listed as 'oil, olive', so that we can extract just 'oil' easilylastIngs.append(re.sub(',.*$',"",cells['ingredient']))#collect stepsstep=""ifcells['action']:#actions can contain the string {} to refer to the ingredient - e.g., "add the {} and stir"step=re.sub(r'{}',", ".join(lastIngs),row['action'])#the rest of the information is quite granular, but we ignore that and just return the one stringifcells['medium']:step+=" in"ifcells['amount medium']:step+=" "+cells['amount medium']ifcells['unit medium']:step+=" "+cells['unit medium']step+=" "+cells['medium']ifcells['temperature']:step+=" on "+cells['temperature']ifcells['ready when']:step+=" until "+cells['ready when']ifcells['estimated time']:step+=", "+cells['estimated time']lastIngs=[]steps.append(step)returnmeta,ings,steps
Dealing with meta information
12345
Scenario: Dealing with meta information Given that meta information is stored in the second sheet of a spreadsheetAnd that I have 83 recipes to convertWhen I conert the recipesThen that information should be part of it
CSV files only include the first sheet of the spreadsheet, so what is missing is three things - the recipe name, the source, and the style (e.g. Italian, Chinese, Modern British, etc). The recipe name I can also fetch from the file name, so all it leaves is a couple of words per recipes, i.e. about 200 words. Rather than trying to fetch them from the other formats and find ways to merge them with the already generated XML files, I decided it would be much easier to copy them over manually.
Challenge 100% complete
The recipes were successfully fetched and converted. The final script is available on GitHub.