I am slowly building a recipe manager web app, and as a first step I am converting all data I have to a common XML format I have defined in part 1. Some of the data I had stored by a cute looking, but ultimately disappoing, OS X app called Yummy Soup!. I need to liberate that data, which the app can export as an OS X plist, which is an XML format.
Feature: Converting from YummySoup! plist format to XML In order to consolidate my recipe data As a geeky recipe author I need to convert from plist to XML
Converting from one type of XML to another is a job for XSLT, and Apache Ant is what I normally run XSLT with. But first of all, I get the YS! recipes. I can only export them all to single file, by opening the app, selecting all recipes in it, and selecting ‘export’. It created a plist with 239 dict nodes in it, each of them a recipe.
Using Apache Ant to run an XSLT job on a file
12345
Scenario: Using Apache Ant to run an XSLT job on a file Given that I have an XSLT and an XML fileWhen I run the extractFromYS Ant jobThen a new XML file should be producedAnd it should include data from the input file
This is easily achieved using my standard Ant setup
Normally I split property files between common and local (i.e., those that go under version control and those that don’t). In this case I need neither, but Ant will carry on working even if it doesn’t find the files, which is nice. The job extractFromYS is set as default for the project, which allows me to run it from Sublime Text 2 without doing any special work - just choose ‘Ant’ as the build system, then hit CMD-B.
The target increaseBuildNumber is a standard one I use which increases a number in a text file every time I run the job. This is often useful, although probably not this time, but I am happy to keep it for consistency with my other projects. Finally the XSLT job - as simple as it gets, with a file in, file out, and XSLT file, plus some output parameters.
What I am looking for is one string per recipe, to see that the XSLT runs and recognizes them. It did.
Transforming plist into XML
123456
Scenario: Transforming plist into XML Given that an OS X plist document is being XSL transformedAnd key value pairs are stored as <key>KEY</key><string>VALUE</string>When a node is passed through a template and KEY passed as paramerterThen the associated VALUE should be replaced
The XML schema of plists is rather akward for XSLT, so I created a template for handling it. Note that ‘string’ is only one of the possible plist types, but it’s the only one I need in this particular case. I started with fetching the name of the recipe only as a test.
The xsl:template match=”/” matches the whole document and it’s the entry point to control which other templates get to handle which nodes. Every time it encounters a dict node, which is once per recipe, in the source XML document, it handles control to xsl:template match=”/plist/array/dict”. There a new XML document is appended to the output, separated by a row of lines. This is not valid XML of course, but I will split it later.
xsl:processing-instruction name="xml-stylesheet" adds the stylesheet reference to the output - you can’t just type <?xsl... ?> otherwise it looks like it is an instruction to be run rather than a string to be output.
Then a couple of tags are mapped from one file to another: ‘name’ becomes ‘title’, ‘recipeDescription’ becomes ‘description’ and so on. A named template, val, is used to convert from nameMulligatawny to
Mulligatawny This is easily achieved using following-sibling to fetch the value associated with a plist key.
Another thing worth nothing is the cdata-section-elements="title description source step" in the xsl:output node on line 6. That defines a list of elements whose content will be wrapped in a CDATA, which is nice.
Splitting string with XSL built in string functions
12345678910
Scenario: Splitting string with XSL built in string functions Given that cuisine is stored as a string in format "STYLE / REGION"When I pass it through the templateThen it should return a node for style, and one for regionAnd it should be equal to "<style>STYLE</style><region>REGION</region>"Given that cuisine is stored as a string in format "STYLE"And it has no REGIONWhen I pass it through the templateThen it should return a node for style, and an empty one for regionAnd it should be equal to <style>STYLE</style><region></region>`
There are two use cases here - one is dealing with strings like ‘Italian / Sardinian’ or ‘Indian / Kerala’, the other deals with ‘English’, ‘Spanish’ etc. First the ‘val’ template is called, to save the ‘cuisine’ string to a variable, then I use as simple xsl:choose and some of XSL’s built-in string functions to handle both use cases.
Splitting string with XSL using a recursive function
1234
Scenario: Splitting string with XSL using a recursive function Given that tags are stored as a single comma separated stringWhen I pass it through the splitstring templateThen it should return a node for each tag
A comma separated list doesn’t have limits, so a recursive template called splitstring will be used. It is a generic template that can be reused in other projects - it takes three parameters as arguments, the input string, the delimiter and the ouput tag to be generated. The string to be passed in is saved to a variable with the val template.
<!--splitstring - splits a string and assigns each substring to a node@param {String} list the string to be split@param {String} delimiter the string to split by, [optional default ,]@oaran {String} tag the name of the tag to be created, [optional default 'tag']--><xsl:templatename="splitstring"><xsl:paramname="list"/><xsl:paramname="delimiter"select="','"/><xsl:paramname="tag"select="'tag'"/><xsl:variablename="newstring"><xsl:choose><xsl:whentest="contains( $list, $delimiter )"><xsl:value-ofselect="normalize-space($list)"/></xsl:when><xsl:otherwise><xsl:value-ofselect="concat(normalize-space($list), $delimiter)"/></xsl:otherwise></xsl:choose></xsl:variable><xsl:variablename="first"select="substring-before($newstring, $delimiter)"/><xsl:variablename="remaining"select="substring-after($newstring, $delimiter)"/><xsl:elementname="{$tag}"><xsl:value-ofselect="$first"/></xsl:element><xsl:iftest="$remaining"><xsl:call-templatename="splitstring"><xsl:with-paramname="strlisting"select="$remaining"/><xsl:with-paramname="delimiter"select="$delimiter"/><xsl:with-paramname="tag"select="$tag"/></xsl:call-template></xsl:if></xsl:template><xsl:templatematch="/plist/array/dict"><xsl:variablename="tags"><xsl:call-templatename="val"><xsl:with-paramname="node"select="*[. = 'keywords']"/></xsl:call-template></xsl:variable> ...
<xsl:processing-instructionname="xml-stylesheet">href="recipe.xsl" type="text/xsl"
</xsl:processing-instruction><recipelang="en-uk">...
<tags><xsl:call-templatename="splitstring"><xsl:with-paramname="list"select="$tags"/></xsl:call-template></tags>...
Using XSL extension to process tricky strings
12345
Scenario: Using XSL extension to process tricky strings Given that ingredient data in XML node is too complex for XSLTAnd directions are not split in individual stepsWhen I run the XSLTThen I want to process the list with a different language
I have two issues here. Firstly, the directions are split into steps in various different ways, none of them useful, depending on which version of the app they were created in. This is part of what I found so annoying with the software. The ingredients list is more regular, but it is a Python list of tuples, where a tuple either signals the start of a new group of ingredients, or it is a single ingredient.
The first case could be solved with a few regular expressions. XSLT 2 has them, but I am still on 1 so no joy there. Either way, the second case is simply too complex for XSLT, so I need to look at ways to bring other languages into the equation.
XSLT can be extended with various languages - Java as expected, but also Python (or rather Jython, the Java implementation thererof), and Javascript. Alternatively, when running the transform through PHP, the XSLT processor there is able to use PHP functions to process nodes.
I first tried Jython - the ingredients list is (almost) valid Python data, so I thought it should be easy. Then I tried Javascript, because I know it quite well. But I couldn’t get either to work - it looks like Xalan kept treating the scripts as Java instead of using the lang attribute. I posted on StackOverflow and hope that will help.
I had slightly more luck with extending XSLT with Java. but only by using simple methods of built-in Java classes into Xalan. For example here’s how to print the date using Java:
But anything more complex proved to be problematic. That left one last option: good old PHP.
PHP has registerPHPFunctions which are equivalent to XSL extensions, but you have to run the transform from within PHP. This is not such a big deal though, as I am only processing one file.
First of all, I created the simplest possible PHP file to get started. The following will simply run the XML file through an XSL that copies the XML out as is, without changing it.
That works well, nothing changes in the output XML files. So finally I have a way to call functions written in a different language from XSLT.
Using PHP functions to process regular expressions in XSLT
123456789
Scenario: Using PHP functions to process regular expressions in XSLT Given that the directions data is not always split into stepsBut each step could be enclosed in <li> tags Or each step could be enclosed by <li> tags Or steps could be separated by two new lines Or steps could be separated by <br> or <br&; tagsAnd steps could be preceeded by (1) or 1)When I the PHP function processes themThen it should output a list of <step> nodes
The first step here is to arrange the XSLT so that it leaves everything untouched except for the nodes I want to process via PHP. I do this by changing the XSL embedded in the PHP file
1234567891011121314151617181920212223242526
<?xml version="1.0" encoding="utf-8"?><xsl:stylesheetversion="1.0"xmlns:xsl="http://www.w3.org/1999/XSL/Transform"xmlns:php="http://php.net/xsl"><xsl:outputmethod="html"encoding="utf-8"indent="yes"/><!-- ENTRY POINT --><xsl:templatematch="/"><plistversion="1.0"><array><xsl:apply-templatesselect="/plist/array/dict"/></array></plist></xsl:template><!-- ONCE PER RECIPE --><xsl:templatematch="/plist/array/dict"><dict><xsl:variablename="directions"><xsl:value-ofselect="./string[preceding-sibling::key='directions'][1]"/></xsl:variable><xsl:copy-ofselect="php:function('splitSteps', \$directions )"/><xsl:copy-ofselect="./*[preceding-sibling::*[1][not(. = 'directions')]]"/></dict></xsl:template></xsl:stylesheet>
Line 20 finds the node I am interested in, a string node preceeded by a key node with value ‘directions’, assign it to a variable, and then pass it to a PHP function. Note that the PHP function is called in a copy-of node, that allows the function to return some DOM nodes of its own - if I use value-of, then I can only return a string.
Line 22 finds all the other nodes, i.e. the ones who are not string nodes preceeded by etc., and just copy them as they are.
/** * handles the directions, which are split in all sort of weird and wonderful ways, and turns the into a set of nodes. Note that the set of nodes need to ve a valid XML document, i.e. with a single rood node. Lucily, that works here. * @param {String} $node the content of the ndoe * @return {DOMDocument} a DOM tree */function splitSteps( $node ) { $find = array(); $replace = array(); #hteml_decode_entities didn't work array_push( $find, "'<'" ); array_push( $replace, "<" ); array_push( $find, "'>'" ); array_push( $replace, ">" ); array_push( $find, "'<br>'" ); array_push( $replace, "\n\n" ); array_push( $find, "'</?[ou]l>'" ); array_push( $replace, "" ); array_push( $find, "'<li>'" ); array_push( $replace, "\n" ); array_push( $find, "'</li>'" ); array_push( $replace, "\n" ); array_push( $find, "'\n[ \t]+'" ); array_push( $replace, "\n" ); array_push( $find, "'\n{2,}'" ); array_push( $replace, "\n\n" ); array_push( $find, "'(^|\n)\(?\d+\.?\d?\)\s*'" ); array_push( $replace, "\n\n" ); array_push( $find, "'\n\n'" ); array_push( $replace, "</step>\n<step>" ); $node = "<directions><step>" . preg_replace( $find, $replace, $node ) . "</step></directions>"; $dom = new DOMDocument("1.0","UTF-8"); $dom->loadXML( $node ) or die ( $node ); return $dom;}
The PHP function runs a bunch of regular expressions on the strings, using the array form of preg_replace. The end result is an XML tree with a single root node. This is then parsed into an XML document and returned. The PHP XSLT processor will treat that XML as a DOM fragment, and happily add the nodes to the document it is processing. If I had only returned the XML string, all the > and < would have been transformed into > and <.
I run this, and apart from a few stray missing entities (which is why I added the die statement) it processed fine.
More XSLT processing with PHP
12345
Scenario: More XSLT processing with PHP Given that the ingredient list is almost a list of Python tuplesAnd strings are not quoted unless empty or longer than one wordWhen I parse the ingredient nodeThen I want a list of nodes compatible with the XML schema I have been working with
The ingredient list is stored as a single string in this format:
12345678910111213141516171819
<string>(
(
"For The Pastry:",
"",
"",
"",
YES,
NO,
945989
),
(
125,
g,
"unsalted butter",
"",
NO,
NO,
2364227
), ....
where YES, NO, determine whether the entry is an ingredient group or an ingredient. Not a great format, but at least regular.
First I change the XSL so that it it now passes either ingredients or directions nodes to the respective PHP function, and passes the rest untouched.
The PHP function to handle ingredients is not so hard (in PHP - it’d have been a much worse in XSL). It uses regular expressions to massage the string into something that resembles a csv string, then splits it, then copies the various bits into DOM nodes.
/** * parses the ingredient string, which looks like a python tuple, but without quoted string * @param String $node the node being processed * @return DOMFragment a list of nodes */function getIngredients( $node ) { //removes enclosing () $node = substr( $node, 1, -1 ); //gets rid of stuff and flatten $node = preg_replace( "'\"'", "", $node ); $node = preg_replace( "'\n\s*'", " ", $node ); //gets individual ingredients or group names $lines = preg_split( "'(^|\),)\s*\('", $node ); $header = array( "quantity" => 0 , "measurement" => 1 , "name" => 2 , "preparation" => 3 , "isgroup" => 4 , "ignore" => 5 , "ignore2" => 6 ); $dom = new DOMDocument( "1.0", "UTF-8" ); $root = $dom->appendChild( new DOMElement( 'ingredients' ) ); for( $i=0, $i2=sizeof( $lines ); $i<$i2; $i++ ){ $line = trim( $lines[$i] ); if( "" === $line ){ continue; } //treats line as a CSV $fields = preg_split( "':?\,\s*'", $line ); //a line is either a groupname or an igredient if( "YES" === $fields[ $header["isgroup"] ] ){ $ndGroup = $root->appendChild( new DOMElement( 'group' ) ); $ndGroup->appendChild( new DOMElement( 'name', $fields[ $header["quantity"] ] ) ); } else { if( !isset( $ndGroup ) ){ $ndGroup = $root->appendChild( new DOMElement( 'group' ) ); } $ndIngredient = $ndGroup->appendChild( new DOMElement( 'ingredient' ) ); $ndIngredient->appendChild( new DOMElement( 'quantity', $fields[ $header["quantity"] ] ) ); $ndIngredient->appendChild( new DOMElement( 'measurement', $fields[ $header["measurement"] ] ) ); $ndIngredient->appendChild( new DOMElement( 'name', $fields[ $header["name"] ] ) ); $ndIngredient->appendChild( new DOMElement( 'preparation', $fields[ $header["preparation"] ] ) ); } } return $root;}
Splitting XSL output to separate files with Xalan’s redirect
12345
Scenario: Splitting XSL output to separate files with Xalan's redirect Given that XSLT are being processed with XalanAnd that the input is a single XML file with several recipes in itWhen a recipe start is encounteredThen the processor should create a new file for it
There are several possible approaches to splitting the XML in separate files, but the easiest is to use one of the XSLT extensions supported by Xalan, the default XSL processor used by Ant.
The extension needed in this case is redirect (can be found on the bottom LHS). To import it into the XSL document, I added a namespace to the xsl:stylesheet decalration:
(NOTE: some sources suggest using xmlns:redirect=”org.apache.xalan.xslt.extensions.Redirect” but that doesn’t work for the version of Xalan included with Ant).
Then using it is simply a matter of enclosing the recipe template with a redirect:write tags, which takes a ‘file’ attribute to specify the file path. Note that ‘file’ is relative to where the project home, which by default is where the buid.xml file sits.
I also used the xsl:fallback tag for catching errors, but for a one off job it doesn’t really matter.
Note that among the XSLT extension supported by Xalan there are also some dealing with string, including splitting a string, but the recursive function I have created works fine and it’s independent of the processor, so I’ll stick to that.
At first the transform failed with the error: Error! Unrecognized XSLTC extension ‘org.apache.xalan.xslt.extensions.Redirect:write’. After a bit of digging around, I downloaded the latest version of Xalan, extracted the zip, and copied the jars with sudo cp ~/Downloads/xalan-j_2_7_1/lib/*.jar /usr/share/ant/lib/That fixed the issue, and I was able to generate all the recipes.
Challenge 100% complete
If this was a proper job, I would have first explored whether I could have done it all the way I planned it (i.e., using Ant, XSLT, and Jython) and would have changed plan once I discovered I couldn’t. That would have meant doing the whole thing in PHP rather than a two step procedure. But I am only playing around, and the important thing is that the conversation was successful.