I am working on a project that make heavy use of XSLT to covert XML data from a legacy system to JSON for an HTML5 Web Application. Converting XML to JSON is actually quite easy using XSLT. There are even a few open source XSLT templates out there that help you do this. What got tricky for me, was when my client developers wanted to have HTML snippets in the JSON as well. It certainly made sense. I’d rather put HTML in a server side template than generate it in Javascript on the client. However, one thing I did NOT want to do was encode all the HTML by hand. It would be error prone and impossible to maintain. So I found an excellent way to include the HTML as valid XML in the XSLT template and serialize it as a JSON string.

My solution started with another great open source XSLT template that converts XML Nodesets to strings. However, I could not use this template as is for a number of reasons. First, I wanted to include my HTML snippet in the XSLT template itself as valid XML. So I created a template that served as a “helper function” that allowed me to store the HTML as a variable and call the nodetostring template to convert it to a string. This way, I could call my helper function anytime I needed to convert HTML to a string while building my JSON response. I had to store the HTML in a variable because there is no way to call a template with the mode attribute. So I have to save it as a variable and then use <xsl:apply-templates select="$html" mode="nodetostring"/>. The problem is, when you store XML as a variable, it is not stored as a Nodeset. It is stored as a Result Tree Fragment. So you need to use an XSL Extension function to convert the Result Tree Fragment to a Nodeset. Since I am doing all this in Java, I was able to use the xalan:nodeset($html) function for this, as shown in the example below (full details omitted for brevity).

The second issue was the nodetostring XSLT template did not encode the HTML as JSON exactly the way I wanted it. There were four gotchas that I found that I had to correct in the open source template.

  1. I had to escape quotation marks used around HTML attributes like so <img src="someimage.png"/>
  2. I had to escape the forward slash in my HTML end tags. <div>Hello World!</div>. This is an ancient artifact of an old HTML spec that didn’t want html parsers to get confused when putting strings in a <SCRIPT> tag. For some reason, today’s browsers still like it.
  3. This one was totally bizarre. I also needed to include a space between the tag name and the slash on self closing tags. I have no idea why this is, but on MOST modern browsers, if you try using javascript to append a <li> tag as a child of an unordered list that is formatted like so: <ul/>, it won’t work. It gets added to the DOM after the ul tag. But, if the code looks like this: <ul /> (notice the space after the /), everything works fine. Very strange indeed.
  4. I had to be sure to encode any quotation marks that might be included in (bad) HTML. This is the only thing that would really break the JSON by accidentally terminating the string early. I used the escape-quot-string template from xml2json.xsl (link in first paragraph) to search for ” and convert it to ".

 

Hopefully anyone attempting to serialize HTML as JSON will find these lessons-learned helpful. There was one other issue I had to deal with involving a Java bug, but more on that later.