Converting an XML document's encoding
With a very brief XSLT stylesheet.
A colleague recently asked about converting a collection of XML documents to the US-ASCII encoding (that is, to documents where everything is either a US ASCII character or a numeric character reference such as é for the é character). I have several utility stylesheets for converting the encoding of XML documents, and a slight change to one of them gave me a new version that would create a US-ASCII version of any XML document:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> <xsl:output encoding="us-ascii"/> <xsl:template match="@*|node()"> <xsl:copy> <xsl:apply-templates select="@*|node()"/> </xsl:copy> </xsl:template> </xsl:stylesheet>
The stylesheet's single template rule copies every node in the source document tree to the result document tree unchanged. The key to the stylesheet is the value of the xsl:output element's encoding attribute, which specifies the encoding to use when writing the result tree to a file. My similar stylesheets, which have have names like latin1out.xsl and utf8out.xsl, are identical except for this encoding attribute's value.
Your choices for what to put in this attribute are limited to what your XSLT processor can handle. While Xalan is usually my third favorite XSLT processor, with Xerces as the XML parser underneath it can read and write quite a few encodings. (I know it's possible to tell Saxon to use Xerces instead of the Aelfred parser that it usually uses, but I'm too lazy to figure out how.)
So if you need a simple tool to convert the encoding of one or more XML documents, find the encoding name in this list (if the one you pick has a name in parentheses on that list, use that), make it the value of the encoding attribute in the xsl:output element above, and you'll have a stylesheet that converts any well-formed XML document to that encoding.
Comments
(Note: I usually close comments for an entry a few weeks after posting it to avoid comment spam.)
Hi Bob,
xmllint http://xmlsoft.org/xmllint.html has a "encode" option which is also useful for this.
Posted by: Dave Holden | September 28, 2007 10:07 AM
It's possibly marginally quicker to use the single template
to save the system doing a template match at every levelAlso, if you use xslt2 you can add omit-xml-declaration="yes"
which is useful, especially with US-ASCII encoding as it gains the benefit of using ascii without the potential drawback of the document being rejected with an unknown encoding. (If I recall correctly early msxml systems wanted "ASCII" not "US-ASCII" and just putting nothing and so letting the receiving system default to utf-8 works fine for ASCII documents).
David
Posted by: David Carlisle | September 28, 2007 11:00 AM
David,
Nice idea, thanks. The template rule I did have is part of my starting point when I'm creating a new stylesheet, but yours is terser. (I try to avoid matching on "/" because there have been too many times when it's come back to bite me after I added the document() function somewhere in the stylesheet.)
Bob
Posted by: Bob DuCharme | September 28, 2007 11:18 AM