More on Word's mediocre XML
It's not just the index tag markup, but most of the "Insert Field" parts.
After I wrote recently about the awful markup used to identify index entries when you save a Word 2003 file as XML, Jon Udell wrote to me to relay MS Office Program Manager Brian Jones' query about whether I felt similarly about other markup in the XML version of a Word document. I haven't had the time to do a comprehensive review of the XML, and I've written before about a pleasant surprise I found in it (and I was annoyed at the fuss over Microsoft paying Rick Jelliffe to add some perspective to the ODF/OOXML Wikipedia entries—it's Rick Jelliffe, for chrissake) but a bit more investigation let me generalize from my earlier negative comments, and after writing it out to Jon I thought I'd expand on it a bit and post it.
The project I'm writing doesn't need the hyperlinks or table of contents markers in the Word XML, but from what I've seen of them, it looks like the XML representation of most of the Insert Field features seem to be that XML-ized version of the RTF: <w:fldChar w:fldCharType="begin"/>, then a w:instrText element with some cryptic string such as ' TOC \o "1-2" \n \p " " \h \z ' for a table of contents marker, 'HYPERLINK \l "_Toc135558539"' for a hyperlink, ' XE "' for an index entry, and <w:fldChar w:fldCharType="end"/> to finish it.
To test this theory, I created a sample document with about a dozen things added with different Insert Field selections and exported the result as an XML document. The XML version of most of the field constructs begin and end with w:r elements containing w:fldChar elements with w:fldCharType attribute values of "begin" and "end". Some store their information in a w:r child of a w:fldSimple element instead. The w:fldSimple element's w:instr attribute seems to be the equivalent of the w:instrText cousins of the w:fldChar "begin" and "end" elements, with cryptic strings of uppercase keywords, punctuation, and quotation marks like the TOC one shown above to say something about their purpose. (To be fair, the "Hyperlink" field had an actual w:hlink element to represent it.)
Indicating where the constructs begin and end with two separate, generic empty elements that have a fldCharType attribute value of "begin" and "end" is much more difficult to work with than a matched pair of start- and end-tags. XML isn't simply the representation of data with tags enclosed in angle brackets in such a way that Xerces doesn't complain about it; much of the point of XML is to clearly indicate where things (and sub-things) begin and end using a matching pair of start- and end-tags. I suppose that an XML representation of a Word file must address the possibility of overlap—what if the document has bold text, then bold italic, then just italic?—but if the OpenOffice coders can parse the original Word file and turn it into good markup, we know it can be done.
A new annoyance revealed by my further research is the fact that those w:instrText elements store their cryptic strings of information such as ' TOC \o "1-2" \n \p " " \h \z ' as PCDATA. Using XSLT, it's usually easy to check whether an element has no content (regardless of the number of descendant elements it has) by checking whether normalize-space(value-of(.)) = "", and when processing XML versions of Word there are often empty paragraphs and maybe even empty sections that you want to throw out, but these w:instrText elements prevent this from working. I know that storing content in PCDATA and metadata in attributes is only a convention, but it's a convention of document-oriented XML going back to SGML days, and an XML version of a Word file is certainly document-oriented XML. (More on this in the comments to my earlier entry on the topic.)
The kinds of things that a Word user picks "Insert Field" to add are often very important to what makes a Word or XML document richer than plain ASCII text with no markup, and it's a shame that whoever designed the MS XML to represent these didn't do a little more modeling of the data necessary to represent each field type and instead just mapped the RTF (or whatever internal structures that I'm sure the RTF reflects) to pointy brackets and strings full of internal codes. I'm sure it made their design work go more quickly, but the result is something that offers few good arguments for advocacy as a standard.
Jones' blog has been talking up an open source API for processing the Office XML, and while it's good that such a tool exists and is open source, it doesn't address the issues I describe above. The "don't worry about the data complexity, we have a tool that takes care of it" argument often presented in such cases leads to a software dependency, and the reason we use open data standards is to avoid dependency on specific tools. (A dirty little secret of the SGML world was that while we all preached the gospel of an open ISO data standard as a way to avoid dependency on specific software tools, most serious production work relied on Omnimark, a company that at the time was run by a man who would rather tell developers what they needed than listen to what they needed. One former employer of mine converted their SGML system to use XML purely to eliminate their dependency on Omnimark.) A dependency of a data format on a specific tool takes away from arguments toward making that data format a standard.
The things that a Word doc file or an XML version of that doc file must represent can be complex, and I'm sure that further investigation of the XML, if I had the time, would reveal further pleasant surprises and further annoyances. So far the score, on balance, is pretty low.
Comments
(Note: I usually close comments for an entry a few weeks after posting it to avoid comment spam.)
I wrote about the field problem awhile back WRT to citations.
On a related note, you might be interested to know that the OpenDocument Metadata SC has just wrapped up its proposal for enhanced metadata support. Based on RDF, it will include a generic field, whose logic is encoded using ... RDF.
Posted by: Bruce D'Arcus | June 10, 2007 12:25 PM
"I suppose that an XML representation of a Word file must address the possibility of overlap—what if the document has bold text, then bold italic, then just italic?—but if the OpenOffice coders can parse the original Word file and turn it into good markup, we know it can be done."
not to mention if CSS and HTML can do it.
Posted by: bryan | June 10, 2007 1:14 PM
Thanks for the nice words! Here some comments on the structure of Open XML, which I was working on for an article draft, but which may provide some extra general info for interested people.
Open XML's syntax is indeed odd at first, easily enough material for a year's worth of blogs :-), but I have found that there are usually reasons: Open XML has been made using completely different tradeoffs than, say, DOCBOOK has, and consequently looks different.
First, on the superficial syntax. Remember a few years ago when Michael McQueen was saying that the trouble with attributes was that you couldn't have structured attributes, and the SML people were saying that there was no difference between an attribute and an element and that we should reduce our use of attributes to a minimum? That seems to have influenced MS' design choice behind their properties. They have systematically adopted a "head-body" approach to have properties in elements (this is hardly a new thing: I wrote about it my 1998 book): there is a consistent naming convention of a "Pr" suffix used throughout.
However, they also have the HTML-inspired approach that element content should only have searchable content in mind, so that searching doesn't need to be schema-aware. (With the slight complication that you raise, that deleted text sections and fields still use data content, and the use of numeric indexes to shared string tables in SpreadsheetML.) Then they have decided against using mixed content, again influenced by the SML propaganda but also because it resolves one issue for documents loading into relational DBMS.
Now I never cared for the SML ideas much: but the combination of allowing structured attributes, schema-less searches, and easy loading to DBMS are entirely respectable choices it seems to me. Which is not to say that DOCBOOK or ODF should adopt the same goals.
Second, still at a fairly superficial level, terseness has been a goal of Open XML. This is particularly true in the oldest of the languages, SpreadsheetML, which goes back to the beginning of the decade. Not only does Open XML use short element names, it also reveals the internal optimizations of Excel: sparse matrixes, shared strings, SQL_DATE-style numeric indexes to dates, and so on.
When we think of an application like Office, used by hundreds of millions of people worldwide, load/save/recalculate times are not a secondary issue; for example, saving 1 minute a day for one hundred million people is not nothing! Indeed, it then becomes a challenge to ODF, to say "Why don't you support more optimized forms?" (No criticism of ODF intended: it is growing up.)
Third, on the organizational level, Open XML uses its Open Packaging Conventions to recreate SGML's entities: when referring to another document (part, eg equiv of entity) whether internal or external, an id is used (eg equiv of entity name), and each part that has such references has a relationships file (e.g. equiv of internal subset containing entity declarations) which map the ids to URIs (internal or external.) ...Indirection!: where is Elliot Kimber?
The reason for this is obviously that in jettisoning DTDs for XML Schemas, you also jettison the mechanism for making compound documents, and the wheel needs to be reinvented. I don't think there would be any need to mention to Bob how useful entities are for production purposes, for mid-sized documents. (For large documents, extra levels of indirection and managed IDs become practical; for smaller HTML-sized documemts, direct markup of URLs is easier without an indirection mechanism like URLs; but for middle-sized documents, moving constants to headers helps managability: for example, if you have a catalog with the same logo display 10,000 times, it is preferably to change it once in the relationship file (equiv to entity declaration) rather than each of the 10,000 references. Actually, OPC is something that I think ODF could well adopt.
But OPC and relationships does make the markup a little more difficult to read, if you don't know that they are there, because suddenly there is information held in different files. I think people coming from HTML will especially have this problem.
Fourth, there a difference in the design level too: as far as I can see, what MS were trying to do is to take a *completely* linear format and allow arbitrary interleaving of custom XML as the mechanism for *all* structuring. Office 2007 doesn't do any structural implication that I know of (though I am not an expert in it.)
So saying Open XML is like RTF-in-XML is not unfair, though to say that Open XML is *only* RTF-in-XML would be unfair. Nor would a comparison with HTML (a linear format where structures can be made by the user with DIV and SPAN.)
Open XML is an "open" format in the sense that the zipper on a flasher's pants is open: you may not like what you see, it may be less or more than you were expecting, but the functionality is exposed unadorned for all the world's education: whether you are repelled or see opportunities is your business :-) The aim of Open XML is to expose everything that goes on inside Office 2007 not to mediate it according to some abstract/ideological view of the perfect document.
So, in Word, a document is a list of blocks, and a block is either a list of runs or a table. Consequently, in WordprocessingML, a document contains a sequence of <p> or <tbl> elements, and a <p> contains a sequence of <r> run elements, which may contain a sequence of <t> text runs and diagrams etc.
The radical thing MS have did was to take an interleaving approach to structure: you can open any schema, and use this with a context sensitive editor (in Word) to wrap blocks, runs, rows and cells with "custom" elements from that schema. The schema is used to provide syntax direction, but not for subsequent validation; the created WordprocessingML document can still be validated against its usual schemas because the custom elements are marked up with one level of indirection, as values of customXml elements in the word-processing space. Now at the moment, this is not fully baked: you cannot key styles to customXml elements as far as I know: but the aim is to expose what Office 2007 does not what it *may* or *should* do!
In this way they are trying to turn the linear format from a flaw into a strength: if they had structures in place already (sections, lists, headings) they would have to figure out how not to clash with custom XML structures (which is a problem I expect ODF would have.)
Fifth, I think Open XML is about the first consumer format I have seen which takes the separation of presentation from content in tables really seriously. This is something that Dave Peterson used to comment on, that tables are a presentation format which should link into tabular date held separately. So Open XML provides mapping controls to XPaths and also columns.
Lou Burnard used to quote someone that all DTDs are theories about a document: Open XML is clearly a theory about office documents in which there is a hierarchy of 1) casual (linear) documents, then 2) linear documents containing links to data in highly structured XML data, then finally 3) structured documents. Each of these levels has a smaller user base than the preceding one, and the idea of Office XML is to expose what Word/Excel/Powerpoint does in attempting to add better support for the subsequent layers onto its linear roots. That the requirement for structured literary documents is less than the requirement for linking to structured data documents from unstructured literary documents.
It is in interesting theory, and not one that can be sniffed at, I think. If we remember the Pinnacles DTDs, as used by chip makers in the early 90s, it had a database section in its header, for example for Vcc voltage levels, and the text used references to it. The value to the user was not the structuring of the information into sections (which can be done by stylesheets, i.e. by attributes/properties) but the ability to reference the database. Is that referencing capability (of XML) in fact more important for most users/uses than the explicitly hierarchical structuring capabilities?
Now, all that being said, I hope one of the opportunities for the ISO process is to find out where there is some syntactic ugliness that causes real problems and to get the rationale explained and, if there is no good rationale and the issue is important, to get an improvement in the works. I don't believe for a moment that Open XML is perfect, not that perfection is required for a standard of its type (i.e. one that exposes a particular deployed application), and the more that we can focus on real production issues of the kind that Bob is raising, rather than the parade of high-volume bogosity we have been treated to, the more chance that Open XML can be blocked as an ISO standard (if those real flaws reach a showstopping level) or improved (if those flaws can be explained or fixed) or accepted but with an understanding of its qualities and attributes.
Posted by: Rick Jelliffe | June 11, 2007 4:39 AM
Rick,
Thanks for all this! Two comments:
>the idea of Office XML is to expose what Word/Excel/Powerpoint
Is there a version of Powerpoint that can save as XML?
>The aim of Open XML is to expose everything that goes on inside Office
>2007 not to mediate it according to some abstract/ideological view of
>the perfect document... the aim is to expose what Office 2007 does
>not what it *may* or *should* do!
I can see the reasons for doing it this way, but how can MS advocate that an XML format designed to expose the internal workings of an aging binary format should be the standard adopted by governments and corporations around the world instead of one in which the abstractions were thought out first and the execution was modeled on those? Put another way, how can they suggest that the standard be for users to adapt their legislated norms to the quirks of one company's tool instead of the other way around? (The answer, of course, is because it's their tool.)
Posted by: Bob DuCharme | June 11, 2007 8:31 AM
Is there a version of Powerpoint? Yes, Office 2007 and AFAIK Office 2000, 2003 and XP with the compatability kits can save as XML-in-ZIP. (MS also has a save-everything-to-one-file-including-images kind of XML that was in Office 2003, but I hope they have not made that available.)
But Bob, where on earth did you get the idea that anyone on the MS side has ever said that Open XML should be "the standard adopted by governments and corporations around the world instead of one in which the abstractions were thought out first and the execution was modeled on those?" Boy, that is complete propaganda, and nothing like what I have ever advocated and nothing like what I have ever read or heard from any MS person: and I have been following the issue with more than casual interest. Indeed, MS voted for ODF recently at ANSI/ICITS/V1 or whatever.
Any sources for this? Or is it something that "everyone knows"? I think it comes from the idea that there cannot be overlapping standard, or that if there is a standard we are somehow forced to use it. I think it is words being put in MS' mouth. TCP is standard not because it is technically excellent or because it came as a result of great openness initially, but because the RFC describes fairly the pre-existing technology: people chose it (or didn't) over the ISO standards because they were smart enough, not because they were confused by multiple standards or compelled to use it.
ISO standards are voluntary, and the legislatures have by and large resisted the attempt to make them mandatory, especially when most of the mature implementations of ODF are proprietary (IBM word processor, Lotus' new Notes, Sun Star Office, MS Office) and the open source versions are notoriously ratty, in 2006/2007 timeframe at least. You don't fight a monopoly by creating a cartel. Norway has set a really good example recently for documents made public to the external population, by making ISO PDF mandatory for completed public documents, ISO ODF mandatory for for incomplete public documents, HTML allowed for websites where appropriate, and any other ISO format allowed (i.e. future ISO Open XML) to provide parallel versions. That is great: a rich range of formats and the guidance about when to use each...the Norwegian standards prudently leave out mandating what standards should be used internally systems: that is really where you would expect Open XML to be positioned.
Where MS is freaking out AFAICS is not where governments mandate ODF for public documents (few public documents are incomplete and so would be PDF or HTML anyway, and Office has a good ODF export/import story) but in the idea of disallowing the Office native format internally inside production systems. They have put in a lot of XML-based features for which there is no equivalent: it makes Office much more competitive against, say, Crystal Reports or even Web forms and XMetal. It is that internal market for system developers and integrators and archive-openers to which Open XML is targetted, not the market of level-playing-field public document interchange between competing office suites. I think the competitors talk about public documents, but they are playing a bait-and-switch scam to try to block people from choosing Office for use in internal systems with its extra integrator-friendly features. In other words, MS wants to make Office a rich platform with features that go beyond ODF, MS' competitors want to prevent this and fence systems in to only use or exchange ODF (and any extensions that can be grafted on top of it.)
The MS position, as I understand it from their public comments, is that ODF is fine for many simple document exchange uses but, taking a cold hard look despite our hopes and wishes otherwise, is not adequate (in its current form of ODF 1.0 and more so in its form of 2 years ago when the Office 2007 decisions were being made) for the most basic of requirements needed for the default format for Office: that you could save a document with all the information needed to reopen it unchanged. MS' choice was either to add what they needed to ODF 1.0 draft (not a good idea in view of embrace and extend concerns) or become mind readers and adopt two years ago features that ODF 1.3 or 1.4 will have in three years time (not a good idea due to the inaccuracy of clairvoyant technology.)
However, I am not an MS spokesman (I haven't even signed a non-disclosure agreement), so my view may be skewed. I just go on what their public comments and training material says.
Posted by: Rick Jelliffe | June 11, 2007 9:31 AM
Thanks, I hadn't heard about the compatibility kits. I just saw "Save as XML" in Word in Excel but not PowerPoint.
I guess I had an oversimplified view of the standardization issues. I thought it was a case of "let's pick a format in which to store our content as we go forward," with two sides saying "pick our format, not theirs", which is often the case with disagreements over data standards. The different levels and different formats appropriate to each level makes sense.
TCP was not a single product from a single company, and the standard gave a blueprint for implementers to work from to ensure interoperability. It sounds like Microsoft's format functions more as an API to their product suite, and while 'm glad that it exists, what is the point of having it stamped as an ISO standard, besides the marketing advantages of being able to say "it's an ISO standard"? In other words, what is the advantage to anyone outside of Microsoft of their XML format being an ISO standard? Wouldn't implementers have to work around the interface decisions of this one company whether the documentation of this interface held "standard" status or not?
Posted by: Bob DuCharme | June 11, 2007 10:31 AM
The XML-in-ZIP is the default format for Office 2007 (Word, Excell, Powerpoint). If you do "Save as XML" you get the crazy all-in-one file format that adds a level of packaging elements instead of the ZIP package. To convert a .DOCX file to ZIP, just change the extension.
They have given up binary formats as the default. You can still save in the old binary formats of course, or ODF and XHTML and ISO PDF/A if you want to, and Excel has a new fast binary format available as well. There are batch converters available to convert old repositories to Open XML too.
TCP indeed pre-existed the RFC by about 7 years: see
Cerf, V., and R. Kahn, "A Protocol for Packet Network
Intercommunication", IEEE Transactions on Communications,
Vol. COM-22, No. 5, pp 637-648, May 1974.
for the first description by researchers Vince Cerf and R. Kahn. Then it went through about 8 incarnations at ARPA before it became an RFC. Almost all the fundamental internet technologies were developed as libraries/(=~applications/products) first then described later: it is not the blue-sky development method at all. It was the ISO OSI protocols that were developed based on blue sky thinking (a la Richar Ganriel's The Right Way is the Wrong Way), to a large extent (or at least that is the mythology passed down.)
Why should Open XML be an ISO standard? Well, my attitude is probably more "why shouldn't it be?" ISO has to be fair, even to Microsoft. If a company can never win by playing the standards game, they never will; the trick is making sure that everyone wins. The basic reason something becomes a standard at ISO is that there is a market requirement for it: now if there isn't a market requirement for MS to document and explain their formats it is hard to think of what else would pass the test... Having Open XML will not stop the progress of ODF for public documents: the dynamics and sweet spots for both are too different.
Posted by: Rick Jelliffe | June 11, 2007 2:46 PM
>If a company can never win by playing
>the standards game, they never will
don't be confused here:
"play the standards game" is not the same that
"game the standards system"
Posted by: marc | June 14, 2007 12:52 PM
They're close enough--you'd be hard-pressed to make a list of companies that play the game with no interest in gaming the system and another list of companies that game the system without playing the game.
Posted by: Bob DuCharme | June 14, 2007 1:41 PM
Rick wrote:
"The basic reason something becomes a standard at ISO is that there is a market requirement for it: now if there isn't a market requirement for MS to document and explain their formats it is hard to think of what else would pass the test..."
This seems to be the crux of the matter: what is "it" when we're talking about standards?
As I read through the ISO website looking for some hints, it seems to me the conclusion I come to is "it" in fact is an XML standard for office documents. On that basis, we already have one, and OOXML would be a competing standard, undermining the purpose of ISO.
The notion that one should have two ISO standards for office document thus seems to require Rick's more narrow reading of OOXML (that "it" documents a specific dominant product, and that this is the "market requirement").
Posted by: Bruce | June 27, 2007 9:30 AM