Converting DTDs (and DTD developers) to RELAX NG Schemas

Bob DuCharme
LexisNexis
USA
http://www.snee.com/bob


Abstract


The arrival of any new schema language usually brings a utility to convert DTDs to that language, but using such a utility on a large collection of sophisticated DTDs isn't enough to provide a smooth transition for a big publishing organization. Even if all the DTD developers are trained in the new schema language beforehand, a sudden change in their responsibilities to maintaining their large, complex schemas in the new syntax is impractical.

Certain features of RELAX NG and its open-source Trang conversion utility let a DTD development staff continue their use of DTDs as they did before while writing short RELAX NG (or RELAX NG Compact) schemas that specify additional constraints for their data. This lets them continue to use DTD-based applications with an optional step of validating against the RNG schemas to check the additional constraints. During this transitional period, they can get comfortable with RNG's syntax and new features at their own pace, making the complete conversion from DTDs to RNG when they're ready. This paper describes the use of several example DTDs with add-on RNG schemas, concluding with a discussion of the difficulties of implementing a similar transition to W3C Schemas.


Table of Contents


Introduction
DTDs, Parameter Entities, and RNG Patterns
Adding Data Typing and Content Model Constraints
Requiring a Child Based on Attribute Value
Element/Attribute Choice
W3C Schemas?
Conclusion

Introduction

Schemas wouldn't exist if they didn't offer advantages over DTDs. Publishing operations who have used XML for years find these advantages attractive, but the idea of making the conversion without affecting their existing production schedules still intimidates many of them.

Tools such as Trang (http://www.thaiopensource.com/relaxng/trang.html) make it easy to convert DTD files to schema files. The human factor is the difficult part: you can't throw a switch and tell a staff that's been developing and maintaining DTDs for years to start doing it with an entirely new syntax. Converted versions of DTDs that they wrote themselves can look strange and foreign, and a class the week before and a few new books on the shelf won't be enough. The transition must be gradual.

One Trang feature combined with a feature of the RNG (RELAX NG) schema language lets a DTD staff begin to take advantage of RNG's extra power while keeping their DTDs in production. These developers can write small RNG schemas that define only the data constraints that DTDs can't, and problems with these new schemas won't be disruptive because their existing DTD-based system will remain in place.

DTDs, Parameter Entities, and RNG Patterns

When a DTD is modularized using parameter entities, a DTD-to-Schema converter could make the entity substitutions and then perform conversion so that the definition of and references to parameter entities are not accounted for in the result. Most converters follow these steps, but Trang converts the parameter entities to RNG named patterns and references those from content models. For example, a DTD's quantity.type parameter entity with the value "#PCDATA" that gets used in a quantity element's content model gets converted to an RNG named pattern with the same name.

Because RNG lets you incorporate one schema into another and then override definitions from the included one, you can use an add-on RNG schema to check that a DTD-defined quantity element's contents are always integers with the following steps:

  1. Convert the DTD with the quantity and quantity.type declarations to an RNG schema (for example, bigbook.rng).
  2. Create a small secondary RNG schema that has an include instruction pointing at bigbook.rng and redefines quantity.type to have the value "xs:integer" instead of "#PCDATA".
  3. Validate against this secondary RNG schema with a tool such as James Clark's Jing (http://www.thaiopensource.com/relaxng/jing.html) or Sun's msv (http://wwws.sun.com/software/xml/developers/multischema/) to flag any non-integer quantity elements in addition to checking all of the constraints specified in the original DTD, which will be accounted for in the converted version that got included by the add-on one. You can continue to use the DTD with applications that don't understand RELAX NG, such as several commercially available XML editors.

Remember that the main problem with a sudden large-scale transition is the possibility of making DTD developers suddenly responsible for large, complex schemas converted from their original large DTDs into the new syntax. Using the approach described here, they don't ever have to look at the schemas created by conversion from their original production DTDs; the only RNG syntax they have to worry about is what they added to the small add-on schema.

You can add all the constraints you like to this secondary schema. The creation and use of these additional constraints let DTD developers get accustomed to the power and syntax of RNG gradually while the original production system is still in place. Once the staff is comfortable with the RNG syntax, they can abandon the original DTDs and continue with the completely RNG-based schemas. They can even do this one DTD at a time, with no shock to anyone's system.

If changes are made to the original DTD, as long as they don't involve elements or attributes affected by the add-on schema (in which case you have to trace the implications of the change a little more carefully), a new RNG version must be generated with the same name as the original one. This way, use of the add-on schema that points to it will take the changes into account.

Let's look at a more complete example, and several of the constraints that this technique can add to DTDs.

Adding Data Typing and Content Model Constraints

The third customer element in the following document has two problems that no DTD validation will catch:

<testdoc>

  <customer>
    <name><first>Joe</first><last>Smith</last></name>
    <quantity>5</quantity>
  </customer>

  <customer>
    <name>Jane Smith</name>
    <quantity>4</quantity>
  </customer>

  <customer>
    <name><first>Jack</first><last>Smith</last>  Jr. </name>
    <quantity>three</quantity>
  </customer>

</testdoc>

First, we define all the constraints that we can with a DTD. Note how loose the name content model must be to allow the two formats described above—it allows any combination of any number of first elements, last elements, and PCDATA strings:

<-- ex1.dtd -->
<!ELEMENT testdoc  (customer+)>
<!ELEMENT customer (name,quantity)>

<!ENTITY % name.content "(#PCDATA|first|last)*">
<!ELEMENT name %name.content;>

<!ELEMENT first (#PCDATA)>
<!ELEMENT last  (#PCDATA)>

<!ENTITY % quantity.type "#PCDATA">
<!ELEMENT quantity (%quantity.type;)>

The parts we'll want to redefine in RNG are stored in parameter entities. An XML parser finds no problem with the document above when parsed against this DTD.

After Trang converts this DTD to an RNG schema (I converted ex1.dtd above to an RNG Compact schema named ex1-dtd.rnc), the declaration and referencing of the parameter entities above look like this:

name.content = (text | first | last)*
name = element name { name.content }

quantity.type = text
quantity = element quantity { quantity.type }

To define the new constraints, we write a new schema that includes the generated one and redefines the named patterns to add the desired constraints:

grammar {
   include "ex1-dtd.rnc" {
     quantity.type = xsd:integer
     name.content = (first,last) | text 
   }
}

Validation against the DTD can continue as before. To check whether a document's quantity elements are all integers and whether all the name elements conform to one of the content models shown, use an RNG validator such as Jing or Sun's msv to validate the document against this add-on schema. The example document above validates with no errors against the DTD shown, but Jing gives the following two errors, with line and character numbers, about the Jack Smith Jr. line and the quantity value of "three":

C:\dat\xml\rng\ex1.xml:17:53: error: text not allowed here
C:\dat\xml\rng\ex1.xml:18:31: error: bad character content for element

Sun's msv validator also found both problems. (Because it doesn't support RNG Compact syntax, I converted the add-on schema above to a regular RNG schema and validated against that with msv.)

Requiring a Child Based on Attribute Value

Another useful constraint is the requirement of a particular child element in a content model only if an attribute of the containing element has a certain value. For example, I want each b element in the following example to include an ISBN child element if its medium attribute has the value "book" and a URL child element if medium is "web":

<a>
  <b medium="book"><ISBN>12341234</ISBN></b>
  <b medium="web"><URL>http://www.snee.com/</URL></b>
  <b medium="CD"/>
  <!-- The rest (lines 7 - 9) are bad and should be caught. -->
  <b medium="CD"><ISBN>43214321</ISBN></b>
  <b medium="web"><ISBN>43214321</ISBN></b>
  <b medium="web"/>
</a>

The following DTD for this document constrains it as much as it can. Note how both the content model references to the ISBN and URL child elements and the medium attribute declaration are stored in parameter entities so that they can be redefined in the add-on schema.

<!-- ex2.dtd -->
<!ENTITY % ISBN.ref   "ISBN?">
<!ENTITY % URL.ref    "URL?">
<!ENTITY % medium.att "medium (web|CD|book) #IMPLIED">

<!ELEMENT a (b+)>

<!ELEMENT b (%ISBN.ref;,%URL.ref;)>
<!ATTLIST b id     ID    #IMPLIED
            %medium.att;
            color  CDATA #IMPLIED
>

<!ELEMENT ISBN (#PCDATA)>
<!ELEMENT URL  (#PCDATA)>

The add-on schema takes advantage of RNG's flexibility in its use of named patterns. While DTDs can use parameter entities to store parts of content models or parts of attribute list declarations, and W3C Schemas has separate xsd:group and xsd:attributeGroup elements to store and re-use these, RNG patterns can define and redefine elements and attributes in the same named pattern. After including the RNG Compact schema created by Trang from the DTD above, the following schema redeclares ISBN.ref and URL.ref to be empty and then redeclares the medium.att pattern to specify all the allowable combinations: if the medium attribute equals "book" an ISBN child must be supplied with it; if it equals "web a URL child must be supplied with it; and, if it equals "CD", neither the ISBN nor URL child elements is needed. Note how the references to the ISBN and URL child elements in the schema have no question marks after them like they do in the DTD (and, consequently, in the Trang-created schema)—if medium has the specified values, those children must be there.

grammar {
   include "ex2-dtd.rnc" {

     ISBN.ref = empty
     URL.ref  = empty

     medium.att = ((attribute medium {"book"}, ISBN) |
                   (attribute medium {"web"}, URL) |
                   attribute medium {"CD"}
                  )
   }
}

When Jing and msv validate the document shown earlier against this schema, they catch the problems in the last three b elements.

Element/Attribute Choice

Another useful constraint is the ability to specify that a piece of information must be specified in either an attribute or a child element, but not both. For example, let's say we want to ensure that each b element in the following has either a color attribute or a color child element and either a flavor attribute or a flavor child element. The last two should be flagged as invalid.

<a>

  <b color="green" flavor="mint"/>
  <b><flavor>chocolate</flavor><color>mint</color></b>
  <b color="white"><flavor>vanilla</flavor></b>

  <!-- These two (lines 9 and 10) should be caught as bad -->
  <b/>
  <b color="purple" flavor="grape"><color>maroon</color></b>

</a>

As with our last example, the DTD first specifies as much as it can by making them all optional and storing the content model element references and attribute declarations in parameter entities:

<!ENTITY % color.ref  "color?">
<!ENTITY % flavor.ref "flavor?">
<!ENTITY % color.att  "color  CDATA #IMPLIED">
<!ENTITY % flavor.att "flavor CDATA #IMPLIED">

<!ELEMENT a (b+)>
<!ELEMENT b (%flavor.ref;, direction?, author?, %color.ref;, editor?)>
<!ATTLIST b att1   CDATA #IMPLIED
            att2   CDATA #IMPLIED
            att3   CDATA #IMPLIED
            %color.att;
            %flavor.att;
>

<!ELEMENT flavor    (#PCDATA)>
<!ELEMENT color     (#PCDATA)>
<!ELEMENT direction (#PCDATA)>
<!ELEMENT author    (#PCDATA)>
<!ELEMENT editor    (#PCDATA)>

The add-on schema sets the named patterns originally used for attribute declaration to empty and puts the attribute declarations in the named patterns used for content models as part of an OR group that makes the conditions of their use clear:

grammar {
   include "ex3-dtd.rnc" {

     color.att = empty
     color.ref = (color | attribute color {text})
     flavor.att = empty
     flavor.ref = (flavor | attribute flavor {text})
   }
}

The color.ref named pattern, which ex3.rnc uses to define the b element's contents (and attribute list) specifies that an element using this pattern must have a child element or a color attribute. The flavor.ref named pattern is similar.

Using this schema, Jing and msv found all the problematic lines.

W3C Schemas?

Can we use these techniques to implement a gradual transition from DTDs to W3C Schemas? I found two categories of problems that made this too impractical to be worth the trouble: existing conversion utilities give you no hook to redefine what began as parameter entities, and many of the constraints that I wanted to impose on my documents can't be done with W3C Schemas anyway.

When converting the following DTD declarations to any kind of schema,

<!ENTITY % quantity.type "#PCDATA">
<!ELEMENT quantity (%quantity.type;)>

most conversion utilities make the entity substitution first,

<!ELEMENT quantity (#PCDATA)>

and then convert the result to schema syntax, leaving no indication that the quantity.type entity had been declared and referenced in the original. By "most" I mean "all of the ones I tried except Trang"—the most recent versions (as of the end of October) of these programs:

(Keep in mind that for most of these programs, schema conversion is only one of many features.) Trang does preserve some parameter entities upon conversion to W3C Schema; when converting my first example above, it converted name.content to a complexType and referenced that from the definition of the name element, but it made the substitution shown above for the quantity.type parameter entity instead of trying to define some comparable structure in the XSD output, so when converting to W3C Schema, Trang can't do as much for us as it can when converting to RELAX NG.

Converting a DTD to a W3C Schema and then, for example, adding data typing information would mean editing the schema created by the conversion. When future edits are made to the DTD, after the conversion is redone, the edits must be redone. The more constraints you impose, the more edits you have to redo, leaving more room for error. (Using the RNG-based system described earlier, the straight conversion of the modified DTD is all that's necessary to ensure that validation against the add-on schema that "includes" it takes the changes into consideration.)

W3C Schemas offer a redefine element that lets you redefine types, (content model) groups, and attribute groups from "included" schemas. It doesn't let you redefine entire elements, and the redefinition of types presupposes that the schema declares types separately from the elements—something you can't expect from the output of the automated conversion programs. If a conversion utility translated DTD parameter entities to content model groups or attribute groups, an add-on schema could redefine those. To do this, however, the conversion utility parsing the DTD must know whether each parameter entity declares a piece of a content model or an attribute list, but to a DTD parser they're all just strings.

Even if W3C Schemas and the associated conversion tools offered these mechanisms, the W3C Schema syntax has no way to describe most of the constraints described in these examples. While it can certainly assign specific data types to elements and attributes, flexibility in specifying child element and attribute requirements is similar to the way it works in DTDs: they're either required or not, with no way to describe conditional constraints on such requirements.

For example, W3C Schemas offer no way to specify that a content model must conform to one of two models, as with the name element in the first example above. There is no way to require certain child elements based on specific attribute values; nor is there a way to specify the kind of "exclusive or" condition demonstrated in the third example, in which a given element or attribute must be present, but not both.

Many of these potential constraints are more valuable to the so-called "document-oriented" XML users than to the "data-oriented" users, while the W3C Schema advantages of more typing control and more straightforward mapping to object-oriented systems hold more appeal to the data-oriented users—those developing systems that use web services, interact with databases, and engage in other transactional processes. Perhaps that explains why both schema languages have continued to slowly grow in popularity, neither supplanting the other: they each serve different sets of needs better.

Conclusion

One great advantage of a gradual transition to schemas that postpones the cutoff from DTD use is its greater suitability for a prototype approach. A given publishing operation can use the techniques described here to gain a practical appreciation of what RNG can and can't do for them with no disturbance to their DTD-based publishing system. They can put anything they've learned about RNG into practice on a small scale and see the results for as long as they want before making their system dependent on RNG schemas—or deciding to not to use RNG at all.

For publishing operations whose XML workflows currently use DTDs to the exclusion of any kind of schemas, the potential use of add-on RNG schemas should make further investigation of RNG a much more attractive option.

copyright 2003 Bob DuCharme XHTML rendition created by gcapaper Web Publisher v2.1, © 2001-3 Schema Software Inc.