Best practice XSD: the very thing that makes you rich makes me poor

By Rick Jelliffe
August 14, 2008 | Comments: 3

The US government's National Institute of Standards and Technologies (NIST) is the body particularly involved with standards testing and implementation. I was re-reading Boonserm Kulvatunyou and K.C. Morris' XML Schema Design Quality Test Requirement Documentfrom 2004 recently: it has a great list of various best practices for XSD from different organizations from the lens of five years ago.

They are not really meant to be taken together (indeed, it is one of the subtler outcomes of the NIST work is, I think, the appreciation that best practices are not absolute: often we need different horses for different courses.)

The Incredible Shrinking Man

But if you do, you end up having very little of XSD left.

For example, OAGIS rule 400 (p10) recommends against allowing extension (of complex types). But it also (rule 250) recommends against using restriction (of complex types): it recommends substitution groups or Schematron.

Going over to UBL, and out goes xsd:nil, xsd:any, xsd:choice, xsd:any, mixed content, user-defined attributes, empty elements, enumerations (use a pattern and an external code list validated by Schematron). But UBL says don't allow substitution group (unlike the OAGIS rules above.) All complex types (that contain data) should be defined as complexTypes with simpleContent (to allow extension)

Going over to the ASC X.12 gruidelines, and out goes wildcards, abstract types (as used in substitution groups), xsd:redefine, xsd:group, anonymous types, global declarations of elements and attributes (i.e., no flat schemas: use nested ones), use of default of fixed attribute values, xsd:NOTATION, xsd:appinfo.

Going to AEX, and it wants (250, p47) all elements to be optional by default ("validated outside the schema" sounds like a requirement for application-based checking or Schematron).

There are many of these that are clearly contradictory (and the NIST review even points out to contradictions within the guidelines of the same organization, as they stood five years ago).

Of course, it is easy to laugh about how little of XSD is left, but we need to check whether our laughter or fond tut-tutting is a sign that, in our heart of hearts, we expect there is a silver bullet that can make schemas, evolution and interchange trivial. As the recent issue raised by Alex Brown on the use of IDs in RELAX NG (the case was ODF, but it is a wider issue), sometimes the very thing that makes you rich makes me poor.

Profiles may cause fragility

There are of course, a few consistent threads that come out of the NIST and other work. One is that any plan may be better than no plan; however, the trouble with best practices documents is that when systems are built which ignore capabilities banned by best practices documents they will fail when presented with valid data made according to a different profile.

So even if a best practice approach is taken, systems still should be built to cope with as much of the full XSD (or RELAX NG, or Schematron, or DTD, etc) capabiltiies as possible.
Yet again, the deeply unfashionable Postel's law: use the best practices profile for what you send, but as much of the full standard as possible for what you accept.

People interested in best practices should look at Roger Costello, of sinister US think-tank the Mitre Corporation, his site xfront.com. Roger regularly conducts discussions on various mailling lists to figure out consistent approaches to some of the these kinds of best practice questions. One of his comments that I agree with strongly is that for maximally consumable data you will have to provide it in several formats: plurality as a fact of life (the poor will always be with us?)

I think there is a strong conflict (or tradeoff) at work, between the perceived desire/expection of users (and tool makers) for a single-file solution to schemas and the hard reality that layered and phased approaches are required.


Meta Rules OK?

So would be the set of meta-rules we can extract from these best practice documents?

First, as I mentioned above, use the best practices profile for what you send, but as much of the full standard as possible for what you accept.

Second, the schema design will have to cope and perhaps reflect its future history and the organization structure that made it. Where there are conflicting requirements, there will need to be conflicting (i.e. specialized or sibling-derivation) versions of the schemas. These may also influence which schema language features are relevant.

Third, schemas need to be organized with an explicit re-use model: if simple types are to be re-used, they must be named and global. If you are going to allow extension, then you must use complexContent with simpleContent up front: you cannot retro-fit it. And so on.

Fourth, the more complex the organization or re-use, the more that a layered rather than integrated approach to schemas should be adopted.

I think the UBL code model approach is very instructive, because it absolutely springs out of pragmatic organizational issues: you want to validate and model with a schema language, yet code lists change fast and may be out of the control of the people making the rest of the schema (which is probably more stable.) So a layered approach is warranted, using patterns in the XSD and Schematron for interrogating the external code lists.

When we look at the tools for large systems from the big players, how much evidence do we see that these meta-rules are implemented or accepted or even on the radar? I think this is something where we should not expect leadership from vendors: it is the users' business to be clear about their requirements and the vendors' tasks to scrabble after the money.


You might also be interested in:


3 Comments

Rick, Thanks for bringing some visibility to our work. You said that one of the more "subtler outcomes of the NIST work is, I think, the appreciation that best practices are not absolute: often we need different horses for different courses." We actually took that as one of the more significant outcomes of the work and it has influenced what we've done for the last few years.

To be specific we've created a tool to help people in putting together a set of best practices and tests for those for their own projects. That tool is publicly available and is called the Quality of Design Tool. Currently, the system contains fairly robust sets of tests based on XML Schema Naming and Design Guidelines from the Department of Navy, IRS, OAG, and UN/CEFACT, as well as samplings from others.

I'd like to point out that this difficulty in defining best practices goes beyond the XML Schema language itself. One of my favorite examples comes from an activity that we were involved in with the federal government. We were participating with a group (known as the XML COP) and were trying to define a minimum set of core best practices that could be used across the federal government in defining XML Schemas. One of the proposed best practices was not to use abbreviations in defining tag names but rather to expand those. Most people seemed to thing that this was tedious but acceptable. Since abbreviations are so overloaded, the lack of expansion can lead to misunderstandings. However, the "intelligence community" objected to this. Their argument was that they have numerous abbreviations that they exchange but it would be illegal for them to exchange their expansions. (Presumably this would constitute a national security threat --what a world we live in!)

It does go to show, though, that best practices really need to be determined for the group involved. There is no generic solution and this fact is actually quite reasonable. What makes a good language is that it is very flexible and able to address a large range of needs. What makes for good coding is that the language is used consistently in a systematic and structured way. There's no getting around that; otherwise we would end up with a plethora of specialized languages.

You raise a good point about having a reuse model in mind when defining best practices. We would like to come up with a set of similar overarching principles that you could associate with a collection of rules. In doing this one would be able to select the principals that they want to enforce and then grab a set of rules that do that. So far we haven't done this but it could be an interesting addition to our work.

I agree with your conclusions regarding the applicability of Postel's law and that multiple formats (and conversion tools) are a fact of life.

Your blog post reminds me of an old paper of mine whose title also happens to reference a (much older) pop tune ;-). It seem that today, people are using XPath-based technologies such as Schematron, XSLT, and perhaps DSRL to do what Architectural Forms were intended for - to enable a layered approach to schema development.

Very interesting article and some excellent analysis on the drawbacks of profiling and best practices. I think in theory you are right in that we should all implement 100% of a spec and not have to profile. But in reality, people are not perfect. Profiles give some clarity in a real world where it didn't exist before. Am I willing to bet my product on someone else being 100% conformant to a spec?
Profiles may not be ideal, but they make the real world less unclear. Notice I did not say easier. Just clearer.
It is also illustrative to see what peers are doing:
see http://www.xml.com/pub/a/2006/09/20/profiling-xml-schema.html

Popular Topics

Archives

Or, visit our complete archives.

Recommended for You

Got a Question?