Understanding XML: Thoughts on Agile Schema Development

By Kurt Cagle
June 27, 2008 | Comments: 1

Programmers, by nature, are lazy. No programmer wants to do unnecessary (or more typically, repetitive) work when the work could instead be done by automation, which is typically why most programmers spend time trying to discern (or learn) methodologies and design patterns that could be used to simplify those things that are tedious and unrewarding so that they can spend more time on challenging problems instead.

The project management cycle is no different in this regard. By establishing a formal methodology of development, a good project manager can insure both that the right people are working on any given project and that the project can be accomplished within the constraints of budget, time and quality. For a number of years, one of the dominant paradigms for such a process was the Standard Development Life Cycle - SDLC for short. The SDLC process was also known as the Waterfall Methodology, because each stage of the SDLC, from design to deployment to maintenance, created a step function of development where completion of one stage automatically led to the start of another.

Roughly eight years ago, however, there was a fairly significant shift in the way that projects were built, one that more accurately reflected development lifecycles on the web. In normal SDLC, a typical project would take between eighteen and thirty six months to complete, depending upon the complexity of the application in question, with clearly delineated roles and activities that could be checked off according to one of a number of ISO or related standards.

On the web, however, something radically different was emerging. A person or small group would put out a project of some sort, usually at a fairly primitive state. Over the course of a few weeks, the project would be incrementally upgraded, with associated design, development, deployment, unit testing and user testing. A few weeks after that, the same process would take place that inched the application a little further along. The significant thing about this type of development was that at any given stage in the process, beyond the first few weeks, there was always a working application that represented the state of the project at any given time.

The upshot of this approach, eventually dubbed agile or extreme programming, was huge - the customer could, at any time, assess whether the given project was meeting his or her needs, and could change the direction of development on a dime if it became obvious that a given approach wasn't working. For the programmer, there were similar gains - he or she could write code based not on out of date specifications that no longer solved the business problem that was originally intended to be solved but on what worked best for programmer and customer both, reducing the likelihood that six months (or two years) of their life hadn't gone into code that would never see the light of day.

While the exact nature of agile development tends to vary depending upon which particular agile "guru" you happen to follow, for the most part "traditional" agile programming has a number of clear characteristics:

  • Short, iterative life cycles. Agile programming focuses on development deadlines that can be met within weeks, not months (or years) that allows for frequent evaluations of the direction of the project in order to insure both that objectives are being met and that the right objectives are being met.

  • Incremental Changes. Additionally, each iteration works upon the assumption that only small changes -- "deltas" -- are introduced from the previous version. This not only makes setting targets easier, but also makes it easier to "roll back" to an earlier version if a given approach didn't work.

  • Cellular development teams. SLDC teams may number in the dozens or even hundreds of members, Agile teams can be anywhere from a two to maybe a dozen programmers tops. This not only keeps development more immediate, but also makes it easier for the team to communicate and mentor one another.

  • Customer Involvement. Most agile processes place a higher burden on the part of the customer - they become part of the development team, in effect, acting as quality control, designer, and tester. This also insures that at any point, the customer has a considerably better idea of where the particular project is than one working with SLDC.

  • Increased modularization. Projects built around SLDC tend to be fairly monolithic, and as such a given project can be derailed if a key member of the team quits or gets run over by a bus. Agile projects, on the other hand, tend to focus on the small (though a good communication bus is always one of the first things to be laid down in effective projects) and manageable - there are fewer make-or-break components that if not delivered will spell destruction for the whole project. This also makes it easier for others to "extend" a given project, even if they aren't part of the initial development team.

Not surprisingly, both open source projects and web services projects tend to work well using Agile methodologies, and as these are increasingly common, it's meant that Agile has gone from being a somewhat subversive concept to becoming practically the mainstream form of project development in use today.

Agile and the XML Design Conundrum

At first glance, XML should be a natural fit for agile developers - and in contexts where a schema is known, XML is agile. In general, creating documents to a given, known schema implies that you can create transformations or queries on XML documents with impunity in great part because the data model that the XML represents is itself well known and understood. Compare this with most procedural development processes where the underlying data model is essentially built from a skeleton model with increasing degrees of specificity in associated methods. The principle challenge there is in being able to ascertain ahead of time what specific methods will be required, with the implementation then fleshing out each method over each iteration. Because the data model is itself somewhat indeterminate and fuzzy in most imperative applications, data modeling is a step that realistically only needs to be done fairly late in the process.

In a declarative model, however, the data model itself tends to create fairly complex dependencies. Consider for instance, shifting from DocBook 4.4 to DocBook 5.0. Even though the data models are similar - both are languages for describing articles and books, there are some differences between the two, which in turn mean that in order to create a PDF document from 5.0 you have to insure that you have a transformation that will work properly with 5.0, you have queries that can recognize the distinction between 4.4 and 5.0 elements and so on.

This becomes more problematic when the datamodel itself is the thing you are trying to design. An XML document is a representation of a data model, something that is not necessarily the case with imperative programming. Indeed, web programming is rather notoriously bad for actually defining datamodels because many web languages essentially exist primarily just to create string output of markup code without any clear delineation of data model at all. Of course, such applications tend to be very brittle in the face of change - adding new "components" onto a web page could very easily break down if a tag isn't properly closed, for instance - but because such languages tend to be easiest for relatively novice programmers to master, the ease of use factor generally trumps over soundness of modeling.

Additionally, for many people the problem of working with XML has more to do with the often egregiously bad DOM implementations that exist for XML than it does the problems inherent with XML itself. DOM is not XML. DOM is a set of methods for navigating an XML-like infoset. Because it was originally designed with parsers and very specialized apps in mind rather than ease of use, the XML DOM interfaces are highly descriptive, very low level, and concerned primarily with "tree-walking". DOM has been implemented in most languages, and in many respects can be thought of as occupying the same role that Latin occupied in the middle ages - a common language used by a specialized community that is understood only by the initiated, that's complex and convoluted, and that most people, even those working with it daily, generally find frustrating to the extent of longing for something else.

Agile programming is possible (and in fact desirable) with XML, but you need to understand both the sweet spots and the pain points for XML that make it different from doing imperative Agile programming. One of the key distinctions comes from the fact that while imperative programs build up data models intrinsically, the data model for XML documents is generally extrinsic - an XML document can in fact be created with no specific reference to a schema at all, and as a consequence it is possible for an XML document to be valid according to multiple schemas simultaneously - either by deploying two different structures in the same schema language (such as XSD) or by deploying the same structure in two different languages (such as XSD and Relax NG).

What this means in practice is that you can in fact design an XML language in an ad hoc fashion until such time as it becomes necessary to communicate the language to others. Put another way, XML schema design is itself an iterative process that's distinct from anything that you do with it, and you need to be cognizant of the fact that every change to the language imposes corresponding changes on any XML related tools, be they DOM, XSD, XPath, XQuery or other XML manipulation languages.

An upshot of this is that it is difficult to create complex tools for XML until you nail down the schema - though that doesn't preclude the deployment of "test harness" routines that can be used to debug and test schemas under development. In general, while it is possible to create whole schemas directly in a given schema language, an agile approach to schema design involves a couple of distinct phases:

  1. Build the initial schema:
    1. Start with a preliminary data instance.
    2. Establish (and record) a given use case.
    3. Modify the data instance to better satisfy the use case + all previous use cases.
    4. Repeat this process until you have a document that satisfies all of the use cases - varying the instance to try different potential combinations.
    5. Once done, generate a schema that can encapsulate the instances and give it a version number.
  2. For this version, create a set of tools (XSLT transformations, for instance) with an aim towards minimizing the semantic bindings within the stylesheet or related tools as much as possible.
  3. When changes become necessary to the underlying data model, such changes should be incremental, and instances, schemas and transformations should all be upgraded at the same time, with a corresponding change in the version number on the model itself.
  4. One useful set of tools for making this transition is one or more XSLT transformations that will transform an instance from version N to version N+1. Upgrading from version N to Version N+2 then becomes a matter of success transformations: vn+2 = Tnn+2(vn) =  Tn+1n+2(Tnn+1(vn)) where vn is the schema instance and Tij() is the stylesheet transformation. Note that so long as you retain composability (i.e., the transformations introduce no side effects), Tij() can also be an XQuery document. 

Note that while this approach seems to be very simple (and it is) it does hide a few subtleties. The first step involves the same type of incremental changes that are highlighted as part of imperative agile development - adding a few elements or a small module, adding attributes or modifying the range of attribute values and so forth. The test document may start out very simple - indeed, the first time, you're basically working without a schema at all, and attempting to make up a first pass schema that seems to solve the simplest cases.

Note also that at every point you are making small changes to your reference "instance", in essence growing it in parallel to your schema. Now, the danger in this approach comes from the fact that you can end up creating a schema that is so specific to one instance (and one use case) that it fails to satisfy any other use case. Thus, its worth understanding that you should look upon such schema development as a process of trying to satisfy as many of the underlying use case models as you can within a single definition. 

This hints that as you work, you are in essence developing the data model through the different use case iterations. Not all such use cases need to be defined as you're working - indeed, the use cases will tend to suggest themselves as you are going through this iterative process - but you should have at least a baseline idea of what it is you are trying to develop before starting. Keep in mind here that a "use case" is just another name for a business requirement, and as business requirements change over time this means that realistically, even once you have an established schema, you will likely be updating it periodically. By keeping such updates small and self-contained, you are less likely to find yourself in a situation where you need to make a radical change in your underlying data models that force you to re-establish everything from scratch.

One other facet of schema development is worth mentioning. Just as there are design patterns that occur within Java or C++ development, so too do design patterns show up in schema design. Recently, many of these patterns were named and identified as part of a W3C effort, resulting in two working draft documents:

The "Basic" working draft in this case deals with those patterns that are intrinsic to a self-contained XML document, without extensions or modules, while the "Advanced"  working draft focuses on patterns across modules and other edge cases. In both cases, however, the role of these two documents is to identify the most common design patterns that occur within XML documents, and having identified them, to then use this information for better automated tools. In essence you can think of these design patterns as building blocks. If you have an XSLT document that contains these patterns in a semantically neutral manner, then it also makes it possible to identify within that schema those places that don't cleanly fall into such patterns. As one of the roles of effectively agile programming is to identify those patterns that minimize the amount of spurious coding (and that typically as a consequence make the data more manageable to work with in other contexts), the use of such a pattern identification tool can prove highly useful in troubleshooting problems in design before they become serious.

For instance, one entry in the schema patterns document illustrates such a pattern:


2.7.2 ElementMinOccurs0MaxOccursFinite

An [XML Schema 1.0], or other [XML 1.0] document containing an [XML Schema 1.0] element <xs:schema>, exhibits the ElementMinOccurs0MaxOccursFinite pattern identified using the URI [RFC 3986] http://www.w3.org/2002/ws/databinding/patterns/6/09/ElementMinOccurs0MaxOccursFinite when the following [XPath 2.0] expression applied to a document or element node with a context node of //xs:schema results in an [XPath 2.0] node-set containing at least one node:

.//xs:element[@minOccurs = '0' and @maxOccurs and not(@maxOccurs = '0' or @maxOccurs = '1' or @maxOccurs = 'unbounded')]/ (@minOccurs, @maxOccurs)

The following example [XML Schema 1.0] extract illustrates the use of the ElementMinOccurs0MaxOccursFinite pattern within an [XML Schema 1.0] document [ElementMinOccurs0MaxOccursFinite]:

<xs:element name="colorList" type="ex:ColorList" /> <xs:complexType name="ColorList"> <xs:sequence> <xs:element name="colorValue" type="xs:string" minOccurs="0" maxOccurs="2" /> </xs:sequence> </xs:complexType>

The following example [XML 1.0] element is valid against the above example [XML Schema 1.0] when included inside an instance document [ElementMinOccurs0MaxOccursFinite101]:


as is the following element when included in an instance document [ElementMinOccurs0MaxOccursFinite102]:

<ex:colorList> <ex:colorValue>red</ex:colorValue> <ex:colorValue>green</ex:colorValue> </ex:colorList>

The purpose of a data model is to best represent a particular "real-world" object in order to use it within a programming environment, and in general, being able to model it in such a way as to make it easy to work with in that environment. Thus, by being able to build an XSLT that can analyse the model, show where things "fall out of that model" and show where optimizations can be made that will simplify processing down the road is a major part of being an effective data modeler, application designer or information architect.

(In a subsequent article, I'll be posting an XSLT2 file that will do exactly this.)

Note that agile development holds throughout all phases of the process.  Working with incremental deltas become important in both building and modifying schemas, since, in the same way that in programming you generally do not add two blocks of code that may both break an application in similar ways, you want to make sure that the changes made to the schemas are small and manageable, and that can be rolled back readily if problems exist. This also points to a good design principle - before setting up the application, set up a subversion server that can hold subsequent versions of document and schema, and save after each iteration.

Schemas are difficult to test in real-world situations unless you can run enough use cases against them to determine whether they satisfy all of the use cases properly. Note that this means that the more instances that you can try to use as prototypes for your data model, the more robust such a schema will end up being when faced with real-world pressures.  This testing (validating each instance against the schema and working with as many instance prototypes as possible) serves to a great extent to perform the same kind of "Unit Testing" for schemas that you perform with any other type of unit-testing situation.

Building Character(s)

There's been a fair amount of (deliberate) handwaving throughout the last section, so its probably best at this stage to start focusing on a "real-world" example, though admittedly, real world may not necessarily be the most applicable term here.  I've found, over the years, that one of the more interesting things to model (and that most programmers have at least minimal experience with) are role-playing game characters. One reason for this is that such characters should seem like they are fairly simple, but in fact can get complex fairly quickly.

Some years back, Wizards of the Coast released the core Dungeons & Dragons game system as an open standard, calling it the d20 Open Gaming System, under a similarly named License. This system provides a wealth of material for data modelers, though admittedly an earlier effort to provide a formal schema for the various objects in the system didn't meet with success. Nonetheless, because there's enough common knowledge there, it provides a good example of looking at the schema development process in greater detail.

One of the first thin<cgs to observe here is that the design methodologies discussed here are generally most applicable to modeling of "data-like" objects, rather than textual documents. Part of the reason for this is simple - it is likely that, while many objects being modeled with XML may have some document-like characteristics, for the most part these characteristics tend either to be subsumed as part of a larger data structure, or (in this is more likely) it is possible to use an already well established document language such as DocBook, XHTML, DITA and so forth as a namespace subset of the more formal structure. Thus, the rule of not reinventing wheels holds especially well here.

Consider for a moment the archetypal player character as a data object. A fast first blush at an instance (never think about schemas until you actually have a working instance) might look something like the following:

<character> <name>Aleria Delamare</name> <player>Jeane Tomasic</player> <gender>Female</gender> <character-race>Human</character-race> <character-class>Wizard</character-class> <level>5</level> <hitpoints>21</hitpoints> <current-hit-points>12</current-hit-points> <alignment>neutral good</alignment> <armor-class>8</armor-class> <strength>8</strength> <intelligence>17</intelligence> <wisdom>15</wisdom> <constitution>7</constitution> <dexterity>16</dexterity> <charisma>17</charisma> <spell-levels>3 2 1</spell-levels> <spells>Witch Light,Charm Person,Spell Ward I,Spell Ward II,Detect Evil,Wandering Eye,Sleep</spells> <description> Aleria is a young, attractive female wizard just settling into her powers. With brown hair, green eyes and a trim figure, Aleria tends to turn heads even when she'd prefer not to, and for this reason prefers to affect the mien of a scholar, though not alwayswith success. </description> </character>

This seems to capture the basic set of information, though it also looks very much like it was taken directly from a table in a database. There are, admittedly, advantages to being table like, but there are also some significant problems, especially when the data model is conceptually more complex than that. To understand that complexity, it's worth iterating through a few use cases:

Use Case 1. There will likely be more, conceivably many more, than one character within the game - perhaps its a multiplayer game, or even a networked multiplayer game with hundreds to tens of thousands of individual characters. Thus any "record" should be seen as being one of a collection, and should have a way of uniquely identifying that particular record in a collection.

This implies first of all that there is likely a <characters> collection element floating around somewhere, and moreover that the <character> element needs to have an ID. While it's possible to put the ID as an element, its worth noting that the ID doesn't really describe a characteristic of the record so much as identify the record, meaning that conceptually it works better as an attribute:

<characters version="00.01.0001"> <character id="id1234"> <name>Aleria Delamare</name> <player>Jeane Tomasic</player> <!-- additional content --> </character> </characters>

Additionally, any identification scheme needs to determine the scope of uniqueness. If the character is accessible on the web, it may be that the best ID in this case would be more like a URL, with the name of the character and the last name of the player as key:

<characters version="00.01.0001"> <character id="http://www.mycoolgame.com/players/tomasic/aleria+delamare"> <name>Aleria Delamare</name> <player>Jeane Tomasic</player> <!-- additional content --> </character> </characters>

Use Case 2. Records will need to be sorted by character name, player first name and player last name, with the player also being uniquely identifiable in the event of collisions.

In this case, it is better to break up player into two separate search keys of first-name and last-name in order to speed up such searches. Additionally a reference key should be added to uniquely identify the identifier for a given player, as names may still collide.

<characters version="00.01.0001"> <character id="http://www.mycoolgame.com/players/tomasic/aleria+delamare"> <name>Aleria Delamare</name> <player ref="http://www.mycoolgame.com/players/tomasic"> <first-name>Jeane</first-name> <last-name>Tomasic</last-name> <reference></reference> </player> <!-- additional content --> </character> </characters>

This way, if there are two Jeane Tomasic players, then the ref argument will be different for each of them and thus identify the proper owner (it also makes it easier to create a separate collection of player objects that also use the same referencing scheme).

One other note about the changes - the version attribute on <characters> indicates which version of a schema the given instance corresponds to. Within the model to come, the assumption is made that in the default case, the version on a given element will be the version of the most recent ancestor that specifically defines the version - i.e., the version for <player> in this case will be "00.01.0001", because the most immediate ancestor that have the version defined, <characters> has that version number. This also places a requirement on any application that generates a standalone <character> document from a <characters> collection to specifically add this version attribute to the character element at generation time. More information about versioning will be discussed in the section Versioning for Fun and Profit.

Use Case 3. THe d20 system is intended to be used for more than just fantasy RPGs, but instead is also used for horror, comic book adventure, modern era, space opera and other subgenres of the PRG field.


You might also be interested in:



Another great article!

Another approach to agile schema modeling is to start the modeling effort with a domain model, followed by a physical model in the form of UML class diagrams to facilitate collaboration between non-technical subject matter experts (SMEs), modelers, and the technical team. You can then export the UML model in XML Metadata Interchange (XMI) format and use an XSLT transform to map it into an XML Schema and an XML instance. UML stereotypes can be added to the UML class diagram to refine the mapping from class diagrams to XML schemas

If you’re building the schema from scratch, then the schema shall be designed in an iterative and collaborative manner. During each iteration, add just enough components to your schema to support the specific user stories that are being implemented. As the schema grows, refactor as required.

The schema shall be tested for quality against the adopted naming and design rules (NDRs). The US National Institute of Standards and Technology (NIST) has developed an XML Schema Quality of Design Tool (or QoD Tool) which combines Schematron and JESS rules (a Java-based open source rule engine) to validate schemas against NDRs.

For unit testing, the XMLUnit framework can be helpful in testing the schema as you refactor and implement new user stories. XMLUnit for Java allows you to make assertions about the validity of an XML document against an XML Schema. The execution of these tests shall be part of your build and continuous integration process.

The schema modeling effort shall be an integrated part of an agile project which implements practices such as user stories, acceptance tests, unit test first, coding, refactoring, pair programming, short iterations, common code base, and continuous integration.

Joel Amoussou

Popular Topics


Or, visit our complete archives.

Recommended for You

Got a Question?