The challenge of validating XML-in-ZIP file in place: how to do it with Schematron

By Rick Jelliffe
August 5, 2008 | Comments: 7

The issue of how to bundle up and transport XML document sets, with their accompanying stylesheets and media, has a good solution with the XML-in-ZIP approach used by ODF, OOXML, and SCORM. But it was not always clear that this was a good approach: the horrible XML format Microsoft adopted for XML 2003 is a good (I mean a bad) example, one big XML file with media converted to bin64 characters and embedded directly—a fat, fragile file that could test the resources of conventional XML implementations. Water under the bridge now.

But one of the remaining problems, and one that is very interesting to me, is the issue of validation. When the basic information is kept in a single XML file, validation is reasonably straightforward: structures, ID or key referential integrity, datatyping, co-occurrance constraints, and so on. The current range of schema tools support these kinds of intra-document invariants quite well.

But no document is an island, so Schematron (and basically only Schematron) also supports a range of intra-document constraints (using the document() function): one notable case is where you want to keep the allowed values of some element on a separately maintained code list, in any XML format you like, which is the approach Ken Holman took for the OASIS UBL Code List Methodology; another case is where you want to check referential integrity for a link between the current document and another; another case is where you want to test invariants between documents (e.g. to confirm that when the input document to a process has value X in some element, that value has indeed been passed through to the output document correctly.)

But the new XML-in-ZIP documents present a new challenge: constraints that formerly would have been kept in a single document are now split into multiple documents (and the name-independence of the Open Packaging Conventions which OOXML adds a layer of indirection that complicates the base case as well.) So the benefit of single-document validation for detecting real problems has decreased.

(I should say here, that there is also another issue, for validation of these documents. XML validation is largely predicated on the SGML idea of the separation of presentation (processing) and structure: what Charles Goldfarb called rigorous markup was relevant because it did not need to check such emphemora which could be made system dependent. The standards (and conformance thinking) around ODF and OOXML has been fairly confused because those formats are all dependent on particular classes applications and indeed on particular applications.)

So if there is a stronger need for validation inside ZIP files between the various parts, is ISO Schematron up the job? Well, yes and no. I have been working with Tak Tran recently on using Schematron to report on structures found in OOXML and ODF files: using the Schematron report element to identify and label candidates, using Schematron as the rules language front end for an expert system.

The first problem we had was the issue of access to parts in the document. Sun defined a jar: and zip: URL scheme for accessing parts of a JAR archive (a Jar archive is a ZIP file) which is available as a built-in part of the Java API. This uses a bang ! to separate the ZIP part locator from the ZIP filename. For example jar:http://www.foo.com/bar/jar.jar!/baz/entry.txt.

However, the Gods of URLs have not blessed this syntax: it is not in an RFC and the RFC for generic URI syntax reserves "!". There seems to be some idea that if you want a URL to access this, what you should do is convert the ZIP archive into a MIME multipart file, then have URLs to access inside that: a hopelessly confused and utterly unreasonable approach which shows exactly the wrong way to develop standards: I don't know whether it is NIH or just that ZIP files are only supposed to be unzipped and not accessed directly: whatever the reason, it is past its use-by date.

URL resolvers are the functions (classes, libraries, whatever) that retrieve a file from a URL (the WWW jargon would be that they get a representation of a resource identifier, in SGML terms it would be to provide the entity for the PUBLIC or SYSTEM identifier): in Java this is a File object, in .NET this is a Reader object. It is very common to subclass or implement your own custom resolver, and one of the most common is the Norm Walsh's resolver for Java, which implements the OASIS XML Catalog system: Apache has a patched version of this as ApacheCatalogResolver.

So while Sun's Java API provides this jar: capability in its built-in resolver, IBM's Xerces did not (because it was not standard!) and neither has Microsoft. (I really hope IBM-ers and Microsoft-ies get real on this fast, to help developers.) And for home-made or public resolvers in Java, it is in the hands of fate: do they extend the built-in Java Url resolver (and get this scheme) or not? Basically you have to test and get your own. [I believe from a comment in a newsgroup that the Apache Ant resolver (the one under the tools branch not the Commons resolver) does support jar: but I have not tested it: I scanned the code and didn't see how it could though.]

So that is the first hurdle. Either we have to unzip the whole ZIP archive into a file system (which may take more time and space than it needs to), in which case we can use file access, or we have to make our Schematron XSLT engine is called with a URL resolver that understands the zip: or jar: URL schemes. The latter option is doable in Java, but for our current project we need to deliver in .NET and we are stuck without a resolver: fortunately the solution is to build your own. We are using SAXON.NET which allows external resolvers, but I believe the built-in XSLT library for .NET does as well: however, it does not support XSLT2 so is increasingly irrelevant.

So here is a first stab at such a resolver for .NET (with embarrassed disclaimers from the programmer that he has never written or studied C# or .NET before, and that it would benefit from more attention and features) XmlZipResolver.cs. In the absence of an alternative, it might be useful for someone: I am not sure we will be completing it more, because it supports our limited requirements currently. You can see that the functionality is quite straightforward.

So how do you use it with Schematron? Firstly, it exposes a limitation in Schematron, that Schematron expects to run on a document. I am not sure whether the Schematron language needs to be enhanced (for example with some extra attributes on the pattern elements to allow a particular archive and part to be specified) but using standard Schematron the basic document must be extracted first, outside the XSLT, and that is validated. (If your XSLT implementation has an extension function for accessing ZIP, I suppose that could be used.) So already we need a special harness with some ZIP capability: but nothing that Ant, say, cannot handle.

So we run Schematron and provide two inputs:

1) The main XML document we are validating, such as document.xml.

2) As a command-line parameter, the name of the archive, for use to construct zip: (or jar:) URLs so that supporting files are visible.

ODF

In the case of ODF, there are not many different parts to the ZIP archive, and the ODF standards hardwire names for them: the main document is content.xml and the stylesheet is style.sxml.

So in order to validate content.xml we
1) Extract content.xml from the ODF document (considered as a ZIP archive)

2) Run Schematron against this, with an input parameter, say archivename, giving the original ODF file's filename.

3) Use Schematron variables to locate files of interest inside the archive, to validate constraints that are between the content.xml file and those other files.

For example, you might have something like this:

<sch:schema xmlns:sch="http://purl.oclc.org/dsdl/schematron"
queryLanguageBinding="xslt2" >

<sch:title>Schema for ODF Content</sch:title>

<sch:ns prefix="office"
uri="urn:oasis:names:tc:opendocument:xmlns:office:1.0" />
<sch:ns prefix="text"
uri="urn:oasis:names:tc:opendocument:xmlns:text:1.0" />
<sch:ns prefix="style"
uri="urn:oasis:names:tc:opendocument:xmlns:style:1.0" />

<sch:param name="archivename"/>

<sch:let name="styles-url"
value='concat( "jar:", $archivename, "!/styles.xml" )' />
<:sch:let name="styles-root"
value="document( styles-url)" />

...



Now in the assertion tests, you can access the various parts of the styles file with ease. (XSLT2 is used here, to avoid any XSLT1 funnies.) If you want to validate or report an ODF file with a lot of media, this would be a way to do it. I am not sure, however, that in the case of ODF it would not be simpler just to extract all the .xml files into a file system and run Schematron just using the file: URL scheme (i.e. what you get by default.)

Rather than having separate schemas for validating each XML part separately, they can be combined and marked as different phases: you validate content.xml with the schema with the 'content' phase active, styles.xml with the phase 'styles' active, and so on.

[Update J. David Eisenberg's OpenOffice.org XML Essentials has an interesting entity resolver for ODF that handles Getting Rid of the DTD and also provides useful template code for ODF.]


OOXML

OOXML is much more complicated, due to the filename independence that Open Packaging Conventions (OPC) require. Actually, you can usually rely on the hardcoded filenames, such as word/document.xml being the main document for word processing, in which case life is no more complicated than with ODF. But theoretically you need to look in the /_rels/.rels file, which matches the role of each part (e.g file within the ZIP archive) with an ID and internal path and filename.

However, we still need to be able to trace arbitrary files for OOXML (and even for ODF) because the OOXML file might contain arbitrary XML files with data that is accessed by custom controls (or XForms, in the case of ODF, but these are out of scope here, which is euphemism for me not having thought the issues through).

The trick with OOXML is that every XML file X could have its own _rels/X.rels file which is used to resolve IDs.

So in order to validate the content part of an OOXML wordprocessing file we

1) Extract word/document.xml from the OOXML document (considered as a ZIP archive) because that is the default name or

1a) Extract the _rels/.rels file from the OOXML file considered as a ZIP archive, then look in that for the file that has the relationship type http://schemas.microsoft.com/office/2006/relationships/officeDocument and then use the filename provided—all before any Schematron processing. Or

1b) [Added, see Jesper's comments below] Extract the [Content_Types.xml] file from the ZIP archive, then find the files (or patterns) for files with content type application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml which will allow you to locate any wordprocessingml parts in the archive, even if they are not actual main "officeDocument". (The content type says what notation/schema a part is. The rels files say the significance of the part: the relationship to the whole document.)

2) Run Schematron against this, with an input parameter, say archivename, giving the original OOXML file's filename, and filename giving the current filename including file path.

3) Use Schematron variables to locate files of interest inside the archive, to validate constraints that are between the document.xml file and those other files.

So here is an example of the kind of thing we are testing at the moment, to find the best idiom: we use quite a few variables to draw out the various stages.

<sch:schema xmlns:sch="http://purl.oclc.org/dsdl/schematron" 
   queryLanguageBinding="xslt2" >

<sch:title>Schema for OOXML Word Processing Content</sch:title>

<sch:ns prefix="w" uri="http://schemas.openxmlformats.org/wordprocessingml/2006/main" />
<sch:ns prefix="r" uri="http://schemas.openxmlformats.org/package/2006/relationships" />

<sch:p>This header shows how to navigate through OPC relationships.</sch:p>

<!-- command line parameters -->
<sch:let name="archivename"/>
<sch:let name="filename"/>

<!-- Tokenize the filename -->
<:let name="filename-name"
value="tokenize( $filename, '/')[last()]" />
<:let name="filename-path"
value="string-before( $filename, $filename-name)" />

<!-- get the relationships file -->
<:let name="file-relationships-filename"
value='concat( $ filename-path, "_rels/", $filename-name, ".rels")' />
<:let name="file-relationships-absolute-uri"
value='concat("zip:", $archivename, "!", $file-relationships-filename />
<:let name="file-relationship"
value="document($file-relationships-absolute-uri)" />


<:let name="stylesheet-filename"
value="$file-relationship/r:Relationships/r:Relationship
[@Type='http://schemas.microsoft.com/office/2006/relationships/styleSheet']/@Target" />
<let name="styles-url"
value='concat( "zip:", $archivename, "!", $stylesheet-filename)' />
<:sch:let name="styles-root"
value="document( styles-url)" />

...

With headers like these, then the assertions tests for doing things like locating the style for a particular paragraph should be not much different between ODF and OOXML.

But it still makes me wonder whether some higher level support is necessary. ISO DSDL (Document Schema Description Languages) is the multi-part standard which RELAX NG, Schematron, NVRL and DSRL are parts of, and it has Part 6 currently reserved for Path-Based Integrity languages. I have suggested on the public dsdl.org maillist that these new compound documents using XML-in-ZIP represent a real challenge to current schema languages, and we should look at ideas and systems floating around for something to cope: but this requires more experience.

For example, OOXML does have an additional issue: in the case of spreadsheets, there can be multiple worksheet files created as separate parts of the OOXML document, representing the data in different pages or tabs in the user interface. Furthermore, a word processing document could incorporate some spreadsheet data, for example to get some charts. Now in the processing model above, the driver code needs to go through the document and either extract all XML files brute force, or be smart enough to identify just the files of interest for validating using whatever combination of namespace, relationship type, and phase.

This is the kind of thing that, in the DSDL model, we are hoping that the validation processing standard will handle: we will be looking at W3C XProc for this. But it may well be that regardless of how well it XProc handles OPC, XLink, XInclude, XForms-in-ODF and other scenarios, Schematron may need some nicer declarative mechanism to handle better access to files in the same ZIP archive and better iteration over a set of links (one and two step) so that the link target becomes a target for validation perhaps in a different phase.

At the moment, the context node for a rule element in Schematron has to be in the XML document provided at invocation time, and I expect that we can do something better. I think it would be really useful if we could run a Schematron validation on a ZIP archive (or just a directory) and the schema itself was smart and simple enough to handle access to the various parts. Because of the mooted XProc? validation framework for DSDL, such a facility does not need to be full-featured, it just needs to concentrate on whether there is any low-hanging fruit that allows clearer schemas for XML-in-ZIP.

Anyway, it is still early days on all this. Ideas and other experience readers have had is most welcome.


You might also be interested in:


7 Comments

Hi Rick,

I think you got the algorithm for finding the "root element" of e.g. a DOCX-file a bit wrong. As I read the OPC, the method should be:

1. Open [Content_Types.xml] in the ZIP-archive.
2. Locate the ContentType "application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml" if you are working with Wordprocessing files and pick out the URI (PartName) to locate the place of the stream in the ZIP-archive.

I strongly discourage anyone simply looking at the location /word/document.xml since it is in no way always located there.

:o)

Jesper: It depends on what you are trying to do.

There could be multiple XML parts with the content type application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml
in an OOXML file.

But only one of these is the main document, and you know that from the _rels/.rels relationship
http://schemas.microsoft.com/office/2006/relationships/officeDocument

So if the aim were to validate the main document, you use the _rels/.rels relationship. But if the aim were to validate every XML file that was of the content type, you would indeed iterate through the content types, and this would be better for coping with, say, spreadsheets that are embedded in word processing documents.

Strictly, I suppose there is another prior constraint that needs to be tested: is the part specified by the officeDocument relationship actually of the content type desired (e.g. is the OOXML file a wordprocessingml document, or a spreadsheetml document, etc.) Just using the relationship should (or could) catch that.

Marc Van Cappellen has an article at Dr Dobbs on XQuery Your Office Documents (http://www.ddj.com/linux-open-source/202401913) which also takes the jar: URL approach.

It's a little odd to say (given his history of making non-factual statements on anything OOXML) but Rick is right when he mentions the matching of "http://schemas.microsoft.com/office/2006/relationships/officeDocument" in _rels/.rels

Matching the content type is a bad idea since the content type changes according to whether we are dealing with a .docx, .docm, .dotm, .dotx.

With that said, it will be interesting to see if Microsoft updates namespaces (the year in particular) in Office 14-generated files. The smell of broken applications.

Stephane: Thanks! At last we see eye to eye on something :-)

I think Microsoft's problem is that (after the first two crappy versions of anything) they typically err on the side of *not* changing things: continuing old bugs and so on.

I think from the ISO side there will be a strong push to

1) use the ODF namespace (or at least ODF element names in a new OOXML namespace, a la ODF's in-house SVG) for if there are elements (or at least features) adopted from ODF,

2) use the OPC extension mechanism for new elements.

I think there is zero chance (and zero desire) of renaming existing elements with a new namespace for future upgrades. Renaming the namespace for a version-up creates a new language, not an upgrade, which goes against the purpose of having a standard.

By the way, it's

http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument

not

http://schemas.microsoft.com/office/2006/relationships/officeDocument


"Renaming the namespace for a version-up creates a new language, not an upgrade"

Yes, but as a version disambiguator it's the easiest to do for engineers who think in terms of streams, not XML stitching.

By the way, readers who are using .NET and XSLT should consider using Tony Coates'
XmlCatalogResolver

XML Catalogs provide a way to manage references in XML files external to the files, which can be very useful sometimes.

Popular Topics

Archives

Or, visit our complete archives.

Recommended for You

Got a Question?