A standards-based expert system for detecting structures and annotating XML-in-ZIP documents

By Rick Jelliffe
August 19, 2008

One of the projects I have been working on recently has been a proof-of-concept system to allow a rules-base approach to automatically classifying and annotating XML-in-ZIP documents. A few of my recent posts have been in this area: navigating around ZIP, adding foreign elements, and so on.

The brief was for an organization with a large number of documents from multiple sources, but with each source supposed to use stylesheets. The idea was to make a rules base that would distinguish all the different ways that a few structures (titles, table of contents, potentially citations, etc) were represented. This would allow classification of documents according to the structures found, the discovery of outliers and exceptions (e.g. incorrectly marked up documents, or where additional rules were needed), and automated annotation back to the original documents.

The approach we have taken is to use Schematron, using the report elements rather than the assert elements. These have opposite logic to the assertions: a report is made whenever you find part of a pattern rather than when it is missing.

The rulesbase is a Schematron schema. It can wander around the ZIP archive if it has to, in order to get information to test the main XML file. This generates a report in ISO SVRL, the standard Schematron Validation Report Language, which in particular includes an XPath locator to the matched element. The report also includes dynamically generated text from the document in question, and this gives enough information for the various other later stages.

It works!

The only real trick is that writing back multiple annotations to the original file (in the form of customXml elements for OOXML files) will have to be done in reverse order, so as not to disrupt the Xpaths which use positionals. (Note to self: probably the Xpaths should use IDs where available.)

I have had to make a few improvements to the Schematron skeleton code to support all this, and I hope to roll these out to the public version later this week. This should include a property annotation mechanism that will allow more metadata to be generated.

The reason for using Schematron for the rules file is that it eliminates programming aspects from creating the rules file. The person maintaining the rules file only needs to understand XPaths, not full XSLT or Java etc. And the hope is that many of these rules will be quite similar, so making new rules will often be just hacking rather than creating entirely new Xpaths.

Schematron allows the simplest kind of expert system to be expressed: basically the equivalent of if-thens and case statements but hidden as patterns and rules. This means that the maintainer does not need to have any awareness of higher logic, and so on.


You might also be interested in:


Popular Topics

Archives

Or, visit our complete archives.

Recommended for You

Got a Question?