Is arbitrary XML on the web ever going to happen?

By Philip Fennell
July 2, 2008 | Comments: 2

A recent item posted here by Bryan Rasmussen entitled 'why XSL-T support in the browser is a failure' brought-up an interesting point about arbitrary XML on the web and how that works with search engines:

'...there is a more killing limitation that means we will not have XSL-T transformations via processing instructions anytime soon, and that is that to do it would mean cutting oneself off from meaningful indexing by google'

The notion is that indexers are aware of (X)HTML and know what to look for when indexing those documents, but with the potentially differing semantics of arbitrary XML grammars, the job becomes very hard.

Recently I have been working with RDFa and, in unrelated work, the WAI's Accessible Rich Internet Applications (ARIA) Roles. RDFa provides a framework for annotating documents with additional, machine-readable, semantic information. The main thrust has been towards XHTML+RFDa but it is equally applicable to any XML grammar. WAI-ARIA Roles annotate a document with information about the function of document structures used to mark-up complex widgets for modern Rich Internet Applications (RIA) . The intention is to provide assistive technologies, like screen readers, with document-semantics neutral hooks that help the user navigate the document/webapp.

If the indexers used by search engines were tuned to look for RDFa and WAI-ARIA Roles annotations, would the actual semantics of the document be quite so important?

Between the two of them, RDFa and WAI-ARIA Roles can provide useful landmarks for the identification of metadata, main/related content sections, navigation, links and the like for any XML grammar.

Now, I have never been a fan of the willy-nilly creation of arbitrary XML grammars and would love to see developers sign-up to an XML grammar non-proliferation treaty. But, given what I have just said, does that now matter so much? Well yes and no.

Yes, because creating a new XML grammar 'just because you can' is, in my opinion, not good practice. Best practice is to look around to see if someone else has got there first, which after 10 years of XML there's a pretty good chance they have.

No, because as long as developers are willing to provide for the annotation of their arbitrary XML documents with recognised metadata and functional vocabularies then their proprietary content will still be useful to a wider audience.

RDFa is in the limelight at present and gathering momentum. WAI-ARIA Roles have a use beyond accessibility and once people grasp this then we may see their wider adoption.

Is arbitrary XML on the web ever going to happen?

If it is, then machine-readable annotations will be an important driving factor for content indexing, and will also provide useful transformational landmarks for the explosion in client-side XSLT that will follow.

You might also be interested in:


If the indexers are to find it, it cannot be arbitrary. I also predict that indexers (google) won't pay attention to it until it's reasonably guaranteed that the meta data honestly reflects the content of the page. Consider the html <meta/> tag. I believe it's widely ignore in indexing.


The point, I believe, is that arbitrary means the semantics can be variable and the document semi-structured. However, when looking for an element node that has a role attribute with a value of 'ariarole:heading' (//*[@role = "ariarole:heading"]) where once the ariarole is a namespace prefix is resolved, it is quite unambiguous what the 'role' of that element is to those systems that understand WAI-ARIA Roles, and therefore, if roles were used where necessary to enhance content then it can only be a good thing. As for metadata, you are right, without 'trust' and all the baggage that goes with it there will always be doubt, but that doesn't and shouldn't get in the way of finding easier ways to identify content in arbitrary XML to aid indexing.

Popular Topics


Or, visit our complete archives.

Recommended for You

Got a Question?