Efficient XML Interchange Evaluation: a little fragrant?

By Rick Jelliffe
August 4, 2008 | Comments: 3

EXI is a kind of binary compression of XML, intended to be useful in some edge cases where XML is not optimal. I guess I sit in between ardent promoters of it and keen dissers, such as Elliote Rusty Harold: edge cases are important if they are your edge case, and XML could well be so sub-optimal as to be a last resort technology.

However, I agree with Elliote that the new Efficient XML Interchange Evaluation document from W3C is highly sus (do other brands of English use that expression?)

For a start, it reads like marketing. When you read a column that says that XML+GZIP "prevents processing efficiency" you know someone has given the bull a laxative. The EXI group has worked in nice stages: first developing a set of goals and developing concepts like "processing efficiency" then making up ways to measure them, and now this evaluation note.

The trouble? I cannot find where the measurements support the conclusion. It is not so much that from nice quantitative data, they have swung into "yes/no" territory in order to present simplified conclusions. (What is the difference between binary and the fallacy of the excluded middle?)

But that I cannot trace through from the test results to see how XML+GZIP actively prevents efficiency: at first I thought it must be that EXI group's definition of processing efficiency as a technical term had some technicality that would allow disqualification. For example, that if there was no random access, there could be no processing efficiency: XML+GZIP having no such capability in themselves they would not meet the mark. But I don't see anything like that in the definitions.

It is completely possible that I have missed something, and in fact the document is not just making up conclusions that are far stronger than their definitions allow and their evidence supports. It would be good to see this. Or perhaps I am reading too much into column headings, which may be intended to be more like newspaper headlines and not take seriously; I am quite prepared not to take them seriously. Careless wording has let the side down.

The other thing I don't like about this evaluation is that there is no indication of strengths and weakness. There is a useful comment in the EXI Interchange Measurement Notes document: the first is In the Document class all candidates are very close to each other and to gzipped XML, though again Efficient XML has better performance in some cases. Loosely, if you have no schema you can stick with boring deflate compression (or whatever gzip uses) like ZIP. So EXI is really concerned with data interchange (rather than documents) according to tight schemas, where the cost of parsing or random access is onerous.

On the subject of efficient access, last week I spoke (and performed a terrible magic trick!) at the Sydney Open Publish Conference. I caught up briefly with NICTA's Dr Raymond Wong who showed me his compression/random access scheme now being marketed as mcontext: he had compressed the entire of Wikipedia (about 30 gig, I am told) down to fit into his PDA and we could search for a term (with a full-text search), select the particular page, and view the extracted/formatted page in under 10 seconds total: very slick for a wee PDA!


You might also be interested in:


3 Comments

Hi Rick,

First, thanks for talking a more balanced view of the recent documents published by the W3C EXI WG. The issue you raise about tracing the conclusion back through the test results is a good one. The EXI and XBC groups have generated a lot of data over the years and probably take certain definitions for granted, forgetting that not everyone is intimately familiar with documents published over three years ago.

As you suspected, the reason why XML+gzip fails the "processing efficiency" test is due to the specific definition of "processing efficiency" [1]. Many of the XBC use cases [2] stated a requirement for faster parsing and/or serialization than that provided by XML. So, the term "processing efficiency" was created to characterize how fast a format could be parsed and/or serialized relative to XML. All other characteristics being equal, a format that parsed 10 times faster than XML on average might be prefered over one that parsed only 3 times faster. A format that was slower to parse and serialize than XML is said to "prohibit" processing efficiency because it doesn't satisfy the requirements of these use cases.

Parsing and serializing gzipped XML is always slower than parsing and serializing plain old XML because it requires parsing and serializing plain old XML as an intermediate step. This is why XML+gzip fails the "processing efficiency" test.

I hope this is helpful. Thanks again for taking a look and providing feedback.

Cheers!,

John

[1] http://www.w3.org/TR/xbc-properties/#processing-efficiency
[2] http://www.w3.org/TR/xbc-use-cases/

John: Thanks for that. But "is less efficient than" is different to "prevents efficiency": I am sure it was meant to be a prevocative term (in a good sense) but it comes across as slight-of-hand, I think.

Hi Rick,

Sorry for the slow response. I wasn't notified of your response and just happened across it a moment ago.

You are most certainly right that the phrase "is less efficient than" has a completely different meaning than "prevents efficiency" according to the dictionary definitions of these terms. So, I can definately understand how the words might be rather surprising to anyone who doesn't realize that there is a more specific technical definition being used for the term "processing efficiency".

It would be very difficult to detemine whether a candidate binary XML format met the minimum requirements of the XBC Use Cases using the dictionary definition of "processing efficiency", so the XBC group defined a more specific technical definition for "processing efficiency" that could be measured and assessed deterministically for each candidate binary XML format being evaluated.

I can assure you the terms and definitions were not selected to be provocative and there was no attempt at slight-of-hand. The term "processing efficiency" was first defined in the 24 February 2005 draft of the XBC Properties Document [1] long before the W3C decided to pursue the EXI standard or selected Efficient XML as the basis of the standand.
It was defined as a simple Boolean to determine whether a candidate binary XML format was faster than text XML or not [and thus, met the requirements of use cases that claimed they needed something faster than text].

Note that this term was defined for evaluating whether candidate binary XML formats met the W3C's minimum requirements for the XBC binary XML use cases. It wasn't really defined for evaluating text XML itself. Even so, determining whether text XML meets this binary XML requirement is pretty simple. Is it possible for text XML to meet the requirements of use cases that require something faster than text XML? Of course not.

So, yes -- you should be surprised and shocked if you hear someone walking around saying "XML prevents processing efficiency" with no other context or qualification. However, it should be less surprising to hear someone say "text XML doesn't meet the minimum processing efficiency requirements of some binary XML use cases", which is what this document is trying to say.

The terms "processing efficiency" and "prevents" are admittedly confusing in this context, but the EXI working group did not actually come up with them and was not given the option of defining new terms. The EXI group is required to evaluate formats with-respect to the requirements defined by the XBC group. That said, I think there should be a comment next to the "prevents" value that explains this to avoid future confusion. I will recommmend this get added to the next draft of the document.

Thanks again for the feedback. It will help improve the quality of the document. I hope my comments were helpful. Please let me know if there's anything else I can do to assist.

All the best!,

John


[1] http://www.w3.org/TR/2005/WD-xbc-properties-20050224/#processing-efficiency

Popular Topics

Archives

Or, visit our complete archives.

Recommended for You

Got a Question?