Wrapping with foreign elements in Word 2007 and OpenOffice Writer

By Rick Jelliffe
August 18, 2008 | Comments: 5

First the caveat: Word and OpenOffice are not general-purpose XML editors.

But the SGML/XML project has been based on the perceived benefits not just of separating presentation from content in the narrow sense of using simple stylesheets, but in the much more radical approach of being able to have and name the arbitrary structures you consider useful in your document: if you have some paragraphs that together form a warning, that you should be able to have an element called warning which wraps the relevant paragraphs for example.

Now in a sense there is no so much difference between


<warning>blah blah</warning>

with a stylesheet like

warning { display:block; color:red;}

and

<div class="warning">blah blah</div>

with a stylesheet like

.warning { color:red;}

Check

The difference is that conventional grammar schemas cannot validate structures based on stylesheet names (or other attributes) very well (or at all). In Schematron, there is no particular difference: instead of (for element markup)


<rule context="warning">
<assert test="title">A warning should have a title</assert>
</rule>

you have (for attribute markup)


<rule context="*[@class='warning']">
<assert test="*[@class='title']">A warning should have a title</assert>
</rule>

But if you don't use Schematron, then even though the information is the same in the two styles of markup, the attribute-utilizing one defeats schema validation, and therefore rigorous markup. It means you have to hardcode any tests (in which case Schematron may become simpler.)

One difference between ODF and OOXML is that for foreign elements, OASIS/ISO ODF allows you to directly wrap any element in a foreign element. Ecma/ISO OOXML uses a special element, customXml in which you give the element name in an attribute (and attributes are specified using an attr element underneath.)

The advantage of the ODF approach is directness: but the disadvantage is that it defeats schema validation, in the same way that XSLT defeats validation with grammars. In ODF, you have to strip out the foreign elements and validate without them. In OOXML, the customXml element is part of the expected namespace and validated with the schema: you don't need to strip anything to validate, however you don't get any ability to validate the structure of the embedded elements directly. (In fact, you can just extract the customXML elements and transform them into the element-using form and validate that.) So no system is perfect.

Wrap

Now that is how it works in theory, what about in practice?

Just following both the OOXML standard and the ODF 1.1 standards, you will be hardpressed to make embedded XML work in Office 2007 and OpenOffice. I was.

In the case of OOXML, the standard does not connect enough dots, even though the information is actually there: when you use the customXml element for an element, actually Office looks for the settings.xml and checks that there is attachedSchema element for that namespace. (If there is no namespace, by default another setting, alwaysMergeEmptyNamespace may effect whether no namespace custom XML is an error. The default is OK.) But the text on customXml does not mention attachedSchema, which should be fixed.

Doug Mahugh wrote this up in a little article. If you are just programmatically adding markup and the user never needs to edit it, then you don't need a schema. (However, schemas are not necessarily saved in the OOXML file by Word: the mapping between namespaces and schema files is saved in application preferences files or registry, not in the OOXML. This is similar to how OpenOffice treats registering import and export XSLT files, for example.)

So after getting grumpy at this deficiency in the OOXML standard, I decided to try ODF and OpenOffice to see how it handled it. Now previously this section of the standard has caused a little trouble when we tried to figure it out: see A simple ISO NVDL scrip for preparing ODF XML for validation.

When I tried to add a wrapper element around a paragraph element in an ODF file, OpenOffice (both 2.3 and 2.4) just deleted the element and its contained contents. No good at all. But when I double checked the ODF 1.1 standard s1.5, it has

Conforming applications that read and write documents may preserve foreign elements and attributes. ...
Foreign elements may have an office:process-content attribute attached that has the value true or false. If the attribute's value is true, or if the attribute does not exist, the element's content should be processed by conforming applications. Otherwise conforming applications should not process the element's content, but may only preserve its content. If the element's content should be processed, the document itself shall be valid against the OpenDocument schema if the unknown element is replaced with its content only.

So the way OpenOffice seems to work (assuming it is not just an unimplemented feature) is that it doesn't preserve the element and doesn't (on my test, maybe I am doing something wrong) pay any attention to the process-content either. I find this difficult to justify against the text: I think this might be an issue of what you define as an element. The classic definition is that an element is the start and end tags and all its contents: the branch. The way that the ODF spec (and many others) use the term, it is just the particular node (and its attributes and perhaps its simple content, but not its descendant nodes.)

Carry out

But the bottom line for foreign elements as wrappers in ODF and OOXML is that ODF allows them to be stripped out while OOXML doesn't allow that; neither of course require that any particular application understands them. The bottom line for OpenOffice and Office seems to be that OpenOffice strips them (dangerously, but perhaps allowed because of bad drafting of that part of the ODF standard) while Office 2007 does allow them. In both cases users would be helped by clearer text (better conformance text for the OASIS/ISO text, better references for the Ecma/ISO text.)


You might also be interested in:


5 Comments

Florian: Thanks for that. I am glad if it can be fixed fairly easily, because wrapping is another of these techniques where it has more value if it can be adopted without having thereby to marginalize or favour particular platforms or implementations.

Arbitrary markup is a real niche usage, that really few people need. I agree with various comments that Tim Bray for example has made about its nicheness. But it just happens to be my niche in my work at the moment!

It is something that makes large specialty infosystems easier to assemble, for example for legislation, and as something that MS makes quite a fuss about for Office (and which frequently gets picked up as a technical advantage by reviewers), so I think it might have some marketing advantages for Open Office to get it going.

Agreed, there are tricks to know both with OOXML and the Word 2007 implementation. I had better luck using smartTag than customXml (no schema, no validation, no entry into the schema.xml part).

BTW, does anyone know if you can use customXml to add elements with text contents without having the text display in the word doc? It seemed to me on casual reading that customXml has to have w:r and w:t contents, which will display.

See my experiences with smartTag and customXml at:
http://blogarifficness.blogspot.com/2008/11/adding-metadata-inside-ooxml-document.html

Damon: Yes, the customXML are purely wrappers for existing text which have no impact on rendering. So in order to make the text non-visible, you have to set the visibility of the contained paragraph etc.

So the customXML is only useful in circumstances where there is some non-Office application that requires extra markup to do something additional. It is not much use inside Word. (Lets include having a VBA script in non-Office application...)

Peter Sefton (if I recall) commented that using little tables may be more satisfactory than using customXML whenever direct visual effects are desired.

Popular Topics

Archives

Or, visit our complete archives.

Recommended for You

Got a Question?