Can a URL contain an XPath?

By Rick Jelliffe
August 12, 2008 | Comments: 12

Is it possible (under the current standards) to have a URL like this?

http://www.eg.com/documents/act124/body/section[1]/title

Well, lets make it easier. Is it possible to have an IRI with an http scheme that has path components with XPath-syntax predicates and XPath-syntax shortcuts? The scenario is, say, a webserver with a special resolver that extracts and returns fragments.

Lets look at [ and ].

The thing that most people call URLs are the things that you can type into your browser address bar, or put in HTML a/@href or in XML entity declarations or webserver resolver URL rewriting files. But in fact these are a much looser form than standard URLs which uses ASCII syntax and reserves certain characters. Browsers and parsers perform various encodings and decodings to go from the human-friendly form to the strict standard form.

What the RFCs say

The need to give the looser form a proper name has partialy been met by RFC 3987 Internationalized Resource Identifiers (IRIs) which establishes the basic structure


IRI = scheme ":" ihier-part [ "?" iquery ]
[ "#" ifragment ]

and sets rules for / and //.

After an IRI has been converted to a URI, the rules of RFC 3986
Uniform Resource Identifier (URI): Generic Syntax
apply, which (like IRIs) reserve certain characters as generic delimiters:

reserved    = gen-delims / sub-delims

gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"

sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="

In both IRIs and URLs, the percent-encoded form of a gen-delims character is regarded as distinct from its direct use. However, the particular gen-delims (and sub-delims?) that are available for use in particular URL scheme is dependent on that scheme: so this takes us to RFC 1738 Uniform Resource Locators (URL).

RFC 1738 does not define any special behaviour for [ and ], however its

httpurl        = "http://" hostport [ "/" hpath [ "?" search ]]
hpath          = hsegment *[ "/" hsegment ]
hsegment       = *[ uchar | ";" | ":" | "@" | "&" | "=" ]
search         = *[ uchar | ";" | ":" | "@" | "&" | "=" ]

I think there is a persistent problem with the drafting of the RFCs, that they don't make an explicit distinction between the repertoire used to encoding characters and the repertoire of the characters that can be represented. It is easy to fall into the trap the the syntax productions in the RFCs relate to the latter rather than to the former.

So, as far as I can tell, the story on using [ and ] in the
hsegment of a URL is this:

  • RFC 1738 does not allow them as direct characters. Therefore in the strict URL form they must be percent encoded.
  • If they are present in an HTTP URL or IRI (by mistake) as direct characters, conforming software along way (such as browser or other rewriters) will not percent encode them for you, since they are reserved.
  • However, there is nothing in the RFCs from [ and ] being present in URLs and IRIs in their percent encoded form

There are a few other funnies at work for Xpath syntax: for example, XPaths allow x[../z] which http: URL/IRI interpretation would notionally parse as two segments,
x[.. followed by z] (ignoring the square bracket encoding needed for clarity.)

So it seems that the answer is, yes, a URL can contain an XPath, providing it has been delimited in certain ways (even if it an IRI):

  • The left and right square bracket must be percent encoded

  • The solidus "/" inside predicates (brackets) must be percent encoded, but otherwise should not be (since it is fulfilling its segmentation function)

  • The "#" and "?" characters must be percent encoded because of their special http: scheme meaning

  • The other gen-delims and sub-delims should be percent encoded to, just for compatability, except for the ones from the reserved list which are allowed by RFC 1738: ";" | ":" | "@" | "&" | "="

However, a resolver which took the URLs and did pattern matching on them should, for Postel's law, allow that the delimiters reserved by the generic syntax but not actually used in the http: scheme, notably "*", should be accepted in the direct form, while those that are allowed such as"@", "(", "=", ")" should also be accepted in their percent-encoded form too. [This para and previous para updated]

(Next stage: actually implement this and test it!)

Why would you want to do this?

Now you can already use XPaths and URLs:that is what Xpointer specifies. So why would we want to have these Xpathish syntax?

The motivating use case comes out of the PRESTO approach I have been working on:


"All documents, views and metadata at all significant levels of granularity and composition should be available in the best formats practical from their own permanent hierarchical URIs."

The trouble with using ? (or even #) is that the resources identified become, in effect, terminal leaves. You cannot merely add more paths underneath, as far as I can see, to drill down further.

Now one of the PRESTO ideas is that the URL should reflect a use-case from the user's conceptual POV, not merely reflect the system-specific details. Isn't this kind of URL a hard-coding of a specific system — the grammar of the XML document? Well, yes but not necessarily... It could be an XPath into a transformed version (or view) of the document that aligned with the use-case better than the original.

But the ability to use XPath predicate syntax in paths of URLs addresses a very particular problem even if there is no underlying XML document. Lets say we have some web services to deliver accounts (and we want to use the PRESTO approach). So we can get the general ledger from


http://www.eg.com/accounts/general-ledger/

which is the same as

http://www.eg.com/accounts/general-ledger/years/2008

And we can drill down further to


http://www.eg.com/accounts/general-ledger/years/2008/months/August/revenue

But what if the calculation of the general-ledger required use of a parameter, for example, because we wanted to see the black-market version of the books, not the version prepared for the tax man?

If we don't have a predicate kind of mechanism, we get a kind of combinatorial explosion in the possible URLs, so that for every value of every parameter we have to have a separate branch. While we might have allowed the URL


http://www.eg.com/accounts/general-ledger?version="black-market"

we cannot drill down further


http://www.eg.com/accounts/general-ledger?version="black-market"/years/2008/months/August/revenue

and instead would have to have something like


http://www.eg.com/accounts/general-ledger/black-market/years/2008/months/August/revenue

This is within the capabilities of a resolver, but falls apart when you start to have multiple parameters or parameters with non-unique values. So you start to need something like


http://www.eg.com/accounts/general-ledger/version=black-market/years/2008/months/August/revenue

I think the XPath predicate syntax is much clearer:


http://www.eg.com/accounts/general-ledger[@version=black-market]/years/2008/months/August/revenue

Indeed, you might go as far as


http://www.eg.com/accounts/general-ledger
[@version="black-market"][@year="2008"][@month="August"]/revenue

which is, in proper URL syntax, the following ugliness:


http://www.eg.com/accounts/general-ledger
%5B%40version=%22black-market%22%5D%5B%40year=%222008%22%5D%5B%40month=%22August%22%5D/revenue

where the server could use convert the URL into a request in its (non-PRESTO) system URLs such as


http://internal.eg.com/account-bot?get-general-ledger?version=black-market?as-at=2008-08-31

then drill down inside the returned XML account to find the revenue fragment using some XPath such as

general-ledger/summary/revenue

So having a predicate syntax available in URLs allows both systematic retrieval of XML fragments (a la the Oracle xmlfs XML filesystem) but also allows a systematic approach to handling parameterized resource selection (queries) without sacrificing drilldown (which is a key requirement of PRESTO.)

However, it does uglify the real URL, but that is only a consideration if the user sees it rather than the presentation form supplied by a browser (or some intermediate version).

P.S.: PRESTO and Apache Sling

I was interested to see Roy Fielding's Sling project is up at the Apache incubator site. It has a lot of connection to the PRESTO approach: in particular the push to noun-ify URLs. I think where a system that adopted the full PRESTO approach would go futher than Sling is that it seems that the Slide resources don't allow futher drill down inside them: you have to go to some query mechanism, or something outside Sling.

Nevertheless, Sling's separation of data and fuctions (and the nounification of functions and the construction of synthetic URLs with them) is very PRESTO-friendly.


You might also be interested in:


12 Comments

Please check your HTML, it looks like you have been missing some closing tags, the text is becoming smaller and smaller and finally unreadable.

Lars: Yikes, fixed now. 10, 000 apologies

Sorry the problem did not show up in the O'Reilly/MT previewer or the final published page viewde in Firefox, but it disrupts IE. (I think the IE semantics are wrong here, because you would expect the effect of an unclosed code element to finish with the paragraph it is in.)

The tiny size of the embedded fragments (and the recent lack of scrolling) is the subject of an continuing fight with the O'Reilly stylesheet supremos. I hope sometime soon the text will magically appear at a readable size. I have to read the pages at a larger size always.

Rick,

Very interesting piece. Another pattern you could use is a Matrix URI. For example, your:

http://www.eg.com/accounts/general-ledger
   [@version="black-market"][@year="2008"][@month="August"]/revenue

could be:

http://www.eg.com/accounts/general-ledger
  /version=black-market;year=2008;month=August/revenue

It wouldn't give you the full power of XPath, but you wouldn't have to worry about those reserved characters.

Jeni

I'm guessing that xpath1 isn't quite what you're seeking, either, and neither is XPointer generally.

You've come up a against a basic issue in URI syntax, the notion that there are solid "resources" which can then be broken into (or referenced at) "fragments", where the interpretation of the fragment identifier is determined by the MIME type of that unitary resource.

XML structure is a seriously difficult challenge to that set of expectations, and creating a series of excellent koans that are best avoided.

I'd encourage you to simply forge ahead with what you're already doing, incorporating XPath into URLs in a way that you find rational and accepting that your interpretation of them is yours. There are lots of broad bright horizons that direction, and it's definitely time to explore them again.

Jeni: Good point. And you pointed out on your blog that positional predicates (such as chapter[5]) are unreliable in many cases anyway, the matrix URI could be useful.

Simon: The T.B-L link that Jeni referenced is good, because it shows that this is indeed an issue that has been around. I am not really at the stage of being convinced about the desirability of particular syntaxes, but the provision of a predicate/attribute feature inside http: URLs seems to be logically consequent from the PRESTO assumptions.

In trying to explore these PRESTO-related issues (or, at least, in having to think about them again in order to deliver systems), a thing that is coming to me is how much developers ignore the address bar on browsers, either by having horrible URLs which the user cannot reverse engineer or infer new paths from, or by having such horrible URLs and missing resource access functions so that the user has to be provided with address-bar substitutes (notable with breadcrumb bars.)

Now breadcrumb bars do a little more than the address bar does, in that they allow choices for moving forward and so on. But still, if you had a PRESTO system where every resource had an index.html file which gave the subresources available under it (i.e. allowing object-oriented-ish introspection) then you could indeed move forward as well. I am not saying that every use of a breadcrumb bar on a webpage represents a failure to take advantage of the address bar, but it should be a consideration: the address bar is cross-platform and universally used.

Has Rick or anyone else implemented a conclusive approach to this concept?

Andrew: I am not aware of any full xpath-in-url implementations, but there are several resolvers that allow regex-based or pattern-based conversion. (Pageseeder uses one, for example, for public urls.)

Great post, thanks for the tutorial.

I'm guessing that xpath1 isn't quite what you're seeking, either, and neither is XPointer generally.
You've come up a against a basic issue in URI syntax, the notion that there are solid "resources" which can then be broken into (or referenced at) "fragments", where the interpretation of the fragment identifier is determined by the MIME type of that unitary resource.
XML structure is a seriously difficult challenge to that set of expectations, and creating a series of excellent koans that are best avoided.
I'd encourage you to simply forge ahead with what you're already doing, incorporating XPath into URLs in a way that you find rational and accepting that your interpretation of them is yours. There are lots of broad bright horizons that direction, and it's definitely time to explore them again.

Most URL can contain an XPath.PageSeeder's XLink architecture allows writers to edit and validate XML data for demanding schemas through a standard Web browser and are also perfect if used for xpath in the url.

Most URL can contain an XPath.PageSeeder's XLink architecture allows writers to edit and validate XML data for demanding schemas through a standard Web browser and are also perfect if used for xpath in the url.

The men who succeed are the efficient few. They are the few who have the ambition and will power to develop themselves.

Popular Topics

Archives

Or, visit our complete archives.

Recommended for You

Got a Question?