Well, lets make it easier. Is it possible to have an IRI with an http scheme that has path components with XPath-syntax predicates and XPath-syntax shortcuts? The scenario is, say, a webserver with a special resolver that extracts and returns fragments.
Lets look at
The thing that most people call URLs are the things that you can type into your browser address bar, or put in HTML
a/@href or in XML entity declarations or webserver resolver URL rewriting files. But in fact these are a much looser form than standard URLs which uses ASCII syntax and reserves certain characters. Browsers and parsers perform various encodings and decodings to go from the human-friendly form to the strict standard form.
What the RFCs say
The need to give the looser form a proper name has partialy been met by RFC 3987 Internationalized Resource Identifiers (IRIs) which establishes the basic structure
IRI = scheme ":" ihier-part [ "?" iquery ]
[ "#" ifragment ]
and sets rules for
After an IRI has been converted to a URI, the rules of RFC 3986
Uniform Resource Identifier (URI): Generic Syntax apply, which (like IRIs) reserve certain characters as generic delimiters:
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
In both IRIs and URLs, the percent-encoded form of a gen-delims character is regarded as distinct from its direct use. However, the particular gen-delims (and sub-delims?) that are available for use in particular URL scheme is dependent on that scheme: so this takes us to RFC 1738 Uniform Resource Locators (URL).
RFC 1738 does not define any special behaviour for
], however its
httpurl = "http://" hostport [ "/" hpath [ "?" search ]] hpath = hsegment *[ "/" hsegment ] hsegment = *[ uchar | ";" | ":" | "@" | "&" | "=" ] search = *[ uchar | ";" | ":" | "@" | "&" | "=" ]
I think there is a persistent problem with the drafting of the RFCs, that they don't make an explicit distinction between the repertoire used to encoding characters and the repertoire of the characters that can be represented. It is easy to fall into the trap the the syntax productions in the RFCs relate to the latter rather than to the former.
So, as far as I can tell, the story on using
] in the
hsegment of a URL is this:
- RFC 1738 does not allow them as direct characters. Therefore in the strict URL form they must be percent encoded.
- If they are present in an HTTP URL or IRI (by mistake) as direct characters, conforming software along way (such as browser or other rewriters) will not percent encode them for you, since they are reserved.
- However, there is nothing in the RFCs from
]being present in URLs and IRIs in their percent encoded form
There are a few other funnies at work for Xpath syntax: for example, XPaths allow
http: URL/IRI interpretation would notionally parse as two segments,
x[.. followed by
z] (ignoring the square bracket encoding needed for clarity.)
So it seems that the answer is, yes, a URL can contain an XPath, providing it has been delimited in certain ways (even if it an IRI):
- The left and right square bracket must be percent encoded
- The solidus "/" inside predicates (brackets) must be percent encoded, but otherwise should not be (since it is fulfilling its segmentation function)
- The "#" and "?" characters must be percent encoded because of their special
- The other gen-delims and sub-delims should be percent encoded to, just for compatability, except for the ones from the reserved list which are allowed by RFC 1738: ";" | ":" | "@" | "&" | "="
However, a resolver which took the URLs and did pattern matching on them should, for Postel's law, allow that the delimiters reserved by the generic syntax but not actually used in the
http: scheme, notably "*", should be accepted in the direct form, while those that are allowed such as"@", "(", "=", ")" should also be accepted in their percent-encoded form too. [This para and previous para updated]
(Next stage: actually implement this and test it!)
Why would you want to do this?
Now you can already use XPaths and URLs:that is what Xpointer specifies. So why would we want to have these Xpathish syntax?
The motivating use case comes out of the PRESTO approach I have been working on:
"All documents, views and metadata at all significant levels of granularity and composition should be available in the best formats practical from their own permanent hierarchical URIs."
The trouble with using
? (or even
#) is that the resources identified become, in effect, terminal leaves. You cannot merely add more paths underneath, as far as I can see, to drill down further.
Now one of the PRESTO ideas is that the URL should reflect a use-case from the user's conceptual POV, not merely reflect the system-specific details. Isn't this kind of URL a hard-coding of a specific system — the grammar of the XML document? Well, yes but not necessarily... It could be an XPath into a transformed version (or view) of the document that aligned with the use-case better than the original.
But the ability to use XPath predicate syntax in paths of URLs addresses a very particular problem even if there is no underlying XML document. Lets say we have some web services to deliver accounts (and we want to use the PRESTO approach). So we can get the general ledger from
which is the same as
And we can drill down further to
But what if the calculation of the general-ledger required use of a parameter, for example, because we wanted to see the black-market version of the books, not the version prepared for the tax man?
If we don't have a predicate kind of mechanism, we get a kind of combinatorial explosion in the possible URLs, so that for every value of every parameter we have to have a separate branch. While we might have allowed the URL
we cannot drill down further
and instead would have to have something like
This is within the capabilities of a resolver, but falls apart when you start to have multiple parameters or parameters with non-unique values. So you start to need something like
I think the XPath predicate syntax is much clearer:
Indeed, you might go as far as
which is, in proper URL syntax, the following ugliness:
where the server could use convert the URL into a request in its (non-PRESTO) system URLs such as
then drill down inside the returned XML account to find the revenue fragment using some XPath such as
So having a predicate syntax available in URLs allows both systematic retrieval of XML fragments (a la the Oracle xmlfs XML filesystem) but also allows a systematic approach to handling parameterized resource selection (queries) without sacrificing drilldown (which is a key requirement of PRESTO.)
However, it does uglify the real URL, but that is only a consideration if the user sees it rather than the presentation form supplied by a browser (or some intermediate version).
P.S.: PRESTO and Apache Sling
I was interested to see Roy Fielding's Sling project is up at the Apache incubator site. It has a lot of connection to the PRESTO approach: in particular the push to noun-ify URLs. I think where a system that adopted the full PRESTO approach would go futher than Sling is that it seems that the Slide resources don't allow futher drill down inside them: you have to go to some query mechanism, or something outside Sling.
Nevertheless, Sling's separation of data and fuctions (and the nounification of functions and the construction of synthetic URLs with them) is very PRESTO-friendly.