Proposal (Was: When are and aren't two URLs the same?)

Sun Apr 23 01:41:18 UTC 2006

Johannes Ernst wrote:
> I wrote up a proposal on this.
> 
> http://yadis.org/wiki/Draft-029
> 
> It (hopefully) includes what we are doing already, plus the various 
> comments that were made on this list.
> Chances are I'm missing something ...
> 
> One example where this becomes important (even without going into 
> Yadis-using protocols like OpenID and various LID profiles) is Yadis 
> document caching.
> E.g. if I have the Yadis file in cache for URL1, and URL2 is entered 
> that is almost the same as URL1 but not exactly, do I have to try to 
> re-fetch the document or can I assume that I can use the cached  version
> for URL1?
> 
> The proposed algorithm would make this unambiguous, or so I hope ;-)
> 

The wiki appears to be down right now, so I can't really comment on it
directly. However, from viewing the discussion that followed I'm still
inclined to go with OpenID's approach:

* Do the canonicalizations from RFC2616 in order that the URL is
actually retrievable.
* Do a character-by-character string comparison for equality.

It is then up to the server hosting the identity to issue redirects that
canonicalize the URL. In many cases this will already happen, as servers
are usually configured to fix up user shortcuts like missing off
trailing slashes or using http://whatever/ rather than
http://whatever.domain.com/ on a private LAN.

My main reasoning for this is that the server knows better than anyone
else what its URL scheme is, so it makes sense for it to be the party
that handles the canonicalization.

This also works in reverse: a relying party is free to reverse this
canonicalization process for display purposes, but it mustn't make any
assumptions beyond what is given in the canonicalization rules. That is,
a relying party can say:
    "Hey, frank.livejournal.com!"
but it mustn't say
    "Hey, myyadis.net/frank" (when you entered myyadis/frank/)
...because the trailing slash may be significant to that server.

Incidentally, I have in the past used software which makes a real
distinction between the trailing slash case and the no trailing slash case.

   * Some CMS-like system (from yonks ago; can't remember which one):
        * /blah/ means a page called "blah" that is a child of the home
         page
        * /blah means a resource (embedded image, rss feed, or whatever)
         in the home page that is called "blah".
   * Fotobilder (photo gallery software):
        * Produces an HTML page at http://hostname.com/user/pic/23423/
        * Produces the photo itself at
               http://hostname.com/user/pic/23423

In both these cases it is not safe to add (or remove) a trailing slash.
Admittedly the Fotobilder case is stretching it a little since someone
is unlikely to use a photo page as an identity, but it serves to
illustrate the point that there is software out there that makes this
distinction.