URL canonicalization
Mark Rafn
dagon at dagon.net
Wed Sep 14 16:18:09 PDT 2005
On Wed, 14 Sep 2005, Dan Libby wrote:
> Hi, in my database, I need to uniquely keep track of visitors that are
> logging in via remote OpenID servers. The best key available is their
> identity url. But that leaves me with a question about how exactly to
> canonicalize it, that the spec does not clearly address.
> "Note that the user can leave off "http://" and the trailing "/". A
> consumer must canonicalize the URL, following redirects and noting the
> final URL. The final, canonicalized URL is the user's identity URL."
The spec is unclear on what you should consider to be an identity. By
design, I suspect. There can be multiple claimed identites that all map
to a single canonical identity URL, and multiple canonical identity URLs
that have the same delegate identity URL. Only the delegate identity URL
is actually confirmed by the server.
So, do you choose to treat dagon.net and dagon.net/index.html (which
canonicalize to different URLs, but have the same content and delegate to
the same URL) be the same or different? They're the same delegate URL
now, but could be different later, or one or both could stop delegating
and authenticate as different canonical URLs that happen to have the same
content.
I'd argue that you should treat different claimed identity strings as
unique individuals, even if they currently have the same delegate, or
canonicalize to the same identity URL.
> Okay, so case-insensitivity is fairly obvious. I'm already lower-casing
> everything.
You WILL fail if you do that. You can pick whatever case you like for a
domain name, but the path component of a URL is case-sensitive.
> But what about http vs https? For example, should
> "https://sally.people.com/" be treated as a separate identity from
> "http://sally.people.com/"? Or should the protocol be ignored?
They can someday have different content, even if they don't today, or
even if one redirects to the other today. You'd best treat them as
different.
> I suppose the issue can be broadened to: the spec is a bit vague about
> canonicalization of identity URLs. Can we get clarification?
The spec is clear about canonicalization of URLs - you must add
http:// and trailing / if needed, and you must follow redirects.
It is NOT clear whether the claimed identity, the canonical identity URL,
or the delegate identity URL should be considered by consumers to be the
unique individual. I'd argue for claimed identity, but others may
disagree.
--
Mark Rafn dagon at dagon.net <http://www.dagon.net/>
More information about the yadis
mailing list