URL canonicalization

Thu Sep 15 11:32:02 PDT 2005

On Thu, 15 Sep 2005, Zefiro wrote:

> first, to avoid confusion:
> Users types in 'Zefiro.de' in the login box.
> claimed identity: 'Zefiro.de'
> canonical identity / normalized claimed identity: 'http://zefiro.de/'

Normalized, I think, is after adding http:// and / if necessary. 
Canonical is after following redirects, which might change it to 
'http://www.zefiro.de/'.  Maybe use "resolved identity URL" rather than 
"canonical" to indicate the final endpoint URL from which the content is 
retrieved.  "Canonical" implies it is the preferred standard form to use, 
which I don't think it is here.

This terminology should be tightened up in the spec.

> delegate identity: 'http://www.livejournal.com/users/zefirodragon/'

> primary key: a unique string field in the database used to distinguish 
> different users. Depending on the DBMS used an integer would be the real 
> primary key and the OpenID identity just an indexed unique field. Still 
> this is used for login.

I think this stays out of the spec.  There's no reason you couldn't use 
whichever of these strings you consider to be the individual as the PK.

> We were talking about claimed vs. normalized claimed (may indeed be a 
> better name) identity. The consumer should do nothing with the delegate 
> identity except for using this when asking the OpenID server for 
> authentification. Nothing includes not storing, not using as database 
> key, not assuming same delegates are same identitis or different 
> delegates are different identities.

Agree.  I feel the same way about the resolved (after following 
redirects) identity URL.  Whether to treat the claimed or the normalized 
ID as unique is an interesting question, and perhaps should be left up to 
the consumer (though stating a best practice is fine).

I propose something like the following:

A consumer SHOULD treat different normalized (that is, after adding 
http:// and trailing / if needed) claimed identity URLs as distinct if 
they could ever result in different content when fetching.  A consumer 
SHOULD NOT attempt to consolidate or treat two distinct normalized 
identity URLs as the same even if they happen to resolve the same on some 
webservers, redirect to the same place, or use the same delegated identity 
URL.

For example, claimed identities "example.com", "Example.COM", 
"http://example.com", and "http://example.com/" SHOULD be treated as 
identical.  "example.com" and "example.com/index.html" SHOULD be treated 
as distinct, as should "example.com/a+b" and "example.com/a%20b".

I'm tempted to suggest that the spec recommends to use the claimed 
identity as typed as a display name, but I think that's outside the scope, 
and honestly most consumers should keep a display name entered by the user 
which may be completely separate from authenticated credentials.

>> It seems cleaner to me to use the canonical identity as primary key.

> This depends on how sure you can be that two claimed identities are 
> identical. If you assume normalized claimed identity is what makes the 
> difference, then you'd store this in the database. All other variants of 
> writing it (different claimed identity leading to same normalized id) 
> are then accepted equally, for user convenience.

I'd probably recommend (not in the spec, but if a site asked my opinion) 
that userId be a sequence-generated PK, with normalized (but not 
resolved/canonical) claimed identity as an alternate key, and display name 
as a separate field.  Whether to allow two identical claimed identies to 
have different userid and display name would be a business decision, much 
like whether to allow two users to have the same e-mail address.

> Since I really want to be known as my claimed identity, not my 
> normalized claimed id - cause it looks much uglier, with the http etc - 
> I want it to be displayed like I entered it. And then, I personally 
> think it's cleaner to use as primary key exactly what is displayed - 
> i.e. what users using your site would use as their 'primary key in 
> brain' association with me.

I think this is a mistake.  I want my displayed name to be exactly what I 
type, case-preserved, and in many cases including spaces and other 
DNS-illegal characters.  I don't think login name should be used for 
display name in most cases.

I am fine with systems that mangle login name (for instance, storing them 
as all-lowercase), as long as they provide a display name that is not 
altered.  Same goes for openID-authenticated systems.  If a consumer 
doesn't want to keep a display name separate from a login name, then it 
behooves them to store the login name exaclty as entered.  But 
fundamentally, it's up to the consumer what they want to do and what their 
customers demand.
--
Mark Rafn    dagon at dagon.net    <http://www.dagon.net/>