URL canonicalization

Dan Libby danda at videntity.org
Thu Sep 15 14:20:26 PDT 2005

Nice writeup Mark, thanks.  You make a lot of sense and I'm in full

Dan Libby

Mark Rafn wrote:

> On Thu, 15 Sep 2005, Zefiro wrote:
>> first, to avoid confusion:
>> Users types in 'Zefiro.de' in the login box.
>> claimed identity: 'Zefiro.de'
>> canonical identity / normalized claimed identity: 'http://zefiro.de/'
> Normalized, I think, is after adding http:// and / if necessary.
> Canonical is after following redirects, which might change it to
> 'http://www.zefiro.de/'.  Maybe use "resolved identity URL" rather
> than "canonical" to indicate the final endpoint URL from which the
> content is retrieved.  "Canonical" implies it is the preferred
> standard form to use, which I don't think it is here.
> This terminology should be tightened up in the spec.
>> delegate identity: 'http://www.livejournal.com/users/zefirodragon/'
>> primary key: a unique string field in the database used to
>> distinguish different users. Depending on the DBMS used an integer
>> would be the real primary key and the OpenID identity just an indexed
>> unique field. Still this is used for login.
> I think this stays out of the spec.  There's no reason you couldn't
> use whichever of these strings you consider to be the individual as
> the PK.
>> We were talking about claimed vs. normalized claimed (may indeed be a
>> better name) identity. The consumer should do nothing with the
>> delegate identity except for using this when asking the OpenID server
>> for authentification. Nothing includes not storing, not using as
>> database key, not assuming same delegates are same identitis or
>> different delegates are different identities.
> Agree.  I feel the same way about the resolved (after following
> redirects) identity URL.  Whether to treat the claimed or the
> normalized ID as unique is an interesting question, and perhaps should
> be left up to the consumer (though stating a best practice is fine).
> I propose something like the following:
> A consumer SHOULD treat different normalized (that is, after adding
> http:// and trailing / if needed) claimed identity URLs as distinct if
> they could ever result in different content when fetching.  A consumer
> SHOULD NOT attempt to consolidate or treat two distinct normalized
> identity URLs as the same even if they happen to resolve the same on
> some webservers, redirect to the same place, or use the same delegated
> identity URL.
> For example, claimed identities "example.com", "Example.COM",
> "http://example.com", and "http://example.com/" SHOULD be treated as
> identical.  "example.com" and "example.com/index.html" SHOULD be
> treated as distinct, as should "example.com/a+b" and "example.com/a%20b".
> I'm tempted to suggest that the spec recommends to use the claimed
> identity as typed as a display name, but I think that's outside the
> scope, and honestly most consumers should keep a display name entered
> by the user which may be completely separate from authenticated
> credentials.
>>> It seems cleaner to me to use the canonical identity as primary key.
>> This depends on how sure you can be that two claimed identities are
>> identical. If you assume normalized claimed identity is what makes
>> the difference, then you'd store this in the database. All other
>> variants of writing it (different claimed identity leading to same
>> normalized id) are then accepted equally, for user convenience.
> I'd probably recommend (not in the spec, but if a site asked my
> opinion) that userId be a sequence-generated PK, with normalized (but
> not resolved/canonical) claimed identity as an alternate key, and
> display name as a separate field.  Whether to allow two identical
> claimed identies to have different userid and display name would be a
> business decision, much like whether to allow two users to have the
> same e-mail address.
>> Since I really want to be known as my claimed identity, not my
>> normalized claimed id - cause it looks much uglier, with the http etc
>> - I want it to be displayed like I entered it. And then, I personally
>> think it's cleaner to use as primary key exactly what is displayed -
>> i.e. what users using your site would use as their 'primary key in
>> brain' association with me.
> I think this is a mistake.  I want my displayed name to be exactly
> what I type, case-preserved, and in many cases including spaces and
> other DNS-illegal characters.  I don't think login name should be used
> for display name in most cases.
> I am fine with systems that mangle login name (for instance, storing
> them as all-lowercase), as long as they provide a display name that is
> not altered.  Same goes for openID-authenticated systems.  If a
> consumer doesn't want to keep a display name separate from a login
> name, then it behooves them to store the login name exaclty as
> entered.  But fundamentally, it's up to the consumer what they want to
> do and what their customers demand.
> -- 
> Mark Rafn    dagon at dagon.net    <http://www.dagon.net/>

More information about the yadis mailing list