URL canonicalization
Mark Rafn
dagon at dagon.net
Thu Sep 15 11:32:02 PDT 2005
On Thu, 15 Sep 2005, Zefiro wrote:
> first, to avoid confusion:
> Users types in 'Zefiro.de' in the login box.
> claimed identity: 'Zefiro.de'
> canonical identity / normalized claimed identity: 'http://zefiro.de/'
Normalized, I think, is after adding http:// and / if necessary.
Canonical is after following redirects, which might change it to
'http://www.zefiro.de/'. Maybe use "resolved identity URL" rather than
"canonical" to indicate the final endpoint URL from which the content is
retrieved. "Canonical" implies it is the preferred standard form to use,
which I don't think it is here.
This terminology should be tightened up in the spec.
> delegate identity: 'http://www.livejournal.com/users/zefirodragon/'
> primary key: a unique string field in the database used to distinguish
> different users. Depending on the DBMS used an integer would be the real
> primary key and the OpenID identity just an indexed unique field. Still
> this is used for login.
I think this stays out of the spec. There's no reason you couldn't use
whichever of these strings you consider to be the individual as the PK.
> We were talking about claimed vs. normalized claimed (may indeed be a
> better name) identity. The consumer should do nothing with the delegate
> identity except for using this when asking the OpenID server for
> authentification. Nothing includes not storing, not using as database
> key, not assuming same delegates are same identitis or different
> delegates are different identities.
Agree. I feel the same way about the resolved (after following
redirects) identity URL. Whether to treat the claimed or the normalized
ID as unique is an interesting question, and perhaps should be left up to
the consumer (though stating a best practice is fine).
I propose something like the following:
A consumer SHOULD treat different normalized (that is, after adding
http:// and trailing / if needed) claimed identity URLs as distinct if
they could ever result in different content when fetching. A consumer
SHOULD NOT attempt to consolidate or treat two distinct normalized
identity URLs as the same even if they happen to resolve the same on some
webservers, redirect to the same place, or use the same delegated identity
URL.
For example, claimed identities "example.com", "Example.COM",
"http://example.com", and "http://example.com/" SHOULD be treated as
identical. "example.com" and "example.com/index.html" SHOULD be treated
as distinct, as should "example.com/a+b" and "example.com/a%20b".
I'm tempted to suggest that the spec recommends to use the claimed
identity as typed as a display name, but I think that's outside the scope,
and honestly most consumers should keep a display name entered by the user
which may be completely separate from authenticated credentials.
>> It seems cleaner to me to use the canonical identity as primary key.
> This depends on how sure you can be that two claimed identities are
> identical. If you assume normalized claimed identity is what makes the
> difference, then you'd store this in the database. All other variants of
> writing it (different claimed identity leading to same normalized id)
> are then accepted equally, for user convenience.
I'd probably recommend (not in the spec, but if a site asked my opinion)
that userId be a sequence-generated PK, with normalized (but not
resolved/canonical) claimed identity as an alternate key, and display name
as a separate field. Whether to allow two identical claimed identies to
have different userid and display name would be a business decision, much
like whether to allow two users to have the same e-mail address.
> Since I really want to be known as my claimed identity, not my
> normalized claimed id - cause it looks much uglier, with the http etc -
> I want it to be displayed like I entered it. And then, I personally
> think it's cleaner to use as primary key exactly what is displayed -
> i.e. what users using your site would use as their 'primary key in
> brain' association with me.
I think this is a mistake. I want my displayed name to be exactly what I
type, case-preserved, and in many cases including spaces and other
DNS-illegal characters. I don't think login name should be used for
display name in most cases.
I am fine with systems that mangle login name (for instance, storing them
as all-lowercase), as long as they provide a display name that is not
altered. Same goes for openID-authenticated systems. If a consumer
doesn't want to keep a display name separate from a login name, then it
behooves them to store the login name exaclty as entered. But
fundamentally, it's up to the consumer what they want to do and what their
customers demand.
--
Mark Rafn dagon at dagon.net <http://www.dagon.net/>
More information about the yadis
mailing list