Proposal (Was: When are and aren't two URLs the same?)

Fri Apr 21 21:00:11 UTC 2006

# You are right, there is no strong technical reason why it should be
# so, so you can talk me out of it ;-) but I'd prefer it the way as
# proposed.

It makes me feel uncomfortable to apply a long list of transforms to
identity URLs.  If you concede that there's no technical reason *to*
do something, you probably shouldn't.  I'd feel much better about some
of these if you can outline specific problems solved (i.e. why a given
transform MUST be applied).  Doing this on a matter of preference and
style is not a good idea, and I don't buy arguments that involve "most
people" or "most servers".  I don't want us to open up a can of worms
with identity canonicalization, but I'd be very glad to hear solid,
itemized reasoning about why this stuff is necessary.  (Some of them
feel wrong, and I don't know enough to comment on some others.)

# 4. if the host component refers to a relative name, replace the host
# component with a fully-qualified DNS name. For example, the URL
# http://charlie/foo, used within company example.com's intranet,
# would be converted to http://charlie.example.com/foo.

I think you're really asking for trouble here.  I think you're
assuming that charlie.example.com has the same meaning regardless of
where it's used.  Correct me if I'm wrong, but this will probably also
involve a DNS operation, too.

# 6. all components of the path must be unescaped to the maximum
# extent possible. For example, if a URL contained %41 as a character,
# this character needs to be replaced by its unescaped version A.

This should be done anyway (but only once, of course).

# 7. any spaces in the path are replaced by +.

This means that "http://foo.com/a%20b" will be url-unescaped and be
transformed to "http://foo.com/a+b", right?  Doesn't that seem a
little aggressive?

-- 
  Jonathan Daugherty
  JanRain, Inc.