Proposal (Was: When are and aren't two URLs the same?)
drummond.reed at cordance.net
Tue Apr 25 06:01:21 UTC 2006
I have been just observing this thread because the issues of URL/URI
canonicalization can be so complex. But I must second Thomas's advice here:
the normalization/canonicalization rules in section 6.2.2 (and also 6.2.3)
of RFC 3986 are highly recommended. In fact I believe they address every one
of the normalization issues raised on this thread; that's why the XRI TC
adopted them wholesale in the XRI syntax spec.
In particular, they address Johanne's original issue. To quote section
The syntax and semantics of URIs vary from scheme to scheme, as
described by the defining specification for each scheme.
Implementations may use scheme-specific rules, at further processing
cost, to reduce the probability of false negatives. For example,
because the "http" scheme makes use of an authority component, has a
default port of "80", and defines an empty path to be equivalent to
"/", the following four URIs are equivalent:
In general, a URI that uses the generic syntax for authority with an
empty path should be normalized to a path of "/". Likewise, an
explicit ":port", for which the port is empty or the default for the
scheme, is equivalent to one where the port and its ":" delimiter are
elided and thus should be removed by scheme-based normalization. For
example, the second URI above is the normal form for the "http"
So the Yadis spec could address this whole area simply by requiring the
normalization/canonicalization rules specified in sections 6.2.2 and 6.2.3
of RFC 3986, i.e., turning the "SHOULDS" into "MUSTS".
From: yadis-bounces at lists.danga.com [mailto:yadis-bounces at lists.danga.com]
On Behalf Of Thomas Broyer
Sent: Monday, April 24, 2006 12:15 PM
To: yadis at lists.danga.com
Subject: Re: Proposal (Was: When are and aren't two URLs the same?)
2006/4/24, Carl Howells <chowells at janrain.com>:
> > How about the case when there are . or .. components in the absolute
path part of an URL?
> > Including when those which are trailing without a slash.
> > E.g. Should http://example.org/users/joe and
http://example.org/users/foo/../joe be the same?
> > What about http://alice.example.com/ vs http://alice.example.com/.
> Nothing should be done in that case. Those pairs of URLs aren't the
> same, even is some servers will provide the same content for them. Just
> because the path section of a URL looks like a Unix path doesn't mean
> that it is or should be treated as a Unix path.
Look at the example in section 6.2.2 of RFC3986.
Actually, it would be a bug to assume those are different URIs, based
on section 220.127.116.11 of that very same RFC, reproduced here for
18.104.22.168. Path Segment Normalization
The complete path segments "." and ".." are intended only for use
within relative references (Section 4.1) and are removed as part of
the reference resolution process (Section 5.2). However, some
deployed implementations incorrectly assume that reference resolution
is not necessary when the reference is already a URI and thus fail to
remove dot-segments when they occur in non-relative paths. URI
normalizers should remove dot-segments by applying the
remove_dot_segments algorithm to the path, as described in
More information about the yadis