Broken HTML Support
Jens Alfke
jens at mooseyard.com
Sat Feb 11 18:22:54 UTC 2006
On 10 Feb '06, at 12:27 PM, Joseph Holsten wrote:
> I'd really like to start collecting best practices to help pin down
> how the spec is being used, which SHOULDs and recommendations are
> being ignored, etc. In particular, what's the position about
> handling broken html
The long-established general consensus is to accept broken HTML. All
web browsers do it, and every other tool I know of that operates on
HTML found in the wild — crawlers, scrapers, microformat detectors —
with the obvious exception of validators :) It's a case of the well-
known Postel's Law: "Be strict in what you generate but liberal in
what you accept."
In the related domain of XML-based formats, especially RSS/Atom
feeds, there is a legitimate debate about whether invalid markup
should be rejected (and I think the majority opinion is to reject it,
because there aren't the same backward-compatibility issues.)
I think requiring (or even allowing) YADIS implementations to reject
invalid HTML could end up being a roadblock for adoption. For some
sites there may be significant roadblocks in the way of fixing their
HTML: if it's generated by a legacy web-app they don't want to open
the hood of, or by a 3rd party CMS that's not going to fix the
problem soon, or if the pages include user-generated HTML templates
or content. In a case where implementing whizzy new technology
requires making such changes, the new technology may lose.
As for regex-based HTML 'parsers', I'm pretty leery of them. They are
too easily confused by legal but oddball markup, e.g. angle brackets
in CDATA, as I said here last week(?). I don't trust them for
anything security-related.
Doesn't nearly every language/platform have a readily available HTML
parser and tidying library by now?
--Jens
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.danga.com/pipermail/yadis/attachments/20060211/b1c22f54/attachment.html
More information about the yadis
mailing list