Broken HTML Support

Sat Feb 11 18:22:54 UTC 2006

On 10 Feb '06, at 12:27 PM, Joseph Holsten wrote:

> I'd really like to start collecting best practices to help pin down  
> how the spec is being used, which SHOULDs and recommendations are  
> being ignored, etc. In particular, what's the position about  
> handling broken html

The long-established general consensus is to accept broken HTML. All  
web browsers do it, and every other tool I know of that operates on  
HTML found in the wild — crawlers, scrapers, microformat detectors —  
with the obvious exception of validators :) It's a case of the well- 
known Postel's Law: "Be strict in what you generate but liberal in  
what you accept."

In the related domain of XML-based formats, especially RSS/Atom  
feeds, there is a legitimate debate about whether invalid markup  
should be rejected (and I think the majority opinion is to reject it,  
because there aren't the same backward-compatibility issues.)

I think requiring (or even allowing) YADIS implementations to reject  
invalid HTML could end up being a roadblock for adoption. For some  
sites there may be significant roadblocks in the way of fixing their  
HTML: if it's generated by a legacy web-app they don't want to open  
the hood of, or by a 3rd party CMS that's not going to fix the  
problem soon, or if the pages include user-generated HTML templates  
or content. In a case where implementing whizzy new technology  
requires making such changes, the new technology may lose.

As for regex-based HTML 'parsers', I'm pretty leery of them. They are  
too easily confused by legal but oddball markup, e.g. angle brackets  
in CDATA, as I said here last week(?). I don't trust them for  
anything security-related.

Doesn't nearly every language/platform have a readily available HTML  
parser and tidying library by now?

--Jens
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.danga.com/pipermail/yadis/attachments/20060211/b1c22f54/attachment.html