<HTML><BODY style="word-wrap: break-word; -khtml-nbsp-mode: space; -khtml-line-break: after-white-space; "><BR><DIV><DIV>On 10 Feb '06, at 12:27 PM, Joseph Holsten wrote:</DIV><BR class="Apple-interchange-newline"><BLOCKQUOTE type="cite"><P style="margin: 0.0px 0.0px 0.0px 0.0px"><FONT face="Verdana" size="3" style="font: 11.0px Verdana">I'd really like to start collecting best practices to help pin down how the spec is being used, which SHOULDs and recommendations are being ignored, etc. In particular, what's the position about handling broken html</FONT></P> </BLOCKQUOTE></DIV><BR><DIV>The long-established general consensus is to accept broken HTML. All web browsers do it, and every other tool I know of that operates on HTML found in the wild — crawlers, scrapers, microformat detectors — with the obvious exception of validators :) It's a case of the well-known Postel's Law: "Be strict in what you generate but liberal in what you accept."</DIV><DIV><BR class="khtml-block-placeholder"></DIV><DIV>In the related domain of XML-based formats, especially RSS/Atom feeds, there is a legitimate debate about whether invalid markup should be rejected (and I think the majority opinion is to reject it, because there aren't the same backward-compatibility issues.)</DIV><DIV><BR class="khtml-block-placeholder"></DIV><DIV>I think requiring (or even allowing) YADIS implementations to reject invalid HTML could end up being a roadblock for adoption. For some sites there may be significant roadblocks in the way of fixing their HTML: if it's generated by a legacy web-app they don't want to open the hood of, or by a 3rd party CMS that's not going to fix the problem soon, or if the pages include user-generated HTML templates or content. In a case where implementing whizzy new technology requires making such changes, the new technology may lose.</DIV><DIV><BR class="khtml-block-placeholder"></DIV><DIV>As for regex-based HTML 'parsers', I'm pretty leery of them. They are too easily confused by legal but oddball markup, e.g. angle brackets in CDATA, as I said here last week(?). I don't trust them for anything security-related.</DIV><DIV><BR class="khtml-block-placeholder"></DIV><DIV>Doesn't nearly every language/platform have a readily available HTML parser and tidying library by now?</DIV><DIV><BR class="khtml-block-placeholder"></DIV><DIV>--Jens</DIV></BODY></HTML>