LJ not correctly parsing <link... > tags.

Fri Jul 1 16:30:35 PDT 2005

Carl Howells wrote:
> It looks like LJ isn't correctly parsing the <link..> tags to extract 
> the server URL.  In particular, it isn't processing entities in the 
> href.  Example:
> 
> <link rel="openid.server" 
> href="http://www.schtuff.com/?action&#x3d;openid&#x5f;server" />
> 
> That's a valid HTML link tag that uses two entities in its the value of 
> its href attribute.  Those entities should be processed before the 
> attribute is used, meaning the actual value extracted from that 
> attribute should be "http://www.schtuff.com/?action=openid_server".
> 
> At the moment, LJ's consumer code is not extracting that properly.  Will 
> that be fixed?

Ha ha ha. I'd never looked very closely at the consumer code until now. 
Extracting the URLs with regexes? Come on Brad! Surely you know better 
than that?

You can't do HTML with regex. It'll always do something wrong.

The hack solution here would be to use HTML::Entities on the string, 
decoding any entities. That'll only fix this problem, though. No doubt 
others will arise. The only way to do this properly is with an HTML 
parser. Another dependency perhaps, but not an especially harsh one 
after it's already depending on some crazy crypto/hash modules which 
have to be much more rare than HTML::Parser and friends.

I suppose one final solution which I don't really like to admit the 
existence of is to just require that entities and character references 
are not used in these attributes. That'll all go horribly wrong as soon 
as an ID server URL has an ampersand in it, though.