[PATCH] utf8 flag support on perl lib

Tim Bunce Tim.Bunce at pobox.com
Thu Jan 10 20:28:21 UTC 2008


On Thu, Jan 10, 2008 at 07:55:50PM +0100, Peter J. Holzer wrote:
> On 2008-01-10 20:03:13 +0300, Tomash Brechko wrote:
> > On Thu, Jan 10, 2008 at 17:29:42 +0100, Peter J. Holzer wrote:
> > > The same byte sequence, but not the same value. In C (on many systems)
> > > the single precision floating point number 3.1415927 and the integer
> > > 1078530011 have the same byte sequence (0xdb 0xf 0x49 0x40 on little
> > > endian systems), but they hardly have the same value.
> > 
> > OK, I've got your point, though it's more a question of a terminology.
> > 
> > Let me put it another way: my opinion is that C::M (and C::M::F)
> > itself should not save/restore UTF-8 flag.  Instead, it should work
> > the same way other Perl data streams work.  If you write a string to a
> > file, no magic flags are stored somewhere.  Instead, when you _read_
> > it back you say, "alright, please set an UTF-8 flag on the data if it
> > looks like UTF-8 string".

> > DBI works the same way (yes, DBD backends actually, thanks for
> > pointing that, but this doesn't make much difference).

Not for drivers (like DBD::Oracle) that have access to metadata about
the column which tells them the character set being used.

Some drivers for dumb databases do offer an "if it looks like utf8 then
flag it as utf8" mode, but that's just a hack. Albeit a very practical
one in practice.

> > Actually, it's possible to store this flag in memcached, and _when
> > asked_ to set UTF-8 back, no string scan would be necessary to see if
> > the string is really in UTF-8. 
> 
> I really think that:
> 
>  my $var = "some arbitrary string";
>  $memcached->set("key", $var);
>  my $var2 = $memcached->get("key");
>  is($var, $var2);
> 
> should always succeed.

I agree.

> That can be done by always encoding and decoding

Which carries a potentially significant performance hit.

> it or by storing the flag and always honoring it if it is present.

Which carries a potentially significant portability hit.

Seems like both should be supported.

> > However, I think such optimization is
> > not worth the risk of missing some UTF-8 data that was uploaded though
> > some other memcached client that doesn't set any special flag

Has anyone taken a look at how different memcached clients use the flags?

Would someone volunteer?

Perhaps there's some scope for establishing some informal conventions for
certain bits.

Tim.

p.s. I'm working on an XS interface to libmemcached.
    http://code.google.com/p/perl-libmemcached/
The groundwork has been done. Volunteers welcome to add flesh to the bones.


More information about the memcached mailing list