[PATCH] utf8 flag support on perl lib

Sat Jan 12 07:01:35 UTC 2008

Just an FYI for this thread, in the binary protocol we have reserved
some bytes for opaque flags. I think it would make a lot of sense if the
Perl client set a character string / binary string flag and used that to
do the right thing when retrieving the value.

Aaron

On Thu, 2008-01-10 at 19:55 +0100, Peter J. Holzer wrote:
> On 2008-01-10 20:03:13 +0300, Tomash Brechko wrote:
> > On Thu, Jan 10, 2008 at 17:29:42 +0100, Peter J. Holzer wrote:
> > > The same byte sequence, but not the same value. In C (on many systems)
> > > the single precision floating point number 3.1415927 and the integer
> > > 1078530011 have the same byte sequence (0xdb 0xf 0x49 0x40 on little
> > > endian systems), but they hardly have the same value.
> > 
> > OK, I've got your point, though it's more a question of a terminology.
> > 
> > Let me put it another way: my opinion is that C::M (and C::M::F)
> > itself should not save/restore UTF-8 flag.  Instead, it should work
> > the same way other Perl data streams work.  If you write a string to a
> > file, no magic flags are stored somewhere.  Instead, when you _read_
> > it back you say, "alright, please set an UTF-8 flag on the data if it
> > looks like UTF-8 string".
> 
> Nope. When you read a file, you specify the encoding the file should be
> in. For all the encodings except raw, the stream is then actually
> decoded and a converted to perl character strings: If the file doesn't
> match the specified encoding, an exception thrown, there is no "if it
> looks like" involved, and the utf8 flag is always set.
> 
> (Perl is more forgiving on output: If the current character cannot be
> represented in the output encoding, the I/O layer substitutes it instead
> of throwing an exception)
> 
> 
> > DBI works the same way (yes, DBD backends
> > actually, thanks for pointing that, but this doesn't make much
> > difference).
> 
> I hope not. DBD::Oracle converts from and to perl character strings if
> the local character set (in NLS_LANG) is some variation of UTF-8.
> Otherwise it converts from and to the local character set and expects
> and delivers byte strings.  No guessing involved. I think DBD::mysql has
> some flag to control character vs. byte strings, but I haven't used that
> lately. (We are talking only about varchar and clob types here - of
> course a blob must never be converted)
> 
> 
> > Actually, it's possible to store this flag in memcached, and _when
> > asked_ to set UTF-8 back, no string scan would be necessary to see if
> > the string is really in UTF-8. 
> 
> I really think that:
> 
>  my $var = "some arbitrary string";
>  $memcached->set("key", $var);
>  my $var2 = $memcached->get("key");
>  is($var, $var2);
> 
> should always succeed. That can be done by always encoding and decoding
> it or by storing the flag and always honoring it if it is present.
> 
> > However, I think such optimization is
> > not worth the risk of missing some UTF-8 data that was uploaded though
> > some other memcached client that doesn't set any special flag, or of
> > setting UTF-8 flag on the string that was messed with append/prepend.
> 
> Right. Requiring the programmer to do something special is prone to
> errors. C::M should handle all perl data types transparently.
> 
> Of course if you have different clients accessing the data you need to
> specify the exact format anyway - A python client probably can't decode
> Perl's Storable format.
> 
> Append/prepend may be a problem. But that can be left to the
> application - appending a byte string to a character string is a type
> error, just like adding a length to an area is - perl lets you get away
> with both, but the result won't be sensible.
> 
> > You correctly pointed that this flag is part of Perl's internals, so
> > it's better not to set it without additional precautions.
> 
> I think there are two layers:
> 
> 1) Perl knows two types of strings: Byte strings and character strings. 
>    The elements of byte strings can be mapped to 8-bit numbers, the
>    elements of character strings can be mapped to 32-bit numbers. 
>    There are also some differences related to character classes, etc.
>    Whether a given scalar is a character string or a byte string can be
>    determined with the (badly named) utf8::is_utf8 function.
> 
>    This is the conceptional model. 
> 
> 2) perl character strings are actually stored in UTF-8 format, and there
>    is a flag in the PV structure to distinguish character strings from
>    byte strings, which can be manipulated.
> 
>    These are implementation details.
> 
> I think it is perfectly ok (and even necessary) to take 1) into account
> but one shouldn't rely on 2) unless necessary.
> 
> 	hp
> 
>