[PATCH] utf8 flag support on perl lib

Fri Jan 11 08:03:02 UTC 2008

On Thu, Jan 10, 2008 at 19:55:50 +0100, Peter J. Holzer wrote:
> I really think that:
> 
>  my $var = "some arbitrary string";
>  $memcached->set("key", $var);
>  my $var2 = $memcached->get("key");
>  is($var, $var2);
> 
> should always succeed.

That would be nice if that would work auotomagically, however things
are more complex than that.  Suppose you have two strings, one $text
is a true UTF-8 string (it has the UTF-8 flag set and there are also
bytes with high bit set).  Another $binary is the same sequence of
bytes, but UTF-8 is not set for it, it's a binary data.  Now what if
you store both strings into the same (plain) file?  When you will be
reading them back, you either have to open the file in a :raw mode, or
in a :utf8 mode, and _both_ strings will have UTF-8 flag cleared or
set (or you have to read them separately in different modes).  This is
natural, because this flag is not stored along with the data, instead,
when you read the data back the decision about setting UTF-8 flag is
made anew each time.  The same happens if you store 32-bit integer and
32-bit float into the file: it's up to the reading side to decide what
type will have each 4-byte sequence, or whether they will be processed
as the untyped byte sequences.

So, if it doesn't work automagically for files, why to expect it to
work so for memcached?

Likewise, from the description of how DBD::Oracle works it follows
that it also doesn't save UTF-8 flag of Perl scalars somewhere (not a
surprise).  Instead, the decision is again made anew each time the
data is retrieved: if it comes from the binary column, the UTF-8 flag
will be cleared (even if you saved $text there).  It the data comes
from the text column, _and_ some setting (like environment variable or
a DBD constructor parameter) is set, then UTF-8 flag will be set (even
if originally you stored $binary there).

Of course, if the application is designed in a way that it always
saves $text into the text columns, $binary into the binary columns,
and the relevant setting is enabled, then you have the impression that
UTF-8 flag is saved and restored.  But this is because most databases
have the notion of data type, while plain files _and_ memcached lack
it.

Now, one could say, why bother?  Even if memcached is untyped by
itself, let's emulate text-or-binary type with the flag which we will
call F_UTF8 or something.  And like the patch in question does, we'd
save the value of scalar's UTF-8 flag on data store, and restore it on
data fetch (perhaps only when some external setting permits us to do
so, as for DBD::Oracle).

There will be a problem with this.  Suppose I have the following
setup: there's a number of populate scripts that load the data from
say, files, into the memcached.  There's also a number of processing
scripts that retrieve the data from memcached and do some _character_
processing.  Of course, my processing scripts would have 'use utf8;'
(say because they have UTF-8 constants), and also _explicitly_ set
UTF-8 flag on the data retrieved from memcached (because the
processing is character-wise).  "Setting UTF-8 explicitly" would mean
that similar to how you open the file in :utf8 mode you would set some
(yet to be implemented) utf8 => 1 in C::M constructor, or just to the
manual string upgrade.

But what's important is that _populate_ scripts do not care about
characters, they just copy bytes around (it's like when you copy the
file with 32-bit int and 32-bit float you don't care about these
types, just move bytes around).  So naturally populate scripts do not
have 'use utf8;' in them, nor do they open the input files in :utf8
mode.  What's also important is that populate scripts don't have to be
written in Perl.

Now, if we imagine that we employed F_UTF8, the following would
happen: I still may control if the data retrieved from memcached would
have UTF-8 set or not with the environment setting or some other knob.
But all scripts that do stores into memcached (like my populate
scripts) _have_ to work in utf8 mode, otherwise the internal F_UTF8
won't be set.  This is very unnatural and would cause all sorts of
confusion, because basically I have to know the data type just to copy
data around.  Note how this is different from typed databases: there
you can't avoid making the decision when you do the store: you either
save your data into the text _or_ the binary column.  But memcached is
like untyped file: no decision is _required_ when you do the store,
you'll decide everything on fetch.

In other words, it is hard to maintain the F_UTF8 setting, because you
are forced to carry the knowledge about data type across all
store/fetch operations, just to preserve the flag to the time when you
decide to do real character processing.  It would be much safer (and
also more natural) if we won't try to pretend that memcached has the
notion of type when it clearly doesn't, and do all typing explicitly,
like we do it with plain files.

My two cents ;).

-- 
   Tomash Brechko