benchmarking Perl client hash functions

Wed, 21 Jul 2004 11:01:06 -0700 (PDT)

I did an analysis on LiveJournal's keys, figuring it was more realistic
than a dictionary, and found out what's pretty much already been brought
up:

   -- perl's ord() hash sucks.  very, very slow, and buggy.
      the C implementation I copied depends on integer overflow.
      Perl doesn't do that.  end result:  /terrible/ key distribution.
      LiveJournal uses explicit hash values (userids) so we never
      noticed.  we have perhaps 15 keys that we use the hash
      function for.

After fixing that bug:

   -- ord still slow.  distribution okay.

   -- crc32 is very fast, better distribution.

In conclusion, I see no reason to stay with ord.  The perl default as of
the next release (if I hear no loud and valid complaints) will be (CRC32
>> 16) & 0x7ff.  I don't know why just bits 16-30 are used, but since it
looks fine, I'll assume whoever proposed that first knows why.  (I'm
guessing those bits are the most random for that function....)

- Brad

On Fri, 16 Jul 2004, Larry Leszczynski wrote:

> Hi all -
>
> Troy Hakala posted recently about some hashing benchmarks he had run on
> the PHP client:
>
>    http://lists.danga.com/pipermail/memcached/2004-June/000664.html
>
> He found that "(crc32($key)>>16) & 0x7fff" was not only faster than a PHP
> implementation of the Perl client's hash algorithm but it also gave a
> better distribution across buckets.
>
> I did some benchmarking of Perl hashing algorithms and found the same thing
> was true when I used the String::CRC32 module from CPAN.  Details are
> below, but in summary the crc32 scheme was over 400% faster, and gave
> almost perfect distribution across buckets while the current Perl client
> algorithm had uneven distribution.  So I'm wondering a couple things:
>
> 1) For test keys I used all the words in my /usr/share/dict/words, plus
> each of the same words with the characters reversed (about 90,000 keys
> total, average length 8 characters).  Is there any reason to think this
> might skew the results, either because of the length or because the test
> keys only contain alpha characters?  If so, any ideas for generating more
> realistic test keys?
>
> 2) I haven't done any profiling of the Perl client code so I don't know
> how much of a bottleneck (if any) the hashing part might be.  I can
> probably rig up a patch that uses String::CRC32 if it exists, or falls
> back to the current algorithm if not - should I go ahead and give that a
> try?
>
>
> Larry
>
> ==========================================================================
>
> Details:
>
> The current Perl client hash code looks like:
>
>    sub original {
>        my $hash = 0;
>        foreach (split //, shift) {
>            $hash = $hash*33 + ord($_);
>        }
>        return $hash;
>    }
>
> I also tried a slight variation where I did a hash lookup of the ord()
> values instead of the function call, but surprisingly (at least to me)
> that ran about 7% slower than the original:
>
>    my %ord = map { chr($_) => $_ } (0..127);
>    sub ord_lookup {
>        my $hash = 0;0
>        foreach (split //, shift) {
>            $hash = $hash*33 + $ord{$_};
>        }
>        return $hash;
>    }
>
> The String::CRC32 version looks like:
>
>    sub CRC32 {
>       return (crc32(shift) >> 16) & 0x7fff;
>    }
>
> It runs about 400% faster than the original algorithm.  The Digest::CRC
> module also has a crc32 subroutine, which from my testing looks like it
> produces the exact same output as the String::CRC32 version, but the
> Digest::CRC hash code actually runs slower than the even the original
> code:
>
>                   Rate   Digest_CRC     original   CRC32
>    Digest_CRC  37697/s           --         -32%    -87%
>    original    55736/s          48%           --    -80%
>    CRC32      283906/s         653%         409%      --
>
> (Note:  Digest::CRC also includes a *really* slow pure-Perl implementation
> that is used as a fallback if for whatever reason the XS portion does not
> load (I stumbled across that accidentally).  In that case for my tests it
> dropped from about 38K/sec to 400/sec - ouch.)
>
> I also tried the sprintf scheme that Troy tested:
>
>    sub CRC32_sprintf {
>       return sprintf("%u", crc32(shift));
>    }
>
> And I also found that it ran slower than the one that just shifts (but
> still far faster than the original):
>
>                      Rate CRC32_sprintf         CRC32
>    CRC32_sprintf 171415/s            --          -42%
>    CRC32         293065/s           71%            --
>
>
> As far as distribution across buckets, here is what I got for some
> different numbers of buckets (percentage hashed into each bucket rounded
> to nearest integer):
>
> 1
> orig    100
> crc     100
>
> 2
> orig    58,42
> crc     50,50
>
> 3
> orig    24,47,29
> crc     33,34,33
>
> 4
> orig    37,21,21,21
> crc     25,25,25,25
>
> 5
> orig    20,20,20,20,20
> crc     20,20,20,20,20
>
> 6
> orig    15,21,17,10,26,12
> crc     16,17,17,17,17,17
>
> 7
> orig    14,15,14,14,14,14,14
> crc     14,14,14,14,14,14,14
>
> 8
> orig    26,11,10,10,11,11,11,11
> crc     12,12,12,13,13,13,13,13
>
> 9
> orig    12,15,11,7,17,9,6,14,10
> crc     11,11,11,11,11,11,11,11,11
>
> 10
> orig    12,8,12,8,12,9,12,8,11,9
> crc     10,10,10,10,10,10,10,10,10,10
>
> ==========================================================================
>
>