benchmarking Perl client hash functions

DJ Matlock jmat@shutdown.net
Wed, 21 Jul 2004 17:57:14 -0400


This works quite well under PHP... It's a lot faster than the older method
that is in the scripted clients using ord, and it definitely distributes
much better than the other methods...


I just ran 234k words through it to test the distribution...

-- ugly quick php script --

$wordlist = file("/usr/share/dict/words");
printf("Words: %d \n\n",count($wordlist));
list($usec, $sec) = explode(" ", microtime());
$start = ((float)$usec + (float)$sec);
foreach ($wordlist as $word)
{
        $val = (crc32($word) >> 16) & 0x7ff;
        $server = $val % 5;
        $servera[$server]++;
}
list($usec, $sec) = explode(" ", microtime());
$end = ((float)$usec + (float)$sec);
printf("Finished calculating in %f seconds\n",round($end - $start,6));
print_r($servera);

-- Results --

Finished calculating in 0.470279 seconds
Array
(
    [0] => 46961
    [4] => 46851
    [3] => 46868
    [2] => 47138
    [1] => 47119
)



On 7/21/04 2:01 PM, "Brad Fitzpatrick" <brad@danga.com> wrote:

> I did an analysis on LiveJournal's keys, figuring it was more realistic
> than a dictionary, and found out what's pretty much already been brought
> up:
> 
>    -- perl's ord() hash sucks.  very, very slow, and buggy.
>       the C implementation I copied depends on integer overflow.
>       Perl doesn't do that.  end result:  /terrible/ key distribution.
>       LiveJournal uses explicit hash values (userids) so we never
>       noticed.  we have perhaps 15 keys that we use the hash
>       function for.
> 
> After fixing that bug:
> 
>    -- ord still slow.  distribution okay.
> 
>    -- crc32 is very fast, better distribution.
> 
> In conclusion, I see no reason to stay with ord.  The perl default as of
> the next release (if I hear no loud and valid complaints) will be (CRC32
>>> 16) & 0x7ff.  I don't know why just bits 16-30 are used, but since it
> looks fine, I'll assume whoever proposed that first knows why.  (I'm
> guessing those bits are the most random for that function....)
> 
> - Brad
> 
> 
> On Fri, 16 Jul 2004, Larry Leszczynski wrote:
> 
>> Hi all -
>> 
>> Troy Hakala posted recently about some hashing benchmarks he had run on
>> the PHP client:
>> 
>>    http://lists.danga.com/pipermail/memcached/2004-June/000664.html
>> 
>> He found that "(crc32($key)>>16) & 0x7fff" was not only faster than a PHP
>> implementation of the Perl client's hash algorithm but it also gave a
>> better distribution across buckets.
>> 
>> I did some benchmarking of Perl hashing algorithms and found the same thing
>> was true when I used the String::CRC32 module from CPAN.  Details are
>> below, but in summary the crc32 scheme was over 400% faster, and gave
>> almost perfect distribution across buckets while the current Perl client
>> algorithm had uneven distribution.  So I'm wondering a couple things:
>> 
>> 1) For test keys I used all the words in my /usr/share/dict/words, plus
>> each of the same words with the characters reversed (about 90,000 keys
>> total, average length 8 characters).  Is there any reason to think this
>> might skew the results, either because of the length or because the test
>> keys only contain alpha characters?  If so, any ideas for generating more
>> realistic test keys?
>> 
>> 2) I haven't done any profiling of the Perl client code so I don't know
>> how much of a bottleneck (if any) the hashing part might be.  I can
>> probably rig up a patch that uses String::CRC32 if it exists, or falls
>> back to the current algorithm if not - should I go ahead and give that a
>> try?
>> 
>> 
>> Larry
>> 
>> ==========================================================================
>> 
>> Details:
>> 
>> The current Perl client hash code looks like:
>> 
>>    sub original {
>>        my $hash = 0;
>>        foreach (split //, shift) {
>>            $hash = $hash*33 + ord($_);
>>        }
>>        return $hash;
>>    }
>> 
>> I also tried a slight variation where I did a hash lookup of the ord()
>> values instead of the function call, but surprisingly (at least to me)
>> that ran about 7% slower than the original:
>> 
>>    my %ord = map { chr($_) => $_ } (0..127);
>>    sub ord_lookup {
>>        my $hash = 0;0
>>        foreach (split //, shift) {
>>            $hash = $hash*33 + $ord{$_};
>>        }
>>        return $hash;
>>    }
>> 
>> The String::CRC32 version looks like:
>> 
>>    sub CRC32 {
>>       return (crc32(shift) >> 16) & 0x7fff;
>>    }
>> 
>> It runs about 400% faster than the original algorithm.  The Digest::CRC
>> module also has a crc32 subroutine, which from my testing looks like it
>> produces the exact same output as the String::CRC32 version, but the
>> Digest::CRC hash code actually runs slower than the even the original
>> code:
>> 
>>                   Rate   Digest_CRC     original   CRC32
>>    Digest_CRC  37697/s           --         -32%    -87%
>>    original    55736/s          48%           --    -80%
>>    CRC32      283906/s         653%         409%      --
>> 
>> (Note:  Digest::CRC also includes a *really* slow pure-Perl implementation
>> that is used as a fallback if for whatever reason the XS portion does not
>> load (I stumbled across that accidentally).  In that case for my tests it
>> dropped from about 38K/sec to 400/sec - ouch.)
>> 
>> I also tried the sprintf scheme that Troy tested:
>> 
>>    sub CRC32_sprintf {
>>       return sprintf("%u", crc32(shift));
>>    }
>> 
>> And I also found that it ran slower than the one that just shifts (but
>> still far faster than the original):
>> 
>>                      Rate CRC32_sprintf         CRC32
>>    CRC32_sprintf 171415/s            --          -42%
>>    CRC32         293065/s           71%            --
>> 
>> 
>> As far as distribution across buckets, here is what I got for some
>> different numbers of buckets (percentage hashed into each bucket rounded
>> to nearest integer):
>> 
>> 1
>> orig    100
>> crc     100
>> 
>> 2
>> orig    58,42
>> crc     50,50
>> 
>> 3
>> orig    24,47,29
>> crc     33,34,33
>> 
>> 4
>> orig    37,21,21,21
>> crc     25,25,25,25
>> 
>> 5
>> orig    20,20,20,20,20
>> crc     20,20,20,20,20
>> 
>> 6
>> orig    15,21,17,10,26,12
>> crc     16,17,17,17,17,17
>> 
>> 7
>> orig    14,15,14,14,14,14,14
>> crc     14,14,14,14,14,14,14
>> 
>> 8
>> orig    26,11,10,10,11,11,11,11
>> crc     12,12,12,13,13,13,13,13
>> 
>> 9
>> orig    12,15,11,7,17,9,6,14,10
>> crc     11,11,11,11,11,11,11,11,11
>> 
>> 10
>> orig    12,8,12,8,12,9,12,8,11,9
>> crc     10,10,10,10,10,10,10,10,10,10
>> 
>> ==========================================================================
>> 
>>