benchmarking Perl client hash functions
DJ Matlock
jmat@shutdown.net
Wed, 21 Jul 2004 17:57:14 -0400
This works quite well under PHP... It's a lot faster than the older method
that is in the scripted clients using ord, and it definitely distributes
much better than the other methods...
I just ran 234k words through it to test the distribution...
-- ugly quick php script --
$wordlist = file("/usr/share/dict/words");
printf("Words: %d \n\n",count($wordlist));
list($usec, $sec) = explode(" ", microtime());
$start = ((float)$usec + (float)$sec);
foreach ($wordlist as $word)
{
$val = (crc32($word) >> 16) & 0x7ff;
$server = $val % 5;
$servera[$server]++;
}
list($usec, $sec) = explode(" ", microtime());
$end = ((float)$usec + (float)$sec);
printf("Finished calculating in %f seconds\n",round($end - $start,6));
print_r($servera);
-- Results --
Finished calculating in 0.470279 seconds
Array
(
[0] => 46961
[4] => 46851
[3] => 46868
[2] => 47138
[1] => 47119
)
On 7/21/04 2:01 PM, "Brad Fitzpatrick" <brad@danga.com> wrote:
> I did an analysis on LiveJournal's keys, figuring it was more realistic
> than a dictionary, and found out what's pretty much already been brought
> up:
>
> -- perl's ord() hash sucks. very, very slow, and buggy.
> the C implementation I copied depends on integer overflow.
> Perl doesn't do that. end result: /terrible/ key distribution.
> LiveJournal uses explicit hash values (userids) so we never
> noticed. we have perhaps 15 keys that we use the hash
> function for.
>
> After fixing that bug:
>
> -- ord still slow. distribution okay.
>
> -- crc32 is very fast, better distribution.
>
> In conclusion, I see no reason to stay with ord. The perl default as of
> the next release (if I hear no loud and valid complaints) will be (CRC32
>>> 16) & 0x7ff. I don't know why just bits 16-30 are used, but since it
> looks fine, I'll assume whoever proposed that first knows why. (I'm
> guessing those bits are the most random for that function....)
>
> - Brad
>
>
> On Fri, 16 Jul 2004, Larry Leszczynski wrote:
>
>> Hi all -
>>
>> Troy Hakala posted recently about some hashing benchmarks he had run on
>> the PHP client:
>>
>> http://lists.danga.com/pipermail/memcached/2004-June/000664.html
>>
>> He found that "(crc32($key)>>16) & 0x7fff" was not only faster than a PHP
>> implementation of the Perl client's hash algorithm but it also gave a
>> better distribution across buckets.
>>
>> I did some benchmarking of Perl hashing algorithms and found the same thing
>> was true when I used the String::CRC32 module from CPAN. Details are
>> below, but in summary the crc32 scheme was over 400% faster, and gave
>> almost perfect distribution across buckets while the current Perl client
>> algorithm had uneven distribution. So I'm wondering a couple things:
>>
>> 1) For test keys I used all the words in my /usr/share/dict/words, plus
>> each of the same words with the characters reversed (about 90,000 keys
>> total, average length 8 characters). Is there any reason to think this
>> might skew the results, either because of the length or because the test
>> keys only contain alpha characters? If so, any ideas for generating more
>> realistic test keys?
>>
>> 2) I haven't done any profiling of the Perl client code so I don't know
>> how much of a bottleneck (if any) the hashing part might be. I can
>> probably rig up a patch that uses String::CRC32 if it exists, or falls
>> back to the current algorithm if not - should I go ahead and give that a
>> try?
>>
>>
>> Larry
>>
>> ==========================================================================
>>
>> Details:
>>
>> The current Perl client hash code looks like:
>>
>> sub original {
>> my $hash = 0;
>> foreach (split //, shift) {
>> $hash = $hash*33 + ord($_);
>> }
>> return $hash;
>> }
>>
>> I also tried a slight variation where I did a hash lookup of the ord()
>> values instead of the function call, but surprisingly (at least to me)
>> that ran about 7% slower than the original:
>>
>> my %ord = map { chr($_) => $_ } (0..127);
>> sub ord_lookup {
>> my $hash = 0;0
>> foreach (split //, shift) {
>> $hash = $hash*33 + $ord{$_};
>> }
>> return $hash;
>> }
>>
>> The String::CRC32 version looks like:
>>
>> sub CRC32 {
>> return (crc32(shift) >> 16) & 0x7fff;
>> }
>>
>> It runs about 400% faster than the original algorithm. The Digest::CRC
>> module also has a crc32 subroutine, which from my testing looks like it
>> produces the exact same output as the String::CRC32 version, but the
>> Digest::CRC hash code actually runs slower than the even the original
>> code:
>>
>> Rate Digest_CRC original CRC32
>> Digest_CRC 37697/s -- -32% -87%
>> original 55736/s 48% -- -80%
>> CRC32 283906/s 653% 409% --
>>
>> (Note: Digest::CRC also includes a *really* slow pure-Perl implementation
>> that is used as a fallback if for whatever reason the XS portion does not
>> load (I stumbled across that accidentally). In that case for my tests it
>> dropped from about 38K/sec to 400/sec - ouch.)
>>
>> I also tried the sprintf scheme that Troy tested:
>>
>> sub CRC32_sprintf {
>> return sprintf("%u", crc32(shift));
>> }
>>
>> And I also found that it ran slower than the one that just shifts (but
>> still far faster than the original):
>>
>> Rate CRC32_sprintf CRC32
>> CRC32_sprintf 171415/s -- -42%
>> CRC32 293065/s 71% --
>>
>>
>> As far as distribution across buckets, here is what I got for some
>> different numbers of buckets (percentage hashed into each bucket rounded
>> to nearest integer):
>>
>> 1
>> orig 100
>> crc 100
>>
>> 2
>> orig 58,42
>> crc 50,50
>>
>> 3
>> orig 24,47,29
>> crc 33,34,33
>>
>> 4
>> orig 37,21,21,21
>> crc 25,25,25,25
>>
>> 5
>> orig 20,20,20,20,20
>> crc 20,20,20,20,20
>>
>> 6
>> orig 15,21,17,10,26,12
>> crc 16,17,17,17,17,17
>>
>> 7
>> orig 14,15,14,14,14,14,14
>> crc 14,14,14,14,14,14,14
>>
>> 8
>> orig 26,11,10,10,11,11,11,11
>> crc 12,12,12,13,13,13,13,13
>>
>> 9
>> orig 12,15,11,7,17,9,6,14,10
>> crc 11,11,11,11,11,11,11,11,11
>>
>> 10
>> orig 12,8,12,8,12,9,12,8,11,9
>> crc 10,10,10,10,10,10,10,10,10,10
>>
>> ==========================================================================
>>
>>