benchmarking Perl client hash functions
Larry Leszczynski
larryl@emailplus.org
Fri, 16 Jul 2004 18:57:53 -0400 (Eastern Daylight Time)
Hi all -
Troy Hakala posted recently about some hashing benchmarks he had run on
the PHP client:
http://lists.danga.com/pipermail/memcached/2004-June/000664.html
He found that "(crc32($key)>>16) & 0x7fff" was not only faster than a PHP
implementation of the Perl client's hash algorithm but it also gave a
better distribution across buckets.
I did some benchmarking of Perl hashing algorithms and found the same thing
was true when I used the String::CRC32 module from CPAN. Details are
below, but in summary the crc32 scheme was over 400% faster, and gave
almost perfect distribution across buckets while the current Perl client
algorithm had uneven distribution. So I'm wondering a couple things:
1) For test keys I used all the words in my /usr/share/dict/words, plus
each of the same words with the characters reversed (about 90,000 keys
total, average length 8 characters). Is there any reason to think this
might skew the results, either because of the length or because the test
keys only contain alpha characters? If so, any ideas for generating more
realistic test keys?
2) I haven't done any profiling of the Perl client code so I don't know
how much of a bottleneck (if any) the hashing part might be. I can
probably rig up a patch that uses String::CRC32 if it exists, or falls
back to the current algorithm if not - should I go ahead and give that a
try?
Larry
==========================================================================
Details:
The current Perl client hash code looks like:
sub original {
my $hash = 0;
foreach (split //, shift) {
$hash = $hash*33 + ord($_);
}
return $hash;
}
I also tried a slight variation where I did a hash lookup of the ord()
values instead of the function call, but surprisingly (at least to me)
that ran about 7% slower than the original:
my %ord = map { chr($_) => $_ } (0..127);
sub ord_lookup {
my $hash = 0;0
foreach (split //, shift) {
$hash = $hash*33 + $ord{$_};
}
return $hash;
}
The String::CRC32 version looks like:
sub CRC32 {
return (crc32(shift) >> 16) & 0x7fff;
}
It runs about 400% faster than the original algorithm. The Digest::CRC
module also has a crc32 subroutine, which from my testing looks like it
produces the exact same output as the String::CRC32 version, but the
Digest::CRC hash code actually runs slower than the even the original
code:
Rate Digest_CRC original CRC32
Digest_CRC 37697/s -- -32% -87%
original 55736/s 48% -- -80%
CRC32 283906/s 653% 409% --
(Note: Digest::CRC also includes a *really* slow pure-Perl implementation
that is used as a fallback if for whatever reason the XS portion does not
load (I stumbled across that accidentally). In that case for my tests it
dropped from about 38K/sec to 400/sec - ouch.)
I also tried the sprintf scheme that Troy tested:
sub CRC32_sprintf {
return sprintf("%u", crc32(shift));
}
And I also found that it ran slower than the one that just shifts (but
still far faster than the original):
Rate CRC32_sprintf CRC32
CRC32_sprintf 171415/s -- -42%
CRC32 293065/s 71% --
As far as distribution across buckets, here is what I got for some
different numbers of buckets (percentage hashed into each bucket rounded
to nearest integer):
1
orig 100
crc 100
2
orig 58,42
crc 50,50
3
orig 24,47,29
crc 33,34,33
4
orig 37,21,21,21
crc 25,25,25,25
5
orig 20,20,20,20,20
crc 20,20,20,20,20
6
orig 15,21,17,10,26,12
crc 16,17,17,17,17,17
7
orig 14,15,14,14,14,14,14
crc 14,14,14,14,14,14,14
8
orig 26,11,10,10,11,11,11,11
crc 12,12,12,13,13,13,13,13
9
orig 12,15,11,7,17,9,6,14,10
crc 11,11,11,11,11,11,11,11,11
10
orig 12,8,12,8,12,9,12,8,11,9
crc 10,10,10,10,10,10,10,10,10,10
==========================================================================