benchmarking Perl client hash functions
Brad Fitzpatrick
brad@danga.com
Wed, 21 Jul 2004 11:01:06 -0700 (PDT)
I did an analysis on LiveJournal's keys, figuring it was more realistic
than a dictionary, and found out what's pretty much already been brought
up:
-- perl's ord() hash sucks. very, very slow, and buggy.
the C implementation I copied depends on integer overflow.
Perl doesn't do that. end result: /terrible/ key distribution.
LiveJournal uses explicit hash values (userids) so we never
noticed. we have perhaps 15 keys that we use the hash
function for.
After fixing that bug:
-- ord still slow. distribution okay.
-- crc32 is very fast, better distribution.
In conclusion, I see no reason to stay with ord. The perl default as of
the next release (if I hear no loud and valid complaints) will be (CRC32
>> 16) & 0x7ff. I don't know why just bits 16-30 are used, but since it
looks fine, I'll assume whoever proposed that first knows why. (I'm
guessing those bits are the most random for that function....)
- Brad
On Fri, 16 Jul 2004, Larry Leszczynski wrote:
> Hi all -
>
> Troy Hakala posted recently about some hashing benchmarks he had run on
> the PHP client:
>
> http://lists.danga.com/pipermail/memcached/2004-June/000664.html
>
> He found that "(crc32($key)>>16) & 0x7fff" was not only faster than a PHP
> implementation of the Perl client's hash algorithm but it also gave a
> better distribution across buckets.
>
> I did some benchmarking of Perl hashing algorithms and found the same thing
> was true when I used the String::CRC32 module from CPAN. Details are
> below, but in summary the crc32 scheme was over 400% faster, and gave
> almost perfect distribution across buckets while the current Perl client
> algorithm had uneven distribution. So I'm wondering a couple things:
>
> 1) For test keys I used all the words in my /usr/share/dict/words, plus
> each of the same words with the characters reversed (about 90,000 keys
> total, average length 8 characters). Is there any reason to think this
> might skew the results, either because of the length or because the test
> keys only contain alpha characters? If so, any ideas for generating more
> realistic test keys?
>
> 2) I haven't done any profiling of the Perl client code so I don't know
> how much of a bottleneck (if any) the hashing part might be. I can
> probably rig up a patch that uses String::CRC32 if it exists, or falls
> back to the current algorithm if not - should I go ahead and give that a
> try?
>
>
> Larry
>
> ==========================================================================
>
> Details:
>
> The current Perl client hash code looks like:
>
> sub original {
> my $hash = 0;
> foreach (split //, shift) {
> $hash = $hash*33 + ord($_);
> }
> return $hash;
> }
>
> I also tried a slight variation where I did a hash lookup of the ord()
> values instead of the function call, but surprisingly (at least to me)
> that ran about 7% slower than the original:
>
> my %ord = map { chr($_) => $_ } (0..127);
> sub ord_lookup {
> my $hash = 0;0
> foreach (split //, shift) {
> $hash = $hash*33 + $ord{$_};
> }
> return $hash;
> }
>
> The String::CRC32 version looks like:
>
> sub CRC32 {
> return (crc32(shift) >> 16) & 0x7fff;
> }
>
> It runs about 400% faster than the original algorithm. The Digest::CRC
> module also has a crc32 subroutine, which from my testing looks like it
> produces the exact same output as the String::CRC32 version, but the
> Digest::CRC hash code actually runs slower than the even the original
> code:
>
> Rate Digest_CRC original CRC32
> Digest_CRC 37697/s -- -32% -87%
> original 55736/s 48% -- -80%
> CRC32 283906/s 653% 409% --
>
> (Note: Digest::CRC also includes a *really* slow pure-Perl implementation
> that is used as a fallback if for whatever reason the XS portion does not
> load (I stumbled across that accidentally). In that case for my tests it
> dropped from about 38K/sec to 400/sec - ouch.)
>
> I also tried the sprintf scheme that Troy tested:
>
> sub CRC32_sprintf {
> return sprintf("%u", crc32(shift));
> }
>
> And I also found that it ran slower than the one that just shifts (but
> still far faster than the original):
>
> Rate CRC32_sprintf CRC32
> CRC32_sprintf 171415/s -- -42%
> CRC32 293065/s 71% --
>
>
> As far as distribution across buckets, here is what I got for some
> different numbers of buckets (percentage hashed into each bucket rounded
> to nearest integer):
>
> 1
> orig 100
> crc 100
>
> 2
> orig 58,42
> crc 50,50
>
> 3
> orig 24,47,29
> crc 33,34,33
>
> 4
> orig 37,21,21,21
> crc 25,25,25,25
>
> 5
> orig 20,20,20,20,20
> crc 20,20,20,20,20
>
> 6
> orig 15,21,17,10,26,12
> crc 16,17,17,17,17,17
>
> 7
> orig 14,15,14,14,14,14,14
> crc 14,14,14,14,14,14,14
>
> 8
> orig 26,11,10,10,11,11,11,11
> crc 12,12,12,13,13,13,13,13
>
> 9
> orig 12,15,11,7,17,9,6,14,10
> crc 11,11,11,11,11,11,11,11,11
>
> 10
> orig 12,8,12,8,12,9,12,8,11,9
> crc 10,10,10,10,10,10,10,10,10,10
>
> ==========================================================================
>
>