Moving key hashing into the client
Steven Grimm
sgrimm at facebook.com
Sat Jan 12 18:53:52 UTC 2008
On Jan 12, 2008, at 6:24 AM, Tomash Brechko wrote:
> I mean the server computes some hash value to decide where the key
> belongs in server's hash table. I.e. it is how the server finds the
> key in its memory.
If the client passes in the same hash that it uses for server
selection, won't that lead to nonuniform distribution of keys to hash
buckets? Ideally you want a completely different hash function for
server selection than for item bucketing within the server. A simple
example:
Say you have two memcached hosts. The client computes a hash value and
does a simple "modulo by number of hosts" to pick one of them. Then it
sends its request along with the hash value.
The server does a simple "modulo by number of buckets" to figure out
where to put the value. If the number of buckets is a power of two,
half of them are guaranteed to always be unused in this scheme,
because, e.g., only keys whose hash values have the low bit set will
be sent to server #2, whereas only hash values whose low bit is clear
will be sent to server #1. So you end up either wasting memory on
empty buckets or, more likely, not even noticing that the linked lists
in your buckets are twice as long as you'd like just because you're
basing the "do I auto-expand the hashtable?" decision on number of
items / number of buckets.
This is less of an issue if the number of servers is a prime number,
or if you change memcached to always use a prime number of buckets,
but I think using the same hash function for server selection and for
internal server data structures is likely to lead to weird, hard to
diagnose inefficiencies.
The bigger question: Are you actually seeing a server that's
bottlenecked on the cost of computing the key hash? In our tests,
hashing was so cheap that I didn't even bother moving the hash
computation outside the mutex-protected part of the code. It is
lightning fast because the entire key tends to fit in the CPU's L1
cache, so you pay basically nothing to scan it and do the simple hash
computation memcached uses internally.
If you're seeing lots of lock contention in threaded mode and hashing
looks like a culprit, I'd try moving the hash computation up into the
non-lock-protected code before I'd go messing with the protocol and
breaking compatibility with all existing clients just to put that
miniscule piece of work on the client's plate.
-Steve
More information about the memcached
mailing list