locator servers for memcache

Wed, 17 Dec 2003 20:03:02 -0800 (PST)

Avva and I were talking today about the memcached coherency problem after
clients do rehashing (often out of sync with each other).

One solution (which is lame), is to disable hashing in the client.  The
Perl API supports that, but performance would suck.  The alternative also
sucks, which is things getting all out of sync.

We think we've come up with a solution.  Chat log:

(19:16:25) brad: yo, got time to talk?
(19:16:27) brad: about memcache
(19:17:06) avva: yeah, sure
ugh, you have any idea if the feed's title should also specify mode=escaped?
(19:17:32) avva: I'm losing count on how many times we escape stuff, every time I think about it.
(19:18:05) brad: okay, remember my original anti-netsplit idea?
(19:18:15) avva: yes, I remember.
(19:18:25) brad: peers broadcast what buckets they're handling
(19:18:59) brad: well, that breaks down a little... because once something's rehashed (our support for which sucks, actually), multiple people will be handling "bucket 5"
(19:19:11) brad: the people were were once bucket A rehashed to 5, and the real 5
(19:19:12) avva: yeah, I kinda hated it
because I don't want servers to be aware of each other
but I understand that we may not be able to solve the problem without that awareness...
(19:19:48) brad: what we need is fine-grained buckets (say, a 16-bit hash, or more), and map those to actual buckets
(19:20:01) brad: then peers advertise what virtual buckets they're handling
(19:20:19) brad: so say we have 10 real buckets
(19:20:23) avva: hmm, ok, spell it out a bit more fully.

(19:20:33) avva: 10 real servers, you mean
(19:20:43) brad: key "foo" is mapped to logical bucket 234983  and 10 real buckets (maybe over 3 real servers)
(19:21:07) brad: so 234983 % 10 is bucket 3
(19:21:24) brad: but 3 is down.  we take it out, rehash... now 5.  5 is up.
(19:21:30) brad: so we send the key to bucket 5 (server, say, 2)
(19:21:34) brad: and say:
(19:21:44) brad: foo = bar  (logical bucket 234983)
(19:21:54) brad: and that's what they all fight over.
(19:22:43) avva: 1. if we standartise the hash, we don't have to send the logical bucket to the server, it can calculate it from the key.
2. can you elaborate on how they fight over it?
(19:23:10) brad: 1.  perhaps.  but i don't wanna force our hash onto clients.  and the extra CPU on our side.
(19:23:15) brad: 2. yeah
(19:23:31) brad: for each item, we keep track of time we got it, and what virtual bucket it's for
(19:23:49) brad: and we also listen for "I'm handling this bucket" announcements
(19:24:02) brad: we invalidate all our items for that virtual bucket when we hear that
(19:24:32) brad: just like you already do
(19:24:33) brad: only on get
(19:24:56) brad: we make an announcement about handling a bucket when we get a new one we haven't handled in so much time
(19:25:09) avva: there're 65536 virtual buckets, that's quite a few announcements to make.
(19:25:13) brad: basically everybody needs to converge upon an understanding of who is handling what partition
(19:25:23) brad: yeah, i thought about that too.  so maybe less.
(19:25:36) brad: 16k?  8k?
(19:25:50) avva: 512, or configurable. lemme think over what you just described, one minute.
(19:26:14) brad: k.  brb.
(19:29:04) avva: so the plan is, say server A drops out, and it was real bucket 5. clients will now send all keys that mapped to 5 - to 6, this will hold for a few hundred virtual buckets, say.
when 6 gets those keys, it'll announce its handling of those virtual buckets to the world, but 5 isn't hearing it, it's cut off from the network or whatever.
now 5 comes back and suddenly it's 5 again. it doesn't hear 6's announcement of those buckets because that was a while ago and 6 got used to handling them and doesn't send announcements anymore. so 5 feels free to deal out old values for those keys. am I missing something here?

(19:29:07) brad: the key points here are that clients must rehash deterministically from virtual bucket to the same alive bucket.  and servers only deal with virtual buckets, so a dead host's contents can be distributed.
(19:29:41) brad: otherwise, before, say both "foo" and "bar" mapped to bucket 5.  5 goes down.  now we rehash... "foo" goes to 7 and "bar" goes to 8.  that's desirable.
(19:29:46) brad: we don't want all going to 6
(19:30:00) brad: otherwise a big machine could barf on a little machine.
(19:30:07) avva: yes, that's true, ok.
(19:30:16) brad: reading your reply
(19:31:29) brad: servers should also announce and respect the request if they haven't heard an announcement recently.
(19:31:41) brad: but if they have, they should deny the request.
(19:31:56) brad: it might also be good to send along the "rehash count" to the servers
(19:32:07) brad: but i haven't thought about that fully
(19:34:43) avva: another thing I'm not sure this solves is the problem of different ideas about buckets on different clients.
say client A tries to connect to server X and can't do it for some momentary reason, maybe a spike in traffic, a random error, whatever. or it has a "parsing error" (do you still see those in the logs?) anyway, it marks X as dead and rehashes everything. but X is actually fine, so other clients at the same time continue to use it in their bucket schemes. in your contention/announcement idea this would quickly get hairy between servers, I think
(19:36:12) avva: maybe what we should have is some dynamically maintained list of servers rather than the one read from ljconfig.pl by clients?
say on client startup, let it read the current server list from one of the servers, a designated one, whatever. when client marks server as dead, it sends that info to servers, who share it between themselves and give it to starting clients when they ask for it. something like that. not sure.
(19:37:08) avva: we should strive to make the clients have the same idea about alive/dead servers, always, with  very little timeframe for disagreements over this.
(19:37:35) brad: yeah, i agree.
(19:37:46) brad: the server/bucket list on the servers would be ideal
(19:38:00) brad: should be a command which propogates it to all
(19:38:05) brad: and all keep track of dead hosts
(19:38:09) avva: yes
(19:40:42) avva: or you could have something else
you could have a special-purpose server for keeping buckets/dead servers. it's not memcached, but built like it, very fast. it's responsible for translating virtual buckets into real buckets. clients consult it on startup, or whenever they can't get a connection or get an error from a server. so it won't have much traffic this way, only steady trickle of starting clients and a spike when a server goes down.
the advantage is that it doesn't have to negotiate anything with anyone, it gets info from clients/servers and makes a decision for all. however, it'll be a single point of failure.
(19:41:36) brad: so let's make it run in a pair config
(19:42:02) brad: or let us run 'n' of them, and they all talk to each other with TCP, not UDP/broadcast
(19:42:12) brad: on startup, they all need to know each other.
(19:42:23) brad: this is a common strategy, ya know
(19:42:26) avva: since clients consult it on startup, they'll get from it that server 5 is dead, so they won't try to connect to 5 when 5 comes alive again. they'll only try to conncet to 5 again when the bucket server tells them to, and that happens when we manually tell it to do that (it'll inform us of servers going down), or maybe a server will broadcast its readiness every minute.
(19:42:26) brad: with cluster systems
(19:42:57) brad: this daemon could even be perl.
(19:43:05) brad: nothing fancy needed in it.
(19:43:08) avva: with n of them, it becomes a bit heavy perhaps
too convolutes
m memcached servers, n bucket servers
(19:43:08) brad: not high load.
(19:43:15) avva: yes, that's true
(19:43:24) brad: clients aren't configured to talk to memcaches
(19:43:26) brad: they get that from the list
(19:43:39) avva: need to think on an algorithm of when it decides to stop trying to connect to a server
say when it gets errors for this server from 3 different clients, not just one
(19:43:43) brad: from the MLS (memcache locator server)
(19:44:05) brad: the MLS could try to connect on its own
(19:44:12) brad: with a non-blocking select, just like perl client
(19:44:14) avva: so clients are configured for MLSes in ljconfig.pl?
(19:44:18) brad: and it's the final authority on what's dead
(19:44:25) avva: yes, it could, good idea
(19:44:27) brad: yeah, just MLS hosts in config
(19:44:32) brad: and they get memcache list from it
(19:46:38) avva: ok, this is beginning to sound like a nice strategy.
this only solves the clientside coherency problem though.
not the stale data problem. when 5 comes back to life, eventually we want MLSes to mark it live again and send virtual buckets to it.
at this point MLSes can connect to it/other memcache servers and invalidate some virtual buckets. right?
(19:47:15) brad: dina's on the phone
(19:47:37) brad: um, MLS servers can send the memcaches the "delete all" command too
(19:47:40) brad: thoughts on that?
(19:47:51) avva: also possible

(19:48:10) avva: depends on how long they were "dead" maybe. I'll think more about that
(19:48:41) avva: you want to post this log to the memcache list maybe?
(19:49:33) brad: MLS doing delete_all commands would also solve this problem we have now:
(19:49:43) brad: put on host A key foo
(19:49:50) brad: delete key foo.  tries host A.  rehashed to B.
(19:49:55) brad: nothing is deleted from B.
(19:49:57) brad: get foo
(19:50:01) brad: comes from A.
(19:50:03) brad: shouldn't exist.
(19:50:08) brad: now, with new system, works like this:
(19:50:14) brad: client connects to A to delete foo
(19:50:17) brad: but can't connect to A
(19:50:25) brad: so client tells MLS that A is unavailable
(19:50:42) brad: if MLS gets enough of these requests quick enough (or one at all?)
(19:50:47) brad: then it rehashes the virtual buckets
(19:50:54) brad: and deletes everything from "A" as soon as it can
(19:51:02) brad: since the quality of its data is now suspect
(19:51:06) avva: yes
(19:51:34) brad: i'm liking this more and more
(19:51:41) brad: yeah, go ahead and post this all
(19:51:41) avva: one at all sounds too harsh, we'd be losing lots of data due to transient failures on client side
(19:52:01) avva: my client doesn't allow easy export of logs
or I haven't figured it out yet
will you?
(19:52:01) brad: how about 1 at all, once verified by MLS
(19:52:11) brad: sure, i can
(19:52:15) avva: yes, that's better I think
(19:52:54) brad: then after MLS: single LRU.  but the 64 byte chunking scares me.  think how many page tables that is.  TLB thrashing and such, I fear.
(19:53:47) brad: it might be better to keep a data structure of contiguous areas, keyed by size
(19:54:06) brad: so if we have a 1MB item we just fragment it in maybe 30 places
(19:54:14) brad: instead of 1MB/64 bytes
(19:54:41) avva: yeah, I thought about that too
but then we start losing some memory again
over inefficiency of division into areas
(19:54:59) brad: naah, we still have 64 byte boundaries
(19:55:15) brad: so worst case it's no worse than our "everything is 64 byte chunks" case
(19:55:32) brad: it's not 4 byte granular like malloc
(19:56:06) avva: yeah, I see
(19:56:57) brad: this also paves the way towards background memory defragmentation.
(19:57:18) brad: once we have chunks, we can move chunks around since they contain forward pointers.
(19:57:43) brad: but that's probably more fun than useful.
(19:58:02) brad: so far we have awesome CPU usage (none), I'd hate to see it go to shit for some reason
(19:58:13) avva: *nod*
(20:00:55) brad: i'm off to get food.