sean at chittenden.org
Wed Dec 22 11:49:23 PST 2004
>> Yeah... this is a gapping hole in the API at the moment. There are
>> potential solutions to this. The first being:
>> *) Move the server lists to shared memory
>> *) Provide a tool that administrates the lists in shared memory.
> In my experience using mmap(2)ed files is easier than dealing with SYSV
> shared memory.
OOOH! Brilliant... I should've thought of that earlier (no musing on
the sidelines for those that have seen me go several rounds w/
PostgreSQL and advocating the use of mmap(2)).
>> This would work fine and dandy, but is a less than perfect solution as
>> it requires manual management of server lists. The ideal/correct
>> solution is spiffy neato and makes all of this kinda foo automagic.
> Writing a small daemon process that a reads a list of servers from a
> configuration file, tries to connect to each server, issues a couple of
> commands and modifies the list of active servers stored in shared
> according to the results, should be quite easy. That's the same method
> every load balancer uses for availabity checks (i.e. doing out-of band
Yup. mmap(2) the file as private to begin with, write out the binary
server list to the file, then msync(2) the file to make the changes
live. I need to test to make sure that'd work, but I believe that all
locking and semaphore use would get removed and handled by the VM
system. This would all be conditional on a functioning VM, which isn't
necessarily the case in all OSes, even still. Last I checked, a VM w/
a unified buffer cache is scarce outside of the *BSD realm... still,
posix should mandate that the above works assuming a non-broken mmap(2)
> Another thing: I think the way the distribution to the servers is done
> should be changed. Currently you are doing a "hash % number of live
> servers". You should do a "hash % number of configured servers". If
> hits a server which is down rehash the request (e.g. append "foo" to
> the key
> and calculate the hash) and loop, unless you find a server which is
> (of course the "all servers down case" should be treated specially,
> returning an error without any hashing at all). Otherwise you lose
> the whole cache due to distribution changes after a server goes down.
> rehashing only the contents of the server that went down is lost.
I thought about that and it's something that I've got on my TODO list.
The reason I didn't is because of a chunk of code that I have XXX'ed.
If a server goes down or fails for some reason, should it try the next
available server? I wasn't able to answer that question decisively
last time I spent any amount of time groking it, but it's probably high
time I spent a few cycles pondering that scenario. I've been putting
it off until I get the binary protocol work underway, but I see no
reason to not take a stab at getting that working correctly.
Would callers want a command to fail if a server fails, or would they
want the library to repeat the call to another cache until a server
succeeds or the server list is empty? I couldn't answer that last time
because I didn't want to cause application delays... but maybe someone
has a convincing argument that'd push me one way or another.
> That tricky part is how to avoid losing the contents again if the
> comes back alive. May be it would be possible to include a flag "this
> A was offline during the last 24 hours". If a key isn't found on this
> A, then the client should try to fetch the data from the server X
> which was
> the server to be used when server A was down.
> Can you follow me? :)
Yup... I think the solution to your above paragraph is going to come w/
the next major version of memcached and the proposed bits from my email
last night to John McCaskey describing how server lists were going to
be automatically managed by the daemons.
> Another nice idea would be a local memcache-proxy on each client
> The application talks to this proxy instead of communicating directly
> to the
> memcached servers. The proxy does the distribution to the different
> memcached instances. Write requests get a special treatment: Each is
> to two memcached instances. If one memcached goes down, the proxy
> knows the
> corresponding backup instance and can request data from this memcached.
This is possible, but isn't something that I'm going to spend time on
in the meantime. When handling the binary protocol for memcached, I'm
probably going to facilitate this particular daemon popping into
existence by having two libraries:
libmemcache(3), a client library responsible for sending requests and
reading server responses; and
libmemcached(3), a server library responsible for reading requests and
The two libraries would complement each other, but would be separate to
minimize space/RAM. libmemcache(3) is MIT licensed to maximize
embed-ability in any widget your mind can wrap itself around.
> I don't know how to get the atomic increase/decrease right, however
> ... And
> may be there are problems with the whole idea which haven't occurred to
> me yet? :)
Just wait for the failover routines that will come down the pipe. What
you're suggesting, however, of a redundant cache, isn't a bad idea and
is one that I haven't honestly considered yet... but probably should.
The server topology could get rather complex, but would look something
Trackers maintain server lists according to memcached servers that are
connected to them. Memcached servers can connect to a arbitrary server
and sync the contents of keys to a peer server that way if one goes
down, the client can pick from one of the servers that handles the
virtual bucket range (preferring the primary if possible). Before the
client receives a response from the memcached server, the memcached
server would have to receive a response from its replication peers. I
have a good handle on how to handle this without causing any
performance degradation in the non-redundant case, which gives me warm
fuzzy bunny feelings.
To pull that off, memcached(8) would have to act as a client and a
server. Since none of us would want to introduce threading into
memcached(8), I'd have to start thinking about adding an async IO
interface to libmemcache(3) to prevent the daemon from blocking while
it's waiting for a response from its client. Suddenly I see why Y!
uses the trick of a kqueue(2) of kqueue(2) descriptors to handle some
of its tasks.... *ponders*
> I know that reliability isn't the no. 1 goal of memcached. But using
> it as
> quick and easy session storage is tempting. MySQL cluster brings
> issues (commercial licence needed in many cases) and has problems on
> its own
> (when I tested a four node setup with one node crashing I couldn't
> bring it
> alive due to "Unable to alloc node id" errors. Somehow the old
> was "stuck". The only solution was to restart the management server
You won't hear a positive thing out of my mouth about MySQL... ever....
except that their marketing and advertising depts do a very good job
and should be commended for their ability to sell and create "buzz."
Getting replication added into memcached(8) is something that one of my
clients would very much like to see added and given it can be done with
zero overhead in the non-replicated case means it's very likely it
could happen. -sc
More information about the memcached