libmemcache(3) 1.1.0rc4...

Wed Dec 22 11:49:23 PST 2004

>> Yeah... this is a gapping hole in the API at the moment.  There are 
>> two
>> potential solutions to this.  The first being:
>> 	*) Move the server lists to shared memory
>> 	*) Provide a tool that administrates the lists in shared memory.
>
> In my experience using mmap(2)ed files is easier than dealing with SYSV
> shared memory.

OOOH!  Brilliant...  I should've thought of that earlier (no musing on 
the sidelines for those that have seen me go several rounds w/ 
PostgreSQL and advocating the use of mmap(2)).

>> This would work fine and dandy, but is a less than perfect solution as
>> it requires manual management of server lists.  The ideal/correct
>> solution is spiffy neato and makes all of this kinda foo automagic.
>
> Writing a small daemon process that a reads a list of servers from a
> configuration file, tries to connect to each server, issues a couple of
> commands and modifies the list of active servers stored in shared 
> memory
> according to the results, should be quite easy.  That's the same method
> every load balancer uses for availabity checks (i.e. doing out-of band
> checks).

Yup.  mmap(2) the file as private to begin with, write out the binary 
server list to the file, then msync(2) the file to make the changes 
live.  I need to test to make sure that'd work, but I believe that all 
locking and semaphore use would get removed and handled by the VM 
system.  This would all be conditional on a functioning VM, which isn't 
necessarily the case in all OSes, even still.  Last I checked, a VM w/ 
a unified buffer cache is scarce outside of the *BSD realm... still, 
posix should mandate that the above works assuming a non-broken mmap(2) 
implementation.

> Another thing: I think the way the distribution to the servers is done
> should be changed. Currently you are doing a "hash % number of live
> servers". You should do a "hash % number of configured servers". If 
> this
> hits a server which is down rehash the request (e.g. append "foo" to 
> the key
> and calculate the hash) and loop, unless you find a server which is 
> alive
> (of course the "all servers down case" should be treated specially,
> returning an error without any hashing at all). Otherwise you lose 
> almost
> the whole cache due to distribution changes after a server goes down. 
> With
> rehashing only the contents of the server that went down is lost.

I thought about that and it's something that I've got on my TODO list.  
The reason I didn't is because of a chunk of code that I have XXX'ed.  
If a server goes down or fails for some reason, should it try the next 
available server?  I wasn't able to answer that question decisively 
last time I spent any amount of time groking it, but it's probably high 
time I spent a few cycles pondering that scenario.  I've been putting 
it off until I get the binary protocol work underway, but I see no 
reason to not take a stab at getting that working correctly.

Would callers want a command to fail if a server fails, or would they 
want the library to  repeat the call to another cache until a server 
succeeds or the server list is empty?  I couldn't answer that last time 
because I didn't want to cause application delays... but maybe someone 
has a convincing argument that'd push me one way or another.

> That tricky part is how to avoid losing the contents again if the 
> server
> comes back alive. May be it would be possible to include a flag "this 
> server
> A was offline during the last 24 hours". If a key isn't found on this 
> server
> A, then the client should try to fetch the data from the server X 
> which was
> the server to be used when server A was down.
>
> Can you follow me? :)

Yup... I think the solution to your above paragraph is going to come w/ 
the next major version of memcached and the proposed bits from my email 
last night to John McCaskey describing how server lists were going to 
be automatically managed by the daemons.

> Another nice idea would be a local memcache-proxy on each client 
> machine.
> The application talks to this proxy instead of communicating directly 
> to the
> memcached servers. The proxy does the distribution to the different
> memcached instances. Write requests get a special treatment: Each is 
> issued
> to two memcached instances. If one memcached goes down, the proxy 
> knows the
> corresponding backup instance and can request data from this memcached.

This is possible, but isn't something that I'm going to spend time on 
in the meantime.  When handling the binary protocol for memcached, I'm 
probably going to facilitate this particular daemon popping into 
existence by having two libraries:

	libmemcache(3), a client library responsible for sending requests and 
reading server responses; and
	libmemcached(3), a server library responsible for reading requests and 
sending responses.

The two libraries would complement each other, but would be separate to 
minimize space/RAM.  libmemcache(3) is MIT licensed to maximize 
embed-ability in any widget your mind can wrap itself around.

> I don't know how to get the atomic increase/decrease right, however 
> ... And
> may be there are problems with the whole idea which haven't occurred to
> me yet? :)

Just wait for the failover routines that will come down the pipe.  What 
you're suggesting, however, of a redundant cache, isn't a bad idea and 
is one that I haven't honestly considered yet... but probably should.  
The server topology could get rather complex, but would look something 
like:

Trackers maintain server lists according to memcached servers that are 
connected to them.  Memcached servers can connect to a arbitrary server 
and sync the contents of keys to a peer server that way if one goes 
down, the client can pick from one of the servers that handles the 
virtual bucket range (preferring the primary if possible).  Before the 
client receives a response from the memcached server, the memcached 
server would have to receive a response from its replication peers.  I 
have a good handle on how to handle this without causing any 
performance degradation in the non-redundant case, which gives me warm 
fuzzy bunny feelings.

To pull that off, memcached(8) would have to act as a client and a 
server.  Since none of us would want to introduce threading into 
memcached(8), I'd have to start thinking about adding an async IO 
interface to libmemcache(3) to prevent the daemon from blocking while 
it's waiting for a response from its client.  Suddenly I see why Y! 
uses the trick of a kqueue(2) of kqueue(2) descriptors to handle some 
of its tasks....  *ponders*

> I know that reliability isn't the no. 1 goal of memcached. But using 
> it as
> quick and easy session storage is tempting. MySQL cluster brings 
> licence
> issues (commercial licence needed in many cases) and has problems on 
> its own
> (when I tested a four node setup with one node crashing I couldn't 
> bring it
> alive due to "Unable to alloc node id" errors. Somehow the old 
> connection
> was "stuck". The only solution was to restart the management server 
> ...).

You won't hear a positive thing out of my mouth about MySQL... ever.... 
except that their marketing and advertising depts do a very good job 
and should be commended for their ability to sell and create "buzz."  
Getting replication added into memcached(8) is something that one of my 
clients would very much like to see added and given it can be done with 
zero overhead in the non-replicated case means it's very likely it 
could happen.  -sc

-- 
Sean Chittenden