libmemcache(3) 1.1.0rc4...

Wed Dec 22 13:09:39 PST 2004

On 22.12., Sean Chittenden wrote:
> Yup.  mmap(2) the file as private to begin with, write out the binary 
> server list to the file, then msync(2) the file to make the changes 
> live.  I need to test to make sure that'd work, but I believe that all 
> locking and semaphore use would get removed and handled by the VM 
> system.  This would all be conditional on a functioning VM, which isn't 
> necessarily the case in all OSes, even still.  Last I checked, a VM w/ 
> a unified buffer cache is scarce outside of the *BSD realm... still, 
> posix should mandate that the above works assuming a non-broken mmap(2) 
> implementation.

At least after msync(2) a unified view should be provided to all processes,
right? Then you could split your mmap(2)ed file into two sections (each
starting on page boundaries, just in case) and an additional header section,
update section two when section one is read by the clients, then switch a
single bit in the header section to indicate that section two is now active
and msync(2). After a few seconds you do the next check of all servers and
update section one. This way no inconsistencies should be seen by the
clients.

> >With rehashing only the contents of the server that went down is lost.
> I thought about that and it's something that I've got on my TODO list.  
> The reason I didn't is because of a chunk of code that I have XXX'ed.  
> If a server goes down or fails for some reason, should it try the next 
> available server?  I wasn't able to answer that question decisively 

Hm, in case you have an independend process checking you should use the
"hash again if distribution targets a server that is marked down" approach.
In case of the library doing availability check just by trying for itself, I
don't know. But keeping the hash stable and rehashing for failed servers
never hurts:

current implementation:
key "AAA" -> server 4
key "BBB" -> server 2 -> broken -> remove server -> return error
key "BBB" -> server 3
key "AAA" -> server 1  (because num_live_servers was decreased)

with rehashing:
key "AAA" -> server 4
key "BBB" -> server 2 -> broken -> mark server "down" -> return error
key "BBB" -> server 2, but is marked "down" -> rehash -> server 3
key "AAA" -> server 4

If a site relies heavily on memcache the massive cache failure rate after a
single memcached instance goes down might break their neck because the
database servers can't cope with the increased load. And usually they don't
recover until manual intervention or the night comes (or the day, depending
of the kind of site ...).

Using the second approach is cheap in my eyes and just another array to
check (by index).

> last time I spent any amount of time groking it, but it's probably high 
> time I spent a few cycles pondering that scenario.  I've been putting 
> it off until I get the binary protocol work underway, but I see no 
> reason to not take a stab at getting that working correctly.
> 
> Would callers want a command to fail if a server fails, or would they 
> want the library to  repeat the call to another cache until a server 
> succeeds or the server list is empty?  I couldn't answer that last time 
> because I didn't want to cause application delays... but maybe someone 
> has a convincing argument that'd push me one way or another.

Hm, add a mc_ctl(struct memcache *mc, int type, void *data) function and
let the users do a mc_ctl(mc, MCM_RESUBMIT_ON_ERROR, (void *)1) after
mc_new(). With REUBMIT enabled the above might look like:

with rehashing and resubmit after errors:
key "AAA" -> server 4
key "BBB" -> server 2 -> broken -> mark server "down" -> rehash -> server 3
key "BBB" -> server 2, but is marked "down" -> rehash -> server 3
key "AAA" -> server 4

> Yup... I think the solution to your above paragraph is going to come w/ 
> the next major version of memcached and the proposed bits from my email 
> last night to John McCaskey describing how server lists were going to 
> be automatically managed by the daemons.

Great :-)

> This is possible, but isn't something that I'm going to spend time on 
> in the meantime.  When handling the binary protocol for memcached, I'm 
> probably going to facilitate this particular daemon popping into 
> existence by having two libraries:
> 	libmemcache(3), a client library responsible for sending requests 
> 	and reading server responses; and
> 	libmemcached(3), a server library responsible for reading requests 
> 	and sending responses.

That's a good idea. And the two libraries can share some source code
(protocol definitions, structures etc.), instead of duplicating the work.

> Trackers maintain server lists according to memcached servers that are 
> connected to them.  Memcached servers can connect to a arbitrary server 
> and sync the contents of keys to a peer server that way if one goes 
> down, the client can pick from one of the servers that handles the 
> virtual bucket range (preferring the primary if possible).  Before the 
> client receives a response from the memcached server, the memcached 
> server would have to receive a response from its replication peers.  I 
> have a good handle on how to handle this without causing any 
> performance degradation in the non-redundant case, which gives me warm 
> fuzzy bunny feelings.

This sounds interesting. But of course it adds complexity to the server.
Doing the replication (by duplicating write requests to multiplte servers)
in proxy servers in front of the memcached instances would keep the
memcached process simpler and the proxy is complety optional. But with the
proxy approach some things might be impossible (e.g. atomic updates).
Your idea with server<->server communication could be a solution for this,
yes.

> To pull that off, memcached(8) would have to act as a client and a 
> server.  Since none of us would want to introduce threading into 
> memcached(8), I'd have to start thinking about adding an async IO 
> interface to libmemcache(3) to prevent the daemon from blocking while 
> it's waiting for a response from its client.  Suddenly I see why Y! 
> uses the trick of a kqueue(2) of kqueue(2) descriptors to handle some 
> of its tasks....  *ponders*

But if the updates to the slave memcached instances are asynchronous you
lose the guarantee of atomic updates happing on all replicas. 

But doing ...

client request 
  -> memcached 1 receives request 
  -> memcached 1 processes request 
  -> memcached 1 duplicates requests to memcached 2 
     -> memcached 2 receives request 
     -> memcached 2 processes request 
     -> memcached 2 sends response to memcached 1
  -> memcached 1 reads response
  -> memcached 1 sends response to client
-> client reads response

... would really hurt performance.

> You won't hear a positive thing out of my mouth about MySQL... ever.... 

Hm, I like MySQL itself for simple and fast applications. But if my bank
tells me that they store my savings in MySQL i might get a little worried
;-) And concerning MySQL cluster I think that it's just not ready for prime
time yet. 

> except that their marketing and advertising depts do a very good job 
> and should be commended for their ability to sell and create "buzz."  
> Getting replication added into memcached(8) is something that one of my 
> clients would very much like to see added and given it can be done with 
> zero overhead in the non-replicated case means it's very likely it 
> could happen.  -sc

I'm looking forward to it :-)

Sven