libmemcache(3) 1.1.0rc4...

Wed Dec 22 15:22:43 PST 2004

>> Yup.  mmap(2) the file as private to begin with, write out the binary
>> server list to the file, then msync(2) the file to make the changes
>> live.  I need to test to make sure that'd work, but I believe that all
>> locking and semaphore use would get removed and handled by the VM
>> system.  This would all be conditional on a functioning VM, which 
>> isn't
>> necessarily the case in all OSes, even still.  Last I checked, a VM w/
>> a unified buffer cache is scarce outside of the *BSD realm... still,
>> posix should mandate that the above works assuming a non-broken 
>> mmap(2)
>> implementation.
>
> At least after msync(2) a unified view should be provided to all 
> processes,
> right? Then you could split your mmap(2)ed file into two sections (each
> starting on page boundaries, just in case) and an additional header 
> section,
> update section two when section one is read by the clients, then 
> switch a
> single bit in the header section to indicate that section two is now 
> active
> and msync(2). After a few seconds you do the next check of all servers 
> and
> update section one. This way no inconsistencies should be seen by the
> clients.

I'm not worried about page boundries, a good VM will handle that.  I'm 
more worried about the changes to the list being atomic and avoiding 
the use of semaphores or some kind of synchronization.  msync(2) should 
flush and relink the vnodes atomically.... there are a few issues to 
watch out for making sure that the mmap(2)'ed regions happen somewhere 
where the list can grow w/ ease....  I don't see a server list growing 
beyond getpagesize(3) bytes, but... *shrug*  better safe than sorry.

>>> With rehashing only the contents of the server that went down is 
>>> lost.
>>
>> I thought about that and it's something that I've got on my TODO list.
>> The reason I didn't is because of a chunk of code that I have XXX'ed.
>> If a server goes down or fails for some reason, should it try the next
>> available server?  I wasn't able to answer that question decisively
>
> Hm, in case you have an independend process checking you should use the
> "hash again if distribution targets a server that is marked down" 
> approach.
> In case of the library doing availability check just by trying for 
> itself, I
> don't know. But keeping the hash stable and rehashing for failed 
> servers
> never hurts:

Well, I'm not going to rely on an independent process to fixup the 
server lists... having a shared server list via mmap(2) sounds good to 
me, but it's going to get summarily gutted as soon as the new protocol 
hits memcached.

> current implementation:
> key "AAA" -> server 4
> key "BBB" -> server 2 -> broken -> remove server -> return error
> key "BBB" -> server 3
> key "AAA" -> server 1  (because num_live_servers was decreased)
>
> with rehashing:
> key "AAA" -> server 4
> key "BBB" -> server 2 -> broken -> mark server "down" -> return error
> key "BBB" -> server 2, but is marked "down" -> rehash -> server 3
> key "AAA" -> server 4
>
> If a site relies heavily on memcache the massive cache failure rate 
> after a
> single memcached instance goes down might break their neck because the
> database servers can't cope with the increased load. And usually they 
> don't
> recover until manual intervention or the night comes (or the day, 
> depending
> of the kind of site ...).
>
> Using the second approach is cheap in my eyes and just another array to
> check (by index).

Patches welcome.  :)

>> last time I spent any amount of time groking it, but it's probably 
>> high
>> time I spent a few cycles pondering that scenario.  I've been putting
>> it off until I get the binary protocol work underway, but I see no
>> reason to not take a stab at getting that working correctly.
>>
>> Would callers want a command to fail if a server fails, or would they
>> want the library to  repeat the call to another cache until a server
>> succeeds or the server list is empty?  I couldn't answer that last 
>> time
>> because I didn't want to cause application delays... but maybe someone
>> has a convincing argument that'd push me one way or another.
>
> Hm, add a mc_ctl(struct memcache *mc, int type, void *data) function 
> and
> let the users do a mc_ctl(mc, MCM_RESUBMIT_ON_ERROR, (void *)1) after
> mc_new(). With REUBMIT enabled the above might look like:

Yeah, that's one of the ideas I've had.  I'm going to wait a bit longer 
for more opinions before I decide what to do and what to make the 
default.

>> Trackers maintain server lists according to memcached servers that are
>> connected to them.  Memcached servers can connect to a arbitrary 
>> server
>> and sync the contents of keys to a peer server that way if one goes
>> down, the client can pick from one of the servers that handles the
>> virtual bucket range (preferring the primary if possible).  Before the
>> client receives a response from the memcached server, the memcached
>> server would have to receive a response from its replication peers.  I
>> have a good handle on how to handle this without causing any
>> performance degradation in the non-redundant case, which gives me warm
>> fuzzy bunny feelings.
>
> This sounds interesting. But of course it adds complexity to the 
> server.
> Doing the replication (by duplicating write requests to multiplte 
> servers)
> in proxy servers in front of the memcached instances would keep the
> memcached process simpler and the proxy is complety optional. But with 
> the
> proxy approach some things might be impossible (e.g. atomic updates).
> Your idea with server<->server communication could be a solution for 
> this,
> yes.

I'm not keen on the proxy solution, to be honest.  Layering that way 
seems flawed to me... having replication peers seems like a better 
idea, certainly more scalable.  *shrug*

>> To pull that off, memcached(8) would have to act as a client and a
>> server.  Since none of us would want to introduce threading into
>> memcached(8), I'd have to start thinking about adding an async IO
>> interface to libmemcache(3) to prevent the daemon from blocking while
>> it's waiting for a response from its client.  Suddenly I see why Y!
>> uses the trick of a kqueue(2) of kqueue(2) descriptors to handle some
>> of its tasks....  *ponders*
>
> But if the updates to the slave memcached instances are asynchronous 
> you
> lose the guarantee of atomic updates happing on all replicas.

The memcached server making the changes to its peers wouldn't use an 
increment, it would use a set.  Here's the catch-all: all data has to 
include a tiny version counter.  Not something I'm wild about, but 
there's space in the protocol.

> But doing ...
>
> client request
>   -> memcached 1 receives request
>   -> memcached 1 processes request
>   -> memcached 1 duplicates requests to memcached 2
>      -> memcached 2 receives request
>      -> memcached 2 processes request
>      -> memcached 2 sends response to memcached 1
>   -> memcached 1 reads response
>   -> memcached 1 sends response to client
> -> client reads response
>
> ... would really hurt performance.

Yup... but it'd be a linear growth for replication as opposed to an 
exponential growth, which is livable to me given that most clients who 
would need this want reliability of writes, not speed... it'd be fast 
for writes, don't get me wrong... just not as fast as a non-redundant 
setup.  Fetch requests would still be as snappy as ever, however.

-sc

-- 
Sean Chittenden