Why not let the fs handle this ??

Wed Jun 7 17:18:36 UTC 2006

Jon Drukman wrote:
> Steven Grimm wrote:
>> Yes. One common setup is to run a memcached instance on every machine 
>> rather than dedicating machines specifically to it. Have a memcached 
>> on each web server. Since each item is only cached on one server, you 
>> can get by with a relatively small cache on each machine. And then 
>> your database load only goes up a little if you lose or reboot one of 
>> the servers.
> 
> Has this actually worked out well in practice for anybody?  I've found 
> that losing one machine (out of about 100) results in so much db 
> thrashing as the keys get repopulated into different places that the 
> site becomes basically unusable until enough of the cache has been 
> regenerated (5-10 minutes if i'm lucky).
> 
> -jsd-
> 
> 

There have been some threads about this, but in our enviroment, we DO 
NOT rebalance keys to down/missing servers.  Instead we just deal with 
the missed hits at the DB layer and let the other 99 servers continue as 
planned.

This has a few benefits:

1. When your server comes back up, you don't have to worry about stale 
data being left on old servers should it crash again.

2.  You're only out 1/100th of your cache, so the other 99 continue on 
as before and don't get hammered while keys rebalance and the entire 
cache is rebuilt.

The key is to keep that downed 100th machine in your pool, so the key 
allocation algorithm still "counts" it, but to somehow let your 
application know not to write to it while it's in a downed state.

Strangely, we seem to be fairly unique in this approach.  At least, when 
I mention it on the list, people don't seem to get it.  To my mind, it's 
the only sane way to deal with the problem.

In our particular case, any failures to memcache cause a server to be 
flagged as "down" in our tracker.  Then, asynchronously, the state of 
that server is periodically checked.  When it comes back up, it's 
completely flushed, and then marked as active.  (You have to do the 
flush to get rid of any stale data in case the server was just 
unresponsive, unreachable, or some other non-hard restart situation).

If you're using PHP, the PECL extension has some great new enhancements 
(I'm not sure if they've made release yet, but certainly pre-release) to 
enable this sort of high-availability functionality to work well with 
whatever tracking method you use externally.

Hope that helps!

Don