Why not let the fs handle this ??
don at smugmug.com
Wed Jun 7 17:18:36 UTC 2006
Jon Drukman wrote:
> Steven Grimm wrote:
>> Yes. One common setup is to run a memcached instance on every machine
>> rather than dedicating machines specifically to it. Have a memcached
>> on each web server. Since each item is only cached on one server, you
>> can get by with a relatively small cache on each machine. And then
>> your database load only goes up a little if you lose or reboot one of
>> the servers.
> Has this actually worked out well in practice for anybody? I've found
> that losing one machine (out of about 100) results in so much db
> thrashing as the keys get repopulated into different places that the
> site becomes basically unusable until enough of the cache has been
> regenerated (5-10 minutes if i'm lucky).
There have been some threads about this, but in our enviroment, we DO
NOT rebalance keys to down/missing servers. Instead we just deal with
the missed hits at the DB layer and let the other 99 servers continue as
This has a few benefits:
1. When your server comes back up, you don't have to worry about stale
data being left on old servers should it crash again.
2. You're only out 1/100th of your cache, so the other 99 continue on
as before and don't get hammered while keys rebalance and the entire
cache is rebuilt.
The key is to keep that downed 100th machine in your pool, so the key
allocation algorithm still "counts" it, but to somehow let your
application know not to write to it while it's in a downed state.
Strangely, we seem to be fairly unique in this approach. At least, when
I mention it on the list, people don't seem to get it. To my mind, it's
the only sane way to deal with the problem.
In our particular case, any failures to memcache cause a server to be
flagged as "down" in our tracker. Then, asynchronously, the state of
that server is periodically checked. When it comes back up, it's
completely flushed, and then marked as active. (You have to do the
flush to get rid of any stale data in case the server was just
unresponsive, unreachable, or some other non-hard restart situation).
If you're using PHP, the PECL extension has some great new enhancements
(I'm not sure if they've made release yet, but certainly pre-release) to
enable this sort of high-availability functionality to work well with
whatever tracking method you use externally.
Hope that helps!
More information about the memcached