Why not let the fs handle this ??

Thu Jun 8 07:09:58 UTC 2006

I have might something to add here:

We use a special IP range to access the memcached clients.
Every client has it's own IP. If a client is down we let another machine 
take it's place by adding the IP to the fail-over. This is a common HA 
technique.

In this case it is special because we use an IP to identify a service. 
You could use hostnames for that, but we don't want to bother with the 
DNS lookups/ttl's and such.

Furthermore, every memcached client is on a different port, enabling us 
to temporary have more than one client on a machine in case of total 
meltdown. It doesn't matter how much memory is assigned for the temporary 
client. As long as it accepts connections...

Not saying this is _the_ solution, it's just a solution.

Grtz!

On Wed, 7 Jun 2006, Don MacAskill wrote:

> > Has this actually worked out well in practice for anybody?  I've found that
> > losing one machine (out of about 100) results in so much db thrashing as the
> > keys get repopulated into different places that the site becomes basically
> > unusable until enough of the cache has been regenerated (5-10 minutes if i'm
> > lucky).

[...cut...]

> The key is to keep that downed 100th machine in your pool, so the key
> allocation algorithm still "counts" it, but to somehow let your application
> know not to write to it while it's in a downed state.
> 
> In our particular case, any failures to memcache cause a server to be flagged
> as "down" in our tracker.  Then, asynchronously, the state of that server is
> periodically checked.  When it comes back up, it's completely flushed, and
> then marked as active.  (You have to do the flush to get rid of any stale data
> in case the server was just unresponsive, unreachable, or some other non-hard
> restart situation).