Asymetric Failure + Best Practices

Tue Jun 24 08:17:07 UTC 2008

Hi Jason,

We thought a bit about automatic failover when we did our .Net client,
BeITMemcached, and we ended up deciding that running slightly uncached is
better than the potential synchronization errors you can get with it. If a
client cannot reach a memcached server, it will mark that as dead
internally, and then keep an increasing retry timeout. This means that all
operations that would end up on the unreachable server will immediately
return false or null, except the occasional operation that gets to retry and
has to wait the entire connection timeout. This means you will not get any
synchronization errors, only that some parts of your app will be partly
uncached, depending on the amount of memcached servers you have.

Also note that even in your symmetrical failure scenario, automatic failover
will cause synchronization errors when the failing memcached server is
brought back online. It is very likely that different parts of your
application will detect this at different times, which means that some parts
will have mapped back the keys to the previously failing server, and others
will not, and this may or may not be fatal for your application. Failover
seems like a good idea since failover itself can be done completely smooth
without synchronization errors, but automatic recovery of the same can never
be smooth, unless you get lucky and all parts of your application recover at
the same time.

Since memcached is designed to explicitly guarantee non-stale data through
deterministic, consistent mapping of a key, and explicitly does not
guarantee persistence (It's a LRU cache after all), we designed our client
to behave along those lines.

In the end, you have to determine what's best for your application. Are you
better at handling synchronization errors or are you better at handling a
high miss-rate?

/Henrik Schröder

On Tue, Jun 24, 2008 at 12:56 AM, Jason McGuirk <
jason.mcguirk at supportsoft.com> wrote:

> Hey Folks,
>
> I'm partially through a .NET web app integration of memcached using the
> Enyim client bindings published on codeplex and had a few 101 questions
> about how the protocol behaves during a failure scenario.
>
> Since the protocol necessitates that the client bindings modulate the hash
> key over available servers, how does this behave during an asymmetrical
> failure?
>
> Consider the following example;
>
> I've got two app servers AppA and AppB, and two Memcache instances MCA and
> MCB. Consider the case where the server running MCB suffers a hardware
> failure and is universally unrouteable. The client bindings on both AppA and
> AppB should detect this failure and re-jigger the key space to map 100% to
> MCA.
>
> Consider an asymmetric failure where only one of the servers (say, AppA) is
> unable to route MCB. This is substantially less likely the a categorical
> failure, but could happen for numerous reasons- a misconfigured firewall, a
> housed router, what have you. In this scenario, AppA makes the determination
> that MCB has failed and will remap the keyspace to MCA. AppB, meanwhile, is
> able to route MCB successfully and will continue placing Keys into BOTH MCA
> and MCB, and the chance for both servers to read stale data is inevitable,
> since invalidations will consistently map to different nodes.
>
> My question is this, What's the best practice people have discovered here?
> Should I have AppA detect the failure and notify all other nodes to ignore
> MCB? How does that work for intermittent failure?
>
> Thanks in advance for any insight ☺
>
> Jason McGuirk
> SupportSoft, Inc.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.danga.com/pipermail/memcached/attachments/20080624/a6bf4209/attachment.html