Failure, Data Integrity and the PECL Extension

Fri May 18 17:51:33 UTC 2007

Kenner,

If any machine is too CPU-bound to return memcached responses at  
anything other than ethernet speed you should probably step back and  
really evaluate your hardware plan.

You'll get a lot of mileage out of memcached but you'll be better off  
in this case rethinking your approach to caching -- if you really  
need transactional integrity or absolutely assured consistency you'll  
want to use an in-memory database like MySQL cluster. If you rework  
your application's cache layer to "appreciate not require"  
consistency you'll be much happier. Lots of folks on this list  
successfully use memcached as a source of authority but it's  
generally a) backed by a more expensive db query if necessary or b)  
not actually critical data. Whatever you do, make the kind of  
consistency problem you describe cause at most annoyance "grr, have  
to hit the disk and recalculate x y z", not "OMG two users got the  
same UID"

For very common data which must be _available_ we keep a separate  
pool of a couple servers who all get the same data written to them --  
we've written MultiputMemcacheDriver class which handles that logic.  
If you write a timestamp as part of your payload data you can resolve  
ambiguity in a pinch -- data with the later timestamp is 'more  
authoritative'. It's not terribly complex but makes for better sleep.

-Nathan / PBwiki

On May 18, 2007, at 9:59 AM, Kenner Stross wrote:

> Hello,
>
> I am using the PECL php extension for memcached access, and am  
> confused/concerned about data integrity in the case of a failure. I  
> have already found some discussions on this list regarding this  
> issue, but I don't see how those solutions hold up in a multi- 
> server environment.
>
> What I've found so far is basically this: Disable automatic  
> failover, use a callback method to catch the failure and in that  
> callback routine set the server status to off and stop any further  
> retrying (-1), and lastly, implement an external service monitor  
> that can detect the problem, flush the cache and then mark the  
> server as available again. That way, you can be sure all stale  
> entries are flushed before it rejoins the pool of active servers.
>
> Fine for one client accessing the cache server.  But I don't see  
> how that guarantees integrity in a multi-client environment.  In  
> particular, I don't see how it works when the failure is quite  
> temporary, due to a heavy load that made the response too  
> sluggish.  Hopefully I'm just overlooking the obvious and one of  
> you will straighten me out.
>
> Let's imagine a simple 3 machine setup (m1 - m3), where each  
> machine is acting as a web server and a memcached server.
>
> m1 web --> attempts write to m3 cache, but it fails due to extreme  
> load. Marks it as failed and offline (in the callback routine).
> m2 web --> accesses m3 cache successfully (no load problem on m2,  
> so no failure). Doesn't see that m1 took it offline.
>
> m2 is using invalid cache data (it's missing m1's activity) but  
> doesn't realize it. An external service monitor may or may not  
> notice this brief, intermittent problem, but even if it does, that  
> doesn't help m2 avoid the m3 cache once m1 has experienced an m3  
> cache failure.
>
> I'm sure I must be missing something.  Your help is greatly  
> appreciated.
>
> Thanks,
> Kenner