Failure, Data Integrity and the PECL Extension

Fri May 18 18:17:47 UTC 2007

Nathan,

Thanks for the response.  Some thoughts/questions:

I agree that significant throughput problems vis-a-vis memcache warrant
a hardware review, but they remain a possibility, so I really need to
understand just what is and is not possible and appropriate with
memcache.  Your discussion is starting to clarify some of that for me,
but I have two questions:

1) In the "appreciate not require" approach you mention backing up
memcache with a more expensive db query if necessary. My question is how
one knows when the expensive backup query is necessary.  If it's a
matter of policy (eg., always check db at critical junctures such as
getting a unique id or whatever), then we're effectively saying -- in
that scenario -- not to rely on memcache.  Use it as a 'pretty reliable'
convenience for non-critical data, but not a system-of-record.  Am I
understanding that correctly, or is there a design which allows us to
use memcache for critical data but be able to detect when it is fishy
and therefore requires double-checking the database?

2) Your MultiputMemcacheDriver sound interesting, but I don't really
understand. Under what conditions would the same tuple of information
have entries with different timestamps?  When would the code be
accessing multiple instances and thus examining the different
timestamps?  If you could walk me through the usage of that driver
(treating me like the dummy that I am), I would greatly appreciate it.

Thanks again,
Kenner

-----Original Message-----
From: Nathan Schmidt [mailto:nschmidt at gmail.com] On Behalf Of Nathan
Schmidt
Sent: Friday, May 18, 2007 10:52 AM
To: kenner at superpowerplanet.com
Cc: memcached at lists.danga.com
Subject: Re: Failure, Data Integrity and the PECL Extension

Kenner,

If any machine is too CPU-bound to return memcached responses at  
anything other than ethernet speed you should probably step back and  
really evaluate your hardware plan.

You'll get a lot of mileage out of memcached but you'll be better off  
in this case rethinking your approach to caching -- if you really  
need transactional integrity or absolutely assured consistency you'll  
want to use an in-memory database like MySQL cluster. If you rework  
your application's cache layer to "appreciate not require"  
consistency you'll be much happier. Lots of folks on this list  
successfully use memcached as a source of authority but it's  
generally a) backed by a more expensive db query if necessary or b)  
not actually critical data. Whatever you do, make the kind of  
consistency problem you describe cause at most annoyance "grr, have  
to hit the disk and recalculate x y z", not "OMG two users got the  
same UID"

For very common data which must be _available_ we keep a separate  
pool of a couple servers who all get the same data written to them --  
we've written MultiputMemcacheDriver class which handles that logic.  
If you write a timestamp as part of your payload data you can resolve  
ambiguity in a pinch -- data with the later timestamp is 'more  
authoritative'. It's not terribly complex but makes for better sleep.

-Nathan / PBwiki

On May 18, 2007, at 9:59 AM, Kenner Stross wrote:

> Hello,
>
> I am using the PECL php extension for memcached access, and am
> confused/concerned about data integrity in the case of a failure. I  
> have already found some discussions on this list regarding this  
> issue, but I don't see how those solutions hold up in a multi- 
> server environment.
>
> What I've found so far is basically this: Disable automatic
> failover, use a callback method to catch the failure and in that  
> callback routine set the server status to off and stop any further  
> retrying (-1), and lastly, implement an external service monitor  
> that can detect the problem, flush the cache and then mark the  
> server as available again. That way, you can be sure all stale  
> entries are flushed before it rejoins the pool of active servers.
>
> Fine for one client accessing the cache server.  But I don't see
> how that guarantees integrity in a multi-client environment.  In  
> particular, I don't see how it works when the failure is quite  
> temporary, due to a heavy load that made the response too  
> sluggish.  Hopefully I'm just overlooking the obvious and one of  
> you will straighten me out.
>
> Let's imagine a simple 3 machine setup (m1 - m3), where each
> machine is acting as a web server and a memcached server.
>
> m1 web --> attempts write to m3 cache, but it fails due to extreme
> load. Marks it as failed and offline (in the callback routine).
> m2 web --> accesses m3 cache successfully (no load problem on m2,  
> so no failure). Doesn't see that m1 took it offline.
>
> m2 is using invalid cache data (it's missing m1's activity) but
> doesn't realize it. An external service monitor may or may not  
> notice this brief, intermittent problem, but even if it does, that  
> doesn't help m2 avoid the m3 cache once m1 has experienced an m3  
> cache failure.
>
> I'm sure I must be missing something.  Your help is greatly
> appreciated.
>
> Thanks,
> Kenner