Restarting the MemcacheD Cluster

Sat Apr 15 19:10:12 UTC 2006

FYI, this is exactly why the PECL memcached extension is adding a "no 
failover" flag and functionality.  Your cluster will remain up and 
active, but members may drop in and out without causing bad data to 
arrive at still-active members.

Last build I tested was looking very good, but there were still some 
gotchas.  Haven't seen an update yet.

Also FYI, it was quite a bit faster than a comparable PHP class.

Don

timeless wrote:
> Philip Neustrom wrote:
>> Why would a rolling restart cause cache corruption?  You mentioned
>> that you had cache corruption to begin with (in your application?) 
>> Maybe this is why you see it on a rolling restart?
>>   
> 
> (At least with the PHP MemcacheD class we use, from PHP.net):
> 
> Data is stored on nodes via a hash. If a node is down, then data is 
> stored on a different node. So if a single server goes down, then comes 
> up, then goes down again, there is almost guaranteed to be not-current 
> data on the cluster (in a high-volume environment, this is as close to 
> guaranteed as to make no difference). Example:
> 
>    1. ServerA goes down,
>    2. ItemA is requested, hash now points at ServerB, which doesn't have
>       the item cached,
>    3. ItemA is retrieved from the database and cached onto ServerB,
>    4. ServerA comes up again,
>    5. ItemA is modified in the database,
>    6. ItemA is requested, hash now points at ServerA, which doesn't have
>       the item cached,
>    7. ItemA is retrieved from the database and cached on ServerA,
>    8. ServerA goes down AGAIN (pesky server),
>    9. ItemA is requested, hash now points at ServerB, which has the item
>       cached, but
>   10. ItemA is retrieved FROM CACHE from ServerB, which is OLDER than
>       ItemA in the database.
> 
> At this point, there is no way for the code to know that ItemA is older 
> than what's in the database. Attempting a rolling restart of every 
> memcached daemon in the cluster while data is actively being written to 
> the daemons will cause flaky cache results if any server subsequently 
> goes back down, since it is exactly this case of a server going down and 
> coming back up.
> 
> At least... that's what I think.
> 
> --
> timeless