Restarting the MemcacheD Cluster
don at smugmug.com
Sat Apr 15 19:10:12 UTC 2006
FYI, this is exactly why the PECL memcached extension is adding a "no
failover" flag and functionality. Your cluster will remain up and
active, but members may drop in and out without causing bad data to
arrive at still-active members.
Last build I tested was looking very good, but there were still some
gotchas. Haven't seen an update yet.
Also FYI, it was quite a bit faster than a comparable PHP class.
> Philip Neustrom wrote:
>> Why would a rolling restart cause cache corruption? You mentioned
>> that you had cache corruption to begin with (in your application?)
>> Maybe this is why you see it on a rolling restart?
> (At least with the PHP MemcacheD class we use, from PHP.net):
> Data is stored on nodes via a hash. If a node is down, then data is
> stored on a different node. So if a single server goes down, then comes
> up, then goes down again, there is almost guaranteed to be not-current
> data on the cluster (in a high-volume environment, this is as close to
> guaranteed as to make no difference). Example:
> 1. ServerA goes down,
> 2. ItemA is requested, hash now points at ServerB, which doesn't have
> the item cached,
> 3. ItemA is retrieved from the database and cached onto ServerB,
> 4. ServerA comes up again,
> 5. ItemA is modified in the database,
> 6. ItemA is requested, hash now points at ServerA, which doesn't have
> the item cached,
> 7. ItemA is retrieved from the database and cached on ServerA,
> 8. ServerA goes down AGAIN (pesky server),
> 9. ItemA is requested, hash now points at ServerB, which has the item
> cached, but
> 10. ItemA is retrieved FROM CACHE from ServerB, which is OLDER than
> ItemA in the database.
> At this point, there is no way for the code to know that ItemA is older
> than what's in the database. Attempting a rolling restart of every
> memcached daemon in the cluster while data is actively being written to
> the daemons will cause flaky cache results if any server subsequently
> goes back down, since it is exactly this case of a server going down and
> coming back up.
> At least... that's what I think.
More information about the memcached