Why not let the fs handle this ??

Wed Jun 7 18:05:41 UTC 2006

Steven Grimm wrote:
> Jon Drukman wrote:
>> Has this actually worked out well in practice for anybody?  I've found 
>> that losing one machine (out of about 100) results in so much db 
>> thrashing as the keys get repopulated into different places that the 
>> site becomes basically unusable until enough of the cache has been 
>> regenerated (5-10 minutes if i'm lucky).
> 
> The problem there is not with lots of memcached servers, but that you're 
> letting your client relocate your keys when a server goes down. Our 
> approach is just to let the missing machine cause cache misses, but the 
> others continue to serve up their usual data. (A missing memcached never 
> stays missing for long since our operations team monitors the servers 
> and fixes any broken ones pretty quickly.)

if a machine is actually crashed/down (happens once in a while), as 
opposed to the machine still being there but with memcached having 
crashed (never happened to me yet) then the connections to the downed 
box take 2 minutes before they give up.  this basically renders the site 
inoperative as lots of pages can hang for 2 minutes (or more) before 
returning.  this quickly fills up all the apache connection slots.

so removing a downed box from the memcache server pool list promptly is 
essential - which means all keys get reshuffled.

ideally there would be a way to specify a 1 second timeout on memcache 
connection, which would get around this, but as far as i'm aware the 2 
minute thing is an OS limitation.  would love to be proved wrong on that :)

-jsd-