Transparent failover and restore?

Sun Dec 19 18:15:53 PST 2004

Kevin A. Burton wrote:
> Greg Whalin wrote:
> 
>>
>> Though, how often do you see motherboards and memory fail on running 
>> machines.  Not that often in my experience.
> 
> 
> Very often in my experience.. the more commodity servers means the more 
> components that can fail. Murphy is a harsh mistress...

Remind me to never buy from your vendor!  :)  In the past 12 years of 
managing high traffice sites, I have only had ram failures 1 to 2 times 
and never had a single motherboard failure.  Drive failures are 
frequent, but RAM/motherboard failures are very very very rare (yep, 3 
verys).  They just do not happen that often in a properly set up 
environment (power cleaning/battery backup), even with cheapo machines 
(and I have bought some of the cheapest).

> At any point in time any component could fail and the cluster should not 
> notice...
>> Given you don't really need a drive in a memcached server, seems much 
>> less likely that you will see hardware failure in a memcached server 
>> compared to the average.  In all, it seems pretty unlikely that a 
>> memcache machine will fail, and given the cheap cost, one can build a 
>> pretty large cluster limiting the total percentage of cache lost if 
>> one of these servers does fail.
>>
> Accidents happen.  We had one of our admins hit the power switch to the 
> bottom 1/2 of one of our racks. I want my software smart enough to 
> handle and recover from any problem.  Makes my job a lot easier ;)
> 
> Read the Google cluster architecture.  Machines fail and they take their 
> time replacing them.
> Kevin

My original point was that the at least some of the stability we have 
seen with the memcache server is due in part to its simplistic design 
(server hashing handled by clients, lack of server side state 
management).  As software complexity grows, the chance for bugs, 
instability and errors grows with it.  I like production services to be 
simple, modular, and stable.

gw