Transparent failover and restore?

Mon Dec 20 13:08:41 PST 2004

On Mon, 20 Dec 2004, Josh Berkus wrote:

> Greg,
>
>> My original point was that the at least some of the stability we have
>> seen with the memcache server is due in part to its simplistic design
>> (server hashing handled by clients, lack of server side state
>> management).  As software complexity grows, the chance for bugs,
>> instability and errors grows with it.  I like production services to
>> be simple, modular, and stable.
>
> Which is a good reason to make the redundancy optional, but not to
> oppose it.   It's also a good focus for the redundancy; keep is as
> simple as possible.
>
>> From my perspective (I used memcached as an accessory to a database, so
> it only holds quickly restorable data) the problem with taking a server
> offline is not primarily one of re-building the cache data, but rather
> one of propagating the server lists.   To give you an example:
>
> For a proposed project, we have 5 PostgreSQL database servers, each one
> running a 250MB instance of Memcached to hold session management
> information.   These servers are replicated by Slony, allowing us to
> take them offline one at a time to upgrade them to PostgreSQL 8.0
> (slony supports this).   Our pooling component, C-JDBC, dynamically
> recognizes which servers are not responding and queries the remaining 4
> servers while the 5th is down.
>
> But memcached doesn't.  In fact, due to the way hash keys are handled,
> not only do we have to propogate new server lists to each one of 8
> webservers, but the entire cache is invalidated and needs to be rebuild
> from database for all 4 machines.   Then when the 5th machine is back
> online, we have to rebuild the entire cache again.
>
>
Memcached clients* will rehash the key if the server the key gets hashed 
to is offline.  When you take one machine offline you'll lose 1/5th of 
your cache, and when you bring it back, you'll again lose 1/5th of your 
cache as the keys originally asigned to the that server migrate back (when 
the clients notice the server is back).

You could make the connection timeouts more agressive to reduce the cost 
of the client finding dead servers.

*At least the perl and java clients, last i looked.