System architecture best practices - memcached + webfarms

Mon Jul 9 12:06:03 UTC 2007

We've had situations where the death of a memcached server caused the 
database to overload, since this server contained a significant part of 
the cache. With dozens or hundreds of web slaves requesting the same 
data, the cache doesn't fill properly, as no process ever reaches the end.

What we did:
- write scripts to "warmup" the cache before the site goes live again, 
to prevent the first visitors from killing everything again. These were 
dumb in the beginning (simply crawl the page in the background while it 
is offline) but got smarter (warmup the cache with the data that has the 
most impact (expensive to get and often requested)).

- dividing up the data and the number of servers in a way that the 
database can handle the death of one memcache (and of course the 
regeneration of cache keys on another system) without db problems. doing 
tests we found a sweet spot at about 6-8 machines running memcache 
instances (with larger cache sizes than 2 GB). Having one of these fail 
will not compromise the database. Below 4 machines running memcache the 
failure of one gets critical.