System architecture best practices - memcached + webfarms

Mon Jul 9 13:14:35 UTC 2007

On one of my sites, I've designed it to withstand a pretty heavy hit rate.
My implemented solution expands on your solution a little bit.

Basically, statistics are generated that tracks the performance of the
page.  As part of this, an external process checks the status of all the
running instances of memcache every second.  If one of the processes is no
longer running, it initiates a 'safe-mode'.  Basically what it does is
replaces all the scripts with safe versions.  The safe versions are almost
identical to the normal ones, except that if the data is not available in
memcache, it checks to see if another process is trying to calculate that
key.  If it is, then it waits 5 seconds, and then checks memcache again.  It
loops in this way a maximum of 6 times. (Note, it loops back to the initial
key request, just in case it was generated by something else that doesn't
bother with the 'safe-' keys). If it is unable to determine if another
process is performing the action, then it will then do the database query.

The way it checks to see if another process is already attempting to
calculate a key, it looks up the key with a 'safe-' in front of it.  If the
result is '1' then that key is being processed by another instance.  If
nothing comes back, it sets that 'safe-' key, and then it goes ahead and
does the query it needs.  When it has finished the query, sets the
appropriate key, and removes the 'safe-' key.

That sounds complicated, but I only do that on the pieces that I expect to
cost the most in a loaded system, which equates to about a dozen or so
queries.   Since all that 'safe' stuff complicates the code a bit, I dont
have it in the normal running scripts.  The down-side is that I have to
maintain two pieces of similar code, but thats not so bad actually because
this is done in fairly low-level data access routines that are not changes
much at all after they were initially written.  Higher level stuff that
changes often would be a different story.

When the statistics processor has determined that things have settled down
it replaces the 'safe' scripts with the normal ones that doesn't do all that
extra checking.

In summary, the idea is that we try to limit the number of processes (page
views) that are generating the same content.  It does this by checking to
see if another processor is generating it.  If so, it waits until the data
has been generated and is in the general cache.  If nothing else is
generating it, it indicates that it will, and then generates it.

In my testing, it works great, although I am unable to generate realistic
load.   So far have not had to test it under a real-life scenario.  Even
with all the memcaches turned off, my databases can currently handle my
heaviest loads.  This is all planned for when loads are heavier.

Disclaimer: my actual implementation varies slightly from the above
explanation, but would take much longer to describe, and the differences are
irrelevant to the idea.  This is simpler and what I would like to actually
get to.  But... time constraints and all...

On 7/9/07, Jan Miczaika <jan at hitflip.de> wrote:
>
> We've had situations where the death of a memcached server caused the
> database to overload, since this server contained a significant part of
> the cache. With dozens or hundreds of web slaves requesting the same
> data, the cache doesn't fill properly, as no process ever reaches the end.
>
> What we did:
> - write scripts to "warmup" the cache before the site goes live again,
> to prevent the first visitors from killing everything again. These were
> dumb in the beginning (simply crawl the page in the background while it
> is offline) but got smarter (warmup the cache with the data that has the
> most impact (expensive to get and often requested)).
>
> - dividing up the data and the number of servers in a way that the
> database can handle the death of one memcache (and of course the
> regeneration of cache keys on another system) without db problems. doing
> tests we found a sweet spot at about 6-8 machines running memcache
> instances (with larger cache sizes than 2 GB). Having one of these fail
> will not compromise the database. Below 4 machines running memcache the
> failure of one gets critical.
>

-- 
"Be excellent to each other"
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.danga.com/pipermail/memcached/attachments/20070709/702ecbb1/attachment.htm