On one of my sites, I've designed it to withstand a pretty heavy hit rate. My implemented solution expands on your solution a little bit.<br><br>Basically, statistics are generated that tracks the performance of the page. As part of this, an external process checks the status of all the running instances of memcache every second. If one of the processes is no longer running, it initiates a 'safe-mode'. Basically what it does is replaces all the scripts with safe versions. The safe versions are almost identical to the normal ones, except that if the data is not available in memcache, it checks to see if another process is trying to calculate that key. If it is, then it waits 5 seconds, and then checks memcache again. It loops in this way a maximum of 6 times. (Note, it loops back to the initial key request, just in case it was generated by something else that doesn't bother with the 'safe-' keys). If it is unable to determine if another process is performing the action, then it will then do the database query.
<br><br>The way it checks to see if another process is already attempting to calculate a key, it looks up the key with a 'safe-' in front of it. If the result is '1' then that key is being processed by another instance. If nothing comes back, it sets that 'safe-' key, and then it goes ahead and does the query it needs. When it has finished the query, sets the appropriate key, and removes the 'safe-' key.
<br><br>That sounds complicated, but I only do that on the pieces that I expect to cost the most in a loaded system, which equates to about a dozen or so queries. Since all that 'safe' stuff complicates the code a bit, I dont have it in the normal running scripts. The down-side is that I have to maintain two pieces of similar code, but thats not so bad actually because this is done in fairly low-level data access routines that are not changes much at all after they were initially written. Higher level stuff that changes often would be a different story.
<br><br>When the statistics processor has determined that things have settled down it replaces the 'safe' scripts with the normal ones that doesn't do all that extra checking.<br><br>In summary, the idea is that we try to limit the number of processes (page views) that are generating the same content. It does this by checking to see if another processor is generating it. If so, it waits until the data has been generated and is in the general cache. If nothing else is generating it, it indicates that it will, and then generates it.
<br><br>In my testing, it works great, although I am unable to generate realistic load. So far have not had to test it under a real-life scenario. Even with all the memcaches turned off, my databases can currently handle my heaviest loads. This is all planned for when loads are heavier.
<br><br>Disclaimer: my actual implementation varies slightly from the above explanation, but would take much longer to describe, and the differences are irrelevant to the idea. This is simpler and what I would like to actually get to. But... time constraints and all...
<br><br><br><div><span class="gmail_quote">On 7/9/07, <b class="gmail_sendername">Jan Miczaika</b> <<a href="mailto:jan@hitflip.de">jan@hitflip.de</a>> wrote:</span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
We've had situations where the death of a memcached server caused the<br>database to overload, since this server contained a significant part of<br>the cache. With dozens or hundreds of web slaves requesting the same<br>
data, the cache doesn't fill properly, as no process ever reaches the end.<br><br>What we did:<br>- write scripts to "warmup" the cache before the site goes live again,<br>to prevent the first visitors from killing everything again. These were
<br>dumb in the beginning (simply crawl the page in the background while it<br>is offline) but got smarter (warmup the cache with the data that has the<br>most impact (expensive to get and often requested)).<br><br>- dividing up the data and the number of servers in a way that the
<br>database can handle the death of one memcache (and of course the<br>regeneration of cache keys on another system) without db problems. doing<br>tests we found a sweet spot at about 6-8 machines running memcache<br>instances (with larger cache sizes than 2 GB). Having one of these fail
<br>will not compromise the database. Below 4 machines running memcache the<br>failure of one gets critical.<br></blockquote></div><br><br clear="all"><br>-- <br>"Be excellent to each other"