Cache miss stampedes

Thu Jul 26 08:47:56 UTC 2007

I understand what you're saying, but I am still a little baffled, 
because it leaves out one important point: If the client code can, in 
the face of a cache miss, correctly repopulate the cache (resulting in 
the stampede scenario that started this thread) then it must be the case 
that there exists working, callable code to generate cache entries from 
the underlying data store. If generating the cached values is a process 
that requires manual intervention, then you're pretty much hosed no 
matter what, because your clients will not be able to recover from a 
cache miss, whether it's a "real" one or not.

So what I'm suggesting this background job do is run the exact same code 
that would be run in production on a cache miss. In fact, that's exactly 
the suggestion I wrote my message in response to: simulate a cache miss 
to cause the cache repopulation client code to be run. The only 
difference is that the suggestion was to make it a server-initiated 
thing and force some number of clients to deal with a cache miss in the 
course of processing a real request from a user, which means that user 
gets to wait longer for a response. All I'm saying is that if you have 
that code, there's no need to actually have any cache misses, and no 
user requests need to be delayed. And if you don't have that code, the 
options under discussion won't help you anyway.

Again, though, it is totally possible I'm missing the point altogether.

-Steve

dormando wrote:
> Well. Uh.
>
> Sometimes ya just can't, yaknow? :)
>
> In an ideal world your cache never expires, memcached never flaps, and 
> you have tools updating caches in the background that work flawlessly. 
> Unfortunately developers don't always have the time to make these 
> perfect, but it's "easy" to patch in one of the previous suggestions 
> to deal with the problem.
>
> Lets say you're a typical startup and you're faced with a problem:
>
> You have a complex bit of parsing code for special data. There was 
> never any code written to automatically, or programmatically, update 
> this data. Sometime in the future caching is added. This is easy; 
> cache the result of the parsing request into memcached, let it expire 
> once per minute so it's easy to propagate changes (which are made by 
> hand, rarely). It's wrong, but it's what happened.
>
> It might take someone a nontrivial amount of time to fix this. Write 
> code to serialize the data into the DB, write a CLI tool or webpage to 
> manage the data, so the cache can be updated after the data is edited 
> (either by hand, or whatever). Good luck getting your boss to sign off 
> on that. There're magic widgits that aren't writing themselves!
>
> I do realize there's an easy way to update the data by hand, then run 
> a tool to just refresh the cache, but that's besides the point ;) 
> Imagine again that you have a lot of these situations. Where caching 
> was plugged in as an afterthought. For new development I _always_ 
> recommended a cron to update the data (getting PHP devs to write crons 
> is like pulling teeth!), or a tool that updates the cache. It doesn't 
> always happen.
>
> At Gaia it's also common for this to happen where simple 'query 
> caching' was plugged in as a caching methodology. Everywhere there's 
> SQL that's relatively static, adding an ->enableCache(blah) call makes 
> it faster! Right? Right... Turns out you can also plug in one of the 
> aforementioned algorithms to mitigate this brain damage.
>
> Also, if you have a cluster configured to not auto-rehash, and 
> memcached's can stay down for multiple minutes during a failure, you 
> will get a similar stampeding problem anyway. You should just cache 
> the data in APC at this point anyway.
>
> Well. In summary; you're right. On that note, I realize in all the 
> wiki updatery I hadn't really stressed what you mentioned enough at 
> all. It's there, but it's not a highlight.
>
> -Dormando
>
> Steven Grimm wrote:
>> I admit I'm a bit baffled by this discussion (and I also admit I have 
>> only been skimming it, so this might be a retread.) It seems like one 
>> of two situations should be true:
>>
>> 1. The underlying data has not changed. The cache is therefore still 
>> correct.
>>
>> 2. The underlying data has changed, and the cache is now stale.
>>
>> In the first case, just don't set an expiration time and you're done, 
>> yes? Since the item is frequently hit (hence the stampedes) it will 
>> never get LRUed out.
>>
>> In the second case, why are you waiting around for some unknown 
>> amount of time to pass -- and for some client to get an actual cache 
>> miss -- before refreshing the cache? If you have a few hot keys that 
>> change often but for whatever reason you can't invalidate / update 
>> the cache at the time the underlying data gets updated, then another 
>> approach is to have some background task periodically updating the 
>> hot items to their current values. Again, you don't let the item 
>> expire in this scenario; it just gets updated every once in a while. 
>> This way nobody has to deal with a cache miss, and the values still 
>> stay as current as you want them to (adjust the frequency of your 
>> background task's updates to taste) with no stampedes.
>>
>> What am I missing?
>>
>> -Steve
>