Erlang memcached

Thu Aug 11 05:05:03 PDT 2005

I emailed the list some time ago about patching the memcached binary  
to include primitives to allow the client to implement a dependency  
tree of cached values <http://lists.danga.com/pipermail/memcached/ 
2005-March/001270.html>.

The problem - just to recap - is that I'm using memcached behind a  
social bookmarking site <http://www.citeulike.org>. There are  
multiple representations of the data stored in a user's account  
(order by date, importance, as HTML, RSS, etc) which I wish to cache.  
There's a fairly well-defined concept of dependency in the HTML  
generation process, so if a user updates one particular item of data  
in his or her account then some (but not all) of the cached pages  
ought to be marked as invalid and removed from the cache. I wish to  
record the dependencies when I write the data into memcached rather  
than trying to do the check at fetch time (which is what I'm  
currently doing in the client by storing a list of keys which the  
item in question depends *on*, and then doing a large mget operation  
to check none of the dependencies has been broken... and I can tell  
that by comparing timestamps).

I eventually submitted a patch for "atomic" replace operations which  
I thought would allow me to store the dependency tree the other way  
round. That's a list of "items which depend on me" associated with  
each item in memcached rather than "items which I depend on". It  
turned out, however, that there was still a race condition which  
meant it didn't quite work out. I had to revert back the approach of  
checking dependencies at fetch time.

The reason I'd really prefer to store the dependency tree the other  
way round is:

1) For cacheable data, there ought to be far more reads than writes,  
so it makes sense to pay for updating the dependency tables once at  
write time and then have cheap reads - especially when the  
alternative requires two round-trips to the server.

2) I can be much cleverer about actually flushing stale data out of  
memory. When a dependency is broken by invalidating an item, it would  
make sense to actually walk the dependency tree all the way down and  
free() up all the memory immediately rather than waiting for the next  
read operation to notice that the data has expired and ought to be  
flushed out. Popular science stories about power laws and "long  
tails" should tell us that a surprising amount of data could hang  
around for quite some time without being read and flushed, occupying  
memory rather unnecessarily.

While I'm aware that memcached was written with the mantra "it's not  
a database" and designed to be as simple as possible, I'm interested  
in perhaps building this sort of behaviour into the server rather  
than the client. What I'm actually proposing is re-implementing a  
protocol-compatible version of memcached in a language called Erlang  
<http://www.erlang.org>.

Erlang is a language I use quite extensively and has certain  
advantages which probably make it well suited to this. Threads are  
extremely cheap to create, and there's a beautifully clean message  
passing model of concurrency. Having a multi-threaded implementation  
(with some reasonable guarantee that it won't suffer from all the  
classic multithreaded C horrors) has certain advantages:

1) Ability to sweep the cache and purge time-expired and unused items  
(in the case where the cache is getting close it its maximum  
capacity) in a background thread.

This has always struck me as being a slightly weak spot in the  
current implementation. My memcached process usually runs pretty  
full, and I sometimes have problems where it just refuses to insert a  
new item. Looking at the following code, this seems to be the  
behaviour for deciding on what to free up to make space for the new  
stuff:

        int tries = 50;
         item *search;

         /* If requested to not push old items out of cache when  
memory runs out,
          * we're out of luck at this point...
          */

         if (!settings.evict_to_free) return 0;

         /*
          * try to get one off the right LRU
          * don't necessariuly unlink the tail because it may be  
locked: refcount>0
          * search up from tail an item with refcount==0 and unlink  
it; give up after 50
          * tries
          */

         if (id > LARGEST_ID) return 0;
         if (tails[id]==0) return 0;

         for (search = tails[id]; tries>0 && search; tries--,  
search=search->prev) {
             if (search->refcount==0) {
                 item_unlink(search);
                 break;
             }
         }
         it = slabs_alloc(ntotal);
         if (it==0) return 0;

To me, this appears to be a one-in-one-out policy, and I think what  
I'm seeing is the case where there are a bunch of very small items at  
the end of the LRU, and I'm trying to insert something much larger.  
It therefore takes a certain number of failed writes before enough  
item_unlink() operations are performed to actually create enough  
space for the new item and for the operation to succeed. Is this  
fair, or am I missing something?

What I think would be relatively straightforward to do in an Erlang  
implementation would be to impose a "low water mark" and a "high  
water mark" on memory usage... together with an "absolute maximum". A  
sweeper thread could continually run and keep purging old items from  
the end of the LRU whenever the memory usage is greater than "low  
water". Because each connection would effectively have its own thread  
(a very lightweight, process emulated one), I could artificially  
introduce a delay (by making the thread dealing with the request  
sleep for a short period) in every write operation as we approach  
"high water". The idea is to effectively keep the cache in a "memory  
equilibrium" where its never allocating memory faster than the  
sweeper can purge it. The memory usage could temporarily exceed "high  
water" (but not the absolute maximum) in which case all write  
activity (but not read) would be halted while the sweeper works to  
get the memory back down to (at least) "high water".

The advantage of this crazy scheme is that it gives some (soft)  
guarantee that you'll always be able to write an object of size  
(absolute max - high water) into the cache without being at the mercy  
of the size of whatever happens to come off the LRU queue at that  
time. It will also probably give a small performance boost (even on  
single processor machines) as we'll be able to sweep the cache during  
periods while we're waiting on IO, which I don't believe can happen  
at the moment.

I think the guarantee of always being able to get an object of a  
certain size in is important. The problem where the allocation creeps  
up to the maximum, and the cache is littered with tiny object which  
don't yield enough space when they're freed is a killer for me. It  
means that I'm constantly regenerating the largest and most expensive  
objects from the DB, and the cache is point blank refusing to store  
them (because it's full of older crap which I don't really need).

2) Improved statistics.

There are many cases where I'd like to be able to walk over all the  
items in the cache and produce some statistics on what I'm actually  
storing, how often they're fetched, and what the hit-rate per item  
type is, etc. Essentially I just want to map a single user-defined  
function over every item in the cache (or, rather do a "foldl", which  
is another reason why a functional language like Erlang is useful here).

Clearly this is impossible in a single threaded implementation - you  
can't just expect to halt everything for a few seconds while you're  
computing this statistic - but it would work quite well in a  
multithreaded world. I don't like not knowing what's in my cache and  
taking up all the space. While it's true that I could build some  
profiling code into a client to make a statistical estimate of this  
(guessing when things get expired via the LRU, etc), it would be an  
order of magnitude easier of the server could just tell me. I'm  
fairly confident that I could tune my cache usage if I knew what was  
actually in there, and what it was being used for.

3) Distributed server/failover

It might sound like I'm straying away from the "memcached is not a  
database" point here, but building distributed systems in Erlang is  
easy (or at least a lot easier than it would be to try to hack  
something in C to do it). It would be relatively straightforward to  
replicate the contents of one memcached to another machine for  
failover. That's not saying that I'm using it like a database to  
store stuff which I'm really relying on. I can always repopulate the  
contents of the server from the database, but whenever I have to do  
this it puts such a tremendous strain on the DB that the whole site  
practically grinds to a halt for a few minutes. In an ideal world,  
I'd like to try to avoid this - it's probably not the most important  
thing right now though.

What would, however, be very interesting would be creating a truly  
distributed version of memcached where nodes can be added  
dynamically, dependencies can be stored across multiple instances,  
etc. This sounds incredibly far-fetched, but it's actually the sort  
of stuff you get for free with Erlang <http://www.erlang.se/doc/ 
doc-5.0.1/lib/mnesia-3.9.2/doc/html/part_frame.html>

Of course there are some downsides:

1) Performance.

memcached is blindingly fast. It's going to be very difficult to  
compete with its current implementation, although Erlang does have  
support for kqueue and /dev/poll. While I don't think I can compete  
on benchmarks for time to complete one get/set operation, I think I  
should be able to make up for that. My hope is that because support  
for dependencies will let users purge stale data out of the cache  
when it becomes stale (rather than after some arbitrary timeout), and  
the improved statistics gathering, it should be possible to get much  
higher cache hit-rates, so net performance ought to be better.

2) Testing.

memcached has been tested in the wild with perhaps billions of  
operations, and is generally incredibly stable. It's going to take  
some time to get any replacement right. While I can test it out on my  
own production sites ("eating one's own dogfood", as I think the  
Americans say), there's still a bit of a leap of faith. The flipside  
of that is that "it's not a database", so in the worst case, it's  
always possible to regenerate data from the DB if it does crash.

Any thoughts on the above? I don't particularly want to fork and go  
down my own path, but I think there are certain benefits to be had  
from doing such a thing. It's something I'll almost certainly write  
for my own use (unless anyone comes up with a good idea as to why it  
won't work), and I'll release it under and open source license if  
there's any interest in it. I'd love to hear any comments before I  
get too involved with writing the thing though.

Richard.