Erlang memcached
Richard Cameron
camster at citeulike.org
Thu Aug 11 05:05:03 PDT 2005
I emailed the list some time ago about patching the memcached binary
to include primitives to allow the client to implement a dependency
tree of cached values <http://lists.danga.com/pipermail/memcached/
2005-March/001270.html>.
The problem - just to recap - is that I'm using memcached behind a
social bookmarking site <http://www.citeulike.org>. There are
multiple representations of the data stored in a user's account
(order by date, importance, as HTML, RSS, etc) which I wish to cache.
There's a fairly well-defined concept of dependency in the HTML
generation process, so if a user updates one particular item of data
in his or her account then some (but not all) of the cached pages
ought to be marked as invalid and removed from the cache. I wish to
record the dependencies when I write the data into memcached rather
than trying to do the check at fetch time (which is what I'm
currently doing in the client by storing a list of keys which the
item in question depends *on*, and then doing a large mget operation
to check none of the dependencies has been broken... and I can tell
that by comparing timestamps).
I eventually submitted a patch for "atomic" replace operations which
I thought would allow me to store the dependency tree the other way
round. That's a list of "items which depend on me" associated with
each item in memcached rather than "items which I depend on". It
turned out, however, that there was still a race condition which
meant it didn't quite work out. I had to revert back the approach of
checking dependencies at fetch time.
The reason I'd really prefer to store the dependency tree the other
way round is:
1) For cacheable data, there ought to be far more reads than writes,
so it makes sense to pay for updating the dependency tables once at
write time and then have cheap reads - especially when the
alternative requires two round-trips to the server.
2) I can be much cleverer about actually flushing stale data out of
memory. When a dependency is broken by invalidating an item, it would
make sense to actually walk the dependency tree all the way down and
free() up all the memory immediately rather than waiting for the next
read operation to notice that the data has expired and ought to be
flushed out. Popular science stories about power laws and "long
tails" should tell us that a surprising amount of data could hang
around for quite some time without being read and flushed, occupying
memory rather unnecessarily.
While I'm aware that memcached was written with the mantra "it's not
a database" and designed to be as simple as possible, I'm interested
in perhaps building this sort of behaviour into the server rather
than the client. What I'm actually proposing is re-implementing a
protocol-compatible version of memcached in a language called Erlang
<http://www.erlang.org>.
Erlang is a language I use quite extensively and has certain
advantages which probably make it well suited to this. Threads are
extremely cheap to create, and there's a beautifully clean message
passing model of concurrency. Having a multi-threaded implementation
(with some reasonable guarantee that it won't suffer from all the
classic multithreaded C horrors) has certain advantages:
1) Ability to sweep the cache and purge time-expired and unused items
(in the case where the cache is getting close it its maximum
capacity) in a background thread.
This has always struck me as being a slightly weak spot in the
current implementation. My memcached process usually runs pretty
full, and I sometimes have problems where it just refuses to insert a
new item. Looking at the following code, this seems to be the
behaviour for deciding on what to free up to make space for the new
stuff:
int tries = 50;
item *search;
/* If requested to not push old items out of cache when
memory runs out,
* we're out of luck at this point...
*/
if (!settings.evict_to_free) return 0;
/*
* try to get one off the right LRU
* don't necessariuly unlink the tail because it may be
locked: refcount>0
* search up from tail an item with refcount==0 and unlink
it; give up after 50
* tries
*/
if (id > LARGEST_ID) return 0;
if (tails[id]==0) return 0;
for (search = tails[id]; tries>0 && search; tries--,
search=search->prev) {
if (search->refcount==0) {
item_unlink(search);
break;
}
}
it = slabs_alloc(ntotal);
if (it==0) return 0;
To me, this appears to be a one-in-one-out policy, and I think what
I'm seeing is the case where there are a bunch of very small items at
the end of the LRU, and I'm trying to insert something much larger.
It therefore takes a certain number of failed writes before enough
item_unlink() operations are performed to actually create enough
space for the new item and for the operation to succeed. Is this
fair, or am I missing something?
What I think would be relatively straightforward to do in an Erlang
implementation would be to impose a "low water mark" and a "high
water mark" on memory usage... together with an "absolute maximum". A
sweeper thread could continually run and keep purging old items from
the end of the LRU whenever the memory usage is greater than "low
water". Because each connection would effectively have its own thread
(a very lightweight, process emulated one), I could artificially
introduce a delay (by making the thread dealing with the request
sleep for a short period) in every write operation as we approach
"high water". The idea is to effectively keep the cache in a "memory
equilibrium" where its never allocating memory faster than the
sweeper can purge it. The memory usage could temporarily exceed "high
water" (but not the absolute maximum) in which case all write
activity (but not read) would be halted while the sweeper works to
get the memory back down to (at least) "high water".
The advantage of this crazy scheme is that it gives some (soft)
guarantee that you'll always be able to write an object of size
(absolute max - high water) into the cache without being at the mercy
of the size of whatever happens to come off the LRU queue at that
time. It will also probably give a small performance boost (even on
single processor machines) as we'll be able to sweep the cache during
periods while we're waiting on IO, which I don't believe can happen
at the moment.
I think the guarantee of always being able to get an object of a
certain size in is important. The problem where the allocation creeps
up to the maximum, and the cache is littered with tiny object which
don't yield enough space when they're freed is a killer for me. It
means that I'm constantly regenerating the largest and most expensive
objects from the DB, and the cache is point blank refusing to store
them (because it's full of older crap which I don't really need).
2) Improved statistics.
There are many cases where I'd like to be able to walk over all the
items in the cache and produce some statistics on what I'm actually
storing, how often they're fetched, and what the hit-rate per item
type is, etc. Essentially I just want to map a single user-defined
function over every item in the cache (or, rather do a "foldl", which
is another reason why a functional language like Erlang is useful here).
Clearly this is impossible in a single threaded implementation - you
can't just expect to halt everything for a few seconds while you're
computing this statistic - but it would work quite well in a
multithreaded world. I don't like not knowing what's in my cache and
taking up all the space. While it's true that I could build some
profiling code into a client to make a statistical estimate of this
(guessing when things get expired via the LRU, etc), it would be an
order of magnitude easier of the server could just tell me. I'm
fairly confident that I could tune my cache usage if I knew what was
actually in there, and what it was being used for.
3) Distributed server/failover
It might sound like I'm straying away from the "memcached is not a
database" point here, but building distributed systems in Erlang is
easy (or at least a lot easier than it would be to try to hack
something in C to do it). It would be relatively straightforward to
replicate the contents of one memcached to another machine for
failover. That's not saying that I'm using it like a database to
store stuff which I'm really relying on. I can always repopulate the
contents of the server from the database, but whenever I have to do
this it puts such a tremendous strain on the DB that the whole site
practically grinds to a halt for a few minutes. In an ideal world,
I'd like to try to avoid this - it's probably not the most important
thing right now though.
What would, however, be very interesting would be creating a truly
distributed version of memcached where nodes can be added
dynamically, dependencies can be stored across multiple instances,
etc. This sounds incredibly far-fetched, but it's actually the sort
of stuff you get for free with Erlang <http://www.erlang.se/doc/
doc-5.0.1/lib/mnesia-3.9.2/doc/html/part_frame.html>
Of course there are some downsides:
1) Performance.
memcached is blindingly fast. It's going to be very difficult to
compete with its current implementation, although Erlang does have
support for kqueue and /dev/poll. While I don't think I can compete
on benchmarks for time to complete one get/set operation, I think I
should be able to make up for that. My hope is that because support
for dependencies will let users purge stale data out of the cache
when it becomes stale (rather than after some arbitrary timeout), and
the improved statistics gathering, it should be possible to get much
higher cache hit-rates, so net performance ought to be better.
2) Testing.
memcached has been tested in the wild with perhaps billions of
operations, and is generally incredibly stable. It's going to take
some time to get any replacement right. While I can test it out on my
own production sites ("eating one's own dogfood", as I think the
Americans say), there's still a bit of a leap of faith. The flipside
of that is that "it's not a database", so in the worst case, it's
always possible to regenerate data from the DB if it does crash.
Any thoughts on the above? I don't particularly want to fork and go
down my own path, but I think there are certain benefits to be had
from doing such a thing. It's something I'll almost certainly write
for my own use (unless anyone comes up with a good idea as to why it
won't work), and I'll release it under and open source license if
there's any interest in it. I'd love to hear any comments before I
get too involved with writing the thing though.
Richard.
More information about the memcached
mailing list