Erlang memcached

Sat Aug 13 13:52:59 PDT 2005

Sorry, too long to quote. :)

I bought a book on Erlang a couple months back after having been intrigued
by ejabberd.  What you propose would be a fun project.

For distributed/failover, though, I'm more interested in keeping the
memcached core/protocol and instead of using slabs.c and doing local
storage, linking in the ndb client API and using an ndb farm.  (this is
what MySQL Cluster is built on, but we wouldn't use MySQL nor SQL)

But I don't have time at the moment for this, so I'm curious what you end
up with, and Sean Chittenden ends up with.

- Brad

On Thu, 11 Aug 2005, Richard Cameron wrote:

>
> I emailed the list some time ago about patching the memcached binary
> to include primitives to allow the client to implement a dependency
> tree of cached values <http://lists.danga.com/pipermail/memcached/
> 2005-March/001270.html>.
>
> The problem - just to recap - is that I'm using memcached behind a
> social bookmarking site <http://www.citeulike.org>. There are
> multiple representations of the data stored in a user's account
> (order by date, importance, as HTML, RSS, etc) which I wish to cache.
> There's a fairly well-defined concept of dependency in the HTML
> generation process, so if a user updates one particular item of data
> in his or her account then some (but not all) of the cached pages
> ought to be marked as invalid and removed from the cache. I wish to
> record the dependencies when I write the data into memcached rather
> than trying to do the check at fetch time (which is what I'm
> currently doing in the client by storing a list of keys which the
> item in question depends *on*, and then doing a large mget operation
> to check none of the dependencies has been broken... and I can tell
> that by comparing timestamps).
>
> I eventually submitted a patch for "atomic" replace operations which
> I thought would allow me to store the dependency tree the other way
> round. That's a list of "items which depend on me" associated with
> each item in memcached rather than "items which I depend on". It
> turned out, however, that there was still a race condition which
> meant it didn't quite work out. I had to revert back the approach of
> checking dependencies at fetch time.
>
> The reason I'd really prefer to store the dependency tree the other
> way round is:
>
> 1) For cacheable data, there ought to be far more reads than writes,
> so it makes sense to pay for updating the dependency tables once at
> write time and then have cheap reads - especially when the
> alternative requires two round-trips to the server.
>
> 2) I can be much cleverer about actually flushing stale data out of
> memory. When a dependency is broken by invalidating an item, it would
> make sense to actually walk the dependency tree all the way down and
> free() up all the memory immediately rather than waiting for the next
> read operation to notice that the data has expired and ought to be
> flushed out. Popular science stories about power laws and "long
> tails" should tell us that a surprising amount of data could hang
> around for quite some time without being read and flushed, occupying
> memory rather unnecessarily.
>
> While I'm aware that memcached was written with the mantra "it's not
> a database" and designed to be as simple as possible, I'm interested
> in perhaps building this sort of behaviour into the server rather
> than the client. What I'm actually proposing is re-implementing a
> protocol-compatible version of memcached in a language called Erlang
> <http://www.erlang.org>.
>
> Erlang is a language I use quite extensively and has certain
> advantages which probably make it well suited to this. Threads are
> extremely cheap to create, and there's a beautifully clean message
> passing model of concurrency. Having a multi-threaded implementation
> (with some reasonable guarantee that it won't suffer from all the
> classic multithreaded C horrors) has certain advantages:
>
> 1) Ability to sweep the cache and purge time-expired and unused items
> (in the case where the cache is getting close it its maximum
> capacity) in a background thread.
>
> This has always struck me as being a slightly weak spot in the
> current implementation. My memcached process usually runs pretty
> full, and I sometimes have problems where it just refuses to insert a
> new item. Looking at the following code, this seems to be the
> behaviour for deciding on what to free up to make space for the new
> stuff:
>
>         int tries = 50;
>          item *search;
>
>          /* If requested to not push old items out of cache when
> memory runs out,
>           * we're out of luck at this point...
>           */
>
>          if (!settings.evict_to_free) return 0;
>
>          /*
>           * try to get one off the right LRU
>           * don't necessariuly unlink the tail because it may be
> locked: refcount>0
>           * search up from tail an item with refcount==0 and unlink
> it; give up after 50
>           * tries
>           */
>
>          if (id > LARGEST_ID) return 0;
>          if (tails[id]==0) return 0;
>
>          for (search = tails[id]; tries>0 && search; tries--,
> search=search->prev) {
>              if (search->refcount==0) {
>                  item_unlink(search);
>                  break;
>              }
>          }
>          it = slabs_alloc(ntotal);
>          if (it==0) return 0;
>
> To me, this appears to be a one-in-one-out policy, and I think what
> I'm seeing is the case where there are a bunch of very small items at
> the end of the LRU, and I'm trying to insert something much larger.
> It therefore takes a certain number of failed writes before enough
> item_unlink() operations are performed to actually create enough
> space for the new item and for the operation to succeed. Is this
> fair, or am I missing something?
>
> What I think would be relatively straightforward to do in an Erlang
> implementation would be to impose a "low water mark" and a "high
> water mark" on memory usage... together with an "absolute maximum". A
> sweeper thread could continually run and keep purging old items from
> the end of the LRU whenever the memory usage is greater than "low
> water". Because each connection would effectively have its own thread
> (a very lightweight, process emulated one), I could artificially
> introduce a delay (by making the thread dealing with the request
> sleep for a short period) in every write operation as we approach
> "high water". The idea is to effectively keep the cache in a "memory
> equilibrium" where its never allocating memory faster than the
> sweeper can purge it. The memory usage could temporarily exceed "high
> water" (but not the absolute maximum) in which case all write
> activity (but not read) would be halted while the sweeper works to
> get the memory back down to (at least) "high water".
>
> The advantage of this crazy scheme is that it gives some (soft)
> guarantee that you'll always be able to write an object of size
> (absolute max - high water) into the cache without being at the mercy
> of the size of whatever happens to come off the LRU queue at that
> time. It will also probably give a small performance boost (even on
> single processor machines) as we'll be able to sweep the cache during
> periods while we're waiting on IO, which I don't believe can happen
> at the moment.
>
> I think the guarantee of always being able to get an object of a
> certain size in is important. The problem where the allocation creeps
> up to the maximum, and the cache is littered with tiny object which
> don't yield enough space when they're freed is a killer for me. It
> means that I'm constantly regenerating the largest and most expensive
> objects from the DB, and the cache is point blank refusing to store
> them (because it's full of older crap which I don't really need).
>
> 2) Improved statistics.
>
> There are many cases where I'd like to be able to walk over all the
> items in the cache and produce some statistics on what I'm actually
> storing, how often they're fetched, and what the hit-rate per item
> type is, etc. Essentially I just want to map a single user-defined
> function over every item in the cache (or, rather do a "foldl", which
> is another reason why a functional language like Erlang is useful here).
>
> Clearly this is impossible in a single threaded implementation - you
> can't just expect to halt everything for a few seconds while you're
> computing this statistic - but it would work quite well in a
> multithreaded world. I don't like not knowing what's in my cache and
> taking up all the space. While it's true that I could build some
> profiling code into a client to make a statistical estimate of this
> (guessing when things get expired via the LRU, etc), it would be an
> order of magnitude easier of the server could just tell me. I'm
> fairly confident that I could tune my cache usage if I knew what was
> actually in there, and what it was being used for.
>
> 3) Distributed server/failover
>
> It might sound like I'm straying away from the "memcached is not a
> database" point here, but building distributed systems in Erlang is
> easy (or at least a lot easier than it would be to try to hack
> something in C to do it). It would be relatively straightforward to
> replicate the contents of one memcached to another machine for
> failover. That's not saying that I'm using it like a database to
> store stuff which I'm really relying on. I can always repopulate the
> contents of the server from the database, but whenever I have to do
> this it puts such a tremendous strain on the DB that the whole site
> practically grinds to a halt for a few minutes. In an ideal world,
> I'd like to try to avoid this - it's probably not the most important
> thing right now though.
>
> What would, however, be very interesting would be creating a truly
> distributed version of memcached where nodes can be added
> dynamically, dependencies can be stored across multiple instances,
> etc. This sounds incredibly far-fetched, but it's actually the sort
> of stuff you get for free with Erlang <http://www.erlang.se/doc/
> doc-5.0.1/lib/mnesia-3.9.2/doc/html/part_frame.html>
>
>
> Of course there are some downsides:
>
>
> 1) Performance.
>
> memcached is blindingly fast. It's going to be very difficult to
> compete with its current implementation, although Erlang does have
> support for kqueue and /dev/poll. While I don't think I can compete
> on benchmarks for time to complete one get/set operation, I think I
> should be able to make up for that. My hope is that because support
> for dependencies will let users purge stale data out of the cache
> when it becomes stale (rather than after some arbitrary timeout), and
> the improved statistics gathering, it should be possible to get much
> higher cache hit-rates, so net performance ought to be better.
>
> 2) Testing.
>
> memcached has been tested in the wild with perhaps billions of
> operations, and is generally incredibly stable. It's going to take
> some time to get any replacement right. While I can test it out on my
> own production sites ("eating one's own dogfood", as I think the
> Americans say), there's still a bit of a leap of faith. The flipside
> of that is that "it's not a database", so in the worst case, it's
> always possible to regenerate data from the DB if it does crash.
>
>
> Any thoughts on the above? I don't particularly want to fork and go
> down my own path, but I think there are certain benefits to be had
> from doing such a thing. It's something I'll almost certainly write
> for my own use (unless anyone comes up with a good idea as to why it
> won't work), and I'll release it under and open source license if
> there's any interest in it. I'd love to hear any comments before I
> get too involved with writing the thing though.
>
> Richard.
>
>