Patch: CPU efficiency, UDP support, and other changes

Tue May 30 03:55:12 UTC 2006

Steven,

This part doesn't intuitively look right:

+            case TRANSMIT_INCOMPLETE:
+            case TRANSMIT_HARD_ERROR:
+                break;                   /* Continue in state machine. */
+
+            case TRANSMIT_SOFT_ERROR:
+                exit = 1;
+                break;

Should the HARD/SOFT be reversed?

On Wed, 3 May 2006, Steven Grimm wrote:

> This big patch contains all the changes we've made to memcached 1.1.12
> at facebook.com. It includes some changes I've sent to the list as
> separate patches:
>
> * Memory efficiency is increased; we get about 40% more items in a given
> amount of memory vs. the standard 1.1.12 memcached. (This patch has a
> couple tweaks that aren't in the previous smaller one since they tie
> into other changes.)
> * Support for large memory sizes (64-bit pointers and size_t).
> * Fix for bogus "out of memory" errors caused by memory filling up
> before a slab class has any slabs.
>
> But the big changes here are not in the other patches:
>
> * CPU consumption is reduced 25-30%.
> * A UDP-based interface is supported in addition to the standard TCP one.
>
> No doubt some will ask, "Why are you sending this out as one big patch
> instead of splitting everything out into small independent patches?"
> I'll include a section answering just that question at the end of this
> message.
>
> Details follow. Some of this will look familiar if you've seen the
> earlier patches.
>
>
>
> Memory consumption
> ------------------
> The slab allocator's powers-of-2 size strategy is now a powers-of-N
> strategy, where N may be specified on the command line using the new
> "-f" option.  The default is 1.25. For a large memcached instance, where
> there are enough items of enough different sizes that the increased
> number of slab classes isn't itself a waste of memory, this is a
> significant win: items are placed in chunks whose sizes are much closer
> to the item size, wasting less memory.
>
> One consequence of this is that slabs are no longer fixed-size; by
> default they are no bigger than 1MB each, but are only as big as they
> need to be to hold a whole number of chunks. That causes the "slabs
> reassign" command to be unavailable; it can be reenabled by compiling
> with -DALLOW_SLABS_REASSIGN at the expense of some wasted memory (all
> slabs will be 1MB).
>
> The minimum amount of space for data in chunks of the smallest slab
> class may be adjusted on the command line using the new "-s" option.
> Each chunk is that many bytes plus the size of the fixed-size item
> header. If you have a lot of items with small fixed-size keys and
> values, you can use this option to maximize the number of items per slab
> in the smallest slab class; experiment with your particular data to find
> the optimal value.
>
> Item expiration times and access times are now stored as 32-bit integers
> (number of seconds relative to server start) rather than time_t, which
> is 64 bits on some platforms. This saves 8 bytes per item when compiled
> in 64-bit mode, and is harmless otherwise.
>
> CPU consumption
> ---------------
> The implementation of the "get" request is substantially reworked. Now
> the entire response is composed in memory ahead of time, and we write it
> out in (usually) just one system call using sendmsg()'s scatter/gather
> capability.  Since we are no longer doing small writes, the
> TCP_CORK/NOPUSH code is not needed and we can simply set the TCP socket
> to TCP_NODELAY at connect time, saving a couple more system calls per
> request.
>
> The "VALUE" line (response to a "get" request) is rendered once at item
> creation time, rather than re-rendered on each fetch.
>
> The current system time is stored in a global variable that's updated
> every second by a libevent timer; this eliminates several time() calls
> per request. A minor improvement, but a cycle saved is a cycle earned.
>
> UDP support
> -----------
> For large installations with tens of thousands of clients, the amount of
> memory consumed by per-TCP-connection kernel buffers can grow large,
> reducing the amount of memory that can be used by memcached.  There is
> now a UDP protocol, which supports an arbitrarily large number of
> clients using a constant amount of server memory.
>
> In the interest of efficiency and simplicity of implementation, the UDP
> protocol does not support reliable delivery; it should therefore be used
> for "get" requests where a dropped response would simply result in a
> recoverable cache miss. For write requests (set, delete, etc.) or very
> large "get" requests, a nonpersistent TCP connection should be used.
> (This is simply advice; the code will happily accept any kind of request
> via its UDP interface.)
>
> UDP support is only enabled if a UDP port is specified on the command line.
>
> The UDP protocol is described at the bottom of doc/protocol.txt.
>
> Large memory support
> --------------------
> This mostly involves using size_t rather than unsigned int in a few
> places and compiling in 64-bit mode, which gives us 64-bit pointers and
> makes size_t 64 bits.
>
> Fix for "out of memory" errors
> ------------------------------
> Rather than preallocate a slab in each slab class as the memcached
> 1.1.13 prerelease does, we decided to instead allow memcached to exceed
> its memory limit slightly. When a "set" request comes in that requires a
> slab whose slab class is empty, we always allocate a slab, even if
> memcached is already at its configured memory limit.
>
> Our memcached instances are large enough that going over the limit by a
> few megabytes is barely even detectable. If you are running in a very
> constrained environment, you can lower the memory limit slightly to
> account for this change, but bear in mind that this change will only
> exceed the memory limit if a "set" request requires it (which will never
> happen if your data always falls within a limited range of sizes.)
>
> Why is this one patch?
> ----------------------
> First, this patch is tested. It runs 24x7 on a large number of very busy
> memcached hosts. Thoroughly testing every possible permutation of these
> changes isn't really feasible.
>
> Second, the changes are not all easily separable.  For example, adding
> the UDP support required reorganizing memcached's implementation of the
> "get" request, and that reorganization also resulted in most of the CPU
> time improvement.  Similarly, one of the memory efficiency tweaks is
> only required because compiling in 64-bit mode (for large memory
> support) increases the size of a particular data type, and the
> implementation of that tweak results in part of the CPU time savings.
>
> Third, I *did* send it out as separate patches to the extent it made
> sense to separate out the changes. But rather than excluding those
> changes from the not-easily-separable stuff, I think it makes more sense
> to include it all together. Otherwise anyone who wants to combine
> everything will have to do tedious error-prone manual editing to merge
> it all together, since some of the changes conflict. For example, both
> the large memory support and the slab allocator modification involve
> changing the parameters to slabs_init(), so it would be impossible to
> produce two independent patches against the 1.1.12 release that could be
> applied successfully one after the other.
>
> Credits
> -------
> These changes were made by David Fetterman, Steven Grimm, and Scott
> Marlette.  Send comments to Steven Grimm (sgrimm at facebook.com) or,
> preferably, to the memcached mailing list.
>