Patch: CPU efficiency, UDP support, and other changes

Thu May 4 20:05:28 UTC 2006

Any chance of the patch getting accepted?  Would it be hard to redo the
patch off of 1.1.13 once that releases?  Based on your description, we
would be quite interested in what you did.

Now if you could just add some real namespace support . . . .  Wouldn't
facebook be interested in such a thing?

Thanks,
Earl

> -----Original Message-----
> From: memcached-bounces at lists.danga.com [mailto:memcached-
> bounces at lists.danga.com] On Behalf Of Steven Grimm
> Sent: Wednesday, May 03, 2006 10:07 AM
> To: memcached at lists.danga.com
> Subject: Patch: CPU efficiency, UDP support, and other changes
> 
> This big patch contains all the changes we've made to memcached 1.1.12
> at facebook.com. It includes some changes I've sent to the list as
> separate patches:
> 
> * Memory efficiency is increased; we get about 40% more items in a
given
> amount of memory vs. the standard 1.1.12 memcached. (This patch has a
> couple tweaks that aren't in the previous smaller one since they tie
> into other changes.)
> * Support for large memory sizes (64-bit pointers and size_t).
> * Fix for bogus "out of memory" errors caused by memory filling up
> before a slab class has any slabs.
> 
> But the big changes here are not in the other patches:
> 
> * CPU consumption is reduced 25-30%.
> * A UDP-based interface is supported in addition to the standard TCP
one.
> 
> No doubt some will ask, "Why are you sending this out as one big patch
> instead of splitting everything out into small independent patches?"
> I'll include a section answering just that question at the end of this
> message.
> 
> Details follow. Some of this will look familiar if you've seen the
> earlier patches.
> 
> 
> 
> Memory consumption
> ------------------
> The slab allocator's powers-of-2 size strategy is now a powers-of-N
> strategy, where N may be specified on the command line using the new
> "-f" option.  The default is 1.25. For a large memcached instance,
where
> there are enough items of enough different sizes that the increased
> number of slab classes isn't itself a waste of memory, this is a
> significant win: items are placed in chunks whose sizes are much
closer
> to the item size, wasting less memory.
> 
> One consequence of this is that slabs are no longer fixed-size; by
> default they are no bigger than 1MB each, but are only as big as they
> need to be to hold a whole number of chunks. That causes the "slabs
> reassign" command to be unavailable; it can be reenabled by compiling
> with -DALLOW_SLABS_REASSIGN at the expense of some wasted memory (all
> slabs will be 1MB).
> 
> The minimum amount of space for data in chunks of the smallest slab
> class may be adjusted on the command line using the new "-s" option.
> Each chunk is that many bytes plus the size of the fixed-size item
> header. If you have a lot of items with small fixed-size keys and
> values, you can use this option to maximize the number of items per
slab
> in the smallest slab class; experiment with your particular data to
find
> the optimal value.
> 
> Item expiration times and access times are now stored as 32-bit
integers
> (number of seconds relative to server start) rather than time_t, which
> is 64 bits on some platforms. This saves 8 bytes per item when
compiled
> in 64-bit mode, and is harmless otherwise.
> 
> CPU consumption
> ---------------
> The implementation of the "get" request is substantially reworked. Now
> the entire response is composed in memory ahead of time, and we write
it
> out in (usually) just one system call using sendmsg()'s scatter/gather
> capability.  Since we are no longer doing small writes, the
> TCP_CORK/NOPUSH code is not needed and we can simply set the TCP
socket
> to TCP_NODELAY at connect time, saving a couple more system calls per
> request.
> 
> The "VALUE" line (response to a "get" request) is rendered once at
item
> creation time, rather than re-rendered on each fetch.
> 
> The current system time is stored in a global variable that's updated
> every second by a libevent timer; this eliminates several time() calls
> per request. A minor improvement, but a cycle saved is a cycle earned.
> 
> UDP support
> -----------
> For large installations with tens of thousands of clients, the amount
of
> memory consumed by per-TCP-connection kernel buffers can grow large,
> reducing the amount of memory that can be used by memcached.  There is
> now a UDP protocol, which supports an arbitrarily large number of
> clients using a constant amount of server memory.
> 
> In the interest of efficiency and simplicity of implementation, the
UDP
> protocol does not support reliable delivery; it should therefore be
used
> for "get" requests where a dropped response would simply result in a
> recoverable cache miss. For write requests (set, delete, etc.) or very
> large "get" requests, a nonpersistent TCP connection should be used.
> (This is simply advice; the code will happily accept any kind of
request
> via its UDP interface.)
> 
> UDP support is only enabled if a UDP port is specified on the command
> line.
> 
> The UDP protocol is described at the bottom of doc/protocol.txt.
> 
> Large memory support
> --------------------
> This mostly involves using size_t rather than unsigned int in a few
> places and compiling in 64-bit mode, which gives us 64-bit pointers
and
> makes size_t 64 bits.
> 
> Fix for "out of memory" errors
> ------------------------------
> Rather than preallocate a slab in each slab class as the memcached
> 1.1.13 prerelease does, we decided to instead allow memcached to
exceed
> its memory limit slightly. When a "set" request comes in that requires
a
> slab whose slab class is empty, we always allocate a slab, even if
> memcached is already at its configured memory limit.
> 
> Our memcached instances are large enough that going over the limit by
a
> few megabytes is barely even detectable. If you are running in a very
> constrained environment, you can lower the memory limit slightly to
> account for this change, but bear in mind that this change will only
> exceed the memory limit if a "set" request requires it (which will
never
> happen if your data always falls within a limited range of sizes.)
> 
> Why is this one patch?
> ----------------------
> First, this patch is tested. It runs 24x7 on a large number of very
busy
> memcached hosts. Thoroughly testing every possible permutation of
these
> changes isn't really feasible.
> 
> Second, the changes are not all easily separable.  For example, adding
> the UDP support required reorganizing memcached's implementation of
the
> "get" request, and that reorganization also resulted in most of the
CPU
> time improvement.  Similarly, one of the memory efficiency tweaks is
> only required because compiling in 64-bit mode (for large memory
> support) increases the size of a particular data type, and the
> implementation of that tweak results in part of the CPU time savings.
> 
> Third, I *did* send it out as separate patches to the extent it made
> sense to separate out the changes. But rather than excluding those
> changes from the not-easily-separable stuff, I think it makes more
sense
> to include it all together. Otherwise anyone who wants to combine
> everything will have to do tedious error-prone manual editing to merge
> it all together, since some of the changes conflict. For example, both
> the large memory support and the slab allocator modification involve
> changing the parameters to slabs_init(), so it would be impossible to
> produce two independent patches against the 1.1.12 release that could
be
> applied successfully one after the other.
> 
> Credits
> -------
> These changes were made by David Fetterman, Steven Grimm, and Scott
> Marlette.  Send comments to Steven Grimm (sgrimm at facebook.com) or,
> preferably, to the memcached mailing list.