memory layout idea

Brad Fitzpatrick brad@danga.com
Thu, 8 Apr 2004 11:55:17 -0700 (PDT)


Originally posted at:
http://www.livejournal.com/users/brad/2000605.html

Copying here:

I found this thread interesting:

Large slab cache in 2.6.1
http://www.ussg.iu.edu/hypermail/linux/kernel/0402.2/index.html

My understanding at this point:

Linux has an LRU for cache items, but the cache items are spread across
pages all over the slab.

When the VM needs memory, it pressures the caches into shrinking
themselves to give up pages. But since they're just freeing off their LRU,
they incredibly often don't even give up pages, they just free up slab
items across a bunch of misc pages.

Experiments were apparently done to actually LRU the slab pages
themselves, and not the items, but since they had no way apparently to
actually free all items on the slab page (why? locking?) it only
effectively added reclaim pressure from the vfs code, so they just did
that instead. (or something)

This is the best analysis of the problem:

http://www.ussg.iu.edu/hypermail/linux/kernel/0402.2/1908.html

    > That was Ed. Because we cannot reclaim
    > slab pages direct from the LRU

    I think that this is needed: Bonwick's slab algorithm (i.e. two-level
linked lists, implemented in cache_alloc_refill and free_block) is
intended for unfreeable objects.
    The dentry cache is a cache of freeable objects - a different
algorithm would be more efficient for shrinking the dentry cache after an
updatedb.
    I had started prototyping, but didn't get far.
    --
    Manfred

So I think this is all pretty lame, but memcached has exactly the same
problem. (trying to use a slab allocator as a cache, and freeing objects,
not pages...)

So this has me thinking about memcached improvements:

-- instead of 1MB slab pages, reduce that to, say, 128k

-- higher granularity size classes (not just powers of two). this'll solve
our problem of only utilizing ~65% of memory.

-- global per-slab LRU, not one LRU for items per-size-class. this'll
solve our problem of having to run the slab-page-balancer script against
memcached when usage patterns change over time. (currently memcached never
re-classes pages to different sized objects on demand)

-- fragment items between slabs, so a 1MB object can go over, say, 9
pages. (would be 8, but struct overhead). this is nowhere as bad as the
TLB thrashing that would've been involved with our old idea (never
implemented due to fear) of just using a bunch of 64-byte chunks. (the
idea that "memory access is free" inspired that idea, but that's not
totally true...)

-- when a new item needs to be added, free slabs, not items. will require
iterating over the slab page to find the items, but we already do that at
one point (during our ghetto manual page reassignment code) and thus are
already guaranteeing empty parts of slabs will be zeroed out. when freeing
a middle fragment of an item, the item free code will need to jump to the
top, then walk the fragments, zeroing out each part on each relevant slab
page.

-- if we encounter a slab that can't be free due to an item that can't be
freed due to a refcount > 0 (transfer in progress), move to the next slab
up the LRU and try again, up to ~20 tries. (just like we do now with the
item LRU)

avva/others... thoughts?