memcached patch: two-dimensional keys
cfallin at gmail.com
Mon Dec 3 20:35:21 UTC 2007
I'll begin by introductions: I am a computer engineering undergrad,
and our data structures class final project required us to modify some
open source project's data structures. I thought it seemed a bit
imposing to walk into a project and declare I knew better than their
design decisions, so I instead looked for some new feature to add, and
I decided to add two-dimensional keys to memcached. I don't see a huge
demand for this but a) it might be useful for someone and b) it was
sort of fun.
Here is a patch: http://c1f.net/misc/memcached-r651-2dhash.patch . As
the filename implies, it applies against r651, which is what I pulled
down when I began work.
I'll say this right out: I was in a rush to get this out the door
before the semester ends, so there are a few rough edges. I don't
expect this to be accepted into mainline without some additional work
(and even then maybe only as an option for the end-user). However,
it's a starting point for whomever may wish to run with it.
Here's the user-visible portion: all keys containing a period are
split at the first period into the first and second dimensions.
Inserts, deletes, updates, etc work as before, except that they also
update a secondary index (more on that below). A 'get' command on
dim1.* or *.dim2 will return (respectively) a list of all
second-dimensions of keys that have that first dimension, or a list of
all first-dimensions of keys that contain that second dimension. The
'get' datablock returned consists of zero or more of the following:
<key-length (ASCII)> <space> <key> <CR> <LF>.
One flaw remains: I could not (due to lack of time) integrate into the
LRU-expiration (lazy deletion?) mechanism. An explicit delete will
update the index appropriately but (for example) a flush-all will not.
Here's how my data structure works:
I keep two index hashtables (separate from the main hashtable with the
items themselves), keyed by the first and second dimensions
respectively. These hashtables share nodes, so a node has two
linked-list ptrs, two keys, and is always in both tables. The hash
chains are doubly-linked so that a node can be removed from both hash
chains when it's found by either one. Searches by either dimension
simply iterate over the appropriate hash-chain, returning all keys of
the other dimension for nodes that match the search key.
Interestingly, deletes of 2d-keyed items must iterate along the
hashchain for the primary key to find the item with the given
secondary key, so a delete is O(N2) average-case (where N2 is the
average size of the second dimension). I justify this by assuming that
the secondary dimension will be small compared to the primary. I
couldn't think of any easy mechanism to solve this, short of keeping a
hashtable of secondary keys per primary key, which is very heavy on
memory usage. (If anyone has any input on this, I'd be curious, even
post-final report for class :-)
I hope someone finds this useful! If not, thanks at least for obliging
me (and thanks for a project that's pretty much the purest application
of data structures I could imagine).
More information about the memcached