Large lists

Dustin Sallings dustin at spy.net
Mon Nov 5 18:42:14 UTC 2007


On Nov 5, 2007, at 3:24 , K J wrote:
> The problem is, this membership list will also be displayed when  
> users request it.  So would it be best to cache the entire list  
> somehow?

	Yes, but I wouldn't try to think of that as something related to  
your membership check.


On Nov 5, 2007, at 3:34 , K J wrote:

> I had thought of doing a bloom filter on it as well.  The problem  
> here is, the membership list might change sometimes, and reading  
> info on bloom filters, it's not really well-suited for dynamically  
> changing lists?

	Well, bloom filters are fine as long as you're only adding to them.   
I wasn't suggesting actually using a bloom filter, though.  Just to  
think of it more like a bloom filter in that you can only get an  
answer that is either possibly present or definitely not present  
(it's a set representation, not a map representation).

	It was late, though, so I guess I was more typing whatever I was  
thinking than describing what was a best fit for your question.

	memcached is a map, so you can actually store a yes or no, but the  
absence of either is a ``possibly.''

> Suppose I cache 10,000 recently-logged in members.  Also, suppose  
> 50% of traffic actually come from these users.  Then, this cache  
> would have a high hit ratio when testing for membership.

	The 10,000 is just a preload.  You can certainly cache the rest of  
them as they actually come in.

> However, what about the non-members?  For instance let's say 40% of  
> the traffic come from non-members.  This would mean there'd need to  
> be a full listing of members to check?

	You have that in your primary data store.  It just means you'd have  
to go ask it and cache it.

> Hmmm an interesting thought did just come across my mind.  Let me  
> hear your thoughts:
> Cache 10,000 most recently logged-in members
> Bloom filter on the entire list
> This way, you can test for negatives (bloom filter), and if there's  
> a positive, check the 10,000 most recently logged-in users.  If  
> that still yields nothing, then do a database query.  In effect,  
> only a small minority of checks would require a trip to the DB.

	The bloom filter will not help you unless you have too many active  
users to fit into a memcached cluster (and you don't).

	Quick review of what went through my head:

	1)  You want set operations (particularly ∈)
	2)  Thinking of memcached as a normal set falls apart because of  
evictions, so how about ∉
	3)  There *is* a value that will be stored at roughly no cost, so we  
can do both.

	#2 is the bloom filter.  After realizing you'd have to go really far  
out of your way to get any savings out of a 0 byte value vs. a 1 byte  
value, I figured you might as well store a boolean.  Unlike a bloom  
filter, this means:

	1)  You can check for presence in a single operation.
	2)  You can further confirm whether the value indicates presence or  
absence.
	3)  You can both add items *and* remove items after going to the  
source.

	So, I apologize for the confusion, but I'd recommend something like  
this:


def is_special_user(uid):
	val=memcache.get(uid)
	if val is not None:
		val=query_to_get_special_flag_for(uid)
		memcache.set(uid, val)
	return val

	You could preload some if you felt you needed to do so.  Adding or  
removing membership can specifically update the cache.

	As I mentioned above, your listing problem should be considered a  
different one.  I'd suspect it could be done more lazily than the  
membership check, but if not, you could always invalidate the list  
when adding or removing membership.

>
> On Nov 4, 2007, at 23:32, J A wrote:
>
> > I have a fairly large members list that I want to keep in memcache.
> > What I do with this list is query it against particular user IDs to
> > see if they are a member of that list or not.  If they are they get
> > certain priviledges.
> >
> > The problem is, this list has gotten to the point of saturating the
> > PHP's memory when fetching the MySQL query the first time.
> >
> > Is there a way to do this more effectively, for instance,
> > partitioning the list into separate smaller lists, grouped by time
> > of login?  I'm thinking of this, as users who have logged in in the
> > past 3 months are more likely to be in the list anyway.
>
>
>        It'd be easier to not think of it as a list if you're just  
> testing
> for membership.  All you want to know is if a particular object is an
> element of a particular set.  You could do this by key convention if
> you batch populate the records.
>
>        However, memcached semantics don't quite give you what you  
> want.
> Depending on whether you can reasonably get a configuration to do what
> you want, it might be easier to think of memcached as a bloom filter
> than as a set in this case.  That is, if you negatively cache things
> that *aren't* part of your list, then the presence of a key will tell
> you for certain that a particular key is not a member, but the absence
> of a key would mean that you don't know (or perhaps memcached *did*
> know, but had since forgotten).
>
>        You could, of course, record the status either way so as to  
> tell the
> difference between not knowing and knowing whether it's a member or
> not.  This is probably best suited to your needs.
>
>        You could optionally preload objects that are likely to be  
> used if
> you think the natural population wouldn't do it effectively (you can
> measure this with stats).
>
> --
> Dustin Sallings
>
>
>
>

-- 
Dustin Sallings


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.danga.com/pipermail/memcached/attachments/20071105/7d722a74/attachment-0001.html


More information about the memcached mailing list