Implementing Memcached on EC2

Wed Jan 3 19:42:22 UTC 2007

Hi,

By now I'm sure many of you are familiar with Amazon's EC2 offering.
We're using it for our application to help with the woes of scaling. One
of the nice features is that each instance comes with 1.7GB of memory
that we'd like to partially allocate to a memcached process. These
memcached processes are shared by the cluster of servers running various
custom applications. Problem is, EC2 doesn't have private
networks, VPNs are too slow and introduce another level of failure, and
Memcache provides no authentication. IPTables is not really viable,
since there's no static IP addressing.

What are everyone's thoughts on implementing Memcache in such an
environment?

I brought this question up to Brad (actually, asking more about the auth
factor), and he brought up some good points.

> But:  how are nodes currently discovering the available
> memcached servers? And don't you have rehashing issues if they're
> coming and going?  Or are you using consistent hashing on the
> client side?  Or just local single node caching?  In which case,
> what's wrong with 127.0.0.1?
> So yes, auth solves a bit, but I'm curious how you plan to make
> this work reliably when nodes are coming and going.

Up until now, discovery of Memcache servers has not been an issue (we
used a private network with a limited/fixed number of Memcache servers).
Discovery is a big issue in general with EC2.  We've solved this using a
heart-beat link to a centralized server (though having it centralized is
not ideal). Applications can query this server to get a list of
resources that provide a particular service (e.g., memcache, afs, www,
db, app1, app2, etc...). We've also tied this into our DNS server to
provide an alternate, widely supported mechanism to lookup services.
Right now, we're not using SRV records, but this is the direction we're
headed. (http://en.wikipedia.org/wiki/SRV_record)

Clearly, using this approach for Memcache breaks down when list of
servers changes in anyway; the hash function will not be universal
across all instances unless the instances share the identical list of
servers/connectivity. It appears the burden is on the Memcache clients
to keep their lists in sync. One of the beauties of Memcache is it's
simplicity and I'd hate to change that. I haven't thought it through
thoroughly, but I like how AFS uses the DNS RR type AFSDB. Using SRV
resource records, we could accomplish the same thing and I think we
could minimize the amount of data duplication by using short TTLs and
timing out persistent Memcache client connections when the SRV TTL is
reached. The DNS serial could be used in some manner too, but not yet
sure how. Also, I don't think this entirely solves the problem of
possible data inconsistency.

As for authorization, I imagine something similar to:

AUTHORIZE key\r\n
AUTHORIZED\r\n
or
CLIENT_ERROR Not Authorized\r\n

SSL/TLS would be another way, but would affect the performance of a
multiplexed server when there's a lot connection thrashing. Also,
authorization should be optional so as not to break compatibility.

Your takes?

Best Regards,

Erik Osterman