memcached: UDP and byte ranges
Dustin Sallings
dustin at spy.net
Tue Dec 18 08:57:14 UTC 2007
On Dec 17, 2007, at 23:15, Aaron Stone wrote:
> - Do we need to separate the Request ID from the Message ID?
The purpose of the request ID is effectively to recreate the TCP
sequence number. This just isn't necessary when your data are
guaranteed to be deliver in order by TCP.
> - Do we need to be able to request portions of a value starting
> from some offset? (to handle the now-infamous facebook's-udp-mgets
> are-so-fat, your-mamma's-ethernet-take-it-no-more!)
My only concern about this is that you may very well be requesting a
section from a different value on a subsequent request.
> - Do we need the server to tell the client how much data is about to
> show up?
The message header already does that.
> I _don't_ see a reason to have separate request id's from message
> id's.
> The combination of a message id and packet number (or byte range,
> which
> I'll get to in a moment) tell us everything we need to know.
It sounds like facebook (does anyone else even use the UDP based
protocol?) already sends multiple messages in a single UDP request.
This same thing happens over TCP. UDP is just a different transport,
and needs the additional information to do what other transports do
automatically.
> If we want the ability to request the n-th byte through the end, why
> not
> just ask for the n-th through m-th byte?
>
> (yes, this is the byte range feature that we've all acknowledged is a
> bad idea. except that it completely subsumes the functionality of the
> UDP packet sequence number and does it even more powerfully.)
No, it's not the same. A UDP get still returns the whole value the
same way it does in TCP, except you have a bit more control over the
packetization. Retrieving a value by asking for a series of parts of
it can't be done atomically.
> Add a field to the GET response akin to DNS's "there's more data but
> you
> need to ask for it". The first response packet will tell the client
> how
> long the entire key is in an extras field, and the common header will
> tell the client how long the data it got in the initial response is.
It already does that.
> Add a new command, RGET (range-get), that defines a larger extras
> section with two additional fields, the offset and the length.
If this didn't use the CAS identifier, there'd be no guarantees that
it'd ever be right. If it did, you're left with the problem of
finding out what the CAS identifier is.
> The client is explicitly allowed to ask for more data than can fit
> in a
> single UDP packet.
It already does, though. You just can't send more data than will fit
in a UDP packet.
> The server sends as many RGET response packets as it needs to send,
> with
> each one containing enough information (offset and length) to
> reassemble
> the value on the client _without resequenceing the packets_!
You can already do that. Once you receive the first packet, you know
how many packets there are, what the total size is, and if you can
assume all of the packets before the last one will be the same size,
you can just fill in the value as the packets arrive.
> Rationale:
>
> By eliminating the packet sequence number, we save the client from
> having to hold all the pieces in order until it can return the value
> to
> the client application.
Hopefully that's unnecessary anyway.
> By giving offsets in each packet, we avoid the potential problem of
> losing the first packet and then being flooded with subsequent packets
> that we don't know what to do with.
If that happens frequently, you should be using TCP and not trying to
reinvent it.
Note that an rget is *not* a retransmit. If you're not very careful,
you may get part of something unrelated to what the rest of the
packets represented. If you are careful, you still may end up having
to throw away all the other values.
> Thoughts? Comments?
I really think it's better to either accept lossiness and general
sloppiness of a thin, dumb UDP transport or just use TCP and get all
of the rest of the features handled for you by your OS vendor.
--
Dustin Sallings
More information about the memcached
mailing list