UDP support in the binary protocol

Marc marc at facebook.com
Mon Dec 17 20:33:24 UTC 2007


I wanted to mention a few things I¹ve been thinking about to take UDP to the
next level (support update requests and large values).  The main things
missing from the protocol now are offset-to-next-request in the UDP header,
a flow control mechanism, and the ability to get specific segments of a
large value:

Offset-to-next-request would be specified in bytes 6-7 of the UDP header to
indicate the beginning of the next request boundary or 0xffff if the current
packet contains no boundary.  This will allow clients to recover from
dropped packet errors and receive subsequent replies.

For flow control, the client needs to be able to throttle the server¹s
reply to prevent implosion.  The problem now is that the client has no idea
how much data will be returned from a get command.  It can vary wildly and ­
when sending multigets to multiple servers, the reply can momentarily exceed
the capacity of upstream switches or client itself.  The challenge it that
right now UDP logic in memcached is very stateless.  That is a good thing,
and I¹m loathe to introduce protocol changes that would require maintaining
state on the server side.

I think the best way to handle this is to again use reserved bytes 6-7 in
the UDP header to indicate the clients read-buffer segment size.  (like TCP
it would have to be in some multiple.  Maybe we can use bytes 4-5 to
indicate scale, since memcached never takes messages > 1 packet).  The reply
payload cannot exceed this size.  The memcached implementation can record
this value along with the UDP message id it already records, and, when
generating the reply, simply track the value and stop processing once the
limit is reached.

Lastly, the ability to issue gets of specific segments of a large value
would allow UDP clients to recover from packet error of large values more
efficiently.  Currently if any packet is dropped, the entire value must be
retransmitted.  Even the offset-to-next-request field will not fix this,
since, for large values,  most packets are within a single request.  What I
have in mind is that if I successfully read 0..M of a value and then get a
timeout or out of order packet, I¹d like to issue the next get for M.
The data-version checking logic that exists now means I¹m never in danger of
getting the wrong data.  I just need an  additional flavor of get to specify
offset and extent.

Potentially,  similar logic could be done for sets, but given the
infrequency of sets w.r.t. gets and that this would again require adding a
lot of state for UDP protocol processing on the server side, I don't think
it's worth pursuing.

On 12/17/07 11:34 AM, "Aaron Stone" <aaron at serendipity.cx> wrote:

> On Mon, Dec 17, 2007, Dustin Sallings <dustin at spy.net> said:
> 
>> 
>> On Dec 16, 2007, at 19:26, Aaron Stone wrote:
>> 
>>> Do we want to add 32 bits to the binary protocol for UDP sequencing?
>>> Has
>>> this been discussed before? If so, please point me in the direction of
>>> such a thread in the mailing list archives!
>> 
>> 
>> No, UDP support seems to be the minimal wrapping around the
>> underlying protocol to provide sequencing.  Not sure if I can point
>> you to archives, but the intention should be somewhat clear.
>> 
>> The purpose of a UDP based protocol would be to provide a
>> connectionless form of the TCP based protocol with less client and
>> server overhead.
>> 
>> When you think about it that way, you're just implementing some of
>> the parts that the transport doesn't give you, so it makes sense to
>> not combine them in such a way that provides redundancy with your
>> transport.  If you're optimistic, you have less overhead in general.
>> 
> 
> Well, ok, but the only thing that the UDP header provides that the binary
> protocol does not now provide directly is sequence numbers for
> reassembling a large SET / GET.
> 
> Here's an idea: we have a different magic byte that indicates that the
> common header is four bytes longer, and we use that magic byte for UDP
> traffic?
> 
> Aaron
> 
> 




More information about the memcached mailing list