binary protocol notes from the facebook hackathon

Wed Jul 11 14:39:51 UTC 2007

Hi,

We have seen the discussion about making a binary protocol.
About a year ago we have written a Perl memcached client,
posted here before, but still available at http://dev.tty.nl/memcached.html
Iirc most of the time in the perl memcached client is spend analysing
the protocol. So having a binary protocol would be great.

For other projects we have made several daemons, clients using a binary protocol.
The daemons, clients all have a similar interface as memcached.
i.e. They have simple operations which are independent/state-less.

After a few iterations we came to a stable design. Applying this 
experience to memcache we get the design we would like to present here.

Characteristics:
Small single state-less commands.
Using 1 command -> 1 response system.
There is no error recovery.

We ended up having a two-layer system:
1) a layer dealing with sending/receiving commands and responses.
2) a layer processing the memcache commands and making a response.

=========================================================================================

The first layer can be implemented in TCP or UDP.
This layer is not memcached specific.

It deals with sending the commands.
For TCP it connects, disconnects. For UDP it would have to track message using message ids or something.
The command is opaque at this layer and is passed to the second layer once it's completely received.
This layer also can do reply to ping commands to check if the server is alive. The second
layer (== memcached) is not bothered with this.

Proposal protocol sending messages using TCP:
- messages look like:
  * magic integer (all integer are 4 bytes, fixed endian)
  * message code (integer): PING, COMMAND, RESPONSE, ERROR, DISCONNECT (something like a HTTP response code)
  * command length, length of the remaining message (excluding magic integers, message
    code and command length) (may be zero)
  * command
  * trailing magic integer

this structure is used for both the request and the response.

We currently have implementations of a protocol similar to the one described above.
- as a C daemon, based upon memcached v1.1.x
- Perl client
- Perl daemon

=========================================================================================

Memcached commands:
first integer is the command, the remaining message is command dependent.

Commands:
- get:
  request:
       - number of keys N (integer) 
       - N times key (20 bytes¹)
  response:
       - N times - response code: FOUND/NOT_FOUND/OUT_OF_DATE (integer)
                 - if FOUND: length of value, value.
       the order of the response is the same as the order of the get command.

- set/add/replace: 
  request:
     - key (20 bytes)
     - expiration time (integer)
     - length of value (integer)
     - value
  response:
     - STORED/NOT_STORED (integer)

- delete:
  request:
    - key (20 bytes)
  response:
    - DELETED/NOT_DELETED (integer)

- ....

- multi-command:
  This command can be used to execute multi commands as one atomic operation, so you 
  can prevent race conditions.
  request:
    - number of commands N (integer)
    - N times  - length of command (integer)
               - command
  response:
    - N times  - length of response (integer)
               - response
    the order of the response is the same as the order of the commands in the request.

¹ I have the keys as fixed length of 20 bytes, of course instead of the fixed length they can
also be replaced with a length prefix followed by the actual key.
I prefer fixed lengths because they are easier(faster) to process on the server.
You can simply use the SHA of 'normal' keys in the client.

=========================================================================================
Example 1:
'get' the key 'xxxxxxxxxxxxxxxxxxxx' (x^20)
request:
<4 bytes magic number>
<4 bytes '28' (length of msg)>
<4 bytes 'COMMAND'>
    <4 bytes command 'GET'>
    <4 bytes '1' (1 key)>
    <20 bytes 'xxxxxxxxxxxxxxxxxxxx' (key) >
<4 bytes trailing magic>

result:
<4 bytes magic number>
<4 bytes '19' (length of msg)>
<4 bytes 'RESPONSE'>
    <4 bytes 'FOUND' (response code)>
    <4 bytes '11' (length of value)>
    <11 bytes 'Hello World' (value) >
<4 bytes trailing magic>

Example 2:
connect.
The client indicates what protocol it speaks for both the layer 1 and the layer 2
protocol. The server can accept this of reject the connection.

request:
<4 bytes magic number>
<4 bytes '36' (length of msg)>
<4 bytes 'CONNECT'>
    <4 bytes '1' (layer 1 protocol major version number)>
    <4 bytes '0' (layer 1 protocol minor version number)>
    <20 bytes 'memcached           ' (layer 2 magic number)>
    <4 bytes '1' (layer 2 protocol major version number)>
    <4 bytes '0' (layer 2 protocol minor version number)>
<4 bytes trailing magic>
result:
<4 bytes magic number>
<4 bytes '37' (length of msg)>
<4 bytes 'CONNECTED'>
    <4 bytes '33' (length of greeting)>
    <33 bytes 'euclid.tty.nl - the moon is rising' (some greeting, mostly for debug)>
<4 bytes trailing magic>

note: the connect is completely handled by level 1. Only 'COMMAND' messages are passed to level 2.

=========================================================================================

There are a lot of details in this email, about the layout of messages and commands.
But Most Important is to make the separation of the protocol into the two layers.

-- 
Gerard Goossen
TTY Internet Solutions