memcached-1.1.12 does infinite loop

Andreas Vierengel avierengel at gatrixx.com
Sun May 29 10:01:43 PDT 2005


Thanks for your ultra-quick answer !!

Am Sonntag, den 29.05.2005, 18:31 +0300 schrieb Anatoly Vorobey:
> On Sun, May 29, 2005 at 04:38:57PM +0200, Andreas Vierengel wrote:
> > I can confirm this behaviour. I have recently tested get_multi with a
> > variable amount of keys. On my 2.8 GHz Xeon machine it took 0.8s so
> > process a get_multi with 10.000 keys. During this time, nothing else is
> > done by memcached. ( I have also done 500.000 keys and this stalls
> > everything for minutes ).
> > About 1% of my requests do get_multi with about 1000 keys, so I'm going
> > to limit one get_multi to about 500 keys, too.
> > 
> > After looking at the code, I think the granularity of an event is one
> > command. If the command is a 500.000-keys-get_multi memcached is forced
> > to satisfy this request before continuing with other work.
> 
> The intention of the code has always been: if you get new input on a
> connection, process it and write out the output for its request for 
> as long as you can do it without waiting, i.e. until you either block at 
> reading (input is over) or block at writing (the TCP stack is slower 
> than you, even on a LAN).

that means a very fast-client has "higher priority" as a slow-client,
because he can fill input-buffers or empty putput-buffer more
aggressively (only in case of a large get_multi).

> When memcached gets a very large get_multi, here's what *should* happen:
> it very quickly finds all the items that are requested and puts
> pointers to them into a large buffer array; then it asynchronously goes
> over the array and sends out the responses with the data, yielding
> to other requests on other connections when it can't write immediately.

ok, understand.

> If there's a large delay on a very large get_multi, it could be due
> to: either the "very quickly" part above is not as optimised and fast 
> as it should be (I certainly never imagined it could take minutes
> on 500,000 keys), or perhaps the writing part is bugged and blocks at
> some place it shouldn't block. I will try to investigate.

In my test with 500.000 keys, strace output me only the time()-calls, so
memcached must be in the "find all the items"-part.
I never see any sockets blocking in my straces.

> > Another idea is to rewrite portions of the memcached-code, so that
> > event-handling is not only done for async network I/O, but also within a
> > get_multi-request. Maybe in the event of a writeable socket, the
> > get_multi fetches more key/values from this request until the socket
> > write-buffer is full. The "state-machine" must be enhanced for this, I
> > think. Buffering the whole response, like it is done at the moment, will
> > also not be needed.
> 
> The whole response is not buffered; only an intermediate list of 
> pointers to items that have been located (the creation of this list 
> doesn't involve network I/O) is buffered. The state-machine *is* used 
> for asynchronously building and sending out the actual response with the 
> data.

Ah I see.

I straced such a request with large number of keys:

17:51:54.496661 epoll_wait(4, {{EPOLLIN, {u32=3084803448,
u64=3084803448}}}, 30000, 2339) = 1 <0.111071>
17:51:54.607773 gettimeofday({1117381914, 607783}, NULL) = 0 <0.000008>
17:51:54.607812 read(430, "get SNAPSHOT|0051|33|2 SNAPSHOT|"..., 4096) =
1448 <0.000010>
17:51:54.607876 read(430, "|104|2 SNAPSHOT|117356|2|2 SNAPS"..., 2648) =
1448 <0.000010>
17:51:54.607928 read(430, 0x8b8e408, 1200) = -1 EAGAIN (Resource
temporarily unavailable) <0.000008>
17:51:54.607968 read(430, 0x8b8e408, 1200) = -1 EAGAIN (Resource
temporarily unavailable) <0.000013>
17:51:54.608011 gettimeofday({1117381914, 608021}, NULL) = 0 <0.000008>
17:51:54.608049 gettimeofday({1117381914, 608059}, NULL) = 0 <0.000007>
17:51:54.608088 epoll_wait(4, {{EPOLLIN, {u32=3084803448,
u64=3084803448}}}, 30000, 2228) = 1 <0.000149>
17:51:54.608272 gettimeofday({1117381914, 608282}, NULL) = 0 <0.000008>
17:51:54.608313 read(430, "SHOT|113493|4|2 SNAPSHOT|105812|"..., 1200) =
1200 <0.000010>
17:51:54.608380 read(430, "90|104|2 SNAPSHOT|107071|104|2 S"..., 4096) =
1696 <0.000010>
17:51:54.608434 read(430, 0x8f66eb8, 2400) = -1 EAGAIN (Resource
temporarily unavailable) <0.000008>
17:51:54.608498 read(430, "3|18|2 SNAPSHOT|1172|33|2 SNAPSH"..., 2400) =
1448 <0.000011>
17:51:54.608550 read(430, 0x8f67460, 952) = -1 EAGAIN (Resource
temporarily unavailable) <0.000008>
17:51:54.608592 read(430, 0x8f67460, 952) = -1 EAGAIN (Resource
temporarily unavailable) <0.000019>
17:51:54.608640 gettimeofday({1117381914, 608650}, NULL) = 0 <0.000008>
17:51:54.608678 gettimeofday({1117381914, 608688}, NULL) = 0 <0.000008>
17:51:54.608717 epoll_wait(4, {{EPOLLIN, {u32=3084803448,
u64=3084803448}}}, 30000, 2227) = 1 <0.000019>
...
it continues in this way until all is read.
that means, read() is called until two subsequent read() returns both
EAGAIN. 

If the client constantly fills the input-buffer of the server, it can be
that read() is called quite a lot. In my example the biggest read()-loop
was 90 calls between two epoll_wait(). About half of them returned
EAGAIN.
Would it be better to epoll_wait() on the first read() that returns
EAGAIN to give other requests a better chance in serving ?

Next step, processing the get-command:

17:51:54.986638 setsockopt(430, SOL_TCP, TCP_CORK, [1], 4) = 0
<0.000017>
17:51:54.986691 time(NULL)              = 1117381914 <0.000013>
17:51:54.986748 time(NULL)              = 1117381914 <0.000012>
17:51:54.986802 time(NULL)              = 1117381914 <0.000011>
17:51:54.986846 time(NULL)              = 1117381914 <0.000012>
17:51:54.986889 time(NULL)              = 1117381914 <0.000011>
17:51:54.986933 time(NULL)              = 1117381914 <0.000012>
17:51:54.986976 time(NULL)              = 1117381914 <0.000012>
17:51:54.987019 time(NULL)              = 1117381914 <0.000012>
17:51:54.987062 time(NULL)              = 1117381914 <0.000012>
17:51:54.987105 time(NULL)              = 1117381914 <0.000011>
17:51:54.987149 time(NULL)              = 1117381914 <0.000012>
17:51:54.987192 time(NULL)              = 1117381914 <0.000011>
...

there are as much time()-calls as keys.
time() is fast, but is it needed for every key ?
I mean, if I want key/values at some point in time, one time()-call at
the beginning of every request should be enough. Maybe if time advances
1 second during the creation of the response, a key/value-pair which
expires now should be send because at request-time it hat not expired.

last step, sending response:

17:51:54.993229 write(430, "VALUE SNAPSHOT|0051|33|2 0 311\r\n", 32) =
32 <0.000016>
17:51:54.993295 write(430, "0051|344|33|12.80000|12.65000|12"..., 313) =
313 <0.000009>
17:51:54.993350 write(430, "VALUE SNAPSHOT|114140|315|2 0 26"..., 35) =
35 <0.000011>
17:51:54.993406 write(430, "114140|276|315|0.00000|104.81000"..., 270) =
270 <0.000009>
17:51:54.993460 write(430, "VALUE SNAPSHOT|112777|104|2 0 27"..., 35) =
35 <0.000009>
17:51:54.993510 write(430, "112777|840|104|129.07265|0.00000"..., 276) =
276 <0.000009>
17:51:54.993563 write(430, "VALUE SNAPSHOT|045936|18|2 0 267"..., 34) =
34 <0.000008>
17:51:54.993613 write(430, "045936|40|18|100.00000|0.00000|0"..., 269) =
269 <0.000009>
17:51:54.993667 write(430, "VALUE SNAPSHOT|108515|104|2 0 27"..., 35) =
35 <0.000009>
17:51:54.993717 write(430, "108515|840|104|201.74745|0.00000"..., 276) =
276 <0.000021>
17:51:54.993782 write(430, "VALUE SNAPSHOT|115266|104|2 0 27"..., 35) =
35 <0.000009>
17:51:54.993832 write(430, "115266|840|104|208.64290|0.00000"..., 278) =
278 <0.000009>
...

this continues for every key-request.
so 2 write()s were called for each value (header and body). This could
probably be optimized for I/O-efficiency. In my understanding, the ideal
case would be just a few write()-calls which includes the whole
response. Maybe writev() ?!?

I will investigate further...

--Andy



More information about the memcached mailing list