libevent & spinning

Fri, 24 Oct 2003 05:50:15 +0200

This is not strictly speaking memcached, but a related bug.
The description might be useful to people reading this list,
and I want a convenient reference to the archives.

We finally figured out the "spinning" bug which has been happening
with memcached every now and then. It only happens under large load
and when using epoll (or poll, as it now turns out) on Linux; FreeBSD
users aren't effected. The bug manifested itself by the memcached
process starting to eat huge amounts of CPU, but not crashing. 
Eventually we tracked the bug to libevent (which is why I wrote it's
not strictly memcached).

When libevent uses poll or epoll for notification, it uses POLLIN
or POLLOUT or both to register its interest in reading, writing,
or both, when registering an event with the OS. The flags are defines
in <sys/poll.h> . Calls to poll() or epoll_wait() are supposed to
return with appropriate indication of which events were activated, and
for which activities (POLLIN if the descriptor is ready for reading,
POLLOUT if it's ready for writing).

It turns out, however, that the implementations of poll() and epoll
on Linux try to second-guess the programmer by _forcing_ the userpspace
caller to wait also for POLLERR (error) and POLLHUP (hangup, such as
TCP connection termination) flags. Even if those flags aren't specified
when poll() or epoll functions are called, it's still watched and is
relayed to the userspace caller when these conditions occur, in 
combination with POLLIN/POLLOUT or without them.

libevent's wrappers of poll and epoll aren't equipped to handle POLLERR,
when it's returned despite not being requested, and in this case
they start going through the same loop continuously, doing essentially
nothing and getting the same POLLERR value from kernel over and over 
again. This is how "spinning" occurs. It happens whenever the waiting
call returns from the kernel with some event having neither POLLIN nor
POLLOUT in its activity bitmask. In practice, it usually turns out to
contain POLLERR or POLLERR|POLLHUP. Further tail-chasing around kernel
sources revelead that the function responsible for this is tcp_poll()
in net/ipv4/tcp.c , and that it returns POLLERR, for instance, when the
socket is in a general error state - which I'm not sure how exactly it
gets into, but evidently this only happens under very high load and
rarely. Still, it does happen every now and then, and this is how
"spinning" occurs.

If libevent uses select() rather than poll() (select() in general is 
slower yet than poll() ), this kind of error never occurs. The reason is
that select()'s semantics are poor and cannot contain information such
as POLLERR or POLLHUP; select() only returns bitmasks of file 
descriptors ready for input, output, or exceptions (e.g. OOB data in 
TCP). It turns out that the implementation of select() in the kernel
has to deal with POLLERR and POLLHUP w/o passing them on to the 
userspace caller (which is what poll/epoll do) - its interface doesn't
allow it this luxury. So what it does is translate POLLERR into 
POLLIN|POLLOUT (read&write), and POLLHUP into POLLIN (read). This is a 
sensible approach which we plan to follow in libevent to get rid of
the spinning bug. We'll coordinate the fix with the maintainer of 
libevent, and hopefully there will be a new version of libevent before
soon, which will fix this bug.

Oh, and as for FreeBSD and its cousins, they do not force 
POLLERR or POLLHUP down the throats of unsuspectinh application 
developers. Also, their socket code never returns POLLERR or POLLHUP
anyway. Either of these two facts ensures the safety of FreeBSD users
of memcached from this bug.

--
avva