libevent looping

Brad Fitzpatrick brad@danga.com
Thu, 4 Sep 2003 19:11:55 -0700 (PDT)


Niels,

A few of us at LiveJournal.com have been bug-hunting a problem with
libevent and epoll for a number of weeks now.

We have a dozen machines running memcached on libevent.  A couple times
per week a process will start eating all available CPU, but continues to
function.

What we discovered is that libevent's epoll mode isn't handling a (new?)
type of event from Linux, and continually processes the same event over
and over.  When a new event comes in that libevent can handle (EPOLLIN or
EPOLLOUT), it handles it fine, deletes it, and proceeds to be confused
over the mystery event.

Here you can see libevent being confused for a while, getting a new event,
passing it back to our application, and then going back to looping, and so
on.

(Our app always does a syscall, read or write, after calls into libevent,
so this isn't us looping.)

Here's a snippet of an strace:

epoll_wait(0x3, 0x80accb8, 0x2000, 0x1ad) = 1
gettimeofday({1062725530, 924268}, NULL) = 0
gettimeofday({1062725530, 924392}, NULL) = 0
gettimeofday({1062725530, 924515}, NULL) = 0
epoll_wait(0x3, 0x80accb8, 0x2000, 0x1ad) = 1
gettimeofday({1062725530, 924764}, NULL) = 0
gettimeofday({1062725530, 924886}, NULL) = 0
gettimeofday({1062725530, 924976}, NULL) = 0
epoll_wait(0x3, 0x80accb8, 0x2000, 0x1ac) = 1
gettimeofday({1062725530, 925195}, NULL) = 0
gettimeofday({1062725530, 925314}, NULL) = 0
gettimeofday({1062725530, 925440}, NULL) = 0
epoll_wait(0x3, 0x80accb8, 0x2000, 0x1ac) = 1
gettimeofday({1062725530, 925690}, NULL) = 0
gettimeofday({1062725530, 925814}, NULL) = 0
gettimeofday({1062725530, 925937}, NULL) = 0
epoll_wait(0x3, 0x80accb8, 0x2000, 0x1ab) = 1
gettimeofday({1062725530, 926186}, NULL) = 0
gettimeofday({1062725530, 926310}, NULL) = 0
gettimeofday({1062725530, 926433}, NULL) = 0
epoll_wait(0x3, 0x80accb8, 0x2000, 0x1ab) = 1
gettimeofday({1062725530, 926666}, NULL) = 0
gettimeofday({1062725530, 926779}, NULL) = 0
gettimeofday({1062725530, 926901}, NULL) = 0
epoll_wait(0x3, 0x80accb8, 0x2000, 0x1aa) = 2
gettimeofday({1062725530, 927167}, NULL) = 0
read(177, "get userpic.6166503\r\n", 16384) = 21
read(177, 0x87a834d, 16363)             = -1 EAGAIN (Resource temporarily unavailable)
setsockopt(177, SOL_TCP, TCP_CORK, [1], 4) = 0
time(NULL)                              = 1062725530
time(NULL)                              = 1062725530
write(177, "VALUE userpic.6166503 1 33\r\n", 28) = 28
write(177, "\4\4\0041234\4\4\4\10\2\3\0\0\0\n\003100\n\00296\n\006"..., 35) = 35
write(177, "END\r\n", 5)                = 5
setsockopt(177, SOL_TCP, TCP_CORK, [0], 4) = 0
read(177, 0x87a8338, 16384)             = -1 EAGAIN (Resource temporarily unavailable)
gettimeofday({1062725530, 928978}, NULL) = 0
gettimeofday({1062725530, 929099}, NULL) = 0
epoll_wait(0x3, 0x80accb8, 0x2000, 0x1a8) = 1
gettimeofday({1062725530, 929360}, NULL) = 0
gettimeofday({1062725530, 929486}, NULL) = 0
gettimeofday({1062725530, 929690}, NULL) = 0
epoll_wait(0x3, 0x80accb8, 0x2000, 0x1a7) = 2
gettimeofday({1062725530, 929886}, NULL) = 0
read(402, "get sess:1256389:4\r\n", 16384) = 20
read(402, 0x84bb3bc, 16364)             = -1 EAGAIN (Resource temporarily unavailable)


We suspect libevent needs to be changed to delete events it knows nothing
about, if there is indeed something besides EPOLLIN and EPOLLOUT.

Any thoughts?


Thanks!
Brad