libevent looping

Brad Fitzpatrick brad@danga.com
Fri, 5 Sep 2003 08:56:31 -0700 (PDT)


The strace below is from 2.4.21-rc5 (LT), but we also see it on other pure
2.4.21 (LT) machines.  We modified libevent to use ET, but we haven't put
it into production yet.  (this works because our app always reads all
data, even though libevent is essentially LT)

The bug happens so rarely that I don't want to modify libevent to print
out debugging info unless I know exactly what you guys want to see.
Either of you want to make a quick patch to libevent's epoll.c?
We can then throw it into production and get back to you in a ~1 week.

- Brad


On Fri, 5 Sep 2003, Davide Libenzi wrote:

> On Fri, 5 Sep 2003, Niels Provos wrote:
>
> > Hi Brad,
> >
> > sorry for not getting back to you earlier.  I just moved to California
> > and started a new job.  At this point, I do not know what is causing
> > your problems.  I attached a diff that fixes some known problems with
> > epoll.
> >
> > Davide did any epoll semantics change recently?
>
> The latest big change was the double LT/ET support. It'd be interesting to
> see what kind of event in received by epoll_wait() inside libevent (by
> dropping a printf of the fetched events). Which kernel are you using ?
> Are you using ET or LT epoll ?
>
>
>
>
> > On Thu, Sep 04, 2003 at 07:11:55PM -0700, Brad Fitzpatrick wrote:
> > > Niels,
> > >
> > > A few of us at LiveJournal.com have been bug-hunting a problem with
> > > libevent and epoll for a number of weeks now.
> > >
> > > We have a dozen machines running memcached on libevent.  A couple times
> > > per week a process will start eating all available CPU, but continues to
> > > function.
> > >
> > > What we discovered is that libevent's epoll mode isn't handling a (new?)
> > > type of event from Linux, and continually processes the same event over
> > > and over.  When a new event comes in that libevent can handle (EPOLLIN or
> > > EPOLLOUT), it handles it fine, deletes it, and proceeds to be confused
> > > over the mystery event.
> > >
> > > Here you can see libevent being confused for a while, getting a new event,
> > > passing it back to our application, and then going back to looping, and so
> > > on.
> > >
> > > (Our app always does a syscall, read or write, after calls into libevent,
> > > so this isn't us looping.)
> > >
> > > Here's a snippet of an strace:
> > >
> > > epoll_wait(0x3, 0x80accb8, 0x2000, 0x1ad) = 1
> > > gettimeofday({1062725530, 924268}, NULL) = 0
> > > gettimeofday({1062725530, 924392}, NULL) = 0
> > > gettimeofday({1062725530, 924515}, NULL) = 0
> > > epoll_wait(0x3, 0x80accb8, 0x2000, 0x1ad) = 1
> > > gettimeofday({1062725530, 924764}, NULL) = 0
> > > gettimeofday({1062725530, 924886}, NULL) = 0
> > > gettimeofday({1062725530, 924976}, NULL) = 0
> > > epoll_wait(0x3, 0x80accb8, 0x2000, 0x1ac) = 1
> > > gettimeofday({1062725530, 925195}, NULL) = 0
> > > gettimeofday({1062725530, 925314}, NULL) = 0
> > > gettimeofday({1062725530, 925440}, NULL) = 0
> > > epoll_wait(0x3, 0x80accb8, 0x2000, 0x1ac) = 1
> > > gettimeofday({1062725530, 925690}, NULL) = 0
> > > gettimeofday({1062725530, 925814}, NULL) = 0
> > > gettimeofday({1062725530, 925937}, NULL) = 0
> > > epoll_wait(0x3, 0x80accb8, 0x2000, 0x1ab) = 1
> > > gettimeofday({1062725530, 926186}, NULL) = 0
> > > gettimeofday({1062725530, 926310}, NULL) = 0
> > > gettimeofday({1062725530, 926433}, NULL) = 0
> > > epoll_wait(0x3, 0x80accb8, 0x2000, 0x1ab) = 1
> > > gettimeofday({1062725530, 926666}, NULL) = 0
> > > gettimeofday({1062725530, 926779}, NULL) = 0
> > > gettimeofday({1062725530, 926901}, NULL) = 0
> > > epoll_wait(0x3, 0x80accb8, 0x2000, 0x1aa) = 2
> > > gettimeofday({1062725530, 927167}, NULL) = 0
> > > read(177, "get userpic.6166503\r\n", 16384) = 21
> > > read(177, 0x87a834d, 16363)             = -1 EAGAIN (Resource temporarily unavailable)
> > > setsockopt(177, SOL_TCP, TCP_CORK, [1], 4) = 0
> > > time(NULL)                              = 1062725530
> > > time(NULL)                              = 1062725530
> > > write(177, "VALUE userpic.6166503 1 33\r\n", 28) = 28
> > > write(177, "\4\4\0041234\4\4\4\10\2\3\0\0\0\n\003100\n\00296\n\006"..., 35) = 35
> > > write(177, "END\r\n", 5)                = 5
> > > setsockopt(177, SOL_TCP, TCP_CORK, [0], 4) = 0
> > > read(177, 0x87a8338, 16384)             = -1 EAGAIN (Resource temporarily unavailable)
> > > gettimeofday({1062725530, 928978}, NULL) = 0
> > > gettimeofday({1062725530, 929099}, NULL) = 0
> > > epoll_wait(0x3, 0x80accb8, 0x2000, 0x1a8) = 1
> > > gettimeofday({1062725530, 929360}, NULL) = 0
> > > gettimeofday({1062725530, 929486}, NULL) = 0
> > > gettimeofday({1062725530, 929690}, NULL) = 0
> > > epoll_wait(0x3, 0x80accb8, 0x2000, 0x1a7) = 2
> > > gettimeofday({1062725530, 929886}, NULL) = 0
> > > read(402, "get sess:1256389:4\r\n", 16384) = 20
> > > read(402, 0x84bb3bc, 16364)             = -1 EAGAIN (Resource temporarily unavailable)
> > >
> > >
> > > We suspect libevent needs to be changed to delete events it knows nothing
> > > about, if there is indeed something besides EPOLLIN and EPOLLOUT.
> > >
> > > Any thoughts?
> > >
> > >
> > > Thanks!
> > > Brad
> >
>
>
> - Davide
>
>