libevent looping
Brad Fitzpatrick
brad@danga.com
Wed, 17 Sep 2003 14:32:28 -0700 (PDT)
Niels,
We've been running a new build of memcached with your patch to libevent's
epoll support, and I must say it's a lot better. We've had no run-away
processes since then. You should release a new version of libevent.
BTW, we also added a debugging section to epoll_dispatch in libevent, to
log 'what' if '!which', thinking that was the problem before. Over all
our machines running this app, that condition was only hit once, with
what=8 (EPOLLERR). Does that need to be handled, or is it safe to ignore?
- Brad
On Fri, 5 Sep 2003, Niels Provos wrote:
> Hi Brad,
>
> sorry for not getting back to you earlier. I just moved to California
> and started a new job. At this point, I do not know what is causing
> your problems. I attached a diff that fixes some known problems with
> epoll.
>
> Davide did any epoll semantics change recently?
>
> Niels.
>
>
> On Thu, Sep 04, 2003 at 07:11:55PM -0700, Brad Fitzpatrick wrote:
> > Niels,
> >
> > A few of us at LiveJournal.com have been bug-hunting a problem with
> > libevent and epoll for a number of weeks now.
> >
> > We have a dozen machines running memcached on libevent. A couple times
> > per week a process will start eating all available CPU, but continues to
> > function.
> >
> > What we discovered is that libevent's epoll mode isn't handling a (new?)
> > type of event from Linux, and continually processes the same event over
> > and over. When a new event comes in that libevent can handle (EPOLLIN or
> > EPOLLOUT), it handles it fine, deletes it, and proceeds to be confused
> > over the mystery event.
> >
> > Here you can see libevent being confused for a while, getting a new event,
> > passing it back to our application, and then going back to looping, and so
> > on.
> >
> > (Our app always does a syscall, read or write, after calls into libevent,
> > so this isn't us looping.)
> >
> > Here's a snippet of an strace:
> >
> > epoll_wait(0x3, 0x80accb8, 0x2000, 0x1ad) = 1
> > gettimeofday({1062725530, 924268}, NULL) = 0
> > gettimeofday({1062725530, 924392}, NULL) = 0
> > gettimeofday({1062725530, 924515}, NULL) = 0
> > epoll_wait(0x3, 0x80accb8, 0x2000, 0x1ad) = 1
> > gettimeofday({1062725530, 924764}, NULL) = 0
> > gettimeofday({1062725530, 924886}, NULL) = 0
> > gettimeofday({1062725530, 924976}, NULL) = 0
> > epoll_wait(0x3, 0x80accb8, 0x2000, 0x1ac) = 1
> > gettimeofday({1062725530, 925195}, NULL) = 0
> > gettimeofday({1062725530, 925314}, NULL) = 0
> > gettimeofday({1062725530, 925440}, NULL) = 0
> > epoll_wait(0x3, 0x80accb8, 0x2000, 0x1ac) = 1
> > gettimeofday({1062725530, 925690}, NULL) = 0
> > gettimeofday({1062725530, 925814}, NULL) = 0
> > gettimeofday({1062725530, 925937}, NULL) = 0
> > epoll_wait(0x3, 0x80accb8, 0x2000, 0x1ab) = 1
> > gettimeofday({1062725530, 926186}, NULL) = 0
> > gettimeofday({1062725530, 926310}, NULL) = 0
> > gettimeofday({1062725530, 926433}, NULL) = 0
> > epoll_wait(0x3, 0x80accb8, 0x2000, 0x1ab) = 1
> > gettimeofday({1062725530, 926666}, NULL) = 0
> > gettimeofday({1062725530, 926779}, NULL) = 0
> > gettimeofday({1062725530, 926901}, NULL) = 0
> > epoll_wait(0x3, 0x80accb8, 0x2000, 0x1aa) = 2
> > gettimeofday({1062725530, 927167}, NULL) = 0
> > read(177, "get userpic.6166503\r\n", 16384) = 21
> > read(177, 0x87a834d, 16363) = -1 EAGAIN (Resource temporarily unavailable)
> > setsockopt(177, SOL_TCP, TCP_CORK, [1], 4) = 0
> > time(NULL) = 1062725530
> > time(NULL) = 1062725530
> > write(177, "VALUE userpic.6166503 1 33\r\n", 28) = 28
> > write(177, "\4\4\0041234\4\4\4\10\2\3\0\0\0\n\003100\n\00296\n\006"..., 35) = 35
> > write(177, "END\r\n", 5) = 5
> > setsockopt(177, SOL_TCP, TCP_CORK, [0], 4) = 0
> > read(177, 0x87a8338, 16384) = -1 EAGAIN (Resource temporarily unavailable)
> > gettimeofday({1062725530, 928978}, NULL) = 0
> > gettimeofday({1062725530, 929099}, NULL) = 0
> > epoll_wait(0x3, 0x80accb8, 0x2000, 0x1a8) = 1
> > gettimeofday({1062725530, 929360}, NULL) = 0
> > gettimeofday({1062725530, 929486}, NULL) = 0
> > gettimeofday({1062725530, 929690}, NULL) = 0
> > epoll_wait(0x3, 0x80accb8, 0x2000, 0x1a7) = 2
> > gettimeofday({1062725530, 929886}, NULL) = 0
> > read(402, "get sess:1256389:4\r\n", 16384) = 20
> > read(402, 0x84bb3bc, 16364) = -1 EAGAIN (Resource temporarily unavailable)
> >
> >
> > We suspect libevent needs to be changed to delete events it knows nothing
> > about, if there is indeed something besides EPOLLIN and EPOLLOUT.
> >
> > Any thoughts?
> >
> >
> > Thanks!
> > Brad
>