Analysis of the mogilefsd busywait bug
Jared Klett
jared at blip.tv
Wed Feb 28 20:47:43 UTC 2007
I hate to be the squeaky wheel, but I'm still seeing the
mogilefsd taking 100% CPU aka the "busywait" bug after updating to
svn748 (the latest as of today).
I reinitialized my entire MogileFS environment, injected one
file, and mogilefsd on one of the two tracker servers started spinning
at 100% CPU.
I provided some debug info in this thread:
http://lists.danga.com/pipermail/mogilefs/2007-February/000792.html
I checked lsof and strace output and it's pretty much the same
story I laid out in that post.
is there any new info or resolution on this issue?
cheers,
- Jared
-----Original Message-----
From: mogilefs-bounces at lists.danga.com
[mailto:mogilefs-bounces at lists.danga.com] On Behalf Of Brad Fitzpatrick
Sent: Wednesday, February 14, 2007 3:23 AM
To: David Weekly
Cc: nathan at pbwiki.com; mogilefs at lists.danga.com; Brett G. Durrett
Subject: Re: Analysis of the mogilefsd busywait bug
Committed in svn740.
On Tue, 13 Feb 2007, David Weekly wrote:
> Brad,
>
> Unfortunately, we've had a hard time reproducing the failure mode
ourselves.
> With the patch in, we haven't seen the problem come up again. That's
> naturally no guarantee that it's fixed, but the patch logically seemed
> like it would solve the problem, and so far it has. If the issue comes
> up again, we'll let you know and dig deeper.
>
> Thanks for your responsiveness!
>
> Cheers,
> -David
>
>
> On 2/13/07, Brad Fitzpatrick <brad at danga.com> wrote:
> >
> > I kill -9'ed the child in a loop while doing tons of queries and
> > wasn't able to reproduce...
> >
> > David, if this patch works for you and fixing your problem, I'll
> > happily commit... I just wanted to see it work for myself first, but
> > I'm not that concerned. Your analysis seems correct.
> >
> > - Brad
> >
> >
> > On Tue, 13 Feb 2007, Brad Fitzpatrick wrote:
> >
> > > Sounds correct. Thanks!
> > >
> > > I'll try and reproduce and verify the fix later today. Should be
> > > as
> > easy
> > > as kill -9'ing some child processes during heavy
> > > traffic/communication between them?
> > >
> > >
> > > On Mon, 12 Feb 2007, David Weekly wrote:
> > >
> > > > Nathan,
> > > >
> > > > So I've taken a peek at the mogilefsd issue you posted to the
> > > > list in
> > more
> > > > detail. What follows is my possibly-flawed analysis of what's
> > > > causing
> > your
> > > > issue as well as the issue of the others on the list.
> > > >
> > > > The epoll fds causing spinning in busywait are waiting on inputs
> > > > from
> > unix
> > > > sockets, which appear to be sockets created fairly "early" (9-13
> > > > in
> > the case
> > > > above, just above epoll's own fd of 7) - this tells me the
> > > > sockets are likely the socketpair()s meant to communicate with
> > > > children. This corresponds well with us seeing unexpected
> > > > queryworker deaths in our
> > syslog
> > > > and corresponds exactly with what the others on the mailing list
> > > > have
> > seen
> > > > for problems. That seems to line up with Brad's analysis.
> > > >
> > > > The issue seems to be that MogileFS::Connection::Worker->close()
> > > > isn't called explicitly when
> > > > MogileFS::ProcManager::PostEventLoopChecker()
> > notices
> > > > the pid's death (via a successful wait() reaping) .
> > > > Consequently, I
> > don't
> > > > believe Danga::Socket->close() is called, which means
> > > > Danga::Socket->_cleanup() isn't called. Here's the important bit
> > > > -
> > > > Danga::Socket->_cleanup() tells epoll it's no longer interested
> > > > in the
> > file
> > > > descriptor (EPOLL_CTL_DEL). So epoll continues to report the
> > socketpair()
> > > > socket as being available for write (although we might get a
> > > > SIGPIPE
> > if we
> > > > actually tried writing to the socket).
> > > >
> > > > Causing MogileFS::Connection::Worker->close() to be called on
> > > > the
> > worker
> > > > whose pid died should (in theory) fix this. Here's the suggested
> > > > patch
> > (on
> > > > latest svn checkout). It could kill your mother and is wildly
> > untested.
> > > > Brad? Brett? Am I on crack?
> > > >
> > > > +++ ProcManager.pm 2007-02-13 01:14:01.000000000 +0000
> > > > @@ -141,6 +141,7 @@
> > > > my $extra = $todie{$pid} ? "expected" :
"UNEXPECTED";
> > > > error("Child $pid ($job) died: $? ($extra)");
> > > > MogileFS::ProcManager->NoteDeadChild($pid);
> > > > + $jobconn->close();
> > > >
> > > > if (my $jobstat = $jobs{$job}) {
> > > > # if the pid is in %todie, then we have asked
> > > > it to
> > shut
> > > > down
> > > >
> > >
> > >
> >
>
More information about the mogilefs
mailing list