First round of small crash fixes for svn

Mon Nov 6 00:25:27 UTC 2006

Alan,

On Fri, 27 Oct 2006, dormando wrote:

> (does this list do attachments?)

yes.

> Attached is a handful of small fixes to the svn mogilefs (not running
> the release, sorry :P).
>
> - Fix for HTTPFile, didn't import the 'error' subroutine, so it'd bomb
> out if trying to error.

Thanks.  Committed.

> - Added a decent watchdog to the delete job as a default.

Committed.

> Given how many
> files it selects it almost never gets to update in time... I was
> thinking of a better way to do this though. Should delete ping every N
> files it deletes? Every few percent of files it has to delete? That
> would prevent it from timing out if a device is lagging significantly.

Added a ->still_alive call in the loop.  It rate-limits itself to once a
second.

> - Also a fix for checking if the path returned is undef. The new
> make_path code has a lot of paths that can return undef, but a lot of
> code in mogile doesn't check to see if what it got back is undef.

Hm.  Yeah, some auditing is needed.  Somewhat harmless, though.

> - Fix a warning in Worker/Query.pm

Checked in a similar version.

> - Make the bogus error code death message more useful in Worker/Replicate.pm

Checked in.

> - Change a socket error to a src err in Worker/Replicate.pm. I'm not
> sure if this gets the intended result though. There're a lot of cases in
> that replicate section where it'll return a "bogus error code". The case
> this fixes is when a mogstored simply dies; if I took down a mogstored
> while heavy replication was going on, it would crash flood and
> eventually kill the parent process (!).
>
> I noticed if a lot of jobs were dying or errors were flying about, the
> parent has a tendency to crash, and the children don't necessarily
> notice :) I haven't tracked down how/why this happens yet.

An strace of the parent would almost certainly give the answer.

Can you reproduce easily enough?  I'd love to fix that.  Sounds scary.

> Finally, the deleter job in the new trackers sucks. We're already 2
> million files behind for deletion. I'll have the bottleneck narrowed
> down sometime on monday.

Yeah, it needs that rescheduling rewrite, so put off delete errors into
the future when things are alive again.

> Other than that, it works great! :P We're running it in production as of
> today and they're hella fast.for everything but deletes. It's also
> really nice having mogadm not suck anymore. We were able to toy with the
> trackers, and add 16 hosts + devices without having to touch the database.

Wonderful!  Let's make it better, though.

- Brad