First round of small crash fixes for svn

Sun Nov 5 22:47:23 UTC 2006

Great stuff!

On Fri, 27 Oct 2006, dormando wrote:

> (does this list do attachments?)
>
> Attached is a handful of small fixes to the svn mogilefs (not running
> the release, sorry :P).
>
> - Fix for HTTPFile, didn't import the 'error' subroutine, so it'd bomb
> out if trying to error.
>
> - Added a decent watchdog to the delete job as a default. Given how many
> files it selects it almost never gets to update in time... I was
> thinking of a better way to do this though. Should delete ping every N
> files it deletes? Every few percent of files it has to delete? That
> would prevent it from timing out if a device is lagging significantly.
>
> - Also a fix for checking if the path returned is undef. The new
> make_path code has a lot of paths that can return undef, but a lot of
> code in mogile doesn't check to see if what it got back is undef.
>
> - Fix a warning in Worker/Query.pm
>
> - Make the bogus error code death message more useful in Worker/Replicate.pm
>
> - Change a socket error to a src err in Worker/Replicate.pm. I'm not
> sure if this gets the intended result though. There're a lot of cases in
> that replicate section where it'll return a "bogus error code". The case
> this fixes is when a mogstored simply dies; if I took down a mogstored
> while heavy replication was going on, it would crash flood and
> eventually kill the parent process (!).
>
> I noticed if a lot of jobs were dying or errors were flying about, the
> parent has a tendency to crash, and the children don't necessarily
> notice :) I haven't tracked down how/why this happens yet.
>
> Another thing to note; I don't like the max_disk_age check, but I
> haven't thought of something decent to deal with it yet, so I'll
> probably just configure my trackers to set it really high. Very
> occasionally our servers get a huge clock skew, or they boot up with
> clocks that're way off and somehow not adjusted like they should be.
> That caused a brief issue where all of our mogile trackers would start
> spewing "no_devices" for a few minutes.
>
> Also, when in that condition the replicate job would spew billions of
> warnings and eventually kill the parent process. (ran out of suggestions
> for fid blah).
>
> Finally, the deleter job in the new trackers sucks. We're already 2
> million files behind for deletion. I'll have the bottleneck narrowed
> down sometime on monday.
>
> Other than that, it works great! :P We're running it in production as of
> today and they're hella fast.for everything but deletes. It's also
> really nice having mogadm not suck anymore. We were able to toy with the
> trackers, and add 16 hosts + devices without having to touch the database.
>
> -Dormando
>