First round of small crash fixes for svn

dormando dormando at rydia.net
Sat Oct 28 01:08:11 UTC 2006


(does this list do attachments?)

Attached is a handful of small fixes to the svn mogilefs (not running 
the release, sorry :P).

- Fix for HTTPFile, didn't import the 'error' subroutine, so it'd bomb 
out if trying to error.

- Added a decent watchdog to the delete job as a default. Given how many 
files it selects it almost never gets to update in time... I was 
thinking of a better way to do this though. Should delete ping every N 
files it deletes? Every few percent of files it has to delete? That 
would prevent it from timing out if a device is lagging significantly.

- Also a fix for checking if the path returned is undef. The new 
make_path code has a lot of paths that can return undef, but a lot of 
code in mogile doesn't check to see if what it got back is undef.

- Fix a warning in Worker/Query.pm

- Make the bogus error code death message more useful in Worker/Replicate.pm

- Change a socket error to a src err in Worker/Replicate.pm. I'm not 
sure if this gets the intended result though. There're a lot of cases in 
that replicate section where it'll return a "bogus error code". The case 
this fixes is when a mogstored simply dies; if I took down a mogstored 
while heavy replication was going on, it would crash flood and 
eventually kill the parent process (!).

I noticed if a lot of jobs were dying or errors were flying about, the 
parent has a tendency to crash, and the children don't necessarily 
notice :) I haven't tracked down how/why this happens yet.

Another thing to note; I don't like the max_disk_age check, but I 
haven't thought of something decent to deal with it yet, so I'll 
probably just configure my trackers to set it really high. Very 
occasionally our servers get a huge clock skew, or they boot up with 
clocks that're way off and somehow not adjusted like they should be. 
That caused a brief issue where all of our mogile trackers would start 
spewing "no_devices" for a few minutes.

Also, when in that condition the replicate job would spew billions of 
warnings and eventually kill the parent process. (ran out of suggestions 
for fid blah).

Finally, the deleter job in the new trackers sucks. We're already 2 
million files behind for deletion. I'll have the bottleneck narrowed 
down sometime on monday.

Other than that, it works great! :P We're running it in production as of 
today and they're hella fast.for everything but deletes. It's also 
really nice having mogadm not suck anymore. We were able to toy with the 
trackers, and add 16 hosts + devices without having to touch the database.

-Dormando
-------------- next part --------------
diff -ru svn/server/lib/MogileFS/HTTPFile.pm rev/HTTPFile.pm
--- svn/server/lib/MogileFS/HTTPFile.pm	2006-10-16 13:37:26.000000000 -0700
+++ rev/HTTPFile.pm	2006-10-27 17:51:24.000000000 -0700
@@ -2,6 +2,7 @@
 use strict;
 use Carp qw(croak);
 use Socket qw(PF_INET IPPROTO_TCP SOCK_STREAM);
+use MogileFS::Util qw(error);
 
 # (caching the connection used for HEAD requests)
 my %head_socket;                # host:port => [$pid, $time, $socket]
diff -ru svn/server/lib/MogileFS/Worker/Delete.pm rev/Worker/Delete.pm
--- svn/server/lib/MogileFS/Worker/Delete.pm	2006-09-18 18:44:25.000000000 -0700
+++ rev/Worker/Delete.pm	2006-10-27 17:51:24.000000000 -0700
@@ -21,6 +21,8 @@
     return $self;
 }
 
+sub watchdog_timeout { 60 }
+
 sub work {
     my $self = shift;
 
@@ -140,6 +142,10 @@
         last if ++$done > PER_BATCH;
 
         my $path = Mgd::make_path($devid, $fid);
+        # There are cases where make_path can return undefined.
+	# Mogile appears to try to replicate to bogus devices sometimes?
+	next unless $path;
+
         my $rv = 0;
         if (my $urlref = Mgd::is_url($path)) {
             # hit up the server and delete it
diff -ru svn/server/lib/MogileFS/Worker/Query.pm rev/Worker/Query.pm
--- svn/server/lib/MogileFS/Worker/Query.pm	2006-10-23 17:33:42.000000000 -0700
+++ rev/Worker/Query.pm	2006-10-27 17:51:25.000000000 -0700
@@ -149,8 +149,10 @@
     my $dmid = $args->{dmid};
     my $key = $args->{key} || "";
     my $multi = $args->{multi_dest} ? 1 : 0;
-    my $fid = ($args->{fid} + 0) || undef; # we want it to be undef if they didn't give one
-                                           # and we want to force it numeric...
+    my $fid = undef;
+    if ($args->{fid}) {
+      $fid = $args->{fid} + 0;
+    }
 
     # get DB handle
     my $dbh = Mgd::get_dbh() or
diff -ru svn/server/lib/MogileFS/Worker/Replicate.pm rev/Worker/Replicate.pm
--- svn/server/lib/MogileFS/Worker/Replicate.pm	2006-10-26 14:05:14.000000000 -0700
+++ rev/Worker/Replicate.pm	2006-10-27 17:51:25.000000000 -0700
@@ -444,7 +444,7 @@
                            errref       => \$copy_err,
                            callback     => sub { $worker->still_alive; },
                            );
-        die "Bogus error code" if !$rv && $copy_err !~ /^(?:src|dest)_error$/;
+        die "Bogus error code: $copy_err" if !$rv && $copy_err !~ /^(?:src|dest)_error$/;
 
         unless ($rv) {
             error("Failed copying fid $fid from devid $sdevid to devid $ddevid (error type: $copy_err)");
@@ -551,7 +551,7 @@
 
     # okay, now get the file
     my $sock = IO::Socket::INET->new(PeerAddr => $shost, PeerPort => $sport, Timeout => 2)
-        or return error("Unable to create socket to $shost:$sport for $spath");
+        or return $src_error->("Unable to create socket to $shost:$sport for $spath");
     $sock->write("GET $spath HTTP/1.0\r\n\r\n");
     return error("Pipe closed retrieving $spath from $shost:$sport")
         if $pipe_closed;


More information about the mogilefs mailing list