First round of small crash fixes for svn
dormando
dormando at rydia.net
Sat Oct 28 01:08:11 UTC 2006
(does this list do attachments?)
Attached is a handful of small fixes to the svn mogilefs (not running
the release, sorry :P).
- Fix for HTTPFile, didn't import the 'error' subroutine, so it'd bomb
out if trying to error.
- Added a decent watchdog to the delete job as a default. Given how many
files it selects it almost never gets to update in time... I was
thinking of a better way to do this though. Should delete ping every N
files it deletes? Every few percent of files it has to delete? That
would prevent it from timing out if a device is lagging significantly.
- Also a fix for checking if the path returned is undef. The new
make_path code has a lot of paths that can return undef, but a lot of
code in mogile doesn't check to see if what it got back is undef.
- Fix a warning in Worker/Query.pm
- Make the bogus error code death message more useful in Worker/Replicate.pm
- Change a socket error to a src err in Worker/Replicate.pm. I'm not
sure if this gets the intended result though. There're a lot of cases in
that replicate section where it'll return a "bogus error code". The case
this fixes is when a mogstored simply dies; if I took down a mogstored
while heavy replication was going on, it would crash flood and
eventually kill the parent process (!).
I noticed if a lot of jobs were dying or errors were flying about, the
parent has a tendency to crash, and the children don't necessarily
notice :) I haven't tracked down how/why this happens yet.
Another thing to note; I don't like the max_disk_age check, but I
haven't thought of something decent to deal with it yet, so I'll
probably just configure my trackers to set it really high. Very
occasionally our servers get a huge clock skew, or they boot up with
clocks that're way off and somehow not adjusted like they should be.
That caused a brief issue where all of our mogile trackers would start
spewing "no_devices" for a few minutes.
Also, when in that condition the replicate job would spew billions of
warnings and eventually kill the parent process. (ran out of suggestions
for fid blah).
Finally, the deleter job in the new trackers sucks. We're already 2
million files behind for deletion. I'll have the bottleneck narrowed
down sometime on monday.
Other than that, it works great! :P We're running it in production as of
today and they're hella fast.for everything but deletes. It's also
really nice having mogadm not suck anymore. We were able to toy with the
trackers, and add 16 hosts + devices without having to touch the database.
-Dormando
-------------- next part --------------
diff -ru svn/server/lib/MogileFS/HTTPFile.pm rev/HTTPFile.pm
--- svn/server/lib/MogileFS/HTTPFile.pm 2006-10-16 13:37:26.000000000 -0700
+++ rev/HTTPFile.pm 2006-10-27 17:51:24.000000000 -0700
@@ -2,6 +2,7 @@
use strict;
use Carp qw(croak);
use Socket qw(PF_INET IPPROTO_TCP SOCK_STREAM);
+use MogileFS::Util qw(error);
# (caching the connection used for HEAD requests)
my %head_socket; # host:port => [$pid, $time, $socket]
diff -ru svn/server/lib/MogileFS/Worker/Delete.pm rev/Worker/Delete.pm
--- svn/server/lib/MogileFS/Worker/Delete.pm 2006-09-18 18:44:25.000000000 -0700
+++ rev/Worker/Delete.pm 2006-10-27 17:51:24.000000000 -0700
@@ -21,6 +21,8 @@
return $self;
}
+sub watchdog_timeout { 60 }
+
sub work {
my $self = shift;
@@ -140,6 +142,10 @@
last if ++$done > PER_BATCH;
my $path = Mgd::make_path($devid, $fid);
+ # There are cases where make_path can return undefined.
+ # Mogile appears to try to replicate to bogus devices sometimes?
+ next unless $path;
+
my $rv = 0;
if (my $urlref = Mgd::is_url($path)) {
# hit up the server and delete it
diff -ru svn/server/lib/MogileFS/Worker/Query.pm rev/Worker/Query.pm
--- svn/server/lib/MogileFS/Worker/Query.pm 2006-10-23 17:33:42.000000000 -0700
+++ rev/Worker/Query.pm 2006-10-27 17:51:25.000000000 -0700
@@ -149,8 +149,10 @@
my $dmid = $args->{dmid};
my $key = $args->{key} || "";
my $multi = $args->{multi_dest} ? 1 : 0;
- my $fid = ($args->{fid} + 0) || undef; # we want it to be undef if they didn't give one
- # and we want to force it numeric...
+ my $fid = undef;
+ if ($args->{fid}) {
+ $fid = $args->{fid} + 0;
+ }
# get DB handle
my $dbh = Mgd::get_dbh() or
diff -ru svn/server/lib/MogileFS/Worker/Replicate.pm rev/Worker/Replicate.pm
--- svn/server/lib/MogileFS/Worker/Replicate.pm 2006-10-26 14:05:14.000000000 -0700
+++ rev/Worker/Replicate.pm 2006-10-27 17:51:25.000000000 -0700
@@ -444,7 +444,7 @@
errref => \$copy_err,
callback => sub { $worker->still_alive; },
);
- die "Bogus error code" if !$rv && $copy_err !~ /^(?:src|dest)_error$/;
+ die "Bogus error code: $copy_err" if !$rv && $copy_err !~ /^(?:src|dest)_error$/;
unless ($rv) {
error("Failed copying fid $fid from devid $sdevid to devid $ddevid (error type: $copy_err)");
@@ -551,7 +551,7 @@
# okay, now get the file
my $sock = IO::Socket::INET->new(PeerAddr => $shost, PeerPort => $sport, Timeout => 2)
- or return error("Unable to create socket to $shost:$sport for $spath");
+ or return $src_error->("Unable to create socket to $shost:$sport for $spath");
$sock->write("GET $spath HTTP/1.0\r\n\r\n");
return error("Pipe closed retrieving $spath from $shost:$sport")
if $pipe_closed;
More information about the mogilefs
mailing list