Troubleshooting multiprocess communication
dormando
dormando at rydia.net
Fri Mar 21 17:43:37 UTC 2008
Hey all,
I'm completely sucking at finding a bug. Allow me to talk at you in
hopes that someone knows the answer offhand.
MogileFS::Worker::Delete has the following code:
# hit up the server and delete it
# TODO: (optimization) use MogileFS->get_observed_state and
don't try to
delete things known to be down/etc
my $sock = IO::Socket::INET->new(PeerAddr => $urlparts->[0],
PeerPort => $urlparts->[1],
Timeout => 2);
unless ($sock) {
# timeout or something, mark this device as down for now
and move on
$self->broadcast_host_unreachable($dev->hostid);
$reschedule_fid->(60 * 60 * 2, "no_sock_to_hostid");
next;
}
(which I've now terribly pasted).
If a host times out, the deleter broadcasts to all workers that the host
is unreachable. I think this is a little excessive, but it should be
okay because:
MogileFS::Worker::Monitor should re-broadcast a 'reachable' state within
the next ten seconds, if the host is actually up and the timeout was a
fluke.
Except the delete job is never getting that message, and the procmanager
code prevents the job monitors subsequent broadcasts from being sent to
the deleter, since the status hasn't changed.
The symptom of this is any deletes destined for those hosts get cycled
through file_to_delete_later and back again every 600 seconds.
Not 100% sure I'm looking in the right place. Given the timeout (600
seconds) and that chunk of code, this is probably right? I should be
able to verify this by sending !to commands to the delete job to say
those hosts are back up, but I haven't gotten that to work yet.
Ideas?
-Dormando
More information about the mogilefs
mailing list