Troubleshooting multiprocess communication

Fri Mar 21 17:43:37 UTC 2008

Hey all,

I'm completely sucking at finding a bug. Allow me to talk at you in 
hopes that someone knows the answer offhand.

MogileFS::Worker::Delete has the following code:

         # hit up the server and delete it
         # TODO: (optimization) use MogileFS->get_observed_state and 
don't try to
  delete things known to be down/etc
         my $sock = IO::Socket::INET->new(PeerAddr => $urlparts->[0],
                                          PeerPort => $urlparts->[1],
                                          Timeout => 2);
         unless ($sock) {
             # timeout or something, mark this device as down for now 
and move on
             $self->broadcast_host_unreachable($dev->hostid);
             $reschedule_fid->(60 * 60 * 2, "no_sock_to_hostid");
             next;
         }

(which I've now terribly pasted).

If a host times out, the deleter broadcasts to all workers that the host 
is unreachable. I think this is a little excessive, but it should be 
okay because:

MogileFS::Worker::Monitor should re-broadcast a 'reachable' state within 
the next ten seconds, if the host is actually up and the timeout was a 
fluke.

Except the delete job is never getting that message, and the procmanager 
code prevents the job monitors subsequent broadcasts from being sent to 
the deleter, since the status hasn't changed.

The symptom of this is any deletes destined for those hosts get cycled 
through file_to_delete_later and back again every 600 seconds.

Not 100% sure I'm looking in the right place. Given the timeout (600 
seconds) and that chunk of code, this is probably right? I should be 
able to verify this by sending !to commands to the delete job to say 
those hosts are back up, but I haven't gotten that to work yet.

Ideas?
-Dormando