Replication when boxes die...

Saunders, Newton nsaunders at corp.untd.com
Mon Oct 30 16:30:32 UTC 2006


Hi,

 

Our company is researching MogileFS for use as our storage backend.  I
am investigating the replication process and am having trouble getting
files to replicate to the mindevcount when a box "dies".  Any help would
be greatly appreciated.

 

I have 4 boxes with the following configuration:

 

Box1: tracker, storage daemon

Box2: tracker, storage daemon

Box3: storage daemon

Box4: storage daemon, mysqld

 

Here are the steps I take:

 

* I store a file to a class that has a mindevcount of 2.  The file is
successfully replicated and exists on 2 boxes (say Box2 and Box4).  

* I bring down the storage daemon on Box2

 

Now, because there is only one "accessible" copy of the file I stored
(on Box4), I expected that the file would be replicated again in order
to bring the replication count back to 2.  This doesn't occur.  

 

I turned on debugging for one of the trackers and, as expected, see this
error message that recognizes that the storage daemon on Box2 is
unreachable: "[monitor(23670)] Port 22675 not listening on
otherwise-alive machine 10.107.30.32?  Error was: 500 Can't connect to
10.107.30.32:22675 (connect: Connection refused)".  The line just above
this message in the monitor worker code is broadcasting the device as
unreachable, however, the device.status field in the database is never
changed.  

 

The only way I can get the file to be replicated to the mindevcount is
to mark one of the devices where it exists as "dead" (either using
mogadm or updating the status field directly in the device table).  Once
I mark the device "dead", the file is immediately replicated on one of
the other boxes (Box1 or Box3).  

 

After a quick grep of the code, I only find one place where the status
field of the device table is modified...the cmd_set_state function in
the query worker.  This function is only called from within the admin
class.

 

I would appreciate any help any of you can give.  Have I set up MogileFS
incorrectly?  Is this not a feature of MogileFS?

 

 

Thanks in advance,

 

Newton Saunders

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.danga.com/pipermail/mogilefs/attachments/20061030/46ce565d/attachment.htm


More information about the mogilefs mailing list