Replication when boxes die...

Jay Buffington jaybuffington at gmail.com
Mon Oct 30 17:00:28 UTC 2006


When a box goes down (like when you shutdown the storage node on box2)
Mogile assumes that it is a temporary failure and the box will become
available shortly.

If you want the replication to start happening you'll have to mark the
host as 'dead'

Jay


On 10/30/06, Saunders, Newton <nsaunders at corp.untd.com> wrote:
>
>
>
>
> Hi,
>
>
>
> Our company is researching MogileFS for use as our storage backend.  I am
> investigating the replication process and am having trouble getting files to
> replicate to the mindevcount when a box "dies".  Any help would be greatly
> appreciated.
>
>
>
> I have 4 boxes with the following configuration:
>
>
>
> Box1: tracker, storage daemon
>
> Box2: tracker, storage daemon
>
> Box3: storage daemon
>
> Box4: storage daemon, mysqld
>
>
>
> Here are the steps I take:
>
>
>
> * I store a file to a class that has a mindevcount of 2.  The file is
> successfully replicated and exists on 2 boxes (say Box2 and Box4).
>
> * I bring down the storage daemon on Box2
>
>
>
> Now, because there is only one "accessible" copy of the file I stored (on
> Box4), I expected that the file would be replicated again in order to bring
> the replication count back to 2.  This doesn't occur.
>
>
>
> I turned on debugging for one of the trackers and, as expected, see this
> error message that recognizes that the storage daemon on Box2 is
> unreachable: "[monitor(23670)] Port 22675 not listening on otherwise-alive
> machine 10.107.30.32?  Error was: 500 Can't connect to 10.107.30.32:22675
> (connect: Connection refused)".  The line just above this message in the
> monitor worker code is broadcasting the device as unreachable, however, the
> device.status field in the database is never changed.
>
>
>
> The only way I can get the file to be replicated to the mindevcount is to
> mark one of the devices where it exists as "dead" (either using mogadm or
> updating the status field directly in the device table).  Once I mark the
> device "dead", the file is immediately replicated on one of the other boxes
> (Box1 or Box3).
>
>
>
> After a quick grep of the code, I only find one place where the status field
> of the device table is modified…the cmd_set_state function in the query
> worker.  This function is only called from within the admin class.
>
>
>
> I would appreciate any help any of you can give.  Have I set up MogileFS
> incorrectly?  Is this not a feature of MogileFS?
>
>
>
>
>
> Thanks in advance,
>
>
>
> Newton Saunders


More information about the mogilefs mailing list