MogileFS Stability Issues

Brad Fitzpatrick brad at danga.com
Fri Sep 29 18:08:22 UTC 2006


tbone,

Can I get the actual strace of each the monitor and replicator when this
happens?  Would help a lot at finding the issue.

I'll try and reproduce in the meantime, though.



On Tue, 26 Sep 2006, tbone wrote:

> I have been having some serious stability issues using both the trunk
> and mogilefs-server-2.00_02 and I figured I'd report it.
>
> The configuration I am using is:
>  - 1 Tracker (local mysqld)
>  - 5 Storage Nodes (running lighttpd, 8 devices each)
>
> The storage nodes do not run mogstored, I tossed together a couple
> scripts.  One for reaping the test-write files (after 2 minutes) and one
> for updating the usage files (once per minute).
>
>
> The tests I am doing are write heavy (its basicly an import of our
> current filesystem).  I fire up the imports on a few machines (each
> inserting a separate set of files).  As soon as I get 5 or 6 of them
> starting to write files into mogile, all of the scripts lose their
> connection to the tracker.
>
> When I check the tracker machine, I see one of the mogilefs processes
> (sometimes the monitor, sometimes the replicator) using 100% CPU (an
> strace shows it stuck in a select()/read() loop).
>
> When running just a few inserts its not too bad, the problem only occurs
> when I hammer it hard with multiple machines attempting to insert as
> fast as they can.
>
> Below is what I find in syslog:
>
> Sep 26 16:21:09 10.0.5.100 mogilefsd[8383]: [replicate(8420)] Got HTTP
> status code 403 PUTing to
> http://10.0.5.104:80/dev26/0/002/111/0002111270.fid
> Sep 26 16:21:09 10.0.5.100 mogilefsd[8383]: [replicate(8420)] Failed
> copying fid 2111270 from devid 1 to devid 26 (error type: dest_error)
> Sep 26 16:21:10 10.0.5.100 mogilefsd[8383]: [replicate(8420)] Got HTTP
> status code 403 PUTing to
> http://10.0.5.104:80/dev26/0/002/111/0002111238.fid
> Sep 26 16:21:10 10.0.5.100 mogilefsd[8383]: [replicate(8420)] Failed
> copying fid 2111238 from devid 16 to devid 26 (error type: dest_error)
> Sep 26 16:21:12 10.0.5.100 mogilefsd[8383]: [queryworker(8407)]
> get_file_size() connect timeout for HTTP HEAD for size of
> http://10.0.5.104:80/dev32/0/002/111/0002111306.fid
> Sep 26 16:21:15 10.0.5.100 mogilefsd[8383]: [replicate(8418)]
> replicated=20, attempted=23, ratio=86.96%
> Sep 26 16:21:16 10.0.5.100 mogilefsd[8383]: [replicate(8421)]
> replicated=20, attempted=20, ratio=100.00%
> Sep 26 16:21:18 10.0.5.100 mogilefsd[8383]: [replicate(8421)]
> failed_getting_lock: Unable to obtain lock mgfs:fid:2111577:replicate
> Sep 26 16:21:19 10.0.5.100 mogilefsd[8418]: crash log: Bogus error code
> at /usr/local/share/perl/5.8.7/MogileFS/Worker/Replicate.pm line 449.
> Sep 26 16:21:19 10.0.5.100 mogilefsd[8418]: ending run
> Sep 26 16:21:19 10.0.5.100 mogilefsd[8420]: crash log: Bogus error code
> at /usr/local/share/perl/5.8.7/MogileFS/Worker/Replicate.pm line 449.
> Sep 26 16:21:19 10.0.5.100 mogilefsd[8420]: ending run
> Sep 26 16:21:19 10.0.5.100 mogilefsd[8383]: [replicate(8420)] Unable to
> create socket to 10.0.5.102:80 for /dev13/0/002/111/0002111577.fid
> Sep 26 16:21:19 10.0.5.100 mogilefsd[8383]: crash log: Base class
> event_err called for MogileFS::Connection::Worker=ARRAY(0x12ce150)
> Sep 26 16:21:19 10.0.5.100 mogilefsd[8383]: ending run
> Sep 26 16:21:19 10.0.5.100 mogilefsd[8421]: crash log: Error writing:
> Broken pipe at /usr/local/share/perl/5.8.7/MogileFS/Worker.pm line 60.
> Sep 26 16:21:19 10.0.5.100 mogilefsd[8421]: ending run
> Sep 26 16:21:23 10.0.5.100 mogilefsd[8415]: crash log: Error writing:
> Broken pipe at /usr/local/share/perl/5.8.7/MogileFS/Worker.pm line 60.
> Sep 26 16:21:23 10.0.5.100 mogilefsd[8415]: ending run
> Sep 26 16:21:25 10.0.5.100 mogilefsd[8384]: crash log: No answer in 4
> seconds from parent to child MogileFS::Worker::Checker=ARRAY(0x12b6d90)
> [8384], dying at /usr/local/share/perl/5.8.7/MogileFS/Worker.pm line 129.
> Sep 26 16:21:25 10.0.5.100 mogilefsd[8384]: ending run
> Sep 26 16:21:26 10.0.5.100 mogilefsd[8388]: crash log: No answer in 4
> seconds from parent to child MogileFS::Worker::Delete=ARRAY(0x12b7300)
> [8388], dying at /usr/local/share/perl/5.8.7/MogileFS/Worker.pm line 129.
> Sep 26 16:21:26 10.0.5.100 mogilefsd[8388]: ending run
>
>


More information about the mogilefs mailing list