MogileFS Stability Issues

Brad Fitzpatrick brad at danga.com
Tue Sep 26 23:54:55 UTC 2006


First off, thanks for the bug report.

Obviously mogilefsd should handle this better, but it looks like the root
of the problem is those errors that lighttpd are returning (the HTTP
status code 403 on the PUT) ....

Can you look into that more (why lighttpd is doing that?), and let us know
if mogilefsd behaves better?

I haven't done that much stress testing with lighttpd yet, though it
should "just work", if it speaks DAV.



On Tue, 26 Sep 2006, tbone wrote:

> I have been having some serious stability issues using both the trunk
> and mogilefs-server-2.00_02 and I figured I'd report it.
>
> The configuration I am using is:
>  - 1 Tracker (local mysqld)
>  - 5 Storage Nodes (running lighttpd, 8 devices each)
>
> The storage nodes do not run mogstored, I tossed together a couple
> scripts.  One for reaping the test-write files (after 2 minutes) and one
> for updating the usage files (once per minute).
>
>
> The tests I am doing are write heavy (its basicly an import of our
> current filesystem).  I fire up the imports on a few machines (each
> inserting a separate set of files).  As soon as I get 5 or 6 of them
> starting to write files into mogile, all of the scripts lose their
> connection to the tracker.
>
> When I check the tracker machine, I see one of the mogilefs processes
> (sometimes the monitor, sometimes the replicator) using 100% CPU (an
> strace shows it stuck in a select()/read() loop).
>
> When running just a few inserts its not too bad, the problem only occurs
> when I hammer it hard with multiple machines attempting to insert as
> fast as they can.
>
> Below is what I find in syslog:
>
> Sep 26 16:21:09 10.0.5.100 mogilefsd[8383]: [replicate(8420)] Got HTTP
> status code 403 PUTing to
> http://10.0.5.104:80/dev26/0/002/111/0002111270.fid
> Sep 26 16:21:09 10.0.5.100 mogilefsd[8383]: [replicate(8420)] Failed
> copying fid 2111270 from devid 1 to devid 26 (error type: dest_error)
> Sep 26 16:21:10 10.0.5.100 mogilefsd[8383]: [replicate(8420)] Got HTTP
> status code 403 PUTing to
> http://10.0.5.104:80/dev26/0/002/111/0002111238.fid
> Sep 26 16:21:10 10.0.5.100 mogilefsd[8383]: [replicate(8420)] Failed
> copying fid 2111238 from devid 16 to devid 26 (error type: dest_error)
> Sep 26 16:21:12 10.0.5.100 mogilefsd[8383]: [queryworker(8407)]
> get_file_size() connect timeout for HTTP HEAD for size of
> http://10.0.5.104:80/dev32/0/002/111/0002111306.fid
> Sep 26 16:21:15 10.0.5.100 mogilefsd[8383]: [replicate(8418)]
> replicated=20, attempted=23, ratio=86.96%
> Sep 26 16:21:16 10.0.5.100 mogilefsd[8383]: [replicate(8421)]
> replicated=20, attempted=20, ratio=100.00%
> Sep 26 16:21:18 10.0.5.100 mogilefsd[8383]: [replicate(8421)]
> failed_getting_lock: Unable to obtain lock mgfs:fid:2111577:replicate
> Sep 26 16:21:19 10.0.5.100 mogilefsd[8418]: crash log: Bogus error code
> at /usr/local/share/perl/5.8.7/MogileFS/Worker/Replicate.pm line 449.
> Sep 26 16:21:19 10.0.5.100 mogilefsd[8418]: ending run
> Sep 26 16:21:19 10.0.5.100 mogilefsd[8420]: crash log: Bogus error code
> at /usr/local/share/perl/5.8.7/MogileFS/Worker/Replicate.pm line 449.
> Sep 26 16:21:19 10.0.5.100 mogilefsd[8420]: ending run
> Sep 26 16:21:19 10.0.5.100 mogilefsd[8383]: [replicate(8420)] Unable to
> create socket to 10.0.5.102:80 for /dev13/0/002/111/0002111577.fid
> Sep 26 16:21:19 10.0.5.100 mogilefsd[8383]: crash log: Base class
> event_err called for MogileFS::Connection::Worker=ARRAY(0x12ce150)
> Sep 26 16:21:19 10.0.5.100 mogilefsd[8383]: ending run
> Sep 26 16:21:19 10.0.5.100 mogilefsd[8421]: crash log: Error writing:
> Broken pipe at /usr/local/share/perl/5.8.7/MogileFS/Worker.pm line 60.
> Sep 26 16:21:19 10.0.5.100 mogilefsd[8421]: ending run
> Sep 26 16:21:23 10.0.5.100 mogilefsd[8415]: crash log: Error writing:
> Broken pipe at /usr/local/share/perl/5.8.7/MogileFS/Worker.pm line 60.
> Sep 26 16:21:23 10.0.5.100 mogilefsd[8415]: ending run
> Sep 26 16:21:25 10.0.5.100 mogilefsd[8384]: crash log: No answer in 4
> seconds from parent to child MogileFS::Worker::Checker=ARRAY(0x12b6d90)
> [8384], dying at /usr/local/share/perl/5.8.7/MogileFS/Worker.pm line 129.
> Sep 26 16:21:25 10.0.5.100 mogilefsd[8384]: ending run
> Sep 26 16:21:26 10.0.5.100 mogilefsd[8388]: crash log: No answer in 4
> seconds from parent to child MogileFS::Worker::Delete=ARRAY(0x12b7300)
> [8388], dying at /usr/local/share/perl/5.8.7/MogileFS/Worker.pm line 129.
> Sep 26 16:21:26 10.0.5.100 mogilefsd[8388]: ending run
>
>


More information about the mogilefs mailing list