MogileFS Stability Issues

tbone tbone at nexopia.com
Tue Sep 26 23:35:51 UTC 2006


I have been having some serious stability issues using both the trunk 
and mogilefs-server-2.00_02 and I figured I'd report it.

The configuration I am using is:
 - 1 Tracker (local mysqld)
 - 5 Storage Nodes (running lighttpd, 8 devices each)

The storage nodes do not run mogstored, I tossed together a couple 
scripts.  One for reaping the test-write files (after 2 minutes) and one 
for updating the usage files (once per minute).


The tests I am doing are write heavy (its basicly an import of our 
current filesystem).  I fire up the imports on a few machines (each 
inserting a separate set of files).  As soon as I get 5 or 6 of them 
starting to write files into mogile, all of the scripts lose their 
connection to the tracker.

When I check the tracker machine, I see one of the mogilefs processes 
(sometimes the monitor, sometimes the replicator) using 100% CPU (an 
strace shows it stuck in a select()/read() loop).

When running just a few inserts its not too bad, the problem only occurs 
when I hammer it hard with multiple machines attempting to insert as 
fast as they can.

Below is what I find in syslog:

Sep 26 16:21:09 10.0.5.100 mogilefsd[8383]: [replicate(8420)] Got HTTP 
status code 403 PUTing to 
http://10.0.5.104:80/dev26/0/002/111/0002111270.fid 
Sep 26 16:21:09 10.0.5.100 mogilefsd[8383]: [replicate(8420)] Failed 
copying fid 2111270 from devid 1 to devid 26 (error type: dest_error) 
Sep 26 16:21:10 10.0.5.100 mogilefsd[8383]: [replicate(8420)] Got HTTP 
status code 403 PUTing to 
http://10.0.5.104:80/dev26/0/002/111/0002111238.fid 
Sep 26 16:21:10 10.0.5.100 mogilefsd[8383]: [replicate(8420)] Failed 
copying fid 2111238 from devid 16 to devid 26 (error type: dest_error) 
Sep 26 16:21:12 10.0.5.100 mogilefsd[8383]: [queryworker(8407)] 
get_file_size() connect timeout for HTTP HEAD for size of 
http://10.0.5.104:80/dev32/0/002/111/0002111306.fid 
Sep 26 16:21:15 10.0.5.100 mogilefsd[8383]: [replicate(8418)] 
replicated=20, attempted=23, ratio=86.96% 
Sep 26 16:21:16 10.0.5.100 mogilefsd[8383]: [replicate(8421)] 
replicated=20, attempted=20, ratio=100.00% 
Sep 26 16:21:18 10.0.5.100 mogilefsd[8383]: [replicate(8421)] 
failed_getting_lock: Unable to obtain lock mgfs:fid:2111577:replicate 
Sep 26 16:21:19 10.0.5.100 mogilefsd[8418]: crash log: Bogus error code 
at /usr/local/share/perl/5.8.7/MogileFS/Worker/Replicate.pm line 449. 
Sep 26 16:21:19 10.0.5.100 mogilefsd[8418]: ending run 
Sep 26 16:21:19 10.0.5.100 mogilefsd[8420]: crash log: Bogus error code 
at /usr/local/share/perl/5.8.7/MogileFS/Worker/Replicate.pm line 449. 
Sep 26 16:21:19 10.0.5.100 mogilefsd[8420]: ending run 
Sep 26 16:21:19 10.0.5.100 mogilefsd[8383]: [replicate(8420)] Unable to 
create socket to 10.0.5.102:80 for /dev13/0/002/111/0002111577.fid 
Sep 26 16:21:19 10.0.5.100 mogilefsd[8383]: crash log: Base class 
event_err called for MogileFS::Connection::Worker=ARRAY(0x12ce150) 
Sep 26 16:21:19 10.0.5.100 mogilefsd[8383]: ending run 
Sep 26 16:21:19 10.0.5.100 mogilefsd[8421]: crash log: Error writing: 
Broken pipe at /usr/local/share/perl/5.8.7/MogileFS/Worker.pm line 60. 
Sep 26 16:21:19 10.0.5.100 mogilefsd[8421]: ending run 
Sep 26 16:21:23 10.0.5.100 mogilefsd[8415]: crash log: Error writing: 
Broken pipe at /usr/local/share/perl/5.8.7/MogileFS/Worker.pm line 60. 
Sep 26 16:21:23 10.0.5.100 mogilefsd[8415]: ending run 
Sep 26 16:21:25 10.0.5.100 mogilefsd[8384]: crash log: No answer in 4 
seconds from parent to child MogileFS::Worker::Checker=ARRAY(0x12b6d90) 
[8384], dying at /usr/local/share/perl/5.8.7/MogileFS/Worker.pm line 129. 
Sep 26 16:21:25 10.0.5.100 mogilefsd[8384]: ending run 
Sep 26 16:21:26 10.0.5.100 mogilefsd[8388]: crash log: No answer in 4 
seconds from parent to child MogileFS::Worker::Delete=ARRAY(0x12b7300) 
[8388], dying at /usr/local/share/perl/5.8.7/MogileFS/Worker.pm line 129. 
Sep 26 16:21:26 10.0.5.100 mogilefsd[8388]: ending run


More information about the mogilefs mailing list