MogileFS Stability Issues
tbone
tbone at nexopia.com
Tue Sep 26 23:35:51 UTC 2006
I have been having some serious stability issues using both the trunk
and mogilefs-server-2.00_02 and I figured I'd report it.
The configuration I am using is:
- 1 Tracker (local mysqld)
- 5 Storage Nodes (running lighttpd, 8 devices each)
The storage nodes do not run mogstored, I tossed together a couple
scripts. One for reaping the test-write files (after 2 minutes) and one
for updating the usage files (once per minute).
The tests I am doing are write heavy (its basicly an import of our
current filesystem). I fire up the imports on a few machines (each
inserting a separate set of files). As soon as I get 5 or 6 of them
starting to write files into mogile, all of the scripts lose their
connection to the tracker.
When I check the tracker machine, I see one of the mogilefs processes
(sometimes the monitor, sometimes the replicator) using 100% CPU (an
strace shows it stuck in a select()/read() loop).
When running just a few inserts its not too bad, the problem only occurs
when I hammer it hard with multiple machines attempting to insert as
fast as they can.
Below is what I find in syslog:
Sep 26 16:21:09 10.0.5.100 mogilefsd[8383]: [replicate(8420)] Got HTTP
status code 403 PUTing to
http://10.0.5.104:80/dev26/0/002/111/0002111270.fid
Sep 26 16:21:09 10.0.5.100 mogilefsd[8383]: [replicate(8420)] Failed
copying fid 2111270 from devid 1 to devid 26 (error type: dest_error)
Sep 26 16:21:10 10.0.5.100 mogilefsd[8383]: [replicate(8420)] Got HTTP
status code 403 PUTing to
http://10.0.5.104:80/dev26/0/002/111/0002111238.fid
Sep 26 16:21:10 10.0.5.100 mogilefsd[8383]: [replicate(8420)] Failed
copying fid 2111238 from devid 16 to devid 26 (error type: dest_error)
Sep 26 16:21:12 10.0.5.100 mogilefsd[8383]: [queryworker(8407)]
get_file_size() connect timeout for HTTP HEAD for size of
http://10.0.5.104:80/dev32/0/002/111/0002111306.fid
Sep 26 16:21:15 10.0.5.100 mogilefsd[8383]: [replicate(8418)]
replicated=20, attempted=23, ratio=86.96%
Sep 26 16:21:16 10.0.5.100 mogilefsd[8383]: [replicate(8421)]
replicated=20, attempted=20, ratio=100.00%
Sep 26 16:21:18 10.0.5.100 mogilefsd[8383]: [replicate(8421)]
failed_getting_lock: Unable to obtain lock mgfs:fid:2111577:replicate
Sep 26 16:21:19 10.0.5.100 mogilefsd[8418]: crash log: Bogus error code
at /usr/local/share/perl/5.8.7/MogileFS/Worker/Replicate.pm line 449.
Sep 26 16:21:19 10.0.5.100 mogilefsd[8418]: ending run
Sep 26 16:21:19 10.0.5.100 mogilefsd[8420]: crash log: Bogus error code
at /usr/local/share/perl/5.8.7/MogileFS/Worker/Replicate.pm line 449.
Sep 26 16:21:19 10.0.5.100 mogilefsd[8420]: ending run
Sep 26 16:21:19 10.0.5.100 mogilefsd[8383]: [replicate(8420)] Unable to
create socket to 10.0.5.102:80 for /dev13/0/002/111/0002111577.fid
Sep 26 16:21:19 10.0.5.100 mogilefsd[8383]: crash log: Base class
event_err called for MogileFS::Connection::Worker=ARRAY(0x12ce150)
Sep 26 16:21:19 10.0.5.100 mogilefsd[8383]: ending run
Sep 26 16:21:19 10.0.5.100 mogilefsd[8421]: crash log: Error writing:
Broken pipe at /usr/local/share/perl/5.8.7/MogileFS/Worker.pm line 60.
Sep 26 16:21:19 10.0.5.100 mogilefsd[8421]: ending run
Sep 26 16:21:23 10.0.5.100 mogilefsd[8415]: crash log: Error writing:
Broken pipe at /usr/local/share/perl/5.8.7/MogileFS/Worker.pm line 60.
Sep 26 16:21:23 10.0.5.100 mogilefsd[8415]: ending run
Sep 26 16:21:25 10.0.5.100 mogilefsd[8384]: crash log: No answer in 4
seconds from parent to child MogileFS::Worker::Checker=ARRAY(0x12b6d90)
[8384], dying at /usr/local/share/perl/5.8.7/MogileFS/Worker.pm line 129.
Sep 26 16:21:25 10.0.5.100 mogilefsd[8384]: ending run
Sep 26 16:21:26 10.0.5.100 mogilefsd[8388]: crash log: No answer in 4
seconds from parent to child MogileFS::Worker::Delete=ARRAY(0x12b7300)
[8388], dying at /usr/local/share/perl/5.8.7/MogileFS/Worker.pm line 129.
Sep 26 16:21:26 10.0.5.100 mogilefsd[8388]: ending run
More information about the mogilefs
mailing list