mogstored dying: redux
Greg Connor
gconnor at nekodojo.org
Mon May 19 00:54:08 UTC 2008
I wrote a week or two ago and asked for help with my "mogstored dying"
problem. Thanks to those who responded at that time. Since then, I
have upgraded all my nodes (16 storage nodes with 2 also acting as
trackers) to CentOS5.1 which runs perl 5.8.8. (The client machine has
perl 5.8.5). I'm using the current subversion tree (1177) for
trackers, storage nodes and clients/utils.
Unfortunately I'm still having a problem with mogstored just dying,
and I can't figure out why. Any help or pointers would be appreciated.
I'm currently using mogtool to push a large amount of data: 5 bigfiles
with a total size of 2454G. I'm expecting that to be broken up into
39269 chunks of 64M each, and right now I've got about 19000 chunks
copied.
My biggest problem right now is that mogstored just plain dies. It
just stops with no message to either syslog or to its output. Of my
16 nodes, they have all stopped running mogstored between 4 and 10
times. In order to keep the copy going, I have to check for mogstored
running every minute and restart it if not running. The only thing
appearing in syslog is after it starts up again, it says perlbal[pid]:
beginning run.
The start script I have been using says --daemonize so I ran mogstored
without --daemonize flag and got a bit more output:
Running.
Out of memory!
Out of memory!
Callback called exit.
Callback called exit.
END failed--call queue aborted.
beginning run
Running.
There's a bit more information in mogtool's output but I don't know if
these coincide with the mogstored crashes. Here are a few:
WARNING: Unable to save file 'collect-20080516-vol6,280': Close failed
at /usr/bin/mogtool line 816, <Sock_minime336:7001> line 283.
MogileFS backend error message: unknown_key unknown_key
System error message: Close failed at /usr/bin/mogtool line 816,
<Sock_minime336:7001> line 283.
WARNING: Unable to save file 'collect-20080516-vol6,311':
MogileFS::NewHTTPFile: error reading from node for device 337007:
Connection reset by peer at (eval 18) line 1
MogileFS backend error message: unknown_key unknown_key
System error message: MogileFS::NewHTTPFile: error reading from node
for device 337007: Connection reset by peer at (eval 18) line 1
WARNING: Unable to save file 'collect-20080516-vol6,1341':
MogileFS::NewHTTPFile: error writing to node for device 343012:
Connection reset by peer at /usr/lib64/perl5/5.8.5/x86_64-linux-thread-
multi/IO/Handle.pm line 399
MogileFS backend error message: unknown_key unknown_key
System error message: MogileFS::NewHTTPFile: error writing to node for
device 343012: Connection reset by peer at /usr/lib64/perl5/5.8.5/
x86_64-linux-thread-multi/IO/Handle.pm line 399
WARNING: Unable to save file 'collect-20080516-vol6,1736': Close
failed at /usr/bin/mogtool line 816, <Sock_minime336:7001> line 1739.
MogileFS backend error message: unknown_key unknown_key
System error message: Close failed at /usr/bin/mogtool line 816,
<Sock_minime336:7001> line 1739.
WARNING: Unable to save file 'collect-20080516-vol6,2373':
MogileFS::NewHTTPFile: unable to write to any allocated storage node
at /usr/lib64/perl5/5.8.5/x86_64-linux-thread-multi/IO/Handle.pm line
399
MogileFS backend error message: unknown_key unknown_key
System error message: MogileFS::NewHTTPFile: unable to write to any
allocated storage node at /usr/lib64/perl5/5.8.5/x86_64-linux-thread-
multi/IO/Handle.pm line 399
A few times I observed mogstored not responding to the tracker (mogadm
check just pauses when listing that host) and in that case, killing
and restarting mogstored brings it back. I could probably check for
this condition too, but now we're getting beyond a "simple" wrapper/
restart/sentinel script.
Is the experience of mogstored just plain dying a common one, or is it
pretty rare? If that were the only thing wrong I could get around it
by wrapping mogstored with a shell script that relaunches it as soon
as it quits, but I'd rather not have to do that... I'd rather get at
the root of the problem and make it not die in the first place.
A more important question I have is: Am I trying to do something with
MogileFS that it's totally not designed for? Is anyone else out there
known to be using mogile for really huge files, chunked like mogtool
does, and if so, were people happy with the results? If it's really
minor problems, I could probably fix them myself, but I'm concerned
that the lack of documentation about mogile's internals would hamper
self-support efforts.
Thanks
gregc
More information about the mogilefs
mailing list