mogstored dying: redux

Greg Connor gconnor at nekodojo.org
Mon May 19 00:54:08 UTC 2008


I wrote a week or two ago and asked for help with my "mogstored dying"  
problem.  Thanks to those who responded at that time.  Since then, I  
have upgraded all my nodes (16 storage nodes with 2 also acting as  
trackers) to CentOS5.1 which runs perl 5.8.8. (The client machine has  
perl 5.8.5).  I'm using the current subversion tree (1177) for  
trackers, storage nodes and clients/utils.

Unfortunately I'm still having a problem with mogstored just dying,  
and I can't figure out why.  Any help or pointers would be appreciated.

I'm currently using mogtool to push a large amount of data: 5 bigfiles  
with a total size of 2454G.  I'm expecting that to be broken up into  
39269 chunks of 64M each, and right now I've got about 19000 chunks  
copied.

My biggest problem right now is that mogstored just plain dies.  It  
just stops with no message to either syslog or to its output.  Of my  
16 nodes, they have all stopped running mogstored between 4 and 10  
times.  In order to keep the copy going, I have to check for mogstored  
running every minute and restart it if not running.  The only thing  
appearing in syslog is after it starts up again, it says perlbal[pid]:  
beginning run.

The start script I have been using says --daemonize so I ran mogstored  
without --daemonize flag and got a bit more output:
         Running.
         Out of memory!
         Out of memory!
         Callback called exit.
         Callback called exit.
         END failed--call queue aborted.
         beginning run
         Running.



There's a bit more information in mogtool's output but I don't know if  
these coincide with the mogstored crashes.  Here are a few:

WARNING: Unable to save file 'collect-20080516-vol6,280': Close failed  
at /usr/bin/mogtool line 816, <Sock_minime336:7001> line 283.
MogileFS backend error message: unknown_key unknown_key
System error message: Close failed at /usr/bin/mogtool line 816,  
<Sock_minime336:7001> line 283.

WARNING: Unable to save file 'collect-20080516-vol6,311':  
MogileFS::NewHTTPFile: error reading from node for device 337007:  
Connection reset by peer at (eval 18) line 1
MogileFS backend error message: unknown_key unknown_key
System error message: MogileFS::NewHTTPFile: error reading from node  
for device 337007: Connection reset by peer at (eval 18) line 1

WARNING: Unable to save file 'collect-20080516-vol6,1341':  
MogileFS::NewHTTPFile: error writing to node for device 343012:  
Connection reset by peer at /usr/lib64/perl5/5.8.5/x86_64-linux-thread- 
multi/IO/Handle.pm line 399
MogileFS backend error message: unknown_key unknown_key
System error message: MogileFS::NewHTTPFile: error writing to node for  
device 343012: Connection reset by peer at /usr/lib64/perl5/5.8.5/ 
x86_64-linux-thread-multi/IO/Handle.pm line 399

WARNING: Unable to save file 'collect-20080516-vol6,1736': Close  
failed at /usr/bin/mogtool line 816, <Sock_minime336:7001> line 1739.
MogileFS backend error message: unknown_key unknown_key
System error message: Close failed at /usr/bin/mogtool line 816,  
<Sock_minime336:7001> line 1739.

WARNING: Unable to save file 'collect-20080516-vol6,2373':  
MogileFS::NewHTTPFile: unable to write to any allocated storage node  
at /usr/lib64/perl5/5.8.5/x86_64-linux-thread-multi/IO/Handle.pm line  
399
MogileFS backend error message: unknown_key unknown_key
System error message: MogileFS::NewHTTPFile: unable to write to any  
allocated storage node at /usr/lib64/perl5/5.8.5/x86_64-linux-thread- 
multi/IO/Handle.pm line 399

A few times I observed mogstored not responding to the tracker (mogadm  
check just pauses when listing that host) and in that case, killing  
and restarting mogstored brings it back.  I could probably check for  
this condition too, but now we're getting beyond a "simple" wrapper/ 
restart/sentinel script.



Is the experience of mogstored just plain dying a common one, or is it  
pretty rare?  If that were the only thing wrong I could get around it  
by wrapping mogstored with a shell script that relaunches it as soon  
as it quits, but I'd rather not have to do that... I'd rather get at  
the root of the problem and make it not die in the first place.


A more important question I have is:  Am I trying to do something with  
MogileFS that it's totally not designed for?  Is anyone else out there  
known to be using mogile for really huge files, chunked like mogtool  
does, and if so, were people happy with the results?   If it's really  
minor problems, I could probably fix them myself, but I'm concerned  
that the lack of documentation about mogile's internals would hamper  
self-support efforts.


Thanks
gregc



More information about the mogilefs mailing list