mogstored dying: redux

Wed May 21 18:55:01 UTC 2008

I'm a little amazed you've been having so many problems :( Big files or 
not I've never really seen this happen. This doesn't happen rarely to 
me, big files or not. Doesn't happen at all.

Is your system/OS still weird? Are you missing huge OS errors with write 
abortions/etc?

Can you run the mogilefsd tracker with the debug level raised, in 
screen, in the foreground, with output redirected to a file? Then 
mail/uploaded it somewhere so we can see?

I'm somewhat highly doubting it's a code problem. The errors really do 
suck a bit, but I can't imagine why you'd be getting that without having 
hardware trouble or really gummed up OS/perl libraries.

-Dormando

> 
> OK that definitely helps.  Lighttpd is back on, and it doesn't look
> like mogstored/lighttpd is dying.  Actually it looks like something
> wrong with the trackers now.
> 
> The various system errors are similar to the ones I've seen before.
> mogtool transfers about 70 chunks successfully and then starts giving
> this error (over and over).
> 
>> MogileFS backend error message: unknown_key unknown_key
>> System error message: Close failed at /usr/bin/mogtool line 816,  
> <Sock_minime336:7001> line 78.
>> This was try #1 and it's been 1.06 seconds since we first tried.   
> Retrying...
> 
> The failed chunks are tried over and over until I kill the job.  Even 
> after killing the mogtool job, I couldn't push any file, with mogtool or 
> with a simple script similar to Mark's test.
> 
> At this point I don't know if mogtool is at fault, but I don't
> actually have any other way of getting large files into the system
> other than rolling my own mogtool, which is likely to be bad for all
> its own reasons and confuse matters further.
> 
> I am also seeing a large number of these errors:
> 
>> System error message: MogileFS::Backend: tracker socket never became 
>> readable (minime336:7001) when sending command: [create_open 
>> domain=dbbackups&fid=0&class=dbbackups-recent&multi_dest=1&key=dwh-20080519-vol9,99 
>> ] at /usr/lib/perl5/site_perl/5.8.5/MogileFS/Client.pm line 268
> 
> Thankfully these seem to be recoverable by mogtool retrying, but they 
> seem to be disturbingly frequent.  I will need to understand what 
> underlying condition is causing this error... I know it's entirely 
> possible that it's a network/machine/storage/whatever problem but I'm 
> not sure where to look for more information or how to start 
> troubleshooting.  I guess this concern applies to all the errors I've 
> seen including:
>  > Close failed at /usr/bin/mogtool line 816
>  > unable to write to any allocated storage node at 
> /usr/lib64/perl5/5.8.5/x86_64-linux-thread-multi/IO/Handle.pm line 399
>  > Connection reset by peer
>  > tracker socket never became readable
>  > socket closed on read at 
> /usr/lib/perl5/site_perl/5.8.5/MogileFS/NewHTTPFile.pm line 335
>  > couldn't connect to mogilefsd backend at 
> /usr/lib/perl5/site_perl/5.8.5/MogileFS/Client.pm line 268
> 
> If they happen rarely, and the tool I'm using can recover, it's not a 
> huge concern, but if they happen frequently enough to raise eyebrows 
> (like 1 transaction in 10) or if they cause an endless loop where we 
> can't recover, then it's a showstopper.
> 
> The general trend here seems to be that errors happen, and MogileFS just 
> plain aborts the current transaction with an unhelpful "die" type of 
> message.  I am reasonably good with perl but I'm not familiar with this 
> code base enough to dive into all the errors listed above and get down 
> to the root cause.  There could be any number of underlying causes, but 
> because error handling is not great, I'm being forced to look at Perl 
> code and guess at root cause instead of going right to the underlying 
> cause.
> 
> Is there some log I should be looking at for more info, or some 
> debugging flag I need to turn on?
> 
> Does anyone else have a winning strategy for dealing with very-large 
> files other than mogtool --bigfile?
> 
> Thanks again for the help and patience.