mogstored dying: redux

Wed May 21 17:41:56 UTC 2008

On May 21, 2008, at 5:19 AM, Ask Bjørn Hansen wrote:

>
> On May 21, 2008, at 3:17, Greg Connor wrote:
>
>> Thanks Mark.  The test script worked fine.  The 403 errors were  
>> only occurring with "lighttpd" used in place of perlbal.  This was  
>> a suggestion (Ask's) which seemed like a good thing to try, but  
>> lighttpd actually made things worse.  With lighttpd, about 1 in 5  
>> requests failed to store, or failed to close.
>
> Oh, I'm sorry.  I realize now that the "make lighttpd work" patch  
> was never committed, darn.  Try the patch below.
>
> 	

OK that definitely helps.  Lighttpd is back on, and it doesn't look
like mogstored/lighttpd is dying.  Actually it looks like something
wrong with the trackers now.

The various system errors are similar to the ones I've seen before.
mogtool transfers about 70 chunks successfully and then starts giving
this error (over and over).

> MogileFS backend error message: unknown_key unknown_key
> System error message: Close failed at /usr/bin/mogtool line 816,  
<Sock_minime336:7001> line 78.
> This was try #1 and it's been 1.06 seconds since we first tried.   
Retrying...

The failed chunks are tried over and over until I kill the job.  Even 
after killing the mogtool job, I couldn't push any file, with mogtool or 
with a simple script similar to Mark's test.

At this point I don't know if mogtool is at fault, but I don't
actually have any other way of getting large files into the system
other than rolling my own mogtool, which is likely to be bad for all
its own reasons and confuse matters further.

I am also seeing a large number of these errors:

> System error message: MogileFS::Backend: tracker socket never became readable (minime336:7001) when sending command: [create_open domain=dbbackups&fid=0&class=dbbackups-recent&multi_dest=1&key=dwh-20080519-vol9,99 ] at /usr/lib/perl5/site_perl/5.8.5/MogileFS/Client.pm line 268

Thankfully these seem to be recoverable by mogtool retrying, but they 
seem to be disturbingly frequent.  I will need to understand what 
underlying condition is causing this error... I know it's entirely 
possible that it's a network/machine/storage/whatever problem but I'm 
not sure where to look for more information or how to start 
troubleshooting.  I guess this concern applies to all the errors I've 
seen including:
 > Close failed at /usr/bin/mogtool line 816
 > unable to write to any allocated storage node at 
/usr/lib64/perl5/5.8.5/x86_64-linux-thread-multi/IO/Handle.pm line 399
 > Connection reset by peer
 > tracker socket never became readable
 > socket closed on read at 
/usr/lib/perl5/site_perl/5.8.5/MogileFS/NewHTTPFile.pm line 335
 > couldn't connect to mogilefsd backend at 
/usr/lib/perl5/site_perl/5.8.5/MogileFS/Client.pm line 268

If they happen rarely, and the tool I'm using can recover, it's not a 
huge concern, but if they happen frequently enough to raise eyebrows 
(like 1 transaction in 10) or if they cause an endless loop where we 
can't recover, then it's a showstopper.

The general trend here seems to be that errors happen, and MogileFS just 
plain aborts the current transaction with an unhelpful "die" type of 
message.  I am reasonably good with perl but I'm not familiar with this 
code base enough to dive into all the errors listed above and get down 
to the root cause.  There could be any number of underlying causes, but 
because error handling is not great, I'm being forced to look at Perl 
code and guess at root cause instead of going right to the underlying cause.

Is there some log I should be looking at for more info, or some 
debugging flag I need to turn on?

Does anyone else have a winning strategy for dealing with very-large 
files other than mogtool --bigfile?

Thanks again for the help and patience.