mogstored dying: redux
dormando
dormando at rydia.net
Wed May 21 18:55:01 UTC 2008
I'm a little amazed you've been having so many problems :( Big files or
not I've never really seen this happen. This doesn't happen rarely to
me, big files or not. Doesn't happen at all.
Is your system/OS still weird? Are you missing huge OS errors with write
abortions/etc?
Can you run the mogilefsd tracker with the debug level raised, in
screen, in the foreground, with output redirected to a file? Then
mail/uploaded it somewhere so we can see?
I'm somewhat highly doubting it's a code problem. The errors really do
suck a bit, but I can't imagine why you'd be getting that without having
hardware trouble or really gummed up OS/perl libraries.
-Dormando
>
> OK that definitely helps. Lighttpd is back on, and it doesn't look
> like mogstored/lighttpd is dying. Actually it looks like something
> wrong with the trackers now.
>
> The various system errors are similar to the ones I've seen before.
> mogtool transfers about 70 chunks successfully and then starts giving
> this error (over and over).
>
>> MogileFS backend error message: unknown_key unknown_key
>> System error message: Close failed at /usr/bin/mogtool line 816,
> <Sock_minime336:7001> line 78.
>> This was try #1 and it's been 1.06 seconds since we first tried.
> Retrying...
>
> The failed chunks are tried over and over until I kill the job. Even
> after killing the mogtool job, I couldn't push any file, with mogtool or
> with a simple script similar to Mark's test.
>
> At this point I don't know if mogtool is at fault, but I don't
> actually have any other way of getting large files into the system
> other than rolling my own mogtool, which is likely to be bad for all
> its own reasons and confuse matters further.
>
> I am also seeing a large number of these errors:
>
>> System error message: MogileFS::Backend: tracker socket never became
>> readable (minime336:7001) when sending command: [create_open
>> domain=dbbackups&fid=0&class=dbbackups-recent&multi_dest=1&key=dwh-20080519-vol9,99
>> ] at /usr/lib/perl5/site_perl/5.8.5/MogileFS/Client.pm line 268
>
> Thankfully these seem to be recoverable by mogtool retrying, but they
> seem to be disturbingly frequent. I will need to understand what
> underlying condition is causing this error... I know it's entirely
> possible that it's a network/machine/storage/whatever problem but I'm
> not sure where to look for more information or how to start
> troubleshooting. I guess this concern applies to all the errors I've
> seen including:
> > Close failed at /usr/bin/mogtool line 816
> > unable to write to any allocated storage node at
> /usr/lib64/perl5/5.8.5/x86_64-linux-thread-multi/IO/Handle.pm line 399
> > Connection reset by peer
> > tracker socket never became readable
> > socket closed on read at
> /usr/lib/perl5/site_perl/5.8.5/MogileFS/NewHTTPFile.pm line 335
> > couldn't connect to mogilefsd backend at
> /usr/lib/perl5/site_perl/5.8.5/MogileFS/Client.pm line 268
>
> If they happen rarely, and the tool I'm using can recover, it's not a
> huge concern, but if they happen frequently enough to raise eyebrows
> (like 1 transaction in 10) or if they cause an endless loop where we
> can't recover, then it's a showstopper.
>
> The general trend here seems to be that errors happen, and MogileFS just
> plain aborts the current transaction with an unhelpful "die" type of
> message. I am reasonably good with perl but I'm not familiar with this
> code base enough to dive into all the errors listed above and get down
> to the root cause. There could be any number of underlying causes, but
> because error handling is not great, I'm being forced to look at Perl
> code and guess at root cause instead of going right to the underlying
> cause.
>
> Is there some log I should be looking at for more info, or some
> debugging flag I need to turn on?
>
> Does anyone else have a winning strategy for dealing with very-large
> files other than mogtool --bigfile?
>
> Thanks again for the help and patience.
More information about the mogilefs
mailing list