Tracker error "size_verify_error" on create_close

Adam Rosien adam at rosien.net
Wed May 2 17:37:56 UTC 2007


The problem was mine; sometimes not all data was written to the
storage node and the tracker correctly identified the problem.  Thanks
for your help in exposing my own bug :).

.. Adam

On 4/19/07, Brad Fitzpatrick <brad at danga.com> wrote:
> More weirdness.
>
> Can you run tcpdump as root on the mogilefsd machine and get me all the
> relevant traffic?
>
> Something like...
>
> # tcpdump -w capture.pcap -s 0 -i eth0 "port 7001 or port 7501"
>
> Then once you see it go all busted, send me the pcap (privately is fine,
> if you prefer) and I'll walk through it?
>
> Otherwise (or in addition), I'll try to reproduce later.
>
> - Brad
>
>
> On Thu, 19 Apr 2007, Adam Rosien wrote:
>
> > Interesting progress.  Your change locked up the tracker because
> > $first has some low bytes in it.  I changed it to write (hex $first)
> > which returns '12', which seems like bogus data from the socket.
> >
> > I'm running a series of unit tests and have also tried seeing if
> > removing some of the tests had any particular effect.  One of my tests
> > did this sequence:
> >
> > 1. create_open -> OK
> > 2. create_close (specifying size 0) -> expected ERR from tracker
> > 3. create_close (proper size specified) -> intermittent size_verify_error
> >
> > If I take out step 2, I don't see the same size_verify_error "HEAD
> > response to get_file_size looks bogus".  I do, however, very rarely
> > get a size_verify_error:
> >
> > ERR size_verify_error
> > Expected:+4%3B+actual:+0+%28missing%29%3B+path:+http://10.3.1.126:17500/dev2/0/000/001/0000001000.fid%3B+error:+Job+queryworker+has+only+0,+wants+5,+making+5.
> >
> > I think I can reasonably say that the non unit test code won't be
> > doing the 1-2-3 sequence, but the intermittent error is odd in any
> > case.  The newer "queryworker has only 0, wants 5, making 5" seems
> > suitably rare.
> >
> > In any case I can detect when a create_open/PUT/create_close sequence
> > fails and try again.
> >
> > Shall I do any other tests?
> >
> > .. Adam
> >
> > On 4/19/07, Brad Fitzpatrick <brad at danga.com> wrote:
> > > Weird.
> > >
> > > But we're getting closer...
> > >
> > > Change this line:
> > >
> > >    return undeferr("HEAD response to get_file_size looks bogus");
> > >
> > > to:
> > >
> > >    return undeferr("HEAD response to get_file_size looks bogus: [$first]");
> > >
> > > And let me know what it says?
> > >
> > > - Brad
> > >
> > >
> > > On Thu, 19 Apr 2007, Adam Rosien wrote:
> > >
> > > > The message after upgrading to trunk is now:
> > > >
> > > > ERR size_verify_error
> > > > Expected:+4%3B+actual:+0+%28cantreach%29%3B+path:+http://10.3.1.104:17500/dev1/0/000/000/0000000484.fid%3B+error:+HEAD+response+to+get_file_size+looks+bogus
> > > >
> > > > If I do a HEAD request to the path in the error response with curl the
> > > > response is "200 OK", so one theory would be that there is some kind
> > > > of timing issue, unless you know more about the meaning behind the
> > > > above message.
> > > >
> > > > .. Adam
> > > >
> > > > On 4/19/07, Adam Rosien <adam at rosien.net> wrote:
> > > > > mogstored.  I'll update and get you the new message.
> > > > >
> > > > > .. Adam
> > > > >
> > > > > On 4/19/07, Brad Fitzpatrick <brad at danga.com> wrote:
> > > > > > Adam,
> > > > > >
> > > > > > Current trunk should be safe to run... nothing scary.  All big changes are
> > > > > > in Fsck.pm, and rest is just cleanups & docs.
> > > > > >
> > > > > > But the thing you want is the part where I (today?) improved this exact
> > > > > > error message to say more than the "HEAD request wasn't 200 OK" that
> > > > > > you're seeing, but to show you exactly what the remote server said during
> > > > > > the size check.
> > > > > >
> > > > > > Which storage node webserver are you using, btw?  mogstored, apache, lighttpd?
> > > > > >
> > > > > > - Brad
> > > > > >
> > > > > > On Thu, 19 Apr 2007, Adam Rosien wrote:
> > > > > >
> > > > > > > ERR size_verify_error
> > > > > > > Expected:+4%3B+actual:+0+%28error%29%3B+path:+http://10.3.1.104:17500/dev1/0/000/000/0000000372.fid%3B+error:+get_file_size%28%29%27s+HEAD+request+wasn%27t+a+200+OK
> > > > > > >
> > > > > > > (I'm writing a 4 byte file, and got a 200 OK from the PUT to the storage node)
> > > > > > >
> > > > > > > I'm running the mogile code from svn trunk, as of Mar 15, Perlbal
> > > > > > > 1.54.  I see there have been updates in trunk since then, but don't
> > > > > > > have them yet.
> > > > > > >
> > > > > > > .. Adam
> > > > > > >
> > > > > > > On 4/19/07, Brad Fitzpatrick <brad at danga.com> wrote:
> > > > > > > > What's the full response line?  size_verify_error should be returning
> > > > > > > > extra details about why it failed.
> > > > > > > >
> > > > > > > > And what version?
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, 19 Apr 2007, Adam Rosien wrote:
> > > > > > > >
> > > > > > > > > Intermittently I get a "size_verify_error" error code from the tracker
> > > > > > > > > when calling create_close after first calling create_open and then
> > > > > > > > > PUTing the file to the storage node.  Is there a possible latency
> > > > > > > > > between completing the PUT to the storage node and when the tracker
> > > > > > > > > confirms the bytes written with the storage node, so that the
> > > > > > > > > create_close returns this error?
> > > > > > > > >
> > > > > > > > > I am using my own C++ code for the tracker protocol and libcurl for HTTP.
> > > > > > > > >
> > > > > > > > > .. Adam
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>


More information about the mogilefs mailing list