Tracker error "size_verify_error" on create_close
Brad Fitzpatrick
brad at danga.com
Fri Apr 20 01:02:51 UTC 2007
More weirdness.
Can you run tcpdump as root on the mogilefsd machine and get me all the
relevant traffic?
Something like...
# tcpdump -w capture.pcap -s 0 -i eth0 "port 7001 or port 7501"
Then once you see it go all busted, send me the pcap (privately is fine,
if you prefer) and I'll walk through it?
Otherwise (or in addition), I'll try to reproduce later.
- Brad
On Thu, 19 Apr 2007, Adam Rosien wrote:
> Interesting progress. Your change locked up the tracker because
> $first has some low bytes in it. I changed it to write (hex $first)
> which returns '12', which seems like bogus data from the socket.
>
> I'm running a series of unit tests and have also tried seeing if
> removing some of the tests had any particular effect. One of my tests
> did this sequence:
>
> 1. create_open -> OK
> 2. create_close (specifying size 0) -> expected ERR from tracker
> 3. create_close (proper size specified) -> intermittent size_verify_error
>
> If I take out step 2, I don't see the same size_verify_error "HEAD
> response to get_file_size looks bogus". I do, however, very rarely
> get a size_verify_error:
>
> ERR size_verify_error
> Expected:+4%3B+actual:+0+%28missing%29%3B+path:+http://10.3.1.126:17500/dev2/0/000/001/0000001000.fid%3B+error:+Job+queryworker+has+only+0,+wants+5,+making+5.
>
> I think I can reasonably say that the non unit test code won't be
> doing the 1-2-3 sequence, but the intermittent error is odd in any
> case. The newer "queryworker has only 0, wants 5, making 5" seems
> suitably rare.
>
> In any case I can detect when a create_open/PUT/create_close sequence
> fails and try again.
>
> Shall I do any other tests?
>
> .. Adam
>
> On 4/19/07, Brad Fitzpatrick <brad at danga.com> wrote:
> > Weird.
> >
> > But we're getting closer...
> >
> > Change this line:
> >
> > return undeferr("HEAD response to get_file_size looks bogus");
> >
> > to:
> >
> > return undeferr("HEAD response to get_file_size looks bogus: [$first]");
> >
> > And let me know what it says?
> >
> > - Brad
> >
> >
> > On Thu, 19 Apr 2007, Adam Rosien wrote:
> >
> > > The message after upgrading to trunk is now:
> > >
> > > ERR size_verify_error
> > > Expected:+4%3B+actual:+0+%28cantreach%29%3B+path:+http://10.3.1.104:17500/dev1/0/000/000/0000000484.fid%3B+error:+HEAD+response+to+get_file_size+looks+bogus
> > >
> > > If I do a HEAD request to the path in the error response with curl the
> > > response is "200 OK", so one theory would be that there is some kind
> > > of timing issue, unless you know more about the meaning behind the
> > > above message.
> > >
> > > .. Adam
> > >
> > > On 4/19/07, Adam Rosien <adam at rosien.net> wrote:
> > > > mogstored. I'll update and get you the new message.
> > > >
> > > > .. Adam
> > > >
> > > > On 4/19/07, Brad Fitzpatrick <brad at danga.com> wrote:
> > > > > Adam,
> > > > >
> > > > > Current trunk should be safe to run... nothing scary. All big changes are
> > > > > in Fsck.pm, and rest is just cleanups & docs.
> > > > >
> > > > > But the thing you want is the part where I (today?) improved this exact
> > > > > error message to say more than the "HEAD request wasn't 200 OK" that
> > > > > you're seeing, but to show you exactly what the remote server said during
> > > > > the size check.
> > > > >
> > > > > Which storage node webserver are you using, btw? mogstored, apache, lighttpd?
> > > > >
> > > > > - Brad
> > > > >
> > > > > On Thu, 19 Apr 2007, Adam Rosien wrote:
> > > > >
> > > > > > ERR size_verify_error
> > > > > > Expected:+4%3B+actual:+0+%28error%29%3B+path:+http://10.3.1.104:17500/dev1/0/000/000/0000000372.fid%3B+error:+get_file_size%28%29%27s+HEAD+request+wasn%27t+a+200+OK
> > > > > >
> > > > > > (I'm writing a 4 byte file, and got a 200 OK from the PUT to the storage node)
> > > > > >
> > > > > > I'm running the mogile code from svn trunk, as of Mar 15, Perlbal
> > > > > > 1.54. I see there have been updates in trunk since then, but don't
> > > > > > have them yet.
> > > > > >
> > > > > > .. Adam
> > > > > >
> > > > > > On 4/19/07, Brad Fitzpatrick <brad at danga.com> wrote:
> > > > > > > What's the full response line? size_verify_error should be returning
> > > > > > > extra details about why it failed.
> > > > > > >
> > > > > > > And what version?
> > > > > > >
> > > > > > >
> > > > > > > On Thu, 19 Apr 2007, Adam Rosien wrote:
> > > > > > >
> > > > > > > > Intermittently I get a "size_verify_error" error code from the tracker
> > > > > > > > when calling create_close after first calling create_open and then
> > > > > > > > PUTing the file to the storage node. Is there a possible latency
> > > > > > > > between completing the PUT to the storage node and when the tracker
> > > > > > > > confirms the bytes written with the storage node, so that the
> > > > > > > > create_close returns this error?
> > > > > > > >
> > > > > > > > I am using my own C++ code for the tracker protocol and libcurl for HTTP.
> > > > > > > >
> > > > > > > > .. Adam
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
More information about the mogilefs
mailing list