fsck hangs when a bad fid is encountered

Sat Sep 8 07:10:29 UTC 2007

13 devices (10 readonly, 3 alive), the 10 are readonly
because they're about to be full, waiting on some new
drives to arrive so the devices can be rebalanced.

4 trackers (1 fsck worker on each one).

Didn't see any specific errors, but not sure if I were
looking at all the right places.  I manually deleted
fid 1 and 2 from the file table and did a fsck reset
after stopping each time.  The output below are from
running fsck for the third time.  I waited around 30
minutes after seeing the bad fid 678419 before
stopping fsck.  

How long should I wait for SRCH to become GONE?  Also,
how do I go about finding how these fids became
orphaned?  Any way to prevent this in the future?

Thanks.

========================================

Output from "mogadm fsck status" (it's not currently
running)

    Running: No
     Status: 669593 / 41601129 (1.61%)
       Time: 37m (297 fids/s; 2294m remain)
 Check Type: Normal (check policy + files)

 [num_NOPA]: 331
 [num_SRCH]: 331

========================================

mysql> select fid, evcode, count(*) from fsck_log
group by fid, evcode;
+--------+--------+----------+
| fid    | evcode | count(*) |
+--------+--------+----------+
|      1 | NOPA   |      119 | 
|      1 | SRCH   |      119 | 
|      2 | NOPA   |      147 | 
|      2 | SRCH   |      147 | 
| 678419 | NOPA   |      331 | 
| 678419 | SRCH   |      331 | 
+--------+--------+----------+
6 rows in set (0.00 sec)

========================================

 mogadm fsck taillog
unixtime             event           fid      devid
1189153502            NOPA        678419          -
1189153502            SRCH        678419          -
1189153508            NOPA        678419          -
1189153508            SRCH        678419          -
1189153513            NOPA        678419          -
1189153513            SRCH        678419          -
1189153519            NOPA        678419          -
1189153519            SRCH        678419          -
1189153524            NOPA        678419          -
1189153524            SRCH        678419          -
1189153530            NOPA        678419          -
1189153530            SRCH        678419          -
1189153535            NOPA        678419          -
1189153535            SRCH        678419          -
1189153541            NOPA        678419          -
1189153541            SRCH        678419          -
1189153546            NOPA        678419          -
1189153546            SRCH        678419          -
1189153551            NOPA        678419          -
1189153551            SRCH        678419          -

========================================

mysql> select min(utime), max(utime) from fsck_log
where fid = 678419;
+------------+------------+
| min(utime) | max(utime) |
+------------+------------+
| 1189151747 | 1189153551 | 
+------------+------------+
1 row in set (0.00 sec)

--- dormando <dormando at rydia.net> wrote:

> 
> > It appears that whenever a bad fid is encountered
> (ie
> > no corresponding entry in the file_on table), it
> just
> > gets stuck logging more and more SRCH and NOPA
> entries
> > to the fsck_log.
> > 
> > If we delete the bad fid from the file table, it
> will
> > continue until it hits another bad fid.
> > 
> > Is this normal?  We started fsck by doing "mogadm
> fsck
> > start".
> > 
> 
> Can you provide a subset of your fsck log? How long
> did you wait? How 
> many devices do you have? Did you see any specific
> errors?
> 
> I'm pretty sure it shouldn't do that :)
> 
> -Dormando
> 

____________________________________________________________________________________
Yahoo! oneSearch: Finally, mobile search 
that gives answers, not web links. 
http://mobile.yahoo.com/mobileweb/onesearch?refer=1ONXIC