mogadm reports entire host as failed after two drive failures

Jared Klett jared at blip.tv
Wed Feb 14 20:38:44 UTC 2007


hello mogile folks,

	I've been populating my shiny new MogileFS with files for the
last few days, and today I had two drives fail in one of the storage
hosts. bummer! (but fortuitous before going into production)

	after the first failure this morning, 'mogadm check' reported as
I would expect:

Checking trackers...
        10.0.0.206:6001 ... OK

Checking hosts...
        [ 1] file3 ... OK
        [ 2] file4 ... OK

Checking devices...
        host device            size(G)    used(G)    free(G)   use%
        ---- --------------- ---------- ---------- ---------- ------
        [ 1] dev1              372.506    139.920    232.585  37.56%
        [ 1] dev2              372.506    140.584    231.922  37.74%
        [ 1] dev3              372.506    137.208    235.297  36.83%
        [ 1] dev4              372.506    137.890    234.616  37.02%
        [ 1] dev5              372.506    138.765    233.741  37.25%
        [ 1] dev6              372.506    143.041    229.465  38.40%
        [ 1] dev7              372.506    142.663    229.842  38.30%
        [ 1] dev8              372.506    141.766    230.739  38.06%
        [ 1] dev9              372.506    142.891    229.614  38.36%
        [ 1] dev10             372.506    143.516    228.989  38.53%
        [ 2] dev11             372.506    143.359    229.147  38.48%
        [ 2] dev12             372.506    142.172    230.333  38.17%
        [ 2] dev13             372.506    144.171    228.335  38.70%
        [ 2] dev14             372.506    145.114    227.392  38.96%
        [ 2] dev15             372.506    141.289    231.217  37.93%
        [ 2] dev16             372.506    140.712    231.794  37.77%
        [ 2] dev17             372.506    139.755    232.751  37.52%
        [ 2] dev18             372.506    140.640    231.866  37.76%
        [ 2] dev19             372.506    142.262    230.243  38.19%
        [ 2] dev20     REQUEST FAILURE
        ---- --------------- ---------- ---------- ---------- ------
                      total:  7077.606   2687.718   4389.888  37.97%

	but after the second failure, it reports the entire host as
failed:

Checking trackers...
        10.0.0.206:6001 ... OK

Checking hosts...
        [ 1] file3 ... OK
        [ 2] file4 ... REQUEST FAILURE

Checking devices...
        host device            size(G)    used(G)    free(G)   use%
        ---- --------------- ---------- ---------- ---------- ------
        [ 1] dev1              372.506    140.997    231.508  37.85%
        [ 1] dev2              372.506    141.694    230.812  38.04%
        [ 1] dev3              372.506    138.347    234.158  37.14%
        [ 1] dev4              372.506    138.629    233.877  37.22%
        [ 1] dev5              372.506    140.295    232.211  37.66%
        [ 1] dev6              372.506    144.190    228.316  38.71%
        [ 1] dev7              372.506    144.503    228.003  38.79%
        [ 1] dev8              372.506    143.518    228.988  38.53%
        [ 1] dev9              372.506    144.285    228.221  38.73%
        [ 1] dev10             372.506    144.897    227.608  38.90%
        ---- --------------- ---------- ---------- ---------- ------
                      total:  3725.056   1421.355   2303.701  38.16%

	which is odd, since I'm on file4 right now and it's working
fine. if I run other commands through mogadm, it reports as expected:

# mogadm --trackers=10.0.0.206:6001 host list
file3 [1]: alive
  IP:       10.0.0.217:7500

file4 [2]: alive
  IP:       10.0.0.218:7500

	I marked the two devices as dead earlier, which shows up here:

# mogadm --trackers=10.0.0.206:6001 device list
file3 [1]: alive
                   used(G) free(G) total(G)
  dev1: alive      140.997 231.508 372.505
 dev10: alive      144.894 227.611 372.505
  dev2: alive      141.693 230.812 372.505
  dev3: alive      138.347 234.158 372.505
  dev4: alive      138.628 233.877 372.505
  dev5: alive      140.356 232.148 372.505
  dev6: alive      144.189 228.315 372.505
  dev7: alive      144.502 228.003 372.505
  dev8: alive      143.518 228.987 372.505
  dev9: alive      144.284 228.221 372.505

file4 [2]: alive
                   used(G) free(G) total(G)
 dev11: alive      145.280 227.225 372.505
 dev12: alive      143.927 228.578 372.505
 dev13: alive      145.359 227.146 372.505
 dev14: dead       146.562 225.943 372.505
 dev15: alive      142.580 229.925 372.505
 dev16: alive      142.285 230.220 372.505
 dev17: alive      141.778 230.727 372.505
 dev18: alive      141.899 230.605 372.505
 dev19: alive      143.648 228.856 372.505
 dev20: dead       131.006 241.499 372.505

	is this behavior from 'mogadm check' correct, or is there an
issue?

cheers,

- Jared

-- 
Jared Klett
Co-founder, Blip.tv
646.526.8948 (cell)
JaredAtWrok  (aim)
http://blog.blip.tv


More information about the mogilefs mailing list