disk monitoring

Tue May 1 17:23:34 UTC 2007

	so far I've built all our MogileFS nodes with 3ware/AMCC
controllers (we were using RAID-5 before Mogile) which do a great job of
monitoring disks on a regular basis and sending e-mail notification when
anything from a bad sector is detected to an entire drive failing.

	the downside is that when a drive *does* fail and the drive is
replaced, the 3ware controller doesn't keep its internal unit IDs
consistent, which then in turn causes devices in Linux to change, i.e.
/dev/sdc fails and now /dev/sdd shifts to become /dev/sdc, which
requires extremely careful manual remounting at the Mogile mount points.

	hope that makes sense... if anyone has a solution to avoid that,
please let me know. :)

	anyway, in the past I've used homegrown scripts to do things
like write a temporary file to the device to be checked, make sure the
file exists, and send an alert e-mail if a problem arises during those
operations.

cheers,

- Jared

-- 
Jared Klett
Co-founder, Blip.tv
JaredAtWrok (aim)
http://blog.blip.tv

-----Original Message-----
From: mogilefs-bounces at lists.danga.com
[mailto:mogilefs-bounces at lists.danga.com] On Behalf Of Eric Lambrecht
Sent: Tuesday, May 01, 2007 12:21 PM
To: mogilefs at lists.danga.com
Subject: Re: disk monitoring

That paper is fascinating!

Honestly, though,  I don't care if SMART predicts the drive failure, I
just want to know when a drive has actually failed so we can 'dead' it
and not end up in a situation where two drives are down and we start
losing data.

Eric....
...... Original Message .......
On Tue, 1 May 2007 11:11:03 +0300 "Egor Egorov" <egor at fine.kiev.ua>
wrote:
>
>On 1 may 2007, at 03:23, Eric Lambrecht wrote:
>
>>
>> A cursory look on the web makes me think 'smartmontools.sf.net'  
>> might work pretty well, as I think all our machines ultimately use 
>> SATA drives.
>
>
>The Google team found that 36% of the failed drives did not exhibit a 
>single SMART-monitored failure. They concluded that SMART data is 
>almost useless for predicting the failure of a single drive.
>
>http://labs.google.com/papers/disk_failures.pdf
>
>Very interesting paper.
>
>-- 
>     Egor Egorov
>     http://www.fine.kiev.ua/
>
>
>
>