disk monitoring

Tue May 1 00:53:35 UTC 2007

smartd is a good start.  We have it running periodic self-tests with  
email notifications.  Something like this config:

/dev/hdc -a -I 194 -I 190 -m <email> -s (S/../.././09|L/../(01|15)/./05)

The two "-I" flags ignore counters for temperature and some other  
value we don't care about, these prevent filling logs with useless  
notices.  They may be not be suitable for your disks.  The odd  
regular expression chunks at the end specify test intervals for Short  
and Long self-tests.  For boxes with lots of disks you can let smartd  
probe rather than explicitly specifying devices.

In the last two years, smartd has always caught our SATA disks before  
they finally went lights-out.  Sometimes you don't get much notice  
though, one disk went from an "old-age" soft_read_error_rate to hard  
failure in under a day.

James
Wikispaces.com

On Apr 30, 2007, at 5:23 PM, Eric Lambrecht wrote:

> What does everybody here use for monitoring the health of disks?  
> We've got quite a heterogenous setup going on, and at the moment I  
> think manual monitoring of syslog is all we've got since various  
> machines seem to fail in all sorts of new and interesting ways with  
> new and interesting error messages.
>
> A cursory look on the web makes me think 'smartmontools.sf.net'  
> might work pretty well, as I think all our machines ultimately use  
> SATA drives.
>
> We're using the latest mogile with the test writes, but that still  
> doesn't seem to catch everything.
>
> Any sage advice from experienced disk-tenders?
>
> Eric...