jbyers at jbyers.com
Tue May 1 00:53:35 UTC 2007
smartd is a good start. We have it running periodic self-tests with
email notifications. Something like this config:
/dev/hdc -a -I 194 -I 190 -m <email> -s (S/../.././09|L/../(01|15)/./05)
The two "-I" flags ignore counters for temperature and some other
value we don't care about, these prevent filling logs with useless
notices. They may be not be suitable for your disks. The odd
regular expression chunks at the end specify test intervals for Short
and Long self-tests. For boxes with lots of disks you can let smartd
probe rather than explicitly specifying devices.
In the last two years, smartd has always caught our SATA disks before
they finally went lights-out. Sometimes you don't get much notice
though, one disk went from an "old-age" soft_read_error_rate to hard
failure in under a day.
On Apr 30, 2007, at 5:23 PM, Eric Lambrecht wrote:
> What does everybody here use for monitoring the health of disks?
> We've got quite a heterogenous setup going on, and at the moment I
> think manual monitoring of syslog is all we've got since various
> machines seem to fail in all sorts of new and interesting ways with
> new and interesting error messages.
> A cursory look on the web makes me think 'smartmontools.sf.net'
> might work pretty well, as I think all our machines ultimately use
> SATA drives.
> We're using the latest mogile with the test writes, but that still
> doesn't seem to catch everything.
> Any sage advice from experienced disk-tenders?
More information about the mogilefs