nathan at pbwiki.com
Tue May 1 01:01:51 UTC 2007
All of our machines dump into some offworld syslogs -- from a
*/15 * * * * root /usr/sbin/smartctl --all -d ata /dev/sda |
grep -E "(SMART overall| Temperature)"|tr -s ' '|tr "\n" "," |logger -
p daemon.info -t 'smartd: /dev/sda'
*/15 * * * * root cat /sys/block/sda/stat |tr -s ' '| logger -t
'dstat-sda' -p daemon.info
*/15 * * * * root sensors -f w83793-i2c-0-2f|grep -E "(Temp|
Fan)"|tr -s ' '|cut -d'(' -f1|tr "\n" ","|logger -t 'hwmon' -p
*/15 * * * * root cat /sys/class/net/eth0/statistics/*|tr "\n"
" "| logger -t 'nstat-eth0' -p daemon.info
So far the only disk failure we've seen was correctly predicted by
smartctl, though with only about 30 minutes warning (noted after the
fact, of course). FWIW, the smartctl lines above are a little too
terse for making a proper alarm -- smartctl died during its run
indicating imminent failure, but without any output there was no
useful info in the syslog stream.
I just saw James Byers note with explicit flags in the smartctl call
- sort of points to the need for a more standard one-line-output mode
for smartctl, or at least a wrapper script which could handle those
more elaborate cases as well as well as our null output.
-Nathan / PBwiki
On Apr 30, 2007, at 5:45 PM, Brandon Ooi wrote:
> we've had a lot of disks fail with no real warning from
> smartmontools, not really sure how it decides a drive is failing.
> we tried using this with a bunch of sata/ide drives. i too would
> like to know what other people are trying to do.
> Eric Lambrecht wrote:
>> What does everybody here use for monitoring the health of disks?
>> We've got quite a heterogenous setup going on, and at the moment I
>> think manual monitoring of syslog is all we've got since various
>> machines seem to fail in all sorts of new and interesting ways
>> with new and interesting error messages.
>> A cursory look on the web makes me think 'smartmontools.sf.net'
>> might work pretty well, as I think all our machines ultimately use
>> SATA drives.
>> We're using the latest mogile with the test writes, but that still
>> doesn't seem to catch everything.
>> Any sage advice from experienced disk-tenders?
More information about the mogilefs