disk monitoring

Tue May 1 06:03:54 UTC 2007

Thanks for reminding me about this ... adding SMART monitoring to Mogile
has been on the TODO list for awhile, but I forgot how much easier it'd be
add nowadays with some of the recent developments which give us a bunch of
the difficult stuff for free.

Now that mogstored is required (and is just a front-end for
lighttpd/apache/etc, if you want to use those), we can rely on the
side-channel management port, and then mogilefsd's [monitor] process can
ask the mogstoreds to export back SMART info, if available.  Then as part
of "mogadm check" we can relay back drive health, not put new/replicated
files on failing drives, etc....

I don't know much about SMART, though, so if somebody wants to lead this
project, I'd love it... and I'm free to advise if you're new to the code
and need pointers.  But I think the job would be mostly SMART stuff
because all the ugly mechanics of moving info around machines/processes
is done for you at this stage.

- Brad

On Mon, 30 Apr 2007, James Byers wrote:

> smartd is a good start.  We have it running periodic self-tests with
> email notifications.  Something like this config:
>
> /dev/hdc -a -I 194 -I 190 -m <email> -s (S/../.././09|L/../(01|15)/./05)
>
> The two "-I" flags ignore counters for temperature and some other
> value we don't care about, these prevent filling logs with useless
> notices.  They may be not be suitable for your disks.  The odd
> regular expression chunks at the end specify test intervals for Short
> and Long self-tests.  For boxes with lots of disks you can let smartd
> probe rather than explicitly specifying devices.
>
> In the last two years, smartd has always caught our SATA disks before
> they finally went lights-out.  Sometimes you don't get much notice
> though, one disk went from an "old-age" soft_read_error_rate to hard
> failure in under a day.
>
> James
> Wikispaces.com
>
> On Apr 30, 2007, at 5:23 PM, Eric Lambrecht wrote:
>
> > What does everybody here use for monitoring the health of disks?
> > We've got quite a heterogenous setup going on, and at the moment I
> > think manual monitoring of syslog is all we've got since various
> > machines seem to fail in all sorts of new and interesting ways with
> > new and interesting error messages.
> >
> > A cursory look on the web makes me think 'smartmontools.sf.net'
> > might work pretty well, as I think all our machines ultimately use
> > SATA drives.
> >
> > We're using the latest mogile with the test writes, but that still
> > doesn't seem to catch everything.
> >
> > Any sage advice from experienced disk-tenders?
> >
> > Eric...
>
>