Feature #17
EDAC monitoring support
| Status: | New | Start: | 01/23/2009 | |
| Priority: | Normal | Due date: | ||
| Assigned to: | - | % Done: | 0% |
|
| Category: | - | |||
| Target version: | 4.0 | |||
Description
EDAC is a generic name for technologies that do ECC memory monitoring and control in modern Linux kernels. We need to implement monitoring script that will support ECC memory checking / monitoring in Inquisitor.
Most modern Linux kernels support EDAC somehow, stability the same status as lm_sensors - i.e. declared as "unstable", but in practice, EDAC is widely supported by distro vendors and backported into a wide range of enterprise kernels. Looks like modern Linux EDAC monitoring works better than chipset-dependent memtest86+ ECC support, that always seems to be a 1-1.5 years behind the schedule.
A short list of relevant links about EDAC:
- http://bluesmoke.sourceforge.net/ - main site of the project, not a big deal; actively developed, all upcoming kernels are supported.
- http://buttersideup.com/edacwiki/ - brief, clear and concise introduction and tutorials.
- http://sourceforge.net/projects/edac-utils - project that hosts userspace utilities to support EDAC.
To make a long story short: EDAC is implemented as Linux kernel modules that should be loaded and configured. After that, they will:
- output errors in kernel's dmesg
- allow querying using sysfs interface (for control, accessing error counters/indicators, resetting, etc)
+ Therе are userspace utilites that can:
- edac-ctl - load necessary modules, autodetecting what's needed by CPU/processor (as
sensors-detectinlm_sensors) - edac-util - reads everything from sysfs interface and outputs in in ready-to-be-parsed manner (as
sensorsinlm_sensors)
Generally, to make this done, we'd need:
- A kernel with full EDAC support; usually they're already in most modern kernels.
- edac-utils installed in chroot.
- Loading of necessary kernel modules during init (can be done using edac-ctl).
- A monitoring script that wakes up every Nth second, checks if there are any errors and logs them to server.