Anuket Project
RAS/mcelog Plugin High Level Design
Requirements
1.0 | Collection Interval should be configurable |
|
2.0 | Support for all Platform RAS Notifications/ Events |
|
3.0 | Support for all RAS counters |
|
4.0 | RAS MIB definition |
|
5.0 | Documentation on preferred configuration to enable RAS Memory features |
|
6.0 | Failure events must be detected within 10ms |
|
Overview
The goal of this equivalence feature is to expose Reliability, Availability and Serviceability (RAS) features metrics and events provided by the platform to higher level fault management applications.
Design
mcelog plugin
Introduction
The purpose of this plugin is to send notifications and stats relevant to Machine Check Exceptions (MCE) when they occur. The plugin leverages the mcelog Linux utility to detect that an exception has occurred. mcelog supports a client server model and does the logging and accounting of exceptions when they occur. The plugin simply leverages the client protocol and the logging capabilities of mcelog to detect when an exception has occurred. Note, mcelog supports logging to syslog as well as a custom log file.
Mcelog
X86 CPUs report errors detected by the CPU as machine check events (MCEs). Most errors can be corrected by the CPU by internal error correction mechanisms (RAS Features). Uncorrected errors cause machine check exceptions which may kill processes or panic the machine. A small number of corrected errors is usually not a cause for worry, but a large number can indicate future failure.
When a corrected or recovered error happens the x86 kernel writes a record describing the MCE into an internal ring buffer available through the /dev/mcelog device mcelog retrieves errors from /dev/mcelog, decodes them into a human readable format and prints them on the standard output or optionally into the system log. This device (/dev/mcelog) should be read in a regular cronjob by the mcelog program or instead of the cronjob the mcelog daemon.
Optionally mcelog can also take more options like keeping statistics or triggering shell scripts on specific events. By default mcelog supports offlining memory pages with persistent corrected errors, offlining CPU cores if they developed cache problems, and otherwise logging specific events to the system log after they crossed a threshold.
When an uncorrected machine check error happens that the kernel cannot recover from then it will usually panic the system. In this case when there was a warm reset after the panic mcelog should pick up the machine check errors after reboot. This is not possible after a cold reset.
When the panic triggers a kdump kexec crash kernel the crash kernel boot up script should log the machine checks to disk, otherwise they might be lost.
Note that after mcelog retrieves an error the kernel doesn’t store it anymore (different from dmesg (1)), so the output should be always saved somewhere and mcelog not run in uncontrolled ways.
http://www.mcelog.org/manpage.html
Plugin Configuration
Plugin configuration parameters description can be found at: https://github.com/collectd/collectd/blob/master/src/collectd.conf.pod
Example configuration for the plugin can be found at:
https://github.com/collectd/collectd/blob/master/src/collectd.conf.in
Plugin Functionality
The plugin does the following:
- Checks mcelog server liveliness, reports a failure if it’s not running or if it fails.
- Retrieve aggregated Memory Corrected and Uncorrected Errors from the client protocol (Submit event/stat).
- Retrieve other machine check exceptions from the log file (Submit event).
Sequence diagram
mcelog sample output
mcelog can relay information regarding MCEs in two ways: client protocol (Memory MCEs only) or through writing to a log file (syslog or other). The next sections show the sample output for each case. Please note, that anything written to a log file will be relayed by means of a collectd notification, while information received through the client protocol will be relayed by a notification or a statistic from collectd.
Client Protocol
[mcelog]# mcelog --client
Memory errors
SOCKET 0 CHANNEL 0 DIMM 0
DMI_NAME "DIMM_A1" DMI_LOCATION "NODE 0 CHANNEL 0 DIMM 0"
corrected memory errors:
2 total
2 in 24h
uncorrected memory errors:
0 total
0 in 24h
Log file
Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 1
ADDR abcd
TIME 1475154705 Thu Sep 29 14:11:45 2016
MCG status:
MCi status:
Corrected error
Error enabled
MCi_ADDR register valid
MCA: No Error
STATUS 9400000000000000 MCGSTATUS 0
MCGCAP 1000c1d APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 62
syslog
Oct 3 09:24:19 localhost mcelog: Hardware event. This is not a software error.
Oct 3 09:24:19 localhost mcelog: MCE 0
Oct 3 09:24:19 localhost mcelog: CPU 0 BANK 1
Oct 3 09:24:19 localhost mcelog: ADDR abcd
Oct 3 09:24:19 localhost mcelog: TIME 1475483059 Mon Oct 3 09:24:19 2016
Oct 3 09:24:19 localhost mcelog: MCG status:
Oct 3 09:24:19 localhost mcelog: MCi status:
Oct 3 09:24:19 localhost mcelog: Corrected error
Oct 3 09:24:19 localhost mcelog: Error enabled
Oct 3 09:24:19 localhost mcelog: MCi_ADDR register valid
Oct 3 09:24:19 localhost mcelog: MCA: No Error
Oct 3 09:24:19 localhost mcelog: STATUS 9400000000000000 MCGSTATUS 0
Oct 3 09:24:19 localhost mcelog: MCGCAP 1000c1d APICID 0 SOCKETID 0
Oct 3 09:24:19 localhost mcelog: CPUID Vendor Intel Family 6 Model 62
Mcelog server output log file (or syslog, depending on mcelog server configuration provided in /etc/mcelog/mcelog.conf) may contains detailed error messages of many categories, including:
- Memory errors
- IO errors
- CPU errors
- QPI (Intel QuickPath Interconnect) errors
- System errors
All those error messages are processed by mcelog plugin and published as collectd notifications.
Logfile Valid identifier/value pairs
Identifier | Description | |
CPU number [bank-nr] | Starts a record. Describes the CPU number, optionally followed by the bank number. Required.Can be followed by STATUS or BANK identifier-pairs. | |
CPU cpu-number: Machine Check Exception mcg-status-nr Bank bank-nr: status-nr | Starts a record. Alternative form of CPU that is output by the kernel, never by mcelog. | |
BANK number | Describe the machine check bank. | |
STATUS number | Describe the machine check bank. | |
MCGSTATUS number | The mcgstatus field | |
RIP number | The program counter | |
RIP segment:number {symbol} | The program counter in segment/offset format, with an kernel symbol | |
TSC number | The CPU time stamp counter at the time of the event | |
ADDR number | The physical memory address of the error. | |
MISC number | The MCi_MISC register. Values are model specific. | |
PROCESSOR vendor:cpuid | The CPU vendor. Vendor identifier (1 = Intel, 2 = AMD) and the CPUID 1.EAX identifer. | |
MCGCAP number | The MCGCAP register. | |
APICID number | The APIC ID of the logical processor where the error occurred | |
SOCKETID number | Describe the physical APIC-ID of the socket the error occurred on. | |
CPUID Vendor vendor Family family family Model model | Describe the CPUID version. Decoded version of PROCESSOR | |
TIME time [decodedtime] | Time when the error occurred, in time_t format. Optionally followed by human readable ctime. Decoded version of TSC |
SNMP Support
As traps are not supported, the only information relayed through SNMP will be memory error related:
- corrected memory errors
- total error number
- number of errors over a period of time
- uncorrected memory errors
- total error number
- number of errors over a period of time
Considerations
Configuration Considerations
BIOS configuration options associated with Errors.
Mcelog must be configured to run on the platform in daemon mode and logging capabilities must be enabled.
Deployment Considerations
Mcelog must also be configured to run on the platform
A custom RAS-MIB will be deployed and registered on the platform
API/GUI/CLI Considerations
Equivalence Considerations
Security Considerations
IPC between the server and client is through a UNIX socket.
Alarms, events, statistics considerations
MCEs must be relayed as alarms, and where possible if a statistic can be retrieved for the type of error than this should be relayed as a statistic also.
Redundancy Considerations
Performance Considerations
Not part of Telemetry so performance is Not Applicable, however, collectd and its plugins should not impact the overall performance of a workload running on the platform
Testing Consideration
Mcelog test suites cover large parts (but not all of) the kernel corrected machine check code at the software level. It does not exercise the actual error handling and correcting hardware or any BIOS components. To exercise these components you can:
- Use APEI error injection if your BIOS supports it. See examples in mce-test below.
- Use a known bad DIMM that throws errors in a test system
- Test on a live cluster with enough machines. Enough transistors imply plenty of errors.
- Use special error injecting DIMMs.
- Use extreme measures like hair dryers (not recommended, use at your own risk)
Other Considerations
Mcelog doesn’t support PCIe AER, so relevant errors will not be retrievable.
User must be root user to run collectd with mcelog plugin enabled.
Impact
The following table outlines possible impact(s) the deployment of this deliverable may have on the current system.
Ref | System Impact Description | Recommendation / Comments |
1 |
|
|
Key Assumptions
The following assumptions apply to the scope specified in this document.
Ref | Assumption | Status |
1 |
|
|
Key Exclusions
The following exclusions apply to the scope discussed in this document.
Ref | Exclusion | Status |
1 |
|
|
Key Dependencies
The following table outlines the key dependencies associated with this deliverable.
Ref | Dependency | Status |
1 | Mcelog daemon |
|
2 | Collectd message log parser utility |
|
3 |
|
|
4 |
|
|
Issues List
Ref | Issue | Status |
1 |
| The sample file format is proposed in section 1.1.1.1 |