| | | | | | | | |
|---|
1 | RAS memory plugin configuration | | Enabled RAS memory plugin by uncommenting 'mcelog' related lines in collectd.conf. Start collectd: "./collectd -C collectd.conf -f" or start as a service "service collectd start". Open collectd csv path, like: "collectd/csv/<DUT>/…". Stop collectd: "pkill collectd" or "service collectd stop". Comment out RAS memory collectd plugin in "collectd.conf" file (mcelog). Delete existing collectd csv files under "collectd/csv" path. Start collectd. Stop collectd: "pkill collectd" or "service collectd stop". Uncomment RAS memory collectd plugin in "collectd.conf" file (mcelog). Start collectd.
| File is changed. Verify collectd is running: "pidof collectd" returns process ID or "service collectd status" service is running. RAS memory collectd files exists: total and 24 hour for corrected and uncorrected errors. Collectd RAS related files are updated with interval set in "collectd.conf". After collectd start collectd RAS related files are not created/updated. RAS memory collectd files exists: total and 24 hour for corrected and uncorrected errors. Collectd RAS related files are updated with interval set in "collectd.conf".
| File is changed. collectd is running. mcelog-SOCKET_0_CHANNEL_0_DIMM_0_DIMM_A1 mcelog-SOCKET_0_CHANNEL_2_DIMM_0_DIMM_C1 mcelog-SOCKET_1_CHANNEL_0_DIMM_0_DIMM_E1 mcelog-SOCKET_1_CHANNEL_2_DIMM_0_DIMM_G1 mcelog-SOCKET_0_CHANNEL_0_DIMM_any mcelog-SOCKET_0_CHANNEL_3_DIMM_0_DIMM_D1 mcelog-SOCKET_1_CHANNEL_0_DIMM_any mcelog-SOCKET_1_CHANNEL_3_DIMM_0_DIMM_H1 mcelog-SOCKET_0_CHANNEL_1_DIMM_0_DIMM_B1 mcelog-SOCKET_0_CHANNEL_any_DIMM_any mcelog-SOCKET_1_CHANNEL_1_DIMM_0_DIMM_F1 mcelog-SOCKET_1_CHANNEL_any_DIMM_any Files are not updated. Files are updated with new values (timestamp and errors).
| | | |
2 | RAS memory plugin interval configuration | | Open "collectd.conf" file to check the collectd update interval. Open collectd csv path, like: "collectd/csv/<DUT>/mce_log…". Change interval in "collectd.conf" to 60 (seconds). Inject few memory errors. Change interval in range 1-300 seconds. Inject few memory errors.
| Find line "Interval <number>". RAS memory collectd files exists: total and 24 hour for corrected and uncorrected errors. RAS memory collectd files are updated with interval set in "collectd.conf". RAS memory collectd files are updated every 60 seconds. RAS memory collectd files are updated every set interval.
| 10 seconds is by default. Timestamps are updated every 10 second. RAS memory collectd files are updated every 60 seconds. Works correct for 30, 60, 300 seconds.
| | | |
3 | RAS memory plugin mcelog liveness detection | | Verify collectd, mcelog are running. Stop mcelog service. Start mcelog service. Restart collectd if needed. Terminate mcelog (pkill mcelog). Restart mcelog service. Repeat test three times.
| Collectd, mcelog are running. Service mcelog is stopped. Appropriate messages are printed to syslog with correct severity by collectd RAS memory plugin. Collectd and mcelog are running. RAS memory collectd files are updated with interval set in "collectd.conf". Service mcelog is exited. Appropriate messages are printed to syslog (TBD) with correct severity by RAS memory collectd plugin. Collectd and mcelog are running. RAS memory collectd related files are updated with interval set in "collectd.conf". RAS memory collectd plugin is stopped/started, messages about this are printed.
| pidof mcelog, collectd: 207803, 207791 syslog messages: collectd[207791]: mcelog: Connection to socket is broken collectd[207791]: plugin_dispatch_notification: severity = 1; message = Connection to mcelog socket is broken.; time = 1477301194.912; host = silpixa00378251; collectd[207791]: plugin_read_thread: Handling `mcelog'. mcelog: mcelog_read collectd[207791]: mcelog: MACHINE CHECK INFO NOT AVAILABLE collectd[207791]: plugin_read_thread: read-function of the `mcelog' plugin took 0.000027 seconds. collectd[207791]: plugin_read_thread: Effective interval of the `mcelog' plugin is 30.000 seconds. collectd[207791]: plugin_read_thread: Next read of the `mcelog' plugin at 1477301754.617.
3. RSA memory collectd files are updated with new timestamps. After error injected to DIMM any new values are recorded. 4. systemd[1]: Stopped LSB: Machine Check Exceptions (MCE) collector & decoder. 5. pidof collectd, mcelog: 209386, 209318 | Pass | | |
4 | RAS memory plugin upon collectd restart | High | Enabled RAS memory plugin by uncommenting 'mcelog' related lines in collectd.conf. Start collectd: "./collectd -C collectd.conf -f" or start as a service "service collectd start". Open collectd csv path, like: "collectd/csv/<DUT>/…". Stop collectd: "pkill collectd" or "service collectd stop". Repeat test three times.
| File is changed. Verify collectd is running: "pidof collectd" returns process ID or "service collectd status" service is running. Collectd RAS memory files exists: total and 24 hour for corrected and uncorrected errors. Collectd RAS related files are updated with interval set in "collectd.conf". Verify collectd is not running: "pidof collectd" returns nothing or "service collectd status" service is stopped. Collectd RAS related files are not updated. Collectd is functioning correctly. Collectd RAS memory related data is updated in time.
| Success Collectd service started, mcelog plugin init and read callback calls present in syslog. Mcelog appends data to log files with defined interval. Collectd service is stopped. Logs of mcelog are not updated anymore. Repeating previous steps reproduces same behavior.
| Pass | | |
5 | RAS memory plugin upon corrected errors injection | | Inject a correctable memory errors. $ cat mytest/corrected CPU 0 BANK 0 STATUS 0xcc00008000010090 ADDR 0x0010FFFFFFF $ ./mce-inject mytest/corrected Make sure no errors are injected. Wait for while. Repeat test for other correctable memory errors.
| Memory error recorded to total and 24 hour files for corrected errors with correct timestamp and location (node#/channel#/DIMM#) by collectd RAS memory plugin. Same number of memory errors are recoreded to mcelog and to collectd logfile. No new memory errors are recorded neither to mcelog nor to csv collectd plugin files related to RAS memory. Same as in step#1.
| Error is logged to /var/log/mcelog and to collectd and values are the same. Same number of memory errors are recoreded to mcelog and to collectd logfile. No server reboot observed. During idle time there's no new records in mcelog plugin log files, neither logged by mcelog itself. Unchanged value is updated to collectd with correct time interval. Once new errors are injected, counting and logging occurs as expected.
| Pass | | |
6 | RAS memory plugin upon uncorrected non-fatal errors injection | | Inject an uncorrectable non-fatalmemory error. $ cat mytest/uncorrected_nonfatal CPU 0 BANK 2 STATUS UNCORRECTED SRAO 0xc0 MCGSTATUS RIPV MCIP ADDR 0x1234 MISC 0x8c RIP 0x73:0x1eadbabe $ ./mce-inject mytest/uncorrected_nonfatal Make sure no errors are injected. Wait for while. Repeat error injection three times.
| Memory error recorded to total and 24 hour files for uncorrected errors with correct timestamp and location (node#/channel#/DIMM#) by collectd RAS memory plugin. Same number of memory errors are recoreded to mcelog and to collectd logfile. Note: error injection may cause a system reboot. No new memory errors are recorded neither to mcelog nor to csv collectd plugin files related to RAS memory. Same as in step#1.
| Error is logged to /var/log/mcelog and to collectd and values are the same. Same number of memory errors are recoreded to mcelog and to collectd logfile. No server reboot observed. During idle time there's no new records in mcelog plugin log files, neither logged by mcelog itself. Unchanged value is updated to collectd with correct time interval. Once new errors are injected, counting and logging occurs as expected.
| | | |
7 | RAS memory plugin upon uncorrected fatal errors injection | Medium | Inject an uncorrectable fatalmemory error. $ cat mytest/uncorrected_fatal CPU 0 BANK 2 STATUS UNCORRECTED SRAO 0xc0 MCGSTATUS MCIP ADDR 0x1234 MISC 0x8c $ ./mce-inject mytest/uncorrected_fatal Check server behavior. Repeat step#1 again (same or different memory error).
| Memory error recorded to total and 24 hour files for uncorrected errors with correct timestamp and location (node#/channel#/DIMM#) by collectd RAS memory plugin. Same number of memory errors are recoreded to mcelog and to collectd logfile. Note: error injection may cause a system reboot. No new memory errors are recorded neither to mcelog nor to csv collectd plugin files related to RAS memory. Same as in step#1.
| Server is rebooted. Uncorrected error is detected by mcelog, logged by collectd after server is up against correct DIMM location. Collectd files don't preserve statistic after error injected and reboot! Collectd files don't preserve statistic once mcelog is restarted! During idle time there's no new records in mcelog plugin log files, neither logged by mcelog itself. Unchanged value is updated to collectd with correct time interval. Once new errors are injected, counting and logging occurs as expected.
| | | |
8 | RAS memory plugin MCE detection on faulty DIMM | | Get prepared a server with faulty DIMM installed to specific slot. Wait for expected memory errors. Check for RAS memory errors in mcelog and in collectd csv files. Repeat observation for a while, overnight.
| Start the server. Errors are registered in mcelog log file and in "collectd/csv/" files with correct address: node#/channel#/DIMM#. Errors are detected and MCE statistic is updated.
| Removed because it's difficult to check as host is continuously rebooting. | | | |
9 | RAS memory plugin upon different Unix socket location | Medium | Change socket location in mcelog.conf (socket-path = /var/run/mcelog-client) and collectd.conf for mcelog plugin to other location (default: McelogClientSocket "/var/run/mcelog-client"). Restart mcelog/collectd. Inject an error and check the statistic.
| Configuration changed. Socket is created, mcelog/collectd are running. Memory error recorded to total and 24 hour files for errors with correct timestamp and location (node#/channel#/DIMM#) by collectd RAS memory plugin. Same number of memory errors are recoreded to mcelog and to collectd logfile.
| Configuration changed. Socket is created, mcelog/collectd are running. Memory error recorded to total and 24 hour files for errors with correct timestamp and location (node#/channel#/DIMM#) by collectd RAS memory plugin. Same number of memory errors are recoreded to mcelog and to collectd logfile.
| | | |
10 | RAS memory plugin upon different log file location | Medium | Change log file location in mcelog.conf (logfile = /var/log/newmcelog). Make sure data not updated though socket-path defined mcelog.conf. Change log file location in collectd.conf for mcelog plugin to other location (McelogLogfile "/var/log/newmcelog"). Restart collectd. Inject an error and check the statistic.
| Configuration changed. Log file is created, mcelog/collectd are running. Memory error notifications dispatched by exec or python plugin. Stats not updated by CSV plugin!!!
| Log file is not created under new location. Memory error notifications dispatched by exec or python plugin. Stats not updated by CSV plugin!!!
| | | |
11 | RAS memory plugin started with "Plugin mce" section commented | | Comment out "<Plugin mcelog>" section in collectd.conf. Start collectd. Inject a memory error.
| 2. Collectd started. Default path for socket, "McelogClientSocket" - "/var/run/mcelog-client". Default path for log file, "McelogLogfile" - "/var/log/mcelog". 3. Mcelog reports memory error to the "/var/log/mcelog" log file correctly. | 2. Collectd started. Socket is created under "/var/run/mcelog-client". 3. Mcelog reports memory error to the "/var/log/mcelog" log file, values are same as reported by collectd plugin. | | | |
12 | RAS memory plugin started with commented fields | | Comment out "McelogClientSocket" field in collectd.conf. Start collectd. Inject a memory error. Comment out "McelogLogfile" field in collectd.conf. Restart collectd. Inject a memory error.
| 2. Collectd started. Socket file is created under "/var/run/mcelog-client" location. Mcelog reports memory error to the "/var/log/mcelog" log file correctly. 4. Collectd started. Mcelog reports memory error to the "/var/log/mcelog" log file correctly. | 2. Collectd started. Socket file is created under "/var/run/mcelog-client" location. Mcelog reports memory error to the "/var/log/mcelog" log file correctly. 4. Collectd started. Mcelog reports memory error to the "/var/log/mcelog" log file correctly. | | | |
13 | RAS memory plugin data updated for new period (day) | Medium | Start mcelog, collectd. Inject corrected, uncorrected non fatal and fatal errors. Wait for new day started. Inject corrected, uncorrected non fatal and fatal errors.
| 2. Memory error recorded to total and 24 hour files for corrected errors by collectd RAS memory plugin. Same number of memory errors are recoreded to mcelog and to collectd log file. 3. All memory corrected/uncorrected errors are copied from previous day to the new day (YYYY-MM-DD) for total timespan. All memory corrected/uncorrected errors for 24h timespanpreserved values for previous day, but set to zero for a new day. 4. Memory error recorded for new day only to total and 24 hour files for corrected/uncorrected errors by collectd RAS memory plugin. Same number of memory errors are recoreded to mcelog and to collectd log file. | 2. Memory error recorded to total and 24 hour files for corrected errors by collectd RAS memory plugin. Same number of memory errors are recoreded to mcelog and to collectd log file. 3. All memory corrected/uncorrected errors are copied from previous day to the new day (YYYY-MM-DD) for total timespan. All memory corrected/uncorrected errors for 24h timespan preserved values for previous day, but set to zero for a new day. 4. Memory error recorded for new day only to total and 24 hour files for corrected/uncorrected errors by collectd RAS memory plugin. Same number of memory errors are recoreded to mcelog and to collectd log file. | | | |
14 | RAS memory plugin data updated from emulated socket (non mcelog) | Medium | Configure mcelog plugin to retrieve data from other socket (collectd.conf). Open a socket (using mcelog emulator). Start collectd (mcelog service must be stopped). Generate corrected/uncorrected errors through created socket (using mcelog emulator).
| 3. Collectd started. 4. Generated memory corrected/uncorrected errors are recorded correctly to specified DIMM's location. Number of corrected/uncorrected errors is same retrieved from collectd and generated. |