Anuket Project
Memory RAS Plugin Executed Tests
Manual RAS memory plugin testing:
Prerequisites:
Tests precondition:
- Collectd with RAS memory plugin (mcelog) is installed. Collectd plugins csv, mcelog are enabled. Mcelog service is started.
- Installed mcelog, error injection tools: mce-inject tools, mce-test, einj (Memory corrected errors were injected using einj.ko module).
- DUT's BIOS is supported by mcelog (BIOS Vendor: Intel Corp. is supported for sure).
Installation details:
Environment details:
E1 - Bare Metal, U16.04.
Repo/branch used:
- Internal collectd repo: https://git-ger-6.devtools.intel.com/gerrit/gitweb?p=dpdk_xstats-collectd.git;tflink=projects.dpdk_xstats/scm.collectd.
- Plugin feature branch: feat_ras, master.
- Plugin dependency: mcelog.
Error injection details.
- Memory errors injected by mce-test(einj).
To inject corrected memory errors:- Remove sb_edac and edac_core kernel modules: rmmod sb_edac rmmod edac_core
- Insert einj module: modprobe einj param_extension=1
- Inject an error by specifying details (last command should be repeated at least two times):
$ APEI_IF=/sys/kernel/debug/apei/einj
$ echo 0x8 > $APEI_IF/error_type
$ echo 0x01f5591000 > $APEI_IF/param1
$ echo 0xfffffffffffff000 > $APEI_IF/param2
$ echo 1 > $APEI_IF/notrigger
$ echo 1 > $APEI_IF/error_inject
To inject memory uncorrected non-fatal / fatal errors just change error_type:- $ echo 0x00000010 > $APEI_IF/error_type
- Corrected memory errors injected by mce-inject.
Install mce-inject as mentioned are here.
Load mce_inject module:
modprobe mce_inject
Edit file:
$ vi test/corrected
CPU 0 BANK 0
STATUS 0xcc00008000010090
ADDR 0x0010FFFFFFF
Inject an error:
mce-inject test/corrected - Uncorrected (non-fatal, without reboot) memory error injected using mce-inject and mce-test.
$ mce-inject mce-test/cases/coverage/soft-inj/recoverable_ucr/data/srao_mem_scrub.
- $ echo 0x00000020 > $APEI_IF/error_type
Mcelog collectd section:
LoadPlugin mcelog
<Plugin mcelog>
McelogClientSocket "/var/run/mcelog-client"
McelogLogfile "/var/log/mcelog"
</Plugin>
Will be changed after branch "feat_mcelog_mem_notification_level" is merged (default if all commented for now is socket):
#<Plugin mcelog>
# <Memory>
# McelogClientSocket "/var/run/mcelog-client"
# PersistentNotification false
# </Memory>
# McelogLogfile "/var/log/mcelog"
#</Plugin>
RAS memory general test cases and result details.
# | Test case title | Priority | Steps | Expected result | Actual result | Status | Environment | Automation result |
---|---|---|---|---|---|---|---|---|
1 | RAS memory plugin configuration | High |
|
|
| Pass | E1 | Pass |
2 | RAS memory plugin interval configuration | High |
|
|
| Pass | E1 | Pass |
3 | RAS memory plugin mcelog liveness detection | High |
|
|
3. RSA memory collectd files are updated with new timestamps. After error injected to DIMM any new values are recorded. 4. systemd[1]: Stopped LSB: Machine Check Exceptions (MCE) collector & decoder. 5. pidof collectd, mcelog: 209386, 209318
| Pass | E1 | PASS (HAA-1195, Fixed) |
4 | RAS memory plugin upon collectd restart | High |
|
|
| Pass | E1 | Pass |
5 | RAS memory plugin upon corrected errors injection | High |
|
|
| Pass | E1 | Pass |
6 | RAS memory plugin upon uncorrected non-fatal errors injection | Medium |
|
|
| Pass | E1 | Pass |
7 | RAS memory plugin upon uncorrected fatal errors injection | Medium |
|
|
| Pass | E1 | NA |
8 |
|
| Removed because it's difficult to check as host is continuously rebooting. | Invalid | E1 | NA | ||
9 | RAS memory plugin upon different Unix socket location | Medium |
|
|
| Pass | E1 | NA |
10 | RAS memory plugin upon different log file location | Medium |
|
|
| TBD (PR's awaiting) | E1 | NA |
11 | RAS memory plugin started with "Plugin mce" section commented | High |
| 2. Collectd started. Default path for socket, "McelogClientSocket" - "/var/run/mcelog-client". Default path for log file, "McelogLogfile" - "/var/log/mcelog". 3. Mcelog reports memory error to the "/var/log/mcelog" log file correctly. | 2. Collectd started. Socket is created under "/var/run/mcelog-client". 3. Mcelog reports memory error to the "/var/log/mcelog" log file, values are same as reported by collectd plugin. | Pass | E1 | NA |
12 | RAS memory plugin started with commented fields | High |
| 2. Collectd started. Socket file is created under "/var/run/mcelog-client" location. Mcelog reports memory error to the "/var/log/mcelog" log file correctly. 4. Collectd started. Mcelog reports memory error to the "/var/log/mcelog" log file correctly. | 2. Collectd started. Socket file is created under "/var/run/mcelog-client" location. Mcelog reports memory error to the "/var/log/mcelog" log file correctly. 4. Collectd started. Mcelog reports memory error to the "/var/log/mcelog" log file correctly. | Pass | E1 | NA |
13 | RAS memory plugin data updated for new period (day) | Medium |
| 2. Memory error recorded to total and 24 hour files for corrected errors by collectd RAS memory plugin. Same number of memory errors are recoreded to mcelog and to collectd log file. 3. All memory corrected/uncorrected errors are copied from previous day to the new day (YYYY-MM-DD) for total timespan. All memory corrected/uncorrected errors for 24h timespanpreserved values for previous day, but set to zero for a new day. 4. Memory error recorded for new day only to total and 24 hour files for corrected/uncorrected errors by collectd RAS memory plugin. Same number of memory errors are recoreded to mcelog and to collectd log file. | 2. Memory error recorded to total and 24 hour files for corrected errors by collectd RAS memory plugin. Same number of memory errors are recoreded to mcelog and to collectd log file. 3. All memory corrected/uncorrected errors are copied from previous day to the new day (YYYY-MM-DD) for total timespan. All memory corrected/uncorrected errors for 24h timespan preserved values for previous day, but set to zero for a new day. 4. Memory error recorded for new day only to total and 24 hour files for corrected/uncorrected errors by collectd RAS memory plugin. Same number of memory errors are recoreded to mcelog and to collectd log file. | Pass | E1 | NA |
14 | RAS memory plugin data updated from emulated socket (non mcelog) | Medium |
| 3. Collectd started. 4. Generated memory corrected/uncorrected errors are recorded correctly to specified DIMM's location. Number of corrected/uncorrected errors is same retrieved from collectd and generated. | 3. Collectd started. 4. Generated memory corrected/uncorrected errors are recorded correctly against DIMM's location. Number of corrected/uncorrected errors is same retrieved from collectd and generated. | TBD | E1 | NA |
15 | RAS memory plugin events are received | High |
| 3. Mcelog running. Collectd started without errors in syslog. Notification(s) recorded every time interval for corrected/uncorrected memory errors. 4. Number of notification(s) increased and is the same as retrieved by mcelog, and updated every time interval for corrected/uncorrected memory errors. 5. Number of notification(s) increased and is the same as retrieved by mcelog, and updated every time interval for corrected/uncorrected memory errors. | 3. Mcelog running. Collectd started without errors in syslog. Notification(s) recorded every time interval for corrected/uncorrected memory errors. 4. Number of notification(s) increased and is the same as retrieved by mcelog, and updated every time interval for corrected/uncorrected memory errors. 5. Number of notification(s) increased and is the same as retrieved by mcelog, and updated every time interval for corrected/uncorrected memory errors. | Pass | E1 | NA |
16 | RAS memory plugin events are received every 5-10 ms | High |
| 2. Mcelog running. Collectd started without errors in syslog. Notification(s) recorded every time interval for corrected/uncorrected memory errors. 3. Number of notification(s) increased and is the same as retrieved by mcelog, and updated every time interval for corrected/uncorrected memory errors.
| 2. Mcelog running. Collectd started without errors in syslog. Notification(s) recorded every time interval for corrected/uncorrected memory errors. 3. Number of notification(s) increased and is the same as retrieved by mcelog, and updated every time interval for corrected/uncorrected memory errors. Time needed for notification been operated: corrected - 79-116=37ms / 108-148=40ms uncorrected - 134-160=26ms / 162-188=26ms | TBD | E1 | NA |
17 | RAS memory plugin events are received according to the interval with persist notification option enabled | High |
| 4. Mcelog running. Collectd started without errors in syslog. 5. Notifications recorded every time interval for corrected and uncorrected memory errors. | 4. Mcelog running. Collectd started without errors in syslog. 5. Notifications recorded every time interval for corrected and uncorrected memory errors. | Pass (dev testing on branch feat_mcelog_mem_notification_level) | E1 | NA |
18 | RAS memory plugin events are received according to the interval with persist notification option disabled | High |
| 4. Mcelog running. Collectd started without errors in syslog. 5. Notifications recorded only once per each error injection for corrected and uncorrected memory errors. | 4. Mcelog running. Collectd started without errors in syslog. 5. Notifications recorded only once per each error injection for corrected and uncorrected memory errors. | Pass (dev testing on branch feat_mcelog_mem_notification_level) | E1 | NA |
19 | RAS memory plugin configuration memory socket and log file are exclusiveMcelogLogfile | High |
| 2. Enabling memory socket and log file is prohibited. Error should be received and plugin exited. | 2. Enabling memory socket and log file is prohibited. Error should be received and plugin exited. Collectd failed on config stage, no other plugins loaded. | Pass (dev testing on branch feat_mcelog_mem_notification_level) | E1 | NA |
20 | RAS memory plugin notifications read from log file | High |
| 2. Mcelog running. Collectd started without errors in syslog. 3. Notification about corrected and uncorrected errors are sent once per injection. | TBD | E1 | NA | |
21 | RAS memory plugin notifications read from log file and dispatched once regardless “PersistentNotification” | Medium |
| 2. Mcelog running. Collectd started without errors in syslog. 3. Notification about errors are sent once per injection of corrected/uncorrected error. 4. Notification about errors are sent once per injection of corrected/uncorrected error. | TBD | E1 | NA | |
21 | RAS memory plugin notifications severity sent from socket | High |
| 2. Mcelog running. Collectd started without errors in syslog. 3. Notification about corrected error is sent with Warning severity. 4. Notification about uncorrected error is sent with Failure severity. | TBD | E1 | NA | |
22 | RAS memory plugin notifications severity sent from logfile. | High |
| 2. Mcelog running. Collectd started without errors in syslog. 3. Notification about corrected error is sent with Warning severity. 4. Notification about uncorrected error is sent with Failure severity. | TBD | E1 | NA |
SNMP RAS memory test cases for manual execution
Q & A:
- Is it expected to have no previous mcelog logs after reboot?
- N/A
- Does snmp-agent plugin depend on snmpd?
- N/A
- Is it expected to have no previous mcelog logs after reboot?
Manual test results:
# | High level scenario description | Steps to be executed | Expected Result | Test result | Comments | Automated |
---|---|---|---|---|---|---|
# | High level scenario description | Steps to be executed | Expected Result | Test result | Comments | Automated |
1 | Positive scenario snmp-agent plugin configuration. | 1. Enable mcelog and snmp-agent plugins in collectd.conf. 2. Configure snmp-agent to run in various snmp versions (v1, v2c, v3). | Collectd runs as expected with correct applied config settings for snmp-agent plugins. Collectd service exits normally on service stop. | PASS | Yes(under review) | |
2 | Negative scenario snmp-agent plugin configuration. | 1. Enable mcelog and snmp-agent plugins in collectd.conf. 2. Configure snmp-agent incorrectly (list of options TBD when plugin is available). | Collectd logs error message against snmp-agent plugin. Collectd service starts, runs and exits normally, only if no service affecting misconfiguration occured. Else collectd fails to start, with rc=1. | |||
3 | Verify snmp-agent plugin reports corrected errors collected by enabled mcelog plugin | 1. Run collectd with enabled mcelog and snmp-agent plugins. | Collectd service starts and runs normally. | PASS | Yes(under review) | |
2. Get memory errors summary using mcelog utility. | Get initial number of corrected errors. | |||||
3. Get corrected memory errors number using snmpget utility, within an interval time window. | Verify that initial values taken from two sources are the same. | |||||
4. During 5 intervals, monitor if data changes. | Verify that data does not change without errors injection. | |||||
5. Inject 1 or few corrected errors. | ||||||
6. Get memory errors summary using mcelog utility. | Get current number of corrected errors. Verify the counter difference corresponds number of injected corrected errors. | |||||
7. Get corrected memory errors value using snmpget utility, within an interval time window. | Verify the value is same as one retrieved from mcelog. | |||||
4 | Verify snmp-agent plugin reports timed out corrected errors collected by enabled mcelog plugin | 1. Run collectd with enabled mcelog and snmp-agent plugins. | Collectd service starts and runs normally. | PASS | Yes(under review) | |
2. Get memory errors summary using mcelog utility. | Get initial number of timed out corrected errors, note the date. | |||||
3. Get timed out corrected memory errors value using snmpget utility, within an interval time window. | Verify that initial values taken from two sources are the same, and belong to same date. | |||||
4. During 5 intervals, monitor if data changes. | Verify that data does not change without errors injection. | |||||
5. Inject 1 or few corrected errors. | ||||||
6. Get memory errors summary using mcelog utility, and the corresponding date. | Verify the counter difference corresponds number of timed out corrected errors for this specific date. | |||||
7. Get timed out corrected memory errors value using snmpget utility, within an interval time window. | Verify the value is same as one retrieved from mcelog. | |||||
5 | Verify snmp-agent plugin reports uncorrected errors collected by enabled mcelog plugin | 1. Run collectd with enabled mcelog and snmp-agent plugins. | Collectd service starts and runs normally. | PASS | Yes(under review) | |
2. Get memory errors summary using mcelog utility. | Get initial number of uncorrected errors. | |||||
3. Get uncorrected memory errors number using snmpget utility, within an interval time window. | Verify that initial values taken from two sources are the same. | |||||
4. During 5 intervals, monitor if data changes. | Verify that data does not change without errors injection. | |||||
5. Inject an uncorrected error. | Verify that it causes system reset, but system is available again after OS restart. | |||||
6. Get memory errors summary using mcelog utility. | Verify the injected uncorrected error was logged. | |||||
7. Get uncorrected memory errors value using snmpget utility, within an interval time window. | Verify the value is same as one retrieved from mcelog. | |||||
6 | Verify snmp-agent plugin reports timed out uncorrected errors collected by enabled mcelog plugin | 1. Run collectd with enabled mcelog and snmp-agent plugins. | Collectd service starts and runs normally. | PASS | Yes(under review) | |
2. Get memory errors summary using mcelog utility. | Get initial number of timed out uncorrected errors. | |||||
3. Get timed out uncorrected memory errors number using snmpget utility, within an interval time window. | Verify that initial values taken from two sources are the same. | |||||
4. During 5 intervals, monitor if data changes. | Verify that data does not change without errors injection. | |||||
5. Inject an uncorrected error. | Verify that it causes system reset, but system is available again after OS restart. | |||||
6. Get memory errors summary using mcelog utility. | Verify the injected uncorrected error was logged. | |||||
7. Get timed out uncorrected memory errors value using snmpget utility, within an interval time window. | Verify the value is same as one retrieved from mcelog. | |||||
7 | Verify snmp-agent plugin behavior when snmpd service is stopped | 1. Run collectd with enabled mcelog and snmp-agent plugins. 2. Stop snmpd: service snmpd stop | User cant sent snmpwalk Message appears in log/syslog :"Warning: Failed to connect to the agentx master agent ([NIL])" | PASS | ||
8 | Verify that snmp-agent plugin does not report any data when mcelog plugin is disabled | 1. Run collectd with enabled snmp-agent and disabled mcelog plugin. | Collectd service starts and runs normally. | PASS | Yes(under review) | |
2. Get memory errors summary using mcelog utility. | Get initial number of timed out uncorrected errors. | |||||
3. Get memory errors number using snmpwalk utility, within an interval time window. | Verify that no data is returned, but only an error message. | |||||
9 | Verify correct behavior of snmp-agent collectd plugin when mcelog plugin is enabled but mcelog service is stopped | 1. Stop mcelog: service mcelog stop 2. Run collectd with enabled snmp-agent and mcelog plugin. |
Error raises: "Failed to connect to mcelog server. Connection refused" | PASS | Yes(under review) | |
10 | Verify correct behavior of snmp-agent collectd plugin when mcelog plugin is enabled but mcelog service is restarted | 1. Run collectd with enabled snmp-agent and mcelog plugin. 2. Restart mcelog service. 3. Trigger another count of errors. 4. Verify that mcelog snmp values are equal to triggered errors count. | Mcelog snmp values are equal to triggered errors count. | PASS | Yes(under review) | |
11 | Verify snmp-agent plugin behavior when snmpd service is restarted. |
| TBD | TBD |