Memory RAS Plugin Executed Tests

Anuket Project

Memory RAS Plugin Executed Tests

 

Manual RAS memory plugin testing:

Prerequisites:

Tests precondition:

 

Installation details:

 

Environment details:

E1 - Bare Metal, U16.04.

 

Repo/branch used:

 

Error injection details.

  1. Memory errors injected by mce-test(einj).
    To inject corrected memory errors:

    1. Remove sb_edac and edac_core kernel modules: rmmod sb_edac rmmod edac_core

    2. Insert einj module: modprobe einj param_extension=1

    3. Inject an error by specifying details (last command should be repeated at least two times): 
      $ APEI_IF=/sys/kernel/debug/apei/einj
      $ echo 0x8 > $APEI_IF/error_type
      $ echo 0x01f5591000 > $APEI_IF/param1
      $ echo 0xfffffffffffff000 > $APEI_IF/param2
      $ echo 1 > $APEI_IF/notrigger
      $ echo 1 > $APEI_IF/error_inject

    Check the MCE statistic: mcelog --client. Check the mcelog log for injected error details: less /var/log/mcelog.

    To inject memory uncorrected non-fatal / fatal errors just change error_type:

    1. $ echo 0x00000010 > $APEI_IF/error_type

    2. $ echo 0x00000020 > $APEI_IF/error_type

  2. Corrected memory errors injected by mce-inject.
    Install mce-inject as mentioned are here.
    Load mce_inject module:
        modprobe mce_inject
    Edit file:
        $ vi test/corrected
        CPU 0 BANK 0
        STATUS 0xcc00008000010090
        ADDR 0x0010FFFFFFF
    Inject an error:
        mce-inject test/corrected

  3. Uncorrected (non-fatal, without reboot) memory error injected using mce-inject and mce-test.
        $ mce-inject  mce-test/cases/coverage/soft-inj/recoverable_ucr/data/srao_mem_scrub.

Mcelog collectd section:

LoadPlugin mcelog

<Plugin mcelog>

  McelogClientSocket  "/var/run/mcelog-client"

  McelogLogfile "/var/log/mcelog"

</Plugin>

Will be changed after branch "feat_mcelog_mem_notification_level" is merged (default if all commented for now is socket):

#<Plugin mcelog>

#  <Memory>

#    McelogClientSocket "/var/run/mcelog-client"

#    PersistentNotification false

#  </Memory>

#  McelogLogfile "/var/log/mcelog"

#</Plugin>

 

  1. RAS memory general test cases and result details.

#

Test case title

Priority

Steps

Expected result

Actual result

Status

Environment

Automation result

#

Test case title

Priority

Steps

Expected result

Actual result

Status

Environment

Automation result

1

RAS memory plugin configuration

High

  1. Enabled RAS memory plugin by uncommenting 'mcelog' related lines in collectd.conf.

  2. Start collectd: "./collectd -C collectd.conf -f" or start as a service "service collectd start".

  3. Open collectd csv path, like: "collectd/csv/<DUT>/…".

  4. Stop collectd: "pkill collectd" or "service collectd stop". Comment out RAS memory collectd plugin in "collectd.conf" file (mcelog). Delete existing collectd csv files under "collectd/csv" path. Start collectd.

  5. Stop collectd: "pkill collectd" or "service collectd stop". Uncomment RAS memory collectd plugin in "collectd.conf" file (mcelog). Start collectd.

  1. File is changed.

  2. Verify collectd is running: "pidof collectd" returns process ID or "service collectd status" service is running.

  3. RAS memory collectd files exists: total and 24 hour for corrected and uncorrected errors. Collectd RAS related files are updated with interval set in "collectd.conf".

  4. After collectd start collectd RAS related files are not created/updated.

  5. RAS memory collectd files exists: total and 24 hour for corrected and uncorrected errors. Collectd RAS related files are updated with interval set in "collectd.conf".

  1. File is changed.

  2. collectd is running.

  3. mcelog-SOCKET_0_CHANNEL_0_DIMM_0_DIMM_A1
    mcelog-SOCKET_0_CHANNEL_2_DIMM_0_DIMM_C1
    mcelog-SOCKET_1_CHANNEL_0_DIMM_0_DIMM_E1
    mcelog-SOCKET_1_CHANNEL_2_DIMM_0_DIMM_G1
    mcelog-SOCKET_0_CHANNEL_0_DIMM_any
    mcelog-SOCKET_0_CHANNEL_3_DIMM_0_DIMM_D1
    mcelog-SOCKET_1_CHANNEL_0_DIMM_any
    mcelog-SOCKET_1_CHANNEL_3_DIMM_0_DIMM_H1
    mcelog-SOCKET_0_CHANNEL_1_DIMM_0_DIMM_B1
    mcelog-SOCKET_0_CHANNEL_any_DIMM_any
    mcelog-SOCKET_1_CHANNEL_1_DIMM_0_DIMM_F1 mcelog-SOCKET_1_CHANNEL_any_DIMM_any

  4. Files are not updated.

  5. Files are updated with new values (timestamp and errors).

 Pass

E1

Pass

2

RAS memory plugin interval configuration

High

  1. Open "collectd.conf" file to check the collectd update interval.

  2. Open collectd csv path, like: "collectd/csv/<DUT>/mce_log…".

  3. Change interval in "collectd.conf" to 60 (seconds). Inject few memory errors.

  4. Change interval in range 1-300 seconds.  Inject few memory errors.

  1. Find line "Interval     <number>".

  2. RAS memory collectd files exists: total and 24 hour for corrected and uncorrected errors. RAS memory collectd files are updated with interval set in "collectd.conf".

  3. RAS memory collectd files are updated every 60 seconds.

  4. RAS memory collectd files are updated every set interval.

  1. 10 seconds is by default.

  2. Timestamps are updated every 10 second.

  3. RAS memory collectd files are updated every 60 seconds.

  4. Works correct for 30, 60, 300 seconds.

 Pass

E1

Pass

3

RAS memory plugin mcelog liveness detection

High

  1. Verify collectd, mcelog are running.

  2. Stop mcelog service.

  3. Start mcelog service. Restart collectd if needed.

  4. Terminate mcelog (pkill mcelog).

  5. Restart mcelog service.

  6. Repeat test three times. 

  1. Collectd, mcelog are running.

  2. Service mcelog is stopped. Appropriate messages are printed to syslog with correct severity by collectd RAS memory plugin.

  3. Collectd and mcelog are running. RAS memory collectd files are updated with interval set in "collectd.conf".

  4. Service mcelog is exited. Appropriate messages are printed to syslog (TBD) with correct severity by RAS memory collectd plugin.

  5. Collectd and mcelog are running. RAS memory collectd related files are updated with interval set in "collectd.conf".

  6. RAS memory collectd plugin is stopped/started, messages about this are printed. 

  1. pidof mcelog, collectd: 207803, 207791

  2. syslog messages:
    collectd[207791]: mcelog: Connection to socket is broken
    collectd[207791]: plugin_dispatch_notification: severity = 1; message = Connection to mcelog socket is broken.; time = 1477301194.912; host = silpixa00378251;
    collectd[207791]: plugin_read_thread: Handling `mcelog'. mcelog: mcelog_read
    collectd[207791]: mcelog: MACHINE CHECK INFO NOT AVAILABLE
    collectd[207791]: plugin_read_thread: read-function of the `mcelog' plugin took 0.000027 seconds.
    collectd[207791]: plugin_read_thread: Effective interval of the `mcelog' plugin is 30.000 seconds.
    collectd[207791]: plugin_read_thread: Next read of the `mcelog' plugin at 1477301754.617.

3. RSA memory collectd files are updated with new timestamps. After error injected to DIMM any new values are recorded.

4. systemd[1]: Stopped LSB: Machine Check Exceptions (MCE) collector & decoder.

5. pidof collectd, mcelog: 209386, 209318

 

Pass

E1

PASS

(HAA-1195, Fixed)

4

RAS memory plugin upon collectd restart

High

  1. Enabled RAS memory plugin by uncommenting 'mcelog' related lines in collectd.conf.

  2. Start collectd: "./collectd -C collectd.conf -f" or start as a service "service collectd start".

  3. Open collectd csv path, like: "collectd/csv/<DUT>/…".

  4. Stop collectd: "pkill collectd" or "service collectd stop". 

  5. Repeat test three times.

  1. File is changed.

  2. Verify collectd is running: "pidof collectd" returns process ID or "service collectd status" service is running.

  3. Collectd RAS memory files exists: total and 24 hour for corrected and uncorrected errors. Collectd RAS related files are updated with interval set in "collectd.conf".

  4. Verify collectd is not running: "pidof collectd" returns nothing or "service collectd status" service is stopped. Collectd RAS related files are not updated.

  5. Collectd is functioning correctly. Collectd RAS memory related data is updated in time.

  1. Success

  2. Collectd service started, mcelog plugin init and read callback calls present in syslog.

  3. Mcelog appends data to log files with defined interval.

  4. Collectd service is stopped. Logs of mcelog are not updated anymore.

  5. Repeating previous steps reproduces same behavior.

Pass

E1

Pass

5

RAS memory plugin upon corrected errors injection

High

  1. Inject a correctable memory errors.

    $ cat mytest/corrected
    CPU 0 BANK 0
    STATUS 0xcc00008000010090
    ADDR 0x0010FFFFFFF

    $ ./mce-inject mytest/corrected

  2. Make sure no errors are injected. Wait for while.

  3. Repeat test for other correctable memory errors.

  1. Memory error recorded to total and 24 hour files for corrected errors with correct timestamp and location (node#/channel#/DIMM#) by collectd RAS memory plugin. Same number of memory errors are recoreded to mcelog and to collectd logfile.

  2. No new memory errors are recorded neither to mcelog nor to csv collectd plugin files related to RAS memory.

  3. Same as in step#1.

  1. Error is logged to /var/log/mcelog and to collectd and values are the same. Same number of memory errors are recoreded to mcelog and to collectd logfile. No server reboot observed.

  2. During idle time there's no new records in mcelog plugin log files, neither logged by mcelog itself. Unchanged value is updated to collectd with correct time interval.

  3. Once new errors are injected, counting and logging occurs as expected.

Pass

E1

Pass

6

RAS memory plugin upon uncorrected non-fatal errors injection

 Medium

  1. Inject an uncorrectable non-fatalmemory error.
    $ cat mytest/uncorrected_nonfatal
    CPU 0 BANK 2
    STATUS UNCORRECTED SRAO 0xc0
    MCGSTATUS RIPV MCIP
    ADDR 0x1234
    MISC 0x8c
    RIP 0x73:0x1eadbabe
    $ ./mce-inject mytest/uncorrected_nonfatal

  2. Make sure no errors are injected. Wait for while.

  3. Repeat error injection three times.

  1. Memory error recorded to total and 24 hour files for uncorrected errors with correct timestamp and location (node#/channel#/DIMM#) by collectd RAS memory plugin. Same number of memory errors are recoreded to mcelog and to collectd logfile.
    Note: error injection may cause a system reboot.

  2. No new memory errors are recorded neither to mcelog nor to csv collectd plugin files related to RAS memory.

  3. Same as in step#1.

  1. Error is logged to /var/log/mcelog and to collectd and values are the same. Same number of memory errors are recoreded to mcelog and to collectd logfile. No server reboot observed.

  2. During idle time there's no new records in mcelog plugin log files, neither logged by mcelog itself. Unchanged value is updated to collectd with correct time interval.

  3. Once new errors are injected, counting and logging occurs as expected.

Pass

E1

Pass

7

RAS memory plugin upon uncorrected fatal errors injection

 Medium

  1. Inject an uncorrectable fatalmemory error.
    $ cat mytest/uncorrected_fatal
    CPU 0 BANK 2
    STATUS UNCORRECTED SRAO 0xc0
    MCGSTATUS MCIP
    ADDR 0x1234
    MISC 0x8c
    $ ./mce-inject mytest/uncorrected_fatal

  2. Check server behavior.

  3. Repeat step#1 again (same or different memory error).

  1. Memory error recorded to total and 24 hour files for uncorrected errors with correct timestamp and location (node#/channel#/DIMM#) by collectd RAS memory plugin. Same number of memory errors are recoreded to mcelog and to collectd logfile.
    Note: error injection may cause a system reboot.

  2. No new memory errors are recorded neither to mcelog nor to csv collectd plugin files related to RAS memory.

  3. Same as in step#1.

  1. Server is rebooted. Uncorrected error is detected by mcelog, logged by collectd after server is up against correct DIMM location. Collectd files don't preserve statistic after error injected and reboot!
    Collectd files don't preserve statistic once mcelog is restarted!

  2. During idle time there's no new records in mcelog plugin log files, neither logged by mcelog itself. Unchanged value is updated to collectd with correct time interval.

  3. Once new errors are injected, counting and logging occurs as expected.

Pass

E1

NA 

RAS memory plugin MCE detection on faulty DIMM

 Low

  1. Get prepared a server with faulty DIMM installed to specific slot.

  2. Wait for expected memory errors. Check for RAS memory errors in mcelog and in collectd csv files.

  3. Repeat observation for a while, overnight.

  1. Start the server.

  2. Errors are registered in mcelog log file and in "collectd/csv/" files with correct address: node#/channel#/DIMM#.

  3. Errors are detected and MCE statistic is updated.

Removed because it's difficult to check as host is continuously rebooting.

Invalid

E1

NA

RAS memory plugin upon different Unix socket location

 Medium

  1. Change socket location in mcelog.conf (socket-path = /var/run/mcelog-client) and collectd.conf for mcelog plugin to other location (default: McelogClientSocket "/var/run/mcelog-client"). Restart mcelog/collectd.

  2. Inject an error and check the statistic.

  1. Configuration changed. Socket is created, mcelog/collectd are running.

  2. Memory error recorded to total and 24 hour files for errors with correct timestamp and location (node#/channel#/DIMM#) by collectd RAS memory plugin. Same number of memory errors are recoreded to mcelog and to collectd logfile.

  1. Configuration changed. Socket is created, mcelog/collectd are running.

  2. Memory error recorded to total and 24 hour files for errors with correct timestamp and location (node#/channel#/DIMM#) by collectd RAS memory plugin. Same number of memory errors are recoreded to mcelog and to collectd logfile.

Pass

E1

NA 

10

RAS memory plugin upon different log file location

Medium

  1. Change log file location in mcelog.conf (logfile = /var/log/newmcelog). Make sure data not updated though socket-path defined mcelog.conf.

  2. Change log file location in collectd.conf for mcelog plugin to other location (McelogLogfile "/var/log/newmcelog"). Restart collectd.

  3. Inject an error and check the statistic.

  1. Configuration changed. Log file is created, mcelog/collectd are running.

  2. Memory error notifications dispatched by exec or python plugin. Stats not updated by CSV plugin!!!

  1. Log file is not created under new location.

  2. Memory error notifications dispatched by exec or python plugin. Stats not updated by CSV plugin!!!

TBD

(PR's awaiting)

 E1

NA

11

RAS memory plugin started with "Plugin mce" section commented

High

  1. Comment out "<Plugin mcelog>" section in collectd.conf.

  2. Start collectd.

  3. Inject a memory error.

2. Collectd started.

Default path for socket, "McelogClientSocket" - "/var/run/mcelog-client".

Default path for log file, "McelogLogfile" - "/var/log/mcelog".

3. Mcelog reports memory error to the "/var/log/mcelog" log file correctly.

2. Collectd started. Socket is created under "/var/run/mcelog-client".

3. Mcelog reports memory error to the "/var/log/mcelog" log file, values are same as reported by collectd plugin.

Pass

E1

NA

12

RAS memory plugin started with commented fields 

High

  1. Comment out "McelogClientSocket" field in collectd.conf.

  2. Start collectd. Inject a memory error.

  3. Comment out  "McelogLogfile" field in collectd.conf.

  4. Restart collectd. Inject a memory error.

2.  Collectd started. Socket file is created under "/var/run/mcelog-client" location. Mcelog reports memory error to the "/var/log/mcelog" log file correctly.

4. Collectd started. Mcelog reports memory error to the "/var/log/mcelog" log file correctly.

2. Collectd started. Socket file is created under "/var/run/mcelog-client" location. Mcelog reports memory error to the "/var/log/mcelog" log file correctly.

4. Collectd started. Mcelog reports memory error to the "/var/log/mcelog" log file correctly.

Pass

E1

NA

13

RAS memory plugin data updated for new period (day)

Medium

  1. Start mcelog, collectd.

  2. Inject corrected, uncorrected non fatal and fatal errors.

  3. Wait for new day started.

  4. Inject corrected, uncorrected non fatal and fatal errors.

2. Memory error recorded to total and 24 hour files for corrected errors by collectd RAS memory plugin. Same number of memory errors are recoreded to mcelog and to collectd log file.

3. All memory corrected/uncorrected errors are copied from previous day to the new day (YYYY-MM-DD) for total timespan.

All memory corrected/uncorrected errors for 24h timespanpreserved values for previous day, but set to zero for a new day.

4. Memory error recorded for new day only to total and 24 hour files for corrected/uncorrected errors by collectd RAS memory plugin. Same number of memory errors are recoreded to mcelog and to collectd log file.

2. Memory error recorded to total and 24 hour files for corrected errors by collectd RAS memory plugin. Same number of memory errors are recoreded to mcelog and to collectd log file.

3. All memory corrected/uncorrected errors are copied from previous day to the new day (YYYY-MM-DD) for total timespan.

All memory corrected/uncorrected errors for 24h timespan preserved values for previous day, but set to zero for a new day.

4. Memory error recorded for new day only to total and 24 hour files for corrected/uncorrected errors by collectd RAS memory plugin. Same number of memory errors are recoreded to mcelog and to collectd log file.

Pass

E1

NA 

14

RAS memory plugin data updated from emulated socket (non mcelog)

Medium

  1. Configure mcelog plugin to retrieve data from other socket (collectd.conf). 

  2. Open a socket (using mcelog emulator).

  3. Start collectd (mcelog service must be stopped).

  4. Generate corrected/uncorrected errors through created socket (using mcelog emulator).

3. Collectd started.

4. Generated memory corrected/uncorrected errors are recorded correctly to specified DIMM's location. Number of corrected/uncorrected errors is same retrieved from collectd and generated.