RAS Other Executed Tests

Anuket Project

RAS Other Executed Tests

est Environment details:

  • Bare Metal,  Ubuntu 16.04.2 LTS

Repo/branch used:

Tests precondition:

  • Mcelog installed.

  • mce-inject tool installed.

  • Collectd installed.

  • Exec/python collectd plugin configured.

RAS Other

Collectd configuration (default):

LoadPlugin mcelog

#<Plugin mcelog>
# McelogClientSocket "/var/run/mcelog-client"
# McelogClientSocketEnabled true
# <McelogLogfile "/var/log/mcelog">
#   <Match>
#     Name "DISCLAIMER"
#     Regex "(Hardware event.*)"
#     Excluderegex "kernel"
#     IsMandatory true
#   </Match>
#   <Match>
#     Name "MCE details"
#     Regex "(.*)"
#     SubmatchIdx 0
#     Excluderegex "kernel|Hardware event|TIME|CPUID"
#     IsMandatory false
#   </Match>
#   <Match>
#     Name "ORIGIN"
#     Regex "MCA: (.*)[ _][Ee][Rr]{2}"
#     SubmatchIdx 1
#     Excluderegex "kernel|Hardware event|TIME|CPUID|No Error"
#     IsMandatory false
#   </Match>
#   <Match>
#     Name "TIME"
#     Regex "TIME ([0-9]*)"
#     Excluderegex "kernel"
#     IsMandatory false
#   </Match>
#   <Match>
#     Name "CPUID"
#     Regex "CPUID (Vendor.*)"
#     Excluderegex "kernel"
#     IsMandatory true
#   </Match>
# </McelogLogfile>
# McelogLogfileEnabled true
#</Plugin> 

Table#1: RAS IO test cases

#

Test Summary

Steps

Expected

Observed 

Status

Comments

#

Test Summary

Steps

Expected

Observed 

Status

Comments

1

RAS plugin notifications upon collectd start with "McelogLogfileEnabled false"

  1. Collected initial configuration.

  2. Set "McelogLogfileEnabled false". Start collectd.

  3. Verify notifications dispatched by PCIe plugin.

  4. Inject IO error: echo "CPU 0 BANK 1 STATUS 0x8800000000000E0B" | ./mce-inject

2. Collectd started.

3. Notification that mcelog is connected to server dispatched.

4. Notification is not dispatched.

 

Pass

 

2

RAS plugin notifications upon collectd start with "McelogLogfileEnabled true"

  1. Collected initial configuration.

  2. Verify notifications dispatched by PCIe plugin.

  1. Collectd started.

  2. Notification that mcelog is connected to server dispatched.

  3. Other old notifications read from mcelog are dispatched.

 

Fail

Internal JIRA Filed

 

3

RAS plugin dispatches notifications after every collectd restart

  1. Collectd initial configuration. Start collectd.

  2. Inject IO error.

  3. Restart collectd.

  4. Inject IO error (corrected):
    ./mce-inject io_err
    # cat io_err
    CPU 0 BANK 1 STATUS 0x8800000000000E0B 

  1. Collectd started.

  2. Notification about IO error is dispatched as notification.

  3. Collectd started.

  4. Notification about IO error is dispatched as notification.

 

Pass

 

4

RAS plugin upon mcelog LoadPlugin commented

  1. Comment out mcelog part. Restart collectd.
    #LoadPlugin mcelog
    #<Plugin mcelog>
    # ...
    #</Plugin>

  2. Inject IO error.

2. No notification dispatched.

2. No notification dispatched.

Pass 

 

5

RAS plugin upon mcelog Plugin commented (default)

  1. Comment out mcelog part. Restart collectd.
    LoadPlugin mcelog
    #<Plugin mcelog>
    # ...
    #</Plugin>

  2. Inject IO error.

2. Notification is dispatched with correct values for all fields.

Severity:WARNING
Time:0.000
Host:silpixa00398942
Plugin:mcelog
PluginInstance:BUS
Type:gauge
TypeInstance:Corrected error
DISCLAIMER:Hardware event. This is not a software error.
MCEdetails: MCE 0
MCEdetails: CPU 0 BANK 1
MCEdetails: MISC 0
MCEdetails: MCG status:
MCEdetails: MCi status:
MCEdetails: Corrected error
MCEdetails: MCi_MISC register valid
MCEdetails: MCA: BUS error: 0 0 Level-3 Generic Generic IO Request-did-not-timeout
MCEdetails: Running trigger `bus-error-trigger'
MCEdetails: IO MCA reported by root port 0:00:00.0
MCEdetails: Running trigger `iomca-error-trigger'
MCEdetails: STATUS 8800000000000e0b MCGSTATUS 0
MCEdetails: MCGCAP 7000c16 APICID 0 SOCKETID 0
CPUID:CPUID Vendor Intel Family 6 Model 79
GotMachine Check Exception

Fail

Internal JIRA Filed

SA: Time is incorrect in notification but valid in "/var/log/mcelog".

Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 1
MISC 0
TIME 1492529725 Tue Apr 18 16:35:25 2017
MCG status:
MCi status:
Corrected error
MCi_MISC register valid
MCA: BUS error: 0 0 Level-3 Generic Generic IO Request-did-not-timeout
Running trigger `bus-error-trigger'
IO MCA reported by root port 0:00:00.0
Running trigger `iomca-error-trigger'
STATUS 8800000000000e0b MCGSTATUS 0
MCGCAP 7000c16 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 79

 

6

RAS plugin upon mcelog Plugin "McelogLogfile ..." part commented

  1. Comment out mcelog part. Restart collectd.
    LoadPlugin mcelog
    <Plugin mcelog>
     McelogClientSocket "/var/run/mcelog-client"
     McelogClientSocketEnabled true
     #<McelogLogfile "/var/log/mcelog">
     # ...
     #</McelogLogfile>
    McelogLogfileEnabled true
    </Plugin>

  2. Inject IO error.

2. Notification is dispatched with correct values for all fields.

Same as above.

Fail

Internal JIRA Filed

SA: Time is incorrect in notification but valid in "/var/log/mcelog".

7

RAS plugin upon mcelog Plugin Match part commented

  1. Comment out mcelog part. Restart collectd.
    LoadPlugin mcelog
    <Plugin mcelog>
     McelogClientSocket "/var/run/mcelog-client"
     McelogClientSocketEnabled true
     <McelogLogfile "/var/log/mcelog">
     # ...
    </McelogLogfile>
    McelogLogfileEnabled true
    </Plugin>

  2. Inject IO error.

2. Notification is dispatched with correct values for all fields.

Same as above.

Fail

Internal JIRA Filed

SA: Time is incorrect in notification but valid in "/var/log/mcelog".

8

RAS plugin upon mcelog Plugin all fields uncommented

(same as default configuration)

  1. Uncomment default mcelog part. Restart collectd.
    LoadPlugin mcelog
    <Plugin mcelog>
     McelogClientSocket "/var/run/mcelog-client"
     McelogClientSocketEnabled true
     <McelogLogfile "/var/log/mcelog">
      ...
    </McelogLogfile>
    McelogLogfileEnabled true
    </Plugin>

  2. Inject IO error.

2. Notification is dispatched with correct values for all fields.

Severity:WARNING
Time:1492529930.000
Host:silpixa00398942
Plugin:mcelog
PluginInstance:BUS
Type:gauge
TypeInstance:Corrected error
DISCLAIMER:Hardware event. This is not a software error.
MCEdetails: MCE 0
MCEdetails: CPU 0 BANK 1
MCEdetails: MISC 0
MCEdetails: MCG status:
MCEdetails: MCi status:
MCEdetails: Corrected error
MCEdetails: MCi_MISC register valid
MCEdetails: MCA: BUS error: 0 0 Level-3 Generic Generic IO Request-did-not-timeout
MCEdetails: Running trigger `bus-error-trigger'
MCEdetails: IO MCA reported by root port 0:00:00.0
MCEdetails: Running trigger `iomca-error-trigger'
MCEdetails: STATUS 8800000000000e0b MCGSTATUS 0
MCEdetails: MCGCAP 7000c16 APICID 0 SOCKETID 0
CPUID:Vendor Intel Family 6 Model 79
GotMachine Check Exception. 

Pass

 

9

RAS plugin upon mcelog Plugin commented/removed Match part with "IsMandatory false"

  1. Comment out mcelog part. Restart collectd.
    LoadPlugin mcelog
    <Plugin mcelog>
     McelogClientSocket "/var/run/mcelog-client"
     McelogClientSocketEnabled true
     <McelogLogfile "/var/log/mcelog">
       <Match>
         Name "DISCLAIMER"
         Regex "(Hardware event.*)"
         Excluderegex "kernel"
         IsMandatory true
       </Match>
     # ...
       <Match>
         Name "CPUID"
         Regex "CPUID (Vendor.*)"
         Excluderegex "kernel"
         IsMandatory true
       </Match>
    </McelogLogfile>
    McelogLogfileEnabled true
    </Plugin>

  2. Inject IO error.

2. Notification is dispatched with correct values for all fields.

Notification:

Severity:FAILURE
Time:1492530303.353
Host:silpixa00398942
Plugin:mcelog
PluginInstance:other
Type:gauge
TypeInstance:Uncorrected error
DISCLAIMER:Hardware event. This is not a software error.
CPUID:Vendor Intel Family 6 Model 79
GotMachine Check Exception.

mcelog:

Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 1
MISC 0
TIME 1492606180 Wed Apr 19 13:49:40 2017
MCG status:
MCi status:
Corrected error
MCi_MISC register valid
MCA: BUS error: 0 0 Level-3 Generic Generic IO Request-did-not-timeout
Running trigger `bus-error-trigger'
IO MCA reported by root port 0:00:00.0
Running trigger `iomca-error-trigger'
STATUS 8800000000000e0b MCGSTATUS 0
MCGCAP 7000c16 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 79

Fail

Internal JIRA Filed

 

SA: Looks like fields in notifications are filtered: "MCEdetails" part is missing.

Error type is different: Corrected vs Uncorrected.

"PluginInstance" is changed to "other".

Time is different: mcelog: "TIME 1492530298 Tue Apr 18 16:44:58 2017"; notification: "1492530303.353 Tue Apr 18 16:45:03 IST 2017"; (attempt#2: 16:50:49 vs Tue Apr 18 16:50:53 IST 2017)

10

RAS plugin correctly reads severity of injected IO errors

  1. Collectd initial configuration. Start collectd.

  2. Inject corrected IO error.
    # ./mce-inject io_err
    # cat io_err
    CPU 0 BANK 1 STATUS 0x8800000000000E0B 

  3. Inject uncorrected non fatal IO error.
    # ./mce-inject io_uncor_err
    # cat io_uncor_err
    ?

2. Notification is dispatched with severity WARNING for corrected error.

3. Notification is dispatched with severity FAILURE for uncorrected error.

2. Notification is dispatched with severity WARNING for corrected error.

3. Notification is dispatched with severity FAILURE for uncorrected error???

2-Pass

SA: How to inject uncorrected non fatal/fatal?

11

RAS plugin upon memory and IO error injection

  1. Collectd initial configuration. Start collectd.

  2. Inject corrected IO error. # ./mce-inject io_err # cat io_err CPU 0 BANK 1 STATUS 0x8800000000000E0B 

  3. Inject corrected memory error.

2. Notification is dispatched about IO error once.

3. Notification is dispatched about memory corrected error once.

2. Notification is dispatched about IO error once.

3. Notification is dispatched about memory corrected error every time interval.

Fail

Internal JIRA Filed

 

12

RAS plugin events received from different mcelog location

  1. Change mcelog file location in mcelog.conf and collectd.conf. Restart mcelog, restart collectd services.

  2. Inject IO error.

  1. Mcelog, collectd are running. Collectd plugins are loaded.

  2. Notification is dispatched about IO error.

 

Pass

 

13

RAS plugin events received from mcelog-client socket upon "McelogClientSocketEnabled false/true" is changed

  1. Change collectd.conf. Restart collectd.
    LoadPlugin mcelog
    <Plugin mcelog>
     McelogClientSocket "/var/run/mcelog-client"
     McelogClientSocketEnabled true
     <McelogLogfile "/var/log/mcelog">
         Name "Host:silpixa00398942"
         Regex "(Host.*)"
         Excluderegex "kernel"
         IsMandatory true
       </Match>
       <Match>
         Name "MCE details"
         Regex "(.*)"
         SubmatchIdx 0
         Excluderegex "kernel|Hardware event|TIME|CPUID"
         IsMandatory false
       </Match>
       <Match>
         Name "Gotmemory"
         Regex "(Gotmemory.*)"
         Excluderegex "kernel"
         IsMandatory true
       </Match>
     </McelogLogfile>
     McelogLogfileEnabled false
    </Plugin>

  2. Inject corrected memory error (memory errors is sent over socket).

  3. Change "McelogClientSocketEnabled false". Restart collectd.

  4. Inject corrected memory error (memory errors is sent over socket).

  1. Collectd started.

  2. Notification about an error is dispatched with parsing.

  3. Collectd started.

  4. Notification about an error is dispatched is without parsing.

2. Notification about an IO error is dispatched.

 

 

4. Notification about an error is not dispatched.

Fail

Internal JIRA Filed

 

 

14

RAS plugin events received from mcelog file upon "McelogClientSocketEnabled false" and "McelogLogfileEnabled true"

  1. Change collectd.conf. Restart collectd.
    <Plugin mcelog>
     McelogClientSocket "/var/run/mcelog-client"
     McelogClientSocketEnabled false
     <McelogLogfile "/var/log/mcelog">
       <Match>
         Name "DISCLAIMER"
         Regex "(Hardware event.*)"
         Excluderegex "kernel"
         IsMandatory true
       </Match>
       <Match>
         Name "MCE details"
         Regex "(.*)"
         SubmatchIdx 0
         Excluderegex "kernel|Hardware event|TIME|CPUID"     IsMandatory false   </Match>   <Match>     Name "CPUID"     Regex "CPUID (Vendor.*)"    Excluderegex "kernel"     IsMandatory true   </Match> </McelogLogfile> McelogLogfileEnabled true</Plugin>