Collectd DCPMM Plugin HLD

Anuket Project

Collectd DCPMM Plugin HLD

Requirement





Name

1.0 

Uses APIs available with libpmwapi in Intel® PMWatch v3.0 or above

2.0

Uses libipmctl v01.00.00.3262 or above

3.0

Uses DCPMM DIMM firmware v01.00.00.5127 or above

4.0

Should have configurable metrics group selection

5.0

Should have configurable collection interval

Overview

Collectd DCPMM plugin monitors Intel® Optane™ DC Persistent Memory and provides memory performance and health information metrics.



Design

dcpmm plugin

The dcpmm plugin collects memory performance, health information metrics and timestamp information listed below.



Name

Type

Type Instance

Description

Comment

Name

Type

Type Instance

Description

Comment

Memory performance metrics

total_bytes_read

media

total_bytes_read

Number of bytes transacted by the read operations



total_bytes_written

media

total_bytes_written

Number of bytes transacted by the write operations



read_64B_ops_rcvd

media

read_64B_ops_rcvd

Number of read operations performed to the physical media in 64 bytes granularity



write_64B_ops_rcvd

media

write_64B_ops_rcvd

Number of write operations performed to the physical media in 64 bytes granularity



media_read_ops

media

media_read_ops

Number of read operations performed to the physical media



media_write_ops

media

media_write_ops

Number of write operations performed to the physical media



host_reads

controller

host_reads

Number of read operations received from the CPU (memory controller)



host_writes

controller

host_writes

Number of write operations received from the CPU (memory controller)



read_hit_ratio

buffer

read_hit_ratio

Measures the efficiency of the buffer in the read path. Range of 0.0 - 1.0



write_hit_ratio

buffer

write_hit_ratio

Measures the efficiency of the buffer in the write path. Range of 0.0 - 1.0



Health information metrics

health_status

health

health_status

Overall health summary (0: normal | 1: non-critical | 2: critical | 3: fatal)



lifespan_remaining

health

lifespan_remaining

The module’s remaining life as a percentage value of factory expected life span



lifespan_used

health

lifespan_used

The module’s used life as a percentage value of factory expected life span



power_on_time

health

power_on_time

The lifetime the DIMM has been powered on in seconds



uptime

health

uptime

The current uptime of the DIMM for the current power cycle in seconds



last_shutdown_time

health

last_shutdown_time

The time the system was last shutdown. The time is represented in epoch (seconds)



media_temperature

health

media_temperature

The media’s current temperature in degrees Celsius



controller_temperature

health

controller_temperature

The controller’s current temperature in degrees Celsius



max_media_temperature

health

max_media_temperature

The media’s the highest temperature reported in degrees Celsius



max_controller_temperature

health

max_controller_temperature

The controller’s highest temperature reported in degrees Celsius



Timestamp

tsc_cycles

timestamp

tsc_cycles

The number of tsc cycles during each interval



epoch

timestamp

epoch

The timestamp in seconds at which the metrics are collected from DCPMM DIMMs





Plugin configuration

The following configuration options should be supported by dcpmm collectd plugin:  

Name

Description

Comment

Name

Description

Comment

Interval

The collection interval in seconds at which the metric counts are collected

Defaults to global Interval value. This will override the global Interval value for dcpmm plugin. None of the other plugins will be affected.

CollectHealth

Health information metrics will be collected if set to true

Default value is false.

CollectdPerfMetrics

Memory performance metrics will be collected if set to true

Default value is true.

EnableDispatchAll

This parameter helps to seamlessly enable simultaneous health and memory performance metrics collection in future.

This is unused at the moment and must always be false.



Here is an example of the plugin configuration section of collectd.conf file:

<Plugin dcpmm>
  Interval 10.0
  CollectHealth false
  CollectPerfMetrics true
  EnableDispatchAll false
</Plugin>



Implementation details

To enable monitoring of Intel® Optane™ DC Persistent Memory, APIs provided by Intel® PMWatch is used. This is an open source tool and is available at https://github.com/intel/intel-pmwatch.

The following diagram shows the high level architecture of dcpmm plugin.



plugin API

libpmwapi API

Description

plugin API

libpmwapi API

Description

dcpmm_config



Parse configuration provided in collectd.conf and register read callback if the correct configuration is provided

dcpmm_init

PMWAPIGetDIMMCount

Obtain the number of DCPMM DIMMs

PMWAPIStart

Passes the configuration and starts the collection

dcpmm_read

PMWAPIRead

Reads the metric values

dcpmm_shutdown

PMWAPIStop

Stops the collection

Considerations

Configuration Considerations

The recommended collection interval is 1 second or above.

Deployment Considerations

If Intel® Optane™ DC Persistent Memory (DCPMM) is not available in the system, the plugin will be unloaded during initialization callback.

API/GUI/CLI Considerations

Equivalence Considerations

Security Considerations

Alarms, events, statistics considerations

Redundancy Considerations

Performance Considerations

The recommended collection interval is 1 second or above.

Testing Consideration

The timing interval requirement needs to be taken into consideration when conducting tests.

The Tests should be carried out on a system underload as well as a relatively idle system.

Other Considerations

Impact

The following table outlines possible impact(s) the deployment of this deliverable may have on the current system.



Ref

System Impact Description

Recommendation / Comments

1





Key Assumptions

The following assumptions apply to the scope specified in this document.



Ref

Assumption

Status

1





Key Exclusions

The following exclusions apply to the scope discussed in this document.



Ref

Exclusion

Status

1