Anuket Project

RDT Cache plugin High Level Design

Requirement

1.0

Use CMT and MBM to monitor last level cache occupancy, memory bandwidth utilization and the Instructions Per Clock (IPC) on a per core basis

2.0

Use libpqos

3.0

Report IPC, Memory bandwidth utilization and last level cache occupancy

4.0

Should have a configurable interval

5.0

Provide SNMP support for any collectd values, through an RDT MIB

6.0

1 second < Polling Interval < 2 seconds. --> 24 bit counter could overrun

 

Overview

Cache Monitoring Technology (CMT), Memory Bandwidth Monitoring (MBM), Cache Allocation Technology (CAT) and Code and Data Prioritization (CDP) Technology provide the hardware framework to monitor and control the utilization of shared resources, like last level cache, memory bandwidth. These Technologies comprise Intel’s Resource Director Technology (RDT). As multithreaded and multicore platform architectures emerge, running workloads in single-threaded, multithreaded, or complex virtual machine environment, the last level cache and memory bandwidth are key resources to manage. Intel introduces CMT, MBM, CAT and CDP to manage these various workloads across shared resources. 

CMT and MBM

CMT and MBM are new features that allows an operating system (OS) or Hypervisor/virtual machine monitor (VMM) to determine the usage of cache and memory bandwidth by applications running on the platform. CMT and MBM can be used to do the following:

  • To detect if the platform supports these monitoring capabilities (via CPUID).  
  • For an OS or VMM to assign an ID for each of applications or VMs that are scheduled to run on a core. This ID is called the Resource Monitoring ID (RMID).      
  • To monitor cache occupancy and memory bandwidth on a per-RMID basis.
  • To read LLC occupancy and memory bandwidth for a given RMID at any time, for an OS or VMM.  

CAT and CDP

CAT and CDP are new features that allow an OS or Hypervisor/VMM to control allocation of CPUs shared last level cache. Once CAT or CDP is configured, the processor allows access to portions of the cache according to the established class of service (COS). The processor obeys the COS rules when it runs an application thread or application process. This can be accomplished by performing these steps:

  • Determine if the CPU supports the CAT and CDP feature.
    • Configure the COS to define the amount of resources (cache space) available. This configuration is at the processor level and is common to all logical processors.
    • Associate each logical processor with an available COS.
    • Run the application on the logical processor that uses the desired COS

Design

intel_rdt plugin

The intel_rdt plugin collects information provided by monitoring features of Intel Resource Director Technology (Intel(R) RDT): Cache Monitoring Technology (CMT), and Memory Bandwidth Monitoring (MBM).  Using these monitoring technologies, the intel_rdt plugin should collect the following metrics:

 

Name

Type

Type Instance

Description

Comment

LLC

bytes

llc

last level cache occupancy (CMT)

Existing type

MBL

memory_bandwidth

local

the bandwidth of accessing memory associated with the local socket (MBM)

Existing type

MBR

memory_bandwidth

remote

the bandwidth of accessing the remote socket (MBM)

Existing type

IPC

ipc

 

instructions per clock

New type introduced in types.db

 

 

Plugin configuration

The following configuration options should be supported by intel_rdt collectd plugin:  

Name

Description

Comment

Interval

The interval within which to retrieve statistics on monitored events in seconds

Interval option is supported by collectd and is defined in <LoadPlugin> block. No additional functionality should be developed in intel_rdt plugin to support this option.

Cores

Core groups definition. Monitored metrics are reported as aggregated statistics per group.

The field is represented as list of strings with core group values. Each string represents a list of cores in a group. Allowed formats are: “0,1,2,3” “0-10,20-18” “1,3,5-8,10,0x10-12”.

If an empty string is provided as value for this field default cores configuration should be applied - a separate group for each core.

 

Here is an example of the plugin configuration section of collectd.conf file:

  <Plugin "intel_rdt">
    Cores "0-2" "3,4,6" "8-10,15"
  </Plugin>

 Implementation details

To enable support of Intel RDT features PQoS library will be used.  This software package is maintained, updated and developed on https://github.com/01org/intel-cmt-cat. The API provided by PQoS library aligns with collectd plugin API very well which makes plugins implementation simple and straightforward. The following table describes correspondence between collectd plugin API and PQoS API that should be used to implement plugins functionality.

 
plugin API
PQoS API
Description
rdt_config
 
Parse and validate core groups configuration provided by user in collectd.conf
rdt_init
pqos_init
Initialize PQoS library
pqos_cap_get
Get capabilities of current platform to detect which monitoring event are supported
pqos_mon_start
Start monitoring of all supported events for configured core groups 
rdt_read
pqos_mon_poll
Get monitored data from pqos and dispatch to collectd
rdt_shutdown
pqos_mon_stop
Stop monitoring of configured core groups
pqos_fini
Shutdown PQoS library
 

For more details on plugin API see collectd plugin implementation guide https://collectd.org/wiki/index.php/Plugin_architecture.

 

SNMP Support

All metrics collected by intel_rdt plugin should be available through SNMP.  This will be achieved by creating proper configuration for snmp_agent collectd plugin. No additional functionality needed in intel_rdt plugin to support SNMP. See description of SNMP feature for more details on snmp_agent plugin.

Considerations

Configuration Considerations

The polling interval used for this plugin must be: 1 second < Interval < 2 seconds.

Deployment Considerations

By leveraging the core group configuration for the RDT plugin, it’s necessary to taskset/pin and isolate cores for specific applications that you would like to monitor until the process support is implemented.

It’s not recommended to use this Plugin in conjunction with the virt plugin if you plan on retrieving the extended virt statistics that include perf statistics, as perf corrupts CAT.

If your platform does not support RDT – this plugin will be unloaded at initialization time.

API/GUI/CLI Considerations

Equivalence Considerations

The SNMP MIB used for this plugin is a newly Defined MIB.

Security Considerations

Alarms, events, statistics considerations

Certain platform generations will not support all the metrics intended to be collected by the plugin. Unsupported metrics will not be reported.

Redundancy Considerations

Performance Considerations

Not part of Telemetry so performance is Not Applicable

Testing Consideration

The timing interval requirement needs to be taken into consideration when conducting tests.

The Tests should be carried out on a system underload as well as a relatively idle system.

Other Considerations

Impact

The following table outlines possible impact(s) the deployment of this deliverable may have on the current system.

 

Ref

System Impact Description

Recommendation / Comments

1

 

 

Key Assumptions

The following assumptions apply to the scope specified in this document.

 

Ref

Assumption

Status

1

 

 

Key Exclusions

The following exclusions apply to the scope discussed in this document.

 

Ref

Exclusion

Status

1

 

 

Key Dependencies

The following table outlines the key dependencies associated with this deliverable.

 

Ref

Dependency

Status

1

Libpqos

 

2

Net-SNMP