Collectd Metrics and Events

Anuket Project

Collectd Metrics and Events

 



Distinction between metrics and events

For the purposes of Platform Service Assurance, it's important to distinguish between metrics and events as well as how they are measured (from a timing perspective).

A Metric is a (standard) definition of a quantity describing the performance and/or reliability of a monitored function, which has an intended utility and is carefully specified to convey the exact meaning of the measured value. A measured value of a metric is produced in an assessment of a monitored function according to a method of measurement. For example the number of dropped packets for a networking interface is a metric.

 

An Event is defined as an important state change in a monitored function.  The monitor system is notified that an event has occurred using a message with a standard format. The Event notification describes the significant aspects of the event, such as the name and ID of the monitored function, the type of event, and the time the event occurred. For example, an event notification would take place if the link status of a networking device on a compute node suddenly changes from up to down on a node hosting VNFs in an NFV deployment.



Statistics

Statistics in collectd consist of a value list. A value list includes:

 

Value list

Example

comment

Values

 

99.8999

percentage

Value length

the number of values in the data set.

 

 

Time

timestamp at which the value was collected.

1475837857

epoch

Interval

interval at which to expect a new value.

10

interval

Host

used to identify the host.

localhost

can be uuid for vm or host… or can give host a name

Plugin

used to identify the plugin.

cpu

 

Plugin instance (optional)

used to group a set of values together. For e.g. values belonging to a DPDK interface.

0

 

Type

unit used to measure a value. In other words used to refer to a data set.

percent

 

Type instance (optional)

used to distinguish between values that have an identical type.

user

 

meta data

an opaque data structure that enables the passing of additional information about a value list. “Meta data in the global cache can be used to store arbitrary information about an identifier” 

 

 

Notifications


Notifications in collectd are generic messages containing:

An associated severity, which can be one of OKAY, WARNING, and FAILURE.

A time.      

A Message     

A host.      

A plugin.      

A plugin instance (optional).    

A type.      

A types instance (optional).    

Meta-data.     



Example notification:

 

Severity:FAILURE

Time:1472552207.385

Host:pod3-node1

Plugin:dpdkevents

PluginInstance:dpdk0

Type:gauge

TypeInstance:link_status

DataSource:value

CurrentValue:1.000000e+00

WarningMin:nan

WarningMax:nan

FailureMin:2.000000e+00

FailureMax:nan

Hostpod3-node1, plugin dpdkevents (instance dpdk0) type gauge (instance link_status): Data source "value" is currently 1.000000. That is below the failure threshold of 2.000000.

 

Supported Metrics and Events

Recommended Intervals for plugins

  • For events plugins: In order to meet a 10ms detection time, it's recommended to use a 5ms interval for the events plugins.

  • For stats plugins: generally these can run at the global interval configured for collectd. Where there is an exception this is pointed out in the table below.

Metrics

Reference starting point: https://github.com/collectd/collectd/blob/master/src/types.db  

But below is a mapping of the "base" plugins that would run on the host/the guest.

 

 

 

Where collectd is running

Plugin

Plugin Instance

Type

Type Instance

Description

Range

comment

Additional Info

Where collectd is running

Plugin

Plugin Instance

Type

Type Instance

Description

Range

comment

Additional Info

Host/guest

CPU

(A read plugin that retrieves CPU usage in Nanoseconds of as a percentage)

 

percent/nanoseconds

idle

Time CPU spends idle. 

 

Can be per cpu/aggregate across all the cpus.For more info, please see:http://man7.org/linux/man-pages/man1/top.1.html
http://blog.scoutapp.com/articles/2015/02/24/understanding-linuxs-cpu-stats
Note that jiffies operate on a variable time base, HZ. The default value of HZ should be used (100), yielding a jiffy value of 0.01 seconds) [time(7)]. Also, the actual number of jiffies in each second is subject to system factors, such as use of virtualization. Thus, the percent calculation based on jiffies will nominally sum to 100% plus or minus error.

 

 

 

 

 

 

 

 

percent/nanoseconds

nice

Time the CPU spent running user space processes that have been niced. The priority level a user space process can be tweaked by adjusting its niceness.

 

 

 

percent/nanoseconds

interrupt

Time the CPU has spent servicing interrupts.

 

 

 

percent/nanoseconds

softirq

(apparently) Time spent handling interrupts that are synthesized, and almost as important as Hardware interrupts (above). "In current kernels there are ten softirq vectors defined; two for tasklet processing, two for networking, two for the block layer, two for timers, and one each for the scheduler and read-copy-update processing. The kernel maintains a per-CPU bitmask indicating which softirqs need processing at any given time." [Ref]

 

 

 

percent/nanoseconds

steal

CPU steal is a measure of the fraction of time that a machine is in a state of “involuntary wait.”  It is time for which the kernel cannot otherwise account in one of the traditional classifications like user, system, or idle.  It is time that went missing, from the perspective of the kernel.http://www.stackdriver.com/understanding-cpu-steal-experiment/

 

 

 

 

 

 

 

percent/nanoseconds

system

Time that the CPU spent running the kernel.

 

 

 

 

percent/nanoseconds

user

Time CPU spends running un-niced user space processes. 

 

 

 

 

percent/nanoseconds

wait

The time the CPU spends idle while waiting for an I/O operation to complete

 

 

 

Interface

(A read plugin that retrieves Linux Interface statistics)

 

if_dropped

in

The total number of received dropped packets.

 

 

 

 

if_errors

in

The total number of received error packets.

 

http://www.onlamp.com/pub/a/linux/2000/11/16/LinuxAdmin.html

 

 

if_octets

in

The total number of received bytes.

 

 

 

 

if_packets

in

The total number of received packets.

 

 

 

 

if_dropped

out

The total number of transmit packets dropped

 

 

 

 

if_errors

out

The total number of transmit error packets. (This is the total of error conditions encountered when attempting to transmit a packet. The code here explains the possibilities, but this code is no longer present in /net/core/dev.c  master at present - it appears to have moved to /net/core/net-procfs.c.)

 

 

 

 

if_octets

out

The total number of bytes transmitted

 

 

 

 

if_packets

out

The total number of transmitted packets

 

 

 

Memory

(A read plugin that retrieves memory usage statistics)

 

memory

buffered

The amount, in kibibytes, of temporary storage for raw disk blocks.

 

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Deployment_Guide/s2-proc-meminfo.html

 

 

memory

cached

The amount of physical RAM, in kibibytes, left unused by the system.

 

 

 

 

memory

free

The amount of physical RAM, in kibibytes, left unused by the system.

 

 

 

 

memory

slab_recl

The part of Slab that can be reclaimed, such as caches.

 

Slab — The total amount of memory, in kibibytes, used by the kernel to cache data structures for its own use

 

 

memory

slab_unrecl

The part of Slab that cannot be reclaimed even when lacking memory

 

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Deployment_Guide/s2-proc-meminfo.html

 

 

memory

total

Total amount of usable RAM, in kibibytes, which is physical RAM minus a number of reserved bits and the kernel binary code.

 

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Deployment_Guide/s2-proc-meminfo.html

This was the only undefined metric in the mem_used calculation below.

 

memory

used

mem_used = mem_total - (mem_free + mem_buffered + mem_cached + mem_slab_total);

 

https://github.com/collectd/collectd/blob/master/src/memory.c#L325

 

disk

(A read plugin that retrieves disk usage statistics)

 

disk_io_time

io_time

time spent doing I/Os (ms). You can treat this metric as a device load percentage (Value of 1 sec time spent matches 100% of load).

 

 

 

 

disk_io_time

weighted_io_time

measure of both I/O completion time and the backlog that may be accumulating.

 

 

 

 

disk_merged

read

the number of operations, that could be merged into other, already queued operations, i. e. one physical disk access served two or more logical operations. Of course, the higher that number, the better.

 

 

 

 

disk_merged

write

the number of operations, that could be merged into other, already queued operations, i. e. one physical disk access served two or more logical operations. Of course, the higher that number, the better.

 

 

 

 

disk_octects

read

the number of octets read from a disk or partition