Anuket Project
Collectd Metrics and Events
Distinction between metrics and events
For the purposes of Platform Service Assurance, it's important to distinguish between metrics and events as well as how they are measured (from a timing perspective).
A Metric is a (standard) definition of a quantity describing the performance and/or reliability of a monitored function, which has an intended utility and is carefully specified to convey the exact meaning of the measured value. A measured value of a metric is produced in an assessment of a monitored function according to a method of measurement. For example the number of dropped packets for a networking interface is a metric.
An Event is defined as an important state change in a monitored function. The monitor system is notified that an event has occurred using a message with a standard format. The Event notification describes the significant aspects of the event, such as the name and ID of the monitored function, the type of event, and the time the event occurred. For example, an event notification would take place if the link status of a networking device on a compute node suddenly changes from up to down on a node hosting VNFs in an NFV deployment.
Statistics
Statistics in collectd consist of a value list. A value list includes:
Value list | Example | comment | |||||
|---|---|---|---|---|---|---|---|
Values |
| 99.8999 | percentage | ||||
Value length | the number of values in the data set. |
|
| ||||
Time | timestamp at which the value was collected. | 1475837857 | epoch | ||||
Interval | interval at which to expect a new value. | 10 | interval | ||||
Host | used to identify the host. | localhost | can be uuid for vm or host… or can give host a name | ||||
Plugin | used to identify the plugin. | cpu |
| ||||
Plugin instance (optional) | used to group a set of values together. For e.g. values belonging to a DPDK interface. | 0 |
| ||||
Type | unit used to measure a value. In other words used to refer to a data set. | percent |
| ||||
Type instance (optional) | used to distinguish between values that have an identical type. | user |
| ||||
meta data | an opaque data structure that enables the passing of additional information about a value list. “Meta data in the global cache can be used to store arbitrary information about an identifier” |
|
| ||||
Notifications
Notifications in collectd are generic messages containing:
An associated severity, which can be one of OKAY, WARNING, and FAILURE. | ||||||
A time. | ||||||
A Message | ||||||
A host. | ||||||
A plugin. | ||||||
A plugin instance (optional). | ||||||
A type. | ||||||
A types instance (optional). | ||||||
Meta-data. |
Example notification:
Severity:FAILURE |
Time:1472552207.385 |
Host:pod3-node1 |
Plugin:dpdkevents |
PluginInstance:dpdk0 |
Type:gauge |
TypeInstance:link_status |
DataSource:value |
CurrentValue:1.000000e+00 |
WarningMin:nan |
WarningMax:nan |
FailureMin:2.000000e+00 |
FailureMax:nan |
Hostpod3-node1, plugin dpdkevents (instance dpdk0) type gauge (instance link_status): Data source "value" is currently 1.000000. That is below the failure threshold of 2.000000. |
Supported Metrics and Events
Recommended Intervals for plugins
For events plugins: In order to meet a 10ms detection time, it's recommended to use a 5ms interval for the events plugins.
For stats plugins: generally these can run at the global interval configured for collectd. Where there is an exception this is pointed out in the table below.
Metrics
Reference starting point: https://github.com/collectd/collectd/blob/master/src/types.db
But below is a mapping of the "base" plugins that would run on the host/the guest.
Where collectd is running | Plugin | Plugin Instance | Type | Type Instance | Description | Range | comment | Additional Info |
|---|---|---|---|---|---|---|---|---|
Host/guest | CPU (A read plugin that retrieves CPU usage in Nanoseconds of as a percentage) |
| percent/nanoseconds | idle | Time CPU spends idle. |
| Can be per cpu/aggregate across all the cpus.For more info, please see:http://man7.org/linux/man-pages/man1/top.1.html |
|
|
|
|
|
|
| |||
| percent/nanoseconds | nice | Time the CPU spent running user space processes that have been niced. The priority level a user space process can be tweaked by adjusting its niceness. |
|
| |||
| percent/nanoseconds | interrupt | Time the CPU has spent servicing interrupts. |
|
| |||
| percent/nanoseconds | softirq |
|
| ||||
| percent/nanoseconds | steal | CPU steal is a measure of the fraction of time that a machine is in a state of “involuntary wait.” It is time for which the kernel cannot otherwise account in one of the traditional classifications like user, system, or idle. It is time that went missing, from the perspective of the kernel.http://www.stackdriver.com/understanding-cpu-steal-experiment/ |
|
| |||
|
|
|
| |||||
| percent/nanoseconds | system | Time that the CPU spent running the kernel. |
|
|
| ||
| percent/nanoseconds | user | Time CPU spends running un-niced user space processes. |
|
|
| ||
| percent/nanoseconds | wait | The time the CPU spends idle while waiting for an I/O operation to complete |
|
|
| ||
Interface (A read plugin that retrieves Linux Interface statistics) |
| if_dropped | in | The total number of received dropped packets. |
|
|
| |
| if_errors | in | The total number of received error packets. |
| http://www.onlamp.com/pub/a/linux/2000/11/16/LinuxAdmin.html |
| ||
| if_octets | in | The total number of received bytes. |
|
|
| ||
| if_packets | in | The total number of received packets. |
|
|
| ||
| if_dropped | out | The total number of transmit packets dropped |
|
|
| ||
| if_errors | out | The total number of transmit error packets. (This is the total of error conditions encountered when attempting to transmit a packet. The code here explains the possibilities, but this code is no longer present in /net/core/dev.c master at present - it appears to have moved to /net/core/net-procfs.c.) |
|
|
| ||
| if_octets | out | The total number of bytes transmitted |
|
|
| ||
| if_packets | out | The total number of transmitted packets |
|
|
| ||
Memory (A read plugin that retrieves memory usage statistics) |
| memory | buffered | The amount, in kibibytes, of temporary storage for raw disk blocks. |
|
| ||
| memory | cached | The amount of physical RAM, in kibibytes, left unused by the system. |
|
|
| ||
| memory | free | The amount of physical RAM, in kibibytes, left unused by the system. |
|
|
| ||
| memory | slab_recl | The part of Slab that can be reclaimed, such as caches. |
| Slab — The total amount of memory, in kibibytes, used by the kernel to cache data structures for its own use |
| ||
| memory | slab_unrecl | The part of Slab that cannot be reclaimed even when lacking memory |
|
| |||
| memory | total | Total amount of usable RAM, in kibibytes, which is physical RAM minus a number of reserved bits and the kernel binary code. |
| This was the only undefined metric in the mem_used calculation below. | |||
| memory | used | mem_used = mem_total - (mem_free + mem_buffered + mem_cached + mem_slab_total); |
| https://github.com/collectd/collectd/blob/master/src/memory.c#L325 |
| ||
disk (A read plugin that retrieves disk usage statistics) |
| disk_io_time | io_time | time spent doing I/Os (ms). You can treat this metric as a device load percentage (Value of 1 sec time spent matches 100% of load). |
|
|
| |
| disk_io_time | weighted_io_time | measure of both I/O completion time and the backlog that may be accumulating. |
|
|
| ||
| disk_merged | read | the number of operations, that could be merged into other, already queued operations, i. e. one physical disk access served two or more logical operations. Of course, the higher that number, the better. |
|
|
| ||
| disk_merged | write | the number of operations, that could be merged into other, already queued operations, i. e. one physical disk access served two or more logical operations. Of course, the higher that number, the better. |
|
|
| ||
| disk_octects | read | the number of octets read from a disk or partition |