The tables and lists of questions have been created by Sridhar Rao <Sridhar.Rao@spirent.com>
...
Term | What we mean by that? |
Metric | A Measurement of a particular characteristic. Ex: %ge of CPU used, Amount of Bandwidth used, etc. Complete definition can be found here |
Event | A record of something that has happened - A simple immutable fact. Example: Link has gone down. A packet from a flow is dropped, etc. Complete definition can be found here |
Agent | Software that runs on a node/system that needs to be monitored. |
Client Node | A node that is monitored (Node on which agent runs) |
Server Node | A node that collects metrics and events from the client node. |
Sampling Interval | How frequently the metrics are sent. |
Push Mode | Fetching of events by subscribing |
Poll Mode | Fetching of events via polling. |
Writing of Metrics/events | sending/outputting of metrics or events. |
Reading of Metrics/events | receiving/reading of measurements |
Logging of Metrics/events | Logging of monitored/received metric or event |
Metric Types (data source types) | Guage: Value stored as-is |
Parameter Table
Parameters\Tools | Collectd | Ceilometer Polling agent. | Monasca | SNAP | node-exporter and other exporters | sensu client: metric collection plugins | munin | telegraf | NRPE + Plugins (NSClient++, ICINGA, OpenNMS) | diamond | Reimann | Elastic Beats | Centreon | NSClient++ (Same as NRPE) | icinga (Same as NRPE) | OpenNMS (Same as NRPE) | Note: Note: 1. For some parameters the answer could be just YES/NO, 2. Whereas, for some we may have to provide a description/details 3. For some we may have to choose from the list [], whereas for some we may append a value to the list. 4. For some parameters, please provide the number of 'actual metrics' provided under that category. For example, collectd would provide 12 metrics for Processes-category Use NA - If Not applicable. Use NK - If it is Not Known | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
CPU metrics | idle, system, wait, stolen, user (% & time), util, vcpus | idle, system, wait, stolen, user (% & time), util, vcpus | idle, system, wait, stolen, user (% & time) | idle, system, wait, stolen, user, guest, irq, nice (% & jiffies) | idle, system, wait, stolen, user (% & time), util, vcpus | idle, system, wait, stolen, user (% & time), util, vcpus | Freq, usage - idle, system, wait, user, util and vcpus. | Same as ceilometer or monasca | user, system, iowait, idle in (% and time). average-load | idle, system, wait, user, nice. | idle, system, wait, user, nice, stolen, irq | idle, system, wait, user, nice, stolen, irq | |||||||||||||
Disk IO metrics | Read and write (bytes, rate, time, sectors) disk-free | read and write (bytes, rate, req) | read and write (bytes, rate, req) | read and write (ops, octets, merged, time) disk-free | read and write (bytes, rate, req) | Read and write (bytes, rate, time, sectors) | read and write (bytes, rate, req) | Same as ceilometer or monasca | read and write (ops, octets, merged, time) disk-free | read and write (bytes, rate, req) | read and write (merged, sector, time, req) io- reqs, time, weighted | read and write (count, time and bytes) | |||||||||||||
Memory metrics | free, swap, total, used (bytes and percetages) | usage, bandwidth | free, swap, total, used | free, available, total, used. | free, swap, total, used | free, swap, total, used (Mb and percentages) | free, swap, total, used, slab. | Same as ceilometer or monasca | free, available, total, used. (bytes, %ges) | free, total, swap, active, dirty, inactive, buffers. | free, used, (bytes and %ges) actual-used. | free, used, (bytes and %ges) actual-used. | |||||||||||||
Process metrics | I/O, memory, CPU-Usage, read-write (bytes and count) | NO | NO | I/O, memory, CPU-Usage, (bytes and count). | Same as collectd. | status, thread-count, uptime. IO, memory, cpu-usage. connections. | Cpu and memory, read-write (bytes, count), and various other fields | Cpu and memory, read-write (bytes, count) | CPU, memory, uptime, | btime, ctxt, processes, blocked, running | I/O, memory, CPU-Usage, read-write (bytes and count) | I/O, memory, CPU-Usage, read-write (bytes and count) | |||||||||||||
Network Interface Metrics | Interface plugin: Standard 4 fields of rx/tx (octets, packets, errors, dropped). Netlink plugin: uses netlink sockets and covers others | Standard 4 fields of rx/tx (octets, packets, errors, dropped). | Standard 4 fields of rx/tx (octets, packets, errors, dropped). | sent and recv : bytes, compressed, drops, errors, fifo, frame, multicast, packets | Standard 4 fields of rx/tx (octets, packets, errors, dropped). | Standard 4 fields of rx/tx (octets, packets, errors, dropped). Also includes, fifo, compressed, and frame stats. | rx/tx (octets, packets, errors, dropped). | Same as ceilometer or monasca | rx/tx (octets, packets, errors, dropped). SNMP (3) | Rx and Tx. MBs | Standard 4 fields of rx/tx (octets, packets, errors, dropped) | Standard 4 fields of rx/tx (octets, packets, errors, dropped). | |||||||||||||
Libvirt Metrics | YES - | Libvirt Metrics | YES - | YESYES | YES | YES | YES | NO | NO | NO | YES | YES | NO | NO | |||||||||||
Container resource usage Monitoring (memory, restarts, status, uptime, etc) | YES | NO | NO | Docker | Docker | Docker | NO | Docker | YES (Docker, LXC) | Docker | YES (Docker) | YES (4) | |||||||||||||
Databases Monitoring : [Influxdb, MongoDb, MySql, PostgreSql, Carbon(graphite), Prometheus, RRDCache,Redis, TSDB] | YES for all | MySql, PostgreSql, MongoDb | Influxdb, Vertica, MySql, PostgreSql, Cassandra | Influxdb, mysql, mongodb, Cassandra | ALL (4) | All | NO | All. | YES for all | MongoDb, mysql, postgresql, and Redis | YES for all | YES for all (4) | |||||||||||||
Publish metrics to databases - (influxdb, mysql, TSDB, Postgresql, MongoDb, Carbon, Elasticsearch) | YES for all | NO | NO | YES for all. | NO | NO (1) | NO | Yes for all | NO | Yes for All | YES for all. | YES (4) | |||||||||||||
Encryption Support | YES | NO | NO | YES | NO | NO | NO | NO | YES | YES | YES | YES | |||||||||||||
Language (written) | C | Python | Python | Go | Go | Ruby | Perl | Go | perl, shell, c, (varies) | Python | Varies - ruby, c, c++, etc. | Go | |||||||||||||
Extensibility - multilanguage support [Python, Java, Golang, C/C++, Lua] | YES for all | Java | Java | Python C++ | Java, Python, Ruby | Go, Python. | Python, Ruby | None. | Perl, shell, C. | None | Multiple | NO? | |||||||||||||
Interoperability [with other monitoring solutions] | Sensu, statsd, telegraf? | Nagios zabbix | ceilometer | Ceilometer, Facter, Reimann, Prometheus | Collectd | Nagios, Zabbix. | NO | Reimann | NSClient, Icinga. | Nagios | Collectd | Collectd? | |||||||||||||
Write to Message Queues and protocols (AMQP, Kafka, MQTT, NSQ) | YES for ALL | AMQP | Kafka | AMQP, Kafka. | NO | AMQP | NO | kafka, MQTT, NSQ | NO | Yes for ALL | YES for all | YES for all (4) | |||||||||||||
Metrics Pub/sub Mode Support (Metrics push/pull mode support ?) | YES | YES | YES | YES | YES | YES | NO | YES | NO | YES | YES | YES | |||||||||||||
Metrics Req/Resp Mode Support | NO | NO | NO | YES | NO | YES | YES | NO | YES | NO | YES | YES | |||||||||||||
Support for Events (polling, Pushing) | Yes | NO (1) | NO (1) | YESNO | NO | YES | NO | YES | YES | NO | YES | YES | |||||||||||||
Notification Support | YES | NO (1) | NO (1) | YESNO | NO (1) | YES | NO | NO | YES | NO | YES | YES | |||||||||||||
Logging Support | YES | YES | YES | YES | YES | YES | YES | YES | YES | YES | YES | YES | |||||||||||||
Hypervisor metrics | YES | NO | NO | YES (KVM) | YES | YES (XenTop) | NO | NO | YES | XEN, KVM. | NO | NO | |||||||||||||
Log-File Analysis | YES | NO | NO | YES | YES (mtail) | NO | NO | YES | YES | NO | YES | YES | |||||||||||||
Other Writing (output) Support: [CSV, HTTP, RRD, UnixSocket, Multicast] | ALL that are listed. | NO | NO | NO | HTTP | NO | RRD | Socket, | NO | HTTP | NO | YES? | |||||||||||||
Transport Protocol | Depends on the end point it's communicating with. | TCP* | TCP* | TCP | TCP, UDP. (5) | TCP | TCP | TCP, UDP | TCP | TCP | TCP, UDP | TCP, UDP | |||||||||||||
Data-Format [XML, JSON, etc] | JSON, Custom, XML | JSON XML | JSON | JSON | JSON ? | JSON | Custom | Custom | Custom | JSON | Custom | Custom, JSON | |||||||||||||
Data-model | Custom | KVP | KVP | KVP | KVP | KVP | Custom | Custom | Custom | KVP | KVP | KVP | |||||||||||||
Hardware: IPMI, Battery, Sensors, | YES for all | IPMI | IPMI | IMPI | YES for all | YES - IPMI | YES (3) | IPMI sensors | YES | NO | NO? | YES for all | |||||||||||||
Metric Types: Guage, Derive, Counter, absolute | YES for all | Gauge cumulative delta | Gauge, rate, counter. | gauge, derive, counter. | Gauge, Counter, Histogram, summary | Gauge, Counter, derive. | Gauge, Counter, derive. | Gauge, Counter. | Gauge, Derivative, delta | Gauge, sum, counter, derive | Gauge, sum, counter, derive | ||||||||||||||
Last-Updated | 2017 | 2017 | 2017 | Varies(5) | Varies (5) | Varies (5) | Varies (5) | 2017 | varies(5) | Varies (5) | Varies(5) | Varies(5) | |||||||||||||
Commercial Versions? | NO | NO | ? | NO | NO | YES | NO | No | YES | YES? | YES? | YES? | Resource consumption by the agent | ||||||||||||
Run-Time Analysis [^] | CPU: 14.8% | CPU:17.5% | |||||||||||||||||||||||
License | MIT/GPL v2 or later | Apache License, Version 2.0 | Apache License, Version 2.0 | Apache License, Version 2.0 | Multiple (5) | MIT | GPL V2. | MIT | GPL V3 | MIT | MIT | Apache License, Version 2.0 | |||||||||||||
Webserver monitoring [Nginix, Apache] | YES for all | Apache | Apache | YES for all. | Nginix, Apache, Passenger varnish | Apache, Nginix, Unicorn. | NO | Yes for all | YES for all | NO | YES for all. | Yes for all | |||||||||||||
Platforms - OS? Linux (unix'es), Windows. | Supports windows, linux, freebsd, etc. | Linux | Linux | Linux, MAC, Windows (soon) | Linux Windows(3) | Linux, Windows, | Linux, Windows | Linux | ALL | Linux | ALL | ALL | |||||||||||||
Configuration Tool support [Puppet, Chef, Ansible, Salt] | YES for all | Puppet Chef | Puppet, Chef, Ansible, | Yes for all. | Yes for all. | YES for all | NO | Yes for All. | Yes for all | Puppet | ALL | ALL | |||||||||||||
Deployments: servers, VMs, containers, | ALL | ALL | ALL | ALL | ALL | ALL. | ALL | All | ALL | ALL | ALL | ALL | |||||||||||||
Openstack Modules | NOOpenstack Modules | NO (2) | NO | ALL. | CEPH, Cinder, Glance, Keystone, Neutron, Nova | NO | NO | NO | NO | YES (All) | NO | NO | NO | ||||||||||||
Intel PCM and SSDs SMART metrics | NO | NO | NO | YES | NO | NO | NO | NO | NO | NO | NO | NO | |||||||||||||
Cluster Mgmt. (Kubernetes, Mesos, Swarm) | NO | NO | NO | Kubernetes and Mesos | Kubernetes and mesos | Kubernetes and mesos | NO | Kubernetes and Mesos | YES | NO | YES | YES | |||||||||||||
Modifiers - (filtering, threshold, tags, contexts)
| Filtering and threshold - yes. Tags - YES. Contexts - No. (1) | NO | YES | YES for all. | Tags, Filtering and threshold. | NO(1) | NO | Tagging | YES | Tags | YES | YES | |||||||||||||
Dynamic Loading of plugins. | NO | NO | NO | YES | YES | YES. | YES? | NO | YES | NO | YES | YES | |||||||||||||
Intervals:
| Lowest Sampling Interval - How frequently the plugins can read values from source(s) of truth. | can go down to a nano second resolution | LSI: can go down to a nano second resolution NTI: Cannot be specified - depends on size of the buffer and reading interval | PI: Configurable default: 60s | CF: Configurable Further controlled by per-plugin "collect_period"
| Based on Task Configuration. Interval: Can go down to ns resolution. | Configurable scrape interval. | Command specific check-interval.
| |||||||||||||||||
Interval for transmitting over the network | Cannot be specified - depends on size of the buffer and reading interval | ||||||||||||||||||||||||
Other Services monitoring: (DHCP, DNS, FTP, NTP, HAProxy, Consul) | HAProxy, DNS, NTP | NO | HAProxy, NTP. | HAProxy | DHCP, HAproxy, NTP, Consul. | YES for all. | NO | HAproxy, NTP, Consul, DNS, | YES | NO | YES (4) | YES(4) |
Legends
(1) This aspect is realized either as a server-side component or by a 'customized' agent.
(2) Custom solution exist, and may not be part of main distribution.
(3) Support with strong dependency on additional tool/library.
(4) Supports more-options than the ones provided in column-1
(5) A single value cannot be entered due development of logically-independent modules by different community groups.
...
Other Services monitoring: (DHCP, DNS, FTP, NTP, HAProxy, Consul) | HAProxy, DNS, NTP | NO | HAProxy, NTP. | HAProxy | DHCP, HAproxy, NTP, Consul. | YES for all. | NO | HAproxy, NTP, Consul, DNS, | YES | NO | YES (4) | YES(4) |
---|
Legends
(1) This aspect is realized either as a server-side component or by a 'customized' agent.
(2) Custom solution exist, and may not be part of main distribution.
(3) Support with strong dependency on additional tool/library.
(4) Supports more-options than the ones provided in column-1
(5) A single value cannot be entered due development of logically-independent modules by different community groups.
[^]: Runtime analysis process and considerations:
- Isolate the CPUs on the monitoring node. [ Add isolcpus option in the grub. CPU0]
- Run the agent on the isolated CPU (CPU0). [ Use taskset command to run agent-processes with appropriate CPU-mask: 0x01]
- Plugins: Configure agent to monitor following metrics - CPU, Memory, Disk, Interface, IPMI, processes, libvirt, Caches, OVS, hugepages.
- Output: Make agent to send metrics over network (Ex: influxdb running on separate node)
- Workload: stress-ng + iperf.
- Monitoring duration: 5 minutes.
- Frequency: 1sec.
- Collect Metrics (using any other tool) to analyze agent's runtime performance [ Ex: Used Snap to collect ‘collectd-process’ metrics and CPU and memory data]
- Note the iperf performance ( to study any effect on it due to collectd]
#: Interval Definitions:
Lowest Sampling Interval (LSI) - How frequently the plugins can read values from source(s) of truth.
Network Transmit Interval (NTI) - Interval at which the metrics are sent over the network.
Polling Interval: Freqency at which metrics are read.
Check Frequency(CF): frequency at which all plugins are run. This may map to LSI and NTI.
Inference Questions
View file | ||||
---|---|---|---|---|
|
The Questions | The Answer |
Lowest Interval: Which agent supports the lowest sampling interval, and what is the value? | |
Interoperability: Which agent is 'most interoperable'? (Work with maximum of 'servers' (collection node) | |
Large-scale deployment: Which agent is ideal for large-scale monitoring (Provide description in a separate page, if needed) | |
Low-footprint: Which agent has the lowest footprint (memory and CPU)? | |
Metrics: Which agent supports maximum number of metrics? | |
Gaps: Are there any metrics that are not supported by any of the agent and that are relavant to NFV? | |
Which agent is ideal for realtime analytics?- [Support for maximum scalable datastores, visualization tools and Analytics engines?] | |
Is any of the agents been used in large-scale real-world deployments? If so, please provide the details on the performance. | |
Which agent has the least/maximum dependency - Libraries, OS/Kernel versions, etc.? | |
Which agent provides maximum 'freedom' w.r.t. Licenses (core agent + plugins)? | |
Which agent is best for the following datastores: Influxdb, Graphite, ElasticSearch? | |
Which agent support dynamic configuration? | |