Anuket Project
Monitoring Agents Comparative Study
The tables and lists of questions have been created by @Sridhar Rao <Sridhar.Rao@spirent.com>
There are numerous opensource monitoring solutions available, with varying approaches and architectures. In this study, we compare only the 'agent' component of the monitoring solution, and will not consider the server-side component(s). Because, there can be multiple implementation options of the 'server' - for example, with collectd, it could be simple collectd-web or a timeseries database such as Influxdb, telemetry system based on Apache Kafka, etc. - and considering all the options would be extremely difficult. Typically the server side components could include some or all of the following (a) Metric collection infrastructure - raw-metric receiver, message-queues, etc. (b) Metric Modifier - add contexts, perform-aggregation, filter, etc. (c) Storage solution (d) Alarm/Alerting System (e) Visualization/Graphing - dashboards. (f) Publishing.
Terminology Definition
Term | What we mean by that? |
Metric | A Measurement of a particular characteristic. |
Event | A record of something that has happened - A simple immutable fact. |
Agent | Software that runs on a node/system that needs to be monitored. |
Client Node | A node that is monitored (Node on which agent runs) |
Server Node | A node that collects metrics and events from the client node. |
Sampling Interval | How frequently the metrics are sent. |
Push Mode | Fetching of events by subscribing |
Poll Mode | Fetching of events via polling. |
Writing of Metrics/events | sending/outputting of metrics or events. |
Reading of Metrics/events | receiving/reading of measurements |
Logging of Metrics/events | Logging of monitored/received metric or event |
Metric Types (data source types) | Guage: Value stored as-is |
Parameter Table
Parameters\Tools | Collectd | Ceilometer Polling agent. | Monasca | SNAP | node-exporter and other exporters | sensu client: metric collection plugins | munin | telegraf | NRPE + Plugins (NSClient++, ICINGA, OpenNMS) | diamond | Reimann | Elastic Beats | Note: | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
CPU metrics | idle, system, wait, stolen, user (% & time), util, vcpus | idle, system, wait, stolen, user (% & time), util, vcpus | idle, system, wait, stolen, user (% & time) | idle, system, wait, stolen, user, guest, irq, nice (% & jiffies) | idle, system, wait, stolen, user (% & time), util, vcpus | idle, system, wait, stolen, user (% & time), util, vcpus | Freq, usage - idle, system, wait, user, util and vcpus. | Same as ceilometer or monasca | user, system, iowait, idle in (% and time). average-load | idle, system, wait, user, nice. | idle, system, wait, user, nice, stolen, irq | idle, system, wait, user, nice, stolen, irq | ||
Disk IO metrics | Read and write (bytes, rate, time, sectors) disk-free | read and write (bytes, rate, req) | read and write (bytes, rate, req) | read and write (ops, octets, merged, time) disk-free | read and write (bytes, rate, req) | Read and write (bytes, rate, time, sectors) | read and write (bytes, rate, req) | Same as ceilometer or monasca | read and write (ops, octets, merged, time) disk-free | read and write (bytes, rate, req) | read and write (merged, sector, time, req) io- reqs, time, weighted | read and write (count, time and bytes) | ||
Memory metrics | free, swap, total, used (bytes and percetages) | usage, bandwidth | free, swap, total, used | free, available, total, used. | free, swap, total, used | free, swap, total, used (Mb and percentages) | free, swap, total, used, slab. | Same as ceilometer or monasca | free, available, total, used. (bytes, %ges) | free, total, swap, active, dirty, inactive, buffers. | free, used, (bytes and %ges) actual-used. | free, used, (bytes and %ges) actual-used. | ||
Process metrics | I/O, memory, CPU-Usage, read-write (bytes and count) | NO | NO | I/O, memory, CPU-Usage, (bytes and count). | Same as collectd. | status, thread-count, uptime. IO, memory, cpu-usage. connections. | Cpu and memory, read-write (bytes, count), and various other fields | Cpu and memory, read-write (bytes, count) | CPU, memory, uptime, | btime, ctxt, processes, blocked, running | I/O, memory, CPU-Usage, read-write (bytes and count) | I/O, memory, CPU-Usage, read-write (bytes and count) | ||
Network Interface Metrics | Interface plugin: Standard 4 fields of rx/tx (octets, packets, errors, dropped). | Standard 4 fields of rx/tx (octets, packets, errors, dropped). | Standard 4 fields of rx/tx (octets, packets, errors, dropped). | sent and recv : bytes, compressed, drops, errors, fifo, frame, multicast, packets | Standard 4 fields of rx/tx (octets, packets, errors, dropped). | Standard 4 fields of rx/tx (octets, packets, errors, dropped). Also includes, fifo, compressed, and frame stats. | rx/tx (octets, packets, errors, dropped). | Same as ceilometer or monasca | rx/tx (octets, packets, errors, dropped). SNMP (3) | Rx and Tx. MBs | Standard 4 fields of rx/tx (octets, packets, errors, dropped) | Standard 4 fields of rx/tx (octets, packets, errors, dropped). | ||
Libvirt Metrics | YES - | YES | YES | YES | YES | NO | NO | NO | YES | YES | NO | NO | ||
Container resource usage Monitoring (memory, restarts, status, uptime, etc) | YES | NO | NO | Docker | Docker | Docker | NO | Docker | YES (Docker, LXC) | Docker | YES (Docker) | YES (4) | ||
Databases Monitoring : [Influxdb, MongoDb, MySql, PostgreSql, Carbon(graphite), Prometheus, RRDCache,Redis, TSDB] | YES for all | MySql, PostgreSql, MongoDb | Influxdb, Vertica, MySql, PostgreSql, Cassandra | Influxdb, mysql, mongodb, Cassandra | ALL (4) | All | NO | All. | YES for all | MongoDb, mysql, postgresql, and Redis | YES for all | YES for all (4) | ||
Publish metrics to databases - (influxdb, mysql, TSDB, Postgresql, MongoDb, Carbon, Elasticsearch) | YES for all | NO | NO | YES for all. | NO | NO (1) | NO | Yes for all | NO | Yes for All | YES for all. | YES (4) | ||
Encryption Support | YES | NO | NO | YES | NO | NO | NO | NO | YES | YES | YES | YES | ||
Language (written) | C | Python | Python | Go | Go | Ruby | Perl | Go | perl, shell, c, (varies) | Python | Varies - ruby, c, c++, etc. | Go | ||
Extensibility - multilanguage support [Python, Java, Golang, C/C++, Lua] | YES for all | Java | Java | Python C++ | Java, Python, Ruby | Go, Python. | Python, Ruby | None. | Perl, shell, C. | None | Multiple | NO? | ||
Interoperability [with other monitoring solutions] | Sensu, statsd, telegraf? | Nagios zabbix | ceilometer | Ceilometer, Facter, Reimann, Prometheus | Collectd | Nagios, Zabbix. | NO | Reimann | NSClient, Icinga. | Nagios | Collectd | Collectd? | ||
Write to Message Queues and protocols (AMQP, Kafka, MQTT, NSQ) | YES for ALL | AMQP | Kafka | AMQP, Kafka. | NO | AMQP | NO | kafka, MQTT, NSQ | NO | Yes for ALL | YES for all | YES for all (4) | ||
Metrics Pub/sub Mode Support (Metrics push/pull mode support ?) | YES | YES | YES | YES | YES | YES | NO | YES | NO | YES | YES | YES | ||
Metrics Req/Resp Mode Support | NO | NO | NO | YES | NO | YES | YES | NO | YES | NO | YES | YES | ||
Support for Events (polling, Pushing) | Yes | NO (1) | NO (1) | NO | NO | YES | NO | YES | YES | NO | YES | YES | ||
Notification Support | YES | NO (1) | NO (1) | NO | NO (1) | YES | NO | NO | YES | NO | YES | YES | ||
Logging Support | YES | YES | YES | YES | YES | YES | YES | YES | YES | YES | YES | YES | ||
Hypervisor metrics | YES | NO | NO | YES (KVM) | ||||||||||