Monitoring Agents Comparative Study

Anuket Project

Monitoring Agents Comparative Study

 

The tables and lists of questions have been created by @Sridhar Rao <Sridhar.Rao@spirent.com>

 

There are numerous opensource monitoring solutions available, with varying approaches and architectures. In this study, we compare only the 'agent' component of the monitoring solution, and will not consider the server-side component(s). Because, there can be multiple implementation options of the 'server' - for example, with collectd, it could be simple collectd-web or a timeseries database such as Influxdb, telemetry system based on Apache Kafka, etc. - and considering all the options would be extremely difficult. Typically the server side components could include some or all of the following (a) Metric collection infrastructure - raw-metric receiver, message-queues, etc. (b) Metric Modifier - add contexts, perform-aggregation, filter, etc. (c) Storage solution (d) Alarm/Alerting System (e) Visualization/Graphing - dashboards. (f)  Publishing.

 

Terminology Definition

Term

What we mean by that?

Metric

A Measurement of a particular characteristic.
Ex: %ge of CPU used, Amount of Bandwidth used, etc. Complete definition can be found here

Event

A record of something that has happened - A simple immutable fact.
Example: Link has gone down. A packet from a flow is dropped, etc. Complete definition can be found here 

Agent

Software that runs on a node/system that needs to be monitored.

Client Node

A node that is monitored (Node on which agent runs)

Server Node

A node that collects metrics and events from the client node.

Sampling Interval

How frequently the metrics are sent.

Push Mode

Fetching of events by subscribing

Poll Mode

Fetching of events via polling.

Writing of Metrics/events

sending/outputting of metrics or events.

Reading of Metrics/events

receiving/reading of measurements

Logging of Metrics/events

Logging of monitored/received metric or event

Metric Types (data source types)

Guage: Value stored as-is
Derive: Derivative - Change of the value (rate)
Counter: Similar to Derive - but it is NEVER negative (due to wrap-around)
Absolute: 

 

 

Parameter Table

 

Parameters\Tools

Collectd

Ceilometer

Polling agent.

Monasca

SNAP

node-exporter and other exporters

sensu client: metric collection plugins

munin

telegraf

NRPE + Plugins

(NSClient++,

ICINGA,

OpenNMS)

diamond

Reimann

Elastic Beats

Note:
1. For some parameters the answer could be just YES/NO,
2. Whereas, for some we may have to provide a description/details
3. For some we may have to choose from the list [], whereas for some we may append a value to the list.
4. For some parameters, please provide the number of 'actual metrics' provided under that category. For example, collectd would provide 12 metrics for Processes-category

Use NA - If Not applicable.
Use NK - If it is Not Known

CPU metrics

idle, system, wait, stolen, user (% & time), util, vcpus

idle, system, wait, stolen, user (% & time), util, vcpus

idle, system, wait, stolen, user (% & time)

idle, system, wait, stolen, user, guest, irq, nice (% & jiffies)

idle, system, wait, stolen, user (% & time), util, vcpus

idle, system, wait, stolen, user (% & time), util, vcpus

Freq,

usage - idle, system, wait, user, util and vcpus.

Same as ceilometer or monasca

user, system, iowait, idle in (% and time).

average-load

idle, system, wait, user, nice.

idle, system, wait, user, nice, stolen, irq

idle, system, wait, user, nice, stolen, irq

Disk IO metrics

Read and write (bytes, rate, time, sectors)

disk-free

read and write (bytes, rate, req)

read and write (bytes, rate, req)

read and write (ops, octets, merged, time)

disk-free

read and write (bytes, rate, req)

Read and write (bytes, rate, time, sectors)

read and write (bytes, rate, req)

Same as ceilometer or monasca

read and write (ops, octets, merged, time)

disk-free

read and write (bytes, rate, req)

read and write (merged, sector, time, req)

io- reqs, time, weighted

read and write (count, time and bytes)

Memory metrics

free, swap, total, used (bytes and percetages)

usage, bandwidth

free, swap, total, used

free, available, total, used.

free, swap, total, used

free, swap, total, used (Mb and percentages)

free, swap, total, used, slab.

Same as ceilometer or monasca

free, available, total, used. (bytes, %ges)

free, total, swap, active, dirty, inactive, buffers.

free, used, (bytes and %ges) actual-used.

free, used, (bytes and %ges) actual-used.

Process metrics

I/O, memory, CPU-Usage, read-write (bytes and count)

NO

NO

I/O, memory, CPU-Usage, (bytes and count).

Same as collectd.

status, thread-count, uptime. IO, memory, cpu-usage. connections.

Cpu and memory, read-write (bytes, count), and various other fields

Cpu and memory, read-write (bytes, count)

CPU, memory, uptime,

btime, ctxt, processes, blocked, running

I/O, memory, CPU-Usage, read-write (bytes and count)

I/O, memory, CPU-Usage, read-write (bytes and count)

Network Interface Metrics

Interface plugin: Standard 4 fields of rx/tx (octets, packets, errors, dropped).
Netlink plugin: uses netlink sockets and covers others

Standard 4 fields of rx/tx (octets, packets, errors, dropped).

Standard 4 fields of rx/tx (octets, packets, errors, dropped).

sent and recv : bytes, compressed, drops, errors, fifo, frame, multicast, packets

Standard 4 fields of rx/tx (octets, packets, errors, dropped).

Standard 4 fields of rx/tx (octets, packets, errors, dropped). Also includes, fifo, compressed, and frame stats.

rx/tx (octets, packets, errors, dropped).

Same as ceilometer or monasca

rx/tx (octets, packets, errors, dropped). SNMP (3)

Rx and Tx.

MBs

Standard 4 fields of rx/tx (octets, packets, errors, dropped)

Standard 4 fields of rx/tx (octets, packets, errors, dropped).

Libvirt Metrics

YES - 

YES

YES

YES

YES

NO

NO

NO

YES

YES

NO

NO

Container resource usage Monitoring

(memory, restarts, status, uptime, etc)

YES

NO

NO

Docker

Docker

Docker

NO

Docker

YES (Docker, LXC)

Docker

YES (Docker)

YES (4)

Databases Monitoring : [Influxdb, MongoDb,  MySql, PostgreSql, Carbon(graphite),  Prometheus, RRDCache,Redis, TSDB]

YES for all

MySql, PostgreSql, MongoDb

Influxdb, Vertica, MySql, PostgreSql, Cassandra

Influxdb, mysql, mongodb, Cassandra

ALL (4)

All

NO

All.

YES for all

MongoDb, mysql, postgresql, and Redis

YES for all

YES for all (4)

Publish metrics to databases - (influxdb, mysql, TSDB, Postgresql, MongoDb, Carbon, Elasticsearch)

YES for all

NO

NO

YES for all.

NO

NO (1)

NO

Yes for all

NO

Yes for All

YES for all.

YES (4)

Encryption Support

YES

NO

NO

YES

NO

NO

NO

NO

YES

YES

YES

YES

Language (written)

C

Python

Python

Go

Go

Ruby

Perl

Go

perl, shell, c, (varies)

Python

Varies - ruby, c, c++, etc.

Go

Extensibility - multilanguage support [Python, Java, Golang, C/C++, Lua]

YES for all

Java

Java

Python

C++

Java, Python, Ruby

Go, Python.

Python, Ruby

None.

Perl, shell, C.

None

Multiple

NO?

Interoperability [with other monitoring solutions]

Sensu, statsd, telegraf?

Nagios zabbix

ceilometer

Ceilometer, Facter, Reimann, Prometheus

Collectd

Nagios, Zabbix.

NO

Reimann

NSClient, Icinga.

Nagios

Collectd

Collectd?

Write to Message Queues and protocols (AMQP, Kafka, MQTT, NSQ)

YES for ALL

AMQP

Kafka

AMQP, Kafka.

NO

AMQP

NO

kafka,

MQTT,

NSQ

NO

Yes for ALL

YES for all

YES for all (4)

Metrics Pub/sub Mode Support

(Metrics push/pull mode support ?)

YES

YES

YES

YES

YES

YES

NO

YES

NO

YES

YES

YES

Metrics Req/Resp Mode Support 

NO

NO

NO

YES

NO

YES

YES

NO

YES

NO

YES

YES

Support for Events (polling, Pushing)

Yes

NO (1)

NO (1)

NO

NO

YES

NO

YES

YES

NO

YES

YES

Notification Support

YES

NO (1)

NO (1)

NO

NO (1)

YES

NO

NO

YES

NO

YES

YES

Logging Support 

YES

YES

YES

YES

YES

YES

YES

YES

YES

YES

YES

YES

Hypervisor metrics

YES

NO

NO

YES (KVM)