Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

 

The tables and lists of questions have been created by Sridhar Rao <Sridhar.Rao@spirent.com>

...

TermWhat we mean by that?
MetricA Measurement of a particular characteristic.
Ex: %ge of CPU used, Amount of Bandwidth used, etc. Complete definition can be found here
EventA record of something that has happened - A simple immutable fact.
Example: Link has gone down. A packet from a flow is dropped, etc. Complete definition can be found here 
AgentSoftware that runs on a node/system that needs to be monitored.
Client NodeA node that is monitored (Node on which agent runs)
Server NodeA node that collects metrics and events from the client node.
Sampling IntervalHow frequently the metrics are sent.
Push ModeFetching of events by subscribing
Poll ModeFetching of events via polling.
Writing of Metrics/eventssending/outputting of metrics or events.
Reading of Metrics/eventsreceiving/reading of measurements
Logging of Metrics/eventsLogging of monitored/received metric or event
Metric Types (data source types)

Guage: Value stored as-is
Derive: Derivative - Change of the value (rate)
Counter: Similar to Derive - but it is NEVER negative (due to wrap-around)
Absolute: 

 

 

Parameter Table

 

NPRE NSClient++ReimannPublish metrics to databases - (influxdb, mysql,  Supports windows,

Parameters\Tools

Collectd

Ceilometer

Polling agent.

MonascaSNAPnode-exporter and other exporterssensu client: metric collection pluginsmunintelegraf

NRPE + Plugins

diamondcentreonicingaOpenNMS

(NSClient++,

ICINGA,

OpenNMS)

diamondReimann

Elastic Beats

Note:
1. For some parameters the answer could be just YES/NO,
2. Whereas, for some we may have to provide a description/details
3. For some we may have to choose from the list [], whereas for some we may append a value to the list.
4. For some parameters, please provide the number of 'actual metrics' provided under that category. For example, collectd would provide 12 metrics for Processes-category

Use NA - If Not applicable.
Use NK - If it is Not Known

Lowest Sampling Interval -

(for transmitting over network)

can go down to a nano second resolution

(1-sec)

               
CPU metricsCPU metricsidle, system, wait, stolen, user (% & time), util, vcpusidle, system, wait, stolen, user (% & time), util, vcpusidle, system, wait, stolen, user (% & time) idle, system, wait, stolen, user, guest, irq, nice (% & jiffies)idle, system, wait, stolen, user (% & time), util, vcpusidle, system, wait, stolen, user (% & time), util, vcpus 

Freq,

usage - idle, system, wait, user, util and vcpus.

Same as ceilometer or monasca 

user, system, iowait, idle in (% and time).

average-load

idle, system, wait, user, nice      .idle, system, wait, user, nice, stolen, irqidle, system, wait, user, nice, stolen, irq
Disk IO metrics

Read and write (bytes, rate, time, sectors)

disk-free

read and write (bytes, rate, req)read and write (bytes, rate, req)

read and write (ops, octets, merged, time)

 

disk-free

read and write (bytes, rate, req)Read and write (bytes, rate, time, sectors) read and write (bytes, rate, req)Same as ceilometer or monasca 

read and write (ops, octets, merged, time)

disk-free

read and write (bytes, rate, req)      Memory metrics usage, bandwidthfree, swap, total, used 

read and write (merged, sector, time, req)

io- reqs, time, weighted

read and write (count, time and bytes)
Memory metricsfree, swap, total, used (bytes and percetages)usage, bandwidthfree, swap, total, used (Mb and percentages) Same as ceilometer or monasca free, available, total, used.free, swap, activetotal, dirtyusedfree, inactive, buffers.      
Process metricsI/O, memory, CPU-Usage, count.NONO Same as collectd.status, thread-count, uptime. IO, memory, cpu-usage. connections.   btime, ctxt, processes, blocked, running      
Network Interface MetricsInterface plugin: Standard 4 fields of rx/tx (octets, packets, errors, dropped).
Netlink plugin: uses netlink sockets and covers others
Standard 4 fields of rx/tx (octets, packets, errors, dropped).Standard 4 fields of rx/tx (octets, packets, errors, dropped). Standard 4 fields of rx/tx (octets, packets, errors, dropped).Standard 4 fields of rx/tx (octets, packets, errors, dropped). Also includes, fifo, compressed, and frame stats. Same as ceilometer or monasca 

Rx and Tx.

MBs

      
Libvirt MetricsYES - YESYES YESNO NO YES      
Container resource usage MonitoringYESNONO DockerDocker Docker Docker      
Databases Monitoring : [Influxdb, MongoDb,  MySql, PostgreSql, Carbon(graphite),  Prometheus, RRDCache,Redis, TSDB]YES for all

MySql, PostgreSql, MongoDb

Influxdb, Vertica, MySql, PostgreSql, Cassandra  

ALL (4)

All

 

All.

 All      
swap, total, used (Mb and percentages)free, swap, total, used, slab.Same as ceilometer or monascafree, available, total, used. (bytes, %ges)free, total, swap, active, dirty, inactive, buffers.free, used, (bytes and %ges) actual-used.free, used, (bytes and %ges) actual-used.
Process metricsI/O, memory, CPU-Usage, read-write (bytes and count)NONOI/O, memory, CPU-Usage, (bytes and count). Same as collectd.status, thread-count, uptime. IO, memory, cpu-usage. connections.Cpu and memory, read-write (bytes, count), and various other fields Cpu and memory, read-write (bytes, count)CPU, memory, uptime,btime, ctxt, processes, blocked, runningI/O, memory, CPU-Usage, read-write (bytes and count)I/O, memory, CPU-Usage, read-write (bytes and count)
Network Interface MetricsInterface plugin: Standard 4 fields of rx/tx (octets, packets, errors, dropped).
Netlink plugin: uses netlink sockets and covers others
Standard 4 fields of rx/tx (octets, packets, errors, dropped).Standard 4 fields of rx/tx (octets, packets, errors, dropped).sent and recv : bytes, compressed, drops, errors, fifo, frame, multicast, packetsStandard 4 fields of rx/tx (octets, packets, errors, dropped).Standard 4 fields of rx/tx (octets, packets, errors, dropped). Also includes, fifo, compressed, and frame stats.rx/tx (octets, packets, errors, dropped).Same as ceilometer or monascarx/tx (octets, packets, errors, dropped). SNMP (3)

Rx and Tx.

MBs

Standard 4 fields of rx/tx (octets, packets, errors, dropped)Standard 4 fields of rx/tx (octets, packets, errors, dropped).
Libvirt MetricsYES - YESYESYESYESNONONOYESYESNONO

Container resource usage Monitoring

(memory, restarts, status, uptime, etc)

YESNONODockerDockerDockerNODockerYES (Docker, LXC)DockerYES (Docker)YES (4)
Databases Monitoring : [Influxdb, MongoDb,  MySql, PostgreSql, Carbon(graphite),  Prometheus, RRDCache,Redis, TSDB]YES for all

MySql, PostgreSql, MongoDb

Influxdb, Vertica, MySql, PostgreSql, Cassandra Influxdb, mysql, mongodb, Cassandra

ALL (4)

All

NO

All.

YES for allMongoDb, mysql, postgresql, and RedisYES for allYES for all (4)
Publish metrics to databases - (influxdb, mysql, TSDB, Postgresql, MongoDb, Carbon, Elasticsearch)YES for all?NO?             NOYES for all.NONO (1)NOYes for allNOYes for AllYES for all.YES (4)
Encryption SupportYESNONO YESNONO NO NO YESYES YES     Extensibility - multilanguage support [Python, Java, Golang, C/C++, Lua]YES for allJavaJava Java, Python, RubyGo, Python.          YES
Language (written)CPythonPythonGoGoRubyPerlGoperl, shell, c, (varies)PythonVaries - ruby, c, c++, etc.Go
Extensibility - multilanguage support [Python, Java, Golang, C/C++, Lua]YES for allJavaJava

Python

C++

Java, Python, RubyGo, Python.Python, RubyNone.Perl, shell, C.NoneMultipleNO?
Interoperability [with other monitoring solutions]Sensu, statsd, telegraf?

Nagios zabbix

ceilometer Ceilometer, Facter, Reimann, PrometheusCollectdNagios, Zabbix. NO  ReimannNSClient, Icinga.Nagios Collectd     Write Collectd?
Write to Message Queues and protocols (AMQP, Kafka, MQTT, NSQ)YES for ALLAMQPKafka AMQP, Kafka.NOAMQP NO

kafka,

MQTT,

NSQ

 NO       Yes for ALLYES for allYES for all (4)

Metrics Pub/sub Mode Support

(Metrics push/pull mode support ?)

YESYESYES YESYESYES NO YES NO YES YES     Metrics Req/YES
Metrics Req/Resp Mode Support NONONO YESNOYES YES NO YES NO      YESYES
Support for Events (polling, Pushing)YesNO (1)NO (1) NONOYES NO YES YES NO      YESYES
Notification SupportYESNO (1)NO (1) NONO (1)YES NO NO YES NO YES     YES
Logging Support YESYESYES YESYESYES YES YES YES YES      YESYES
Hypervisor metricsYESNONO YES (KVM)YESYES (XenTop) NO NO        Log-YESXEN, KVM.NONO
Log-File AnalysisYESNONO YESYES (mtail)NO NO YES YES NO      YESYES
Other Writing (output) Support:
[CSV, HTTP, RRD, UnixSocket, Multicast]
ALL that are listed.NONO NOHTTPNO RRD Socket, NO HTTP      NOYES?
Transport ProtocolDepends on the end point it's communicating with.TCP*TCP*  TCPTCP, UDP. (5)TCP TCP

TCP, UDP

 TCP TCP      TCP, UDPTCP, UDP
 Data-Format
[XML, JSON, etc]
JSON, Custom, XMLJSON XMLJSON JSONJSON ?JSON CustomCustom Custom JSON      CustomCustom, JSON
Data-modelCustomKVPKVP KVPKVPKVP CustomCustom Custom KVP KVP     KVP
Hardware:
IPMI, Battery, Sensors, 
YES for allIPMIIPMI IMPIYES for allYES - IPMI          YES (3)IPMI sensorsYESNONO?YES for all
Metric Types: Guage, Derive, Counter, absoluteYES for allGuage Gauge cumulative delta  GuageGauge, rate, counter.gauge, derive, counter.Gauge, Counter, Histogram, summary   Gauge, Counter, derive.Gauge, Counter, derive.Gauge, Counter.        
Language (written)CPythonPython GoRuby Go        
Last-Updated201720172017 Gauge, Derivative, deltaGauge, sum, counter, deriveGauge, sum, counter, derive
Last-Updated201720172017Varies(5)Varies (5)Varies (5)Varies (5)2017varies(5)Varies (5)Varies(5)Varies(5) 
Commercial Versions? NO NO ? NO NO YES NO No YESCommercial VersionsYES?NOYES?NOYES? NOYES No 
Run-Time Analysis [^]

CPU: 14.8%
VM:958205952
RSS: 2066
Code:14442496
Data:831705088
StackSize:2288

      Resource consumption by the agent

Binary: 617Kb

 

        

CPU:17.5%
VM:345899008
RSS:7880
Code:15036416
Data:321084800
StackSize:416

       
LicenseMIT/GPL v2 or laterApache License, Version 2.0 Apache License, Version 2.0  Apache License, Version 2.0Multiple (5)MIT GPL V2. MIT GPL V3 MIT      MITApache License, Version 2.0
Webserver monitoring
[Nginix, Apache]
YES for allApacheApache YES for all.Nginix, Apache, Passenger varnishApache, Nginix, Unicorn. NO         Yes for allYES for allNOYES for all.Yes for all

Platforms - OS?

Linux (unix'es), Windows.

Supports windows, linux, freebsd, etc.LinuxLinux

Linux

 

Unix, MAC,

Windows (soon)

Linux

Windows(3)

Linux, Windows, Linux, Windows Linux ALL Linux ALL     ALL
Configuration Tool support [Puppet, Chef, Ansible, Salt]YES for allPuppet ChefPuppet, Chef, Ansible, Yes for all.

Yes for all.

YES for all          NOYes for All.Yes for allPuppetALLALL
Deployments: servers, VMs, containers,ALLALLALL ALLALLALL. ALL All ALL ALL      ALLALL
Openstack ModulesNO (2) NO ALL.CEPH, Cinder, Glance, Keystone, Neutron, Nova NO NO NO         NOYES (All)NONONO
Intel PCM and SSDs SMART metricsNONONOYESNONONONO NONO NO NO       

Cluster Mgmt.

(Kubernetes, Mesos, Swarm)

NONONOKubernetes and Mesos            

Modifiers - (filtering, threshold, tags, contexts)

                

Other Services monitoring:

(DHCP, FTP, NTP, HAProxy, Consul)

     YES for all.            

Legends

(1) This aspect is realized either as a server-side component or by a 'customized' agent.

(2) Custom solution exist, and may not be part of main distribution.

(3) Support with strong dependency on additional tool/library.

(4) Supports more-options than the ones provided in column-1 

(5) A single value cannot be entered due development of logically-independent modules by different community groups.

...

Cluster Mgmt.

(Kubernetes, Mesos, Swarm)

NONONOKubernetes and MesosKubernetes and mesosKubernetes and mesosNOKubernetes and MesosYESNOYESYES

Modifiers - (filtering, threshold, tags, contexts)

 

Filtering and threshold - yes.

Tags - YES.

Contexts - No. (1)

NOYESYES for all.Tags, Filtering and threshold.NO(1)NOTaggingYESTagsYESYES
Dynamic Loading of plugins.NONONOYESYESYES.YES?NOYESNOYESYES

Intervals:

 

LSI: can go down to a nano second resolution

NTI: Cannot be specified - depends on size of the buffer and reading interval

PI: Configurable default: 60s

CF: Configurable

Further controlled by per-plugin "collect_period"

 

Based on Task Configuration.

Interval: Can go down to ns resolution.

Configurable scrape interval.

Command specific check-interval.

 

      

Other Services monitoring:

(DHCP, DNS, FTP, NTP, HAProxy, Consul)

HAProxy, DNS, NTPNOHAProxy, NTP.HAProxyDHCP, HAproxy, NTP, Consul.YES for all.NOHAproxy, NTP, Consul, DNS,YESNOYES (4)YES(4)

Legends

(1) This aspect is realized either as a server-side component or by a 'customized' agent.

(2) Custom solution exist, and may not be part of main distribution.

(3) Support with strong dependency on additional tool/library.

(4) Supports more-options than the ones provided in column-1 

(5) A single value cannot be entered due development of logically-independent modules by different community groups.

[^]: Runtime analysis process and considerations:

  1. Isolate the CPUs on the monitoring node. [ Add isolcpus option in the grub. CPU0]
  2. Run the agent on the isolated CPU (CPU0). [ Use taskset command to run agent-processes with appropriate CPU-mask: 0x01]
  3. Plugins: Configure agent to monitor following metrics - CPU, Memory, Disk, Interface, IPMI, processes, libvirt, Caches, OVS, hugepages.
  4. Output: Make agent to send metrics over network (Ex: influxdb running on separate node)
  5. Workload: stress-ng + iperf.
  6. Monitoring duration: 5 minutes.
  7. Frequency: 1sec.
  8. Collect Metrics (using any other tool) to analyze agent's runtime performance  [ Ex: Used Snap to collect ‘collectd-process’ metrics and CPU and memory data]
  9. Note the iperf performance ( to study any effect on it due to collectd]

#: Interval Definitions:

Lowest Sampling Interval (LSI) - How frequently the plugins can read values from source(s) of truth.

Network Transmit Interval (NTI) - Interval at which the metrics are sent over the network.

Polling Interval: Freqency at which metrics are read.

Check Frequency(CF): frequency at which all plugins are run. This may map to LSI and NTI.

Inference Questions

View file
nameMonitoringAgents-Inference.pptx
height250

The Questions The Answer
Lowest Interval: Which agent supports the lowest sampling interval, and what is the value? 
Interoperability: Which agent is 'most interoperable'?  (Work with maximum of 'servers' (collection node) 
Large-scale deployment: Which agent is ideal for large-scale monitoring (Provide description in a separate page, if needed) 
Low-footprint: Which agent has the lowest footprint (memory and CPU)? 
Metrics: Which agent supports maximum number of metrics? 
Gaps: Are there any metrics that are not supported by any of the agent and that are relavant to NFV? 
Which agent is ideal for realtime analytics?- [Support for maximum scalable datastores, visualization tools and Analytics engines?] 
Is any of the agents been used in large-scale real-world deployments? If so, please provide the details on the performance. 
Which agent has the least/maximum dependency - Libraries, OS/Kernel versions, etc.? 
Which agent provides maximum 'freedom' w.r.t. Licenses (core agent + plugins)? 
Which agent is best for the following datastores: Influxdb, Graphite, ElasticSearch? 
Which agent support dynamic configuration?