This wiki will hold the minutes of discussion topics from the Barometer Weekly Call
DMA Project
- Distribute some monitoring and analysis capabilities to the edge
- Allow faster polling rates locally without creating a bottleneck for transfer of large amounts of data to a central site.
- Allows fast remediation of node-local events
- Project is looking for an upstream community
- Would Barometer be a good fit?
Discussion topics for the “ideal” monitoring agent
- Polling vs Event capture for the monitoring agent
- Platform independent monitor agent
- Network Interfaces
- Kernel events
- VM / Container monitoring
- Common bus for Events / Telemetry / Config
- Common Object model
- Agent configuration
- Performance
- <<50ms and other timing requirements
Decisions
Polling vs Event capture for the monitoring agent <Feb 07 2017>
The scope of polling being discussed is that of the monitoring agent itself (on the node that’s being observed). Collectd is configured to run at a particular interval by default every 10 seconds. the question is, do you leave the read plugins poll for stats and events every time the read interval fires?
A. Both polling and event driven updates should be supported --> it depends on the subsystem you are monitoring, default would be to leverage event based systems where they exist, but polling should be supported as a configuration option that can be selected by the end user.
If we consider the scope of the VIM to the monitoring Agent and whether within this context, we should support polling /event driven updates?
Fault events should always use a push model, and the mechanism over which events are sent needs to be reliable.
Telemetry, can be polled or pushed (could be polled to spread the load on the collection side).
Network (over)load should be taken into consideration as regards which model to use (push vs pull), you don't want to destabilize the network. push is more scalable overall and preferred for fault management.
Agent configuration <Feb 14 2017>
Should be able to dynamically:
* Enable/disable/or restart resource monitoring
* Get values/notifications
* Get capabilities
* Get the list of metrics being collected
* flush the list of metrics
* Set thresholds for resources
* blacklist resources
* support some sort of buffering mechanism, and should be able to configure
* get the timing information for the agent and do aTiming sync if required.