Streaming Telemetry CPU Monitoring

Overview

Streaming telemetry CPU monitoring in OcNOS is designed to maintain critical system performance even under high CPU load. This feature ensures that telemetry operations, including both dial-in and dial-out, do not degrade control plane applications or other essential system functions by regulating telemetry activities based on CPU usage.

Feature Characteristics

OcNOS implements CPU monitoring through these mechanisms:

CPU Usage Monitoring: The system monitors the 5-minute average CPU usage, enabled by default through the CML daemon (CMLd).
Configurable Threshold: Users can define a CPU usage threshold between 20% and 80%; the default is 40%.
State Transition Logic: The telemetry functionality transitions between a "Normal" and "Paused" state based on the monitored CPU usage relative to the configured threshold.
Transition to Pause: When the 5-minute average CPU usage goes above the default or configured threshold and stays above the threshold for at least a minute, the telemetry state transitions from "Normal" to "Pause".
Transition to Normal: When the 5-minute average CPU usage drops below 3/4th of the default or configured threshold and remains low for 300 seconds, the telemetry state transitions from "Pause" to "Normal."
Actions in "Paused" State:
Pause existing dial-in subscriptions by sending unsubscribe messages to protocol modules, which frees up CPU usage in PMs, redis-server, and gnmid. No gNMI responses are sent for these paused subscriptions.
Pause existing dial-out telemetry subscriptions. No gNMI responses are sent for these paused subscriptions.
Rejects new incoming dial-in connections.
When in the pause state, on any new dial-out subscription activation, the corresponding sensor-paths enter a paused state immediately.
Actions in "Normal" State:
Resume paused dial-in and dial-out subscriptions by resending subscribe messages to the relevant sensor-paths.
Accept new subscription connections (dial-in). In Dial-out mode, on subscription activation, corresponding sensor-paths are activated immediately. gNMI responses are sent as usual at the sample interval.
Status Display: Use the command show streaming-telemetry to check the current status of the telemetry functionality (Normal or Paused).

The OpenConfig data model does not support CPU monitoring.

Benefits

CPU monitoring protects system health by dynamically reducing telemetry overhead during CPU stress conditions:

During scale scenarios or critical control plane applications that require a higher number of CPU cycles, non-critical applications like telemetry can take less CPU cycles by entering into a pause state.
Ensures more CPU cycles are available for high-priority processes.