Dynamic Load Balancing Based on RoCEv2 Destination-QPair

Overview

The traditional data center load balancing uses hash-based methods to keep packets of the same flow on a single path. However, this approach ignores real-time link bandwidth usage, which can cause uneven network resource utilization. As a result, this method can lead to imbalanced traffic distribution and congestion—issues that are especially critical for Artificial Intelligence (AI) and Machine Learning (ML) workloads, which frequently involve large data transfers that heavily strain network bandwidth.

The Dynamic Load Balancing (DLB) feature advances conventional hash-based load balancing by incorporating various strategies that optimize traffic distribution across members of an Equal-Cost Multi-Path (ECMP) group.

Unlike static hash-based approaches, which allocate flows without factoring in real-time link conditions, DLB continuously monitors and adjusts traffic patterns.

This strategy improves performance and link utilization during dynamic traffic allocation:

The system allocates new data flows based on the real-time load of each Equal-Cost Multi-Path (ECMP) member.
When load conditions change, it reassigns existing flows to different paths while maintaining flow integrity and avoiding packet reordering.

DLB ensures proper bandwidth utilization among group members. A large flow on a single link will not disrupt other smaller flows in the ECMP group.

Feature Characteristics

The DLB overcomes the hash based load balance limitations using various modes of load balance operations.

DLB Modes

Dynamic load balancing offers users multiple operational modes for distributing network traffic efficiently:

Fixed Packet Mode: Assigned to a specific port and remains unchanged, even after periods of inactivity.
Per-Packet Mode: Each packet is assigned to a different member port, ensuring distributed traffic flow.
Optimal Flow Mode: If a flow remains idle for a defined duration of inactivty timer, it is reassigned to the most suitable member, determined by the egress port’s link quality.
Random Flow Mode: Idle flows are reassigned to a randomly chosen member, facilitating a diverse distribution approach.
Reactive Path Rebalancing: Reactive path rebalancing or long lived flow reassignment is supported in DLB. By means of this feature, in a given ECMP group, if a continuous incoming stream occupies a egress member port, and a better quality (less loaded) egress member exists, the stream will be probabilistically reassigned to the better member if quality is good by a configured delta.
To support reactive path rebalancing, DLB supports the following configurable values.
Reassignment threshold: The probability threshold by which an existing continuous IP stream egressing a DLB group be reassigned to a better available member.
Reassignment quality delta: The Quality difference needed between current member and the available member, for the stream to be considered for reassignment.


1. Users can configure up to four Ethernet types to be eligible for Dynamic Load Balancing (DLB) and it is supported only on Tomahawk4 (TH4) platforms and Tomahawk5 (TH5) platforms.
2. Random and Reactive Path Rebalancing DLB mode is supported only in Tomahawk5 (TH5) platforms.

The traffic distribution is performed considering the key factors such as flow-set, inactive timer and port quality.

Flowset: A collection of macroflows (grouped microflows) managed as a unit for traffic distribution.
Inactivity Timer: Duration for which a flow must be idle before it becomes eligible for reassignment. It is supported only on Optimal and Random DLB mode.
Port Quality Band: Members are rated on a scale from 0 (lowest) to 7 (highest) based on real-time port load and queue depth.

RoCE Destination-QPair

In AI/ML clusters, RDMA is used to communicate memory-to-memory between GPUs over the network. RDMA over Converged Ethernet (RoCE) is an extension of InfiniBand with Ethernet forwarding. RoCEv2 encapsulates IB transport in Ethernet, IP, and UDP headers, so it can be routed over Ethernet networks.

For RoCEv2 transport, the network must provide high throughput and low latency while avoiding traffic drops in situations where congestion occurs.

For such RoCEv2 traffic which communicate between two GPUs, it required to create entropy for load-balancing with Destination-QPairs that can be enabled using load-balance rtag7command with ipv4/ipv6 rocev2-dest-qpair options.

DLB Flow Monitoring

DLB includes support for flow monitoring to help administrators observe and troubleshoot traffic distribution across ECMP members.

It includes the following monitoring parameters (per sampled packet):

DLB ID: Unique identifier for the DLB group
Source Port: Ingress interface of the sampled packet
Flowset Index: Index representing the macroflow group
Egress Nexthop: Selected member port for the flow
Monitoring is performed at the macroflow level, ensuring insights into how collections of flows are routed and balanced.

This functionality is supported only on , Tomahawk4 (TH4) platforms, Tomahawk5 (TH5) platforms, Trident4 (TR4) platforms .
It is not supported for per-packet DLB mode.

Benefits

Dynamic Load Balancing (DLB) addresses key limitations of traditional hash-based load-balancing by introducing intelligent, adaptive traffic distribution mechanisms.

Limitations

RTAG7 Configuration is mandatory for DLB functionality.
DLB configurations for ECMP groups are applied at the global level.
ECMP DLB is not supported when any ECMP member is configured as a LAG.
ECMP groups configured with hash-based load balancing cannot be modified to use DLB mode. A node reboot is necessary to recreate the ECMP groups with DLB mode enabled.
The number of ECMP groups supported with DLB depends on the configured flow set size. For example, a flow set size of 256 supports up to 128 ECMP groups.
When a next hop (NH) is added or removed from an ECMP group, the hardware ECMP group is rebuilt with new members, causing a change in its DLB ID. As a result, existing flows are rehashed and reassigned.
On TH5 platforms, ECMP DLB is supported only for port speeds ranging from 50G to 800G.
Random and Reactive path Rebalancing DLB mode is only supported in TH5.
RTAG7 RoCEV2 Dest-qpairs - is not supported for TR3.
Port quality information is not supported on TR3.
Flow monitoring not supported for “per-packet” DLB mode.
In-activity timer is applicable only for Optimal and Random DLB modes

Prerequisites

Ensure the following:

Enable the RTAG7 hashing for load balancing.