Robust and Validated RoCE (RDMA over Converged Ethernet) on TERALYNX

To run high performance distributed applications, in deployments such as cloud computing, HPC and machine learning, RDMA (Remote Direct Memory Access) is a key technology. It enables efficient and fast movement of data between compute and storage and compute to compute. It achieves that by eliminating multiple data copies, by freeing up valuable cpu resources and by transporting the data over Ethernet networks. RDMA over Ethernet networks is called RDMA over Converged Ethernet (RoCE) and it is being deployed widely across private and public cloud environments.

The Ethernet network needs to provide essential capabilities to enable high performance RoCE. They include

PFC (Priority based Flow Control): It is a flow-control mechanism based on priority/class of traffic to support lossless transport of data.
ECN (Explicit Congestion Notification): It is an extension to TCP/IP protocol to enable end to end congestion notification without dropping data packets. The switch sets ECN bits on data packets when congestion is detected so the receiver can signal to the sender to slow down.
ETS (Enhanced Transmission Selection): It is a bandwidth management mechanism, where bandwidth is allocated between different priorities/classes of traffic.
RoCE v2: The ability to run RoCE over routed networks (i.e. using layer 3 to scale to larger data centers/networks) where QoS/priority information is also carried across layer 3 boundaries.
Rich QoS features: Ability to mark and prioritize different types of data traffic, across layer 2 & 3 networks.
Scalable performance with lowest latencies: Ability to run RDMA over a large data center with lowest latencies and high performance.
Efficient buffer management: Ability of the switch to allocate, free-up and manage buffers to forward data packets without dropping them, thereby enabling high performance networks.

Figure: Applications using RoCE v2 over TERALYNX switches

Innovium team, with a strong DNA and track record in switching and data center solutions, is proud to deliver extremely robust and high performance RoCE v2 capabilities listed above in TERALYNX. Additionally, TERALYNX has a large and efficient on-chip buffer (70MB) to help drive the highest performance with RoCE v2 where one needs to absorb RoCE enabled traffic while PFC kicks into effect at the source. Innovium has further implemented various patented optimizations to handle PFC better. These comprehensive capabilities have been tested and validated extensively by Innovium, our OEM and Cloud customers running over large scale multi-rack server/storage environments across multiple scenarios. Results of the tests, that include throughput, delays and packet drops have exceeded customer expectations.

Innovium’s TERALYNX data-center Ethernet switch family delivers 2T to industry-highest 12.8T bandwidth, while delivering low latency, advanced telemetry, comprehensive feature-set including high-performance RoCE and programmability. Multiple switch platforms are already available from OEM and ODMs, supporting a broad range of fixed and modular switch configurations. Innovium software, with a clean sheet modern design, offers customers a comprehensive set of features, including support for popular Software for Open Networking in the Cloud (SONiC). We have spent the last year validating interoperability with an extensive PAM4 based ecosystem, to enable faster and reliable customer deployments. Further, we have a compelling roadmap to meet future customer requirements.

We are excited to deliver robust and validated RoCE capabilities on TERALYNX switches for data center customers. Please contact us at [email protected] if you need further information.