DNA 26-02: To InfiniBand and Beyond

I’ve been posing the question “will Ethernet succeed InfiniBand as the transport mechanism of choice for AI workloads”, and the responses are predictably coloured by the vendor in question. As you would expect the major Ethernet vendors very much see Ethernet as the future. However, the word from the server vendors, as well as the largest of the GPU, and CPU vendors is that for the very biggest workloads with tens of 1000s of GPUs, InfiniBand is still very much the “gold standard”, a phrase which I heard independently from multiple vendors.

I am curious as to why this might be, so I’ve started to break it down by looking at performance, capability, and the future development of both standards. I will dig deeper on each area in the coming weeks but for now, here are a few high-level thoughts.

Performance of Cisco’s Silicon One and Broadcom’s Jericho-3 AI chips at 800Gbps compares well to the latest InfiniBand XDR 800 Gbps standard. However, there are other factors such as effective bandwidth, port-to-port latency, switching capacity and the forwarding rate which need to be considered.

From a features perspective, two fundamentals of InfiniBand are RDMA (Remote Direct Memory Access) and its lossless fabric. Ethernet provides RoCE (RDMA over Converged Ethernet) and depending on the vendor there will be some form of QoS mechanism to achieve a lossless fabric, such as PFC (Priority Flow Control) and ECN (Explicit Congestion Notification) in the case of Cisco.

The feature uniquely available to InfiniBand and perhaps of the key reasons why it remains the Gold Standard is SHARP (Scalable Hierarchical Aggregation and Reduction Protocol). SHARP is an NVIDIA proprietary protocol that provides significant performance benefits by enabling In-Network Computing for AI collective operations such as data aggregation and reduction.

The market landscape is ever shifting. Ethernet vendors such as Cisco, HPE, and DriveNets bringing solutions to market. The Ultra Ethernet Consortium is developing new standards to provide functionality to deliver capabilities like RDMA and SHARP. Finally, NVIDIA have a horse in each race with SpectrumX for Ethernet and QuantumX for InfiniBand.

For any comments, questions, or corrections please message me directly or reply below.

https://ultraethernet.org/

https://lnkd.in/eUZzVeji

https://lnkd.in/eq6vnF44

https://drivenets.com/