InfiniBand Performance Tuning for AI Workloads

Distributed training performance is no longer compute-bound — it is network-bound.

For large-scale LLM training, inefficient RDMA communication can reduce cluster efficiency by more than 40%.
This guide provides a production-grade methodology for tuning InfiniBand in GPU clusters.

Why InfiniBand Performance Matters for LLM Training

In modern training workloads:

AllReduce dominates iteration time
Communication overlaps with compute
Network imbalance breaks scaling efficiency

At scale:

Small latency increase → global throughput drop
PCIe misalignment → NCCL bandwidth collapse

End-to-End Data Path

The real performance path is:
GPU → PCIe Switch → HCA → IB Fabric → Remote HCA → Remote GPU

Key bottlenecks:

PCIe lane width
NUMA crossing
Retimer latency
GDR capability

Key Performance Metrics

1. RDMA Bandwidth Test

bash

ib_write_bw
ib_read_bw
ib_send_bw

2. NCCL Test

bash

nccl-tests/build/all_reduce_perf -b 8 -e 128M -f 2 -g 8

Focus on:

Bus Bandwidth
Algorithm Bandwidth
Latency at small message sizes

PCIe and NUMA Affinity Optimization

Check Topology

bash

nvidia-smi topo -m
lspci -tv
numactl -H

Goal:

GPU and HCA under same NUMA
Avoid SYS distance in topology matrix

Manual Binding

bash

export NCCL_IB_HCA=mlx5_0,mlx5_1
export NCCL_TOPO_FILE=/path/to/custom_topo.xml

GPUDirect RDMA Optimization

Verify GDR

bash

nvidia-smi -q | grep GPUDirect

Common Issues:

ACS enabled in PCIe switch
IOMMU enabled
Insufficient BAR space

NCCL Environment Variable Tuning

Core Parameters

bash

export NCCL_NET_GDR_LEVEL=2
export NCCL_IB_QPS_PER_CONNECTION=4
export NCCL_IB_GID_INDEX=3
export NCCL_IB_TC=136

Tuning strategy depends on:

GPU count per node
Rail-optimized network
Fabric oversubscription ratio

InfiniBand Fabric Tuning

Cluster-level tuning:

MTU = 4096
Adaptive routing = enabled
Proper SL mapping

Validate using:

bash

ibdiagnet
perfquery

RoCE vs InfiniBand

For a lossless RoCE tuning guide see:
➡ /guide/03-network/roce-ai-fabric

Key differences:

Congestion control
Buffer design
PFC impact

Benchmark Methodology

A correct benchmarking process:

Single link RDMA test
Intra-node NCCL
Inter-node NCCL
Real training workload validation

Real-World Tuning Case Study

Initial State:

8×GPU per node
AllReduce Bus BW: 43 GB/s

Optimizations Applied:

NUMA realignment
GDR enabled
QPs increased
Rail-aware NCCL tuning

Final Result:

AllReduce Bus BW: 92 GB/s
Scaling efficiency improved from 58% → 91%

Ask the AI-HPC Expert

Have a performance issue in your cluster?
Use our AI expert system to diagnose your NCCL and RDMA bottlenecks interactively.

01. Hardware & Chips

02. Cluster Architecture

03. Network (IB/RoCE)

04. Storage Systems

05. Automated Provisioning

06. Cloud & Scheduling

07. Heterogeneous Computing

08. AI Compiler

09. Frameworks

10. Pre-trained Models

11. Distributed Training

12. Inference Engines

13. Industry Apps

14. AI for Science

InfiniBand Performance Tuning for AI Workloads

Why InfiniBand Performance Matters for LLM Training

End-to-End Data Path

Key Performance Metrics

1. RDMA Bandwidth Test

2. NCCL Test

PCIe and NUMA Affinity Optimization

Check Topology

Manual Binding

GPUDirect RDMA Optimization

Verify GDR

NCCL Environment Variable Tuning

Core Parameters

InfiniBand Fabric Tuning

RoCE vs InfiniBand

Benchmark Methodology

Real-World Tuning Case Study

Ask the AI-HPC Expert

InfiniBand Performance Tuning for AI Workloads ​

Why InfiniBand Performance Matters for LLM Training ​

End-to-End Data Path ​

Key Performance Metrics ​

1. RDMA Bandwidth Test ​

2. NCCL Test ​

PCIe and NUMA Affinity Optimization ​

Check Topology ​

Manual Binding ​

GPUDirect RDMA Optimization ​

Verify GDR ​

NCCL Environment Variable Tuning ​

Core Parameters ​

InfiniBand Fabric Tuning ​

RoCE vs InfiniBand ​

Benchmark Methodology ​

Real-World Tuning Case Study ​

Related Articles ​

Ask the AI-HPC Expert ​

InfiniBand Performance Tuning for AI Workloads

Why InfiniBand Performance Matters for LLM Training

End-to-End Data Path

Key Performance Metrics

1. RDMA Bandwidth Test

2. NCCL Test

PCIe and NUMA Affinity Optimization

Check Topology

Manual Binding

GPUDirect RDMA Optimization

Verify GDR

NCCL Environment Variable Tuning

Core Parameters

InfiniBand Fabric Tuning

RoCE vs InfiniBand

Benchmark Methodology

Real-World Tuning Case Study

Related Articles

Ask the AI-HPC Expert