InfiniBand Performance Tuning for AI Workloads
Distributed training performance is no longer compute-bound — it is network-bound.
For large-scale LLM training, inefficient RDMA communication can reduce cluster efficiency by more than 40%.
This guide provides a production-grade methodology for tuning InfiniBand in GPU clusters.
Why InfiniBand Performance Matters for LLM Training
In modern training workloads:
- AllReduce dominates iteration time
- Communication overlaps with compute
- Network imbalance breaks scaling efficiency
At scale:
- Small latency increase → global throughput drop
- PCIe misalignment → NCCL bandwidth collapse
End-to-End Data Path
The real performance path is:GPU → PCIe Switch → HCA → IB Fabric → Remote HCA → Remote GPU
Key bottlenecks:
- PCIe lane width
- NUMA crossing
- Retimer latency
- GDR capability
Key Performance Metrics
1. RDMA Bandwidth Test
ib_write_bw
ib_read_bw
ib_send_bw2. NCCL Test
nccl-tests/build/all_reduce_perf -b 8 -e 128M -f 2 -g 8Focus on:
- Bus Bandwidth
- Algorithm Bandwidth
- Latency at small message sizes
PCIe and NUMA Affinity Optimization
Check Topology
nvidia-smi topo -m
lspci -tv
numactl -HGoal:
- GPU and HCA under same NUMA
- Avoid SYS distance in topology matrix
Manual Binding
export NCCL_IB_HCA=mlx5_0,mlx5_1
export NCCL_TOPO_FILE=/path/to/custom_topo.xmlGPUDirect RDMA Optimization
Verify GDR
nvidia-smi -q | grep GPUDirectCommon Issues:
- ACS enabled in PCIe switch
- IOMMU enabled
- Insufficient BAR space
NCCL Environment Variable Tuning
Core Parameters
export NCCL_NET_GDR_LEVEL=2
export NCCL_IB_QPS_PER_CONNECTION=4
export NCCL_IB_GID_INDEX=3
export NCCL_IB_TC=136Tuning strategy depends on:
- GPU count per node
- Rail-optimized network
- Fabric oversubscription ratio
InfiniBand Fabric Tuning
Cluster-level tuning:
- MTU = 4096
- Adaptive routing = enabled
- Proper SL mapping
Validate using:
ibdiagnet
perfqueryRoCE vs InfiniBand
For a lossless RoCE tuning guide see:
➡ /guide/03-network/roce-ai-fabric
Key differences:
- Congestion control
- Buffer design
- PFC impact
Benchmark Methodology
A correct benchmarking process:
- Single link RDMA test
- Intra-node NCCL
- Inter-node NCCL
- Real training workload validation
Real-World Tuning Case Study
Initial State:
- 8×GPU per node
- AllReduce Bus BW:
43 GB/s
Optimizations Applied:
- NUMA realignment
- GDR enabled
- QPs increased
- Rail-aware NCCL tuning
Final Result:
- AllReduce Bus BW:
92 GB/s - Scaling efficiency improved from 58% → 91%
Related Articles
Ask the AI-HPC Expert
Have a performance issue in your cluster?
Use our AI expert system to diagnose your NCCL and RDMA bottlenecks interactively.