Mellanox IB Operations & Tuning Guide
Abstract: This guide consolidates best practices for InfiniBand network lifecycle management, including driver installation, IP configuration, switch firmware upgrades, MPI benchmarking, performance tuning, and BER troubleshooting.
1. Driver Installation & Config
1.1 Install Mellanox OFED
For RHEL/CentOS based systems.
# 1. Extract
tar zxvf MLNX_OFED_LINUX-5.3-1.0.0.1-rhel7.5-x86_64.tgz
cd MLNX_OFED_LINUX-5.3-1.0.0.1-rhel7.5-x86_64/
# 2. Install (Force all modules)
./mlnxofedinstall --all --force
# 3. Reboot
reboot1.2 Configure IPoIB
Set static IP for management or IPoIB communication.
# Edit: /etc/sysconfig/network-scripts/ifcfg-ib0
NAME=ib0
DEVICE=ib0
BOOTPROTO=static
ONBOOT=yes
TYPE=InfiniBand
IPADDR=11.11.11.200
NETMASK=255.255.0.0- Verify:
ip aandibstat(State: Active).
1.3 Start Subnet Manager (SM)
For small clusters without a managed switch, start OpenSM on at least one node.
systemctl start opensmd2. Switch Firmware Upgrade
2.1 MST Initialization
Mellanox Software Tools (MST) are used to access IB devices.
mst start
mst ib add
mst statusOutput: /dev/mst/SW_MT54000_..._lid-0x0006 (The switch device)
2.2 Burning Firmware
- Query:bash
flint -d /dev/mst/SW_MT54000_..._lid-0x0002 query - Burn:bash
flint -d /dev/mst/SW_MT54000_..._lid-0x0002 -i firmware.bin burn - Reset:bash
flint -d /dev/mst/SW_MT54000_..._lid-0x0002 swresetNote
For high-end models (e.g., 400G MQM9790), a physical power cycle (unplug for 3 mins) may be required.
3. Benchmarking (MPI & OSU)
3.1 Intel MPI Benchmarks (IMB)
Test PingPong bandwidth and latency. Use Intel Compiler 2020+.
# PingPong (2 processes)
mpirun -iface ib0 -f hostfile -np 2 -ppn 1 ./IMB-MPI1 pingpong- Metrics:
- Latency:
t[usec]for 0-byte msg (< 2us). - Bandwidth:
Mbytes/secfor 4MB msg (Near line rate).
- Latency:
3.2 OSU Micro-Benchmarks (Script)
Iterate through all node pairs using OpenMPI.
#!/bin/bash
MPI_RUN="/usr/mpi/gcc/openmpi-4.1.7a1/bin/mpirun"
OSU_DIR="/path/to/osu-micro-benchmarks"
for i in `cat host1`; do
for j in `cat host2`; do
# Force IB device mlx5_0:1
$MPI_RUN -x UCX_NET_DEVICES=mlx5_0:1 -H $i,$j $OSU_DIR/osu_bw
done
done4. Performance Tuning
4.1 CPU Power Management
High Performance Computing requires disabling power saving to avoid CPU wake-up latency.
# Set to Performance
cpupower -c all frequency-set -g performance
# Verify
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# Expected: performance4.2 NIC Tuning
Use mlnx_tune for one-click optimization.
# High Throughput Profile
mlnx_tune -p HIGH_THROUGHPUT5. Health Check & Troubleshooting
5.1 Deep Inspection with ibdiagnet
Run a 30-minute stress test to catch intermittent errors.
ibdiagnet --pc --pm_pause_time 1800 -P all=1 \
--get_phy_info --get_cable_info --sc \
--extended_speeds all --pm_per_lane --routing --sharp5.2 Failure Criteria
- BER: Must be < $10^{-12}$ (Strict) or $10^{-8}$ (Min).
- Link Down Counter: Delta must be 0.
5.3 Isolation Method
Locate device via LID, then apply Cross-Swap:
- Clean: Clean fiber endpoints.
- Swap:
- Replace Cable.
- Replace Transceiver.
- Swap NIC/Switch ports.
- Retest: Run
ibdiagnetagain to confirm zero errors.
