Building an All-in-One HPC Benchmark Toolkit

Abstract: Traditional HPC acceptance testing involves tedious installation of compilers and libraries on every node. This article proposes a Docker-based solution: consolidating GPU HPL, CPU Linpack, STREAM, FIO, and OSU Micro-benchmarks into a single "Super Image" for efficient, "Build Once, Run Anywhere" performance validation.

1. Foundation Setup

The toolkit relies on the host's Docker environment and the NVIDIA Container Toolkit.

1.1 Host Preparation

RedHat / CentOS:

bash

yum install -y yum-utils
yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
yum install -y docker-ce docker-ce-cli containerd.io

# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.repo | tee /etc/yum.repos.d/nvidia-container-toolkit.repo
yum install -y nvidia-container-toolkit
nvidia-ctk runtime configure --runtime=docker
systemctl restart docker

1.2 Base Image

We recommend extending the official NVIDIA HPC Benchmarks image:

bash

docker pull nvcr.io/nvidia/hpc-benchmarks:24.03

2. Integration Strategy

We integrate the following components via Dockerfile or interactive commit.

2.1 GPU HPL (Built-in)

The base image already includes optimized HPL and HPL-MxP.

Single Node 8-GPU Command:

bash

mpirun --bind-to none -np 8 \
  hpl.sh --cpu-affinity 0-31:32-63:64-95:96-127:128-159:160-191:192-223:224-255 \
  --gpu-affinity 0:1:2:3:4:5:6:7 \
  --dat /workspace/hpl-linux-x86_64/sample-dat/HPL-8GPU.dat

2.2 Intel CPU Linpack & OneAPI

Install Intel OneAPI Base & HPC Toolkit inside the container for CPU benchmarking.

bash

# Install Base Kit
sh ./l_BaseKit_p_2024.1.0.596_offline.sh -a --silent --eula accept

# Install HPC Kit (MPI + MKL)
sh ./l_HPCKit_p_2024.1.0.560_offline.sh -a --silent --eula accept

# Env Setup
echo "source /opt/intel/oneapi/setvars.sh" >> /etc/profile

2.3 AMD CPU Linpack

For AMD platforms, inject the AMDLinpack.zip.

Note

Disable SMT in BIOS for AMD CPU benchmarking to get accurate physical core peak performance.

2.4 STREAM (Memory Bandwidth)

Compile STREAM inside the container.

bash

# CPU STREAM
icc -O3 -xCORE-AVX512 -qopenmp -DSTREAM_ARRAY_SIZE=80000000 -o stream_cpu stream.c

# GPU STREAM (Built-in)
/workspace/stream-gpu-linux-x86_64/stream-gpu-test.sh

2.5 FIO & IOzone (Storage I/O)

For testing local NVMe RAID or Parallel File Systems.

bash

# Install FIO
yum install -y fio

# Compile IOzone
make linux-AMD64
cp iozone /usr/local/bin/

2.6 OSU Micro-benchmarks (Network)

For testing InfiniBand/RoCE latency and bandwidth.

bash

# Example Script
mpirun --allow-run-as-root -np 2 -H host1,host2 \
  -x UCX_NET_DEVICES=mlx5_0:1 \
  /opt/osu-micro-benchmarks/mpi/pt2pt/osu_bw

3. Multi-Node Deployment

3.1 Container Launch

Run a "Privileged Container" on all nodes:

bash

docker run -d --net=host --privileged --ipc=host \
  --gpus all --name hpc-tool \
  -v /root/.ssh:/root/.ssh \
  my-hpc-toolkit:v1.0 sleep infinity

--net=host: Use host IB network.
--ipc=host: Optimize Shared Memory.
-v /root/.ssh: Share host SSH keys for passwordless MPI.

3.2 Orchestration

Use a simple script on the head node to trigger mpirun inside the containers across the cluster.

4. Portable Live USB Solution

To protect customer environments (non-intrusive testing):

Create USB: Flash Ubuntu/Rocky Live ISO.
Load Image: Place hpc-toolkit.tar on the data partition.
On-Site:
- Boot server from USB.
- Install Docker (offline).
- docker load -i hpc-toolkit.tar.
- Run benchmarks.

5. Summary

This approach standardizes HPC benchmarking:

Unified Baseline: Eliminates variance caused by compiler/library versions.
Fast Delivery: Reduces setup time from hours to minutes.
Full Coverage: Compute (HPL), Memory (STREAM), Storage (FIO), and Network (OSU) in one box.

01. Hardware & Chips

02. Cluster Architecture

03. Network (IB/RoCE)

04. Storage Systems

05. Automated Provisioning

06. Cloud & Scheduling

07. Heterogeneous Computing

08. AI Compiler

09. Frameworks

10. Pre-trained Models

11. Distributed Training

12. Inference Engines

13. Industry Apps

14. AI for Science

Building an All-in-One HPC Benchmark Toolkit

1. Foundation Setup

1.1 Host Preparation

1.2 Base Image

2. Integration Strategy

2.1 GPU HPL (Built-in)

2.2 Intel CPU Linpack & OneAPI

2.3 AMD CPU Linpack

2.4 STREAM (Memory Bandwidth)

2.5 FIO & IOzone (Storage I/O)

2.6 OSU Micro-benchmarks (Network)

3. Multi-Node Deployment

3.1 Container Launch

3.2 Orchestration

4. Portable Live USB Solution

5. Summary

Building an All-in-One HPC Benchmark Toolkit ​

1. Foundation Setup ​

1.1 Host Preparation ​

1.2 Base Image ​

2. Integration Strategy ​

2.1 GPU HPL (Built-in) ​

2.2 Intel CPU Linpack & OneAPI ​

2.3 AMD CPU Linpack ​

2.4 STREAM (Memory Bandwidth) ​

2.5 FIO & IOzone (Storage I/O) ​

2.6 OSU Micro-benchmarks (Network) ​

3. Multi-Node Deployment ​

3.1 Container Launch ​

3.2 Orchestration ​

4. Portable Live USB Solution ​

5. Summary ​

Building an All-in-One HPC Benchmark Toolkit

1. Foundation Setup

1.1 Host Preparation

1.2 Base Image

2. Integration Strategy

2.1 GPU HPL (Built-in)

2.2 Intel CPU Linpack & OneAPI

2.3 AMD CPU Linpack

2.4 STREAM (Memory Bandwidth)

2.5 FIO & IOzone (Storage I/O)

2.6 OSU Micro-benchmarks (Network)

3. Multi-Node Deployment

3.1 Container Launch

3.2 Orchestration

4. Portable Live USB Solution

5. Summary