Overview

This guide targets engineers with basic Linux and system-administration experience and documents an end-to-end deployment of a small Slurm HPC cluster on Rocky Linux 9.4. The reference topology is a single control/management node (mu01) and multiple compute nodes (cu01-cu19). The document covers system preparation, network and time synchronization, shared storage, user management, Munge authentication, MariaDB for SlurmDBD accounting, building and installing Slurm, configuration, verification steps, and operational recommendations.

Readers are assumed to be familiar with basic Linux commands, SSH, yum/DNF package management, and systemd service operations.

Step 1 — Base Environment

  • Operating system: Rocky Linux 9.4
  • Scheduler: Slurm
  • Authentication: Munge
  • Accounting database: MariaDB (for slurmdbd)
  • Shared storage: NFS
  • Build tools: GCC, make, rpmbuild

1.1 Install OS

Install Rocky Linux 9.4 using a minimal or server image. Partition according to your organization’s policy and ensure adequate space for /var, /home, and /opt. Enabling the EPEL repository is recommended for convenience when installing additional packages.

1.2 Network and Hostnames

Assign static IP addresses on each machine (via nmtui/nmcli) and ensure hostnames are unique and resolvable via DNS or /etc/hosts. Example /etc/hosts on the management node (mu01):

127.0.0.1   localhost localhost.localdomain
::1         localhost localhost.localdomain
192.168.8.100 mu01
192.168.8.101 cu01
192.168.8.102 cu02
...
192.168.8.119 cu19

1.3 Firewall and SELinux Recommendations

You can stop firewalld temporarily in a test or controlled internal environment, but in production only open the required ports (Slurm: 6817/6818/6819, Munge, NFS, MariaDB, etc.).

systemctl stop firewalld
systemctl disable firewalld

Prefer setting SELinux to permissive while developing and then adapt policies rather than permanently disabling it:

setenforce 0
sed -i 's/SELINUX=enforcing/SELINUX=permissive/' /etc/selinux/config

1.4 SSH Key-Based Login (use non-root admin account)

Generate an SSH key on the management node for the admin account and distribute it to compute nodes:

ssh-keygen -t rsa -b 4096 -C "admin@hpc"
for i in {1..19}; do printf "cu%02d\n" "$i" >> nodes.txt; done
for host in $(cat nodes.txt); do ssh-copy-id admin@$host; done

Use Ansible or pssh to perform bulk administration and avoid manual inconsistencies.

1.5 Time Synchronization (Chrony)

Configure the management node as an NTP server for the LAN and point compute nodes to it.

Management node /etc/chrony.conf example:

allow 192.168.8.0/24
pool ntp.aliyun.com iburst

Compute nodes:

server 192.168.8.100 iburst

Enable and verify:

systemctl enable --now chronyd
chronyc sources
chronyc tracking

Step 2 — NFS Shared Storage

Install and enable NFS on the management node:

yum install -y nfs-utils rpcbind
systemctl enable --now rpcbind nfs-server

Example /etc/exports:

/home   192.168.8.0/24(rw,sync,no_root_squash)
/opt    192.168.8.0/24(rw,sync,no_root_squash)
/data   192.168.8.0/24(rw,sync,no_root_squash)

Apply and check exports:

exportfs -rav
showmount -e

Clients can mount exports persistently via /etc/fstab or systemd mounts. Example fstab entries:

192.168.8.100:/home /home nfs defaults,_netdev 0 0
192.168.8.100:/opt  /opt  nfs defaults,_netdev 0 0
192.168.8.100:/data /data nfs defaults,_netdev 0 0

Mount and validate:

mount -a
df -h /home /opt /data

Security note: evaluate no_root_squash carefully; avoid it in environments where root mapping is a concern.

Step 3 — User and Group Management

Use centralized identity management (LDAP/FreeIPA) wherever possible. If you temporarily synchronize /etc/* files across nodes, automate with Ansible/rsync and be cautious with /etc/shadow permissions.

Temporary example:

useradd zhangsan
for host in $(cat nodes.txt); do
    rsync -a /etc/passwd /etc/group /etc/shadow /etc/gshadow root@$host:/etc/
done

In production, use centralized authentication combined with sudo and auditing.

Step 4 — Munge (Cluster Authentication)

Install Munge on all nodes and distribute the same /etc/munge/munge.key:

yum install -y munge munge-libs munge-devel
# or build RPM from source
# rpmbuild -tb munge-*.tar.xz

/usr/sbin/mungekey -f
for host in $(cat nodes.txt); do scp /etc/munge/munge.key root@$host:/etc/munge/; done
chown munge:munge /etc/munge/munge.key
chmod 400 /etc/munge/munge.key
systemctl enable --now munge
munge -n | unmunge

Verify Munge on all nodes.

Step 5 — MariaDB (Slurm Accounting)

Install MariaDB on the management node and create the slurm database and user:

yum install -y mariadb-server
systemctl enable --now mariadb
mysql_secure_installation

SQL to create database and user (replace with a strong password):

CREATE DATABASE slurm_acct_db;
CREATE USER 'slurm'@'localhost' IDENTIFIED BY 'strong_password_here';
GRANT ALL PRIVILEGES ON slurm_acct_db.* TO 'slurm'@'localhost';
FLUSH PRIVILEGES;

If slurmdbd is remote from the database, create a remote user and restrict access by firewall rules.

Step 6 — Build and Install Slurm

Preconditions: Chrony and Munge running, consistent slurm user and shared config storage (optional).

Install build dependencies and build Slurm RPMs:

yum -y install gcc make perl pam-devel python3 readline-devel lua-devel hwloc-devel
wget https://download.schedmd.com/slurm/slurm-24.05.2.tar.bz2
rpmbuild -ta slurm-*.tar.bz2
cd ~/rpmbuild/RPMS/x86_64/
yum localinstall -y slurm-*.rpm

Create slurm system user and directories:

useradd -r -m -d /var/lib/slurm -s /sbin/nologin slurm
mkdir -p /var/spool/slurmd /var/spool/slurmctld /var/log/slurm
chown -R slurm:slurm /var/spool/slurmd /var/spool/slurmctld /var/log/slurm

Prepare and distribute /etc/slurm/slurm.conf (use the Slurm configurator or adapt to your environment). Key settings example:

ClusterName=hpccluster
SlurmctldHost=mu01
ProctrackType=proctrack/cgroup
TaskPlugin=task/affinity
StateSaveLocation=/var/spool/slurmctld
SlurmUser=slurm
AccountingStorageType=accounting_storage/slurmdbd
JobAcctGatherType=jobacct_gather/linux
NodeName=mu01 CPUs=24 RealMemory=94000
NodeName=cu[01-19] CPUs=24 RealMemory=94000
PartitionName=CU Nodes=cu[01-19] Default=YES MaxTime=INFINITE State=UP

Distribute and set permissions:

for host in $(cat nodes.txt mu01); do scp /etc/slurm/slurm.conf root@$host:/etc/slurm/; done
chown slurm:slurm /etc/slurm/slurm.conf

Enable cgroup support (/etc/slurm/cgroup.conf) and start services:

ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
systemctl enable --now slurmctld slurmd

Configure slurmdbd.conf to connect to MariaDB:

AuthType=auth/munge
DbdHost=localhost
DbdPort=6819
StorageHost=127.0.0.1
StorageUser=slurm
StoragePass=strong_password_here
StorageLoc=slurm_acct_db

Start and verify slurmdbd:

systemctl enable --now slurmdbd
systemctl status slurmdbd
sacctmgr show cluster

Start-up and Cluster Verification

systemctl enable --now slurmdbd slurmctld
# Start slurmd on compute nodes (use pssh/ansible)
pssh -h nodes.txt "systemctl enable --now slurmd"
sinfo
srun -N2 --ntasks-per-node=1 hostname

Common checks: munge -n | unmunge, chronyc tracking, sinfo, squeue, sacct.

Operational and Security Recommendations

  • Do not disable SELinux in production; prefer policy adjustments.
  • Protect the Munge key and database credentials, and rotate them periodically.
  • Use firewall rules to restrict access to necessary ports and hosts.
  • Adopt centralized identity management rather than manual /etc syncing.
  • Implement backup and recovery for slurmctld and slurmdbd data.

Optional: Monitoring and Scaling

  • Collect Slurm metrics using Prometheus + Grafana for visualization.
  • For higher availability, evaluate active/passive slurmctld designs or HA tools.

Conclusion

This guide provides practical steps to deploy a small Slurm cluster on Rocky Linux 9.4, from system preparation to Slurm installation and verification. Adjust configurations according to your network, security, and scale requirements, and add monitoring, backups, and high-availability as needed.