Overview
This guide targets engineers with basic Linux and system-administration experience and documents an end-to-end deployment of a small Slurm HPC cluster on Rocky Linux 9.4. The reference topology is a single control/management node (mu01
) and multiple compute nodes (cu01-cu19
). The document covers system preparation, network and time synchronization, shared storage, user management, Munge authentication, MariaDB for SlurmDBD accounting, building and installing Slurm, configuration, verification steps, and operational recommendations.
Readers are assumed to be familiar with basic Linux commands, SSH, yum/DNF package management, and systemd service operations.
Step 1 — Base Environment
- Operating system: Rocky Linux 9.4
- Scheduler: Slurm
- Authentication: Munge
- Accounting database: MariaDB (for slurmdbd)
- Shared storage: NFS
- Build tools: GCC, make, rpmbuild
1.1 Install OS
Install Rocky Linux 9.4 using a minimal or server image. Partition according to your organization’s policy and ensure adequate space for /var
, /home
, and /opt
. Enabling the EPEL repository is recommended for convenience when installing additional packages.
1.2 Network and Hostnames
Assign static IP addresses on each machine (via nmtui
/nmcli
) and ensure hostnames are unique and resolvable via DNS or /etc/hosts
. Example /etc/hosts
on the management node (mu01
):
127.0.0.1 localhost localhost.localdomain
::1 localhost localhost.localdomain
192.168.8.100 mu01
192.168.8.101 cu01
192.168.8.102 cu02
...
192.168.8.119 cu19
1.3 Firewall and SELinux Recommendations
You can stop firewalld
temporarily in a test or controlled internal environment, but in production only open the required ports (Slurm: 6817/6818/6819, Munge, NFS, MariaDB, etc.).
systemctl stop firewalld
systemctl disable firewalld
Prefer setting SELinux to permissive
while developing and then adapt policies rather than permanently disabling it:
setenforce 0
sed -i 's/SELINUX=enforcing/SELINUX=permissive/' /etc/selinux/config
1.4 SSH Key-Based Login (use non-root admin account)
Generate an SSH key on the management node for the admin account and distribute it to compute nodes:
ssh-keygen -t rsa -b 4096 -C "admin@hpc"
for i in {1..19}; do printf "cu%02d\n" "$i" >> nodes.txt; done
for host in $(cat nodes.txt); do ssh-copy-id admin@$host; done
Use Ansible or pssh
to perform bulk administration and avoid manual inconsistencies.
1.5 Time Synchronization (Chrony)
Configure the management node as an NTP server for the LAN and point compute nodes to it.
Management node /etc/chrony.conf
example:
allow 192.168.8.0/24
pool ntp.aliyun.com iburst
Compute nodes:
server 192.168.8.100 iburst
Enable and verify:
systemctl enable --now chronyd
chronyc sources
chronyc tracking
Step 2 — NFS Shared Storage
Install and enable NFS on the management node:
yum install -y nfs-utils rpcbind
systemctl enable --now rpcbind nfs-server
Example /etc/exports
:
/home 192.168.8.0/24(rw,sync,no_root_squash)
/opt 192.168.8.0/24(rw,sync,no_root_squash)
/data 192.168.8.0/24(rw,sync,no_root_squash)
Apply and check exports:
exportfs -rav
showmount -e
Clients can mount exports persistently via /etc/fstab
or systemd mounts. Example fstab entries:
192.168.8.100:/home /home nfs defaults,_netdev 0 0
192.168.8.100:/opt /opt nfs defaults,_netdev 0 0
192.168.8.100:/data /data nfs defaults,_netdev 0 0
Mount and validate:
mount -a
df -h /home /opt /data
Security note: evaluate no_root_squash
carefully; avoid it in environments where root mapping is a concern.
Step 3 — User and Group Management
Use centralized identity management (LDAP/FreeIPA) wherever possible. If you temporarily synchronize /etc/*
files across nodes, automate with Ansible/rsync and be cautious with /etc/shadow
permissions.
Temporary example:
useradd zhangsan
for host in $(cat nodes.txt); do
rsync -a /etc/passwd /etc/group /etc/shadow /etc/gshadow root@$host:/etc/
done
In production, use centralized authentication combined with sudo and auditing.
Step 4 — Munge (Cluster Authentication)
Install Munge on all nodes and distribute the same /etc/munge/munge.key
:
yum install -y munge munge-libs munge-devel
# or build RPM from source
# rpmbuild -tb munge-*.tar.xz
/usr/sbin/mungekey -f
for host in $(cat nodes.txt); do scp /etc/munge/munge.key root@$host:/etc/munge/; done
chown munge:munge /etc/munge/munge.key
chmod 400 /etc/munge/munge.key
systemctl enable --now munge
munge -n | unmunge
Verify Munge on all nodes.
Step 5 — MariaDB (Slurm Accounting)
Install MariaDB on the management node and create the slurm
database and user:
yum install -y mariadb-server
systemctl enable --now mariadb
mysql_secure_installation
SQL to create database and user (replace with a strong password):
CREATE DATABASE slurm_acct_db;
CREATE USER 'slurm'@'localhost' IDENTIFIED BY 'strong_password_here';
GRANT ALL PRIVILEGES ON slurm_acct_db.* TO 'slurm'@'localhost';
FLUSH PRIVILEGES;
If slurmdbd
is remote from the database, create a remote user and restrict access by firewall rules.
Step 6 — Build and Install Slurm
Preconditions: Chrony and Munge running, consistent slurm
user and shared config storage (optional).
Install build dependencies and build Slurm RPMs:
yum -y install gcc make perl pam-devel python3 readline-devel lua-devel hwloc-devel
wget https://download.schedmd.com/slurm/slurm-24.05.2.tar.bz2
rpmbuild -ta slurm-*.tar.bz2
cd ~/rpmbuild/RPMS/x86_64/
yum localinstall -y slurm-*.rpm
Create slurm
system user and directories:
useradd -r -m -d /var/lib/slurm -s /sbin/nologin slurm
mkdir -p /var/spool/slurmd /var/spool/slurmctld /var/log/slurm
chown -R slurm:slurm /var/spool/slurmd /var/spool/slurmctld /var/log/slurm
Prepare and distribute /etc/slurm/slurm.conf
(use the Slurm configurator or adapt to your environment). Key settings example:
ClusterName=hpccluster
SlurmctldHost=mu01
ProctrackType=proctrack/cgroup
TaskPlugin=task/affinity
StateSaveLocation=/var/spool/slurmctld
SlurmUser=slurm
AccountingStorageType=accounting_storage/slurmdbd
JobAcctGatherType=jobacct_gather/linux
NodeName=mu01 CPUs=24 RealMemory=94000
NodeName=cu[01-19] CPUs=24 RealMemory=94000
PartitionName=CU Nodes=cu[01-19] Default=YES MaxTime=INFINITE State=UP
Distribute and set permissions:
for host in $(cat nodes.txt mu01); do scp /etc/slurm/slurm.conf root@$host:/etc/slurm/; done
chown slurm:slurm /etc/slurm/slurm.conf
Enable cgroup support (/etc/slurm/cgroup.conf
) and start services:
ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
systemctl enable --now slurmctld slurmd
Configure slurmdbd.conf
to connect to MariaDB:
AuthType=auth/munge
DbdHost=localhost
DbdPort=6819
StorageHost=127.0.0.1
StorageUser=slurm
StoragePass=strong_password_here
StorageLoc=slurm_acct_db
Start and verify slurmdbd
:
systemctl enable --now slurmdbd
systemctl status slurmdbd
sacctmgr show cluster
Start-up and Cluster Verification
systemctl enable --now slurmdbd slurmctld
# Start slurmd on compute nodes (use pssh/ansible)
pssh -h nodes.txt "systemctl enable --now slurmd"
sinfo
srun -N2 --ntasks-per-node=1 hostname
Common checks: munge -n | unmunge
, chronyc tracking
, sinfo
, squeue
, sacct
.
Operational and Security Recommendations
- Do not disable SELinux in production; prefer policy adjustments.
- Protect the Munge key and database credentials, and rotate them periodically.
- Use firewall rules to restrict access to necessary ports and hosts.
- Adopt centralized identity management rather than manual
/etc
syncing. - Implement backup and recovery for
slurmctld
andslurmdbd
data.
Optional: Monitoring and Scaling
- Collect Slurm metrics using Prometheus + Grafana for visualization.
- For higher availability, evaluate active/passive
slurmctld
designs or HA tools.
Conclusion
This guide provides practical steps to deploy a small Slurm cluster on Rocky Linux 9.4, from system preparation to Slurm installation and verification. Adjust configurations according to your network, security, and scale requirements, and add monitoring, backups, and high-availability as needed.