Skip to content

Incident Review: How LD_LIBRARY_PATH Broke dracut and XFS Mount

Abstract

A GPU/HPC server failed to boot with the error: mount: unknown filesystem type 'xfs'. The root cause was not disk failure, but environment variable pollution that broke the initramfs generation process.


1. Symptoms

The system dropped into emergency mode. Manual mounting of XFS partitions worked, but the automated boot process failed to recognize the filesystem type.

2. The Red Flag

blkid -V reported a version from 2003, despite the system being a modern CentOS 7. This caused dracut to fail to detect the root partition type, leading to an initramfs missing the xfs.ko module.

3. Root Cause

LD_LIBRARY_PATH was set globally in /etc/profile to point to CUDA and MPI libraries. These libraries overlapped with system libraries, causing critical binaries like blkid to link to incompatible versions and malfunction.

4. Resolution

  1. Reinstall util-linux and e2fsprogs via RPM to restore binary integrity.
  2. Rebuild the initramfs using dracut -f.
  3. Verify xfs.ko existence in the image and reboot.

5. Prevention

Avoid global LD_LIBRARY_PATH. Use ldconfig or Lmod for HPC software stack management.

AI-HPC Organization