Slurm Clusters are currently in beta. If you’d like to provide feedback, please join our Discord.
Key features
Slurm Clusters eliminate the traditional complexity of cluster orchestration by providing:- Zero configuration setup: Slurm and munge are pre-installed and fully configured.
- Instant provisioning: Clusters deploy rapidly with minimal setup.
- Automatic role assignment: Runpod automatically designates controller and agent nodes.
- Built-in optimizations: Pre-configured for optimal NCCL performance.
- Full Slurm compatibility: All standard Slurm commands work out-of-the-box.
If you prefer to manually configure your Slurm deployment, see Deploy an Instant Cluster with Slurm (unmanaged) for a step-by-step guide.
Deploy a Slurm Cluster
- Open the Instant Clusters page on the Runpod console.
- Click Create Cluster.
- Select Slurm Cluster from the cluster type dropdown menu.
- Configure your cluster specifications:
- Cluster name: Enter a descriptive name for your cluster.
- Pod count: Choose the number of Pods in your cluster.
- GPU type: Select your preferred GPU type.
- Region: Choose your deployment region.
- Network volume (optional): Add a network volume for persistent/shared storage. If using a network volume, ensure the region matches your cluster region.
- Pod template: Select a Pod template or click Edit Template to customize start commands, environment variables, ports, or container/volume disk capacity.
Slurm Clusters currently only support official Runpod Pytorch images. If you deploy using a different image, the Slurm process will not start.
- Click Deploy Cluster.
Connect to a Slurm Cluster
Once deployment completes, you can access your cluster from the Instant Clusters page. From this page you can select a cluster to view it’s component nodes, including a label indicating the Slurm controller (primary node) and Slurm agents (secondary nodes). Expand a node to view details like availability, GPU/storage utilization, and options for connection and management. Connect to a node using the Connect button, or using any of the connection methods supported by Pods.Submit and manage jobs
All standard Slurm commands are available without configuration. For example, you can: Check cluster status and available resources:Advanced configuration
While Runpod’s Slurm Clusters work out-of-the-box, you can customize your configuration by connecting to the Slurm controller node using the web terminal or SSH. Access Slurm configuration files in their standard locations:/etc/slurm/slurm.conf
- Main configuration file./etc/slurm/gres.conf
- Generic resource configuration.
Troubleshooting
If you encounter issues with your Slurm Cluster, try the following:- Jobs stuck in pending state: Check resource availability with
sinfo
and ensure requested resources are available. If you need more resources, you can add more nodes to your cluster. - Authentication errors: Munge is pre-configured, but if issues arise, verify the munge service is running on all nodes.