User Guide Slurm
User Guide Slurm
User Guide Slurm
REFERENCE
86 A2 45FD 01
extreme computing
SLURM V2.2
extreme computing
SLURM V2.2
User's Guide
Software
July 2010
BULL CEDOC
357 AVENUE PATTON
B.P.20845
49008 ANGERS CEDEX 01
FRANCE
REFERENCE
86 A2 45FD 01
The following copyright notice protects this book under Copyright laws which prohibit such actions as, but not limited
to, copying, distributing, modifying, and making derivative works.
The information in this document is subject to change without notice. Bull will not be liable for errors
contained herein, or for incidental or consequential damages in connection with the use of this material.
Table of Contents
Preface ............................................................................................................................ix
Chapter 1.
SLURM Overview..................................................................................... 1
1.1
1.2
1.3
SLURM Daemons............................................................................... 2
1.3.1
SLURMCTLD ............................................................................................................. 2
1.3.2
SLURMD .................................................................................................................. 4
1.3.3
1.4
1.5
1.6
Chapter 2.
2.1
2.2
2.2.1
2.2.2
2.2.3
MySQL Configuration.............................................................................................. 15
2.2.4
2.2.5
2.2.6
2.3
2.3.1
2.3.2
2.3.3
2.4
2.5
2.6
2.6.1
Introduction ............................................................................................................ 21
2.6.2
iii
2.6.3
2.6.4
Chapter 3.
3.1
3.2
3.3
3.4
3.5
3.6
3.7
Logging......................................................................................... 28
3.8
3.9
Security ......................................................................................... 29
3.10
Chapter 4.
4.1
4.1.1
4.1.2
4.2
4.3
Chapter 5.
iv
5.1
5.2
MPI Support................................................................................... 37
5.3
SRUN ........................................................................................... 39
5.4
SBATCH (batch).............................................................................. 40
5.5
5.6
SATTACH....................................................................................... 42
5.7
SACCTMGR................................................................................... 43
5.8
SBCAST ......................................................................................... 44
5.9
5.10
5.11
5.12
5.13
STRIGGER ...................................................................................... 49
5.14
SVIEW........................................................................................... 50
5.15
Chapter 6.
6.1
6.2
6.2.1
6.2.2
6.2.3
6.2.4
Timers.................................................................................................................... 56
6.2.5
6.2.6
Hard Limits............................................................................................................. 56
6.3
6.3.1
6.3.2
Chapter 7.
7.1
7.2
7.3
7.4
7.5
7.6
Glossary........................................................................................................................ 65
Index............................................................................................................................. 67
Preface
vi
List of figures
Figure
Figure
Figure
Figure
1-1.
1-2.
4-1.
5-1.
List of tables
Table 1-1.
Table 1-2.
Table 1-3.
Preface
vii
Preface
Note
The Bull Support Web site may be consulted for product information, documentation,
downloads, updates and service offers:
http://support.bull.com
Scope and Objectives
A resource manager is used to allocate resources, to find out the status of resources, and to
collect task execution information. Bull Extreme Computing platforms use SLURM, an opensource, scalable resource manager.
This guide describes how to configure, manage, and use SLURM.
Intended Readers
This guide is for Administrators and Users of Bull Extreme Computing systems.
Prerequisites
This manual applies to SLURM versions from version 2.2, unless otherwise indicated.
mportant
The Software Release Bulletin contains the latest information for your delivery.
This should be read first. Contact your support representative for more
information.
Preface
ix
1.1
See
SACCTMGR to view and modify SLURM account information. Used with the slurmdbd
daemon
SACCT to display data for all jobs and job steps in the SLURM accounting log
SREPORT used to generate reports from the SLURM accounting data when using an
accounting database
The man pages for the commands above for more information.
1.2
SLURM Components
SLURM consists of two types of daemons and various command-line user utilities. The
relationships between these components are illustrated in the following diagram:
1.3
1.3.1
SLURM Daemons
SLURMCTLD
The central control daemon for SLURM is called SLURMCTLD. SLURMCTLD is multi
threaded; thus, some threads can handle problems without delaying services to normal jobs
that are also running and need attention. SLURMCTLD runs on a single management node
(with a fail-over spare copy elsewhere for safety), reads the SLURM configuration file, and
maintains state information on:
The SLURMCTLD daemon in turn consists of three software subsystems, each with a specific
role:
Software Subsystem
Role Description
Node Manager
Partition Manager
Groups nodes into disjoint sets (partitions) and assigns job limits
and access controls to each partition. The partition manager also
allocates nodes to jobs (at the request of the Job Manager) based
on job and partition properties. SCONTROL is the (privileged)
user utility that can alter partition properties.
Job Manager
Table 1-1.
The following figure illustrates these roles of the SLURM Software Subsystems.
1.3.2
SLURMD
The SLURMD daemon runs on all the Compute Nodes of each cluster that SLURM manages
and performs the lowest level work of resource management. Like SLURMCTLD (previous
subsection), SLURMD is multi-threaded for efficiency; but, unlike SLURMCTLD, it runs with
root privileges (so it can initiate jobs on behalf of other users).
SLURMD carries out five key tasks and has five corresponding subsystems. These
subsystems are described in the following table.
SLURMD Subsystem
Machine Status
Job Status
Remote Execution
Handles all STDERR, STDIN, and STDOUT for remote tasks. This may
involve redirection, and it always involves locally buffering job
output to avoid blocking local tasks.
Job Control
Propagates signals and job-termination requests to any SLURMmanaged processes (often interacting with the Remote Execution
subsystem).
Table 1-2.
1.3.3
1.4
Scheduler Types
The system administrator for each machine can configure SLURM to invoke one of several
alternative local job schedulers. To determine which scheduler SLURM is currently invoked
on any machine, execute the following command:
scontrol show config |grep SchedulerType
where the returned string will have one of the values described in the following table.
Returned String
Value
builtin
Description
A first-in-first-out scheduler. SLURM executes jobs strictly in the order
in which they were submitted (for each resource partition), unless
those jobs have different priorities. Even if resources become
available to start a specific job, SLURM will wait until there is no
previously-submitted job pending (which sometimes confuses
impatient job submitters).
This is the default.
backfill
wiki
gang
hold
Hold scheduling places all new jobs in a file. If the file exists, it will
hold all the jobs otherwise SLURM defaults to the built-in FIFO as
described in the builtin section.
Table 1-3.
1.5
The SLURM configuration file includes a wide variety of parameters. This configuration file
must be available on each node of the cluster.
The slurm.conf file should define at least the configuration parameters as defined in the
examples provided and any additional ones that are required. Any text following a # is
considered a comment. The keywords in the file are not case sensitive, although the
argument usually is (e.g., "SlurmUser=slurm" might be specified as "slurmuser=slurm"). Port
numbers to be used for communications are specified as well as various timer values.
A description of the nodes and their grouping into partitions is required. A simple node
range expression may be used to specify a range of nodes to avoid building a
configuration file with a large numbers of entries. The node range expression can contain
one pair of square brackets with a sequence of comma separated numbers and/or ranges
of numbers separated by a "-" (e.g. "linux[0-64,128]", or "lx[15,18,32-33]").
Node names can have up to three name specifications: NodeName is the name used by
all SLURM tools when referring to the node, NodeAddr is the name or IP address SLURM
uses to communicate with the node, and NodeHostname is the name returned by the
/bin/hostname -s command. Only NodeName is required (the others default to the same
name), although supporting all three parameters provides complete control over the naming
and addressing the nodes.
Nodes can be in more than one partition, with each partition having different constraints
(permitted users, time limits, job size limits, etc.). Each partition can thus be considered a
separate queue. Partition and node specifications use node range expressions to identify
nodes in a concise fashion. An annotated example configuration file for SLURM is provided
with this distribution in /etc/slurm/slurm.conf.example. Edit this configuration file to suit the
needs of the user cluster, and then copy it to /etc/slurm/slurm.conf.
Configuration Parameters
Refer to the slurm.conf man page, using the command below, for configuration details,
options, parameter descriptions, and configuration file examples.
Example:
$ man slurm.conf
1.6
Most SCONTROL options and commands can only be used by System Administrators.
Some SCONTROL commands report useful configuration information or manage job
checkpoints, and any user can benefit from invoking them appropriately.
The option 'Shared=YES or NO' in job level is related with the permission of sharing a
node, whereas the option 'Shared= YES, NO or FORCE' in partition level is related
with the permission of sharing a specific resource (which means node, socket, core or
even thread).
NAME
SCONTROL - Used to view and modify SLURM configuration and state.
SYNOPSIS
SCONTROL
[OPTIONS...] [COMMAND...]
DESCRIPTION
SCONTROL is used to view or modify the SLURM configuration including: job, job step,
node, partition, reservation, and overall system configuration. Most of the commands can
only be executed by user root. If an attempt to view or modify configuration information is
made by an unauthorized user, an error message will be printed and the requested action
will not occur. If no command is entered on the execute line, SCONTROL will operate in
an interactive mode and prompt for input. It will continue prompting for input and
executing commands until explicitly terminated. If a command is entered on the execute
line, SCONTROL will execute that command and terminate. All commands and options
are case-insensitive, although node names and partition names are case-sensitive (node
names "LX" and "lx" are distinct). All commands and options can be abbreviated to the
extent that the specification is unique.
OPTIONS
For options, examples and details please refer to the man page.
Example:
$ man scontrol
2.1
Installing SLURM
If not already installed, install SLURM on the Management Node and on the Reference
Nodes as described in the Installation and Configuration Guide related to your software.
2.2
2.2.1
The name of the machine where the SLURM control functions will run. This will be the
Management Node, and will be set as shown in the example below.
ClusterName=<clustername>
ControlMachine=<basename>
ControlAddr=<basename>
2.
3.
4.
Any port numbers, paths for log information and SLURM state information. If they do
not already exist, the path directories must be created on all of the nodes. If these are
not set, all logging information will go to the scontrol log.
SlurmctldPort=6817
SlurmdPort=6818
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdLogFile=/var/log/slurm/slurmd.log.%h
StateSaveLocation=/var/log/slurm/log_slurmctld
SlurmdSpoolDir=/var/log/slurm/log_slurmd/
5.
6.
PreemptType: Configure to the desired mechanism used to identify which jobs can
preempt other jobs.
preempt/none indicates that jobs will not preempt each other (default).
preempt/partition_prio indicates that jobs from one partition can preempt
jobs from lower priority partitions.
preempt/qos indicates that jobs from one Quality Of Service (QOS) can
preempt jobs from a lower QOS.
Shared: Configure the partition's Shared setting to FORCE for all partitions in
which job preemption is to take place. The FORCE option supports an additional
parameter that controls how many jobs can share a resource
(FORCE[:max_share]). By default the max_share value is 4. In order to preempt
jobs (and not gang schedule them), always set max_share to 1. To allow up to 2
jobs from this partition to be allocated to a common resource (and gang
scheduled), set Shared=FORCE:2.
To enable preemption after making the configuration changes described above, restart
SLURM if it is already running. Any change to the plugin settings in SLURM requires a
full restart of the daemons. If you just change the partition Priority or Shared setting,
this can be updated with scontrol reconfig.
10
7.
Enable the topology plugin parameter according to the characteristics of your systems
network, to allow SLURM to allocate resources to jobs in order to minimize network
contention and optimize execution performance. Different plugins exist for hierarchical
or three-dimensional torus networks. The basic algorithm is to identify the lowest level
switch in the hierarchy that can satisfy a job's request and then allocate resources on
its underlying leaf switches using a best-fit algorithm. Use of this logic requires a
configuration setting of:
TopologyPlugin=topology/tree (for hierarchical networks) or
TopologyPlugin=topology/3d_torus (for 3D torus networks)
TopologyPlugin=topology/none (default)
The description of your systems network topology should be given in a separate file
called topology.conf as presented in section 2.2.3 MySQL Configuration.
8.
Provide accounting requirements. The path directories must be created on all of the
nodes, if they do not already exist. For Job completion:
SLURM can be configured to collect accounting information for every job and job step
executed. Accounting records can be written to a simple text file or a database.
Information is available about both currently executing jobs and jobs that have already
terminated. The sacct command can report resource usage for running or terminated
jobs including individual tasks, which can be useful to detect load imbalance between
the tasks. The sstat command can be used to status only currently running jobs. It also
can give you valuable information about imbalance between tasks. The sreport
command can be used to generate reports based upon all jobs executed in a
particular time interval.
There are three distinct plugin types associated with resource accounting. We
recommend the first option, and will give examples for this type of configuration. More
information can be found in the official SLURM documentation. Presently job
completion is not supported with the SlurmDBD, but can be written directly to a
database, script or flat file. If you are running with the accounting storage, you may
not need to run this since it contains much of the same information. You may select
both options, but much of the information is duplicated. The SLURM configuration
parameters (in slurm.conf) associated with these plugins include:
11
For accounting, we recommend using the mysql database along with the slurmDBD
daemon.
9.
Provide the paths to the job credential keys. The keys must be copied to all of the
nodes.
Note: If using MUNGE, these keys are ignored.
JobCredentialPrivateKey=/etc/slurm/private.key
JobCredentialPublicCertificate=/etc/slurm/public.key
10. Provide the cryptographic signature tool to be used when jobs are created. You may
use openssl or munge. Munge is recommended:
CryptoType=crypto/openssl
# default is crypto/munge
Or:
AuthType=auth/munge
CryptoType=crypto/munge
Note
The crypto/munge default setting is recommended by Bull, and requires the munge plugin
to be installed.
See section 2.6 Installing and Configuring Munge for SLURM Authentication (MNGT).
11. Provide Compute Node details. Example :
NodeName=bali[10-37] Procs=8 State=UNKNOWN
12. Provide information about the partitions. MaxTime is the maximum wall-time limit for
any job in minutes. The state of the partition may be UP or DOWN.
PartitionName=global Nodes=bali[10-37] State=UP Default=YES
PartitionName=test Nodes=bali[10-20] State=UP MaxTime=UNLIMITED
PartitionName=debug Nodes=bali[21-30] State=UP
13. In order that Nagios monitoring is enabled inside Bull System Manager HPC Edition,
the SLURM Event Handler mechanism has to be active. This means that the following
line in the SLURM.conf file on the Management Node has to be uncommented, or
added if it does not appear there.
SlurmEventHandler=/usr/lib/clustmngt/slurm/slurmevent
Note
If the value of the ReturnToService parameter in the slurm.conf is set to 0, then when a
node that is down is re-booted, the Administrator will have to change the state of the node
manually with a command similar to that below, so that the node appears as idle and
available for use:
$ scontrol update NodeName=bass State=idle Reason=test
12
See
The slurm.conf man page for more information on all the configuration parameters,
including the ReturnToService parameter, and those referred to above.
For the mysql accounting database configuration parameters shown below refer to
https://computing.llnl.gov/linux/slurm/accounting.html
13
#PriorityDecayHalfLife=14-0
#PriorityUsageResetPeriod=14-0
#PriorityWeightFairshare=100000
#PriorityWeightAge=1000
#PriorityWeightPartition=10000
#PriorityWeightJobSize=1000
#PriorityMaxAge=1-0
#
# LOGGING
CryptoType=crypto/openssl
SlurmctldDebug=5
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=5
SlurmdLogFile=/var/log/slurmd.log
JobCompType=jobcomp/none
#JobCompLoc=
#
# ACCOUNTING
########
my sql
AccountingStorageEnforce=limits
AccountingStorageLoc=slurm_acct_db
AccountingStorageType=accounting_storage/slurmdbd
AccountingStoragePort=8544
#AccountingStoragePass=slurm
JobAcctGatherType=jobacct_gather/linux
JobAcctGatherFrequency=30
#NodeName=linux[1-32] Procs=1 State=UNKNOWN
#PartitionName=debug Nodes=linux[1-32] Default=YES MaxTime=INFINITE
State=UP
NodeName=incare[193,194,196-198,200,204,206] Procs=4 State=UNKNOWN
PartitionName=debug Nodes=incare[193,194,196-198,200,204,206]
Default=YES MaxTime=INFINITE State=UP
2.2.2
14
DbdAddr=localhost
DbdHost=localhost
DbdPort=7031
SlurmUser=slurm
MessageTimeout=300
#DebugLevel=4
#DefaultQOS=normal,standby
LogFile=/var/log/slurm/slurmdbd.log
PidFile=/var/run/slurmdbd.pid
PluginDir=/usr/lib/slurm
#PrivateData=accounts,users,usage,jobs
#TrackWCKey=yes
#
# Database info
StorageType=accounting_storage/mysql
StorageHost=localhost
StoragePort=1234
StoragePassword=password
StorageUser=slurm
StorageLoc=slurm_acct_db
2.2.3
MySQL Configuration
While SLURM will create the database automatically you need to make sure the
StorageUser is given permissions in MySQL to do so. As the mysql user, grant privileges to
that user using a command such as (you need to be root):
GRANT ALL ON StorageLoc.* TO 'StorageUser'@'StorageHost';
Example:
with a default password:
mysql@machu~]$ mysql
mysql> grant all on slurm_acct_db.* TO 'slurm'@'localhost';
2.2.4
In order to configure your systems network topology you will need to activate the necessary
plugin into the slurm.conf file as explained in section 2.2.1 step 7, and provide a
topology.conf. This is an ASCII file which describes the cluster's network topology for
optimized job resource allocation. The file location can be modified at system build time
using the DEFAULT_SLURM_CONF parameter. Otherwise, the file will always be located in
the same directory as the slurm.conf file. Please use the man page to get more complete
information. An example of this file is shown below.
Example
$ man topology.conf
15
toplogy.conf Examplpe
##################################################################
# SLURMs network topology configuration file for use with the
# topology/tree plugin
##################################################################
SwitchName=s0 Nodes=dev[0-5]
SwitchName=s1 Nodes=dev[6-11]
SwitchName=s2 Nodes=dev[12-17]
SwitchName=s3 Switches=s[0-2]
2.2.5
2.2.6
The files and directories used by SLURMCTLD must be readable or writable by the user
SlurmUser (the SLURM configuration files must be readable; the log file directory and state
save directory must be writable).
Create a SlurmUser
The SlurmUser must be created before SLURM is started. The SlurmUser will be referenced
by the slurmctld daemon. Create a SlurmUser on the Compute, Login/IO or Login
Reference nodes with the same uid gid (106 for instance):
groupadd -g 106 slurm
useradd -u 106 -g slurm slurm
mkdir -p /var/log/slurm
chmod 755 /var/log/slurm
The gid and uid numbers do not have to match the one indicated above, but they have to
be the same on all nodes in the cluster.
The user name in the example above is slurm, another name can be used, however it has
to be the same on all nodes in the cluster.
Configure the SLURM job credential keys as root
Unique job credential keys for each job should be created using the openssl program.
These keys are used by the slurmctld daemon to construct a job credential, which is sent to
the srun command and then forwarded to slurmd to initiate job steps.
mportant
16
When you are within the directory where the keys will reside, run the commands below:
cd /etc/slurm
openssl genrsa -out private.key 1024
openssl rsa -in private.key -pubout -out public.key
The Private.Key file must be readable by SlurmUser only. If this is not the case then use the
commands below to change the setting.
chown slurm.slurm /etc/slurm/private.key
chmod 600 /etc/slurm/private.key
The Public.Key file must be readable by all users. If this is not the case then use the
commands below to change the setting.
chown slurm.slurm /etc/slurm/public.key
chmod 644 /etc/slurm/public.key
2.3
2.3.1
slurm.conf man page for more information on the parameters of the slurm.conf file,
and slurm_setup.sh man page for information on the SLURM setup script.
The slurm.conf file must have been created on the Management Node and all the
necessary parameters defined BEFORE the script is used to propagate the information
to the Reference Nodes.
The use of the script requires root access, and depends on the use of the ssh, pdcp
and pdsh tools.
The SLURM setup script is found in /etc/slurm/slurm_setup.sh and is used to automate and
customize the installation process. The script reads the slurm.conf file created previously
and does the following:
1.
Creates the SlurmUser, using the SlurmUID, SlurmGroup, SlurmGID, and SlurmHome
optional parameter settings in the slurm.conf file to customize the user and group. It
also propagates the identical Slurm User and Group settings to the reference nodes.
17
2.
Validates the pathnames for log files, accounting files, scripts, and credential files. It
then creates the appropriate directories and files, and sets the permissions. For user
supplied scripts, it validates the path and warns if the files do not exist. The directories
and files are replicated on both the Management Node and reference nodes.
3.
Creates the job credential validation private and public keys on the Management and
reference nodes.
4.
5.
Copies the slurm.conf file from the Management Node to the reference nodes.
18
-v
-u
-f, -F
-d
Note
2.3.2
Skip the next section, which describes how to complete the configuration of SLURM
manually, if the slurm_setup.sh script has been used successfully.
Create a SlurmUser
The SlurmUser must be created before SLURM is started. SlurmUser will be referenced
by the slurmctld daemon. Create a SlurmUser on the Compute, Login/IO or Login
Reference nodes with the same uid gid (106 for instance):
groupadd -g 106 slurm
useradd -u 106 -g slurm slurm
mkdir -p /var/log/slurm
chmod 755 /var/log/slurm
The gid and uid numbers do not have to match the one indicated above, but they have
to be the same on all the nodes in the cluster.
The user name in the example above is slurm, another name can be used, however it
has to be the same on all the nodes in the cluster.
2.
Note
/etc/slurm/slurm.conf
The public key must be on the KSIS image deployed to ALL the Compute Nodes otherwise
SLURM will not start.
3.
4.
19
2.3.3
or
or
/etc/init.d/slurm startclean
Note
2.4
or
The startclean argument will start the daemon on that node without preserving saved state
information (all previously running jobs will be purged and the node state will be restored
to the values specified in the configuration file).
2.
3.
Verify that the daemons have started by running the scontrol command again.
scontrol show node --all
4.
If you are using the mysql slurm accounting, check to make sure the slurmdbd has
been started.
20
2.5
required
/lib/security/pam_slurm.so
If it is necessary to always allow access for an administrative group (for example wheel),
stack the pam_access module ahead of pam_slurm:
account
account
sufficient
required
/lib/security/pam_access.so
/lib/security/pam_slurm.so
When access is denied because the user does not have an active job running on the node,
an error message is returned to the application:
Access denied: user foo (uid=1313) has no active jobs.
This message can be suppressed by specifying the no_warn argument in the PAM
configuration file.
2.6
2.6.1
Introduction
This software component is required if the authentication method for the communication
between the SLURM components is munge (where AuthType=auth/munge). On most
platforms, the munged daemon does not require root privileges. If possible, the daemon
must be run as a non-privileged user. This can be controlled by the init script as detailed in
the 2.6.3 Starting the Daemon section below.
See
/etc/munge/
This directory contains the daemon's secret key. The recommended permissions for it
are 0700.
21
/var/lib/munge/
This directory contains the daemon's PRNG seed file. It is also where the daemon
creates pipes for authenticating clients via file-descriptor-passing. If the file-descriptorpassing authentication method is being used, this directory must allow execute
permissions for all; however, it must not expose read permissions. The recommended
permissions for it are 0711.
/var/log/munge/
This directory contains the daemon's log file. The recommended permissions for it are
0700.
/var/run/munge/
This directory contains the Unix domain socket for clients to communicate with the
daemon. It also contains the daemon's PID file. This directory must allow execute
permissions for all. The recommended permissions for it are 0755.
These directories must be owned by the user that the munged daemon will run as. They
cannot allow write permissions for group or other (unless the sticky-bit is set). In addition,
all of their parent directories in the path up to the root directory must be owned by either
root or the user that the munged daemon will run as. None of them can allow write
permissions for group or other (unless the sticky-bit is set).
2.6.2
or
$ dd if=/dev/urandom bs=1 count=1024 >/etc/munge/munge.key
This file must be given 0400 permissions and owned by the munged daemon user.
22
2.6.3
2.6.4
2.
3.
4.
5.
Also, check the log file (/var/log/munge/munged.log) or try running the daemon in
the foreground:
/usr/sbin/munged --foreground
23
24
3.2
The SLURMD daemon executes on all Compute nodes. It resembles a remote shell
daemon which exports control to SLURM. Because SLURMD initiates and manages user
jobs, it must execute as the user root.
The SLURMDBD daemon executes on the management node. It is used to write data to
the slurm accouting database.
25
window can be used to execute commands such as, srun -N1 /bin/hostname, to confirm
functionality.
Another important option for the daemons is -c to clear the previous state information.
Without the -c option, the daemons will restore any previously saved state information:
node state, job state, etc. With the -c option all previously running jobs will be purged and
the node state will be restored to the values specified in the configuration file. This means
that a node configured down manually using the SCONTROL command will be returned to
service unless also noted as being down in the configuration file. In practice, SLURM
restarts with preservation consistently.
The /etc/init.d/slurm and slurmdbd scripts can be used to start, startclean or stop the
daemons for the node on which it is being executed.
3.3
DESCRIPTION
SLURMCTLD is the central management daemon of SLURM. It monitors all other SLURM
daemons and resources, accepts work (jobs), and allocates resources to those jobs. Given
the critical functionality of SLURMCTLD, there may be a backup server to assume these
functions in the event that the primary server fails.
OPTIONS
-c
Clear all previous SLURMCTLD states from its last checkpoint. If not specified,
previously running jobs will be preserved along with the state of DOWN,
DRAINED and DRAINING nodes and the associated reason field for those
nodes.
-D
-f <file>
Read configuration from the specified file. See NOTE under ENVIRONMENT
VARIABLES below.
-h
-L <file>
-v
Verbose operation. Using more than one v (e.g., -vv, -vvv, -vvvv, etc.)
increases verbosity.
-V
ENVIRONMENT VARIABLES
The following environment variables can be used to override settings compiled into
SLURMCTLD.
26
SLURM_CONF
The location of the SLURM configuration file. This is overridden by explicitly naming a
configuration file in the command line.
Note
3.4
DESCRIPTION
SLURMD is the Compute Node daemon of SLURM. It monitors all tasks running on the
compute node, accepts work (tasks), launches tasks, and kills running tasks upon request.
OPTIONS
-c
-D
Run SLURMD in the foreground. Error and debug messages will be copied to
stderr.
-M
Lock SLURMD pages into system memory using mlockall to disable paging of
the SLURMD process. This may help in cases where nodes are marked
DOWN during periods of heavy swap activity. If the mlockall system call is
not available, an error will be printed to the log and SLURMD will continue as
normal.
-h
-f <file>
-L <file>
-v
Verbose operation. Using more than one v (e.g., -vv, -vvv, -vvvv, etc.)
increases verbosity.
-V
ENVIRONMENT VARIABLES
The following environment variables can be used to override settings compiled into
SLURMD.
27
SLURM_CONF
The location of the SLURM configuration file. This is overridden by explicitly naming a
configuration file on the command line.
Note
3.5
3.6
-D
-h
-v
Verbose operation. Using more than one v (e.g., -vv, -vvv, -vvvv, etc.)
increases verbosity.
-V
Node Selection
The node selection mechanism used by SLURM is controlled by the SelectType
configuration parameter. If you want to execute multiple jobs per node, but apportion the
processors, memory and other resources, the cons_res (consumable resources) plug-in is
recommended. If you tend to dedicate entire nodes to jobs, the linear plug-in is
recommended.
3.7
Logging
SLURM uses the syslog function to record events. It uses a range of importance levels for
these messages. Be certain that your system's syslog functionality is operational.
28
3.8
Corefile Format
SLURM is designed to support generating a variety of core file formats for application
codes that fail (see the --core option of the srun command).
3.9
Security
Unique job credential keys for each site should be created using the openssl program
openssl must be used (not ssh-keygen) to construct these keys. An example of how to do
this is shown below.
Specify file names that match the values of JobCredentialPrivateKey and
JobCredentialPublicCertificate in the configuration file. The JobCredentialPrivateKey file
must be readable only by SlurmUser. The JobCredentialPublicCertificate file must be
readable by all users. Both files must be available on all nodes in the cluster. These keys
are used by slurmctld to construct a job credential, which is sent to srun and then
forwarded to slurmd to initiate job steps.
> openssl genrsa -out /path/to/private/key 1024
> openssl rsa -in /path/to/private/key -pubout -out /path/to/public/key
3.10
Print the detailed state of job 477 and change its priority to zero. A priority of zero
prevents a job from being initiated (it is held in "pending" state).
adev0: scontrol
29
Print the state of node adev13 and drain it. To drain a node, specify a new state of
DRAIN, DRAINED, or DRAINING. SLURM will automatically set it to the appropriate
value of either DRAINING or DRAINED depending on whether the node is allocated
or not. Return it to service later.
adev0: scontrol
scontrol: show node adev13
Reconfigure all SLURM daemons on all nodes. This should be done after changing the
SLURM configuration file.
adev0: scontrol reconfig
Print the current SLURM configuration. This also reports if the primary and secondary
controllers (slurmctld daemons) are responding. Use the ping command to see the state
of the controllers.
adev0: scontrol show config
30
ControlAddr
= bones
ControlMachine
= bones
CryptoType
= crypto/munge
DebugFlags
= (null)
DefMemPerCPU
= UNLIMITED
DisableRootJobs
= NO
EnforcePartLimits
= NO
Epilog
= (null)
EpilogMsgTime
= 2000 usec
EpilogSlurmctld
= (null)
FastSchedule
= 0
FirstJobId
= 1
GetEnvTimeout
= 2 sec
HealthCheckInterval
= 0 sec
HealthCheckProgram
= (null)
InactiveLimit
= 0 sec
JobAcctGatherFrequency = 30 sec
JobAcctGatherType
= jobacct_gather/linux
JobCheckpointDir
= /var/slurm/checkpoint
JobCompHost
= localhost
JobCompLoc
= slurm_acct_db
JobCompPass
= (null)
JobCompPort
= 8544
JobCompType
= jobcomp/mysql
JobCompUser
= slurm
JobCredentialPrivateKey = (null)
JobCredentialPublicCertificate = (null)
JobFileAppend
= 0
JobRequeue
= 1
KillOnBadExit
= 0
KillWait
= 30 sec
Licenses
= (null)
MailProg
= /bin/mail
MaxJobCount
= 5000
MaxMemPerCPU
= UNLIMITED
MaxTasksPerNode
= 128
MessageTimeout
= 10 sec
MinJobAge
= 300 sec
MpiDefault
= none
MpiParams
= (null)
NEXT_JOB_ID
= 2015
OverTimeLimit
= 0 min
PluginDir
= /usr/lib64/slurm
PlugStackConfig
= /etc/slurm/plugstack.conf
PreemptMode
= OFF
PreemptType
= preempt/none
PriorityType
= priority/basic
PrivateData
= none
ProctrackType
= proctrack/linuxproc
Prolog
= (null)
PrologSlurmctld
= (null)
PropagatePrioProcess
= 0
PropagateResourceLimits = ALL
PropagateResourceLimitsExcept = (null)
ResumeProgram
= (null)
ResumeRate
= 300 nodes/min
ResumeTimeout
= 60 sec
ResvOverRun
= 0 min
ReturnToService
= 0
SallocDefaultCommand
= (null)
SchedulerParameters
= (null)
SchedulerPort
= 7321
SchedulerRootFilter
= 1
SchedulerTimeSlice
= 30 sec
SchedulerType
= sched/builtin
SelectType
= select/cons_res
SelectTypeParameters
= CR_CORE
31
SlurmUser
= slurm(200)
SlurmctldDebug
= 9
SlurmctldLogFile
= /var/slurm/logs/slurmctld.log
SlurmSchedLogFile
= (null)
SlurmSchedLogLevel
= 0
SlurmctldPidFile
= /var/slurm/logs/slurmctld.pid
SlurmctldPort
= 6817
SlurmctldTimeout
= 120 sec
SlurmdDebug
= 3
SlurmdLogFile
= /var/slurm/logs/slurmd.log.%h
SlurmdPidFile
= /var/slurm/logs/slurmd.pid
SlurmEventHandler
= /opt/slurm/event_handler.sh
SlurmEventHandlerLogfile = /var/slurm/ev-logfile
SlurmEventHandlerPollInterval = 10
SlurmdPort
= 6818
SlurmdSpoolDir
= /var/slurm/slurmd.spool
SlurmdTimeout
= 300 sec
SlurmdUser
= root(0)
SLURM_CONF
= /etc/slurm/slurm.conf
SLURM_VERSION
= 2.1.0
SrunEpilog
= (null)
SrunProlog
= (null)
StateSaveLocation
= /var/slurm/slurm.state
SuspendExcNodes
= (null)
SuspendExcParts
= (null)
SuspendProgram
= (null)
SuspendRate
= 60 nodes/min
SuspendTime
= NONE
SuspendTimeout
= 30 sec
SwitchType
= switch/none
TaskEpilog
= (null)
TaskPlugin
= task/affinity
TaskPluginParam
= (null type)
TaskProlog
= (null)
TmpFS
= /tmp
TopologyPlugin
= topology/none
TrackWCKey
= 0
TreeWidth
= 50
UsePam
= 0
UnkillableStepProgram
= (null)
UnkillableStepTimeout
= 60 sec
WaitTime
= 0 sec
Slurmctld(primary/backup) at bones/(NULL) are UP/DOWN
32
4.1.1
Create the gfs2 SLURM file system using the HA_MGMT:SLURM label
For example:
mkfs.gfs2 -j 2 -t HA_MGMT:SLURM /dev/sdm
2.
3.
4.
33
BackupAddr=mngt1
SlurmUser=slurm
StateSaveLocation=/gfs2shared/slurmdata
5.
If necessary, modify the slurm.conf file according to your system and copy the
modified /etc/slurm/slurm.conf on to all Compute Nodes and Management Nodes.
'StateSaveLocation
slurmctld
primary
slurmctld
Accounting
4.1.2
34
'StateSaveLocation
slurmctld
slurmctldbackup
slurmdbd
mysql
Figure 5-2. SLURM High Availability using Database Accounting
4.2
2.
Manually mount the shared file system for SLURM on both Management Nodes:
pdsh -w mngt0,mngt1 "mount LABEL=HA_MGMT:SLURM /gfs2shared/slurmdata"
3.
4.
5.
Note
For users of Database accounting slurmdbd will be started when the node boots with no
extra steps.
Chapter 4. SLURM High Availability
35
4.3
2.
Wait a few seconds and check that HA Cluster Suite is available by using the clustat
command.
3.
Mount the SLURM shared storage on the Primary Management Node, by running the
command below:
mount LABEL=HA_MGMT:SLURM /gfs2shared/slurmdata
4.
5.
36
Wait a minute and check that SLURM is running on both Management Nodes.
SACCTMGR to view and modify SLURM account information. Used with the slurmdbd
daemon
SACCT to display data for all jobs and job steps in the SLURM accounting log
Global Accounting API for merging the data from a LSF accounting file and the SLURM
accounting file into a single record
mportant
Note
5.2
SLURM does not work with PBS Professional Resource Manager and should only
be installed on clusters which do not use PBS PRO.
There is only a general explanation of each command in the following sections. For
complete and detailed information, please refer to the man pages. For example, man srun.
MPI Support
The PMI (Process Management Interface) is provided by Bullx MPI to launch processes on a
cluster and provide services to the MPI interface. For example, a call to pmi_get_appnum
returns the job id. This interface uses sockets to exchange messages.
In Bullx MPI, this mechanism uses the MPD daemons running on each Compute Node.
Daemons can exchange information and answer the PMI calls.
Chapter 5. Managing Resources using SLURM
37
SLURM replaces the Process Management Interface with its own implementation and its
own daemons. No MPD is needed and when a PMI request is sent (for example
pmi_get_appnum), a SLURM extension must answer this request.
The following diagrams show the difference between the use of PMI with and without a
resource manager that allows process management.
Figure 5-1. MPI Process Management With and Without Resource Manager
Bullx MPI jobs can be launched directly by the srun command. SLURM's none MPI plug-in
must be used to establish communications between the launched tasks. This can be
accomplished either using the SLURM configuration parameter MpiDefault=none in
slurm.conf or srun's --mpi=none option. The program must also be linked with SLURM's
implementation of the PMI library so that tasks can communicate host and port information
at startup. (The system administrator can add this option to the mpicc and mpif77
commands directly, so the user will not need to bother). Do not use SLURM's MVAPICH
plug-in for Bullx MPI.
$ mpicc -L<path_to_slurm_lib> -lpmi ...
$ srun -n20 --mpi=none a.out
38
Notes
5.3
Some Bullx MPI functions are not currently supported by the PMI library integrated with
SLURM.
Set the environment variable PMI_DEBUG to a numeric value of 1 or higher for the PMI
library to print debugging information.
SRUN
SRUN submits jobs to run under SLURM management. SRUN can submit an interactive job
and then persist to shepherd the job as it runs. SLURM associates every set of parallel tasks
("job steps") with the SRUN instance that initiated that set.
SRUN options allow the user to both:
Specify the parallel environment for job(s), such as the number of nodes used, node
partition, distribution of processes among nodes, and total time.
Control the behavior of a parallel job as it runs, such as redirecting or labeling its
output or specifying its reporting verbosity.
NAME
srun - run parallel jobs
SYNOPSIS
srun [OPTIONS] executable [args...]
DESCRIPTION
Run a parallel job on cluster managed by SLURM. If necessary, srun will first create a
resource allocation in which to run the parallel job.
OPTIONS
Please refer to the man page for more details on the options, including examples of use.
Example
$ man srun
39
5.4
SBATCH (batch)
NAME
SBATCH Submit a batch script to SLURM
SYNOPSIS
sbatch [OPTIONS] SCRIPT [ARGS]
DESCRIPTION
sbatch submits a batch script to SLURM. The batch script may be linked to sbatch using its
file name and the command line. If no file name is specified, sbatch will read in a script
from standard input. The batch script may contain options preceded with #SBATCH before
any executable commands in the script.
sbatch exits immediately after the script has been successfully transferred to the SLURM
controller and assigned a SLURM job ID. The batch script may not be granted resources
immediately, and may sit in the queue of pending jobs for some time, before the required
resources become available.
When the batch script is granted the resources for its job allocation, SLURM will run a
single copy of the batch script on the first node in the set of allocated nodes.
OPTIONS
Please refer to the man page for more details on the options, including examples of use.
Example
$ man sbatch
40
5.5
SALLOC (allocation)
NAME
SALLOC - Obtain a SLURM job allocation (a set of nodes), execute a command, and then
release the allocation when the command is finished.
SYNOPSIS
salloc [OPTIONS] [<command> [command_args]]
DESCRIPTION
salloc is used to define a SLURM job allocation, which is a set of resources (nodes),
possibly with some constraints (e.g. number of processors per node). When salloc obtains
the requested allocation, it will then run the command specified by the user. Finally, when
the user specified command is complete, salloc relinquishes the job allocation.
The command may be any program the user wishes. Some typical commands are xterm, a
shell script containing srun commands, and srun.
OPTIONS
Please refer to the man page for more details on the options, including examples of use.
Example
$ man salloc
41
5.6
SATTACH
NAME
sattach - Attach to a SLURM job step.
SYNOPSIS
sattach [OPTIONS] <jobid.stepid>
DESCRIPTION
sattach attaches to a running SLURM job step. By attaching, it makes available the I/O
streams for all the tasks of a running SLURM job step. It is also suitable for use with a
parallel debugger like TotalView.
OPTIONS
Please refer to the man page for more details on the options, including examples of use.
Example
$ man sattach
42
5.7
SACCTMGR
NAME
sacctmgr - Used to view and modify SLURM account information.
SYNOPSIS
sacctmgr [OPTIONS] [COMMAND]
DESCRIPTION
sacctmgr is used to view or modify SLURM account information. The account information is
maintained within a database with the interface being provided by slurmdbd (SLURM
Database daemon). This database serves as a central storehouse of user and computer
information for multiple computers at a single site. SLURM account information is recorded
based upon four parameters that form what is referred to as an association.
These parameters are user, cluster, partition, and account:
user is the login name.
cluster is the name of a SLURM managed cluster as specified by the ClusterName
parameter in the slurm.conf configuration file.
partition is the name of a SLURM partition on that cluster.
account is the bank account for a job.
The intended mode of operation is to initiate the sacctmgr command, add, delete, modify,
and/or list association records then commit the changes and exit.
OPTIONS
Please refer to the man page for more details on the options, including examples of use.
Example
$ man sacctmgr
43
5.8
SBCAST
sbcast is used to copy a file to local disk on all nodes allocated to a job. This should be
executed after a resource allocation has taken place and can be faster than using a single
file system mounted on multiple nodes.
NAME
sbcast - transmit a file to the nodes allocated to a SLURM job.
SYNOPSIS
sbcast [-CfpsvV] SOURCE DEST
DESCRIPTION
sbcast is used to transmit a file to all nodes allocated to the SLURM job which is currently
active. This command should only be executed within a SLURM batch job or within the shell
spawned after the resources have been allocated to a SLURM. SOURCE is the name of the
file on the current node. DEST should be the fully qualified pathname for the file copy to be
created on each node. DEST should be on the local file system for these nodes.
Note
44
5.9
[OPTIONS...]
DESCRIPTION
SQUEUE is used to view job and job step information for jobs managed by SLURM.
OPTIONS
Please refer to the man page for more details on the options, including examples of use.
Example
$ man squeue
45
5.10
DESCRIPTION
SINFO is used to view partition and node information for a system running SLURM.
OPTIONS
Please refer to the man page for more details on the options, including examples of use.
Example
$ man sinfo
46
5.11
DESCRIPTION
SCANCEL is used to signal or cancel jobs or job steps. An arbitrary number of jobs or job
steps may be signaled using job specification filters or a space-separated list of specific job
and/or job step IDs. A job or job step can only be signaled by the owner of that job or
user root. If an attempt is made by an unauthorized user to signal a job or job step, an
error message will be printed and the job will not be signaled.
OPTIONS
Please refer to the man page for more details on the options, including examples of use.
Example
$ man scancel
47
5.12
DESCRIPTION
Accounting information for jobs invoked with SLURM is logged in the job accounting log
file.
The SACCT command displays job accounting data stored in the job accounting log file in
a variety of forms for your analysis. The SACCT command displays information about jobs,
job steps, status, and exit codes by default. The output can be tailored with the use of the -fields= option to specify the fields to be shown.
For the root user, the SACCT command displays job accounting data for all users, although
there are options to filter the output to report only the jobs from a specified user or group.
For the non-root user, the SACCT command limits the display of job accounting data to
jobs that were launched with their own user identifier (UID) by default. Data for other users
can be displayed with the --all, --user, or --uid options.
Note
Much of the data reported by SACCT has been generated by the wait3() and getrusage()
system calls. Some systems gather and report incomplete information for these calls;
SACCT reports values of 0 for this missing data. See the getrusage man page for your
system to obtain information about which data are actually available on your system.
OPTIONS
Please refer to the man page for more details on the options, including examples of use.
Example
$ man sacct
48
5.13
STRIGGER
NAME
strigger - Used to set, get or clear SLURM trigger information.
SYNOPSIS
strigger --set [OPTIONS...]
strigger --get [OPTIONS...]
strigger --clear [OPTIONS...]
DESCRIPTION
strigger is used to set, get or clear SLURM trigger information. Triggers include events such
as a node failing, a job reaching its time limit or a job terminating.
These events can cause actions such as the execution of an arbitrary script. Typical uses
include notifying system administrators regarding node failures and terminating a job when
its time limit is approaching.
Trigger events are not processed instantly, but a check is performed for trigger events on a
periodic basis (currently every 15 seconds). Any trigger event which occur within that
interval will be compared against the trigger programs set at the end of the time interval.
The trigger program will be executed once for any event occurring in that interval with a
hostlist expression for the nodelist or job ID as an argument to the program. The record of
those events (e.g. nodes which went DOWN in the previous 15 seconds) will then be
cleared. The trigger program must set a new trigger before the end of the next interval to
insure that no trigger events are missed. If desired, multiple trigger programs can be set for
the same event.
mportant
This command can only set triggers if run by the user SlurmUser unless SlurmUser
is configured as root user. This is required for the slurmctld daemon to set the
appropriate user and group IDs for the executed program. Also note that the
program is executed on the same node that the slurmctld daemon uses rather
than on an allocated Compute Node. To check the value of SlurmUser, run the
command:
scontrol show config | grep SlurmUser
OPTIONS
Please refer to the man page for more details on the options, including examples of use.
Example
$ man strigger
49
5.14
SVIEW
NAME
sview - Graphical user interface to view and modify SLURM state.
Note
See
50
5.15
The Global Accounting API only applies to clusters which use SLURM and the Load Sharing
Facility (LSF) batch manager from Platform Computing together.
Both the LSF and SLURM products can produce an accounting file. The Global Accounting
API offers the capability of merging the data from these two accounting files and presenting
it as a single record to the program using this API.
Perform the following steps to call the Global Accounting API:
1.
After SLURM has been installed (assumes /usr folder), build the Global Accounting API
library by going to the /usr/lib/slurm/bullacct folder and executing the following
command:
make f makefile-lib
This will build the library libcombine_acct.a. This makefile-lib assumes that the SLURM
product is installed in the /usr folder, and LSF is installed in /app/slurm/lsf/6.2. If this
is not the case, the SLURM_BASE and LSF_BASE variables in the makefile-lib file must
be modified to point to the correct location.
2.
3.
//
//
#include "/usr/lib/slurm/bullacct/combine_acct.h"
// define file pointer for LSF and Slurm log file
FILE *lsb_acct_fg = NULL;
// file pointer for LSF accounting log file
FILE *slurm_acct_fg = NULL; // file pointer for Slurm log file
int status, jobId;
struct CombineAcct newAcct; // define variable for the new records
// call cacct_init routine to open lsf and slurm log file,
// and initialize the newAcct structure
status = cacct_init(&lsb_acct_fg, &slurm_acct_fg, &newAcct);
//
//
//
//
//
//
//
//
//
//
//
//
//
//
//
//
//
//
//
This routine will use the input LSF job ID to locate the LSF accounting
information in the LSF log file, then get the SLURM_JOBID and locate the
SLURM accounting information in the SLURM log file.
This routine will return a zero to indicate that both records are found
51
//
//
//
//
jobId = 2010;
status = get_combine_acct_info(lsb_acct_fg,
slurm_acct_fg,
jobId,
&newAcct);
//
display_combine_acct_record(&newAcct);
// when finished accessing the record, the user must close the log files and
// the free memory used in the newAcct variable by calling cacct_wrapup
// routine.
// For example:
//
if (lsb_acct_fg != NULL)
// if open successfully before
cacct_wrapup(&lsb_acct_fg, &slurm_acct_fg, &newAcct);
//
//
//
//
//
//
//
//
when done do the following to free the memory used by the otherAcct
variable.
free_cacct_ptrs(&otherAcct);
52
evenType[50];
versionNumber[50];
eventTime;
jobId;
userId;
options;
numProcessors;
submitTime;
beginTime;
termTime;
startTime;
userName[MAX_LSB_NAME_LEN];
queue[MAX_LSB_NAME_LEN];
*resReq;
*dependCond;
*preExecCmd;
/* the command string to be pre_executed */
fromHost[MAXHOSTNAMELEN];
cwd[MAXFILENAMELEN];
inFile[MAXFILENAMELEN];
outFile[MAXFILENAMELEN];
errFile[MAXFILENAMELEN];
jobFile[MAXFILENAMELEN];
int
char
int
char
int
double
char
char
struct
char
char
numAskedHosts;
**askedHosts;
numExecHosts;
**execHosts;
jStatus;
hostFactor;
jobName[MAXLINELEN];
command[MAXLINELEN];
lsfRusage LSFrusage;
*mailUser;
*projectName;
int
int
char
char
int
int
int
char
char
char
char
int
char
int
char
exitStatus;
maxNumProcessors;
*loginShell;
/* login shell specified by user */
*timeEvent;
idx;
/* array idx, must be 0 in JOB_NEW */
maxRMem;
maxRswap;
inFileSpool[MAXFILENAMELEN]; /* spool input file */
commandSpool[MAXFILENAMELEN]; /* spool command file */
*rsvId;
*sla;
/* The service class under which the job runs. */
exceptMask;
*additionalInfo;
exitInfo;
*warningAction;
/* warning action, SIGNAL | CHKPNT |
command, NULL if unspecified */
warningTimePeriod;
/* warning time period in seconds,
-1 if unspecified */
*chargedSAAP;
*licenseProject;
/* License Project */
slurmJobId;
/* job id from slurm */
int
char
char
int
/* job status */
/* part two is the SLURM info minus the duplicated infomation from LSF */
long
char
int
int
int
double
int
int
double
int
int
double
int
int
char
char
int
int
int
int
char
priority;
partition[64];
gid;
blockId;
numTasks;
aveVsize;
maxRss;
maxRssTaskId;
aveRss;
maxPages;
maxpagestaskId;
avePages;
minCpu;
minCpuTaskId;
stepName[NAME_SIZE];
stepNodes[STEP_NODE_BUF_SIZE];
maxVsizeNode;
maxRssNodeId;
maxPagesNodeId;
minCpuTimeNodeId;
*account;
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
priority */
partition node */
group ID */
Block ID */
nproc */
ave vsize */
max rss */
max rss task */
ave rss */
max pages */
max pages task */
ave pages */
min cpu */
min cpu task */
step process name */
step node list */
max vsize node */
max rss node */
max pages node */
min cpu node */
account number */
};
53
54
6.2
6.2.1
6.2.2
55
6.2.3
Node Configuration
SLURM can track the amount of memory and disk space available for each Compute Node
and use it for scheduling purposes; however this will entail an extra overhead. Optimize
performance by specifying the expected configuration using the parameters that are
available (RealMemory, Procs, and TmpDisk). If the node is found to have fewer resources
than the configured amounts, it will be marked as DOWN and not be used. Also, the
FastSchedule parameter should be set.
While SLURM can easily handle a heterogeneous cluster, configuring the nodes using the
minimal number of lines in the slurm.conf file will make administration easier and result in
better performance.
6.2.4
Timers
The configuration parameter SlurmdTimeout determines the interval at which slurmctld
routinely communicates with slurmd. Communications occur at half the SlurmdTimeout
value. If a Compute Node fails, the time of failure is identified and jobs are no longer
allocated to it. Longer intervals decrease system noise on Compute Nodes (these requests
are synchronized across the cluster, but there will be some impact on applications). For
large clusters, SlurmdTimeoutl values of 120 seconds or more are reasonable.
6.2.5
TreeWidth parameter
SLURM uses hierarchical communications between the slurmd daemons in order to
increase parallelism and improve performance. The TreeWidth configuration parameter
controls the fanout of messages. The default value is 50, meaning each slurmd daemon can
communicate with up to 50 other slurmd daemons and up to 2500 nodes can be contacted
with two message hops. The default value will work well for most clusters. Optimal system
performance can usually be achieved if TreeWidth is set to the square root of the number
of nodes in the cluster for systems having no more than 2500 nodes, or the cube root for
larger systems.
6.2.6
Hard Limits
The srun command automatically increases its open file limit to the hard limit in order to
process all the standard input and output connections to the launched tasks. It is
recommended that you set the open file hard limit to 8192 across the cluster.
6.3
56
cpufreq governor that can change CPU frequency and voltage (note that the cpufreq driver
must be enabled in the Linux kernel configuration). Of particular note is the fact that SLURM
can power nodes up or down at a configurable rate, to prevent rapid changes in power
demands. For example, without SLURMs support to increase power demands in a gradual
fashion, starting a 1000 node job on an idle cluster could result in an instantaneous surge,
in the order of multiple megawatts, in the power demand.
6.3.1
SuspendTime: Nodes becomes eligible for power saving mode after being idle for this
number of seconds. The configured value should not exceed the time to suspend and
resume a node. A negative number disables power saving mode. The default value is 1 (disabled).
SuspendRate: Maximum number of nodes to be placed into power saving mode per
minute. A value of zero results in no limits being imposed. The default value is 60. Use
this to prevent rapid drops in power requirements.
ResumeRate: Maximum number of nodes to be removed from power saving mode per
minute. A value of zero results in no limits being imposed. The default value is 300.
Use this to prevent rapid increases in power requirements.
SuspendTimeout: Maximum time permitted (in seconds) between when a node suspend
request is issued, and when the node shutdown is complete. When the time specified
has expired the node must be ready for a resume request to be issued as needed for a
new workload. The default value is 30 seconds.
ResumeTimeout: Maximum time permitted (in seconds) between when a node resume
request is issued and when the node is actually available for use. Nodes which fail to
respond in this time-frame may be marked DOWN and the jobs scheduled on the
node requeued. The default value is 60 seconds.
SuspendExcNodes: List of nodes that will never be placed in power saving mode. Use
SLURM's hostlist expression format. By default, no nodes are excluded.
SuspendExcParts: List of partitions that will never be placed in power saving mode.
Multiple partitions may be specified using a comma separator. By default, no nodes
are excluded.
Chapter 6. Tuning Performances for SLURM Clusters
57
Note that SuspendProgram and ResumeProgram execute as SlurmUser on the node where
the slurmctld daemon runs (Primary and Backup server nodes). Use of sudo may be
required for SlurmUser to power down and restart nodes. If you need to convert SLURM's
hostlist expression into individual node names, the scontrol show hostnames command may
prove useful. The commands used to boot or shut down nodes will depend upon the cluster
management tools that are available.
SuspendProgram and ResumeProgram are not subject to any time limits. They should
perform the required action, ideally verify the action (e.g. node boot and the start of the
slurmd daemon so that the node is no longer non-responsive to slurmctld) and terminate.
Long running programs will be logged by slurmctld, but not aborted.
#!/bin/bash
# Example SuspendProgram
hosts=`scontrol show hostnames $1`
for host in "$hosts"
do
sudo node_shutdown $host
done
#!/bin/bash
# Example ResumeProgram
hosts=`scontrol show hostnames $1`
for host in "$hosts"
do
sudo node_startup $host
done
Subject to the various rates, limits and exclusions, the power save code follows the logic
below:
1.
Identify nodes which have been idle for at least the SuspendTime.
2.
3.
Identify the nodes which are in power save mode (a flag in the node's state field), but
have been allocated to jobs.
4.
5.
Once the slurmd responds, initiate the job and/or job steps allocated to it.
6.
If the slurmd fails to respond with the value configured for the SlurmdTimeout, the node
will be marked DOWN and the job requeued if possible.
7.
Repeat indefinitely.
The slurmctld daemon will periodically (every 10 minutes) log how many nodes are in
power save mode using messages of this sort:
[May 02 15:31:25] Power save mode 0 nodes
...
[May 02 15:41:26] Power save mode 10 nodes
...
[May 02 15:51:28] Power save mode 22 nodes
Using these logs you can easily see the effect of SLURM's power saving support. You can
also configure SLURM with programs that perform no action using SuspendProgram and
ResumeProgram, in order to assess the potential impact of power saving mode before
enabling it.
58
6.3.2
Fault tolerance
For the slurmctld daemon to terminate gracefully, it should wait up to SuspendTimeout or
ResumeTimeout interval (whichever is larger), for any spawned SuspendProgram or
ResumeProgram to terminate before the daemon terminates. If the spawned program does
not terminate within that time period, the event will be logged and slurmctld will exit in
order to permit another slurmctld daemon to be initiated. Synchronization problems could
also occur if, and when, the slurmctld daemon crashes (a rare event), and is restarted.
In either event, the newly initiated slurmctld daemon (or the backup server) will recover
saved node state information that may not accurately describe the actual node state. In the
case of a failed SuspendProgram, the negative impact is limited to the power consumption
being increased. No special action is currently in place so that SuspendProgram is
executed multiple times in order to insure the nodes remain in a reduced power mode. The
case of a failed ResumeProgram call is more serious as the node could be placed into a
DOWN state and/or jobs could fail. In order to minimize this risk, when the slurmctld
daemon is started and the node which should be allocated to a job fails to respond, the
ResumeProgram will be executed (possibly for a second time).
59
60
7.1
7.2
The version numbers depend on the release and are indicated by the letter x above.
Run the command scontrol ping to determine if the primary and backup controllers are
responding.
2.
If they respond, then there may be a Network or Configuration problem see section
7.5 Networking and Configuration Problems.
3.
If there is no response, log on to the machines to rule out any network problems.
4.
Check to see if the slurmctld daemon is active by running the following command:
ps -ef | grep slurmctld
a.
If slurmctld is not active, restart it as the root user using the following command.
b.
Check the SlurmctldLogFile file in the slurm.conf file for an indication of why it
failed.
c.
If slurmctld is running but not responding (a very rare situation), then kill and
restart it as the root user using the following commands:
d.
61
5.
If SLURM continues to fail without an indication of the failure mode, stop the service,
add the controller option "-c" to the /etc/slurm/slurm.sh script, as shown below, and
restart.
service slurm stop
SLURM_OPTIONS_CONTROLLER=-c
service slurm start
Note
7.3
All running jobs and other state information will be lost when using this option.
This is dependent upon the scheduler used by SLURM. Run the following command to
identify the scheduler.
scontrol show config | grep SchedulerType
See section 1.4 Scheduler Types for details on the scheduler types.
2.
For any scheduler, the priorities of jobs can be checked using the following command:
scontrol show job
7.4
Check to determine why the node is down using the following command:
scontrol show node <name>
This will show the reason why the node was set as down and the time when this
happened. If there is insufficient disk space, memory space, etc. compared to the
parameters specified in the slurm.conf file, then either fix the node or change
slurm.conf.
For example, if the temporary disk space specification is TmpDisk=4096, but the
available temporary disk space falls below 4 GB on the system, SLURM marks it as
down.
2.
If the reason is Not responding, then check the communication between the
Management Node and the DOWN node by using the following command:
ping <address>
Check that the <address> specified matches the NodeAddr values in the slurm.conf
file. If ping fails, then fix the network or the address in the slurm.conf file.
3.
62
Login to the node that SLURM considers to be in a DOWN state and check to see if
the slurmd daemon is running using the following command:
4.
If slurmd is not running, restart it as the root user using the following command:
service slurm start
5.
Check SlurmdLogFile file in the slurm.conf file for an indication of why it failed.
a.
If slurmd is running but not responding (a very rare situation), then kill and restart
it as the root user using the following commands:
6.
If the node is still not responding, there may be a Network or Configuration problem
see section 7.5 Networking and Configuration Problems.
7.
If the node is still not responding, increase the verbosity of debug messages by
increasing SlurmdDebug in the slurm.conf file, and restart. Again, check the log file for
an indication of why it failed.
8.
If the node is still not responding without an indication as to the failure mode, stop the
service, add the daemon option "-c" to the /etc/slurm/slurm.sh script, as shown
below, and restart.
service slurm stop
SLURM_OPTIONS_DAEMONS=-c
service slurm start
Note
7.5
All running jobs and other state information will be lost when using this option.
Use the following command to examine the status of the nodes and partitions:
sinfo --all
2.
Use the following commands to confirm that the control daemons are up and running
on all nodes:
scontrol ping
scontrol show node
3.
Check the controller and/or slurmd log files (SlurmctldLog and SlurmdLog in the
slurm.conf file) for an indication of why a particular node is failing.
4.
Check for consistent slurm.conf and credential files on the node(s) experiencing
problems.
63
5.
If the problem is a user-specific problem, check that the user is configured on the
Management Node as well as on the Compute Nodes. The user does not need to be
able to login, but his user ID must exist. User authentication must be available on every
node. If not, non-root users will be unable to run jobs.
6.
Verify that the security mechanism is in place, see Chapter 3 for more information on
SLURM and security.
7.
Check that a consistent version of SLURM exists on all of the nodes by running one of
the following commands:
sinfo -V
or
rpm -qa | grep slurm
If the first two digits of the version number match, it should work fine. However,
version 1.1 commands will not work with version 1.2 daemons or vice-versa.
Errors can result unless all these conditions are true.
8.
Each node must be synchronized to the correct time. Communication errors occur if the
node clocks differ.
Execute the following command to confirm that all nodes display the same time:
pdsh -a date
7.6
More Information
For more information on SLURM Troubleshooting see
http://www.llnl.gov/linux/slurm/slurm.html
64
Glossary
A
API
Application Programmer Interface
L
LSF
Load Sharing Facility
M
MPI
Message Passing Interface
P
PDSH
Parallel Distributed Shell
PMI
Process Management Interface
R
RPM
RPM Package Manager
S
SLURM
Simple Linux Utility for Resource Management an
open source, highly scalable cluster management
and job scheduling system.
SSH
Secure Shell
Glossary
65
66
Index
/
/etc/init.d/slurm script, 25
/gfs2shared/slurmdata file system, 35
Authentication, 12, 21
C
Command Line Utilities, 37
Commands
sacct, 1, 37, 48
sacctmgr, 1, 37, 43
salloc, 1, 37, 41
sattach, 1, 37, 42
sbatch, 1, 37, 40
sbcast, 1, 37, 44
scancel, 1, 37, 47
scontrol, 2, 7
sinfo, 46
sinfo, 1, 37
squeue, 1, 37, 45
sreport, 1
srun, 1, 37, 39
sstat, 1
strigger, 1, 37, 49
sview, 1, 37, 50
Compute node daemon, 25
Files
slurm.conf, 6
slurmdbd.conf, 14
Functions, 1
G
Global Accounting API, 37, 51
H
HA Cluster Suite, 35
Hard Limit, 56
High Availability, 33
J
Job Accounting, 55
JobAcctGatherType parameter, 55
JobCredentialPrivateKey, 29
JobCredentialPublicCertificate, 29
L
large clusters, 55
LSF, 51
M
MPI Support, 37
Controller daemon, 25
Munge
installation, 12, 21
cpufreq governor, 57
munged daemon, 21
NodeAddr, 6
Daemons
munged, 21
SLURMCTLD, 2, 25, 26
SLURMD, 4, 25, 27
SlurmDBD, 4, 25, 28
NodeHostname, 6
Draining a node, 30
pam_slurm module, 21
NodeName, 6
O
oppensl, 16, 29
Power saving, 56
FastSchedule parameter, 56
Fault tolerance, 59
resource manager, 1
Index86 A2 45FD 01
67
ResumeRate parameter, 57
slurmdbd.conf file, 14
SlurmdTimeout parameter, 56
Slurmstepd, 55
SlurmUser, 16
sreport command, 1
sstat command, 1
Scheduler Types, 5
backfill, 5
builtin, 5
gang, 5
hold, 5
wiki, 5
SuspendExecNodes parameter, 57
scontrol command, 2, 7
Scontrol examples, 29
Secret Key, 22
security, 16, 29
SelectType configuration parameter, 28
SelectType parameter, 55
sinfo command, 1, 37, 46
slurm.conf file, 6
slurm.conf file example, 13
slurm_setup.sh Script, 17
slurmctld daemon, 59
SLURMCTLD daemon, 2, 25, 26
68
SuspendExecParts parameter, 57
SuspendProgram parameter, 57, 59
SuspendRate parameter, 57
SuspendTime parameter, 57
SuspendTimeout parameter, 57, 59
sview command, 1, 37, 50
syslog, 28
T
testing configuration, 20
Timers for Slurmd and Slurmctld daemons, 56
topology.conf file, 15
TreeWidth paramter, 56
troubleshooting, 61
U
using openssl, 16
BULL CEDOC
357 AVENUE PATTON
B.P.20845
49008 ANGERS CEDEX 01
FRANCE
REFERENCE
86 A2 45FD 01