Bunya User Guide 2022 12 06

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Bunya (very short) user guide

General HPC training is available via the RCC and QCIF Training resources.

To get a basic understanding of what you need to be aware of when using HPC for your
research please listen to the following videos:

Connecting to HPC via putty


Transferring files with FileZilla
Where does my data go on HPC
Directories (folders) I should be aware of on HPC
Message of the day (info on current status and problems)
Relative and absolute path (common problem in user scripts)
How to load installed software
Why no calculation should be run on the login nodes
PBS qstat (monitoring jobs)
PBS qsub (submitting jobs)
PBS qdel (deleting jobs)

For UQ users and QCIF users with a QRIScloud collection please also listen to
General overview of Q RDM
Q RDM on HPC

What is changing

Bunya replaces Awoonga and FlashLite (and eventually Tinaroo).

Hardware:

• Bunya currently has around 6000 cores, with 96 physical cores per node.
o (2 * 48 core CPUs per node).
• These CPUs are based on AMD (epyc3 Milan). They are not Intel CPUs as was the
case with FlashLite and Tinaroo.
• These CPU cores are based on the industry standard x86_64 architecture.
• Each standard Bunya node has 2TB of RAM.
• There are also 3 high memory nodes that each have 4TB of RAM.

Resource Scheduler:

• Bunya uses the Slurm scheduler and batch queue system which is different to the
PBS scheduler and batch queue used on FlashLite and Tinaroo. Users will not be able
to reuse their PBS scripts from Tinaroo/FlashLite but will have to change to Slurm
scripts.
• Bunya is currently CPU only for the standard user. The standard queues do not have
GPU hardware resources associated with them yet.
Software:

• Software is still available via the module system as it was on FlashLite and Tinaroo.
• Bunya will, however, have different software and versions installed than Tinaroo,
FlashLite or Awoonga did.
• Users should use module avail to check which software and their versions are
installed.
• Users who install their own software are required to recompile their software for
Bunya.

What remains the same

• Locations for data: /home, /scratch/user, /scratch/project and /QRISdata remain the
same.
• /RDS has been retired (it was set up as a link to /QRISdata so for users accustomed
to /RDS it is just the name that is changing) and users are now required to use
/QRISdata.
• Users will see the same data in /home, /scratch/user, /scratch/project and /QRISdata
on Tinaroo/FlashLite and Bunya. There will be no need for users to transfer any data
from Tinaroo/FlashLite before using Bunya.

Guide

Connecting

Set 1 of the Training resources explains how to use Putty to connect to a HPC with the basics
found here. To connect to Bunya please use:

Hostname: bunya3.rcc.uq.edu.au
Port: 22

For those using command line ssh:

ssh [email protected]

Bunya enforces UQ Sign In MFA.

File Transfer

The basic use of FileZilla for file and data transfer is shown here.
If you experience problems with disconnection, then try this: Go to Edit -> Settings and
change the number under “Timeout” from 20 seconds to 120 or more.

With MFA you need to use an interactive session in FileZilla to connect. Click on the icon
directly under “File” (left top corner) which should bring up this window. Then select
interactive from the drop-down menu.
Software

The training resources have a short video on how to use software modules to load installed
software on HPC.

The basic commands are:

module avail - shows all available main modules


module --show_hidden avail - shows all available modules
module avail [SOFTWARE-NAME or KEYWORD] - shows all modules for SOFTWARE-
NAME or KEYWORD
module spider [SOFTWARE-NAME or KEYWORD] - shows all possible modules for
SOFTWARE-NAME or KEYWORD
module load [SOFTWARE-NAME/VERSION] - loads a specific software version
module unload [SOFTWARE-NAME/VERSION] - unloads a specific software version
module list - lists all currently loaded software modules
module purge - unloads ALL currently loaded software modules

Bunya uses EasyBuild to build and install software and modules. Modules on Bunya are self-
contained which means users do not need to load any dependencies for the module to
work. This is similar to how modules worked on Tinaroo and FlashLite but different to
Wiener.

Using module avail will show only the main software modules installed. It will not show
all the different dependency modules that are also available. To show ALL modules including
hidden modules use

module --show_hidden avail

How to build your own software


Users can use EasyBuild to build their own software against existing modules on Bunya.
https://docs.easybuild.io/en/latest/index.html

EasyBuild recipes can be found for a very wide range of software. Some might need
tweaking for newer versions, but it often is relatively easy. You can also write your own.

Users can build into their own home directory but use all exisiting software and software
tool chains that are already available. Users need to load the EasyBuild module first:

module load easybuild/4.6.1

For example, if you create a folder called EasyBuild in your home directory and have a recipe
located in this directory you can build the software via this command.

eb --prefix=/home/YourUsername/EasyBuild --
installpath=/home/YourUsername/EasyBuild --
buildpath=/home/YourUsername/EasyBuild/build --
robot=/home/YourUsername/EasyBuild ./EasyBuild-recipe-file.eb

If you add the –D option, it will do a dry run first. Please use eb –H to get the help manual.

Users who have a working EasyBuild recipe and have tested that the software installed as
such is working on Bunya can offer the recipe to be uploaded to the cluster wide installed
software and it would then be available via modules.

Do not run on the login nodes

Users are reminded that no calculation, no matter how quick or small, should be run on the
login nodes. So no, the quick python or R or bash script or similar should NOT be just quickly
run from the command line as it is so much more convenient. All calculations are required
to be done on the compute nodes.

Users can use interactive jobs which will give them that command line feel and flexibility
and allow the use of graphical user interfaces.

Users have access to a debug queue for quick testing of new jobs and codes etc.

Interactive jobs

User should use interactive jobs to do quick testing and if they need to use a graphical user
interface (GUI) to run their calculations. This could include jupyter, spider, etc. salloc is
used to submit an interactive job and you should specify the required resources via the
command line:

This seems to work:


salloc --nodes=1 --ntasks-per-node=1 --cpus-per-task=1 --mem=50G --
job-name=TEST --time=05:00:00 --partition=general --
account=AccountString srun --export=PATH,TERM,HOME,LANG --pty
/bin/bash -l

Please use --partition=general unless you have been given permission to use ai or
gpu.

For an interactive session on the gpu or ai nodes you will need to add --
gres=gpu:[number] to the salloc request. For the gpu partition you will need to
specify which type of GPU you are requersting as they are now AMD and NVIDIA GPUs. See
below for more information.

This will log you onto a node. To run a job just type as you would usually do on the
command line. As srun was already used in the above command there is no need to use
srun to run your executables, it will just mess things up.

Once you are done type

exit

on the command line which will stop any processes still running and will release the
allocation for the job.

At the moment there are issues with testing MPI jobs through an interactive session.

For those using the foss tool chain, please do


export OMP_NUM_THREADS=1
Otherwise we have found that you get multiple processes per MPI process which eventually
locks up the node.

For those using the intel tool chain please do


salloc --nodes=1 --ntasks-per-node=96 --cpus-per-task=1 --ntasks=96
--mem=500G --job-name=MPI-test --time=05:00:00 --partition=general -
-account=AccountString

This will give you a new shell and an allocation, but you are still on the login node. You can
now use srun to actually start a job on a node.
1) Load all the modules you need
2) export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi2.so
3) export SLURM_MPI_TYPE=pmi2
4) srun --ntasks=[number up to 96] --export=ALL executable <
input> output

You can use any number of cores you need up to the full 96 you requested via salloc. You
need the --export=ALL to export the environment with the loaded modules and pointing
to pmi2 to the job. This will only work for the general and debug partition. For the GPU
ones you might have to do some testing and provide a long list of what needs to be
exported.

This will start the job. Once it is done or crashed you get your prompt back but you are still
in the salloc allocation, so you are able to submit more under that allocation. To exit and
release the job allocation type exit.

Slurm scripts

Users should keep in mind that Bunya has 96 cores per node. 96 cores or cpu-per-task is
therefore the maximum a multi core job can request. Please note not all calculations scale
well with cores, so before requesting all 96 cores do some testing first.

Users with MPI jobs should run in multiples of nodes, so in multiples of 96 cores. This means
the calculation needs to scale well to such numbers of cores. Most will not, so do some
testing first!

The Pawsey Centre has an excellent guide on how to migrate from PBS to SLURM. The
Pawsey Centre also provides a good general overview of job scheduling with Slurm and
examples workflows like array jobs.

Below are examples for single core, single node but multiple cores, MPI, and array job
submission scripts. The different request flags mean the following:

#SBATCH --nodes=[number] - how many nodes the job will use


#SBATCH --ntasks-per-node=[number] - This is 1 for single core jobs and multi core
jobs. This is 96 (or less if single node) for MPI jobs.
#SBATCH --cpus-per-task=[number] - This is 1 for single core jobs, number of cores
for multi core jobs, and 1 for MPI jobs. --cpus-per-task can be undertstood as
OMP_NUM_THREADS.
#SBATCH --mem=[number M|G|T] - RAM per job given in megabytes (M), gigabytes (G),
or terabytes (T). Ask for 2000000M to get the maximum memory on a standard node. Ask
for 4000000M to get the maximum memory on a high memory node.
(#SBATCH --mem-per-cpu=[number M|G|T] - alternative to the request above, only
relevant to MPI jobs.)
#SBATCH --gres=gpu:[type]:[number] - to request the use of GPU on a GPU node.
On the gpu partition there are 2 per node and on the ai partition there are 3 per node.
Please see the example scripts below for the available types of GPUs
#SBATCH --time=[hours:minutes:seconds] - time the job needs to complete
#SBATCH -o filename - filename where the standard output should go to
#SBATCH -e filename - filename where the standard error should go to
#SBATCH -job-name=[Name] - Name for the job that is seen in the queue
#SBATCH --account=[Name] - Account String for your research or accounting group, all
Account Strings start with “a_”
#SBATCH --partition=general/gpu/debug/ai
#SBATCH --array=[range] - Indicates that this is and array job with range number of
tasks.
srun – runs the executable and will receive info on number of cores etc from Slurm. There
is no need to specify them here.

See
man sbatch
and
man srun
for more options (use arrow keys to scroll up and down and q to quit)

Please note: The default partition is debug which will give you are bare minimum of
resources. For example the maximum walltime in the debug queue is 30 minutes. Most
users would want to run in the general partition. Important, the slurm defaults are usually
not sufficient for most user jobs. If you want appropriate resources, you are required to
request them.

Please note: using the SBATCH options –o and –e in a script will result in the standard error
and standard output file to appear as soon as the job starts to run. This behaviour is
different to standard PBS behaviour on Tinaroo and FlashLite (unless you specified paths for
those files there too) where the standard error, .e, and standard output, .o, files only
appeared when the job was finished or had crashed.

Please note: In Slurm your job will start in the directory/folder you submitted from. This is
different to PBS behaviour on Tinaroo/FlashLite where your job started in your home
directory. So on Bunya, using slurm, there is no need to change into the job directory, unless
this is different to the directory you submitted from.

Please note: There is currently no equivalent to the $TMPDIR that was available on FlashLite
and Tinaroo. Until this has been set up users are required to use their /scratch/user
directory for temporary files. RCC is working to set up a large and fast space for temporary
files which will accommodate similar loads as was possible on FlashLite, if not more.

Accounting has now been switched on and will be enforced. Users cannot run jobs without
a valid Account String. All valid AccountStrings start with “a_” and are all lower case
letters. If you do not have a valid AccountString then please contact your supervisor.
AccountStrings and access are managed by research groups and group leaders. Groups
who wish to use Bunya are required to apply to set up a group with a valid AccountString.
Only group leaders can apply to set up such a group. A PhD student or postdoc without
their own funding and group should not apply. Applications can be made by contacting
[email protected].

Simple script for AI GPUs. Nodes bun003, bun004, and bun005. AI GPUs are restricted to a
specific set of users. If you have not been given explicit permission do not use these.
Only certain AccountStrings have access to these GPUs. If you should and cannot run a job
please contact your supervisor.

#!/bin/bash --login
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=10G
#SBATCH --job-name=Test
#SBATCH --time=1:00:00
#SBATCH --partition=ai
#SBATCH --account=AccountString
#SBATCH --gres=gpu:1 #you can ask for up to 3 here
#SBATCH –o slurm.output
#SBATCH –e slurm.error

module-loads-go-here

srun executable < input > output

Simple script for AMD GPUs. Nodes bun001 and bun002. The AMD GPUs are restricted to
a specific set of users. If you have not been given explicit permission do not use these.

#!/bin/bash --login
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=10G
#SBATCH --job-name=Test
#SBATCH --time=1:00:00
#SBATCH --partition=gpu
#SBATCH --account=AccountString
#SBATCH --gres=gpu:mi210:1 #you can ask for up to 2 here
#SBATCH –o slurm.output
#SBATCH –e slurm.error

module-loads-go-here

srun executable < input > output

Simple script for A100 GPUs. Node bun068. These are not set up yet, so please do not try
and use these.

#!/bin/bash --login
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=10G
#SBATCH --job-name=Test
#SBATCH --time=1:00:00
#SBATCH --partition=gpu
#SBATCH --account=AccountString
#SBATCH --gres=gpu:a100:1 #you can ask for up to 2 here
#SBATCH –o slurm.output
#SBATCH –e slurm.error

module-loads-go-here

srun executable < input > output

Simple script for CPUs and single node

#!/bin/bash --login
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=10G
#SBATCH --job-name=Test
#SBATCH --time=1:00:00
#SBATCH --partition=general
#SBATCH --account=AccountString
#SBATCH –o slurm.output
#SBATCH –e slurm.error

module-loads-go-here

srun executable < input > output

To ask for more than 1 core change the line


#SBATCH --cpus-per-task=12
To run over 12 cores for example.

Simple MPI script (used 2 nodes, giving 192 cores, as an example)

#!/bin/bash --login
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=96
#SBATCH --cpus-per-task=1
#SBATCH --mem=5G
#SBATCH --job-name=MPI-Test
#SBATCH --time=1:00:00
#SBATCH --partition=general
#SBATCH --account=AccountString
#SBATCH –o slurm.output
#SBATCH –e slurm.error

module-loads-go-here

srun executable < input > output

Job Arrays

Here is one example of an array job script with 5 tasks.

#!/bin/bash --login
#SBATCH --job-name=testarray
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=5G
#SBATCH --time=00:01:00
#SBATCH --account=AccountString
#SBATCH --partition=general
#SBATCH --output=test_array_%A_%a.out
#SBATCH --array=1-5

module-loads-go-here

srun executable < input > output

Useful variables for array jobs

$SLURM_ARRAY_JOB_ID = Job array's master job ID number.


$SLURM_ARRAY_TASK_COUNT = Total number of tasks in a job array.
$SLURM_ARRAY_TASK_ID = Job array ID (index) number.

To check on submitted jobs you use squeue.


squeue –u YourUsername – will only print your jobs

Here are some other useful additions to the squeue command. For information on what all
these means please consult the man pages.

squeue -o "%.18i %.9P %.8j %.8u %.8T %.10M %.9l %.6D %.10a %.4c %R"
squeue -o"%.7i %.9P %.8j %.8u %.2t %.10M %.6D %C"

sinfo is sued to obtain information about the actual nodes. Here some useful examples.
sinfo -o "%n %e %m %a %c %C"
sinfo -O Partition,NodeList,Nodes,Gres,CPUs
sinfo -o "%.P %.5a %.10l %.6D %.6t %N %.C %.E %.g %.G %.m"

You might also like