2022 - HPC Training 04 - HPC Basic Usage
2022 - HPC Training 04 - HPC Basic Usage
Basic Usage
16 GB Memory 2 TB Memory
activate
1 GPUs Unlimited GPUs
Example,
$ scp /path/to/filename
username@umhpc.dicc.um.edu.my:/home/username
$ scp -r /path/to/directory
username@umhpc.dicc.um.edu.my:/home/username
/home Vs /lustre
Home Directory Lustre Directory
File System NFS Lustre
Backup Daily -
CG / COMPLETING The job has done execute, and is now completing itself.
Priority Weightage
QoS 10,000
Fairshare 50,000
Factor 1: Job Age
● Newly submitted job are less likely to run first compared to the jobs that have
queued for a very long time.
● Job Age Priority increases as the job spends more time in queue.
● Job Age Priority will reach its maximum value when the job has queue for ONE
month.
● The longer the job wait in queue, the higher the priority.
Factor 2: Quality of Service (QoS)
● Each QoS has its own defined priority value.
● Jobs should be submitted with appropriate QoS based on amount of time
needed.
● Longer jobs might clash with system maintenance schedule. Jobs could be
terminated when system maintenance start.
● Longer jobs are more prone to random hardware or software failure.
Type of Quality of Service
QoS Maximum Walltime Priority Boost
short 1 hour +10,000
normal * 1 day 0
long 7 days 0
Cannot enter input halfway during the execution. Can enter input interactively during the execution.
#SBATCH --partition=cpu-opteron
#SBATCH --job-name=first_job
#SBATCH --output=%x.out
#SBATCH --error=%x.err
#SBATCH --nodes=1
#SBATCH --mem=2G
#SBATCH --ntasks=16
#SBATCH --qos=normal
#SBATCH --mail-type=ALL
#SBATCH --mail-user=george@um.edu.my
OR
● (Recommended) sbatch with parameters within submission script
$ sbatch submission_script.sh
Delete / Cancel Submitted Job
● Use scancel to cancel and remove a batch job from the queue.
$ scancel <job_id>
● Cancelled job cannot be recovered!
When to use Interactive Mode
● You should use Interactive Mode to run your jobs when:
○ You need to input commands during the application execution.
○ You are trying to compile your own application.
○ You need to analyse real time terminal output.
○ You are trying to debug or troubleshoot your calculation.
Starting Interactive Session
● Standard CPU Jobs
$ salloc -n4 --mem=16G
$ srun --jobid=1234 --pty bash -l
64 for OPTERON
200G long Standard large job
48 for EPYC
Monitor Job Resources
● Login to DICC OnDemand Portal at https://umhpc.dicc.um.edu.my.
● Go to Jobs > Active Jobs
● 50% CPU Usage is max for EPYC, 100% CPU Usage for Opteron
Resources Allocation & Usage
● Jobs can only access resources allocated to the them.
● Jobs should be stopped if they are no longer needed.
● Allocated resources for jobs that are idle for long period of time will be
released. These idle jobs will be hold and prevented from scheduled.
● Jobs exceeding allocated memory will be TERMINATED automatically by the
system cgroup with Out-of-Memory (OOM) error.
● Jobs that generate more CPU loads than allocated CPU cores will be
TERMINATED by the administrator to avoid service degradation. Warning will
be sent to the user that submitted the problematic jobs.
Checking Output & Error Log
● Output and error logs contain useful information to help determine if your
calculation is successful or not.
● Output and error logs can normally be found under the same directory where
you run sbatch command to submit the submission script.
● The filename for the output and error log can be specified with the following
parameters:
--error=error.log
--output=output.log
Account History
Checking Account History
● Use sacct command to view your past job history.
$ sacct --starttime=2022-01-01 --endtime=2022-01-10
● Additional optional parameters:
○ -l, --long
■ full detail that contain every column of the job history data
○ -p, --parsable
■ a parsable record of the history data
Sample Account History
$ sacct --startime=2010-01-01 --endtime=2010-01-10