Checking the status of your jobs
So you’ve launched a job, yay!
Let’s look at some ways to look at the status of your job, check activity across the cluster, and delete a job.
Slurm: squeue
Running squeue
(think: Slurm queue) from your terminal on a front-end will give you a table of statistics about all of the jobs currently running on that sub-cluster. The output will look like:
12 [bora] squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
4114 batch run_2 pmcardle R 21:17:52 2 bo[30-31]
4113 batch run_2 pmcardle R 21:18:01 2 bo[28-29]
4112 batch run_2 pmcardle R 21:18:16 2 bo[26-27]
4111 batch run_2 pmcardle R 21:18:27 2 bo[02-03]
4133 batch 80_nf5.5 ychuang R 7:13:43 1 bo01
4141 batch ccc xliang06 R 3:49:10 20 bo[05-08,10-25]
4137 hima 78_n9E5_ ychuang R 6:44:10 1 hi01
4136 hima 78_n9E5_ ychuang R 6:47:10 1 hi01
4135 hima 78_n9E5_ ychuang R 6:54:52 1 hi01
4134 hima 78_n9E5_ ychuang R 6:59:18 1 hi01
You can also limit this to just jobs under a particular username (like your own username, or someone else) with the -u
flag,
13 [bora] squeue -u pmcardle
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
4114 batch run_2 pmcardle R 21:21:29 2 bo[30-31]
4113 batch run_2 pmcardle R 21:21:38 2 bo[28-29]
4112 batch run_2 pmcardle R 21:21:53 2 bo[26-27]
4111 batch run_2 pmcardle R 21:22:04 2 bo[02-03]
This command lists:
-
Job ID: The unique ID given to our job
-
Name: The name of the job
-
User: The username who launched the job
-
Time: Time elapsed on the job
-
Nodes / Nodelist: Number of nodes reserved and their number
Slurm: Deleting a job
You may want to delete a job before its walltime limit. Find the job ID (using squeue
) and run scancel [JOB ID]
. You should see Terminated
whcih indicates the job has been successfully deleted.
Torque: qstat
For good measure, let’s cover the equivalents on Torque. Running qstat
from your terminal will give you a table with statistics about all of the jobs currently running on the sub-cluster you are logged into. The output of this command will look like:
To limit this to a particular username, we can again use the -u
flag qstat -u [USERNAME]
. Or equivalently, the HPC has prepopulated all users’ .bashrc
file with the alias qsu
for qstat -u
, so you can just do
What does this table tell us?
Job ID: The unique ID given to our job
Username: The user who launched the job
NDS: Number of nodes reserved
TSK: Number of total processors reserved
Req’d Time: Amount of walltime requested
Elap Time: How long the job has been running for. If the job is in the queue and waiting to be launched, this line will look like
-----------
Torque: Deleting Jobs
If for any reason you decide that you want to cancel a job before it has reach it’s walltime limit, find the job ID (using either of the two methods in the previous page) and run qdel [JOB ID]
You should see Terminated
, which indicates that the job has been successfully deleted.
PBSTOP
Torque also had a utility called PBSTOP (a modification of the base TOP command) that gave a very nice overview of activity on the cluster. We’re not aware of an equivalent yet on Slurm, but for historical reference this is what PBSTOP output looked like: