Checking the status of your jobs

So you’ve launched a job, yay!

Let’s look at some ways to look at the status of your job, check activity across the cluster, and delete a job.

Slurm: `squeue`

Running squeue (think: Slurm queue) from your terminal on a front-end will give you a table of statistics about all of the jobs currently running on that sub-cluster. The output will look like:

12 [bora] squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              4114     batch    run_2 pmcardle  R   21:17:52      2 bo[30-31]
              4113     batch    run_2 pmcardle  R   21:18:01      2 bo[28-29]
              4112     batch    run_2 pmcardle  R   21:18:16      2 bo[26-27]
              4111     batch    run_2 pmcardle  R   21:18:27      2 bo[02-03]
              4133     batch 80_nf5.5  ychuang  R    7:13:43      1 bo01
              4141     batch      ccc xliang06  R    3:49:10     20 bo[05-08,10-25]
              4137      hima 78_n9E5_  ychuang  R    6:44:10      1 hi01
              4136      hima 78_n9E5_  ychuang  R    6:47:10      1 hi01
              4135      hima 78_n9E5_  ychuang  R    6:54:52      1 hi01
              4134      hima 78_n9E5_  ychuang  R    6:59:18      1 hi01

You can also limit this to just jobs under a particular username (like your own username, or someone else) with the -u flag,

13 [bora] squeue -u pmcardle
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              4114     batch    run_2 pmcardle  R   21:21:29      2 bo[30-31]
              4113     batch    run_2 pmcardle  R   21:21:38      2 bo[28-29]
              4112     batch    run_2 pmcardle  R   21:21:53      2 bo[26-27]
              4111     batch    run_2 pmcardle  R   21:22:04      2 bo[02-03]

This command lists:

Job ID: The unique ID given to our job
Name: The name of the job
User: The username who launched the job
Time: Time elapsed on the job
Nodes / Nodelist: Number of nodes reserved and their number

Slurm: Deleting a job

You may want to delete a job before its walltime limit. Find the job ID (using squeue) and run scancel [JOB ID]. You should see Terminated whcih indicates the job has been successfully deleted.

Torque: `qstat`

For good measure, let’s cover the equivalents on Torque. Running qstat from your terminal will give you a table with statistics about all of the jobs currently running on the sub-cluster you are logged into. The output of this command will look like:

To limit this to a particular username, we can again use the -u flag qstat -u [USERNAME]. Or equivalently, the HPC has prepopulated all users’ .bashrc file with the alias qsu for qstat -u, so you can just do

What does this table tell us?

Job ID: The unique ID given to our job

Username: The user who launched the job

NDS: Number of nodes reserved

TSK: Number of total processors reserved

Req’d Time: Amount of walltime requested

Elap Time: How long the job has been running for. If the job is in the queue and waiting to be launched, this line will look like -----------

Torque: Deleting Jobs

If for any reason you decide that you want to cancel a job before it has reach it’s walltime limit, find the job ID (using either of the two methods in the previous page) and run qdel [JOB ID]

You should see Terminated, which indicates that the job has been successfully deleted.

PBSTOP

Torque also had a utility called PBSTOP (a modification of the base TOP command) that gave a very nice overview of activity on the cluster. We’re not aware of an equivalent yet on Slurm, but for historical reference this is what PBSTOP output looked like:

PBSTOP

Checking the status of your jobs

Slurm: squeue

Slurm: Deleting a job

Torque: qstat

Torque: Deleting Jobs

PBSTOP

Slurm: `squeue`

Torque: `qstat`