mstk.scheduler.Slurm¶
- class mstk.scheduler.Slurm(queue, n_proc, n_gpu, n_node=0, exclude=None, env_cmd=None)¶
Slurm job scheduler with support of GPU allocation and MPI/OpenMP hybrid parallelization.
Slurm is a rather powerful and complicated job scheduler with tons of configurations and options. It is not the goal of mstk to provided a comprehensive wrapper for Slurm. Therefore, it’s very likely that the job script generated by this class doesn’t fully fit the requirement of a specific computing center. In that case, it’s viable to do some process on the generated job script before submitting it.
- Parameters:
queue (str) – The jobs will be submitted to this partition.
n_proc (int) – The CPU cores a job can use.
n_gpu (int) – The GPU card a job can use.
n_node (int) – The nodes a job can use. If 0, then will be decided by slurm.
exclude (str, Optional) – The nodes to be excluded, in Slurm format
env_cmd (str, Optional) – The commands for setting up the environment before running real calculations. It will be inserted on the top of job scripts.
- queue¶
The jobs will be submitted on this queue.
- Type:
str
- n_proc¶
The CPU cores a job can use.
- Type:
int
- n_gpu¶
The GPU card a job can use.
- Type:
int
- n_node¶
The nodes a job can use. If 0, then will be decided by slurm.
- Type:
int
- exclude¶
The nodes to be excluded, in Slurm format
- Type:
str
- env_cmd¶
The commands for setting up the environment before running real calculations.
- Type:
str
- sh¶
The default name of the job script
- Type:
str
- max_running_hour¶
The wall time limit for a job in hours.
- Type:
int
- username¶
The current user
- Type:
str
- cached_jobs_expire¶
The lifetime of cached jobs in seconds.
- Type:
int
- submit_cmd¶
The command for submitting the job script. If is sbatch by default. But extra argument can be provided, e.g. sbatch –qos=debug.
- Type:
str
Methods
__init__(queue, n_proc, n_gpu[, n_node, ...])download(**kwargs)Download the simulation files to target folder.
generate_sh(commands, name[, workdir, sh, ...])Generate a shell script for commands to be executed by the job scheduler on compute nodes.
Retrieve all the jobs that are currently managed by job scheduler.
get_job_from_name(name)Get the job with specified name.
is_running(name)Check whether or not a job is pending or running (not killed or finished or failed).
Check whether or not Slurm is working normally on this machine.
kill_job(name)Kill a job which has the specified name.
submit([sh])Submit a job script to scheduler.
upload(**kwargs)Upload the simulation files to target folder.
Attributes
all_jobsRetrieve all the jobs that are currently managed by job scheduler.
Whether or not this is a remote job scheduler
n_running_jobsThe number of jobs that is currently pending or running (not killed or finished or failed)
- is_remote = False¶
Whether or not this is a remote job scheduler
- is_working() bool¶
Check whether or not Slurm is working normally on this machine.
It calls sinfo –version and check the output.
- Returns:
is
- Return type:
bool
- generate_sh(commands, name, workdir=None, sh=None, id_prior=None, **kwargs)¶
Generate a shell script for commands to be executed by the job scheduler on compute nodes.
Because of the complexity of Slurm configurations, it’s probable that the job script generated here is not fully valid. In that case, it’s viable to do some process on the generated job script before submitting it.
- Parameters:
commands (list of str) – The commands to be executed step by step
name (str) – The name of the job to be submitted
workdir (str, Optional) – The working directory of this job
id_prior (int, Optional) – The id of prior job this one depends on
sh (str, Optional) – The name (path) of shell script to be generated. If not set, will use the default
sh.
- submit(sh=None)¶
Submit a job script to scheduler.
- Parameters:
sh (str, optional) – The file name of the job script. If not set, will use the default
sh.- Returns:
id – Job ID. -1 means failed
- Return type:
int
- kill_job(name) bool¶
Kill a job which has the specified name.
- Parameters:
name (str) –
- Returns:
killed
- Return type:
bool
- get_all_jobs()¶
Retrieve all the jobs that are currently managed by job scheduler.
It call scontrol show job for Slurm or qstat -f -u for Torque to get the list of jobs. If some jobs have finished long max_running_hour ago (depends on the setting of job scheduler on the machine), they may disappeared from the list outputted by job scheduler.
The difference between this method and property
all_jobsis that, this method does not use the cached job list. Therefore it get the up to date jobs, but it also gives more pressure to the job scheduler. Usually, just callall_jobsinstead of this method.- Returns:
jobs
- Return type:
list of PbsJob