mstk.scheduler.Scheduler

class mstk.scheduler.Scheduler(queue=None, n_proc=1, n_gpu=0, env_cmd=None)

Base class for job schedulers.

Scheduler should not be constructed directly. Use its subclasses instead.

queue

The jobs will be submitted to this queue.

Type:

str

n_proc

The CPU cores a job can use.

Type:

int

n_gpu

The GPU card a job can use.

Type:

int

env_cmd

The commands for setting up the environment before running real calculations.

Type:

str, Optional

sh

The default name of the job script

Type:

str

max_running_hour

The wall time limit for a job in hours.

Type:

int

username

The current user

Type:

str

cached_jobs_expire

The lifetime of cached jobs in seconds.

Type:

int

Methods

__init__([queue, n_proc, n_gpu, env_cmd])

download(**kwargs)

Download the simulation files to target folder.

generate_sh(commands, name[, workdir, sh])

Generate a shell script for commands to be executed by the job scheduler on compute nodes.

get_all_jobs()

Retrieve all the jobs that are currently managed by job scheduler.

get_job_from_name(name)

Get the job with specified name.

is_running(name)

Check whether or not a job is pending or running (not killed or finished or failed).

is_working()

Whether or not this job scheduler is running normally.

kill_job(name)

Kill a job which has the specified name.

submit([sh])

Submit a job script to scheduler.

upload(**kwargs)

Upload the simulation files to target folder.

Attributes

all_jobs

Retrieve all the jobs that are currently managed by job scheduler.

is_remote

Whether or not this is a remote job scheduler

n_running_jobs

The number of jobs that is currently pending or running (not killed or finished or failed)

is_remote = False

Whether or not this is a remote job scheduler

property all_jobs

Retrieve all the jobs that are currently managed by job scheduler.

It call scontrol show job for Slurm or qstat -f -u for Torque to get the list of jobs. If some jobs have finished long max_running_hour ago (depends on the setting of job scheduler on the machine), they may disappeared from the list outputted by job scheduler.

In order not to apply too much pressure to the job scheduler, cache is used to stores the jobs. This method will return the cached results without calling scontrol or qstat is the cache is not expired. The lifetime of cache is determined by attribute cached_jobs_expire in seconds.

Returns:

jobs

Return type:

list of PbsJob

is_working()

Whether or not this job scheduler is running normally.

Returns:

is

Return type:

bool

generate_sh(commands, name, workdir=None, sh=None)

Generate a shell script for commands to be executed by the job scheduler on compute nodes.

Parameters:
  • commands (str) – List of commands to be executed by the job scheduler on compute node step by step.

  • name (str) – The name of the job to be submitted.

  • workdir (str, Optional) – The working directory.

  • sh (str) – The name (path) of the shell script being written. If not set, will use the default sh.

upload(**kwargs) bool

Upload the simulation files to target folder.

This method should be implemented by subclasses that is remote scheduler, which is determined by the attribute is_remote. If it’s not a remote job scheduler, will simply return True.

Returns:

successful – Whether or not the upload is successful

Return type:

bool

download(**kwargs) bool

Download the simulation files to target folder.

This method should be implemented by subclasses that is remote scheduler, which is determined by the attribute is_remote. If it’s not a remote job scheduler, will simply return True.

Returns:

successful – Whether or not the download is successful

Return type:

bool

submit(sh=None, **kwargs)

Submit a job script to scheduler.

Parameters:

sh (str, optional) – The file name of the job script. If not set, will use the default sh.

Returns:

id – Job ID. -1 means failed

Return type:

int

get_job_from_name(name)

Get the job with specified name.

If such a job can not be found, None will be returned. If several job have same name, the most recently submitted one will be returned.

Parameters:

name (str) –

Returns:

job

Return type:

PbsJob or None

is_running(name)

Check whether or not a job is pending or running (not killed or finished or failed).

Parameters:

name (str) –

Returns:

is

Return type:

bool

kill_job(name)

Kill a job which has the specified name.

Parameters:

name (str) –

Returns:

killed

Return type:

bool

get_all_jobs()

Retrieve all the jobs that are currently managed by job scheduler.

It call scontrol show job for Slurm or qstat -f -u for Torque to get the list of jobs. If some jobs have finished long max_running_hour ago (depends on the setting of job scheduler on the machine), they may disappeared from the list outputted by job scheduler.

The difference between this method and property all_jobs is that, this method does not use the cached job list. Therefore it get the up to date jobs, but it also gives more pressure to the job scheduler. Usually, just call all_jobs instead of this method.

Returns:

jobs

Return type:

list of PbsJob

property n_running_jobs: int

The number of jobs that is currently pending or running (not killed or finished or failed)

Only the jobs belongs to the specified queue will be considered.

Returns:

n

Return type:

int