mstk.scheduler.RemoteSlurm

class mstk.scheduler.RemoteSlurm(queue, n_proc, n_gpu, host, username, remote_dir, port=22, n_node=0, exclude=None, env_cmd=None)

Slurm job scheduler running on a remote machine.

Parameters:
  • queue (str) – The jobs will be submitted to this partition.

  • n_proc (int) – The CPU cores a job can use.

  • n_gpu (int) – The GPU card a job can use.

  • host (str) – The IP address of the remote host that is running Slurm

  • username (str) – The username for logging in the remote host

  • remote_dir (str) – The default directory to use on the remote host for running calculation

  • port (int) – The SSH port for logging in the remote host

  • n_node (int) – The nodes a job can use. If 0, then will be decided by slurm.

  • exclude (str, Optional) – The nodes to be excluded, in Slurm format

  • env_cmd (str, Optional) – The commands for setting up the environment before running real calculations. It will be inserted on the top of job scripts.

queue

The jobs will be submitted on this queue.

Type:

str

n_proc

The CPU cores a job can use.

Type:

int

n_gpu

The GPU card a job can use.

Type:

int

n_node

The nodes a job can use. If 0, then will be decided by slurm.

Type:

int

exclude

The nodes to be excluded, in Slurm format

Type:

str, Optional

env_cmd

The commands for setting up the environment before running real calculations.

Type:

str

sh

The default name of the job script

Type:

str

host

The IP address of the remote host that is running Slurm

Type:

str

username

The username for logging in the remote host

Type:

str

remote_dir

The default directory to use on the remote host for running calculation

Type:

str

port

The SSH port for logging in the remote host

Type:

int

max_running_hour

The wall time limit for a job in hours.

Type:

int

cached_jobs_expire

The lifetime of cached jobs in seconds.

Type:

int

submit_cmd

The command for submitting the job script. If is sbatch by default. But extra argument can be provided, e.g. sbatch –qos=debug.

Type:

str

Methods

__init__(queue, n_proc, n_gpu, host, ...[, ...])

download([remote_dir, local_dir])

Upload all the files in remote directory to current local directory.

generate_sh(commands, name[, workdir, sh, ...])

Generate a shell script for commands to be executed by the job scheduler on compute nodes.

get_all_jobs()

Retrieve all the jobs that are currently managed by job scheduler.

get_job_from_name(name)

Get the job with specified name.

is_running(name)

Check whether or not a job is pending or running (not killed or finished or failed).

is_working()

Check whether or not Slurm is working normally on the remote machine.

kill_job(name)

Kill a job which has the specified name.

submit([sh, remote_dir])

Submit a job script to the Slurm scheduler on the remote machine.

upload([local_dir, remote_dir])

Upload all the files in current local directory to remote directory.

Attributes

all_jobs

Retrieve all the jobs that are currently managed by job scheduler.

is_remote

Whether or not this is a remote job scheduler

n_running_jobs

The number of jobs that is currently pending or running (not killed or finished or failed)

is_remote = True

Whether or not this is a remote job scheduler

is_working() bool

Check whether or not Slurm is working normally on the remote machine.

It calls sinfo –version and check the output.

Returns:

is

Return type:

bool

upload(local_dir=None, remote_dir=None)

Upload all the files in current local directory to remote directory.

Parameters:
  • local_dir (dir, optional) – If not set, will use the current dir.

  • remote_dir (dir, optional) – If not set, will use the default remote_dir.

Returns:

successful – Whether or not the upload is successful

Return type:

bool

download(remote_dir=None, local_dir=None) bool

Upload all the files in remote directory to current local directory.

Parameters:
  • remote_dir (dir, optional) – If not set, will use the default remote_dir.

  • local_dir (dir, optional) – If not set, will use the current dir.

Returns:

successful – Whether or not the download is successful

Return type:

bool

submit(sh=None, remote_dir=None)

Submit a job script to the Slurm scheduler on the remote machine.

Parameters:
  • sh (str) – The job script to be submitted.

  • remote_dir (str) – The directory to submit the script on the remote machine.

Returns:

id

Return type:

int

kill_job(name) bool

Kill a job which has the specified name.

Parameters:

name (str) –

Returns:

killed

Return type:

bool

get_all_jobs()

Retrieve all the jobs that are currently managed by job scheduler.

It call scontrol show job for Slurm or qstat -f -u for Torque to get the list of jobs. If some jobs have finished long max_running_hour ago (depends on the setting of job scheduler on the machine), they may disappeared from the list outputted by job scheduler.

The difference between this method and property all_jobs is that, this method does not use the cached job list. Therefore it get the up to date jobs, but it also gives more pressure to the job scheduler. Usually, just call all_jobs instead of this method.

Returns:

jobs

Return type:

list of PbsJob