mstk.scheduler.RemoteSlurm¶
- class mstk.scheduler.RemoteSlurm(queue, n_proc, n_gpu, host, username, remote_dir, port=22, n_node=0, exclude=None, env_cmd=None)¶
Slurm job scheduler running on a remote machine.
- Parameters:
queue (str) – The jobs will be submitted to this partition.
n_proc (int) – The CPU cores a job can use.
n_gpu (int) – The GPU card a job can use.
host (str) – The IP address of the remote host that is running Slurm
username (str) – The username for logging in the remote host
remote_dir (str) – The default directory to use on the remote host for running calculation
port (int) – The SSH port for logging in the remote host
n_node (int) – The nodes a job can use. If 0, then will be decided by slurm.
exclude (str, Optional) – The nodes to be excluded, in Slurm format
env_cmd (str, Optional) – The commands for setting up the environment before running real calculations. It will be inserted on the top of job scripts.
- queue¶
The jobs will be submitted on this queue.
- Type:
str
- n_proc¶
The CPU cores a job can use.
- Type:
int
- n_gpu¶
The GPU card a job can use.
- Type:
int
- n_node¶
The nodes a job can use. If 0, then will be decided by slurm.
- Type:
int
- exclude¶
The nodes to be excluded, in Slurm format
- Type:
str, Optional
- env_cmd¶
The commands for setting up the environment before running real calculations.
- Type:
str
- sh¶
The default name of the job script
- Type:
str
- host¶
The IP address of the remote host that is running Slurm
- Type:
str
- username¶
The username for logging in the remote host
- Type:
str
- remote_dir¶
The default directory to use on the remote host for running calculation
- Type:
str
- port¶
The SSH port for logging in the remote host
- Type:
int
- max_running_hour¶
The wall time limit for a job in hours.
- Type:
int
- cached_jobs_expire¶
The lifetime of cached jobs in seconds.
- Type:
int
- submit_cmd¶
The command for submitting the job script. If is sbatch by default. But extra argument can be provided, e.g. sbatch –qos=debug.
- Type:
str
Methods
__init__(queue, n_proc, n_gpu, host, ...[, ...])download([remote_dir, local_dir])Upload all the files in remote directory to current local directory.
generate_sh(commands, name[, workdir, sh, ...])Generate a shell script for commands to be executed by the job scheduler on compute nodes.
Retrieve all the jobs that are currently managed by job scheduler.
get_job_from_name(name)Get the job with specified name.
is_running(name)Check whether or not a job is pending or running (not killed or finished or failed).
Check whether or not Slurm is working normally on the remote machine.
kill_job(name)Kill a job which has the specified name.
submit([sh, remote_dir])Submit a job script to the Slurm scheduler on the remote machine.
upload([local_dir, remote_dir])Upload all the files in current local directory to remote directory.
Attributes
all_jobsRetrieve all the jobs that are currently managed by job scheduler.
Whether or not this is a remote job scheduler
n_running_jobsThe number of jobs that is currently pending or running (not killed or finished or failed)
- is_remote = True¶
Whether or not this is a remote job scheduler
- is_working() bool¶
Check whether or not Slurm is working normally on the remote machine.
It calls sinfo –version and check the output.
- Returns:
is
- Return type:
bool
- upload(local_dir=None, remote_dir=None)¶
Upload all the files in current local directory to remote directory.
- Parameters:
local_dir (dir, optional) – If not set, will use the current dir.
remote_dir (dir, optional) – If not set, will use the default
remote_dir.
- Returns:
successful – Whether or not the upload is successful
- Return type:
bool
- download(remote_dir=None, local_dir=None) bool¶
Upload all the files in remote directory to current local directory.
- Parameters:
remote_dir (dir, optional) – If not set, will use the default
remote_dir.local_dir (dir, optional) – If not set, will use the current dir.
- Returns:
successful – Whether or not the download is successful
- Return type:
bool
- submit(sh=None, remote_dir=None)¶
Submit a job script to the Slurm scheduler on the remote machine.
- Parameters:
sh (str) – The job script to be submitted.
remote_dir (str) – The directory to submit the script on the remote machine.
- Returns:
id
- Return type:
int
- kill_job(name) bool¶
Kill a job which has the specified name.
- Parameters:
name (str) –
- Returns:
killed
- Return type:
bool
- get_all_jobs()¶
Retrieve all the jobs that are currently managed by job scheduler.
It call scontrol show job for Slurm or qstat -f -u for Torque to get the list of jobs. If some jobs have finished long max_running_hour ago (depends on the setting of job scheduler on the machine), they may disappeared from the list outputted by job scheduler.
The difference between this method and property
all_jobsis that, this method does not use the cached job list. Therefore it get the up to date jobs, but it also gives more pressure to the job scheduler. Usually, just callall_jobsinstead of this method.- Returns:
jobs
- Return type:
list of PbsJob