Slurm job submission in Python¶
This package provides a thin Python layer on top of the Slurm workload manager for submitting Slurm batch scripts into a Slurm queue. Its core features are:
- Python classes representing a Slurm batch script,
- simple file transfer mechanism between shared file system and node,
- macro support based on Python’s format specification mini-language,
- JSON-encoding and decoding of Slurm batch scripts,
- new submission command ssub,
- successive submission of Slurm batch scripts, and
- rescue of failed jobs.
This example shows how to submit a JSON-encoded Slurm batch script into a Slurm queue via ssub:
ssub submit --in pyssub_example.json --out pyssub_example.out
The JSON-encoded Slurm batch script pyssub_example.json has the following content:
{
"pyssub_example": {
"executable": "echo",
"arguments": "'Hello World!'"
}
}
A more detailed introduction is given in the Getting started guide.
Note¶
I have written this package because I was working with a small Slurm cluster during my PhD. This cluster was configured in a way that the easiest approach was to submit multiple single-task Slurm batch scripts instead of a single multi-task Slurm batch script containing multiple srun commands. The package reflects this approach and therefore does not have to be the best solution for your cluster.
Installation¶
This package is a pure Python 3 package (it requires at least Python 3.6) and does not depend on any third-party package. All releases are uploaded to PyPI and the newest release can be installed via
pip install pyssub
I would recommend to create a dedicated virtual Python 3 environment for the installation (e.g. via virtualenvwrapper):
source /usr/share/virtualenvwrapper/virtualenvwrapper.sh
mkvirtualenv -p /usr/bin/python3.6 -i pyssub py3-slurm
If you prefer to work with the newest revision, you can also install the package directly from GitHub:
pip install 'git+https://github.com/kkrings/pyssub#egg=pyssub'
Contributing¶
I welcome input from your side, either by creating issues or via pull requests. For the latter, please make sure that all unit tests pass. The unit tests can be executed via
python setup.py test
Getting started¶
Imagine you have an executable that you want to execute on a Slurm batch farm for a list of input files. Each job should process one input file. Both the executable and the input file should be copied to the computing node.
Create a skeleton batch script
pyssub_example_one.json
:{ "executable": "/home/ga65xaz/pyssub_example.py", "arguments": "--in {macros[inputfile]} --out {macros[outputfile]}", "options": { "job-name": "{macros[jobname]}", "ntasks": 1, "time": "00:10:00", "chdir": "/var/tmp", "error": "/scratch9/kkrings/logs/{macros[jobname]}.out", "output": "/scratch9/kkrings/logs/{macros[jobname]}.out" }, "transfer_executable": true, "transfer_input_files": [ "/scratch9/kkrings/{macros[inputfile]}" ], "transfer_output_files": [ "/scratch9/kkrings/{macros[outputfile]}" ] }
The script
pyssub_example.py
must be executable. In this example, we use macros, which are based on Python’s format specification mini-language, for the job name and the file names of both the input and the output file.Warning
In case of Python scripts, you have to be careful if the shebang starts with
#!/usr/bin/env python
because Slurm will transfer the user environment of the submit node to the computing node. This could lead to unwanted results if you for example use pyssub from within a dedicated virtual Python 3 environment that does not correspond to the one the Python script is supposed to use.Create a batch script collection
pyssub_example.json
:{ "pyssub_example_00": { "script": "/home/ga65xaz/pyssub_example.script", "macros": { "jobname": "pyssub_example_00", "inputfile": "pyssub_example_input_00.txt", "outputfile": "pyssub_example_output_00.txt" } }, "pyssub_example_01": { "script": "/home/ga65xaz/pyssub_example.script", "macros": { "jobname": "pyssub_example_01", "inputfile": "pyssub_example_input_01.txt", "outputfile": "pyssub_example_output_01.txt" } } }
The collection is a mapping of job names to JSON objects that contain the absolute path to the batch script skeleton and the macro values that will be injected into the skeleton.
Note
By default, the job name is not the one that Slurm will assign to the job internally, but it is best practice to tell Slurm to use the same name via the Slurm option
job-name
. In the example above, this is achieved with the help of the macrojobname
.Submit the batch script collection via ssub. The ssub command also allows you to control the maximum allowed number of queuing jobs (the default is 1000) and to specify how long it should wait before trying to submit more jobs into the queue (the default is 120 seconds). The output file pyssub_example.out will contain the job name and job ID of each submitted job.
ssub submit \ --in pyssub_example.json \ --out pyssub_example.out
After your jobs are done, collect the failed ones. This feature requires the
sacct
command to be available, which allows to query the Slurm job database. It will query the status of each job listed in pyssub_example.out` and save the job name and job ID of each finished job that has failed.ssub rescue \ --in pyssub_example.out \ --out pyssub_example.rescue
If the jobs have failed because of temporary problems with the computing node for example, you can simply resubmit only the failed jobs:
ssub submit \ --in pyssub_example.json \ --out pyssub_example.out \ --rescue pyssub_example.rescue
The next step is to use a Python script for creating the same collection of batch scripts, which is shown in the Advanced example page.
Advanced example¶
Following up the Getting started guide, this more advanced example shows how to create the same collection of batch scripts via a Python script.
./example.py \
--sbatch-name pyssub_example \
--sbatch-exec /home/ga65xaz/pyssub_example.py \
--sbatch-jobs 2 \
--sbatch-stdout /scratch9/kkrings/logs \
--sbatch-in '/scratch9/kkrings/pyssub_example_input_{macros[jobid]:02d}.txt' \
--sbatch-out '/scratch9/kkrings/pyssub_example_output_{macros[jobid]:02d}.txt' \
--in 'pyssub_example_input_{macros[jobid]:02d}.txt' \
--out 'pyssub_example_output_{macros[jobid]:02d}.txt'
The example script example.py looks like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 | #!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
Create a collection of Slurm batch scripts for executing an executable
on a Slurm cluster. Each job has the macros job name ``jobname`` and job
ID ``jobid``, which can be passed to the executable and/or the file
transfer mechanism.
"""
import json
import os
import pyssub.sbatch
def main(config, arguments=""):
script = pyssub.sbatch.SBatchScript(config.executable, arguments)
script.options.update({
"job-name": "{macros[jobname]}",
"time": "00:10:00",
"chdir": "/var/tmp",
"error": os.path.join(config.stdout, "{macros[jobname]}.out"),
"output": os.path.join(config.stdout, "{macros[jobname]}.out")
})
script.transfer_executable = True
script.transfer_input_files.extend(config.transfer_input_files)
script.transfer_output_files.extend(config.transfer_output_files)
scriptfile = config.name + ".script"
with open(scriptfile, "w") as stream:
json.dump(script, stream, cls=pyssub.sbatch.SBatchScriptEncoder)
njobs = len(config.jobs)
suffix = "_{{:0{width}d}}".format(width=len(str(njobs)))
collection = {}
for jobid in config.jobs:
jobname = config.name + suffix.format(jobid)
collection[jobname] = {
"script": scriptfile,
"macros": {
"jobname": jobname,
"jobid": jobid
}
}
with open(config.name + ".jobs", "w") as stream:
json.dump(collection, stream, cls=pyssub.sbatch.SBatchScriptEncoder)
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(
description=__doc__,
epilog="Additional arguments are passed to the exectuable.")
parser.add_argument(
"--sbatch-name",
nargs="?",
type=str,
help="jobs' prefix",
required=True,
dest="name")
parser.add_argument(
"--sbatch-exec",
nargs="?",
type=str,
help="path to executable",
required=True,
metavar="PATH",
dest="executable")
parser.add_argument(
"--sbatch-jobs",
nargs="?",
type=str,
help="sequence of job IDs: ``%(default)s``",
default="[1]",
metavar="EXPR",
dest="jobs")
parser.add_argument(
"--sbatch-stdout",
nargs="?",
type=str,
help="path to stdout/stderr output directory: ``%(default)s``",
default="/scratch9/kkrings/logs",
metavar="PATH",
dest="stdout")
parser.add_argument(
"--sbatch-in",
nargs="+",
type=str,
help="transfer input files to node: ``None``",
default=[],
metavar="PATH",
dest="transfer_input_files")
parser.add_argument(
"--sbatch-out",
nargs="+",
type=str,
help="transfer output files from node: ``None``",
default=[],
metavar="PATH",
dest="transfer_output_files")
config, arguments = parser.parse_known_args()
config.jobs = eval(config.jobs)
main(config, arguments=" ".join(arguments))
|
sbatch¶
Module containing classes representing a Slurm batch script and corresponding JSON encoder and decoder
-
class
pyssub.sbatch.
SBatchScript
(executable: str, arguments: str = '')¶ Slurm batch script
Represents a single-task Slurm batch script. Additionally, a simple file transfer mechanism between node and shared file systems is realized.
-
executable
¶ Path to executable
Type: str
-
arguments
¶ Arguments that will be passed to executable
Type: str
-
options
¶ Mapping of sbatch options to objects (string-) representing values
Type: dict(str, object)
-
transfer_executable
¶ Transfer executable to node.
Type: bool
-
transfer_input_files
¶ Sequence of input files that are copied to the node before executing executable
Type: list(str)
-
transfer_output_files
¶ Sequence of output files that are moved after executing executable
Type: list(str)
-
-
class
pyssub.sbatch.
SBatchScriptDecoder
¶ JSON decoder for Slurm batch script
This callable class can be used as an object_hook when loading a JSON object from disk. All objects that represent a Slurm batch script, with or without macros, are decoded into the corresponding Python type.
-
decode
(description: Dict[str, Any]) → pyssub.sbatch.SBatchScript¶ Decode Slurm batch script.
Parameters: description (dict(str, object)) – Script’s JSON-compatible representation Returns: Slurm batch script Return type: SBatchScript
-
decode_macro
(description: Dict[str, Any]) → pyssub.sbatch.SBatchScriptMacro¶ Decode Slurm batch script containing macros.
If
script
points to a string, it is interpreted as a path to a JSON-encoded Slurm batch script on disk that will be loaded and decoded.Parameters: description (dict(str, object)) – Script’s JSON-compatible representation Returns: Slurm batch script Return type: SBatchScriptMacro
-
-
class
pyssub.sbatch.
SBatchScriptEncoder
(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)¶ JSON encoder for Slurm batch script
This class provides a JSON-compatible representation of a Slurm batch script; both SBatchScript and SBatchScriptMacro are supported.
-
default
(o: Any) → Dict[str, Any]¶ Try to encode the given object.
-
-
class
pyssub.sbatch.
SBatchScriptMacro
(script: pyssub.sbatch.SBatchScript, macros: Dict[str, Any])¶ Slurm batch script with macro support
The macro support allows to put variables (macros) into the script and to reuse it for different values. The macro support is based on Python’s format specification mini-language.
-
script
¶ Slurm batch script containing macros
Type: SBatchScript
-
macros
¶ Macro values that are inserted into the script when the script’s string representation is called
Type: dict(str, object)
Examples
Create a script with one macro.
>>> skeleton = SBatchScript("echo", "'{macros[mg]}'") >>> script = SBatchScriptMacro(skeleton, {"msg": "Hello World!"})
-
scmd¶
Module containing functions wrapping Slurm commands
-
pyssub.scmd.
failed
(jobs: Dict[str, int]) → Dict[str, int]¶ Failed jobs
Check which of the given jobs have failed, meaning that their states are not equal to
COMPLETED
.Parameters: jobs (dict(str, int)) – Mapping of job names to job IDs Returns: Mapping of names to IDs of the jobs that have failed Return type: dict(str, int) Raises: RuntimeError
– If job ID cannot be matched from sacct’s output.
-
pyssub.scmd.
numjobs
(user: str, partition: Optional[str] = None) → int¶ Number of queuing jobs
Check the number of queuing jobs for the given user and partition.
Parameters: - user (str) – User name or ID
- partition (str, optional) – Partition name
Returns: Number of queuing jobs
Return type: int
-
pyssub.scmd.
submit
(script: pyssub.sbatch.SBatchScript, partition: Optional[str] = None) → int¶ Submit Slurm batch script.
Parameters: - script (SBatchScript) – Slurm batch script
- partition (str, optional) – Partition for resource allocation
Returns: Job ID
Return type: int
Raises: RuntimeError
– If job ID cannot be matched from sbatch’s output.
shist¶
Slurm job history: save/load names and IDs of submitted jobs
-
pyssub.shist.
load
(filename: str) → Dict[str, int]¶ Load Slurm jobs from disk.
Load names and IDs of submitted Slurm jobs from disk.
Parameters: filename (str) – Path to input file Returns: Mapping of job names to job IDs Return type: dict(str, int)
-
pyssub.shist.
save
(filename: str, jobs: Dict[str, int]) → None¶ Save Slurm jobs to disk.
Save names and IDs of submitted Slurm jobs to disk.
Parameters: - filename (str) – Path to output file
- jobs (dict(str, int)) – Mapping of job names to job IDs