Slurm job submission in Python

This package provides a thin Python layer on top of the Slurm workload manager for submitting Slurm batch scripts into a Slurm queue. Its core features are:

  • Python classes representing a Slurm batch script,
  • simple file transfer mechanism between shared file system and node,
  • macro support based Python’s format specification mini-language,
  • JSON-encoding and decoding of Slurm batch scripts,
  • new submission command ssub,
  • successive submission of Slurm batch scripts, and
  • rescue of failed jobs.

The example shows how to submit a JSON-encoded Slurm batch script into a Slurm queue via ssub:

ssub submit --in pyssub_example.json --out pyssub_example.out

The JSON-encoded Slurm batch script pyssub_example.json has the following content:

{
   "pyssub_example": {
      "executable": "echo",
      "arguments": "'Hello World!'"
   }
}

A more detailed introduction is given in the Getting started guide.

Note that I have written this package because I was working with a small Slurm cluster during my PhD. This cluster was configured in a way that the easiest approach was to submit multiple single-task Slurm batch scripts instead of a single multi-task Slurm batch script containing multiple srun commands. The package reflects this approach and therefore does not have to be the best solution for your cluster.

Installation

This package is pure Python 3 package (it requires at least Python 3.6) and does not depend on any third-party package. All releases are uploaded to PyPI and the newest release can be installed via

pip install pyssub

I would recommend to create a dedicated virtual Python 3 environment for the installation (e.g. via virtualenvwrapper):

source /usr/share/virtualenvwrapper/virtualenvwrapper.sh
mkvirtualenv -p /usr/bin/python3.6 -i pyssub py3-slurm

If you prefer to work with the newest revision, you can also install the package directly from GitHub:

pip install 'git+https://github.com/kkrings/pyssub#egg=pyssub'

Contributing

I welcome input from your side, either by creating issues or via pull reqests. For the latter, please make sure that all unit tests pass. The unit tests can be executed via

python setup.py test

Getting started

Imagine you have an executable that you want to execute on a Slurm batch farm for a list of input files. Each job should process one input file. Both the executable and the input file should be copied to the computing node.

  1. Create a skeleton batch script pyssub_example_one.json:

    {
       "executable": "/home/ga65xaz/pyssub_example.py",
       "arguments": "--in {macros[inputfile]} --out {macros[outputfile]}",
       "options": {
          "job-name": "{macros[jobname]}",
          "ntasks": 1,
          "time": "00:10:00",
          "chdir": "/var/tmp",
          "error": "/scratch9/kkrings/logs/{macros[jobname]}.out",
          "output": "/scratch9/kkrings/logs/{macros[jobname]}.out"
       },
       "transfer_executable": true,
       "transfer_input_files": [
          "/scratch9/kkrings/{macros[inputfile]}"
       ],
       "transfer_output_files": [
          "/scratch9/kkrings/{macros[outputfile]}"
       ]
    }
    

    The script pyssub_example.py must be executable. In this example, we use macros, which are based on Python’s format specification mini-language, for the job name and the file names of both the input and the output file.

    Warning

    In case of Python scripts, you have to be careful if the shebang starts with #!/usr/bin/env python because Slurm will transfer the user environment of the submit node to the computing node. This could lead to unwanted results if you for example use pyssub from within a dedicated virtual Python 3 environment that does not correspond to the one the Python script is supposed to use.

  2. Create a batch script collection pyssub_example.json:

    {
       "pyssub_example_00": {
          "script": "/home/ga65xaz/pyssub_example.script",
          "macros": {
             "jobname": "pyssub_example_00",
             "inputfile": "pyssub_example_input_00.txt",
             "outputfile": "pyssub_example_output_00.txt"
          }
       },
       "pyssub_example_01": {
          "script": "/home/ga65xaz/pyssub_example.script",
          "macros": {
             "jobname": "pyssub_example_01",
             "inputfile": "pyssub_example_input_01.txt",
             "outputfile": "pyssub_example_output_01.txt"
          }
       }
    }
    

    The collection is a mapping of job names to JSON objects that contain the absolute path to the batch script skeleton and the macro values that will be injected into the skeleton.

    Note

    By default, the job name is not the one that Slurm will assign to the job internally, but it is best practice to tell Slurm to use the same name via the Slurm option job-name. In the example above, this is achieved with the help of the macro jobname.

  3. Submit the batch script collection via ssub. The ssub command also allows you to control the maximum allowed number of queuing jobs (the default is 1000) and to specify how long it should wait before trying to submit more jobs into the queue (the default is 120 seconds). The output file pyssub_example.out will contain the job name and job ID of each submitted job.

    ssub submit \
       --in pyssub_example.json \
       --out pyssub_example.out
    
  4. After your jobs are done, collect the failed ones. This feature requires the sacct command to be available, which allows to query the Slurm job database. It will query the status of each job listed in pyssub_example.out` and save the job name and job ID of each finished job that has failed.

    ssub rescue \
       --in pyssub_example.out \
       --out pyssub_example.rescue
    
  5. If the jobs have failed because of temporary problems with the computing node for example, you can simply resubmit only the failed jobs:

    ssub submit \
       --in pyssub_example.json \
       --out pyssub_example.out \
       --rescue pyssub_example.rescue
    

The next step is to use a Python script for creating the same collection of batch scripts, which is shown in the Advanced example page.

Advanced example

Following up the Getting started guide, this more advanced example shows how to create the same collection of batch scripts via a Python script.

./example.py \
   --sbatch-name pyssub_example \
   --sbatch-exec /home/ga65xaz/pyssub_example.py \
   --sbatch-jobs 2 \
   --sbatch-stdout /scratch9/kkrings/logs \
   --sbatch-in '/scratch9/kkrings/pyssub_example_input_{macros[jobid]:02d}.txt' \
   --sbatch-out '/scratch9/kkrings/pyssub_example_output_{macros[jobid]:02d}.txt' \
   --in 'pyssub_example_input_{macros[jobid]:02d}.txt' \
   --out 'pyssub_example_output_{macros[jobid]:02d}.txt'

The example script example.py looks like this:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
#!/usr/bin/env python
# -*- coding: utf-8 -*-

"""
Create a collection of Slurm batch scripts for executing an executable
on a Slurm cluster. Each job has the macros job name ``jobname`` and job
ID ``jobid``, which can be passed to the executable and/or the file
transfer mechanism.

"""
import json
import os

import pyssub.sbatch


def main(config, arguments=""):
    script = pyssub.sbatch.SBatchScript(config.executable, arguments)

    script.options.update({
        "job-name": "{macros[jobname]}",
        "time": "00:10:00",
        "chdir": "/var/tmp",
        "error": os.path.join(config.stdout, "{macros[jobname]}.out"),
        "output": os.path.join(config.stdout, "{macros[jobname]}.out")
        })

    script.transfer_executable = True
    script.transfer_input_files.extend(config.transfer_input_files)
    script.transfer_output_files.extend(config.transfer_output_files)

    scriptfile = config.name + ".script"
    with open(scriptfile, "w") as stream:
        json.dump(script, stream, cls=pyssub.sbatch.SBatchScriptEncoder)

    njobs = len(config.jobs)
    suffix = "_{{:0{width}d}}".format(width=len(str(njobs)))

    collection = {}
    for jobid in config.jobs:
        jobname = config.name + suffix.format(jobid)

        collection[jobname] = {
            "script": scriptfile,
            "macros": {
                "jobname": jobname,
                "jobid": jobid
                }
            }

    with open(config.name + ".jobs", "w") as stream:
        json.dump(collection, stream, cls=pyssub.sbatch.SBatchScriptEncoder)


if __name__ == "__main__":
    import argparse
    parser = argparse.ArgumentParser(
        description=__doc__,
        epilog="Additional arguments are passed to the exectuable.")

    parser.add_argument(
        "--sbatch-name",
        nargs="?",
        type=str,
        help="jobs' prefix",
        required=True,
        dest="name")

    parser.add_argument(
        "--sbatch-exec",
        nargs="?",
        type=str,
        help="path to executable",
        required=True,
        metavar="PATH",
        dest="executable")

    parser.add_argument(
        "--sbatch-jobs",
        nargs="?",
        type=str,
        help="sequence of job IDs: ``%(default)s``",
        default="[1]",
        metavar="EXPR",
        dest="jobs")

    parser.add_argument(
        "--sbatch-stdout",
        nargs="?",
        type=str,
        help="path to stdout/stderr output directory: ``%(default)s``",
        default="/scratch9/kkrings/logs",
        metavar="PATH",
        dest="stdout")

    parser.add_argument(
        "--sbatch-in",
        nargs="+",
        type=str,
        help="transfer input files to node: ``None``",
        default=[],
        metavar="PATH",
        dest="transfer_input_files")

    parser.add_argument(
        "--sbatch-out",
        nargs="+",
        type=str,
        help="transfer output files from node: ``None``",
        default=[],
        metavar="PATH",
        dest="transfer_output_files")

    config, arguments = parser.parse_known_args()
    config.jobs = eval(config.jobs)

    main(config, arguments=" ".join(arguments))

sbatch

Module containing classes representing a Slurm batch script and corresponding JSON encoder and decoder

class pyssub.sbatch.SBatchScript(executable, arguments='')

Slurm batch script

Represents a single-task Slurm batch script. Additionally, a simple file transfer mechanism between node and shared file systems is realized.

executable

Path to executable

Type:str
arguments

Arguments that will be passed to executable

Type:str
options

Mapping of sbatch options to objects (string-) representing values

Type:dict(str, object)
transfer_executable

Transfer executable to node.

Type:bool
transfer_input_files

Sequence of input files that are copied to the node before executing executable

Type:list(str)
transfer_output_files

Sequence of output files that are moved after executing executable

Type:list(str)
class pyssub.sbatch.SBatchScriptDecoder

JSON decoder for Slurm batch script

This callable class can be used as an object_hook when loading a JSON object from disk. All objects that represent a Slurm batch script, with or without macros, are decoded into the corresponding Python type.

decode(description)

Decode Slurm batch script.

Parameters:description (dict) – Script’s JSON-compatible representation
Returns:Slurm batch script
Return type:SBatchScript
decode_macro(description)

Decode Slurm batch script containing macros.

If script points to a string, it is interpreted as a path to a JSON-encoded Slurm batch script on disk that will be loaded and decoded.

Parameters:description (dict) – Script’s JSON-compatible representation
Returns:Slurm batch script
Return type:SBatchScriptMacro
class pyssub.sbatch.SBatchScriptEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)

JSON encoder for Slurm batch script

This class provides a JSON-compatible representation of a Slurm batch script; both SBatchScript and SBatchScriptMacro are supported.

default(o)

Try to encode the given object.

class pyssub.sbatch.SBatchScriptMacro(script, macros)

Slurm batch script with macro support

The macro support allows to put variables (macros) into the script and to reuse it for different values. The macro support is based on Python’s format specification mini-language.

script

Slurm batch script containing macros

Type:SBatchScript
macros

Macro values that are inserted into the script when the script’s string representation is called

Type:dict(str, object)

Examples

Create a script with one macro.

>>> skeleton = SBatchScript("echo", "'{macros[mg]}'")
>>> script = SBatchScriptMacro(skeleton, {"msg": "Hello World!"})

scmd

Module containing functions wrapping Slurm commands

pyssub.scmd.failed(jobs)

Failed jobs

Check which of the given jobs have failed, meaning that their states are not equal to COMPLETED.

Parameters:jobs (dict(str, int)) – Mapping of job names to job IDs
Returns:Mapping of names to IDs of the jobs that have failed
Return type:dict(str, int)
pyssub.scmd.numjobs(user, partition=None)

Number of queuing jobs

Check the number of queuing jobs for the given user and partition.

Parameters:
  • user (str) – User name or ID
  • partition (str, optional) – Partition name
Returns:

Number of queuing jobs

Return type:

int

pyssub.scmd.submit(script, partition=None)

Submit Slurm batch script.

Parameters:
  • script (SBatchScript) – Slurm batch script
  • partition (str, optional) – Partition for resource allocation
Returns:

Job ID

Return type:

int

shist

Slurm job history: save/load names and IDs of submitted jobs

pyssub.shist.load(filename)

Load Slurm jobs from disk.

Load names and IDs of submitted Slurm jobs from disk.

Parameters:filename (str) – Path to input file
Returns:Mapping of job names to job IDs
Return type:dict(str, int)
pyssub.shist.save(filename, jobs)

Save Slurm jobs to disk.

Save names and IDs of submitted Slurm jobs to disk.

Parameters:
  • filename (str) – Path to output file
  • jobs (dict(str, int)) – Mapping of job names to job IDs

Indices and tables