Tags:
create new tag
, view all tags

How to Submit Job Arrays

Introduction

This primer describes how to submit job arrays, namely a slew of serial jobs that run the same script but for a different set of parameters that can be described by a unique identifier. For example, reducing a set of data or running a grid of models.

This primer complements the primer's introduction on how to submit jobs, so read that one first.

For job arrays, the qsub file will specify a range of tasks to start, while the job file is written in such a way as to compute a specific case, defined by a unique task number.

At run-time, the Grid Engine set up the following four environment variables:

SGE_TASK_ID unique ID for the specific task
SGE_TASK_FIRST ID of the first task, as specifed with qsub
SGE_TASK_LAST ID of the first task, as specifed with qsub
SGE_TASK_STEPSIZE ID step size, as specifed with qsub

  • The job file is written to make use of these environment variables
  • The qsub file (or command) is written like for serial jobs, but uses the additional flag -t n[-m[:s]].
  • This flag tells the job scheduler to start a set of tasks, as follow:
    -t 1-20 run 20 tasks, with ID range 1 to 20 (SGE_TASK_ID = 1, 2, 3, ... , 20)
    -t 10-30 run 20 tasks, with ID range 10 to 30
    -t 50-140:10 run 10 tasks, with ID range 10 to 140, by step of 10
    -t 20 run one task, with ID 20 (SGE_TASK_ID = 20)
  • You can limit the number of concurrent running tasks with the flag -tc k (or the #$ -tc k directive)
    • where k is the max number of allowed concurrent tasks
    • i.e., -t 1-1000 -tc 200 will submit 1000 tasks but limit the job array to run no more than 200 concurrent tasks.

Example

The following example shows how to run a grid of models

  • let's assume that the program mymodel produces a model from specifying a temperature, density and pressure
  • it reads the 3 values from stdin and write out a model to stdout.

One simple implementation is to write, let say 1000 input files (for a grid of 10 temperatures, 10 densities and 10 pressures), and call them input.1, input.2, input.3, etc, input.1000.

The corresponding job file (csh syntax) is simple as:

hydra% cat example-ja.job
#!/bin/csh
#$ -j y -cwd
#
./mymodel < input.$SGE_TASK_ID

The corresponding qsub file (or qsub command) is

qsub  -t 1-1000 -N models example-ja.job
and when submitted the queuing system will say something like:

Your job-array 5145619.1-1000:1 ("models") has been submitted

  • This will produce the 1000 tasks to compute the 1000 models.
  • The output files will be named models.oNNNNNNN.M where NNNNNNN is the unique job ID for the job array (the 1000 models share the same ID), and M will be a value between 1 and 1000.
  • In this example, the number of task running simultaneously will be limited only by the cluster's limit rules and load.
  • A trivial example (no actual computation) is on hydra in ~hpc/tests/queues as example-ja.job and example-ja.qsub, and produced the models.o* files.

-- SylvainKorzennikHPCAnalyst - 30 Jan 2012

Topic revision: r3 - 2012-06-29 - SylvainKorzennikHPCAnalyst
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2015 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback