Tags:
create new tag
, view all tags

Available Queues and Limits

A new set of queues and limits have been implemented with the deployment of the InfiniBand fabric (summer 2012).


Introduction

The cluster hydra was upgraded to Rocks 5.4.3 in March 2012, and part the cluster is now connected via an InfiniBand (IB).

A new set of queues and limits have been implemented to establish simple but flexible rules, with set limits, and allow fair use of the cluster.

The limit in the tables below have values that can/will be adjusted if/when needed. Email me (at hpc@cfa) if you run into problems or have concerns with this configuration.


Queues

  • There is an array of queues (5/4/4/3).
  • One set for serial jobs (5), one set for parallel jobs (4). one set for parallel jobs that make use of the IB (4), and one set for serial jobs with high memory needs (3).
  • There are short, medium, long and very long execution time queues,
    plus one unlimited execution time, running at low priority and only for serial jobs.
  • The size of any parallel jobs is only limited by the slot limit on the associated queue.

  • This results in 16 queues, that are named as follows:
    Type Time limit
      short-T medium-T long-T veryLong-T unlimited-T  
    serial sTz.q mTz.q lTz.q vTz.q uTz.q  
    parallel short-T medium-T long-T veryLong-T   PE available
    IB not needed sTN.q mTN.q lTN.q vTN.q   orte, mpich, openmp
    IB is needed sTNi.q mTNi.q lTNi.q vTNi.q   orte_ib, mpich_ib
    High Memory   mThM.q lThM.q vThM.q   orte, mpich, openmp, for high memory use

Note:

  • The queue all.q is not available (disabled).
  • All the queues allow only BATCH execution, and have associated memory use and time limits (except one).
  • The parallel queues support orte, mpich, openmp and orte_ib, mpich_ib.
  • Read the section below on memory limits, unless your application uses less then 1GB/slot.
  • The relevant SGE commands are
    • to get the list of queue: qconf -sql
    • to get the list of queues, their status and load: qstat -g c


Limits

Time Limits

  • All the queues, but the uTz.q, have time limits.
  • There are 4 time limits associated with a queue:
    • two soft ones, two hard ones,
    • one pair for CPU time, one pair for R/T (real time, elapsed time, wall clock time).
  • The soft limits can be caught by your job and you have until the hard limit to do something about it (like save what you have computed so far.)
  • The hard limits are just that: the job is terminated when that limit is reached.

  • The time limits are set, using a 12x progression (but for vT*.q), as follow:
    type short-T medium-T long -T veryLong-T resource name
    CPU (soft) 3h 36h (1.5d) 432h (18d) 108d (2592h) s_cpu
    CPU (hard) CPU (soft) + 15m r_cpu
    R/T (soft) 2x CPU (soft) s_rt
    R/T (hard) R/T (soft) + 15m h_rt

Note:

  • To check the time limits of a queue, use something like qconf -sq sTz.q
    and look for the values associated to the above-listed resource names.
  • To check the time limits for all the queues, use qconf -sq '*' | egrep 'qname|_rt|_cpu'.
  • The unlimited serial queue (uTz.q) run at priority=19, see below on how to use it.
    (qconf -sq uTz.q | grep priority.)
  • To check the actual limits, use the command qconf -srqs.
  • See below for more on qsub flags.

Memory Limits

New feature as of July 2013

All the Queues Have Memory Limits.

  • The cluster has between 2 and 4GB of memory per core on each compute node. While the most recently purchased nodes, with 64 cores, have 256GB of aggregate memory, this is still only 4GB/core.

  • Virtual memory is very inefficient when exceeding significantly the available physical memory (causing very high disk I/Os) and some applications need substantially more than the 2 to 4GB range.

  • Since the scheduler will launch as many tasks as they are slots (i.e., cores) on each compute node, tasks that use more than 4GB of physical memory can eventually deplete the available physical memory. This causes the node to crash, killing all the tasks running on that node.

  • To prevent this, all the queues, except the high-memory ones (see below), have a limit of 4GB/slot of physical memory (h_data, h_rss) and a limit of 6GB of virtual memory (h_vmem).

  • A mechanism (a complex consumable resource in GE speak) to reserve and track memory use has been set up (July 2013).
    • If your jobs are using more than 1GB but less than 4 to 6GB of memory per slot, you are asked to use the memory_reserved (mr) resource, as follow:
      as option to qsub qsub -l memory_reserved=3.5G
      or qsub -l mr=3.5G
      or as en embedded directive #$ -l mr=3.5G
      This option/directive tells the job scheduler that you expect to use up to 3.5GB of physical memory per slot.
      The scheduler will keep track of this (against the nodes' available memory) and will not start other tasks on a given node if there will be not enough memory left.
    • This is a voluntary mechanism that all users are asked to use. By default (if you don't do anything) the scheduler will reserve 1GB/slot.
    • Jobs using more than 1GB/slot should use something like this:
      qsub -l mr=3.5G,h_data=3.5G,h_vmem=4G
      meaning:
      mr=3.5G reserve 3.5 GB of memory (per slot)
      h_data=3.5G kill my job if I exceed 3.5G of physical memory
      h_vmem=4G kill my job if I exceed 4.0G of virtual memory

  • Use values appropriate to what you need. You can check how much memory a job has used with qacct -j NNNNNN, where NNNNNN is the job number ($JOB_ID)

Special Queues for High Memory Use

  • Three special queues are available for jobs with high memory needs: mThM.q, lThM.q, and vThM.q
  • All three have a physical memory limit of 16GB and a virtual memory limit of 18GB (per slot).
  • We now have 640 cores over 10 nodes available for high memory jobs (Dec 2014). Each node has 64 cores and 256GB of total memory (4GB/core).
  • To run in the high memory queues, you need to
    • specify -l use_himem or -l hm
    • and request to be allowed to use them (email hpc@cfa.harvard.edu with a short justification paragraph, or come and see me).
  • Example:
    qsub -l hm,mr=10G,h_data=10G,s_vmem=12G
    meaning:
    hm I need to use a lot of memory
    mr=10G reserve 10 GB of memory
    h_data=10G kill my job if I exceed 10G of physical memory
    h_vmem=12G kill my job if I exceed 12G of virtual memory

NOTES for high memory parallel jobs:

  1. Memory specification (and limit) is per slot, i.e.:
    qsub -pe orte 10 -l hm,mr=12G is a request for 10 slots and 120GB of total distributed memory (12GB for each slot).
    qsub -pe openmp 24 -l hm,mr=10G is a request for 240GB on a single node using 24 cores for multi-threaded use.
  2. There is only 256GB of total memory in each high memory node. The job qsub -pe openmp 64 -l hm,mr=10G, i.e. requesting 640GB of memory on a single node, cannot run.
  3. Asking for extra memory tells the scheduler to not start other jobs that will need memory on these nodes - so ask only what you will need/use.
  4. There are only 10 high memory nodes, asking for most of the memory of a single node (like 240GB as 24 x 10GB with openmp) and submitting multiple such jobs will result in your jobs waiting for along time in the queue for the requested resource(s) to become available.
  5. As always, use values appropriate to what you need:
    • You can check how much memory a job has used, when it has finished, with qacct -j NNNNNN, where NNNNNN is the job numbe ($JOB_ID). The command qacct must be run on hydra-2.
    • During execution, you can monitor memory use with
      • qhost -h compute_node_list,
      • the command top, that you run on a compute node via my script ~/hpc/sbin/rtop+,
      • via qstat (or ~hpc/sbin/q+ -mem), or
      • using the script ~hpc/sbin/ck-mem-use.csh
  6. Remember that memory is both an expensive and scarce resource.

Jobs Limits (slots)

Each queue has an associated, per user, slot limit.

  • The total number of slots (CPUs) used by a given user for all his running jobs in a given queue is limited.
  • These limits are set as follow:
    Queue Limit   Queue Limit   Queue Limit   Queue Limit
    sTz.q 768   sTN.q 768   sTNi.q 384  
    mTz.q 384   mTN.q 384   mTNi.q 192   mThM.q 128
    lTz.q 192   lTN.q 192   lTNi.q 96   lThM.q 64
    vTz.q 98   vTN.q 98   vTNi.q 48   vThM.q 32
    uTz.q 48  
    since we have currently some 3,000 slots, and 856 w/ IB. As the number of nodes on the IB fabric increases, these will adjusted if need be.

The logic being that the longer the job is, the fewer of them should run concurrently.

Overall Limits

Since a user can fill more than one queue, there is also an overall limit:

  • Each user is limited to a total of around 800 slots (out of 3,000) for all his jobs in all the queues. That limit is adjusted depending on the cluster overall use.
  • A limit on the total of available slots for all users (together) prevents over loading the cluster.
  • A limit on how many jobs each user is allowed to queue at one time prevents the scheduler from being overwhelmed (set to 2,500).

What Are the Limits?

  • The command qconf -srqs lists all the slots limits, known as resource quota set (RQS) in SGE lingo.
    The name and description fields should help you understand what the associated limits mean.
  • The command qquota shows the current resource quotas, i.e. the limits activated by the jobs currently in the queue,
    so it is not necessarily the complete list of limits.
  • Queue specific limits can be queried with qconf -sq sTz.q, you can use REs and parse the output w/ egrep:
    qconf -sq \?Tz.q | egrep 'name|s_cpu|s_rt|h_rss|h_data|h_vmem'


The Qsub Flags

  • With a set of queues, comes the need to specify more when submitting a job with qsub.
  • Just add the flag to qsub, or include it in the job file using the embedded directives (#$).

Qsub Flags For Serial Jobs

  • By default the scheduler will pick a queue, following two criteria:
    1. the ordering of the queues
    2. the load of the queues

  • The ordering of the serial queues is set to (1) short, (2) medium, (3) long, (4) veryLong.
  • Unless you specify the queue, or your time requirement, your job is likely to go to the short queue, and may run out of time.
  • You can direct your job to the right queue, by either:
    specifying the queue -q mTz.q request medium-T serial queue
    specifying cpu time -l s_cpu=20:00:00 request 20h of cpu
    specifying r/t time -l s_rt=40:00:00 need 40h of wall clock time
  • We ask that if your job uses more that 1GB, to specify its anticipated memory use, with something like -l mr=2.5G.

Note:

  • Jobs will not go to the unlimited/low priority queue, unless you direct them there specifically:
    1. specify the queue with -q uTz.q, and
    2. add the option -l lp,
      (which is equivalent to -l low_pri=true.)
    • Using simply -q uTz.q will not work.
  • Jobs will not go to the high memory queues, unless:
    1. you have requested to be authorized to use them,
    2. you specify a high memory queue, with for example, -q mThM.q, and
    3. you add the option -l hm,
      (which is equivalent to -l use_himem=true), and
    4. you specify the amount of memory you will need with something like -l mr=10G,h_data=10G,s_vmem=12G.
    • Using simply -q mThMTz.q will not work.
  • Remember that each queue has a R/T and cpu limit, a memory limit, and that if your job is getting few cpu cycles
    (like when bogged down by I/Os), it may run out of R/T (elapsed time) and get terminated.

Qsub Flags For Parallel Jobs

  • For parallel job you must specify the PE and the number of CPUs, and should specify a time requirement, or the queue.
    Without specifying a time requirement, the scheduler will queue your job in the queue with the lower order (shorter time limit), and the lower load.
  • The ordering of the parallel queues is set to (11) short, (12) medium, (13) long, (14) veryLong for the non-IB and to (21) short, (22) medium, (23) long, (24) veryLong for the with-IB queues.
  • You specify the PE and the number of CPUs by using something like -pe mpich 8 or -pe orte 8.
    This example requests 8 CPUs.

Note:

  • Like for serial jobs, you should also specify a R/T or cpu limit,
    otherwise the scheduler will pick one for you, and your job might run out of time and get terminated:
    specifying cpu time -l s_cpu=20:00:00 request 20h of cpu
    specifying r/t time -l s_rt=40:00:00 need 40h of wall clock time
  • You can instead specify a queue, but make sure tot use the queue name that offer the PE you need:
    specifying the queue -q mTN.q request medium-T, no IB (PEs: orte, mpich, openmp)
    specifying the queue -q mTNi.q request medium-T, with IB (PEs: orte_ib, mpich_ib)
  • If you specify incompatible options, you will get the
    Unable to run job: error: no suitable queues. message, (unless you specified -w w or -w n).
  • You can use wildcards or REs (regular expressions) in the specification of the queue and the PE.


Examples

short serial job qsub -q sTz.q request explictly the sTz.q (short-T/serial) queue
  qsub -l s_cpu=10:00 specify 10m of cpu (does not guarantee to go to sTz.q)
long serial job qsub -q lTz.q request explictly the lTz.q (long-T/serial) queue
  qsub -l s_cpu=240:00:00 specify 10d (240h) of cpu
unlimited low priority serial job qsub -l lp request low-priority
small parallel job, orte qsub -pe orte 4 -l s_cpu=10:00 request 4 CPUs, orte PE specify 10m of system time (cpu) -- not guarantee to go to sTN.q
medium parallel job, over IB, mpich_ib qsub -pe mpich_ib 64 -l s_cpu=10:00:00 request 64 CPUs, mpich_ib PE and 10h of cpu
long parallel job, no IB, mpich qsub -pe mpich 64 -q lTN.q request 64 CPUs, mpich PE and explicitly the long-T no-IB queue
long parallel job, over IB, mpich_ib qsub -pe mpich_ib 64 -q lTNi.q request 64 CPUs, mpich_ib PE and explicitly the long-T with-IB queue
openmp parallel job qsub -pe openmp 16 request 16 CPUs, openmp PE, no other resources
by default, all qsub are verified for compatibility and will not be submitted if they can never be scheduled.
  Using -w p allows you to test (poke) if there is a suitable queue for the job, without submitting it, based on the cluster as is status.
  Using -w v allows you to test (verify) if there is a suitable queue for the job, without submitting it, based on an empty cluster.
  Alternatively, you could use -w n (none) or -w w (warning) to override -w e
The command qalter can be used to change some of the attributes of a job that has been queued, and is not yet running (man qalter.)

-- SylvainKorzennikHPCAnalyst - 11 Jul 2012

Topic revision: r12 - 2014-12-19 - SylvainKorzennikHPCAnalyst
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2015 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback