Tags:
create new tag
, view all tags

Queues FAQs

  • What queuing system is running on hydra and where is the documentation?
The queuing system running on hydra is the Sun Grid Engine (SGE), simply because it is the default system with the Rocks distribution.

You run jobs on compute nodes by submitting them to the queue from the head node. The head node is hydra the machine you log on. You do not log on the compute nodes, nor try to force on what nodes you want to run a (serie of) job(s). Instead you tell the queue scheduler what your jobs need when submitting a jobs with the command qsub. Do not run jobs out of band, namely starting jobs directly on nodes, or spawning more tasks without letting the queue scheduler know what resources you need (i.e., like parallel jobs using MPI, refer to the relevant primer).

To view the documentation on SGE, use:

host@cfa% firefox https://hydra-2.si.edu/roll-documentation/sge/5.4.3/
This should work from any machine at CfA, or any machine connected to CfA via our VPN. Note that the 5.4.3/ part will change when the system gets upgraded to a newer Rocks release.

You can view the documentation on Rocks with :

host@cfa% firefox https://hydra-2.si.edu/roll-documentation/base/5.4.3/
or for that matter all available documentation by pointing your browser to https://hydra-2.si.edu/roll-documentation/.

NOTES

  • Use hydra-2 not hydra (as of Mar 2012)
  • the URLs start with https not http, any connection to http://hydra-2.si.edu will fail on a timeout.
  • Since we do not use a commercial (i.e., one we pay for) SSL certificate provider, firefox will warn you of imminent doom when connecting for the first time to https://hydra-2.si.edu.

You will see a page with

This Connection is Untrusted 
You have asked Firefox to connect securely to hydra-2.si.edu, 
but we can't confirm that your connection is secure.
Add a (permanent) exception for hydra-2.si.edu, telling firefox to "trust this site's identification". If you enable it to be permanent, you will only have to do this once.

  • How do I submit a job to the queue?

Refer to the corresponding primer.

Here is a simple example. Let us assume that you have a program crunch.f that you have successfully compiled as crunch and want to run. The program (crunch) is in /pool/cluster/myname/test, and all the files needed to run it are in that same directory, or the files produced will go there. In any directory, create a file qCrunch.job as follows:
hydra% cat qCrunch.job
cd /pool/cluster/myname/test
./crunch

Remember, unless you tell qsub otherwise, this file should be using csh syntax (not sh).

Then submit the job, as follows:

hydra% qsub -j y -o $cwd/qCrunch.log -N qCrunch qCrunch.job
Your job NNNNNNN ("qCrunch") has been submitted

where

  -j y            merge stderr w/ stdout
  -o $cwd/qCrunch.log   put merged stderr & stdout in $cwd/qCrunch.log
  -N qCrunch      name that job "qCrunch"
   qCrunch.job          the actual instructions (csh) to run the job

Read man qstat for more info on the qsub command.

NOTE: this is not the way to submit parallel jobs (i.e., MPI), nor the most efficient way to submit job arrays. Read the primers on job submission.

ALSO, if you job spawn a background task, make sure to add a wait at the end of the job script. This way the scheduler doesn't think that your job is finished while in fact it is still running (in the background). Best practice rule is to not start/spawn background tasks.


  • How do I list my job?
use qstat -u $USER -s r to get a list of your running job(s), -s p for pending jobs(s), omit the -s option to see them all. Use -u '*' to see all the users jobs.

  • How do I monitor the queue?

Refer to the corresponding primer.

  • qmon is the GUI based queue monitoring tool ( SGE) and requires that your ssh connection to hydra have X-tunelling enabled. FYI I'm not a fan of it.
    hydra% qmon
    
  • ganglia is the cluster monitoring tool ( Rocks). It runs under a web browser, go to https://hydra-2.si.edu/ganglia/

  • How do I kill my job(s)?
Use qdel NNNNNNN where NNNNNNN is the job ID listed by qstat. You can kill all your jobs with qdel -u $USER.

  • What is the limit of jobs I can submit?
Read the primer on queues, as the limits depend in which queue the jobs is/will be running.

You can view these limits with:

hydra% qconf -srqs

They are explained under limits, as jobs and overall limits.

There is also a limit on the number of jobs you can queue, still out of consideration to the user community we ask you to limit the number of jobs you queue to a reasonable value.


  • How do I request specific resources (like memory)?
You can request specific resources for your jobs. For instance you can tell the scheduler that your job needs at least so much memory. The scheduler will start that job only when that resource becomes available.

Resources can be specified either as directives in the submitted script, or after the -l flag to qsub, i.e., the command

hydra% qsub -l mem_free="2G" script.csh
request 2GB of free memory for script.csh to run.

This is explained in queues primer, under memory limits.


  • How do I limit the number of jobs I submit?
If you submit your jobs using job arrays (using the -t flag when using qsub), the flag -tc 24 will limit the number of concurrent tasks running (i.e., to 24). (See the primer on submitting jobs for more details on job arrays).

You can otherwise limit the number of jobs in queue with the following csh script:

hydra% cat<<EOF > q-wait.csh
#!/bin/csh
#
@ NMAX = 24
#
loop:
  @ N = `qstat -u $user | tail --lines=+3 | wc -l`
  if ($N >= $NMAX) then
    sleep 180
    goto loop
  endif
EOF
hydra% chmod +x q-wait
and use the instruction ./q-wait in your queuing script.

  • How do I submit a job to start only after another one has completed?
Use the -hold_jid flag when qsub 'ing, i.e., something like:
hydra% qsub -N FirstOne pre-process.job
Your job 12345678 ("FirstOne") has been submitted
hydra% qsub -hold_jid 12345678 -N SecondStage post-process.job
Your job 12345679 ("SecondStage") has been submitted

  • What does the jobs status Eqw, t, or, dr mean?
* Eqw stands for error while waiting in queue. You can find out the reason for the error state with
hydra% qstat -j NNNNNNNN -explain E
where NNNNNNNN is the job number.

* t stands for transferring. It usually occurs just before a job is started, i.e., when a job is being transferred from pending (qw: waiting in the queue) to running (r).

* dr stands for deletion, i.e., deleted while running. It typically reflects jobs whose compute node died (NA) and where subsequently deleted by the user with qdel.


  • What shell does qsub use?
Despite a #!/bin/sh as the fist line of the job file you submit, qsub interprets the file as csh, unless you override the shell used by qsub with -S /bin/sh.

Namely, suppose that your script simulate.sh is

#!/bin/sh
#
cd /home/myname/simulations
MODEL=universe
export MODEL
./simulate
and is executable (chmod +x). It will run fine as ./simulate.sh on the head node, but produce the error message export: Command not found. when qsub 'ed... Indeed, the command export is not a csh command, and w/out any specific directive, qsub passes the file to /bin/csh and ignores the #! in the first line.

You must instead, either use

hydra% qsub -S /bin/sh simulate.sh
when submitting the job, or add the directive #$ -S /bin/sh to the script file, i.e.,
#!/bin/sh
#$ -S /bin/sh
#
cd /home/myname/simulations
MODEL=universe
export MODEL
./simulate

SylvainKorzennikHPCAnalyst - 12 Dec 2011

Topic revision: r13 - 2013-07-11 - SylvainKorzennikHPCAnalyst
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2015 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback