- What queuing system is running on
hydra and where is the documentation?
The queuing system running on
hydra is the Sun Grid Engine (
SGE), simply because it is the default system with the Rocks distribution.
You run jobs on compute nodes by submitting them to the queue from the head node. The head node is
hydra the machine you log on.
You do not log on the compute nodes, nor try to force on what nodes you want to run a (serie of) job(s).
Instead you tell the queue scheduler what your jobs need when submitting a jobs with the command
Do not run jobs out of band, namely starting jobs directly on nodes, or spawning more tasks without letting
the queue scheduler know what resources you need (i.e., like parallel jobs using MPI, refer to the relevant primer).
To view the documentation on
host@cfa% firefox https://hydra-2.si.edu/roll-documentation/sge/5.4.3/
This should work from any machine at CfA, or any machine connected to CfA via our VPN. Note that the
5.4.3/ part will change when the system gets upgraded
to a newer Rocks release.
You can view the documentation on Rocks with :
host@cfa% firefox https://hydra-2.si.edu/roll-documentation/base/5.4.3/
or for that matter all available documentation by pointing your browser to https://hydra-2.si.edu/roll-documentation/.
You will see a page with
hydra (as of Mar 2012)
- the URLs start with
http, any connection to
http://hydra-2.si.edu will fail on a timeout.
- Since we do not use a commercial (i.e., one we pay for) SSL certificate provider,
firefox will warn you of imminent doom when connecting for the first time to
This Connection is Untrusted
You have asked Firefox to connect securely to hydra-2.si.edu,
but we can't confirm that your connection is secure.
Add a (permanent) exception for
firefox to "trust this site's identification". If you enable it to be permanent, you will only have to do this once.
- How do I submit a job to the queue?
Refer to the corresponding primer
Here is a simple example. Let us assume that you have a program
crunch.f that you have successfully compiled as
crunch and want to run.
The program (
crunch) is in
/pool/cluster/myname/test, and all the files needed to run it are in that same directory, or the files produced will go there.
In any directory, create a file
qCrunch.job as follows:
hydra% cat qCrunch.job
Remember, unless you tell
qsub otherwise, this file should be using
csh syntax (not
Then submit the job, as follows:
hydra% qsub -j y -o $cwd/qCrunch.log -N qCrunch qCrunch.job
Your job NNNNNNN ("qCrunch") has been submitted
-j y merge stderr w/ stdout
-o $cwd/qCrunch.log put merged stderr & stdout in $cwd/qCrunch.log
-N qCrunch name that job "qCrunch"
qCrunch.job the actual instructions (csh) to run the job
man qstat for more info on the
NOTE: this is not the way to submit parallel jobs (i.e.,
MPI), nor the most efficient way to submit
job arrays. Read the primers on job submission.
ALSO, if you job spawn a background task, make sure to add a
wait at the end of the job script. This way the
scheduler doesn't think that your job is finished while in fact it is still running (in the background). Best practice rule is to not
start/spawn background tasks.
qstat -u $USER -s r to get a list of your running job(s),
-s p for pending jobs(s), omit the
-s option to see them all. Use
-u '*' to see all the users jobs.
- How do I monitor the queue?
Refer to the corresponding primer
qmon is the GUI based queue monitoring tool ( SGE) and requires that your
ssh connection to
X-tunelling enabled. FYI I'm not a fan of it.
ganglia is the cluster monitoring tool ( Rocks). It runs under a web browser, go to https://hydra-2.si.edu/ganglia/
qdel NNNNNNN where
NNNNNNN is the job ID listed by
qstat. You can kill all your jobs with
qdel -u $USER.
- What is the limit of jobs I can submit?
Read the primer on queues, as the limits depend in which queue the jobs is/will be running.
You can view these limits with:
hydra% qconf -srqs
They are explained under limits, as jobs and
There is also a limit on the number of jobs you can queue, still out of consideration to the user community we ask you to limit
the number of jobs you queue to a reasonable value.
- How do I request specific resources (like memory)?
You can request specific resources for your jobs.
For instance you can tell the scheduler that your
job needs at least so much memory. The scheduler
will start that job only when that resource becomes available.
Resources can be specified either as directives in the
submitted script, or after the
-l flag to
hydra% qsub -l mem_free="2G" script.csh
request 2GB of free memory for
script.csh to run.
This is explained in queues primer, under memory limits.
- How do I limit the number of jobs I submit?
If you submit your jobs using job arrays (using the
-t flag when
qsub), the flag
-tc 24 will limit the number of
concurrent tasks running (i.e., to 24).
(See the primer on submitting jobs for more details on job arrays).
You can otherwise limit the number of jobs in queue
with the following
hydra% cat<<EOF > q-wait.csh
@ NMAX = 24
@ N = `qstat -u $user | tail --lines=+3 | wc -l`
if ($N >= $NMAX) then
hydra% chmod +x q-wait
and use the instruction
./q-wait in your queuing script.
- How do I submit a job to start only after another one has completed?
-hold_jid flag when
qsub 'ing, i.e., something like:
hydra% qsub -N FirstOne pre-process.job
Your job 12345678 ("FirstOne") has been submitted
hydra% qsub -hold_jid 12345678 -N SecondStage post-process.job
Your job 12345679 ("SecondStage") has been submitted
- What does the jobs status
Eqw stands for error while waiting in queue. You can find out the reason for the error state with
hydra% qstat -j NNNNNNNN -explain E
NNNNNNNN is the job number.
t stands for transferring. It usually occurs just before a job is started, i.e., when a job is being transferred
from pending (qw: waiting in the queue) to running (r).
dr stands for deletion, i.e., deleted while running. It typically reflects jobs whose compute node died (
NA) and where subsequently deleted by the user with
- What shell does
#!/bin/sh as the fist line of the job file you submit,
qsub interprets the file as
unless you override the shell used by
Namely, suppose that your script
and is executable (
chmod +x). It will run fine as
./simulate.sh on the head node, but produce the error message
export: Command not found.
qsub 'ed... Indeed, the command
export is not a
csh command, and w/out any specific directive,
passes the file to
/bin/csh and ignores the
#! in the first line.
You must instead, either use
hydra% qsub -S /bin/sh simulate.sh
when submitting the job, or add the directive
#$ -S /bin/sh to the script file, i.e.,
#$ -S /bin/sh
- 12 Dec 2011