Tags:
create new tag
, view all tags

Error Messages FAQs

  • error: can't unpack gdi request
If qstat returns this error message, i.e.:
hydra % qstat -j 302388
 error: can't unpack gdi request
 error: error unpacking gdi request: bad argument
 failed receiving gdi request
the files system /var is likely full. Email the cluster's sysadmin ( DJ) to have this fixed.

* error: ending connection before all data received

Users of the ORTE parallel environment (qsub -pe orte N and /opt/openmpi/bin) can run into the following typical error:
/opt/openmpi/bin/mpirun -np 8 mycode
error: error: ending connection before all data received
error: 
error reading job context from "qlogin_starter"
--------------------------------------------------------------------------
A daemon (pid 8307) died unexpectedly with status 1 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
mpirun: clean termination accomplished

This error is the result of ORTE (ORTE - the Open Run-Time Environment) trying to use rsh (instead of ssh) to start the slave processes. This will be explained in more details in the How to submit MPI jobs primer (not yet ready).

The solution is to add, if you use the csh, the line:

  setenv  OMPI_MCA_plm_rsh_disable_qrsh 1

or, if you use the sh, the line:

  export OMPI_MCA_plm_rsh_disable_qrsh=1

to you job file (the file you submit with qsub) before the /opt/openmpi/bin/mpirun command.


  • Illegal instruction
I ran into the Illegal instruction error when running some of my executables on some compute nodes. I tested the code on the head node: no problem, I tried to queue it, and it fails sometimes... I eventually managed to have it fail on specific nodes and could duplicate a systematic "runs or fails" pattern.

The problem turned out to be the result of compilation optimization (I use the aggressive PGI -fast -Mipa=fast optimization flags) and the fact that the CPUs on some nodes are (were) older than the one on the head node. The Illegal instruction refers to a binary instruction in the executable not compatible with the CPU on that specific node.

The solution is to add (if using the PGI compiler) the flag -tp k8-64, since the head node produces -tp k8-64e (you can see this by using the -v option when compiling). The -tp flag stands for target processor (TP), and the compiler assumes that your TP is the same as the one you compile on. When optimizing, compilers might use op-codes (and registers) specific to the TP.

The solution I opted for was to use -tp k8-64,k8-64e when using the PGI compiler, as to produce an executable for both architecture. Other compilers have similar TP flags. This will be explained in more details in the How to compile your programs primer (not yet ready).

This condition is likely to occur if and when the head node gets upgraded while some compute nodes are not, and thus have older processors.

BTW, pgf90 -help -tp lists the potential TPs, while the PGI user guide (pp 210 in pgiug.pdf) addresses this in details.

-- SylvainKorzennikHPCAnalyst - 13 Dec 2011

Topic revision: r2 - 2011-12-13 - SylvainKorzennikHPCAnalyst
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2015 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback