Tags:
create new tag
, view all tags

HPC Primer Introduction

Introduction

  • The cluster, hydra, is made of
    • two login nodes,
    • one front-end node,
    • a queue manager/scheduler (SGE), and
    • a slew of compute nodes.
  • You access the cluster by loggin on one of the login nodes (hydra.si.edu or hydra-login.si.edu) using ssh.
  • From either login nodes you submit and monitor your job(s) via the queue manager/scheduler:
    • the queue manager/scheduler is SGE, or grid engine, simply GE.
    • SGE was formerly known as the SUN Grid Engine,
    • it is now the OGE for Open Grid Engine (the free version), or Oracle Grid Engine.
    • On hydra, we run the open version of SGE.
  • The front-end node (hydra-2.si.edu) is running the queue manager/scheduler
    • it should not be used as a login node.
  • All the nodes (login, front--end and compute nodes) are interconnected via Ethernet (1Gbps).
  • Some of the compute nodes, and the front-end node are also interconnected via InfiniBand (40Gbps).

  • Hydra cluster schematic:
    cluster.jpg
click on the figure to enlarge it.

You can only login to hydra from a trusted IP address, so if your desktop is not managed by the CF or HEA, you must

  • either login on the CF gateway login.cfa.harvard.edu,
  • or authenticate your desktop/laptop via VPN (check the CF's help page on how to connect to CfA's VPN),
and then ssh to hydra.si.edu.

The login nodes are for normal interactive use like editing, compiling, script writing, etc.... and submitting jobs.

  • They are not a compute nodes, nor is the front-end node, and thus they should not be used for actual computations, except short debugging interactive sessions or short ancillary computations.
  • The compute nodes are the hosts on which you run your computations, by submitting a job to the queue system, via the qsub command, from a login node.

Do not run jobs on the head node and do not run jobs out of band, this mean:

  • do not log on a compute node to manually start a computation, always use qsub,
  • do not run scripts/programs that spawn additional tasks, unless you have requested the corresponding resources (assuming you know how to),
  • if you run something in the background (you really shouldn't), use wait so your job terminates only when all the associated processes have finished,
  • If you run parallel jobs (MPI), read the relevant primer(s) and follow the instructions (you don't start MPI jobs on the cluster the way you do it on your workstation or laptop).

Remember that

  • you probably should optimize your executables with the appropriate compilation flags for production runs,
  • you probably need to write a script to submit a job,
  • you probably need to specify multiple options when qsub 'ing your script,
  • things don't always scale up, as you submit a lot of jobs, that will run concurrently, ask yourself:
    • is there name space conflict? (all write to the same file, have the same job name, ...)
    • what will be the resulting I/O load? (all read the same file, all write a lot of useless stuff)
    • how much will I use/abuse disk space? (fill up shared public disk space, heavy I/O load compared to CPU load)
  • you are not the sole user.

Login

  • Accounts on the cluster (passwords and home directories) are separate and distinct from CF or HEA unix accounts.
  • To use hydra you need to request a separate account.

Support

  • The cluster is managed by D.J. Ding (DingDJ@si.edu), the system administrator in Herndon, VA. Do not contact the CF or HEA support sys-admins.
  • Additional support for SAO is provided by SAO's HPC analyst (hpc@cfa). This role is currently assumed by Sylvain Korzennik (at 25% FTE).

Sylvain is not the sys-admin, so contact D.J. for problems, contact Sylvain for advice & suggestions.

A mailing list is available for the cluster users:

  • The mailing list (HPCC-L@si-listserv.si.edu) is read by the cluster sysadmin and the HPC analyst as well as by other cluster users.
  • Use this mailing list to communicate with other cluster users, share ideas, ask for help with cluster use problems, offer advice and solutions to problems, etc.
  • To email to that list you must log to the listserv and post your message.
  • Replies to these messages are by default broadcast to the list.
  • You will need to set up a password the first time you use it (upper right, under "Options").

Software

  • As for any unix system, you must properly configure you account to access all the system resources.
  • The configuration on the cluster is different from the CF or HEA-managed machines,

Your ~/.bash_profile, ~/.bashrc, and or ~/.cshrc need to be adjusted accordingly. You can look in ~hpc for examples (with ls -la ~hpc).

  • Compilers
    • GNU compilers (gcc, g++, gfortran, g90)
    • Portland Group (PGI) compilers (v12.5) and their Cluster Dev Kit (debugger, profiler, etc...)
    • Intel compilers (v12.0) and their Cluster Studio (debugger, profiler, etc...)

  • Libraries
    • MPI, for GNU, PGI and Intel compilers, w/ IB support
    • math libraries that come with compilers
    • AMD math libraries

  • S/W Packages
    • IDL, including 128 run-time licenses (for batch processing)
    • IRAF (v1.7)

More can be installed upon request.

To properly configure access these, refer to the respective entries in the primer.

-- SylvainKorzennikHPCAnalyst - 23 Jan 2012

Topic revision: r12 - 2012-06-28 - SylvainKorzennikHPCAnalyst
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2015 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback