Tags:
create new tag
, view all tags

High Memory Cluster Lattice

Introduction

The OCIO/ORIS has purchased two new nodes with 24 cores and 504GB of memory each. These nodes are intended for jobs that requires a lot of memory.

Until they get merged with Hydra (this requires a software upgrade of Hydra, that will be soon scheduled), we have setup a separate, temporary cluster, Lattice.

Currently, access to Lattice (ssh lattice.si.edu) is restricted to a users known to require a lot of memory.

If you'd like to try Lattice, let me and DJ know. Lattice will be used by DJ to test some other features we consider porting to Hydra.

Configuration

Lattice has been configured as follow:

  • One of the two new nodes serves as head-node and compute node (lattice), the second one (compute-1-0) is a 'pure' compute node.

  • The other nodes are for DJ's tests.

  • They are connected via InfiniBand (IB), but the required software and queues awaits testing and verification.

  • Lattice has 24 + 20 = 44 slots and 504GB + 440GB = 944GB of memory available for computation.

  • Most (but no all) disks available on Hydra are mounted on Lattice.

  • The disk /home on Lattice is not shared with Hydra - you can access what is /home/XXXX on Hydra as /hydra_home/XXXX on Lattice.

  • The disk /home on Lattice is a small disk, do not use it for data, use a /pool disk instead.

  • The GNU compilers are available (gcc), and version 4.8.2 is available under /share/apps/gcc/4.8.2

  • OpenMPI 1.6.5 for gcc 4.7.1 and 4.8.2 has been installed under /share/apps/openmpi but needs verification and testing.

  • The PGI and Intel compilers are NOT available on Lattice, neither is IDL. They may be ported to Lattice if there is justified demand and we do not run into licensing issues.

Queues

  • Lattice has been configured so far with only two queues:
    all.q 15-day/30-day of cpu/elapsed-time
      max of 10GB of mem allowed (per core if using openmp)
      batch mode only
      serial and openmp jobs (no MPI)
    himem.q 15-day/30-day of cpu/elapsed-time
      max of 450GB of mem allowed (almost all the nodes' memory)
      batch mode only
      serial and openmp jobs (no MPI)

  • Use qconf -sq all.q or qconf -sq himem.q to get all the configuration details on each queue.

Limits

  • Overall limits are:
    • 20 slots/user, 44 for all users
    • 600GB total mem/user, 960GB for all users (values to be tweaked)

  • These may be fine tuned, so check them w/ qconf -srqs

  • Also, the mem_reserved mechanism (consumable) has been implemented (as explained for Hydra in the memory limits writeup), and a job consumes by default 10GB.

  • Reserve the memory you will need with -l mem_reserved=XXX, where XXX is something like 50G.

  • Do not use -l mem_free=MMM, as it generates the 'no suitable queue' error.

  • To run in the himem.q queue, you MUST specify -l hm.

Note

  • You may need to re-compile your code on Lattice.

Examples:

qsub -l mem_reserved=5G test.job

-- submits test.job to the all.q queue, but reserves only 5GB (instead of 10GB)

qsub -q himem.q -l hm,mem_reserved=80G,h_data=80G,h_vmem=80G test.job

-- submits test.job to the himem.q queue, and use up to 80GB of memory (reserves 80GB and limits the job to 80GB)

  • You will find examples under ~hpc/tests (on Lattice):

test.job trivial job
test-hm.job trivial job, but runs in the himem.q queue
memhog.job memory hog jobs, that uses memhog (an executable built from memhog.c).

Examples of how to run memhog.job are in q-memhog.sou

In that example the command q ... memhog.job 16 8 1, that passes 3 arguments to memhog.job, will launch a job that will allocate 16GB of memory (1st arg), set that array to zero, increment its element 8 times (2nd arg, controls how long the memhog will run, need to remains smaller than 128) and repeat it all 1 time (3rd arg, again allows you to control the time memhog runs). If you increase 16 (to hog more memory), you must increase the values listed in the -l specification, i.e., the -l hm,mem_reserved=16G,h_data=16G,h_vmem=16G option.

To submit openmp jobs, refer to Hydra's primer.

-- SylvainKorzennikHPCAnalyst - 2014-03-17

Topic revision: r3 - 2014-03-28 - SylvainKorzennikHPCAnalyst
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2015 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback