Running CASA on the Hydra Cluster: A Progress Report

The Hydra Cluster
SAO has access to a Linux based Beowulf cluster, called hydra. The cluster consists of 296 compute nodes, distributed over 10 racks, totalling 3116 compute cores (CPUs), and uses the Sun Grid Engine queuing system,

The CASA version
We use CASA 4.1.0, with a release date of May 28, 2013. The Release Notes detail major improvements over previous versions, such as the "experimental parallelization of most visibilty-related tasks (but excluding those that produce new output MSs)."

The Test
We calibrate the M100 Band 3 Science Verification dataset, using an edited version of the python script alma-m100-analysis-hpc-regression.py, which is included in the CASA 4.1.0 tarball. For purposes of comparison, we do initial tests on the RTDC machines, rtdc7 and rglinux12.

Notes:
  1. A representative edited script is m100-rtdc7-8mms-7engines.py.
  2. Steps 0-18 constitute the calibration.
  3. All interactive lines have been removed to enable comparison with hydra batch jobs, which do not accept them.
  4. References to ALMA Science Data Model (ASDM) format files have also been removed, since they are not readily available. Before running the script, we rename:
            mv X54.ms X54-monolith.ms
            mv X220.ms X220-monolith.ms
  5. The number of engines to set to seven, a number acceptable on all test machines. See initial lines in python script and the cluster configuration file.
  6. The script is executed using the commands m100-rtdc7-8mms-7engines.sh.
    "--nogui" is necessary for hydra batch jobs, even if no X-windows are used.
The Results
MACHINE Mode CPU (GHz) RAM (GB) user system elapsed
rtdc7 non-batch 16x3.5 48.0 18m 30.924s 9m 22.946s 82m 41.580s
rglinux12 "   8x2.9 " 23m 48.106s 9m 35.289s 71m 56.657s
. . . . . . .
hydra serial batch --- --- coming coming coming


Click here for a more detailed progress report.