Running Parallelized CASA on the Hydra Cluster:
A Progress Report


The Hydra Cluster
SAO has access to a Linux based Beowulf cluster, called hydra. The cluster consists of 296 compute nodes, distributed over 10 racks, totalling 3116 compute cores (CPUs), and uses a Sun Grid Engine queuing system.

The CASA version
We use the latest two full releases: CASA 4.1.0 and CASA 4.2.0, with release dates of June 2013 and January 2014, respectively. Release Notes detail major improvements over previous versions, such as the "experimental parallelization of most visibilty-related tasks (but excluding those that produce new output MSs)."

The Test
We calibrate the M100 Band 3 Science Verification dataset, using an edited version of the python script alma-m100-analysis-hpc-regression.py, which is included in the CASA 4.1.0 and CASA 4.2.0 tarballs. For purposes of comparison, we do initial tests on the RTDC machines, rtdc7 and rglinux12.

Notes:
  1. A representative edited script is m100-rtdc7-8mms-7engines.py.
  2. Steps 0-18 constitute the calibration.
  3. All interactive lines have been removed to enable comparison with hydra batch jobs, which do not accept them.
  4. References to ALMA Science Data Model (ASDM) format files have also been removed, since they are not readily available. Before running the script, we rename:
            mv X54.ms X54-monolith.ms
            mv X220.ms X220-monolith.ms
  5. The number of engines to set to seven, a number acceptable on all test machines. See initial lines in python script and the cluster configuration file.
  6. The script is executed using the commands m100-rtdc7-8mms-7engines.sh.
    "--nogui" is necessary for hydra batch jobs, even if no X-windows are used.

The Results
MACHINE Additional CPU (GHz) RAM (GB) user system elapsed
rtdc7
(RTDC)
16x3.5 48.0 0h 18m
30.924s
0h 9m
22.946s
1h 22m
41.580s
rglinux12
(RTDC)
  8x2.9 " 0h 23m
48.106s
0h 9m
35.289s
1h 11m
56.657s
. . . . . . .
hydra
compute-0-32.local
NetApp disk 64x2.2 256 1h 29m
18.459s
1h 18m
32.504s
4h 46m
53.30s
" local disk " " 1h 30m
2.212s
1h 37m
44.280s
3h 43m
13.11s

Notes:
  1. The results displayed above are from CASA 4.1.0. CASA 4.2.0 did not complete, even on our best performing machine, rtdc7.
  2. The hydra tests were run on a single dedicated machine (compute-0-32.local) as sole user. Any other means of running resulted in far poorer results.
  3. Parallel CASA (through 4.2.0) is clearly still in the development phase. It is not designed to make efficient use of a cluster's (e.g., hydra's) capabilities.
  4. Detailed progress reports can be found at the links below:
            9/16/2013 (CASA 4.1.0)
            9/02/2014 (CASA 4.2.0) .