------------------------------------------------------------------------------------------------------------------
                           Progress Report on Running Parallel CASA on the Hydra Cluster
                                             Alice Argon, 9/16/2013
------------------------------------------------------------------------------------------------------------------


5/20/2013: CASA 4.1.0 release (libraries not included)
6/07/2013: CASA 4.1.0 release (libraries included)

CASA 4.0.1 Release Notes claim:
 "experimental parallelization of most visibilty-related tasks (but excluding those that produce new output MSs)." 

 Notes: MS = measurement set, i.e., data set
        Nothing new regarding parallelization in this release.


Overview of the Test:
 Calibrate the M100 Band 3 Science Verification dataset:
   https://almascience.nrao.edu/alma-data/science-verification
 using steps 0-18 of "alma-m100-analysis-hpc-regression.py", included in the CASA 4.1.0 tarball.


Preparation: 
 1) Remove all references to plots, X-windows, etc., from python script.  Hydra batch jobs will not accept them.  
 2) Use "-nogui" option when running CASA.  Required even if plot routines not explicitly called. 
 3) Force number of "engines" to be 7 (a number all test machines can accomodate) by means of configuration file:
      http://www.cfa.harvard.edu/rtdc/ALMA/HYDRA/
    If do not set, CASA will make its own selection based on machine resources, but not necessarily the same 
    number from run to run.  For reasons unclear to me, it is almost always an odd number (7,9,15,21,..).

    Note: CASA splits the MS into sub-MSs (more about this below).  Each engine takes one of the sub-MSs and 
    performs the calibration at hand, allegedly in parallel.


Non-hydra test results, for comparison.
---------------------------------------------
       rtdc7            rglinux12
   [  16 CPUs  ]      [   8 CPUs  ]
   [ 48 GB RAM ]      [ 48 GB RAM ]
---------------------------------------------
    78m 26.600s        71m 56.657s       real 
    18m 11.368s        23m 48.106s       user
     8m 27.040s         9m 35.289s        sys
---------------------------------------------
Notes: 
 1) The rglinux11 test was run on 8/22 and not reported to Sylvain or Jeff.
 2) The python script (above) calls for 8 sub-MSs (mynumsubmss = 8).  All attempts to set this to any other 
    number result in a failed run.


Initial hydra tests (pre-08/05/2013):
 Ran many hydra tests, but could do no better than
   real run time = 55156.0393691s (15.3 hours),
   using "qsub -pe orte 8 -q mTN.q -l mr=6G" (sufficient time, sufficient memory)


Sought Sylvain Korzennik's (hydra cluster) and Jeff Kern's (CASA @ NRAO) expert advice:
 8/5     : Met with Sylvain to go through specifics of the test.
 8/5     : Wrote to Jeff to ask if he'd be open to Sylvain's detailed questions.
 8/6     : Jeff replied, saying he would be open.
 8/5-8/9 : Sylvain's test results, all cc'd to Jeff.
 9/3     : Having not heard from Jeff since 8/6 (above), wrote a gentle "picking up the conversation" email 
           with a few basic questions.
 9/16    : as of today, Jeff has not respnded.


Sylvain's tests:
 1) Ran on all three hydra queues (orte, mpich, and openmp).  Queue runs are submitted to a scheduler, 
     who then selects the compute nodes to be used, based on requested resources.

 2) Ran on single dedicated machine (compute-0-32.local) as sole user.  Sylvain summarizes:

     "This test was run on a Dell R815, 64 cores, 256GB of RAM (2.2GHz AMD 6274, 2M cache)
      I set NMMS = 8, no of engines = 7 (to duplicate Alice's test, see below, 01:18:26 real, 00:18:11 user)
 
      nothing else was running on that node

      I got:

      1- using the NetApp disk
      5358.459u 4712.504s 4:46:53.30 58.5%    0+0k 0+0io 44pf+0w

      so the I/Os (s-time) are comparable to CPUs (u-time)

      2- using a local disk:
      5402.212u 5864.280s 3:43:13.11 84.1%    0+0k 0+0io 76pf+0w"

 3) Ran tests with different NMMS (number of sub-MMs) and engines, including:
     NMMS=32,  Engines=31
          16           15
           8           21
    All resulted in failed runs.
 

Some of Sylvain's questions to Jeff (8/6-8/9)

 1) "If there is a writeup I should consult, please direct me."

 2) "Maybe you can explain all this to us:
     meaning of number of engines, number of threads used by simple_cluster.py, the amount of memory, how to 
     optimize all this and the number of MMS, how does the memory use scale w/ engines, threads, MMS..."

 3) When a run failed, e.g., NMMS not equal to 8, 
    "the script keep chugging (garbage) afterwards - is there no error catching :-}"


My questions to Jeff (9/3):

 1) "What machine(s) did you test parallelized CASA on at NRAO?  Number of CPUs, RAM, etc. 
     I assume the test machines were servers and not clusters."

 2) "Will parallelized CASA be more cluster-friendly in the next release?  As Sylvain demonstrated in
     his tests, we have been unable to make use of our cluster's (HYDRA's) considerable resources.  To date,
     the best results (quickest times) have been on the non-HYDRA server, rtdc7."

 3) "Can "mynumsubmss" in "alma-m100-analysis-hpc-regression.py" be set to anything other than 8? 
     Both Sylvain and I (working on completely different sets of machines) have been unable to successfully
     complete the test (steps 0-18) if this is set to anything other than 8."


Final Note about factor of three in test:
  1h18m of elapsed time (18m of user time) on rtdc7 
  3h43m of elapsed time, (1h30m of user time) on compute-0-39

rtdc7 CPUs are 1.5x faster (but not 3x).  The additional factor of two is at present unaccounted for.  Could 
it be due to RAIDs (rtdc7)?