HOP

What is HOP and what does it do?

HOP is an algorithm for finding groups of particles based on their instantaneous densities. Roughly speaking, it works by connecting particles to nearby particles in the direction of increasing density. We have described the method in our paper "HOP: A new group finding algorithm for N-body simulations" to be published in ApJ.

While our paper contains a full description of the method, a quick summary is:

  1. Compute the density around each particle, using an adaptive kernel with a length scale set by the distance to the N_dens nearest neighbor.

  2. Now consider each particle and record the ID number of the neighboring particle with the highest density. Consider N_hop neighbors in this step. The particle with the highest density could be the chosen particle itself.

  3. Using these ID numbers and starting at each particle in turn, hop from a particle to its densest neighbor until a particle that is its own densest neighbor is found. The starting particle then becomes a member of the group with the final particle as its center. The density of this central particle is the peak density of the group, a quantity needed in step 5.

  4. As described in our paper, the above scheme tends to break large groups into (probably unphysical) small ones due to the presence of local density maxima. To correct this, we recombine the groups on the basis of a dual density threshold. To set up for this, we need to catalog the densest boundaries between groups. This is done by looking at each particle in turn and asking whether the particle and any of its N_merge nearest neighbors are members of different groups. If so, the particle is a boundary between the two groups; the density of the boundary is the average of the density of the two particles. We catalog the highest density boundary found between each pair of groups and store the result.

  5. Groups are considered viable if their densest particle (found in step 3) is greater than some threshold delta_peak. Two viable groups are merged if they share a boundary greater than delta_merge. An unviable group is merged to the viable group with which it shares the highest boundary. For the latter purpose, boundaries are transmitted by intervening unviable groups; that is, if group A is viable and groups B and C are not, but C does not touch A, C can still be joined to A if the *lower* of the density of the AB boundary and the BC boundary is greater than the threshold delta_merge. If this step is unclear to you, you might look at our paper and especially Figure 1.

  6. All particles whose density are less than a threshold delta_outer are excluded from groups.
In our implementation of HOP, steps (1), (2-4), and (5-6) are run separately and the needed output is saved to disk. Hence, one can vary certain parameters without repeating all 6 steps.

Step (1) is the most CPU expensive. Steps (2) and (4) are also reasonably expensive. The others are quick. Moreover, steps (1), (2), and (4) are the most memory intensive as well, because they require more information about the particles.


Our implementation of HOP

We are distributing our implementation of the HOP algorithm. Our version actually involves two programs that must be run sequentially. The first (hop.c) performs steps (1) through (4) above and requires three of the parameters (N_dens, N_hop, and N_merge). The second (regroup.c) performs steps (5) and (6) and requires the other three parameters (delta_peak, delta_merge, and delta_outer). The second program takes far less CPU time and a significantly less memory; hence, one can vary the second trio of parameters fairly inexpensively.

How to compile these programs

  1. First, you need to customize the input subroutine for hop.c so that it can read your simulation format. All the instructions are in the file hop_input.c along with an example.

    You will also need to decide at this point if you are assigning equal mass to each particle or not. This will affect how you store the mass and will affect how you compile.

  2. Next, you need to edit the file Makefile so as to set the proper compiler optimization flags. You should be able to use fairly aggressive optimizations. Set the flags in the CFLAGS variable and the compiler (cc, gcc, whatever) in the CC variable. If you have some library to replace , set it in the LIBS variable.

    While still editing Makefile, if you decided in step 1 to give the particles unequal masses, you need to set the flag -DDIFFERENT_MASSES so as to communicate this definition to the preprocessor. Uncomment this line in Makefile.

  3. From the shell command line, type "make". This should compile both hop.c and regroup.c, leaving you with the executables "hop" and "regroup".

    Type "make clean" if you want to delete the .o files.

How to use these programs

I've covered most of this in the sections entitled User Control, so here I will be more general. The HOP algorithm has six user-chosen parameters. In our paper, we discuss how varying these affects the mass function of groups in a cosmological simulation. In this case, variations in delta_outer mattered a fair bit, but the others mattered significantly less. We found that N_dens = 64, N_hop = 16, and N_merge = 4 with delta_outer = delta_saddle/2.5 = delta_peak/3 gave robust answers.

Possible extensions

The HOP algorithm is based solely on the instantaneous positions of the particles. An extension to this and similar approaches is to remove from these groups those particles whose velocities are large enough to escape the group. For example, see SKID (from the HPCC), or DENMAX [ Gelb & Bertschinger, Astroph. J., 436, 467 (1994)]. Typically, one removes the calculates the energies of all particles, removes the most unbound, updates the energies, and iterates. I haven't included such a scheme here, but I have included hooks for it in regroup.c. Or you can pipe the output of regroup to your own program.

Another direction for customization is that one need not maximize the density field. Any scalar field attached to the set of particles could be inputed through the .den file.

While we primarily had the case delta_outer < delta_saddle < delta_peak in mind, the code does not require this. It does require N_merge < N_hop, although this is due to my coding for an optimization not due to particular requirement of the algorithm.

At heart, the hopping part of the HOP algorithm is a method for tracking the gradient of a scalar field on a discrete set, while the group merging part of the algorithm is a method for manipulating isodensity contours on such a set. There may well be other applications of such ideas, for example to genus statistics.


HOP.C

The hop.c program performs all the hard work of finding and using the nearest neighbor lists. The main engine is the kd-tree search algorithm written by Joachim Stadel and the University of Washington NASA HPCC ESS. This engine was publically distributed as part of the SMOOTH program (which calculates adaptive densities and velocity fields from N-body simulations, available here). They have given us permission to redistribute this engine as part of the HOP program. Any compliments regarding the speed of hop.c should be directed to them.

The source files involved in hop.c are:

hop.c
The guts of HOP, plus all the input/output, written by DJE (although main() was customized from that of SMOOTH).
hop_input.c
Since each user will have a different format for their simulation data, I've pulled the input routine into a separate file for the user to customize.
smooth.c
The kd-tree search engine, from SMOOTH v2.01, written by Joachim Stadel and the HPCC. Note, however, that I have made some minor customizations (regarding the treatment of the particle masses) and removed code not needed in HOP. Fetch the latest version of SMOOTH from their web site if you want the whole feature list.
kd.c
Some utility routines for smooth.c. Again, part of SMOOTH v2.01. I have removed unneeded code (e.g. the input/output routines).
smooth.h, kd.h
The header files for the SMOOTH package. I've added and removed some fields from their structures.

Input

Hop.c needs to read the number of particles, their positions, and perhaps their masses from an input file. Since I can't guess what format your simulation data files are in, you'll need to customize the input routine. I've extracted the input routine into the source file hop_input.c and included in that file instructions on how to customize the input routine.

Hop.c can also read in a file containing all the densities. This avoids the most time-consuming step (namely calculating the smoothed densities from the particle positions). Since hop.c stores the densities IF it calculates them, you can save compute time when rerunning the code while varying parameters other than N_dens. Alternatively, if you wanted to supply your own density estimates or base your group-finding on a quantity other than density, you could supply the information this way. The required input format is the same as described for the density file (.den) in the Output section, directly below. Alternatively, you could alter it by changing the binInDensity() subroutine.

Output

Hop.c outputs three files. These are suffixed ".den", ".hop", and ".gbound". The first two are binary files of (approximate) size 4*N bytes; the last is ASCII and is generally smaller, although it varies in size according to N_merge and the details of the simulation data.
Density File (".den")
This binary file contains the densities associated with each particle. The format is:
  1. Bytes 1-4 -- The number of particles, as a 4-byte integer.
  2. Rest of file -- The densities of each of the particles, in order, stored as 4-byte floats.

Group Membership file (".hop")
This binary file contains the group ID number of each particle. This is the "raw" group membership, in the sense that no merging of groups according the density has been applied. Groups are numbered from 0 to n_group-1 in order of decreasing group size (not that this necessarily means much before the remerging the groups....); particles not assigned to a group because their density is uninterestingly low are assigned ID-1. The format is:
  1. Bytes 1-4 -- The number of particles, as a 4-byte integer.
  2. Bytes 5-8 -- The number of groups, as a 4-byte integer.
  3. Rest of file -- The group ID numbers of each of the particles, in order, stored as 4-byte integers.

Group Boundary file (".gbound")
This ASCII file contains all the relevant information concerning how groups may be merged according to their densities. The file actually contains two lists of information separated by a line beginning with "###". The format is:
  1. Line 1 -- The number of groups
  2. Next approximately n_groups lines -- These contain information on the groups themselves, in space-separated columns.
    • Column 1 -- The group ID number
    • Column 2 -- The number of particles in the group
    • Column 3 -- The particle ID number of densest particle in the group.
    • Column 4-6 -- The position of that particle
    • Column 7 -- The density of that particle.
    Only the last column is actually used by the regroup.c program, although the first column is monitored to make sure that the information is in order. Any lines that begin with a "#" are ignored (but note that "###" is reserved!). This is used to place some header information at the top of the file.
  3. Some single line in the middle of the file -- Begins with "###" to mark the end of the first section and the beginning of the second.
  4. Next many lines -- Each line contains information on pairs of groups that share a boundary. The information is carried in three space-separated columns, such that:
    • Column 1-2 -- The two group ID numbers
    • Column 3 -- The density of the boundary, which is the highest average density achieved by any pair of neighboring particles that were split between the groups.

User Control

Hop.c uses command-line flags for user input. You can get a list of these by typing "hop" (or by giving illegal arguments....).

Flags related to the calculation of densities:

-den file
Specifies a file from which to read the density (.den) file. If present, then the densities will not be recalculated. If not present, densities will be calculated according to the following parameters:

-nd int
Specifies N_dens, the number of particles to smooth over when calculating the density. To be precise, the list of N_dens particles includes the primary particle, while the outermost particle in the list will be exactly at the edge of the cubic spline and will therefore receive zero weight. Default: 64.

-tophat/-th
Use a tophat smoothing kernel rather than a cubic spline. To be precise, the list of N_dens particles includes the primary particle, while outermost particle will be at the edge of the tophat, where it is given full weight. Default: cubic spline

-g
Use gather kernel rather than gather-scatter. Only available with the cubic spline. Gather-scatter is more usual. Default: gather-scatter
Flags related to hopping and group boundaries:
-nh int
Specifies N_hop, the number of neighbors to look at when searching for the highest density neighbor. To be precise, the list of N_hop particles includes the primary particle; all are considered when looking for the densest. Default: 16

-nm int
Specifies N_merge, the number of neighbors to look at when searching for boundaries. To be precise, the list of N_hop particles does NOT include the primary particle. Note that this must be strictly less than N_hop because of an optimization that I made. Default: 4

-dt float
Particles below this density threshold are not included in the group catalog. This doesn't save much CPU time, but it does substantially reduce the size of the .gbound file, because we don't need to store information on all those uninteresting very low density groups. Even a few times the background density will make a substantial reduction in the size of the .gbound file.

You should set this threshold to be lower than any of the density thresholds (delta_outer, delta_saddle, or delta_peak) that you plan to use in regroup.c, and at least a factor of two lower than delta_saddle (since group boundaries are based on *average* densities). Default: none

Flags related to the input simulation file:
-in file
The name of the input simulation file. If this is omitted, the simulation data is read from stdin.

-p float
The periodicity of the positions in the input file. I've hardwired these to be the same in each direction, but if you look in main(), you'll see where to change this assumption. Default: none (so for cosmological simulations, you need to set this!).
Flags related to the output:
-o/-out file
The (root) name of the output files. Suffixes (".den", ".hop", and ".gbound") will be attached.

-nodensity
Don't write the ".den" output file. This is automatic if the densities were not computed but rather read from file. But this doesn't stop the densities from being computed.

-nohop
Don't write the ".hop" output file. But this doesn't stop the hopping from being computed, since the group boundaries step will need them.

-nogbound
Don't write the ".gbound" output file. This does stop the group boundaries from being computed, since they are not otherwise needed.

-densityonly
This computes only the densities and quits. The ".hop" and ".gbound" files are not written.
Flags related to performance:
-b int
This is left over from the SMOOTH program. It controls the number of particles to be stored in each of the leaves of the kd-tree. Smaller numbers would marginally improve the search time but would require substantially more nodes in the tree and hence more memory. I kept this at its default value of 16 and haven't played with it much. At b=16, the tree takes about 15% of the memory demanded by the program.
Hence, an example invocation of HOP would be:

hop -in my_simulation_file -p 1 -nd 48 -nh 20 -nm 5 -dt 1 -out my_hop_out

This would read from the simulation "my_simulation_file", placing the particles in a unit box, and perform HOP using N_dens=48, N_hop=20, and N_merge=5, and excluding particles with density less than two (useful if the particle mass has been normalized so that the total mass is 1). The output "my_hop_out.den", "my_hop_out.hop", and "my_hop_out.gbound" will be written.

If one then wanted to change N_hop to 16, it would be computationally faster to use:

hop -in my_simulation_file -p 1 -den my_hop_out.den -nh 16 -nm 5 -dt 1 -out my_hop_out2

because this will avoid recalculating the densities.

Performance

Regarding memory consumption, hop.c requires 29 bytes per particle (33 if compiled with DIFFERENT_MASSES) for the particle information, plus approximately 6 bytes per particle to hold the tree (assuming that one hasn't used the -b flag; the number 6 is inversely proportional to the setting of -b). I was able to analyze a 256^3 particle simulation on a 512 meg machine without being swamped by the swapping.

Regarding CPU consumption, each of the three sweeps invoked by the HOP algorithm scales roughly as N*M, where N is the number of particles and M is the number of neighbors being requested. This scaling doesn't quite hold for small M. The runtime is fairly independent of the degree of clustering (unlike, say, a grid-based FOF code). Typically, N_dens is greater than N_hop or N_merge, hence the density calculation has the largest M and takes the most time.

I found that with the default settings, the code took about 2 hours on an UltraSparc 170 for a 256^3 particle data set and 15 minutes on a 128^3 data set. Setting -b to 4 rather than 16 did not alter the performance.

Warnings

I have only tested the code for cases with periodic boundary conditions. Clearly in the non-periodic case, edge effects in the density estimation should be considered.


REGROUP.C

The regroup.c program merges and prunes groups according the chosen density threshold parameters. The output is the new group membership (.tag) file, a sorted list of the group sizes, and a summary of which input groups were merged.

This program was written by DJE and consists of the following source files:

regroup.c
The program itself, including all the input/output.

slice.c
Some utility routines for managing the main data structure.

slice.h
The header file for the main data structure.

Input

Regroup.c requires three input files, all generated by hop.c. The densities at each particle must be supplied as a ".den" file, the group membership of each particle given as a ".hop" file, and the group boundary information given as a ".gbound" file. Alternatively, the latter file can be replaced by a ".gmerge" file, which simply instructs regroup.c as to which input groups to merge together.

The formats of ".den", ".hop", and ".gbound" files are given in the hop.c information. The format of ".gmerge" files are given below.

Output

Regroup.c generates three output files. These are suffixed ".tag", ".sort", and ".gmerge". The first of these is a binary file of (approximate) size 4*N bytes. The last two are ASCII files and are considerably smaller.
Group membership file (".tag")
This binary file contains the group ID numbers of all of the particles, after the groups have been merged and pruned by density considerations. The format is identical to that of the ".hop" file described above, namely a sequence of N+2 4-byte integers consisting of the number of particles, the number of groups, and then the group ID's of the N particles. Groups are numbered 0 to n_group-1; particles not assigned to any group are given the number -1.

If the -f77 flag has been set, the format of the ".tag" file is altered so as to be compatible with FORTRAN unformatted I/O. The file should be readable in the following manner

int*4 n_particles, n_groups
real*4 group_id(n_particles)
read (*) n_particles, n_groups
read (*) (group_id(j),j=1,n_particles)
where group_id's run from 0 to n_groups-1. In detail, the file format is:
  1. Bytes 1-4 -- The integer 8.
  2. Bytes 5-8 -- The number of particles, N.
  3. Bytes 9-12 -- The number of groups.
  4. Bytes 13-16 -- The integer 8.
  5. Bytes 17-20 -- The integer 4*N.
  6. Next many bytes -- The group ID numbers for all the particles.
  7. Last 4 bytes -- The integer 4*N.

Group multiplicity file (".size")
This ASCII file contains the number of particles in each group. The format is:
  1. Line 1 -- The number of particles
  2. Line 2 -- The total number of particles in groups
  3. Line 3 -- The number of groups
  4. Remaining lines -- Two space-separated columns
    • Column 1 -- The ID number of the group
    • Column 2 -- The number of particles in that group

Group merging log file (".gmerge")
This ASCII file contains the mapping of the input group numbering scheme to the output group numbering scheme.
  1. Line 1 -- The number of particles
  2. Line 2 -- The number of input groups
  3. Line 3 -- The number of output groups
  4. Line 4 -- delta_peak
  5. Line 5 -- delta_saddle
  6. Remaining lines -- Two space-separated columns
    • Column 1 -- Input group ID number
    • Column 2 -- Output group ID number

User Control

Regroup.c uses command-line flags for user input. You can get a list of these by typing "regroup" (or by giving illegal arguments....).

Flags related to the input files:

-hop file
The .hop input file containing the input group memberships.

-den file
The .den input file containing all the particle densities.

-gbound file
The .gbound input file containing the group boundary info.

-root file
Sets all three of the above to a common root plus the default suffixs (".hop", ".den", and ".gbound"). However, you can also override one or more of these choices with an explicit setting of -hop, -den, -gbound, or -gmerge.

-gmerge file
Reads the group merging map from the given file rather than constructing it from the .gbound file and density parameters.
Flags related to the input parameters:
-douter float
Sets delta_outer, the density required for a particle to be in a group.

-nodens
This turns off the delta_outer cut, which means that the .den file is not read and need not be given. Note, however, that if you use this option, you should probably not use the -douter flag. See the Warnings section, below.

-dpeak float
Sets delta_peak, the central density needed for a group to be independently viable.

-dsaddle float
Sets delta_saddle, the boundary density needed two viable groups to be merged.

-nomerge
Turns off group merging. The input groups will be unchanged.

-mingroup int
Only groups with this many particles or more will be outputted. Default: 10

-nosort
Don't sort the groups by size. Also don't output .size.
Flags related to output:
-o/-out file
Write the output files to the given root plus the default suffixes (".tag", ".gmerge", and ".sort"). If not specied, this defaults to "zregroup".

-otag/-outtag file
Override the above choice with regard to the .tag file.

-notagoutput
Don't write any .tag file. This is useful if you only want to track the size of groups, rather than explicit particle membership.

-pipe
Send the .tag output to stdout, but write the .gmerge and .sort files as usual.

-pipequiet
Send the .tag output to stdout and don't write the .gmerge or .sort files at all.

-f77
Write the output .tag file in a binary format compatible with FORTRAN unformatted I/O.
Piping the output of regroup.c is useful if you have another program that analyzes the properties of the set of groups. Having separate membership files for each choice of density thresholds can take a lot of disk space. Because regroup.c takes very little time to run, it may be easier to keep only the output of hop.c on disk and to re-generate the final, density- manipulated catalog on demand.

Hence, an invocation of regroup.c might look like

regroup -root my_hop_out -douter 80 -dsaddle 140 -dpeak 160 -mingroup 8 -out my_final_catalog

This would read the input of my_hop_out.hop, my_hop_out.den, and my_hop_out.gbound, impose the density thresholds, sort the final catalog by group size, exclude any groups smaller than 8 particles, and write output as my_final_catalog.sort, my_final_catalog.gmerge, and my_final_catalog.tag. Adding a -pipe flag would dump the last of these to stdout rather than disk, so it could taken as input for an analysis program. Using -f77 would allow a simpler interface to a FORTRAN program.

Performance

Regroup.c runs in a manner of seconds and is essentially I/O-limited. reading the large ".den" and ".hop" files and writing the ".tag" file takes most of the time. Each of these files is 4*N bytes of memory.

It takes 4*N bytes of memory, where N is the number of particles, plus some storage that scales with the number of groups.

Warnings

Remember that HOP is biased against finding groups smaller than N_dens or N_hop; hence, setting -mingroup to be tiny is usually pointless.

The code assumes that all densities are greater than the internal quantity MINDENS, which I've set to be a large negative number close to the most negative representable number. But if you want to input some astronomically negative number, beware (i.e. choose more sensible units).

If you specify -nodens, you most likely want to avoid setting delta_outer > 0. Doing the latter will cause the merging section of the code to throw out low-density groups on the basis of the boundary densities, which is likely to produce strangely inconsistent group edges.


How to contact us

The HOP algorithm was presented in a paper by Daniel J. Eisenstein (deisenstein@cfa.harvard.edu) and Piet Hut (piet@sns.ias.edu). We were at the Institute for Advanced Study when HOP was written.

This is the documentation for version 1.1 of HOP. Version 1.1 fixes a small bug in regroup.c that could cause crashes on large data sets. No results could be corrupted, as the code will simply crash. Further information and updates will be posted at www.cfa.harvard.edu/~deisenst/hop/hop.html.