HOP
What is HOP and what does it do?
HOP is an algorithm for finding groups of particles based on their
instantaneous densities. Roughly speaking, it works by connecting
particles to nearby particles in the direction of increasing density.
We have described the method in our paper
"HOP: A new group finding algorithm for N-body simulations"
to be published in ApJ.
While our paper contains a full description of the method, a quick
summary is:
- Compute the density around each particle, using an adaptive kernel
with a length scale set by the distance to the
N_dens
nearest neighbor.
- Now consider each particle and record the ID number of the
neighboring particle with the highest density. Consider
N_hop
neighbors in this step. The particle with the highest
density could be the chosen particle itself.
- Using these ID numbers and starting at each particle in turn,
hop from a particle to its densest neighbor until a particle that is
its own densest neighbor is found. The starting particle then becomes
a member of the group with the final particle as its center. The
density of this central particle is the peak density of the group, a
quantity needed in step 5.
- As described in our paper, the above scheme tends to break
large groups into (probably unphysical) small ones due to the presence
of local density maxima. To correct this, we recombine the groups on
the basis of a dual density threshold. To set up for this, we need to
catalog the densest boundaries between groups. This is done by looking
at each particle in turn and asking whether the particle and any of its
N_merge
nearest neighbors are members of different groups.
If so, the particle is a boundary between the two groups; the density
of the boundary is the average of the density of the two particles. We
catalog the highest density boundary found between each pair of groups
and store the result.
- Groups are considered viable if their densest particle (found
in step 3) is greater than some threshold
delta_peak
. Two
viable groups are merged if they share a boundary greater than
delta_merge
. An unviable group is merged to the viable
group with which it shares the highest boundary. For the latter
purpose, boundaries are transmitted by intervening unviable groups;
that is, if group A is viable and groups B and
C are not, but C does not touch A,
C can still be joined to A if the *lower* of the
density of the AB boundary and the BC boundary is
greater than the threshold delta_merge
. If this step is
unclear to you, you might look at our paper and especially Figure 1.
- All particles whose density are less than a threshold
delta_outer
are excluded from groups.
In our implementation of HOP, steps (1), (2-4), and (5-6) are
run separately and the needed output is saved to disk. Hence, one
can vary certain parameters without repeating all 6 steps.
Step (1) is the most CPU expensive. Steps (2) and (4) are also
reasonably expensive. The others are quick. Moreover, steps (1),
(2), and (4) are the most memory intensive as well, because they
require more information about the particles.
Our implementation of HOP
We are distributing our implementation of the HOP algorithm. Our
version actually involves two programs that must be run sequentially.
The first (hop.c) performs steps (1) through (4) above and
requires three of the parameters (N_dens
, N_hop
,
and N_merge
). The second (regroup.c) performs
steps (5) and (6) and requires the other three parameters
(delta_peak
, delta_merge
, and
delta_outer
). The second program takes far less CPU time
and a significantly less memory; hence, one can vary the second trio of
parameters fairly inexpensively.
How to compile these programs
- First, you need to customize the input subroutine for
hop.c so that it can read your simulation format. All the
instructions are in the file hop_input.c along with an example.
You will also need to decide at this point if you are assigning
equal mass to each particle or not. This will affect how you store
the mass and will affect how you compile.
- Next, you need to edit the file Makefile so as to set the
proper compiler optimization flags. You should be able to use fairly
aggressive optimizations. Set the flags in the CFLAGS variable and the
compiler (cc, gcc, whatever) in the CC variable. If you have some
library to replace , set it in the LIBS variable.
While still editing Makefile, if you decided in step 1 to give the
particles unequal masses, you need to set the flag
-DDIFFERENT_MASSES
so as to communicate this definition to
the preprocessor. Uncomment this line in Makefile.
- From the shell command line, type "make". This
should compile both hop.c and regroup.c, leaving
you with the executables "hop" and "regroup".
Type "make clean" if you want to delete the .o files.
How to use these programs
I've covered most of this in the sections entitled User Control, so
here I will be more general. The HOP algorithm has six user-chosen
parameters. In our paper, we discuss how varying these affects the
mass function of groups in a cosmological simulation. In this case,
variations in delta_outer
mattered a fair bit, but the
others mattered significantly less. We found that N_dens
=
64, N_hop
= 16, and N_merge
= 4 with
delta_outer
= delta_saddle
/2.5 =
delta_peak
/3 gave robust answers.
Possible extensions
The HOP algorithm is based solely on the instantaneous positions of
the particles. An extension to this and similar approaches is to
remove from these groups those particles whose velocities are large
enough to escape the group. For example, see
SKID
(from the HPCC),
or DENMAX
[
Gelb & Bertschinger, Astroph. J., 436, 467 (1994)].
Typically, one removes the calculates the energies of all particles,
removes the most unbound, updates the energies, and iterates. I
haven't included such a scheme here, but I have included hooks for it
in regroup.c. Or you can pipe the output of regroup to your
own program.
Another direction for customization is that one need not maximize
the density field. Any scalar field attached to the set of particles
could be inputed through the .den file.
While we primarily had the case delta_outer
<
delta_saddle
< delta_peak
in mind, the code
does not require this. It does require N_merge
<
N_hop
, although this is due to my coding for an optimization
not due to particular requirement of the algorithm.
At heart, the hopping part of the HOP algorithm is a method for
tracking the gradient of a scalar field on a discrete set, while the
group merging part of the algorithm is a method for manipulating
isodensity contours on such a set. There may well be other
applications of such ideas, for example to genus statistics.
HOP.C
The hop.c program performs all the hard work of finding and
using the nearest neighbor lists. The main engine is the kd-tree
search algorithm written by
Joachim Stadel and the
University of Washington NASA HPCC ESS.
This engine was publically distributed as part of the
SMOOTH program
(which calculates adaptive densities and velocity fields from N-body
simulations, available
here).
They have given us permission to redistribute this engine as part of
the HOP program. Any compliments regarding the speed of
hop.c should be directed to them.
The source files involved in hop.c are:
-
hop.c
-
The guts of HOP, plus all the input/output, written by DJE
(although
main()
was customized from that of
SMOOTH).
-
hop_input.c
-
Since each user will have a different format for their
simulation data, I've pulled the input routine into a
separate file for the user to customize.
-
smooth.c
-
The kd-tree search engine, from
SMOOTH
v2.01, written by
Joachim Stadel
and the
HPCC. Note, however,
that I have made some minor customizations (regarding the
treatment of the particle masses) and removed code not needed
in HOP. Fetch the latest version of
SMOOTH
from their web site if you want the whole feature list.
-
kd.c
-
Some utility routines for smooth.c. Again, part of
SMOOTH
v2.01. I have removed unneeded code (e.g. the input/output
routines).
-
smooth.h, kd.h
-
The header files for the
SMOOTH
package. I've added and removed some fields from their
structures.
Input
Hop.c needs to read the number of particles, their positions, and
perhaps their masses from an input file. Since I can't guess what
format your simulation data files are in, you'll need to customize the
input routine. I've extracted the input routine into the source file
hop_input.c and included in that file instructions on how to customize
the input routine.
Hop.c can also read in a file containing all the
densities. This avoids the most time-consuming step (namely
calculating the smoothed densities from the particle positions). Since
hop.c stores the densities IF it calculates them, you can
save compute time when rerunning the code while varying parameters
other than N_dens
. Alternatively, if you wanted to supply
your own density estimates or base your group-finding on a quantity
other than density, you could supply the information this way. The
required input format is the same as described for the density file
(.den) in the Output section, directly below. Alternatively, you could
alter it by changing the binInDensity() subroutine.
Output
Hop.c outputs three files. These are suffixed ".den",
".hop", and ".gbound". The first two are binary files of (approximate)
size 4*N bytes; the last is ASCII and is generally smaller, although it
varies in size according to N_merge
and the details of the
simulation data.
- Density File (".den")
-
This binary file contains the densities
associated with each particle. The format is:
- Bytes 1-4 -- The number of particles, as a 4-byte integer.
- Rest of file -- The densities of each of the particles, in
order, stored as 4-byte floats.
- Group Membership file (".hop")
-
This binary file contains the group
ID number of each particle. This is the "raw" group membership, in the
sense that no merging of groups according the density has been
applied. Groups are numbered from 0 to
n_group
-1 in order of
decreasing group size (not that this necessarily means much before the
remerging the groups....); particles not assigned to a group because
their density is uninterestingly low are assigned ID-1. The format is:
- Bytes 1-4 -- The number of particles, as a 4-byte integer.
- Bytes 5-8 -- The number of groups, as a 4-byte integer.
- Rest of file -- The group ID numbers of each of the particles, in
order, stored as 4-byte integers.
- Group Boundary file (".gbound")
-
This ASCII file contains all the
relevant information concerning how groups may be merged according to
their densities. The file actually contains two lists of information
separated by a line beginning with "###". The format is:
- Line 1 -- The number of groups
- Next approximately
n_groups
lines -- These contain
information on the groups themselves, in space-separated columns.
- Column 1 -- The group ID number
- Column 2 -- The number of particles in the group
- Column 3 -- The particle ID number of densest particle
in the group.
- Column 4-6 -- The position of that particle
- Column 7 -- The density of that particle.
Only the last column is actually used by the
regroup.c program, although the first column is
monitored to make sure that the information is in order. Any
lines that begin with a "#" are ignored (but note that "###" is
reserved!). This is used to place some header information at
the top of the file.
- Some single line in the middle of the file -- Begins with "###" to
mark the end of the first section and the beginning of the second.
- Next many lines -- Each line contains information on pairs of
groups that share a boundary. The information is carried in
three space-separated columns, such that:
- Column 1-2 -- The two group ID numbers
- Column 3 -- The density of the boundary, which is the highest
average density achieved by any pair of neighboring particles
that were split between the groups.
User Control
Hop.c uses command-line flags for user input. You can get a
list of these by typing "hop" (or by giving illegal arguments....).
Flags related to the calculation of densities:
- -den file
-
Specifies a file from which to read the density (.den) file.
If present, then the densities will not be recalculated.
If not present, densities will be calculated according to
the following parameters:
- -nd int
-
Specifies
N_dens
, the number of particles to
smooth over when calculating the density. To be
precise, the list of N_dens
particles
includes the primary particle, while the outermost
particle in the list will be exactly at the edge of the
cubic spline and will therefore receive zero weight.
Default: 64.
- -tophat/-th
-
Use a tophat smoothing kernel rather than a cubic
spline. To be precise, the list of
N_dens
particles includes the primary particle, while
outermost particle will be at the edge of the tophat,
where it is given full weight. Default: cubic spline
- -g
-
Use gather kernel rather than gather-scatter. Only available
with the cubic spline. Gather-scatter is more usual.
Default: gather-scatter
Flags related to hopping and group boundaries:
- -nh int
-
Specifies
N_hop
, the number of neighbors to
look at when searching for the highest density
neighbor. To be precise, the list of N_hop
particles includes the primary particle; all are
considered when looking for the densest. Default: 16
- -nm int
-
Specifies
N_merge
, the number of neighbors
to look at when searching for boundaries. To be
precise, the list of N_hop
particles does
NOT include the primary particle. Note that this must
be strictly less than N_hop
because of an
optimization that I made. Default: 4
- -dt float
-
Particles below this density threshold are not included
in the group catalog. This doesn't save much CPU time,
but it does substantially reduce the size of the
.gbound file, because we don't need to store
information on all those uninteresting very low density
groups. Even a few times the background density will
make a substantial reduction in the size of the .gbound
file.
You should set this threshold to be lower than any
of the density thresholds (delta_outer
,
delta_saddle
, or delta_peak
) that
you plan to use in regroup.c, and at least a
factor of two lower than delta_saddle
(since
group boundaries are based on *average* densities).
Default: none
Flags related to the input simulation file:
- -in file
-
The name of the input simulation file. If this is omitted,
the simulation data is read from stdin.
- -p float
-
The periodicity of the positions in the input file.
I've hardwired these to be the same in each direction,
but if you look in
main()
, you'll see where
to change
this assumption. Default: none (so for cosmological
simulations, you need to set this!).
Flags related to the output:
- -o/-out file
-
The (root) name of the output files. Suffixes (".den",
".hop", and ".gbound") will be attached.
- -nodensity
-
Don't write the ".den" output file. This is automatic if
the densities were not computed but rather read from file.
But this doesn't stop the densities from being computed.
- -nohop
-
Don't write the ".hop" output file. But this doesn't
stop the hopping from being computed, since the group
boundaries step will need them.
- -nogbound
-
Don't write the ".gbound" output file. This does stop the
group boundaries from being computed, since they are not
otherwise needed.
- -densityonly
-
This computes only the densities and quits. The ".hop"
and ".gbound" files are not written.
Flags related to performance:
- -b int
-
This is left over from the
SMOOTH program. It controls
the number of particles to be stored in each of the
leaves of the kd-tree. Smaller numbers would marginally
improve the search time but would require substantially
more nodes in the tree and hence more memory. I kept this
at its default value of 16 and haven't played with it much.
At b=16, the tree takes about 15% of the memory demanded by
the program.
Hence, an example invocation of HOP would be:
hop -in my_simulation_file -p 1 -nd 48 -nh 20 -nm 5 -dt 1
-out my_hop_out
This would read from the simulation "my_simulation_file", placing
the particles in a unit box, and perform HOP using
N_dens
=48, N_hop
=20, and N_merge
=5,
and excluding particles with density less than two (useful if the
particle mass has been normalized so that the total mass is 1). The
output "my_hop_out.den", "my_hop_out.hop", and "my_hop_out.gbound" will
be written.
If one then wanted to change N_hop
to 16, it would be
computationally faster to use:
hop -in my_simulation_file -p 1 -den my_hop_out.den -nh 16 -nm 5
-dt 1 -out my_hop_out2
because this will avoid recalculating the densities.
Performance
Regarding memory consumption, hop.c requires 29 bytes per
particle (33 if compiled with DIFFERENT_MASSES
) for the
particle information, plus approximately 6 bytes per particle to hold
the tree (assuming that one hasn't used the -b flag; the
number 6 is inversely proportional to the setting of -b).
I was able to analyze a 256^3 particle simulation on a 512 meg machine
without being swamped by the swapping.
Regarding CPU consumption, each of the three sweeps invoked by the
HOP algorithm scales roughly as N*M, where N is the number of particles
and M is the number of neighbors being requested. This scaling doesn't
quite hold for small M. The runtime is fairly independent of the
degree of clustering (unlike, say, a grid-based FOF code). Typically,
N_dens
is greater than N_hop
or
N_merge
, hence the density calculation has the largest M and
takes the most time.
I found that with the default settings, the code took about 2 hours on
an UltraSparc 170 for a 256^3 particle data set and 15 minutes on a
128^3 data set. Setting -b to 4 rather than 16 did not alter the
performance.
Warnings
I have only tested the code for cases with periodic boundary conditions.
Clearly in the non-periodic case, edge effects in the density estimation
should be considered.
REGROUP.C
The regroup.c program merges and prunes groups according the
chosen density threshold parameters. The output is the new group
membership (.tag) file, a sorted list of the group sizes, and a summary
of which input groups were merged.
This program was written by DJE and consists of the following source files:
- regroup.c
-
The program itself, including all the input/output.
- slice.c
-
Some utility routines for managing the main data structure.
- slice.h
-
The header file for the main data structure.
Input
Regroup.c requires three input files, all generated by
hop.c. The densities at each particle must be supplied as a
".den" file, the group membership of each particle given as a ".hop"
file, and the group boundary information given as a ".gbound" file.
Alternatively, the latter file can be replaced by a ".gmerge" file,
which simply instructs regroup.c as to which input groups to
merge together.
The formats of ".den", ".hop", and ".gbound" files are given in the
hop.c information. The format of ".gmerge" files are given below.
Output
Regroup.c generates three output files. These are suffixed ".tag",
".sort", and ".gmerge". The first of these is a binary file of
(approximate) size 4*N bytes. The last two are ASCII files and are
considerably smaller.
- Group membership file (".tag")
-
This binary file contains the group ID numbers of all of the
particles, after the groups have been merged and pruned by density
considerations. The format is identical to that of the ".hop" file
described above, namely a sequence of N+2 4-byte integers consisting of
the number of particles, the number of groups, and then the group ID's
of the N particles. Groups are numbered 0 to
n_group
-1;
particles not assigned to any group are given the number -1.
If the -f77
flag has been set, the format of the ".tag"
file is altered so as to be compatible with FORTRAN unformatted I/O.
The file should be readable in the following manner
int*4 n_particles, n_groups
real*4 group_id(n_particles)
read (*) n_particles, n_groups
read (*) (group_id(j),j=1,n_particles)
where group_id
's run from 0 to n_groups-1
.
In detail, the file format is:
- Bytes 1-4 -- The integer 8.
- Bytes 5-8 -- The number of particles, N.
- Bytes 9-12 -- The number of groups.
- Bytes 13-16 -- The integer 8.
- Bytes 17-20 -- The integer 4*N.
- Next many bytes -- The group ID numbers for all the particles.
- Last 4 bytes -- The integer 4*N.
- Group multiplicity file (".size")
-
This ASCII file contains the
number of particles in each group. The format is:
- Line 1 -- The number of particles
- Line 2 -- The total number of particles in groups
- Line 3 -- The number of groups
- Remaining lines -- Two space-separated columns
- Column 1 -- The ID number of the group
- Column 2 -- The number of particles in that group
- Group merging log file (".gmerge")
-
This ASCII file contains the mapping
of the input group numbering scheme to the output group numbering scheme.
- Line 1 -- The number of particles
- Line 2 -- The number of input groups
- Line 3 -- The number of output groups
- Line 4 --
delta_peak
- Line 5 --
delta_saddle
- Remaining lines -- Two space-separated columns
- Column 1 -- Input group ID number
- Column 2 -- Output group ID number
User Control
Regroup.c uses command-line flags for user input. You can
get a list of these by typing "regroup" (or by giving illegal
arguments....).
Flags related to the input files:
- -hop file
-
The .hop input file containing the input group memberships.
- -den file
-
The .den input file containing all the particle densities.
- -gbound file
-
The .gbound input file containing the group boundary info.
- -root file
-
Sets all three of the above to a common root plus the
default suffixs (".hop", ".den", and ".gbound").
However, you can also override one or more of these
choices with an explicit setting of -hop,
-den, -gbound, or -gmerge.
- -gmerge file
-
Reads the group merging map from the given file rather
than constructing it from the .gbound file and density
parameters.
Flags related to the input parameters:
- -douter float
-
Sets
delta_outer
, the density required for a
particle to be in a group.
- -nodens
-
This turns off the
delta_outer
cut, which
means that the .den file is not read and need not be given.
Note, however, that if you use this option, you should
probably not use the -douter
flag. See
the Warnings section, below.
- -dpeak float
-
Sets
delta_peak
, the central density needed
for a group to be independently viable.
- -dsaddle float
-
Sets
delta_saddle
, the boundary density needed two
viable groups to be merged.
- -nomerge
-
Turns off group merging. The input groups will be unchanged.
- -mingroup int
-
Only groups with this many particles or more will be outputted.
Default: 10
- -nosort
-
Don't sort the groups by size. Also don't output .size.
Flags related to output:
- -o/-out file
-
Write the output files to the given root plus the default
suffixes (".tag", ".gmerge", and ".sort"). If not specied,
this defaults to "zregroup".
- -otag/-outtag file
-
Override the above choice with regard to the .tag file.
- -notagoutput
-
Don't write any .tag file. This is useful if you only
want to track the size of groups, rather than explicit
particle membership.
- -pipe
-
Send the .tag output to stdout, but write the .gmerge and
.sort files as usual.
- -pipequiet
-
Send the .tag output to stdout and don't write the .gmerge
or .sort files at all.
- -f77
-
Write the output .tag file in a binary format compatible
with FORTRAN unformatted I/O.
Piping the output of regroup.c is useful if you have another
program that analyzes the properties of the set of groups. Having
separate membership files for each choice of density thresholds can
take a lot of disk space. Because regroup.c takes very
little time to run, it may be easier to keep only the output of
hop.c on disk and to re-generate the final, density-
manipulated catalog on demand.
Hence, an invocation of regroup.c might look like
regroup -root my_hop_out -douter 80 -dsaddle 140 -dpeak 160
-mingroup 8 -out my_final_catalog
This would read the input of my_hop_out.hop, my_hop_out.den, and
my_hop_out.gbound, impose the density thresholds, sort the final
catalog by group size, exclude any groups smaller than 8 particles,
and write output as my_final_catalog.sort, my_final_catalog.gmerge,
and my_final_catalog.tag. Adding a -pipe flag would dump the last
of these to stdout rather than disk, so it could taken as input for
an analysis program. Using -f77 would allow a simpler
interface to a FORTRAN program.
Performance
Regroup.c runs in a manner of seconds and is essentially
I/O-limited. reading the large ".den" and ".hop" files and writing the
".tag" file takes most of the time. Each of these files is 4*N bytes
of memory.
It takes 4*N bytes of memory, where N is the number of particles, plus
some storage that scales with the number of groups.
Warnings
Remember that HOP is biased against finding groups smaller than
N_dens
or N_hop
; hence, setting
-mingroup to be tiny is usually pointless.
The code assumes that all densities are greater than the internal
quantity MINDENS
, which I've set to be a large negative
number close to the most negative representable number. But if you
want to input some astronomically negative number, beware (i.e. choose
more sensible units).
If you specify -nodens
, you most likely want to avoid
setting delta_outer
> 0. Doing the latter will cause
the merging section of the code to throw out low-density groups on the
basis of the boundary densities, which is likely to produce strangely
inconsistent group edges.
How to contact us
The HOP algorithm was presented in a
paper by
Daniel J. Eisenstein
(deisenstein@cfa.harvard.edu)
and Piet Hut
(piet@sns.ias.edu). We were at the
Institute for Advanced Study when HOP was
written.
This is the documentation for version 1.1 of HOP. Version 1.1 fixes
a small bug in regroup.c that could cause crashes on large data sets.
No results could be corrupted, as the code will simply crash.
Further information and updates will be posted at
www.cfa.harvard.edu/~deisenst/hop/hop.html.