Research Blog
Welcome to my Research Blog.
This is mostly meant to document what I am working on for myself, and to communicate with my colleagues. It is likely filled with errors!
This project is maintained by ndrakos
I ran into some problems running the \(2048^3\) simulations, so here I am going to document my issues and how I solved them.
MUSIC requires about 500GB of memory to run the \(2048^3\) simulations (which I figured out through trial and error), which is more than is available on the regular Pleiades nodes. There is documentation here on how to run higher memory jobs.
Here is the jobscript I ended up using:
#PBS -lselect=1:ncpus=16:ompthreads=16:model=ldan:mem=700GB
#PBS -l walltime=2:00:00
#PBS -q normal
module load comp-intel/2018.3.222
cd /nobackup/ndrakos/MUSIC/
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/nasa/pkgsrc/sles12/2016Q4/lib:/u/ndrakos/install_to_here/gsl_in/lib
./MUSIC /nobackup/ndrakos/wfirst2048/wfirst2048_ics.conf > /nobackup/ndrakos/wfirst2048/MUSIC_output
One thing to note is that MUSIC creates some large files when generating the ICs, so I had to move MUSIC to the nobackup drive.
After fixing the memory problem, I kept getting this error (this is the point of the code where the ICs have been generated, and are being printed to the output file):
MUSIC: src/plugins/output_gadget2.cc:128: void gadget2_output_plugin<T_store>::distribute_particles(unsigned int, std::vector<std::vector<unsigned int> >&, std::vector<unsigned int>&) [with T_store = float]: Assertion `n2dist[i]==0' failed.
/var/spool/pbs/mom_priv/jobs/8833017.pbspl1.nas.nasa.gov.SC: line 11: 53443 Aborted (core dumped) ./MUSIC /nobackup/ndrakos/wfirst2048/wfirst2048_ics.conf > /nobackup/ndrakos/wfirst2048/MUSIC_output
I tried a few different things to try and debug this. Eventually, I found the user group for the code, and after reading through some of the posts found the following:
for the Gadget-2 output plugin multiple output files are needed since any single standard Gadget-2 IC file can only hold up to 2**32 particles = (~ 1600**3).
I hadn’t realized this, and had left the parameter gadget_num_files = 1
in the MUSIC configuration file. After increasing this parameter, the code ran fine.
I also ran into problems with the Gadget Simulation.
The relevant output for the error is:
Allocated 1872.46 MByte for particle storage. 72
reading file `/nobackup/ndrakos/wfirst2048/wfirst2048_ics.dat.0' on task=0 (con\
tains 134217728 particles.)
distributing this file to tasks 0-14
Type 0 (gas): 0 (tot= 0000000000) masstab=0
Type 1 (halo): 134217728 (tot= 0000000000) masstab=0.00152861
Type 2 (disk): 0 (tot= 0000000000) masstab=0
Type 3 (bulge): 0 (tot= 0000000000) masstab=0
Type 4 (stars): 0 (tot= 0000000000) masstab=0
Type 5 (bndry): 0 (tot= 0000000000) masstab=0
too many particles
task 497: endrun called with an error level of 1313
I tried increasing the number of cores and increasing the number of IC files, but that didn’t work.
I then recompiled MUSIC and Gadget, making sure there were no warnings, and all the modules were loaded properly. I then regenerated the ICs, and tried rerunning. This seemed to fix this error.
After that I ran into memory issues. There are two different places this occurs: (1) internally, when Gadget throws an error, or (2) because there is not enough memory available on Pleiades.
The former gives the following eror “No domain decomposition that stays within memory bounds is possible”, and can be fixed by increasing TreeAllocFactor/PartAllocFactor or by running on more processors. Currently, I am using TreeAllocFactor=1.5 and PartAllocFactor=2.0 and 512 mpiprocesses. That seems to be working so far; if the job dies at some point I might have to increase those parameters more. Note that if you ask for too many mpiprocs, Gadget throws the following error:
We are out of Topnodes. Increasing the constant MAXTOPNODES might help.task 1287: endrun called with an error level of 13213
To get enough memory on Pleiades, I ended up requesting less cores per node. I also found that I got seg faults if I tried to run with too many processors. Eventually I got the code working with the following job script:
#PBS -l select=32:ncpus=16:model=bro
#PBS -l walltime=120:00:00
#PBS -q long
module load mpi-sgi/mpt
module load comp-intel/2018.3.222
cd /u/ndrakos/Gadget2/
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/nasa/pkgsrc/sles12/2016Q4/lib:/u/ndrak\
os/install_to_here/gsl_in/lib
mpiexec -np 512 ./Gadget2 /nobackup/ndrakos/wfirst2048/wfirst2048_gadget.param\
> /nobackup/ndrakos/wfirst2048/output