Yes. If you do a direct comparison between the speeds of say a Cray XE6 and a Blue Gene/Q on the same number of cores you will find that the Blue Gene is significantly slower (4 times slower in the test that I – MDT – did). What can we do about this?
With a bit of digging you then find that BG/Qs are supposed to be run with multiple threads per core, since you simply don’t get full instruction throughput with 1 thread per core. It’s not that more threads help, it’s that less than 2 threads per core is like running with only one leg. You just can’t compare BG/Q to anything unless you’re running on >2 hardware threads per core. Most applications max out 3 or 4 per core (and CASINO maxes out at 4 – see below). BG/Q is a completely in-order, single-issue-per-hardware-thread core, whereas x86 cores are usually multi-issue.
Here are some timings, for a short VMC run on a large system (TiO2 – 648 electrons) on 512 cores:
Hector - Cray XE6 : 55.1 sec
Vesta – Blue Gene/Q : 222.19 sec
Now try ‘overloading’ each core on the BG/Q
Can take up to 4 threads/core (in powers of 2)
--ppn=48 "48 is not a valid ranks per node value"
--ppn=64 138.44 sec
So by using 512 cores to run 2048 MPI processes, we improve to ~2.5 times slower than Hector (512 cores running 512 processes) rather than 4 times slower as before. Note that the number of processes per core has to be a power of 2 (i.e. not 3) – unless you do special tricks.
Now try Openmp on the cores instead of extra MPI processes (OpenmpShm mode)
--ppn=16 --tpp=1 222.19
--ppn=16 --tpp=2 197.18
--ppn=16 --tpp=4 196.42
Thus for this case we see (a) not much point going beyond
--tpp=2, and (b) Openmp is not as good as just using extra MPI processes. In this case this means that using the ability to run multiple processes per core for a fixed numbers of cores and thus allowing it to do fewer moves per process (parallelizing over moves) is much faster than attempting to calculate the wave function faster for a fixed number of moves per process (parallelizing over electrons).
Thus – if memory allows it – the best way to run CASINO on a Blue Gene/Q such as Mira at Argonne National Lab seems to be:
runqmc -n xxx --ppn=64 -s --user.shmemsize=yyy
i.e. run a shared memory calc with a shared memory block of yyy Mb on xxx 16-core nodes with 64 MPI processes per node (4 per core).
Note however, that all this comes with a big caveat regarding MEMORY.
Consider the big Blue Gene/Q at Argonne – Mira, which has 16Gb available per 16-core node.
Forgetting about shared memory for the moment, it is common to think of all this memory as being generally available to all cores on the node, such that e.g. the master process could allocate 15Gb and all the slaves could allocate a total of 0.5GB and everything would fit in the node..
In fact, since there is no swap space available, the 16 Gb per node is divided among the MPI processes as evenly as possible (which apparently might not be that even in some cases due to how the TLB works). So when you’re using 64 MPI processes per node – which we supposedly recommend above – then each individual process has a maximum possible memory usage of 16Gb / 64 = 250 Mb of memory in principle.
In practice, this will be reduced by (a) any shared memory partition – the default for this being c. 32Mb over the whole node, which you may increase to whatever value you like using the –user.shmemsize command line argument to runqmc, (b) memory used by MPI, and (c) other stuff.
According to some data I found, in practice for 64 MPI processes per node, the available memory per process is between 180 and 240 Mb (unevenly between the processes).
You may therefore run out of memory for large systems on large numbers of processes because:
- 180 Mb is not very much.
- The memory required by MPI increases as a function of the partition size.
- There are various large arrays in CASINO which either cannot be put in shared memory (e.g. because they refer to configs which live only on the MPI process in question, such as the array containing the interparticle distances for a given set of configs) or have not yet been coded to be put in shared memory (e.g. Fourier coefficients for orbitals expanded in plane waves), and these just won’t fit in 180 Mb..
What to do about it
(1) Don’t use
--ppn=64 for the number of processes per node. Everytime you halve it, the available memory per process will approximately double. The disadvantage then is that the effective speed of a single core will go down (for
--ppn=16 i.e. 1 process per core it will be something like 4 times slower than a Cray core running a single process). For really big systems you can even do e.g.
--ppn=8 etc. Note that the available processes that you are not using can be turned into OpenMP threads, which parallelize over the electrons instead of the configs (you would need to compile the code in OpenmpShm mode first, then use
runqmc --ppn=2 --tpp=2).
(2) Setting an environment variable called
BG_MAPCOMMONHEAP=1 can even out the memory available per process, with the cost that “memory protection between processes is not as stringent” (I don’t know what that means practically). CASINO will do this automatically if the arch file has been set up to do so (e.g. on Mira/Cetus/Vesta).
(3) Make sure you use shared memory if the system supports it, and tune the value of –user.shmemsize to be as small as possible (CASINO will print out the amount of shared memory required at the end of setup within the scope of ‘testrun
: T‘ – by default using the number of processors used in the test run. It will estimate the amount of shared memory required on N processors if you set the input keyword shm_size_nproc to be N.)
(4) Badger the CASINO developers to put more things into shared memory where possible.
That was a long answer to a simple question.
← Frequently Asked Questions