Hi Cyrus,
Welcome to the CASINO Forum! And thanks for posting in the Computational Electronic Structure subforum - it was getting very lonely with all the action going on in other subforums..
Suppose you have a large array that is used by every process but never altered during the run. Examples of this are the B-splines file in VMC/DMC or the integrals file in FCIQMC/SQMC. If every process on a node were to have its own copy of this array, then the memory size would be exceeded. So, one needs to have a single copy per node. One way to do this would be to run a single process per node (as opposed to a process per core, which is what I currently do), and then use OpenMP to parallelize within a core. Mike Towler and George Booth tell me that another possibility is to continue to use one process per core, but to use Posix/SystemV to have a single copy of the array per node and this is the way it is done in CASINO and in the FCIQMC program. What are the pros and cons of the two approaches, both in terms of ease of coding and in terms of parallel efficiency. (I am familiar with MPI but do not know OpenMP or Posix/SystemV at all.)
For your purposes (i.e.making your CHAMP QMC code use shared memory within a node) the best way forward is to use System V/Posix. This is because CHAMP is currently - if I understand correctly - parallelized using MPI only. Thus a preliminary implementation of System V shared memory involving only the vector of blip coefficients requires only
one change to the code that you have already written, namely where it currently says:
allocate(blip_vector(10234234928374923847))
you need to change that to
call shallocate(blip_vector(10234234928374923847)).
(and strictly speaking a 'deshallocate at the end as well). Then the rest of your code will work as normal (but with e.g. 32 times less memory required per node, if you have 32-core nodes).
Now of course you also need to
add some routines which define what you mean by 'shallocate'. In CASINO this consists of:
(1)
alloc_shm.c - a low-level C routine which does the actual allocating and deallocating of shared memory using either the System V commands (on most machines) or Posix commands (in practice only on Blue Gene machines).
(2) A Fortran module
shalloc_smp.f90 defining the 'shallocate' function - it looks at the
type of 'blip_vector' (i.e. is it integer, double precision, single precision, complex etc..) and at how many dimensions it has, then calls the stuff in alloc_shm appropriately.
(3) A 'fake' Fortran module
shalloc_nonsmp.f90 for when you don't want to use shared memory mode (a very few machines, such as the Japanese K computer, physically won't allow it). This simply allocates blip_vector using a normal Fortran allocate statement for each MPI process. Then a non-Shm computer without System V/Posix won't get confused by trying to 'call shallocate'.
There's a slight complication with 'NUMA nodes' (Non-Uniform Memory Access). Simplifying massively, let's say that the 32-core node consists of 4 physical 8-core processors plugged into a board, and each 8-core processor can access its own local memory faster than the memory local to the other 3 processors. Then - if you have enough memory available - it would be
faster to run with 4 copies of blip_vector, and each one will be shared by all cores on a processor. In practice most people don't bother reading the documentation deeply enough to realize that this is likely to benefit them, and end up not doing it. (I have to admit that not enough practical timing tests have been done to determine how much this kind of thing matters with CASINO).
Now you could use OpenMP, and that would involve - as you say - running e.g. 1 MPI process per node, and this then 'spawns' 32 OpenMP threads, one of which runs on each of the 32 physical cores [Again, for a NUMA node, it might be better to run 4 MPI processes per node, and 8 OpenMP threads per MPI process, but whatever].
Now those 32 threads are effectively 'sharing memory'. The trouble is from a CHAMP perspective is that you then need to define what it is those 32 OpenMP threads are going to do. This will involve a significant rewrite of your code - at the very least adding loads of compiler directives to likely looking parallelizable loops - rather than changing one line and adding what are effectively some library routines as in the System V/Posix case..
Now CASINO can do OpenMP as well, and in hybrid MPI-OpenMP mode its OpenMP threads are used to parallelize over e.g. electrons and orbitals and stuff like that. Now it's important to realize that running 32 OpenMP threads on our 32-core node won't make it go 32 times faster. What we found (and our implementation is probably not that efficient) is that running 4 OpenMP threads per MPI process gives you about a 2.5x speedup; more than 4 OpenMP threads per MPI process gives you very little additional benefit.
On the other hand, if you discount on issue with DMC equilibration time, CASINO running in pure MPI mode has been shown to scale more or less linearly with the number of MPI processes when running with 1 MPI process per core (i.e. if you double the number of cores, the code goes twice as fast). I've found this to be essentially true on up to half a million cores - on the rare machines that have that many...
This latter conclusion is of course dependent on the fact that - in DMC - independently propagating walkers don't need to talk to each other very much, and in CASINO the little talking that
is done is hidden by using asynchronous MPI communication and other tricks.
So as far as I know, hardly any CASINO user bothers to use OpenMP mode, and usually the best thing to do is to run a single MPI process per core and use SystemV/Posix shared memory. On machines like Blue Gene/Qs, with their peculiar architecture, it can be beneficial to run with
up to 4 MPI processes running on a physical core i.e. 64 CASINO MPI processes running on a 16-core BG/Q node (subject ro rather low per-process memory limits). See the relevant question in CASINO FAQ for a discussion of this:
http://vallico.net/casinoqmc/faqs/b9/.
So, in my opinion, there is simply no contest - if you want to get CHAMP doing this quickly you should choose System V/Posix. Now, of course, it wouldn't happen quickly if we made you reinvent the wheel and write your own routines to implement 'shallocate'! Therefore - providing the other developers (particularly Lucian Anton, who wrote much of the low-level stuff) are happy with it - I don't see why we can't donate the CASINO routines to aid the CHAMP cause, if that's what you want. Let me know..
Hope this helps,
Mike