OpenMP versus Posix/SystemV

Cyrus_Umrigar · Post by **Cyrus_Umrigar** » Wed Jan 29, 2014 5:20 pm

This is a general question about parallel programming. Suppose you have a large array that is used by every process but never altered during the run. Examples of this are the B-splines file in VMC/DMC or the integrals file in FCIQMC/SQMC. If every process on a node were to have its own copy of this array, then the memory size would be exceeded. So, one needs to have a single copy per node. One way to do this would be to run a single process per node (as opposed to a process per core, which is what I currently do), and then use OpenMP to parallelize within a core. Mike Towler and George Booth tell me that another possibility is to continue to use one process per core, but to use Posix/SystemV to have a single copy of the array per node and this is the way it is done in CASINO and in the FCIQMC program. What are the pros and cons of the two approaches, both in terms of ease of coding and in terms of parallel efficiency. (I am familiar with MPI but do not know OpenMP or Posix/SystemV at all.)

A further question, which is of lesser importance is: Does either approach easily allow one to run different numbers of processes on the various nodes used by a job (which would be useful if one were running on mixed architectures, or if one did not have dedicated nodes)?

Post by **Mike Towler** » Wed Jan 29, 2014 10:57 pm

Hi Cyrus,

Welcome to the CASINO Forum! And thanks for posting in the Computational Electronic Structure subforum - it was getting very lonely with all the action going on in other subforums..

Suppose you have a large array that is used by every process but never altered during the run. Examples of this are the B-splines file in VMC/DMC or the integrals file in FCIQMC/SQMC. If every process on a node were to have its own copy of this array, then the memory size would be exceeded. So, one needs to have a single copy per node. One way to do this would be to run a single process per node (as opposed to a process per core, which is what I currently do), and then use OpenMP to parallelize within a core. Mike Towler and George Booth tell me that another possibility is to continue to use one process per core, but to use Posix/SystemV to have a single copy of the array per node and this is the way it is done in CASINO and in the FCIQMC program. What are the pros and cons of the two approaches, both in terms of ease of coding and in terms of parallel efficiency. (I am familiar with MPI but do not know OpenMP or Posix/SystemV at all.)

For your purposes (i.e.making your CHAMP QMC code use shared memory within a node) the best way forward is to use System V/Posix. This is because CHAMP is currently - if I understand correctly - parallelized using MPI only. Thus a preliminary implementation of System V shared memory involving only the vector of blip coefficients requires only one change to the code that you have already written, namely where it currently says:

allocate(blip_vector(10234234928374923847))

you need to change that to

call shallocate(blip_vector(10234234928374923847)).

(and strictly speaking a 'deshallocate at the end as well). Then the rest of your code will work as normal (but with e.g. 32 times less memory required per node, if you have 32-core nodes).

Now of course you also need to add some routines which define what you mean by 'shallocate'. In CASINO this consists of:

(1) alloc_shm.c - a low-level C routine which does the actual allocating and deallocating of shared memory using either the System V commands (on most machines) or Posix commands (in practice only on Blue Gene machines).

(2) A Fortran module shalloc_smp.f90 defining the 'shallocate' function - it looks at the type of 'blip_vector' (i.e. is it integer, double precision, single precision, complex etc..) and at how many dimensions it has, then calls the stuff in alloc_shm appropriately.

(3) A 'fake' Fortran module shalloc_nonsmp.f90 for when you don't want to use shared memory mode (a very few machines, such as the Japanese K computer, physically won't allow it). This simply allocates blip_vector using a normal Fortran allocate statement for each MPI process. Then a non-Shm computer without System V/Posix won't get confused by trying to 'call shallocate'.

There's a slight complication with 'NUMA nodes' (Non-Uniform Memory Access). Simplifying massively, let's say that the 32-core node consists of 4 physical 8-core processors plugged into a board, and each 8-core processor can access its own local memory faster than the memory local to the other 3 processors. Then - if you have enough memory available - it would be faster to run with 4 copies of blip_vector, and each one will be shared by all cores on a processor. In practice most people don't bother reading the documentation deeply enough to realize that this is likely to benefit them, and end up not doing it. (I have to admit that not enough practical timing tests have been done to determine how much this kind of thing matters with CASINO).

Now you could use OpenMP, and that would involve - as you say - running e.g. 1 MPI process per node, and this then 'spawns' 32 OpenMP threads, one of which runs on each of the 32 physical cores [Again, for a NUMA node, it might be better to run 4 MPI processes per node, and 8 OpenMP threads per MPI process, but whatever].

Now those 32 threads are effectively 'sharing memory'. The trouble is from a CHAMP perspective is that you then need to define what it is those 32 OpenMP threads are going to do. This will involve a significant rewrite of your code - at the very least adding loads of compiler directives to likely looking parallelizable loops - rather than changing one line and adding what are effectively some library routines as in the System V/Posix case..

Now CASINO can do OpenMP as well, and in hybrid MPI-OpenMP mode its OpenMP threads are used to parallelize over e.g. electrons and orbitals and stuff like that. Now it's important to realize that running 32 OpenMP threads on our 32-core node won't make it go 32 times faster. What we found (and our implementation is probably not that efficient) is that running 4 OpenMP threads per MPI process gives you about a 2.5x speedup; more than 4 OpenMP threads per MPI process gives you very little additional benefit.

On the other hand, if you discount on issue with DMC equilibration time, CASINO running in pure MPI mode has been shown to scale more or less linearly with the number of MPI processes when running with 1 MPI process per core (i.e. if you double the number of cores, the code goes twice as fast). I've found this to be essentially true on up to half a million cores - on the rare machines that have that many...

This latter conclusion is of course dependent on the fact that - in DMC - independently propagating walkers don't need to talk to each other very much, and in CASINO the little talking that is done is hidden by using asynchronous MPI communication and other tricks.

So as far as I know, hardly any CASINO user bothers to use OpenMP mode, and usually the best thing to do is to run a single MPI process per core and use SystemV/Posix shared memory. On machines like Blue Gene/Qs, with their peculiar architecture, it can be beneficial to run with up to 4 MPI processes running on a physical core i.e. 64 CASINO MPI processes running on a 16-core BG/Q node (subject ro rather low per-process memory limits). See the relevant question in CASINO FAQ for a discussion of this: http://vallico.net/casinoqmc/faqs/b9/.

So, in my opinion, there is simply no contest - if you want to get CHAMP doing this quickly you should choose System V/Posix. Now, of course, it wouldn't happen quickly if we made you reinvent the wheel and write your own routines to implement 'shallocate'! Therefore - providing the other developers (particularly Lucian Anton, who wrote much of the low-level stuff) are happy with it - I don't see why we can't donate the CASINO routines to aid the CHAMP cause, if that's what you want. Let me know..

Hope this helps,
Mike

Cyrus_Umrigar · Post by **Cyrus_Umrigar** » Sat Feb 01, 2014 4:24 am

Mike,

Thanks very much for your reply!
Your solution sounds like magic. I would guess it is a bit harder to do than you indicate, but if it is even remotely as simple to use as you indicate it would save a lot of time. So, yes, I would be very grateful to have permission to use it. Actually my immediate need was to use it in our SQMC program (which is mostly what I have worked on the last 3 years) for the integrals file, rather than in CHAMP for the blips file. Actually even for SQMC I do not need it urgently because I have for the moment abandoned the project I wanted it for, but at some time it would be nice to have it.

Am I correct in thinking that every process makes a call to this magical shallocate routine, even though we want a single copy of the array to be allocated per node?

Thanks!
Cyrus

Lucian_Anton · Post by **Lucian_Anton** » Sun Feb 02, 2014 9:49 am

Hi Cyrus,

Just a quick comment to Mike description. Shared memory might need to synchronisation between the node ranks.
Consider the code from bellow

Code: Select all

call shallocate(a,….)
if(am_smpmaster) then
   read(u) a
endif
call shallocate_barrier
! now other ranks can access shared data
b=a(i)

Without the call to shallocate_barrier (which is a MPI barrier at node level) the value of b is affected by race condition.

Otherwise using shared memory for a mature MPI code is much easier than retrofitting OpenMP, which has its own problem to scale beyond 8 threads (see http://www.cs.uiuc.edu/~snir/PDF/CCGrid13.pdf)

I am very happy if you want to use CASINO share memory subroutines.
It is worth notice that the latest MPI3 standard has added share memory, a very good presentation of this new feature can be found at http://htor.inf.ethz.ch/publications/im ... amming.pdf in which there is an example about a QMC code.

I have test this feature in MPICH latest release and I have read that OpenMPI has implemented it too. Cray MPI has it already on ARCHER. I plan to add MPI3 in CASINO subroutines later this year, when time would be available.

Kind regards,

Lucian Anton.

Lucian_Anton · Post by **Lucian_Anton** » Sun Feb 02, 2014 9:59 am

Cyrus_Umrigar wrote: Am I correct in thinking that every process makes a call to this magical shallocate routine, even though we want a single copy of the array to be allocated per node?

Thanks!
Cyrus

Yes, shallocate must be called by all ranks on a node and it returns an array which can be read/write by all ranks on the node.

MPI 3 offer more flexibility, its alloc subroutine returns a pointer to a section of the shared array with variable local size. Also MPI 3 provides a query function that returns the addresses of the arrays used by the other ranks (Heofler paper explains this in details).

Lucian

Post by **Mike Towler** » Sun Feb 02, 2014 1:17 pm

Hi Cyrus,

Your solution sounds like magic. I would guess it is a bit harder to do than you indicate..

Nope, it really should be that simple. Apart from the issue that Lucian mentioned - using a few well-placed barrier calls to make sure that the smpslave processes don't access the shared memory segment before it's been filled with stuff by the smpmaster - there really are no problems apart from modifying your build system to accommodate the new routines.

One quick thing: Lucian and I are currently kicking a few modifications around that will make sure that vectors can be repeatedly allocated and deallocated during a run on Posix machines (something which CASINO didn't attempt to do until recently) without causing memory leaks or other problems. There should therefore be a new version of the Shm stuff put into the code in the next few days, so don't take the routines from the version of CASINO that you downloaded last week - wait until I tell you, which should be very soon.

M.

Post by **Mike Towler** » Thu Feb 20, 2014 1:37 pm

Dear Cyrus,

The modifications to CASINO's shared memory system have now been completed (it took a little longer than anticipated, due to the sheer stubbornness of the wretched Blue Gene/Q machines). Please feel free to download the current beta, and to nick the alloc_shm.c, shalloc_smp.f90, and shalloc_nonsmp.f90 routines for incorporation into your code. I've also very significantly expanded the discussion of all these issues in the manual. Let me know if you have any questions.

Relevant extracts from the DIARY file below.

Best wishes,
Mike

Code: Select all

---[v2.13.295]---
* Couple of minor changes to alloc_shm.c to (1) stop craycc moaning that 'the
  function "getpid" is declared implicitly', and (2) fix a typo in the
  OpenmpShm bit.
  -- Mike Towler, 2014-02-19

---[v2.13.290]---
* New implementation of Posix shared memory for Blue Gene/Qs. Should finally
  fix Shm errors that have been apparent on these machines for the last couple
  of months.
  -- Mike Towler, 2014-02-18

  In general with CASINO one selects whether to use System V or Posix shared
  memory by setting the CFLAGS_SHM parameter in the arch file to be -DSHM_SYSV
  or -DSHM_POSIX. There is a now a third variant "-DSHM_POSIX_BGQ" which is
  designed to overcome the apparently buggy BGQ implementation of Posix shared
  memory on BG/Qs. The plain Posix version actually worked well originally,
  when we only used Shm only for a single large array (the blip coefficients)
  persistent through the entire calculation. When we began to treat
  'shallocate' and 'deshallocate' as shared memory equivalent of Fortran
  allocate and deallocate, for arrays which are meant to appear and disappear
  multiple times, it was found not to work. (See subsequent patches 2.13.214,
  2.13.242, 2.13.265, 2.13.266, 2.13.281.)

  The BG/Q Posix bugginess includes but is not necessarily limited to: unlinked
  files not actually being removed, ftruncate producing unexpected results, and
  mmap not using the offset argument. To bypass these problems the
  implementation now opens only one shared memory file, and the memory
  allocation needed by CASINO arrays is controlled in this file via a linked
  list of pointers to blocks of memory. The basic memory model of the algorithm
  is a stack of memory blocks which reuses freed blocks if and only if the size
  of the new memory request fits into an already-existing block. It is not
  particularly smart but it should do for the current operation pattern used by
  CASINO in shared memory (i.e. allocate 1-2 files to store data and use a
  third as a temporary buffer for IO). The smartness factor may be reviewed
  in the future as Shm usage patterns change.

  Many thanks to Lucian Anton for his hard work on this.

---[v2.13.289]---
* Hugely expanded and improved the section of the manual concerning how to use
  CASINO in parallel (in particular there was hardly any discussion of shared
  memory, how to get good scaling on large numbers of processors, how to use
  the peculiar features of Blue Gene machines, how to use OpenMP/OpenMPShm etc.
  etc.). See the new Section 38.
  -- Mike Towler, 2014-02-18

Cyrus_Umrigar · Post by **Cyrus_Umrigar** » Thu Feb 20, 2014 1:55 pm

Dear Mike and Lucian,

Thank you very much for the information and for your willingness to share your 3 shared memory subroutines!

Cyrus

The CASINO forum

OpenMP versus Posix/SystemV

OpenMP versus Posix/SystemV

Re: OpenMP versus Posix/SystemV

Re: OpenMP versus Posix/SystemV

Re: OpenMP versus Posix/SystemV

Re: OpenMP versus Posix/SystemV

Re: OpenMP versus Posix/SystemV

Re: OpenMP versus Posix/SystemV

Re: OpenMP versus Posix/SystemV