B8. To use CASINO in shared memory mode on a Blue Gene system I need to set (at runtime) an environment variable BG_SHAREDMEMPOOLSIZE (Blue Gene/P) or BG_SHAREDMEMSIZE Blue Gene/Q) to be the size of the shared memory partition in Mb. How do I do this, why do I need to do it, and what value should I set it to?

This variable may be set by using the --user.shmemsize option to the runqmc script (exactly what environment variable this defines is set in the appropriate .arch file for the machine in question).

Note that on Blue Gene/Ps the default for this is zero (i.e. the user always needs to set it explicitly) and on Blue Gene/Qs the default is something like 32-64 Mb and dependent on the number of cores/per node requested — thus for very small jobs on Blue Gene/Qs you don’t need to set it explicitly.

An example of such a machine is Intrepid (a Blue Gene/P at Argonne National Lab) and you can look in the following file to see exactly what is done with the value of --user.shmemsize:

CASINO/arch/data/bluegene-xlf-cobalt-parallel.intrepid.arch

For a Blue Gene/Q you can look in:

CASINO/arch/data/bluegene-xlf-ll-parallel.bluejoule.arch
CASINO/arch/data/bluegene-xlf-ll-parallel.mira.arch
CASINO/arch/data/bluegene-xlf-ll-parallel.cetus.arch
CASINO/arch/data/bluegene-xlf-ll-parallel.vesta.arch

Let us use the Intrepid Blue Gene/P as an example.

Nodes on intrepid have 4 cores and 2Gb of available memory, and the machine can run in 3 different “modes” :

SMP - 1 process per node, which can use all 2Gb of memory.
DUAL - 2 processes per node, each of which can use 1Gb of memory.
VN - 4 processes per node, each of which can use 512Mb of memory.

Using shared memory in SMP mode doesn’t seem to work (any job run in this way is ‘Killed with signal 11‘) presumably due to an IBM bug. (One might consider using OpenMp to split the configs over the 4 cores you would run in SMP mode and have multiple --tpp threads).

So – taking VN mode as an example – we would like to be able to allocate (say) 1.5Gb of blip coefficients on the node, and for all four cores to have access to this single copy of the data – which is the point of shared memory runs. Such a calculation would be impossible if all 4 cores had to have a separate copy of the data.

Unfortunately, on Intrepid, the user needs to know how much of the 2Gb will be taken up by shared memory allocations. This is easy enough, since the only vector which is ‘shallocked’ is the vector of blip coefficients. This will be a little bit smaller than the size of the binary blip file, which you can work it out with e.g. ‘du -sh bwfn.data.bin‘.

On a machine with limited memory like this one, it will pay to use the ‘sp_blips‘ option in input, and to use a smaller plane-wave cutoff if you can get away with it.

Thus if your blip vector is 1.2Gb in size, then run the code with something like :

runqmc --shmem=4 --ppn=4 --user.shmemsize=1250 --walltime=2h30m

where the --shmem indicates we want to share memory over 4 cores, the --ppn (‘processes per node’) indicates we want to use VN mode, and --walltime is the maximum job time. Note the value for --user.shmemsize is in Mb.

A technical explanation for this behaviour (from a sysadmin) might run as follows:

The reason why the pool allocation needs to be up-front is that when the node boots it sets up the TLB to divide the memory space corresponding to the mode (SMP, DUAL, or VN). if you look at Figure 5-3 in the Apps Red Book (See http://www.redbooks.ibm.com/abstracts/sg247287.html?Open), you’ll see how the (shared) kernel and shared memory pool are layed out first then the remainder of the node’s memory is split in 2 for DUAL mode. The diagram 5-2 I believe is erroneous and should look more like the the one for DUAL except with 4 processes. With this scheme, it is impossible to grow the pool dynamically unless you fixed each processes’ memory instead (probably more of a burden).

In theory, in VN mode you should be able to allocate a pool that is 2GB – kernel (~10MB IIRC) – RO program text – 4 * RW process_size. There is a limitation that the TLB has not that many slots and the pages referenced there can only have certain sizes: 1MB, 16MB or 256MB, or 1GB. (There is some evidence 1GB may only be used in SMP mode). So depending on the size of the pool you ask for, it may take more than 1 slot, and there is the possibility of running out of slots. We don’t know whether or not the pool size is padded in any way. So one thing you could try is to increase it slightly to 3*256MB.

How much memory can you shallocate in general? In practice I (MDT) find on Intrepid that any attempt to set BG_SHAREDMEMPOOLSIZE to be greater than 1800 Mb results in an “Insufficient memory to start application” error. Values up to that should be fine. Sometime with larger values (from 1500 Mb to 1800 Mb) one sees the following error:

* Out of memory in file /bghome/bgbuild/V1R4M2_200_2010-100508P/ppc/bgp/comm/lib/dev/mpich2/src/mpi/romio/adio/ad_bgl/ad_bgl_wrcoll.c, line 498

but this doesn’t seem to affect the answer.

Note there is more information on Blue Gene memory at the end of question B9.

I prefer Cray.

Please log in to rate this.
0 people found this helpful.


Category: B: Using CASINO

← Frequently Asked Questions

Leave a Reply