B: Using CASINO
To remove the Jastrow, you need to delete all parameters (except the cutoffs) from all Jastrow terms in the correlation.data file. CASINO will apply the cusp conditions on the Jastrow parameters (=> non-zero alpha_1 parameter) if any parameter is provided in the file, even if it’s zero.
Alternatively, you can set ‘use_jastrow
: F‘ in the input file, provided you do not want to optimize the Jastrow parameters in this run.
MDT Note added 9/10/2013: the default behaviour for empty Jastrows in energy minimization changed in CASINO v2.13.94, so that it now does the following:
“If one is studying a homogeneous system, or a 3D-periodic system, or the optimization method is energy minimisation with the CASINO standard Jastrow factor, a simple default for u will be chosen that satisfies the Kato cusp conditions; otherwise, only the Slater wave function will be used for the first configuration generation run when performing wave function optimization” (varmin, madmin, or emin).“
casinohelp xxx‘, or ‘
casinohelp search xxx‘, or ‘
casinohelp all‘. The ‘
casinohelp‘ utility is very useful, and is the most up-to-date keyword reference available — often more than the CASINO manual.
Your compiler/OS is not correctly handling over-sized memory allocation; CASINO should exit on ‘Allocation problem’ rather than crash the machine. CASINO v2.2 solves this problem.
This is likely to be the case when using fluid orbitals in an excitonic regime, where the HF configurations appear not to be very good for optimizing Jastrow parameters. Try setting ‘opt_maxiter
: 1‘ and ‘opt_method
: madmin‘, which will run single-iteration NL2SOL optimizations. It may take a few VMC-madmin cycles to reach sensible energies, after which you can use the default ‘opt_maxiter
: 20‘ to get to the minimum more quickly.
Short answer: No, you must compare the VMC energies and variances and ignore what varmin tells you. Optimization fails if the energy of the second VMC run is significantly greater than that of the first VMC run.
Long answer: The ‘unreweighted variance’ is a good target function to minimize to lower the TRUE energy (that of the later VMC run), but the values it takes are of no physical significance. This often applies to the ‘reweighted variance’ as well. Notice that the initial value of the two target functions must be the true variance of the first VMC run (or the true variance of a subset of the configurations if you set vmc_nconfig_write < vmc_nstep, which is usually the case). The same applies to the ‘mean energy’ reported in varmin.
Your Linux computer is dynamically changing the processor frequency to save power. It should set the frequency to the maximum when you run CASINO, but by default it ignores processes with a nice value greater than zero (CASINO’s default is +15). To fix this, supply the ‘
--user.nice=0‘ option to runqmc.
You will see wild fluctuations in the block timings only if some other process triggers a change in the frequency, otherwise you will only notice slowness.
The above problem should not appear in modern Linux distributions (from 2008 onwards).
Other than this, block times are bound to oscillate, since during the course of the simulation particles are moved in/out of (orbital/Jastrow/ backflow) cutoff regions, which increases/reduces the expense of calculating things for particular configurations. However, provided the blocks contain a sufficient number of moves, the block times should be equally long on average.
In this check, CASINO computes the numerical gradient and Laplacian of the wave function at a VMC-equilibrated configuration by finite differences with respect to the position of an electron. The results are compared against the analytic gradient and Laplacian, which are used in the computation of kinetic energies all over CASINO.
CASINO reports the degree of accuracy to which the numerical and analytical derivatives agree using four different levels: optimal, good, poor and bad. Each of these correspond to a relative difference of <1E-7, <1E-5, <1E-3, and >1E-3, respectively. If the accuracy is ‘bad’, the test reports a failure, and the analytical and numerical gradient and Laplacian are printed, for debugging purposes.
This check should detect any inconsistencies between the coded expressions for the values, gradients and Laplacians of orbitals, Jastrow terms and backflow terms, as well as inconsistencies in the process of calculating kinetic energies and wave-function ratios. Therefore it’s reassuring to see that a given wave function passes the kinetic energy test.
However there are cases where the results from the test should not be taken too seriously:
– The thresholds defining the optimal/good/poor/bad levels are arbitrary. For small systems they seem to be a good partition (one usually gets good or optimal), but for large systems the procedure is bound to become more numerically unstable and the thresholds may not be appropriate. Thus a ‘poor’ gradient or Laplacian need not be signalling an error.
– For ill-behaved wave functions (e.g. after an unsuccessful optimization) it is not uncommon for the check to report a failure (‘bad’ level). This is not a bug in the code, you’ll just need to try harder at optimizing the wave function.
BG_SHAREDMEMPOOLSIZE(Blue Gene/P) or
BG_SHAREDMEMSIZEBlue Gene/Q) to be the size of the shared memory partition in Mb. How do I do this, why do I need to do it, and what value should I set it to?
This variable may be set by using the
--user.shmemsize option to the runqmc script (exactly what environment variable this defines is set in the appropriate
.arch file for the machine in question).
Note that on Blue Gene/Ps the default for this is zero (i.e. the user always needs to set it explicitly) and on Blue Gene/Qs the default is something like 32-64 Mb and dependent on the number of cores/per node requested — thus for very small jobs on Blue Gene/Qs you don’t need to set it explicitly.
An example of such a machine is Intrepid (a Blue Gene/P at Argonne National Lab) and you can look in the following file to see exactly what is done with the value of
For a Blue Gene/Q you can look in:
Let us use the Intrepid Blue Gene/P as an example.
Nodes on intrepid have 4 cores and 2Gb of available memory, and the machine can run in 3 different “modes” :
SMP - 1 process per node, which can use all 2Gb of memory.
DUAL - 2 processes per node, each of which can use 1Gb of memory.
VN - 4 processes per node, each of which can use 512Mb of memory.
Using shared memory in SMP mode doesn’t seem to work (any job run in this way is ‘
Killed with signal 11‘) presumably due to an IBM bug. (One might consider using OpenMp to split the configs over the 4 cores you would run in SMP mode and have multiple
So – taking VN mode as an example – we would like to be able to allocate (say) 1.5Gb of blip coefficients on the node, and for all four cores to have access to this single copy of the data – which is the point of shared memory runs. Such a calculation would be impossible if all 4 cores had to have a separate copy of the data.
Unfortunately, on Intrepid, the user needs to know how much of the 2Gb will be taken up by shared memory allocations. This is easy enough, since the only vector which is ‘shallocked’ is the vector of blip coefficients. This will be a little bit smaller than the size of the binary blip file, which you can work it out with e.g. ‘
du -sh bwfn.data.bin‘.
On a machine with limited memory like this one, it will pay to use the ‘sp_blips‘ option in input, and to use a smaller plane-wave cutoff if you can get away with it.
Thus if your blip vector is 1.2Gb in size, then run the code with something like :
runqmc --shmem=4 --ppn=4 --user.shmemsize=1250 --walltime=2h30m
--shmem indicates we want to share memory over 4 cores, the
--ppn (‘processes per node’) indicates we want to use VN mode, and
--walltime is the maximum job time. Note the value for
--user.shmemsize is in Mb.
A technical explanation for this behaviour (from a sysadmin) might run as follows:
“The reason why the pool allocation needs to be up-front is that when the node boots it sets up the TLB to divide the memory space corresponding to the mode (SMP, DUAL, or VN). if you look at Figure 5-3 in the Apps Red Book (See http://www.redbooks.ibm.com/abstracts/sg247287.html?Open), you’ll see how the (shared) kernel and shared memory pool are layed out first then the remainder of the node’s memory is split in 2 for DUAL mode. The diagram 5-2 I believe is erroneous and should look more like the the one for DUAL except with 4 processes. With this scheme, it is impossible to grow the pool dynamically unless you fixed each processes’ memory instead (probably more of a burden).
In theory, in VN mode you should be able to allocate a pool that is 2GB – kernel (~10MB IIRC) – RO program text – 4 * RW process_size. There is a limitation that the TLB has not that many slots and the pages referenced there can only have certain sizes: 1MB, 16MB or 256MB, or 1GB. (There is some evidence 1GB may only be used in SMP mode). So depending on the size of the pool you ask for, it may take more than 1 slot, and there is the possibility of running out of slots. We don’t know whether or not the pool size is padded in any way. So one thing you could try is to increase it slightly to 3*256MB.”
How much memory can you shallocate in general? In practice I (MDT) find on Intrepid that any attempt to set
BG_SHAREDMEMPOOLSIZE to be greater than 1800 Mb results in an “
Insufficient memory to start application” error. Values up to that should be fine. Sometime with larger values (from 1500 Mb to 1800 Mb) one sees the following error:
* Out of memory in file /bghome/bgbuild/V1R4M2_200_2010-100508P/ppc/bgp/comm/lib/dev/mpich2/src/mpi/romio/adio/ad_bgl/ad_bgl_wrcoll.c, line 498”
but this doesn’t seem to affect the answer.
Note there is more information on Blue Gene memory at the end of question B9.
I prefer Cray.
Yes. If you do a direct comparison between the speeds of say a Cray XE6 and a Blue Gene/Q on the same number of cores you will find that the Blue Gene is significantly slower (4 times slower in the test that I – MDT – did). What can we do about this?
With a bit of digging you then find that BG/Qs are supposed to be run with multiple threads per core, since you simply don’t get full instruction throughput with 1 thread per core. It’s not that more threads help, it’s that less than 2 threads per core is like running with only one leg. You just can’t compare BG/Q to anything unless you’re running on >2 hardware threads per core. Most applications max out 3 or 4 per core (and CASINO maxes out at 4 – see below). BG/Q is a completely in-order, single-issue-per-hardware-thread core, whereas x86 cores are usually multi-issue.
Here are some timings, for a short VMC run on a large system (TiO2 – 648 electrons) on 512 cores:
Hector - Cray XE6 : 55.1 sec
Vesta – Blue Gene/Q : 222.19 sec
Now try ‘overloading’ each core on the BG/Q
Can take up to 4 threads/core (in powers of 2)
--ppn=48 "48 is not a valid ranks per node value"
--ppn=64 138.44 sec
So by using 512 cores to run 2048 MPI processes, we improve to ~2.5 times slower than Hector (512 cores running 512 processes) rather than 4 times slower as before. Note that the number of processes per core has to be a power of 2 (i.e. not 3) – unless you do special tricks.
Now try Openmp on the cores instead of extra MPI processes (OpenmpShm mode)
--ppn=16 --tpp=1 222.19
--ppn=16 --tpp=2 197.18
--ppn=16 --tpp=4 196.42
Thus for this case we see (a) not much point going beyond
--tpp=2, and (b) Openmp is not as good as just using extra MPI processes. In this case this means that using the ability to run multiple processes per core for a fixed numbers of cores and thus allowing it to do fewer moves per process (parallelizing over moves) is much faster than attempting to calculate the wave function faster for a fixed number of moves per process (parallelizing over electrons).
Thus – if memory allows it – the best way to run CASINO on a Blue Gene/Q such as Mira at Argonne National Lab seems to be:
runqmc -n xxx --ppn=64 -s --user.shmemsize=yyy
i.e. run a shared memory calc with a shared memory block of yyy Mb on xxx 16-core nodes with 64 MPI processes per node (4 per core).
Note however, that all this comes with a big caveat regarding MEMORY.
Consider the big Blue Gene/Q at Argonne – Mira, which has 16Gb available per 16-core node.
Forgetting about shared memory for the moment, it is common to think of all this memory as being generally available to all cores on the node, such that e.g. the master process could allocate 15Gb and all the slaves could allocate a total of 0.5GB and everything would fit in the node..
In fact, since there is no swap space available, the 16 Gb per node is divided among the MPI processes as evenly as possible (which apparently might not be that even in some cases due to how the TLB works). So when you’re using 64 MPI processes per node – which we supposedly recommend above – then each individual process has a maximum possible memory usage of 16Gb / 64 = 250 Mb of memory in principle.
In practice, this will be reduced by (a) any shared memory partition – the default for this being c. 32Mb over the whole node, which you may increase to whatever value you like using the –user.shmemsize command line argument to runqmc, (b) memory used by MPI, and (c) other stuff.
According to some data I found, in practice for 64 MPI processes per node, the available memory per process is between 180 and 240 Mb (unevenly between the processes).
You may therefore run out of memory for large systems on large numbers of processes because:
- 180 Mb is not very much.
- The memory required by MPI increases as a function of the partition size.
- There are various large arrays in CASINO which either cannot be put in shared memory (e.g. because they refer to configs which live only on the MPI process in question, such as the array containing the interparticle distances for a given set of configs) or have not yet been coded to be put in shared memory (e.g. Fourier coefficients for orbitals expanded in plane waves), and these just won’t fit in 180 Mb..
What to do about it
(1) Don’t use
--ppn=64 for the number of processes per node. Everytime you halve it, the available memory per process will approximately double. The disadvantage then is that the effective speed of a single core will go down (for
--ppn=16 i.e. 1 process per core it will be something like 4 times slower than a Cray core running a single process). For really big systems you can even do e.g.
--ppn=8 etc. Note that the available processes that you are not using can be turned into OpenMP threads, which parallelize over the electrons instead of the configs (you would need to compile the code in OpenmpShm mode first, then use
runqmc --ppn=2 --tpp=2).
(2) Setting an environment variable called
BG_MAPCOMMONHEAP=1 can even out the memory available per process, with the cost that “memory protection between processes is not as stringent” (I don’t know what that means practically). CASINO will do this automatically if the arch file has been set up to do so (e.g. on Mira/Cetus/Vesta).
(3) Make sure you use shared memory if the system supports it, and tune the value of –user.shmemsize to be as small as possible (CASINO will print out the amount of shared memory required at the end of setup within the scope of ‘testrun
: T‘ – by default using the number of processors used in the test run. It will estimate the amount of shared memory required on N processors if you set the input keyword shm_size_nproc to be N.)
(4) Badger the CASINO developers to put more things into shared memory where possible.
That was a long answer to a simple question.
The authors of the various interfaces between CASINO and plane-wave codes such as PWSCF/CASTEP/ABINIT make different assumptions about which unoccupied orbitals should be included in the file.
For example, with CASTEP/ABINIT you always have to produce a formatted
pwfn.data files as a first step, then this must be transformed into a formatted blip
bwfn.data file using the
blip utility, then when this is read in by CASINO, this will be transformed into a much smaller
bwfn.data.bin file (if keyword write_binary_blips is T, which is the default.
With PWSCF, you may produce any of these files (
bwfn.data or old format
bwfn.data.b1) directly with the DFT code without passing through a sequence of intermediaries.
In the CASTEP case, the unstated assumption is that the formatted file is kept as a reference (either
pwfn.data or the larger
bwfn.data depending on whether disk space is a problem) and that this contains all the orbitals written out by the DFT code. When converting to binary, only the orbitals occupied in the requested state are written out. If later, you want to do a different excited state, then the
bwfn.data.bin file should be regenerated for that state.
In the PWSCF case, because the formatted file need never exist, then all orbitals written by the DFT code are included in the binary blip file (old format
bwfn.data.b1) including all unoccupied ones. Thus these files can be considerably larger than ones produced through the ABINIT/CASTEP/blip utility/CASINO route.
In general one should control the number of unoccupied orbitals in the blip file through some parameter in the DFT code itself. For example, you might try the following:
- CASTEP : Increase ‘nextra_bands‘ to something positive.
Note that CASTEP has a help system just like CASINO’s
castep -help nextra_bands‘.
- PWSCF : Play with the ‘nbnd‘ keyword.
- ABINIT : Play with the ‘nband‘ keyword.
See http://www.abinit.org/ documentation/helpfiles/for-v6.8/input_variables/varbas.html#nband
Note that it is planned to tidy this system considerably in a future release of CASINO (including making the blip utility write in binary directly from the
bwfn.data ever having existed).