A: Installing CASINO
You’ll need a Fortran 90 compiler, a UNIX environment with the bash shell, the [t]csh shell, and, if you have a parallel machine, an MPI library. Optionally, if you plan to use the provided plotting utilities, you should install xmgrace and gnuplot.
These dependencies can be installed automatically by the ‘install’ script, so go to A2 below. Keep reading if you would rather do this by hand.
Here are full setup commands for popular Linux distributions, which you can copy and paste into your terminal:
- Ubuntu 10.04+, Linux Mint 9+:
sudo apt-get install make gcc gfortran g++ tcsh openmpi-bin libopenmpi-dev grace gnuplot
- Debian Lenny+ (5.0+):
su -c "apt-get install make gcc gfortran g++ tcsh openmpi-bin libopenmpi-dev grace gnuplot"
- Fedora 9+, CentOS:
su -c "yum install make gcc gcc-gfortran gcc-c++ tcsh openmpi openmpi-devel grace gnuplot"
- openSUSE 11.3+:
sudo zypper install make gcc gcc-fortran gcc-c++ tcsh openmpi openmpi-devel xmgrace gnuplot
- Mandriva 2010.2+:
su -c "gurpmi make gcc gcc-gfortran gcc-c++ tcsh openmpi grace gnuplot"
su -c "emerge make gcc tcsh openmpi grace sci-visualization/gnuplot"
- Arch Linux:
su -c "pacman -S make gcc gcc-fortran tcsh openmpi grace gnuplot"
This distribution has no official package manager with automated dependency resolution. Slackware users, have fun. Both of you.
- For openSUSE (up to at least version 11.3), after installing the packages above and before trying to compile the code, you may need to log out and back in for changes to take effect.
- As a bleeding-edge rolling-release distro, Arch Linux as of May 2011 is the first to hit a compilation problem with the default gfortran 4.6.0. This will get fixed eventually on gfortran’s side, or worked around on ours.
- Installing OpenMPI under Ubuntu 10.10 and 11.04 will pull a package called
blcr-dkmsas a ‘recommends’, but it does not compile against the Linux kernel versions distributed in either of these releases. Ignore the error message after installing the above packages, and run
sudo apt-get remove --purge blcr-dkmsto remove the problematic package. It is not needed for normal operation. Ubuntu bug: https://bugs.launchpad.net/bugs/700036
After this, go to A2 below. Your
CASINO_ARCH is ‘
linuxpc-gcc‘ for the non-parallel version and ‘
linuxpc-gcc-parallel‘ for the parallel one. Fedora 12 and later versions on multi-core systems should use ‘
Comprehensive instructions here. Brief summary below.
Change into the CASINO directory and type ‘
./install‘ then follow the prompts. This script helps you find the correct
CASINO_ARCH for your machine, or to create a new one if you need to.
The script allows you to configure several C
ASINO_ARCHs for the same installation, e.g., allowing you to keep binaries from different compilers, and to share the installation on machines with different architectures.
The script creates a
~/.bashrc.casino file which is loaded on login and defines a function called
casinoarch, so that at any point you can type ‘
casinoarch‘ at the prompt to switch to the configuration that you choose.
You can run the install script as many times as you like, to do things like adding/removing
CASINO_ARCHs or reordering them by preference.
Should the abilities of the script fail to satisfy your needs, you can always define a
CASINO_ARCH by creating the file
CASINO/arch/data/<CASINO_ARCH>.arch The syntax of
.arch files is explained in
CASINO/arch/README and in Appendix 5 of the manual.
In our tests the Intel Fortran compiler
ifort often generates the fastest executables for x86-based processors (NB, the PathScale Fortran compiler is tied in speed with
ifort on AMD processors). The Gnu GCC
gfortran compiler is free, widely available and also often very fast. Basically, if you have access to multiple compilers on any given machine, you should just try each one and see what happens.
Here is a sample set of timings on 16 processors of a Cray XK6 system (Jaguar) in December 2011 (MDT). Again – Intel
ifort wins, with Gnu
gfortran a close second.
with the following keywords changed:
vmc_nstep : 800 #*! Number of steps (Integer)
vmc_nconfig_write : 800 #*! Number of configs to write (Integer)
dmc_equil_nstep : 20 #*! Number of steps (Integer)
dmc_stats_nstep : 20 #*! Number of steps (Integer)
dmc_target_weight : 800.d0 #*! Total target weight in DMC (Real)
Total CASINO CPU time DMC energy
Ifort : 54.6394 seconds -63.253019810592 +/- 0.007612017376
Gnu : 62.6400 seconds -63.253019816018 +/- 0.007612049672
Pathscale : 67.8962 seconds -63.259034999613 +/- 0.011000738147
PGF : 72.9000 seconds -63.245897141026 +/- 0.009693089487
Cray : 84.0300 seconds -63.192701255463 +/- 0.006393938977
NB: The pathscale compiler is being deprecated on Crays and is no longer
supported on Jaguar (as of June 2012).
Here is another set of runs for H on graphene (MDT 1.2013) on a Cray XK7 (Titan).
Gnu 79.65 seconds -282.890716861446 +/- 0.160222570613
PGF 83.74 seconds -282.895599950580 +/- 0.121131726135
Ifort 84.48 seconds -282.970112923443 +/- 0.128262416613
Cray 87.28 seconds -282.890716855184 +/- 0.160222562116
Gnu 397.20 seconds -284.256578870330 +/- 0.048408651743
Cray 456.63 seconds -284.256578740638 +/- 0.048408642717
Ifort 477.75 seconds -284.252588591933 +/- 0.117686071014
PGF 490.10 seconds -283.915347739145 +/- 0.077180929508
Moral: use the GNU compiler (
Note: GNU and Cray answers essentially agree with each other, but Ifort (VMC) and PGF (DMC) are giving significantly different answers. This needs to be investigated.
install script to create a new
CASINO_ARCH. If you choose to edit the compiler options when prompted, you will be offered the choice of enabling external BLAS and LAPACK libraries (as opposed to compiling the BLAS and LAPACK source code included in the CASINO distribution). You will need to provide the linker flags required to link to these libraries (this depends on your setup, ask you system administrator or read your library’s documentation). For example, the Intel MKL is linked (at least in the Cambridge TCM group) by setting:
BLAS: -lmkl lguide -lpthread
LAPACK: -lmkl_lapack -lmkl -lguide -lpthread
NB: CASINO does not make particularly intensive use of linear algebra routines, so the improvements from doing this would normally be expected to be small. However, in favourable cases – and cases become more favourable as the system size increases – improvements of up to 20% have been observed. You should feel free to experiment; beware that doing this can make things slower as well..
Recent Linux distributions offer ready-to-use MPI-enabled compilers. Installing the relevant packages should provide you with an ‘
mpif90‘ executable, an ‘
mpirun‘ launcher, libraries etc, and you would not need to go through the instructions below (see A1 above instead).
If you want a different compiler, or if your distribution does not provide said package(s), follow these instructions:
- Install the Fortran compilers you want to use (
- Download OpenMPI from
www.open-mpi.org. (NB, we use OpenMPI in this example, but you can choose to use any other MPI implementation so long as it fully supports MPI-2 [LAM/MPI can not be used].).
- Extract from the archive and change into the newly created directory.
- Choose a directory for the installation. We will refer to this directory as (e.g. /opt/openmpi).
- For each compiler you want OpenMPI to work with, let:
<fc>= name of Fortran compiler binary,
<f77>= name of Fortran 77 compiler (may be the same as ),
<cc>= name of C compiler, and
<c++> = name of C++ compiler
<comp>= name under which we will refer to this setup.
(e.g., <fc>=<f77>=<comp>=ifort,<cc> =icc <c++>=icpc).
Configure OpenMPI with:
./configure CC=<cc> CXX=<c++> FC=<fc> F77=<f77> --prefix=<install-dir>/<comp>
sudo make install
Repeat this step with the next compiler.
- Add <install-dir>/<comp>/bin to your PATH, and <install-dir>/<comp>/lib
to your LD_LIBRARY_PATH. Use the compiler as mpif90 / mpicc.
- Refer to A2 above to set up CASINO for this compiler.
- To switch compilers, change your
CASINO_ARCH. Simple bash functions (which can be put in your
.bashrcfile) can be written to simplify this task.
Short answer: no.
Long answer: you shouldn’t unless you know what you’re doing. CASINO is designed to be installed under the user’s home directory. If you want to do a system-wide installation we can provide no help, as we haven’t ever done this. A future version of CASINO might support this.
Contact your system administrator to check you have done the right setup. If you have, ask a question on the discussion forum.
a) In the linking stage, I get something like:
1586-346 (S) An error occurred during code generation.
The code generation return code was -1.
ld: 0706-005 Cannot find or open file: /tmp/ipajR8xEe
ld:open(): No such file or directory
1586-347 (S) An error occurred during linking of the object
produced by the IPA Link step.
The link return code was 255.
Your machine probably has a small CPU-time limit for interactive commands set by default. Type ‘limit cputime 10:00’ (or a larger value) and type ‘make’ again.
b) Why does the utilities compilation die on an IBM SP3 sometimes?
If you have a ‘
TMPDIR‘ environment variable set, unset it. The compiler uses it and may be confused by it having a value already.
Use something like ‘
jlimit processes 24‘ to increase the maximum
number of concurrent processes.
If you can get root privileges and
xmgrace is available for your distribution in
.deb form, go for that.
The following are instructions for installing
xmgrace as a non-root user when motif-compatible libraries are not available:
- Get the SOURCES of
http://lesstif.sourceforge.net/), and untar them in a temp directory.
- Go into the
lesstifdirectory and install
$ ./configure --prefix $HOME/misc/lesstif
make all install
- Go into the
gracedirectory and install it by issuing:
$ ./configure --prefix $HOME/misc --with-extra-incpath="$HOME/misc/lesstif/include:/usr/X11R6/include" --with-extra-ldpath="$HOME/misc/lesstif/lib:/usr/X11R6/lib: /usr/X11R6/lib64" --with-motif-library="-lXm -lXt -lXft -lXrender -lXext -lX11"
$ make ; make install
- Then link
$HOME/misc/grace/bin/xmgraceinto your path (e.g.,
$HOME/bin) for convenience. You should be able to start
xmgrace‘ at the command prompt.
CASINO was written for UNIX/Linux systems and is not supported directly under Windows. Nevertheless, it runs well enough in Windows using Cygwin, an emulation layer that provides a Linux-like compilation and execution environment for applications. Under Cygwin, CASINO works as if it’s running on a Linux workstation using the GCC compiler suite. Although Cygwin is considered an emulation layer, this only concerns system calls. The calculation speed of CASINO is native, and its performance is quite acceptable.
These instructions do not cover the installation and general use of Cygwin; please refer to Cygwin’s documentation for that. The instructions were tested under Windows 7 Ultimate 64-bit and Windows 8.1 Enterprise 64-bit, using Cygwin 1.7.28, with all Cygwin packages up-to-date as of March 2014.
Installation and compilation
1. Download and install Cygwin (www.cygwin.com), either 32-bit or 64-bit version.
2. Install at least the following packages and all their dependencies using Cygwin’s setup program (which is also its package manager) in addition to those packages that are installed by default:
3. If you want to run in parallel (which believe me, you do, if your machine has more than 1 CPU core) then you you should also install the following to get MPI support:
It’s not obvious if all these are necessary, but that’s what we installed and everything subsequently worked. Essentially, MPI should just work on Cygwin straight out of the box.
4. Set up your Cygwin environment so that you are using the bash shell (should be the default anyway).
5. Follow the instructions in the CASINO manual for downloading, unpacking, installing and configuring CASINO using its
install script. The option [p] of the
install script allows you to specify the value of
CASINO_ARCH manually. The correct value is “
cygwin-gcc” under Cygwin, or “
cygwin-gcc-parallel” if, as is likely, you want to run CASINO in parallel. The auto-detect option of the
install script should also be able to figure out the existence of these two
CASINO_ARCHs all by itself (though because of a Cygwin bug preventing asynchronous operations, the auto-detection will run much more slowly than usual).
6. Use the
install script’s compilation option to compile CASINO.
7. Run CASINO using the
runqmc script from the Cygwin command line. Use of e.g. the -p flag to runqmc will allow you to run on more than one processor.
Technical Note about targeting Cygwin vs native environment under Windows
There are two ways to use the Cygwin environment to compile an application originally written for Linux and run it under Windows.
The first (recommended) is to compile it targeting the Cygwin environment. In this case, the application will require Cygwin to run, but will see an almost complete Linux-like environment, so usually little or no special care needs to be taken. The application does not need to be aware that it’s running on Windows. Specifying “
cygwin-gcc” or “
cygwin-gcc-parallel” as the value for
CASINO_ARCH is intended to do this.
The second way is to compile the application to natively target Windows. You still need Cygwin for compilation (as it includes the necessary tools), but in principle the application then does not require Cygwin to run; it will run as a native Windows application. From our point of view this obviously involves extra headache because of issues such as the difference in path separator (/ vs \) and the fact that Windows doesn’t know what a symlink is. Specifying the CASINO make tag “
NATIVE_WINDOWS = yes” in your arch file indicates that Unix filenames are to be converted to Windows filenames, and symbolic links are to be substituted with their targets. Specifying the “
windowspc-g95“, or “
windowspc-ifort” value for
CASINO_ARCH attempts to do this. However, note that doing so is not recommended, as it adds a needless layer of additional complication. Running outside of the Cygwin layer will also prevent the use of all of CASINO’s useful utilities, in particular the
runqmc script. The
windowspc-xx arches should therefore be considered to be obsolete and unsupported.
We at CASINO HQ understand that, as a Windows user, you are only able to contemplate performing actions for which there is a corresponding small picture of it on your computer that you are able to click. You will probably also find it helpful if an animation of a dog or something pops up after you type go, in order to say something like “Hi there. You appear to be running a quantum Monte Carlo calculation. Would you like me to help you with that..?“. We’re sorry, but CASINO doesn’t do either of these things, and we appreciate that your granny probably isn’t going to like that. If you wish to continue in your old Windows traditions, nonetheless, you might contemplate sending us a very large amount of money on a regular basis so that we can buy a yacht.
This warning message is generated at link time when the ‘
--whole-archive‘ option and static linking of the Suse Linux-provided
libpthread.a library is used. The warning message will not occur if the program is linked dynamically, and since that is what the general Linux community does, it is unlikely that SuSE will address this anytime soon. In all compilers except PathScale, Cray was able to remove these options and replace them with others.
Cray has added the following to the Cray Application Developer’s Environment User’s Guide (S-2396)
4.7.2 Known Warnings
Code compiled using the options --whole-archive,-lpthread get the
following warning message issued by libpthread.a(sem_open.o): warning:
the use of 'mktemp' is dangerous, better use 'mkstemp'. The
--whole-archive option is necessary to avoid a runtime segmentation
fault when using OpenMP libraries. This warning can be safely ignored.
In short, there is nothing you/we/Cray can do about this until SuSE addresses it.
File = /home/billy/CASINO/src/nl2sol.f90,
Line = 58, Column = 12
This numeric constant is out of range.
This appears to be a harmless pathscale/glibc bug – ignore.
B: Using CASINO
To remove the Jastrow, you need to delete all parameters (except the cutoffs) from all Jastrow terms in the correlation.data file. CASINO will apply the cusp conditions on the Jastrow parameters (=> non-zero alpha_1 parameter) if any parameter is provided in the file, even if it’s zero.
Alternatively, you can set ‘use_jastrow
: F‘ in the input file, provided you do not want to optimize the Jastrow parameters in this run.
MDT Note added 9/10/2013: the default behaviour for empty Jastrows in energy minimization changed in CASINO v2.13.94, so that it now does the following:
“If one is studying a homogeneous system, or a 3D-periodic system, or the optimization method is energy minimisation with the CASINO standard Jastrow factor, a simple default for u will be chosen that satisfies the Kato cusp conditions; otherwise, only the Slater wave function will be used for the first configuration generation run when performing wave function optimization” (varmin, madmin, or emin).“
casinohelp xxx‘, or ‘
casinohelp search xxx‘, or ‘
casinohelp all‘. The ‘
casinohelp‘ utility is very useful, and is the most up-to-date keyword reference available — often more than the CASINO manual.
Your compiler/OS is not correctly handling over-sized memory allocation; CASINO should exit on ‘Allocation problem’ rather than crash the machine. CASINO v2.2 solves this problem.
This is likely to be the case when using fluid orbitals in an excitonic regime, where the HF configurations appear not to be very good for optimizing Jastrow parameters. Try setting ‘opt_maxiter
: 1‘ and ‘opt_method
: madmin‘, which will run single-iteration NL2SOL optimizations. It may take a few VMC-madmin cycles to reach sensible energies, after which you can use the default ‘opt_maxiter
: 20‘ to get to the minimum more quickly.
Short answer: No, you must compare the VMC energies and variances and ignore what varmin tells you. Optimization fails if the energy of the second VMC run is significantly greater than that of the first VMC run.
Long answer: The ‘unreweighted variance’ is a good target function to minimize to lower the TRUE energy (that of the later VMC run), but the values it takes are of no physical significance. This often applies to the ‘reweighted variance’ as well. Notice that the initial value of the two target functions must be the true variance of the first VMC run (or the true variance of a subset of the configurations if you set vmc_nconfig_write < vmc_nstep, which is usually the case). The same applies to the ‘mean energy’ reported in varmin.
Your Linux computer is dynamically changing the processor frequency to save power. It should set the frequency to the maximum when you run CASINO, but by default it ignores processes with a nice value greater than zero (CASINO’s default is +15). To fix this, supply the ‘
--user.nice=0‘ option to runqmc.
You will see wild fluctuations in the block timings only if some other process triggers a change in the frequency, otherwise you will only notice slowness.
The above problem should not appear in modern Linux distributions (from 2008 onwards).
Other than this, block times are bound to oscillate, since during the course of the simulation particles are moved in/out of (orbital/Jastrow/ backflow) cutoff regions, which increases/reduces the expense of calculating things for particular configurations. However, provided the blocks contain a sufficient number of moves, the block times should be equally long on average.
In this check, CASINO computes the numerical gradient and Laplacian of the wave function at a VMC-equilibrated configuration by finite differences with respect to the position of an electron. The results are compared against the analytic gradient and Laplacian, which are used in the computation of kinetic energies all over CASINO.
CASINO reports the degree of accuracy to which the numerical and analytical derivatives agree using four different levels: optimal, good, poor and bad. Each of these correspond to a relative difference of <1E-7, <1E-5, <1E-3, and >1E-3, respectively. If the accuracy is ‘bad’, the test reports a failure, and the analytical and numerical gradient and Laplacian are printed, for debugging purposes.
This check should detect any inconsistencies between the coded expressions for the values, gradients and Laplacians of orbitals, Jastrow terms and backflow terms, as well as inconsistencies in the process of calculating kinetic energies and wave-function ratios. Therefore it’s reassuring to see that a given wave function passes the kinetic energy test.
However there are cases where the results from the test should not be taken too seriously:
– The thresholds defining the optimal/good/poor/bad levels are arbitrary. For small systems they seem to be a good partition (one usually gets good or optimal), but for large systems the procedure is bound to become more numerically unstable and the thresholds may not be appropriate. Thus a ‘poor’ gradient or Laplacian need not be signalling an error.
– For ill-behaved wave functions (e.g. after an unsuccessful optimization) it is not uncommon for the check to report a failure (‘bad’ level). This is not a bug in the code, you’ll just need to try harder at optimizing the wave function.
BG_SHAREDMEMPOOLSIZE(Blue Gene/P) or
BG_SHAREDMEMSIZEBlue Gene/Q) to be the size of the shared memory partition in Mb. How do I do this, why do I need to do it, and what value should I set it to?
This variable may be set by using the
--user.shmemsize option to the runqmc script (exactly what environment variable this defines is set in the appropriate
.arch file for the machine in question).
Note that on Blue Gene/Ps the default for this is zero (i.e. the user always needs to set it explicitly) and on Blue Gene/Qs the default is something like 32-64 Mb and dependent on the number of cores/per node requested — thus for very small jobs on Blue Gene/Qs you don’t need to set it explicitly.
An example of such a machine is Intrepid (a Blue Gene/P at Argonne National Lab) and you can look in the following file to see exactly what is done with the value of
For a Blue Gene/Q you can look in:
Let us use the Intrepid Blue Gene/P as an example.
Nodes on intrepid have 4 cores and 2Gb of available memory, and the machine can run in 3 different “modes” :
SMP - 1 process per node, which can use all 2Gb of memory.
DUAL - 2 processes per node, each of which can use 1Gb of memory.
VN - 4 processes per node, each of which can use 512Mb of memory.
Using shared memory in SMP mode doesn’t seem to work (any job run in this way is ‘
Killed with signal 11‘) presumably due to an IBM bug. (One might consider using OpenMp to split the configs over the 4 cores you would run in SMP mode and have multiple
So – taking VN mode as an example – we would like to be able to allocate (say) 1.5Gb of blip coefficients on the node, and for all four cores to have access to this single copy of the data – which is the point of shared memory runs. Such a calculation would be impossible if all 4 cores had to have a separate copy of the data.
Unfortunately, on Intrepid, the user needs to know how much of the 2Gb will be taken up by shared memory allocations. This is easy enough, since the only vector which is ‘shallocked’ is the vector of blip coefficients. This will be a little bit smaller than the size of the binary blip file, which you can work it out with e.g. ‘
du -sh bwfn.data.bin‘.
On a machine with limited memory like this one, it will pay to use the ‘sp_blips‘ option in input, and to use a smaller plane-wave cutoff if you can get away with it.
Thus if your blip vector is 1.2Gb in size, then run the code with something like :
runqmc --shmem=4 --ppn=4 --user.shmemsize=1250 --walltime=2h30m
--shmem indicates we want to share memory over 4 cores, the
--ppn (‘processes per node’) indicates we want to use VN mode, and
--walltime is the maximum job time. Note the value for
--user.shmemsize is in Mb.
A technical explanation for this behaviour (from a sysadmin) might run as follows:
“The reason why the pool allocation needs to be up-front is that when the node boots it sets up the TLB to divide the memory space corresponding to the mode (SMP, DUAL, or VN). if you look at Figure 5-3 in the Apps Red Book (See http://www.redbooks.ibm.com/abstracts/sg247287.html?Open), you’ll see how the (shared) kernel and shared memory pool are layed out first then the remainder of the node’s memory is split in 2 for DUAL mode. The diagram 5-2 I believe is erroneous and should look more like the the one for DUAL except with 4 processes. With this scheme, it is impossible to grow the pool dynamically unless you fixed each processes’ memory instead (probably more of a burden).
In theory, in VN mode you should be able to allocate a pool that is 2GB – kernel (~10MB IIRC) – RO program text – 4 * RW process_size. There is a limitation that the TLB has not that many slots and the pages referenced there can only have certain sizes: 1MB, 16MB or 256MB, or 1GB. (There is some evidence 1GB may only be used in SMP mode). So depending on the size of the pool you ask for, it may take more than 1 slot, and there is the possibility of running out of slots. We don’t know whether or not the pool size is padded in any way. So one thing you could try is to increase it slightly to 3*256MB.”
How much memory can you shallocate in general? In practice I (MDT) find on Intrepid that any attempt to set
BG_SHAREDMEMPOOLSIZE to be greater than 1800 Mb results in an “
Insufficient memory to start application” error. Values up to that should be fine. Sometime with larger values (from 1500 Mb to 1800 Mb) one sees the following error:
* Out of memory in file /bghome/bgbuild/V1R4M2_200_2010-100508P/ppc/bgp/comm/lib/dev/mpich2/src/mpi/romio/adio/ad_bgl/ad_bgl_wrcoll.c, line 498”
but this doesn’t seem to affect the answer.
Note there is more information on Blue Gene memory at the end of question B9.
I prefer Cray.
Yes. If you do a direct comparison between the speeds of say a Cray XE6 and a Blue Gene/Q on the same number of cores you will find that the Blue Gene is significantly slower (4 times slower in the test that I – MDT – did). What can we do about this?
With a bit of digging you then find that BG/Qs are supposed to be run with multiple threads per core, since you simply don’t get full instruction throughput with 1 thread per core. It’s not that more threads help, it’s that less than 2 threads per core is like running with only one leg. You just can’t compare BG/Q to anything unless you’re running on >2 hardware threads per core. Most applications max out 3 or 4 per core (and CASINO maxes out at 4 – see below). BG/Q is a completely in-order, single-issue-per-hardware-thread core, whereas x86 cores are usually multi-issue.
Here are some timings, for a short VMC run on a large system (TiO2 – 648 electrons) on 512 cores:
Hector - Cray XE6 : 55.1 sec
Vesta – Blue Gene/Q : 222.19 sec
Now try ‘overloading’ each core on the BG/Q
Can take up to 4 threads/core (in powers of 2)
--ppn=48 "48 is not a valid ranks per node value"
--ppn=64 138.44 sec
So by using 512 cores to run 2048 MPI processes, we improve to ~2.5 times slower than Hector (512 cores running 512 processes) rather than 4 times slower as before. Note that the number of processes per core has to be a power of 2 (i.e. not 3) – unless you do special tricks.
Now try Openmp on the cores instead of extra MPI processes (OpenmpShm mode)
--ppn=16 --tpp=1 222.19
--ppn=16 --tpp=2 197.18
--ppn=16 --tpp=4 196.42
Thus for this case we see (a) not much point going beyond
--tpp=2, and (b) Openmp is not as good as just using extra MPI processes. In this case this means that using the ability to run multiple processes per core for a fixed numbers of cores and thus allowing it to do fewer moves per process (parallelizing over moves) is much faster than attempting to calculate the wave function faster for a fixed number of moves per process (parallelizing over electrons).
Thus – if memory allows it – the best way to run CASINO on a Blue Gene/Q such as Mira at Argonne National Lab seems to be:
runqmc -n xxx --ppn=64 -s --user.shmemsize=yyy
i.e. run a shared memory calc with a shared memory block of yyy Mb on xxx 16-core nodes with 64 MPI processes per node (4 per core).
Note however, that all this comes with a big caveat regarding MEMORY.
Consider the big Blue Gene/Q at Argonne – Mira, which has 16Gb available per 16-core node.
Forgetting about shared memory for the moment, it is common to think of all this memory as being generally available to all cores on the node, such that e.g. the master process could allocate 15Gb and all the slaves could allocate a total of 0.5GB and everything would fit in the node..
In fact, since there is no swap space available, the 16 Gb per node is divided among the MPI processes as evenly as possible (which apparently might not be that even in some cases due to how the TLB works). So when you’re using 64 MPI processes per node – which we supposedly recommend above – then each individual process has a maximum possible memory usage of 16Gb / 64 = 250 Mb of memory in principle.
In practice, this will be reduced by (a) any shared memory partition – the default for this being c. 32Mb over the whole node, which you may increase to whatever value you like using the –user.shmemsize command line argument to runqmc, (b) memory used by MPI, and (c) other stuff.
According to some data I found, in practice for 64 MPI processes per node, the available memory per process is between 180 and 240 Mb (unevenly between the processes).
You may therefore run out of memory for large systems on large numbers of processes because:
- 180 Mb is not very much.
- The memory required by MPI increases as a function of the partition size.
- There are various large arrays in CASINO which either cannot be put in shared memory (e.g. because they refer to configs which live only on the MPI process in question, such as the array containing the interparticle distances for a given set of configs) or have not yet been coded to be put in shared memory (e.g. Fourier coefficients for orbitals expanded in plane waves), and these just won’t fit in 180 Mb..
What to do about it
(1) Don’t use
--ppn=64 for the number of processes per node. Everytime you halve it, the available memory per process will approximately double. The disadvantage then is that the effective speed of a single core will go down (for
--ppn=16 i.e. 1 process per core it will be something like 4 times slower than a Cray core running a single process). For really big systems you can even do e.g.
--ppn=8 etc. Note that the available processes that you are not using can be turned into OpenMP threads, which parallelize over the electrons instead of the configs (you would need to compile the code in OpenmpShm mode first, then use
runqmc --ppn=2 --tpp=2).
(2) Setting an environment variable called
BG_MAPCOMMONHEAP=1 can even out the memory available per process, with the cost that “memory protection between processes is not as stringent” (I don’t know what that means practically). CASINO will do this automatically if the arch file has been set up to do so (e.g. on Mira/Cetus/Vesta).
(3) Make sure you use shared memory if the system supports it, and tune the value of –user.shmemsize to be as small as possible (CASINO will print out the amount of shared memory required at the end of setup within the scope of ‘testrun
: T‘ – by default using the number of processors used in the test run. It will estimate the amount of shared memory required on N processors if you set the input keyword shm_size_nproc to be N.)
(4) Badger the CASINO developers to put more things into shared memory where possible.
That was a long answer to a simple question.
The authors of the various interfaces between CASINO and plane-wave codes such as PWSCF/CASTEP/ABINIT make different assumptions about which unoccupied orbitals should be included in the file.
For example, with CASTEP/ABINIT you always have to produce a formatted
pwfn.data files as a first step, then this must be transformed into a formatted blip
bwfn.data file using the
blip utility, then when this is read in by CASINO, this will be transformed into a much smaller
bwfn.data.bin file (if keyword write_binary_blips is T, which is the default.
With PWSCF, you may produce any of these files (
bwfn.data or old format
bwfn.data.b1) directly with the DFT code without passing through a sequence of intermediaries.
In the CASTEP case, the unstated assumption is that the formatted file is kept as a reference (either
pwfn.data or the larger
bwfn.data depending on whether disk space is a problem) and that this contains all the orbitals written out by the DFT code. When converting to binary, only the orbitals occupied in the requested state are written out. If later, you want to do a different excited state, then the
bwfn.data.bin file should be regenerated for that state.
In the PWSCF case, because the formatted file need never exist, then all orbitals written by the DFT code are included in the binary blip file (old format
bwfn.data.b1) including all unoccupied ones. Thus these files can be considerably larger than ones produced through the ABINIT/CASTEP/blip utility/CASINO route.
In general one should control the number of unoccupied orbitals in the blip file through some parameter in the DFT code itself. For example, you might try the following:
- CASTEP : Increase ‘nextra_bands‘ to something positive.
Note that CASTEP has a help system just like CASINO’s
castep -help nextra_bands‘.
- PWSCF : Play with the ‘nbnd‘ keyword.
- ABINIT : Play with the ‘nband‘ keyword.
See http://www.abinit.org/ documentation/helpfiles/for-v6.8/input_variables/varbas.html#nband
Note that it is planned to tidy this system considerably in a future release of CASINO (including making the blip utility write in binary directly from the
bwfn.data ever having existed).
C: Using the utilities
There are README files in most subdirectories under CASINO/utils, which should contain the information you need.
D: Using CASINO with external programs
Note this sort of advice is likely to date quickly.
UK Hector machine (PWSCF current SVN version 26/11/2011)
None of the four compilers works by default – I (MDT) eventually
managed to make the GNU one work. There may be other ways.
Check current default module with ‘
module list‘ – if PrgEnv-gnu is not
listed then run the relevant one of the next three lines’
module unload PrgEnv-path
module unload PrgEnv-cray
module unload PrgEnv-pgi
module load PrgEnv-gnu
then in the espresso base directory
make.sys file that gets produced, changing the
CFLAGS line from
CFLAGS = -fast $(DFLAGS) $(IFLAGS)
CFLAGS = -O3 $(DFLAGS) $(IFLAGS)
Then type ‘
Having loaded module PrgEnv-pgi:
UK Hartree Centre – Blue Joule Blue Gene/Q (Espresso version 5.02 Feb 2013)
This requires a bit of hacking. Not all of the following may be necessary, or even the right thing to do (you’re supposed to fiddle with some preliminary files in the install directory) but this is a recipe that
worked for me.
module load scalapack
- This will create
~/espresso/make.sys, in which you should change the
following things (again, these may not all be necessary but I couldn’t be bothered to check):
MPIF90 = mpixlf90to
#F90 = /opt/ibmcmp/xlf/bg/14.1/bin/bgxlf90_r
CC = /opt/ibmcmp/vacpp/bg/12.1/bin/bgxlc_r
F77 = /opt/ibmcmp/xlf/bg/14.1/bin/bgxlf_r
MPIF90 = mpixlf90_r
CC = mpixlc_r
F77 = mpixlf77_r
LD = /opt/ibmcmp/vacpp/bg/12.1/bin/bgxlc_r -qarch=qp -qtune=qpto
LD = mpixlf90_r -qarch=qp -qtune=qp
LDLIBS = -L/opt/ibmcmp/xlf/bg/14.1/lib64 -lxlopt -lxl -lxlf90_r -lxlfmath– and the Blas/Lapack libs should be changed from whatever they are to
BLAS_LIBS = -L/bgsys/ibm_essl/prod/opt/ibmmath/essl/5.1/lib64/ -lesslbg
BLAS_LIBS_SWITCH = external
LAPACK_LIBS = -L/bgsys/ibm_essl/prod/opt/ibmmath/essl/5.1/lib64/ \Note that the 3.4.2 version of the lapack directory might have changed by the time you come to read this – check this exists and has
-L/gpfs/packages/ibm/lapack/3.4.2/lib -lesslbg -llapack
LAPACK_LIBS_SWITCH = external
- Then type ‘
make pw‘ in
Various versions of PWSCF from summer 2011 produce an apparently miscompiled executable, which when told to print out a list of k points in the output file (or in xwfn.data), just prints out a string of zeroes i.e. all k points are listed as (0.0 0.0 0.0).
Having dug around in the PWSCF source code, it seems that the k point grid was being defined before the reciprocal lattice vectors, hence all the k points really were zero. I have no idea why no-one noticed this – you would think it would be pretty fundamental. Looking at the latest (26/11/2011) this has now been fixed.
Solution : upgrade your PWSCF.
EDIT: later investigation showed this problem was extant between commit 8051 and commit 8121.
casino2upfin PWSCF, then it stops and whines about not having been compiled with support for hybrid functionals. I didn’t specify a hybrid functional, so why?
casino2upf utility marks any UPF files it creates as having been generated using Hartree-Fock (since they generally are). If you do not supply a value for the ‘input_dft‘ keyword in the `system’ section of the PWSCF input file, then PWSCF will attempt to use the functional specified in the pseudopotential file i.e. it will try to do a Hartree-Fock calculation, and — given that this is only possible with PWSCF if you compiled it having invoked ‘
configure‘ with the ‘
--enable-exx‘ flag — then the code may stop and whine about not having been compiled with support for hybrid functionals. This can be confusing. Solution: specify ‘input_dft‘ in the input file.
According to CRYSTAL author Roberto Orlando:
“There was a stupid restriction in the input for pseudopotentials that limited the maximum allowed angular quantum number to L=4, even if algorithms are general. Thus, we have extended it to L=5. Unfortunately, this implies the addition of one datum in the record below INPUT. This change is reported in the manual, but everybody using pseudopotentials now fails. Maybe we should change the error message …”
There are clearly backwards compatible ways in which this could have been done but anyhow, the point is that from CRYSTAL09 onwards, all input decks constructed from pseudopotentials obtained from the CASINO pseudopotential library before Feb 2012 will fail.
The solution is to add an extra zero to end of the second line of each pseudopotential (effectively stating that your pseudo contains no g functions). Thus:
1.000 8 8 8 0 0
51.12765602 1.00000000 -1
38.05848906 -860.41240728 0
1.000 8 8 8 0 0 0
51.12765602 1.00000000 -1
38.05848906 -860.41240728 0
On 14/2/2012 MDT converted all the files in the CASINO pseudopotential library and in the examples to reflect this change so everything should now work with CRYSTAL09.
gwfn.datafiles derived from CRYSTAL06 or CRYSTAL09. Why?
The converters in
utils/wfn_converters/crystal0[6,9] were broken for a period of several years. The error was introduced in patch 2.4.41 and fixed in 2.11.5 (and backfixed into the official 2.10 release).
The essence of the problem was that the block of orbital coefficients for the down-spin orbitals was incorrectly just a copy of those for the up-spin-orbitals, rather than being the correct down-spin ones. Apologies for our having taken so long to notice this.
To fix this problem manually:
Fine line 466 in the following two files:
In the incorrect version of these files, this should read:
Change this to:
runpwscf script uses the CASINO architecture scheme in order to know how to run calculations on any given machine. The data files containing the machine definition is in
CASINO/arch/data/xx.arch (the ‘arch file’).
If your CASINO arch file defines a command for running CASINO and PWSCF (such as
SCRIPT_RUN: mpirun -np &NPROC& &BINARY&), then it must include a tag
&BINARY_ARGS& following the
&BINARY& tag. This is because the PWSCF executable takes command line arguments such as ‘
-npool‘ and ‘
< in.pwscf >> out.pwscf‘ etc. which are not required by CASINO. For this reason an arch file which works for CASINO may not necessarily work for PWSCF without this modification. To check this on a batch machine, type ‘
runpwscf --qmc --check-only‘ and examine the resulting ‘
pw.x‘ batch file. If the line containing e.g. the
mpirun command above does not have ‘
-pw2casino‘ following ‘
pw.x‘, then you will need to add
&BINARY_ARGS& to your arch file.
Here is a recipe that I (MDT) got to work on my machines (at work and at home) in connection with the Cetus machine at Argonne.
(1) If you are trying to do this at home without a static IP address, then subscribe to a free dynamic DNS service like http://www.noip.com/ . You end up with a daemon permanently running on your home machine that sends your current IP address (whenever it changes) to Noip who keep it linked with a name like ‘
mikes_house.no-ip.biz‘. Then you can do ‘
ssh mikes_house.no-ip.biz‘ from anywhere and it supposedly will figure out the current correspinding IP address.
Also make sure you instruct your router to open the ssh port and to forward any ssh traffic to the machine you want (which you will have setup with a static IP like 192.168.0.42 or whatever on your internal network).
You might also need to tell your home machine to open the correct ports for VNC (with OpenSuse Linux there’s an option for this in Yast2).
(2) On Cetus, type:
vncserver -geometry 1920x1080
It will respond
New 'cetuslac1:2 (your_user_name)' desktop is cetuslac1:2
Keep a note of the ‘
:2‘ – this is the port number that you will need (and this might be different every time you do this).
(3) On Cetus, type:
ssh -c blowfish -C -f -R5902:localhost:5902 [email protected]_house.no-ip.biz
vncviewer -display :0 localhost:2
where all occurrences of
02 refer to the port number from step (2).
The Totalview user support guy told me to add
-encodings '"tight zlib copyrect hextile rre corre raw"'
to the list of arguments to
ssh, but that either corrupted my display, or had no effect. I’m not sure what this is supposed to do (and can’t be bothered to look it up – feel free).
(4) You should then get a big window containing a desktop (which will either be a fully functioning clone of the desktop on your home machine, or a partially functioning black desktop where you can’t e.g. resize windows, depending on which machine you use. I don’t know how to fix the black case – nobody said this is easy!).
(5) Inside the desktop should be a terminal window with which you can issue commands on Cetus. On Cetus, you would then do
runqmc -n 1 --ppn=2 -T 1h -s --user.shmemsize=1000 --debug
Totalview will then launch inside the virtual desktop, and when you press a button it should respond quickly, rather than taking half an hour over it, as would happen with ‘
Then, assuming you’ve studied the Totalview manual for several days, you ought to be able to debug CASINO. Various bizarre behaviours may be encountered if you try to do this, but it essentially works.
(6) Once you’ve finished with it, close the Totalview window, then issue the following command on Cetus:
vncserver -kill :2
:2 is the port number from step (2).
Note finally that Totalview also produce a ‘Remote Display Client’ which will supposedly do all this quasi-automatically, and which you can download and install this on your personal machine. However, not even the Totalview user support guys can make it work with Cetus (and I did ask).
Unmodified versions 5.1 or 5.1.1 of the PWSCF DFT code (extant from April –> December 2014) contain a bug introduced by the developers that affected the CASINO converter routine. SVN development versions from December 2014 contain a fix for this bug, as will subsequent releases.
To fix the bug manually, then replace line 366 of PW/src/pw2casino_write.f90
CALL get_buffer (evc, nwordwfc, iunwfc, ikk )
IF( nks > 1 ) CALL get_buffer (evc, nwordwfc, iunwfc, ikk )
E: QMC questions
In general, yes. Although there may be cases where using different pseudopotentials may be of little importance, this is not true in most situations. CASINO requires the pseudopotential on a grid, and quantum chemistry codes tend to require them expanded in Gaussians. The online library has the pseudopotentials in both formats (with the latter done specifically for GAUSSIAN, CRYSTAL and GAMESS). There are also some notes supplied below the table which you should read.
Have a look in the
CASINO/utils/pseudo_converters directory for utilities that convert pseudopotentials formatted for other non-Gaussian codes into the correct format for CASINO. Utilities are currently available for ABINIT, CHAMP, CASTEP, PWSCF, and GP.
You shouldn’t, see D1.a. If you generate all-electron orbitals, you should run all-electron QMC calculations. Notice that the scaling of all-electron QMC with atomic number is problematic, you should almost always use pseudopotentials to simulate everything but first-row atoms.
The source code is necessary for getting CASINO to work where it’s most useful, i.e., on high-performance clusters. It’s impractical to generate binaries for each machine we know of, particularly given that we presently don’t have access to most of the supported architectures.
However .deb and .rpm packages for popular Linux distributions are a planned feature, which we will get to once we enable system-wide installations (see A6 above).
G: Development of CASINO
We are always happy to accept the help of competent people with developing CASINO. If you wish to get involved, then please contact Mike Towler (mdt26 at cantab.net) with some details of what it is you wish to implement/improve. You will be given a developer’s password which will allow you to access the main development source code and documentation.
CASINO is developed using the git revision control system (see
CASINO/doc/git_guide.pdf). Access to the developer git repository may be granted by sending a request to Mike Towler and then following the instructions at casinoqmc.net/git/
More information is available on this site here.
Use the Cambridge University Computing Service VPN service, which they call VPDN. Despite the fact that every other person has an Android device these days, the UCS specifically state that they do not support Android (and secretly would rather everyone still used Windows XP). However, they do support it – they just pretend not to be aware of it and are extremely resistant to being told that they do. Hence this FAQ entry, which they have been made aware of and refuse to put on their website, despite it being perfectly correct, legal, legitimate, etc..
Parts of this advice might work for other universities if they run a VPN service.
VPDN instructions for Android devices on the Cambridge network
- Go to http://userforms.csx.cam.ac.uk/vpdn (Raven Login)
- Fill in form requesting VPDN_USERNAME and VPDN_PASSWORD.
- Wait for UCS to send you these by regular mail.
- From a terminal, type:
ssh -x [email protected]then:
Login : VPDN_USERNAME
Password : VPDN_PASSWORD
Press C <enter>
Press K <enter>The SHARED_KEY will be displayed. Write it down, then logout.
- Install the Android app ‘VpnCilla‘ from the Google Play store (costs about 3 UK pounds) on your device. This is the only such app I could get to work, and I did try quite a few of them.
- Start VpnCilla.
- Add a new connection by pressing the + sign.
- Type a name for the new connection e.g. ‘Cambridge VPN’ or whatever you like.
- Type in the following connection settings:~
VPDN Server Address : vpdn-access-cisco.csx.cam.ac.uk
Group ID : vpdn
Group Password : SHARED_KEY
Your Username : VPDN_USERNAME
User Password : VPDN_PASSWORD
VPN via mobile network : <tick the box>
Optional vpnc flags : --enable-1des
NB: I had to *guess* the Group ID was ‘vpdn’ – the UCS don’t seem to give you this information, which is presumably one reason why Android devices are ‘not supported’.
- That’s it.
To make things easier there is a VpnCilla ‘Widget’ in the App Locker that you can add to your homescreen; this can be used to toggle the VPN connection on and off with one click. You can also rig this to start automatically when you turn on the phone, so that VPN is effectively ‘Always On’.
Result: your phone/tablet now always acts like it’s plugged into the wall in the Cavendish Laboratory wherever it might happen to be physically, so you can read Physical Review Letters on the beach just by pressing a PDF link, rather than pressing the PDF link and spending the next 10 minutes scrolling through menus and filling in forms.