Page 1 of 1

memory requirements in dmc

Posted: Thu Nov 02, 2017 3:32 pm
by Katharina Doblhoff
Dear all,
could somebody briefly explain the memory requirements in dmc (when using T-moves)?
the reason I am asking is the following: I have a large system with around 1300electrons. It comes fine over the vmc calculations but dies with memory issues when starting the dmc calculations.
I have 64GB of RAM and my wavefunction file should use about 8GB thereof in shared memory mode (24cpus per node).
Aren't the main other memory requirements the determinant with (#electrons)^2 double entries? This would be 1300^2*8B=2GB. For T-moves, I think this should be needed twice --> 4GB. Since I am using shared memory, I should be fine. What am I overlooking? (Obviously something pretty big...)
Thank you for explaining!
Katharina

Re: memory requirements in dmc

Posted: Thu Nov 02, 2017 3:54 pm
by Pablo_Lopez_Rios
Hi Katharina,

The main difference between VMC and DMC in this regard is that each MPI process deals with a single walker in VMC, whereas in DMC there are on average P / Nproc configurations per process (where P is the target population). The memory use per configuration is O(N^2), which in your case is pretty significant. How many DMC configurations per process are you using in your calculations?

Best,
Pablo

Re: memory requirements in dmc

Posted: Thu Nov 02, 2017 3:56 pm
by Mike Towler
Hi Katharina

Read DIARY entry 2.13.281 (and weep)..

M.

Re: memory requirements in dmc

Posted: Thu Nov 02, 2017 4:48 pm
by Katharina Doblhoff
@ Pablo
Thanks for your answer! I had just figured that out... - stupid me!
@ Mike
Hang on, I do not get it:
  • How do you get to the 128G per node in the example in the logfile? The array has dimension 50x3x1000x1000. For a double array that is 50x3x1000^2*8B=1.2GB. For 64 processes this makes 77G and not 124. :?
    And why do you need to keep those arrays for every angle anyway? That just goes into a sum in the end, I thought. - Though admittedly I did not think for long - so I am likely to overlook something.
Unfortunately, it looks as if these were rhetorical questions anyway, since - from my current test using less configs - it looks as if the N^3 scaling per step would already take over at this number of electrons. Which is even more restrictive than simply changing to more nodes and no Tmoves. But I would like to understand it anyhow... Any chance on getting an explanation without having to dig into the code myself?

Re: memory requirements in dmc

Posted: Thu Nov 09, 2017 5:52 pm
by Mike Towler
Hi Katharina

Sorry for the delay in responding - I've been moving house (again..).

From my vague memory of looking at this years ago, I got to 128 Gb because there are 6 or 7 T-move arrays with dangerous scaling with system (e.g including netot * nitot * ...) not just 1 - and you can't put them into shared memory because they depend on details of the particular configs associated with each MPI process..

It was obviously coded at a time when we didn't do very large systems and nobody gave any thought to the memory requirements.. The DIARY entry was a note to myself to go back and see if there is a cleverer way of coding this (there almost certainly is) but I never got round to it before I was forced into the shopkeeping industry..

If you bribe me (suitable items are described in the CASINO manual - search for the word 'shiny') then I'll try to find time to look at it over the next few weeks..

M.

Re: memory requirements in dmc

Posted: Fri Nov 10, 2017 11:02 am
by Katharina Doblhoff
Hi Mike,
For some reason or the other, I still do not get it: I would have thought that there is one array of dimension 50 x 3 x nelectrons x nions...
And since (according to the manual) poems count as "shiny things" and since I still want to understand how it works, here comes a limerick (Though I think that I am better at QMC than at rhyming ;-) ):
There was once a chemist (that's me).
I sighed when I did QMC.
It could not be done:
too long took a run!
Should I thus leave the lab sans esprit?

Re: memory requirements in dmc

Posted: Mon Nov 13, 2017 1:19 pm
by Mike Towler
Perfect poem, thank you. I shall print it out and stick it to the wall of the church in Vallico next to the song you composed when you were last here..!
It seems that I am therefore now obliged to fix the T-move memory hogging for you - I should be able to have a look next week.
For some reason or the other, I still do not get it: I would have thought that there is one array of dimension 50 x 3 x nelectrons x nions...
You might think that - however, if you look at the source you find there are five relevant T-move arrays.

Some of these depend on the number of points in the non-local pseudopotential spherical grid, where the relevant quantities are defined as:

! Maximum number of points in spherical grid
INTEGER,PARAMETER :: nl_maxnrefgrid=50
! Actual number of points in current spherical grid
INTEGER,ALLOCATABLE :: nl_nrefgrid(:)

and let's say that

maxnlang=maxval(nlang) ! Highest l required for any pseudopotential.

Assume 1000 electrons, 1000 nuclei (i.e. a thousand H atoms in a box), maxnlang=2
Integer takes 4 bytes
Double previsions takes 8 bytes

Originally the arrays were allocated as follows:

! Set up arrays for T move if necessary.
if(use_tmove)then
allocate(
i tmove_no_points(nitot,netot) : 1000000 * 4 = 4000000
dp tmove_T(nl_maxnrefgrid,nitot,netot) : 50000000 * 8 = 400000000
dp tmove_T_moved(nl_maxnrefgrid,nitot,netot) : 50000000 * 8 = 400000000
dp tmove_T_full(nl_maxnrefgrid,0:maxnlang,nitot) : 150000 * 8 = 1200000
dp tmove_points(3,nl_maxnrefgrid,nitot,netot) : 150000000* 8 = 1200000000
)
endif ! use_tmove

Total memory requirement: 4000000+400000000+400000000+1200000+1200000000
2,005,200,000 = 2Gb

At the time I was looking at this, I was using a 64 core per node Blue Gene Q machine, thus - since it seems we can't put these things in shared memory - a total of 128 Gb per node was required just for these five vectors.

The original obvious way that I improved this (in the DIARY entry mentioned above) was to notice that it was using nl_maxnrefgrid = 50 to allocate the latter 4 vectors, but this amount of space is never required unless you use the highest accuracy spherical grid, which no-one ever does because it is unnecessary and wasteful.
Remember:

+---------------------------------------------------------+
| NON_LOCAL_GRID Exactly integrates l=... No. points |
+---------------------------------------------------------+
| 1 0 1 |
| 2 2 4 |
| 3 3 6 |
| 4 5 12 |
| 5 5 18 |
| 6 7 26 |
| 7 11 50 |
+---------------------------------------------------------+

If you use the the actual size of the grid in this H-atom case instead - call it nl_nrefgrid - then you get:

i tmove_no_points(nitot,netot) : 1000000 * 4 = 4000000
dp tmove_T(nl_nrefgrid,nitot,netot) : 12000000 * 8 = 96000000
dp tmove_T_moved(nl_nrefgrid,nitot,netot) : 12000000 * 8 = 96000000
dp tmove_T_full(nl_nrefgrid,0:maxnlang,nitot) : 36000 * 8 = 288000
dp tmove_points(3,nl_nrefgrid,nitot,netot) : 36000000 * 8 = 288000000

Total : 4000000+96000000+96000000+288000+288000000
484,288,000 = 484Mb (* 64 = 30.7Gb - saving of 97.28 Gb)

which is a huge saving of course, but these vectors are still ridiculously large.

Now it's pretty clear that the original author didn't think about memory requirements, or he would have done the above grid thing in the first place (the change was very easy..). Therefore if we ask ourselves whether it is necessary to store all this information in this way, I bet that we find that it isn't and that we could do this in a more efficient way. I will have a look next week!

Cheers,
Mike

Re: memory requirements in dmc

Posted: Wed Jan 03, 2018 12:09 pm
by Mike Towler
Hi Katharina,

Neil came up with an idea to reduce the T-move memory requirements, and his fix is now in the distribution. DIARY entry:

Significantly reduced the memory requirements of the T-move scheme by only
evaluating and storing the T-move matrix elements for at most five ions (i.e.,
assuming that an electron is never inside the nonlocal pseudopotential cutoff
radii of more than five ions at the same time). This reduces the memory
requirements by a factor of 5/nitot.


This should help a lot.

Best wishes,
Mike