Page 1 of 1

Calc. dying with memory issues after exec. of several blocks

Posted: Tue Oct 25, 2016 11:55 am
by Katharina Doblhoff
Dear all and dear casino developers in particular,

I have nasty and big calculations running. The blip file is 50G. I have 64G RAM per node. My calculations started off fine (at least, I cannot see anything weird), but then all of them died with memory issues. The weird thing is that they all got through a couple of blocks (the entire equilibration and about 4 to 8 stats blocks with 80 steps per block) before they died. In spite of the fact that all jobs run the same calculation (difference is only in the twist), they all died at different moments in execution time. The calculation at the gamma point ran fine, but that of course does not need complex wfns.
So, basically I have three questions:
a.) how can this (dying after several blocks have been executed) happen (this is a curiosity driven question and is actually of miner immediate urgency)
b.) should my runs be fine so far? i.e. can I restart them using the config.out from the last block, possibly using nodes which have more memory or using less blocks if that would help? Or do I have to fear that ther were some severe memory spillages in the earlier blocks too?
c.) if restarting is fine, does it matter that these nodes aren't haswell nodes while the others were (It will affect runtime of course, but would it harm the calculations in some way when restarting?).

Thanks for any help
Katharina

Re: Calc. dying with memory issues after exec. of several bl

Posted: Tue Oct 25, 2016 12:28 pm
by Neil Drummond
Dear Katharina,

Sorry to hear about the memory problems.

a) CASINO allocates configurations when they are born and deallocates them when they die, so the memory requirements fluctuate in time. The details of memory management are handled by the compiler and we have to hope that it is sensible. Perhaps this is behind the issue?

b) If CASINO has completed writing out a config.out file (you can check using format_configs) then it should be safe to restart from it.

c) Restarting on a different machine is fine so long as the endianness is the same. If the endianness is different then CASINO should protest when it tries to read the bwfn.data.bin file. If endianness is a problem then you can use format_configs to produce an unambiguous, formatted version of config.in, which you can then unformat on the new machine by running format_configs again.

Best wishes,

Neil.

Re: Calc. dying with memory issues after exec. of several bl

Posted: Tue Oct 25, 2016 1:11 pm
by Katharina Doblhoff
Hi Neil!

Thank you for your fast reply, so I will restart the calcs. But your answer to a.) does not sound reasonable to me: Typically (and this is also what I see for the gamma-point calculation) the number of configs is highest durig the first stage of equilibration, then goes down and then oscillates, but basically never reaches its initial high. So if I get through equilibration, why do I not get through the stats?

Thank you and all the best,
Katharina

Re: Calc. dying with memory issues after exec. of several bl

Posted: Tue Oct 25, 2016 4:51 pm
by Neil Drummond
Dear Katharina,

Thanks very much for reporting the issue - I agree there is a memory problem. It seems to have been introduced in patch 1d6d14ae, in which the call to dmc_annihilate_configs at the end of dmc_main was bypassed for the equilibration stage of a DMC calculation. This means that the data for the configuration population at the end of equilibration becomes detached and sits unused in memory during the subsequent statistics accumulation. If you start again with runtype=dmc_stats, you should have a bit more memory available. To fix the problem, replace

Code: Select all

! Clear configs.
 if(iaccum.or.(isdmc.and..not.iaccum))then
  call dmc_annihilate_configs
 endif
with

Code: Select all

! Clear configs.
 call dmc_annihilate_configs
in dmc.f90.

At least this bug doesn't affect any results and, since nobody has noticed until just now, obviously hasn't caused too much inconvenience.

Thanks again,

Neil.

Re: Calc. dying with memory issues after exec. of several bl

Posted: Wed Oct 26, 2016 9:07 am
by Mike Towler
Well spotted! I've added Neil's fix to the main distribution available on the website.

M.

Re: Calc. dying with memory issues after exec. of several bl

Posted: Thu Oct 27, 2016 6:43 am
by Katharina Doblhoff
Hi Neil!
Ah, now this makes sense! I guess it is very bad luck to really run into this situation! Thanks for spotting the problem!
All the best,
Katharina