Calc. dying with memory issues after exec. of several blocks

General discussion of the Cambridge quantum Monte Carlo code CASINO; how to install and setup; how to use it; what it does; applications.
Post Reply
Katharina Doblhoff
Posts: 84
Joined: Tue Jun 17, 2014 6:50 am

Calc. dying with memory issues after exec. of several blocks

Post by Katharina Doblhoff »

Dear all and dear casino developers in particular,

I have nasty and big calculations running. The blip file is 50G. I have 64G RAM per node. My calculations started off fine (at least, I cannot see anything weird), but then all of them died with memory issues. The weird thing is that they all got through a couple of blocks (the entire equilibration and about 4 to 8 stats blocks with 80 steps per block) before they died. In spite of the fact that all jobs run the same calculation (difference is only in the twist), they all died at different moments in execution time. The calculation at the gamma point ran fine, but that of course does not need complex wfns.
So, basically I have three questions:
a.) how can this (dying after several blocks have been executed) happen (this is a curiosity driven question and is actually of miner immediate urgency)
b.) should my runs be fine so far? i.e. can I restart them using the config.out from the last block, possibly using nodes which have more memory or using less blocks if that would help? Or do I have to fear that ther were some severe memory spillages in the earlier blocks too?
c.) if restarting is fine, does it matter that these nodes aren't haswell nodes while the others were (It will affect runtime of course, but would it harm the calculations in some way when restarting?).

Thanks for any help
Katharina
Neil Drummond
Posts: 113
Joined: Fri May 31, 2013 10:42 am
Location: Lancaster
Contact:

Re: Calc. dying with memory issues after exec. of several bl

Post by Neil Drummond »

Dear Katharina,

Sorry to hear about the memory problems.

a) CASINO allocates configurations when they are born and deallocates them when they die, so the memory requirements fluctuate in time. The details of memory management are handled by the compiler and we have to hope that it is sensible. Perhaps this is behind the issue?

b) If CASINO has completed writing out a config.out file (you can check using format_configs) then it should be safe to restart from it.

c) Restarting on a different machine is fine so long as the endianness is the same. If the endianness is different then CASINO should protest when it tries to read the bwfn.data.bin file. If endianness is a problem then you can use format_configs to produce an unambiguous, formatted version of config.in, which you can then unformat on the new machine by running format_configs again.

Best wishes,

Neil.
Katharina Doblhoff
Posts: 84
Joined: Tue Jun 17, 2014 6:50 am

Re: Calc. dying with memory issues after exec. of several bl

Post by Katharina Doblhoff »

Hi Neil!

Thank you for your fast reply, so I will restart the calcs. But your answer to a.) does not sound reasonable to me: Typically (and this is also what I see for the gamma-point calculation) the number of configs is highest durig the first stage of equilibration, then goes down and then oscillates, but basically never reaches its initial high. So if I get through equilibration, why do I not get through the stats?

Thank you and all the best,
Katharina
Neil Drummond
Posts: 113
Joined: Fri May 31, 2013 10:42 am
Location: Lancaster
Contact:

Re: Calc. dying with memory issues after exec. of several bl

Post by Neil Drummond »

Dear Katharina,

Thanks very much for reporting the issue - I agree there is a memory problem. It seems to have been introduced in patch 1d6d14ae, in which the call to dmc_annihilate_configs at the end of dmc_main was bypassed for the equilibration stage of a DMC calculation. This means that the data for the configuration population at the end of equilibration becomes detached and sits unused in memory during the subsequent statistics accumulation. If you start again with runtype=dmc_stats, you should have a bit more memory available. To fix the problem, replace

Code: Select all

! Clear configs.
 if(iaccum.or.(isdmc.and..not.iaccum))then
  call dmc_annihilate_configs
 endif
with

Code: Select all

! Clear configs.
 call dmc_annihilate_configs
in dmc.f90.

At least this bug doesn't affect any results and, since nobody has noticed until just now, obviously hasn't caused too much inconvenience.

Thanks again,

Neil.
Mike Towler
Posts: 239
Joined: Thu May 30, 2013 11:03 pm
Location: Florence
Contact:

Re: Calc. dying with memory issues after exec. of several bl

Post by Mike Towler »

Well spotted! I've added Neil's fix to the main distribution available on the website.

M.
Katharina Doblhoff
Posts: 84
Joined: Tue Jun 17, 2014 6:50 am

Re: Calc. dying with memory issues after exec. of several bl

Post by Katharina Doblhoff »

Hi Neil!
Ah, now this makes sense! I guess it is very bad luck to really run into this situation! Thanks for spotting the problem!
All the best,
Katharina
Post Reply