Calc. dying with memory issues after exec. of several blocks

General discussion of the Cambridge quantum Monte Carlo code CASINO; how to install and setup; how to use it; what it does; applications.

Calc. dying with memory issues after exec. of several blocks

Postby Katharina Doblhoff » Tue Oct 25, 2016 11:55 am

Dear all and dear casino developers in particular,

I have nasty and big calculations running. The blip file is 50G. I have 64G RAM per node. My calculations started off fine (at least, I cannot see anything weird), but then all of them died with memory issues. The weird thing is that they all got through a couple of blocks (the entire equilibration and about 4 to 8 stats blocks with 80 steps per block) before they died. In spite of the fact that all jobs run the same calculation (difference is only in the twist), they all died at different moments in execution time. The calculation at the gamma point ran fine, but that of course does not need complex wfns.
So, basically I have three questions:
a.) how can this (dying after several blocks have been executed) happen (this is a curiosity driven question and is actually of miner immediate urgency)
b.) should my runs be fine so far? i.e. can I restart them using the config.out from the last block, possibly using nodes which have more memory or using less blocks if that would help? Or do I have to fear that ther were some severe memory spillages in the earlier blocks too?
c.) if restarting is fine, does it matter that these nodes aren't haswell nodes while the others were (It will affect runtime of course, but would it harm the calculations in some way when restarting?).

Thanks for any help
Katharina
Katharina Doblhoff
 
Posts: 84
Joined: Tue Jun 17, 2014 6:50 am

Re: Calc. dying with memory issues after exec. of several bl

Postby Neil Drummond » Tue Oct 25, 2016 12:28 pm

Dear Katharina,

Sorry to hear about the memory problems.

a) CASINO allocates configurations when they are born and deallocates them when they die, so the memory requirements fluctuate in time. The details of memory management are handled by the compiler and we have to hope that it is sensible. Perhaps this is behind the issue?

b) If CASINO has completed writing out a config.out file (you can check using format_configs) then it should be safe to restart from it.

c) Restarting on a different machine is fine so long as the endianness is the same. If the endianness is different then CASINO should protest when it tries to read the bwfn.data.bin file. If endianness is a problem then you can use format_configs to produce an unambiguous, formatted version of config.in, which you can then unformat on the new machine by running format_configs again.

Best wishes,

Neil.
Neil Drummond
 
Posts: 82
Joined: Fri May 31, 2013 10:42 am
Location: Lancaster

Re: Calc. dying with memory issues after exec. of several bl

Postby Katharina Doblhoff » Tue Oct 25, 2016 1:11 pm

Hi Neil!

Thank you for your fast reply, so I will restart the calcs. But your answer to a.) does not sound reasonable to me: Typically (and this is also what I see for the gamma-point calculation) the number of configs is highest durig the first stage of equilibration, then goes down and then oscillates, but basically never reaches its initial high. So if I get through equilibration, why do I not get through the stats?

Thank you and all the best,
Katharina
Katharina Doblhoff
 
Posts: 84
Joined: Tue Jun 17, 2014 6:50 am

Re: Calc. dying with memory issues after exec. of several bl

Postby Neil Drummond » Tue Oct 25, 2016 4:51 pm

Dear Katharina,

Thanks very much for reporting the issue - I agree there is a memory problem. It seems to have been introduced in patch 1d6d14ae, in which the call to dmc_annihilate_configs at the end of dmc_main was bypassed for the equilibration stage of a DMC calculation. This means that the data for the configuration population at the end of equilibration becomes detached and sits unused in memory during the subsequent statistics accumulation. If you start again with runtype=dmc_stats, you should have a bit more memory available. To fix the problem, replace

Code: Select all
! Clear configs.
 if(iaccum.or.(isdmc.and..not.iaccum))then
  call dmc_annihilate_configs
 endif


with

Code: Select all
! Clear configs.
 call dmc_annihilate_configs


in dmc.f90.

At least this bug doesn't affect any results and, since nobody has noticed until just now, obviously hasn't caused too much inconvenience.

Thanks again,

Neil.
Neil Drummond
 
Posts: 82
Joined: Fri May 31, 2013 10:42 am
Location: Lancaster

Re: Calc. dying with memory issues after exec. of several bl

Postby Mike Towler » Wed Oct 26, 2016 9:07 am

Well spotted! I've added Neil's fix to the main distribution available on the website.

M.
Mike Towler
 
Posts: 234
Joined: Thu May 30, 2013 11:03 pm
Location: Florence

Re: Calc. dying with memory issues after exec. of several bl

Postby Katharina Doblhoff » Thu Oct 27, 2016 6:43 am

Hi Neil!
Ah, now this makes sense! I guess it is very bad luck to really run into this situation! Thanks for spotting the problem!
All the best,
Katharina
Katharina Doblhoff
 
Posts: 84
Joined: Tue Jun 17, 2014 6:50 am


Return to The CASINO program

Who is online

Users browsing this forum: No registered users and 4 guests

cron