Calc. dying with memory issues after exec. of several blocks
Posted: Tue Oct 25, 2016 11:55 am
Dear all and dear casino developers in particular,
I have nasty and big calculations running. The blip file is 50G. I have 64G RAM per node. My calculations started off fine (at least, I cannot see anything weird), but then all of them died with memory issues. The weird thing is that they all got through a couple of blocks (the entire equilibration and about 4 to 8 stats blocks with 80 steps per block) before they died. In spite of the fact that all jobs run the same calculation (difference is only in the twist), they all died at different moments in execution time. The calculation at the gamma point ran fine, but that of course does not need complex wfns.
So, basically I have three questions:
a.) how can this (dying after several blocks have been executed) happen (this is a curiosity driven question and is actually of miner immediate urgency)
b.) should my runs be fine so far? i.e. can I restart them using the config.out from the last block, possibly using nodes which have more memory or using less blocks if that would help? Or do I have to fear that ther were some severe memory spillages in the earlier blocks too?
c.) if restarting is fine, does it matter that these nodes aren't haswell nodes while the others were (It will affect runtime of course, but would it harm the calculations in some way when restarting?).
Thanks for any help
Katharina
I have nasty and big calculations running. The blip file is 50G. I have 64G RAM per node. My calculations started off fine (at least, I cannot see anything weird), but then all of them died with memory issues. The weird thing is that they all got through a couple of blocks (the entire equilibration and about 4 to 8 stats blocks with 80 steps per block) before they died. In spite of the fact that all jobs run the same calculation (difference is only in the twist), they all died at different moments in execution time. The calculation at the gamma point ran fine, but that of course does not need complex wfns.
So, basically I have three questions:
a.) how can this (dying after several blocks have been executed) happen (this is a curiosity driven question and is actually of miner immediate urgency)
b.) should my runs be fine so far? i.e. can I restart them using the config.out from the last block, possibly using nodes which have more memory or using less blocks if that would help? Or do I have to fear that ther were some severe memory spillages in the earlier blocks too?
c.) if restarting is fine, does it matter that these nodes aren't haswell nodes while the others were (It will affect runtime of course, but would it harm the calculations in some way when restarting?).
Thanks for any help
Katharina