Page 2 of 2

Re: Error PPOTS

Posted: Wed Jul 17, 2013 12:06 am
by Mike Towler
Also - it would help to have the exact error code so we know what the thing is whining about.

Find the following block of code in CASINO/src/ppots.f90

Code: Select all

open(io_pp,file=psp_filename,status='old',iostat=ierr)
if(ierr/=0)call errstop('READ_PPOTS','Error opening '//trim(psp_filename)&
 &//' file.')
and insert the following line after the open statement.

Code: Select all

if(ierr/=0)write(my_node+400,*)'ierr=',ierr ; call flush(my_node+400)
Recompile the code, then run it again in a case where you expect to see the error.
In fact, as you are up against a 48h time limit, you need to find a shorter run which exhibits the same error.

To do this set max_cpu_time = 10 min in the input file, then use 'runqmc --auto-continue' - this will force a restart after 10 minutes.

If you encounter the error, one or more files called e.g. fort.4xx should appear, containing something like 'ierr=123'. Could you let me know what the actual value of 123 is?

M.

Re: Error PPOTSca

Posted: Wed Jul 17, 2013 12:28 am
by elaheh
> Describe to me exactly the (presumably manual) process you use to restart the calculation.
To restart for example a DMC calculation, all the required files are present in my directory. I only change config.out to "config.in", newrun to "F" and runtype to "dmc_stats". Then I restart running it. Sometimes it running without problem and occasionally it stops with error I mentioned before.

It is not at all a big problem, but when I submit my job, wait on a queue and it suddenly stops with the error, then it is bothering.

Re: Error PPOTS

Posted: Wed Jul 17, 2013 7:30 am
by Mike Towler
To restart for example a DMC calculation, all the required files are present in my directory. I only change config.out to "config.in", newrun to "F" and runtype to "dmc_stats". Then I restart running it. Sometimes it running without problem and occasionally it stops with error I mentioned before.
OK - that's fine.
It is not at all a big problem
Yes it is, and we need to find out what's causing it. Let me have the error code when you find out what it is..

M.

Re: Error PPOTS

Posted: Wed Jul 17, 2013 7:53 pm
by elaheh
I am running other jobs in the cluster. I was wondering If I change the code and compile, it does not affect other jobs.

Elaheh.

Re: Error PPOTS

Posted: Wed Jul 17, 2013 8:37 pm
by Mike Towler
No problem - just use a different binary name.

Make the change to ppots.f90 then:

cd CASINO/src
make EXECUTABLE=casino_test

then use

runqmc --binary=casino_test -p 512 -T 48h -s etc..

Re: Error PPOTS

Posted: Wed Jul 17, 2013 11:48 pm
by elaheh
There is
After following change in ppots.f90:
open(io_pp,file=psp_filename,status='old',iostat=ierr)
if(ierr/=0)write(my_node+400,*)'ierr=',ierr ; call flush(my_node+400)
if(ierr/=0)call errstop('READ_PPOTS','Error opening pseudopotential file ' &
&//trim(psp_filename)//'.')

and typing "make EXECUTABLE=casino_test", there is an error:
/home/polaris_lan1/lanem/CASINO/src/ppots.f90(172): error #6404: This name does not have a type, and must have an explicit type. [MY_NODE]
if(ierr/=0)write(my_node+400,*)'ierr=',ierr ; call flush(my_node+400)
------------------------^
compilation aborted for /home/polaris_lan1/lanem/CASINO/src/ppots.f90 (code 1)
make: *** [/home/polaris_lan1/lanem/CASINO/src/zlib/linuxpc-ifort-sge-parallel.polaris//opt/ppots.o] Error 1

Re: Error PPOTS

Posted: Thu Jul 18, 2013 6:19 am
by Mike Towler
Sorry.

At the top of the read_ppots.d90 routine, change the line:

USE parallel, ONLY : am_master

to

USE parallel, ONLY : am_master,my_node

Re: Error PPOTS

Posted: Thu Jul 18, 2013 2:06 pm
by Neil Drummond
Dear Elaheh & Mike,

I have occasionally encountered this problem on the Polaris cluster in Leeds. The problem is not reproducible (a job that fails can simply be resubmitted). I don't think it is a problem with CASINO; it is more likely to be a problem with the filesystem on Polaris. We could perhaps reduce the chance of encountering the issue by reading the pseudopotentials on the master node and broadcasting.

Best wishes,

Neil.

Re: Error PPOTS

Posted: Thu Jul 18, 2013 2:53 pm
by Mike Towler
I have occasionally encountered this problem on the Polaris cluster in Leeds. The problem is not reproducible (a job that fails can simply be resubmitted). I don't think it is a problem with CASINO; it is more likely to be a problem with the filesystem on Polaris. We could perhaps reduce the chance of encountering the issue by reading the pseudopotentials on the master node and broadcasting.
Yes - the thought had occured to me - but I was trying to be systematic about it (it took a while to establish exactly what the problem was and that nothing silly was going on).

However, given that Polaris has a Lustre file sytem and you're using the Intel Fortran compiler I think the problem might be this:

http://www.nas.nasa.gov/hecc/support/kb ... n_276.html

In which case we ought to be able to cure it by putting "action=read" in the open statement (which we probably ought to be doing anyway).

I'll submit a patch with that fix and let's see how we get on.

M.

Re: Error PPOTS

Posted: Thu Jul 18, 2013 3:27 pm
by Mike Towler
OK - the current beta CASINO v2.13.95 (downloadable from the website) now contains this fix. If my suspicion is correct, your problem should go away). Let me know if it does or not.

M.