Stale PBS jobs

defusco · Post by **defusco** » Thu Nov 07, 2013 2:24 pm

Hi,

We are running CASINO on our cluster with PBS Torque and the Moab scheduler. Quite frequently I see orphaned CASINO processes on nodes after a job has been deleted or crashed. I am working with the users in an attempt to determine if these jobs were deleted by hand, crashed or ran out of time.

Has anyone else seen this problem or know why there might be slave processes leftover when the master dies?

Thanks,
Albert DeFusco, Ph.D.
Research Assistant Professor
Technical Director, Center for Simulation and Modeling
University of Pittsburgh
Pittsburgh, PA 15260
412-648-3094
http://www.sam.pitt.edu

Post by **Mike Towler** » Thu Nov 07, 2013 3:59 pm

Hi Albert,

Sorry to hear you're having problems.

First, standard information request: what version of CASINO are you using?

If the jobs were deleted by the user typing qdel or whatever, or by the job hitting an externally imposed time limit, then the scheduler should just kill everything. If it doesn't, then it's a bug in the scheduler and it's not our problem.

If the job is stopping because CASINO has detected an error, then the code can stop in one of two ways, either by calling mpi_finalize or by calling mpi_abort.

Calling mpi_finalize is preferred, because it stops the job neatly and without excessively verbose output, but it can only be done safely if the same error is guaranteed to be encountered on all MPI processes - then effectively each process 'shuts itself down'. If the error is only encountered on a subset of the MPI processes (let's say one for simplicity) then this single process must call mpi_abort, which not only stops the process on which it was called, but aggressively signals all the others to stop whatever they're doing and shut down too. This also unavoidably produces a lot of hideous warning messages in the CASINO output file, which are best avoided if possible.

Let's say only the master process encounters an error and then calls mpi_finalize. The other processes don't know about this and just carry on until the next occurence of a collective communication (such as averaging the energy over the cores) at which point they will stop and wait for the master to hit this call too - which it never does, because it's stopped. Hence the slave processes will hang for ever - or until the scheduler kills them.

Now in CASINO there are formalized error routines called 'errstop' and 'errstop_master' which end up calling mpi_abort and mpi_finalize respectively. But it's easy for a developer to either be ignorant about the difference, or just automatically write the wrong version of call errstop('ROUTINE','There was an error') without bothering to check whether the error is encountered by all the processes. I usually correct a couple of errors of this nature per year.

So if it's CASINO's fault, then you have two options:

(1) Use the absolute latest version of the code, in which I may have fixed the offending errstop call, or
(2) tell me what error message is in the CASINO output file, so I can check if the wrong version of errstop is being used to produce it.

Best wishes,
Mike

The CASINO forum

Stale PBS jobs

Stale PBS jobs

Re: Stale PBS jobs