NCORE parameter

Kayahan · Post by **Kayahan** » Thu Apr 17, 2014 2:04 am

Hello CASINO Users,

I was trying to run CASINO on Hopper, one of the computers in NERSC, but I had this following problem. I used the arch file that was given in the distribution and didn't have any problems while compiling it. Here is the output when I use runqmc command:

Code: Select all

runqmc -p 96 -n 4 -T 1h -v
Loading tags from linuxpc-path-pbs-parallel.hopper-nersc.arch
TYPE of machine is 'cluster'
Have set SUBMIT_SCRIPT='qsub &SCRIPT&'
CORES_PER_NODE_CLUSTER is defined: CORES_PER_NODE overridden
Evaluated CORES_PER_NODE=24
Evaluated NPROC=96
Evaluated TPP=1
Evaluated PPN=24
Evaluated NNODE=4

ERROR: NCORE parameter required on this machine but could not be deduced from
       input. Please provide more input.

NNODE*CORES_PER_NODE should be NCORE automatically and manual says that it is a redundant parameter, I couldn't figure out why it causes this problem.

Thanks,
Kayahan

Post by **Mike Towler** » Thu Apr 17, 2014 9:11 am

Hi Kayahan,

There are two relevant parameters in the arch file - CORES_PER_NODE and CORES_PER_NODE_CLUSTER. The latter is meant for the (slightly unusual) situation where you have login nodes and compute nodes with differing number of cores. If CORES_PER_NODE_CLUSTER is present in the arch file, it overrides CORES_PER_NODE when deciding how to run the job (as it says in the verbose output you're getting because you've specified the -v flag).

In the arch file arch/data/machine/hopper-nersc.arch, CORES_PER_NODE_CLUSTER is defined as 12 (not 24).

Your runqmc flags define the total number of cores as 96 and the number of nodes as 4, which is obviously incompatible with 12 cores per node.

So run it with either 'runqmc -p 96' or 'runqmc -n 8' or change CORES_PER_NODE_CLUSTER in the arch file.

Cheers,
Mike

Kayahan · Post by **Kayahan** » Thu Apr 17, 2014 1:35 pm

Thanks Mike for quick reply. You are right that CORES_PER_NODE_CLUSTER parameter is set as 12 in the original file and I changed it to 24 because there are actually 24 cores in Hopper's computation nodes. I forgot to mention about that in my previous mail.

http://www.nersc.gov/users/computationa ... ute-nodes/

However, the same problem existed before I changed that parameter. Here I changed the hopper-nersc.arch file in ~/arch/data/machines directory to its original state and output of the code:

Code: Select all

runqmc -p 96 -T 1h -v                                                                                                                                                                                    
Loading tags from linuxpc-path-pbs-parallel.hopper-nersc.arch
TYPE of machine is 'cluster'
Have set SUBMIT_SCRIPT='qsub &SCRIPT&'
CORES_PER_NODE_CLUSTER is defined: CORES_PER_NODE overridden
Evaluated CORES_PER_NODE=12
Evaluated NPROC=96
Evaluated TPP=1
Evaluated PPN=12
Evaluated NNODE=8

ERROR: NCORE parameter required on this machine but could not be deduced from
       input. Please provide more input.

Thanks,
Kayahan

Post by **Mike Towler** » Thu Apr 17, 2014 8:02 pm

OK. If what you say is correct then that shouldn't happen.

Can you post the output if you change -v to --verbosity=5 in the list of runqmc flags?

What version of CASINO is this?

M.

Kayahan · Post by **Kayahan** » Thu Apr 17, 2014 8:28 pm

Here is the output when verbosity is set to 5. Its version is 2.12.1.

Code: Select all

runqmc -p 96 -T 1h --verbosity=5                            
Loading tags from linuxpc-path-pbs-parallel.hopper-nersc.arch
TYPE of machine is 'cluster'
Have set SUBMIT_SCRIPT='qsub &SCRIPT&'
Dependency tree:
tags[4] = user_allowed_QUEUE user_default_QUEUE user_max_QUEUE user_min_QUEUE
vars[3] = USER.QUEUE
tags[3] = ALLOWED_NCORE ALLOWED_NNODE CORES_PER_NODE CORES_PER_NODE_CLUSTER
 MAX_NCORE MAX_NNODE MIN_NCORE MIN_NNODE user_allowed_ACCOUNT
 user_default_ACCOUNT user_max_ACCOUNT user_min_ACCOUNT
vars[2] = META.RUN_TOPOLOGY USER.ACCOUNT
tags[2] = ALLOWED_WALLTIME internal_ACCOUNT_LINE MAX_CORETIME MAX_WALLTIME
 MIN_CORETIME MIN_WALLTIME TIME_FORMAT WALLTIME_CODES
vars[1] = BINARY BINARY_ARGS INTERNAL.ACCOUNT_LINE OUT SCRIPT WALLTIME
tags[1] = SCRIPT_HEAD SCRIPT_RUN SUBMIT_SCRIPT
Evaluated user_allowed_QUEUE=''
Evaluated user_default_QUEUE='regular'
Evaluated user_max_QUEUE=''
Evaluated user_min_QUEUE=''
Evaluated QUEUE='regular'
Evaluated ALLOWED_NCORE=''
Evaluated ALLOWED_NNODE=''
Evaluated CORES_PER_NODE='16'
Evaluated CORES_PER_NODE_CLUSTER='24'
Evaluated MAX_NCORE=''
Evaluated MAX_NNODE=''
Evaluated MIN_NCORE=''
Evaluated MIN_NNODE=''
Evaluated user_allowed_ACCOUNT=''
Evaluated user_default_ACCOUNT='CASINO_UNSET'
Evaluated user_max_ACCOUNT=''
Evaluated user_min_ACCOUNT=''
CORES_PER_NODE_CLUSTER is defined: CORES_PER_NODE overridden
Evaluated CORES_PER_NODE=24
Made var_NPROC_TOTAL=96 as var_NJOB=1 and var_NPROC=96
Made nthread=96 as var_TPP=1 and var_NPROC=96
Made nthread_total=96 as var_NJOB=1 and nthread=96
Implicit topology assumption (T1): TPN=24
Made var_PPN=24 as var_TPN=24 and var_TPP=1
Made var_NNODE=4 as var_NPROC=96 and var_PPN=24
Made var_NNODE_TOTAL=4 as var_NJOB=1 and var_NNODE=4
Evaluated NPROC=96
Evaluated TPP=1
Evaluated PPN=24
Evaluated NNODE=4

ERROR: NCORE parameter required on this machine but could not be deduced from
       input. Please provide more input.

Thanks,
Kayahan

Post by **Mike Towler** » Fri Apr 18, 2014 2:37 am

OK, the runqmc script is smart enough to know that it needs the value of NCORE because whoever wrote your arch file included some multi-line bash scripting to define the maximum wall time (MAX_WALLTIME) as follows:

Code: Select all

#-! *MAX_WALLTIME:
#-!  case "&USER.QUEUE&"
#-!  interactive) echo 30m ;;
#-!  debug) echo 30m ;;
#-!  premium) echo 12h ;;
#-!  regular)
#-!   if ((&NCORE&<12288)) ; then
#-!    echo 24h
#-!   elif ((&NCORE&<49152)) ; then
#-!    echo 24h
#-!   elif ((&NCORE&<98304)) ; then
#-!    echo 24h
#-!   else
#-!    echo 12h
#-!   fi ;;
#-!  low) echo 12h ;;
#-!  esac

You may think that you are defining NCORE specifically with 'runqmc -p 96' but -p actually gives the 'number of MPI processes' NPROC -- as you can see from the verbose output -- and this is not necessarily the same as the number of reserved physical cores. Parallel machines are so much more complicated these days!

You may also think that you have given enough information for it to work out NCORE, but strictly speaking you haven't.. If you give it the number of MPI processes NPROC (a theoretical characteristic of the requested job) and CORES_PER_NODE_CLUSTER (a physical characteristic of the machine), you would also need to specify the number of MPI processes per node though the --ppn flag. That said, there should probably be a default of assuming processes per node = cores per node and I'm not entirely sure why it isn't doing something like that here. The next time we have a rainy day I might plough through the hideously complicated scripts that implement this and try to figure out exactly what it's doing and why.

In the meantime, you can probably fix your problem by redefining the MAX_WALLTIME definition in terms of variables that the arch file is aware of, so e.g. change the NCORE-dependent bit of it to:

Code: Select all

#-!   if ((&NNODE&<512)) ; then
#-!    echo 24h
#-!   elif ((&NNODE&<2048)) ; then
#-!    echo 24h
#-!   elif ((&NNODE&<4096)) ; then
#-!    echo 24h
#-!   else
#-!    echo 12h
#-!   fi ;;

noting that the if block can be simplified as all but one of the options end up defining MAX_WALLTIME as '24h'. As previously discussed, you also need to change CORE_PER_NODE_CLUSTER from 12 to the correct value of 24.

Let me know if that works..

M.

Kayahan · Post by **Kayahan** » Fri Apr 18, 2014 3:36 am

Thanks a lot for clear explanation, it is working well.

Best,
Kayahan

Post by **Mike Towler** » Fri Apr 18, 2014 6:41 am

Good. I've updated the hopper arch file in the distribution (CASINO current beta 2.13.331).

Note that I've also fixed some other errors in the maximum number of nodes allowed in different queues (assuming the online documentation is correct).

M.

Kayahan · Post by **Kayahan** » Fri Apr 18, 2014 1:34 pm

Same arch file is also working for Edison, though I didn't check the code if there can be minor incompatibilities about the queue part etc.

Kayahan

Post by **Mike Towler** » Mon Apr 21, 2014 12:16 pm

Well, if you can be bothered, it's probably worth making a separate arch file for edison.

In particular you would need to change the HOSTNAME tag (or the automatic detection facility of the install script won't work).

If it has different numbers of nodes, and different queue structures, maximum walltimes etc. then you should change the runtime stuff at the top of the arch file. If you let me know the output of the 'hostname' command on edison I'm happy to do this for you.

The CASINO forum

NCORE parameter

NCORE parameter

Re: NCORE parameter

Re: NCORE parameter

Re: NCORE parameter

Re: NCORE parameter

Re: NCORE parameter

Re: NCORE parameter

Re: NCORE parameter

Re: NCORE parameter

Re: NCORE parameter