Main Menu

OMRunner - The DRMAA job runner

Synopsis

omrunner [OPTIONS] -f jdf

Description

The OMRunner uses SSH and DRMAA to submit co-allocated OpenMPI  jobs to remote clusters.  DRMAA provides a common  interface to  autonomous local resource managers in remote clusters.  OpenMPI is an open source highly configurable  MPI-2 implementation that is developed and maintained by a consortium of academic, research, and industry partners. The OMRunner has the capability of selecting a fast interconnect to use when a job is submitted on multiple clusters on DAS-3. In most cases high speed Myri-10G interconnect is used unless the Delft cluster is selected. With the Delft cluster,  the Gigabit/s Ethernet interconnect is used. In addition to the OpenMPI jobs, the OMRunner can be used to submit other non-coallocated jobs to remote multiple clusters. Jobs compiled with other implementations of MPI such as MPICH, cannot be submitted with the OMRunner.

Options

-flex                     : the job request is flexible
-optComm           : if possible, try to optimize communication
-cm                      : if possible, try to minimize the number of clusters used
-x <clusters>     : comma separated list of clusters not to be used
-np <processes>: number of processes to run per node
-l <LEVEL>         : set log4j <FATAL| ERROR| WARN| DEBUG>  output  level

Examples

The following are examples of running jobs with the OMRunner.

1. Simple co-allocated job execution.

This example executes an MPI application that calculate pi and exits. The application has been compiled with OpenMPI  on DAS-3. 

 

[hashim@fs3 JDFs]$ cat  cpi-das3.jdf
+(
&( count = "2")
 ( directory = "/home/hashim/bin" )
 (maxWallTime = "15" )
 ( executable = "cpi-ompi" )
)
(&( count = "2")
 ( directory = "/home/hashim/bin" )
 (maxWallTime = "15" )
 ( executable = "cpi-ompi" )
)

 

[hashim@fs3 JDFs]$ omrunner -f  cpi-das3.jdf
Ksched -  Assigned job ID 78755
Ksched - Job 78755 Assigned  LOW_PRIORITY
Ksched - Reservation for component 1 succeed
Ksched - Placed component 2 on fs3.das3.tudelft.nl
Ksched - Placed component 1 on fs0.das3.cs.vu.nl
Ksched - Reservation for component 2 succeed
Runner - Submitting for execution component 1 to fs0.das3.cs.vu.nl
Ksched - Claiming for  processors for job 78755 begins
Runner - Submitting for execution component 2 to fs3.das3.tudelft.nl
DRMAA  - Component2@ fs3.das3.tudelft.nl: QUEUED
DRMAA  - Component1@ fs0.das3.cs.vu.nl: QUEUED
DRMAA  - Component2@ fs3.das3.tudelft.nl: ACTIVE
DRMAA  - Component1@ fs0.das3.cs.vu.nl: ACTIVE
Process 0 of 4 on node319
Process 3 of 4 on node076
Process 2 of 4 on node077
Process 1 of 4 on node332
pi is approximately 3.1415926544231239, Error is 0.0000000008333307
wall clock time = 0.022639
Runner - Job 78755 has completed successfully

 

Compare the output of the OMRunner and that of the KRunner to spot the differences.

2. Another co-allocated job example

In this example we execute the Poisson application that implements a parallel iterative algorithm to find a discrete approximation to the solution of the two-dimensional Poisson equation on the unit square.  The job request has four non-fixed components, which in total are requesting 64 nodes. However, we use the -np  2 switch to run this job on 128 cores. 

 

[hashim@fs3 JDFs]$ cat pois-ompi.jdf
+
( &(count = "16")
   ( directory = "/home/hashim/bin")
   ( maxWallTime = "15" )
   ( executable = "/home/hashim/bin/Pois-ompi" )
   ( arguments = "16" "8" )
 )
( &(count = "16")
   ( directory = "/home/hashim/bin")
   ( maxWallTime = "15" )
   ( executable = "/home/hashim/bin/Pois-ompi" )
   ( arguments = "16" "8" )
 )
( &(count = "16")
   ( directory = "/home/hashim/bin")
   ( maxWallTime = "15" )
   ( executable = "/home/hashim/bin/Pois-ompi" )
   ( arguments = "16" "8" )
 )
( &(count = "16")
   ( directory = "/home/hashim/bin")
   ( maxWallTime = "15" )
   ( executable = "/home/hashim/bin/Pois-ompi" )
   ( arguments = "16" "8" )
 )

 

[hashim@fs3 JDFs]$ omrunner -np 2 -f  pois-ompi.jdf
Ksched -  Assigned job ID 78760
Ksched - Job 78760 Assigned  LOW_PRIORITY
Ksched - Reservation for component 1 succeed
Ksched - Reservation for component 2 succeed
Ksched - Reservation for component 3 succeed
Ksched - Reservation for component 4 succeed
Ksched - Claiming for  processors for job 78760 begins
Ksched - Placed component 4 on fs0.das3.cs.vu.nl
Ksched - Placed component 2 on fs3.das3.tudelft.nl
Ksched - Placed component 1 on fs3.das3.tudelft.nl
Ksched - Placed component 3 on fs2.das3.science.uva.nl
Runner - Submitting for execution component 1 to fs3.das3.tudelft.nl
Runner - Submitting for execution component 3 to fs2.das3.science.uva.nl
Runner - Submitting for execution component 4 to fs0.das3.cs.vu.nl
Runner - Submitting for execution component 2 to fs3.das3.tudelft.nl
DRMAA  - Component1@ fs3.das3.tudelft.nl: QUEUED
DRMAA  - Component2@ fs3.das3.tudelft.nl: QUEUED
DRMAA  - Component4@ fs0.das3.cs.vu.nl: QUEUED
DRMAA  - Component3@ fs2.das3.science.uva.nl: QUEUED
DRMAA  - Component1@ fs3.das3.tudelft.nl: ACTIVE
DRMAA  - Component2@ fs3.das3.tudelft.nl: ACTIVE
DRMAA  - Component4@ fs0.das3.cs.vu.nl: ACTIVE
DRMAA  - Component3@ fs2.das3.science.uva.nl: ACTIVE
Iter.= 315 Proc. 0/128 : Elapsed total Wtime: 9.37  ( 99.7% CPU)
Runner - Job 78760 has completed successfully