Synopsisomrunner [OPTIONS] -f jdf DescriptionThe OMRunner uses SSH and DRMAA to submit co-allocated OpenMPI jobs to remote clusters. DRMAA provides a common interface to autonomous local resource managers in remote clusters. OpenMPI is an open source highly configurable MPI-2 implementation that is developed and maintained by a consortium of academic, research, and industry partners. The OMRunner has the capability of selecting a fast interconnect to use when a job is submitted on multiple clusters on DAS-3. In most cases high speed Myri-10G interconnect is used unless the Delft cluster is selected. With the Delft cluster, the Gigabit/s Ethernet interconnect is used. In addition to the OpenMPI jobs, the OMRunner can be used to submit other non-coallocated jobs to remote multiple clusters. Jobs compiled with other implementations of MPI such as MPICH, cannot be submitted with the OMRunner. Options
-flex : the job request is flexible -optComm : if possible, try to optimize communication -cm : if possible, try to minimize the number of clusters used -x <clusters> : comma separated list of clusters not to be used -np <processes>: number of processes to run per node -l <LEVEL> : set log4j <FATAL| ERROR| WARN| DEBUG> output level ExamplesThe following are examples of running jobs with the OMRunner. 1. Simple co-allocated job execution. This example executes an MPI application that calculate pi and exits. The application has been compiled with OpenMPI on DAS-3. [hashim@fs3 JDFs]$ cat cpi-das3.jdf +( &( count = "2") ( directory = "/home/hashim/bin" ) (maxWallTime = "15" ) ( executable = "cpi-ompi" ) ) (&( count = "2") ( directory = "/home/hashim/bin" ) (maxWallTime = "15" ) ( executable = "cpi-ompi" ) ) [hashim@fs3 JDFs]$ omrunner -f cpi-das3.jdf Ksched - Assigned job ID 78755 Ksched - Job 78755 Assigned LOW_PRIORITY Ksched - Reservation for component 1 succeed Ksched - Placed component 2 on fs3.das3.tudelft.nl Ksched - Placed component 1 on fs0.das3.cs.vu.nl Ksched - Reservation for component 2 succeed Runner - Submitting for execution component 1 to fs0.das3.cs.vu.nl Ksched - Claiming for processors for job 78755 begins Runner - Submitting for execution component 2 to fs3.das3.tudelft.nl DRMAA - Component2@ fs3.das3.tudelft.nl: QUEUED DRMAA - Component1@ fs0.das3.cs.vu.nl: QUEUED DRMAA - Component2@ fs3.das3.tudelft.nl: ACTIVE DRMAA - Component1@ fs0.das3.cs.vu.nl: ACTIVE Process 0 of 4 on node319 Process 3 of 4 on node076 Process 2 of 4 on node077 Process 1 of 4 on node332 pi is approximately 3.1415926544231239, Error is 0.0000000008333307 wall clock time = 0.022639 Runner - Job 78755 has completed successfully Compare the output of the OMRunner and that of the KRunner to spot the differences. 2. Another co-allocated job example In this example we execute the Poisson application that implements a parallel iterative algorithm to find a discrete approximation to the solution of the two-dimensional Poisson equation on the unit square. The job request has four non-fixed components, which in total are requesting 64 nodes. However, we use the -np 2 switch to run this job on 128 cores. [hashim@fs3 JDFs]$ cat pois-ompi.jdf + ( &(count = "16") ( directory = "/home/hashim/bin") ( maxWallTime = "15" ) ( executable = "/home/hashim/bin/Pois-ompi" ) ( arguments = "16" "8" ) ) ( &(count = "16") ( directory = "/home/hashim/bin") ( maxWallTime = "15" ) ( executable = "/home/hashim/bin/Pois-ompi" ) ( arguments = "16" "8" ) ) ( &(count = "16") ( directory = "/home/hashim/bin") ( maxWallTime = "15" ) ( executable = "/home/hashim/bin/Pois-ompi" ) ( arguments = "16" "8" ) ) ( &(count = "16") ( directory = "/home/hashim/bin") ( maxWallTime = "15" ) ( executable = "/home/hashim/bin/Pois-ompi" ) ( arguments = "16" "8" ) ) [hashim@fs3 JDFs]$ omrunner -np 2 -f pois-ompi.jdf Ksched - Assigned job ID 78760 Ksched - Job 78760 Assigned LOW_PRIORITY Ksched - Reservation for component 1 succeed Ksched - Reservation for component 2 succeed Ksched - Reservation for component 3 succeed Ksched - Reservation for component 4 succeed Ksched - Claiming for processors for job 78760 begins Ksched - Placed component 4 on fs0.das3.cs.vu.nl Ksched - Placed component 2 on fs3.das3.tudelft.nl Ksched - Placed component 1 on fs3.das3.tudelft.nl Ksched - Placed component 3 on fs2.das3.science.uva.nl Runner - Submitting for execution component 1 to fs3.das3.tudelft.nl Runner - Submitting for execution component 3 to fs2.das3.science.uva.nl Runner - Submitting for execution component 4 to fs0.das3.cs.vu.nl Runner - Submitting for execution component 2 to fs3.das3.tudelft.nl DRMAA - Component1@ fs3.das3.tudelft.nl: QUEUED DRMAA - Component2@ fs3.das3.tudelft.nl: QUEUED DRMAA - Component4@ fs0.das3.cs.vu.nl: QUEUED DRMAA - Component3@ fs2.das3.science.uva.nl: QUEUED DRMAA - Component1@ fs3.das3.tudelft.nl: ACTIVE DRMAA - Component2@ fs3.das3.tudelft.nl: ACTIVE DRMAA - Component4@ fs0.das3.cs.vu.nl: ACTIVE DRMAA - Component3@ fs2.das3.science.uva.nl: ACTIVE Iter.= 315 Proc. 0/128 : Elapsed total Wtime: 9.37 ( 99.7% CPU) Runner - Job 78760 has completed successfully
|