Main Menu

KRunner - The default Globus runner

Synopsis

krunner [OPTIONS] -f jdf

Description

The KRunner is the default Globus runner of KOALA. It implements the most basic way of running a job on a grid. It can be used for almost any kind of job, but it does not implement specific requirements certain job types may have.

Options

-l <LEVEL>     : set log4j <FATAL|ERROR|WARN|DEBUG>  output  level
-g                   : stage executable to the execution site
-flex               : the job request is flexible
-optComm     : if possible, try to optimize communication
-cm                : if possible, try to minimize the number of clusters used
-x <clusters>: comma separated list of clusters not to be used

Examples

The following are examples of running jobs with the KRunner.

1. Simple single job execution.

The first example is a very simple job which just executes "uname -n" and exits. This can be done with the rsl given below. In this example the rsl is stored in the file 'uname-1.jdf'. The most simple way of starting a job is shown.

&
( directory = "/bin" )
( arguments = "-n" )
( executable = "uname" )
( maxWallTime = "15" )
( count = "5" )
 
[hashim@fs3 JDFs]$ krunner -f uname-1.jdf
Ksched - Assigned job ID 78624
Ksched - Job 78624 Assigned LOW_PRIORITY
Ksched - Reservation for component 1 succeed
Ksched - Placed component 1 on fs3.das3.tudelft.nl
Ksched - Claiming for processors for job 78624 begins
Runner - Submitting for execution component 1 to fs3.das3.tudelft.nl
GRAM - Component1 @ fs3.das3.tudelft.nl: PENDING
node358
node301
node362
node342
node310
GRAM - Component1 @ fs3.das3.tudelft.nl: DONE
Runner - Job 78624 has completed successfully

The KRunner sends a new job request to the Ksched, which is the KOALA scheduler. If the rsl is correct, the Ksched responds with a KOALA job id and the assigned priority level of the job. After the job has been placed successfully, the Ksched informs the runner the execution site, in this case fs3.das3.tudelft.nl, selected for the component. At the predetermined job claiming time, the Ksched instructs the runner to start claiming processors for the job components. The runner then submits the job component to the selected execution site for execution. node358, node301, node362, node342, and node310 are the messages from stdout redirected from the nodes where the command uname -n has been running. The status messages are the transition messages coming from the local resource manager informing us about the progress of the job. A successful job component goes through the following stages:

  • UNSUBMITTED
  • STAGE_IN
  • PENDING
  • ACTIVE
  • STAGE_OUT
  • DONE

2. An MPI job execution

In this example we run an MPICH application that calculates pi. The job request, shown below, is semi-fixed and consists of two components. In this example, we want the standard output of the run to be appended to the file out.dat. Note in the rsl we have added the "jobtype" attribute. This is required with the Globus GRAM for MPI jobs.

[hashim@fs3 JDFs]$ cat cpi-mpich.jdf
+
(
&( count = "2")
( directory = "/home/hashim/bin" )
(maxWallTime = "15" )
(jobtype = "mpi" )
(stdout = "out.dat")
( executable = "/home/hashim/bin/cpi.mpich" )
( resourcemanagercontact = "fs2.das3.science.uva.nl" )
)
(
&( count = "2")
( directory = "/home/hashim/bin" )
(maxWallTime = "15" )
(jobtype = "mpi" )
(stdout = "out.dat")
( executable = "/home/hashim/bin/cpi.mpich" )
)

[hashim@fs3 JDFs]$ krunner -f cpi-mpich.jdf
Ksched - Assigned job ID 78647
Ksched - Job 78647 Assigned LOW_PRIORITY
Ksched - Reservation for component 1 succeed
Ksched - Placed component 2 on fs0.das3.cs.vu.nl
Ksched - Reservation for component 2 succeed
Ksched - Placed component 1 on fs2.das3.science.uva.nl
Ksched - Claiming for processors for job 78647 begins
Runner - Submitting for execution component 2 to fs0.das3.cs.vu.nl
Runner - Submitting for execution component 1 to fs2.das3.science.uva.nl
GRAM - Component1 @ fs2.das3.science.uva.nl: STAGE_IN
GRAM - Component2 @ fs0.das3.cs.vu.nl: STAGE_IN
GRAM - Component2 @ fs0.das3.cs.vu.nl: PENDING
GRAM - Component1 @ fs2.das3.science.uva.nl: PENDING
GRAM - Component2 @ fs0.das3.cs.vu.nl: DONE
GRAM - Component1 @ fs2.das3.science.uva.nl: DONE
Runner - Job 78647 has completed successfully

[hashim@fs3 JDFs]$ more out.dat
Process 1 of 2 on node011.beowulf.cluster
Process 0 of 2 on node004.beowulf.cluster
pi is approximately 3.1415926544231318, Error is 0.0000000008333387
wall clock time = 0.000000
Process 1 of 2 on node218.beowulf.cluster
Process 0 of 2 on node230.beowulf.cluster
pi is approximately 3.1415926544231318, Error is 0.0000000008333387
wall clock time = 0.000000

The components are sent to fs2.das3.science.uva.nl, which was fixed, and fs0.das3.cs.vu.nl. Since the KRunner does not support co-allocation, the two components are executed independently and hence, each produce its own output.