The PAMELA Project


Introduction

The Pamela project aims at the development of a modeling methodology that enables a rapid development of low complexity, parameterized performance models (aka symbolic performance models) of parallel programs running on shared-memory as well as distributed-memory (vector) machines. The methodology is based on the use of the formalism Pamela, which is an acronym for PerformAnce ModEling LAnguage. Pamela is a process-oriented performance simulation language. After modeling the (parallel) program and machine in terms of a Pamela model, the source is compiled to a symbolic performance model.

Symbolic performance models offer analytic and diagnostic insight in the complex interplay between parallel program and machine. Moreover, symbolic performance modeling offers the potential of reduction of model evaluation cost. As parallel programs, machines and code/data mappings typically feature a large degree of regularity, the cost models are also regular, which offers the opportunity of a huge reduction of evaluation cost through symbolic simplification. However, the derivation of symbolic performance models is a labor-intensive and error-prone process. This is where Pamela comes in. As Pamela is intuitively close to the program or machine, modeling is relatively straightforward. As the subsequent derivation of the symbolic model is mechanized (through the Pamela compiler) symbolic performance modeling effort is significantly reduced.

Example

Here's a simple modeling example. Consider P workstations sharing one remote file server. The average time a workstation spends computing locally is t_l. The average workload the workstation requires of the file server is t_s. In Pamela the system would be modeled by the following equations:
    resource parameter fcfs(i)         % predefined FCFS array
    resource fs = fcfs(0,1)            % fs is an FCFS resource

    numeric parameter P                % # workstations
    numeric parameter N                % each WS loops N times
    numeric parameter t_l              % local compute time
    numeric parameter t_s              % service time

    process my_model = par (p = 1, P)
                           seq (i = 1, N) {
                               delay(t_l) ; use(fs,t_s)
                           }
Compiling the above source using the Pamela compiler produces a new Pamela model:
    numeric parameter P
    numeric parameter N
    numeric parameter t_l
    numeric parameter t_s
    numeric T_my_model = 
            max(((N * t_l) + (N * t_s)),max((t_s * (N * (P * [ 1 ])))))
    numeric phi_my_model = 
            ((N * t_l) + (N * t_s))
    numeric delta_my_model = 
            (t_s * (N * (P * [ 1 ])))
    numeric omega_my_model = 
            max((t_s * (N * (P * [ 1 ]))))
In this model T_my_model represents the symbolic performance model of the system. The other equations are generated for specific performance-diagnostic purposes (see the Pamela papers). As t_s, N, and P are scalars the T_my_model expression reduces to
    numeric T_my_model = N * max((t_l + t_s),(P * t_s))
This last transformation was done by hand. Soon however, also this optimization will be part of the Pamela compiler's symbolic expression simplification repertoire.

Why is the above symbolic performance model also expressed in Pamela? Typically, the above model is evaluated for various parameter settings. One could use a standard mathematical tool, of course. On the other hand, one could now also use the Pamela compiler itself, using the evaluation engine within the Pamela compiler that reduces (evaluates) numeric expressions. For instance, suppose one would want to study the execution time as a function of P for t_l = 10, t_s = 0.1, and N = 1,000,000. Then simply remove the parameter modifiers from the corresponding numeric equations and enter the numeric values, as in

    numeric parameter P
    numeric N = 1000000
    numeric t_l = 10
    numeric t_s = 0.1
    numeric T_my_model = 
            N * max((t_l + t_s),(P * t_s))
    numeric phi_my_model = 
            ((N * t_l) + (N * t_s))
    numeric delta_my_model = 
            (t_s * (N * P * [ 1 ]))
    numeric omega_my_model = 
            (t_s * (N * P))
Recompilation will trigger a number of additional simplifications, resulting in the following result.
    numeric parameter P
    numeric N = 1000000
    numeric t_l = 10
    numeric t_s = 0.1
    numeric T_my_model =
            (1000000 * max(1.010000e+01,(1.000000e-01 * P)))
    numeric phi_my_model =
            10100000
    numeric delta_my_model =
            (100000 * (P * [ 1 ]))
    numeric omega_my_model =
            (100000 * P)
This model evaluates even faster as 3 parameters (and a number of expressions) have been numerically reduced. In larger models (e.g., 50 parameters) this form of "parameter refinement" may yield considerable savings in evaluation cost.

As can be seen by the above example, the Pamela compiler supports symbolic modeling in a much broader sense than just translating process models to numeric models. It is an algebraic translation, simplification, and evaluation tool that applies to process algebras as well as ordinary algebras ("numeric" in terms of Pamela). All the model manipulations can be performed by Pamela is needed. An important part of the tool is the numeric and symbolic simplification engine. This (prototype) engine is still being extended.

Simple and Automatic

The example shows how Pamela can serve as a convenient modeling tool. As the language is, in fact, a (restricted) performance simulation language, modeling in terms of Pamela is much more intuitive than directly deriving the symbolic result. Once, the model is coded, deriving the symbolic execution model is merely a matter of compiling. Also note that, in contrast to some other analytic approaches, such as those using Petri nets, queuing networks, or process algebras, our model has a solution complexity of O(1)!

On top of this, the choice of the Pamela modeling language also allows for automatic performance modeling of parallel programs (or algorithms) that are expressed in a high-level parallel programming model, such as the data parallel programming model, where synchronization structure is limited to series-parallel (SP) structure. Apart from some program annotation, required as a result of the inherent undecidability problems in static program analysis, Pamela model generation from data parallel programs can be entirely mechanized. In practice this means that one can compile a data parallel program into a symbolic performance model and have it generate, e.g., a speedup plot in a matter of seconds! See our latest technical report on the results of such an automatic cost estimator in the papers section.

Trade-off

How come deriving symbolic performance models turns out to be so easy? Well, there's a catch, of course. Giving model development cost the highest priority, the Pamela approach trades accuracy for solution cost. This has two consequences. First, in order to enable automatic symbolic analysis, the synchronization structure of Pamela models must be series-parallel (SP) structured, introducing modeling restrictions that may affect modeling accuracy. Second, the symbolic predictions are essentially a lower bound on the actual execution time. Although theory and practice tells us this bound is limited to tens of percents (typically less), the prediction error is still much more than what one gets from simulating the Pamela model.

Utility

Despite these prediction errors, Pamela proves to be a helpful tool at the early development stages of high-performance algorithm and architecture development. We've performed many manual modeling case studies as well as recent modeling studies using a Pamela generator, which is an engine built within a compiler that translates data parallel programs for distributed-memory machines. The results of these experiments show that the average prediction error ranges around 10% to 20%. The accuracy is more than sufficient to provide a good assessment of program speedup potential and also allows a correct selection between alternative coding or data partitioning strategies. Especially for the data parallel program performance modeling experiments the results are compelling. With a minimum of program annotation effort, symbolic performance models are compiled in a matter of seconds, while for regular kernels such as ADI, matrix multiplication, Gaussian elimination, and parallel sorting (PSRS), the evaluation cost of the symbolic models ranges in the milliseconds! (again, see our latest technical report on the results of the automatic cost estimator in the papers section.)

Simulation or Analytic

As Pamela is a performance simulation language, simulation is also a way of obtaining performance predictions from a Pamela model. The following figure shows the entire modeling and analysis road map offered by Pamela. The upper road applies to symbolic performance modeling. The lower road applies to simulation.

For each route there is a (prototype) compiler available (we hope to merge them into one some day). In the simulation case, the Pamela compiler generates a (C) simulator where each parallel Pamela process is mapped onto a thread. While prediction accuracy will generally be much better compared to the symbolic model evaluation, prediction cost is huge compared to the analytic model. Let's take a look at the above file server example with P = 1,000 workstations executing applications that generate N = 1,000,000 service calls. On a 350 MHz Pentium II simulation takes in the order of 100,000 seconds (we estimated this number based on extrapolation). Evaluating the symbolic model, in contrast, only takes about 100 microseconds! Still, simulation can be of use when (1) more accuracy is required, and (2) when functional simulation is required. The latter feature is possible as Pamela allows C code to be inlined (simulation compiler only). The following example shows how a simulation model is derived of a sparse DSCAL vector routine running on a simple bus-based multiprocessor. The example also discusses the migration from simulation model to simulation performance model (the latter can be compiled to a symbolic performance model). While simulation is supported, the Pamela methodology is primarily focused at enabling symbolic performance modeling.

Publications

NOTE: The following material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

There is a journal paper that summarizes the Pamela methodology:

The following technical report presents an application of Pamela within an automatic cost estimator for data parallel programs. This work has recently been performed within an ESPRIT II compiler research project called JOSES. The report provides an impression of what Pamela is currently capable of. Here are also a few conference papers that focus on various aspects of the Pamela methodology: There is also a PhD thesis on the Pamela methodology: The following technical reports describe the Pamela tools.

Software

WARNING: This section contains links to software that may be covered by copyright. Retrieving, copying, or distributing these files may violate copyright protection laws. We recommend that the user abides international law in accessing these links. As in many university research projects, our tools are developed to prove the concept, rather than to serve a commercial purpose. Hence, the tools are essentially prototypes. Of course, we like to give you the opportunity to experiment with them. Needless to say, however, the tools are provided as is without any warranty; without even the implied warranty of merchantibility or fitness for a particular purpose, and are intended for academic, non-commercial use.

Other Links

Pamela has been used or is being used at various Dutch universities, the TNO national research organization (modeling distributed, embedded systems), Philips research (architecture research), and the Indian Institute of Science (architecture research), The technology is used in research projects such as JOSES (EC Esprit, embedded systems), and Automap (Netherlands, high-performance computing). A stochastic extension of the methodology is currently being developed by PhD student Hasyim Gautama. His work will allow us to derive symbolic execution time distribution models. For instance, we predict the mean, variance, skewness, and kurtosis of the execution time of a sorting application as a symbolic function of the data set length at ultra-low cost. Being able to (symbolically) predict the effects of data dependent program execution on performance, is essential in time-critical (embedded) programs. Also check out this link for some of the research work in which Pamela is being used. The Pamela tools are also used in performance modeling courses at our department (in4078 MSc and DOI courses).
Arjan van Gemund <A.J.C.vanGemund@its.tudelft.nl>
Last modified: Oct 2001