The purpose of this lecture is to spend a little time talking about the scheduling of concurrent processes for performance. The term scheduling is often used to mean different things (much like the terms task, or thread). We'll start with a few definitions, most of which are widely acceptable As always in the systems end of the computer science pool, your mileage may vary.
scheduling: the assignment of application or program components to computer, communication, or storage resources at a specified point in time for a specified duration
partitioning: the decomposition of a program or application into sequential components and the communication requirements that are necessary to ensure that they correctly implement the desired semantics
resource allocation: the acquisition of resources by a user or program scheduler
For the most part, these terms seem fairly straight-forward and distinct, but you'd be surprised how often they are misused or interchanged.
performance as an objective function -- Formally, the process of scheduling involves the optimization of some objective function. In this class, we will study scheduling using objective functions that have quantifiable metrics. We will broadly define these metrics as the performance characteristics of the application.
For example, much of the Computational Grid community is concerned with minimizing the execution time of an application. The quantifiable metric is application execution time (which is described by some objective function) and it is the goal of the scheduler to minimize this function. It seems obvious, doesn't it? Stay tuned.
Before we talk more extensively about scheduling, it is important to note that there are many ways to schedule programs (and other things) that do not involve, or only partially involve quantifiable metrics. For example, researchers are sometimes funded to use a particular set of machines even if a better alternative is available. It is possible to consider these part of a multi-dimensional objective function in which these dimensions have 1 or 0 values. For example, you could form some function that returns two values:
(a,b) = schedule(program)such that
a = execution time in seconds b = binary variable indicating that the SDSC Blue Meanie is used (1) or is not used (0)Your objective would be to minimize the value of a and maximize the value of b. Alternatively, you can specify b as a constraint. There are many formal optimization methods that will optimize a given objective function subject to a set of constraints.
qualitative metrics -- Sometimes, though, you'd like to schedule subject to "metrics" that are fundamentally or just presently qualitative. Security is an excellent example of a qualitative metric. You might like to schedule your program so that it only uses "secure" resources but what are the units of security? How can you measure the security of a system? The security research community grapples with this problem mightily and will some day, no doubt, come up with a measurement theory. At present, though, the current practice is to express the security of a system in terms of a tuple of boolean variable, each of which indicates that a particular security measure is either present of absent. Is the machine behind a fire wall? Does the machine allow clear text passwords? Does the machine run the Apache web server? You might be able to make interesting statistical arguments about measuring security. For example, it is probably true empirically that most machines are secure in proportion to the number of people who have legal logins on it. This statement A) may not be useful, B) is statistical, certainly, and C) does not imply a causal relationship. It is fun to think about, though.
But I digress.
We'll leave the realm of qualities behind, at this point. You should know it is there, but for our purposes, scheduling will involve quantifiable metrics and may be subject to boolean constraints.
bash$ time ./test_rand_matrix -r 200 -c 200 -i 1 > /dev/null real 0m2.978s user 0m2.940s sys 0m0.030s bash$Presumably we can observe the effect of an scheduling decisions as perturbations of the time values. But which time values do you care about? Here is an example of where I run the same program, but with contention by another CPU bound program:
bash$ time ./test_rand_matrix -r 200 -c 200 -i 1 > /dev/null real 0m5.332s user 0m2.660s sys 0m0.000sIs my schedule a good one? I waited almost twice as long for the answer, but the execution actually took less CPU time. If I'm not paying for CPU time, I probably care about the wall-clock time, but if I were running on a big, expensive machine for which I were charged for CPU occupancy time only, I might be happier with the second execution.
This particular ambiguity occurs most frequently with batch scheduled machines. Consider the following problem. You have a large program that will take 5 minutes to run on the IBM Blue Meanie at SDSC, and 6 hours to run on your workstation here at UCSB. Which do you choose? Okay -- it turns out that there is a queuing mechanism that you must use to get access to the Blue Meanie which takes your job specifications (inputs, binaries, etc.) and runs them when the machine is free. Lots of people want to use the Blue Meanie (almost as many, in fact, as the number who love hearing Ringo Starr say "Blue Meanie"). You are not charged anything to wait in the queue, but your job won't start executing for 8 hours. Now which one do you choose? Okay -- it is 9:00 PM and you are about to go home so, either way, you won't be looking at the results until the next day. Now which one do you choose?
The point here (in addition to the obligatory pop icon references) is that it is important to be clear about what time means when discussing application performance. In this class, we will be concerned primarily with wall-clock time as it measures turn-around time or response time. That is, when you measure the performance of your application you should do some from the time you press return to start the entire application until the time the application finishes. Even this description is a bit arbitrary, but we'll clarify as we go along.
Ops, Flops, and Mop Tops -- Sorry. It is a Beatles day. A related ambiguity has to do with machine efficiency. In particular, performance oriented applications are often concerned with the number of operations/second a particular computation can achieve. In light of the previous paragraph, we need to careful about how we define the seconds in this metric, but let's say we settle on wall-clock time as we know we should. It can still be a little confusing. Consider the following code for matrix multiply:
double A[100*100];
double B[100*100];
double matrix[100*100];
for(i=0; i < 100; i++)
{
for(j=0; j < 100 ; j++)
{
matrix[i*100+j] = 0.0;
for(k=0; k < 100; k++)
{
matrix[i*100+j] +=
(A[i*100+k] * B[k*100+j]);
}
}
}
It is hopefully not too difficult for you to convince yourself that this code
will do 2,000,000 floating-point operations. My laptop runs this nested loop
in about .06 seconds yielding 30.37 Mflops/sec when I use gcc,
unoptimized. How many instructions does this code execute? Here is
the assembly language for this loop:
.L4: movl -4(%ebp),%eax ;; i load and test cmpl 12(%ebp),%eax jl .L7 jmp .L5 .p2align 4,,7 .L7: movl $0,-8(%ebp) .p2align 4,,7 .L8: movl -8(%ebp),%eax ;; j load and test cmpl 28(%ebp),%eax jl .L11 jmp .L6 .p2align 4,,7 .L11: movl -4(%ebp),%eax ;; matrix initialize to 0.0 imull 28(%ebp),%eax movl %eax,%edx addl -8(%ebp),%edx leal 0(,%edx,8),%eax movl -16(%ebp),%edx movl $0,(%edx,%eax) movl $0,4(%edx,%eax) movl $0,-12(%ebp) .p2align 4,,7 .L12: movl -12(%ebp),%eax ;; k load and test cmpl 16(%ebp),%eax jl .L15 jmp .L10 .p2align 4,,7 .L15: ;; inner loop starts here movl -4(%ebp),%eax imull 28(%ebp),%eax movl %eax,%edx addl -8(%ebp),%edx leal 0(,%edx,8),%eax movl -16(%ebp),%edx movl -4(%ebp),%ecx imull 28(%ebp),%ecx movl %ecx,%ebx addl -8(%ebp),%ebx leal 0(,%ebx,8),%ecx movl -16(%ebp),%ebx movl -4(%ebp),%esi imull 16(%ebp),%esi movl %esi,%edi addl -12(%ebp),%edi leal 0(,%edi,8),%esi movl 8(%ebp),%edi movl %edi,-36(%ebp) movl -12(%ebp),%edi imull 28(%ebp),%edi movl %edi,-20(%ebp) movl -20(%ebp),%edi addl -8(%ebp),%edi movl %edi,-24(%ebp) movl -24(%ebp),%edi leal 0(,%edi,8),%edi movl %edi,-28(%ebp) movl 20(%ebp),%edi movl %edi,-32(%ebp) movl -36(%ebp),%edi fldl (%edi,%esi) movl -28(%ebp),%esi movl -32(%ebp),%edi fmull (%edi,%esi) fldl (%ebx,%ecx) faddp %st,%st(1) fstpl (%edx,%eax) ;; inner loop ends == 39 instructions .L14: incl -12(%ebp) ;; increment k jmp .L12 .p2align 4,,7 .L13: .L10: incl -8(%ebp) ;; increment j jmp .L8 .p2align 4,,7 .L9: .L6: incl -4(%ebp) ;; increment i jmp .L4 .p2align 4,,7 .L5:By my count (and it may be off a little) my pitiful laptop does a whopping 42,120,400 instructions in .06 seconds for a sustained rate of 638.78 Mops/sec. What, then, is the performance of the application? On the one hand, we get 30.37 Mflops/s and on the other 638.78 Mops/s.
overhead -- If you want to account for performance correctly, you must accumulate the useful work and divide by the time metric you care about (which almost always should be wall-clock time). The rest you must consider overhead. In this example, then, the compiler does 37 instructions worth of overhead in the inner loop for each floating point add and multiply.
Incidentally, when I recompile the same code with the -O2 option, the execution time drops to 0.008 seconds yielding a flop rate of 252.40 Mflops/sec. I'll leave it to you to discover the mysteries and moral triumphs associated with the field of compiler optimization. You should know, though, that it can make a real difference in performance, especially for numerical codes.
for(i=0; i < 7; i++)
{
B[i] = A[i] + 5;
}
If you think about it for a little bit in purely non-practical terms, you will
notice that each assignment to an element of B[] requires only the
corresponding value of A[] and the constant 5. Diagrammatically,
we can represent each addition in the following way:
The "node" in this small graph represents the operation to be performed and each edge represents either inputs or outputs of the operation. This representation is sometimes called a dataflow representation since it essentially denotes program semantics in terms of data flowing between operations. The chief advantage, from a notational point of view, of dataflow is that it makes parallelism obvious. For example, if it is not obvious from the C code above,
should make it obvious that all seven additions can occur in parallel. As such, the code in silly.c, if executed on a parallel, shared-memory machine with at least 7 free processors would do the 7 adds in parallel.
/*
* silly.c -- parallel array addition
*/
#include < unistd.h >
#include < stdlib.h >
#include < pthread.h >
#include < stdio.h >
struct msg
{
double left_input;
double right_input;
double result;
};
void *AddIt(void *arg)
{
struct msg *m = (struct msg *)arg;
m->result = m->left_input + m->right_input;
pthread_exit((void *)m);
}
main()
{
struct msg *m;
int i;
double A[7] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0};
double B[7];
pthread_t tids[7];
pthread_attr_t attr;
void *r;
int err;
pthread_attr_init(&attr);
pthread_attr_setscope(&attr,PTHREAD_SCOPE_SYSTEM);
for(i=0; i < 7; i++)
{
m = (struct msg *)malloc(sizeof(struct msg));
m->left_input = A[i];
m->right_input = 5;
err = pthread_create(&(tids[i]),&attr,AddIt,(void *)m);
}
for(i=0; i < 7; i++)
{
pthread_join(tids[i],(void **)&m);
B[i] = m->result;
}
for(i=0; i < 7; i++)
{
printf("B[%d]: %f\n",i,B[i]);
}
pthread_exit(NULL);
}
Or would it?
It takes a certain amount of time to spawn a thread. Since each thread in this code has very little to do, by the time the 7th thread is spawned, the 1st one is probably finished. Is that parallelism? The 1st and 7th thread are not actually executed in parallel.
granularity -- What is happening here is that there is insufficient work in each thread to cover the overhead cost associated with
Partitioning is the process of determining the work in the application that will be executed sequentially. The code in silly.c represents the minimum granularity partition since it is not possible to break the add instruction up further (at least in C).
Consider, however, the partitioning shown in the following figure:
In it, some of the addition operations have been sequentialized to try and amortize the start-up an communication cost associated with running the threads in parallel. The code in less_silly.c implements this partitioning.
a word about dataflow -- So obvious is the parallelism in the dataflow representation that not only is it useful as a human-readable representation, but parallelizing and many optimizing compilers use it internally as well. The dataflow relationship is sometimes termed a dependence relationship or dependence graph.
Now consider the following code segment
A[0] = 1.0;
for(i=1; i < 7; i++)
{
A[i] = A[i-1] + 5;
}
Is it possible to parallelize this loop? Draw the dataflow graph.
Notice that there is no parallelism available because of the way the application code has been written. It would be possible to run parts of the code in different threads (i.e. different partitions) but no matter how the code is partitioned, it will be a sequential code.
In general, the problem of determining an optimal partition is NP-complete. For special cases, like the one above, it can be done, but for arbitrary computations, the assignment of nodes to threads that minimizes the overall execution time cannot be determined (yet) in polynomial time.
a general model -- Notice, though, that this model of program representation and partitioning is fairly general. Consider the assignment sopecified as project 2. Each matrix operation can be represented as a sequential node, each input array as an input edge, and each output array as an output edge. Parts of the code are sequential, and parts parallel. What partition have you chosen?
Scheduling is the process of assigning the partitions of a program to the resources of some computer system such that a specific objective function is optimized. Typically, there is only a single choice of communication medium available for communication between processors. In this case, the problem reduces to the problem of assigning sequential computation units (e.g. threads) to processors. On more complicated systems, however, the communication events (e.g. edges in the dataflow graph) must also be mapped.
To make the following exposition less periphrastic, we will adopt the following nomenclature:
To determine a schedule, the scheduler must specify both a resource to service each node and edge, and time frame in which the servicing should take place. Formally, then, a schedule can be represented as the assignment of a 3-tuple to each component in a program graph where
Often it is useful to represent a program schedule as a two-dimensional Gantt Chart. In a Gantt chart, one dimension represents time, and the other represents location in terms of resource assignment. For example,
Depicts a possible schedule of the partitioned parallel loop shown before. Time runs vertically downward and the processor dimension run horizontally to the right. Notice that it is possible to observe a start time and end time (length being a measure of duration) for all program components as well as the processor mapping. Also, it is typical to assume that the chart begins at time 0 and that all times are relative to that origin. Also, in the time dimension, distance relationships typically matter, but in the processor dimension they do not. That is, processor P0 is not logically farther away from processor P1 than it is from P2, but the thread running on P0 runs for a longer time than the one running on P2.
accounting for timeGantt charts are a pretty wonderful abstraction. Indeed, all sorts of important relationships can be conveyed visually in a way that is intuitively appealing (at least to me). For example, the schedule shown above shows the thread running on P0 as beginning at time 0 and some time later, the thread running on P1 beginning. Why? If we assume the array A[] to be resident in the memory belonging to P0 and we wish to use P1 it will take some time for the necessary components of A[] to be communicated from P0 to P1. Hence, P1 must start later. If the communication medium can only send one message at a time, and P1's message is sent first, then the thread on P2 must start even later still.
If P0, P1, and P2 are separate machines connected by a network, however, the following Gantt chart is probably a more realistic depiction of an execution schedule.
Here, P0 must send data to the other processes sequentially. If it waits for a reply after each message, the messages go one-at-a-time. The thread on P0, then, can only begin executing after both send operations complete.
minimizing execution time -- If execution time is the objective function, then scheduling becomes the process of minimizing the longest distance from 0 to the end of any thread in the Gantt chart, given all of the incumbent start times and durations.
variance -- A fundamental problem with Grid scheduling is that the performance that a Grid program will get out of the resources it uses cannot be predicted perfectly. Even if you knew exactly how long each add would take and exactly what the communication times were in the two previous examples, you would have a hard time building an accurate Gantt chart unless absolutely nothing else was going on while you were running the program. Nothing. Not NFS updating its permission cache, not crontab jobs running polluting your TLB, not the NTP daemon updating its time sync -- absolutely nothing.
precise knowledge -- Even then, prediction is hard because of processor cache issues (is your data cache-word aligned?) and because you need to know exactly what is going on in your message system. Unix sockets, for example, often use dynamically allocated buffers. The time to allocate a buffer is often a function of the state the memory allocator. If nothing has ever been allocated or the allocation pool is not fragmented, the allocation time can be less than if it has been a busy day on your system.
contention -- With the load from other users programs causing contention, the prediction problem is worse. A performance model for the application (even a perfect one) would need to be able to take, as parameters, estimates of resource load to make load sensitive predictions. And those estimates, themselves, must be predictions of the load that will be there when the application actually runs.
The problem, then, is that a perfect schedule requires perfect information about the performance response that the application will get in the future, and predicting the future is hard.
two approaches -- There are two broad approaches to mitigating the variance n performance response that is likely. I will term them work-stealing and predictive mapping although the literature is rife with cute names, acronyms, and all manner of nomenclature. Scheduling is a fairly old systems discipline.
adaptive -- Better still, work-stealing is adaptive. If the loads change while the application is running, the request frequency from all of the processors will change as well. A processor that starts out making lots of requests may slow down as load increases due to contention by other users. Slower processors may speed up as load lightens.
the halting problem -- There is another potential source of uncertainty in scheduling that doesn't have anything to do with resource load and that is in general, a program's execution time depends on its inputs. Another way of saying the same thing is that it is impossible to determine what the output will be (i.e. what path will be taken through a program) for all programs and all inputs. If it were possible, then it would also be possible to solve the halting problem. Work-stealing also addresses this source of variance. If a processor gets a piece of work that takes a long time to compute, other processors will take up the slack.
Note that not all programs are so unpredictable as a result of their inputs. The performance of Matrix-multiply, for example, can be predicted very accurately from the size of the inputs.
popular -- This scheduling approach is, by far, the most popular in heterogeneous, shared environments because it does not require accurate estimates of predicted load or data-dependent execution times. It is also fairly simple to implement. If there is one master responsible for handing out work and keeping track of the partial results, this type of scheduling is called master/worker or master/slave. The pseudo-code for the master is usually something like
master()
{
while there is uncomputed work left to do
{
message = wait until a worker contacts me;
if(message.type == WORK_COMPLETED)
{
add results to completed work;
}
if(message.type == REQUEST_NEW_WORK)
{
send requester a piece of uncompleted work;
}
}
send all workers TERMINATE;
}
The code for workers is also fairly easy
worker()
{
do forever
{
send master REQUEST_NEW_WORK;
message = wait for master to reply;
if(message.type == TERMINATE)
exit;
do whatever work is specified in the message;
send master WORK_COMPLETED;
}
}
Obviously, it is possible to piggy-back a new request with each completed
message, but you get the point, I hope. If the master is threaded, it can
also be doing work while the workers are working. Hence the workers
steal work from the master.
peer-to-peer work stealing -- It is also possible for workers to steal work from each other. That is, each worker is given a list of things to do and workers that finish early steal uncompleted work from their neighbors. This approach requires some additional complexity in that workers need to be able to find each other, but the fundamental principle is the same.
The first issue has to do with the overhead associated with the request/response cycle and load imbalance. To understand the tradeoff, let's go back to the silly partitioning example shown above, but let's make the problem bigger.
for(i=0; i < 10000000; i++)
{
B[i] = A[i] + 5;
}
Let's now imagine that we implement the code as a master/worker application in
which a single value of A[] is sent to each worker on request, the
worker increments it by 5, and send the result back to the master who stores
it in B[]. Here is pseodocode for the master.
master()
{
max_index_completed = 0;
next_index_to_assign = 0;
while(max_index_completed < 9999999)
{
message = wait until a worker contacts me;
if(message.type == WORK_COMPLETED)
{
B[message.index] = message.result;
if(max_index_completed < message.index)
max_index_completed = message.index;
}
if(message.type == REQUEST_NEW_WORK)
{
send requestor (next_index_to_assign,
A[next_index_to_assign] )
next_index_to_assign += 1;
}
}
send all workers TERMINATE;
}
And the worker
worker()
{
do forever
{
send master REQUEST_NEW_WORK;
message = wait for master to reply;
if(message.type == TERMINATE)
exit;
new_message.result = message.A + 5;
new_message.index = message.index;
send master WORK_COMPLETED, new_message;
}
}
Let's assume (for the sake of clarity and not reality) that every index that
is assigned will be completed (i.e. nothing gets lost). Don't try this at
home. Fixing this problem in the pseodocode would obscure the topic at hand,
but you would need to account for this possibility in any real version of such
a program.
Obviously, for any real network and machine combination, the time required to do the messaging completely overwhelms the benefit that comes from being able to run in parallel. The solution is to increase the amount of work that is assigned to each worker so the messaging cost is amortized by the additional parallelism.
load imbalance -- Handing out more sequential work from the master on each request solves the overhead problem. If you can't see why, try drawing a pair of Gantt charts - one where the communication cost is much greater than the sequential work, and one for the reverse case.
By "thickening" up on the work, though, the possibility of having one host become slow while the computation takes place can delay the entire computation. The longer the sequential piece of work, the greater the delay when the unexpected happens. Conversely, the shorter the piece of work, the less extra delay occurs when a process is late.
tradeoff -- The scheduler, then, must choose a work size that balances
double *A;
double *new_A;
double *temp;
int i;
int j;
for(l=0; l < ITERATIONS; l++)
{
for(i=1; i < (cols_A-1); i++)
{
for(j=1; j < (rows_A-1); j++)
{
new_A[i*cols_A+j] =
(A[(i-1)*cols_A+j] +
A[(i+1)*cols_A+j] +
A[i*cols_A+(j-1)] +
A[i*cols_A+(j+1)]) / 4.0;
}
}
temp = A;
A = new_A;
new_A = temp;
}
This loop nest implements a 2-D Jacobi iterative solver
algorithm. Each element of new_A is set equal to the north, south,
east, and west neighbors of the corresponding position in A. Then,
A and new_A are switched so that the averaging is performed on
the just-computed array. Jacobi is usually run in one of two ways. Either
this strange form of averaging goes on for a fixed number of iterations, or it
runs until the maximum difference between corresponding values of new_A
and A falls below some specified threshold.
Jacobi as Master/Worker -- Consider implementing a distributed version of Jacobi using the master/worker scheduling paradigm we have been discussing. How would you do it?
partition one dimension -- The first thing you need is a partitioning of the problem. Each computation requires four values, so you could send out messages with four values each, but as we discussed before, the overheads will probably be too high to get much benefit from such a fine partition. You could pick four neighboring values from random parts of A but probably the best thing to do is to let a worker have a contiguous region of A and to compute all of the values for that region. To make the indexing simple, the easiest thing to do is to partition the arrays along one dimension.
You could partition A in this way so that each worker would get some number of columns (for example) in each message. Each iteration of the algorithm could hand out partitions of A and get back partitions of new_A. The edges would have to overlap so that each worker could compute its boundary value. data locality -- Notice that you don't get to do much computation for each value sent. If a worker gets 100 elements it will do 500 operations (4 adds and a divide for each element). If you think about it for a while, you realize that each processor gets a piece of A on each request, and returns a piece of new_A. The piece of A it gets on this iteration was a piece of new_A on the last. If each processor worked on the same region of A and new_A each time, they would not have to be sent back and forth between the master and the workers. The boundary regions would have to be communicated between processors at each iteration, but the interior could remain in place time after time. They key here is to realize that there is data locality in this problem that can be exploited.
SIMD -- It turns out that there are many algorithms that can be implemented so that the data structures (the arrays in this case) remain in place across iterations. Often these codes are implemented as "Single Program Multiple Data" programs in which each worker executes the same program over a different region of the data structure. In addition to making the distributed implementation more efficient, this approach has the added benefit of not requiring the entire data structure to be located in any one memory all at once. If you need to compute an array so big that it won't fit on one computer, you can, but there will need to be interprocessor communication for any non-local operations.
load imbalance -- The problem with this approach is that because the cores of each partition remain resident, the scheduler must incur a much greater communication cost (relatively speaking) to move them. If a processor becomes slow, the scheduler is faced with two bad alternatives: either move the data (which is expensive) or wait (which could be expensive).
Prediction In either case (master/worker or distributed data structure) the scheduler needs t choose the "right" size of the pieces of work for the workers to compute. To do so, it must make a a prediction communication the communication and execution times involved. For example, if the master "knew"
First -- there is the Jacobi AppLeS. The key realization to have here is that the partition sizes can be scaled to match processor power. For example, if you have three processors, and one is faster than the other two, you can assign twice as much work to the processor that is twice as powerful:
Next, you need an objective function. The one we chose was to try and balance the sum of th times each processor spent computing and communicating. We called this form of scheduling time balancing since it is analogous to load balancing, but it takes into account the time lost to communication.
Another important realization to have is that if you partition a single dimension, only neighboring processors need to communicate.
Finally, you need to realize that there are generally simple rules for calculating how much communication and computation time will be spent given the times for a minimal partition.
performance model -- The time spent by processor i can be expressed as
T[i] = (Compute[i] * N * p[i]) + Comm(N,i,i-1) + Comm(N,i,i+1) where N = the number of elements along the non-partitioned dimension p[i] = the number of elements partitioned to processor i Compute[i] = time to compute a single element on processor i Comm(N,i,j) = time to communicate N elements between processor i and processor jIf you benchmark the code on each processor, you can calculate the predicted value of Compute[i] using NWS predictions.
Compute[i] = benchmark[i] / NWS_CPU_AVAIL[i]Similarly, if you know the predicted bandwidth available between any two processors from the NWS, you can calculate the predicted Comm(N,i).
Comm(N,i,j) = (N * sizeof(element)) / NWS_BW(i,j)With this performance model, a problem size, a set of processors, a set of benchmark values, and the NWS you can set up a set of linear equations that look like
for i in 1 to Processors T[i] = T[j] SUM(N * p[i]) = N^2I'll leave it to you to work out the algebra, but the only unspecified variable in these equations are the p[i] which you can solve for very quickly. The resulting list of values are the predicted "widths" for each partition that will balance the time spent by the processors.
The only remaining piece you need is a way to select which processors to use. This methodology gives you a schedule for a specified set of processors. It also gives you a way to estimate the time T[i] that each processor will take. To select processors, you need a heuristic that chooses subsets from the global pool. Our heuristics was to
choose the most powerful processor on the pool
calculate T[i] for that processor
while(not done)
{
add processor to list
choose most powerful processor that is best connected to the processor
on the end of the list
calculate T[i] if processors on list and this processor are used
}
The finishing criteria varied (I'll let you read about them) but the trick
here is to remember which list of processors gave you the best (lowest)
T[i]. That list is then used to generate a set of p[i]
for the application.
And that's it. The AppLeS scheduler did all of this work at run time just before launching the actual computation using the p[i] values it obtained.
complib AppLeS -- We built another AppLeS for a genetic sequencing algorithm called Complib that combined both the time balancing technique we used for Jacobi with master/worker scheduling. This work was joint between myself and my student at the time Neil Spring.
Complib is a library sequencing application. It takes a known target library of sequences and an unknown source library of sequences and applies the FASTA algorithm to find the best matches between the unknown sequences and the known sequences. The version we had was originally written to use Mentat, an early version of the Legion Grid computing system. It was originally written to use the master/worker paradigm exclusively.
The main data structure, however, is a 2-D array of "scores" that are generated for each possible source-target match.
The collector at the end reduces this array by finding the "max" score for any match on each worker and returning the global max. If it were not for the variance in the system, the more efficient way to program Complib would be as a distributed data structure "SPMD" application.
two sources of variance -- Complib actually contains two sources of variance. The first is that the resources may be varying. The second is that it is difficult to predict how long each score will take to compute. Some scores are computed very quickly, while others are not. The master/worker paradigm takes care of both.complib
scheduling with predictions -- Our approach, though, was to use the NWS to generate a measure of the dependable performance we could expect from each processor and then to partition part of the problem using that metric. The idea here was to identify the minimum performance we could expect from each process, to partition the corresponding fraction according to that minimum, and then to use master-worker to schedule the rest.
The expected performance came from NWS predictions of CPU and network availability. The dependable performance we calculated by subtracting 3 * square-root of the mean square prediction error from the expected value.
some statistics -- not much -- To see how this works, consider the following extremely hypothetical execution performance graph for 10,000 hypothetical program executions.
The x-axis shows a range of execution times from 100 units up to 900 units of time (say, seconds). The y-axis shows the fraction of the total number of jobs that took as long as the corresponding x-value indicates. Clearly, the most popular execution time is 500 time units.
expectation -- Knowing this information, what execution time would you guess for the 10,001st job? More realistically, what value would you guess for the next 10,000 jobs? If you are like me, and I usually, am, you'll probably pick the 500. Congratulations. You've just discovered the concept of expected value. The right way to think of making this choice is not that you'll be right on the next prediction. It is that, over time, by choosing this value you'll be less wrong than choosing and other values (all other things remaining exactly the same). It turns out that the way you measure "wrong" matters, but we won't go into that here because that would mean actually learning a little about statistics. Yikes -- the fear.
For the Complib AppLeS case, however, things are a little different. If the AppLeS scheduler guesses that the execution time will take 500 seconds on a particular processor and partitions the program at the start, the ones that finish early do not really slow the overall program down, but the ones that finish late absolutely do. If those late ones were scheduled using master/worker, they might have found a better home. What is needed is an estimate of how much the scheduler can count on with, say, 95% confidence.
The question for you, then, is What is the x-value that is less that 95% of all x-values?. If you think about it a minute (and I urge you to engage in this process now), you can hopefully convince yourself that the answer can be had by integrating this curve, and finding the x-value that determines a 5% area on the left-hand side of the graph. See? calculus was actually useful.
cumulative distribution function --- A graphic way to accomplish the same purpose can be had by plotting the cumulative distribution function (CDF) which is where the x-values are the same, but each y-value is the sum of the y-values that come before it.
Now, you can see what the 5% value is pretty clearly. Find 5% on the y-axis, see where the corresponding x-value is on the curve. In this case it is about 335. That means, if the AppLeS scheduler guessed 335, it would under-estimate 95% of the time.
still more statistics -- It turns out that for a Normal distribution, if you know the mean and the standard deviation you can calculate exactly where any cut-off is without having all this data around. For data that doesn't conform to a Normal distribution (or some form of exponential) figuring the cut-off essentially requires that you build a CDF. Notice that the CDF representation shown here requires all 10,000 data points. There are other problems as well, again, which we will defer for now, but the bottom line is that the Complib AppLeS used the NWS forecast as a measure of expectation, and the square-root of the mean square forecasting error as a measure of standard deviation. The multiplier is bigger than for a Normal, but we determine empirically that it worked fairly well.
In a sense, this technique gave us a conservative "confidence interval" for each processor that was based on the degree to which we could predict its behavior with the NWS.
By partitioning the conservative part and leaving the rest for standard master/worker
we were able to improve the performance over either master/work or static partitioning alone.
These results were quite startling to the Legion implementers who could see any reason why master/worker scheduling alone wouldn't yield the best performance. More importantly, the illustrate and important scheduling principle. It may not be the expected value, but rather the range of values, that it is important to consider.
unified field theory -- An interesting byproduct of this work is that I believe it forms a framework for all scheduling. In a sense, a scheduler must determine an initial "placement" of the work, and then possible go through a "replacement" as time evolves. In a master/worker application, the initial placement is all on the master, and everything gets replaces. In a distributed data structure application, everything is placed where it will stay until the program terminates, and nothing gets replaced. The hybrid approach we have come up with can implement all combinations in between thereby unifying the two approaches.