CS140 Final Project: Parallel Sorting (MPI & maybe Cilk++)

The object of this project is to write at least two parallel programs to sort double-precision floating-point numbers, and to compare their running times (actually, their "sort rates" as described below) with each other and with a good sequential code. One of the parallel programs will be a distributed-memory sort written in MPI; the other can be either a different MPI sort, or a Cilk++ Quicksort that you write and tune. You should run all your codes on the same machine (probably OnDemand).

Distributed-memory sorts:

Your MPI sort routine(s) should start out with each of p processors holding n/p numbers. It should end up with each processor again holding n/p numbers, but all n numbers should all be in nondecreasing order. That is, P(0) has the smallest n/p numbers, in order within the processor's array; then P(1) has the next smallest n/p, in order within the processor's array; then P(2) and so so up to P(p-1), which has the largest n/p numbers, in order.

To get the numbers in the first place, use a parallel random number generator as in Program 1 to generate n/p random numbers locally on each processor. This will be faster (and use much less memory) than generating all the numbers on one processor and sending them out to be sorted. You should also write a routine to verify that the answer is correct. You need to verify three things:

You can choose which distributed-memory algorithms to experiment with, but I suggest that one of them should be some version of the bucket sort in Section 4.2.1 of the text or the sample sort in Section 10.4.4. You can also read about a more recent, fairly complicated but very fast, distributed-memory sorting code here.

In designing distributed-memory parallel sorting algorithms, it's important to bear in mind that communication is extremely expensive and is to be avoided whenever possible. Thus, parallelizing a standard sorting algorithm one element at a time is likely to very costly, since it might have to send a message every time it compares two elements. A better philosophy is for each processor to do some work on its own n/p numbers sequentially, and then just use a small amount of communication to figure out more or less where every element is supposed to end up, then just send every element once from where it started out to where it is supposed to end up. This is the philosophy behind the parallel bucket and sample sorts.

Also, as described in Section 10.4.5, it's likely that things will go faster if you use high-level MPI collectives like "MPI_Alltoallv" to send the data around instead of using individual sends and recvs.

Multicore sort:

If you want to do a Cilk++ sort for one of your sorts, I recommend that you implement a quicksort. This is a natural divide-and-conquer sort. The hard part is doing the partition step in parallel, which is necessary to get span less than n and therefore parallelism more than log n. You should implement a parallel partition step, and you may also want to experiment with a pivot selection that is more sophisticated than a random selection. Here is a paper that describes such a sort.

For your Cilk++ sort, the data will start out in one big array, and should finish up in the same array but in sorted order. Again, you should write a validation routine to verify that your sort got the right result. This time there are only two things to verify:

What to measure and report:

You should run each of your sorts on at least three different inputs:

Run on a range of values of n, up to the largest you can; run on a range of values of p, up to 4 for Cilk++ and up to at least 16 for your MPI sort(s). When you measure running time, you should only measure the time of your sorting routine, not the time to generate the random numbers or the time to verify the answer.

In addition to the running time, for each test you do you should compute the "sort rate", which is defined as n*log(n)/(p*t), where n is the number of elements, p is the number of processors, and t is the measured running time in seconds. The reason to compute the sort rate is that, in the best of all possible worlds, it would be more or less the same no matter what values of n and p you use (provided you use an O(n log n) sorting algorithm). You should compute the sort rate for both (or all) of the parallel algorithms you implement, for a variety of values of p, and for the largest values of n you can manage. For comparison, you should also compute the sort rate for your algorithms running on a single processor, p = 1.

For comparison, you should measure the running time of a good sequential sort (e.g. the C qsort routine), and (if one of your sorts is in Cilk++) of the Cilk++ quicksort example, and compute their sort rates. The idea is to see how close your parallel algorithm, with large p and large n, can come to matching the sort rate of a good sequential code.