CS140 Final Project: Parallel Sorting

The object of this project is to write at least two parallel programs to sort double-precision floating-point numbers, and to compare their running times with each other and with a good sequential code.

Each sort routine should start out with each of p processors holding n/p numbers. It should end up with each processor again holding n/p numbers, but all n numbers should all be in nondecreasing order. That is, P(0) has the smallest n/p numbers, in order within the processor's array; then P(1) has the next smallest n/p, in order within the processor's array; then P(2) and so so up to P(p-1), which has the largest n/p numbers, in order.

To get the numbers in the first place, use a parallel random number generator as in Program 1 to generate n/p random numbers locally on each processor. This will be faster (and use much less memory) than generating all the numbers on one processor and sending them out to be sorted. You should also write a routine to verify that the answer is correct. You need to verify three things:

When you measure running time, you should only measure the time of your sorting routine, not the time to generate the random numbers or the time to verify the answer.

In addition to the running time, for each test you do you should compute the "sort rate", which is defined as n*log(n)/(p*t), where n is the number of elements, p is the number of processors, and t is the measured running time in seconds. The reason to compute the sort rate is that, in the best of all possible worlds, it would be more or less the same no matter what values of n and p you use (provided you use an O(n log n) sorting algorithm). You should compute the sort rate for both (or all) of the parallel algorithms you implement, for a variety of values of p, and for the largest values of n you can manage. For comparison, you should also compute the sort rate for your algorithms running on a single processor, p = 1. Also, you should compute the sort rate for a system built-in sort function such as qsort (your choice) on a single processor for large n. The idea is to see how close your parallel algorithm, with large p and very large n, can come to matching the sort rate of a good sequential code.

You can read about parallel sorting algorithms in Section 4.2.1 and Chapter 10 of the textbook. Here is an excellent paper from 1991 on parallel sorting algorithms. You can also read about a more recent, fairly complicated but very fast, parallel sorting code here.

You can choose which two (or more) algorithms to experiment with, but I suggest that one of them should be some version of the bucket sort in Section 4.2.1 or the sample sort in Section 10.4.4.

In designing parallel sorting algorithms, it's important to bear in mind that communication is extremely expensive and is to be avoided whenever possible. Thus, parallelizing a standard sorting algorithm one element at a time is likely to very costly, since it might have to send a message every time it compares two elements. A better philosophy is for each processor to do some work on its own n/p numbers sequentially, and then just use a small amount of communication to figure out more or less where every element is supposed to end up, then just send every element once from where it started out to where it is supposed to end up. This is the philosophy behind the parallel bucket and sample sorts.

Also, as described in Section 10.4.5, it's likely that things will go faster if you use high-level MPI collectives like "MPI_Alltoallv" to send the data around instead of using individual sends and recvs.