Slabpose Columnsort: A New Oblivious Algorithm for Out-of-Core Sorting on Distributed-Memory Clusters

Notes by Aydin Buluc (Original document)

The authors develop an out-of-core sorting algorithm for a distributed-memory cluster. They not only introduce a new algorithm, but also introduce a new paradigm which they call oblivious sorting. In the existing merging-based and partitioning-based paradigms for out-of-core sorting, the I/O and communication patterns are determined mostly by the input keys. This nondeterminism makes algorithm engineering, such as pipelining, very difficult. The third paradigm that is introduced, oblivious sorting algorithms, are comparison based algorithms with predetermined sequence of comparisons. Therefore, they are amenable to improvements through overlapping I/O, communication and computation because they have predetermined I/O and communication patterns.
The basic columnsort algorithm is introduced by Leighton(1985). Columnsort is a generalization of the odd-even transposition sort which itself is a parallelizable variant of bubblesort. The algorithm sorts N records arranged as an r x s matrix where N=rs with the two restrictions: s divides r (s | r) and r ³ 2s2. When s = 2, one may check that it degenerates into odd-even transposition sort.
In the parallel setting, each processor i owns columns i, i+P, i+2P.... which is the columnwise cyclic striping ( More on partitioning ). Memory buffers hold exactly r records (i.e. B = r) which allows the algorithm to load a whole column into memory. Each processor thus have [s/P] of such columns and for each column it does the following four stages, mostly in a pipelined setting (Figure 1).
  1. Read a column of r records
  2. Sort them locally using an in-memory sorting.
  3. Exchange records and permute them. That stage corresponds to one of the permutation steps (transpose-reshape, reshape-transpose, shift-up, shift-down) of sequential columnsort.They need communication in the parallel setting.
  4. Write a column of r records to disk
Figure 1: Stages that a column (= buffer of size r) passes
The problem with this formulation is the height interpretation: Assuming M is the overall memory of the whole cluster and P is the number of processors. Then this algorithm sets r=[M/P] in order to minimize communication and I/O. However the height restriction r ³ 2s2, combined with this interpretation, limits the maximum number of records that can be sorted N:
N £   æ
Ö

(M /P)3

2
 
Obviously that doesn't scale since if we increase the number of processors, the maximum number of records that can be sorted does not increase. They introduce another variant of their algorithm, called the M-Columnsort, that uses r=M as their height interpretation and improves the problem-size restriction to:
N £   æ
Ö

M3

2
 
However, M-Algorithm suffers from drastically increased communication costs and eventually performs worse than other variants.
The best variant they developed, Slabpose Columnsort, mainly uses a block striped partitioning of columns which they call "slaps".This new algorithm relaxes the problem-size restriction without increasing the communication costs. The performance comparisons of the algorithms can be found in Figure 2
Figure 2: Actual competion times, with 4 GB as the per-processor problem size
Their experimental setting uses only off-the-shelf software such as pthreads for overlapping I/O, communication and computation and MPI for message passing. They use a Beawolf cluster that is composed of 32 dual 2.8-GHz Intel Xeon nodes, each having 4 GB of RAM and 36 GB of HDD. Their use of parallelism is mainly to allow storing large data sets and parallelize I/O operations, simply because their algorithms (except for the M-Algorithm) is I/O bound.
My personal opinion favors M-Columnsort in spite of its relatively weak performance since it is the only algorithm that really scales with the increasing number of processors. Slabpose Columnsort is better that straightforward implementation of the columnsort (3-pass columnsort), but it still has a problem-size restriction that depends on the size of the external memory of one node. One possibility of future research might be to improve the communication costs of M-Columnsort. Lastly, as in the case of almost every external-memory algorithm, this algorithm might also be adapted to cache-oblivious model.



File translated from TEX by TTH, version 3.77.
On 11 Apr 2007, 03:08.