Slabpose Columnsort: A New Oblivious Algorithm for Out-of-Core
Sorting on Distributed-Memory Clusters
The authors develop an out-of-core sorting algorithm for a distributed-memory
cluster. They not only introduce a new algorithm, but also introduce a new
paradigm which they call oblivious sorting. In the existing merging-based and
partitioning-based paradigms for out-of-core sorting, the I/O and communication
patterns are determined mostly by the input keys. This nondeterminism makes
algorithm engineering, such as pipelining, very difficult. The third paradigm
that is introduced, oblivious sorting algorithms, are comparison based
algorithms with predetermined sequence of comparisons. Therefore, they are
amenable to improvements through overlapping I/O, communication and computation
because they have predetermined I/O and communication patterns.
The basic columnsort algorithm is introduced by Leighton(1985). Columnsort is a
generalization of the odd-even
transposition sort which itself is a parallelizable variant of
bubblesort. The algorithm sorts N records arranged as an r x s matrix where
N=rs with the two restrictions: s divides r (s | r)
and r ³ 2s2. When s = 2, one may check
that it degenerates into odd-even transposition sort.
In the parallel setting, each processor i owns columns i, i+P, i+2P.... which
is the columnwise cyclic striping (
More on partitioning ). Memory buffers hold exactly r records (i.e. B =
r) which allows the algorithm to load a whole column into memory. Each
processor thus have [s/P] of such columns and for each column it does the
following four stages, mostly in a pipelined setting (Figure 1).
-
Read a column of r records
-
Sort them locally using an in-memory sorting.
-
Exchange records and permute them. That stage corresponds to one of the
permutation steps (transpose-reshape, reshape-transpose, shift-up, shift-down)
of sequential columnsort.They need communication in the parallel setting.
-
Write a column of r records to disk
Figure 1: Stages that a column (= buffer of size r) passes
The problem with this formulation is the height interpretation: Assuming M is
the overall memory of the whole cluster and P is the number of processors. Then
this algorithm sets r=[M/P] in order to minimize communication and I/O. However
the height restriction r ³ 2s2, combined
with this interpretation, limits the maximum number of records that can be
sorted N:
Obviously that doesn't scale since if we increase the number of processors, the
maximum number of records that can be sorted does not increase. They introduce
another variant of their algorithm, called the M-Columnsort, that uses
r=M as their height interpretation and improves the problem-size restriction
to:
However, M-Algorithm suffers from drastically increased communication costs
and eventually performs worse than other variants.
The best variant they developed, Slabpose Columnsort, mainly uses a
block striped partitioning of columns which they call "slaps".This new
algorithm relaxes the problem-size restriction without increasing the
communication costs. The performance comparisons of the algorithms can
be found in Figure 2
Figure 2: Actual competion times, with 4 GB as the
per-processor problem size
Their experimental setting uses only off-the-shelf software such as pthreads
for overlapping I/O, communication and computation and MPI for message passing.
They use a Beawolf cluster that is composed of 32 dual 2.8-GHz Intel Xeon
nodes, each having 4 GB of RAM and 36 GB of HDD. Their use of parallelism is
mainly to allow storing large data sets and parallelize I/O operations, simply
because their algorithms (except for the M-Algorithm) is I/O bound.
My personal opinion favors M-Columnsort in spite of its relatively weak
performance since it is the only algorithm that really scales with the
increasing number of processors. Slabpose Columnsort is better that
straightforward implementation of the columnsort (3-pass columnsort), but it
still has a problem-size restriction that depends on the size of the external
memory of one node. One possibility of future research might be to improve the
communication costs of M-Columnsort. Lastly, as in the case of almost every
external-memory algorithm, this algorithm might also be adapted to
cache-oblivious model.
File translated from TEX by
TTH, version 3.77.
On 11 Apr 2007, 03:08.