Report ID
2008-15
Report Date
Abstract
We implemented and evaluated several Gaussian elimination based algorithms
on Graphic Processing Units (GPUs). These algorithms, LU decomposition
without pivoting,
all-pairs shortest-paths, and transitive closure, all have similar data
access patterns. The impressive computational power and memory bandwidth of
the GPU make it an attractive
platform to run such computationally intensive algorithms.
Although improvements over CPU implementations have previously been achieved
for those algorithms in terms of raw speed, the utilization of the
underlying computational resources
was quite low. We implemented a recursively partioned all-pairs
shortest-paths algorithm that harnesses the power of GPUs better than
existing implementations. The alternate schedule of
path computations allowed us to cast almost all operations into
matrix-matrix multiplications on a semiring.
Since matrix-matrix multiplication is highly optimized and has a high ratio
of computation to communication, our implementation does not suffer from the
premature saturation of bandwidth
resources as iterative algorithms do. By increasing temporal locality, our
implementation runs more than two orders of magnitude faster on an NVIDIA
8800 GPU than on an Opteron.
Our work provides evidence that programmers should rethink algorithms
instead of directly porting them to GPU.
Document
2008-15.pdf212.33 KB