Report ID
2008-15
Report Authors
Aydin Buluc, John R. Gilbert, Ceren Budak
Report Date
Abstract
We implemented and evaluated several Gaussian elimination based algorithms on Graphic Processing Units (GPUs). These algorithms, LU decomposition without pivoting, all-pairs shortest-paths, and transitive closure, all have similar data access patterns. The impressive computational power and memory bandwidth of the GPU make it an attractive platform to run such computationally intensive algorithms. Although improvements over CPU implementations have previously been achieved for those algorithms in terms of raw speed, the utilization of the underlying computational resources was quite low. We implemented a recursively partioned all-pairs shortest-paths algorithm that harnesses the power of GPUs better than existing implementations. The alternate schedule of path computations allowed us to cast almost all operations into matrix-matrix multiplications on a semiring. Since matrix-matrix multiplication is highly optimized and has a high ratio of computation to communication, our implementation does not suffer from the premature saturation of bandwidth resources as iterative algorithms do. By increasing temporal locality, our implementation runs more than two orders of magnitude faster on an NVIDIA 8800 GPU than on an Opteron. Our work provides evidence that programmers should rethink algorithms instead of directly porting them to GPU.
Document
2008-15.pdf212.33 KB