S+: Fast Sparse Gaussian Elimination (LU Factorization)

Sparse LU factorization with partial pivoting is important for many scientific applications and delivering high performance for this problem is difficult on distributed memory machines. This project studies the properties of elimination forests and uses them to guide supernode partitioning/amalgamation and execution scheduling. This design with 2D mapping effectively identifies dense structures without introducing too many zeros in the BLAS computation and exploits asynchronous parallelism with low buffer space cost. The implementation of this code, called S+, uses supernodal matrix multiplication which retains the BLAS-3 level efficiency and avoids unnecessary arithmetic operations. This project also studies two space optimization techniques which can greatly improve the worst-case performance of static symbolic factorization.The experiments show that S+ can achieve up to 10.85GFLOPS on 128 Cray T3E 450MHz nodes, which is the highest performance reported in the literature. The previous record was 2.583 GFLOPS on a shared memory machine. This project is currently supported in part by NSF CAREER CCR-9702640. S+ has also been tested and packaged by Sun Microsystems and is released as part of Sun HPC ClusterTools 4 Software.

Download S+ 1.0. Please cite this if you use S+ in your work.

S+ 1.0 runs on SGI Origin 2000/Power Challenge and Cray T3E. This release uses MPI and BLAS library. You may port to another machine which also supports MPI and BLAS (please let us know if you have done that successfully or you have application experience with S+).

Selected Publications on Fast Sparse Gaussian Elimination (LU Factorization): Current members: Past members :