240A Winter 2013 HW1

Due on Feb 8, 2013.

You will port and parallelize code for matrix multiplication which is a basic building block in many scientific computations. The most naive code to multiply square matrices is:

  for i = 1 to n
    for j = 1 to n
      for k = 1 to n
        C[i,j] = C[i,j] + A[i,k] * B[k,j]
      end
    end
  end

Initialize A[i,j]= i+j. B[i,j]=i*j.

There are 3 options to implement the sequential code.

1) The naive approach listed above.
2) Use a submatrix partitioning (blocked version).
3) Use a BLAS dgemm library function.

The sample C/C++ code for above 3 options with timing and test driver is available from this tar file .

What to do

Write or port the C or C++ code to Triton with the Intel MKL library. Report the megaflops numbers using the above 3 options with n=50, 100, 200, 400, 600, 800, and 1000 on one core.
Parallelize the naive sequential program using openMP. Report megaflops numbers, parallel time, and speedup for n=800 with 2, 4, 8 threads (cores).
Parallelize the naive program using MPI. Process 0 collects the final results from all processes. Report megaflops numbers, parallel time, and speedup for n=800 with 2, 4, 8, 16, 32 processes (processors).
Write the optimized pthreads code for parallel matrix multiplication so that you can obtain the "best" megaflops performance for n=800 running on a cluster node with 8 cores. Report megaflops numbers and parallel time accomplished.

What to submit

Turn in the source code directory without binary files using turnin program if you have a CSIL account (turnin HW1@cs240a directory-name). Otherwise email to the TA.
The code directory contains the instruction on how to compile, how to test, how to collect performance number. The code must contain a sample mechanism to check that the multiplied results are correct. Namely, show that a few final array values are correct.
A simple text report contains your group name(s), and the performance numbers. For all problems, explain if performance numbers obtained are reasonable. For Problem #4, explain your design and justify why your design/implementation have optimized the megaflops performance.

Reference links :