240A Winter 2013 HW1

240A Winter 2013 HW1

Due on Feb 8, 2013.

You will port and parallelize code for matrix multiplication which is a basic building block in many scientific computations. The most naive code to multiply square matrices is:

  for i = 1 to n
    for j = 1 to n
      for k = 1 to n
        C[i,j] = C[i,j] + A[i,k] * B[k,j]
      end
    end
  end
Initialize A[i,j]= i+j. B[i,j]=i*j.

There are 3 options to implement the sequential code.

The sample C/C++ code for above 3 options with timing and test driver is available from this tar file .

What to do

  1. Write or port the C or C++ code to Triton with the Intel MKL library. Report the megaflops numbers using the above 3 options with n=50, 100, 200, 400, 600, 800, and 1000 on one core.

  2. Parallelize the naive sequential program using openMP. Report megaflops numbers, parallel time, and speedup for n=800 with 2, 4, 8 threads (cores).

  3. Parallelize the naive program using MPI. Process 0 collects the final results from all processes. Report megaflops numbers, parallel time, and speedup for n=800 with 2, 4, 8, 16, 32 processes (processors).

  4. Write the optimized pthreads code for parallel matrix multiplication so that you can obtain the "best" megaflops performance for n=800 running on a cluster node with 8 cores. Report megaflops numbers and parallel time accomplished.

What to submit

Reference links :