240A Winter 2013 HW1
240A Winter 2013 HW1
Due on Feb 8, 2013.
You will port and parallelize code for matrix multiplication which is a
basic building block in many scientific
computations. The most naive code to multiply square matrices is:
for i = 1 to n
for j = 1 to n
for k = 1 to n
C[i,j] = C[i,j] + A[i,k] * B[k,j]
end
end
end
Initialize A[i,j]= i+j. B[i,j]=i*j.
There are 3 options to implement the sequential code.
-
1) The naive approach listed above.
-
2) Use a submatrix partitioning (blocked version).
-
3) Use a BLAS dgemm library function.
The sample C/C++ code for above 3 options with timing and test driver is available from
this tar file .
What to do
- Write or port the C or C++ code to Triton with the Intel MKL library.
Report the megaflops numbers using the above 3 options with n=50, 100, 200, 400, 600, 800, and 1000 on one core.
- Parallelize the naive sequential program using openMP.
Report megaflops numbers, parallel time, and speedup for n=800 with 2, 4, 8 threads (cores).
- Parallelize the naive program using MPI.
Process 0 collects the final results from all processes.
Report megaflops numbers, parallel time, and speedup for n=800 with 2, 4, 8, 16, 32 processes (processors).
- Write the optimized pthreads code for parallel matrix multiplication
so that you can obtain the "best" megaflops performance for n=800 running on a cluster node with 8 cores.
Report megaflops numbers and parallel time accomplished.
What to submit
- Turn in the source code directory without binary files
using turnin program if you have a CSIL account (turnin HW1@cs240a directory-name).
Otherwise email to the TA.
-
The code directory contains the instruction on how to compile, how to test, how to collect performance number.
The code must contain a sample mechanism to check that the multiplied results are correct.
Namely, show that a few final array values are correct.
- A simple text report contains your group name(s), and the performance numbers.
For all problems, explain if performance numbers obtained are reasonable.
For Problem #4, explain your design and justify why your design/implementation have optimized the megaflops performance.
Reference links :