Next: About this document Up: No Title Previous: Other Information

CS110B Programming Exercises

Modify the hello program such that the nodes with even numbers will print ``hello from node" and their numbers. All even nodes will print the total number of nodes allocated.
Modify the broadcasting program (broadcast.c) such that node 0 broadcasts a message ``hello0'' to every other processors and node 1 also broadcasts a message ``hello1''to every other processors.
Each node prints the received message(s).
In the ring.c (fring.f) program, each node sends a message to its successor in the ring. Modify the ring program such that node 0 passes another message ``how are you'' to processor P-1, P-1 sends P-2 and so on. So this message is passed in the ring in a counterclockwise manner. The program stops when node 0 receives the message it sends to P-1. Verify if the content of this message is correct.
Modify the ring program to achieve the function of the program in Problem 2: node 0 sends a message ``hello0'' to node 1, then node 1 passes this message to node 2 and so on until node P-1 receives this message. Then node 0 passes message ``hello1'' to nodes in this ring.

Modify the ring program to implement an all-to-all broadcasting. Every node broadcasts a message other nodes. To avoid excessive commumication, this all-to-all broadcasting is arranged in a ring-pipeline fashion. In general, for p processors, it takes p-1 steps to complete. For example, Assuming p=4:

At time step 1, proc 0 sends message 0 to proc 1.
                proc 1 sends message 1 to proc 2.
                proc 2 sends message 2 to proc 3.
                proc 3 sends message 3 to proc 0.
 
At time step 2, proc 0 sends message 3 to proc 1
                proc 1 sends message 0 to proc 2.
                proc 2 sends message 1 to proc 3.
                proc 3 sends message 2 to proc 0.
 
At time step 3, proc 0 sends message 2 to proc 1
                proc 1 sends message 3 to proc 2.
                proc 2 sends message 0 to proc 3.
                proc 3 sends message 1 to proc 0.

Assume that each node Pi owns a vector Xi. Perform a global sum such that every node has the summation of all Xi from different processors.
Use three types of functions to implement 1) use csend/crecv, do not use the broadcasting function. 2) Use csend (broadcasting)/crecv. 3). Use gdsum.
The example programs in directory /fs/meiko-user/tyang/cs110b/newtiming contain: new_matrix_para.c (a simple parallel program for matrix multiplication). This program contains the statements to measure the time spent for computation (defined in file meiko_timing.c).
Run these two programs to find the time for 2 processors, 4 processors and 8 processors when the matrix dimension is 200, 400.
Modify the Pi program to use a block mapping. Processor 0 executes task 1, 2, ..., P, and processor 1 does task P+1, P+2, ..., 2P.
You measure parallel time ( use node 0) in two cases: 1) including the communication of gdsum and 2) excluding the communication of gdsum. Check the parallel time for two cases using p=2, p=4 and p=8.
Modify program matrix_para.c such that each node uses the memory in an order of n**2/p instead of n**2. Measure the time of p=2, p=4, p=8 when n=200, 400.
Modify the program lusimple.c such that each node uses the memory in an order of n**2/p instead of n**2. Measure the time of p=2, p=4, p=8. n=200, 400.
For the matrix-vector multiplication problem c=Ax, the number of processors is p, the dimension of matrix A is nxn where n=p*r. We partition matrix A into n rows. We distribute these n tasks evenly in p processors using the block method. We assume that each processor initializes its own part of A and processor 0 contains the value of x. The parallel program needs to first broadcast x from processor 0 using csend/crecv. And after the computation, each processor sends the partial result of vector c to processor 0 using csend/crecv. The message type of crecv should be -1 (why?). Write a C or Fortran program and run it in Paragon. Notice that each node program should only use space in an order of n**2/p.
The test example. The elements of matrix A and vector x are all 1. n=1000 and p=2, 4, 8.
For p=2,4,8, Report the parallel time that includes initialization, broadcasting, multiplication, and result collection and Report the parallel time that includes multiplication part only. You also need to determine the sequential time (using p=1) and compute the speedup and efficiency for each case.
You need to hand your program, running trace, performance data. The running trace includes the output of your program, but you don't need to print the entire vector result c. Just print first 10 elements.
Assume all upper triangular elements of matrix A are 0. Change the above algorithm in 1) such that you don't perform operations on upper triangular part. Notice that the loads of processors are not even. You need to use a cyclic method to map the computation on processors. Modify your program.
The test example. The elements of matrix A are all 1 except the upper triangular part. Elements of x are all 1. n=1000 and p=1, 2, 4, 8.
Report the performance data in the same way as 1).
Analyze the parallel time difference between 1) and 2).
Modify the program lusimple.c to solve a linear matrix system using Gaussian Elimination with pivoting. Each node uses the memory in an order of n**2/p. Measure the time of p=2, p=4, p=8 when n=200, 400.

Next: About this document Up: No Title Previous: Other Information

Tao Yang
Sun Jan 21 15:34:03 PST 1996