Computer Science 338
Parallel Processing

Williams College
Spring 2006

Tutorial Assignment 5: More MPI
Due: Tuesday, March 7, 2006 at 9:00 AM

Readings

The majority of this week's readings are from the textbook. Please read Quinn Chapters 5, 6 and 8. These chapters use three major examples of parallel algorithms to demonstrate program design and implementation using the MPI model. For next week's tutorial meetings, I would like each of you to come with a list of the most important ideas covered in each chapter, and of course any questions you have about the readings.

There are also a few matrix-matrix multiplication examples that I would like you to look at.

Distributed Data Structures

The MPI version of Conway's Game of Life used a distributed data structure. Each process maintains its own subset of the computational domain, in this case just a number of rows of the grid. Other processes do not know about the data on a given process. Only that data that is needed to compute the next generation, a one-cell overlap, is exchanged between iterations.

The "slice by slice" method of distributing the grid was chosen only for its simplicity of implementation, both in the determination of what processes are given what rows, and the straightforward communication patterns that can be used to exchange boundary data. We could partition in more complicated patterns, but there would be extra work involved.

Matrix-Matrix Multiplication

The possiblities for the matrix-matrix multiply are numerous. Now the absolute easiest way to do it would be to distribute the matrix A by rows, have B replicated everywhere, and then have C by rows. If we distributed our matrices this way in the first place, everything is simple:

See: /usr/cs-local/share/cs338/examples/matmult_mpi_toosimple

This program has very little MPI communication - this is by design, as we distributed our matrices so that each process would have exactly what it needs.

Unfortunately, this is not likely to be especially useful. More likely, we will want all three matrices distributed the same way.

To make the situation more realistic, but still straightforward, let's assume that our initial matrices A and B are distributed by rows, in the same fashion as the Life simulator. Further, the result matrix C is also to be distributed by rows.

The process that owns each row will do the computation for that row. What information does each process have locally? What information will it need to request from other processes?

Matrix multiplication is a pretty "dense" operation, and we to send all the columns of B to all processes.

See: /usr/cs-local/share/cs338/examples/matmult_mpi_simple

Note that we only initialize rows of B on one process, but since it's all needed on every process, we need to broadcast those rows.

Can we do better? Can we get away without storing all of B on each process? We know we need to send it, but we we do all the computation that needs each row before continuing on to the next?

See: /usr/cs-local/share/cs338/examples/matmult_mpi_better

Yes, all we had to do was rearrange the loops that do the actual computation of the entries of C. We can broadcast each row, use it for everything it needs to be used for, then we move on. We save memory!

Even though we do the exact same amount of communication, our memory usage per process goes from O(n²) to O((n²)/(p)).

Next week, we will look at more complicated examples of distributed data structures and methods for determining the domain decomposition of those structures.

Lab Tasks

You may work alone or in pairs again on this week's programs. Any changes from last week's groups must be confirmed by an e-mail to me no later than Friday, March 3.

There are several files to turn in for this assignment. They should all be included in a file named tut05.tar that you submit using the turnin utility. Please use the filenames specified and be sure to include your name (and your partner's name, if you are working in a group) in each file.

Modify one of the matrix-matrix multiply examples that uses MPI for communication so that all three matrices exist only on the master process at the start and finish of the computation. Use MPI functions to distribute the entries of A and B, as needed, to the processes (include the master in the computation), do the multiplication, then use MPI functions to gather the result C back to the master.
- You will need a number of MPI communication steps. Explain each MPI call (what you need to do and why you chose that particular MPI call) in detailed comments in your source code.
- Run your program using 1, 2, 4, 8, 16 and 32 processors on the compute nodes of the Dhanni cluster. Submit these runs through PBS. Include the timings and your analysis of those timings in the README file that you include with your submission.
Your submitted tar file should include a subdirectory matmult that includes your Makefile, your C source code, your PBS script(s), and a brief README file that expains how to run your program, describes how and why you chose to parallelize your program, and describes and analyzes your timing results. Please do not include object files or your executable in your tar file.
Convert your OpenMP program that solves Laplace's equation using Jacobi iteration to a distributed memory program using message passing with the MPI library for communication.
Most of the requirements for the program are the same as those of the OpenMP version.
- Again, you may break down the computation any way you'd like. Justify your choice in the README file that you include in your submission.
- You will need a number of MPI communication steps. Explain each MPI call (what you need to do and why you chose that particular MPI call) in detailed comments in your source code.
- Overlap your computation and communication, where possible. Point out these areas of overlap in comments in your source code.
- Run your program using 1, 2, 4, 8, 16 and 32 processors on the compute nodes of the Dhanni cluster. Submit these runs through PBS. Include the timings and your analysis of those timings in the README file that you include with your submission.
Your submitted tar file should include a subdirectory laplace that includes your Makefile, your C source code, your PBS script(s), and a brief README file that expains how to run your program, describes how and why you chose to parallelize your program, and describes and analyzes your timing results. Please do not include object files or your executable in your tar file.
Honor code guidelines: While the programs are to be done only by you (meaning your group, if you choose to work in a group), along the lines of a laboratory program, I want to encourage you to ask questions and discuss the programs with me and with classmates outside your group as you develop it. However, no sharing of code between groups is permitted. If you have any doubts, please check first and avoid honor code problems later.
Grading guidelines: Your grade for the programs will be determined by correctness, design, documentation, and style, as well as the required items in the detailed comments and README file.

Computer Science 338 Parallel Processing

Williams College Spring 2006

Distributed Data Structures

Matrix-Matrix Multiplication

Computer Science 338
Parallel Processing

Williams College
Spring 2006