The first part of the notes on MPI Applications are available in Lecture 9.
The MPI version of Conway's Game of Life used a distributed data structure. Each process maintains its own subset of the computational domain, in this case just a number of rows of the grid. Other processes do not know about the data on a given process. Only that data that is needed to compute the next generation, a one-cell overlap, is exchanged between iterations.
The "slice by slice" method of distributing the grid was chosen only for its simplicity of implementation, both in the determination of what processes are given what rows, and the straightforward communication patterns that can be used to exchange boundary data. We could partition in more complicated patterns, but there would be extra work involved.
The possiblities for the matrix-matrix multiply are numerous. Now the absolute easiest way to do it would be to distribute the matrix A by rows, have B replicated everywhere, and then have C by rows. If we distributed our matrices this way in the first place, everything is simple:
Demo: matmult_mpi_toosimple.tgz Also available in /home/faculty/terescoj/shared/cs338/lect10.
This program has very little MPI communication - this is by design, as we distributed our matrices so that each process would have exactly what it needs.
Unfortunately, this is not likely to be especially useful. More likely, we will want all three matrices distributed the same way.
To make the situation more realistic, but still straightforward, let's assume that our initial matrices A and B are distributed by rows, in the same fashion as the Life simulator. Further, the result matrix C is also to be distributed by rows.
The process that owns each row will do the computation for that row. What information does each process have locally? What information will it need to request from other processes?
Matrix multiplication is a pretty "dense" operation, and we to send all the columns of B to all processes.
Demo: matmult_mpi_simple.tgz Also available in /home/faculty/terescoj/shared/cs338/lect10.
Note that we only initialize rows of B on one process, but since it's all needed on every process, we need to broadcast those rows.
Can we do better? Can we get away without storing all of B on each process? We know we need to send it, but we we do all the computation that needs each row before continuing on to the next?
Demo: matmult_mpi_better.tgz Also available in /home/faculty/terescoj/shared/cs338/lect10.
Yes, all we had to do was rearrange the loops that do the actual computation of the entries of C. We can broadcast each row, use it for everything it needs to be used for, then we move on. We save memory!
Even though we do the exact same amount of communication, our memory usage per process goes from O(n2) to O((n2)/(p)).
So far, we have looked at distributed data structures that use only arrays. Arrays usually are the easiest to distribute: we simply assign a range of subscripts to each process.
We can consider distributing things like sets, linked lists, trees, and graphs. However, we will look at the distribution of a mesh structure, as meshes are commonly used as the underlying structure for many scientific computations that use distributed memory computers.
Meshes come in a variety of types. Quads and hexes are structured meshes. Triangles and tets are unstructured meshes. There are big differences in terms of what happens when you solve on them and how hard they are to generate, but for our purposes here, the issues are similar. Our examples will be unstructured meshes.
Terminology: A typical mesh entity hierarchy consists of three-dimensional regions, and their bounding faces, edges, and vertices, with bidirectional links between mesh entities of consecutive order, allowing queries such as "what faces bound this region" and "what edges are incident on this vertex" to be made efficiently.
The term "element" often is used to refer to the highest-dimension entity, often is the place where the solution lives.
How to store these things?
The use of adaptivity (and some other considerations) necessitate linked data structures for the mesh.