Computer Science 341-02
Parallel Processing

Mount Holyoke College
Fall 2007

Assignment 6: Jacobi Iteration on a Distributed Quadtree
Due: 2:30 PM, Thursday, November 29, 2007

Schedule

This document will take us through the next couple of weeks worth of meetings. Here is our meeting schedule for the remainder of November:

November 8: discussion of adaptivity and distributed linked structures
November 12: (special 10 AM Monday meeting) discussion of partitioning and dynamic load balancing algorithms: geometric algorithms
November 20: discussion of partitioning and dynamic load balancing algorithms: graph algorithms
November 27: discussion of partitioning and dynamic load balancing algorithms: other algorithms
November 29: implementations of partitioning and dynamic load balancing algorithms

Readings

In addition to the readings contained in this document, we will be looking at three articles:

The article "Parallel Structures and Dynamic Load Balancing for Adaptive Finite Element Computation" by Flaherty et al., from Applied Numerical Mathematics, 26 (1998), pp. 241-263.
The book chapter "Partitioning and Dynamic Load Balancing for the Numerical Solution of Partial Differential Equations" by Teresco, Devine, and Flaherty, from Numerical Solution of Partial Differential Equations on Parallel Computers, Are Magnus Bruaset, Petter Bjørstad, Aslak Tveito, editors. © Springer-Verlag, 2005.
The article "Zoltan data management services for parallel dynamic applications," by Devine, et al., Computing in Science and Engineering, 4 (2002), pp. 90-97. This describes the Zoltan library, which includes implementations of many of the procedures from the book chapter.

Adaptivity and Distributed Linked Structures

So far, we have looked at distributed data structures that use only arrays. Arrays usually are the easiest to distribute; we simply assign a range of subscripts to each process. Quinn spends a lot of time discussing local vs. global indexing.

In your Jacobi solver, solution points were distributed evenly through the domain:

-------------------------------------
|                                   |
|   .   .   .   .   .   .   .   .   |
|                                   |
|   .   .   .   .   .   .   .   .   |
|                                   |
|   .   .   .   .   .   .   .   .   |
|                                   |
|   .   .   .   .   .   .   .   .   |
|                                   |
|   .   .   .   .   .   .   .   .   |
|                                   |
|   .   .   .   .   .   .   .   .   |
|                                   |
|   .   .   .   .   .   .   .   .   |
|                                   |
|   .   .   .   .   .   .   .   .   |
|                                   |
-------------------------------------

If you wanted a more accurate solution, you could add more solution points. But since your program used a uniform distribution of the points, better accuracy requires adding points everywhere.

-------------------------------------
| . . . . . . . . . . . . . . . . . |
| . . . . . . . . . . . . . . . . . |
| . . . . . . . . . . . . . . . . . |
| . . . . . . . . . . . . . . . . . |
| . . . . . . . . . . . . . . . . . |
| . . . . . . . . . . . . . . . . . |
| . . . . . . . . . . . . . . . . . |
| . . . . . . . . . . . . . . . . . |
| . . . . . . . . . . . . . . . . . |
| . . . . . . . . . . . . . . . . . |
| . . . . . . . . . . . . . . . . . |
| . . . . . . . . . . . . . . . . . |
| . . . . . . . . . . . . . . . . . |
| . . . . . . . . . . . . . . . . . |
| . . . . . . . . . . . . . . . . . |
| . . . . . . . . . . . . . . . . . |
-------------------------------------

While this is possible, it increases the total amount of work very quickly.

In some cases, we really only need more points in just parts of the domain. In a heat distribution problem, if there is a heat source in the northwest corner, we may only need more accuracy near that heat source. Fewer points may provide a sufficiently accurate solution further from the source.

-------------------------------------
| ..... . . . . . .   .   .   .   . |
| ..... . . . . . .   .   .   .   . |
| ..... . . . .   .   .   .   .   . |
| . . . . .   .   .   .   .   .   . |
| . . . . .   .   .   .   .   .   . |
| . . .   .   .   .   .   .   .   . |
| . . .   .   .   .   .   .   .   . |
| .   .   .   .   .   .   .   .   . |
| .   .   .   .   .   .   .   .   . |
| .   .   .   .   .   .   .       . |
| .   .   .   .   .   .   .         |
| .   .   .   .   .       .       . |
| .   .   .   .   .                 |
| .   .   .   .   .       .       . |
| .   .   .                         |
| .   .   .       .       .       . |
-------------------------------------

This is significantly more efficient in terms of the amount of computation we need to do to obtain a solution of acceptable accuracy.

If we know ahead of time where the extra work is needed, we could assign extra points there at the start of the computation. However, we often do not know this information. After all, we probably wouldn't be solving problems for which we already have a solution handy, so an adaptive approach is taken. Periodically, the accuracy of the solution can be checked, and extra points added as needed. In the context of the Jacobi solver, we may have a threshold for the greatest allowable difference in temperature between adjacent points. If the difference exceeds the threshold, points are added in that vicinity and the solution is recomputed.

Adaptivity with arrays is difficult, as the completely regular structure is lost. In many cases, a mesh structure, often implemented as a linked data structure, is used instead of arrays.

Meshes come in a variety of types. Meshes consisting of quadrilaterals (hexahedra in three dimensions) are called structured meshes. Meshes constructed from triangles (tetrahedra in three dimensions) are called unstructured meshes. There are big differences in terms of what happens when you solve on them and how hard they are to generate, but for our purposes here, the issues are similar.

Some terminology: a typical mesh consists of three-dimensional regions, and their bounding faces, edges, and vertices. A data structure implementing the mesh will often allow queries such as "what faces bound this region" and "what edges are incident on this vertex" to be made efficiently.

The term "element" often is used to refer to the highest-dimension entity, and in many cases is the entity with which the solution is stored.

In fact, our Jacobi example is really just a computation on a uniform quadrilateral structured mesh. Here, the squares serve as the mesh elements and these contain the solution values.

A (hypothetical) adaptive refinement of such a mesh might look like:

Here is a simple unstructured mesh:

The use of adaptivity (and some other considerations) necessitate linked data structures for the mesh. Here is the triangular mesh after a hypothetical refinement operation:

Such mesh structures may be stored in memory as arrays of entities or with a full topological hierarchy. In your lab tasks this week, you will implement an adaptive structured mesh using a quadtree data structure.

Here are some examples of three-dimensional unstructured meshes:

Mesh of the space about the wing of an Onera M6 airplane. Contains 85,567 tetrahedral elements.
Mesh of a very fake airplane.
Mesh of a simplified perforated shock tube (well, a half cylinder joined with a larger quarter cylinder). Contains about 35,000 tertrahedral elements.
Mesh of human arteries, created from MRI data. Contains 1,103,018 tetrahedral elements.

These sample unstructured meshes are stored in memory using a hierarchical linked structure. These structures are implemented in a software library called the SCOREC Mesh Database (MDB). We could write programs to operate entirely on these meshes (and many have been written) but it is not yet parallel.

A second library, the Parallel Mesh Database (PMDB) is built on top of MDB and provides for distributed meshes.

We would like our distributed meshes to provide all the functionality and flexibility that is provided by the regular MDB. See the description and figures in the Applied Numerical Mathematics article. PMDB takes care of all of the message passing to migrate the mesh entities as appropriate and to update all of the partition boundary structures.

A variety of algorithms can be used to determine just how to distribute mesh elements among a set of cooperating processes and several of these mesh partitioning algorithms will be topics of study for us very soon.

Adaptive mesh refinement of PMDB meshes is provided by the (very scary) refdref library.

Refinement of tetrahedra:
Coarsening of tetrahedra:

We will pick up with more on these topics later in the semester.

Lab Tasks

Your final major lab programming task is to develop a parallel adaptive version of the Jacobi solver. This is a complex undertaking, and we will tackle it as a group (that is, all three of us) over the next few weeks. We will use a Subversion repository to facilitate our cooperative development effort. We will need to make a steady effort - there are many stages to the development and we will need time to coordinate our efforts.

We could approach this problem using an unstructured mesh as described above, but that can get complicated very quickly, even more so than the approach we will take. Instead, we will develop a C or C++ program to solve Laplace's equation on a square domain using Jacobi iteration that operates on an adaptive quadtree structure. We will then parallelize this adaptive Jacobi solver, using the SPMD model with MPI for message passing. Finally, we will implement a redistribution procedure to rebalance the load after imbalance is introduced by adaptivity.

Quadtrees

A quadtree is a spatial tree data structure. Each node in the tree is called a quadrant. Leaf nodes are called terminal quadrants. These terminal quadrants will serve as the elements on which we will perform computation.

The tree's root represents the entire domain, which, in our case, is a square.

The four children of the root each represent one quarter of the space taken by the root.

These children can then be divided in four, continuing down as many levels as desired. Different parts of the domain may be refined to different levels.

Computing Sequentially on a Quadtree

The Jacobi iteration on a given quadtree is similar to the iteration you've done in the array-based versions. However, since there may be a deeper quadtree in some parts of the domain, some leaf quadrants have neighbors more or less refined than themselves.

In the above figure, the element "E" should compute its value based on the values of the shaded neighbor quadrants. Note that E's north neighbor is actually refined one level further, so it has two immediate north neighbors. Ideally, the north neighbor value should be the average of the two immediate quadrant neighbors, but it is sufficient to use the value at the shaded quadrant (which is the average of the four leaf quadrants).

One obstacle to overcome is to locate neighboring quadrants efficiently. It is possible to maintain direct neighbor links within your quadtree structure, but you may find it more useful to search based on coordinate information. Each quadrant knows its own bounding box coordinates and from this can easily compute the coordinates of the adjacent quadrants at the same level. A nice feature of the quadtree structure is that the quadrants that contain any point in space can easily be found with a simple traversal from the root, determining the correct child at each step by comparing coordinates. If the neighbor point is outside the root quadrant, you know to apply a boundary condition.

Sequential Program Requirements

To make the program more interesting in the context of adaptivity, our program should allow a wider range of initial and boundary conditions than your previous implementations.

Initial conditions should be specified by a C function that provides values given (x,y) coordinates of a point in the domain. Each leaf quadrant's solution value is initialized to the value obtained by passing the quadrant's centroid to this function.
Boundary conditions are also specified through C functions. The left and right boundaries take the y coordinate as a parameter and the top and bottom boundaries take the x coordinate as a parameter, allowing boundary conditions to be functions in addition to simple constants. When a quadrant needs to query a boundary condition when computing its value, it should call the appropriate function.
Special internal boundary conditions are specified by one more C function. The function takes a leaf quadrant and sets its value if the quadrant contains any points that have internal boundary values. For example, if there is a point heat source at (0.25,0.25) keeping a constant temperature of 2, this function will set the solution value of any quadrant containing (0.25,0.25) to 2. This function should be called on each leaf quadrant during each Jacobi iteration step. If the function sets the quadrant's solution value, that value should be used instead of the solution computed based on its neighbors.

These functions can be hard-coded into our program, but we will implement a system where initial and boundary conditions can be specified through a configuration file, time permitting.

The program should take several parameters:

an initial global refinement level, which in turn determines the initial number of leaf quadrants
a Jacobi iteration tolerance, similar to the tolerance from the previous implementations
a limit on the number of Jacobi iteration steps
a refinement threshold, specifying the maximum difference in temperature allowed between adjacent leaf quadrants
a limit on the total number of refinement steps

Here is the result of my program when run on the unit square with the x=0 boundary condition set to 2y, the other side boundaries fixed at 0, and a special boundary condition setting the point at (0.75,0.75) to 3. The entire domain is initialized to 0. The initial quadtree is refined 3 levels (64 leaf quadrants), a Jacobi tolerance of 0.0001 with a maximum number of iterations of 2000, and a refinement threshold of 0.05 and a maximum number of refinement levels of 3.

Sequential Program Design and Implemenation

The structure of the program might look something like this:

  create quadtree to initial refinement level
  set initial solution values of leaf quadrants to initial conditions
  do {
    do {
      foreach leaf quadrant {
        if (!special boundary condition applied)
          do Jacobi iteration step
      }
    } while (error > Jacobi tolerance && jac_steps < jac_max)
    refine quadrants based on refinement threshold
  } while (any quadrants were refined && ref_steps < ref_max)

Even if we choose to write our program in C instead of C++, we should follow good object-oriented design. Separate the functionality of the quadtree from the solution process.

There are a few C macros and functions may be useful in the shared area for this assignment.

As difficult as it was to print your solutions in a meaningful way in the uniform grid version, it will be even more so once we include adaptivity. We will print solutions to a file in a format that includes the coordinates of each leaf quadrant and its solution value. These values may then be plotted with gnuplot. A script that I used to generate some of the solution plots in this handout is available in the class shared area and here. This is a very simple gnuplot script and I am sure you can do better.

Parallelization

This is, after all, an upper-level elective and a parallel processing course, so our ultimate goal in the assignment is to parallelize this program.

There are many choices to make when parallelizing a program such as this. Which parts of the computation are to be performed by each processor? What is the granularity of the units of work to be divided among the processors? What information must be maintained and what communication is needed to support the computation once the work has been divided? Will the workloads need to be adjusted following adaptive steps? How can such a rebalancing be performed?

In order to keep the task manageable, we will follow the following phased development process:

Phase 1: Build the initial quadtree to the requested refinement level, but replicate it completely on each process. Assign a unique owner process to each leaf quadrant, essentially partitioning the quadtree. Each quadrant's solution is computed only by its owner, but quadrants along partition boundaries will need to exchange solution values between iterations. There are many possible approaches to this partitioning problem, but it should be fairly straightforward if you divide the quadrants into partitions based on a tree traversal. If you have n leaf quadrants and p processes, assign the first (n)/(p) to the first partition (process 0), the next (n)/(p) to the next partition (process 1), and so on, handling remainder quadrants in some reasonable way. If child quadrants are ordered NW-NE-SW-SE, a partitioning of a base quadtree refined to three levels might be done as follows:

For this phase, we need to implement the message passing needed to send solution information from owned quadrants on partition boundaries to those other processes (and only to those other processes) that will need the information during the solution process. Note that we are ignoring adaptivity for this phase, so the working program should compute only on the initial quadtree. Solution output procedures are needed to compare the solution from the parallel version with those from the sequential version. Once we are satisfied with phase 1, we will save a copy before moving to phase 2. You will need to use the phase 1 code for parts of the timing study described below.
Phase 2: Now, to reintroduce adaptivity. Adaptivity will work much as it did in the sequential program, except that when a quadrant is refined, it need only be refined on the owner process. This means that the quadtree, which was originally completely replicated on each process, will grow beyond the initial refinement level only on the owner process of a given quadrant. This should be easy on the interiors of partitions, but will complicate both the error checking (neighbors may now be off-process) and the solution process (off-process copies of initial leaf quadrants may have been refined). The first can be addressed by a similar solution-value exchange as you needed for phase 1 computation. The second may be addressed by doing at least partial refinements of the off-process copies of quadrants involved in interprocess communication. When this works, we will again save a copy before moving to phase 3.
Phase 3: All of this adaptivity will likely introduce a load imbalance. If all or most of the refinement takes place in just a few processes, those processes will have a larger workload during each Jacobi iteration, causing other processes to wait before the boundary exchange. After a refinement phase, we will implement a rebalancing phase, where the partitions of the initial quadtree structure can be adjusted (and refined parts of the tree migrated appropriately) to ensure that each process has approximately the same number of owned leaf quadrants. To keep this relatively straightforward, the units of work that are allowed to be migrated are the original leaf quadrants. This means that the same techniques you used to keep track of off-process neighbor quadrants in phase 1 can be used here. The downside is that the granularity of the "work objects" that you are partitioning can get large after several refinement steps have occurred, meaning a perfect load balance may not be possible. I suggest using the same rule for deciding which parts of the quadtree are assigned to which processes. Traverse the tree (only to the level of the original quadrants) and fill the partitions. If you broadcast the sizes of each quadrant (that is, the number of leaf quadrants below it), each process can compute this new decomposition independently and determine which parts of its tree need to be sent elsewhere.

I would like to encourage enhancements, and we can discuss them to make sure they will not lead to unexpected complications.

A Brief Timing Study

Using the FreeBSD nodes of our cluster, conduct a performance study that, for a given problem setup (tolerances and boundary conditions), considers

an adaptive solution on a single process (use the sequential code to get a fair comparison),
the same solution, but with the entire grid initially refined to the finest level from the previous solution (again, with the sequential code),
the non-adaptive solution from (2) run with 8 processes (something the phase 1 parallelized code should be able to do),
the adaptive solution from (1) run with 8 processes but no load balancing (something our phase 2 code parallelized should be able to do), and
the adaptive solution from (1) and (4) run with 8 processes and load balancing.

Repeat the above, but for a larger number of processes, using one or more of the computing environments available on the TeraGrid.

Describe the experiment, the computing environments in which you conduct it, and summarize your findings in a file study.txt.

Submission and Grading Guidelines

Since we are all working together, there is no formal submission process for the program. However, I would like each of you to generate and submit your own study.txt file describing the study, and a file contributions.txt that describes your contributions to the design and development of the program. Place these in a directory named assignment6 that you submit using the turnin utility.

Your grade for the program will be determined by correctness, design, documentation, style, and efficiency both overall and specifically for your contributions to the program we produce.

Computer Science 341-02 Parallel Processing

Mount Holyoke College Fall 2007

Computer Science 341-02
Parallel Processing

Mount Holyoke College
Fall 2007