Computer Science 338
Parallel Processing

Williams College
Spring 2006

Tutorial Assignment 7: Jacobi on a Distributed Quadtree
Due: Friday, April 7, 2006 at 5:00 PM

This week we focus on how to distribute the workload of a scientific computation (e.g., mesh elements for a finite element computation) among the cooperating processors to perform an efficient parallel computation. We have seen the basics of how these computations are set up, and it is clear that we wish to provide an appropriate amount of work to each processor so that they all finish their computing phase at the same time, and to minimize the number of elements on interprocess boundaries, as these lead to communication during the computation phase.

Class of 60s Talks

Our Class of 60s speaker, Christopher Johnson from the University of Utah, will be at Williams on Monday and Tuesday, April 3-4. His talks are scheduled for Monday at 4 PM and Monday at 8 PM, with a dinner between the talks and a reception following the second talk. Please do your best to attend the talks, dinner, and reception.

Readings

This week's readings will form the basis for our tutorial discussions on April 4 and 5. Please read these with an eye toward how you think some of the ideas discussed could be applied to the parallel adaptive Jacobi solver on a quadtree that you will be implementing this week. Please come to your tutorial meeting with a few written paragraphs summarizing the key points in these papers. You may work with your tutorial partner on this, or work independently.

The book chapter "Partitioning and Dynamic Load Balancing for the Numerical Solution of Partial Differential Equations" by Teresco, Devine, and Flaherty, from Numerical Solution of Partial Differential Equations on Parallel Computers, Are Magnus Bruaset, Petter Bjørstad, Aslak Tveito, editors. © Springer-Verlag, 2005. You can get a copy from the class shared area in file ppde04.pdf.
The article "Zoltan data management services for parallel dynamic applications," by Devine, et al., Computing in Science and Engineering, 4 (2002), pp. 90-97. This describes the Zoltan library, which includes implementations of many of the procedures from the book chapter. You can get a copy from the class shared area in file c2sciprog_final.pdf.

Lab Tasks

This week's task is to parallelize your Jacobi solver that operates on adaptive quadtrees. Your program should use the SPMD model with MPI for message passing. You may again work alone or in groups of two or three. You may compile and test your programs anywhere you'd like (so long as it has MPI available), but I will compile and test your programs on bullpen or dhanni (your choice). Please provide a Makefile that works on one of these systems.

There are many choices to make when parallelizing a program such as this. Which parts of the computation are to be performed by each processor? What is the granularity of the units of work to be divided among the processors? What information must be maintained and what communication is needed to support the computation once the work has been divided? Will the workloads need to be adjusted following adaptive steps? How can such a rebalancing be performed?

In order to keep your task manageable, you should follow the following phased development process:

Phase 1: Build your initial quadtree to the requested refinement level, but replicate it completely on each process. Assign a unique owner process to each leaf quadrant, essentially partitioning the quadtree. Each quadrant's solution is computed only by its owner, but quadrants along partition boundaries will need to exchange solution values between iterations. There are many possible approaches to this partitioning problem, but it should be fairly straightforward if you divide the quadrants into partitions based on a tree traversal. If you have n leaf quadrants and p processes, assign the first (n)/(p) to the first partition (process 0), the next (n)/(p) to the next partition (process 1), and so on, handling remainder quadrants in some reasonable way. If you order your child quadrants NW-NE-SW-SE, a partitioning of a base quadtree refined to three levels might be done as follows:

Implement the message passing needed to send solution information from owned quadrants on partition boundaries to those other processes (and only to those other processes) that will need the information during the solution process. Note that we are ignoring adaptivity for this phase, so your working program should compute only on the initial quadtree. Implement solution output procedures so you can compare your solution from the parallel version with the sequential version. Once you have completed phase 1, save a copy of your working program before moving to phase 2.
Phase 2: Now, to reintroduce adaptivity. Adaptivity will work much as it did in your sequential program, except that when a quadrant is refined, it need only be refined on the owner process. This means that the quadtree, which was originally completely replicated on each process, will grow beyond the initial refinement level only on the owner process of a given quadrant. This should be easy on the interiors of partitions, but will complicate both the error checking (neighbors may now be off-process) and the solution process (off-process copies of initial leaf quadrants may have been refined). The first can be addressed by a similar solution-value exchange as you needed for phase 1 computation. The second may be addressed by doing at least partial refinements of the off-process copies of quadrants involved in interprocess communication. When this works, save a copy before moving to phase 3.
Phase 3: All of this adaptivity will likely introduce a load imbalance. If all or most of the refinement takes place in just a few processes, those processes will have a larger workload during each Jacobi iteration, causing other processes to wait before the boundary exchange. After a refinement phase, implement a rebalancing phase, where the partitions of the initial quadtree structure can be adjusted (and refined parts of the tree migrated appropriately) to ensure that each process has approximately the same number of owned leaf quadrants. To keep this relatively straightforward, the units of work that are allowed to be migrated are the original leaf quadrants. This means that the same techniques you used to keep track of off-process neighbor quadrants in phase 1 can be used here. The downside is that the granularity of the "work objects" that you are partitioning can get large after several refinement steps have occurred, meaning a perfect load balance may not be possible. I suggest using the same rule for deciding which parts of the quadtree are assigned to which processes. Traverse the tree (only to the level of the original quadrants) and fill the partitions. If you broadcast the sizes of each quadrant (that is, the number of leaf quadrants below it), each process can compute this new decomposition independently and determine which parts of its tree need to be sent elsewhere.

I would like to encourage enhancements, but if you have ideas for enhancements that you think would be useful and interesting, check with me before implementing them. They may add unexpected complications, or they may be features planned for inclusion in the next assignment.

A Brief Study

Using one of the clusters, conduct a performance study that, for a given problem setup (tolerances and boundary conditions), considers

an adaptive solution on a single process (use last week's sequential code to get a fair comparison),
the same solution, but with the entire grid initially refined to the finest level from the previous solution (again, with last week's code),
the non-adaptive solution from (2) run with 8 processes (something your phase 1 code from this week should be able to do),
the adaptive solution from (1) run with 8 processes but no load balancing (something your phase 2 code from this week should be able to do), and
the adaptive solution from (1) and (4) run with 8 processes and load balancing.

Describe the experiment, the computing environment in which you will conduct it, and summarize your findings in a file tut07.txt.

Submission and Grading Guidelines

There are several files to turn in for this assignment. They should all be included in a file named tut07.tar that you submit using the turnin utility. Please use the filenames specified and be sure to include your name (and your partner's name, if you are working in a group) in each file.

Your submitted tar file should include subdirectories phase1, phase2, and phase3, each of which includes a Makefile that will compile and link your code on bullpen or dhanni, your C or C++ source code, and a brief README file that expains how and where to run your program. At the top level, include a file tut07.txt that describes your performance study. Please do not include object files or your executables in your tar file.

Honor code guidelines: While the program is to be done only by you (meaning your group, if you choose to work in a group), along the lines of a laboratory program, I want to encourage you to ask questions and discuss the programs with me and with classmates outside your group as you develop it. However, no sharing of code between groups is permitted. If you have any doubts, please check first and avoid honor code problems later.

Grading guidelines: Your grade for the programs will be determined by correctness, design, documentation, style, and efficiency.

Computer Science 338 Parallel Processing

Williams College Spring 2006

A Brief Study

Submission and Grading Guidelines

Computer Science 338
Parallel Processing

Williams College
Spring 2006