|
Computer Science 338 Parallel Processing Williams College Spring 2006
|
|
Tutorial Assignment 7: Jacobi on a Distributed Quadtree
Due: Friday, April 7, 2006 at 5:00 PM
This week we focus on how to distribute the workload of a scientific
computation (e.g., mesh elements for a finite element computation)
among the cooperating processors to perform an efficient parallel
computation. We have seen the basics of how these computations are
set up, and it is clear that we wish to provide an appropriate amount
of work to each processor so that they all finish their computing
phase at the same time, and to minimize the number of elements on
interprocess boundaries, as these lead to communication during the
computation phase.
Class of 60s Talks
Our Class of 60s speaker, Christopher Johnson from the University of
Utah, will be at Williams on Monday and Tuesday, April 3-4. His talks
are scheduled for Monday at 4 PM and Monday at 8 PM, with a dinner
between the talks and a reception following the second talk. Please
do your best to attend the talks, dinner, and reception.
Readings
This week's readings will form the basis for our tutorial discussions
on April 4 and 5. Please read these with an eye toward how you think
some of the ideas discussed could be applied to the parallel adaptive
Jacobi solver on a quadtree that you will be implementing this week.
Please come to your tutorial meeting with a few written paragraphs
summarizing the key points in these papers. You may work with your
tutorial partner on this, or work independently.
Lab Tasks
This week's task is to parallelize your Jacobi solver that operates on
adaptive quadtrees. Your program should use the SPMD model with MPI
for message passing. You may again work alone or in groups of two or
three. You may compile and test your programs anywhere you'd like (so
long as it has MPI available), but I will compile and test your
programs on bullpen or dhanni (your choice). Please provide a Makefile that works on one of these systems.
There are many choices to make when parallelizing a program such as
this. Which parts of the computation are to be performed by each
processor? What is the granularity of the units of work to be divided
among the processors? What information must be maintained and what
communication is needed to support the computation once the work has
been divided? Will the workloads need to be adjusted following
adaptive steps? How can such a rebalancing be performed?
In order to keep your task manageable, you should follow the following
phased development process:
- Phase 1
- Build your initial quadtree to the requested
refinement level, but replicate it completely on each process. Assign
a unique owner process to each leaf quadrant, essentially partitioning
the quadtree. Each quadrant's solution is computed only by its owner,
but quadrants along partition boundaries will need to exchange
solution values between iterations. There are many possible
approaches to this partitioning problem, but it should be fairly
straightforward if you divide the quadrants into partitions based on a
tree traversal. If you have n leaf quadrants and p processes,
assign the first (n)/(p) to the first partition (process 0), the
next (n)/(p) to the next partition (process 1), and so on,
handling remainder quadrants in some reasonable way. If you order
your child quadrants NW-NE-SW-SE, a partitioning of a base quadtree
refined to three levels might be done as follows:
Implement the message passing needed to send solution information from
owned quadrants on partition boundaries to those other processes (and
only to those other processes) that will need the information during
the solution process. Note that we are ignoring adaptivity for this
phase, so your working program should compute only on the initial
quadtree. Implement solution output procedures so you can compare
your solution from the parallel version with the sequential version.
Once you have completed phase 1, save a copy of your working program
before moving to phase 2.
- Phase 2
- Now, to reintroduce adaptivity. Adaptivity will work
much as it did in your sequential program, except that when a quadrant
is refined, it need only be refined on the owner process. This means
that the quadtree, which was originally completely replicated on each
process, will grow beyond the initial refinement level only on the
owner process of a given quadrant. This should be easy on the
interiors of partitions, but will complicate both the error checking
(neighbors may now be off-process) and the solution process
(off-process copies of initial leaf quadrants may have been refined).
The first can be addressed by a similar solution-value exchange as you
needed for phase 1 computation. The second may be addressed by doing
at least partial refinements of the off-process copies of quadrants
involved in interprocess communication. When this works, save a copy
before moving to phase 3.
- Phase 3
- All of this adaptivity will likely introduce a load
imbalance. If all or most of the refinement takes place in just a few
processes, those processes will have a larger workload during each
Jacobi iteration, causing other processes to wait before the boundary
exchange. After a refinement phase, implement a rebalancing phase,
where the partitions of the initial quadtree structure can be adjusted
(and refined parts of the tree migrated appropriately) to ensure that
each process has approximately the same number of owned leaf
quadrants. To keep this relatively straightforward, the units of work
that are allowed to be migrated are the original leaf quadrants. This
means that the same techniques you used to keep track of off-process
neighbor quadrants in phase 1 can be used here. The downside is that
the granularity of the "work objects" that you are partitioning can
get large after several refinement steps have occurred, meaning a
perfect load balance may not be possible. I suggest using the same
rule for deciding which parts of the quadtree are assigned to which
processes. Traverse the tree (only to the level of the original
quadrants) and fill the partitions. If you broadcast the sizes of
each quadrant (that is, the number of leaf quadrants below it), each
process can compute this new decomposition independently and determine
which parts of its tree need to be sent elsewhere.
I would like to encourage enhancements, but if you have ideas for
enhancements that you think would be useful and interesting, check
with me before implementing them. They may add unexpected
complications, or they may be features planned for inclusion in the
next assignment.
Using one of the clusters, conduct a performance study that, for a
given problem setup (tolerances and boundary conditions), considers
- an adaptive solution on a single process (use last week's
sequential code to get a fair comparison),
- the same solution, but with the entire grid initially refined to
the finest level from the previous solution (again, with last week's
code),
- the non-adaptive solution from (2) run with 8 processes
(something your phase 1 code from this week should be able to do),
- the adaptive solution from (1) run with 8 processes but no load
balancing (something your phase 2 code from this week should be able
to do), and
- the adaptive solution from (1) and (4) run with 8 processes and
load balancing.
Describe the experiment, the computing environment in which you will
conduct it, and summarize your findings in a file tut07.txt.
There are several files to turn in for this assignment. They should
all be included in a file named tut07.tar that you submit using
the turnin utility. Please use the filenames specified
and be sure to include your name (and your partner's name, if you are
working in a group) in each file.
Your submitted tar file should include subdirectories phase1,
phase2, and phase3, each of which includes a Makefile that will compile and link your code on bullpen or dhanni,
your C or C++ source code, and a brief README file that expains
how and where to run your program. At the top level, include a file
tut07.txt that describes your performance study. Please do not include object files or your executables in your tar file.
Honor code guidelines: While the program is to be done only by
you (meaning your group, if you choose to work in a group), along the
lines of a laboratory program, I want to encourage you to ask
questions and discuss the programs with me and with classmates outside
your group as you develop it. However, no sharing of code between
groups is permitted. If you have any doubts, please check first and
avoid honor code problems later.
Grading guidelines: Your grade for the programs will be
determined by correctness, design, documentation, style, and
efficiency.