Lecture 13

Lecture 13 - Partitioning and Load Balancing

Agenda

Announcements
Octree/SFC Partitioning
Dynamic Load Balancing

Announcements

No CS colloquium this week.
No class the first Tuesday back from break, but watch your e-mail for changes.
The term project is out.
There will be a midterm exam soon after break. Details coming by e-mail. We will talk about it much more at the class meeting of April 10.

Dynamic Load Balancing

See notes from Lecture 12 for the intro.

With dynamic load balancing, the mesh is already distributed, so algorithms must operate in parallel on this distributed data. Computing the new partitioning on a single process is not scalable, in terms of time or memory.

Many of the algorithms we looked at so far work as dynamic load balancers. All of the geometric algorithms can be (and have been) implemented in parallel. Parmetis and Jostle provide multilevel graph partitioners that operate in parallel.

Octree/SFC rebalancing

To be used as a dynamic rebalancer, the octree structure must be created automatically, based on a previously-existing mesh distribution.

Initially, we calculate a bounding box for the entire domain by examining the centroid of each object to be inserted into the tree. A root node, representing the entire domain, is created on each process. Each of the NP processes refines this root octant in parallel to the initial refinement level

IRL = log₈(NP)

to create 8^(IRL) leaf octants. For a quadtree,

IRL = log₄(NP) ,

which produces one level of refinement for our three-process example.

This produces an identical global octree on each process. This global octree, while small relative to the full tree to be generated, simplifies and improves the efficiency of traversal and object insertion. Each terminal octant in the global octree is called a global octant. While global octants are replicated on all processes, each is assigned a unique and permanent owner as the process that initially contains its portion of the spatial domain. A map array represents the entire global octree as a linear representation of the global octants. Once the map array has been created, the top levels of the tree are no longer needed. Next, each process calculates its range of leaf octants by a DFT. If the number of processes does not evenly divide the number of leaf octants, excess octants are placed in the higher-numbered partitions, with at most one extra octant assigned per partition. The construction of the global tree structure and the map array and their initial distribution are shown below.

Initial root quadrant on each process:

Global quadrant on each process:

Initial partitioning of global octree:

The global octants assigned to each process are initially placed into a local root list. The local roots are the roots of the local subtrees that exist on each process. They are maintained in their DFT order.

Next, the objects to be partitioned are inserted into the octree. Parts of the tree will be refined when inserted objects exceed the prescribed limit on the number of objects per leaf octant (§). The tree resulting from object insertion for the example of Figure with a refinement limit of five elements per quadrant is shown below:

Objects to be inserted may reside on any process, and some objects will likely reside on processes other than those owning their destination octants. Such objects are called orphans and must be migrated to the appropriate process, as determined by a O(log(NP)) search using the map array.

After object insertion, each process computes costs for each locally rooted subtree using traversals within its domain with no interprocess communication. Prefix and global costs are computed from the per-process cost totals, enabling each process to determine its position in the global traversal.

As with the serial procedure, each process traverses its subtrees to create partitions. Each process determines its initial destination partition as

IDP = OPS / PC

where PC is the prefix cost. Each process traverses its subtrees with no interprocess communication. A load counter computes the sum of the costs of octants assigned to the current destination process (CDP). Once the load counter exceeds the OPS, the CDP is incremented, the load counter is reset, and traversal continues. All remaining octants are assigned to the last partition even if its load exceeds OPS:

The numbers above indicate the destination processes for each octant.

When the traversals are complete, subtrees and their associated data are migrated, if necessary, to their assigned destination process. Octant migration occurs in three stages. First, octants migrating to the same process are collected and the destination process is notified. The destination processes allocate space for the arriving data and notify the source processes of their new addresses to update migrating octants' remote parent or offspring links. Finally, the updated octants are sent to their destination and are removed locally. This strategy preserves the octant traversal order.

Local roots are updated and communicated to every process. Each root octant is added to the remote octant list in the map array element corresponding to the octant's global octree ancestor. Below, we see the final partitioning and distributed tree structure for our two-dimensional example.

Zoltan Dynamic Load Balancing

Zoltan is a library for dynamic load balancing developed at Sandia National Laboratories.

General interface to a variety of partitioners and load balancers
Application programmer can avoid the details of load balancing
Interact with application through callback functions and migration arrays
Switch load balancers easily
Provides interfaces for:
- Orthogonal bisection, Inertial bisection
- Octree partitioning (with Loy, Gervasio, Campbell - RPI)
- Hilbert SFC partitioning (Edwards - Sandia, Patra, Bauer - Buffalo)
- Metis/Parmetis (Karypis, Kumar, Schloegel - Minnesota)
- Jostle (Walshaw - Greenwich)
- Refinement tree balancing (Mitchell - NIST)

Zoltan is a data-structure neutral software package, meaning it does not depend on any application-specific data structures.

To use Zoltan algorithms, an application must call functions in the Zoltan API to set parameters and choose an algorithm. It also must provide a set of callback functions that Zoltan will call when it needs information from the application, such as the number of objects to be partitioned, their coordinates, and their interconnections.

Zoltan uses the term objects to refer to the items to be partitioned. In the examples we have been looking at, Zoltan objects would correspond to mesh regions (3D) or faces (2D).

For PMDB meshes, the file pmdb_zoltan.c makes appropriate Zoltan API calls and defines callbacks to allow Zoltan to be used for partitioning and dynamic load balancing.

We first consider how to use Zoltan from the PMDBtool program. We looked briefly at PMDBtool last week. It is a program that can manipulate PMDB meshes through a simple scripting language. Here is a sample PMDBtool script that reads in a PMDB mesh called muzzle-refined.sms and partitions it using Zoltan's Octree partitioner:

trace
modeler null
model "muzzle-refined"
read_sms "muzzle-refined"
migstats reset
PMDB_LB_PRINT_ZOLTANTIMES true
zoltan "DEBUG_LEVEL" "3"
zoltan "OCT_OUTPUT_LEVEL" "1"
zoltan "LB_METHOD" "OCTPART"
zoltan
migstats print
#write_sms "muzzle_part"
dx "muzzle_part"
metrics "muzzle_part" "1"

trace turns on a trace mode, which instructs PMDBtool to print out a message for each script line processed.
modeler null tells PMDBtool that we will not be using a commercial geometric modeler interface.
model "muzzle-refined" reads in a model description from the file muzzle-refined.smd.
read_sms "muzzle-refined" reads in a mesh file muzzle-refined.sms onto the master process.

At this point, process 0 contains the entire mesh, and other processes contain nothing.

migstats reset resets variables used to gather statistics about mesh migration within PMDB.
PMDB_LB_PRINT_ZOLTANTIMES true turns on some extra timing printouts for Zoltan partitioning.
zoltan "DEBUG_LEVEL" "3" turns on a higher debugging output level for Zoltan. Higher numbers produce more detailed output. The various levels are described in the Zoltan manual. This results in a call to Zoltan_Set_Param() to set the DEBUG_LEVEL parameter to 3.
"OCT_OUTPUT_LEVEL" "1" turns on debugging information for Zoltan's octree balancer. This also results in a call to Zoltan_Set_Param().
zoltan "LB_METHOD" "OCTPART" sets the balancing method to be used by Zoltan to OCTPART, the octree partitioner.
zoltan with no additional arguments results in a call to Zoltan_LB_Balance() to compute a new decomposition, followed by appropriate PMDB calls to migrate the mesh to achieve the partitioning.

At this point, the mesh is partitioned evenly among the processes, using a decomposition as computed by Zoltan's octree partitioning algorithm.

migstats print prints out migration statistics gathered since the migstats reset call above.
write_sms "muzzle_part" is commented out, but if it was not, it would result in the partitioned mesh being written to a collection of .sms and .pmdb files, which describe the mesh and partition boundary structures, respectively.
dx "muzzle_part" produces a file muzzle_part.dx which represents the partitioned mesh in a format that can be visualized using OpenDX. More on this after the break.
metrics "muzzle_part" "1" computes and prints a collection of statistics about the current partition, writing them to files muzzle_part.1.xxx.

Agenda

Announcements

Octree/SFC Partitioning

Other Partitioning Issues

Dynamic Load Balancing

Octree/SFC rebalancing

Zoltan Dynamic Load Balancing