Lecture 13 - Partitioning and Load Balancing


Agenda

Announcements

Octree/SFC Partitioning

See notes from Lecture 12.

Other Partitioning Issues

See notes from Lecture 12.

Dynamic Load Balancing

See notes from Lecture 12 for the intro.

With dynamic load balancing, the mesh is already distributed, so algorithms must operate in parallel on this distributed data. Computing the new partitioning on a single process is not scalable, in terms of time or memory.

Many of the algorithms we looked at so far work as dynamic load balancers. All of the geometric algorithms can be (and have been) implemented in parallel. Parmetis and Jostle provide multilevel graph partitioners that operate in parallel.

Octree/SFC rebalancing

To be used as a dynamic rebalancer, the octree structure must be created automatically, based on a previously-existing mesh distribution.

Initially, we calculate a bounding box for the entire domain by examining the centroid of each object to be inserted into the tree. A root node, representing the entire domain, is created on each process. Each of the NP processes refines this root octant in parallel to the initial refinement level

IRL = log8(NP)
to create 8(IRL) leaf octants. For a quadtree,
IRL = log4(NP) ,
which produces one level of refinement for our three-process example.

This produces an identical global octree on each process. This global octree, while small relative to the full tree to be generated, simplifies and improves the efficiency of traversal and object insertion. Each terminal octant in the global octree is called a global octant. While global octants are replicated on all processes, each is assigned a unique and permanent owner as the process that initially contains its portion of the spatial domain. A map array represents the entire global octree as a linear representation of the global octants. Once the map array has been created, the top levels of the tree are no longer needed. Next, each process calculates its range of leaf octants by a DFT. If the number of processes does not evenly divide the number of leaf octants, excess octants are placed in the higher-numbered partitions, with at most one extra octant assigned per partition. The construction of the global tree structure and the map array and their initial distribution are shown below.

Initial root quadrant on each process:

Global quadrant on each process:

Initial partitioning of global octree:

The global octants assigned to each process are initially placed into a local root list. The local roots are the roots of the local subtrees that exist on each process. They are maintained in their DFT order.

Next, the objects to be partitioned are inserted into the octree. Parts of the tree will be refined when inserted objects exceed the prescribed limit on the number of objects per leaf octant (§). The tree resulting from object insertion for the example of Figure  with a refinement limit of five elements per quadrant is shown below:

Objects to be inserted may reside on any process, and some objects will likely reside on processes other than those owning their destination octants. Such objects are called orphans and must be migrated to the appropriate process, as determined by a O(log(NP)) search using the map array.

After object insertion, each process computes costs for each locally rooted subtree using traversals within its domain with no interprocess communication. Prefix and global costs are computed from the per-process cost totals, enabling each process to determine its position in the global traversal.

As with the serial procedure, each process traverses its subtrees to create partitions. Each process determines its initial destination partition as

IDP = OPS / PC
where PC is the prefix cost. Each process traverses its subtrees with no interprocess communication. A load counter computes the sum of the costs of octants assigned to the current destination process (CDP). Once the load counter exceeds the OPS, the CDP is incremented, the load counter is reset, and traversal continues. All remaining octants are assigned to the last partition even if its load exceeds OPS:

The numbers above indicate the destination processes for each octant.

When the traversals are complete, subtrees and their associated data are migrated, if necessary, to their assigned destination process. Octant migration occurs in three stages. First, octants migrating to the same process are collected and the destination process is notified. The destination processes allocate space for the arriving data and notify the source processes of their new addresses to update migrating octants' remote parent or offspring links. Finally, the updated octants are sent to their destination and are removed locally. This strategy preserves the octant traversal order.

Local roots are updated and communicated to every process. Each root octant is added to the remote octant list in the map array element corresponding to the octant's global octree ancestor. Below, we see the final partitioning and distributed tree structure for our two-dimensional example.

Zoltan Dynamic Load Balancing

Zoltan is a library for dynamic load balancing developed at Sandia National Laboratories.

Zoltan is a data-structure neutral software package, meaning it does not depend on any application-specific data structures.

To use Zoltan algorithms, an application must call functions in the Zoltan API to set parameters and choose an algorithm. It also must provide a set of callback functions that Zoltan will call when it needs information from the application, such as the number of objects to be partitioned, their coordinates, and their interconnections.

Zoltan uses the term objects to refer to the items to be partitioned. In the examples we have been looking at, Zoltan objects would correspond to mesh regions (3D) or faces (2D).

For PMDB meshes, the file pmdb_zoltan.c makes appropriate Zoltan API calls and defines callbacks to allow Zoltan to be used for partitioning and dynamic load balancing.

We first consider how to use Zoltan from the PMDBtool program. We looked briefly at PMDBtool last week. It is a program that can manipulate PMDB meshes through a simple scripting language. Here is a sample PMDBtool script that reads in a PMDB mesh called muzzle-refined.sms and partitions it using Zoltan's Octree partitioner:

trace
modeler null
model "muzzle-refined"
read_sms "muzzle-refined"
migstats reset
PMDB_LB_PRINT_ZOLTANTIMES true
zoltan "DEBUG_LEVEL" "3"
zoltan "OCT_OUTPUT_LEVEL" "1"
zoltan "LB_METHOD" "OCTPART"
zoltan
migstats print
#write_sms "muzzle_part"
dx "muzzle_part"
metrics "muzzle_part" "1"

At this point, process 0 contains the entire mesh, and other processes contain nothing.

At this point, the mesh is partitioned evenly among the processes, using a decomposition as computed by Zoltan's octree partitioning algorithm.