Lecture 12 - Partitioning and Load Balancing


Agenda

Announcements

Mesh Partitioning

Last time, we looked at some of the details of structures that underly a typical distributed scientific computation.

A typical computation follows this sort of model:

      start with a model
      step through time:  for [t = start to finish] {
                             compute
                             BARRIER
                             update
                             BARRIER
                          }

We will look at partitioning of meshes used in solutions of things like systems of partial differential equations, appropriate for methods such as:

We will assume that the highest-dimension entities (regions in 3D, faces in 2D) are the units that are uniquely distributed, and other lower-dimension bounding entities are replicated on boundaries as needed. This is not true in all cases, but it is a common approach. Most of what we will discuss this week in relation to partitioning and load balancing will not depend on the specifics of the underlying mesh structures or which entities are distributed. For example, we could distribute the vertices and share higher-dimension entities, given appropriate underlying data structures.

So we have an idea what it takes to develop the data structures for this, but what about the actual partitioning algorithm?

Some obvious possibilities:

The part of the mesh assigned to each process is called a "partition" and in most cases there is one partition assigned to each process.

Partition Quality Metrics

These are some and we will consider many more. How can we compare them?

In general, we want the same number of elements assigned to each process, and as few of them on the partition boundaries as we can.

Why? An equal distribution of elements means a good load balance.

Few elements at boundaries means less communication -- remember that a typical calculation on these meshes involves getting information from nearest neighbors. When that information is on a different process, some communication (be it message passing or shared memory access) is necessary.

Two surface index measures - maximum local surface index and global surface index, both defined in Teresco and Ungar, Section 2.4. These are similar, but certainly different. What does each measure? When is each important?

So we can gather these statistics. They can help us measure how well the partitioners perform.

But what else might be important? Number of neighbors each processor must communicate with is important. Why? Interprocess adjacencies are also defined in Teresco and Ungar.

Another factor, this one not mentioned in the paper, is the intraprocess connectivity - how many "connected components" are within each partition. This is also known as "subdomain splitting" and can be an important factor.

Geometric Partitioning

Geometric partitioning uses only the coordinates of the mesh elements being partitioned to determine the partitioning. Some of the ideas we have mentioned are examples of geometric methods.

In each of these cases, a "cutting plane" is placed through the domain and all elements on one side are assigned to one partition, and all elements on the other side are assigned to the other. For higher-dimensional elements, the location used is often the element's centroid that is used to determine its partition assignment.

Graph Partitioning

A popular way to think of mesh partitioning is as a graph partitioning problem, by working with the "dual graph." For a three-dimensional mesh, each mesh region becomes a graph vertex, and each mesh face that is shared by two mesh regions becomes a graph edge.

Sample mesh Mesh with induced graph
Graph and partitioning Partitioned mesh

We want to minimize the number of "edge cuts" in the graph while assigning the same number of graph vertices to each partition. We could certainly do this, but it is an NP-complete problem, so we're pretty unlikely to achieve an optimal solution in any reasonable amount of time, for large graphs anyway. Heuristic approaches are necessary. And there are plenty of them out there.

Several packages exist to do graph partitioning, including:

Octree/SFC Partitioning

Another approach that fits mostly into the domain of geometric methods is based on octree structures, and is closely related to methods that use space-filling curves.

Octree partitioning was motivated by octree-based mesh generation, where a problem domain is embedded in a cubic universe that is recursively subdivided into eight octants wherever more resolution is required to produce an octree structure. The universe is represented by the root octant. Each octant is either a parent (or interior) octant, with exactly eight children, or a leaf (or terminal) octant with no children.

We'll start by thinking in two dimensions, using a quadtree instead of the three-dimensional octree.

Level 1 (entire domain)
Level 2
Level 3
Level 4

Above, the construction of a quadtree for a square domain with a small hole. At Level 1, only the root quadrant exists. At Level 2, one refinement (bisection) has occurred. Level 3 shows refinement of two of the Level 2 quadrants, and Level 4 shows refinement of one Level 3 quadrant.

This shows a triangular 40-element mesh generated from the previous quadtree, and association of mesh entities with leaf quadrants.

Octree structures may be constructed for meshes generated by other procedures or created by adaptive refinement by associating an element with the octant containing its centroid. Thus, the entire domain and mesh are embedded in a cubic universe. Each element is inserted into the octant of the tree containing its centroid. If the number of elements assigned to an octant exceeds a prescribed tolerance, the octant is refined and its elements are distributed to the appropriate offspring. The granularity of the tree should be fine enough to allow for a good balance, since all elements in a terminal octant will be assigned to the same partition, yet coarse enough to remain an order of magnitude smaller than the elements being partitioned, for efficiency.

A depth-first traversal (DFT) of the octree determines all subtree "costs." Since the total cost (TC) of the octree and the number of partitions (NP) are known, the optimal partition size (OPS) is

OPS = TC / NP .
A second (truncated) DFT of the octree adds octants to the current partition if their inclusion does not exceed OPS. If adding an octant's entire subtree exceeds OPS, the traversal descends the tree and continues. Terminal octants are not split; thus, if a terminal octant overfills a partition, a decision must be made whether to add it or to close the current partition, leaving it slightly under-filled, and start work on the next partition. This decision is based on the relative level of imbalance and the cumulative cost of closed partitions to avoid a very large final partition. Three-way partitioning of the example mesh is shown above.

The traversal of the leaf octants is important to the resulting partitions. The traversals can be organized using space-filling curves.

Morton - simple, but large discontinuities
Gray Code - better, but still has discontinuities
Hilbert - much better, reduces discontinuties, but more complex to construct

The Level 1, Level 2, and adaptively refined curves are shown.

And in three dimensions:



Morton


Hilbert

Other Partitioning Issues

Dynamic Load Balancing

Thus far, we have considered only an "initial partitioning" step. If the mesh is adaptive, that is, it is changing over time as the computation proceeds, we will need to repartition to account for these changes:

Initial balanced partition Adaptivity introduces imbalance
Migrate as needed Rebalanced partition

Many important factors must be considered