Lecture 12

Lecture 12 - Partitioning and Load Balancing

Agenda

Announcements
Mesh Partitioning
Partition Quality Metrics
Geometric Partitioning
Graph Partitioning
Octree/SFC Partitioning
Dynamic Load Balancing

Announcements

For the current lab - I am having trouble with the Cray system. The program I thought was compiled and linked was not. Ignore that one for now.

Mesh Partitioning

Last time, we looked at some of the details of structures that underly a typical distributed scientific computation.

A typical computation follows this sort of model:

      start with a model
      step through time:  for [t = start to finish] {
                             compute
                             BARRIER
                             update
                             BARRIER
                          }

We will look at partitioning of meshes used in solutions of things like systems of partial differential equations, appropriate for methods such as:

finite difference methods
finite element methods
discontinuous Galerkin methods

We will assume that the highest-dimension entities (regions in 3D, faces in 2D) are the units that are uniquely distributed, and other lower-dimension bounding entities are replicated on boundaries as needed. This is not true in all cases, but it is a common approach. Most of what we will discuss this week in relation to partitioning and load balancing will not depend on the specifics of the underlying mesh structures or which entities are distributed. For example, we could distribute the vertices and share higher-dimension entities, given appropriate underlying data structures.

So we have an idea what it takes to develop the data structures for this, but what about the actual partitioning algorithm?

Some obvious possibilities:

random
slices in one axis
alternate coordinates for slicing

The part of the mesh assigned to each process is called a "partition" and in most cases there is one partition assigned to each process.

Partition Quality Metrics

These are some and we will consider many more. How can we compare them?

In general, we want the same number of elements assigned to each process, and as few of them on the partition boundaries as we can.

Why? An equal distribution of elements means a good load balance.

Few elements at boundaries means less communication -- remember that a typical calculation on these meshes involves getting information from nearest neighbors. When that information is on a different process, some communication (be it message passing or shared memory access) is necessary.

Two surface index measures - maximum local surface index and global surface index, both defined in Teresco and Ungar, Section 2.4. These are similar, but certainly different. What does each measure? When is each important?

So we can gather these statistics. They can help us measure how well the partitioners perform.

But what else might be important? Number of neighbors each processor must communicate with is important. Why? Interprocess adjacencies are also defined in Teresco and Ungar.

Another factor, this one not mentioned in the paper, is the intraprocess connectivity - how many "connected components" are within each partition. This is also known as "subdomain splitting" and can be an important factor.

Geometric Partitioning

Geometric partitioning uses only the coordinates of the mesh elements being partitioned to determine the partitioning. Some of the ideas we have mentioned are examples of geometric methods.

"Slice" partitioning - usually does poorly in surface index measure, but it is designed to reduce the interprocess adjacency.
Coordinate Bisection - repeatedly "cut" along coordinate axes until enough partitions are created.
- always cycle through the (two or) three dimensions
- cut along the "longest" axis at each step
Inertial Bisection - since the domain is not necessarily aligned in any way with the coordinate axes, we might instead want to cut along the principal axis of inertia.

In each of these cases, a "cutting plane" is placed through the domain and all elements on one side are assigned to one partition, and all elements on the other side are assigned to the other. For higher-dimensional elements, the location used is often the element's centroid that is used to determine its partition assignment.

Graph Partitioning

A popular way to think of mesh partitioning is as a graph partitioning problem, by working with the "dual graph." For a three-dimensional mesh, each mesh region becomes a graph vertex, and each mesh face that is shared by two mesh regions becomes a graph edge.


Sample mesh		Mesh with induced graph

Graph and partitioning		Partitioned mesh

We want to minimize the number of "edge cuts" in the graph while assigning the same number of graph vertices to each partition. We could certainly do this, but it is an NP-complete problem, so we're pretty unlikely to achieve an optimal solution in any reasonable amount of time, for large graphs anyway. Heuristic approaches are necessary. And there are plenty of them out there.

greedy methods - pick a starting vertex and keep adding neighbors of already-assigned vertices to the partitions.
spectral methods - very high quality partitioning but expensive, uses spectral graph theory - eigenvalues of a characteristic matrix of the graph.
multilevel methods - start by condensing the graph into a coarser version, partition that with the expensive spectral methods (for example), then refine back up to the original graph. See this link from the graph paritioning people at Minnesota.

Several packages exist to do graph partitioning, including:

Octree/SFC Partitioning

Another approach that fits mostly into the domain of geometric methods is based on octree structures, and is closely related to methods that use space-filling curves.

Octree partitioning was motivated by octree-based mesh generation, where a problem domain is embedded in a cubic universe that is recursively subdivided into eight octants wherever more resolution is required to produce an octree structure. The universe is represented by the root octant. Each octant is either a parent (or interior) octant, with exactly eight children, or a leaf (or terminal) octant with no children.

We'll start by thinking in two dimensions, using a quadtree instead of the three-dimensional octree.

Level 1 (entire domain)

Level 2

Level 3

Level 4

Above, the construction of a quadtree for a square domain with a small hole. At Level 1, only the root quadrant exists. At Level 2, one refinement (bisection) has occurred. Level 3 shows refinement of two of the Level 2 quadrants, and Level 4 shows refinement of one Level 3 quadrant.

This shows a triangular 40-element mesh generated from the previous quadtree, and association of mesh entities with leaf quadrants.

Octree structures may be constructed for meshes generated by other procedures or created by adaptive refinement by associating an element with the octant containing its centroid. Thus, the entire domain and mesh are embedded in a cubic universe. Each element is inserted into the octant of the tree containing its centroid. If the number of elements assigned to an octant exceeds a prescribed tolerance, the octant is refined and its elements are distributed to the appropriate offspring. The granularity of the tree should be fine enough to allow for a good balance, since all elements in a terminal octant will be assigned to the same partition, yet coarse enough to remain an order of magnitude smaller than the elements being partitioned, for efficiency.

A depth-first traversal (DFT) of the octree determines all subtree "costs." Since the total cost (TC) of the octree and the number of partitions (NP) are known, the optimal partition size (OPS) is

OPS = TC / NP .

A second (truncated) DFT of the octree adds octants to the current partition if their inclusion does not exceed OPS. If adding an octant's entire subtree exceeds OPS, the traversal descends the tree and continues. Terminal octants are not split; thus, if a terminal octant overfills a partition, a decision must be made whether to add it or to close the current partition, leaving it slightly under-filled, and start work on the next partition. This decision is based on the relative level of imbalance and the cumulative cost of closed partitions to avoid a very large final partition. Three-way partitioning of the example mesh is shown above.

The traversal of the leaf octants is important to the resulting partitions. The traversals can be organized using space-filling curves.

Morton - simple, but large discontinuities

Gray Code - better, but still has discontinuities

Hilbert - much better, reduces discontinuties, but more complex to construct

The Level 1, Level 2, and adaptively refined curves are shown.

And in three dimensions:

Morton

Hilbert

Level 1 and Level 2 curves shown
Hilbert reduction in discontinuties is more significant in 3D
Hilbert and Gray Code orderings constructed efficiently from Morton using ordering and orientation tables
These can be used to traverse an octree's leaf octants

Other Partitioning Issues

Heterogeneous element weights
- adaptive p-refinement
- different solution methods on different parts of the domain
- enforcement of boundary conditions
- local time-stepping
Heterogeneous systems
- may tolerate imbalance to avoid communication over slow networks
- extra communication over fast networks to avoid slow networks

Dynamic Load Balancing

Thus far, we have considered only an "initial partitioning" step. If the mesh is adaptive, that is, it is changing over time as the computation proceeds, we will need to repartition to account for these changes:


Initial balanced partition		Adaptivity introduces imbalance


Migrate as needed		Rebalanced partition

Many important factors must be considered

Competing goals: computational balance vs. communication minimization
Costs:
- cost of computing the new partition
- cost of moving the data to realize it
May tolerate imbalance to avoid a repartition step