Lecture 21

Lecture 21 - Parallel Architectures, Architecture-Aware Computing

Agenda

Announcements
Parallel Architectures
Architecture-Aware Computing

Announcements

CS Colloquium this week is a bioinformatics talk by Andrew Zimmer.

Architecture-Aware Computing

There are a number of ways in which a programmer might wish to account for the architecture when programming a parallel computer.

We begin by looking at a mesh-based computation, as we had seen a few weeks ago, and how the partitioning and load balancing algorithms might account for architecture, though many of the ideas apply to other parallel computations as well.

Heterogeneous Processor Speeds

The first case we consider is one where the processors are of different speeds.

First, we will look at an iterative approach that can be used with any load balancing software which supports heterogeneous object weights.

Use existing weighted balancers (ParMetis used here)
Assign higher weights to elements on slower processors
Iterate to convergence
32,016-element mesh is subdivied into four partitions on a Linux 4-way SMP
Weights {1.25, 0.5, 0.5, 1.25} produce partitions of {4,884, 11,127, 11,159, 4,890} in 19 steps
Weights {1.25, 0.25, 0.05, 1.25} produce partitions of sizes {990, 5,374, 24,612, and 1,084} in 25 iterations
Advantages:
- Existing weighted balancers used with no modification
- Can adjust to changing loads by modifying weights
- Incremental mesh changes will produce incremental changes to partitions, requiring a small number of iterations
Disadvantages:
- Iteration is expensive
- Some items may be migrated multiple times
- Does not account for relative network speeds

Perhaps it makes more sense to weight the partitions rather than the objects being partitioned. If we can simply create partitions with different numbers of objects, we can avoid the iteration described above.

Is this difficult in the algorithms we have discussed?

With the straightforward geometric algorithms (slice partitioning, coordinate bisection, inertial bisection) we can adjust the cutting plane positions to put different amounts of the work on each side.

The Jostle developers describe a technique of adding "penalty weights" to partitions which should get smaller portions:

To partition 75 objects onto two processors where processor 1 is twice as fast as processor 2, a penalty weight on processor 2. In this case, a penalty weight of 25 is imposed on processor 2, giving a total to be partitioned of 100. Each partition gets 50, though 25 of those are the penalty weights on the partition assigned to processor 2, so processor 1 gets 50 objects while processor 2 gets 25.

For a slightly more complicated example, suppose we have 4 processors, 2 with a speed of 1, 1 with a speed of 2, and 1 with a speed of 4, and we want to partition a collection of 128 objects taking into account the relative processor speeds. We know that we want the processor with a speed of 4 to have the most objects in the final partitioning (64) so we must add penalty weights to the other partitions: 48 to each of the speed 1 processors, 32 to the speed 2 processor. Now the load balancer will partition using a total weight of 256, but the appropriate portion of this work on each of the slower processors will be made up of penalty weights.

With octree partitioning, recall that the space-filling curve traversal of the leaf octants forms a one-dimensional ordering of the three-dimensional space. p partitions are formed by cutting the SFC into p pieces of equal weight. To partition for weighted partitions, the sizes of the subintervals need to be adjusted.

As an example, consider a situation with 4 processors, where P₀ should get 30% of the work, P₁ should get 15% of the work, P₂ should get 20% of the work, and P₃ should get the remaining 35% of the work. We begin with 100 objects distributed equally, 25 to each processor.

Recall that the octree load balancing algorithm allows each processor to compute the final partition assignment of all of its currently-assigned work with no communication other than a prefix sum to learn what percentage of work is currently assigned to lower-rank processors.

Power	30%	15%	20%	35%
Prefix Power	0%	30%	45%	65%
Current Load	25	25	25	25
Current Prefix Work	0%	25%	50%	75%
Start Proc	0	30 > 25, 0	50 > 30, 50 > 45, 50 < 65, 2	75 > 30, 75 > 45, 75 > 65, 3
Assignments	25 to 0	5 to 0, 15 to 1, 5 to 2	15 to 2, 10 to 3	25 to 3