Lecture 21 - Parallel Architectures, Architecture-Aware Computing


Agenda

Announcements

Parallel Architectures

Notes are in previous lecture.

Architecture-Aware Computing

There are a number of ways in which a programmer might wish to account for the architecture when programming a parallel computer.

We begin by looking at a mesh-based computation, as we had seen a few weeks ago, and how the partitioning and load balancing algorithms might account for architecture, though many of the ideas apply to other parallel computations as well.

Heterogeneous Processor Speeds

The first case we consider is one where the processors are of different speeds.

First, we will look at an iterative approach that can be used with any load balancing software which supports heterogeneous object weights.

Perhaps it makes more sense to weight the partitions rather than the objects being partitioned. If we can simply create partitions with different numbers of objects, we can avoid the iteration described above.

Is this difficult in the algorithms we have discussed?

With the straightforward geometric algorithms (slice partitioning, coordinate bisection, inertial bisection) we can adjust the cutting plane positions to put different amounts of the work on each side.

The Jostle developers describe a technique of adding "penalty weights" to partitions which should get smaller portions:

To partition 75 objects onto two processors where processor 1 is twice as fast as processor 2, a penalty weight on processor 2. In this case, a penalty weight of 25 is imposed on processor 2, giving a total to be partitioned of 100. Each partition gets 50, though 25 of those are the penalty weights on the partition assigned to processor 2, so processor 1 gets 50 objects while processor 2 gets 25.

For a slightly more complicated example, suppose we have 4 processors, 2 with a speed of 1, 1 with a speed of 2, and 1 with a speed of 4, and we want to partition a collection of 128 objects taking into account the relative processor speeds. We know that we want the processor with a speed of 4 to have the most objects in the final partitioning (64) so we must add penalty weights to the other partitions: 48 to each of the speed 1 processors, 32 to the speed 2 processor. Now the load balancer will partition using a total weight of 256, but the appropriate portion of this work on each of the slower processors will be made up of penalty weights.

With octree partitioning, recall that the space-filling curve traversal of the leaf octants forms a one-dimensional ordering of the three-dimensional space. p partitions are formed by cutting the SFC into p pieces of equal weight. To partition for weighted partitions, the sizes of the subintervals need to be adjusted.

As an example, consider a situation with 4 processors, where P0 should get 30% of the work, P1 should get 15% of the work, P2 should get 20% of the work, and P3 should get the remaining 35% of the work. We begin with 100 objects distributed equally, 25 to each processor.

Recall that the octree load balancing algorithm allows each processor to compute the final partition assignment of all of its currently-assigned work with no communication other than a prefix sum to learn what percentage of work is currently assigned to lower-rank processors.

Power 30% 15% 20% 35%
Prefix Power 0% 30% 45% 65%
Current Load 25 25 25 25
Current Prefix Work 0% 25% 50% 75%
Start Proc 0 30 > 25, 0 50 > 30, 50 > 45, 50 < 65, 2 75 > 30, 75 > 45, 75 > 65, 3
Assignments 25 to 0 5 to 0, 15 to 1, 5 to 2 15 to 2, 10 to 3 25 to 3