Lecture 21

Lecture 21 - Multiprocessing

Agenda

Announcements
Multiprocessing
- Hardware
- Synchronization/Locks
- CPU Scheduling
Distributed Systems

This week includes a little preview of both electives being offered next semester: parallel processing, and networks. Here, we'll concentrate on the OS-related issues that arise when we start putting multiple processors in a box or when we connect computers over a network.

Why?

want more computing power than possible with a single CPU

How?

tightly-coupled: put multiple CPUs in a box

loosely-coupled: connect CPUs with a network

where the network might be internal to the system, a local area network (ethernet), or even the Internet.

Shared-memory Multiprocessing

The type of multiprocessing we will be discussing is Multiple-Instruction Multiple-Data (MIMD), where each CPU is running its own program on its own data. An alternative is Single-Instruction Multiple-Data (SIMD), where all CPUs run the same program, in fact the same instructions in lockstep, but on different data. This is often special hardware such as a vector processor. In a SIMD system, a single instruction might be applied concurrently on n processors on each element of an n-element array.

two or more CPUs accessing a shared physical memory

same memory can be accessed by all CPUs

each CPU may have its own cache

what if two CPUs have the same memory addresses in their caches, and both modify it?

if a memory location in a CPU's cache is modified, the typical approach is to invalidate all cache lines for that same memory on other CPUs

if we restrict caches to contain disjoint parts of memory, we may lose some efficiency in cases where the memory will not be modified

memory access may be

uniform if all CPUs can access all memory with equal cost

non-uniform if some CPUs have faster access to some (more local) memory

Given a shared-memory architecture, how should the OS manage the CPUs and the memory?

each CPU could get its own copy of the OS, memory is partitioned among CPUs

relatively simple

processes cannot migrate among CPUs, as each process is pinned to the CPU whose OS' scheduler knows about it

individual OSs cannot coordinate on I/O service - may be making inconsistent modifications to file systems, for example

another option is to have the OS run on one CPU (the master) and have the others run only user processes

one CPU must handle all system calls!

does not scale

one OS but it can run on any CPU - symmetric multiprocessing (SMP)

eliminates bottlenecks of master CPU

but what if two processes make system calls at the same time - multiple processes in the OS at once!

we had only logical concurrency with a single CPU, here there is real concurrency

could restrict all system calls to one process at a time, but this is too restrictive

need to restrict some system calls to be mutually exclusive - protect access to shared kernel structures

this can be achieved by kernel locks - any processor that wants to access a certain part of the kernel must first acquire the lock

use separate locks for different parts of the kernel, allowing multiple CPUs to be doing different things, all in the kernel, but that would not interfere with each other

These kernel locks may seem simple at first, given that in a single-CPU environment, we know that the kernel rules the world - no other process is going to do anything without the kernel knowing, as no other process could be scheduled until the kernel did it. But with SMP, it is more complex.

Assuming for the moment that we can get kernel locks, we need to be careful of deadlock. Since a number of CPUs might be holding locks, one or more of the techniques we discussed earlier in the semester must be used. It is not good to have a kernel deadlock - the system will likely grind to a halt. Search the archives on your favorite open-source kernel's SMP mailing list and you're sure to find examples of deadlocks.

Locks/Synchronization

Way back in September, we talked about process synchronization, and looked at a number of things, such as Peterson's algorithm and hardware instructions, to implement synchronization primitives such as semaphores. We decided that disabling interrupts was a great way to do this, but in a multiprocessor environment, it just doesn't work. We can only disable interrupts on one CPU, we can't stop all other CPUs from continuing their work and introducing unwanted real concurrency.

Let's look back at our test-and-set implementation of mutual exclusion:

Shared data
```
boolean lock = false;
```

Process P_i

while (1) {
  while (TestAndSet(&lock));  /* busy wait */

    /* critical section */

  lock = false;

    /* non-critical section */
}

TestAndSet is atomic in that the process can not be interrupted until the function completes, but....it is possible for two processes on two different CPUs to start executing the TestAndSet at the same time. It is not hard to break the above mutual exclusion in this case.

For this to work, additional hardware is needed (an extra system bus line) to ensure that no other CPU can start accessing the bus until the first to start has completed its operation. Even if it does, we have a busy wait/spin lock. Here, we could have several processes tying up several CPUs waiting to acquire a lock. This may sound bad, but it's even worse - these spin locks will cause cache thrashing, if we are not careful.

Let's think about this cache thrashing. Consider 3 processes, all trying to get access to a critical section, using the a test-and-set instruction on the lock variable shown with a 0 in memory. Assume that test-and-set really is atomic, using the hardware support mentioned above.

Consider this sequence of events:

P0 calls test-and-set on lock
lock's cache line is moved into P0's cache (bus access)
lock is read as 0 from P0's cache
lock is set to 1 in P0's cache
any other cache lines containing lock are invalidated (bus access)
the cache line is written back to memory (bus access)
P0 has the lock - it is happy
P1 calls test-and-set on lock
lock's cache line is moved into P1's cache (bus access)
lock is read as 1 from P1's cache
lock is set to 1 in P1's cache
any other cache lines containing lock (now, P0) are invalidated (bus access)
the cache line is written back to memory (bus access)
P1 did not get the lock, it goes back to try again
P2 calls test-and-set on lock
lock's cache line is moved into P2's cache (bus access)
lock is read as 1 from P2's cache
lock is set to 1 in P2's cache
any other cache lines containing lock (now, P1) are invalidated (bus access)
the cache line is written back to memory (bus access)
P2 did not get the lock, it goes back to try again
P1 calls test-and-set on lock
lock's cache line is moved into P1's cache (bus access)
lock is read as 1 from P1's cache
lock is set to 1 in P1's cache
any other cache lines containing lock (now, P2) are invalidated (bus access)
the cache line is written back to memory (bus access)
P1 did not get the lock, it goes back to try again
P2 calls test-and-set on lock
lock's cache line is moved into P2's cache (bus access)
lock is read as 1 from P2's cache
lock is set to 1 in P2's cache
any other cache lines containing lock (now, P1) are invalidated (bus access)
the cache line is written back to memory (bus access)
P2 did not get the lock, it goes back to try again
... until P0 releases the lock. At that point, P0 will write a 0 to lock and P1 or P2's cache line will be invalidated, and whoever can grab it next, wins.

This is bad. Not only do we have the undesirable busy wait, but we are copying chunks of memory over the system bus like crazy.

The text describes three approaches that can help here.

Avoid test-and-set, which includes a memory write, even if we are not really changing memory, by doing a plain old read first. If the read returns 0, it means there's a chance we can get the lock, at which time we try the test-and-set. We may not get it, but at least we knew it was worth a shot.

Use ideas from ethernet to avoid the tight spin lock and "wait a little" before trying again.

Use private lock variables and a waiting list of CPUs.

CPU Scheduling

CPU scheduling can be much more complicated when we have multiple CPUs.

Straightforward approach:

have one ready queue

whenever a CPU becomes available the next process is selected

processes can run on any CPU and are likely to bounce around among the CPUs during their lifetimes

no CPU will be left idle unless no processes are ready

This can be a reasonable approach, but it does have some problems:

moving a process around means that any information it builds up in the cache and/or TLB on one CPU will no longer be there if it is next scheduled to run on a different CPU. Scheduling to try to keep going back to the same CPU is called affinity scheduling. Solaris and Irix (SGI Unix) do some amount of affinity scheduling.

a process holding a spin lock should not be preempted if possible, as there may be other processes in a busy wait trying to acquire the lock. It makes no sense to keep those other processes spinning while the process that holds the lock (and hence is the only one that can release it) waits in a ready queue for its next turn on the CPU.

See also in the text: gang scheduling

Distributed Systems

Now, we consider distributed systems, where we have a collection of computers connected by a network. These could be connected by a private, fast network, in which case we might call it a cluster, or be connected by general-purpose networks, even the Internet.

The interconnection network itself is not our focus - take the Networks course. For now, we'll assume that the computers can communicate with each other.

A networked system allows for

resource sharing

share files

remote hardware devices: tape drives, printers

computational speedup

reliability - if a node fails, others are available while repairs are made

The OS for a distributed system could be:

Network Operating System - users are aware of multiple nodes - access resources explicity:

remote login to appropriate machine (telnet, rlogin, ssh)

transferring files explicitly (ftp, scp)

just use standard OS with network support

Distributed Operating System - users need not know where things are on the network

system handles the transparency

data migration - automated file transfer (scp), file sharing (NFS, AFS, Samba)

computation migration - transfer computation across the system

remote procedure call (RPC) - a process on one node makes a function call (essentially) that runs on a different node

process migration - execute an entire process, or parts of it, on different nodes

Network OS is simpler, puts mode burden on the user.

Distributed OS is more complex, but more automated.

Design Issues for A Distributed OS

Transparency - hide distinction between local and remote resources

access transparency - ability to access local and remote resources in a uniform manner

location transparency - users have no awareness of object locations

migration transparency - object can be moved without changing name or access method

concurrency transparency - share objects without interference

replication transparency - consistency of mutiple instances (or partitioned) files and data

parallelism transparency - parallelism without need to have users aware of the details

failure transparency - graceful degradation rather than system disruption, minimize damage to users - fault tolerance

performance transparency - consistent and predictable performance when the system structure or load changes

size transparency - system can grow without users' knowledge - scalability - difficult issue, as bottlenecks may arise in large systems

Agenda

Announcements

Multiprocessing

Shared-memory Multiprocessing

Locks/Synchronization

CPU Scheduling

Distributed Systems