Lecture 21 - Multiprocessing
- Announcements
- Multiprocessing
- Hardware
- Synchronization/Locks
- CPU Scheduling
- Distributed Systems
This week includes a little preview of both electives being offered
next semester: parallel processing, and networks. Here, we'll
concentrate on the OS-related issues that arise when we start putting
multiple processors in a box or when we connect computers over a
network.
Why?
How?
The type of multiprocessing we will be discussing is Multiple-Instruction Multiple-Data (MIMD), where each CPU is running
its own program on its own data. An alternative is Single-Instruction Multiple-Data (SIMD), where all CPUs run the same
program, in fact the same instructions in lockstep, but on different
data. This is often special hardware such as a vector processor.
In a SIMD system, a single instruction might be applied concurrently
on n processors on each element of an n-element array.
Given a shared-memory architecture, how should the OS manage the CPUs
and the memory?
each CPU could get its own copy of the OS, memory is
partitioned among CPUs
relatively simple
processes cannot migrate among CPUs, as each process is
pinned to the CPU whose OS' scheduler knows about it
individual OSs cannot coordinate on I/O service - may be
making inconsistent modifications to file systems, for example
another option is to have the OS run on one CPU (the master) and have the others run only user processes
one CPU must handle all system calls!
does not scale
one OS but it can run on any CPU - symmetric
multiprocessing (SMP)
eliminates bottlenecks of master CPU
but what if two processes make system calls at the same
time - multiple processes in the OS at once!
we had only logical concurrency with a single CPU, here
there is real concurrency
could restrict all system calls to one process at a time,
but this is too restrictive
need to restrict some system calls to be mutually
exclusive - protect access to shared kernel structures
this can be achieved by kernel locks - any processor that
wants to access a certain part of the kernel must first acquire the
lock
use separate locks for different parts of the kernel,
allowing multiple CPUs to be doing different things, all in the
kernel, but that would not interfere with each other
These kernel locks may seem simple at first, given that in a
single-CPU environment, we know that the kernel rules the world - no
other process is going to do anything without the kernel knowing, as
no other process could be scheduled until the kernel did it. But
with SMP, it is more complex.
Assuming for the moment that we can get kernel locks, we need to be
careful of deadlock. Since a number of CPUs might be holding locks,
one or more of the techniques we discussed earlier in the semester
must be used. It is not good to have a kernel deadlock - the system
will likely grind to a halt. Search the archives on your favorite
open-source kernel's SMP mailing list and you're sure to find examples
of deadlocks.
Way back in September, we talked about process synchronization, and
looked at a number of things, such as Peterson's algorithm and
hardware instructions, to implement synchronization primitives such as
semaphores. We decided that disabling interrupts was a great way to
do this, but in a multiprocessor environment, it just doesn't work.
We can only disable interrupts on one CPU, we can't stop all other
CPUs from continuing their work and introducing unwanted real
concurrency.
Let's look back at our test-and-set implementation of mutual
exclusion:
TestAndSet is atomic in that the process can not be interrupted
until the function completes, but....it is possible for two processes
on two different CPUs to start executing the TestAndSet at the
same time. It is not hard to break the above mutual exclusion in this
case.
For this to work, additional hardware is needed (an extra system bus
line) to ensure that no other CPU can start accessing the bus until
the first to start has completed its operation. Even if it does, we
have a busy wait/spin lock. Here, we could have several processes
tying up several CPUs waiting to acquire a lock. This may sound bad,
but it's even worse - these spin locks will cause cache thrashing, if
we are not careful.
Let's think about this cache thrashing. Consider 3 processes, all
trying to get access to a critical section, using the a test-and-set
instruction on the lock variable shown with a 0 in memory. Assume
that test-and-set really is atomic, using the hardware support
mentioned above.
Consider this sequence of events:
- P0 calls test-and-set on lock
- lock's cache line is moved into P0's cache (bus access)
- lock is read as 0 from P0's cache
- lock is set to 1 in P0's cache
- any other cache lines containing lock are invalidated
(bus access)
- the cache line is written back to memory (bus access)
- P0 has the lock - it is happy
- P1 calls test-and-set on lock
- lock's cache line is moved into P1's cache (bus access)
- lock is read as 1 from P1's cache
- lock is set to 1 in P1's cache
- any other cache lines containing lock (now, P0) are invalidated
(bus access)
- the cache line is written back to memory (bus access)
- P1 did not get the lock, it goes back to try again
- P2 calls test-and-set on lock
- lock's cache line is moved into P2's cache (bus access)
- lock is read as 1 from P2's cache
- lock is set to 1 in P2's cache
- any other cache lines containing lock (now, P1) are invalidated
(bus access)
- the cache line is written back to memory (bus access)
- P2 did not get the lock, it goes back to try again
- P1 calls test-and-set on lock
- lock's cache line is moved into P1's cache (bus access)
- lock is read as 1 from P1's cache
- lock is set to 1 in P1's cache
- any other cache lines containing lock (now, P2) are invalidated
(bus access)
- the cache line is written back to memory (bus access)
- P1 did not get the lock, it goes back to try again
- P2 calls test-and-set on lock
- lock's cache line is moved into P2's cache (bus access)
- lock is read as 1 from P2's cache
- lock is set to 1 in P2's cache
- any other cache lines containing lock (now, P1) are invalidated
(bus access)
- the cache line is written back to memory (bus access)
- P2 did not get the lock, it goes back to try again
- ... until P0 releases the lock. At that point, P0 will write a
0 to lock and P1 or P2's cache line will be invalidated, and
whoever can grab it next, wins.
This is bad. Not only do we have the undesirable busy wait, but we
are copying chunks of memory over the system bus like crazy.
The text describes three approaches that can help here.
Avoid test-and-set, which includes a memory write, even if we
are not really changing memory, by doing a plain old read first. If
the read returns 0, it means there's a chance we can get the lock, at
which time we try the test-and-set. We may not get it, but at least
we knew it was worth a shot.
Use ideas from ethernet to avoid the tight spin lock and "wait
a little" before trying again.
Use private lock variables and a waiting list of CPUs.
CPU scheduling can be much more complicated when we have multiple CPUs.
Straightforward approach:
have one ready queue
whenever a CPU becomes available the next process is selected
processes can run on any CPU and are likely to bounce around
among the CPUs during their lifetimes
no CPU will be left idle unless no processes are ready
This can be a reasonable approach, but it does have some problems:
moving a process around means that any information it builds up
in the cache and/or TLB on one CPU will no longer be there if it is
next scheduled to run on a different CPU. Scheduling to try to keep
going back to the same CPU is called affinity scheduling.
Solaris and Irix (SGI Unix) do some amount of affinity scheduling.
a process holding a spin lock should not be preempted if
possible, as there may be other processes in a busy wait trying to
acquire the lock. It makes no sense to keep those other processes
spinning while the process that holds the lock (and hence is the only
one that can release it) waits in a ready queue for its next turn on
the CPU.
See also in the text: gang scheduling
Now, we consider distributed systems, where we have a collection of
computers connected by a network. These could be connected by a
private, fast network, in which case we might call it a cluster, or be connected by general-purpose networks, even the
Internet.
The interconnection network itself is not our focus - take the
Networks course. For now, we'll assume that the computers can
communicate with each other.
A networked system allows for
resource sharing
share files
remote hardware devices: tape drives, printers
computational speedup
reliability - if a node fails, others are available while
repairs are made
The OS for a distributed system could be:
Network Operating System - users are aware of multiple
nodes - access resources explicity:
remote login to appropriate machine (telnet, rlogin, ssh)
transferring files explicitly (ftp, scp)
just use standard OS with network support
Distributed Operating System - users need not know where
things are on the network
system handles the transparency
data migration - automated file transfer (scp), file sharing
(NFS, AFS, Samba)
computation migration - transfer computation across the
system
remote procedure call (RPC) - a process on one node makes a
function call (essentially) that runs on a different node
process migration - execute an entire process, or parts of
it, on different nodes
Network OS is simpler, puts mode burden on the user.
Distributed OS is more complex, but more automated.
Design Issues for A Distributed OS
Transparency - hide distinction between local and remote
resources
access transparency - ability to access local and remote
resources in a uniform manner
location transparency - users have no awareness of object
locations
migration transparency - object can be moved without
changing name or access method
concurrency transparency - share objects without
interference
replication transparency - consistency of mutiple
instances (or partitioned) files and data
parallelism transparency - parallelism without need to
have users aware of the details
failure transparency - graceful degradation rather than
system disruption, minimize damage to users - fault tolerance
performance transparency - consistent and predictable
performance when the system structure or load changes
size transparency - system can grow without users'
knowledge - scalability - difficult issue, as bottlenecks may
arise in large systems