Lecture 21 - Multiprocessing


Agenda

Announcements

  • None yet.
  • Multiprocessing

    This week includes a little preview of both electives being offered next semester: parallel processing, and networks. Here, we'll concentrate on the OS-related issues that arise when we start putting multiple processors in a box or when we connect computers over a network.

    Why?

  • want more computing power than possible with a single CPU
  • How?

  • tightly-coupled: put multiple CPUs in a box
  • loosely-coupled: connect CPUs with a network

    where the network might be internal to the system, a local area network (ethernet), or even the Internet.

  • Shared-memory Multiprocessing

    The type of multiprocessing we will be discussing is Multiple-Instruction Multiple-Data (MIMD), where each CPU is running its own program on its own data. An alternative is Single-Instruction Multiple-Data (SIMD), where all CPUs run the same program, in fact the same instructions in lockstep, but on different data. This is often special hardware such as a vector processor. In a SIMD system, a single instruction might be applied concurrently on n processors on each element of an n-element array.

  • two or more CPUs accessing a shared physical memory
  • same memory can be accessed by all CPUs
  • each CPU may have its own cache
  • what if two CPUs have the same memory addresses in their caches, and both modify it?
  • if a memory location in a CPU's cache is modified, the typical approach is to invalidate all cache lines for that same memory on other CPUs
  • if we restrict caches to contain disjoint parts of memory, we may lose some efficiency in cases where the memory will not be modified
  • memory access may be
  • uniform if all CPUs can access all memory with equal cost
  • non-uniform if some CPUs have faster access to some (more local) memory
  • Given a shared-memory architecture, how should the OS manage the CPUs and the memory?

  • each CPU could get its own copy of the OS, memory is partitioned among CPUs
  • relatively simple
  • processes cannot migrate among CPUs, as each process is pinned to the CPU whose OS' scheduler knows about it
  • individual OSs cannot coordinate on I/O service - may be making inconsistent modifications to file systems, for example
  • another option is to have the OS run on one CPU (the master) and have the others run only user processes
  • one CPU must handle all system calls!
  • does not scale
  • one OS but it can run on any CPU - symmetric multiprocessing (SMP)
  • eliminates bottlenecks of master CPU
  • but what if two processes make system calls at the same time - multiple processes in the OS at once!
  • we had only logical concurrency with a single CPU, here there is real concurrency
  • could restrict all system calls to one process at a time, but this is too restrictive
  • need to restrict some system calls to be mutually exclusive - protect access to shared kernel structures
  • this can be achieved by kernel locks - any processor that wants to access a certain part of the kernel must first acquire the lock
  • use separate locks for different parts of the kernel, allowing multiple CPUs to be doing different things, all in the kernel, but that would not interfere with each other
  • These kernel locks may seem simple at first, given that in a single-CPU environment, we know that the kernel rules the world - no other process is going to do anything without the kernel knowing, as no other process could be scheduled until the kernel did it. But with SMP, it is more complex.

    Assuming for the moment that we can get kernel locks, we need to be careful of deadlock. Since a number of CPUs might be holding locks, one or more of the techniques we discussed earlier in the semester must be used. It is not good to have a kernel deadlock - the system will likely grind to a halt. Search the archives on your favorite open-source kernel's SMP mailing list and you're sure to find examples of deadlocks.

    Locks/Synchronization

    Way back in September, we talked about process synchronization, and looked at a number of things, such as Peterson's algorithm and hardware instructions, to implement synchronization primitives such as semaphores. We decided that disabling interrupts was a great way to do this, but in a multiprocessor environment, it just doesn't work. We can only disable interrupts on one CPU, we can't stop all other CPUs from continuing their work and introducing unwanted real concurrency.

    Let's look back at our test-and-set implementation of mutual exclusion:

    TestAndSet is atomic in that the process can not be interrupted until the function completes, but....it is possible for two processes on two different CPUs to start executing the TestAndSet at the same time. It is not hard to break the above mutual exclusion in this case.

    For this to work, additional hardware is needed (an extra system bus line) to ensure that no other CPU can start accessing the bus until the first to start has completed its operation. Even if it does, we have a busy wait/spin lock. Here, we could have several processes tying up several CPUs waiting to acquire a lock. This may sound bad, but it's even worse - these spin locks will cause cache thrashing, if we are not careful.

    Let's think about this cache thrashing. Consider 3 processes, all trying to get access to a critical section, using the a test-and-set instruction on the lock variable shown with a 0 in memory. Assume that test-and-set really is atomic, using the hardware support mentioned above.

    Consider this sequence of events:

    1. P0 calls test-and-set on lock
    2. lock's cache line is moved into P0's cache (bus access)
    3. lock is read as 0 from P0's cache
    4. lock is set to 1 in P0's cache
    5. any other cache lines containing lock are invalidated (bus access)
    6. the cache line is written back to memory (bus access)
    7. P0 has the lock - it is happy
    8. P1 calls test-and-set on lock
    9. lock's cache line is moved into P1's cache (bus access)
    10. lock is read as 1 from P1's cache
    11. lock is set to 1 in P1's cache
    12. any other cache lines containing lock (now, P0) are invalidated (bus access)
    13. the cache line is written back to memory (bus access)
    14. P1 did not get the lock, it goes back to try again
    15. P2 calls test-and-set on lock
    16. lock's cache line is moved into P2's cache (bus access)
    17. lock is read as 1 from P2's cache
    18. lock is set to 1 in P2's cache
    19. any other cache lines containing lock (now, P1) are invalidated (bus access)
    20. the cache line is written back to memory (bus access)
    21. P2 did not get the lock, it goes back to try again
    22. P1 calls test-and-set on lock
    23. lock's cache line is moved into P1's cache (bus access)
    24. lock is read as 1 from P1's cache
    25. lock is set to 1 in P1's cache
    26. any other cache lines containing lock (now, P2) are invalidated (bus access)
    27. the cache line is written back to memory (bus access)
    28. P1 did not get the lock, it goes back to try again
    29. P2 calls test-and-set on lock
    30. lock's cache line is moved into P2's cache (bus access)
    31. lock is read as 1 from P2's cache
    32. lock is set to 1 in P2's cache
    33. any other cache lines containing lock (now, P1) are invalidated (bus access)
    34. the cache line is written back to memory (bus access)
    35. P2 did not get the lock, it goes back to try again
    36. ... until P0 releases the lock. At that point, P0 will write a 0 to lock and P1 or P2's cache line will be invalidated, and whoever can grab it next, wins.

    This is bad. Not only do we have the undesirable busy wait, but we are copying chunks of memory over the system bus like crazy.

    The text describes three approaches that can help here.

  • Avoid test-and-set, which includes a memory write, even if we are not really changing memory, by doing a plain old read first. If the read returns 0, it means there's a chance we can get the lock, at which time we try the test-and-set. We may not get it, but at least we knew it was worth a shot.
  • Use ideas from ethernet to avoid the tight spin lock and "wait a little" before trying again.
  • Use private lock variables and a waiting list of CPUs.
  • CPU Scheduling

    CPU scheduling can be much more complicated when we have multiple CPUs.

    Straightforward approach:

  • have one ready queue
  • whenever a CPU becomes available the next process is selected
  • processes can run on any CPU and are likely to bounce around among the CPUs during their lifetimes
  • no CPU will be left idle unless no processes are ready
  • This can be a reasonable approach, but it does have some problems:

  • moving a process around means that any information it builds up in the cache and/or TLB on one CPU will no longer be there if it is next scheduled to run on a different CPU. Scheduling to try to keep going back to the same CPU is called affinity scheduling. Solaris and Irix (SGI Unix) do some amount of affinity scheduling.
  • a process holding a spin lock should not be preempted if possible, as there may be other processes in a busy wait trying to acquire the lock. It makes no sense to keep those other processes spinning while the process that holds the lock (and hence is the only one that can release it) waits in a ready queue for its next turn on the CPU.
  • See also in the text: gang scheduling

    Distributed Systems

    Now, we consider distributed systems, where we have a collection of computers connected by a network. These could be connected by a private, fast network, in which case we might call it a cluster, or be connected by general-purpose networks, even the Internet.

    The interconnection network itself is not our focus - take the Networks course. For now, we'll assume that the computers can communicate with each other.

    A networked system allows for

  • resource sharing
  • share files
  • remote hardware devices: tape drives, printers
  • computational speedup
  • reliability - if a node fails, others are available while repairs are made
  • The OS for a distributed system could be:

  • Network Operating System - users are aware of multiple nodes - access resources explicity:
  • remote login to appropriate machine (telnet, rlogin, ssh)
  • transferring files explicitly (ftp, scp)
  • just use standard OS with network support
  • Distributed Operating System - users need not know where things are on the network
  • system handles the transparency
  • data migration - automated file transfer (scp), file sharing (NFS, AFS, Samba)
  • computation migration - transfer computation across the system
  • remote procedure call (RPC) - a process on one node makes a function call (essentially) that runs on a different node
  • process migration - execute an entire process, or parts of it, on different nodes
  • Network OS is simpler, puts mode burden on the user.

    Distributed OS is more complex, but more automated.

    Design Issues for A Distributed OS

    Transparency - hide distinction between local and remote resources

  • access transparency - ability to access local and remote resources in a uniform manner
  • location transparency - users have no awareness of object locations
  • migration transparency - object can be moved without changing name or access method
  • concurrency transparency - share objects without interference
  • replication transparency - consistency of mutiple instances (or partitioned) files and data
  • parallelism transparency - parallelism without need to have users aware of the details
  • failure transparency - graceful degradation rather than system disruption, minimize damage to users - fault tolerance
  • performance transparency - consistent and predictable performance when the system structure or load changes
  • size transparency - system can grow without users' knowledge - scalability - difficult issue, as bottlenecks may arise in large systems