Lecture 23 - Current and Future OS Issues


Agenda

Announcements

  • No colloquium this week. Enjoy the break.
  • Reminder: get a draft of the final project papers to me before you see Williamstown in the rear-view mirror for Thanksgiving.
  • Presentation Information

  • The schedule for presentations on the pages linked from the course web page for the next two lectures.
  • The schedule is tight, so presentations will need to start and end on time.
  • Attendance is required. There is no "credit" for attendance, but there are penalties for missing the presentations.
  • A good presentation has to be planned and probably rehearsed. Especially when time is limited, you need to have some slides prepared.
  • Don't put your whole talk on the slides, and don't read them, but use them to guide the audience through your important points.
  • Know your audience - include enough background information so they can understand your main points, but not that which you expect they already know.
  • Some Current and Future OS Issues

    Much of what we have discussed this semester applies to modern desktop and server operating systems. Last week, we talked about some of the issues that arise when multiprocessors or distributed systems are used.

    Some of the most interesting (to me, anyway) current and future OS-related issues arise in the high-end parallel and distributed systems that are the fastest computers in the world today.

    Before we look at today's systems, consider a few parallel systems that were precursors to today's systems:

  • Connection Machine, by Thinking Machines. Long, long gone. CM-2 had 64K processors, connected in a mesh network. 2 GB memory, 10 GB disk typically. Massive for the time (1990). CM-5 has some of the best LED displays.
  • Cray, vector supercomputing. SPMD model, typically. Cray was bought by SGI and later sold to Tera, which renamed itself Cray.
  • Intel Paragon. Paragon was another mesh-interconnected system. The processors were partitioned among jobs. Intel has since gotten out of the supercomputer business.
  • Typical current systems:

  • IBM SP. Basically a cluster with high-end RS6000/PowerPC processors and a high-speed switch interconnect.
  • SGI Origin 3000, and its predecessor, the SGI Origin 2000. CC-NUMA and NUMA-Flex - non-uniform memory access, but all memory is shared among all processes.
  • PC Cluster. Try this at home. Many fall into the category of Beowulf Clusters.
  • Workstation Cluster. Bullpen is an example.
  • Larger Cluster-based systems

  • Chiba City at Argonne.
  • Some issues for the clustered systems:

  • assignment of nodes to jobs - want to keep as many nodes as possible busy computing, but don't want to let small jobs (few processors) starve larger jobs (many processors)
  • some jobs may require special resources available only on some nodes - for example in the bullpen cluster, some nodes have 4 processors while others have two, some nodes have more memory than others, and some nodes have a gigabit interconnect and others don't.
  • a scheduler might require jobs submissions to include a maximum running time to be able to schedule more intelligently
  • scheduling is likely non-preemptive, but we can consider a SJF approach
  • these issues are addressed by the Maui Scheduler
  • are compute nodes granted exclusively to one job, or are they shared?
  • use of local disks, parallel file systems - see last time
  • ASCI-class supercomputers: Achieving the Teraflop.

    Mid-late 1990's - Department of Energy program to push toward a teraflop system.

  • Intel Cluster. ASCI Red, Sandia National Labs (SNL). World's first teraflop system. Linux cluster, but different from most. Its OS is based on linux, but was customized significantly. They stripped out anything that was not needed to make sure as much physical memory as possible was available for user jobs.
  • SGI Origin 2000. ASCI Blue Mountain, Los Alamos National Labs. 48 128-way SMP systems.
  • IBM SP. ASCI Blue Pacific, Lawrence Livermore National Labs (LLNL).
  • Into the "New Millennium"

  • IBM was selected to build ASCI White, at LLNL, the next generation ASCI platform. A smaller version is Blue Horizon at San Diego Supercomputing Center (SDSC), available to a wider group. Larger clusters built using larger SMPs. Single-node processor utilization is important, but may be difficult to achieve.
  • CPlant at SNL takes the cluster idea to a new level. The cluster is designed to evolve over time, by adding new (faster, cheaper) nodes and pruning off obsolete parts. This "pruning" can be done without reconfiguring the rest of the cluster.
  • Number one on the Top 500 Supercomputers for a while was Terascale at Pittsburgh Supercomputing Center. Recently nicknamed "LeMieux." Built from Alpha processors.
  • What's up right now

  • Cray SV2, the next generation vector supercomputer. Both the MTA and SV2 rely heavily on compiler technology to extract parallelism from high level programs.
  • TeraGrid: massive, distributed Linux cluster. The analogy is with a power grid - you plug into the wall and expect to be able to get whatever electricity you need and you are charged for what you use. The idea is to provide a "computational grid" where you can submit your computation, get CPU cycles "out there somewhere" and get your answer back.

    Entire new issues arise here - it is no longer a resource stored in one building in one place with one adminstrative entity. Need to worry about scheduling on a larger scale. User authentication. Use of computers at multiple sites simultaneously will involve slow wide-area network links. How to get the data to the computers, where to store results?

  • IBM Blue Gene, also Blue Gene/L at LLNL. An even larger SP system.
  • Top 500 Supercomputers November 2002 was announced last week.
  • At positions 2 and 3, LANL's ASCI Q. ASCI Q specs Another IBM SP system. Two, in fact. Now up to 7 teraflops.
  • Number 1 is Japan's NEC Earth Simulator.
  • What is there "inside the fence?" Certainly some government or military agencies have more powerful versions of these computers available, but not public.
  • Coming Soon

  • First 100 Teraflop system contract recently awarded to IBM to build ASCI Purple.
  • Something Different

  • Tera/Cray MTA. Only one MTA-1 in production, SDSC. OS is MTX, a fully-distributed Unix variant. The system is thoroughly multithreaded, with each "processor" actually consisting of a number of streams, each of which is fed the code and data to do some part of the computation. A sufficiently multithreaded application can keep most of these streams busy, meaning that the penalty we usually pay for memory latency is gone. In fact, the system has no traditional memory cache, since it can rely on this multithreading to mask memory latency. The MTA-2 has since arrived. Its architecture is similar to the MTA-1, but it's all CMOS instead of GaAs.

    A little more from a Cray developer:

    The MTA-2 is quite different. The instruction set is the same as in the MTA-1. But the MTA-2 is built from all-CMOS parts. It also has a completely different mechanical structure, and a completely different network structure. It can also handle more memory, I believe. Processor boards come in groups of 16 called "cages." Each processor board (we call them system boards) has one CPU, two memory controllers, one IOP. Those four entities are called "resources." Each resource is attached to a network node. The network nodes are connected to each other in a ring on the system board, and to other boards with various kinds of inter-board connections.

    The MTA system is fully scannable, i.e., with few exceptions, every flop in the machine can be written and read by diagnostic control programs. This is how we boot the machine - essentially we write the state we want into the machine, and then say "go." When we bring the machine down we read the state out and can get diagnostic information that way. We use the scan system heavily - for booting, part testing, development, all kinds of things.

  • HTMT (Hybrid Technology Multhithreaded Architecture) Petaflop machine, 2007? NASA JPL among others. Yes, that's petaflop. 1 million billion floating point operations per second. Highlights: Superconducting 100 GHz Processors, running at 4 degrees Kelvin, with smart in-processor memories. Even some physicists think 4 K is kind of chilly. SRAM section cooled by liquid nitrogen. Optical packet switching - the "data vortex." Massive storage - high-density holographic storage system.
  • Issues:

  • Breakdown of SMP scalability?
  • Job scheduling; partitioning the computer among running jobs
  • Wire length is a key factor - longer wires mean it takes longer for information to arrive. Even on the bullpen cluster, the gigabit network is limited to short cables (less than 2 feet). Locality, locality, locality.
  • Disks? High performance filesystems? Hierarchical filesystems.
  • Memory hierarchies can be extreme - consider the HTMT.
  • Course Evaluations