Lecture 20 - Internet/Grid Computing, Parallel Architectures


Agenda

Announcements

Internet/Grid Computing

Condor and Globus notes from previous lecture.

We will hear about another approach to internet computing, the actor model, in one of the final project presentations.

Parallel Architectures

As a start of our discussion of architecture-aware parallel programming, let's consider the kinds of computers where we might want to run parallel programs.

The bullpen cluster falls into the "heterogenous cluster" category.

Historical Systems
  • Connection Machine, by Thinking Machines. Long, long gone. CM-2 had 64K processors, connected in a mesh network. 2 GB memory, 10 GB disk typically. Massive for the time (1990). CM-5 has some of the best LED displays.
  • Cray, vector supercomputing. SPMD model, typically. Cray was bought by SGI and later sold to Tera, which renamed itself Cray.
  • Intel Paragon. Paragon was another mesh-interconnected system. The processors were partitioned among jobs. Intel has since gotten out of the supercomputer business.
  • Typical current systems
  • IBM SP. Basically a cluster with high-end RS6000/PowerPC processors and a high-speed switch interconnect.
  • SGI Origin 3000, and its predecessor, the SGI Origin 2000. CC-NUMA and NUMA-Flex - non-uniform memory access, but all memory is shared among all processes.
  • PC Cluster. Try this at home. Many fall into the category of Beowulf Clusters.
  • Workstation Cluster. Bullpen is an example.
  • Larger Cluster-based systems
  • Chiba City at Argonne.
  • Some issues for the clustered systems:

  • assignment of nodes to jobs - want to keep as many nodes as possible busy computing, but don't want to let small jobs (few processors) starve larger jobs (many processors)
  • some jobs may require special resources available only on some nodes - for example in the bullpen cluster, some nodes have 4 processors while others have two, some nodes have more memory than others, and some nodes have a gigabit interconnect and others don't.
  • a scheduler might require jobs submissions to include a maximum running time to be able to schedule more intelligently
  • scheduling is likely non-preemptive, but we can consider a SJF approach
  • these issues are addressed by the Maui Scheduler
  • are compute nodes granted exclusively to one job, or are they shared?
  • ASCI-class supercomputers: Teraflop Computing

    Mid-late 1990's - Department of Energy program to push toward a teraflop system.

  • Intel Cluster. ASCI Red, Sandia National Labs (SNL). World's first teraflop system. Linux cluster, but different from most. Its OS is based on linux, but was customized significantly. They stripped out anything that was not needed to make sure as much physical memory as possible was available for user jobs.
  • SGI Origin 2000. ASCI Blue Mountain, Los Alamos National Labs. 48 128-way SMP systems.
  • IBM SP. ASCI Blue Pacific, Lawrence Livermore National Labs (LLNL).
  • The Latest
  • IBM was selected to build ASCI White, at LLNL, the next generation ASCI platform. A smaller version is Blue Horizon at San Diego Supercomputing Center (SDSC), available to a wider group. Larger clusters built using larger SMPs. Single-node processor utilization is important, but may be difficult to achieve.
  • CPlant at SNL takes the cluster idea to a new level. The cluster is designed to evolve over time, by adding new (faster, cheaper) nodes and pruning off obsolete parts. This "pruning" can be done without reconfiguring the rest of the cluster.
  • Number one on the Top 500 Supercomputers for a while was Terascale at Pittsburgh Supercomputing Center. Recently nicknamed "LeMieux." Built from Alpha processors.
  • Some Current Trends
  • Cray SV2, the next generation vector supercomputer. Both the MTA and SV2 rely heavily on compiler technology to extract parallelism from high level programs.
  • TeraGrid: massive, distributed Linux cluster. The "computational grid" idea.

    We talked about some of the new issues that arise here - it is no longer a resource stored in one building in one place with one adminstrative entity. Need to worry about scheduling on a larger scale. User authentication. Use of computers at multiple sites simultaneously will involve slow wide-area network links. How to get the data to the computers, where to store results?

  • IBM Blue Gene, also Blue Gene/L at LLNL. An even larger SP system.
  • Top 500 Supercomputers November 2002 is the latest ranking (Next one due out in June).
  • At positions 2 and 3, LANL's ASCI Q. ASCI Q specs This is a large cluster of HP/Compaq Alphaservers. Two, in fact. Now up to 7 teraflops.
  • Number 1 is Japan's NEC Earth Simulator.
  • What is there "inside the fence?" Certainly some government or military agencies have more powerful versions of these computers available, but not public.
  • Coming Soon

  • First 100+ Teraflop system contract recently awarded to IBM to build ASCI Purple.
  • Something Different

  • Tera/Cray MTA. Only one MTA-1 in production, SDSC. OS is MTX, a fully-distributed Unix variant. The system is thoroughly multithreaded, with each "processor" actually consisting of a number of streams, each of which is fed the code and data to do some part of the computation. A sufficiently multithreaded application can keep most of these streams busy, meaning that the penalty we usually pay for memory latency is gone. In fact, the system has no traditional memory cache, since it can rely on this multithreading to mask memory latency. The MTA-2 has since arrived. Its architecture is similar to the MTA-1, but it's all CMOS instead of GaAs.

    A little more from a Cray developer:

    The MTA-2 is quite different. The instruction set is the same as in the MTA-1. But the MTA-2 is built from all-CMOS parts. It also has a completely different mechanical structure, and a completely different network structure. It can also handle more memory, I believe. Processor boards come in groups of 16 called "cages." Each processor board (we call them system boards) has one CPU, two memory controllers, one IOP. Those four entities are called "resources." Each resource is attached to a network node. The network nodes are connected to each other in a ring on the system board, and to other boards with various kinds of inter-board connections.

    The MTA system is fully scannable, i.e., with few exceptions, every flop in the machine can be written and read by diagnostic control programs. This is how we boot the machine - essentially we write the state we want into the machine, and then say "go." When we bring the machine down we read the state out and can get diagnostic information that way. We use the scan system heavily - for booting, part testing, development, all kinds of things.

  • HTMT (Hybrid Technology Multhithreaded Architecture) Petaflop machine, 2007? NASA JPL among others. Yes, that's petaflop. 1 million billion floating point operations per second. Highlights: Superconducting 100 GHz Processors, running at 4 degrees Kelvin, with smart in-processor memories. Even some physicists think 4 K is kind of chilly. SRAM section cooled by liquid nitrogen. Optical packet switching - the "data vortex." Massive storage - high-density holographic storage system.