Lecture 19

Lecture 19 - Internet/Grid Computing

Agenda

Announcements
Clarification of Analysis of Floyd's Algorithm
Internet/Grid Computing

Announcements

CS Colloquium this week - Prof. Rick Kline from Pace University. His research is on querying music databases.
CS Preregistration event tonight. 6 PM, CS Lounge. Come by and eat pizza, talk about courses for next year, TA opportunities, research opportunities, and much more.

Clarification of Analysis of Floyd's Algorithm

The notes from the previous lecture have been corrected and clarified regarding the analysis of the communication costs of the parallel implementation of Floyd's Algorithm.

Internet/Grid Computing

The evolution of the Internet has led to a number of approaches to make use of networked computing resources for parallel computation.

A number of projects have worked to make use of internet-connected computing resoruces.

Many of these approaches fall under the category of grid computing, sometimes also called metacomputing.

The name "grid computing" can be misleading - it does not refer to a collection of processors connected by a grid-like network. The name comes from an analogy with the power grid. You "plug in" to the Internet and get your computational needs satisfied from available resources.

One view of the goal of grid computing is that users should see one integrated, dependable, global computing resource.

Some of the issues to consider:

Scale: large number of compute nodes, providers, consumers
Heterogeneity of resources: PCs, workstations, clusters, supercomputers, etc.
Heterogeneity of management systems (different OS, queueing systems) and policies
Heterogeneity of applications (engineering, science, commerce)
Heterogeneity of application requirements (CPU, I/O, memory, or network intensive)
Unknown and varying demand patterns
Geographic distribution/varying time zones
Differing goals: owner of computers (producers) and users (consumers) have different priorities
Insecure and unreliable computing environment, both computing nodes and interconnection links

Both producers and consumers of grid resources join in for economic reasons:

consumers get low-cost resources, don't have to maintain local systems
producers get benefit (possibly monetary) for making what might be otherwise unused resouces available on the grid
some producers may join because they want to contribute to an interesting project

Example application: visualization of the output of an electron microscope:

The electron microscope, the high-performance visualization engine, and the scientist who wants to look at the data may be geographically distributed
Want a straightforward way for the scientist in New York to control the electron microscope in California and have the data processed on a supercomputer in Illinois.
This requires scheduling of all of these resources, sufficient network bandwidth to move around massive amounts of data in real time, and the ability for the microscope, the supercomputer, and the scientist's workstation to coordinate with each other.

NSF is supporting a project called the Distributed Teragrid Facility (DTF) which intends to build the TeraGrid, capable of 11.6 trillion calculations per second, with components located at NCSA, SDSC, ANL, Caltech.

Some projects:

SETI@Home
Twin Primes
Condor
Globus - collection of Grid tools
Actor model - a final project will discuss this

SETI@Home

This project (http://setiathome.ssl.berkeley.edu) uses otherwise idle personal computers to analyze the massive volume of data produced by the Search for Extraterrestrial Intelligence (SETI) telescope.

Their idea was to break up the data into chunks to be farmed out to computers that are volunteered for use by their owners.

The program runs as a screen saver, so it is only taking up the volunteer's computing resources when they're not otherwise in use.

The computation is a good candidate for this approach, since it is embarassingly parallel. The data is broken into chunks of about 250K which are sent off to participating computers. One of these "work units" takes several hours to a few days of computing time, involving a few trillion mathematical operations.

Some concerns here:

compute nodes can come and go, and when they go, it's possible they will never come back
the compute nodes are untrusted - there could be hostile users who deliberatly try to inject bad data or to otherwise disrupt the project

The project addresses these concerns by:

a central server keeps track of the work units that are in progress
replicating work - send the same chunk to multiple computers
source code is not released

One estimate is that 500,000 computers participate, providing 1000 CPU-years per day!

Twin Primes

A similar approach was used by the Twin Primes project: http://www.cs.rpi.edu/research/twinp

This project was an effort to break the record for the largest known pair of "twin primes" - a pair of prime numbers whose difference is 2.

Concerns are similar to those of SETI@Home, where compute nodes can come and go. The computer running a server could also crash, and the work lost needs to be minimized.

A worker process gets a range of 10 billion numbers to check for the existence of twin primes, and report back.

In addition to volunteers who run the process on their own systems, this project included the use of a large number of workstations at RPI. However, the only way to get permission to run on these systems was to promise not to interfere with their normal work. A system was developed called "SCATTERS" (Simple Cool Admin Tool To Everywhere Run Something) that starts up a worker when the system is idle, but kills it when other activity is detected. This allowed the use of 247 systems.

Actual computation uses Sieve of Eratosthenes.

The project continued for 2 years, and broke the record for the largest pair of twin primes which were over 10¹⁶.

Both SETI@Home and the Twin Primes use a client-server model. The server farms out chunks of work to willing clients. It is up to the client to decide when to request work. We can think of this as a more distributed bag of tasks idea.

Condor

Condor (http://www.cs.wisc.edu/condor) has been developed since the late 1980's as a way to make use of available CPU cycles on idle workstations.

Target: high-throughput computing.

High-performance computing: try to run a given simulation as fast as possible
High-throughput computing: try to run a given simulation as many times as possible in a given amount of time

The system is intended to run general Unix processes on available resourced by "scavenging" cycles.

It has been expanded to be able to run parallel (MPI, PVM) jobs as well.

Services needed:

Queueing
Scheduling
Montoring
Resource management
Accounting

Condor is intended to use bother dedicated and non-dedicated resources.

Condor runs jobs even if some machines:

Crash
Become disconnected from the network
Run out of disk space
Don't have needed software installed
Frequently get interrupted by their primary users
Are in a different management domain

Mechanisms:

Transparent Checkpoint and Restart
- no changes required to user source code (though object code has to be available)
- enables process migration
- enables preemptive scheduling of processes on nodes
Transparent Remote System Calls
- no shared filesystem needed
- no remote account needed
"Classified Ads" - used to match resource requests with resource offers
- Ad contains attributes about itself
- Ad contains a description of what it wants in a match
- Condor uses a "matchmaker" schedule work within the constraints of the users and the machine owners
- Users (jobs) can make a request such as "I need to run on a Solaris/Sparc system with 1 GB RAM"
- Owners (machines) can offer their services in restricted ways such as "Only run jobs when the system is otherwise idle and never run jobs owned by Wally."

A key feature of this system is the ability to do process migration, using the checkpointing feature.

To stop a job running on one system and start it up on another from where it left off, it neeeds to remember everything about the current state of the process, remove itself from that system, then move to another system, start up, and restore its state.

Condor's checkpointing feature:

is transparent to the user program
- the program has no idea that it has been checkpointed
- must work with the standard Unix kernel
saves the entire state of an executing process to a file: need the data and stack segments, information about open files, pending signals, and CPU state.
is accomplished using a "checkpointing library" that is linked into any program which is submitted to the Condor system.
provides the mechanism for process migration, as the checkpointed process can be restored to a different machine
is triggered by a signal and is implemented in a signal handler

Saving the state of a process:

program text is read-only, so it can be re-read from the executable file on restart
program data:
- initialized data - values given at compile time, but we still need to store these, as they may have changed from their initial values as defined in the executable.
- uninitialized data - must be saved
- heap space - must be saved
stack: procedure calls and local variables must be saved, but restoring is tricky since restoring the stack in the straightforward way would cause the stack of the program doing the restore to be overwritten
shared libraries: tricky - the shared libraries are mapped into data segments, and need to be saved so they can be reloaded into the same are of virtual memory on restore
file status: also tricky as any open file must be remembered, but much of the state information cannot be obtained easily, so Condor instruments the open() and dup() system calls.

The rest of the details here.

How to go about scheduling process in a Condor "flock"?

Given the ability to migrate processes, we can do opportunistic scheduling - assign a process to a node even if we're pretty sure it will not have a chance to execute to completion before the node will become unavailable.

But for things like MPI processes, we do not want to allow processes to migrate, and we don't want one of our processes to stop while others are allowed to continue (since they will not continue for long), so Condor also allows for dedicated scheduling on nodes that advertise that capability. Checkpointing parallel jobs is hard.

Most dedicated schedulers require a maximum amount of time to allow backfilling with shorter/smaller jobs. Condor can fill in any "holes" in the schedule with the regular (preemptable) jobs without delaying dedicated jobs, since those preemptable jobs can just be preempted..

Globus Toolkit

The Globus Toolkit (http://www.globus.org/) is part of the Globus project that is working to support grid computing.

a symbiotic set of basic grid services, including resource managers, security protocols, information services, communication services, fault tolerance services, and remote data access facilities.
"bag of services" model, to allow developers to use as few or as many of these stand-alone services as they want, while not imposing any single programming paradigm.
to be accepted, must enable high performance.