Lecture 22 - Distributed Systems
"I think ultimately success is good. Failure not so good...uh..."
- Words of Wisdom from Leonardo DiCaprio in an
interview with Barbara Walters
- Announcements
- Distributed File Systems
- NFS
- Caching and Replication
- AFS
- Parallel File Systems
Last time, we listed and briefly discussed several kinds of
transparency that might be desirable a distributed system. Now we
will look in more detail at network file systems.
Files on some of these disks may be shared. Perhaps the user files
for all four systems reside on the server. A copy of the OS resides
on the big workstation and the regular workstation, with extra space
on the big workstation used for some special software. Then the
diskless workstation needs everthing, including its OS, from the
server. A distributed file system (DFS) is what manages a
collection of such storage devices.
Terminology:
For a DFS, the client interface most likely includes a set of file
operations, much like those available to access local disks (create,
delete, read, write)
DFS Issues:
Naming Schemes
We share files in our lab using NFS, originally developed by Sun, now
available in many systems.
means to share filesystems (or part of filesystems) among
clients transparently
a remote directory is mounted over a local file system
directory, just as we mount local disk partitions into the directory
hierarchy
mounting requires knowledge of physical location of files -
both hostname and local path on the server
usage, however, is transparent - works just like any filesystem
mounted from a local disk
mount mechanism is separate from file service mechanism; each using
RPC (remote procedure call)
interoperable - can work with different machine architectures,
OS, networks
mount mechanism requires that the client is auhorized to connect
(see /etc/exports in most Unix variants, /etc/dfs/dfstab
in Solaris) - mountd process services mount requests
when a mount is completed, a file handle (essentially just
the inode of the remote directory) is returned that the client OS can
use to translate local file requests to NFS file service requests -
nfsd process services file access requests
NFS fits in at the virtual file system (VFS) layer in the OS -
instead of translating a path to a particular local partition and file
system type, requests are converted to NFS requests
NFS servers are stateless - each request has to provide
all necessary information - server does not maintain information
about client connections
two main ways for clients to know what to request and from
where: entries in file system table (/etc/fstab or /etc/vfstab), or automount tables (see /etc/auto_* on bullpen,
for example). fstab entries are mounted when the system comes up,
automount entries are mounted on demand, and are unmounted when not
active
many NFS implementations include an extra client-side process,
nfsiod, that acts as an intermediary, allowing things like
read-ahead and delayed writes. This improves performance, though the
delayed writes add some danger (see below).
Caching is important - can use the regular buffer cache on the
client, could use client disk space as well. Server automatically
will use its buffer cache for NFS requests as well as any local
requests.
Cache on disk vs. main memory:
Advantages of disk caches:
reliability (non-volatile)
can remain across reboots
can be larger
Advantages of memory caches:
can be used for diskless workstations
faster
already have memory cache for local access, and putting a
remote file cache there allows reuse of that mechanism
Cache Update Policy - mostly the same issues we have seen in other
contexts
Write-through - write data through to disk as soon as
write call is made - reliable, but performance is poor
Delayed-write - write to cache now, to server "later"
- much faster write, but dangerous!
Advantage: some data (temp files, for example) may be
overwritten or removed before ever being written to the disk at
all.
Danger: unwritten data may be lost if a client machine
crashes
Can have system scan cache regularly to flush modified
blocks
write-on-close: flush all cache blocks for a file when it is
closed
Cache Consistency
Need to know if the copy of a disk block in our cache is consistent
with the master copy.
client-initiated: client that wants to reuse a block checks with
the server
server-initiated: server notifies clients of any changes from
other processes
For a stateless system like NFS, the server has no knowledge of which
clients are connected, let alone what is in their cache. There, any
cache consistency protocol must be client-initiated.
Still, it is very difficult to guarantee anything here. Even if the
client can check with the server, it is possible that a dirty block
remains in another client's cache. NFS generally deals with this with
an approach that any modified cache blocks are written back to disk
"reasonably" quickly, usually in a few seconds. So in practice,
problems arise only in systems where there are many concurrent
writes. In these situations, the user processes should do some sort
of file locking to ensure that the cache will not lead to
inconsistencies.
Stateless vs. Stateful File Service
We said that NFS requests are stateless - each request is
self-contained. No open and close operations for the server, as the
file is reopened and gets data at a specific position in the file.
This seems inefficient...
A stateful file server would "remember" what requests have been
made, so a request could be something like "read the next block of
this file"
This can increase performance, by allowing fewer disk accesses on the
server side, using a read-ahead to get blocks into the server's cache
to anticipate upcoming requests.
Failure recovery:
a stateful file server loses all of this state if it crashes
may need to contact all possible clients to reconstruct state
or could return error conditions to any client requests and
force them to start over
The server will need a way to decide that a client has gone away - it
is maintaining information about each client, and the client may have
forgotten to close a file, or forgotten to unmount a partition, or
simply crashed.
a stateless server can usually recover seemlessly from a
failure
if bull and eringer reboot up in the lab, your processes may
stop and you might get a "NFS server eringer not responding"
message, but as soon as they come back up, the clients can continue
what they were doing
Replication
Similar issues arise when we want to replicate some files to enhance
reliability, efficiency, availability.
reliability/availability: one server goes down, use another that has a
replica
efficiency: use the closest or least-busy server
technique used by web servers - a request for a file at
www.something.com may be silently redirected to one of a number of
servers, like www2.something.com, www28.something.com, etc.
main issue: keeping replicas up to date when one or more is
changing
if there is a "master copy" we can use caching ideas - just
treat replicas like cached copies
if there is no master, any change must be made to all replicas
The Andrew File System, now known just as AFS, is a
globally distributed file system. Originally from CMU, later supported
by a company called Transarc, which has since become part of IBM. IBM
has released AFS as an open source product.
we saw the naming convention that includes a site name in the path
use of a file cache on local disks - important as network
latency is now (potentially) over the internet
the system caches entire files locally, not individual disk
blocks
file permissions are now very important, as many users can
browse - AFS supports more complicated file permissions, including
ACLs:
17:50:32 24.cortez:~ -> fs la
Access list for . is
Normal rights:
system:backup l
system:anyuser rl
teresj rlidwka
Permissions are set for directories rather than individual files, and
can be set for read, list, insert, delete, write, lock, and
administer. See
http://cac.engin.umich.edu/resources/storage/afs/afs_acl.html
for more on AFS access rights.
files can move among servers in the same cell without its name
changing
For more on OpenAFS, see
http://www.openafs.org/
Consider a situation where we have a cluster, where each node has one
or more CPUs, a local memory, and a large local disk.
User processes should be able to run on any node, and have access to a
common file space (home directory, data sets, etc.).
One option is to have all nodes access shared files from an external
file server, or to designate each node in the cluster as having
certain files on its local disk, and the nodes share the files using,
e.g. NFS.
The bullpen cluster uses a combination of these approaches. All nodes
have access to the department's file servers, plus files on the front
end node of the system. Each node has a local disk, and these are
shared (NFS automounted) among the nodes.
Potential problems:
File server or network to the file server may become a
bottleneck if many nodes are accessing files at once
If we try to get around this by writing to local disks, we lose
some of the transparency - we need to know which node's disks we
used, etc.
Potential solution: borrow ideas from RAID, and make one big
parallel file system out of the disks attached to cluster nodes!
Just as RAID hides the details of which disk actually stored a given
file, a parallel file system hides which disk and which node stores
a file.
Issues to consider:
how do we allocate disk blocks?
how much shared directory information is needed on all nodes?
how do we access files on remote disks? cache locally? migrate
to local disk?
new files written to local disk?
use striping to keep all disks active and spread out the load?
what about fault tolerance? if one node goes down, is the
entire file system unavailable?
are we making the network an unmanageable bottleneck?
how much complexity are we adding to the kernel's file system
modules?
what about concurrent writes? concurrent reads/writes?
some decisions may depend on expected access patterns -
scientific computing is likely to result in large reads/writes, but
rarely will have a write conflict, whereas a database is likely to
have many small transactions
Examples of systems that do this kind of thing:
IBM General Parallel File System
(GPFS),
in particular look at the technical
paper.
For IBM AIX and Linux.
Sun Parallel File
System.
For Solaris.
The Parallel Virtual File System
(PVFS). For Linux.
Sistina Global File System
(GFS). For Linux.