Lecture 18 - More File Systems


Agenda

Announcements

  • CS Colloquium tomorrow: Duane Bailey on "Earning Cache" - 2:35 PM in TCL 202.
  • Directories, continued

    Acyclic-Graph Directories

    The tree model does not allow the same file to exist in more than one directory. We can provide this by making the directory an acyclic graph.

    Two or more directory entries can point to the same subdirectory or file, but (for now) we restrict it to disallow any directory entry pointing "back up" the directory structure.

    These kinds of directory graphs can be made using links in the Unix world, shortcuts in the Windows world, or aliases in the Mac world.

    We have multiple names for and multiple paths to the same file.

    Unix links can be

  • symbolic or soft link - specify a path to the file (logical) - ln -s - original file is "real" others are just pointing to that one
  • hard link - actual link to the same file on the disk from multiple directories (physical) - ln - all hard links are equal
  • This allows sharing of files, but introduces complications - what happens when the file is removed from one of the directories? If there may be more references to the file, can we delete it? With symbolic links, the file just gets deleted and we have a dangling pointer. With hard links, a reference count is maintained, and the actual file is only deleted when all references to it are removed.

    General Graph Directory

    What if we allow links back up the chain?

    Unix directories have this built in - all directories except the system's root directory have a special entry .. that indicates the parent directory, and an entry . that indicates the current directory.

    But they also allow links to be created back "up the chain" of the directory structures, potentially introducing cycles in the directory graph.

    Need to be careful with command like find that search a directory and its subdirectories for something. The search is infinite if cycles are followed. Typically, a program like find will not follow symlinks.

    Problematic cycles can be avoided by allowing "up" links to files, not directories. Could also run cycle detection every time a new link is added, if this is a concern. Unix leaves it up to programs to make sure they treat symlinks appropriately.

    BSD 4.3 limits the number of links allowed to be traversed for any given path name to 8 to avoid undetected cycles. This limit is actually 32 in the current version of FreeBSD, and 20 for Solaris.

    Class demo program: morelinks.c

    Directory Implementation

    In any case, an individual subdirectory will typically contain a list of files. How to store this list?

  • Linear list - list of names, each of which has a pointer to the file's data blocks. This is straightforward, but requires a costly search on large directories.
  • Hash Table - hashed linear list - decrease search time, but more complex to implement.
  • Another consideration: case sensitivity of filenames. Recent Windows, MacOS filesystems have filenames that remember case, but searches are case insensitive. Most Unix filesystems are truly case sensitive.

    Disks and Partitions

    A system may have a number of disks, each with one or more partitions. These are logical subdivisions of the physical disk, often created to help better organize data on the disk.

    A partition is where a filesystem gets created (more on that soon). Once we have a filesystem, we need to make it accessible to the world.

    In DOS/Windows, this typically involves assigning a letter to each partition. Then there is a directory hierarchy within each partition.

    In Unix, there are no drive letters, everything is considered to be part of one big hierarchy. One partition forms the root directory (/) and all others are mounted into the structure it defines.

    The places where partitions get mounted are called mount points, and are nothing more than regular directories. When a partition is mounted onto a mount point, the directory is replaced by the contents of the partition mounted.

    When the mount point directory is accessed, the virtual file system layer of the OS notices that the directory has been used as a mount point, and sets the current directory to the root of the partition mounted there.

    The partitions mounted can be of any type, but all appear to be part of the same directory structure. The system delivers requests to the appropriate partition, and a filesystem-type-specific set of operations are used to access the actual filesystem.

    The list of partitions and their mount points and types are listed in a file system table file. In FreeBSD, it is located in /etc/fstab; in Solaris, it is /etc/vfstab.

    The partitions mounted may be remote as well as local - more on this next week.

    File System Implementation

    Suppose we have partitioned a disk, and it's time to take those disk blocks that have been reserved for our partition and create an actual file system to hold our files. Several issues need to be considered, such as how we allocate disk blocks within our partitions to files and directories, how we decide what blocks are available, and the complexity and efficiency of the choices we might make at this stage.

    Allocation Methods

    First, we consider how disk blocks are allocated to files.

    Contiguous Allocation

    Each file is allocated a set of contiguous disk blocks

  • similar to contiguous allocation of memory
  • simple - directory entry needs only starting location (block number) and length (number of blocks)
  • supports random access into files - can easily compute and read the block that contains a certain part of the file.
  • can lead to holes (external fragmentation)
  • may be difficult to have a file grow
  • Extents

    Extents are analogous to segmentation for memory allocation. Files are allocated as a collection of extents, which are contiguous chunks of disk blocks. Each has a starting block and a size.

    Linked Allocation

    Each disk block has a pointer to the next disk block in the file as well as some file data.

  • need to reserve part of each data block for a pointer - can make for odd-sized data blocks
  • directory entry requires only starting block
  • easy to append to a file
  • no external fragmentation
  • no random access - have to traverse each block
  • a bad disk block means the entire file from that block on is lost
  • A variation on this is the File Allocation Table (FAT) used by MS-DOS and pre-NT Windows versions. This gathers the links into one table.

  • get to use the whole disk block for data
  • a bad disk block means only that block is lost
  • unless... the FAT itself goes bad, in which case we have a problem - have backup copies on the disk, then run your favorite rescue program
  • somewhat better random access - traverse the FAT only - read disk blocks only for the data stored there
  • each disk block needs a FAT entry - total number of blocks, in turn total size of a partition - is limited by the size of the FAT
  • increased block size means fewer blocks/FAT entries, but more internal fragmentation
  • Indexed Allocation

    Use disk blocks as index blocks that don't hold file data, but hold pointers to the disk blocks that hold file data.

  • directory entry now contains a pointer to the index block
  • each file's index block contains pointers to all of its data blocks
  • random access is similar to FAT
  • a bad data block costs only that block, bad index block could cost the entire file
  • size of a file is limited by the number of pointers a data block can hold - if a block holds 512 bytes, and a pointer to a disk block takes 2 bytes, we are limited to 256-block, or 128 KB files
  • now even small files require two data blocks - extra disk reads, and potentially wasted space
  • Can get around the file size limitation in a few ways:

  • linked indexed allocation - use the last entry in the index block as a pointer to another index block
  • this removes file size limitations
  • random access becomes a bit harder
  • two-level index - the index block points only to other index blocks
  • file size limitation is not as severe - for example above, disk file now are addressed by a 256-entry index block, each of which points to a 256-entry index block, meaning we can store 65536-block or 32 MB files.
  • random access is better
  • but all files take at least 3 blocks of space and access time
  • Can add more levels for larger files
  • Unix Inodes

    Many Unix filesystems (Berkeley Fast Filesystem, Linux ext2fs, Sun ufs, ...) take an approach that combines some of the ideas above.

  • each file is indexed by an inode
  • inodes are special disk blocks set aside just for this purpose (see df -i to see how many of these exist on your favorite Unix filesystem)
  • they are created when the filesystem is created
  • the number of inodes limits the total number of files/directories that can be stored in the filesystem
  • the inode itself consists of
  • administrative information (permissions, timestamps, etc.)
  • a number of direct blocks (typically 12) that contain pointers to the first 12 blocks of the file
  • a single indirect pointer that points to a disk block which in turn is used as an index block, if the file is too big to be indexed entirely by the direct blocks
  • a double indirect pointer that points to a disk block which is a collection of pointers to disk blocks which are index blocks, used if the file is too big to be indexed by the direct and single indirect blocks
  • a triple indirect pointer that points to an index block of index blocks of index blocks...
  • interesting reading on your favorite FreeBSD system: /sys/ufs/ufs/dinode.h
  • small files need only the direct blocks, so there is little waste in space or extra disk reads in those cases
  • medium sized files may use indirect blocks
  • only large files make use of (and incur the overhead of) the double or triple indirect blocks, and that is reasonable since those files are large anyway
  • since the disk is now broken into two different types of blocks - inodes and data blocks, there must be some way to determine where the inodes are, and to keep track of free inodes and disk blocks. This is done by a superblock, located at a fixed position in the filesystem. The superblock is usually replicated on the disk to avoid catastrophic failure in case of corruption of the main superblock
  • Disk Allocation Considerations

  • limitations on file size, total partition size
  • internal, external fragmentation
  • overhead to store and access index blocks
  • layout of files, inodes, directories, etc, as they affect performance - disk head movement, rotational latency - many unix filesystems keep clusters of inodes at a variety of locations throughout the file system, to allow inodes and the disk blocks they reference to be close together
  • may want to reorganize files occasionally to improve layout (see hw7 question)
  • Free Space Management

    With any of these methods of allocation, we need some way to keep track of free disk blocks.

    Two main options:

    1. bit vector - keep a vector, one bit per disk block
    2. 0 means the corresponding block is free, 1 means it is in use
    3. search for a free block requires search for the first 0 bit, can be efficient given hardware support
    4. vector is too big to keep in main memory, so it must be on disk, which makes traversal slow
    5. with block size 212 or 4KB, disk size 233 or 8 GB, we need 221 bits (128 KB) for bit vector
    6. easy to allocate contiguous space for files
    7. free list - keep a linked list of free blocks
    8. with linked allocation, can just use existing links to form a free list
    9. with FAT, use FAT entries for unallocated blocks to store free list
    10. no wasted space
    11. can be difficult to allocate contiguous blocks
    12. allocate from head of list, deallocated blocks added to tail, both O(1) operations

    Performance Optimization

    Caching is an important optimization for disk accesses.

    A disk cache may be located:

  • main memory
  • disk controller
  • internal to disk drive
  • See Homework 7 for more about a strategy used by many Unix variants to use main memory as a disk cache: the buffer cache.

    Safety and Recovery

    When a disk cache is used, there could be data in memory that has been "written" by programs, which which has not yet been physically written to the disk. This can cause problems in the event of a system crash or power failure.

    If the system detects this situation, typically on bootup after such a failure, a consistency checker is run. In Unix, this is usually the fsck program, and in Windows, scandisk or some variant. This checks for and repairs, if possible, inconsistencies in the filesystem.

    Journaling Filesystems

    One way to avoid data loss when a filesystem is left in an inconsistent state is to move to a log-structured or journaling filesystem.

  • record updates to the filesystem as transactions
  • transactions are written immediately to a log, though the actual filesystem may not yet be updated
  • transactions in the log are asynchronously applied to the actual filesystem, at which time the transaction is removed from the log
  • if the system crashes, any pending transactions can be applied to the filesystem - main benefits are less chance of significant inconsistencies, and that those inconsistencies can be corrected from the unfinished transactions, avoiding the long consistency check
  • Examples:
  • ReiserFS, a linux journaling filesystem - I recommend reading this page
  • ext3fs, also for linux
  • jfs, IBM journaling filesystem, available for AIX, Linux
  • Related idea in FreeBSD's filesystem: Soft Updates
  • Journaling extensions to Macintosh HFS disks, called Elvis, supposedly coming in OS X 10.2.2
  • NTFS does some journaling, but some claim it is not "fully journaled"
  • the term "journaling" may also refer to systems that maintain the transaction log for a longer time, giving the ability to "undo" changes and retrieve a previous state of a filesystem