\ Program Development using Unix - File Systems Internals

Program Development using Unix - File System Internals

Allocation
Contiguous
Linked list
Index
Inodes

Structure

Files may be organised as a

sequence of bytes
a sequence of records
a tree

MSDOS and Unix both treat files as byte sequences and impose no internal structure on them. Individual applications may require an internal structure, but not the O/S.

Many old systems that used to read 80 column cards use an 80 character record structure (eg the Burroughs A9)

Some mainframe systems use a tree organisation for files, where each node is indexed by a key.

Types

The types of file under Unix may be

regular
directory
character special
block special

Regular files are binary or Ascii files, and contain compiled programs, data files, source code, text, etc. These are the ``ordinary'' files.

Directory files are used by the O/S to maintain the hierarchical file system.

Character special files are devices at a fixed location, with volatile contents. Reads from such files are actually reads from the device. No random access is available on such files. Examples are keyboard, mouse etc. In Unix, these are /dev/kbd, /dev/mouse, /dev/audio, etc

Block special files are also devices, but which allow random access. Examples are SCSI disks /dev/sd0, etc, memory /dev/kmem and floppy disks /dev/fd0

Access

Access to files may be sequential, random or both. It depends on the file type and support from the O/S. Most regular files allow both.

Allocation

Files can be laid out on the disk in a variety of ways

contiguous
linked list
index
inodes

Contiguous

In a contiguous layout, the entire file is laid out as one sequence of disk blocks. It is easy to keep track of the file, because you only need its starting block and size. It means that space must be pre-allocated for the file before writing it can take place.

Linked list

In a linked list layout, the file is divided into blocks and part of each block is a pointer to the next block. Keeping track of each file is also easy, because you only need the start of each list. Sequential access is easy. Random access is hard. Because space is used by the pointer, reads should not be in powers of 2.

Index

In an index method, a table is maintained of each block of the disk. Entries in this table for each index are the value of the next index for a file. Thus a file starting in block 4 and using blocks 6 and 2 would have table entries

index

link

Then you can store the start of the file somewhere else. This makes it easier to find where a file is, because the index table can be stored in fast memory instead of on disk, and actual data can be full blocks.

MSDOS uses this system. The root directory is at a fixed location on the disk, and is contiguous of fixed size. (This is why you cannot have an unlimited number of files in the root directory in MSDOS.) Entries in this for the start of each file in the root directory point to the FAT (File Allocation Table) which is the index table for the rest of the disk.

When a new file is created or extended, new blocks can be found by looking through the FAT for unused entries.

Inodes

The last method uses a fixed size table of inodes . Each entry in the table contains a set of pointers. These point to the first, second, ... blocks of the file. This allows direct indexing of a fixed number of blocks. If the file is bigger than that, then the next entries are to a single indirect block, to a double indirect block and to a triple indirect block.

inodes

The advantages are that there is quick access to the first blocks, and potentially huge file sizes. This method is used by Unix.

The inode is used to store all information about a file, and there is one inode per file. This information includes: owner, time last modified, time last accessed, size, file permisssions, number of links, etc.

For example, Minix is a simple version of Unix, and Linux has the option to use a Minix file system. A Minix inode is defined by

struct minix_inode {
        unsigned short i_mode;
        unsigned short i_uid;
        unsigned long i_size;
        unsigned long i_time;
        unsigned char i_gid;
        unsigned char i_nlinks;
        unsigned short i_zone[9];
};

This has 7 direct blocks and 2 levels of indirect block.

When a new file is created (by creat), it creates a new inode entry in the inode table.

Free list

When a file grows, it will fill up any blocks that it currently has, and then will have to request a new block. Unused or spare blocks will need to be maintained using a free list. For example, in MSDOS the free list is just those entries in the FAT table that are unused. Unix keeps a separate free list.

One method of keeping the free list is as a linked list. Traversal of this may be slow, and it takes up space. Another method (used for small computers) is a bit vector. In this, the first block is represented by bit 1, the second by bit 2, etc. A block is free if its corresponding bit is zero, not free otherwise. Special hardware instructions may support quick perusal of this vector.

The simplest way of allocating a block from this list is FCFS (first come first served), but other algorithms are possible. For example, since directories are accessed very frequently, placing blocks for them close to the middle of the disk may speed up access to them.

Disk caching

Files are located at blocks, and read by blocks. A ``block'' will have its size determined by hardware and software characteristics. For example, it may be related to the sector size of the disk. Since a file uses at least one block, small block sizes are desirable, but since they have to be read from and written to disk, large sizes are desirable. For example, most Unix systems now use 1kb blocks.

It is becoming increasingly common to reduce the amount of disk access required by caching. In this a copy of something is kept in fast memory rather than on disk. For example, a block may be read into memory. If another read is required from the block, then instead of re-reading it from disk, the read is taken from a copy of the block in RAM.

The high level I/O functions are often implemented using a cache or buffer. When a read is first requested, say by getchar, a block is read into an input cache for that file, Successive getchar's are from this buffer until it becomes empty, at which point another block-read is done. This block read is tuned to the disk.

Caching may be of read-only data, such as compiled program code. The cache size should be big enough that a substantial amount of execution is done using just the code in the cache.

When data is changed, caching becomes more interesting. Suppose I am modifying a file, and I make a change. To reduce disk I/O this should be cached locally, and not written until a lot can be written. Suppose the system crashes or someone-else wants to make use of the file in the meantime. The caches and disk will be inconsistent. The simplest solution is a write through cache, so that writes are reflected immediately on the disk. Of course, this loses some of the potential savings... Home Program Development using Unix Home

Jan Newmarch (http://jan.newmarch.name) <jan@newmarch.name>