Introduction
This lecture looks at filesystems from the O/S viewpoint.
It covers file operations in general, and the Unix API
(Application Programmer's Interface) for these.
Choices in how a file system can be implemented are examined,
and how the Unix file system is done.
File Systems
We have already done many things from the users viewpoint, such as file naming,
directory structure, file permissions, etc, using Unix and MSDOS as primary
examples. Now we look at such things from the O/S viewpoint:
- the O/S view of file systems
- how these can be implemented
- what is a typical API for file services
Structure
Files may be organised as a
- sequence of bytes
- a sequence of records
- a tree
MSDOS and Unix both treat files as byte sequences and impose no internal structure
on them. Individual applications may require an internal structure, but not
the O/S.
Many old systems that used to read 80 column cards use an 80 character record
structure (eg the Burroughs A9)
Some mainframe systems use a tree organisation for files, where each node is
indexed by a key.
Types
The types of file under Unix may be
- regular
- directory
- character special
- block special
Regular files are binary or Ascii files, and contain compiled programs, data
files, source code, text, etc. These are the ``ordinary'' files.
Directory files are used by the O/S to maintain the hierarchical file system.
Character special files are devices at a fixed location,
with volatile contents.
Reads from such files are actually reads from the device. No random access
is available on such files. Examples are keyboard, mouse etc. In Unix, these
are /dev/kbd, /dev/mouse, /dev/audio, etc
Block special files are also devices, but which allow random access. Examples
are SCSI disks /dev/sd0, etc, memory /dev/kmem and floppy disks /dev/fd0
Access
Access to files may be sequential, random or both. It depends on the file type
and support from the O/S. Most regular files allow both.
File operations
The set of operations allowed on files may include
- create
- delete
- open
- close
- read
- write
- append
- seek
- get attributes (security)
- set attributes (security)
- rename
Unix low-level file operations
The stdio library has a high-level interface to the file operations, with printf,
putchar, etc. These are built from a lower-level interface. Typical calls from
this are
#include
#include
#include
int creat(char *path,
mode_t mode);
int open(char *path,
int oflag,
...);
int close(int fildes);
int read(int fildes,
char *buf,
unsigned nbyte);
int write(int fildes,
char *buf,
unsigned nbyte);
int chmod(char *path,
mode_t mode);
int stat(char *path,
struct stat *buf);
creat takes the pathname of a file and creates a new file, removing the contents
if it already existed. The mode sets the access permissions. This is a bit-wise
OR of the file permissions:
S_IRUSR 0400
S_IWUSR 0200
S_IXUSR 0100
S_IRGRP 040
S_IWGRP 020
S_IXGRP 010
S_IROTH 04
S_IWOTH 02
S_IXOTH 01
This follows the rwx for user, group and other that is shown by ``ls -l''.
The open call opens the file in the mode given by oflag, and returns a file
descriptor. This is an integer which the O/S uses to index into a per process
file descriptor table, which is used to keep info about open files. Zero is
always stdin, 1 is stdout, 2 is stderr.
oflag is a bit-wise OR of a number of constants
O_RDONLY
O_WRONLY
O_CREAT
O_TRUNC
When O_CREAT is one of these flags, an additional argument to open is the file
permission mode.
The read call is a block read of a given number of bytes. This could be tuned
to the physical characteristics of the device read from, for example.
For example, it may be much faster to do a block read of 1024 bytes from
a hard disk than to do 1024 reads of 1 byte from the disk.
The value returned is the actual number of bytes read.
This may be less than the number asked for. For example
-
if you ask for
1024 bytes to be read from a file that only has 53 bytes in it, you will
only get 53.
- input from the keyboard is normally
line-buffered, so no matter how many you ask for, you will only get a
line at a time.
-
a pipeline is often implemented with a buffer size of 4k. Reads
of more than that may cause the pipeline buffer to be filled and emptied
several times.
``read'' returns zero on EOF, and -1 on error. A while loop to read using
this call should always test the return value for > 0:
while ((nread = read(...)) > 0)
...
if (nread == 0)
/* EOF code */
else
/* error code */
The write call is also a block write of a number of bytes. It returns -1 if
a write error occurs (such as no space left).
Example
This is a simple version of cp, using the low-level I/O.
When should you use these low-level functions? Only when you have to.
In general
use the high-level functions. There will be many times when we can't.
Note
that these functions are specific to the Unix API, and are not in Microsoft
C, for example. So code written using these functions will only be portable
across Unix, not to other Operating Systems.
Allocation
Files can be laid out on the disk in a variety of ways
- contiguous
- linked list
- index
- inodes
Contiguous
In a contiguous layout, the entire file is laid out as one sequence of disk
blocks. It is easy to keep track of the file, because you only need its starting
block and size. It means that space must be pre-allocated for the file before
writing it can take place.
Linked list
In a linked list layout, the file is divided into blocks and part of each block
is a pointer to the next block. Keeping track of each file is also easy, because
you only need the start of each list. Sequential access is easy. Random access
is hard. Because space is used by the pointer, reads should not be in powers
of 2.
Index
In an index method, a table is maintained of each block of the disk. Entries
in this table for each index are the value of the next index for a file. Thus
a file starting in block 4 and using blocks 6 and 2 would have table entries
Then you can store the start of the file somewhere else. This makes it easier
to find where a file is, because the index table can be stored in fast memory
instead of on disk, and actual data can be full blocks.
MSDOS uses this system. The root directory is at a fixed location on the disk,
and is contiguous of fixed size.
(This is why you cannot have an unlimited number of files in the root
directory in MSDOS.)
Entries in this for the start of each file
in the root directory point to the FAT (File Allocation Table) which is the
index table for the rest of the disk.
When a new file is created or extended, new blocks can be found by looking
through the FAT for unused entries.
Inodes
The last method uses a fixed size table of inodes . Each entry in the table
contains a set of pointers. These point to the first, second, ... blocks of
the file. This allows direct indexing of a fixed number of blocks. If the file
is bigger than that, then the next entries are to a single indirect block,
to a double indirect block and to a triple indirect block.
The advantages are that there is quick access to the first blocks,
and potentially
huge file sizes. This method is used by Unix.
The inode is used to store all information about a file,
and there is one inode
per file. This information includes: owner, time last modified,
time last accessed,
size, file permisssions, number of links, etc.
For example, Minix is a simple version of Unix, and Linux has the option
to use a Minix file system. A Minix inode is defined by
struct minix_inode {
unsigned short i_mode;
unsigned short i_uid;
unsigned long i_size;
unsigned long i_time;
unsigned char i_gid;
unsigned char i_nlinks;
unsigned short i_zone[9];
};
This has 7 direct blocks and 2 levels of indirect block.
When a new file is created (by creat), it creates a new
inode entry in the inode table.
Free list
When a file grows, it will fill up any blocks that it currently has, and then
will have to request a new block. Unused or spare blocks will need to be maintained
using a free list. For example, in MSDOS the free list is just those entries
in the FAT table that are unused. Unix keeps a separate free list.
One method of keeping the free list is as a linked list. Traversal of this
may be slow, and it takes up space. Another method (used for small computers)
is a bit vector. In this, the first block is represented by bit 1, the second
by bit 2, etc. A block is free if its corresponding bit is zero, not free otherwise.
Special hardware instructions may support quick perusal of this vector.
The simplest way of allocating a block from this list is FCFS (first come first
served), but other algorithms are possible. For example, since directories
are accessed very frequently, placing blocks for them close to the middle of
the disk may speed up access to them.
Disk caching
Files are located at blocks, and read by blocks. A ``block'' will have its
size determined by hardware and software characteristics. For example, it may
be related to the sector size of the disk. Since a file uses at least one block,
small block sizes are desirable, but since they have to be read from and written
to disk, large sizes are desirable. For example, most Unix systems now use
1kb blocks.
It is becoming increasingly common to reduce the amount of disk access required
by caching. In this a copy of something is kept in fast memory rather than
on disk. For example, a block may be read into memory. If another read is required
from the block, then instead of re-reading it from disk, the read is taken
from a copy of the block in RAM.
The high level I/O functions are often implemented using a cache or buffer.
When a read is first requested, say by getchar, a block is read into an input
cache for that file, Successive getchar's are from this buffer until it becomes
empty, at which point another block-read is done. This block read is tuned
to the disk.
Caching may be of read-only data, such as compiled program code. The cache
size should be big enough that a substantial amount of execution is done using
just the code in the cache.
When data is changed, caching becomes more interesting. Suppose I am modifying
a file, and I make a change. To reduce disk I/O this should be cached locally,
and not written until a lot can be written. Suppose the system crashes or someone-else
wants to make use of the file in the meantime. The caches and disk will be
inconsistent. The simplest solution is a write through cache, so that writes
are reflected immediately on the disk. Of course, this loses some of the potential
savings...
This page is
http://pandonia.canberra.edu.au/OS/l6_1.html,
copyright Jan Newmarch.
It is maintained by Jan Newmarch.
email:
jan@ise.canberra.edu.au
Web:
http://pandonia.canberra.edu.au/
Last modified: 26 August, 1995