CS170 Lecture notes -- File this one under "11"

Matthew Allen
Directory: /cs/faculty/rich/public_html/class/cs170/notes/FileSystem
Lecture Notes: http://www.cs.ucsb.edu/~rich/class/cs170/notes/FileSystem/index.html
additional reference notes are available to help elucidate this important topic.

As it turns out, there are lots of different ways to organize a file system. We will specifically discuss how it was done in UNIX because it is a relatively straightforward, general purpose system.

The UNIX ideal is to provide an abstraction that is easily portable to many different methods for retrieving data. The abstraction they chose is to make everything appear as if it was a sequential steam of bytes. This is called a byte-stream. It is a nice abstraction because it means to the program all types of input are going to look the same. Some forms of input, like pipes, sockets, and terminals, fit this model automatically. This is because they are a stream of bytes coming, in order, from a device or another processes, local or not.

If we want to implement a byte-stream file interface, we need to provide a set of functions that maintain this abstraction. They are:

open: this will allow us to open an existing file or create a new file and brings it into memory so the processes can use it. When a process calls open, we should try to locate the file the process identifies and return a file descriptor, which is the process's means of identifying a file. If we cannot locate what they describe, we should either create a new file or provide them with an error.
close: close should take release a file descriptor from use and make the file inaccessible until another open call is made.
read: this will allow us to read n bytes out of a file and into some space in a process's memory. When a read call is made, it should start where the previous read left off. To do this, we keep a seek pointer associated with the file. When we make a read call, our call starts at the seek pointer, and when we are done, we update the seek pointer to point to just after the end of the read.
write: write should let us write n bytes from a process's memory into a file. If a write call would go past the end of a file, the file should be made larger to accommodate the new data. The write call should also use the seek pointer. New data written into a file should start at the seek pointer, and when write is complete the seek pointer should be updated.
truncate: we need to be able to remove data from the end of the file stream.
seek: this will allow us to move the seek pointer in a file, allowing us to choose which part of the stream we read and write to.

One thing you will probably notice about these commands is that not all of them are applicable to byte-streams like pipes and sockets. Specifically, truncate and seek are meaningless on these types of input. However, they are essential for making a file system useful. Also, other commands that might seem important are not present. Copy, for instance, seems like an important function. However, it is easily implemented with open, read, and write, so it is not included. I will leave these definitions alone for a while so I can describe how the OS deals with files. We will get back to them a little later.

The difficulty with disks

The problem with UNIX's nice little abstraction is that disk devices don't support it very well. When you think of a disk, think of it as drums of magnetic storage medium called cylinders. Each cylinder is mounted on a motor, and there is an arm that mounts a magnetic reader that retrieves data off the disk as it rotates. Consecutive bytes are stored in a ring around the outside of the cylinder called a track. Tracks are further subdivided into sectors, which is the smallest unit of data that disks recognize. So, when you issue a read request to a disk, you specify a cylinder, track, and sector for the disk to read. The disk then returns to the OS that entire sector of data. This is not exactly how things work, but it is pretty close.

If we know the size of a sector and how many cylinders, tracks per cylinder, and sectors per track, we can easily calculate the disk's capacity. Let's say we have an imaginary disk with the following properties: 1 cylinder, 512 tracks, and 32 sectors per track. Also, each sector holds 512 bytes of data. We can calculate the number of bytes per track as the number of bytes per sector times the number of sectors per track. In this case, we have 32 (sectors) * 512 (bytes) = 16,384. Since we have 512 tracks on the disk, that means the total disk capacity is 16,384 * 512 = 8,388,608, or 8MB. When we want to create a file on the disk, what we will do is assign a set of sectors to contain the data on that disk. So, if we want to store a 2000 byte file, we will need at least 4 sectors. If we want to utilize each sector in its entirety, the first 3 sectors will be full of data, and the last will hold 464 bytes of data.

Many modern disks use a slightly different addressing mode called logical block addressing in which sectors are addressed from 0 to n, where n is the total number of sectors on all cylinders and tracks. This abstracts away the need to refer to a sector by its cylinder, track, and sector. Instead we can refer to it with a single addresses. When we use this addressing method, we usually refer to sectors as blocks. This is the method we will use throughout this lecture.

If we want to implement a byte-stream file system, then the most intuitive way to save a file's data is as a contiguous set of bytes on the disk. Let's say we are going to use this method and store the 2,000 byte file we discussed above. If the first block of our file happens to be block 1,203, then the file] will consume blocks 1,203, 1,204, 1,205, and 1,206. If we want to write 400 more bytes onto the end of the file, then we will run past the end of block 1,206, which only has 48 bytes left. Thus, we will need to move on to the next sector, 1,207, to store the remaining 352 bytes.

This, as many of you probably immediately recognize, is the problem with this method of storing data. If there is another file being on the disk that starts at 1,207, then for this write to succeed we have to relocate one of the file. Probably, we would want to relocate the file that wants to grow to a new location. This has two problems. The least important is that moving the file will be slow. Worse, however, is that unless the disk is sparsely used we may not be able to find 5 other contiguous sectors on the disk. If the disk is heavily used, we might have to move around other files to create 5 empty sectors. If the file is larger, it might be extremely hard to find space for it. This is a problem, but is easily avoided.

The solution to our problem is to lift the requirement that we want all of the sectors of a file to be contiguous. Instead, we associate a record with each file that tells us which blocks are responsible for holding the file. This record is an array that associates each 512 bytes of the file with a block. For instance, the first element of this array will point to a block containing bytes 0 to 511 of the file. The next points to another block that contains bytes 512 to 1,023. This method is called direct addressing. Now consider our example file. The first 512 bytes can be stored on block 1,203. The next 512 bytes might be on block 723. The rest of the data is stored on two other blocks that can be located anywhere on the disk. Now, if we were to write another 400 bytes, we would just have to find one empty block and assign it to the next element in the array.

Now, this solves the problems we had before from trying to allocate in contiguous blocks. However, we still have a few problems. One is that we are using more space for overhead than the contiguous method. In that method, we only have to know the first block of the file and the size. If addresses are 4 bytes and we have an 4 bytes for size, then this takes 8 bytes. In our new method, we have the size of the array * 4, plus another 4 for size. If we have 14 elements in our array, this is 60 bytes. This data needs to be kept on the disk or in core, so we want to keep it small, but the difference between 8 and 60 is not a problem. However, a worse problem is the maximum size of a file is fixed by the size of the array. Since we have 14 elements in our array, then at most our files can have 14 * 512 = 7,168 bytes. If we want to make it bigger, we have to add more elements to this array. So, if we want our maximum file size to be on 10 MB (which is small), then we would need an array with 20,480 elements, which requires 80KB by itself. This is significant.

The way we get around the problem above is to use what is called indirect addressing. The trick here is that we can use a block on the disk to store an index of block addresses as well. Now, each element of our array points to a disk block, and this block contains a list of block addresses that contain the file's contents. So for our example, the first element of the array points to block 3,413 This sector contains 4 addresses, that point to the 4 blocks we need to store the file's contents. If we run out of space in the block for addresses we assign a new index block to the array.

This increases our storage capacity dramatically. Each block can index 512 / 4 = 128 block, which is 128 * 512 = 65,536 bytes of file data. If we keep our same 14 element array, we now can store 917,504 bytes per file. This is 128 times what we could hold before. This is still restrictive, but this technique can be extended to deal with more levels of indirection. For instance, if we use double-indirect addressing, each array element points to a sector containing the addresses of index sectors that in turn point to the file's contents. Now each array element refers to an index of 128 index blocks with 128 data sectors, for 128 * 128 * 512 = 8,388,608 bytes, and 117,440,512 bytes total for the file. And with triple-indirect addressing we can store 15,032,385,536 (15 GB) bytes total. We do induce a bit of overhead because now if we want to read a byte from a file, we have to issue a read for each level of indirection, but we get a lot more potential space with a lot lower space overhead for small files.

As it turns out, it is best to use a mix of the addressing modes discussed above. If you analyze the statistics of file system usage, you will find out that most files on the system are small and frequently accessed. A small percentage is fairly large and occasionally accessed, and there are a few very large files that get accessed infrequently. The choice of addressing modes is chosen to reflect this, and balances access frequency with speed. Usually most of the array elements use direct addressing, then there are two or three indirect, one or two double-indirect, and one or two triple-indirect. So, our array might be set up so there are 10 direct addresses, 2 indirect addresses, one double-indirect, and one triple-indirect. Now the first 5,120 bytes are accessed with direct addressing, the next 131,072 bytes are accessed using indirect addressing, and then there are 8,388,608 and 1,073,741,824 bytes that use double- and triple-indirect addressing. Now we can allow file with 1,082,266,624 bytes max. Its nice because inodes are a reasonable size and smaller files will be quicker to access because they use more direct addressing modes, but large files can still be represented.

Organizing the disk

What you should be asking yourself now is how we associate these arrays of disk contents with actual files on the system. The answer, in UNIX at least, is using a data structure called an inode. An inode completely defines a file, and contains a couple things: an index pointing to the file's contents and a set of file attributes. The index we have discussed in detail, and the attributes are any details you are accustomed to associating with a file like the owner, permissions, size, file type, creation, access, and modification times, etc.

These inodes spend most of their lives out on disk, in an array of inodes that sits at the beginning of the disk. Inodes are crafted in such a way that their size is a factor of the size of a block. So, for instance, we might craft an inode so we have 16 addresses (64 bytes total) and 56 bytes of attributes, yielding us an inode of size 128. Notice that exactly four of these will fit on a block (128 * 4 = 512). Each of these inodes is uniquely identified by an inode number, which is related to the address on the disk. Inodes are numbered 1 through n. Given an inode number, the block it lives on is the number / 4, and the offset into that block is (number % 4) * 128.

So, lets talk about a hypothetic inode. Here are all the fields of this inode (some of them you won't know... we will see them soon):

18 bytes for the owners user name (char owner[18])
18 bytes for the owner group's name (char group[18])
4 bytes for the file type (unsigned long type)
4 bytes for the file's permissions (unsigned long perms)
4 bytes apiece for the access, modification, and creation times (unsigned long accessed, modified, created)
4 bytes for the size (unsigned long size)
4 bytes for the number of links (unsigned long links)
16 4 byte addresses (10 direct, 3 indirect, 2 double-indirect, and 1 triple indirect. unsigned long address[16])

If you want to find inode 23, it will be on block 5 (23 / 4 = 5) and will be 384 ((23 % 4) * 128) bytes into the block. If the os needs to access inode 23, it will read block 5 into memory and find the inode on it. When you look there, it has these values. The owner is set to "msa", and the group is set to "grad". It a normal file, so the type is INODE_FILE (set to 0x1). I created it on 2003 November 5th, 14:44ish, and that was the last time I messed with it, so the access, modification, and creation times are all 1068072256. Its 202,148 bytes long, and there is one link to it. The questions you should ask yourself is how many blocks am I using? Well, 10 direct block give us 5,120 bytes, so once those are gone we have 197,028 bytes remaining. The three indirect blocks will hold 196,608 of the data, leaving 420. This last piece will have to be held in a double-indirect block. All of our data takes 395 blocks. Each of the three indirect addresses use 1 index block. The one double-indirect address uses 2 index block, one holds the address of the data block and the other holds the address of the index block. Thats 400 blocks total.

These inodes describe files, but they don't really describe all of the file system. There are other details the OS needs to know about the file system, like the total number of inodes and blocks on the disk, as well as a list of which blocks and inodes are free. This information is kept in another data structure called the super block. You will notice that inodes start from 1, not 0. Conveniently, this unused bit of space starting at the beginning of block 0 is where the super block lives. After the super block is a list of all inodes in the system. The rest of the blocks for file contents and indirect indexes live on the disk after the inodes. In general, it looks like this:

Many of you will have noticed that the inodes take up a fixed amount of space on the disk. This means that we can have a fixed number of inodes in the system, as well as a fixed number of blocks that can be used for file contents. If we don't have enough inodes, we might run out of files. If we have too many inodes, we might leave a lot unused and have less space for actual file contents. These needs must be balanced by the administrator when the system is configured. Setting all of this up is what happens when you format a disk for a specific file system type.

You may also have noticed that, to access this information, you need access to very important information from the super block. What you might as yourself is, "what happens if the super block gets ruined somehow?" The answer, unfortunately, is that your file system might be lost. Its a real fear. Disk sectors go bad when disks get old, are dropped, or get wet. If your superblock or inodes get trashed, there are ways you can try to get them back, but they may not work. It sucks, but thats the way it goes.

Managing your files

Now we know where and how files are stored on disk, and we know how files are maintained. What we don't know yet is how the user interacts with the file system. We are used to seeing is files with names that are located in a hierarchy of directories. In what we've discussed so far, we don't have names and directories. All we have are files identified by numbers that live on disk. So let's talk about the other stuff.

First thing we need to discuss is how directories work. Each directory is represented by an inode on disk. This inode has a specific flag set in it's 'type' field that says it is a directory. It differs from a normal file in that the contents aren't user supplied data but are a set of entries that describe the files that are present in this directory. Each entry contains only two things: a file name and an inode number. This entry is a fixed size, so file names are limited in length. If you know where a directory is and you know the name of a file in that directory, you can ask the os to find it and it will scan through the list of the directory's contents, find the file name you specified, and get the inode associated with it off the disk.

For example, lets say our file system allows 16 bytes per directory entry. We are going to store the inode number in an unsigned short, because we know we will not allow the fs to be configured with more than 65536 inodes. A short takes up 2 bytes, so we use the other 14 bytes for a string describing the name of the file. Our example directory holds 5 files:

    "."             : 147
    ".."            : 91
    "cat"           : 133
    "dog"           : 211
    "fish"          : 12

This inode's first direct index points to a block on disk that contains 80 bytes of data: five 16 byte entries describing the directory's contents. "." and ".." are special directory entries. "." refers to the directory's inode, so this means that the directory we are discussing is inode 147. ".." refers to the parent directory, which means that this directory is referred to in the directory in inode 91. This directory contains three other files: cat, dog, and fish. From the information we have so far, we cannot tell if these are regular files or directories, but we know which inodes they use. So, if we tell the os we would like to open "cat" in the current directory, the OS will scan through the directory listing until it finds an entry named "cat". Finding it, it will fetch inode 133 (which the second inode in block 33) off the disk.

Now we know how directories work, lets get to talking about the entire directory structure. The entire directory tree extends from a single directory called the root directory. This directory is special because the OS knows how to find it without having to search for it. This root directory is known by the super block, so when the OS is asked to look for it, it just checks there. In Unix, the root directory is called "/". So, if we want to find the file /cs/class/cs170/fish, here is what we do. We see that the first directory in the path is "/", the root directory. The OS knows where this is because the super block specifies it. So it looks in the root directory's contents. Let's say it looks like this:

    "."             : 0
    ".."            : 0
    "usr"           : 1021
    "green"         : 777
    "cs"            : 2331
    "tmp"           : 8

The OS scan through until it finds "cs", and sees that it is inode 2331. The OS reads 2331 into memory, and scans through its contents until it finds the entry "class". It then opens the "class" inode and looks for "cs170". It recursively moves through the path like this until it finds the final name, "fish". It finds the file fish in inode 12, which it loads into memory and returns to the user.

You don't always specify a pathname relative to root. Frequently, you just specify it relative to a file. This works because processes keep a current directory field in the PCB. The refers to the directory the process is considered to be running in, and files are found in this directory if the calling process does not specify that you want the OS to look from root.

Before I move on, I am going to talk briefly about mount points. As it turns out, we can accesses multiple disks, each with their own super blocks and inodes, from UNIX. The way this is done is similar to how we access the root directory. We can create other special entries in directories that are called mount points. They look like files when you scan through the directory's contents, but they have inode numbers that are not valid (0 is a common one, although not the only one). When the OS sees a file with this inode, it understands that it is special and consults the directory of mounted devices. It then finds the device, and consults the root directory of that devices described by its super block. As it traverses the path, it now looks for inodes on the mounted device, not the root disk. This is how Unix handles multiple disks as well as removable medium like cd-rom and floppy disks.

Processes, files, and the byte-stream

Finally, lets get to talking about how files get handled in memory by the OS and processes. Lets start with the OS. When a processes requests that a file be opened, the OS does two things. First, it reads the inode off the disk and into the OS memory in a statically declared, fixed sized array called the inode table. The entire inode is brought into memory so that file accesses do not require an additional disk read to find inode details. Also, it is easier to synchronize multiple processes accessing one file if they are stored in core, and each entry in this table keeps a count of how many times the file has been opened since it came into memory. The OS also creates an entry in another table called the open file table.

The open file table entries are what the processes uses to interact with a file. They contain a couple details that are important to the process, like the seek pointer and the number of references to the inode. There is one and only one open file table entry for each open call made by a processes. When a processes dups a file or calls for, all additional references to the copied open file refer to the same entry in the open file table. If another open call is made for a file that is already opened, another open file table entry is created and it refers to the inode table entry for that inode that was opened before. The inode table changes the number of references that inode has to reflect the additional open file table entry that refers to it. However, the new open file table entry has its own seek pointer and is independent of the other entry that refers to the same inode.

Each PCB has a keeps a record of all of the files it has open in the file table. These file table entries refer directly to the entries in the open file table. The program refers to these entries using a file descriptor, which is an integer index into this open file. When reads and writes are made, the OS looks in that processes file table to find the open file reference to that inode.

Lets consider the sequence of events that leads to the arrangement of our example illustration:

Process 1 makes two open calls: one for inode 26 and another for inode 13.
Process 1 forks, creating processes 2
Process 2 closes its reference to inode 26, but shares the reference to inode 13 with process 1.
Process 2 reads 50 bytes from inode 13, updating the seek pointer. It also opens inode 47 and reads 15 bytes from it.
Process 3 independently opens inode 47, dups the file table entry, and writes 200 bytes to it.

What you will see from this is that there are two ways processes can access the same file. One is to share the open file table entry. In this case, they share the same seek pointer, so file accesses by one process affect the seek pointer for the other. This is done with dups and forks. The other way they can both open the file independently. Then, they have separate open file table entries and seek pointers for that file. However, the inode table knows that inode has been opened twice.

Now, lets discuss how we would finally implement the system calls we discussed in the beginning of this lecture:

open: the os finds the inode of the file named. If the file does not exist and the calling processes wants it created, a free inode is allocated to the file and any directories that need updating are dealt with. If that file is not already in memory, it is loaded into memory into the inode table. If it is in memory already, the reference count is updated to reflect that another open call has been made. A new open file table entry is created to refer to the in-core inode. The calling process allocates a file table entry to point to the open file table entry. The index of the file table entry is returned to the process.
close: the os uses the file descriptor from the system call and the PCB file table to find the open file table entry for this file. It then decrements the reference count on the open file table entry named. If that count is 0, the open file table entry is freed and the open count in the inode table is decremented. If this count reaches 0, the inode is removed from the table and written back to disk.
read: the os uses the open file table to find the inode referred to. It then reads n bytes starting from the seek pointer into the process memory, and updates the seek pointer. If the seek pointer is at the end of the file, EOF is returned to the process.
write: works just like read, except that if the write proceeds past the end of the file, additional blocks are found in a list of free blocks and added to the inode address entries.
truncate: removes all blocks from the file and adds them to the list of free blocks.
seek: finds the open file table and updates the seek pointer to the value specified