this page last updated:
Wed Sep 23 15:34:22 PDT 2015
Roadmap
In this class you are free to design your file system in any way you choose.
If this is your first up-close encounter with a file system, however, or if
you are having trouble understanding how the pieces all fit together, this
document will provide one possible roadmap for the project. It represents,
more or less, how I implemented it. You need
not consider a prescription. Rather, if you don't have a strong feeling about
how to proceed, you might consult this text as I know a design and
implementation that follows it will result in a working file system.
Requirements and Style
The first thing to understand is that the project is not to create a
production-quality implementation. Your file system must be production quality
in terms of its robustness (it must not lose data or crash) but the other
aspects of a file system (portability, extensibility, etc.) that we'd
typically want in the implementation need not be there.
For example, it is fine to design your data structures so that there is only
one file system of your type mounted at a time. If you were building this
file system for a real OS, you'd need to handle having multiple file systems
mounted simultaneously. Feel free to design for the more general case, but it
is not necessary.
The other way to look at the requirements for this project is to ask "what
must my file system do?" At the end of the quarter, I will ask you to add my
ssh public key to your instance and for you to start up your file system
using a single mount point. As root, I will install several test routines by
copying them into your file system through this mount point and I will run the
routines. They will both stress test your implementation and record some
performance stats.
I will also ask you to demo any cool features or features of which your are
particularly proud.
And that's it. The goals (in order of importance are) first to enjoy
the process, second, not to have your file system crash or corrupt the
storage, and third to make your file system performant.
By way of style, it has been my experience that building this type of system
is best accomplished using two basic principles.
- Build incrementally and generate tests as you go. Don't move on
until all of your tests pass for a particular stage. Don't move on until you
have understood and fixed each bug.
- Understand that an OS is fundamentally implementing four operations
- discover: find the thing I'm looking for by its name or ID
- allocate: from a pool of available resources allocate one
unambiguously
- map: create a data structure that maps one abstraction to one or
more resources
- deallocate: unambiguously return a resource to its correct pool
when it is no longer in use
If you keep these two basic tenets in mind, I think the project is more
straightforward to comprehend.
Design
For phases 1 and 2 (e.g. without optimization) design the file system as
four layers which I'll go through from the bottom up:
- Layer 0: The Disk Layer -- This layer modularizes access to the physical
storage medium
- layer 1: The Data Structure Layer: This layer builds pools of blocks and
inodes on the disk and defines ways to allocate and deallocate each.
- layer 2: The Abstraction Layer: This layer implements the Linux file
abstractions.
- layer 3: The Interface Layer: This layer implements an interface between
FUSE and the file abstraction layer.
Software layering is a design principle that can be taken to an extreme. In
this case, it can be used fairly faithfully so that each layer makes call
only to
the layer below it.
Phase 1
Layer 0
The first step is to build an interface to the disk. For debugging purposes,
build an in-memory interface that stores and retrieves data from an
in-memory buffer, but does so based on logical block number. By doing this in
memory it is possible to use the debugger to "see" what is on the disk which
makes debugging easier
In phase 2, the idea is to rewrite this layer to use the raw disk rather than
an in-memory buffer as storage. It is important that the interface to
this layer be a block interface (i.e. data is read or written only in full
block units). Thus, henceforth I will refer to "on disk" as being through
layer 0 (which will eventually read and write a disk) although until Phase 2
it will really be to a memory buffer, a block at a time..
Your test routines should verify that you can access all of the blocks on disk
individually and that there is no corruption (e.g. due to a miscalculation
resulting in overlap) in the blocks.
Layer 1
There are three kinds of functions to implement at layer 1:
- make-fs: creates the superblock, free inode list, and free block list on
disk
- inode allocate, free, read, and write: routines to get, free and access
inodes on disk
- buffer allocate, free, read, and write: routines to get, free and access
blocks on disk
Your test routines should verify that the free lists look correct
(uncorrupted) after bock and inode allocate and free calls. Since multiple
inodes will fit into a block, it should also verify that inode reads and
writes work correctly.
Layer 2
This layer implements files and directories. For this project,
at some level, you'll need to implement the following Linux system calls
- mkdir: makes a directory
- mknod: makes a file
- readdir: reads a directory
- unlink: removes a file or directory
- open/close: opens/closes a file
- read/write: reads/writes a file
You will want to study the man pages on these calls to understand their
specific semantics. For example, a write past the end of a file, simply
extends the file in Linux (it does not generate an EOF error). You may also
wish to implement additional calls like truncate, chown, and chmod, depending
on how realistic you'd like your file system to be.
You test codes for Layer 2 should be able to make directories and files. They
should follow the correct creation semantics (e.g. a mknod fails if it
specifies a path that contains non-existent directories). You should test
file reads/writes that use direct blocks in your inodes, indirect blocks, and
double indirect blocks. You should also make sure that files get deleted
properly and that the free lists look reasonable as blocks and inodes are
allocated and released.
Layer 3
The final layer connects Layer 2 to the FUSE interface. Try looking at the FUSE hello world
example to understand how to build a basic set of fuse bindings. Note that
FUSE has several different interface facilities. In particular, it passes the
path to to each object in each call starting at the root of the mounted file
system. Layer 3 can always call a function to convert a path to
an inode (this routine is called
namei
is some Unix
implementations) in each call. It is also possible to get FUSE to pass
back a file
info data structure in which you can store your own information (e.g. the
inode number) for subsequent calls. You are free to use this facility if you
so choose. Using namei each time means that each call will get the true
conversion form the sik but it will be really slow. You might start with
the namei approach and then see if using FUSE to pass back the inode number
when it can improves performance.
Also, the debugger is most helpful for development at this layer. There isn't
much documentation that explains exactly what comes across the FUSE interface
in gory detail. It is instructive to write stubs at layer 3 and to
set breakpoints (using the debugger) in the
stubs just to see what FUSE was passing into my code.
Testing at this stage involves mounting a small file system and using Linux to
test it out. Consider writing test routines that use ascii text since it is
easy to use the shell with such tests, and it is also easy
to spot corrupted files. While the file system is
small (it must be able to fit in memory) all of the "standard" file operations
should work when your tests are complete.
Phase 1 Complete
At this stage, you should have a working file system that uses FUSE and an
in-memory buffer as the disk store. You can pretty much get all of the system
calls to work. The only restriction is that the sizes will need to be pretty
restrictive. Considering usiang a small block size and small constants to
test everything
and then moving to Phase 2 before beginning stress tests. The larger sizes possible
with a real disk may expose some sizing bugs. One possibility
is to do a Phase 1.5 in which you rewrite Layer 0 to use a Linux file
and an intermediate step. Especially if your test routines use text, you can
interrogate the raw disk this way by examining the contents of the file that
is acting as your raw disk.
Phase 2
Rewrite your Layer 0 to use the Linux file commands on a raw block device in
/dev
. Launch an instance in Eucalyptus, create a volume, and
attach it to the instance. The new device can be accessed like a file through
the /dev
entry.
For example, if the attached volume is
/dev/vdb
then
- open /dev/vdb
- lseek to 4096
- read 1024 bytes
will open the raw disk device, move the file pointer to byte 4096, and read
1024 bytes from the raw disk.
Rewrite Layer 0 and rerun your tests with a file system that is at least 2 GB.
At this point you should also write stress tests that do lots of operations
with different sizes and offsets to make sure that your file system doesn't have
a latent bug or two.
And that's it. Phase 2 is done.