Scribe Notes by: TaeSung Roh and Michael Liao
Terminology
- error - mistake made by designer
- fault - latent bug: in the built system, but nothing observably bad has happened
- failure - fault has triggered
we need to attack all 3!
File System Goals
- durability: write at time t, need later, get some result even if powere fails (disk crashes, errors on bus...)
- performance: throughput + low latency
- atomicity: write (or, more generally, any changes) either happen or don’t happen
Recall: transactions in main memory
Hard Links and Trouble
inode has a link count along with other stuff (data, block ptrs)
we have stuff like usr/bin/bash is really a link to bin/sh
we can do something like link /bin/sh","home/eggert/bin/s")what's happening is that these links point at an inode, which has link count 3
the directory which we linked also has its own inode -> inside its data blocks we have a directory entry which points it at /bin/sh
rename("/home/eggert/bin/s", "/home/eggert/bin/shell")
when we do this, we need to mess with the data structure in the directory entry
what happens is we write the new disk block in RAM, when we start to write it to disk, we pull the plug! now we have a corrupted directoryrename("/home/eggert/bin/s","/home/eggert/test/sh");
we do two operations here: read bin dir into RAM, read test dir into RAM, clear "s" from bin in RAM, write bin block out, add "s" to test in RAM, write test block outmust assume that the system might crash between any two steps just named
we are ok if power failure occurs, unless we have a power failure between the write to blockwhat we do is we put write bin block out as the last step
if there is power failure now, between write test block out and write bin block out, we essentially do a link instead of a rename
the problem here is that the link count is never incremented
why it's bad: later, we can have a link count of 0, but still a link to the inodewe could also take the approach where we increment the link count on the test block out, subtract on the bin block out, but this doesn't work either
we’ll come back laterAnother Problem: Hard links to directories
what happens if we link bin to usr/bin? the bin diretory has a link count of 3 (usr points to bin, bin points to usr/bin, . points to itself) Sub directory linking to parent directory: this causes a cycle
Hard links to directories are prohibited
- this avoids cycles
- makes link counts reliable
How do we do this then? Symbolic link
Symbolic Links
Symbolic link has an inode with a data block, stuff in the in data block, like normal stuff, however in the type field, there is something that denotes a symlink
execvp("/bin/sh") - if "/bin/sh" is a symlink to usr/bin, also works recursively (if in turn usr/bin is a symlink, it links to that too)
symlinks don't count in the link count, so we can create a cycle this wayPros:
- avoided garbage collection problem
- they can cross file systems
Cons:
- complex
- slower
- dangling sym links
- loops (limit on count)
File System Invariants
Invariant: Boolean expression that’s always true
Let’s assume typical Unix f.s.
- every block of the f.s. is used for exactly 1 purpose (*: doesn’t matter)
boot, super block, bitmap*, inode, data* (regular files - but for directories, it matters)
- all referenced blocks are initialized to valid data
- all referenced blocks in the data are marked used in bitmap
- all unreferenced blocks in the data are marked unused in bitmap
Consequences of Violations:
- Disaster
- Disaster
- Disaster
- Leak: wasted space
Example
rename(“d/a”, “e/b”)
read d
read e
read a’s inode
read b’s inode
+ 1 a’s link count in RAM
write a’s inode
write updated e
write updated d
- 1 a’s link count
Write a’s inode
(We violated 4 but not 1~3)
This requires 5 writes and a move! There “should” be only 2!Interaction Between Performance & Robustness
Elevator algorithm + rename op.
Ops are not done in order requested (for performance reasons)
Fix: (write read) requests list dependencies (cheap to implement)
I/O to data blocks have no dependencies
Huge backwards compatibility issue
ext4: new data structures on disk
Some new stuff:
8 byte
+ 2 bits to time stamps (@ high-end)
+ 30 bits @ low end (ns resolution)
(ext3: 32-bit time_t value (1970-2038)Extents
improve performance (128 MB each)
Containing 4KiB blocks
allocating contiguously
this is not backward compatible at all
if your file is file is <= 512 MB, you can be represented by 4 extents, if not, we use an indirect approach (similar to what we've seen before)
new system call "fallocate" - advises file system about expected size of filedelayed allocation
don’t extend on write(), extend on flush, improve performance