File System (Internals) Mar. 31, 2004: "... Does This Look Familiar?... "

Download as pdf or txt
Download as pdf or txt
You are on page 1of 44

...Does this look familiar?...

15-410

File System (Internals) Mar. 31, 2004


Dave Eckhardt Bruce Maggs

-1-

L26_Filesystem

15-410, S04

Synchronization
Project 3 status
Several groups skipped Checkpoint 3 (the easiest one!) Not everybody took advantage of opportunity to plan Several groups seem on track to finish early Several groups dangerously close to the 90% problem
First 90% of the work takes the first 90% of the time Last 10% of the work takes the second 90% of the time

We want everybody to finish!


Project 3 is the core experience of the class Can't bury it and move on!

-2-

15-410, S04

Synchronization
Project 3 / Project 4 hurdle test suite
Released this week Two sections
Basic tests, solidity tests

At P3 deadline, you will run the tests


Goal: pass ~80% of each section Register to begin Project 4 (some P3 extensions) Extra week to work on P3 Cannot submit P4, grade will be 0%
15-410, S04

Not passing the hurdle?

-3-

Synchronization
Today
Chapter 12 (not: Log-structured, NFS)

-4-

15-410, S04

Outline
File system code layers (abstract) Disk, memory structures Unix VFS layering indirection Directories Block allocation strategies, free space Cache tricks Recovery, backups

-5-

15-410, S04

File System Layers


Device drivers
read/write(disk, start-sector, count) read/write(partition, block) [cached] read/write (file, block) manage directories, free space

Block I/O File I/O File system


-6-

15-410, S04

File System Layers


Multi-filesystem namespace
Partitioning, names for devices Mounting Unifying multiple file system types
UFS, ext2fs, ext3fs, reiserfs, FAT, 9660, ...

-7-

15-410, S04

Shredding Disks
Split disk into partitions/slices/minidisks/...
PC: 4 partitions Windows, FreeBSD, Plan 9 Mac: volumes OS 9, OS X, system vs. user data

Or: glue disks together into volumes/logical disks Partition may contain...
Paging area
Indexed by in-memory structures random garbage when OS shuts down Block allocation: file # block list Directory: name file #
15-410, S04

File system

-8-

Disk Structures
Boot area (first block/track/cylinder)
Interpreted by hardware bootstrap ( BIOS ) May include partition table Key parameters: #blocks, metadata layout Unix: superblock ownership/permissions data location

File system control block


File control block (Unix: inode )


-9-

15-410, S04

Memory Structures
In-memory partition tables
Sanity check file system I/O in correct partition

Cached directory information System-wide open-file table


In-memory file control blocks Open mode (read/write/append/...) Cursor (read/write position)

Process open-file tables


- 10 -

15-410, S04

VFS layer
Goal
Allow one machine to use multiple file system types
Unix FFS MS-DOS FAT CD-ROM ISO9660 Remote/distributed: NFS/AFS

Standard system calls should work transparently Insert a level of indirection!

Solution

- 11 -

15-410, S04

Single File System


n = read(fd, buf, size) INT 54 sys_read(fd, buf, len) namei() iget() iput()

sleep() rdblk(dev, N) wakeup() startIDE()


- 12 -

IDEintr()
15-410, S04

VFS Virtualization
n = read(fd, buf, size) INT 54 vfs_read() ufs_read() namei() procfs_read() procfs_domem() iput()
15-410, S04

iget()
- 13 -

VFS layer file system operations


struct vfsops { char *name; int (*vfs_mount)(); int (*vfs_statfs)(); int (*vfs_vget)(); int (*vfs_unmount)(); ... }

- 14 -

15-410, S04

VFS layer file operations


Each VFS provides an array of methods
VOP_LOOKUP(vnode, new_vnode, name) VOP_CREATE(vnode, new_vnode, name, attributes) VOP_OPEN(vnode, mode, credentials, process) VOP_READ(vnode, uio, readwrite, credentials) Validating system call parameters Moving data from/to user memory Thread sleep/wakeup Caches (data blocks, name inode mappings)
15-410, S04

Operating system provides fs-independent code


- 15 -

Directories
External interface
vnode2 = lookup(vnode1, name) List of (name,inode #) - not sorted! Names are variable-length Lookup is linear
How long does it take to delete N files?

Traditional Unix FFS directories


Common alternative: hash-table directories

- 16 -

15-410, S04

Allocation / Mapping
Allocation problem
Where do I put the next block of this file? Near the previous block? Where is block 32 of this file? Similar to virtual memory
Multiple large address spaces specific to each file Only one underlying address space of blocks Source address space may be sparse!

Mapping problem

- 17 -

15-410, S04

Allocation Contiguous
Approach
File location defined as (start, length) Sequential disk accesses are cheap Bookkeeping is easy Dynamic storage allocation (fragmentation, compaction) Must pre-declare file size at creation

Motivation

Issues

- 18 -

15-410, S04

Allocation Linked
Approach
File location defined as (start) Each disk block contains pointer to next Avoid fragmentation problems Allow file growth

Motivation

Issues?

- 19 -

15-410, S04

Allocation Linked
Issues
508-byte blocks don't match memory pages In general, one seek per block read/written - slow! Very hard to access file blocks at random
lseek(fd, 37 * 1024, SEEK_SET);

Benefit
Can recover files even if directories destroyed Linked multi-block clusters, not blocks

Common modification

- 20 -

15-410, S04

Allocation FAT
Used by MS-DOS, OS/2, Windows
Digital cameras, GPS receivers, printers, PalmOS, ...

Semantically same as linked allocation Links stored out of band in table


Result: nice 512-byte sectors for data Next-block pointer array Indexed by block number Next=0 means free
15-410, S04

Table at start of disk


- 21 -

Allocation FAT
7 2 5 -1 3 -1 0 -1
- 22 -

hello.jav dir. c sys.ini

0 1 4

15-410, S04

Allocation - FAT
7 2 5 -1 3 -1 0 -1
- 23 -

hello.jav dir. c sys.ini

0 1 4

15-410, S04

Allocation - FAT
7 2 5 -1 3 -1 0 -1
- 24 -

hello.jav dir. c sys.ini

0 1 4

15-410, S04

Allocation - FAT
7 2 5 -1 3 -1 0 -1
- 25 -

hello.jav dir. c sys.ini

0 1 4

hello.jav: 0, 7
15-410, S04

Allocation FAT
Issues
Damage to FAT scrambles entire disk
Solution: backup FAT Seek to FAT, read, seek to actual block (repeat) Unless FAT can be cached Linear time to walk through FAT

Generally two seeks per block read/write


Still very hard to access random file blocks

- 26 -

15-410, S04

Allocation Indexed
Motivation
Avoid fragmentation problems Allow file growth Improve random access Per-file block array

Approach

99 100 101 3001 3002 -1 -1 -1

3004 -1 -1 -1 6002 -1 -1 -1
15-410, S04

- 27 -

Allocation Indexed
Allows holes
foo.c is sequential foo.db, blocks 1..3 -1
logically blank

sparse allocation
a.k.a. holes read() returns nulls write() requires alloc file size
ls -l ls -s

file size

foo.c 99 100 101 3001 3002 -1 -1 -1

foo.db 3004 -1 -1 -1 6002 -1 -1 -1


15-410, S04

- 28 -

Allocation Indexed
How big should index block be?
Too small: limits file size Too big: lots of wasted pointers Linked Multi-level What Unix actually does

Combining index blocks


- 29 -

15-410, S04

Linked Index Blocks


Last pointer indicates next index block Simple Access is not-so-random
O(n/c) is still O(n) O(n) disk transfers

99 100 101 3001 3002 10459 10460 45789

10461 10462 10463 -1 -1 -1 -1 -1


15-410, S04

- 30 -

Multi-Level Index Blocks


Index blocks of index blocks Does this look familiar? Allows big holes

9987 9988 -1 -1

99 100 101 3001 3002 10459 10460 10461


15-410, S04

- 31 -

Unix Index Blocks


Intuition
Many files are small
Length = 0, length = 1, length < 80, ...


Some files are huge (3 gigabytes) 12 (direct) block pointers: 12 * 8 KB = 96 KB


Availability is free - you need inode to open() file anyway single, double, triple

Clever heuristic in Unix FFS inode




3 indirect block pointers

- 32 -

15-410, S04

Unix Index Blocks


15 16 17 18 100 500 1000 19 20 101 102 501 502 21 22 23 24 103 104 105 106 25 26 27 28 29 30 31 32
15-410, S04

- 33 -

Unix Index Blocks


15 16 17 18 -1 -1 -1

Direct blocks Indirect pointer Double-indirect Triple-indirect

- 34 -

15-410, S04

Unix Index Blocks


15 16 17 18 100 -1 -1 19 20

- 35 -

15-410, S04

Unix Index Blocks


15 16 17 18 100 500 -1 19 20 101 102 21 22 23 24

- 36 -

15-410, S04

Unix Index Blocks


15 16 17 18 100 500 1000 19 20 101 102 501 502 21 22 23 24 103 104 105 106 25 26 27 28 29 30 31 32
15-410, S04

- 37 -

Tracking Free Space


Bit-vector
1 bit per block: boolean free Check each word vs. 0 Use first bit set instruction Text example
1.3 GB disk, 512 B sectors: 332 KB bit vector
  

Need to keep (much of) it in RAM

- 38 -

15-410, S04

Tracking Free Space


Linked list
Superblock points to first free block Each free block points to next Free block can point to multiple free blocks
512 bytes = 128 4-byte block numbers
  

Cost to allocate N blocks is linear


FAT approach provides free-block list for free (block,sequential-block-count)


Keep free-extent lists




- 39 -

15-410, S04

Unified Buffer Cache


Some memory frames back virtual pages Some memory frames cache file blocks Would be silly to double-cache vmem pages
Page cache, file-system cache often totally independent
Page cache chunks according to hardware page size File cache chunks accoding to file system block size Different code, different RAM pools
     

Observation
How much RAM to devote to each one? Why not have just one cache?
Mix automatically varies according to load
15-410, S04


- 40 -

Cache tricks
Read-ahead for (i = 0; i < filesize; ++i) putc(getc(infile), outfile);
System observes sequential reads
can pipeline reads to overlap computation , read latency
   

Free-behind
Discard buffer from cache when next is requested Good for large files Anti-LRU

- 41 -

15-410, S04

Recovery
System crash...now what?
Some RAM contents were lost Free-space list on disk may be wrong Scan file system
Check invariants Unreferenced files Double-allocated blocks Unallocated blocks Fix problems Expert user???
   

- 42 -

15-410, S04

Backups
Incremental approach
Monthly: dump entire file system Weekly: dump changes since last monthly Daily: dump changes since last weekly Collect changes since yesterday
Scan file system by modification time
   

Merge approach - www.teradactyl.com


Two tape drives merge yesterday's tape, today's delta


- 43 -

15-410, S04

Summary
Block-mapping problem
Similar to virtual-to-physical mapping for memory Large, often-sparse address spaces
Holes not the common case, but not impossible
      

Map any logical address to any physical address Key difference: file maps often don' fit in memory t Multiple file system types on one machine Grow your block-allocation map ...
15-410, S04

Insert a level of indirection

- 44 -

You might also like