Zfs Last Word
Zfs Last Word
Zfs Last Word
Jeff Bonwick
ZFS Overview
Immense capacity
Simple administration
You're going to put a lot of people out of work. Jarod Jenson, ZFS beta customer
Smokin' performance
Any defect in disk, controller, cable, driver, or firmware can corrupt data silently; like running a server without ECC memory Labels, partitions, volumes, provisioning, grow/shrink, /etc/vfstab... Lots of limits: filesystem/volume size, file size, number of files, files per directory, number of snapshots, ... Not portable between platforms (e.g. x86 to/from SPARC) Linear-time create, fat locks, fixed block size, nave prefetch, slow random writes, dirty region logging
Brutal to manage
Dog slow
Figure out why it's gotten so complicated Blow away 20 years of obsolete assumptions Design an integrated system from scratch
Pooled storage
Completely eliminates the antique notion of volumes Does for storage what VM did for memory Historically considered too expensive Turns out, no it isn't And the alternative is unacceptable Keeps things always consistent on disk Removes almost all constraints on I/O order Allows us to get huge performance wins
Transactional operation
Rewrite filesystems to handle many disks: hard Insert a little shim (volume) to cobble disks together: easy Filesystems, volume managers sold as separate products Inherent problems in FS/volume interface can't be fixed
FS
FS Volume
FS Volume
FS Volume
(2G concat)
(2G stripe)
(1G mirror)
1G Disk
Lower 1G
Upper 1G
Even 1G
Odd 1G
Left 1G
Right 1G
Abstraction: virtual disk Partition/volume for each FS Grow/shrink by hand Each FS has limited bandwidth Storage is fragmented, stranded
Abstraction: malloc/free No partitions to manage Grow/shrink automatically All bandwidth always available All storage in the pool is shared
FS Volume
FS Volume
FS Volume
ZFS
ZFS
Write this block, then that block, ... Loss of power = loss of on-disk consistency Workaround: journaling, which is slow & complex
FS
ZFS
DMU
Again, all-or-nothing Always consistent on disk No journal not needed Schedule, aggregate, and issue I/O at will No resync if power lost Runs at platter speed
Volume
Write each block to each disk immediately to keep mirrors in sync Loss of power = resync Synchronous and slow
Storage Pool
Everything is copy-on-write
Never overwrite live data On-disk state always valid no windows of vulnerability No need for fsck(1M) Related changes succeed or fail as a whole No need for journaling No silent data corruption No panics due to silently corrupted metadata
Everything is transactional
Everything is checksummed
Copy-On-Write Transactions
1. Initial block tree 2. COW some blocks
Snapshot uberblock
Current uberblock
End-to-End Checksums
Disk Block Checksums
Checksum stored with data block Any self-consistent block will pass Can't even detect stray writes Inherent FS/volume interface limitation
Checksum stored in parent block pointer Fault isolation between data and checksum Entire pool (block tree) is self-validating
Address Address Checksum Checksum Address Address Checksum Checksum
Data
Checksum
Data
Checksum
Data
Data
Phantom writes Misdirected reads and writes DMA parity errors Driver bugs Accidental overwrite
Traditional Mirroring
1. Application issues a read.
Mirror reads the first disk, which has a corrupt block. It can't tell.
Application
Application
Application
ZFS mirror
ZFS mirror
ZFS mirror
Read old data and old parity (two synchronous disk reads) Compute new parity = new data ^ old data ^ old parity Write new data and new parity
= garbage
Loss of power between data and parity writes will corrupt data Workaround: $$$ NVRAM in hardware (i.e., don't lose power!)
RAID-Z
3 sectors (logical) = 3 data blocks + 1 parity block, etc. Integrated stack is key: metadata drives reconstruction Currently single-parity; double-parity version in the works
Eliminates read-modify-write (it's fast) Eliminates the RAID-5 write hole (you don't need NVRAM) Checksum-driven combinatorial reconstruction
Disk Scrubbing
ECC memory scrubbing for disks Traverses pool metadata to read every copy of every block Verifies each copy against its 256-bit checksum Self-healing as it goes Traditional resilver: whole-disk copy, no validity check ZFS resilver: live-data copy, everything checksummed All data-repair code uses the same reliable mechanism
ZFS Scalability
Moore's Law: need 65th bit in 10-15 years Zettabyte = 70-bit (a billion TB) ZFS capacity: 256 quadrillion ZB Exceeds quantum limit of Earth-based storage
Seth Lloyd, "Ultimate physical limits to computation." Nature 406, 1047-1054 (2000)
No limits on files, directory entries, etc. No wacky knobs (e.g. inodes/cg) Parallel read/write, parallel constant-time directory operations, etc.
Concurrent everything
ZFS Performance
Copy-on-write design
Turns random writes into sequential writes Maximizes throughput Automatically chosen to match workload Scoreboarding, priority, deadline scheduling, sorting, aggregation
Pipelined I/O
Intelligent prefetch
Dynamic Striping
Writes: striped across all five mirrors Reads: wherever the data was written No need to migrate existing data Old data striped across 1-4 New data striped across 1-5 COW gently reallocates old data
ZFS
ZFS
ZFS
Intelligent Prefetch
Crucial for any streaming service provider The Matrix (2 hours, 16 minutes)
Jeff 0:07
Bill 0:33
Matt 1:42
Great for HPC applications ZFS understands the matrix multiply problem
ZFS Administration
All storage is shared no wasted space, no wasted bandwidth Filesystems become administrative control points
Per-dataset policy: snapshots, compression, backups, privileges, etc. Who's using all the space? df(1M) is cheap, du(1) takes forever!
Manage logically related filesystems as a group Control compression, checksums, quotas, reservations, and more Mount and share filesystems without /etc/vfstab or /etc/dfs/dfstab Inheritance makes large-scale administration a snap
Online everything
Setting Properties
ZFS Snapshots
Instantaneous creation, unlimited number No additional space used blocks copied only when they change Accessible through .zfs/snapshot in root of each filesystem
ZFS Clones
Instantaneous creation, unlimited number Ideal for storing many private copies of mostly-shared data
Powered by snapshots
Change server from x86 to SPARC, it just works Adaptive endianness: neither platform pays a tax
Writes always use native endianness, set bit in block pointer Reads byteswap only if host endianness != block endianness
Forget about device paths, config files, /etc/vfstab, etc. ZFS will share/unshare, mount/unmount, etc. as necessary
NFSv4/NT-style ACLs
Allow/deny with inheritance User-selectable 256-bit checksum algorithms, including SHA-256 Data can't be forged checksums detect it Uberblock checksum provides digital signature for entire pool Protects against spying, SAN snooping, physical device theft Thoroughly erases freed blocks
Object-Based Storage
Filesystems Databases Swap space Sparse volume emulation Third-party applications NFS Database App
UFS
iSCSI
Raw
Swap
ZFS was designed to run in either user or kernel context Nightly ztest program does all of the following in parallel:
Read, write, create, and delete files and directories Create and destroy entire filesystems and storage pools Turn compression on and off (while filesystem is active) Change checksum algorithm (while filesystem is active) Add and remove devices (while pool is active) Change I/O caching and scheduling policies (while pool is active) Scribble random garbage on one side of live mirror to test self-healing data Force violent crashes to simulate power loss, then verify pool integrity
Probably more abuse in 20 seconds than you'd see in a lifetime ZFS has been subjected to over a million forced, violent crashes without losing data integrity or leaking a single block
ZFS Summary
End the Suffering Free Your Mind
Simple
Powerful
Safe
Fast
Open
http://www.opensolaris.org/os/community/zfs
Free
ZFS
Jeff Bonwick