Zfs Last Word

Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

ZFS

Jeff Bonwick

THE LAST WORD IN FILE SYSTEMS


Distinguished Engineer Sun Microsystems

ZFS The Last Word in File Systems

ZFS Overview

Provable data integrity

Detects and corrects silent data corruption

Immense capacity

The world's first 128-bit filesystem

Simple administration

You're going to put a lot of people out of work. Jarod Jenson, ZFS beta customer

Smokin' performance

ZFS The Last Word in File Systems

Trouble With Existing Filesystems

No defense against silent data corruption

Any defect in disk, controller, cable, driver, or firmware can corrupt data silently; like running a server without ECC memory Labels, partitions, volumes, provisioning, grow/shrink, /etc/vfstab... Lots of limits: filesystem/volume size, file size, number of files, files per directory, number of snapshots, ... Not portable between platforms (e.g. x86 to/from SPARC) Linear-time create, fat locks, fixed block size, nave prefetch, slow random writes, dirty region logging

Brutal to manage

Dog slow

ZFS The Last Word in File Systems

ZFS Objective End the Suffering

Data management should be a pleasure


Simple Powerful Safe Fast

ZFS The Last Word in File Systems

You Can't Get There From Here Free Your Mind

Figure out why it's gotten so complicated Blow away 20 years of obsolete assumptions Design an integrated system from scratch

ZFS The Last Word in File Systems

ZFS Design Principles

Pooled storage

Completely eliminates the antique notion of volumes Does for storage what VM did for memory Historically considered too expensive Turns out, no it isn't And the alternative is unacceptable Keeps things always consistent on disk Removes almost all constraints on I/O order Allows us to get huge performance wins

End-to-end data integrity


Transactional operation

ZFS The Last Word in File Systems

Why Volumes Exist


In the beginning, each filesystem managed a single disk.

Customers wanted more space, bandwidth, reliability


Rewrite filesystems to handle many disks: hard Insert a little shim (volume) to cobble disks together: easy Filesystems, volume managers sold as separate products Inherent problems in FS/volume interface can't be fixed

An industry grew up around the FS/volume model


FS

FS Volume

FS Volume

FS Volume

(2G concat)

(2G stripe)

(1G mirror)

1G Disk

Lower 1G

Upper 1G

Even 1G

Odd 1G

Left 1G

Right 1G

ZFS The Last Word in File Systems

FS/Volume Model vs. ZFS


Traditional Volumes

ZFS Pooled Storage


Abstraction: virtual disk Partition/volume for each FS Grow/shrink by hand Each FS has limited bandwidth Storage is fragmented, stranded

Abstraction: malloc/free No partitions to manage Grow/shrink automatically All bandwidth always available All storage in the pool is shared

FS Volume

FS Volume

FS Volume

ZFS

ZFS Storage Pool

ZFS

ZFS The Last Word in File Systems

FS/Volume Model vs. ZFS


FS/Volume I/O Stack
Block Device Interface

ZFS I/O Stack


Object-Based Transactions

Write this block, then that block, ... Loss of power = loss of on-disk consistency Workaround: journaling, which is slow & complex

FS

Make these 7 changes to these 3 objects All-or-nothing

ZFS

Transaction Group Commit

DMU

Again, all-or-nothing Always consistent on disk No journal not needed Schedule, aggregate, and issue I/O at will No resync if power lost Runs at platter speed

Volume

Block Device Interface

Write each block to each disk immediately to keep mirrors in sync Loss of power = resync Synchronous and slow

Transaction Group Batch I/O

Storage Pool

ZFS The Last Word in File Systems

ZFS Data Integrity Model

Everything is copy-on-write

Never overwrite live data On-disk state always valid no windows of vulnerability No need for fsck(1M) Related changes succeed or fail as a whole No need for journaling No silent data corruption No panics due to silently corrupted metadata

Everything is transactional

Everything is checksummed

ZFS The Last Word in File Systems

Copy-On-Write Transactions
1. Initial block tree 2. COW some blocks

3. COW indirect blocks

4. Rewrite uberblock (atomic)

ZFS The Last Word in File Systems

Bonus: Constant-Time Snapshots

At end of TX group, don't free COWed blocks

Actually cheaper to take a snapshot than not!

Snapshot uberblock

Current uberblock

ZFS The Last Word in File Systems

End-to-End Checksums
Disk Block Checksums

ZFS Checksum Trees

Checksum stored with data block Any self-consistent block will pass Can't even detect stray writes Inherent FS/volume interface limitation

Checksum stored in parent block pointer Fault isolation between data and checksum Entire pool (block tree) is self-validating
Address Address Checksum Checksum Address Address Checksum Checksum

Data
Checksum

Data
Checksum

Data

Data

Disk checksum only validates media Bit rot

ZFS validates the entire I/O path



Bit rot Phantom writes Misdirected reads and writes DMA parity errors Driver bugs Accidental overwrite

Phantom writes Misdirected reads and writes DMA parity errors Driver bugs Accidental overwrite

ZFS The Last Word in File Systems

Traditional Mirroring
1. Application issues a read.
Mirror reads the first disk, which has a corrupt block. It can't tell.

2. Volume manager passes


bad block up to filesystem. If it's a metadata block, the filesystem panics. If not...

3. Filesystem returns bad data


to the application.

Application FS xxVM mirror

Application FS xxVM mirror

Application FS xxVM mirror

ZFS The Last Word in File Systems

Self-Healing Data in ZFS


1. Application issues a read.
ZFS mirror tries the first disk. Checksum reveals that the block is corrupt on disk.

2. ZFS tries the second disk.


Checksum indicates that the block is good.

3. ZFS returns good data


to the application and repairs the damaged block.

Application

Application

Application

ZFS mirror

ZFS mirror

ZFS mirror

ZFS The Last Word in File Systems

Traditional RAID-4 and RAID-5

Several data disks plus one parity disk


^ ^ ^ ^ =0

Fatal flaw: partial stripe writes

Parity update requires read-modify-write (slow)


Read old data and old parity (two synchronous disk reads) Compute new parity = new data ^ old data ^ old parity Write new data and new parity

Suffers from write hole:


= garbage

Loss of power between data and parity writes will corrupt data Workaround: $$$ NVRAM in hardware (i.e., don't lose power!)

Can't detect or correct silent data corruption

ZFS The Last Word in File Systems

RAID-Z

Dynamic stripe width

Each logical block is its own stripe


3 sectors (logical) = 3 data blocks + 1 parity block, etc. Integrated stack is key: metadata drives reconstruction Currently single-parity; double-parity version in the works

All writes are full-stripe writes


Eliminates read-modify-write (it's fast) Eliminates the RAID-5 write hole (you don't need NVRAM) Checksum-driven combinatorial reconstruction

Detects and corrects silent data corruption

No special hardware ZFS loves cheap disks

ZFS The Last Word in File Systems

Disk Scrubbing

Finds latent errors while they're still correctable

ECC memory scrubbing for disks Traverses pool metadata to read every copy of every block Verifies each copy against its 256-bit checksum Self-healing as it goes Traditional resilver: whole-disk copy, no validity check ZFS resilver: live-data copy, everything checksummed All data-repair code uses the same reliable mechanism

Verifies the integrity of all data


Provides fast and reliable resilvering


Mirror resilver, RAID-Z resilver, attach, replace, scrub

ZFS The Last Word in File Systems

ZFS Scalability

Immense capacity (128-bit)


Moore's Law: need 65th bit in 10-15 years Zettabyte = 70-bit (a billion TB) ZFS capacity: 256 quadrillion ZB Exceeds quantum limit of Earth-based storage

Seth Lloyd, "Ultimate physical limits to computation." Nature 406, 1047-1054 (2000)

100% dynamic metadata


No limits on files, directory entries, etc. No wacky knobs (e.g. inodes/cg) Parallel read/write, parallel constant-time directory operations, etc.

Concurrent everything

ZFS The Last Word in File Systems

ZFS Performance

Copy-on-write design

Turns random writes into sequential writes Maximizes throughput Automatically chosen to match workload Scoreboarding, priority, deadline scheduling, sorting, aggregation

Dynamic striping across all devices

Multiple block sizes

Pipelined I/O

Intelligent prefetch

ZFS The Last Word in File Systems

Dynamic Striping

Automatically distributes load across all devices


Writes: striped across all four mirrors Reads: wherever the data was written Block allocation policy considers: Capacity Performance (latency, BW) Health (degraded mirrors)

Writes: striped across all five mirrors Reads: wherever the data was written No need to migrate existing data Old data striped across 1-4 New data striped across 1-5 COW gently reallocates old data

ZFS

ZFS Storage Pool

ZFS Add Mirror 5

ZFS

ZFS Storage Pool

ZFS

ZFS The Last Word in File Systems

Intelligent Prefetch

Multiple independent prefetch streams

Crucial for any streaming service provider The Matrix (2 hours, 16 minutes)

Jeff 0:07

Bill 0:33

Matt 1:42

Automatic length and stride detection


Great for HPC applications ZFS understands the matrix multiply problem

Row-major access Columnmajor storage

Detects any linear access pattern Forward or backward

The Matrix (10K rows, 10K columns)

ZFS The Last Word in File Systems

ZFS Administration

Pooled storage no more volumes!

All storage is shared no wasted space, no wasted bandwidth Filesystems become administrative control points

Hierarchical filesystems with inherited properties

Per-dataset policy: snapshots, compression, backups, privileges, etc. Who's using all the space? df(1M) is cheap, du(1) takes forever!

Manage logically related filesystems as a group Control compression, checksums, quotas, reservations, and more Mount and share filesystems without /etc/vfstab or /etc/dfs/dfstab Inheritance makes large-scale administration a snap

Online everything

ZFS The Last Word in File Systems

Creating Pools and Filesystems

Create a mirrored pool named tank


# zpool create tank mirror c0t0d0 c1t0d0

Create home directory filesystem, mounted at /export/home


# zfs create tank/home # zfs set mountpoint=/export/home tank/home

Create home directories for several users


# zfs create tank/home/ahrens # zfs create tank/home/bonwick # zfs create tank/home/billm

Note: automatically mounted at /export/home/{ahrens,bonwick,billm} thanks to inheritance

Add more space to the pool


# zpool add tank mirror c2t0d0 c3t0d0

ZFS The Last Word in File Systems

Setting Properties

Automatically NFS-export all home directories


# zfs set sharenfs=rw tank/home

Turn on compression for everything in the pool


# zfs set compression=on tank

Limit Eric to a quota of 10g


# zfs set quota=10g tank/home/eschrock

Guarantee Tabriz a reservation of 20g


# zfs set reservation=20g tank/home/tabriz

ZFS The Last Word in File Systems

ZFS Snapshots

Read-only point-in-time copy of a filesystem


Instantaneous creation, unlimited number No additional space used blocks copied only when they change Accessible through .zfs/snapshot in root of each filesystem

Allows users to recover files without sysadmin intervention

Take a snapshot of Mark's home directory


# zfs snapshot tank/home/marks@tuesday

Roll back to a previous snapshot


# zfs rollback tank/home/perrin@monday

Take a look at Wednesday's version of foo.c


$ cat ~maybee/.zfs/snapshot/wednesday/foo.c

ZFS The Last Word in File Systems

ZFS Clones

Writable copy of a snapshot


Instantaneous creation, unlimited number Ideal for storing many private copies of mostly-shared data

Software installations Workspaces Diskless clients

Create a clone of your OpenSolaris source code


# zfs clone tank/solaris@monday tank/ws/lori/fix

ZFS The Last Word in File Systems

ZFS Backup / Restore

Powered by snapshots

Full backup: any snapshot Incremental backup: any snapshot delta

Very fast cost proportional to data changed

So efficient it can drive remote replication


Generate a full backup
# zfs backup tank/fs@A >/backup/A

Generate an incremental backup


# zfs backup -i tank/fs@A tank/fs@B >/backup/B-A

Remote replication: send incremental once per minute


# zfs backup -i tank/fs@11:31 tank/fs@11:32 | ssh host zfs restore -d /tank/fs

ZFS The Last Word in File Systems

ZFS Data Migration

Host-neutral on-disk format


Change server from x86 to SPARC, it just works Adaptive endianness: neither platform pays a tax

Writes always use native endianness, set bit in block pointer Reads byteswap only if host endianness != block endianness

ZFS takes care of everything


Forget about device paths, config files, /etc/vfstab, etc. ZFS will share/unshare, mount/unmount, etc. as necessary

Export pool from the old server


old# zpool export tank

Physically move disks and import pool to the new server


new# zpool import tank

ZFS The Last Word in File Systems

ZFS Data Security

NFSv4/NT-style ACLs

Allow/deny with inheritance User-selectable 256-bit checksum algorithms, including SHA-256 Data can't be forged checksums detect it Uberblock checksum provides digital signature for entire pool Protects against spying, SAN snooping, physical device theft Thoroughly erases freed blocks

Authentication via cryptographic checksums


Encryption (coming soon)

Secure deletion (coming soon)

ZFS The Last Word in File Systems

Object-Based Storage

DMU is a general-purpose transactional object store


Filesystems Databases Swap space Sparse volume emulation Third-party applications NFS Database App

UFS

iSCSI

Raw

Swap

ZFS POSIX Layer

ZFS Volume Emulator (zvol)

Data Management Unit (DMU) Storage Pool Allocator (SPA)

ZFS The Last Word in File Systems

ZFS Test Methodology

A product is only as good as its test suite


ZFS was designed to run in either user or kernel context Nightly ztest program does all of the following in parallel:

Read, write, create, and delete files and directories Create and destroy entire filesystems and storage pools Turn compression on and off (while filesystem is active) Change checksum algorithm (while filesystem is active) Add and remove devices (while pool is active) Change I/O caching and scheduling policies (while pool is active) Scribble random garbage on one side of live mirror to test self-healing data Force violent crashes to simulate power loss, then verify pool integrity

Probably more abuse in 20 seconds than you'd see in a lifetime ZFS has been subjected to over a million forced, violent crashes without losing data integrity or leaking a single block

ZFS The Last Word in File Systems

ZFS Summary
End the Suffering Free Your Mind

Simple

Concisely expresses the user's intent

Powerful

Pooled storage, snapshots, clones, compression, scrubbing, RAID-Z

Safe

Detects and corrects silent data corruption

Fast

Dynamic striping, intelligent prefetch, pipelined I/O

Open

http://www.opensolaris.org/os/community/zfs

Free

ZFS
Jeff Bonwick

THE LAST WORD IN FILE SYSTEMS


[email protected]

You might also like