Fast18 Sun

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Spiffy: Enabling File-System Aware

Storage Applications
Kuei Sun, Daniel Fryer, Joseph Chu, Matthew Lakier, Angela Demke Brown,
and Ashvin Goel, University of Toronto
https://www.usenix.org/conference/fast18/presentation/sun

This paper is included in the Proceedings of the


16th USENIX Conference on File and Storage Technologies.
February 12–15, 2018 • Oakland, CA, USA
ISBN 978-1-931971-42-3

Open access to the Proceedings of


the 16th USENIX Conference on
File and Storage Technologies
is sponsored by USENIX.
Spiffy: Enabling File-System Aware Storage Applications
Kuei Sun, Daniel Fryer, Joseph Chu, Matthew Lakier, Angela Demke Brown and Ashvin Goel
University of Toronto

Abstract Storage Applications Category Purpose


Many file-system applications such as defragmentation Differentiated services [18] online
performance
tools, file system checkers or data recovery tools, oper- Defragmentation tool either
ate at the storage layer. Today, developers of these stor- File system checker [13] either
age applications require detailed knowledge of the file- Data recovery tool [4] offline
reliability
system format, which takes a significant amount of time IO shepherding [12] online
to learn, often by trial and error, due to insufficient doc- Runtime verification [8] online
umentation or specification of the format. Furthermore, File system conversion tool offline
administrative
these applications perform ad-hoc processing of the file- Partition editor [11] offline
system metadata, leading to bugs and vulnerabilities. Type-specific corruption [2] offline
debugging
We propose Spiffy, an annotation language for speci- Metadata dump tool offline
fying the on-disk format of a file system. File-system de-
velopers annotate the data structures of a file system, and Table 1: Example file-system aware storage applications.
we use these annotations to generate a library that allows Offline applications have exclusive access to the file sys-
identifying, parsing and traversing file-system metadata, tem; online applications operate on an in-use file system.
providing support for both offline and online storage ap-
These applications improve the performance or reliabil-
plications. This approach simplifies the development of
ity of a storage system by performing file-system specific
storage applications that work across different file sys-
processing at the storage layer. For example, differenti-
tems because it reduces the amount of file-system spe-
ated storage services [18] improve performance by pref-
cific code that needs to be written.
erentially caching blocks that contain file-system meta-
We have written annotations for the Linux Ext4, Btrfs
data or the data of small files. I/O shepherding [12]
and F2FS file systems, and developed several applica-
improves reliability by using file structure information
tions for these file systems, including a type-specific
to implement checksumming and replication. Similarly,
metadata corruptor, a file system converter, and an on-
Recon [8] improves reliability by verifying the consis-
line storage layer cache that preferentially caches files
tency of file-system metadata at the storage layer.
for certain users. Our experiments show that applica-
tions that use the library to access file system metadata Today, developers of these storage applications per-
can achieve good performance and are robust against file form ad-hoc processing of file system metadata because
system corruption errors. most file systems do not provide the requisite library
code. Even when such library code exists, its interface
may not be usable by all storage applications. For ex-
1 Introduction ample, the libext2fs library only supports offline in-
There are many file-system aware storage applications terpretation of a Linux Ext3/4 file system partition; it
that bypass the virtual file system interface and operate does not support online use. Furthermore, the libraries of
directly on the file system image. These applications re- different file systems, even when they exist, do not pro-
quire a detailed understanding of the format of a file sys- vide similar interfaces. As a result, these storage applica-
tem, including the ability to identify, parse and traverse tions have to be developed from scratch, or significantly
file system structures. These applications can operate in rewritten for each file system, impeding the adoption of
an offline or online context, as shown in Table 1. Ex- new file systems or new file-system functionality.
amples of offline tools include a file system checker that To make matters worse, many file systems do not
traverses the file system image to check the consistency provide detailed and up-to-date documentation of their
of its metadata [17], and a data recovery tool that helps metadata format. The ad-hoc processing performed by
recover deleted files [4]. these storage applications is thus error-prone and can
Online storage applications need to understand the lead to system instability, security vulnerability, and data
file-system semantics of blocks as they are accessed at corruption [3]. For example, fsck can sometimes further
runtime (e.g., whether the block contains data or meta- corrupt a file system [33]. Some storage applications re-
data, whether it belongs to a specific type of file, etc.). duce the amount of file-system specific code in their im-

USENIX Association 16th USENIX Conference on File and Storage Technologies 91


plementation by modifying their target file system and inode structure in a Linux Ext3/4 file system is stored
operating system [18, 12]. This approach only works for in a field within the super block that must be accessed to
specific file systems, and can introduce its own bugs. It correctly interpret an inode block. Similarly, many struc-
also requires custom system software, which may be im- tures are variable sized, with the size information being
practical in virtual machine and cloud environments. stored in other structures. Third, the semantics of meta-
Our aim is to reduce the burden of developing file- data fields may be context-sensitive. For example, point-
system aware storage applications. To do so, we enable ers inside an inode structure can refer to either directory
file system developers to specify the format of their file blocks or data blocks, depending on the type of the in-
system using a domain-specific language so that the file ode. Fourth, the placement of structures on disk may be
system metadata can be parsed, traversed and updated implicit in the code that operates on them (e.g., an in-
correctly. We introduce Spiffy,1 a language for annotat- stance of structure B optionally follows structure A) and
ing file system data structures defined in the C language. some structures may not be declared at all (e.g., treat-
Spiffy allows file system developers to unambiguously ing a buffer as an array of integers). Finally, metadata
specify the physical layout of the file system. The anno- interpretation must be performed efficiently, but it is im-
tations handle low level details such as the encoding of practical to load all file-system metadata into memory for
specific fields, and the pointer relationships between file large file systems. These challenges are not addressed by
system structures. We compile the annotated sources to existing specification tools, as discussed in Section 7.
generate a Spiffy library that provides interfaces for type- In Spiffy, the key to specifying the relationships be-
safe parsing, traversal and update of file system meta- tween file system structures is a pointer annotation that
data. The library allows a developer to write actions for specifies that a field holds an address to a data structure
different file system metadata structures, invoking file- on physical storage. Pointers have an address space type
system specific or generic code as needed, for their of- that indicates how the address should be mapped to the
fline or online application. We support online applica- physical location. In the struct foo example above,
tions that need to read metadata, such as differentiated this annotation would help clarify that bar_block_ptr
storage services [18], but not ones that need to modify holds an address to a structure of type bar, and its ad-
metadata such as online defragmentation. dress space type is a (little-endian) block pointer. We ex-
The generic interfaces provided by the library simplify pose cross-structure dependencies by using a name res-
the development of applications that work across differ- olution mechanism that allows annotations to name the
ent file systems. Consider an application that shows file- necessary structures unambiguously. We handle context-
system fragmentation by plotting a histogram of the size sensitive fields and structures by providing support for
of free extents in the file system. This application needs conditional types and conditionally inherited structures.
to traverse the file system to find and parse structures We also provide support for specifying implicit fields
that represent free space, and then collect the extent in- that are computed at runtime. Last, annotations can spec-
formation. With Spiffy, the application code for finding ify the granularity at which the structures should be ac-
and parsing structures is similar for different file systems. cessed from storage, allowing efficient data access and
File-system specific actions are only needed for collect- reducing the memory footprint of the applications.
ing the extent information from the free space structures Together, these Spiffy features have allowed us to
(e.g., bitmaps for Ext4 and free space extents for Btrfs). properly annotate three widely deployed file systems, 1)
The complexity of modern file systems [16] raises Ext4, an update-in-place file system, 2) Btrfs, a copy-on-
several challenges for our specification-based approach. write file system, and 3) F2FS, a log-structured file sys-
Many aspects of file system structures and their relation- tem [15]. We have implemented five applications that are
ships are not captured by their declarations in header designed to work across file systems: a file system dump
files. First, an on-disk pointer in a file-system structure tool, a file system corruption tool, a free space display
may be implicitly specified, e.g., as an integer, as shown tool, a file system converter, and a storage layer service
below. The naming convention suggests that this field is that preferentially caches data for specific users.
a pointer, but that fact cannot be deduced from the struc-
ture definition because it is embedded in file system code. 2 Bugs in File-System Applications
struct foo { We motivate this work by presenting various bugs caused
__le32 bar_block_ptr;
by incorrect parsing of file-system metadata in storage
};
applications (outlined in Table 2). Some of these bugs
Second, the interpretation of file system structures can cause crashes, while others may result in file system cor-
depend on other structures. For example, the size of an ruption. For each bug, we discuss the root cause.
1 Specifying and Interpreting the Format of Filesystems 1. An extra memory allocation caused uninitialized bytes

92 16th USENIX Conference on File and Storage Technologies USENIX Association


Tool FS Bug Title Closed
1 libparted Fat32 #22266: jump instruction and boot code corrupted with random bytes after fat is resized 2016-05
2 ntfsprogs NTFS Bug 723343 - Negative Number of Free Clusters in NTFS Not Properly Interpreted 2014-02
3 e2fsck Ext4 #781110 e2fsprogs: e2fsck does not detect corruption 2016-05
4 e2fsck Ext4 #760275 e2fsprogs: e2fsck corrupts Hurd filesystems 2015-05
5 btrfsck Btrfs Bug 104141 - Malformed input causing crash / floating point exception in btrfsck 2015-10

Table 2: Bugs due to incorrect parsing of file system formats.


to be written to the boot jump field of Fat32 file sys- texts. Type safety ensures that parsing and serialization
tems during resizing. Since Windows depends on the of file system structures will detect data corruption that
correctness of this field, the bug rendered the file sys- leads to type violations, thus reducing the chance of cor-
tem unrecognizable by the operating system. ruption propagation, and avoiding crash failures.
2. NTFS has a complex specification for the size of the Ideally, data structure types and their relationships
MFT record. If the value is positive, it is interpreted could be extracted from file system source code. Al-
as the number of clusters per record. Otherwise, the though the C header files of a file system contain the
size of the record is 2|value| bytes (e.g., −10 would structural definitions for various metadata types, they are
mean that the record size is 1024 bytes). The devel- incomplete descriptions of the file system format because
opers of ntfsprogs were unaware of this detail, and so information is often hidden within the file system code.
the GParted partition editing tool would fail when at- Our annotations augment the C language, helping spec-
tempting to resize an NTFS partition. ify parts of a file system’s format that cannot be easily
3. The e2fsck file system checker failed to detect cor- expressed in C.
rupted directory entries if the size field of the entries After a file system developer annotates his or her file
was set to zero, which resulted in no repair being per- system’s data structures, we use a compiler to parse the
formed. Ironically, other programs, such as debugfs, annotated structures and to generate a library that pro-
ls, and the file system itself, could correctly detect the vides file-system specific interpretation routines. The li-
corruption. brary supports traversal and selective retrieval of meta-
4. Ext2/3/4 inodes contain union fields for storing oper- data structures through type introspection. These facili-
ating system (OS) specific metadata. A sanity check ties allow writing generic or file-system specific actions
was omitted in e2fsck prior to accessing this field, and on specific file system metadata structures. For exam-
repairs were always performed assuming that the cre- ple, the application may wish to operate on the directory
ator OS is Linux. Consequently, the file system be- entries of a file system. Instead of attempting to parse
comes corrupt for Hurd and possibly other OSs. the entire file system and find all directory entries, which
requires significant file-system specific code, a developer
5. A fuzzer [34] was able to craft corrupted super blocks
using Spiffy would use generic type introspection code to
that would crash the Btrfsck tool. In response, Btrfs
find and operate on all directory entries. However, since
developers added 15 extra checks (for a total of 17
the directory entry format may not be the same across file
checks) to the super block parsing code.
systems, the application may require file-system specific
The common theme among all these bugs is that: 1) actions on the directory entry structures.
they are simple errors that occur because they require a Our annotation-based approach has several advan-
detailed understanding of the file system format; 2) they tages. First, it provides a concise and clear documen-
can cause serious data loss or corruption; and 3) most tation of the file system’s format. Second, our gen-
of these bugs were fixed in less than 5 lines of code. Our erated libraries enable rapid prototyping of file-system
domain-specific language allows generating libraries that aware storage applications. The libraries provide a uni-
can sanitize file system metadata by checking various form API, easing the development of applications that
structural constraints before it is accessed in memory. In work across file systems so that the programmer can fo-
the presence of corrupted metadata, our libraries gener- cus on the logic and not the format of the file systems.
ate error codes, rather than crashing the tools or propa- Third, our approach requires minimal changes to the file
gating the corruption further. Section 3.1 discusses how system source code (the annotations are only in the C
our approach can help prevent or detect these bugs. header files and are backwards compatible with existing
binary code), reducing the chance of introducing file sys-
3 Approach tem bugs. In contrast, differentiated storage services [18]
needed to modify the file system and the kernel’s storage
Our annotation language enables type-safe interpretation stack to enable I/O classification. With our approach,
of file system structures, in both offline and online con- this application can be implemented by using introspec-

USENIX Association 16th USENIX Conference on File and Storage Technologies 93


struct ext4_dir_entry { of a directory entry’s name in Ext4 is stored in a field
__le32 inode; /* Inode number */ called name_len, as shown in Figure 1. However, this
__le16 rec_len; /* Directory entry length */ data structure definition does not provide the linkage be-
__u16 name_len; /* Name length */ tween the two fields.2 Structures may depend on fields
char name[EXT4_NAME_LEN]; /* File name */ in other structures as well. For example, several fields
}; of the super block are frequently accessed to determine
Figure 1: Ext4 directory entry structure definition. the block size, the features that are enabled in the file
system, etc. To support these dependencies, we need
Checkpoint Region
to name these structures. For example, the expression
Checkpoint Pack #1 Checkpoint Pack #2 sb.s_inode_size helps determine the size of an inode
object, where sb is the name assigned to the super block.
The naming mechanism must ensure that a name refers
Checkpoint Pack
to the correct structure. For example, the F2FS file sys-
Checkpoint Orphan Data Summary Node Summary Checkpoint tem contains two checkpoint packs for ensuring file sys-
Header Blocks Blocks Blocks Footer tem consistency, as shown in Figure 2. The number of or-
cphdr phan blocks in a F2FS checkpoint pack is determined by
a field inside the checkpoint header. Our naming mecha-
Figure 2: Each F2FS checkpoint pack contains a header nism must ensure that when this field is accessed, it refers
followed by a variable number of orphan blocks. to the header structure in the correct checkpoint pack.
Spiffy uses a path-based name resolution mechanism,
tion at the block layer for an unmodified file system, or
based on the observation that every file system structure
at the hypervisor for an existing virtual machine. Finally,
is accessed along a path of pointers starting from the su-
file system formats are known to be stable over time, so
per block. In the simplest case, the automatic self vari-
there is minimal cost for maintaining annotations.
able is used to reference the fields of the same structure.
Otherwise, a name lookup is performed in the reverse or-
3.1 Designing Annotations
der of the path that was used to access the data structure.
The design of our annotation language for specifying the For example, in Figure 2, when we need to reference the
format of file system structures was motivated by several checkpoint header (cphdr in the figure) while parsing
key concepts. the orphan block, the name resolution mechanism can
File System Pointers File system pointers connect the unambiguously determine that it is referring to its parent
metadata structures in a file system, but they are not well checkpoint header. This strategy also makes it easy to
specified in C data structure definitions, as explained in use reference counting to ensure that a referenced struc-
Section 1. The difference between a file system pointer ture is valid in memory when it needs to be accessed.
and an in-memory pointer is that the content of an in- Context-Sensitive Types File system metadata are fre-
memory pointer is always interpreted as the in-memory quently context-sensitive. A pointer may reference dif-
address of the pointed-to data, but interpreting the ad- ferent types of metadata, or a structure may have optional
dress contained by a file system pointer may involve mul- fields, based on a field value. For example, the type of a
tiple layers of translation. The most common type of journal block in Ext4 depends on a common field called
file system pointer is a block pointer, where the address h_blocktype. If the field’s value is 3, then it is the jour-
maps to a physical block location that contains a con- nal super block that contains many additional fields that
tiguous data structure. However, file system structures can be parsed. However, if its value is 2, then it is a
may also be laid out discontiguously. For example, the commit block that contains no other fields. We need to
journal of an Ext4 file system is a logically contiguous be able to handle such context-sensitive structures and
structure that can be stored on disk non-contiguously, as pointers. We use a when expression, evaluated at run-
a file. Similarly, Btrfs maps logical addresses to physical time, to support such context-sensitive types. These con-
addresses for supporting RAID configurations. ditional expressions also allow us to specify when differ-
Our design incorporates this requirement by associat- ent fields of a union are valid, which enables Spiffy to
ing an address space with each file system pointer. Each enforce a strict access discipline at runtime, and would
address space specifies a mapping of its addresses to prevent Bug #4 from Section 2.
physical locations. In the case of the Ext4 journal, we
use the inode number, which uniquely identifies files in Computed Fields Sometimes file systems compute a
Unix file systems, as an address in the file address space. value from one or more fields and use it to locate struc-
tures. For example, the block group descriptor table in
Cross-Structure Dependencies File system structures
often depend on other structures. For example, the length 2 Confusingly, name has a fixed size in the definition.

94 16th USENIX Conference on File and Storage Technologies USENIX Association


Base Class Member Function Description
Spiffy File System Library
Entity int process_fields(Visitor & v) allows v to visit all fields of this object
int process_pointers(Visitor & v) allows v to visit all pointer fields of this object
int process_by_type(int t, Visitor & v) allows v to visit all structures of type t
Pointer Entity * fetch() retrieves the pointed-to container from disk
serializes and then persists the container, may
Container int save(bool alloc=true)
assign a new address to the container
FileSystem FileSystem(IO & io) instantiates a new file system object
Entity * fetch_super() retrieves the super block from disk
Entity * create_container(int type, Path & p) creates a new container of metadata type
Entity * parse_by_type(int type, Path & p, parses the buffer as metadata type, using
Address & addr, const char * buf, size_t len) p to resolve cross structure dependencies
File System Developer
IO int read(Address & addr, char * & buf) reads from an address space specified by addr
int write(Address & addr, const char * buf) writes to an address space specified by addr
int alloc(Address & addr, int type) allocates an on-disk address for metadata type
Application Programmer
Visitor int visit(Entity * e) visits an entity and possibly processes it
Table 3: Spiffy C++ Library API.
Ext4 is implicitly the block(s) that immediately follows For example, an Ext4 extent header always begins with
the super block. However, the exact address of the de- the magic number 0xF30A to help detect corrupt blocks.
scriptor blocks depends on the block size, which is spec- Similarly, the name_len field of an Ext4 directory entry
ified in the super block. We annotate this information should be less than the rec_len field. Such constraints
as an implicit field of the super block that is computed can be specified for each structure so that they can be
at runtime. This approach allows the field to be derefer- checked to ensure correctness when parsing the structure.
enced like a normal pointer, allowing traversal of the file The use of constraint annotations could have helped pre-
system without requiring any changes to the underlying vent Bug #1, and detect Bugs #3 and #5 from Section 2.
format. A computed field annotation can also be used The set of valid addresses for a metadata container
to specify the size calculation for an NTFS MFT record, may also have a placement constraint. For example,
avoiding Bug #2 from Section 2. F2FS NAT blocks can only be placed inside the NAT
Metadata Granularity Existing file systems assume area, which is specified in the F2FS super block. By
that the underlying storage media is a block device and annotating the placement constraint of a metadata con-
access data in block units. Data structures can exist tainer, Spiffy can verify that the address assigned to
within such blocks or they can span contiguous physical newly allocated metadata is within the correct bounds
blocks. Some data structures that span blocks are read before the metadata is persisted to disk.
in their entirety. For example, the Btrfs B-tree nodes are
(by default) 16KB, or 4 blocks, and these blocks are read 3.2 The Spiffy API
from disk together. In other cases, the data structure is Table 3 shows a subset of the API for building Spiffy
read in portions. For example, an Ext4 inode table con- applications. The API consists of three sets of functions.
tains a group of inode blocks. The file system does not The first set are automatically generated by Spiffy based
load the entire table in memory because it can be very on the annotated file system data structures. The second
large. Instead, it only loads the portions that are needed. set need to be implemented by file system developers and
We define an access unit for file system structures so are reusable across different applications. The last set are
that the compiler can generate efficient code for travers- written by the application programmer for implementing
ing the file system. We call the unit of disk access a application and file-system specific logic.
container. The container size is typically the file system The Spiffy library uses the visitor pattern [9], allow-
block size but it may span multiple blocks, as in the Btrfs ing a programmer to customize the operations performed
example. A structure that is placed inside a container is on each file system metadata type by implementing the
called an object. Finally, structures that span contain- visit function of the abstract base class Visitor.
ers are called extents. We load extents on demand, when The Entity base class provides a common inter-
their containers are accessed. face for all metadata structures and their fields. The
Constraint Checking The values of metadata fields process_pointers function invokes the visit func-
within or across different objects often have constraints. tion of an application-defined Visitor class on each

USENIX Association 16th USENIX Conference on File and Storage Technologies 95


struct Address { Entity * IBlockPtr::fetch() {
int aspc; /* address space type */ IBlock * ib;
long id; /* id of the address */ Address & addr = this->address;
unsigned offset; /* offset from id */ char * buf = new char[addr.size];
unsigned size; /* size of object */ this->fs.io.read(addr, buf);
}; ib = new IBlock(this->fs, addr, this->path);
ib->parse(buf, addr.size);
Figure 3: Address structure to locate container on disk. return ib;
}
pointer within the entity. The process_by_type func-
tion allows visiting a specific type of structure that is Figure 4: Example of a generated fetch function.
reachable from the entity. Unlike the other process IBlockPtr is a subclass of Pointer.
functions, process_by_type will automatically follow
can then fill the container with data and invoke save to
pointers. For example, invoking process_by_type on
allocate and write the newly created container to disk.
the super block with the inode structure as an argument
results in visiting all inodes in the file system.
3.3 Building Applications
Every container (and extent) has an address associated
with it that allows accessing the container from disk. Fig- Figure 5 shows a sample application built using the
ure 3 shows the format of an address, consisting of an ad- Spiffy API. This application prints the type of each meta-
dress space, an identifier and an offset within the address data block in an Ext4 file system in depth-first order. The
space, and the size of the container. The offset field is Ext4IO class implements the block and the file address
used when a container belongs to an extent. space, as described in Section 5. The program starts by
The Pointer class stores the address of a container invoking fetch_super, which fetches the super block
(or an extent), and its fetch function reads the pointed- from a known location on disk and parses it. Then it
to container from disk. Figure 4 shows the generated uses two mutually recursive visitors, EntVisitor and
code for the fetch function for a pointer to a container PtrVisitor, to traverse the file system.
named IBlock (inode block). The file-system devel- The EntVisitor::visit function takes an
oper implements an IO class with a read function for entity as input, prints its name, and then in-
each address space defined for the file system. When the vokes process_pointers, which calls the
IBlock is constructed, it invokes the constructors of its PtrVisitor::visit function for every pointer in
fields, thus creating all the objects (e.g., inodes) within the entity. The PtrVisitor::visit function invokes
the container. The constructors for inodes, in turn, invoke fetch, which fetches the pointed-to entity from disk,
the constructors of block pointers in the inodes, which and invokes EntVisitor::visit on it.
initialize a part of the address (address space, size and
offset) of the block pointers based on the annotations. 3.4 Limitations
Then the container is parsed, which initializes the con- The correctness of Spiffy applications depends on cor-
tainer fields in a nested manner, including setting the id rectly written annotations. Therefore, if and when file
component of the address of all the block pointers in the system format changes do occur, the specifications will
inodes contained in the IBlock. need to be updated. Spiffy applications will also need
The Path object is associated with every entity and to update all file-system specific code that is affected by
contains the list of structures that are needed to resolve the format changes. These changes will likely only affect
cross-structure dependencies during parsing or serializ- code that directly operates on the updated metadata struc-
ing the container. It is set up based on the sequence of tures, since the Spiffy library will provide safe traversal
constructor calls, with each constructor adding the cur- and parsing of any intermediate structures.
rent object to the path passed to it. Currently, we have implemented an online application
The save function serializes a container by invoking at the storage layer (metadata caching, see Section 5)
nested serialization on its fields. Then, it invokes the that reads file system metadata, but does not modify it.
alloc function for newly created metadata, or when ex- We are exploring modifying file system metadata using
isting metadata has to be reallocated (e.g., copy-on-write Spiffy at the storage layer (which requires hooks into
allocator). The allocator finds a new address for the con- the file system code, e.g., for transactions and alloca-
tainer and updates any metadata that tracks allocation tion [12]), and at the file system level (which enables
(e.g., the Ext4 block bitmap). If the address passes place- more powerful applications).
ment constraint checks, the buffer is written to disk. Unlike typical file-system applications that operate at
The create_container function constructs empty the VFS layer and are file-system independent, Spiffy ap-
containers of a given type. The application developer plications operate directly on file-system specific struc-

96 16th USENIX Conference on File and Storage Technologies USENIX Association


EntVisitor ev; serializer will refuse to serialize a corrupted value that
PtrVisitor pv; violates its type constraints. Instead, corruption is per-
int PtrVisitor::visit(Entity & e) { formed after a block is serialized but before it is written.
Entity * tmp = ((Pointer &)e).fetch(); Free Space Tool This tool shows file-system fragmen-
if (tmp != nullptr) {
tation by plotting a histogram of the size of free ex-
ev.visit(*tmp);
tmp->destroy();
tents. The tool retrieves the metadata structures that
} store free space information and processes them (e.g.,
return 0; block bitmaps for Ext4, extent items for Btrfs, and seg-
} ment information table (SIT) for F2FS). This logic is im-
int EntVisitor::visit(Entity & e) { plemented using process_by_type (see Table 3) and
cout << e.get_name() << endl; a custom visit function that processes all the retrieved
return e.process_pointers(pv); metadata structures. Code to traverse the file system and
} parse intermediate structures is provided by our library.
void main(void) {
Ext4IO io("/dev/sdb1"); File System Conversion Tool Converting an existing
Ext4 fs(io); file system into a file system of another type is a time-
Entity * sup; consuming process, involving copying files to another
if ((sup = fs.fetch_super()) != nullptr) { disk, reformatting the disk, and then copying the files
ev.visit(*sup); back to the new file system. In-place file system conver-
sup->destroy(); sion that updates file system metadata without moving
} most file data can speed up the conversion dramatically.
} While some such conversion tools exist,3 they are hard
Figure 5: Code for traversing and printing the types of to implement correctly and not generally available.
all the metadata blocks in an Ext4 file system. We have designed an in-place file system conversion
tool using the Spiffy framework. Such a conversion tool
tures and are thus file-system dependent. Since file sys- requires detailed knowledge of the source and the des-
tems share common abstractions (e.g. files, directories, tination file systems, and is thus a challenging applica-
inodes), it may be possible to carefully abstract the func- tion for our approach. In-place conversion involves sev-
tionality that is shared between implementations, reduc- eral steps. First, the file and directory related metadata,
ing file-system dependence even further. such as inodes, extent mappings, and directory entries
of the source file system, are parsed into a standard for-
4 File System Applications mat. Second, the free space in the source file system is
tracked. Third, if any source file data occupies blocks
We have written five file-system aware storage applica- that are statically allocated in the destination file system,
tions using the Spiffy framework: a dump tool, a free then those blocks are reallocated to the free space, and
space reporting tool, a type-specific metadata corruptor, the conversion aborted if sufficient free space is not avail-
a file system conversion tool, and a prioritized block able. Finally, the metadata for the destination file system
layer cache. The first four applications operate offline, is created and written to disk. In our current tool, a power
while the last one is an online application. failure during the last step would corrupt the source file
File System Dump Tool The file system dump tool system. We plan to add failure atomicity in the future.
parses all the metadata in a file system image and exports Our tool currently converts extent-based Ext4 file sys-
the result in an XML format, using file system traver- tems to log-structured F2FS file systems. The source
sal code similar to the example in Figure 5. In addi- file system is read using a custom set of visitors that ef-
tion to process_pointers, the entity class provides a ficiently traverse the file system and create in-memory
process_fields method that allows iterating over all copies of relevant metadata. For example, unused block
fields (not just pointer fields) of the class. The dump tool groups can be skipped while processing block group de-
can be configured to prevent structures such as unallo- scriptors. Next, we generate the free space list by reusing
cated inode structures from being exported. components from the free space tool, and then removing
F2FS’s static metadata area from the list. Then, Ext4 ex-
Type-Specific Corruption Tool This tool is a variant
tents in the F2FS metadata area are relocated to the free
of the dump tool that injects file-system corruption in a
space with their mappings updated. Finally, F2FS meta-
type-specific manner [2], allowing us to test the robust-
data is created from the in-memory copies and written to
ness of file systems and their tools. When we decide to
corrupt a field, we cannot simply modify its in-memory 3 Theconvert utility converts FAT32 to NTFS [27], and updating to
value, since serialization is type-safe. For example, the iOS 10.3 upgrades the file system from HFS+ to APFS [28]

USENIX Association 16th USENIX Conference on File and Storage Technologies 97


disk, which involves allocation and pointer management, a Linux kernel module. We linked our module, including
requiring significant file-system-specific logic. our generated library, into the Linux kernel by porting
Fortunately, various pieces of the code can be reused some C++ standard containers to the kernel environment
for different combinations of source and destination file and integrating the GNU g++ compiler into the kernel
system when adapting new file systems. As an example, build process, which required minor changes.
only the code to copy Btrfs metadata from an existing Every annotated structure is wrapped in a class that al-
file system and to list its free space is required to support lows introspection. Each field in the wrapped class can
the conversion from Btrfs to F2FS, since the in-memory refer to its name, type and size, and has a reference to the
data structures are generic across file systems that sup- containing structure. The generated library performs var-
port VFS. If the file system does not support VFS, suit- ious types of error-checking operations. For example, the
able default values can be used, which would be helpful parsing of offset fields ensures that objects do not cross
for upgrading from a legacy file system such as FAT32. container boundaries, and that all variable-sized struc-
Prioritized Block Layer Cache We have imple- tures fit within their containers. These checks are essen-
mented a file-system aware block layer cache based on tial if an application aims to handle file system corrup-
Bcache [20]. Our cache preferentially caches the files of tion. When parsing does fail, an error code is propagated
certain priority users, identified by the uid of the file. to the caller of the parse or serialize function.
This caching policy can dramatically improve workload Address Spaces Annotation developers must imple-
performance by improving the cache hit rate for priori- ment the IO interface shown in Table 3. The Ext4 file
tized workloads, as shown in previous work [26]. Bcache address space implementation for the Ext4IO class (see
uses an LRU replacement policy; in our implementation, Figure 5) requires fetching the file contents associated
blocks belonging to priority users are given a second with an inode number. For Btrfs, we currently support
chance and are only evicted if they return to the head the RAID address space for a single device, which only
of the LRU list without being referenced. allows metadata mirroring (RAID-1). For F2FS, we sup-
We use a runtime interpretation module, described in port the NID address space, which maps a NID (node id)
more detail in Section 5, to identify metadata blocks at to a node block. The implementation involves a lookup
the block layer without any modifications to the file sys- to see if a valid mapping entry is in the journal. If not,
tem. We track the data extents that belong to file inodes the mapping is obtained from the node address table.
containing the uid of a priority user, so that we can pref-
erentially cache these extents. For Ext4, we use custom Runtime Interpretation Offline Spiffy applications
visit functions to parse inodes and determine the prior- use variants of the file-system traversal algorithm in Fig-
ity extent nodes. Similarly, we parse the priority extent ure 5. Spiffy also supports online file-system aware stor-
nodes to determine the priority extent leaves, which con- age applications via a kernel module that performs file
tain the priority data extents. system interpretation at the block layer of the Linux ker-
For Btrfs, the inodes and their file extent items may nel using the generated libraries. These storage applica-
not be placed close together (e.g., within the same B- tions are typically difficult to write and error prone, since
tree leaf block), and so parsing an inode object will not manual parsing code is needed for each block type. How-
provide information about its extents. Fortunately, the ever, our implementation only requires a small amount of
key of a file extent item is its associated inode number, bootstrap code to support any annotated file system. The
making it easy to track the file extents of priority users. rest of the code is file-system independent.
In offline applications, the fetch function reads data
5 Implementation from disk and parses the structure. The type of the struc-
ture is known from the pointer that is passed to the fetch
We implemented a compiler that parses Spiffy annota- function. In contrast, for online interpretation, the file
tions. The compiler generates the file system’s internal system performs the read, and the application just needs
representation in a symbol table, containing the defini- to parse it. The parse_by_type function in Table 3
tions of all the file system metadata, their annotations, allows parsing of arbitrary buffers and constructing the
their fields (including type and symbolic name), and each corresponding containers, without the need for an IO ob-
of their field’s annotations. Next, it detects errors such as ject to read data from disk. However, it needs to know
duplicate declarations or missing required arguments. Fi- the type of the block before parsing is possible. Our run-
nally, the symbol table and compiler options are exported time interpretation depends on the fact that a pointer to a
for use by the compiler’s backend. metadata block must be read before the pointed-to block
Spiffy’s backend generates C++ code for a file-system is read. When a pointer is found during the parsing of a
specific metadata library using Jinja2 [22]. The library block, the module tracks the type of the pointed-to block
can be compiled as either a user space library or as part of so that its type is known when it is read.

98 16th USENIX Conference on File and Storage Technologies USENIX Association


Our module exports several functions, including File System Line Count Annotated Structures
interpret_read and interpret_write, that need to Ext4 491 113 15+10+4
be placed in the I/O path to perform runtime interpreta- Btrfs 556 151 27+4+1
tion. These functions operate on locked block buffers. F2FS 462 127 14+16+5
The module maintains a mapping between block num- Table 4: File system structure annotation effort.
bers and their types. After intercepting a completed
read request, it checks whether a mapping exists, and plications. Accurate classification can be implemented
if so, it is a metadata block and it gets parsed. Next, by keeping the previous versions of blocks and compar-
process_pointers is invoked with a visitor that adds ing the versions at transaction commit time. However, it
(or updates) all the pointers that are found in the block comes with a higher memory overhead [8].
into the mapping table. If a parsed block will be refer-
enced later (e.g., super block), we make a copy so that it 6 Evaluation
is available during subsequent parsing of structures that
depend on the value of its fields (e.g., parsing the Ext4 in- In this section, we discuss the effort required to annotate
ode block requires knowing the size of an inode, which is the structures of existing file systems, the effort required
in the super block). The local copy is atomically replaced to write Spiffy applications, and the robustness of Spiffy
when a new version of the block is written to disk. libraries. We then evaluate the performance of our file-
When the I/O operation is a write, the module needs to system conversion tool and the file-system aware block-
determine the type of the written block. A statically allo- layer caching mechanism.
cated block can be immediately parsed because its type
will not change. For example, most metadata blocks in 6.1 Annotation Effort
Ext4 are statically allocated. However, in Btrfs, the super Table 4 shows the effort required to correctly annotate
block is the only statically allocated metadata block. For the Ext4, Btrfs and F2FS file systems. The second col-
dynamically allocated blocks, the block must first be la- umn shows the number of lines of code of existing on-
beled as unknown and its contents cached, since its type disk data structures in these file systems. The lines of
may either be unknown or have changed. Interpretation code count was obtained using cloc [6] to eliminate
for this block is deferred until it is referenced by a block comments and empty lines. The third column shows the
that is subsequently accessed (either read or written), and number of annotation lines. This number is less than one-
whose type is known. At that point, the module will in- third of the total line count for all the file systems.
terpret all unknown blocks that are referenced. The last column is listed as A + B +C, with A showing
Since most dynamically-typed blocks are data blocks, no modification to the data structure (other than adding
they should be discarded immediately to reduce mem- annotations), B showing the number of data structures
ory overhead. For the Btrfs file system, this is relatively that were added, and C showing the number of data struc-
easy because metadata blocks are self-identifying. For tures that needed to be modified. Structure declarations
Ext4, these blocks need to be temporarily buffered until needed to be added or modified for three reasons:
they can be interpreted. However, we use a heuristic for
Ext4 to quickly identify dynamically-typed blocks that 1. We break down structures that benefit from being
are definitely not metadata, to reduce the memory over- declared as conditionally inherited types. For ex-
head of deferred interpretation. The block is first parsed ample, btrfs_file_extent_item is split into two
as if it were a dynamically allocated block (e.g., a direc- parts: the header and an optional footer, depending on
tory block or extent metadata block), and if the parsing whether it contains inline data or extent information.
results in an error, then the block is assumed to be data 2. Simple structures such as Ext4 extent metadata
and discarded. This heuristic could be used in other file blocks, are not declared in the original source code.
systems as well because most file systems have a small However, for annotation purposes, they need to be ex-
number of dynamically allocated metadata block types, plicitly declared. All of the added structures in Ext4
or their blocks are self-identifying. belong to this category.
The module currently relies on the file system to is- 3. Some data structures with a complex or backward-
sue trim operations to detect deallocation of blocks so compatible format require modifications to enable
that stale entries can be removed from the mapping table. proper annotation. For example, Ext4 inode retains its
Since file systems do not guarantee correct implementa- Ext3 definition in the official header file even though
tion of trim, the module additionally flushes out entries the i_block field now contains extent tree informa-
for dynamically allocated blocks that have not been ac- tion rather than block pointers. We redefined the Ext4
cessed recently. This works for a caching application, inode structure and replaced i_block with the extent
but may lead to mis-classification for other runtime ap- header followed by four extent entries.

USENIX Association 16th USENIX Conference on File and Storage Technologies 99


6.2 Developer Effort specific policy requires 111 lines of code, and the Btrfs-
specific policy requires 134 lines of code. Currently,
Dump Tool: The file system dump tool includes a file- we have not implemented prioritized caching for F2FS,
system independent XML writer module, written in 565 which would require tracking NAT entries, similar to
lines of code. The main function for each file system is how we track inode numbers for Btrfs to find file extents.
written in 40 to 50 lines of code. The dump tool is helpful
for debugging issues with real file systems. In addition,
6.3 Corruption Experiments
an expert can verify that the annotations are correct when
the output of the dump tool matches the expected con- We use our type-specific corruption tool to evaluate the
tents of the file system. Therefore, this tool has become robustness of Spiffy generated libraries. The experiment
an integral part of our development process. fills a 128MB file system image with 12,000 files and
Type-Specific Corruptor: This tool is written in 455 some directories, then clobbers a chosen field in a spe-
lines of code, with less than 30 lines of code required for cific metadata structure (e.g., one of the inode structures)
the main function of each file system. The structure that to create a corrupted file system image. We corrupt each
the user wants to corrupt is specified via the command field in each type of metadata structure three times, twice
line and the tool uses process_by_type to find it, with- to a random value and once to zero.
out the need for file-system specific code. The Spiffy dump tool was able to generate correctly
Free Space Tool: The file system free space tool has formatted XML files in the face of arbitrary single-field
271 lines of file-system independent code. File-system corruptions for all of these images. When corruption is
specific parts require 76 lines for Ext4, 77 lines for Btrfs, detected during the parsing of a container or a pointer
and 194 lines for F2FS. F2FS requires more code due to fetch (i.e., pointer address is out-of-bound or fails a
the complex format of its block allocation information. placement constraint), an error is printed and the pro-
gram stops the traversal.
Conversion Tool: The Spiffy file system conversion tool
Table 5 describes the crashes we found when we
framework is written in 504 lines of code. The code for
ran existing tools on the same corrupted images. For
reading Ext4 takes 218 lines, the code to convert to the
dumpe2fs (dump tool for Ext4) v1.42.13, we found a
F2FS file system requires 1760 lines, and the file-system
single crash when the s_creator_os field of the su-
developer code for F2FS, which is reused in other ap-
per block is corrupted. For dump.f2fs v1.6.1-1, we ob-
plications such as the dump tool, consists of 383 lines.
served 5 instances of segmentation faults. Three of the
We also wrote a manual converter tool that uses the
crashes were due to corruption in the super block, and
libext2fs [30] library to copy Ext4 metadata from the
one crash each was detected for the summary block and
source file system, and manually writes raw data to cre-
inode structures. We were unable to trigger any crash-
ate an F2FS file system. The manual converter has 223
related bugs in btrfs-debug-tree v4.4.
lines of Ext4 code, and 2260 lines for the F2FS code.
These results are not unexpected since F2FS is a rela-
While the two converters have similar number of lines
tively young file system. Btrfs uses metadata checksum-
of code, the Spiffy converter has several other benefits.
ming to detect corruption, and thus requires corruption
For the source file system, the manual converter takes
to be injected before checksum generation to fully test
advantage of the libext2fs library. Writing the code
the robustness of its dump tool. Lastly, dumpe2fs does
to convert from a different source file system would re-
not traverse the full file system metadata, and so does not
quire significant effort, and would require much more
encounter most of the metadata corruption. Our Spiffy
code for a file system such as ZFS that lacks a similar
dump tool is both more complete and more robust than
user-level library. On the destination side, the Spiffy con-
dumpe2fs, without requiring significant testing effort.
verter requires many file-system specific lines of code
We also tried an extensive set of random corruption ex-
to manually initialize each newly created object. How-
periments, and none of the existing tools crashed, show-
ever, Spiffy checks constraints on objects and uses the
ing that our type-specific corruptor is a useful tool for
create_container and save functions to create and
testing the robustness of these applications.
serialize objects in a type-safe manner, while the manual
converter writes raw data, which is error-prone, leading
to the types of bugs discussed in Section 2. 6.4 File System Conversion Performance
Prioritized Cache: The original Bcache code consisted We compare the time it takes to perform copy-based con-
of 10518 lines of code. To implement prioritized caching version, versus using the Spiffy-based and the manually
we added 289 lines to this code, which invoke our written in-place file-system conversion tools. The results
generic runtime metadata interpretation framework, con- are shown in Table 6. The experiments are run on an
sisting of 2158 lines of code. This framework provides Intel 510 Series SATA SSD. We create the file set using
hooks to specify file-system specific policies. Our Ext4- Filebench 1.5-a3 [32] in an Ext4 partition on the SSD,

100 16th USENIX Conference on File and Storage Technologies USENIX Association
Tool Name Structure Field Description
dumpe2fs super block s_creator_os index out of bound error during OS name lookup
dump.f2fs super block log_blocks_per_seg index out of bound error while building nat bitmap
super block segment_count_main null pointer dereference after calloc fails
super block cp_blkaddr double free error during error handling (no valid checkpoint)
summary block n_nats index out of bound error during nid lookup
inode i_namelen index out of bound error when adding null character to end of name
Table 5: List of segmentation faults found during type-specific corruption experiments.

# files Copy Converter Manual Conv. Spiffy Conv. Ext4


Fileserver A, alone
20000 188.2 ± 3.7s 6.6 ± 0.5s 7.0 ± 0.2s Fileserver A + Fileserver B, no preference
1000 192.7 ± 2.3s 3.3 ± 0.1s 3.8 ± 0.0s Fileserver A + Fileserver B, A is preferred
100 195.1 ± 0.2s 3.3 ± 0.1s 3.7 ± 0.1s
Btrfs
Table 6: Time required for each technique to convert Fileserver A, alone
from Ext4 to F2FS for different number of files. Fileserver A + Fileserver B, no preference
Fileserver A + Fileserver B, A is preferred
and then convert the partition to F2FS. The 20K file set 0 1000 2000 3000
uses the msnfs file size distribution with the largest file Fileserver A Fileserver B
ops/s
size up to 1GB. The rest of the file sets have progres-
sively fewer small files. All file sets have a total size of Figure 6: Throughput of prioritized caching over LRU
16GB. For the copy converter, we run tar -aR at the caching with one or two file servers for Ext4 and Btrfs.
root of the SSD partition and save the tar file on a sepa- scheduling related effects, the NOOP I/O scheduler is
rate local disk. We then reformat the SSD partition and used in all cases for both the caching and primary device.
extract the file set back into the partition.
We use a pair of identical Filebench fileserver work-
The copy converter requires transferring two full
loads to simulate a shared hosting scenario with two
copies of the file set, and so it takes 30x to 50x longer
users where one requires higher storage performance
than using the conversion tools, which only need to move
than the other. We generate a total file set size of 8GB
data blocks out of F2FS’s static metadata area and then
with an average file size of 128KB, for each workload.
create the corresponding F2FS metadata. Both conver-
The fileserver personality performs a series of create,
sion tools take more time with larger file sets since they
write, append, read and delete of random files throughout
need to handle the conversion of more file system meta-
the experiment. Filebench reports performance metrics
data. The library-assisted conversion tool performs rea-
every 60 seconds over a period of 90 minutes. Perfor-
sonably compared to its manually-written counterpart,
mance initially fluctuates as the cache fills, therefore we
with at most a 16.7% overhead for the added type-safety
present the average throughput over the last 60 minutes
protection that the library offers.
of the experiment, after performance stabilizes.
Figure 6 shows the average throughput for each of the
6.5 Prioritized Cache Performance experiments in operations per second. The error bars
We measure the performance of our prioritized block show 95% confidence intervals. First, we establish the
layer cache (see Section 4), and compare it against LRU baseline performance of a single fileserver instance run-
caching with one or two instances of the same workload. ning alone, which has a cache hit ratio of 64% and 54%
Our experimental setup includes a client machine con- for Ext4 and Btrfs, respectively. Next, we run two in-
nected to a storage server over a 10Gb Ethernet using stances of fileserver to observe the effect of cache con-
the iSCSI protocol. The storage server runs Linux 3.11.2 tention. We see a drastic reduction in cache hit ratio to
and has 4 Intel Processor E7-4830 CPUs for a total of 32 23% and 24% for Ext4 and Btrfs, respectively. Both
cores, 256GB of memory and a software RAID-6 vol- fileservers have similar performance, which is between
ume consisting of 13 Hitachi HDS721010 SATA2 7200 2.3x and 2.7x less than when running alone. When we
RPM disks. The client machine runs Linux 4.4.0 with apply preferential caching to the files used by fileserver
Intel Processor E5-2650, and an Intel 510 Series SATA A, however, its throughput improves by 60% over non-
SSD that is used for client-side caching. To mimic the prioritized LRU caching when running concurrently with
memory-to-cache ratio of real-world storage servers, we fileserver B, with the overall cache hit ratio improving
limit the memory on the client to 4GB and use 8GB of to 46% and 53% for Ext4 and Btrfs, respectively. Pri-
the SSD for write-back caching. The RAID partition is oritized caching also improves the aggregate through-
formatted with either the Ext4 or Btrfs file system and put of the system by 14% to 22%. Giving priority to
is used as the primary storage device. To avoid any one of the two jobs implicitly reduces cache contention.

USENIX Association 16th USENIX Conference on File and Storage Technologies 101
These results show that storage applications using our references to external objects. Our annotation language
generated library can provide reasonable performance overcomes this limitation by explicitly annotating point-
improvements without changing the file system code. ers, which defines how file system metadata reference
each other. We also provide support for address spaces,
so that address values can be mapped to user-specified
7 Related Work
physical locations on disk.
A large body of work has focused on storage-layer ap- Several projects have explored C extensions for ex-
plications that perform file-system specific processing pressing additional semantic information [19, 35, 29].
for improving performance or reliability. Semantically- CCured [19] enables type and memory safety, and the
smart disks [24] used probing to gather detailed knowl- Deputy Type System [35] prevents out-of-bound array
edge of file system behavior, allowing functionality or errors. Both projects annotate source code, perform
performance to be enhanced transparently at the block static analysis, and add runtime checks, but they are de-
layer. The probing was designed for Ext4-like file sys- signed for in-memory structures.
tems and would likely require changes for copy-on-write Formal specification approaches for file systems [1, 5]
and log-structured file systems. Spiffy annotations avoid require building a new file system from scratch, while
the need for probing, helping provide accurate block type our work focuses on building tools for existing file sys-
information based on runtime interpretation. tems. Chen et al. [5] use logical address spaces as ab-
I/O shepherding [12] improves reliability by using stractions for writing higher-level file system specifica-
file structure information to implement checksumming tions. This idea inspired our use of an address space type
and replication. Block type information is provided to for specifying pointers. Another method for specifying
the storage layer I/O shepherd by modifying the file pointers is by defining paths that enable traversing the
system and the buffer-cache code. Our approach en- metadata tree to locate a metadata object, such as finding
ables I/O shepherding without requiring these changes. the inode structure from an inode number [14, 10]. These
Also, unlike I/O shepherding, Spiffy allows interpreting approaches focus on the correctness of file-system oper-
block contents, enabling more powerful policies, such as ations at the virtual file system layer, whereas our goal is
caching the files of specific users. to specify the physical structures of file systems.
A type-safe disk extends the disk interface by expos-
ing primitives for block allocation and pointer relation- 8 Conclusion
ships [23], which helps enforce invariants such as pre-
venting access to unallocated blocks, but this interface Spiffy is an annotation language for specifying the on-
requires extensive file system modifications. We believe disk file system data structures. File system developers
that our runtime interpretation approach allows enforcing annotate their data structures using Spiffy, which enables
such type-safety invariants on existing file systems. generating a library that allows parsing and traversing file
Serialization of structured data has been explored system data structures correctly.
through interface languages such as ASN.1 [25] and Pro- We have shown the generality of our approach by an-
tocol Buffers [31], which allow programmers to define notating three vastly different file systems. The anno-
their data structures so that marshaling routines can be tated file system code serves as detailed documentation
generated for them. However, the binary serialization for the metadata structures and the relationships between
format for the structures is specified by the protocol and them. File-system aware storage applications can use the
not under the control of the programmer. As a result, Spiffy libraries to improve their resilience against pars-
these languages cannot be used to interpret the existing ing bugs, and to reduce the overall programming effort
binary format of a file system. needed for supporting file-system specific logic in these
Data description languages such as Hammer [21] and applications. Our evaluation suggests that applications
PADS [7] allow fine-grained byte-level data formats to using the generated libraries perform reasonably well.
be specified. However, they have limited support for non- We believe our approach will enable interesting applica-
sequential processing, and thus their parsers cannot inter- tions that require an understanding of storage structures.
pret file system I/O, where a graph traversal is required
rather than a sequential scan. Furthermore, with online Acknowledgements
interpretation, this traversal is performed on a small part
of the graph, and not on the entire data. We thank the anonymous reviewers and our shepherd,
Nail [3] shares many goals with our work. Its grammar André Brinkmann, for their valuable feedback. We spe-
provides the ability to specify arbitrarily computed fields. cially thank Michael Stumm, Ding Yuan, Mike Qin, and
It also supports non-linear parsing, but its scope is lim- Peter Goodman for their insightful suggestions. This
ited to a single packet or file, and so it does not support work was supported by NSERC Discovery.

102 16th USENIX Conference on File and Storage Technologies USENIX Association
References [18] M ESNIER , M., C HEN , F., L UO , T., AND A KERS , J. B. Differen-
tiated storage services. In Proc. of the Symposium on Operating
[1] A MANI , S., RYZHYK , L., AND M URRAY, T. Towards a fully Systems Principles (SOSP) (2011), pp. 57–70.
verified file system, 2012. EuroSys Doctoral Workshop 2012.
[19] N ECULA , G. C., M C P EAK , S., AND W EIMER , W. Ccured:
[2] BAIRAVASUNDARAM , L. N., RUNGTA , M., AGRAWA , N., type-safe retrofitting of legacy code. In Proceedings of the 29th
A RPACI -D USSEAU , A. C., A RPACI -D USSEAU , R. H., AND ACM SIGPLAN-SIGACT symposium on Principles of program-
S WIFT, M. M. Analyzing the effects of disk-pointer corrup- ming languages (New York, NY, USA, 2002), POPL ’02, ACM,
tion. In 2008 IEEE International Conference on Dependable Sys- pp. 128–139.
tems and Networks With FTCS and DCC (DSN) (2008), IEEE,
pp. 502–511. [20] OVERSTREET, K. Linux bcache, Aug. 2016. https://bcache.
evilpiepirate.org/.
[3] BANGERT, J., AND Z ELDOVICH , N. Nail: A practical tool for
parsing and generating data formats. In 11th USENIX Sympo- [21] PATTERSON , M., AND H IRSCH , D. Hammer parser generator,
sium on Operating Systems Design and Implementation (OSDI march 2014. https://github.com/UpstandingHackers/
14) (2014), pp. 615–628. hammer.
[4] B UCKEYE , B., AND L ISTON , K. Recovering deleted files in [22] RONACHER , A. Jinja2 documentation, 2011.
linux. http://collaboration.cmc.ec.gc.ca/science/ [23] S IVATHANU , G., S UNDARARAMAN , S., AND Z ADOK , E. Type-
rpn/biblio/ddj/Website/articles/SA/v11/i04/a9. safe disks. In Proc. of the USENIX Symposium on Operating
htm, 2006. Systems Design and Implementation (OSDI) (2006), pp. 15–28.
[5] C HEN , H., Z IEGLER , D., C HAJED , T., C HLIPALA , A., [24] S IVATHANU , M., P RABHAKARAN , V., P OPOVICI , F. I.,
K AASHOEK , M. F., AND Z ELDOVICH , N. Using crash hoare D ENEHY, T. E., A RPACI -D USSEAU , A. C., AND A RPACI -
logic for certifying the fscq file system. In Proceedings of the D USSEAU , R. H. Semantically-smart disk systems. In USENIX
25th Symposium on Operating Systems Principles (2015), ACM, Conference on File and Storage Technologies (FAST) (2003),
pp. 18–37. pp. 73–88.
[6] DANIAL , A. Cloc–count lines of code. Open source (2009). [25] S TEEDMAN , D. Abstract syntax notation one (ASN. 1): the tuto-
http://cloc.sourceforge.net/. rial and reference. Technology appraisals, 1993.
[7] F ISHER , K., AND WALKER , D. The pads project: an overview. [26] S TEFANOVICI , I., T HERESKA , E., O’S HEA , G., S CHROEDER ,
In Proceedings of the 14th International Conference on Database B., BALLANI , H., K ARAGIANNIS , T., ROWSTRON , A., AND
Theory (2011), ACM, pp. 11–17. TALPEY, T. Software-defined caching: Managing caches in
[8] F RYER , D., S UN , K., M AHMOOD , R., C HENG , T., B ENJAMIN , multi-tenant data centers. In Proceedings of the Sixth ACM Sym-
S., G OEL , A., AND B ROWN , A. D. Recon: Verifying file system posium on Cloud Computing (2015), ACM, pp. 174–181.
consistency at runtime. ACM Transactions on Storage 8, 4 (Dec. [27] T ECH N ET, M. How to convert fat disks to ntfs.
2012), 15:1–15:29. https://technet.microsoft.com/en-us/library/
[9] G AMMA , E. Design patterns: elements of reusable object- bb456984.aspx.
oriented software. Pearson Education India, 1995. [28] T OM WARREN. Apple is upgrading millions of
[10] G ARDNER , P., N TZIK , G., AND W RIGHT, A. Local reasoning iOS devices to a new modern file system today.
for the posix file system. In European Symposium on Program- https://www.theverge.com/2017/3/27/15076244/
ming Languages and Systems (2014), Springer, pp. 169–188. apple-file-system-apfs-ios-10-3-features. Ac-
cessed: 2017-03-27.
[11] G EDAK , C. Manage Partitions with GParted How-to. Packt Pub-
lishing Ltd, 2012. [29] T ORVALDS , L., T RIPLETT, J., AND L I , C. Sparse–a semantic
parser for c. see http://sparse.wiki.kernel.org (2007).
[12] G UNAWI , H. S., P RABHAKARAN , V., K RISHNAN , S., A RPACI -
D USSEAU , A. C., AND A RPACI -D USSEAU , R. H. Improv- [30] T S ’ O , T. E2fsprogs: Ext2/3/4 filesystem utilities. http://
ing file system reliability with I/O shepherding. In Proc. of e2fsprogs.sourceforge.net/, 2017.
the Symposium on Operating Systems Principles (SOSP) (2007), [31] VARDA , K. Protocol buffers: Google’s data interchange for-
pp. 293–306. mat. Google Open Source Blog, Available at least as early as
[13] G UNAWI , H. S., R AJIMWALE , A., A RPACI -D USSEAU , A. C., Jul (2008).
AND A RPACI -D USSEAU , R. H. SQCK: A declarative file sys- [32] W ILSON , A. The new and improved filebench. In Proceed-
tem checker. In Proc. of the USENIX Symposium on Operating ings of 6th USENIX Conference on File and Storage Technologies
Systems Design and Implementation (OSDI) (Dec. 2008). (2008). https://github.com/filebench/filebench/.
[14] H ESSELINK , W. H., AND L ALI , M. I. Formalizing a hierarchical [33] YANG , J., T WOHEY, P., E NGLER , D., AND M USUVATHI , M.
file system. Electronic Notes in Theoretical Computer Science Using model checking to find serious file system errors. ACM
259 (2009), 67–85. Transactions on Computer Systems (TOCS) 24, 4 (2006), 393–
[15] L EE , C., S IM , D., H WANG , J., AND C HO , S. F2fs: A new file 423.
system for flash storage. In 13th USENIX Conference on File and [34] Z ALEWSKI , M. American fuzzy lop. http://lcamtuf.
Storage Technologies (FAST 15) (2015), pp. 273–286. coredump.cx/afl/, 2016.
[16] L U , L., A RPACI -D USSEAU , A. C., A RPACI -D USSEAU , R. H., [35] Z HOU , F., C ONDIT, J., A NDERSON , Z., BAGRAK , I., E N -
AND L U , S. A study of Linux file system evolution. In Proc. NALS , R., H ARREN , M., N ECULA , G., AND B REWER , E.
of the USENIX Conference on File and Storage Technologies Safedrive: Safe and recoverable extensions using language-based
(FAST) (Feb. 2013). techniques. In Proceedings of the 7th symposium on Operating
[17] M A , A., D RAGGA , C., A RPACI -D USSEAU , A. C., AND systems design and implementation (2006), USENIX Associa-
A RPACI -D USSEAU , R. H. ffsck: The fast file system checker. tion, pp. 45–60.
In Proc. of the USENIX Conference on File and Storage Tech-
nologies (FAST) (Feb. 2013).

USENIX Association 16th USENIX Conference on File and Storage Technologies 103

You might also like